Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Eric Dumazet @ 2011-05-17  8:34 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: Benoit Sigoure, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	netdev, linux-kernel
In-Reply-To: <E5F8880D-72DF-488B-9515-60453D3EC240@comsys.rwth-aachen.de>

Le mardi 17 mai 2011 à 10:01 +0200, Alexander Zimmermann a écrit :

> 
> regardless of netdev will accept this patch or not, the
> upcoming initRTO is 1s. See
> http://tools.ietf.org/id/draft-paxson-tcpm-rfc2988bis-02.txt
> 
> The draft is IESG approved and will become an RFC soon.

Thanks Alex for this link / information.

^ permalink raw reply

* Re: [PATCH] ethtool: ETHTOOL_SFEATURES: remove NETIF_F_COMPAT return
From: Michał Mirosław @ 2011-05-17  8:45 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, David Miller
In-Reply-To: <1305583775.2885.65.camel@bwh-desktop>

On Mon, May 16, 2011 at 11:09:35PM +0100, Ben Hutchings wrote:
> On Mon, 2011-05-16 at 23:50 +0200, Michał Mirosław wrote:
> > On Mon, May 16, 2011 at 10:08:59PM +0100, Ben Hutchings wrote:
> > > On Mon, 2011-05-16 at 22:51 +0200, Michał Mirosław wrote:
> > > > On Mon, May 16, 2011 at 03:53:17PM +0100, Ben Hutchings wrote:
> > > > > On Mon, 2011-05-16 at 16:23 +0200, Michał Mirosław wrote:
> > > > > > On Mon, May 16, 2011 at 02:37:46PM +0100, Ben Hutchings wrote:
> > > > > > > On Mon, 2011-05-16 at 15:28 +0200, Michał Mirosław wrote:
> > > > > > > > Remove NETIF_F_COMPAT since it's redundant and will be unused after
> > > > > > > > all drivers are converted to fix/set_features.
> > > > > > > > 
> > > > > > > > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > > > > > > > ---
> > > > > > > > 
> > > > > > > > For net as we don't want to have ETHTOOL_F_COMPAT hit stable release.
> > > > > > > [...]
> > > > > > > ETHTOOL_F_WISH means that the requested features could not all be
> > > > > > > enabled, *but are remembered*.  ETHTOOL_F_COMPAT means they were not
> > > > > > > remembered.
> > > > > > Hmm. So, lets just revert 39fc0ce5710c53bad14aaba1a789eec810c556f9
> > > > > > (net: Implement SFEATURES compatibility for not updated drivers).
> > > > > That's also problematic because it means we can't make any use of the
> > > > > 'available' masks from ETHTOOL_GFEATURES.
> > > > > 
> > > > > The patch I sent is actually tested with a modified ethtool.  The
> > > > > fallback works.  I don't think you've tested whether any of your
> > > > > proposals can actually practically be used by ethtool.
> > > > 
> > > > While reading your patches I noted some differences in the way we see
> > > > the new [GS]FEATURES ops.
> > > > 
> > > > First, you make NETIF_F_* flags part of the ethtool ABI. In my approach
> > > > feature names become an ABI instead. That's what ETH_SS_FEATURES string
> > > > set is for, and that's what comments in kernel's <linux/ethtool.h>
> > > > include say.
> > > 
> > > We've been through this before.  I can't use those names in ethtool
> > > because they aren't the same as ethtool used previously.  I could make
> > > it map strings to strings, but I don't see the point.
> > > 
> > > > dev->features are exposed directly by kernel only in two ways:
> > > >  1. /sys/class/net/*/features - since NETIF_F_* flags are not exported
> > > >     in headers for userspace, this should be treated like a debugging
> > > >     facility and not an ABI
> > > >  2. ETHTOOL_[GS]FLAGS - these export 5 flags (LRO, VLAN offload, NTuple,
> > > >     and RX hashing) that are renamed to ETH_FLAG_* - only those constants
> > > >     are in the ABI and only in relation with ETHTOOL_[GS]FLAGS
> > > > 
> > > > Second, you reimplement 'ethtool -K' using ETHTOOL_SFEATURES. Does this mean
> > > > that we want to get rid of ETHTOOL_[GS]{FLAGS,SG,...} from kernel?
> > > We must not.
> > 
> > So what's the point in reimplementing old options via ETHTOOL_SFEATURES?
> 
> Where, in ethtool?  The benefits include:
> - Kernel remembers all the features the user wants on, even if the
> combination is impossible.  Turning TX checksumming off and on no longer
> forces TSO off.

This is what you get by using old ethtool on new kernel. And only for
converted drivers, whether using SFEATURES or old calls.

> - ethtool can distinguish and report whether a feature is unsupported or
> its dependencies are not met.

In this case, when feature is unsupported at all, you still get -EOPNOTSUPP.
If you get no error from old call but after readback (via GSG, etc.) the
feature is still disabled - it means that there are some unmet dependencies.
This is the same information you get from [GS]FEATURES calls.

> > > > The
> > > > assumptions in those calls are a bit different from ETHTOOL_[GS]FEATURES
> > > > but there is an conversion layer in kernel that allows old binaries to
> > > > work correctly in the common case. (-EOPNOTSUPP is still returned for
> > > > drivers which can't change particular feature. The difference is seen
> > > > only in that disabling and enabling e.g. checksumming won't disable other
> > > > dependent features in the result.)
> > > > 
> > > > Right now we already agree that NETIF_F_COMPAT should go.
> > > > 
> > > > I'll send my idea of the ethtool code using ETHTOOL_[GS]FEATURES and
> > > > keeping NETIF_F_* flags internal to the kernel. It adds new modes (-w/-W).
> > > > This might be made even more useful by adding simple wildcard matching.
> > > I've explained before that I do not want to add new options to do
> > > (mostly) the same thing.  Users should have not have to use a different
> > > command depending on the kernel version.
> > 
> > We can avoid new option by checking feature-strings for unrecognised
> > arguments to -K. This way, we will have the old options which work
> > regardless of kernel version ('tx', 'rx', 'sg', etc.) and new options
> > which need recent kernel anyway (separated 'tx-checksum-*', 'loopback',
> > others coming in for 2.6.40).
> This is just too subtle a distinction.  It will mostly confuse users.

We should just document the difference. I expect users who don't care about
new features to not read docs. So 'tx' will still mean 'all TX checksumming'
for them, and they will expect it to turn all TX checksumming
offloads driver supports. If the set changes (like: even in earlier kernels,
some drivers add NETIF_F_SCTP_CSUM to this set) you'll need to update
ethtool userspace. That won't happen if you keep using old calls for
old options as the change will be contained in kernel-side wrapper.

> > Also, this way fallbacks in userspace are avoided.
> No, ethtool will be supporting kernels <2.6.40 for many years yet.

Sure it will. I meant no fallbacks for old options (because they aren't
needed for the tool to work correctly) and no fallback for new options
(as that is not supported in old kernels anyway).

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: linux-next: Tree for May 16 (net/ipv4/ping)
From: Eric Dumazet @ 2011-05-17  9:54 UTC (permalink / raw)
  To: David Miller; +Cc: randy.dunlap, sfr, netdev, linux-next, linux-kernel, segoon
In-Reply-To: <20110516.153844.1072050131335197211.davem@davemloft.net>

Le lundi 16 mai 2011 à 15:38 -0400, David Miller a écrit :
> From: Randy Dunlap <randy.dunlap@oracle.com>
> Date: Mon, 16 May 2011 12:35:34 -0700
> 
> > On Mon, 16 May 2011 15:10:19 +1000 Stephen Rothwell wrote:
> > 
> >> Hi all,
> >> 
> >> Changes since 20110513:
> > 
> > 
> > when CONFIG_PROC_SYSCTL is not enabled:
> > 
> > ping.c:(.text+0x52af3): undefined reference to `inet_get_ping_group_range_net'
> 
> Vasiliy, please fix this.

Vasily seems busy, here is a fix for this problem.

I tested new ping was working even if we could not set the group range
anymore.

Thanks

[PATCH net-next-2.6] net: ping: fix build error

Randy Dunlap reported following build error if CONFIG_PROC_SYSCTL is not
enabled:

ping.c:(.text+0x52af3): undefined reference to
`inet_get_ping_group_range_net'

Also made inet_get_ping_group_range_table() static

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
---
 net/ipv4/inet_connection_sock.c |   14 ++++++++++++++
 net/ipv4/sysctl_net_ipv4.c      |   14 +-------------
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 3a2ba56..e5c0729 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -23,6 +23,7 @@
 #include <net/route.h>
 #include <net/tcp_states.h>
 #include <net/xfrm.h>
+#include <net/ping.h>
 
 #ifdef INET_CSK_DEBUG
 const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
@@ -52,6 +53,19 @@ void inet_get_local_port_range(int *low, int *high)
 }
 EXPORT_SYMBOL(inet_get_local_port_range);
 
+void inet_get_ping_group_range_net(struct net *net, gid_t *low, gid_t *high)
+{
+	const gid_t *data = net->ipv4.sysctl_ping_group_range;
+	unsigned int seq;
+
+	do {
+		seq = read_seqbegin(&sysctl_local_ports.lock);
+
+		*low = data[0];
+		*high = data[1];
+	} while (read_seqretry(&sysctl_local_ports.lock, seq));
+}
+
 int inet_csk_bind_conflict(const struct sock *sk,
 			   const struct inet_bind_bucket *tb)
 {
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 28e8273..dc5d2a0 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -73,19 +73,7 @@ static int ipv4_local_port_range(ctl_table *table, int write,
 }
 
 
-void inet_get_ping_group_range_net(struct net *net, gid_t *low, gid_t *high)
-{
-	gid_t *data = net->ipv4.sysctl_ping_group_range;
-	unsigned seq;
-	do {
-		seq = read_seqbegin(&sysctl_local_ports.lock);
-
-		*low = data[0];
-		*high = data[1];
-	} while (read_seqretry(&sysctl_local_ports.lock, seq));
-}
-
-void inet_get_ping_group_range_table(struct ctl_table *table, gid_t *low, gid_t *high)
+static void inet_get_ping_group_range_table(struct ctl_table *table, gid_t *low, gid_t *high)
 {
 	gid_t *data = table->data;
 	unsigned seq;

^ permalink raw reply related

* Confirm Receipt
From: Western Union Money Transfer @ 2011-05-17  8:49 UTC (permalink / raw)


You have a money transfer of 85,000 USD. Confirm receipt with your name and
country via email: engwutf@live.co.uk


^ permalink raw reply

* [PATCH] xen: netback: use __CONST_RING_SIZE not __RING_SIZE
From: Ian Campbell @ 2011-05-17  9:59 UTC (permalink / raw)
  To: xen-devel, netdev; +Cc: Ian Campbell, Witold Baryluk

The later causes warnings with gcc 4.5+. __CONST_RING_SIZE was introduced in
667c78afaec0 to fix this but as netback wasn't upstream at the time it did not
benefit, hence:

  CC      drivers/net/xen-netback/netback.o
drivers/net/xen-netback/netback.c:110:37: warning: variably modified 'grant_copy_op' at file scope [enabled by default]
drivers/net/xen-netback/netback.c:111:30: warning: variably modified 'meta' at file scope [enabled by default]
drivers/net/xen-netback/netback.c: In function 'xen_netbk_rx_action':
drivers/net/xen-netback/netback.c:584:6: warning: variable 'irq' set but not used [-Wunused-but-set-variable]

Thanks to Witold Baryluk for pointing this out.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
---
 drivers/net/xen-netback/common.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 5d7bbf2..e85474e 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -109,8 +109,8 @@ struct xenvif {
 	wait_queue_head_t waiting_to_free;
 };
 
-#define XEN_NETIF_TX_RING_SIZE __RING_SIZE((struct xen_netif_tx_sring *)0, PAGE_SIZE)
-#define XEN_NETIF_RX_RING_SIZE __RING_SIZE((struct xen_netif_rx_sring *)0, PAGE_SIZE)
+#define XEN_NETIF_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE)
+#define XEN_NETIF_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, PAGE_SIZE)
 
 struct xenvif *xenvif_alloc(struct device *parent,
 			    domid_t domid,
-- 
1.7.2.5


^ permalink raw reply related

* [PATCH] net: ping: fix build failure
From: Vasiliy Kulikov @ 2011-05-17 10:16 UTC (permalink / raw)
  To: David Miller; +Cc: randy.dunlap, sfr, netdev, linux-next, linux-kernel
In-Reply-To: <20110516.153844.1072050131335197211.davem@davemloft.net>

On Mon, May 16, 2011 at 15:38 -0400, David Miller wrote:
> From: Randy Dunlap <randy.dunlap@oracle.com>
> Date: Mon, 16 May 2011 12:35:34 -0700
> 
> > On Mon, 16 May 2011 15:10:19 +1000 Stephen Rothwell wrote:
> > when CONFIG_PROC_SYSCTL is not enabled:
> > 
> > ping.c:(.text+0x52af3): undefined reference to `inet_get_ping_group_range_net'
> 
> Vasiliy, please fix this.

I wonder whether there is any way to test such unusual configurations?
Only randconfig or are there any (partly-)automated tools for it?


[PATCH] net: ping: fix build failure

If CONFIG_PROC_SYSCTL=n the building process fails:

    ping.c:(.text+0x52af3): undefined reference to `inet_get_ping_group_range_net'

Moved inet_get_ping_group_range_net() to ping.c.

Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
---
 include/net/ping.h         |    2 --
 net/ipv4/ping.c            |   13 +++++++++++++
 net/ipv4/sysctl_net_ipv4.c |   12 ------------
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/net/ping.h b/include/net/ping.h
index 23062c3..682b5ae 100644
--- a/include/net/ping.h
+++ b/include/net/ping.h
@@ -44,8 +44,6 @@ extern struct proto ping_prot;
 extern void ping_rcv(struct sk_buff *);
 extern void ping_err(struct sk_buff *, u32 info);
 
-extern void inet_get_ping_group_range_net(struct net *net, unsigned int *low, unsigned int *high);
-
 #ifdef CONFIG_PROC_FS
 extern int __init ping_proc_init(void);
 extern void ping_proc_exit(void);
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 7041d09..3635975 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -188,6 +188,19 @@ exit:
 	return sk;
 }
 
+static void inet_get_ping_group_range_net(struct net *net, gid_t *low, gid_t *high)
+{
+	gid_t *data = net->ipv4.sysctl_ping_group_range;
+	unsigned seq;
+	do {
+		seq = read_seqbegin(&sysctl_local_ports.lock);
+
+		*low = data[0];
+		*high = data[1];
+	} while (read_seqretry(&sysctl_local_ports.lock, seq));
+}
+
+
 static int ping_init_sock(struct sock *sk)
 {
 	struct net *net = sock_net(sk);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 28e8273..57d0752 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -73,18 +73,6 @@ static int ipv4_local_port_range(ctl_table *table, int write,
 }
 
 
-void inet_get_ping_group_range_net(struct net *net, gid_t *low, gid_t *high)
-{
-	gid_t *data = net->ipv4.sysctl_ping_group_range;
-	unsigned seq;
-	do {
-		seq = read_seqbegin(&sysctl_local_ports.lock);
-
-		*low = data[0];
-		*high = data[1];
-	} while (read_seqretry(&sysctl_local_ports.lock, seq));
-}
-
 void inet_get_ping_group_range_table(struct ctl_table *table, gid_t *low, gid_t *high)
 {
 	gid_t *data = table->data;
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH] net: ping: fix build failure
From: Eric Dumazet @ 2011-05-17 10:27 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: David Miller, randy.dunlap, sfr, netdev, linux-next, linux-kernel
In-Reply-To: <20110517101656.GA28685@albatros>

Le mardi 17 mai 2011 à 14:16 +0400, Vasiliy Kulikov a écrit :
> On Mon, May 16, 2011 at 15:38 -0400, David Miller wrote:
> > From: Randy Dunlap <randy.dunlap@oracle.com>
> > Date: Mon, 16 May 2011 12:35:34 -0700
> > 
> > > On Mon, 16 May 2011 15:10:19 +1000 Stephen Rothwell wrote:
> > > when CONFIG_PROC_SYSCTL is not enabled:
> > > 
> > > ping.c:(.text+0x52af3): undefined reference to `inet_get_ping_group_range_net'
> > 
> > Vasiliy, please fix this.
> 
> I wonder whether there is any way to test such unusual configurations?
> Only randconfig or are there any (partly-)automated tools for it?
> 
> 
> [PATCH] net: ping: fix build failure
> 
> If CONFIG_PROC_SYSCTL=n the building process fails:
> 
>     ping.c:(.text+0x52af3): undefined reference to `inet_get_ping_group_range_net'
> 
> Moved inet_get_ping_group_range_net() to ping.c.
> 
> Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
> Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply

* [stable submission] vmxnet3: Fix inconsistent LRO state after initialization
From: Thomas Jarosch @ 2011-05-17 10:52 UTC (permalink / raw)
  To: stable; +Cc: netdev

Hi greg k-h,

please include commit ebde6f8acba92abfc203585198a54f47e83e2cd0
"vmxnet3: Fix inconsistent LRO state after initialization"

in 2.6.37 / 2.6.38 stable.


Kernel 2.6.32 first included the vmxnet3 driver and I've checked
that the patch applies to 2.6.32.40 and 2.6.35.13, so it might be
worth to include it in all maintained kernels.

Best regards,
Thomas Jarosch

^ permalink raw reply

* Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Hagen Paul Pfeifer @ 2011-05-17 11:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benoit Sigoure, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	netdev, linux-kernel
In-Reply-To: <1305619677.2850.11.camel@edumazet-laptop>


On Tue, 17 May 2011 10:07:57 +0200, Eric Dumazet wrote:



> I wont discuss if introducing a new sysctl is welcomed, only on patch

> issues. I believe some work in IETF is done to reduce the 3sec value to

> 1sec anyway.



Why not? I though all new knobs in this area should be done on a per route

metric so it can be controlled on a per path basis. RTO should be

adjustable on a per path basis, because it depends on the path.



Some months back [1] I posted a patch to enable/disable TCP quick ack

mode, which has nothing to do with network paths, just with a local server

policy. But David rejected the patch with the argument that I should use a

per path knob (this is a little bit inapprehensible for me, but David has

the last word).



Hagen





[1] http://kerneltrap.org/mailarchive/linux-netdev/2010/8/23/6283640

^ permalink raw reply

* Re: [PATCH 0/7] Network namespace manipulation with file descriptors
From: David Lamparter @ 2011-05-17 11:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-arch-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linux Containers
In-Reply-To: <m1fwoqoapn.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>

On Sat, May 07, 2011 at 07:18:44AM -0700, Eric W. Biederman wrote:
> You can read the processes network namespace by opening
> /proc/<pid>/ns/net.  Unfortunately comparing the network
> namespaces for identity is another matter.  You will probably
> be better off simply forcing the routing daemon to start
> in the desired network namespace in it's initscript.
> 
> For purposes of clarity please have a look at my work in
> progress patch for iproute2.  This demonstrates how I expect
> userspace to work in a multi-network namespace world.
> 
[...]
> Subject: [PATCH] iproute2:  Add processless netnwork namespace support.
[...]
> Configuration specific to a network namespace that
> would ordinarily be stored under /etc/ is stored under
> /etc/netns/<name>.  For example if the dns server
> configuration is different for your vpn you would
> create a file /etc/netns/myvpn/resolv.conf.
> 
> File descriptors that can be used to manipulate a
> network namespace can be created by opening
> /var/run/netns/<NAME>.
> 
> This adds the following commands to iproute.
> ip netns add NAME
> ip netns delete NAME
> ip netns monitor
> ip netns list
> ip netns exec NAME cmd ....
> ip link set DEV netns NAME

funny, this is almost exactly what my code does - though you're probably
doing it better and have more features ;)
http://git.spaceboyz.net/equinox/vrf-tools.git/
git://spaceboyz.net/equinox/vrf-tools.git

It currently forks off a daemon to keep the namespace open; attaching is
not possible yet, but opening a socket in a different namespace is.

Most of the actual management (mounting things & co.) I offloaded to
some shell scripts; I use it together with GNU screen (which makes it
very nice to grab one of the namespaces and start/stop/manage/...
things).

I also have patches for OpenVPN and pptpd floating around that make it
possible to 'cross' namespace boundaries, i.e. the VPN servers listen in
one namespace and have their devices in another.


-David

^ permalink raw reply

* Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.
From: Eric Dumazet @ 2011-05-17 12:20 UTC (permalink / raw)
  To: Hagen Paul Pfeifer
  Cc: Benoit Sigoure, davem, kuznet, pekkas, jmorris, yoshfuji, kaber,
	netdev, linux-kernel
In-Reply-To: <e6b52d895fca947b4cce2896d593619b@localhost>

Le mardi 17 mai 2011 à 13:02 +0200, Hagen Paul Pfeifer a écrit :
> On Tue, 17 May 2011 10:07:57 +0200, Eric Dumazet wrote:
> 
> > I wont discuss if introducing a new sysctl is welcomed, only on patch
> > issues. I believe some work in IETF is done to reduce the 3sec value to
> > 1sec anyway.
> 
> Why not? 

Just because I let this point to David and others. I personally dont
care that much.

> I though all new knobs in this area should be done on a per route
> metric so it can be controlled on a per path basis. RTO should be
> adjustable on a per path basis, because it depends on the path.
> 

Adding many knobs to each clone had a huge cost on previous kernels.
(Think some machines have millions entries in IP route cache), this used
quite a lot of memory.

With latest David work, we'll consume less ram, because we can now share
settings, instead of copying them on each dst entry.



> Some months back [1] I posted a patch to enable/disable TCP quick ack
> mode, which has nothing to do with network paths, just with a local server
> policy. But David rejected the patch with the argument that I should use a
> per path knob (this is a little bit inapprehensible for me, but David has
> the last word).

Well, if nobody speaks after David, he has the last word indeed.

BTW, I remember Stephen actually asked the per route thing, not David.

http://kerneltrap.org/mailarchive/linux-netdev/2010/8/23/6283641

Then David also stated it :

http://kerneltrap.org/mailarchive/linux-netdev/2010/8/23/6283678

If you really want tcp_quickack thing you really should do it as
requested by both Stephen & David ;)

Unfortunately, I dont know if its really needed or worthwhile.

^ permalink raw reply

* WARNING, net/core/dev.c, skb_gso_segment, 2.6.39-rc7-git11 inclusive
From: Denys Fedoryshchenko @ 2011-05-17 12:52 UTC (permalink / raw)
  To: netdev

 Hi

 Just tried some non-standard pedit rules to modify tcp data, and got 
 WARNING almost instantly.
 Maybe it is just invalid pedit, but probably will be interesting. My 
 idea was to modify outgoing tcp window size,
 rules was not complete by mistake, but got that warning.

 Rules
 tc qdisc del dev eth1 root
 tc qdisc add dev eth1 root handle 1: htb default 1
 tc class add dev eth1 parent 1: classid 1:1 htb rate 900Mbit ceil 
 900Mbit burst 10000 cburst 10000 quantum 60000
 tc qdisc add dev eth1 parent 1:1 handle 11 pfifo limit 10000

 tc filter add dev eth1 prio 1 protocol ip parent 1: \
   u32 match ip protocol 0x06 0xff match ip dport 80 0x2000 flowid 1:1 \
   match u16 0x0000 0xffff at 34 action gact ok

 tc filter add dev eth1 prio 1 protocol ip parent 1: \
   u32 match ip protocol 0x06 0xff match ip dport 80 0x2000 flowid 1:1 \
     action pedit munge offset 34 u16 set 0x2000 \
       pipe csum ip and tcp

 Warning:


  2298.298075] GACT probability on
 [ 2298.369166] ------------[ cut here ]------------
 [ 2298.369175] WARNING: at net/core/dev.c:1876 
 skb_gso_segment+0x156/0x2b8()
 [ 2298.369178] Hardware name: RS700-E6/ERS4
 [ 2298.369181] igb: caps=(0x21143b3, 0x0) len=11636 data_len=0 
 ip_summed=0
 [ 2298.369184] Modules linked in: act_gact act_csum act_pedit cls_u32 
 sch_htb ipv6 i2c_i801 i7core_edac i2c_core edac_core rtc_cmos igb e1000e 
 rtc_core rtc_lib iTCO_wdt ghes hed button
 [ 2298.369200] Pid: 0, comm: swapper Not tainted 2.6.39-rc7-insat #1
 [ 2298.369203] Call Trace:
 [ 2298.369205]  <IRQ>  [<ffffffff810338cf>] 
 warn_slowpath_common+0x80/0x98
 [ 2298.369218]  [<ffffffff8103397b>] warn_slowpath_fmt+0x41/0x43
 [ 2298.369222]  [<ffffffff8132a10d>] skb_gso_segment+0x156/0x2b8
 [ 2298.369227]  [<ffffffff8132a700>] dev_hard_start_xmit+0x37c/0x522
 [ 2298.369233]  [<ffffffff81341135>] sch_direct_xmit+0x67/0x18d
 [ 2298.369237]  [<ffffffff81341365>] __qdisc_run+0x10a/0x126
 [ 2298.369241]  [<ffffffff8132ac39>] dev_queue_xmit+0x393/0x4c3
 [ 2298.369247]  [<ffffffff81332a8a>] neigh_resolve_output+0x28f/0x2ed
 [ 2298.369250]  [<ffffffff8134b8da>] ? nf_hook_slow+0x6d/0x108
 [ 2298.369253]  [<ffffffff81357ec4>] ? ip_options_build+0x143/0x143
 [ 2298.369255]  [<ffffffff8135a964>] ip_finish_output+0x260/0x2a3
 [ 2298.369257]  [<ffffffff8135aa4d>] ip_output+0xa6/0xad
 [ 2298.369258]  [<ffffffff81359d79>] ? __ip_local_out+0x9c/0x9e
 [ 2298.369260]  [<ffffffff81359d9f>] ip_local_out+0x24/0x28
 [ 2298.369262]  [<ffffffff8135a31c>] ip_queue_xmit+0x2dd/0x328
 [ 2298.369265]  [<ffffffff8131fcd5>] ? __skb_clone+0x29/0xf2
 [ 2298.369268]  [<ffffffff8136bbd4>] tcp_transmit_skb+0x74d/0x78b
 [ 2298.369270]  [<ffffffff8136e389>] tcp_write_xmit+0x806/0x8f5
 [ 2298.369272]  [<ffffffff8136b04f>] ? 
 tcp_established_options+0x2e/0xa9
 [ 2298.369274]  [<ffffffff8136e4c9>] 
 __tcp_push_pending_frames+0x20/0x7c
 [ 2298.369276]  [<ffffffff8136a685>] tcp_rcv_established+0x104/0x60f
 [ 2298.369278]  [<ffffffff81371696>] tcp_v4_do_rcv+0x1ba/0x377
 [ 2298.369283]  [<ffffffff81173e9f>] ? security_sock_rcv_skb+0x11/0x13
 [ 2298.369284]  [<ffffffff81371d6f>] tcp_v4_rcv+0x51c/0x869
 [ 2298.369287]  [<ffffffff81355b34>] ? ip_rcv+0x28a/0x28a
 [ 2298.369289]  [<ffffffff81355b34>] ? ip_rcv+0x28a/0x28a
 [ 2298.369291]  [<ffffffff81355cac>] 
 ip_local_deliver_finish+0x178/0x235
 [ 2298.369293]  [<ffffffff81355ddb>] ip_local_deliver+0x72/0x79
 [ 2298.369295]  [<ffffffff81355880>] ip_rcv_finish+0x2dc/0x306
 [ 2298.369297]  [<ffffffff81355b06>] ip_rcv+0x25c/0x28a
 [ 2298.369299]  [<ffffffff81329536>] __netif_receive_skb+0x4b8/0x4ea
 [ 2298.369301]  [<ffffffff81329778>] netif_receive_skb+0x67/0x6e
 [ 2298.369303]  [<ffffffff81329861>] napi_skb_finish+0x24/0x3b
 [ 2298.369305]  [<ffffffff81329d44>] napi_gro_receive+0xa8/0xad
 [ 2298.369309]  [<ffffffffa003668f>] igb_poll+0x783/0xaaa [igb]
 [ 2298.369314]  [<ffffffff81213d16>] ? put_device+0x12/0x14
 [ 2298.369317]  [<ffffffff81226e3d>] ? scsi_next_command+0x3e/0x46
 [ 2298.369319]  [<ffffffff81329e79>] net_rx_action+0xa3/0x1bb
 [ 2298.369322]  [<ffffffff81038537>] __do_softirq+0x83/0x114
 [ 2298.369324]  [<ffffffff813babcc>] call_softirq+0x1c/0x30
 [ 2298.369328]  [<ffffffff81003e0f>] do_softirq+0x33/0x68
 [ 2298.369329]  [<ffffffff810383d4>] irq_exit+0x3f/0x88
 [ 2298.369331]  [<ffffffff810036f6>] do_IRQ+0x98/0xaf
 [ 2298.369334]  [<ffffffff813b9453>] common_interrupt+0x13/0x13
 [ 2298.369335]  <EOI>  [<ffffffff8104d2d7>] ? 
 __hrtimer_start_range_ns+0x2a3/0x2b5
 [ 2298.369343]  [<ffffffff811c05a9>] ? intel_idle+0xc3/0xe9
 [ 2298.369345]  [<ffffffff811c058c>] ? intel_idle+0xa6/0xe9
 [ 2298.369348]  [<ffffffff813097cb>] cpuidle_idle_call+0x94/0xcd
 [ 2298.369350]  [<ffffffff81001c47>] cpu_idle+0x5a/0x91
 [ 2298.369353]  [<ffffffff813a5564>] rest_init+0x68/0x6a
 [ 2298.369356]  [<ffffffff8160d930>] start_kernel+0x2f3/0x2fe
 [ 2298.369358]  [<ffffffff8160d085>] 
 x86_64_start_reservations+0x82/0x86
 [ 2298.369360]  [<ffffffff8160d172>] x86_64_start_kernel+0xe9/0xf0
 [ 2298.369361] ---[ end trace 408f79c72c45f490 ]---
 [ 2298.369807] ------------[ cut here ]------------
 [ 2298.369809] WARNING: at net/core/dev.c:1876 
 skb_gso_segment+0x156/0x2b8()
 [ 2298.369810] Hardware name: RS700-E6/ERS4
 [ 2298.369811] igb: caps=(0x21143b3, 0x0) len=2948 data_len=0 
 ip_summed=0
 [ 2298.369812] Modules linked in: act_gact act_csum act_pedit cls_u32 
 sch_htb ipv6 i2c_i801 i7core_edac i2c_core edac_core rtc_cmos igb e1000e 
 rtc_core rtc_lib iTCO_wdt ghes hed button
 [ 2298.369818] Pid: 5867, comm: syslog-ng Tainted: G        W   
 2.6.39-rc7-insat #1
 [ 2298.369819] Call Trace:
 [ 2298.369820]  <IRQ>  [<ffffffff810338cf>] 
 warn_slowpath_common+0x80/0x98
 [ 2298.369824]  [<ffffffff8103397b>] warn_slowpath_fmt+0x41/0x43
 [ 2298.369826]  [<ffffffff8132a10d>] skb_gso_segment+0x156/0x2b8
 [ 2298.369828]  [<ffffffff8132a700>] dev_hard_start_xmit+0x37c/0x522
 [ 2298.369830]  [<ffffffff81341135>] sch_direct_xmit+0x67/0x18d
 [ 2298.369831]  [<ffffffff81341365>] __qdisc_run+0x10a/0x126
 [ 2298.369833]  [<ffffffff8132ac39>] dev_queue_xmit+0x393/0x4c3
 [ 2298.369835]  [<ffffffff81332a8a>] neigh_resolve_output+0x28f/0x2ed
 [ 2298.369837]  [<ffffffff8134b8da>] ? nf_hook_slow+0x6d/0x108
 [ 2298.369838]  [<ffffffff81357ec4>] ? ip_options_build+0x143/0x143
 [ 2298.369840]  [<ffffffff8135a964>] ip_finish_output+0x260/0x2a3
 [ 2298.369841]  [<ffffffff8135aa4d>] ip_output+0xa6/0xad
 [ 2298.369843]  [<ffffffff81359d79>] ? __ip_local_out+0x9c/0x9e
 [ 2298.369844]  [<ffffffff81359d9f>] ip_local_out+0x24/0x28
 [ 2298.369846]  [<ffffffff8135a31c>] ip_queue_xmit+0x2dd/0x328
 [ 2298.369848]  [<ffffffff8131fcd5>] ? __skb_clone+0x29/0xf2
 [ 2298.369850]  [<ffffffff8136bbd4>] tcp_transmit_skb+0x74d/0x78b
 [ 2298.369852]  [<ffffffff8136e389>] tcp_write_xmit+0x806/0x8f5
 [ 2298.369854]  [<ffffffff8136b04f>] ? 
 tcp_established_options+0x2e/0xa9
 [ 2298.369856]  [<ffffffff8136e4c9>] 
 __tcp_push_pending_frames+0x20/0x7c
 [ 2298.369857]  [<ffffffff8136a685>] tcp_rcv_established+0x104/0x60f
 [ 2298.369859]  [<ffffffff81371696>] tcp_v4_do_rcv+0x1ba/0x377
 [ 2298.369861]  [<ffffffff81173e9f>] ? security_sock_rcv_skb+0x11/0x13
 [ 2298.369863]  [<ffffffff81371d6f>] tcp_v4_rcv+0x51c/0x869
 [ 2298.369865]  [<ffffffff81355b34>] ? ip_rcv+0x28a/0x28a
 [ 2298.369867]  [<ffffffff81355b34>] ? ip_rcv+0x28a/0x28a
 [ 2298.369869]  [<ffffffff81355cac>] 
 ip_local_deliver_finish+0x178/0x235
 [ 2298.369870]  [<ffffffff81355ddb>] ip_local_deliver+0x72/0x79
 [ 2298.369872]  [<ffffffff81355880>] ip_rcv_finish+0x2dc/0x306
 [ 2298.369874]  [<ffffffff81355b06>] ip_rcv+0x25c/0x28a
 [ 2298.369876]  [<ffffffff81329536>] __netif_receive_skb+0x4b8/0x4ea
 [ 2298.369878]  [<ffffffff81329778>] netif_receive_skb+0x67/0x6e
 [ 2298.369879]  [<ffffffff81329861>] napi_skb_finish+0x24/0x3b
 [ 2298.369881]  [<ffffffff81329d44>] napi_gro_receive+0xa8/0xad
 [ 2298.369884]  [<ffffffffa003668f>] igb_poll+0x783/0xaaa [igb]
 [ 2298.369886]  [<ffffffff8100e0d3>] ? x86_pmu_enable+0x1fc/0x263
 [ 2298.369889]  [<ffffffff810794cb>] ? perf_ctx_adjust_freq+0x29/0x10a
 [ 2298.369890]  [<ffffffff81329e79>] net_rx_action+0xa3/0x1bb
 [ 2298.369893]  [<ffffffff8106f3d4>] ? 
 __rcu_process_callbacks+0x75/0x265
 [ 2298.369895]  [<ffffffff81038537>] __do_softirq+0x83/0x114
 [ 2298.369897]  [<ffffffff813babcc>] call_softirq+0x1c/0x30
 [ 2298.369899]  [<ffffffff81003e0f>] do_softirq+0x33/0x68
 [ 2298.369900]  [<ffffffff810383d4>] irq_exit+0x3f/0x88
 [ 2298.369902]  [<ffffffff810036f6>] do_IRQ+0x98/0xaf
 [ 2298.369904]  [<ffffffff813b9453>] common_interrupt+0x13/0x13
 [ 2298.369905]  <EOI>  [<ffffffff8104dc74>] ? 
 atomic_notifier_call_chain+0x13/0x15
 [ 2298.369910]  [<ffffffff81202714>] ? do_con_write+0x57b/0x1db7
 [ 2298.369912]  [<ffffffff812025eb>] ? do_con_write+0x452/0x1db7
 [ 2298.369913]  [<ffffffff810340da>] ? console_unlock+0x170/0x19a
 [ 2298.369915]  [<ffffffff811ff9b0>] ? con_flush_chars+0x3e/0x43
 [ 2298.369917]  [<ffffffff8104daa8>] ? up+0x34/0x39
 [ 2298.369918]  [<ffffffff81203f97>] con_write+0x11/0x26
 [ 2298.369920]  [<ffffffff8104a0c5>] ? add_wait_queue+0x3f/0x46
 [ 2298.369923]  [<ffffffff811f3896>] n_tty_write+0x239/0x35a
 [ 2298.369925]  [<ffffffff8102f2a9>] ? try_to_wake_up+0x240/0x240
 [ 2298.369926]  [<ffffffff811f0b13>] tty_write+0x19d/0x22f
 [ 2298.369928]  [<ffffffff811f365d>] ? n_tty_ioctl+0xab/0xab
 [ 2298.369931]  [<ffffffff810b3e82>] vfs_write+0xae/0x133
 [ 2298.369933]  [<ffffffff810b3fc0>] sys_write+0x45/0x6c
 [ 2298.369935]  [<ffffffff813b9a7b>] system_call_fastpath+0x16/0x1b
 [ 2298.369936] ---[ end trace 408f79c72c45f491 ]---


^ permalink raw reply

* Re: WARNING, net/core/dev.c, skb_gso_segment, 2.6.39-rc7-git11 inclusive
From: Eric Dumazet @ 2011-05-17 13:16 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <75df1ebd245faa29f6d9c067bad94e8e@visp.net.lb>

Le mardi 17 mai 2011 à 15:52 +0300, Denys Fedoryshchenko a écrit :
> Hi
> 
>  Just tried some non-standard pedit rules to modify tcp data, and got 
>  WARNING almost instantly.
>  Maybe it is just invalid pedit, but probably will be interesting. My 
>  idea was to modify outgoing tcp window size,
>  rules was not complete by mistake, but got that warning.
> 
>  Rules
>  tc qdisc del dev eth1 root
>  tc qdisc add dev eth1 root handle 1: htb default 1
>  tc class add dev eth1 parent 1: classid 1:1 htb rate 900Mbit ceil 
>  900Mbit burst 10000 cburst 10000 quantum 60000
>  tc qdisc add dev eth1 parent 1:1 handle 11 pfifo limit 10000
> 
>  tc filter add dev eth1 prio 1 protocol ip parent 1: \
>    u32 match ip protocol 0x06 0xff match ip dport 80 0x2000 flowid 1:1 \
>    match u16 0x0000 0xffff at 34 action gact ok
> 
>  tc filter add dev eth1 prio 1 protocol ip parent 1: \
>    u32 match ip protocol 0x06 0xff match ip dport 80 0x2000 flowid 1:1 \
>      action pedit munge offset 34 u16 set 0x2000 \
>        pipe csum ip and tcp
> 
>  Warning:
> 
> 
>   2298.298075] GACT probability on
>  [ 2298.369166] ------------[ cut here ]------------
>  [ 2298.369175] WARNING: at net/core/dev.c:1876 
>  skb_gso_segment+0x156/0x2b8()
>  [ 2298.369178] Hardware name: RS700-E6/ERS4
>  [ 2298.369181] igb: caps=(0x21143b3, 0x0) len=11636 data_len=0 
>  ip_summed=0
>  [ 2298.369184] Modules linked in: act_gact act_csum act_pedit cls_u32 
>  sch_htb ipv6 i2c_i801 i7core_edac i2c_core edac_core rtc_cmos igb e1000e 
>  rtc_core rtc_lib iTCO_wdt ghes hed button
>  [ 2298.369200] Pid: 0, comm: swapper Not tainted 2.6.39-rc7-insat #1
>  [ 2298.369203] Call Trace:
>  [ 2298.369205]  <IRQ>  [<ffffffff810338cf>] 
>  warn_slowpath_common+0x80/0x98
>  [ 2298.369218]  [<ffffffff8103397b>] warn_slowpath_fmt+0x41/0x43
>  [ 2298.369222]  [<ffffffff8132a10d>] skb_gso_segment+0x156/0x2b8
>  [ 2298.369227]  [<ffffffff8132a700>] dev_hard_start_xmit+0x37c/0x522
>  [ 2298.369233]  [<ffffffff81341135>] sch_direct_xmit+0x67/0x18d
>  [ 2298.369237]  [<ffffffff81341365>] __qdisc_run+0x10a/0x126
>  [ 2298.369241]  [<ffffffff8132ac39>] dev_queue_xmit+0x393/0x4c3
>  [ 2298.369247]  [<ffffffff81332a8a>] neigh_resolve_output+0x28f/0x2ed
>  [ 2298.369250]  [<ffffffff8134b8da>] ? nf_hook_slow+0x6d/0x108
>  [ 2298.369253]  [<ffffffff81357ec4>] ? ip_options_build+0x143/0x143
>  [ 2298.369255]  [<ffffffff8135a964>] ip_finish_output+0x260/0x2a3
>  [ 2298.369257]  [<ffffffff8135aa4d>] ip_output+0xa6/0xad
>  [ 2298.369258]  [<ffffffff81359d79>] ? __ip_local_out+0x9c/0x9e
>  [ 2298.369260]  [<ffffffff81359d9f>] ip_local_out+0x24/0x28
>  [ 2298.369262]  [<ffffffff8135a31c>] ip_queue_xmit+0x2dd/0x328
>  [ 2298.369265]  [<ffffffff8131fcd5>] ? __skb_clone+0x29/0xf2
>  [ 2298.369268]  [<ffffffff8136bbd4>] tcp_transmit_skb+0x74d/0x78b
>  [ 2298.369270]  [<ffffffff8136e389>] tcp_write_xmit+0x806/0x8f5
>  [ 2298.369272]  [<ffffffff8136b04f>] ? 
>  tcp_established_options+0x2e/0xa9
>  [ 2298.369274]  [<ffffffff8136e4c9>] 
>  __tcp_push_pending_frames+0x20/0x7c
>  [ 2298.369276]  [<ffffffff8136a685>] tcp_rcv_established+0x104/0x60f
>  [ 2298.369278]  [<ffffffff81371696>] tcp_v4_do_rcv+0x1ba/0x377
>  [ 2298.369283]  [<ffffffff81173e9f>] ? security_sock_rcv_skb+0x11/0x13
>  [ 2298.369284]  [<ffffffff81371d6f>] tcp_v4_rcv+0x51c/0x869
>  [ 2298.369287]  [<ffffffff81355b34>] ? ip_rcv+0x28a/0x28a
>  [ 2298.369289]  [<ffffffff81355b34>] ? ip_rcv+0x28a/0x28a
>  [ 2298.369291]  [<ffffffff81355cac>] 
>  ip_local_deliver_finish+0x178/0x235
>  [ 2298.369293]  [<ffffffff81355ddb>] ip_local_deliver+0x72/0x79
>  [ 2298.369295]  [<ffffffff81355880>] ip_rcv_finish+0x2dc/0x306
>  [ 2298.369297]  [<ffffffff81355b06>] ip_rcv+0x25c/0x28a
>  [ 2298.369299]  [<ffffffff81329536>] __netif_receive_skb+0x4b8/0x4ea
>  [ 2298.369301]  [<ffffffff81329778>] netif_receive_skb+0x67/0x6e
>  [ 2298.369303]  [<ffffffff81329861>] napi_skb_finish+0x24/0x3b
>  [ 2298.369305]  [<ffffffff81329d44>] napi_gro_receive+0xa8/0xad
>  [ 2298.369309]  [<ffffffffa003668f>] igb_poll+0x783/0xaaa [igb]
>  [ 2298.369314]  [<ffffffff81213d16>] ? put_device+0x12/0x14
>  [ 2298.369317]  [<ffffffff81226e3d>] ? scsi_next_command+0x3e/0x46
>  [ 2298.369319]  [<ffffffff81329e79>] net_rx_action+0xa3/0x1bb
>  [ 2298.369322]  [<ffffffff81038537>] __do_softirq+0x83/0x114
>  [ 2298.369324]  [<ffffffff813babcc>] call_softirq+0x1c/0x30
>  [ 2298.369328]  [<ffffffff81003e0f>] do_softirq+0x33/0x68
>  [ 2298.369329]  [<ffffffff810383d4>] irq_exit+0x3f/0x88
>  [ 2298.369331]  [<ffffffff810036f6>] do_IRQ+0x98/0xaf
>  [ 2298.369334]  [<ffffffff813b9453>] common_interrupt+0x13/0x13
>  [ 2298.369335]  <EOI>  [<ffffffff8104d2d7>] ? 
>  __hrtimer_start_range_ns+0x2a3/0x2b5
>  [ 2298.369343]  [<ffffffff811c05a9>] ? intel_idle+0xc3/0xe9
>  [ 2298.369345]  [<ffffffff811c058c>] ? intel_idle+0xa6/0xe9
>  [ 2298.369348]  [<ffffffff813097cb>] cpuidle_idle_call+0x94/0xcd
>  [ 2298.369350]  [<ffffffff81001c47>] cpu_idle+0x5a/0x91
>  [ 2298.369353]  [<ffffffff813a5564>] rest_init+0x68/0x6a
>  [ 2298.369356]  [<ffffffff8160d930>] start_kernel+0x2f3/0x2fe
>  [ 2298.369358]  [<ffffffff8160d085>] 
>  x86_64_start_reservations+0x82/0x86
>  [ 2298.369360]  [<ffffffff8160d172>] x86_64_start_kernel+0xe9/0xf0
>  [ 2298.369361] ---[ end trace 408f79c72c45f490 ]---
>  [ 2298.369807] ------------[ cut here ]------------
>  [ 2298.369809] WARNING: at net/core/dev.c:1876 
>  skb_gso_segment+0x156/0x2b8()
>  [ 2298.369810] Hardware name: RS700-E6/ERS4
>  [ 2298.369811] igb: caps=(0x21143b3, 0x0) len=2948 data_len=0 
>  ip_summed=0
>  [ 2298.369812] Modules linked in: act_gact act_csum act_pedit cls_u32 
>  sch_htb ipv6 i2c_i801 i7core_edac i2c_core edac_core rtc_cmos igb e1000e 
>  rtc_core rtc_lib iTCO_wdt ghes hed button
>  [ 2298.369818] Pid: 5867, comm: syslog-ng Tainted: G        W   
>  2.6.39-rc7-insat #1
>  [ 2298.369819] Call Trace:
>  [ 2298.369820]  <IRQ>  [<ffffffff810338cf>] 
>  warn_slowpath_common+0x80/0x98
>  [ 2298.369824]  [<ffffffff8103397b>] warn_slowpath_fmt+0x41/0x43
>  [ 2298.369826]  [<ffffffff8132a10d>] skb_gso_segment+0x156/0x2b8
>  [ 2298.369828]  [<ffffffff8132a700>] dev_hard_start_xmit+0x37c/0x522
>  [ 2298.369830]  [<ffffffff81341135>] sch_direct_xmit+0x67/0x18d
>  [ 2298.369831]  [<ffffffff81341365>] __qdisc_run+0x10a/0x126
>  [ 2298.369833]  [<ffffffff8132ac39>] dev_queue_xmit+0x393/0x4c3
>  [ 2298.369835]  [<ffffffff81332a8a>] neigh_resolve_output+0x28f/0x2ed
>  [ 2298.369837]  [<ffffffff8134b8da>] ? nf_hook_slow+0x6d/0x108
>  [ 2298.369838]  [<ffffffff81357ec4>] ? ip_options_build+0x143/0x143
>  [ 2298.369840]  [<ffffffff8135a964>] ip_finish_output+0x260/0x2a3
>  [ 2298.369841]  [<ffffffff8135aa4d>] ip_output+0xa6/0xad
>  [ 2298.369843]  [<ffffffff81359d79>] ? __ip_local_out+0x9c/0x9e
>  [ 2298.369844]  [<ffffffff81359d9f>] ip_local_out+0x24/0x28
>  [ 2298.369846]  [<ffffffff8135a31c>] ip_queue_xmit+0x2dd/0x328
>  [ 2298.369848]  [<ffffffff8131fcd5>] ? __skb_clone+0x29/0xf2
>  [ 2298.369850]  [<ffffffff8136bbd4>] tcp_transmit_skb+0x74d/0x78b
>  [ 2298.369852]  [<ffffffff8136e389>] tcp_write_xmit+0x806/0x8f5
>  [ 2298.369854]  [<ffffffff8136b04f>] ? 
>  tcp_established_options+0x2e/0xa9
>  [ 2298.369856]  [<ffffffff8136e4c9>] 
>  __tcp_push_pending_frames+0x20/0x7c
>  [ 2298.369857]  [<ffffffff8136a685>] tcp_rcv_established+0x104/0x60f
>  [ 2298.369859]  [<ffffffff81371696>] tcp_v4_do_rcv+0x1ba/0x377
>  [ 2298.369861]  [<ffffffff81173e9f>] ? security_sock_rcv_skb+0x11/0x13
>  [ 2298.369863]  [<ffffffff81371d6f>] tcp_v4_rcv+0x51c/0x869
>  [ 2298.369865]  [<ffffffff81355b34>] ? ip_rcv+0x28a/0x28a
>  [ 2298.369867]  [<ffffffff81355b34>] ? ip_rcv+0x28a/0x28a
>  [ 2298.369869]  [<ffffffff81355cac>] 
>  ip_local_deliver_finish+0x178/0x235
>  [ 2298.369870]  [<ffffffff81355ddb>] ip_local_deliver+0x72/0x79
>  [ 2298.369872]  [<ffffffff81355880>] ip_rcv_finish+0x2dc/0x306
>  [ 2298.369874]  [<ffffffff81355b06>] ip_rcv+0x25c/0x28a
>  [ 2298.369876]  [<ffffffff81329536>] __netif_receive_skb+0x4b8/0x4ea
>  [ 2298.369878]  [<ffffffff81329778>] netif_receive_skb+0x67/0x6e
>  [ 2298.369879]  [<ffffffff81329861>] napi_skb_finish+0x24/0x3b
>  [ 2298.369881]  [<ffffffff81329d44>] napi_gro_receive+0xa8/0xad
>  [ 2298.369884]  [<ffffffffa003668f>] igb_poll+0x783/0xaaa [igb]
>  [ 2298.369886]  [<ffffffff8100e0d3>] ? x86_pmu_enable+0x1fc/0x263
>  [ 2298.369889]  [<ffffffff810794cb>] ? perf_ctx_adjust_freq+0x29/0x10a
>  [ 2298.369890]  [<ffffffff81329e79>] net_rx_action+0xa3/0x1bb
>  [ 2298.369893]  [<ffffffff8106f3d4>] ? 
>  __rcu_process_callbacks+0x75/0x265
>  [ 2298.369895]  [<ffffffff81038537>] __do_softirq+0x83/0x114
>  [ 2298.369897]  [<ffffffff813babcc>] call_softirq+0x1c/0x30
>  [ 2298.369899]  [<ffffffff81003e0f>] do_softirq+0x33/0x68
>  [ 2298.369900]  [<ffffffff810383d4>] irq_exit+0x3f/0x88
>  [ 2298.369902]  [<ffffffff810036f6>] do_IRQ+0x98/0xaf
>  [ 2298.369904]  [<ffffffff813b9453>] common_interrupt+0x13/0x13
>  [ 2298.369905]  <EOI>  [<ffffffff8104dc74>] ? 
>  atomic_notifier_call_chain+0x13/0x15
>  [ 2298.369910]  [<ffffffff81202714>] ? do_con_write+0x57b/0x1db7
>  [ 2298.369912]  [<ffffffff812025eb>] ? do_con_write+0x452/0x1db7
>  [ 2298.369913]  [<ffffffff810340da>] ? console_unlock+0x170/0x19a
>  [ 2298.369915]  [<ffffffff811ff9b0>] ? con_flush_chars+0x3e/0x43
>  [ 2298.369917]  [<ffffffff8104daa8>] ? up+0x34/0x39
>  [ 2298.369918]  [<ffffffff81203f97>] con_write+0x11/0x26
>  [ 2298.369920]  [<ffffffff8104a0c5>] ? add_wait_queue+0x3f/0x46
>  [ 2298.369923]  [<ffffffff811f3896>] n_tty_write+0x239/0x35a
>  [ 2298.369925]  [<ffffffff8102f2a9>] ? try_to_wake_up+0x240/0x240
>  [ 2298.369926]  [<ffffffff811f0b13>] tty_write+0x19d/0x22f
>  [ 2298.369928]  [<ffffffff811f365d>] ? n_tty_ioctl+0xab/0xab
>  [ 2298.369931]  [<ffffffff810b3e82>] vfs_write+0xae/0x133
>  [ 2298.369933]  [<ffffffff810b3fc0>] sys_write+0x45/0x6c
>  [ 2298.369935]  [<ffffffff813b9a7b>] system_call_fastpath+0x16/0x1b
>  [ 2298.369936] ---[ end trace 408f79c72c45f491 ]---
> 
> --

Hi Denys

net/sched/act_csum.c sets skb->ip_summed = CHECKSUM_NONE;  after csum
changes.

and skb_gso_segment() barfs if skb->ip_summed != CHECKSUM_PARTIAL

You could disable GRO for the time being.




^ permalink raw reply

* [gianfar PATCH] RFC v2: Add rx_ntuple feature
From: Sebastian Pöhn @ 2011-05-17 13:18 UTC (permalink / raw)
  To: netdev; +Cc: sebastian.poehn

This patch implements rx ntuple filtering for the gianfar driver.
Changes since last version:
#Added code is now re-entrant
#Consolidation of convert routines

I am planing to do some final testing. After that I hope I can send the
final patch.

Signed-off-by: Sebastian Poehn <sebastian.poehn@belden.com>
---

 drivers/net/gianfar.c         |   16 +-
 drivers/net/gianfar.h         |   44 ++
 drivers/net/gianfar_ethtool.c | 1006
+++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1062 insertions(+), 4 deletions(-)

diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index ff60b23..ddd4007 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -658,6 +658,11 @@ static int gfar_of_init(struct platform_device
*ofdev, struct net_device **pdev)
 	priv->num_rx_queues = num_rx_qs;
 	priv->num_grps = 0x0;

+	/* Init Rx queue filer rule set linked list*/
+	INIT_LIST_HEAD(&priv->ntuple_list.list);
+	priv->ntuple_list.count = 0;
+	mutex_init(&priv->rx_queue_access);
+
 	model = of_get_property(np, "model", NULL);

 	for (i = 0; i < MAXGROUPS; i++)
@@ -751,7 +756,8 @@ static int gfar_of_init(struct platform_device *ofdev,
struct net_device **pdev)
 			FSL_GIANFAR_DEV_HAS_VLAN |
 			FSL_GIANFAR_DEV_HAS_MAGIC_PACKET |
 			FSL_GIANFAR_DEV_HAS_EXTENDED_HASH |
-			FSL_GIANFAR_DEV_HAS_TIMER;
+			FSL_GIANFAR_DEV_HAS_TIMER |
+			FSL_GIANFAR_DEV_HAS_RX_FILER;

 	ctype = of_get_property(np, "phy-connection-type", NULL);

@@ -1042,6 +1048,9 @@ static int gfar_probe(struct platform_device *ofdev)
 	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_VLAN)
 		dev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;

+	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_RX_FILER)
+		dev->features |= NETIF_F_NTUPLE;
+
 	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_EXTENDED_HASH) {
 		priv->extended_hash = 1;
 		priv->hash_width = 9;
@@ -1151,9 +1160,8 @@ static int gfar_probe(struct platform_device *ofdev)
 		priv->rx_queue[i]->rxic = DEFAULT_RXIC;
 	}

-	/* enable filer if using multiple RX queues*/
-	if(priv->num_rx_queues > 1)
-		priv->rx_filer_enable = 1;
+	/* always enable rx filer*/
+	priv->rx_filer_enable = 1;
 	/* Enable most messages by default */
 	priv->msg_enable = (NETIF_MSG_IFUP << 1 ) - 1;

diff --git a/drivers/net/gianfar.h b/drivers/net/gianfar.h
index fc86f51..7897e81 100644
--- a/drivers/net/gianfar.h
+++ b/drivers/net/gianfar.h
@@ -168,6 +168,7 @@ extern const char gfar_driver_version[];
 #define MACCFG2_LENGTHCHECK	0x00000010
 #define MACCFG2_MPEN		0x00000008

+#define ECNTRL_FIFM		0x00008000
 #define ECNTRL_INIT_SETTINGS	0x00001000
 #define ECNTRL_TBI_MODE         0x00000020
 #define ECNTRL_REDUCED_MODE	0x00000010
@@ -271,6 +272,7 @@ extern const char gfar_driver_version[];
 #define RCTRL_TUCSEN		0x00000100
 #define RCTRL_PRSDEP_MASK	0x000000c0
 #define RCTRL_PRSDEP_INIT	0x000000c0
+#define RCTRL_PRSFM		0x00000020
 #define RCTRL_PROM		0x00000008
 #define RCTRL_EMEN		0x00000002
 #define RCTRL_REQ_PARSER	(RCTRL_VLEX | RCTRL_IPCSEN | \
@@ -870,6 +872,7 @@ struct gfar {
 #define FSL_GIANFAR_DEV_HAS_BD_STASHING		0x00000200
 #define FSL_GIANFAR_DEV_HAS_BUF_STASHING	0x00000400
 #define FSL_GIANFAR_DEV_HAS_TIMER		0x00000800
+#define FSL_GIANFAR_DEV_HAS_RX_FILER		0x00001000

 #if (MAXGROUPS == 2)
 #define DEFAULT_MAPPING 	0xAA
@@ -1066,6 +1069,9 @@ struct gfar_private {

 	struct vlan_group *vlgrp;

+	/* RX queue filer rule set*/
+	struct ethtool_rx_ntuple_list ntuple_list;
+	struct mutex rx_queue_access;

 	/* Hash registers and their width */
 	u32 __iomem *hash_regs[16];
@@ -1140,6 +1146,16 @@ static inline void gfar_write_filer(struct
gfar_private *priv,
 	gfar_write(&regs->rqfpr, fpr);
 }

+static inline void gfar_read_filer(struct gfar_private *priv,
+		unsigned int far, unsigned int *fcr, unsigned int *fpr)
+{
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+
+	gfar_write(&regs->rqfar, far);
+	*fcr = gfar_read(&regs->rqfcr);
+	*fpr = gfar_read(&regs->rqfpr);
+}
+
 extern void lock_rx_qs(struct gfar_private *priv);
 extern void lock_tx_qs(struct gfar_private *priv);
 extern void unlock_rx_qs(struct gfar_private *priv);
@@ -1157,4 +1173,32 @@ int gfar_set_features(struct net_device *dev, u32
features);

 extern const struct ethtool_ops gfar_ethtool_ops;

+#define ESWFULL 160
+#define EHWFULL 161
+#define EOUTOFRANGE 162
+/*XXX: Drop in final*/
+#define PRINT_MAX 255
+#define MAX_FILER_CACHE_IDX (2*(MAX_FILER_IDX))
+
+struct gfar_mask_entry {
+	unsigned int mask; /*The mask value which is valid for a block with*/
+	unsigned int start; /*start*/
+	unsigned int end; /*till end*/
+	unsigned int block; /*Same block values indicate depended entries*/
+};
+
+/*Represents a receive filer table entry */
+struct gfar_filer_entry {
+	u32 ctrl; /*The control field from HW*/
+	u32 prop; /*The property field from HW*/
+};
+
+/*Full table*/
+	/*Only temporary needed! The 20 additional
+	* entries are a shadow for one extra element*/
+struct filer_table {
+	u32 index;
+	struct gfar_filer_entry fe[MAX_FILER_CACHE_IDX + 20];
+};
+
 #endif /* __GIANFAR_H */
diff --git a/drivers/net/gianfar_ethtool.c b/drivers/net/gianfar_ethtool.c
index 493d743..98561b5 100644
--- a/drivers/net/gianfar_ethtool.c
+++ b/drivers/net/gianfar_ethtool.c
@@ -38,6 +38,8 @@
 #include <linux/mii.h>
 #include <linux/phy.h>

+#include <linux/sort.h>
+
 #include "gianfar.h"

 extern void gfar_start(struct net_device *dev);
@@ -787,6 +789,1009 @@ static int gfar_set_nfc(struct net_device *dev,
struct ethtool_rxnfc *cmd)
 	return ret;
 }

+static int gfar_check_filer_hardware(struct gfar_private *priv)
+{
+	struct gfar __iomem *regs = NULL;
+	u32 i;
+
+	regs = priv->gfargrp[0].regs;
+
+	/*Check if we are in FIFO mode*/
+	i = gfar_read(&regs->ecntrl);
+	i &= ECNTRL_FIFM;
+	if (i == ECNTRL_FIFM) {
+		printk(KERN_WARNING "Interface in FIFO mode\n");
+		i = gfar_read(&regs->rctrl);
+		i &= RCTRL_PRSDEP_MASK | RCTRL_PRSFM;
+		if (i == (RCTRL_PRSDEP_MASK | RCTRL_PRSFM)) {
+			printk(KERN_WARNING
+			"Receive Queue Filtering is enabled\n");
+		} else {
+			printk(KERN_WARNING
+			"Receive Queue Filtering is disabled\n");
+			return -1;
+		}
+	}
+	/*Or in standard mode*/
+	else{
+		i = gfar_read(&regs->rctrl);
+		i &= RCTRL_PRSDEP_MASK;
+		if (i == RCTRL_PRSDEP_MASK) {
+			printk(KERN_WARNING
+			"Receive Queue Filtering is enabled\n");
+		} else {
+			printk(KERN_WARNING
+			"Receive Queue Filtering is disabled\n");
+			return -1;
+		}
+	}
+
+	/* Sets the properties for arbitrary filer rule
+	 * to the first 4 Layer 4 Bytes*/
+	regs->rbifx = 0xC0C1C2C3;
+	return 0;
+}
+
+static int gfar_comp_asc(const void *a, const void *b)
+{
+	if (*(u32 *) a > *(u32 *) b)
+		return 1;
+	else if (*(u32 *) a == *(u32 *) b)
+		return 0;
+	else
+		return -1;
+}
+
+static int gfar_comp_desc(const void *a, const void *b)
+{
+	if (*(u32 *) a > *(u32 *) b)
+		return -1;
+	else if (*(u32 *) a == *(u32 *) b)
+		return 0;
+	else
+		return 1;
+}
+
+static void gfar_swap(void *a, void *b, int size)
+{
+	u32 t1 = *(u32 *) a;
+	u32 t2 = *(u32 *) (a + 4);
+	u32 t3 = *(u32 *) (a + 8);
+	u32 t4 = *(u32 *) (a + 12);
+	*(u32 *) a = *(u32 *) b;
+	*(u32 *) (a + 4) = *(u32 *) (b + 4);
+	*(u32 *) (a + 8) = *(u32 *) (b + 8);
+	*(u32 *) (a + 12) = *(u32 *) (b + 12);
+	*(u32 *) b = t1;
+	*(u32 *) (b + 4) = t2;
+	*(u32 *) (b + 8) = t3;
+	*(u32 *) (b + 12) = t4;
+}
+
+/*Write a mask to filer cache*/
+static void gfar_set_mask(u32 mask, struct filer_table *tab)
+{
+	tab->fe[tab->index].ctrl = RQFCR_AND | RQFCR_PID_MASK
+		| RQFCR_CMP_EXACT;
+	tab->fe[tab->index].prop = mask;
+	tab->index++;
+}
+
+/*Sets parse bits (e.g. IP or TCP)*/
+static void gfar_set_parse_bits(u32 value, u32 mask, struct filer_table
*tab)
+{
+	gfar_set_mask(mask, tab);
+	tab->fe[tab->index].ctrl = RQFCR_CMP_EXACT | RQFCR_PID_PARSE
+			| RQFCR_AND;
+	tab->fe[tab->index].prop = value;
+	tab->index++;
+}
+
+static void gfar_set_general_attribute(u32 value,
+		 u32 mask, u32 flag, struct filer_table *tab)
+{
+		gfar_set_mask(mask, tab);
+		tab->fe[tab->index].ctrl = RQFCR_CMP_EXACT | RQFCR_AND
+				| flag;
+		tab->fe[tab->index].prop = value;
+		tab->index++;
+}
+
+#define RQFCR_PID_PRI_IRANGE 0xFFFFFFF8
+#define RQFCR_PID_L4P_IRANGE 0xFFFFFF00
+#define RQFCR_PID_VID_IRANGE 0xFFFFF000
+#define RQFCR_PID_PORT_IRANGE 0xFFFF0000
+#define RQFCR_PID_MAC_IRANGE 0xFF000000
+
+/*
+ * For setting a tuple of value and mask of type flag
+ * Example:
+ * IP-Src = 10.0.0.0/255.0.0.0
+ * value: 0x0A000000 mask: FF000000 flag: RQFPR_IPV4
+ *
+ * Ethtool gives us a value=0 and mask=~0 for don't care a tuple
+ * For a don't care mask it gives us a 0
+ *
+ * The check if don't care and the mask adjustment if mask=0 is done for
VLAN
+ * and MAC stuff on an upper level (due to missing information on this
level).
+ * For these guys we can discard them if they are value=0 and mask=0.
+ *
+ * Further the all masks are one-padded for better hardware efficiency.
+ */
+static void gfar_set_attribute(u32 value, u32 mask,
+		u32 flag, struct filer_table *tab)
+{
+	switch (flag) {
+		/*3bit*/
+	case RQFCR_PID_PRI:
+		if (!(value | mask))
+			return;
+		mask |= RQFCR_PID_PRI_IRANGE;
+		break;
+		/*8bit*/
+	case RQFCR_PID_L4P:
+	case RQFCR_PID_TOS:
+		if (!~(mask|RQFCR_PID_L4P_IRANGE))
+			return;
+		if (!mask)
+			mask = ~0;
+		else
+			mask |= RQFCR_PID_L4P_IRANGE;
+		break;
+		/*12bit*/
+	case RQFCR_PID_VID:
+		if (!(value | mask))
+			return;
+		mask |= RQFCR_PID_VID_IRANGE;
+		break;
+		/*16bit*/
+	case RQFCR_PID_DPT:
+	case RQFCR_PID_SPT:
+	case RQFCR_PID_ETY:
+		if (!~(mask | RQFCR_PID_PORT_IRANGE))
+			return;
+		if (!mask)
+			mask = ~0;
+		else
+			mask |= RQFCR_PID_PORT_IRANGE;
+		break;
+		/*24bit*/
+	case RQFCR_PID_DAH:
+	case RQFCR_PID_DAL:
+	case RQFCR_PID_SAH:
+	case RQFCR_PID_SAL:
+		if (!(value | mask))
+			return;
+		mask |= RQFCR_PID_MAC_IRANGE;
+		break;
+		/*for all real 32bit masks*/
+	default:
+		if (!~mask)
+			return;
+		if (!mask)
+			mask = ~0;
+		break;
+	}
+	gfar_set_general_attribute(value, mask, flag, tab);
+}
+
+/*Translates value and mask for UDP,TCP or SCTP*/
+static void gfar_set_basic_ip(struct ethtool_tcpip4_spec *value,
+		struct ethtool_tcpip4_spec *mask, struct filer_table *tab)
+{
+	gfar_set_attribute(value->ip4src, mask->ip4src, RQFCR_PID_SIA, tab);
+	gfar_set_attribute(value->ip4dst, mask->ip4dst, RQFCR_PID_DIA, tab);
+	gfar_set_attribute(value->pdst, mask->pdst, RQFCR_PID_DPT, tab);
+	gfar_set_attribute(value->psrc, mask->psrc, RQFCR_PID_SPT, tab);
+	gfar_set_attribute(value->tos, mask->tos, RQFCR_PID_TOS, tab);
+}
+
+/*Translates value and mask for USER-IP4*/
+static void gfar_set_user_ip(struct ethtool_usrip4_spec *value,
+		struct ethtool_usrip4_spec *mask, struct filer_table *tab)
+{
+	gfar_set_attribute(value->ip4src, mask->ip4src, RQFCR_PID_SIA, tab);
+	gfar_set_attribute(value->ip4dst, mask->ip4dst, RQFCR_PID_DIA, tab);
+	gfar_set_attribute(value->tos, mask->tos, RQFCR_PID_TOS, tab);
+	gfar_set_attribute(value->proto, mask->proto, RQFCR_PID_L4P, tab);
+	gfar_set_attribute(value->l4_4_bytes, mask->l4_4_bytes, RQFCR_PID_ARB,
+			tab);
+
+}
+
+/*Translates value and mask for ETHER spec*/
+static void gfar_set_ether(struct ethhdr *value, struct ethhdr *mask,
+		struct filer_table *tab)
+{
+	u32 upper_temp_mask = 0;
+	u32 lower_temp_mask = 0;
+	/*Source address*/
+	if (!is_broadcast_ether_addr(mask->h_source)) {
+
+		if (is_zero_ether_addr(mask->h_source)) {
+			upper_temp_mask = 0xFFFFFFFF;
+			lower_temp_mask = 0xFFFFFFFF;
+		} else {
+			upper_temp_mask = mask->h_source[0] << 16
+					| mask->h_source[1] << 8
+					| mask->h_source[2] ;
+			lower_temp_mask = mask->h_source[3] << 16
+					| mask->h_source[4] << 8
+					| mask->h_source[5] ;
+		}
+		/*Upper 24bit*/
+		gfar_set_attribute(value->h_source[0] << 16
+				| value->h_source[1] << 8
+				| value->h_source[2],
+				upper_temp_mask, RQFCR_PID_SAH, tab);
+		/*And the same for the lower part*/
+		gfar_set_attribute(value->h_source[3] << 16
+				| value->h_source[4] << 8
+				| value->h_source[5],
+				lower_temp_mask, RQFCR_PID_SAL, tab);
+	}
+	/*Destination address*/
+	if (!is_broadcast_ether_addr(mask->h_dest)) {
+
+		/*Special for destination is limited broadcast*/
+		if ((is_broadcast_ether_addr(value->h_dest)
+				&& is_zero_ether_addr(mask->h_dest))) {
+			gfar_set_parse_bits(RQFPR_EBC, RQFPR_EBC, tab);
+		} else {
+
+			if (is_zero_ether_addr(mask->h_dest)) {
+				upper_temp_mask = 0xFFFFFFFF;
+				lower_temp_mask = 0xFFFFFFFF;
+			} else {
+				upper_temp_mask = mask->h_dest[0] << 16
+						| mask->h_dest[1] << 8
+						| mask->h_dest[2] ;
+				lower_temp_mask = mask->h_dest[3] << 16
+						| mask->h_dest[4] << 8
+						| mask->h_dest[5] ;
+			}
+
+			/*Upper 24bit*/
+			gfar_set_attribute(value->h_dest[0] << 16
+					| value->h_dest[1] << 8
+					| value->h_dest[2],
+					upper_temp_mask, RQFCR_PID_DAH, tab);
+			/*And the same for the lower part*/
+			gfar_set_attribute(value->h_dest[3] << 16
+					| value->h_dest[4] << 8
+					| value->h_dest[5],
+					lower_temp_mask, RQFCR_PID_DAL, tab);
+		}
+	}
+
+	gfar_set_attribute(value->h_proto, mask->h_proto, RQFCR_PID_ETY, tab);
+
+}
+
+/*Convert a ethtool_rx_ntuple to binary filter format of gianfar*/
+static int gfar_convert_to_filer(struct ethtool_rx_ntuple_flow_spec *rule,
+		struct filer_table *tab)
+{
+	u32 vlan = 0, vlan_mask = 0;
+	u32 id = 0, id_mask = 0;
+	u32 cfi = 0, cfi_mask = 0;
+	u32 prio = 0, prio_mask = 0;
+
+	u32 old_index = tab->index;
+
+	/*Check if vlan is wanted*/
+	if (rule->vlan_tag_mask != 0xFFFF) {
+		if (!rule->vlan_tag_mask)
+			rule->vlan_tag_mask = 0xFFFF;
+
+		vlan = RQFPR_VLN;
+		vlan_mask = RQFPR_VLN;
+
+		/*Separate the fields*/
+		id = rule->vlan_tag & 0xFFF;
+		id_mask = rule->vlan_tag_mask & 0xFFF;
+		cfi = (rule->vlan_tag >> 12) & 1;
+		cfi_mask = (rule->vlan_tag_mask >> 12) & 1;
+		prio = (rule->vlan_tag >> 13) & 0x7;
+		prio_mask = (rule->vlan_tag_mask >> 13) & 0x7;
+
+		if (cfi == 1 && cfi_mask == 1) {
+			vlan |= RQFPR_CFI;
+			vlan_mask |= RQFPR_CFI;
+		} else if (cfi == 0 && cfi_mask == 1) {
+			vlan_mask |= RQFPR_CFI;
+		}
+	}
+
+	switch (rule->flow_type) {
+	case TCP_V4_FLOW:
+		gfar_set_parse_bits(RQFPR_IPV4 | RQFPR_TCP | vlan, RQFPR_IPV4
+				| RQFPR_TCP | vlan_mask, tab);
+		gfar_set_basic_ip((struct ethtool_tcpip4_spec *) &rule->h_u,
+				(struct ethtool_tcpip4_spec *) &rule->m_u, tab);
+
+		break;
+	case UDP_V4_FLOW:
+		gfar_set_parse_bits(RQFPR_IPV4 | RQFPR_UDP | vlan, RQFPR_IPV4
+				| RQFPR_UDP | vlan_mask, tab);
+		gfar_set_basic_ip((struct ethtool_tcpip4_spec *) &rule->h_u,
+				(struct ethtool_tcpip4_spec *) &rule->m_u, tab);
+		break;
+	case SCTP_V4_FLOW:
+		gfar_set_parse_bits(RQFPR_IPV4 | vlan, RQFPR_IPV4 | vlan_mask,
+				tab);
+		gfar_set_attribute(132, 0xFFFFFFFF, RQFCR_PID_L4P, tab);
+		gfar_set_basic_ip((struct ethtool_tcpip4_spec *) &rule->h_u,
+				(struct ethtool_tcpip4_spec *) &rule->m_u, tab);
+		break;
+	case IP_USER_FLOW:
+		gfar_set_parse_bits(RQFPR_IPV4 | vlan, RQFPR_IPV4 | vlan_mask,
+				tab);
+		gfar_set_user_ip((struct ethtool_usrip4_spec *) &rule->h_u,
+				(struct ethtool_usrip4_spec *) &rule->m_u, tab);
+		break;
+	case ETHER_FLOW:
+		if (vlan)
+			gfar_set_parse_bits(vlan, vlan_mask, tab);
+		gfar_set_ether((struct ethhdr *) &rule->h_u,
+				(struct ethhdr *) &rule->m_u, tab);
+		break;
+	default:
+		return -1;
+	}
+
+	/*Set the vlan attributes in the end*/
+	if (vlan) {
+		gfar_set_attribute(id, id_mask,	RQFCR_PID_VID, tab);
+		gfar_set_attribute(prio, prio_mask, RQFCR_PID_PRI, tab);
+	}
+
+	/*If there has been nothing written till now, it must be a default*/
+	if (tab->index == old_index) {
+		gfar_set_mask(0xFFFFFFFF, tab);
+		tab->fe[tab->index].ctrl = 0x20;
+		tab->fe[tab->index].prop = 0x0;
+		tab->index++;
+	}
+
+	/*Remove last AND*/
+	tab->fe[tab->index - 1].ctrl &= (~RQFCR_AND);
+
+	/*Specify which queue to use or to drop*/
+	if (rule->action == ETHTOOL_RXNTUPLE_ACTION_DROP)
+		tab->fe[tab->index - 1].ctrl |= RQFCR_RJE;
+	else
+		tab->fe[tab->index - 1].ctrl |= (rule->action << 10);
+
+	/*Only big enough entries can be clustered*/
+	if (tab->index > (old_index + 2)) {
+		tab->fe[old_index + 1].ctrl |= RQFCR_CLE;
+		tab->fe[tab->index - 1].ctrl |= RQFCR_CLE;
+	}
+
+	/*In rare cases the cache can be full while there is free space in hw*/
+	if (tab->index > MAX_FILER_CACHE_IDX - 1)
+		return -ESWFULL;
+
+	return 0;
+}
+
+/*XXX: Drop this in final version!*/
+/*For debugging*/
+void print_hw(struct gfar_private *p)
+{
+
+	int i = 0;
+	unsigned int a, b;
+	printk(KERN_DEBUG "No.  Control   Properties\n");
+	for (i = 0; i < PRINT_MAX; i++) {
+		gfar_read_filer(p, i, &a, &b);
+		printk(KERN_DEBUG "%3d  %08x  %08x\n", i, a, b);
+	}
+}
+
+/*Copy size filer entries*/
+static void gfar_copy_filer_entries(struct gfar_filer_entry dst[0],
+		struct gfar_filer_entry src[0], s32 size)
+{
+	while (size > 0) {
+		size--;
+		dst[size].ctrl = src[size].ctrl;
+		dst[size].prop = src[size].prop;
+	}
+}
+
+/* Delete the contents of the filer-table between start and end
+ * and collapse them*/
+static int gfar_trim_filer_entries(int begin, int end, struct filer_table
*tab)
+{
+	int length;
+	if (end > MAX_FILER_CACHE_IDX || begin < 0 || end < begin)
+		return -EOUTOFRANGE;
+
+	end++;
+	length = end - begin;
+
+	/*Copy*/
+	while (end < tab->index) {
+		tab->fe[begin].ctrl = tab->fe[end].ctrl;
+		tab->fe[begin++].prop = tab->fe[end++].prop;
+
+	}
+	/*Fill up with don't cares*/
+	while (begin <= tab->index) {
+		tab->fe[begin].ctrl = 0x60;
+		tab->fe[begin].prop = 0xFFFFFFFF;
+		begin++;
+	}
+
+	tab->index -= length;
+	return 0;
+}
+
+/*Make space on the wanted location*/
+static int gfar_expand_filer_entries(int begin, int length,
+		struct filer_table *tab)
+{
+	int i = 0;
+	if (begin < 0 || length <= 0 || length + tab->index
+			> MAX_FILER_CACHE_IDX || begin > MAX_FILER_CACHE_IDX)
+		return -EOUTOFRANGE;
+
+	/*Copy*/
+	gfar_copy_filer_entries(&(tab->fe[begin + length]),
+			&(tab->fe[begin]),
+			tab->index - length + 1);
+
+	/*XXX: Can be dropped in final version*/
+	/*Fill up with zeros*/
+	i = length;
+	while (i > 0) {
+		tab->fe[i + begin].ctrl = 0;
+		tab->fe[i + begin].prop = 0;
+		i--;
+	}
+
+	tab->index += length;
+	return 0;
+}
+
+static int gfar_get_next_cluster_start(int start, struct filer_table *tab)
+{
+	for (; (start < tab->index) && (start < MAX_FILER_CACHE_IDX - 1);
+	 start++) {
+		if ((tab->fe[start].ctrl & (RQFCR_AND | RQFCR_CLE))
+				== (RQFCR_AND | RQFCR_CLE)) {
+			return start;
+		}
+	}
+	return -1;
+}
+
+static int gfar_get_next_cluster_end(int start, struct filer_table *tab)
+{
+	for (; (start < tab->index) && (start < MAX_FILER_CACHE_IDX - 1);
+	 start++) {
+		if ((tab->fe[start].ctrl & (RQFCR_AND | RQFCR_CLE))
+				== (RQFCR_CLE))
+			return start;
+	}
+	return -1;
+}
+
+/*
+ * Uses hardwares clustering option to reduce
+ * the number of filer table entries
+ */
+static void gfar_cluster_filer(struct filer_table *tab)
+{
+	s32 i = -1, j, iend, jend;
+
+	while ((i = gfar_get_next_cluster_start(++i, tab)) != -1) {
+		j = i;
+		while ((j = gfar_get_next_cluster_start(++j, tab)) != -1) {
+			/*
+			 * The cluster entries self and the previous one
+			 * (a mask) must be identical!
+			 */
+			if (tab->fe[i].ctrl != tab->fe[j].ctrl)
+				break;
+			if (tab->fe[i].prop != tab->fe[j].prop)
+				break;
+			if (tab->fe[i - 1].ctrl != tab->fe[j - 1].ctrl)
+				break;
+			if (tab->fe[i - 1].prop != tab->fe[j - 1].prop)
+				break;
+			iend = gfar_get_next_cluster_end(i, tab);
+			jend = gfar_get_next_cluster_end(j, tab);
+			if (jend == -1 || iend == -1)
+				break;
+			/*
+			* First we make some free space, where our cluster
+			* element should be. Then we copy it there and finally
+			* delete in from its old location.
+			*/
+
+			if (gfar_expand_filer_entries(iend, (jend - j), tab)
+					== -EOUTOFRANGE)
+				break;
+
+			gfar_copy_filer_entries(&(tab->fe[iend + 1]),
+				&(tab->fe[jend + 1]), jend - j);
+
+			if (gfar_trim_filer_entries(jend - 1, jend + (jend - j),
+					tab) == -EOUTOFRANGE)
+				return;
+
+			/*Mask out cluster bit*/
+			tab->fe[iend].ctrl &= ~(RQFCR_CLE);
+		}
+	}
+}
+
+/*Swaps the 0xFF80 masked bits of a1<>a2 and b1<>b2*/
+static void gfar_swap_ff80_bits(struct gfar_filer_entry *a1,
+		struct gfar_filer_entry *a2, struct gfar_filer_entry *b1,
+		struct gfar_filer_entry *b2)
+{
+	u32 temp[4];
+	temp[0] = a1->ctrl & 0xFF80;
+	temp[1] = a2->ctrl & 0xFF80;
+	temp[2] = b1->ctrl & 0xFF80;
+	temp[3] = b2->ctrl & 0xFF80;
+
+	a1->ctrl &= ~0xFF80;
+	a2->ctrl &= ~0xFF80;
+	b1->ctrl &= ~0xFF80;
+	b2->ctrl &= ~0xFF80;
+
+	a1->ctrl |= temp[1];
+	a2->ctrl |= temp[0];
+	b1->ctrl |= temp[3];
+	b2->ctrl |= temp[2];
+}
+
+/* Generate a list consisting of masks values with their start and
+ * end of validity and block as indicator for parts belonging
+ * together (glued by ANDs) in mask_table*/
+u32 gfar_generate_mask_table(struct gfar_mask_entry *mask_table,
+		struct filer_table *tab)
+{
+	u32 i, and_index = 0, block_index = 1;
+
+	for (i = 0; i < tab->index; i++) {
+
+		/*LSByte of control = 0 sets a mask*/
+		if ((tab->fe[i].ctrl & 0xF) == 0) {
+			mask_table[and_index].mask = tab->fe[i].prop;
+			mask_table[and_index].start = i;
+			mask_table[and_index].block = block_index;
+			if (and_index >= 1)
+				mask_table[and_index - 1].end = i - 1;
+			and_index++;
+		}
+		/*cluster starts will be separated because they should
+		* hold their position*/
+		if (tab->fe[i].ctrl & RQFCR_CLE)
+			block_index++;
+		/*A not set AND indicated the end of a depended block*/
+		if (!(tab->fe[i].ctrl & RQFCR_AND))
+			block_index++;
+
+	}
+
+	mask_table[and_index - 1].end = i - 1;
+
+	return and_index;
+}
+
+/*
+* Sorts the entries of mask_table by the values of the masks.
+* Important: The 0xFF80 flags of the first and last entry of a
+* block must hold their position (which queue, CLusterEnable, ReJEct,
+* AND)
+*/
+void gfar_sort_mask_table(struct gfar_mask_entry *mask_table,
+		struct filer_table *temp_table, u32 and_index)
+{
+	/*Pointer to compare function (_asc or _desc)*/
+	int (*gfar_comp) (const void *, const void *);
+
+	u32 i, size = 0, start = 0, prev = 1;
+	u32 old_first, old_last, new_first, new_last;
+
+	gfar_comp = &gfar_comp_desc;
+
+	for (i = 0; i < and_index; i++) {
+
+		if (prev != mask_table[i].block) {
+			old_first = mask_table[start].start + 1;
+			old_last = mask_table[i - 1].end;
+			/*I my opinion start should be multiplied by
+			* sizeof(struct gfar_mask_entry) do not ask me why
+			* only this version is working */
+			sort(mask_table + start, size,
+					sizeof(struct gfar_mask_entry),
+					gfar_comp, &gfar_swap);
+
+			/*Toggle order for every block. This makes the
+			* thing more efficient! Believe me!*/
+			if (gfar_comp == gfar_comp_desc)
+				gfar_comp = &gfar_comp_asc;
+			else
+				gfar_comp = &gfar_comp_desc;
+
+			new_first = mask_table[start].start + 1;
+			new_last = mask_table[i - 1].end;
+
+			gfar_swap_ff80_bits(&temp_table->fe[new_first],
+					&temp_table->fe[old_first],
+					&temp_table->fe[new_last],
+					&temp_table->fe[old_last]);
+
+			start = i;
+			size = 0;
+		}
+		size++;
+		prev = mask_table[i].block;
+	}
+}
+
+/*
+ * Reduces the number of masks needed in the filer table to save entries
+ * This is done by sorting the masks of a depended block. A depended
block is
+ * identified by gluing ANDs or CLE. The sorting order toggles after every
+ * block. Of course entries in scope of a mask must change their location
with
+ * it.
+*/
+static int gfar_optimize_filer_masks(struct filer_table *tab)
+{
+	struct filer_table *temp_table;
+	struct gfar_mask_entry *mask_table;
+
+	u32 and_index = 0, previous_mask = 0, i = 0, j = 0, size = 0;
+	s32 ret = 0;
+
+	/*We need a copy of the filer table because
+	* we want to change its order*/
+	temp_table = kmalloc(sizeof(struct filer_table), GFP_KERNEL);
+	if (temp_table == NULL)
+		return -ENOMEM;
+	memcpy(temp_table, tab, sizeof(struct filer_table));
+
+	mask_table = kzalloc(sizeof(struct gfar_mask_entry)
+			* (MAX_FILER_CACHE_IDX / 2 + 1), GFP_KERNEL);
+	if (mask_table == NULL) {
+		ret = -ENOMEM;
+		goto end;
+	}
+
+	and_index = gfar_generate_mask_table(mask_table, tab);
+
+	gfar_sort_mask_table(mask_table, temp_table, and_index);
+
+	/*Now we can copy the data from our duplicated filer table to
+	* the real one in the order the mask table says*/
+	for (i = 0; i < and_index; i++) {
+		size = mask_table[i].end - mask_table[i].start + 1;
+		gfar_copy_filer_entries(&(tab->fe[j]),
+				&(temp_table->fe[mask_table[i].start]), size);
+		j += size;
+	}
+
+	/* And finally we just have to check for duplicated masks and drop the
+	 * second ones*/
+	for (i = 0; i < tab->index && i < MAX_FILER_CACHE_IDX; i++) {
+		if (tab->fe[i].ctrl == 0x80) {
+			previous_mask = i++;
+			break;
+		}
+	}
+	for (; i < tab->index && i < MAX_FILER_CACHE_IDX; i++) {
+		if (tab->fe[i].ctrl == 0x80) {
+			if (tab->fe[i].prop == tab->fe[previous_mask].prop) {
+				/*Two identical ones found!
+				* So drop the second one!*/
+				gfar_trim_filer_entries(i, i, tab);
+
+			} else
+				/*Not identical!*/
+				previous_mask = i;
+		}
+	}
+
+	kfree(mask_table);
+end:	kfree(temp_table);
+	return ret;
+}
+
+/*Write the bit-pattern from software's buffer to hardware registers*/
+static int gfar_write_filer_table(struct gfar_private *priv,
+		struct filer_table *tab)
+{
+	u32 i = 0;
+	if (tab->index > MAX_FILER_IDX - 1)
+		return -EHWFULL;
+
+	/*Avoid inconsistent filer table to be processed*/
+	lock_rx_qs(priv);
+
+	/*Fill regular entries*/
+	for (; i < MAX_FILER_IDX - 1 &&
+			(tab->fe[i].ctrl | tab->fe[i].ctrl) ; i++)
+		gfar_write_filer(priv, i, tab->fe[i].ctrl,
+				tab->fe[i].prop);
+	/*Fill the rest with fall-troughs*/
+	for (; i < MAX_FILER_IDX - 1; i++)
+		gfar_write_filer(priv, i, 0x60, 0xFFFFFFFF);
+	/* Last entry must be default accept
+	 * because that's what people expect*/
+	gfar_write_filer(priv, i, 0x20, 0x0);
+
+	unlock_rx_qs(priv);
+
+	return 0;
+}
+
+static int gfar_add_table_entry(struct ethtool_rx_ntuple_flow_spec *flow,
+		struct ethtool_rx_ntuple_list *list)
+{
+	struct ethtool_rx_ntuple_flow_spec_container *temp;
+	temp = kmalloc(sizeof(struct ethtool_rx_ntuple_flow_spec_container),
+			GFP_KERNEL);
+	if (temp == NULL)
+		return -ENOMEM;
+	memcpy(&temp->fs, flow, sizeof(struct ethtool_rx_ntuple_flow_spec));
+	list_add_tail(&temp->list, &list->list);
+	list->count++;
+
+	if (~flow->data_mask)
+		printk(KERN_WARNING
+			"User-specific data is not supported by hardware!\n");
+	if (flow->flow_type == IP_USER_FLOW)
+		if (flow->m_u.usr_ip4_spec.ip_ver != 255)
+			printk(KERN_WARNING
+				"IP-Version is not supported by hardware!\n");
+	if (flow->flow_type == ETHER_FLOW)
+		if ((is_broadcast_ether_addr(flow->h_u.ether_spec.h_dest)
+				&& is_zero_ether_addr(
+						flow->m_u.ether_spec.h_dest)))
+			printk(KERN_DEBUG
+			"Filtering broadcast destination is very cheap!\n");
+
+	return 0;
+
+}
+/*
+ * Compares flow-specs a and b
+ * if a==b return 0
+ * if a!=b return 1
+ * if error return -1
+ */
+static int gfar_compare_flow_spec(struct ethtool_rx_ntuple_flow_spec *a,
+		struct ethtool_rx_ntuple_flow_spec *b)
+{
+	if (a == 0 || b == 0)
+		return -1;
+	if (a->flow_type != b->flow_type)
+		return 1;
+	if (a->vlan_tag != b->vlan_tag)
+		return 1;
+	if (a->vlan_tag_mask != b->vlan_tag_mask)
+		return 1;
+	switch (a->flow_type) {
+	case TCP_V4_FLOW:
+	case UDP_V4_FLOW:
+	case SCTP_V4_FLOW:
+		if (a->h_u.tcp_ip4_spec.pdst != b->h_u.tcp_ip4_spec.pdst)
+			return 1;
+		if (a->m_u.tcp_ip4_spec.pdst != b->m_u.tcp_ip4_spec.pdst)
+			return 1;
+		if (a->h_u.tcp_ip4_spec.psrc != b->h_u.tcp_ip4_spec.psrc)
+			return 1;
+		if (a->m_u.tcp_ip4_spec.psrc != b->m_u.tcp_ip4_spec.psrc)
+			return 1;
+
+		goto gfar_compare_basic_ip_stuff;
+	case IP_USER_FLOW:
+		if (a->h_u.usr_ip4_spec.proto != b->h_u.usr_ip4_spec.proto)
+			return 1;
+		if (a->m_u.usr_ip4_spec.proto != b->m_u.usr_ip4_spec.proto)
+			return 1;
+		if (a->h_u.usr_ip4_spec.ip_ver != b->h_u.usr_ip4_spec.ip_ver)
+			return 1;
+		if (a->m_u.usr_ip4_spec.ip_ver != b->m_u.usr_ip4_spec.ip_ver)
+			return 1;
+		if (a->h_u.usr_ip4_spec.l4_4_bytes
+				!= b->h_u.usr_ip4_spec.l4_4_bytes)
+			return 1;
+		if (a->m_u.usr_ip4_spec.l4_4_bytes
+				!= b->m_u.usr_ip4_spec.l4_4_bytes)
+			return 1;
+
+		goto gfar_compare_basic_ip_stuff;
+	case ETHER_FLOW:
+		if (compare_ether_addr(a->h_u.ether_spec.h_dest,
+				b->h_u.ether_spec.h_dest))
+			return 1;
+		if (compare_ether_addr(a->h_u.ether_spec.h_source,
+				b->h_u.ether_spec.h_source))
+			return 1;
+		if (compare_ether_addr(a->m_u.ether_spec.h_dest,
+				b->m_u.ether_spec.h_dest))
+			return 1;
+		if (compare_ether_addr(a->m_u.ether_spec.h_source,
+				b->m_u.ether_spec.h_source))
+			return 1;
+		if (a->h_u.ether_spec.h_proto != b->h_u.ether_spec.h_proto)
+			return 1;
+		if (a->m_u.ether_spec.h_proto != b->m_u.ether_spec.h_proto)
+			return 1;
+		return 0;
+	default:
+		return -1;
+	}
+
+	/*Control-flow never passes here!*/
+
+gfar_compare_basic_ip_stuff:
+	if (a->h_u.tcp_ip4_spec.ip4dst != b->h_u.tcp_ip4_spec.ip4dst)
+		return 1;
+	if (a->m_u.tcp_ip4_spec.ip4dst != b->m_u.tcp_ip4_spec.ip4dst)
+		return 1;
+	if (a->h_u.tcp_ip4_spec.ip4src != b->h_u.tcp_ip4_spec.ip4src)
+		return 1;
+	if (a->m_u.tcp_ip4_spec.ip4src != b->m_u.tcp_ip4_spec.ip4src)
+		return 1;
+	if (a->h_u.tcp_ip4_spec.tos != b->h_u.tcp_ip4_spec.tos)
+		return 1;
+	if (a->m_u.tcp_ip4_spec.tos != b->m_u.tcp_ip4_spec.tos)
+		return 1;
+
+	return 0;
+}
+
+/* Searches the existing flow_specs for flow and return NULL if none found
+ * or the address of the container in the linked list in case of success*/
+static struct ethtool_rx_ntuple_flow_spec_container
*gfar_search_table_entry(
+		struct ethtool_rx_ntuple_flow_spec *flow,
+		struct ethtool_rx_ntuple_list *list)
+{
+	struct ethtool_rx_ntuple_flow_spec_container *loop;
+	list_for_each_entry(loop, &list->list, list) {
+		if (gfar_compare_flow_spec(flow, &loop->fs) == 0)
+			return loop;
+	}
+	return NULL;
+}
+
+static int gfar_del_table_entry(
+		struct ethtool_rx_ntuple_flow_spec_container *cont,
+		struct ethtool_rx_ntuple_list *list)
+{
+	list_del(&cont->list);
+	kfree(cont);
+	list->count--;
+	return 0;
+}
+
+static int gfar_process_filer_changes(struct ethtool_rx_ntuple_flow_spec
*flow,
+		struct gfar_private *priv)
+{
+	struct ethtool_rx_ntuple_flow_spec_container *j;
+	struct filer_table *tab;
+	s32 i = 0;
+	s32 ret = 0;
+
+	/*So index is set to zero, too!*/
+	tab = kzalloc(sizeof(struct filer_table), GFP_KERNEL);
+	if (tab == NULL) {
+		printk(KERN_WARNING "Can not get memory!\n");
+		return -ENOMEM;
+	}
+
+	j = gfar_search_table_entry(flow, &priv->ntuple_list);
+
+	if (flow->action == ETHTOOL_RXNTUPLE_ACTION_CLEAR) {
+		if (j != NULL)
+			gfar_del_table_entry(j, &priv->ntuple_list);
+		else {
+			printk(KERN_WARNING
+			"Deleting this rule is not possible,"
+			" because it does not exist!\n");
+			return -1;
+		}
+	} else if (j != NULL) {
+		printk(KERN_WARNING "Adding this rule is not possible,"
+			" because it already exists!\n");
+		return -1;
+	}
+
+	/*Now convert the existing filer data from flow_spec into
+	* filer tables binary format*/
+	list_for_each_entry(j, &priv->ntuple_list.list, list) {
+		ret = gfar_convert_to_filer(&j->fs, tab);
+		if (ret == -ESWFULL) {
+			printk(KERN_WARNING
+			"Adding this rule is not possible,"
+			" because there is not space left!\n");
+			goto end;
+		}
+	}
+
+	/*Here add the new one*/
+	if (flow->action != ETHTOOL_RXNTUPLE_ACTION_CLEAR) {
+		ret = gfar_convert_to_filer(flow, tab);
+		if (ret == -ESWFULL) {
+			printk(KERN_WARNING
+			"Adding this rule is not possible,"
+			" because there is not space left!\n");
+			goto end;
+		}
+		if (ret == -1) {
+			printk(KERN_WARNING
+			"Adding this rule is not possible,"
+			" because this flow-type is not supported"
+			" by hardware!\n");
+			goto end;
+		}
+	}
+
+	i = tab->index;
+
+	/*Optimizations to save entries*/
+	gfar_cluster_filer(tab);
+	gfar_optimize_filer_masks(tab);
+
+	printk(KERN_DEBUG "\tSummary:\n"
+	"\tData on hardware: %d\n"
+	"\tCompression rate: %d %%\n", tab->index, 100 - (100 * tab->index)
+			/ i);
+
+	/*Write everything to hardware*/
+	ret = gfar_write_filer_table(priv, tab);
+	if (ret == -EHWFULL) {
+		printk(KERN_WARNING
+		"Adding this rule is not possible,"
+		" because there is not space left!\n");
+		goto end;
+	}
+
+	/*Only if all worked fine, add the flow*/
+	if (flow->action != ETHTOOL_RXNTUPLE_ACTION_CLEAR)
+		gfar_add_table_entry(flow, &priv->ntuple_list);
+
+end:	kfree(tab);
+	return ret;
+}
+
+static int gfar_set_rx_ntuple(struct net_device *dev,
+		struct ethtool_rx_ntuple *cmd)
+{	struct gfar_private *priv = netdev_priv(dev);
+
+	/*Only values between -2 and num_rx_queues - 1 allowed*/
+	if ((cmd->fs.action >= (signed int)priv->num_rx_queues) ||
+	(cmd->fs.action < ETHTOOL_RXNTUPLE_ACTION_CLEAR))
+		return -EINVAL;
+
+	/* Only one process per device in this region, because the linked list
+	 * ntuple_list and the hardware are critical resources*/
+	mutex_lock(&priv->rx_queue_access);
+
+	if (list_empty(&priv->ntuple_list.list))
+		if (gfar_check_filer_hardware(priv) != 0)
+			return -1;
+
+	gfar_process_filer_changes(&cmd->fs, priv);
+
+	mutex_unlock(&priv->rx_queue_access);
+
+	print_hw(priv);
+
+	return 0;
+}
+
+
 const struct ethtool_ops gfar_ethtool_ops = {
 	.get_settings = gfar_gsettings,
 	.set_settings = gfar_ssettings,
@@ -808,4 +1813,5 @@ const struct ethtool_ops gfar_ethtool_ops = {
 	.set_wol = gfar_set_wol,
 #endif
 	.set_rxnfc = gfar_set_nfc,
+	.set_rx_ntuple = gfar_set_rx_ntuple
 };

^ permalink raw reply related

* Re: WARNING, net/core/dev.c, skb_gso_segment, 2.6.39-rc7-git11 inclusive
From: Denys Fedoryshchenko @ 2011-05-17 13:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1305638195.2850.97.camel@edumazet-laptop>

>>
>> --
>
> Hi Denys
>
> net/sched/act_csum.c sets skb->ip_summed = CHECKSUM_NONE;  after csum
> changes.
>
> and skb_gso_segment() barfs if skb->ip_summed != CHECKSUM_PARTIAL
>
> You could disable GRO for the time being.
 Thanks a lot, i will, it is just test.
 By the way, is it a bug and it will be fixed, or pedit is not intended 
 to be (ab)used like that?


^ permalink raw reply

* Re: [PATCH 1/1] igmp: fix ip_mc_clear_src to not reset ip_mc_list->sf{mode,count}
From: Veaceslav Falico @ 2011-05-17 13:30 UTC (permalink / raw)
  To: David Stevens
  Cc: David Miller, jmorris, kaber, kuznet, linux-kbuild, linux-kernel,
	mmarek, netdev, pekkas, yoshfuji
In-Reply-To: <OF948388D5.90B3F5A7-ON88257892.006C3228-88257892.0071B9D5@us.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 3249 bytes --]

On Mon, May 16, 2011 at 01:42:11PM -0700, David Stevens wrote:
> 
>         On NETDEV_DOWN, all group memberships are dropped. 
> ip_mc_clear_src()
> is simply freeing all the source filters and turning it into an "EXCLUDE 
> nobody"
> membership (ie, the same as an ordinary join without source filtering). 
> This
> ordinarily happens when you are deleting the group entirely (when the 
> reference
> count goes to 0), but is also called on device down.
>         This patch is not appropriate; when the groups are deleted, the 
> source
> filters are deleted, and the filter counts have to reflect the source 
> filters
> in the list. If you had an "INCLUDE A" filter, for example, that would 
> become
> an "INCLUDE nobody" filter and drop all traffic (from A or not). The 
> number
> of source filters is not related to the number of listener sockets, and 
> the
> function of ip_mc_clear_src() is to make it 0 (with the special case of 1 
> for
> EXCLUDE), so setting the counts has to be done for proper functioning.
>         I don't quite understand the problem you're trying to solve here 
> --
> when the device comes back up, the group should be re-added with 
> {EXCLUDE,nobody} and
> ip_check_mc() should therefore return 1. Of course, while the interface is
> down, the mc_list is empty and it'd return 0 in that case.
>         Do you have a small test program to demonstrate the problem?

Yes, attached are two programs, one sender and one receiver, they bind both
to localhost and send each other traffic. To reproduce, start the sender
and two instances of receivers, then do an ifconfig lo up; ifconfig lo
down;, restart the sender program (both of the receivers should once again
receive the multicast traffic). Then kill one receiver (the MCAST_EXCLUDE
will become 0), and do an "ip route flush cache". The new route cache will
be without the local flag on, and the remaining receiver will stop
receiving traffic.

What happens:

1) When both receivers start, ip_mc_list->sfcount[MCAST_EXCLUDE] == 2
2) On NETDEV_DOWN event, groups are dropped and sfmode = MCAST_EXCLUDE,
	sfcount[MCAST_EXCLUDE] = 1
3) On NETDEV_UP, the group is re-joined, but kernel thinks that there's
	only one listener (sfcount[MCAST_EXCLUDE]).
4) On socket destroy (when one receiver is terminated), the count is 0.
5) On route cache flush, __mkroute_output() doesn't see the remaining
	listener, and creates a route cache without RTCF_LOCAL flag, thus not
	allowing any traffic on that group to local listeners.

The igmp_group_dropped() (the actual routine that drops a group) is called
when:

1) ip_mc_dec_group() is called and im->users == 0
2) ip_mc_unmap()
3) ip_mc_down()
4) ip_mc_destroy_dev()

The 1) we call either on socket destroy or when the socket actually asks to
leave a group. In this case, we need to "reset" the state on no listeners.

2),3),4) are called on various device modifications
(NETDEV_PRE_TYPE_CHANGE, NETDEV_DOWN and NETDEV_UNREGISTER) - but the group
can be rejoined on their next events - NETDEV_POST_TYPE_CHANGE, NETDEV_UP
and NETDEV_REGISTER, which will cause the ip_mc_list to loose track of
existing listeners.

So, I tend to think that we must clear the sources only on 1).

Will send the patch shortly.

Thank you!

[-- Attachment #2: mcsend.c --]
[-- Type: text/plain, Size: 3595 bytes --]

#include <sys/types.h>   /* for type definitions */
#include <sys/socket.h>  /* for socket API function calls */
#include <netinet/in.h>  /* for address structs */
#include <arpa/inet.h>   /* for sockaddr_in */
#include <stdio.h>       /* for printf() */
#include <stdlib.h>      /* for atoi() */
#include <string.h>      /* for strlen() */
#include <unistd.h>      /* for close() */

#define MAX_LEN  1024    /* maximum string size to send */
#define MIN_PORT 1024    /* minimum port allowed */
#define MAX_PORT 65535   /* maximum port allowed */

int main(int argc, char *argv[]) {

  int sock;                   /* socket descriptor */
  char send_str[MAX_LEN];     /* string to send */
  struct sockaddr_in mc_addr; /* socket address structure */
  unsigned int send_len;      /* length of string to send */
  char* mc_addr_str;          /* multicast IP address */
  unsigned short mc_port;     /* multicast port */
  unsigned char mc_ttl=1;     /* time to live (hop count) */

  /* validate number of arguments */
  if (argc != 3) {
    fprintf(stderr,
            "Usage: %s <Multicast IP> <Multicast Port>\n",
            argv[0]);
    exit(1);
  }

  mc_addr_str = argv[1];       /* arg 1: multicast IP address */
  mc_port     = atoi(argv[2]); /* arg 2: multicast port number */

  /* validate the port range */
  if ((mc_port < MIN_PORT) || (mc_port > MAX_PORT)) {
    fprintf(stderr, "Invalid port number argument %d.\n",
            mc_port);
    fprintf(stderr, "Valid range is between %d and %d.\n",
            MIN_PORT, MAX_PORT);
    exit(1);
  }

  /* create a socket for sending to the multicast address */
  if ((sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP)) < 0) {
    perror("socket() failed");
    exit(1);
  }

  /* bind socket  to "localhost" */
  struct sockaddr_in lh_addr;   /* socket address structure */
  memset(&lh_addr, 0, sizeof(lh_addr));
  lh_addr.sin_family      = AF_INET;
  lh_addr.sin_addr.s_addr = inet_addr("127.0.0.1");
  lh_addr.sin_port        = 0;
  if ((bind(sock, (struct sockaddr *) &lh_addr,
       sizeof(lh_addr))) < 0) {
    perror("bind() failed");
    exit(1);
  }

  /* enable loopback (should be the default) */
  int mc_loopback = 1;
  if ((setsockopt(sock, IPPROTO_IP, IP_MULTICAST_LOOP,
       (void*) &mc_loopback, sizeof(mc_loopback))) < 0) {
    perror("setsockopt(IP_MULTICAST_LOOP) failed");
    exit(1);
  }

  /* set interface address */
  if ((setsockopt(sock, IPPROTO_IP, IP_MULTICAST_IF,
       (void*) &lh_addr.sin_addr, sizeof(lh_addr.sin_addr))) < 0) {
    perror("setsockopt(IP_MULTICAST_IF) failed");
    exit(1);
  }

  /* set the TTL (time to live/hop count) for the send
  if ((setsockopt(sock, IPPROTO_IP, IP_MULTICAST_TTL,
       (void*) &mc_ttl, sizeof(mc_ttl))) < 0) {
    perror("setsockopt() failed");
    exit(1);
  } */

  /* construct a multicast address structure */
  memset(&mc_addr, 0, sizeof(mc_addr));
  mc_addr.sin_family      = AF_INET;
  mc_addr.sin_addr.s_addr = inet_addr(mc_addr_str);
  mc_addr.sin_port        = htons(mc_port);

  printf("Begin typing (return to send, ctrl-C to quit):\n");

  /* clear send buffer */
  memset(send_str, 0, sizeof(send_str));

  while (fgets(send_str, MAX_LEN, stdin)) {
    send_len = strlen(send_str);

    /* send string to multicast address */
    if ((sendto(sock, send_str, send_len, 0,
         (struct sockaddr *) &mc_addr,
         sizeof(mc_addr))) != send_len) {
      perror("sendto() sent incorrect number of bytes");
      exit(1);
    }

    /* clear send buffer */
    memset(send_str, 0, sizeof(send_str));
  }

  close(sock);

  exit(0);
}


[-- Attachment #3: mcreceive.c --]
[-- Type: text/plain, Size: 3821 bytes --]

#include <sys/types.h>  /* for type definitions */
#include <sys/socket.h> /* for socket API calls */
#include <netinet/in.h> /* for address structs */
#include <arpa/inet.h>  /* for sockaddr_in */
#include <stdio.h>      /* for printf() and fprintf() */
#include <stdlib.h>     /* for atoi() */
#include <string.h>     /* for strlen() */
#include <unistd.h>     /* for close() */

#define MAX_LEN  1024   /* maximum receive string size */
#define MIN_PORT 1024   /* minimum port allowed */
#define MAX_PORT 65535  /* maximum port allowed */

int main(int argc, char *argv[]) {

  int sock;                     /* socket descriptor */
  int flag_on = 1;              /* socket option flag */
  struct sockaddr_in mc_addr;   /* socket address structure */
  char recv_str[MAX_LEN+1];     /* buffer to receive string */
  int recv_len;                 /* length of string received */
  struct ip_mreq mc_req;        /* multicast request structure */
  char* mc_addr_str;            /* multicast IP address */
  unsigned short mc_port;       /* multicast port */
  struct sockaddr_in from_addr; /* packet source */
  unsigned int from_len;        /* source addr length */
  unsigned int fl=1;

  /* validate number of arguments */
  if (argc != 3) {
    fprintf(stderr,
            "Usage: %s <Multicast IP> <Multicast Port>\n",
            argv[0]);
    exit(1);
  }

  mc_addr_str = argv[1];      /* arg 1: multicast ip address */
  mc_port = atoi(argv[2]);    /* arg 2: multicast port number */

  /* validate the port range */
  if ((mc_port < MIN_PORT) || (mc_port > MAX_PORT)) {
    fprintf(stderr, "Invalid port number argument %d.\n",
            mc_port);
    fprintf(stderr, "Valid range is between %d and %d.\n",
            MIN_PORT, MAX_PORT);
    exit(1);
  }

  /* create socket to join multicast group on */
  if ((sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP)) < 0) {
    perror("socket() failed");
    exit(1);
  }

  /* set reuse port to on to allow multiple binds per host */
  if ((setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &flag_on,
       sizeof(flag_on))) < 0) {
    perror("setsockopt() failed");
    exit(1);
  }

  /* construct a multicast address structure */
  memset(&mc_addr, 0, sizeof(mc_addr));
  mc_addr.sin_family      = AF_INET;
//  mc_addr.sin_addr.s_addr = htonl(INADDR_ANY);
  mc_addr.sin_addr.s_addr = inet_addr(mc_addr_str);
  mc_addr.sin_port        = htons(mc_port);

  /* bind to multicast address to socket */
  if ((bind(sock, (struct sockaddr *) &mc_addr,
       sizeof(mc_addr))) < 0) {
    perror("bind() failed");
    exit(1);
  }

  /* construct an IGMP join request structure */
  mc_req.imr_multiaddr.s_addr = inet_addr(mc_addr_str);
//  mc_req.imr_interface.s_addr = htonl(INADDR_ANY);
  mc_req.imr_interface.s_addr = inet_addr("127.0.0.1");

  /* send an ADD MEMBERSHIP message via setsockopt */
  if (fl && (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,
       (void*) &mc_req, sizeof(mc_req))) < 0) {
    perror("setsockopt() failed");
    exit(1);
  }

  for (;;) {          /* loop forever */

    /* clear the receive buffers & structs */
    memset(recv_str, 0, sizeof(recv_str));
    from_len = sizeof(from_addr);
    memset(&from_addr, 0, from_len);

    /* block waiting to receive a packet */
    if ((recv_len = recvfrom(sock, recv_str, MAX_LEN, 0,
         (struct sockaddr*)&from_addr, &from_len)) < 0) {
      perror("recvfrom() failed");
      exit(1);
    }

    /* output received string */
    printf("Received %d bytes from %s: ", recv_len,
           inet_ntoa(from_addr.sin_addr));
    printf("%s", recv_str);
  }

  /* send a DROP MEMBERSHIP message via setsockopt */
  if ((setsockopt(sock, IPPROTO_IP, IP_DROP_MEMBERSHIP,
       (void*) &mc_req, sizeof(mc_req))) < 0) {
    perror("setsockopt() failed");
    exit(1);
  }

  close(sock);
}

^ permalink raw reply

* Re: WARNING, net/core/dev.c, skb_gso_segment, 2.6.39-rc7-git11 inclusive
From: Denys Fedoryshchenko @ 2011-05-17 13:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1305638195.2850.97.camel@edumazet-laptop>

 On Tue, 17 May 2011 15:16:35 +0200, Eric Dumazet wrote:
>
> Hi Denys
>
> net/sched/act_csum.c sets skb->ip_summed = CHECKSUM_NONE;  after csum
> changes.
>
> and skb_gso_segment() barfs if skb->ip_summed != CHECKSUM_PARTIAL
>
> You could disable GRO for the time being.
 It seems mistake from my side, seems, csum was not being called, i 
 forgot to add pipe before csum.


^ permalink raw reply

* [PATCH v2 1/1] igmp: call ip_mc_clear_src() only when we have no users of ip_mc_list
From: Veaceslav Falico @ 2011-05-17 13:38 UTC (permalink / raw)
  To: David Stevens
  Cc: David Miller, jmorris, kaber, kuznet, linux-kbuild, linux-kernel,
	mmarek, netdev, pekkas, yoshfuji
In-Reply-To: <OF948388D5.90B3F5A7-ON88257892.006C3228-88257892.0071B9D5@us.ibm.com>

In igmp_group_dropped() we call ip_mc_clear_src(), which resets the number
of source filters per mulitcast. However, igmp_group_dropped() is also
called on NETDEV_DOWN, NETDEV_PRE_TYPE_CHANGE and NETDEV_UNREGISTER, which
means that the group might get added back on NETDEV_UP, NETDEV_REGISTER and
NETDEV_POST_TYPE_CHANGE respectively, leaving us with broken source
filters.

To fix that, we must clear the source filters only when there are no users
in the ip_mc_list, i.e. in ip_mc_dec_group().

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>

---
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 1fd3d9c..732e30b 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -1182,7 +1182,6 @@ static void igmp_group_dropped(struct ip_mc_list *im)
 	}
 done:
 #endif
-	ip_mc_clear_src(im);
 }
 
 static void igmp_group_added(struct ip_mc_list *im)
@@ -1319,6 +1318,7 @@ void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
 				*ip = i->next_rcu;
 				in_dev->mc_count--;
 				igmp_group_dropped(i);
+				ip_mc_clear_src(i);
 
 				if (!in_dev->dead)
 					ip_rt_multicast_event(in_dev);

^ permalink raw reply related

* bonding flaps between member interfaces
From: Patrick Schaaf @ 2011-05-17 13:27 UTC (permalink / raw)
  To: netdev

Dear netdev,

I'm experiencing a regression with bonding. Bugzilla and cursory
searching of the list did not immediately show up anything that seems
related, so here's the report:

Short summary: bonding flips between members every second

bonding in active-backup mode with ARP monitoring
two members in the bond, both being VLAN interfaces on top of two
separate ethernet interfaces
bnx2 ethernet driver, but saw the same behaviour with a tigon box
concrete settings:
BONDING_MODULE_OPTS="mode=active-backup primary=eth0.24 arp_interval=250
arp_ip_target=192.168.x.x"
See below for a /proc/net/bonding/bond24 output reflecing the
configuration.

This setup I have in production on 2.6.36.2, and it works fine.
It also works fine, tested today, with 2.6.36,4 and 2.6.37.6

Starting with 2.6.38 (2.6.38.6 tested today), and still happening with
2.6.39-rc7, I experience problems. While I can still work over the
interface, it is flipping once per second between the two member
interfaces. There is no indication of the underlying interface going
up/down, but bonding seems to think so.

See below an excerpt of the kernel log for two back-and-forth flapping
cycles.

In /proc/net/bonding/bond24, I see the failure counter of the configured
primary interface counting up with each flap. The counter of the non
primary interface does not move. When I switch the primary interface by
echoing to /sys, the behaviour of the counters flips: always the
configured primary has the counter going up.
 
best regards
  Patrick

Here is /proc/net/bonding/bond24 while running on 2.6.37.6, to show the
concrete configuration from this POV. Everything looks the same with the
failing kernels, except for the noted behaviour of the Failure Counts.

Ethernet Channel Bonding Driver: v3.7.0 (June 2, 2010)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth0.24 (primary_reselect always)
Currently Active Slave: eth0.24
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 250
ARP IP target/s (n.n.n.n form): 192.168.x.x

Slave Interface: eth0.24
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: d4:85:64:ca:1c:12
Slave queue ID: 0

Slave Interface: eth1.24
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: d4:85:64:ca:1c:14
Slave queue ID: 0

Here is kernel log output for two flapping cycles (booted kernel was
2.6.39-rc7):

May 17 14:58:22 myserver kernel: [ 1016.629155] bonding: bond24: link
status definitely down for interface eth0.24, disabling it
May 17 14:58:22 myserver kernel: [ 1016.629159] bonding: bond24: making
interface eth1.24 the new active one.
May 17 14:58:22 myserver kernel: [ 1016.629162] device eth0.24 left
promiscuous mode
May 17 14:58:22 myserver kernel: [ 1016.629164] device eth0 left
promiscuous mode
May 17 14:58:22 myserver kernel: [ 1016.629191] device eth1.24 entered
promiscuous mode
May 17 14:58:22 myserver kernel: [ 1016.629193] device eth1 entered
promiscuous mode
May 17 14:58:22 myserver kernel: [ 1016.878596] bonding: bond24: link
status definitely up for interface eth0.24.
May 17 14:58:22 myserver kernel: [ 1016.878600] bonding: bond24: making
interface eth0.24 the new active one.
May 17 14:58:22 myserver kernel: [ 1016.878603] device eth1.24 left
promiscuous mode
May 17 14:58:22 myserver kernel: [ 1016.878605] device eth1 left
promiscuous mode
May 17 14:58:22 myserver kernel: [ 1016.878631] device eth0.24 entered
promiscuous mode
May 17 14:58:22 myserver kernel: [ 1016.878633] device eth0 entered
promiscuous mode
May 17 14:58:23 myserver kernel: [ 1017.626919] bonding: bond24: link
status definitely down for interface eth0.24, disabling it
May 17 14:58:23 myserver kernel: [ 1017.626923] bonding: bond24: making
interface eth1.24 the new active one.
May 17 14:58:23 myserver kernel: [ 1017.626926] device eth0.24 left
promiscuous mode
May 17 14:58:23 myserver kernel: [ 1017.626928] device eth0 left
promiscuous mode
May 17 14:58:23 myserver kernel: [ 1017.626955] device eth1.24 entered
promiscuous mode
May 17 14:58:23 myserver kernel: [ 1017.626957] device eth1 entered
promiscuous mode
May 17 14:58:23 myserver kernel: [ 1017.876359] bonding: bond24: link
status definitely up for interface eth0.24.
May 17 14:58:23 myserver kernel: [ 1017.876363] bonding: bond24: making
interface eth0.24 the new active one.
May 17 14:58:23 myserver kernel: [ 1017.876366] device eth1.24 left
promiscuous mode
May 17 14:58:23 myserver kernel: [ 1017.876368] device eth1 left
promiscuous mode
May 17 14:58:23 myserver kernel: [ 1017.876394] device eth0.24 entered
promiscuous mode
May 17 14:58:23 myserver kernel: [ 1017.876396] device eth0 entered
promiscuous mode



^ permalink raw reply

* Re: pull request: wireless-next-2.6 2011-05-16-v2
From: Gustavo F. Padovan @ 2011-05-17 14:13 UTC (permalink / raw)
  To: David Miller; +Cc: linville, linux-wireless, netdev
In-Reply-To: <20110516.230835.989629139579226478.davem@davemloft.net>

* David Miller <davem@davemloft.net> [2011-05-16 23:08:35 -0400]:

> From: "John W. Linville" <linville@tuxdriver.com>
> Date: Mon, 16 May 2011 20:08:09 -0400
> 
> > I'm sorry, Dave!  In my defense, I am on the verge of sickness and got
> > very little sleep last night... :-(
> > 
> > I have corrected the problem and was sure to perform the correct build
> > test this time! :-)
> 
> So these hunks made to net/bluetooth/l2cap_core.c:
> 
> @@ -3745,11 +3758,14 @@ done:
>  static inline int l2cap_conless_channel(struct l2cap_conn *conn, __le16 psm, struct sk_buff *skb)
>  {
>  	struct sock *sk;
> +	struct l2cap_chan *chan;
>  
> -	sk = l2cap_get_sock_by_psm(0, psm, conn->src);
> -	if (!sk)
> +	chan = l2cap_global_chan_by_psm(0, psm, conn->src);
> +	if (!chan)
>  		goto drop;
>  
> +	sk = chan->sk;
> +
>  	bh_lock_sock(sk);
>  
>  	BT_DBG("sk %p, len %d", sk, skb->len);
> @@ -3745,11 +3758,14 @@ done:
>  static inline int l2cap_conless_channel(struct l2cap_conn *conn, __le16 psm, struct sk_buff *skb)
>  {
>  	struct sock *sk;
> +	struct l2cap_chan *chan;
>  
> -	sk = l2cap_get_sock_by_psm(0, psm, conn->src);
> -	if (!sk)
> +	chan = l2cap_global_chan_by_psm(0, psm, conn->src);
> +	if (!chan)
>  		goto drop;
>  
> +	sk = chan->sk;
> +
>  	bh_lock_sock(sk);
>  
>  	BT_DBG("sk %p, len %d", sk, skb->len);
> 
> Doesn't generate the following warnings on your compiler?
> 
> net/bluetooth/l2cap_core.c: In function ‘l2cap_recv_frame’:
> net/bluetooth/l2cap_core.c:3758:15: warning: ‘sk’ may be used uninitialized in this function
> net/bluetooth/l2cap_core.c:3758:15: note: ‘sk’ was declared here
> net/bluetooth/l2cap_core.c:3791:15: warning: ‘sk’ may be used uninitialized in this function
> net/bluetooth/l2cap_core.c:3791:15: note: ‘sk’ was declared here
> 

I keep not seeing these warnings. I'll upgrade to gcc 4.6 and check this again.

padovan bluetooth-next-2.6 $ rm net/bluetooth/*.o net/bluetooth/*.ko
padovan bluetooth-next-2.6 $ make
  CHK     include/linux/version.h
  CHK     include/generated/utsrelease.h
  UPD     include/generated/utsrelease.h
  CALL    scripts/checksyscalls.sh                                                                  
  CHK     include/generated/compile.h                                                               
  UPD     include/generated/compile.h                                                               
  CC      init/version.o                                                                            
  CC      kernel/module.o                                                                           
  LD      init/built-in.o                                                                           
  CC      kernel/kexec.o                                                                            
  LD      kernel/built-in.o                                                                         
  LD      net/bluetooth/built-in.o                                                                  
  CC [M]  net/bluetooth/af_bluetooth.o                                                              
  CC [M]  net/bluetooth/hci_core.o                                                                  
  CC [M]  net/bluetooth/hci_conn.o                                                                  
  CC [M]  net/bluetooth/hci_event.o                                                                 
  CC [M]  net/bluetooth/mgmt.o                                                                      
  CC [M]  net/bluetooth/hci_sock.o                                                                  
  CC [M]  net/bluetooth/hci_sysfs.o                                                                 
  CC [M]  net/bluetooth/lib.o                                                                       
  CC [M]  net/bluetooth/l2cap_core.o                                                                
  CC [M]  net/bluetooth/l2cap_sock.o                                                                
  CC [M]  net/bluetooth/sco.o                                                                       
  LD [M]  net/bluetooth/bluetooth.o                                                                 
  LD      vmlinux.o   

-- 
Gustavo F. Padovan
http://profusion.mobi

^ permalink raw reply

* Re: [PATCH] net: tuntap: Fix tun_net_fix_features()
From: Michael S. Tsirkin @ 2011-05-17 14:29 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: netdev, Herbert Xu, Ben Hutchings, Shan Wei
In-Reply-To: <20110517081954.A227013A6A@rere.qmqm.pl>

On Tue, May 17, 2011 at 10:19:54AM +0200, Michał Mirosław wrote:
> tun->set_features are meant to limit not force the features.
> 
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> ---
>  drivers/net/tun.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 74e9405..f77c6d0 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -458,7 +458,7 @@ static u32 tun_net_fix_features(struct net_device *dev, u32 features)
>  {
>  	struct tun_struct *tun = netdev_priv(dev);
>  
> -	return (features & tun->set_features) | (features & ~TUN_USER_FEATURES);
> +	return features & (tun->set_features | ~TUN_USER_FEATURES);
>  }
>  
>  static const struct net_device_ops tun_netdev_ops = {
> -- 
> 1.7.2.5

One thing that this will do though: previously, if
ethtool disables offloads, then an application enables
them, the application will have the last say.
With this patch, the most conservative approach wins.
Right?
If we want to have the existing behaviour
I think the following would do this (untested). What do you think?


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 74e9405..1d6c7bc 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1199,6 +1199,8 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
 		return -EINVAL;
 
 	tun->set_features = features;
+	tun->dev->features &= TUN_USER_FEATURES;
+	tun->dev->features |= (features & TUN_USER_FEATURES);
 	netdev_update_features(tun->dev);
 
 	return 0;

^ permalink raw reply related

* Re: [PATCH 0/7] Network namespace manipulation with file descriptors
From: Eric W. Biederman @ 2011-05-17 14:33 UTC (permalink / raw)
  To: David Lamparter
  Cc: Alex Bligh, linux-arch, netdev, linux-kernel, Linux Containers,
	linux-fsdevel
In-Reply-To: <20110517111148.GA3762520@jupiter.n2.diac24.net>

David Lamparter <equinox@diac24.net> writes:

> On Sat, May 07, 2011 at 07:18:44AM -0700, Eric W. Biederman wrote:
>> You can read the processes network namespace by opening
>> /proc/<pid>/ns/net.  Unfortunately comparing the network
>> namespaces for identity is another matter.  You will probably
>> be better off simply forcing the routing daemon to start
>> in the desired network namespace in it's initscript.
>> 
>> For purposes of clarity please have a look at my work in
>> progress patch for iproute2.  This demonstrates how I expect
>> userspace to work in a multi-network namespace world.
>> 
> [...]
>> Subject: [PATCH] iproute2:  Add processless netnwork namespace support.
> [...]
>> Configuration specific to a network namespace that
>> would ordinarily be stored under /etc/ is stored under
>> /etc/netns/<name>.  For example if the dns server
>> configuration is different for your vpn you would
>> create a file /etc/netns/myvpn/resolv.conf.
>> 
>> File descriptors that can be used to manipulate a
>> network namespace can be created by opening
>> /var/run/netns/<NAME>.
>> 
>> This adds the following commands to iproute.
>> ip netns add NAME
>> ip netns delete NAME
>> ip netns monitor
>> ip netns list
>> ip netns exec NAME cmd ....
>> ip link set DEV netns NAME
>
> funny, this is almost exactly what my code does - though you're probably
> doing it better and have more features ;)

Well if it has more features it is only because I have managed to keep
everything simple enough that adding features was easy.  I ignored all
of the hard bits.

> http://git.spaceboyz.net/equinox/vrf-tools.git/
> git://spaceboyz.net/equinox/vrf-tools.git
>
> It currently forks off a daemon to keep the namespace open; attaching is
> not possible yet, but opening a socket in a different namespace is.

I went the round of keeping a daemon open, saw how much code that
takes and how fragile that can be in the corner cases and decided to
patch the kernel to make the interfaces better.

> Most of the actual management (mounting things & co.) I offloaded to
> some shell scripts; I use it together with GNU screen (which makes it
> very nice to grab one of the namespaces and start/stop/manage/...
> things).

That does sound like a nice way of doing the management.

> I also have patches for OpenVPN and pptpd floating around that make it
> possible to 'cross' namespace boundaries, i.e. the VPN servers listen in
> one namespace and have their devices in another.

For openvpn I have managed to get away with simply using an up script. 
Mostly the script is:

ip netns add $NSNAME || true
ip netns exec $NSNAME ip link set lo up
ip link set $dev netns $NSNAME
ip netns exec $NSNAME ip link set $dev up
ip netns exec $NSNAME ifconfig $dev $ifconfig_local netmask $ifconfig_netmask broadcast $ifconfig_broadcast

With a few extra bits for dns options and routes.  If I had an openvpn
built with the iproute option I expect I could get away by just wrapping
iproute.  Not that I would mind a patched openvpn.

Personally I think using a vpn in a network namespace seems like a
killer feature.

Eric

^ permalink raw reply

* [PATCH v3 1/1] igmp: call ip_mc_clear_src() only when we have no users of ip_mc_list
From: Veaceslav Falico @ 2011-05-17 14:37 UTC (permalink / raw)
  To: David Stevens
  Cc: David Miller, jmorris, kaber, kuznet, linux-kbuild, linux-kernel,
	mmarek, netdev, pekkas, yoshfuji
In-Reply-To: <20110517133801.GC30366@darkmag.usersys.redhat.com>

In igmp_group_dropped() we call ip_mc_clear_src(), which resets the number
of source filters per mulitcast. However, igmp_group_dropped() is also
called on NETDEV_DOWN, NETDEV_PRE_TYPE_CHANGE and NETDEV_UNREGISTER, which
means that the group might get added back on NETDEV_UP, NETDEV_REGISTER and
NETDEV_POST_TYPE_CHANGE respectively, leaving us with broken source
filters.

To fix that, we must clear the source filters only when there are no users
in the ip_mc_list, i.e. in ip_mc_dec_group().

Correct version of the patch.

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
---
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 1fd3d9c..142ca0d 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -1169,20 +1169,18 @@ static void igmp_group_dropped(struct ip_mc_list *im)
 
 	if (!in_dev->dead) {
 		if (IGMP_V1_SEEN(in_dev))
-			goto done;
+			return;
 		if (IGMP_V2_SEEN(in_dev)) {
 			if (reporter)
 				igmp_send_report(in_dev, im, IGMP_HOST_LEAVE_MESSAGE);
-			goto done;
+			return;
 		}
 		/* IGMPv3 */
 		igmpv3_add_delrec(in_dev, im);
 
 		igmp_ifc_event(in_dev);
 	}
-done:
 #endif
-	ip_mc_clear_src(im);
 }
 
 static void igmp_group_added(struct ip_mc_list *im)
@@ -1319,6 +1317,7 @@ void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
 				*ip = i->next_rcu;
 				in_dev->mc_count--;
 				igmp_group_dropped(i);
+				ip_mc_clear_src(i);
 
 				if (!in_dev->dead)
 					ip_rt_multicast_event(in_dev);

^ permalink raw reply related

* Re: [PATCH] net: tuntap: Fix tun_net_fix_features()
From: Michał Mirosław @ 2011-05-17 14:46 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, Herbert Xu, Ben Hutchings, Shan Wei
In-Reply-To: <20110517142943.GA926@redhat.com>

On Tue, May 17, 2011 at 05:29:43PM +0300, Michael S. Tsirkin wrote:
> On Tue, May 17, 2011 at 10:19:54AM +0200, Michał Mirosław wrote:
> > tun->set_features are meant to limit not force the features.
> > 
> > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > ---
> >  drivers/net/tun.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 74e9405..f77c6d0 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -458,7 +458,7 @@ static u32 tun_net_fix_features(struct net_device *dev, u32 features)
> >  {
> >  	struct tun_struct *tun = netdev_priv(dev);
> >  
> > -	return (features & tun->set_features) | (features & ~TUN_USER_FEATURES);
> > +	return features & (tun->set_features | ~TUN_USER_FEATURES);
> >  }
> >  
> >  static const struct net_device_ops tun_netdev_ops = {
> > -- 
> > 1.7.2.5
> 
> One thing that this will do though: previously, if
> ethtool disables offloads, then an application enables
> them, the application will have the last say.
> With this patch, the most conservative approach wins.
> Right?

Exactly.

On device creation, wanted_features default to all offloads
enabled, so unless an admin changes the flags, the application controls
what is enabled. This matters only when using persistent tun/tap and
admin and user are two different people. If the admin is using queues
and doesn't want to handle e.g. TSO packets (I'm not sure if they are
properly accounted in all queuing disciplines), then the feature should
not be enabled by user.

> If we want to have the existing behaviour
> I think the following would do this (untested). What do you think?
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 74e9405..1d6c7bc 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1199,6 +1199,8 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
>  		return -EINVAL;
>  
>  	tun->set_features = features;
> +	tun->dev->features &= TUN_USER_FEATURES;
> +	tun->dev->features |= (features & TUN_USER_FEATURES);
>  	netdev_update_features(tun->dev);

tun->dev->features will be recalculated by netdev_update_features()
anyway. For this to work as you described it would need to alter
wanted_features. I don't like the idea that something other than one
of ethtool_ops is changing this field, as it then becomes something
else that what the admin wants (even if that is not what he gets).

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH] net: tuntap: Fix tun_net_fix_features()
From: Michael S. Tsirkin @ 2011-05-17 14:54 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: netdev, Herbert Xu, Ben Hutchings, Shan Wei
In-Reply-To: <20110517144635.GA22878@rere.qmqm.pl>

On Tue, May 17, 2011 at 04:46:35PM +0200, Michał Mirosław wrote:
> On Tue, May 17, 2011 at 05:29:43PM +0300, Michael S. Tsirkin wrote:
> > On Tue, May 17, 2011 at 10:19:54AM +0200, Michał Mirosław wrote:
> > > tun->set_features are meant to limit not force the features.
> > > 
> > > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> > > ---
> > >  drivers/net/tun.c |    2 +-
> > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > > index 74e9405..f77c6d0 100644
> > > --- a/drivers/net/tun.c
> > > +++ b/drivers/net/tun.c
> > > @@ -458,7 +458,7 @@ static u32 tun_net_fix_features(struct net_device *dev, u32 features)
> > >  {
> > >  	struct tun_struct *tun = netdev_priv(dev);
> > >  
> > > -	return (features & tun->set_features) | (features & ~TUN_USER_FEATURES);
> > > +	return features & (tun->set_features | ~TUN_USER_FEATURES);
> > >  }
> > >  
> > >  static const struct net_device_ops tun_netdev_ops = {
> > > -- 
> > > 1.7.2.5
> > 
> > One thing that this will do though: previously, if
> > ethtool disables offloads, then an application enables
> > them, the application will have the last say.
> > With this patch, the most conservative approach wins.
> > Right?
> 
> Exactly.
> 
> On device creation, wanted_features default to all offloads
> enabled, so unless an admin changes the flags, the application controls
> what is enabled. This matters only when using persistent tun/tap and
> admin and user are two different people. If the admin is using queues
> and doesn't want to handle e.g. TSO packets (I'm not sure if they are
> properly accounted in all queuing disciplines), then the feature should
> not be enabled by user.
> 
> > If we want to have the existing behaviour
> > I think the following would do this (untested). What do you think?
> > 
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 74e9405..1d6c7bc 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -1199,6 +1199,8 @@ static int set_offload(struct tun_struct *tun, unsigned long arg)
> >  		return -EINVAL;
> >  
> >  	tun->set_features = features;
> > +	tun->dev->features &= TUN_USER_FEATURES;
> > +	tun->dev->features |= (features & TUN_USER_FEATURES);
> >  	netdev_update_features(tun->dev);
> 
> tun->dev->features will be recalculated by netdev_update_features()
> anyway. For this to work as you described it would need to alter
> wanted_features. I don't like the idea that something other than one
> of ethtool_ops is changing this field, as it then becomes something
> else that what the admin wants (even if that is not what he gets).
> 
> Best Regards,
> Michał Mirosław

Yes, with virtualization admin and the app are two different people
usually.  The device doesn't have to be persistent though I think -
what limits this to persistent devices?
I agree this behaviour seems more consistent, I just hope this change
does not break any setups.

-- 
MST

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox