Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v7 0/8] Request for inclusion: tcp memory buffers
From: David Miller @ 2011-10-13 20:18 UTC (permalink / raw)
  To: glommer
  Cc: linux-kernel, akpm, lizf, kamezawa.hiroyu, ebiederm, paul,
	gthelen, netdev, linux-mm, kirill, avagin, devel
In-Reply-To: <4E9746B0.7030603@parallels.com>

From: Glauber Costa <glommer@parallels.com>
Date: Fri, 14 Oct 2011 00:14:40 +0400

> Are you happy, or at least willing to accept, an approach that keep
> things as they were with cgroups *compiled out*, or were you referring
> to not in use == compiled in, but with no users?

To me these are the same exact thing, because %99 of users will be running
a kernel with every feature turned on in the Kconfig.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: can: the mailinglist moved to vger.kernel.org
From: Oliver Hartkopp @ 2011-10-13 20:17 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: linux-can, netdev
In-Reply-To: <1318506157-10329-1-git-send-email-mkl@pengutronix.de>

On 10/13/11 13:42, Marc Kleine-Budde wrote:

> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>


Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>

> ---
>  MAINTAINERS |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index aac56f9..5008b08 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1671,7 +1671,7 @@ CAN NETWORK LAYER
>  M:	Oliver Hartkopp <socketcan@hartkopp.net>
>  M:	Oliver Hartkopp <oliver.hartkopp@volkswagen.de>
>  M:	Urs Thuermann <urs.thuermann@volkswagen.de>
> -L:	socketcan-core@lists.berlios.de (subscribers-only)
> +L:	linux-can@vger.kernel.org
>  L:	netdev@vger.kernel.org
>  W:	http://developer.berlios.de/projects/socketcan/
>  S:	Maintained
> @@ -1683,7 +1683,7 @@ F:	include/linux/can/raw.h
>  
>  CAN NETWORK DRIVERS
>  M:	Wolfgang Grandegger <wg@grandegger.com>
> -L:	socketcan-core@lists.berlios.de (subscribers-only)
> +L:	linux-can@vger.kernel.org
>  L:	netdev@vger.kernel.org
>  W:	http://developer.berlios.de/projects/socketcan/
>  S:	Maintained

^ permalink raw reply

* Re: [PATCH v7 0/8] Request for inclusion: tcp memory buffers
From: David Miller @ 2011-10-13 20:16 UTC (permalink / raw)
  To: glommer
  Cc: linux-kernel, akpm, lizf, kamezawa.hiroyu, ebiederm, paul,
	gthelen, netdev, linux-mm, kirill, avagin, devel
In-Reply-To: <4E9744A6.5010101@parallels.com>

From: Glauber Costa <glommer@parallels.com>
Date: Fri, 14 Oct 2011 00:05:58 +0400

> Thank you for letting me now about your view of this that early.

I depend upon my colleagues to assist me in the large task that is reviewing
the enormous number of networking patches that get submitted.

Unfortunately, none of them got a chance to review this patch set
seriously, since I know most of them (especially Eric Dumazet) would
balk at the overhead you're proposing to add to our stack, just as I
did.

This is the reality of the situation, and I'm sorry to tell you that
snippy retorts when someone does take the time out to review your work
won't help at all.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v7 0/8] Request for inclusion: tcp memory buffers
From: Glauber Costa @ 2011-10-13 20:14 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, akpm, lizf, kamezawa.hiroyu, ebiederm, paul,
	gthelen, netdev, linux-mm, kirill, avagin, devel
In-Reply-To: <20111013.161221.1969725742975317077.davem@davemloft.net>

On 10/14/2011 12:12 AM, David Miller wrote:
> From: Glauber Costa<glommer@parallels.com>
> Date: Fri, 14 Oct 2011 00:05:58 +0400
>
>> Also, I kind of dispute the affirmation that !cgroup will encompass
>> the majority of users, since cgroups is being enabled by default by
>> most vendors. All systemd based systems use it extensively, for
>> instance.
>
> I will definitely advise people against this, since the cost of having
> this on by default is absolutely non-trivial.
>
> People keep asking every few releases "where the heck has my performance
> gone" and it's because of creeping features like this.  This socket
> cgroup feature is a prime example of where that kind of stuff comes
> from.
>
> I really get irritated when people go "oh, it's just one indirect
> function call" and "oh, it's just one more pointer in struct sock"
>
> We work really hard to _remove_ elements from structures and make them
> smaller, and to remove expensive operations from the fast paths.
>
> It might take someone weeks if not months to find a way to make a
> patch which compensates for the extra overhead your patches are adding.
>
> And I don't think you fully appreciate that.

Let's focus on this:
Are you happy, or at least willing to accept, an approach that keep 
things as they were with cgroups *compiled out*, or were you referring 
to not in use == compiled in, but with no users?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v7 0/8] Request for inclusion: tcp memory buffers
From: David Miller @ 2011-10-13 20:12 UTC (permalink / raw)
  To: glommer
  Cc: linux-kernel, akpm, lizf, kamezawa.hiroyu, ebiederm, paul,
	gthelen, netdev, linux-mm, kirill, avagin, devel
In-Reply-To: <4E9744A6.5010101@parallels.com>

From: Glauber Costa <glommer@parallels.com>
Date: Fri, 14 Oct 2011 00:05:58 +0400

> Also, I kind of dispute the affirmation that !cgroup will encompass
> the majority of users, since cgroups is being enabled by default by
> most vendors. All systemd based systems use it extensively, for
> instance.

I will definitely advise people against this, since the cost of having
this on by default is absolutely non-trivial.

People keep asking every few releases "where the heck has my performance
gone" and it's because of creeping features like this.  This socket
cgroup feature is a prime example of where that kind of stuff comes
from.

I really get irritated when people go "oh, it's just one indirect
function call" and "oh, it's just one more pointer in struct sock"

We work really hard to _remove_ elements from structures and make them
smaller, and to remove expensive operations from the fast paths.

It might take someone weeks if not months to find a way to make a
patch which compensates for the extra overhead your patches are adding.

And I don't think you fully appreciate that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v7 0/8] Request for inclusion: tcp memory buffers
From: David Miller @ 2011-10-13 20:08 UTC (permalink / raw)
  To: glommer
  Cc: linux-kernel, akpm, lizf, kamezawa.hiroyu, ebiederm, paul,
	gthelen, netdev, linux-mm, kirill, avagin, devel
In-Reply-To: <4E9744A6.5010101@parallels.com>

From: Glauber Costa <glommer@parallels.com>
Date: Fri, 14 Oct 2011 00:05:58 +0400

> On 10/14/2011 12:00 AM, David Miller wrote:
>> That imposes a new non-trivial cost, in fast paths, even when people
>> do not use your feature.
> Well, there is a cost, but all past submissions included round trip
> benchmarks.
> In none of them I could see any significant slowdown.

Did you try millions of sockets doing all kinds of different accesses?

Did you check the nanosecond latency of operations over loopback so
that the real cost of you change can be isolated and thus measured
properly?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next] e1000e: fix skb truesize underestimation
From: David Miller @ 2011-10-13 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <1318529016.2393.62.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 13 Oct 2011 20:03:36 +0200

> e1000e allocates a page per skb fragment. We must account
> PAGE_SIZE increments on skb->truesize, not the actual frag length.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] igb: fix skb truesize underestimation
From: David Miller @ 2011-10-13 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <1318528601.2393.57.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 13 Oct 2011 19:56:41 +0200

> e1000 allocates half a page per skb fragment. We must account
> PAGE_SIZE/2 increments on skb->truesize, not the actual frag length.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] ixgbe: fix skb truesize underestimation
From: David Miller @ 2011-10-13 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <1318528781.2393.59.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 13 Oct 2011 19:59:41 +0200

> ixgbe allocates half a page per skb fragment. We must account
> PAGE_SIZE/2 increments on skb->truesize, not the actual frag length.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] e1000: fix skb truesize underestimation
From: David Miller @ 2011-10-13 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, jeffrey.t.kirsher
In-Reply-To: <1318528422.2393.55.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 13 Oct 2011 19:53:42 +0200

> e1000 allocates a full page per skb fragment. We must account PAGE_SIZE
> increments on skb->truesize, not the actual frag length.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] bnx2: fix skb truesize underestimation
From: David Miller @ 2011-10-13 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, mchan
In-Reply-To: <1318528219.2393.52.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 13 Oct 2011 19:50:19 +0200

> bnx2 allocates a full page per fragment. We must account PAGE_SIZE
> increments on skb->truesize, not the actual frag length.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] be2net: fix truesize errors
From: David Miller @ 2011-10-13 20:06 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, sathya.perla, subbu.seetharaman, ajit.khaparde
In-Reply-To: <1318523553.2393.34.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 13 Oct 2011 18:32:33 +0200

> Le jeudi 13 octobre 2011 à 18:31 +0200, Eric Dumazet a écrit :
>> Fix skb truesize underestimations of this driver.
>> 
>> Each frag truesize is exactly rx_frag_size bytes. (2048 bytes per
>> default)
>> 
>> A driver should not use "sizeof(struct sk_buff)" at all.
>> 
>> Signed-off-by: Eric Dumazet <eric.dumazet>
> 
> Oh well, garbled Signed-off-by, sorry !
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

:-)  Applied.

^ permalink raw reply

* Re: [PATCH v7 0/8] Request for inclusion: tcp memory buffers
From: Glauber Costa @ 2011-10-13 20:05 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, akpm, lizf, kamezawa.hiroyu, ebiederm, paul,
	gthelen, netdev, linux-mm, kirill, avagin, devel
In-Reply-To: <20111013.160031.605700447623532119.davem@davemloft.net>

On 10/14/2011 12:00 AM, David Miller wrote:
> From: Glauber Costa<glommer@parallels.com>
> Date: Thu, 13 Oct 2011 17:09:34 +0400
>
>> This series was extensively reviewed over the past month, and after
>> all major comments were merged, I feel it is ready for inclusion when
>> the next merge window opens. Minor fixes will be provided if they
>> prove to be necessary.
>
> I'm not applying this.

Thank you for letting me now about your view of this that early.

> You're turning inline increments and decrements of the existing memory
> limits into indirect function calls.

Yes, indeed.

> That imposes a new non-trivial cost, in fast paths, even when people
> do not use your feature.
Well, there is a cost, but all past submissions included round trip 
benchmarks.
In none of them I could see any significant slowdown.

> Make this evaluate into exactly the same exact code stream we have
> now when the memory cgroup feature is not in use, which will be the
> majority of users.

What exactly do you mean by "not in use" ? Not compiled in or not 
actively being exercised ? If you mean the later, I appreciate tips on 
how to achieve it.

Also, I kind of dispute the affirmation that !cgroup will encompass
the majority of users, since cgroups is being enabled by default by
most vendors. All systemd based systems use it extensively, for instance.

^ permalink raw reply

* Re: [PATCH net-next] net: more accurate skb truesize
From: David Miller @ 2011-10-13 20:05 UTC (permalink / raw)
  To: eric.dumazet; +Cc: bhutchings, netdev, ak
In-Reply-To: <1318526934.2393.49.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 13 Oct 2011 19:28:54 +0200

> [PATCH V2 net-next] net: more accurate skb truesize
> 
> skb truesize currently accounts for sk_buff struct and part of skb head.
> kmalloc() roundings are also ignored.
> 
> Considering that skb_shared_info is larger than sk_buff, its time to
> take it into account for better memory accounting.
> 
> This patch introduces SKB_TRUESIZE(X) macro to centralize various
> assumptions into a single place.
> 
> At skb alloc phase, we put skb_shared_info struct at the exact end of
> skb head, to allow a better use of memory (lowering number of
> reallocations), since kmalloc() gives us power-of-two memory blocks.
> 
> Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
> aligned to cache lines, as before.
> 
> Note: This patch might trigger performance regressions because of
> misconfigured protocol stacks, hitting per socket or global memory
> limits that were previously not reached. But its a necessary step for a
> more accurate memory accounting.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] ehea: Change maintainer to me
From: Breno Leitao @ 2011-10-13 20:05 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo; +Cc: netdev, linux-kernel, davem
In-Reply-To: <1318535779-18275-1-git-send-email-cascardo@linux.vnet.ibm.com>

On 10/13/2011 04:56 PM, Thadeu Lima de Souza Cascardo wrote:
> Breno Leitao has passed the maintainership to me.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Acked-by: Breno Leitão <leitao@linux.vnet.ibm.com>

^ permalink raw reply

* Re: [PATCH v7 0/8] Request for inclusion: tcp memory buffers
From: David Miller @ 2011-10-13 20:00 UTC (permalink / raw)
  To: glommer
  Cc: linux-kernel, akpm, lizf, kamezawa.hiroyu, ebiederm, paul,
	gthelen, netdev, linux-mm, kirill, avagin, devel
In-Reply-To: <1318511382-31051-1-git-send-email-glommer@parallels.com>

From: Glauber Costa <glommer@parallels.com>
Date: Thu, 13 Oct 2011 17:09:34 +0400

> This series was extensively reviewed over the past month, and after
> all major comments were merged, I feel it is ready for inclusion when
> the next merge window opens. Minor fixes will be provided if they
> prove to be necessary.

I'm not applying this.

You're turning inline increments and decrements of the existing memory
limits into indirect function calls.

That imposes a new non-trivial cost, in fast paths, even when people
do not use your feature.

Make this evaluate into exactly the same exact code stream we have
now when the memory cgroup feature is not in use, which will be the
majority of users.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] ehea: Change maintainer to me
From: Thadeu Lima de Souza Cascardo @ 2011-10-13 19:56 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, davem, Thadeu Lima de Souza Cascardo, Breno Leitao

Breno Leitao has passed the maintainership to me.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Cc: Breno Leitao <leitao@linux.vnet.ibm.com>
---
 MAINTAINERS |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index aac56f9..c25e93a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2468,7 +2468,7 @@ S:	Supported
 F:	drivers/infiniband/hw/ehca/
 
 EHEA (IBM pSeries eHEA 10Gb ethernet adapter) DRIVER
-M:	Breno Leitao <leitao@linux.vnet.ibm.com>
+M:	Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	drivers/net/ethernet/ibm/ehea/
-- 
1.7.4.4

^ permalink raw reply related

* Help with NULL pointer derefrence in net/ipv4/xfrm4_policy.c
From: Mark Larwill @ 2011-10-13 19:39 UTC (permalink / raw)
  To: netdev

I'm an software developer working for a small firewall company. I am
running into a kernel NULL pointer deference bug (using an older
2.6.35.12 kernel), and noticed a patch submitted that seems like it
might fix my problem. I was hoping someone could help me out a bit. It
is not practical for me to simply upgrade to the latest Linux, rather
I must make a targeted fix.

First a little on my specific issue, then my questions. My problem is
in the file net/ipv4/xfrm4_policy.c in the function
xfrm4_dst_ifdown(), on this line: " if (xdst->u.rt.idev->dev == dev)
{" basically idev is all 0 so xdst->u.rt.idev->dev will crash.

But before I blindly apply the patch I was hoping someone could help
answer some questions on it so I don't make things worse by not
understanding everything. The patch I reference below removes all of
this idev stuff from the kernel. I can see that as long as the if
condition (mentioned above) is met the (xfrm4_dst_ifdown) code
basically goes through a list and assigns the idev structures loopback
devices, so my guess is that for some reason these idev devices are no
longer being used or never were used to begin with. Also similar
behavior is done in the generic ipv4_dst_ifdown() function in
net/ipv4/route.c.

My questions are:
Q1) Why was the idev stuff there to begin with if it was unnecessary?
/ What was it's original purpose?
Q2) Why did we have to assign the devices to the loopback device? (Was
it so no packets would go out during deletion?)
Q3) When, and in what context are the ipv4_dst_ops.ifdown, and
ops->ifdown function pointers called? (What are they doing and why?)
Q4) Between 2.6.35.12 and the patch that removes the idev stuff in
from 11 Nov 2010 in 2.6.38-rc8 are there any other things
removed/changed in between that made it possible to remove the idef
stuff?
Q5) All of these basically lead up to: Why is it safe to remove all of
the idev stuff?
Q6) Is there a safe way I can safely remove this idev stuff or at
least avoid the NULL pointer deference? (Just blindly returning if it
is NULL would be the immediate reaction without thinking about it, but
chances are that's wrong---otherwise why is xfrm4_dst_ifdown written
the way it is?

Any help is greatly appreciated, thanks very much!

--Mark Larwill

REFERENCES:
A)
This post here to kerneltrap.org which seems to describe the old behavior:
http://kerneltrap.org/mailarchive/linux-netdev/2007/9/27/324077

B) The main change I am referring to, and text below
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=72cdd1d971c0deb1619c5c339270570c43647a78

It seems idev field in struct rtable has no special purpose, but adding
extra atomic ops.

We hold refcounts on the device itself (using percpu data, so pretty
cheap in current kernel).

infiniband case is solved using dst.dev instead of idev->dev

Removal of this field means routing without route cache is now using
shared data, percpu data, and only potential contention is a pair of
atomic ops on struct neighbour per forwarded packet.

About 5% speedup on routing test.

^ permalink raw reply

* [v3 PATCH 0/2] NETFILTER new target module, HMARK
From: Hans Schillstrom @ 2011-10-13 19:02 UTC (permalink / raw)
  To: kaber, pablo, jengelh, netfilter-devel, netdev; +Cc: hans, Hans Schillstrom

The target allows you to create rules in the "raw" and "mangle" tables
which alter the netfilter mark (nfmark) field within a given range.
First a 32 bit hash value is generated then modulus by <limit> and
finally an offset is added before it's written to nfmark.
Prior to routing, the nfmark can influence the routing method (see
"Use netfilter MARK value as routing key") and can also be used by
other subsystems to change their behaviour.

The mark match can also be used to match nfmark produced by this module.
See the kernel module for more info.

REVISION
Version 3
        Handling of SCTP for IPv6 added.

Version 2
	NAT Added for IPv4
	IPv6 ICMP handling enhanced.
	Usage example added

Version 1
	Initial RFC


We (Ericsson) use hmark in-front of ipvs as a pre-loadbalancer and
handles up to 70 ipvs running in parallel in clusters.
However hmark is not restricted to run infront of IPVS it can also be used as
"poor mans" load balancer.
With this version is also NAT supported as an option, with very high flows
you might not want to use conntrack.

The idea is to generate a direction independent fw mark range to use as input to
the routing (i.e. ip rule add fwmark ...).
Pretty straight forward and simple.


Example:
                                      App Server (Real Server)

                                           +---------+
                                        -->| Service |
     Gateway A                             +---------+
                          /
            +----------+ /     +----+      +---------+
--- if -A---| selector |---->  |ipvs|  --->| Service |
            +----------+ \     +----+      +---------+
                          \
                               +----+      +---------+
                               |ipvs|   -->| Service |
                               +----+      +---------+
      Gateway C
            +----------+ /     +----+
--- if-B ---| selector | --->  |ipvs|
            +----------+ \     +----+      +---------+
                                           | Service |
                                           +---------+
                          /
            +----------+ /     +----+     ..
--- if-B ---| selector | --->  |ipvs|      +---------+
            +----------+ \     +----+      | Service |
                          \                +---------+
#
# Example with four ipvs loadbalancers
#
iptables -t mangle -I PREROUTING -d $IPADDR -j HMARK --hmark-mod 4 --hmark-offs 100

ip rule add fwmark 100 table 100
ip rule add fwmark 101 table 101
ip rule add fwmark 102 table 102
ip rule add fwmark 103 table 103

ip ro ad table 100 default via x.y.z.1 dev bond1
ip ro ad table 101 default via x.y.z.2 dev bond1
ip ro ad table 102 default via x.y.z.3 dev bond1
ip ro ad table 103 default via x.y.z.4 dev bond1


If conntrack doesn't handle the return path,
do the oposite with HMARK and send it back right to ipvs.

Another exmaple of usage could be if you have cluster originated connections
and want to spread the connections over a number of interfaces
(NAT will complpicate things for you in this case)



                     \  Blade 1
                      \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /
   +------+
-- | Gw-A |          \  Blade 2
   +------+           \ +----------+      +---------+
   +------+         <-- | selector | <--- | Service |
-- | Gw-B |           / +----------+      +---------+
   +------+          /
   +------+
-- | Gw-C |          \
   +------+           \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /

                     \  Blande -n
                      \ +----------+      +---------+
                    <-- | selector | <--- | Service |
                      / +----------+      +---------+
                     /

Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply

* [v2 PATCH 2/2] NETFILTER userspace part for target HMARK
From: Hans Schillstrom @ 2011-10-13 19:02 UTC (permalink / raw)
  To: kaber, pablo, jengelh, netfilter-devel, netdev; +Cc: hans, Hans Schillstrom
In-Reply-To: <1318532530-29446-1-git-send-email-hans.schillstrom@ericsson.com>

The target allows you to create rules in the "raw" and "mangle" tables
which alter the netfilter mark (nfmark) field within a given range.
First a 32 bit hash value is generated then modulus by <limit> and
finally an offset is added before it's written to nfmark.
Prior to routing, the nfmark can influence the routing method (see
"Use netfilter MARK value as routing key") and can also be used by
other subsystems to change their behaviour.

The mark match can also be used to match nfmark produced by this module.

Ver 2
  IPv4 NAT added
  iptables ver 1.4.12.1 adaptions.

Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
---
 extensions/libxt_HMARK.c           |  381 ++++++++++++++++++++++++++++++++++++
 extensions/libxt_HMARK.man         |   66 ++++++
 include/linux/netfilter/xt_hmark.h |   48 +++++
 3 files changed, 495 insertions(+), 0 deletions(-)
 create mode 100644 extensions/libxt_HMARK.c
 create mode 100644 extensions/libxt_HMARK.man
 create mode 100644 include/linux/netfilter/xt_hmark.h

diff --git a/extensions/libxt_HMARK.c b/extensions/libxt_HMARK.c
new file mode 100644
index 0000000..0def034
--- /dev/null
+++ b/extensions/libxt_HMARK.c
@@ -0,0 +1,381 @@
+/*
+ * Shared library add-on to iptables to add HMARK target support.
+ *
+ * The kernel module calculates a hash value that can be modified by modulus
+ * and an offset. The hash value is based on a direction independent
+ * five tuple: src & dst addr src & dst ports and protocol.
+ * However src & dst port can be masked and are not used for fragmented
+ * packets, ESP and AH don't have ports so SPI will be used instead.
+ * For ICMP error messages the hash mark values will be calculated on
+ * the source packet i.e. the packet caused the error (If sufficient
+ * amount of data exists).
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include <stdbool.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <getopt.h>
+
+#include <xtables.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_hmark.h>
+
+
+#define DEF_HRAND 0xc175a3b8	/* Default "random" value to jhash */
+
+static void HMARK_help(void)
+{
+	printf(
+"HMARK target options, i.e. modify hash calculation by:\n"
+"  --hmark-smask value                Mask source address with value\n"
+"  --hmark-dmask value                Mask Dest. address with value\n"
+"  --hmark-sp-mask value              Mask src port with value\n"
+"  --hmark-dp-mask value              Mask dst port with value\n"
+"  --hmark-spi-mask value             For esp and ah AND spi with value\n"
+"  --hmark-sp-set value               OR src port with value\n"
+"  --hmark-dp-set value               OR dst port with value\n"
+"  --hmark-spi-set value              For esp and ah OR spi with value\n"
+"  --hmark-proto-mask value           Mask Protocol with value\n"
+"  --hmark-rnd                        Random value to hash cacl.\n"
+"  Limit/modify the calculated hash mark by:\n"
+"  --hmark-mod value                  nfmark modulus value\n"
+"  --hmark-offs value                 Last action add value to nfmark\n"
+" For NAT in IPv4 the original address can be used in the return path.\n"
+" Make sure to qualify the statement in a proper way when using nat flags\n"
+"  --hmark-dnat                       Replace src addr/port with original dst addr/port\n"
+"  --hmark-snat                       Replace dst addr/port with original src addr/port\n"
+" In many cases hmark can be omitted i.e. --smask can be used\n");
+}
+
+static const struct option HMARK_opts[] = {
+	{ "hmark-smask", 1, NULL, XT_HMARK_SADR_AND },
+	{ "hmark-dmask", 1, NULL, XT_HMARK_DADR_AND },
+	{ "hmark-sp-mask", 1, NULL, XT_HMARK_SPORT_AND },
+	{ "hmark-dp-mask", 1, NULL, XT_HMARK_DPORT_AND },
+	{ "hmark-spi-mask", 1, NULL, XT_HMARK_SPI_AND },
+	{ "hmark-sp-set", 1, NULL, XT_HMARK_SPORT_OR },
+	{ "hmark-dp-set", 1, NULL, XT_HMARK_DPORT_OR },
+	{ "hmark-spi-set", 1, NULL, XT_HMARK_SPI_OR },
+	{ "hmark-proto-mask", 1, NULL, XT_HMARK_PROTO_AND },
+	{ "hmark-rnd", 1, NULL, XT_HMARK_RND },
+	{ "hmark-mod", 1, NULL, XT_HMARK_MODULUS },
+	{ "hmark-offs", 1, NULL, XT_HMARK_OFFSET },
+	{ "hmark-dnat", 1, NULL, XT_HMARK_USE_DNAT },
+	{ "hmark-snat", 1, NULL, XT_HMARK_USE_SNAT },
+	{ "smask", 1, NULL, XT_HMARK_SADR_AND },
+	{ "dmask", 1, NULL, XT_HMARK_DADR_AND },
+	{ "sp-mask", 1, NULL, XT_HMARK_SPORT_AND },
+	{ "dp-mask", 1, NULL, XT_HMARK_DPORT_AND },
+	{ "spi-mask", 1, NULL, XT_HMARK_SPI_AND },
+	{ "sp-set", 1, NULL, XT_HMARK_SPORT_OR },
+	{ "dp-set", 1, NULL, XT_HMARK_DPORT_OR },
+	{ "spi-set", 1, NULL, XT_HMARK_SPI_OR },
+	{ "proto-mask", 1, NULL, XT_HMARK_PROTO_AND },
+	{ "rnd", 1, NULL, XT_HMARK_RND },
+	{ "mod", 1, NULL, XT_HMARK_MODULUS },
+	{ "offs", 1, NULL, XT_HMARK_OFFSET },
+	{ "dnat", 1, NULL, XT_HMARK_USE_DNAT },
+	{ "snat", 1, NULL, XT_HMARK_USE_SNAT },
+	{ .name = NULL }
+};
+
+static int
+HMARK_parse(int c, char **argv, int invert, unsigned int *flags,
+	    const void *entry, struct xt_entry_target **target)
+{
+	struct xt_hmark_info *hmarkinfo
+		= (struct xt_hmark_info *)(*target)->data;
+	unsigned int value = 0xffffffff;
+	unsigned int maxint = UINT32_MAX;
+
+	if ((c < XT_HMARK_SADR_AND) || (c > XT_HMARK_OFFSET)) {
+		xtables_error(PARAMETER_PROBLEM, "Bad HMARK option \"%s\"",
+			      optarg);
+		return 0;
+	}
+
+	if (c >= XT_HMARK_SPORT_AND && c <= XT_HMARK_DPORT_OR)
+		maxint = UINT16_MAX;
+	else if (c == XT_HMARK_PROTO_AND)
+		maxint = UINT8_MAX;
+
+	if (!xtables_strtoui(optarg, NULL, &value, 0, maxint))
+		xtables_error(PARAMETER_PROBLEM, "Bad HMARK value \"%s\"",
+			      optarg);
+
+	if (*flags == 0) {
+		memset(hmarkinfo, 0xff, sizeof(struct xt_hmark_info));
+		hmarkinfo->pset.v32 = 0;
+		hmarkinfo->flags = 0;
+		hmarkinfo->spiset = 0;
+		hmarkinfo->hoffs = 0;
+		hmarkinfo->hashrnd = DEF_HRAND;
+	}
+	switch (c) {
+	case XT_HMARK_SADR_AND:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-smask' once");
+		}
+		hmarkinfo->smask = htonl(value);
+		if (value == maxint)
+			c = 0;
+		break;
+
+	case XT_HMARK_DADR_AND:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-dmask' once");
+		}
+		hmarkinfo->dmask = htonl(value);
+		if (value == maxint)
+			c = 0;
+		break;
+
+	case XT_HMARK_MODULUS:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-mod' once");
+		}
+		if (value == 0) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "xxx modulus 0 ? "
+				      "thats a div by 0");
+			value = 0xffffffff;
+		}
+		hmarkinfo->hmod = value;
+		if (value == maxint)
+			c = 0;
+		break;
+
+	case XT_HMARK_OFFSET:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-offs' once");
+		}
+		hmarkinfo->hoffs = value;
+		if (value == 0)
+			c = 0;
+		break;
+
+	case XT_HMARK_SPORT_AND:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-sp-mask' once");
+		}
+		hmarkinfo->pmask.p16.src = htons(value);
+		if (value == maxint)
+			c = 0;
+		break;
+
+	case XT_HMARK_DPORT_AND:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-dp-mask' once");
+		}
+		hmarkinfo->pmask.p16.dst = htons(value);
+		if (value == maxint)
+			c = 0;
+		break;
+
+	case XT_HMARK_SPI_AND:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-spi-mask' once");
+		}
+		hmarkinfo->spimask = htonl(value);
+		if (value == maxint)
+			c = 0;
+		break;
+
+	case XT_HMARK_SPORT_OR:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-sp-set' once");
+		}
+		hmarkinfo->pset.p16.src = htons(value);
+		if (!value)
+			c = 0;
+		break;
+
+	case XT_HMARK_DPORT_OR:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-dp-set' once");
+		}
+		hmarkinfo->pset.p16.dst = htons(value);
+		if (!value)
+			c = 0;
+		break;
+
+	case XT_HMARK_SPI_OR:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-spi-set' once");
+		}
+		hmarkinfo->spiset = htonl(value);
+		if (!value)
+			c = 0;
+		break;
+
+	case XT_HMARK_PROTO_AND:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify "
+				      "`--hmark-proto-mask' once");
+		}
+		hmarkinfo->prmask = value;
+		if (value == maxint)
+			c = 0;
+		break;
+
+	case XT_HMARK_RND:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify `--hmark-rnd' once");
+		}
+		hmarkinfo->hashrnd = value;
+		if (value == DEF_HRAND)
+			c = 0;
+		break;
+
+	case XT_HMARK_USE_DNAT:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify `--hmark-rnd' once");
+		}
+		break;
+
+	case XT_HMARK_USE_SNAT:
+		if (*flags & (1 << c)) {
+			xtables_error(PARAMETER_PROBLEM,
+				      "Can only specify `--hmark-rnd' once");
+		}
+		break;
+
+	default:
+		return 0;
+	}
+	*flags |= 1 << c;
+	hmarkinfo->flags = *flags;
+
+	return 1;
+}
+
+static void HMARK_check(unsigned int flags)
+{
+	if (!(flags & (1 << XT_HMARK_MODULUS)))
+		xtables_error(PARAMETER_PROBLEM, "HMARK: the --hmark-mod, "
+			   "is not set, that means the nfmark will be in range"
+			   " 0 - 0xffffffff");
+}
+
+static void HMARK_print(const void *ip, const struct xt_entry_target *target,
+			int numeric)
+{
+	const struct xt_hmark_info *info =
+			(const struct xt_hmark_info *)target->data;
+
+	printf(" HMARK ");
+	if (info->flags & (1 << XT_HMARK_USE_SNAT))
+		printf("snat, ");
+	if (info->flags & (1 << XT_HMARK_SADR_AND))
+		printf("smask 0x%x ", htonl(info->smask));
+
+	if (info->flags & (1 << XT_HMARK_USE_DNAT))
+		printf("dnat, ");
+	if (info->flags & (1 << XT_HMARK_DADR_AND))
+		printf("dmask 0x%x ", htonl(info->dmask));
+
+	if (info->flags & (1 << XT_HMARK_SPORT_AND))
+		printf("sp-mask 0x%x ", htons(info->pmask.p16.src));
+	if (info->flags & (1 << XT_HMARK_DPORT_AND))
+		printf("dp-mask 0x%x ", htons(info->pmask.p16.dst));
+	if (info->flags & (1 << XT_HMARK_SPI_AND))
+		printf("spi-mask 0x%x ", htonl(info->spimask));
+	if (info->flags & (1 << XT_HMARK_SPORT_OR))
+		printf("sp-set 0x%x ", htons(info->pset.p16.src));
+	if (info->flags & (1 << XT_HMARK_DPORT_OR))
+		printf("dp-set 0x%x ", htons(info->pset.p16.dst));
+	if (info->flags & (1 << XT_HMARK_SPI_OR))
+		printf("spi-set 0x%x ", htonl(info->spiset));
+	if (info->flags & (1 << XT_HMARK_PROTO_AND))
+		printf("proto-mask 0x%x ", info->prmask);
+	if (info->flags & (1 << XT_HMARK_RND))
+		printf("rnd 0x%x ", info->hashrnd);
+	if (info->flags & (1 << XT_HMARK_MODULUS))
+		printf("mark=hv %% 0x%x ", info->hmod);
+	if (info->flags & (1 << XT_HMARK_OFFSET))
+		printf("+ 0x%x ", info->hoffs);
+}
+
+static void HMARK_save(const void *ip, const struct xt_entry_target *target)
+{
+	const struct xt_hmark_info *info =
+		(const struct xt_hmark_info *)target->data;
+
+	if (info->flags & (1 << XT_HMARK_SADR_AND))
+		printf("--hmark-smask 0x%x ", htonl(info->smask));
+	if (info->flags & (1 << XT_HMARK_DADR_AND))
+		printf("--hmark-dmask 0x%x ", htonl(info->dmask));
+	if (info->flags & (1 << XT_HMARK_SPORT_AND))
+		printf("--hmark-sp-mask 0x%x ", htons(info->pmask.p16.src));
+	if (info->flags & (1 << XT_HMARK_DPORT_AND))
+		printf("--hmark-dp-mask 0x%x ", htons(info->pmask.p16.dst));
+	if (info->flags & (1 << XT_HMARK_SPI_AND))
+		printf("--hmark-spi-mask 0x%x ", htonl(info->spimask));
+	if (info->flags & (1 << XT_HMARK_SPORT_OR))
+		printf("--hmark-sp-set 0x%x ", htons(info->pset.p16.src));
+	if (info->flags & (1 << XT_HMARK_DPORT_OR))
+		printf("--hmark-dp-set 0x%x ", htons(info->pset.p16.dst));
+	if (info->flags & (1 << XT_HMARK_SPI_OR))
+		printf("--hmark-spi-set 0x%x ", htonl(info->spiset));
+	if (info->flags & (1 << XT_HMARK_PROTO_AND))
+		printf("--hmark-proto-mask 0x%x ", info->prmask);
+	if (info->flags & (1 << XT_HMARK_RND))
+		printf("--hmark-rnd 0x%x ", info->hashrnd);
+	if (info->flags & (1 << XT_HMARK_MODULUS))
+		printf("--hmark-mod 0x%x ", info->hmod);
+	if (info->flags & (1 << XT_HMARK_OFFSET))
+		printf("--hmark-offs 0x%x ", info->hoffs);
+	if (info->flags & (1 << XT_HMARK_USE_DNAT))
+		printf("--hmark-dnat ");
+	if (info->flags & (1 << XT_HMARK_USE_SNAT))
+		printf("--hmark-snat ");
+}
+
+static struct xtables_target mark_tg_reg[] = {
+	{
+		.family        = NFPROTO_UNSPEC,
+		.name          = "HMARK",
+		.version       = XTABLES_VERSION,
+		.revision      = 0,
+		.size          = XT_ALIGN(sizeof(struct xt_hmark_info)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_hmark_info)),
+		.help          = HMARK_help,
+		.parse         = HMARK_parse,
+		.final_check   = HMARK_check,
+		.print         = HMARK_print,
+		.save          = HMARK_save,
+		.extra_opts    = HMARK_opts,
+	},
+};
+
+void _init(void)
+{
+	xtables_register_targets(mark_tg_reg, ARRAY_SIZE(mark_tg_reg));
+}
+
diff --git a/extensions/libxt_HMARK.man b/extensions/libxt_HMARK.man
new file mode 100644
index 0000000..8f44676
--- /dev/null
+++ b/extensions/libxt_HMARK.man
@@ -0,0 +1,66 @@
+This module does the same as MARK, i.e. set an fwmark, but the mark is based on a hash value.
+The hash is based on saddr, daddr, sport, dport and proto. The same mark will be produced independet of direction if no masks is set or the same masks is used for src and dest.
+The hash mark could be adjusted by modulus and finally an offset could be added, i.e the final mark will be within a range. If state RELATED is used icmp will be handled also, i.e. hash will be calculated on the original message not the icmp it self.
+Note: None of the parameters effect the packet it self only the calculated hash value.
+.PP
+Parameters:
+For all masks default is all "1:s", to disable a field use mask 0
+For IPv6 it's just the last 32 bits that is included in the hash
+.TP
+\fB\-\-hmark\-smask\fP \fIvalue\fP
+The value to AND the source address with (saddr & value).
+.TP
+\fB\-\-hmark\-dmask\fP \fIvalue\fP
+The value to AND the dest. address with (daddr & value).
+.TP
+\fB\-\-hmark\-sp\-mask\fP \fIvalue\fP
+A 16 bit value to AND the src port with (sport & value).
+.TP
+\fB\-\-hmark\-dp\-mask\fP \fIvalue\fP
+A 16 bit value to AND the dest port with (dport & value).
+.TP
+\fB\-\-hmark\-sp\-set\fP \fIvalue\fP
+A 16 bit value to OR the src port with (sport | value).
+.TP
+\fB\-\-hmark\-dp\-set\fP \fIvalue\fP
+A 16 bit value to OR the dest port with (dport | value).
+.TP
+\fB\-\-hmark\-spi\-mask\fP \fIvalue\fP
+Value to AND the spi field with (spi & value) valid for proto esp or ah.
+.TP
+\fB\-\-hmark\-spi\-set\fP \fIvalue\fP
+Value to OR the spi field with (spi | value) valid for proto esp or ah.
+.TP
+\fB\-\-hmark\-proto\-mask\fP \fIvalue\fP
+An 8 bit value to AND the L4 proto field with (proto & value).
+.TP
+\fB\-\-hmark\-rnd\fP \fIvalue\fP
+A 32 bit initial value for hash calc, default is 0xc175a3b8.
+.TP
+\fB\-\-hmark\-dnat\fP
+Replace src addr/port with original dst addr/port before calc, hash
+.TP
+\fB\-\-hmark\-dnat\fP
+Replace dst addr/port with original src addr/port before calc, hash
+.PP
+Final processing of the mark in order of execution.
+.TP
+\fB\-\-hmark\-mod\fP \fvalue (must be > 0)\fP
+The easiest way to describe this is:  hash = hash mod <value>
+.TP
+\fB\-\-hmark\-offs\fP \fvalue\fP
+The easiest way to describe this is:  hash = hash + <value>
+.PP
+\fIExamples:\fP
+.PP
+Default rule handles all TCP, UDP, SCTP, ESP & AH
+.IP
+iptables \-t mangle \-A PREROUTING \-m state \-\-state NEW,ESTABLISHED,RELATED
+ \-j HMARK \-\-hmark-offs 10000 \-\-hmark-mod 10
+.PP
+Handle SCTP and hash dest port only and produce a nfmark between 100-119.
+.IP
+iptables \-t mangle \-A PREROUTING -p SCTP \-j HMARK \-\-smask 0 \-\-dmask 0
+ \-\-sp\-mask 0 \-\-offs 100 \-\-mod 20
+.PP
+
diff --git a/include/linux/netfilter/xt_hmark.h b/include/linux/netfilter/xt_hmark.h
new file mode 100644
index 0000000..7b3ee5d
--- /dev/null
+++ b/include/linux/netfilter/xt_hmark.h
@@ -0,0 +1,48 @@
+#ifndef XT_HMARK_H_
+#define XT_HMARK_H_
+
+#include <linux/types.h>
+
+/*
+ * Flags must not start at 0, since it's used as none.
+ */
+enum {
+	XT_HMARK_USE_SNAT = 1,	/* SNAT & DNAT are used by the kernel module */
+	XT_HMARK_USE_DNAT,
+	XT_HMARK_SADR_AND,
+	XT_HMARK_DADR_AND,
+	XT_HMARK_SPI_AND,
+	XT_HMARK_SPI_OR,
+	XT_HMARK_SPORT_AND,
+	XT_HMARK_DPORT_AND,
+	XT_HMARK_SPORT_OR,
+	XT_HMARK_DPORT_OR,
+	XT_HMARK_PROTO_AND,
+	XT_HMARK_RND,
+	XT_HMARK_MODULUS,
+	XT_HMARK_OFFSET,
+};
+
+union ports {
+	struct {
+		__u16	src;
+		__u16	dst;
+	} p16;
+	__u32	v32;
+};
+
+struct xt_hmark_info {
+	__u32		smask;		/* Source address mask */
+	__u32		dmask;		/* Dest address mask */
+	union ports	pmask;
+	union ports	pset;
+	__u32		spimask;
+	__u32		spiset;
+	__u16		flags;		/* Print out only */
+	__u16		prmask;		/* L4 Proto mask */
+	__u32		hashrnd;
+	__u32		hmod;		/* Modulus */
+	__u32		hoffs;		/* Offset */
+};
+
+#endif /* XT_HMARK_H_ */
-- 
1.7.4.4

^ permalink raw reply related

* [v3 PATCH 1/2] NETFILTER module xt_hmark, new target for HASH based fwmark
From: Hans Schillstrom @ 2011-10-13 19:02 UTC (permalink / raw)
  To: kaber, pablo, jengelh, netfilter-devel, netdev; +Cc: hans, Hans Schillstrom
In-Reply-To: <1318532530-29446-1-git-send-email-hans.schillstrom@ericsson.com>

From: Hans Schillstrom <hans@schillstrom.com>

The target allows you to create rules in the "raw" and "mangle" tables
which alter the netfilter mark (nfmark) field within a given range.
First a 32 bit hash value is generated then modulus by <limit> and
finally an offset is added before it's written to nfmark.
Prior to routing, the nfmark can influence the routing method (see
"Use netfilter MARK value as routing key") and can also be used by
other subsystems to change their behavior.

man page
   HMARK
       This  module  does  the same as MARK, i.e. set an fwmark,
       but the mark is based on a hash value.  The hash is based on
       saddr, daddr, sport, dport and proto. The same mark will be produced
       independet of direction if no masks is set or the same masks is used for
       src and dest. The hash mark could be adjusted by modulus and finaly an
       offset could be added, i.e the final mark will be within a range.
       ICMP errors will have hash calc based on the original message.
       Note: None of the parameters effect the packet it self
       only the calculated hash value.

       Parameters: For all masks default is all "1:s", to disable a field
                   use mask 0. For IPv6 it's just the last 32 bits that
                   is included in the hash.

       --hmark-smask value
              The value to AND the source address with (saddr & value).

       --hmark-dmask value
              The value to AND the dest. address with (daddr & value).

       --hmark-sp-mask value
              A 16 bit value to AND the src port with (sport & value).

       --hmark-dp-mask value
              A 16 bit value to AND the dest port with (dport & value).

       --hmark-sp-set value
              A 16 bit value to OR the src port with (sport | value).

       --hmark-dp-set value
              A 16 bit value to OR the dest port with (dport | value).

       --hmark-spi-mask value
              Value to AND the spi field with (spi & value) valid for proto esp or ah.

       --hmark-spi-set value
              Value to OR the spi field with (spi | value) valid for proto esp or ah.

       --hmark-proto-mask value
              A 16 bit value to AND the L4 proto field with (proto & value).

       --hmark-rnd value
              A 32 bit intitial value for hash calc, default is 0xc175a3b8.

       --hmark-dnat
              Replace src addr/port with original dst addr/port before calc, hash

       --hmark-snat
              Replace dst addr/port with original src addr/port before calc, hash

       Final processing of the mark in order of execution.

       --hmark-mod value (must be > 0)
              The easiest way to describe this is:  hash = hash mod <value>

       --hmark-offs alue (must be > 0)
              The easiest way to describe this is:  hash = hash + <value>

       Examples:

       Default rule handles all TCP, UDP, SCTP, ESP & AH
Rev 3
      Support added to SCTP for IPv6
Rev 2
      IPv6 header scan changed to follow RFC 2640
      IPv4 icmp echo fragmented does now use proto as ipv6
      IPv6 pskb_may_pull() check is done in every time in header loop.
      IPv4 nat support added.
      default added in IPv6 loop and null check of hp

Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
---
 include/linux/netfilter/xt_hmark.h |   48 ++++++
 include/net/ipv6.h                 |    1 +
 net/netfilter/Kconfig              |   17 ++
 net/netfilter/Makefile             |    1 +
 net/netfilter/xt_hmark.c           |  321 ++++++++++++++++++++++++++++++++++++
 5 files changed, 388 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netfilter/xt_hmark.h
 create mode 100644 net/netfilter/xt_hmark.c

diff --git a/include/linux/netfilter/xt_hmark.h b/include/linux/netfilter/xt_hmark.h
new file mode 100644
index 0000000..6c1436a
--- /dev/null
+++ b/include/linux/netfilter/xt_hmark.h
@@ -0,0 +1,48 @@
+#ifndef XT_HMARK_H_
+#define XT_HMARK_H_
+
+#include <linux/types.h>
+
+/*
+ * Flags must not start at 0, since it's used as none.
+ */
+enum {
+	XT_HMARK_SADR_AND = 1,	/* SNAT & DNAT are used by the kernel module */
+	XT_HMARK_DADR_AND,
+	XT_HMARK_SPI_AND,
+	XT_HMARK_SPI_OR,
+	XT_HMARK_SPORT_AND,
+	XT_HMARK_DPORT_AND,
+	XT_HMARK_SPORT_OR,
+	XT_HMARK_DPORT_OR,
+	XT_HMARK_PROTO_AND,
+	XT_HMARK_RND,
+	XT_HMARK_MODULUS,
+	XT_HMARK_OFFSET,
+	XT_HMARK_USE_SNAT,
+	XT_HMARK_USE_DNAT,
+};
+
+union ports {
+	struct {
+		__u16	src;
+		__u16	dst;
+	} p16;
+	__u32	v32;
+};
+
+struct xt_hmark_info {
+	__u32		smask;		/* Source address mask */
+	__u32		dmask;		/* Dest address mask */
+	union ports	pmask;
+	union ports	pset;
+	__u32		spimask;
+	__u32		spiset;
+	__u16		flags;		/* Print out only */
+	__u16		prmask;		/* L4 Proto mask */
+	__u32		hashrnd;
+	__u32		hmod;		/* Modulus */
+	__u32		hoffs;		/* Offset */
+};
+
+#endif /* XT_HMARK_H_ */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 3b5ac1f..e0c77ff 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -39,6 +39,7 @@
 #define NEXTHDR_ICMP		58	/* ICMP for IPv6. */
 #define NEXTHDR_NONE		59	/* No next header */
 #define NEXTHDR_DEST		60	/* Destination options header. */
+#define NEXTHDR_SCTP		132	/* Stream Control Transport Protocol */
 #define NEXTHDR_MOBILITY	135	/* Mobility header. */
 
 #define NEXTHDR_MAX		255
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 32bff6d..3abd3a4 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -483,6 +483,23 @@ config NETFILTER_XT_TARGET_IDLETIMER
 
 	  To compile it as a module, choose M here.  If unsure, say N.
 
+config NETFILTER_XT_TARGET_HMARK
+	tristate '"HMARK" target support'
+	depends on NETFILTER_ADVANCED
+	---help---
+	This option adds the "HMARK" target.
+
+	The target allows you to create rules in the "raw" and "mangle" tables
+	which alter the netfilter mark (nfmark) field within a given range.
+	First a 32 bit hash value is generated then modulus by <limit> and
+	finally an offset is added before it's written to nfmark.
+
+	Prior to routing, the nfmark can influence the routing method (see
+	"Use netfilter MARK value as routing key") and can also be used by
+	other subsystems to change their behavior.
+
+	The mark match can also be used to match nfmark produced by this module.
+
 config NETFILTER_XT_TARGET_LED
 	tristate '"LED" target support'
 	depends on LEDS_CLASS && LEDS_TRIGGERS
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 1a02853..359eeb6 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -56,6 +56,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_CONNSECMARK) += xt_CONNSECMARK.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_CT) += xt_CT.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_DSCP) += xt_DSCP.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_HL) += xt_HL.o
+obj-$(CONFIG_NETFILTER_XT_TARGET_HMARK) += xt_hmark.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_LED) += xt_LED.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_NFLOG) += xt_NFLOG.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_NFQUEUE) += xt_NFQUEUE.o
diff --git a/net/netfilter/xt_hmark.c b/net/netfilter/xt_hmark.c
new file mode 100644
index 0000000..fad15ee
--- /dev/null
+++ b/net/netfilter/xt_hmark.c
@@ -0,0 +1,321 @@
+/*
+ *	xt_hmark - Netfilter module to set mark as hash value
+ *
+ *	(C) 2011 Hans Schillstrom <hans.schillstrom@ericsson.com>
+ *
+ *	Description:
+ *	This module calculates a hash value that can be modified by modulus
+ *	and an offset. The hash value is based on a direction independent
+ *	five tuple: src & dst addr src & dst ports and protocol.
+ *	However src & dst port can be masked and are not used for fragmented
+ *	packets, ESP and AH don't have ports so SPI will be used instead.
+ *	For ICMP error messages the hash mark values will be calculated on
+ *	the source packet i.e. the packet caused the error (If sufficient
+ *	amount of data exists).
+ *
+ *	This program is free software; you can redistribute it and/or modify
+ *	it under the terms of the GNU General Public License version 2 as
+ *	published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <net/ip.h>
+#include <linux/icmp.h>
+
+#include <linux/netfilter/xt_hmark.h>
+#include <linux/netfilter/x_tables.h>
+#include <net/netfilter/nf_nat.h>
+
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+#	define WITH_IPV6 1
+#include <net/ipv6.h>
+#include <linux/netfilter_ipv6/ip6_tables.h>
+#endif
+
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Hans Schillstrom <hans.schillstrom@ericsson.com>");
+MODULE_DESCRIPTION("Xtables: packet range mark operations by hash value");
+MODULE_ALIAS("ipt_HMARK");
+MODULE_ALIAS("ip6t_HMARK");
+
+/*
+ * ICMP, get inner header so calc can be made on the source message
+ *       not the icmp header, i.e. same hash mark must be produced
+ *       on an icmp error message.
+ */
+static int get_inner_hdr(struct sk_buff *skb, int iphsz, int nhoff)
+{
+	const struct icmphdr *icmph;
+	struct icmphdr _ih;
+	struct iphdr *iph = NULL;
+
+	/* Not enough header? */
+	icmph = skb_header_pointer(skb, nhoff + iphsz, sizeof(_ih), &_ih);
+	if (icmph == NULL)
+		goto out;
+
+	if (icmph->type > NR_ICMP_TYPES)
+		goto out;
+
+
+	/* Error message? */
+	if (icmph->type != ICMP_DEST_UNREACH &&
+	    icmph->type != ICMP_SOURCE_QUENCH &&
+	    icmph->type != ICMP_TIME_EXCEEDED &&
+	    icmph->type != ICMP_PARAMETERPROB &&
+	    icmph->type != ICMP_REDIRECT)
+		goto out;
+	/* Checkin full IP header plus 8 bytes of protocol to
+	 * avoid additional coding at protocol handlers.
+	 */
+	if (!pskb_may_pull(skb, nhoff + iphsz + sizeof(_ih) + 8))
+		goto out;
+
+	iph = (struct iphdr *)(skb->data + nhoff + iphsz + sizeof(_ih));
+	return nhoff + iphsz + sizeof(_ih);
+out:
+	return nhoff;
+}
+/*
+ * ICMPv6
+ * Input nhoff Offset into network header
+ *       offset where ICMPv6 header starts
+ * Returns true if it's a icmp error and updates nhoff
+ */
+#ifdef WITH_IPV6
+static int get_inner6_hdr(struct sk_buff *skb, int *offset, int hdrlen)
+{
+	struct icmp6hdr *icmp6h;
+	struct icmp6hdr _ih6;
+
+	icmp6h = skb_header_pointer(skb, *offset + hdrlen, sizeof(_ih6), &_ih6);
+	if (icmp6h == NULL)
+		goto out;
+
+	if (icmp6h->icmp6_type && icmp6h->icmp6_type < 128) {
+		*offset += hdrlen + sizeof(_ih6);
+		return 1;
+	}
+out:
+	return 0;
+}
+#endif
+
+/*
+ * Calc hash value, special casre is taken on icmp and fragmented messages
+ * i.e. fragmented messages don't use ports.
+ */
+static __u32 get_hash(struct sk_buff *skb, struct xt_hmark_info *info)
+{
+	int nhoff, hash = 0, poff, proto, frag = 0;
+	struct iphdr *ip;
+	u8 ip_proto;
+	u32 addr1, addr2, ihl;
+	u16 snatport = 0, dnatport = 0;
+	union {
+		u32 v32;
+		u16 v16[2];
+	} ports;
+
+	nhoff = skb_network_offset(skb);
+	proto = skb->protocol;
+
+	if (!proto && skb->sk) {
+		if (skb->sk->sk_family == AF_INET)
+			proto = __constant_htons(ETH_P_IP);
+		else if (skb->sk->sk_family == AF_INET6)
+			proto = __constant_htons(ETH_P_IPV6);
+	}
+
+	switch (proto) {
+	case __constant_htons(ETH_P_IP):
+	{
+		enum ip_conntrack_info ctinfo;
+		struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
+		struct nf_conntrack_tuple *otuple, *rtuple;
+
+		if (!pskb_may_pull(skb, sizeof(*ip) + nhoff))
+			goto done;
+
+		ip = (struct iphdr *) (skb->data + nhoff);
+		if (ip->protocol == IPPROTO_ICMP) {
+			/* Switch hash calc to inner header ? */
+			nhoff = get_inner_hdr(skb, ip->ihl * 4, nhoff);
+			ip = (struct iphdr *) (skb->data + nhoff);
+		}
+
+		if (ip->frag_off & htons(IP_MF | IP_OFFSET))
+			frag = 1;
+
+		ip_proto = ip->protocol;
+		ihl = ip->ihl;
+		addr1 = (__force u32) ip->saddr & info->smask;
+		addr2 = (__force u32) ip->daddr & info->dmask;
+
+		if (!ct || !nf_ct_is_confirmed(ct))
+			break;
+		otuple = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+		/* On the "return flow", to get the original address
+		 * i,e, replace the source address.
+		 */
+		if (ct->status & IPS_DST_NAT &&
+		    info->flags & XT_HMARK_USE_DNAT) {
+			rtuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+			addr1 = (__force u32) otuple->dst.u3.in.s_addr;
+			dnatport = otuple->dst.u.udp.port;
+		}
+		/* On the "return flow", to get the original address
+		 * i,e, replace the destination address.
+		 */
+		if (ct->status & IPS_SRC_NAT &&
+		    info->flags & XT_HMARK_USE_SNAT) {
+			rtuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+			addr2 = (__force u32) otuple->src.u3.in.s_addr;
+			snatport = otuple->src.u.udp.port;
+		}
+		break;
+	}
+#ifdef WITH_IPV6
+	case __constant_htons(ETH_P_IPV6):
+	{
+		struct ipv6hdr *ip6;	/* ip hdr */
+		int hdrlen = 0;	/* In ip header */
+		u8 nexthdr;
+		int ip6hdrlvl = 0;	/* Header level */
+		struct ipv6_opt_hdr _hdr, *hp;
+
+hdr_new:
+		if (!pskb_may_pull(skb, sizeof(*ip6) + nhoff))
+			goto done;
+
+		/* ip header */
+		ip6 = (struct ipv6hdr *) (skb->data + nhoff);
+		nexthdr = ip6->nexthdr;
+		/* nhoff += sizeof(struct ipv6hdr);  Where hdr starts */
+		hdrlen = sizeof(struct ipv6hdr);
+		hp = skb_header_pointer(skb, nhoff + hdrlen, sizeof(_hdr),
+					&_hdr);
+		while (nexthdr) {
+			switch (nexthdr) {
+			case IPPROTO_ICMPV6:
+				/* ICMP Error then move ptr to inner header */
+				if (get_inner6_hdr(skb, &nhoff, hdrlen)) {
+					ip6hdrlvl++;
+					goto hdr_new;
+				}
+				nhoff += hdrlen;
+				goto hdr_rdy;
+
+			case NEXTHDR_FRAGMENT:
+				if (!ip6hdrlvl)
+					frag = 1;
+				break;
+			/* End of hdr traversing */
+			case NEXTHDR_IPV6:	/* Do not process tunnels */
+			case NEXTHDR_TCP:
+			case NEXTHDR_UDP:
+			case NEXTHDR_ESP:
+			case NEXTHDR_AUTH:
+			case NEXTHDR_SCTP:
+			case NEXTHDR_NONE:
+				nhoff += hdrlen;
+				goto hdr_rdy;
+			default:
+				goto done;
+			}
+			if (!hp)
+				goto done;
+			nhoff += hdrlen;	/* eat current header */
+			nexthdr =  hp->nexthdr;	/* Next header */
+			hdrlen = ipv6_optlen(hp);
+			hp = skb_header_pointer(skb, nhoff + hdrlen,
+						sizeof(_hdr), &_hdr);
+
+			if (!pskb_may_pull(skb, nhoff))
+				goto done;
+		}
+hdr_rdy:
+		ip_proto = nexthdr;
+
+		addr1 = (__force u32) ip6->saddr.s6_addr32[3];
+		addr2 = (__force u32) ip6->daddr.s6_addr32[3];
+		ihl = 0; /* (40 >> 2); */
+		break;
+	}
+#endif
+	default:
+		goto done;
+	}
+
+	ports.v32 = 0;
+	poff = proto_ports_offset(ip_proto);
+	nhoff += ihl * 4 + poff;
+	if (!frag && poff >= 0 && pskb_may_pull(skb, nhoff + 4)) {
+		ports.v32 = * (__force u32 *) (skb->data + nhoff);
+		if (ip_proto == IPPROTO_ESP || ip_proto == IPPROTO_AH) {
+			ports.v32 = (ports.v32 & info->spimask) | info->spiset;
+		} else { /* Handle endian */
+			if (snatport)	/* Replace snated dst port (ret flow) */
+				ports.v16[1] = snatport;
+			if (dnatport)
+				ports.v16[0] = dnatport;
+			ports.v32 = (ports.v32 & info->pmask.v32) |
+				    info->pset.v32;
+			if (ports.v16[1] < ports.v16[0])
+				swap(ports.v16[0], ports.v16[1]);
+		}
+	}
+	ip_proto &= info->prmask;
+	/* get a consistent hash (same value on both flow directions) */
+	if (addr2 < addr1)
+		swap(addr1, addr2);
+
+	hash = jhash_3words(addr1, addr2, ports.v32, info->hashrnd) ^ ip_proto;
+	if (!hash)
+		hash = 1;
+
+	return hash;
+
+done:
+	return 0;
+}
+
+static unsigned int
+hmark_tg(struct sk_buff *skb, const struct xt_action_param *par)
+{
+	struct xt_hmark_info *info = (struct xt_hmark_info *)par->targinfo;
+	__u32 hash = get_hash(skb, info);
+
+	if (info->hmod && hash)
+		skb->mark = (hash % info->hmod) + info->hoffs;
+	return XT_CONTINUE;
+}
+
+static struct xt_target hmark_tg_reg __read_mostly = {
+	.name           = "HMARK",
+	.revision       = 0,
+	.family         = NFPROTO_UNSPEC,
+	.target         = hmark_tg,
+	.targetsize     = sizeof(struct xt_hmark_info),
+	.me             = THIS_MODULE,
+};
+
+static int __init hmark_mt_init(void)
+{
+	int ret;
+
+	ret = xt_register_target(&hmark_tg_reg);
+	if (ret < 0)
+		return ret;
+	return 0;
+}
+
+static void __exit hmark_mt_exit(void)
+{
+	xt_unregister_target(&hmark_tg_reg);
+}
+
+module_init(hmark_mt_init);
+module_exit(hmark_mt_exit);
-- 
1.7.6

^ permalink raw reply related

* Route flagged RTCF_REDIRECTED without ICMP redirs?
From: sveniu @ 2011-10-13 18:50 UTC (permalink / raw)
  To: netdev

How can a route end up with being flagged with RTCF_REDIRECTED, and
point to the default gateway, even though it's explicitly set to route
to another node in the same subnet, in the rpdb and routing tables?
There is zero trace of icmp redirects, and all redirect sysctls have
been disabled, and the route cache flushed before every test.

The flag is only set in route.c:rt_init_metrics() and check_peer_redir(),
only if peer->redirect_learned.a4 is set. The only place I see that
being modified, is in route.c:ip_rt_redirect(), which I only see called
from icmp.c:icmp_redirect(). What gives?

This is using kernel version 3.0.

This is happening on a two-node LVS/ipvs setup, where the master node A
schedules packets to node B, and due to having to use NETMAP to handle
multiple overlapping source subnets, node B must send return packets back
to node A for correct translation back to the requestor.

However, node B (172.16.0.3) insists on sending packets straight to its
default gateway (172.16.0.1). Excessive logging in all netfilter tables
and chains, and tcpdump on all interfaces, doesn't show abnormal activity.
Node B's lvs/ipvs does not touch the packet at all.

Here's how it looks after node B has seen a packet, and has responded (by
wrongly sending the response to its default gateway):

# ip route show cache
10.0.0.2 from 172.16.0.3 via 172.16.0.1 dev bond0.310
   cache <redirected>  ipid 0x80e3 rtt 80ms rttvar 70ms cwnd 10

Entry in the rpdb:

# ip rule show
0:      from all lookup local
99:     from 172.16.0.3 to 10.0.0.0/24 lookup to_node1
32766:  from all lookup main
32767:  from all lookup default
(The rpdb really should have eval/match counters, btw!)

Corresponding routing table:

# ip route show table to_node1
default via 172.16.0.2 dev bond0.310

# ip route show
default via 172.16.0.1 dev bond0.310
172.16.0.0/24 dev bond0.310  proto kernel  scope link  src 172.16.0.3
172.16.1.0/24 dev bond0.311  proto kernel  scope link  src 172.16.1.3

Relevant sysctls have been configured on both node A and B:
net.ipv4.conf.*.shared_media = 0
net.ipv4.conf.*.accept_redirects = 0
net.ipv4.conf.*.secure_redirects = 0
net.ipv4.conf.*.send_redirects = 0
* = {all,default,devices}
(Same for ipv6 too, for good measure, although there's no ipv6 traffic.)

Tcpdump on all interfaces shows no traces of any icmp activity. The
'netstat -s' icmp redirect counter does not increase.

What am I missing?

best regards,
Sven Ulland

^ permalink raw reply

* RE: [net-next PATCH] net: allow vlan traffic to be received under bond
From: Hans Schillström @ 2011-10-13 18:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jiri Pirko, Maxime Bizon, davem@davemloft.net,
	netdev@vger.kernel.org, jesse@nicira.com, fubar@us.ibm.com
In-Reply-To: <4E972301.4030004@intel.com>


>On 10/13/2011 8:59 AM, Hans Schillström wrote:
>> Thu, Oct 13, 2011 at 05:04:34PM CEST, mbizon@freebox.fr wrote:
>>>>
>>>> On Tue, 2011-10-11 at 00:37 +0200, Jiri Pirko wrote:
>>>>
>>>>> Hmm, I must look at this again tomorrow but I have strong feeling this
>>>>> will break some some scenario including vlan-bridge-macvlan.
>>>>
>>>> unless I'm mistaken, today's behaviour:
>>>>
>>>> # vconfig add eth0 100
>>>> # brctl addbr br0
>>>> # brctl addif br0 eth0
>>>>
>>>> => eth0.100 gets no more packets, br0.100 is to be used
>>>>
>>>> after the patch won't we get the opposite ?
>>>
>>> Looks like it. The question is what is the correct behaviour...
>>
>> I think this it become correct now, you should not destroy lover level if possible.
>> I.e. as John wrote "it's not an unexpected behaviour"
>>
>> Consider adding a bridge to a vlan like this
>>
>> vconfig add eth0 100
>> brctl addbr br1
>> brctl addif br1 eth0.100
>>
>> If you later add a bridge (or bond) should the previous added bridge still work ?
>> Yes I think so, for me it's the expected behaviour.
>>
>> brctl addbr br0
>> brctl addif br0 eth0
>>
>
>Sorry I'm not entirely sure I followed the above two posts. Note
>this patch restores behavior that has existed for most of the
>2.6.x kernels. In the 3.x kernels VLAN100 is dropped in the
>schematic below (assuming eth0 is active), I think this is
>incorrect.
>
>      ---> eth0.100
>       |
>       |
>eth0 -----> br0
>
>With this patch VLAN100 is delivered to eth0.100 as I expect. Now
>adding a VLAN100 to br0.
>
>
>       ---> eth0.100
>       |
>       |
>eth0 -----> br0---> br0.100
>
>With this patch in the above case VLAN100 is delivered to the
>first matching vlan. Here eth0.100 will receive the packet If
>you really want br0.100 to receive the packet remove eth0.100.
>Without this patch the packet will be delivered to br0.100,
>but no configuration will allow a packet to be delivered to
>eth0.100 with a bond.
>
>The first schematic is used for doing bonding on LAN traffic
>and SAN (storage) traffic with MPIO. So I think my expectations
>are correct and have a real use case.
>
>Thanks,
>John

Sorry if I caused confusion,  I do agree with you to 100%.

Regards
Hans

^ permalink raw reply

* Re: [PATCH net-next] bonding: fix wrong port enabling in 802.3ad
From: Flavio Leitner @ 2011-10-13 18:20 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, Andy Gospodarek
In-Reply-To: <18216.1318527991@death>

On Thu, 13 Oct 2011 10:46:31 -0700
Jay Vosburgh <fubar@us.ibm.com> wrote:

> Flavio Leitner <fbl@redhat.com> wrote:
> 
> >The port shouldn't be enabled unless its current MUX
> >state is DISTRIBUTING which is correctly handled by
> >ad_mux_machine(), otherwise the packet sent can be
> >lost because the other end may not be ready.
> >
> >The issue happens on every port initialization, but
> >as the ports are expected to move quickly to DISTRIBUTING,
> >it doesn't cause much problem.  However, it does cause
> >constant packet loss if the other peer has the port
> >configured to stay in STANDBY (i.e. SYNC set to OFF).
> 
> 	This may explain another misbehavior I've been looking into: if
> the bond's outgoing LACPDUs are lost (never received by the switch), but
> the switch's incoming LACPDUs are received, bonding puts the port into
> use, and packets to the switch are dropped by the switch.
>

Yeah, it could explain that as well because on my tests here,
all ports were enabled as soon as the aggregator is active.

 
> >Signed-off-by: Flavio Leitner <fbl@redhat.com>
> >---
> >
> >The comments there suggests it was a workaround for losses
> >of link events, but I couldn't track the changelog as it
> >seems to be pretty old.  Thus, as all the link notification
> >stuff has been improved a lot, maybe this is not an issue
> >anymore.  At least, I didn't find any problem while
> >unplugging/plugging cables here.
> 
> 	I believe this code fragment is original to the 802.3ad
> submission, which would have been around 2003 or so.
> 
> 	Did you check the standard for what it says should happen in
> this case?  I'm guessing this is something not specified by the
> standard, given the comment, but we should check to make sure.
> 

The standard says that the port should receive when its mux state
is COLLECTING and transmitting when its mux state is DISTRIBUTING.
So, that seems to be violationing the standard because it doesn't
consider the mux state at all.  It is harmless for almost all cases
though, because ports move quickly to DISTRIBUTING or no link at
all, which disables the port.

fbl

^ permalink raw reply

* [PATCH net-next] e1000e: fix skb truesize underestimation
From: Eric Dumazet @ 2011-10-13 18:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jeff Kirsher

e1000e allocates a page per skb fragment. We must account
PAGE_SIZE increments on skb->truesize, not the actual frag length.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000e/netdev.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 78c5d21..035ce73 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1300,7 +1300,7 @@ static bool e1000_clean_rx_irq_ps(struct e1000_adapter *adapter,
 			ps_page->page = NULL;
 			skb->len += length;
 			skb->data_len += length;
-			skb->truesize += length;
+			skb->truesize += PAGE_SIZE;
 		}
 
 		/* strip the ethernet crc, problem is we're using pages now so
@@ -1360,7 +1360,7 @@ static void e1000_consume_page(struct e1000_buffer *bi, struct sk_buff *skb,
 	bi->page = NULL;
 	skb->len += length;
 	skb->data_len += length;
-	skb->truesize += length;
+	skb->truesize += PAGE_SIZE;
 }
 
 /**

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox