Netdev List
 help / color / mirror / Atom feed
* Re: Linux Route Cache performance tests
From: Paweł Staszewski @ 2011-11-07  8:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list
In-Reply-To: <1320620915.6506.44.camel@edumazet-laptop>

W dniu 2011-11-07 00:08, Eric Dumazet pisze:
> Le dimanche 06 novembre 2011 à 22:57 +0100, Paweł Staszewski a écrit :
>> W dniu 2011-11-06 22:26, Eric Dumazet pisze:
>>> Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
>>>> Yes with this is a little problem i think with kernel 3.1 because
>>>> dmesg | egrep  '(rhash)|(route)'
>>>> [    0.000000] Command line: root=/dev/md2 rhash_entries=2097152
>>>> [    0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
>>>> [    4.697294] IP route cache hash table entries: 524288 (order: 10,
>>>> 4194304 bytes)
>>>>
>>>>
>>> Dont tell me you _still_ use a 32bit kernel ?
>> no it is 64bit :)
>> Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
>> :)
>>
>>> If so, you need to tweak alloc_large_system_hash() to use vmalloc,
>>> because you hit MAX_ORDER (10) page allocations.
>> funny then :)
>> Maybee i turned off too many kernel features
>>> But considering LOWMEM is about 700 Mbytes, you wont be able to create a
>>> lot of route cache entries.
>>>
>>> Come on, do us a favor, and enter new era of computing.
> OK, then your kernel is not CONFIG_NUMA enabled
>
> It seems strange given you probably have a NUMA machine (24 cpus)
Yes NUMA was not enabled
I make some tests with NUMA and without to compare performance of ixgbe 
with use Node="" parameters for ixgbe module

> If so, your choices are :
>
> 1) enable CONFIG_NUMA. Really this is a must given the workload of your
> machine.
>
> 2) Or : you need to add "hashdist=1" on boot params
>     and patch your kernel with following patch :
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..07f86e0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5362,7 +5362,6 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
>
>   int hashdist = HASHDIST_DEFAULT;
>
> -#ifdef CONFIG_NUMA
>   static int __init set_hashdist(char *str)
>   {
>   	if (!str)
> @@ -5371,7 +5370,6 @@ static int __init set_hashdist(char *str)
>   	return 1;
>   }
>   __setup("hashdist=", set_hashdist);
> -#endif
>
>   /*
>    * allocate a large system hash table from bootmem
>
Yes after enabling NUMA I can change rhash_entries on kernel boot.

And what is the most important for big route cahce is rhash_entries
if route cache size exceed hash size performance will drop 6x to 8x
So the best settings for route cache are:
rhash_entries = gc_thresh = max_size

Eric tell me what are the plans for removing route cache from kernel ?
Because as You see with route cache performance is better
And without route cache performance is not soo good than with route 
cache enabled but it is stable for all situations even DDOS with 10kk 
random_ips

So for the feature we need to prepare for lower kernel IP forwarding 
performance because of no route cache ?
Or removing route cache will save some time in IP stack  processing ?


Thanks
Pawel


> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply

* Re: [GIT PULL nf-next] IPVS
From: Pablo Neira Ayuso @ 2011-11-07  8:29 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Krzysztof Wilczynski
In-Reply-To: <20111107030659.GA24405@verge.net.au>

Hi Simon,

On Mon, Nov 07, 2011 at 12:07:01PM +0900, Simon Horman wrote:
> Hi Pablo,
> 
> I am a little confused. The nf-next branch seems to have disappeared.
> 
> Could you consider pulling git://github.com/horms/ipvs-next.git master
> to get the following changes that were in your nf-next branch.

I was late to get it into net-next. Since net-next became net after
the 3.1 release, my moved those changes to net to get it into 3.2
once Linus announced that the merge window was opened again.

> Or would
> you like me to rebase the ipvs patches (9 or the 11 changes below) on
> top of git://1984.lsi.us.es/net-next/.git master ?

They are already in net davem's tree, they will be included in the
upcoming 3.2 release.

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fdavem%2Fnet.git&a=search&h=HEAD&st=commit&s=Neira

^ permalink raw reply

* [PATCH 3/3] wanrouter: Remove kernel_lock annotations
From: Richard Weinberger @ 2011-11-07  8:24 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel, Richard Weinberger

The BKL is gone, these annotations are useless.

Signed-off-by: Richard Weinberger <richard@nod.at>
---
 net/wanrouter/wanproc.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/wanrouter/wanproc.c b/net/wanrouter/wanproc.c
index f346395..c43612e 100644
--- a/net/wanrouter/wanproc.c
+++ b/net/wanrouter/wanproc.c
@@ -81,7 +81,6 @@ static struct proc_dir_entry *proc_router;
  *	Iterator
  */
 static void *r_start(struct seq_file *m, loff_t *pos)
-	__acquires(kernel_lock)
 {
 	struct wan_device *wandev;
 	loff_t l = *pos;
@@ -103,7 +102,6 @@ static void *r_next(struct seq_file *m, void *v, loff_t *pos)
 }
 
 static void r_stop(struct seq_file *m, void *v)
-	__releases(kernel_lock)
 {
 	mutex_unlock(&config_mutex);
 }
-- 
1.7.7

^ permalink raw reply related

* Re: net-next tree question: time to submit new features
From: Eric Dumazet @ 2011-11-07  8:22 UTC (permalink / raw)
  To: Alexander Smirnov; +Cc: open list:NETWORKING [GENERAL]
In-Reply-To: <CAJmB2rAHiFopSxBVco6=7fgPi3XezSncrYw-NVtxq1otcMSTqw@mail.gmail.com>

Le lundi 07 novembre 2011 à 11:11 +0300, Alexander Smirnov a écrit :
> Hello everybody,
> 
> last week I sent to the list patch series for 6LoWPAN, but the reply
> was that it wasn't proper time for submitting new features right in
> the middle of merge window.
> 
> Could anyone please specify when the net-next tree opens back and the
> 'proper' time comes?
> 

It reopens after Linus announces linux-3.x.0-rc1 

Right now, we are in the 'merge window'.

^ permalink raw reply

* net-next tree question: time to submit new features
From: Alexander Smirnov @ 2011-11-07  8:11 UTC (permalink / raw)
  To: open list:NETWORKING [GENERAL]

Hello everybody,

last week I sent to the list patch series for 6LoWPAN, but the reply
was that it wasn't proper time for submitting new features right in
the middle of merge window.

Could anyone please specify when the net-next tree opens back and the
'proper' time comes?

With best regards,
Alex

^ permalink raw reply

* [PATCH] MAINTAINERS/rds: update maintainer
From: Or Gerlitz @ 2011-11-07  7:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, Venkat Venkatsubra, Joe Perches

update for the actual maintainer

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

---
This was previously sent by Joe Perches but somehow missed upstream

 MAINTAINERS |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/MAINTAINERS
===================================================================
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5470,7 +5470,7 @@ S:	Maintained
 F:	drivers/net/ethernet/rdc/r6040.c

 RDS - RELIABLE DATAGRAM SOCKETS
-M:	Andy Grover <andy.grover@oracle.com>
+M:	Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
 L:	rds-devel@oss.oracle.com (moderated for non-subscribers)
 S:	Supported
 F:	net/rds/

^ permalink raw reply

* Re: [PATCH v3 0/3] SUNRPC: rcbind clients virtualization
From: Stanislav Kinsbursky @ 2011-11-07  8:02 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond.Myklebust@netapp.com, linux-nfs@vger.kernel.org,
	Pavel Emelianov, neilb@suse.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	devel@openvz.org
In-Reply-To: <20111104221045.GL721@fieldses.org>

05.11.2011 02:10, J. Bruce Fields пишет:
>> BTW, Bruce, please, have a brief look at my e-mail to
>> linux-nfs@vger.kernel.org named "SUNRPC: non-exclusive pipe
>> creation".
>> I've done a lot in "RPC pipefs per net ns" task, and going to send
>> first patches soon. But right now I'm really confused will this
>> non-exclusive pipes creation and almost ready so remove this
>> functionality. But I'm afraid, that I've missed something. Would be
>> greatly appreciate for your opinion about my question.
>
> Sorry for the delay--it looks reasonable to me on a quick skim, but I'm
> assuming it's Trond that will need to review this.
>

Thanks for your review.
And, yep, still waiting for Trond answer too...

-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply

* Re: Regarding Routing table in Linux kernel
From: Eric Dumazet @ 2011-11-07  6:21 UTC (permalink / raw)
  To: Ajith Adapa; +Cc: netdev
In-Reply-To: <CADAe=+KTYAGT4N9R2eTHstbBP89PJPD57jYNp6diGocnoYkaFw@mail.gmail.com>

Le lundi 07 novembre 2011 à 09:52 +0530, Ajith Adapa a écrit :

> I will check it with the latest kernel. Actually I am just checking in
> a 2.6.18 kernel.
> 

2.6.18 is very old in this respect. Lot of things changed in recent
kernels in route cache handling. We no longer have long pauses because
of garbage collector runs.

^ permalink raw reply

* Re: linux-next: build failure after merge of the origin tree
From: Kirsher, Jeffrey T @ 2011-11-07  5:29 UTC (permalink / raw)
  To: David Miller
  Cc: sfr@canb.auug.org.au, torvalds@linux-foundation.org,
	linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	Rose, Gregory V, netdev@vger.kernel.org
In-Reply-To: <20111106.223619.1348742215191441592.davem@davemloft.net>



Cheers,
Jeff

On Nov 6, 2011, at 19:38, "David Miller" <davem@davemloft.net> wrote:

> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Mon, 7 Nov 2011 13:47:06 +1100
> 
>>> If you just revert the commit in origin from -next, then you will get
>>> conflicts with you pull the net.git tree in.
>> 
>> I got no conflicts when I merged in the net tree and can see no fix for
>> this problem in the net tree.  My current head of the net tree is 1a6422f
>> "etherh: Add MAINTAINERS entry for etherh".
> 
> Ok, Jeff please take a look at this and send me a fix soon.
> 
> Thanks.

Ok Dave, at this point, I am puttying together a patch to revert this fix since it appears that more trouble comes with this fix.  I will take a look at it quickly before sending out a patch to fix the issue.

^ permalink raw reply

* Re: [PATCH] data: hello
From: Srivatsa S. Bhat @ 2011-11-07  5:16 UTC (permalink / raw)
  To: Feng King; +Cc: netdev, davem, linux-kernel
In-Reply-To: <1320502893-5136-1-git-send-email-kinwin2008@gmail.com>

On 11/05/2011 07:51 PM, Feng King wrote:
>  great
> 
> 
> Signed-off-by: Feng King <kinwin2008@gmail.com>
> ---
>  b |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
>  create mode 100644 b
> 
> diff --git a/b b/b
> new file mode 100644
> index 0000000..a60a1f3
> --- /dev/null
> +++ b/b
> @@ -0,0 +1 @@
> +hell

Hi Feng and other potential contributors,

Usually kernel developers simply ignore spam or filter it out. But since
you also sent a patch after this mail, I am replying to this.

Please don't spam LKML with "test" emails like this. Already LKML receives
far more _useful_ emails than what one can possibly read, so adding junk
like this to that pile only annoys people. And please bear in mind that it
also gives a bad impression about the sender and may also tempt people to
ignore your patches on purpose, as a natural human tendency.

If you have genuine contributions to make, like patches or bug reports or
even want to ask questions, please feel free to do so directly. That's what
this mailing list is for, after all.


Thanks,
Srivatsa S. Bhat

^ permalink raw reply

* Re: Regarding Routing table in Linux kernel
From: Ajith Adapa @ 2011-11-07  4:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1320601372.6506.11.camel@edumazet-laptop>

Thanks for the reply

>> 1. Why does the linux kernel use hashing table for storing routing
>> entries when there is patricia trie or radix tree which is more faster
>> than Hash table ?
>
> I think you are mistaken. Routes are stored in a trie in recent kernels.
>
> And route cache is scheduled to be removed at some point.
>
> In normal situation, one hash lookup is the faster way to find a random
> item, its a single memory cache line cost.

I will check it with the latest kernel. Actually I am just checking in
a 2.6.18 kernel.

>> 2. Is there any way we can test the performance of an routing
>> algorithm before deploying in a real time scenario to check its
>> performance ? I would like to test one of my implementations to check
>> if there are any performance gains or not ?
>
> Sorry there is no general answer. It all depends on your needs.

Hmm .. Seems there is no simple way for it other than trying out in
real time scenario :(

^ permalink raw reply

* Re: [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw
From: Jan Engelhardt @ 2011-11-07  3:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Hans Schillstrom, kaber, netfilter-devel, netdev, hans
In-Reply-To: <20111107005237.GA29665@1984>


On Monday 2011-11-07 01:52, Pablo Neira Ayuso wrote:
>> +static __u32 get_hash(struct sk_buff *skb, struct xt_hmark_info *info)
>> +{
>> +	int nhoff, hash = 0, poff, proto, frag = 0;
>> +	struct iphdr *ip;
>> +	u8 ip_proto;
>> +	u32 addr1, addr2, ihl;
>> +	u16 snatport = 0, dnatport = 0;
>> +	union {
>> +		u32 v32;
>> +		u16 v16[2];
>> +	} ports;
>> +
>> +	nhoff = skb_network_offset(skb);
>> +	proto = skb->protocol;
>> +
>> +	if (!proto && skb->sk) {
>> +		if (skb->sk->sk_family == AF_INET)
>> +			proto = __constant_htons(ETH_P_IP);
>> +		else if (skb->sk->sk_family == AF_INET6)
>> +			proto = __constant_htons(ETH_P_IPV6);
>
>You already have the layer3 protocol number in xt_action_param. No
>need to use the socket information then.

xt_action_param.family (NFPROTO_) is not the same class af AF_ or ETH_.
Though, wouldn't proto = skb->proto; just be simpler here?

^ permalink raw reply

* Re: linux-next: build failure after merge of the origin tree
From: David Miller @ 2011-11-07  3:36 UTC (permalink / raw)
  To: sfr
  Cc: torvalds, linux-next, linux-kernel, gregory.v.rose,
	jeffrey.t.kirsher, netdev
In-Reply-To: <20111107134706.3e7fc633f913b5e155153127@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 7 Nov 2011 13:47:06 +1100

>> If you just revert the commit in origin from -next, then you will get
>> conflicts with you pull the net.git tree in.
> 
> I got no conflicts when I merged in the net tree and can see no fix for
> this problem in the net tree.  My current head of the net tree is 1a6422f
> "etherh: Add MAINTAINERS entry for etherh".

Ok, Jeff please take a look at this and send me a fix soon.

Thanks.

^ permalink raw reply

* Re: [GIT PULL nf-next] IPVS
From: Simon Horman @ 2011-11-07  3:07 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Krzysztof Wilczynski
In-Reply-To: <20111021072715.GA842@1984>

On Fri, Oct 21, 2011 at 09:27:15AM +0200, Pablo Neira Ayuso wrote:
> On Fri, Oct 21, 2011 at 10:33:01AM +0900, Simon Horman wrote:
> > Hi Pablo,
> > 
> > please consider pulling the following to get compile fix
> > and cleanup patches from Krzysztof Wilczynski.
> > 
> > The following changes since commit 2ca5b853f1dd81c605ddc8a55e06bdad85636597:
> > 
> >   netfilter: export NAT definitions through linux/netfilter_ipv4/nf_nat.h (2011-10-11 03:32:34 +0200)
> > 
> > are available in the git repository at:
> >   git://github.com/horms/ipvs-next.git master
> 
> Pulled, thanks.
> 
> http://1984.lsi.us.es/git/?p=net-next/.git;a=shortlog;h=refs/heads/nf-next
> 
> > Krzysztof Wilczynski (2):
> >       ipvs: Remove unused variable "cs" from ip_vs_leave function.
> >       ipvs: Fix compilation error in ip_vs.h for ip_vs_confirm_conntrack function.
> 
> Strange, I have all IPVS configs enabled here and I didn't hit this error.

Hi Pablo,

I am a little confused. The nf-next branch seems to have disappeared.

Could you consider pulling git://github.com/horms/ipvs-next.git master
to get the following changes that were in your nf-next branch. Or would
you like me to rebase the ipvs patches (9 or the 11 changes below) on
top of git://1984.lsi.us.es/net-next/.git master ?

------

The following changes since commit a9e9fd7182332d0cf5f3e601df3e71dd431b70d7:

  skge: handle irq better on single port card (2011-09-27 13:41:37 -0400)

are available in the git repository at:
  git://github.com/horms/ipvs-next.git master

Joe Perches (1):
      netfilter: Remove unnecessary OOM logging messages

Krzysztof Wilczynski (3):
      ipvs: Expose ip_vs_ftp module parameters via sysfs.
      ipvs: Remove unused variable "cs" from ip_vs_leave function.
      ipvs: Fix compilation error in ip_vs.h for ip_vs_confirm_conntrack function.

Pablo Neira Ayuso (1):
      netfilter: export NAT definitions through linux/netfilter_ipv4/nf_nat.h

Simon Horman (6):
      ipvs: Add documentation for new sysctl entries
      ipvs: Remove unused parameter from ip_vs_confirm_conntrack()
      ipvs: Remove unused return value of protocol state transitions
      ipvs: Removed unused variables
      ipvs: secure_tcp does provide alternate state timeouts
      ipvs: Enhance grammar used to refer to Kconfig options

 Documentation/networking/ipvs-sysctl.txt   |   62 ++++++++++++++++++++++++---
 include/linux/netfilter_ipv4/Kbuild        |    1 +
 include/linux/netfilter_ipv4/nf_nat.h      |   58 ++++++++++++++++++++++++++
 include/net/ip_vs.h                        |   11 ++---
 include/net/netfilter/nf_conntrack_tuple.h |   27 +------------
 include/net/netfilter/nf_nat.h             |   26 +-----------
 net/bridge/netfilter/ebt_ulog.c            |    7 +--
 net/ipv4/netfilter/ipt_CLUSTERIP.c         |    1 -
 net/ipv4/netfilter/ipt_ULOG.c              |    4 +-
 net/ipv4/netfilter/nf_nat_snmp_basic.c     |   22 +---------
 net/ipv6/netfilter/nf_conntrack_reasm.c    |    7 +--
 net/netfilter/ipset/ip_set_core.c          |    4 +-
 net/netfilter/ipvs/ip_vs_core.c            |   20 ++++-----
 net/netfilter/ipvs/ip_vs_ctl.c             |   22 +++-------
 net/netfilter/ipvs/ip_vs_dh.c              |    5 +-
 net/netfilter/ipvs/ip_vs_ftp.c             |    5 +-
 net/netfilter/ipvs/ip_vs_lblc.c            |    9 +---
 net/netfilter/ipvs/ip_vs_lblcr.c           |   13 ++----
 net/netfilter/ipvs/ip_vs_nfct.c            |    2 +-
 net/netfilter/ipvs/ip_vs_proto.c           |    5 +-
 net/netfilter/ipvs/ip_vs_proto_sctp.c      |   14 ++----
 net/netfilter/ipvs/ip_vs_proto_tcp.c       |    6 +--
 net/netfilter/ipvs/ip_vs_proto_udp.c       |    5 +-
 net/netfilter/ipvs/ip_vs_sh.c              |    5 +-
 net/netfilter/ipvs/ip_vs_wrr.c             |    5 +-
 net/netfilter/ipvs/ip_vs_xmit.c            |    2 +-
 net/netfilter/nf_conntrack_core.c          |    5 +--
 net/netfilter/nfnetlink_log.c              |    7 +--
 net/netfilter/xt_IDLETIMER.c               |    2 -
 net/netfilter/xt_hashlimit.c               |    5 +--
 30 files changed, 178 insertions(+), 189 deletions(-)
 create mode 100644 include/linux/netfilter_ipv4/nf_nat.h

^ permalink raw reply

* Re: linux-next: build failure after merge of the origin tree
From: Stephen Rothwell @ 2011-11-07  2:47 UTC (permalink / raw)
  To: David Miller
  Cc: torvalds, linux-next, linux-kernel, gregory.v.rose,
	jeffrey.t.kirsher, netdev
In-Reply-To: <20111106.205259.1237400984015921904.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]

Hi Dave,

On Sun, 06 Nov 2011 20:52:59 -0500 (EST) David Miller <davem@davemloft.net> wrote:
>
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Mon, 7 Nov 2011 10:12:02 +1100
> 
> > Starting with the origin tree, today's linux-next build (powerpc
> > ppc64_defconfig) failed like this:
>  ...
> > Caused by commit 9487dc844054 ("ixgbe: Fix compiler warnings") which hid
> > the declarations of ixgbe_disable_sriov() and ixgbe_check_vf_assignment()
> > when CONFIG_PCI_IOV is not defined.
> > 
> > I have reverted that commit for today.
> 
> It should be fixed in net.git, can you please check that the build
> succeeds after you pull it into -next?

I reverted the commit above ...

> If you just revert the commit in origin from -next, then you will get
> conflicts with you pull the net.git tree in.

I got no conflicts when I merged in the net tree and can see no fix for
this problem in the net tree.  My current head of the net tree is 1a6422f
"etherh: Add MAINTAINERS entry for etherh".

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: linux-next: build failure after merge of the origin tree
From: David Miller @ 2011-11-07  1:52 UTC (permalink / raw)
  To: sfr
  Cc: torvalds, linux-next, linux-kernel, gregory.v.rose,
	jeffrey.t.kirsher, netdev
In-Reply-To: <20111107101202.96fcb76d580e5265bd99aeee@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 7 Nov 2011 10:12:02 +1100

> Starting with the origin tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
 ...
> Caused by commit 9487dc844054 ("ixgbe: Fix compiler warnings") which hid
> the declarations of ixgbe_disable_sriov() and ixgbe_check_vf_assignment()
> when CONFIG_PCI_IOV is not defined.
> 
> I have reverted that commit for today.

It should be fixed in net.git, can you please check that the build
succeeds after you pull it into -next?

If you just revert the commit in origin from -next, then you will get
conflicts with you pull the net.git tree in.

^ permalink raw reply

* Re: [v2 PATCH 2/2] NETFILTER userspace part for target HMARK
From: Pablo Neira Ayuso @ 2011-11-07  0:55 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: kaber, jengelh, netfilter-devel, netdev, hans
In-Reply-To: <1317664003-28189-3-git-send-email-hans.schillstrom@ericsson.com>

On Mon, Oct 03, 2011 at 07:46:43PM +0200, Hans Schillstrom wrote:
> The target allows you to create rules in the "raw" and "mangle" tables
> which alter the netfilter mark (nfmark) field within a given range.
> First a 32 bit hash value is generated then modulus by <limit> and
> finally an offset is added before it's written to nfmark.
> Prior to routing, the nfmark can influence the routing method (see
> "Use netfilter MARK value as routing key") and can also be used by
> other subsystems to change their behaviour.
> 
> The mark match can also be used to match nfmark produced by this module.
> 
> Ver 2
>   IPv4 NAT added
>   iptables ver 1.4.12.1 adaptions.
> 
> Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> ---
>  extensions/libxt_HMARK.c           |  381 ++++++++++++++++++++++++++++++++++++
>  extensions/libxt_HMARK.man         |   66 ++++++
>  include/linux/netfilter/xt_hmark.h |   48 +++++
>  3 files changed, 495 insertions(+), 0 deletions(-)
>  create mode 100644 extensions/libxt_HMARK.c
>  create mode 100644 extensions/libxt_HMARK.man
>  create mode 100644 include/linux/netfilter/xt_hmark.h
> 
> diff --git a/extensions/libxt_HMARK.c b/extensions/libxt_HMARK.c
> new file mode 100644
> index 0000000..0def034
> --- /dev/null
> +++ b/extensions/libxt_HMARK.c
> @@ -0,0 +1,381 @@
> +/*
> + * Shared library add-on to iptables to add HMARK target support.
> + *
> + * The kernel module calculates a hash value that can be modified by modulus
> + * and an offset. The hash value is based on a direction independent
> + * five tuple: src & dst addr src & dst ports and protocol.
> + * However src & dst port can be masked and are not used for fragmented
> + * packets, ESP and AH don't have ports so SPI will be used instead.
> + * For ICMP error messages the hash mark values will be calculated on
> + * the source packet i.e. the packet caused the error (If sufficient
> + * amount of data exists).
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <string.h>
> +#include <stdlib.h>
> +#include <getopt.h>
> +
> +#include <xtables.h>
> +#include <linux/netfilter/x_tables.h>
> +#include <linux/netfilter/xt_hmark.h>
> +
> +
> +#define DEF_HRAND 0xc175a3b8	/* Default "random" value to jhash */
> +
> +static void HMARK_help(void)
> +{
> +	printf(
> +"HMARK target options, i.e. modify hash calculation by:\n"
> +"  --hmark-smask value                Mask source address with value\n"
> +"  --hmark-dmask value                Mask Dest. address with value\n"
> +"  --hmark-sp-mask value              Mask src port with value\n"
> +"  --hmark-dp-mask value              Mask dst port with value\n"
> +"  --hmark-spi-mask value             For esp and ah AND spi with value\n"
> +"  --hmark-sp-set value               OR src port with value\n"
> +"  --hmark-dp-set value               OR dst port with value\n"
> +"  --hmark-spi-set value              For esp and ah OR spi with value\n"
> +"  --hmark-proto-mask value           Mask Protocol with value\n"
> +"  --hmark-rnd                        Random value to hash cacl.\n"
> +"  Limit/modify the calculated hash mark by:\n"
> +"  --hmark-mod value                  nfmark modulus value\n"
> +"  --hmark-offs value                 Last action add value to nfmark\n"
> +" For NAT in IPv4 the original address can be used in the return path.\n"
> +" Make sure to qualify the statement in a proper way when using nat flags\n"
> +"  --hmark-dnat                       Replace src addr/port with original dst addr/port\n"
> +"  --hmark-snat                       Replace dst addr/port with original src addr/port\n"
> +" In many cases hmark can be omitted i.e. --smask can be used\n");
> +}
> +
> +static const struct option HMARK_opts[] = {
> +	{ "hmark-smask", 1, NULL, XT_HMARK_SADR_AND },
> +	{ "hmark-dmask", 1, NULL, XT_HMARK_DADR_AND },
> +	{ "hmark-sp-mask", 1, NULL, XT_HMARK_SPORT_AND },
> +	{ "hmark-dp-mask", 1, NULL, XT_HMARK_DPORT_AND },
> +	{ "hmark-spi-mask", 1, NULL, XT_HMARK_SPI_AND },
> +	{ "hmark-sp-set", 1, NULL, XT_HMARK_SPORT_OR },
> +	{ "hmark-dp-set", 1, NULL, XT_HMARK_DPORT_OR },
> +	{ "hmark-spi-set", 1, NULL, XT_HMARK_SPI_OR },
> +	{ "hmark-proto-mask", 1, NULL, XT_HMARK_PROTO_AND },
> +	{ "hmark-rnd", 1, NULL, XT_HMARK_RND },
> +	{ "hmark-mod", 1, NULL, XT_HMARK_MODULUS },
> +	{ "hmark-offs", 1, NULL, XT_HMARK_OFFSET },
> +	{ "hmark-dnat", 1, NULL, XT_HMARK_USE_DNAT },
> +	{ "hmark-snat", 1, NULL, XT_HMARK_USE_SNAT },
> +	{ "smask", 1, NULL, XT_HMARK_SADR_AND },
> +	{ "dmask", 1, NULL, XT_HMARK_DADR_AND },
> +	{ "sp-mask", 1, NULL, XT_HMARK_SPORT_AND },
> +	{ "dp-mask", 1, NULL, XT_HMARK_DPORT_AND },
> +	{ "spi-mask", 1, NULL, XT_HMARK_SPI_AND },
> +	{ "sp-set", 1, NULL, XT_HMARK_SPORT_OR },
> +	{ "dp-set", 1, NULL, XT_HMARK_DPORT_OR },
> +	{ "spi-set", 1, NULL, XT_HMARK_SPI_OR },
> +	{ "proto-mask", 1, NULL, XT_HMARK_PROTO_AND },
> +	{ "rnd", 1, NULL, XT_HMARK_RND },
> +	{ "mod", 1, NULL, XT_HMARK_MODULUS },
> +	{ "offs", 1, NULL, XT_HMARK_OFFSET },
> +	{ "dnat", 1, NULL, XT_HMARK_USE_DNAT },
> +	{ "snat", 1, NULL, XT_HMARK_USE_SNAT },
> +	{ .name = NULL }
> +};
> +
> +static int
> +HMARK_parse(int c, char **argv, int invert, unsigned int *flags,
> +	    const void *entry, struct xt_entry_target **target)
> +{
> +	struct xt_hmark_info *hmarkinfo
> +		= (struct xt_hmark_info *)(*target)->data;
> +	unsigned int value = 0xffffffff;
> +	unsigned int maxint = UINT32_MAX;
> +
> +	if ((c < XT_HMARK_SADR_AND) || (c > XT_HMARK_OFFSET)) {
> +		xtables_error(PARAMETER_PROBLEM, "Bad HMARK option \"%s\"",
> +			      optarg);
> +		return 0;
> +	}
> +
> +	if (c >= XT_HMARK_SPORT_AND && c <= XT_HMARK_DPORT_OR)
> +		maxint = UINT16_MAX;
> +	else if (c == XT_HMARK_PROTO_AND)
> +		maxint = UINT8_MAX;
> +
> +	if (!xtables_strtoui(optarg, NULL, &value, 0, maxint))
> +		xtables_error(PARAMETER_PROBLEM, "Bad HMARK value \"%s\"",
> +			      optarg);
> +
> +	if (*flags == 0) {
> +		memset(hmarkinfo, 0xff, sizeof(struct xt_hmark_info));
> +		hmarkinfo->pset.v32 = 0;
> +		hmarkinfo->flags = 0;
> +		hmarkinfo->spiset = 0;
> +		hmarkinfo->hoffs = 0;
> +		hmarkinfo->hashrnd = DEF_HRAND;
> +	}
> +	switch (c) {
> +	case XT_HMARK_SADR_AND:
> +		if (*flags & (1 << c)) {
> +			xtables_error(PARAMETER_PROBLEM,
> +				      "Can only specify "
> +				      "`--hmark-smask' once");
> +		}
> +		hmarkinfo->smask = htonl(value);
> +		if (value == maxint)
> +			c = 0;
> +		break;

Please, check current iptables git tree. Jan implemented more advanced
method to handle options. For instance, have a look at libxt_cluster.c

^ permalink raw reply

* Re: [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw
From: Pablo Neira Ayuso @ 2011-11-07  0:52 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: kaber, jengelh, netfilter-devel, netdev, hans
In-Reply-To: <1317664003-28189-2-git-send-email-hans.schillstrom@ericsson.com>

Hi Hans,

On Mon, Oct 03, 2011 at 07:46:42PM +0200, Hans Schillstrom wrote:
> diff --git a/include/linux/netfilter/xt_hmark.h b/include/linux/netfilter/xt_hmark.h
> new file mode 100644
> index 0000000..6c1436a
> --- /dev/null
> +++ b/include/linux/netfilter/xt_hmark.h
> @@ -0,0 +1,48 @@
> +#ifndef XT_HMARK_H_
> +#define XT_HMARK_H_
> +
> +#include <linux/types.h>
> +
> +/*
> + * Flags must not start at 0, since it's used as none.
> + */
> +enum {
> +	XT_HMARK_SADR_AND = 1,	/* SNAT & DNAT are used by the kernel module */
> +	XT_HMARK_DADR_AND,
> +	XT_HMARK_SPI_AND,
> +	XT_HMARK_SPI_OR,
> +	XT_HMARK_SPORT_AND,
> +	XT_HMARK_DPORT_AND,
> +	XT_HMARK_SPORT_OR,
> +	XT_HMARK_DPORT_OR,
> +	XT_HMARK_PROTO_AND,
> +	XT_HMARK_RND,
> +	XT_HMARK_MODULUS,
> +	XT_HMARK_OFFSET,
> +	XT_HMARK_USE_SNAT,
> +	XT_HMARK_USE_DNAT,
> +};
> +
> +union ports {
> +	struct {
> +		__u16	src;
> +		__u16	dst;
> +	} p16;
> +	__u32	v32;
> +};
> +
> +struct xt_hmark_info {
> +	__u32		smask;		/* Source address mask */
> +	__u32		dmask;		/* Dest address mask */
> +	union ports	pmask;
> +	union ports	pset;
> +	__u32		spimask;
> +	__u32		spiset;
> +	__u16		flags;		/* Print out only */
> +	__u16		prmask;		/* L4 Proto mask */
> +	__u32		hashrnd;
> +	__u32		hmod;		/* Modulus */
> +	__u32		hoffs;		/* Offset */
> +};
> +
> +#endif /* XT_HMARK_H_ */
> diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
> index 32bff6d..3abd3a4 100644
> --- a/net/netfilter/Kconfig
> +++ b/net/netfilter/Kconfig
> @@ -483,6 +483,23 @@ config NETFILTER_XT_TARGET_IDLETIMER
>  
>  	  To compile it as a module, choose M here.  If unsure, say N.
>  
> +config NETFILTER_XT_TARGET_HMARK

New config option has to go in alphabetic order (this one should go
after NETFILTER_XT_TARGET_HL).

> +	tristate '"HMARK" target support'
> +	depends on NETFILTER_ADVANCED
> +	---help---
> +	This option adds the "HMARK" target.
> +
> +	The target allows you to create rules in the "raw" and "mangle" tables
> +	which alter the netfilter mark (nfmark) field within a given range.
> +	First a 32 bit hash value is generated then modulus by <limit> and
> +	finally an offset is added before it's written to nfmark.
> +
> +	Prior to routing, the nfmark can influence the routing method (see
> +	"Use netfilter MARK value as routing key") and can also be used by
> +	other subsystems to change their behavior.
> +
> +	The mark match can also be used to match nfmark produced by this module.
> +
>  config NETFILTER_XT_TARGET_LED
>  	tristate '"LED" target support'
>  	depends on LEDS_CLASS && LEDS_TRIGGERS
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index 1a02853..359eeb6 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -56,6 +56,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_CONNSECMARK) += xt_CONNSECMARK.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_CT) += xt_CT.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_DSCP) += xt_DSCP.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_HL) += xt_HL.o
> +obj-$(CONFIG_NETFILTER_XT_TARGET_HMARK) += xt_hmark.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_LED) += xt_LED.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_NFLOG) += xt_NFLOG.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_NFQUEUE) += xt_NFQUEUE.o
> diff --git a/net/netfilter/xt_hmark.c b/net/netfilter/xt_hmark.c
> new file mode 100644
> index 0000000..2f0aa93
> --- /dev/null
> +++ b/net/netfilter/xt_hmark.c
> @@ -0,0 +1,320 @@
> +/*
> + *	xt_hmark - Netfilter module to set mark as hash value
> + *
> + *	(C) 2010 Hans Schillstrom <hans.schillstrom@ericsson.com>
> + *
> + *	Description:
> + *	This module calculates a hash value that can be modified by modulus
> + *	and an offset. The hash value is based on a direction independent
> + *	five tuple: src & dst addr src & dst ports and protocol.
> + *	However src & dst port can be masked and are not used for fragmented
> + *	packets, ESP and AH don't have ports so SPI will be used instead.
> + *	For ICMP error messages the hash mark values will be calculated on
> + *	the source packet i.e. the packet caused the error (If sufficient
> + *	amount of data exists).
> + *
> + *	This program is free software; you can redistribute it and/or modify
> + *	it under the terms of the GNU General Public License version 2 as
> + *	published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <net/ip.h>
> +#include <linux/icmp.h>
> +
> +#include <linux/netfilter/xt_hmark.h>
> +#include <linux/netfilter/x_tables.h>
> +#include <net/netfilter/nf_nat.h>
> +
> +#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
> +#	define WITH_IPV6 1
> +#include <net/ipv6.h>
> +#include <linux/netfilter_ipv6/ip6_tables.h>
> +#endif
> +
> +

extra space not required.

> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Hans Schillstrom <hans.schillstrom@ericsson.com>");
> +MODULE_DESCRIPTION("Xtables: packet range mark operations by hash value");
> +MODULE_ALIAS("ipt_HMARK");
> +MODULE_ALIAS("ip6t_HMARK");
> +
> +/*
> + * ICMP, get inner header so calc can be made on the source message
> + *       not the icmp header, i.e. same hash mark must be produced
> + *       on an icmp error message.
> + */
> +static int get_inner_hdr(struct sk_buff *skb, int iphsz, int nhoff)

This looks very similar to icmp_error in nf_conntrack_proto_icmp.c.
Yours lacks of checksumming validation btw.

I'm trying to find some place where we can put this function to make
it available for both nf_conntrack_ipv4 and your module (to avoid code
redundancy), but I didn't find any so far.

It would be nice to find some way to avoid duplicating code with
similar functionality.

> +{
> +	const struct icmphdr *icmph;
> +	struct icmphdr _ih;
> +	struct iphdr *iph = NULL;
> +
> +	/* Not enough header? */
> +	icmph = skb_header_pointer(skb, nhoff + iphsz, sizeof(_ih), &_ih);
> +	if (icmph == NULL)
> +		goto out;
> +
> +	if (icmph->type > NR_ICMP_TYPES)
> +		goto out;
> +
> +

extra space not required.

> +	/* Error message? */
> +	if (icmph->type != ICMP_DEST_UNREACH &&
> +	    icmph->type != ICMP_SOURCE_QUENCH &&
> +	    icmph->type != ICMP_TIME_EXCEEDED &&
> +	    icmph->type != ICMP_PARAMETERPROB &&
> +	    icmph->type != ICMP_REDIRECT)
> +		goto out;
> +	/* Checkin full IP header plus 8 bytes of protocol to
> +	 * avoid additional coding at protocol handlers.
> +	 */
> +	if (!pskb_may_pull(skb, nhoff + iphsz + sizeof(_ih) + 8))
> +		goto out;

We prefer skb_header_pointer instead. If conntrack is enabled, we can
benefit from defragmention. Please, replace all pskb_may_pull by
skb_header_pointer in this code.

We can assume that the IP header is linear (not fragmented).

> +	iph = (struct iphdr *)(skb->data + nhoff + iphsz + sizeof(_ih));
> +	return nhoff + iphsz + sizeof(_ih);
> +out:
> +	return nhoff;
> +}
> +/*
> + * ICMPv6
> + * Input nhoff Offset into network header
> + *       offset where ICMPv6 header starts
> + * Returns true if it's a icmp error and updates nhoff
> + */
> +#ifdef WITH_IPV6
> +static int get_inner6_hdr(struct sk_buff *skb, int *offset, int hdrlen)
> +{
> +	struct icmp6hdr *icmp6h;
> +	struct icmp6hdr _ih6;
> +
> +	icmp6h = skb_header_pointer(skb, *offset + hdrlen, sizeof(_ih6), &_ih6);
> +	if (icmp6h == NULL)
> +		goto out;
> +
> +	if (icmp6h->icmp6_type && icmp6h->icmp6_type < 128) {
> +		*offset += hdrlen + sizeof(_ih6);
> +		return 1;
> +	}
> +out:
> +	return 0;
> +}
> +#endif
> +
> +/*
> + * Calc hash value, special casre is taken on icmp and fragmented messages
> + * i.e. fragmented messages don't use ports.
> + */
> +static __u32 get_hash(struct sk_buff *skb, struct xt_hmark_info *info)

This function seems to big to me, please, split it into smaller
chunks, like get_hash_ipv4, get_hash_ipv6 and get_hash_ports.

> +{
> +	int nhoff, hash = 0, poff, proto, frag = 0;
> +	struct iphdr *ip;
> +	u8 ip_proto;
> +	u32 addr1, addr2, ihl;
> +	u16 snatport = 0, dnatport = 0;
> +	union {
> +		u32 v32;
> +		u16 v16[2];
> +	} ports;
> +
> +	nhoff = skb_network_offset(skb);
> +	proto = skb->protocol;
> +
> +	if (!proto && skb->sk) {
> +		if (skb->sk->sk_family == AF_INET)
> +			proto = __constant_htons(ETH_P_IP);
> +		else if (skb->sk->sk_family == AF_INET6)
> +			proto = __constant_htons(ETH_P_IPV6);

You already have the layer3 protocol number in xt_action_param. No
need to use the socket information then.

> +	}
> +
> +	switch (proto) {
> +	case __constant_htons(ETH_P_IP):
> +	{
> +		enum ip_conntrack_info ctinfo;
> +		struct nf_conn *ct = ct = nf_ct_get(skb, &ctinfo);
> +		struct nf_conntrack_tuple *otuple, *rtuple;
> +
> +		if (!pskb_may_pull(skb, sizeof(*ip) + nhoff))
> +			goto done;
> +
> +		ip = (struct iphdr *) (skb->data + nhoff);
> +		if (ip->protocol == IPPROTO_ICMP) {
> +			/* Switch hash calc to inner header ? */
> +			nhoff = get_inner_hdr(skb, ip->ihl * 4, nhoff);
> +			ip = (struct iphdr *) (skb->data + nhoff);
> +		}
> +
> +		if (ip->frag_off & htons(IP_MF | IP_OFFSET))
> +			frag = 1;
> +
> +		ip_proto = ip->protocol;
> +		ihl = ip->ihl;
> +		addr1 = (__force u32) ip->saddr & info->smask;
> +		addr2 = (__force u32) ip->daddr & info->dmask;
> +
> +		if (!ct || !nf_ct_is_confirmed(ct))

You seem to (ab)use nf_ct_is_confirmed to make sure you're not in the
original direction. Better use the direction that you get by means of
nf_ct_get.

> +			break;
> +		otuple = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
> +		/* On the "return flow", to get the original address
> +		 * i,e, replace the source address.
> +		 */
> +		if (ct->status & IPS_DST_NAT &&
> +		    info->flags & XT_HMARK_USE_DNAT) {
> +			rtuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
> +			addr1 = (__force u32) otuple->dst.u3.in.s_addr;
> +			dnatport = otuple->dst.u.udp.port;
> +		}
> +		/* On the "return flow", to get the original address
> +		 * i,e, replace the destination address.
> +		 */
> +		if (ct->status & IPS_SRC_NAT &&
> +		    info->flags & XT_HMARK_USE_SNAT) {
> +			rtuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
> +			addr2 = (__force u32) otuple->src.u3.in.s_addr;
> +			snatport = otuple->src.u.udp.port;
> +		}
> +		break;
> +	}
> +#ifdef WITH_IPV6
> +	case __constant_htons(ETH_P_IPV6):
> +	{
> +		struct ipv6hdr *ip6;	/* ip hdr */
> +		int hdrlen = 0;	/* In ip header */
> +		u8 nexthdr;
> +		int ip6hdrlvl = 0;	/* Header level */
> +		struct ipv6_opt_hdr _hdr, *hp;
> +
> +hdr_new:
> +		if (!pskb_may_pull(skb, sizeof(*ip6) + nhoff))
> +			goto done;
> +
> +		/* ip header */
> +		ip6 = (struct ipv6hdr *) (skb->data + nhoff);
> +		nexthdr = ip6->nexthdr;
> +		/* nhoff += sizeof(struct ipv6hdr);  Where hdr starts */
> +		hdrlen = sizeof(struct ipv6hdr);
> +		hp = skb_header_pointer(skb, nhoff + hdrlen, sizeof(_hdr),
> +					&_hdr);
> +		while (nexthdr) {
> +			switch (nexthdr) {
> +			case IPPROTO_ICMPV6:
> +				/* ICMP Error then move ptr to inner header */
> +				if (get_inner6_hdr(skb, &nhoff, hdrlen)) {
> +					ip6hdrlvl++;
> +					goto hdr_new;
> +				}
> +				nhoff += hdrlen;
> +				goto hdr_rdy;
> +
> +			case NEXTHDR_FRAGMENT:
> +				if (!ip6hdrlvl)
> +					frag = 1;
> +				break;
> +			/* End of hdr traversing */
> +			case NEXTHDR_IPV6:	/* Do not process tunnels */
> +			case NEXTHDR_TCP:
> +			case NEXTHDR_UDP:
> +			case NEXTHDR_ESP:
> +			case NEXTHDR_AUTH:
> +			case NEXTHDR_NONE:
> +				nhoff += hdrlen;
> +				goto hdr_rdy;
> +			default:
> +				goto done;

This goto doesn't make too much sense to me, better return 0.

> +			}
> +			if (!hp)
> +				goto done;
> +			nhoff += hdrlen;	/* eat current header */
> +			nexthdr =  hp->nexthdr;	/* Next header */
> +			hdrlen = ipv6_optlen(hp);
> +			hp = skb_header_pointer(skb, nhoff + hdrlen,
> +						sizeof(_hdr), &_hdr);
> +
> +			if (!pskb_may_pull(skb, nhoff))
> +				goto done;
> +		}
> +hdr_rdy:
> +		ip_proto = nexthdr;
> +
> +		addr1 = (__force u32) ip6->saddr.s6_addr32[3];
> +		addr2 = (__force u32) ip6->daddr.s6_addr32[3];
> +		ihl = 0; /* (40 >> 2); */
> +		break;
> +	}
> +#endif
> +	default:
> +		goto done;
> +	}
> +
> +	ports.v32 = 0;
> +	poff = proto_ports_offset(ip_proto);
> +	nhoff += ihl * 4 + poff;
> +	if (!frag && poff >= 0 && pskb_may_pull(skb, nhoff + 4)) {
> +		ports.v32 = * (__force u32 *) (skb->data + nhoff);
> +		if (ip_proto == IPPROTO_ESP || ip_proto == IPPROTO_AH) {
> +			ports.v32 = (ports.v32 & info->spimask) | info->spiset;
> +		} else { /* Handle endian */
> +			if (snatport)	/* Replace snated dst port (ret flow) */
> +				ports.v16[1] = snatport;
> +			if (dnatport)
> +				ports.v16[0] = dnatport;
> +			ports.v32 = (ports.v32 & info->pmask.v32) |
> +				    info->pset.v32;
> +			if (ports.v16[1] < ports.v16[0])
> +				swap(ports.v16[0], ports.v16[1]);
> +		}
> +	}
> +	ip_proto &= info->prmask;
> +	/* get a consistent hash (same value on both flow directions) */
> +	if (addr2 < addr1)
> +		swap(addr1, addr2);
> +
> +	hash = jhash_3words(addr1, addr2, ports.v32, info->hashrnd) ^ ip_proto;
> +	if (!hash)
> +		hash = 1;
> +
> +	return hash;
> +
> +done:
> +	return 0;
> +}

I'll try to find more time to look into this. Specifically, I want to
review the IPv6 bits more carefully.

^ permalink raw reply

* [PATCH] rtl8192e: Don't copy huge struct by value (and make it const).
From: Jesper Juhl @ 2011-11-06 23:21 UTC (permalink / raw)
  To: devel, linux-kernel
  Cc: Andrea Merello, Mike McCormack, Larry Finger, Greg Kroah-Hartman,
	netdev

rtllib_is_shortslot() takes one argument - a struct that's more than a
kilobyte large. It should take a pointer instead of copying such a
huge struct - and the argument might as well be declared 'const' now
that we are at it, since it is not modified. This patch makes these
changes.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
 drivers/staging/rtl8192e/rtllib.h         |    2 +-
 drivers/staging/rtl8192e/rtllib_softmac.c |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/rtl8192e/rtllib.h b/drivers/staging/rtl8192e/rtllib.h
index de25975..3a52120 100644
--- a/drivers/staging/rtl8192e/rtllib.h
+++ b/drivers/staging/rtl8192e/rtllib.h
@@ -2804,7 +2804,7 @@ extern int rtllib_wx_set_gen_ie(struct rtllib_device *ieee, u8 *ie, size_t len);
 
 /* rtllib_softmac.c */
 extern short rtllib_is_54g(struct rtllib_network *net);
-extern short rtllib_is_shortslot(struct rtllib_network net);
+extern short rtllib_is_shortslot(const struct rtllib_network *net);
 extern int rtllib_rx_frame_softmac(struct rtllib_device *ieee,
 				   struct sk_buff *skb,
 				   struct rtllib_rx_stats *rx_stats, u16 type,
diff --git a/drivers/staging/rtl8192e/rtllib_softmac.c b/drivers/staging/rtl8192e/rtllib_softmac.c
index b508685..fa774cf 100644
--- a/drivers/staging/rtl8192e/rtllib_softmac.c
+++ b/drivers/staging/rtl8192e/rtllib_softmac.c
@@ -28,9 +28,9 @@ short rtllib_is_54g(struct rtllib_network *net)
 	return (net->rates_ex_len > 0) || (net->rates_len > 4);
 }
 
-short rtllib_is_shortslot(struct rtllib_network net)
+short rtllib_is_shortslot(const struct rtllib_network *net)
 {
-	return net.capability & WLAN_CAPABILITY_SHORT_SLOT_TIME;
+	return net->capability & WLAN_CAPABILITY_SHORT_SLOT_TIME;
 }
 
 /* returns the total length needed for pleacing the RATE MFIE
-- 
1.7.7.2


-- 
Jesper Juhl <jj@chaosbits.net>       http://www.chaosbits.net/
Don't top-post http://www.catb.org/jargon/html/T/top-post.html
Plain text mails only, please.

^ permalink raw reply related

* linux-next: build failure after merge of the origin tree
From: Stephen Rothwell @ 2011-11-06 23:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-next, linux-kernel, Greg Rose, Jeff Kirsher, David Miller,
	netdev

[-- Attachment #1: Type: text/plain, Size: 864 bytes --]

Hi Linus,

Starting with the origin tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: In function 'ixgbe_set_interrupt_capability':
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:4724:3: error: implicit declaration of function 'ixgbe_disable_sriov'
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c: In function 'ixgbe_remove':
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:7773:3: error: implicit declaration of function 'ixgbe_check_vf_assignment'

Caused by commit 9487dc844054 ("ixgbe: Fix compiler warnings") which hid
the declarations of ixgbe_disable_sriov() and ixgbe_check_vf_assignment()
when CONFIG_PCI_IOV is not defined.

I have reverted that commit for today.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: Linux Route Cache performance tests
From: Eric Dumazet @ 2011-11-06 23:08 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list
In-Reply-To: <4EB702D3.1090703@itcare.pl>

Le dimanche 06 novembre 2011 à 22:57 +0100, Paweł Staszewski a écrit :
> W dniu 2011-11-06 22:26, Eric Dumazet pisze:
> > Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
> >> Yes with this is a little problem i think with kernel 3.1 because
> >> dmesg | egrep  '(rhash)|(route)'
> >> [    0.000000] Command line: root=/dev/md2 rhash_entries=2097152
> >> [    0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
> >> [    4.697294] IP route cache hash table entries: 524288 (order: 10,
> >> 4194304 bytes)
> >>
> >>
> > Dont tell me you _still_ use a 32bit kernel ?
> no it is 64bit :)
> Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
> :)
> 
> > If so, you need to tweak alloc_large_system_hash() to use vmalloc,
> > because you hit MAX_ORDER (10) page allocations.
> funny then :)
> Maybee i turned off too many kernel features
> > But considering LOWMEM is about 700 Mbytes, you wont be able to create a
> > lot of route cache entries.
> >
> > Come on, do us a favor, and enter new era of computing.

OK, then your kernel is not CONFIG_NUMA enabled

It seems strange given you probably have a NUMA machine (24 cpus)

If so, your choices are :

1) enable CONFIG_NUMA. Really this is a must given the workload of your
machine.

2) Or : you need to add "hashdist=1" on boot params
   and patch your kernel with following patch :

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..07f86e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5362,7 +5362,6 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
 
 int hashdist = HASHDIST_DEFAULT;
 
-#ifdef CONFIG_NUMA
 static int __init set_hashdist(char *str)
 {
 	if (!str)
@@ -5371,7 +5370,6 @@ static int __init set_hashdist(char *str)
 	return 1;
 }
 __setup("hashdist=", set_hashdist);
-#endif
 
 /*
  * allocate a large system hash table from bootmem

^ permalink raw reply related

* [SPAM:#] Business proposal
From: LEUNG Cheung @ 2011-11-06 22:16 UTC (permalink / raw)


Hello,
I am sending you this brief letter to solicit your partnership to transfer a total sum of $22.5 million Dollars 
from Hong Kong to your country. We shall share in the ratio of 40/60 (40% for you and 60% for me) when 
the transfer is finally made, more details and procedures will be sent to you when I receive your response.

Thank you and have a great day.

Best Regards,
Mr. LEUNG Cheung

^ permalink raw reply

* Re: [PATCH] net, wireless, mwifiex: Fix mem leak in mwifiex_update_curr_bss_params()
From: Srivatsa S. Bhat @ 2011-11-06 22:24 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: linux-wireless, netdev, linux-kernel, John W. Linville, Bing Zhao
In-Reply-To: <alpine.LNX.2.00.1111062255310.5763@swampdragon.chaosbits.net>

On 11/07/2011 03:28 AM, Jesper Juhl wrote:
> If kmemdup() fails we leak the memory allocated to bss_desc.
> This patch fixes the leak.
> I also removed the pointless default assignment of 'NULL' to 'bss_desc' 
> while I was there anyway.
> 
> Signed-off-by: Jesper Juhl <jj@chaosbits.net>

Looks good to me.
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>

Thanks,
Srivatsa S. Bhat

> ---
>  drivers/net/wireless/mwifiex/scan.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
>  note: patch is compile tested only since I don't have the hardware.
> 
> diff --git a/drivers/net/wireless/mwifiex/scan.c b/drivers/net/wireless/mwifiex/scan.c
> index dae8dbb..8a3f959 100644
> --- a/drivers/net/wireless/mwifiex/scan.c
> +++ b/drivers/net/wireless/mwifiex/scan.c
> @@ -1469,7 +1469,7 @@ mwifiex_update_curr_bss_params(struct mwifiex_private *priv, u8 *bssid,
>  			       s32 rssi, const u8 *ie_buf, size_t ie_len,
>  			       u16 beacon_period, u16 cap_info_bitmap, u8 band)
>  {
> -	struct mwifiex_bssdescriptor *bss_desc = NULL;
> +	struct mwifiex_bssdescriptor *bss_desc;
>  	int ret;
>  	unsigned long flags;
>  	u8 *beacon_ie;
> @@ -1484,6 +1484,7 @@ mwifiex_update_curr_bss_params(struct mwifiex_private *priv, u8 *bssid,
> 
>  	beacon_ie = kmemdup(ie_buf, ie_len, GFP_KERNEL);
>  	if (!beacon_ie) {
> +		kfree(bss_desc);
>  		dev_err(priv->adapter->dev, " failed to alloc beacon_ie\n");
>  		return -ENOMEM;
>  	}

^ permalink raw reply

* [PATCH] net, wireless, mwifiex: Fix mem leak in mwifiex_update_curr_bss_params()
From: Jesper Juhl @ 2011-11-06 21:58 UTC (permalink / raw)
  To: linux-wireless, netdev, linux-kernel; +Cc: John W. Linville, Bing Zhao

If kmemdup() fails we leak the memory allocated to bss_desc.
This patch fixes the leak.
I also removed the pointless default assignment of 'NULL' to 'bss_desc' 
while I was there anyway.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
 drivers/net/wireless/mwifiex/scan.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

 note: patch is compile tested only since I don't have the hardware.

diff --git a/drivers/net/wireless/mwifiex/scan.c b/drivers/net/wireless/mwifiex/scan.c
index dae8dbb..8a3f959 100644
--- a/drivers/net/wireless/mwifiex/scan.c
+++ b/drivers/net/wireless/mwifiex/scan.c
@@ -1469,7 +1469,7 @@ mwifiex_update_curr_bss_params(struct mwifiex_private *priv, u8 *bssid,
 			       s32 rssi, const u8 *ie_buf, size_t ie_len,
 			       u16 beacon_period, u16 cap_info_bitmap, u8 band)
 {
-	struct mwifiex_bssdescriptor *bss_desc = NULL;
+	struct mwifiex_bssdescriptor *bss_desc;
 	int ret;
 	unsigned long flags;
 	u8 *beacon_ie;
@@ -1484,6 +1484,7 @@ mwifiex_update_curr_bss_params(struct mwifiex_private *priv, u8 *bssid,
 
 	beacon_ie = kmemdup(ie_buf, ie_len, GFP_KERNEL);
 	if (!beacon_ie) {
+		kfree(bss_desc);
 		dev_err(priv->adapter->dev, " failed to alloc beacon_ie\n");
 		return -ENOMEM;
 	}
-- 
1.7.7.2


-- 
Jesper Juhl <jj@chaosbits.net>       http://www.chaosbits.net/
Don't top-post http://www.catb.org/jargon/html/T/top-post.html
Plain text mails only, please.

^ permalink raw reply related

* Re: Linux Route Cache performance tests
From: Paweł Staszewski @ 2011-11-06 21:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list
In-Reply-To: <1320614788.6506.38.camel@edumazet-laptop>

W dniu 2011-11-06 22:26, Eric Dumazet pisze:
> Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
>> Yes with this is a little problem i think with kernel 3.1 because
>> dmesg | egrep  '(rhash)|(route)'
>> [    0.000000] Command line: root=/dev/md2 rhash_entries=2097152
>> [    0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
>> [    4.697294] IP route cache hash table entries: 524288 (order: 10,
>> 4194304 bytes)
>>
>>
> Dont tell me you _still_ use a 32bit kernel ?
no it is 64bit :)
Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
:)

> If so, you need to tweak alloc_large_system_hash() to use vmalloc,
> because you hit MAX_ORDER (10) page allocations.
funny then :)
Maybee i turned off too many kernel features
> But considering LOWMEM is about 700 Mbytes, you wont be able to create a
> lot of route cache entries.
>
> Come on, do us a favor, and enter new era of computing.
>
>
>
>
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox