Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] ipvs: Kconfig cleanup
From: Simon Horman @ 2010-07-04  7:13 UTC (permalink / raw)
  To: Michal Marek
  Cc: lvs-devel, netdev, Julian Anastasov, Wensong Zhang, linux-kernel,
	Patrick McHardy
In-Reply-To: <20100704070516.GA9437@verge.net.au>

[ Added Patrick McHardy to CC ]

On Sun, Jul 04, 2010 at 04:05:16PM +0900, Simon Horman wrote:
> On Fri, Jul 02, 2010 at 10:32:08PM +0200, Michal Marek wrote:
> > IP_VS_PROTO_AH_ESP should be set iff either of IP_VS_PROTO_{AH,ESP} is
> > selected. Express this with standard kconfig syntax.
> > 
> > Signed-off-by: Michal Marek <mmarek@suse.cz>
> 
> Acked-by: Simon Horman <horms@verge.net.au>
> 
> > ---
> >  net/netfilter/ipvs/Kconfig |    5 +----
> >  1 files changed, 1 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> > index f2d7623..91e7373 100644
> > --- a/net/netfilter/ipvs/Kconfig
> > +++ b/net/netfilter/ipvs/Kconfig
> > @@ -83,19 +83,16 @@ config	IP_VS_PROTO_UDP
> >  	  protocol. Say Y if unsure.
> >  
> >  config	IP_VS_PROTO_AH_ESP
> > -	bool
> > -	depends on UNDEFINED
> > +	def_bool IP_VS_PROTO_ESP || IP_VS_PROTO_AH
> >  
> >  config	IP_VS_PROTO_ESP
> >  	bool "ESP load balancing support"
> > -	select IP_VS_PROTO_AH_ESP
> >  	---help---
> >  	  This option enables support for load balancing ESP (Encapsulation
> >  	  Security Payload) transport protocol. Say Y if unsure.
> >  
> >  config	IP_VS_PROTO_AH
> >  	bool "AH load balancing support"
> > -	select IP_VS_PROTO_AH_ESP
> >  	---help---
> >  	  This option enables support for load balancing AH (Authentication
> >  	  Header) transport protocol. Say Y if unsure.
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe lvs-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* How to detect ethernet cable plug in event when ethernet card is in D3hot power state?
From: LionSky @ 2010-07-04  8:50 UTC (permalink / raw)
  To: netdev

Just as the title.
A scenario is as follows:
When the cable is unplug-in, the ethernet card is placed into D3hot
power state for power saving.
When the cable is plug in later, it will be resumed back to D0 for user usage.

But I do not know whether the ethernet driver/OS can detect the cable
plug-in event when ethernet card is at D3hot power state?
 If it can do that, OS can resume it from D3 to D0.  Thanks

^ permalink raw reply

* Re: [PATCH repost] sched: export sched_set/getaffinity to modules
From: Michael S. Tsirkin @ 2010-07-04  9:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Sridhar Samudrala, Tejun Heo, Ingo Molnar, netdev,
	lkml, kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev,
	Jiri Kosina, Thomas Gleixner, Andi Kleen
In-Reply-To: <20100702210637.GA12433@redhat.com>

On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
> On 07/02, Peter Zijlstra wrote:
> >
> > On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
> > >
> > >  Does  it (Tejun's kthread_clone() patch) also  inherit the
> > > cgroup of the caller?
> >
> > Of course, its a simple do_fork() which inherits everything just as you
> > would expect from a similar sys_clone()/sys_fork() call.
> 
> Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
> from ioctl(), right?
> 
> Then the new thread becomes the natural child of the caller, and it shares
> ->mm with the parent. And files, dup_fd() without CLONE_FS.
> 
> Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
> TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
> just because the parent gets SIGQUIT or abother coredumpable signal.
> Or the new thread can recieve SIGSTOP via ^Z.
> 
> Perhaps this is OK, I do not know. Just to remind that kernel_thread()
> is merely clone(CLONE_VM).
> 
> Oleg.


Right. Doing this might break things like flush.  The signal and exit
behaviour needs to be examined carefully. I am also unsure whether
using such threads might be more expensive than inheriting kthreadd.

-- 
MST

^ permalink raw reply

* [patch v2.3 0/4], [patch v2.3 0/4] IPVS full NAT support + netfilter 'ipvs' match support
From: Simon Horman @ 2010-07-04 11:32 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Wensong Zhang, Julius Volz, Patrick McHardy,
	David S. Miller, Hannes Eder

This is a repost of a patch-series posted by Hannes Eder last September.
This is v2 of the patch series and I don't see any outstanding objections to
it in the mailing list archives.

After I posted v2.2 of this series in May several concerns were raised
by Patrick McHardy. This series should address all of those concerns.

Malcolm Turnbull has offered to test this code so I'd like to get
a Reviewed-by from him before the code gets merged. In other words,
at this stage these patches are for review not merging.

The original cover-email from Hannes follows.
The diffstat output has been updated to reflect minor up-porting by me.

From:	Hannes Eder <heder@google.com>

The following series implements full NAT support for IPVS.  The
approach is via a minimal change to IPVS (make friends with
nf_conntrack) and adding a netfilter matcher, kernel- and user-space
part, i.e. xt_ipvs and libxt_ipvs.

Example usage:

% ipvsadm -A -t 192.168.100.30:80 -s rr
% ipvsadm -a -t 192.168.100.30:80 -r 192.168.10.20:80 -m
# ...

# Source NAT for VIP 192.168.100.30:80
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 80 -j SNAT --to-source 192.168.10.10

or SNAT-ing only a specific real server:

% iptables -t nat -A POSTROUTING --dst 192.168.11.20 \
> -m ipvs --vaddr 192.168.100.30/32 -j SNAT --to-source 192.168.10.10


First of all, thanks for all the feedback.  This is the changelog for v2:

- Make ip_vs_ftp work again.  Setup nf_conntrack expectations for
  related data connections (based on Julian's patch see
  http://www.ssi.bg/~ja/nfct/) and let nf_conntrack/nf_nat do the
  packet mangling and the TCP sequence adjusting.

  This change rises the question how to deal with ip_vs_sync?  Does it
  work together with conntrackd?  Wild idea: what about getting rid of
  ip_vs_sync and piggy packing all on nf_conntrack and use conntrackd?

  Any comments on this?

- xt_ipvs: add new rule '--vportctl port' to match the VIP port of the
  controlling connection, e.g. port 21 for FTP.  Can be used to match
  a related data connection for FTP:

  # SNAT FTP control connection
  % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
  > --vport 21 -j SNAT --to-source 192.168.10.10
  
  # SNAT FTP passive data connection
  % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
  > --vportctl 21 -j SNAT --to-source 192.168.10.10

- xt_ipvs: use 'par->family' instead of 'skb->protocol'

- xt_ipvs: add ipvs_mt_check and restrict to NFPROTO_IPV4 and NFPROTO_IPV6

- Call nf_conntrack_alter_reply(), so helper lookup is performed based
  on the changed tuple.

Changes to the linux kernel
(nf-next-2.6, "bridge: add per bridge device controls for invoking iptables")

Hannes Eder (3):
      netfilter: xt_ipvs (netfilter matcher for IPVS)
      IPVS: make friends with nf_conntrack
      IPVS: make FTP work with full NAT support


 include/linux/netfilter/xt_ipvs.h |   25 +++++
 include/net/ip_vs.h              |    2 
 net/netfilter/Kconfig            |   10 ++
 net/netfilter/Makefile           |    1 
 net/netfilter/ipvs/Kconfig       |    4 
 net/netfilter/ipvs/ip_vs_app.c   |   43 ---------
 net/netfilter/ipvs/ip_vs_core.c  |   37 --------
 net/netfilter/ipvs/ip_vs_ftp.c   |  173 +++++++++++++++++++++++++++++++++++---
 net/netfilter/ipvs/ip_vs_proto.c |    1 
 net/netfilter/ipvs/ip_vs_xmit.c  |   30 ++++++
 net/netfilter/xt_ipvs.c           |  189 +++++++++++++++++++++++++++++++++++++
 11 files changed, 419 insertions(+), 96 deletions(-)
 create mode 100644 include/linux/netfilter/xt_ipvs.h
 create mode 100644 net/netfilter/xt_ipvs.c


Changes to iptables
(iptables.git, "xt_quota: also document negation")

Hannes Eder (1):
      libxt_ipvs: user-space lib for netfilter matcher xt_ipvs

 configure.ac                      |   10 1
 extensions/libxt_ipvs.c           |  365 +++++++++++++++++++++++++++++++++++++
 extensions/libxt_ipvs.man         |   24 ++
 include/linux/netfilter/xt_ipvs.h |   25 +++
 4 files changed, 422 insertions(+), 2 deletions(-)
 create mode 100644 extensions/libxt_ipvs.c
 create mode 100644 extensions/libxt_ipvs.man
 create mode 100644 include/linux/netfilter/xt_ipvs.h

^ permalink raw reply

* [patch v2.3 1/4] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Simon Horman @ 2010-07-04 11:32 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Wensong Zhang, Julius Volz, Patrick McHardy,
	David S. Miller, Hannes Eder
In-Reply-To: <20100704113246.562399500@vergenet.net>

[-- Attachment #1: netfilter-xt_ipvs-netfilter-matcher-for-IPVS.patch --]
[-- Type: text/plain, Size: 8570 bytes --]

From:	Hannes Eder <heder@google.com>

This implements the kernel-space side of the netfilter matcher xt_ipvs.

[ minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>

--- 

 include/linux/netfilter/xt_ipvs.h |   25 ++++
 net/netfilter/Kconfig             |   10 +
 net/netfilter/Makefile            |    1 
 net/netfilter/ipvs/ip_vs_proto.c  |    1 
 net/netfilter/xt_ipvs.c           |  187 +++++++++++++++++++++++++++++++++++++
 5 files changed, 224 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netfilter/xt_ipvs.h
 create mode 100644 net/netfilter/xt_ipvs.c

v2.1, v2.2
No Change

v2.3
As per advice from Patrick McHardy
* Don't define a value for _XT_IPVS_H in xt_ipvs.h
* Depend on NF_CONNTRACK
* Update to new API
  - ipvs_mt_check() should return an int rather than a bool
  - Change type of ipvs_mt()'s par parameter from
    struct xt_action_param to struct xt_match_param
  - Make ipvs_mt()'s par parameter non-const
Index: nf-next-2.6/include/linux/netfilter/xt_ipvs.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ nf-next-2.6/include/linux/netfilter/xt_ipvs.h	2010-07-04 16:22:19.000000000 +0900
@@ -0,0 +1,25 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H
+
+#define XT_IPVS_IPVS_PROPERTY	(1 << 0) /* all other options imply this one */
+#define XT_IPVS_PROTO		(1 << 1)
+#define XT_IPVS_VADDR		(1 << 2)
+#define XT_IPVS_VPORT		(1 << 3)
+#define XT_IPVS_DIR		(1 << 4)
+#define XT_IPVS_METHOD		(1 << 5)
+#define XT_IPVS_VPORTCTL	(1 << 6)
+#define XT_IPVS_MASK		((1 << 7) - 1)
+#define XT_IPVS_ONCE_MASK	(XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
+
+struct xt_ipvs_mtinfo {
+	union nf_inet_addr	vaddr, vmask;
+	__be16			vport;
+	__u16			l4proto;
+	__u16			fwd_method;
+	__be16			vportctl;
+
+	__u8			invert;
+	__u8			bitmask;
+};
+
+#endif /* _XT_IPVS_H */
Index: nf-next-2.6/net/netfilter/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/Kconfig	2010-07-04 16:21:28.000000000 +0900
+++ nf-next-2.6/net/netfilter/Kconfig	2010-07-04 16:22:19.000000000 +0900
@@ -726,6 +726,16 @@ config NETFILTER_XT_MATCH_IPRANGE
 
 	If unsure, say M.
 
+config NETFILTER_XT_MATCH_IPVS
+	tristate '"ipvs" match support'
+	depends on IP_VS
+	depends on NETFILTER_ADVANCED
+	depends on NF_CONNTRACK
+	help
+	  This option allows you to match against IPVS properties of a packet.
+
+	  If unsure, say N.
+
 config NETFILTER_XT_MATCH_LENGTH
 	tristate '"length" match support'
 	depends on NETFILTER_ADVANCED
Index: nf-next-2.6/net/netfilter/Makefile
===================================================================
--- nf-next-2.6.orig/net/netfilter/Makefile	2010-07-04 16:21:28.000000000 +0900
+++ nf-next-2.6/net/netfilter/Makefile	2010-07-04 16:22:19.000000000 +0900
@@ -76,6 +76,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_HASHLIMI
 obj-$(CONFIG_NETFILTER_XT_MATCH_HELPER) += xt_helper.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_HL) += xt_hl.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_IPRANGE) += xt_iprange.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_IPVS) += xt_ipvs.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_LENGTH) += xt_length.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_LIMIT) += xt_limit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_MAC) += xt_mac.o
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_proto.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_proto.c	2010-07-04 16:21:28.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_proto.c	2010-07-04 16:22:19.000000000 +0900
@@ -98,6 +98,7 @@ struct ip_vs_protocol * ip_vs_proto_get(
 
 	return NULL;
 }
+EXPORT_SYMBOL(ip_vs_proto_get);
 
 
 /*
Index: nf-next-2.6/net/netfilter/xt_ipvs.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ nf-next-2.6/net/netfilter/xt_ipvs.c	2010-07-04 16:43:17.000000000 +0900
@@ -0,0 +1,189 @@
+/*
+ *	xt_ipvs - kernel module to match IPVS connection properties
+ *
+ *	Author: Hannes Eder <heder@google.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/spinlock.h>
+#include <linux/skbuff.h>
+#ifdef CONFIG_IP_VS_IPV6
+#include <net/ipv6.h>
+#endif
+#include <linux/ip_vs.h>
+#include <linux/types.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_ipvs.h>
+#include <net/netfilter/nf_conntrack.h>
+
+#include <net/ip_vs.h>
+
+MODULE_AUTHOR("Hannes Eder <heder@google.com>");
+MODULE_DESCRIPTION("Xtables: match IPVS connection properties");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("ipt_ipvs");
+MODULE_ALIAS("ip6t_ipvs");
+
+/* borrowed from xt_conntrack */
+static bool ipvs_mt_addrcmp(const union nf_inet_addr *kaddr,
+			    const union nf_inet_addr *uaddr,
+			    const union nf_inet_addr *umask,
+			    unsigned int l3proto)
+{
+	if (l3proto == NFPROTO_IPV4)
+		return ((kaddr->ip ^ uaddr->ip) & umask->ip) == 0;
+#ifdef CONFIG_IP_VS_IPV6
+	else if (l3proto == NFPROTO_IPV6)
+		return ipv6_masked_addr_cmp(&kaddr->in6, &umask->in6,
+		       &uaddr->in6) == 0;
+#endif
+	else
+		return false;
+}
+
+static bool
+ipvs_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_ipvs_mtinfo *data = par->matchinfo;
+	/* ipvs_mt_check ensures that family is only NFPROTO_IPV[46]. */
+	const u_int8_t family = par->family;
+	struct ip_vs_iphdr iph;
+	struct ip_vs_protocol *pp;
+	struct ip_vs_conn *cp;
+	bool match = true;
+
+	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+		match = skb->ipvs_property ^
+			!!(data->invert & XT_IPVS_IPVS_PROPERTY);
+		goto out;
+	}
+
+	/* other flags than XT_IPVS_IPVS_PROPERTY are set */
+	if (!skb->ipvs_property) {
+		match = false;
+		goto out;
+	}
+
+	ip_vs_fill_iphdr(family, skb_network_header(skb), &iph);
+
+	if (data->bitmask & XT_IPVS_PROTO)
+		if ((iph.protocol == data->l4proto) ^
+		    !(data->invert & XT_IPVS_PROTO)) {
+			match = false;
+			goto out;
+		}
+
+	pp = ip_vs_proto_get(iph.protocol);
+	if (unlikely(!pp)) {
+		match = false;
+		goto out;
+	}
+
+	/*
+	 * Check if the packet belongs to an existing entry
+	 */
+	cp = pp->conn_out_get(family, skb, pp, &iph, iph.len, 1 /* inverse */);
+	if (unlikely(cp == NULL)) {
+		match = false;
+		goto out;
+	}
+
+	/*
+	 * We found a connection, i.e. ct != 0, make sure to call
+	 * __ip_vs_conn_put before returning.  In our case jump to out_put_con.
+	 */
+
+	if (data->bitmask & XT_IPVS_VPORT)
+		if ((cp->vport == data->vport) ^
+		    !(data->invert & XT_IPVS_VPORT)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_VPORTCTL)
+		if ((cp->control != NULL &&
+		     cp->control->vport == data->vportctl) ^
+		    !(data->invert & XT_IPVS_VPORTCTL)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_DIR) {
+		enum ip_conntrack_info ctinfo;
+		struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+
+		if (ct == NULL || ct == &nf_conntrack_untracked) {
+			match = false;
+			goto out_put_cp;
+		}
+
+		if ((ctinfo >= IP_CT_IS_REPLY) ^
+		    !!(data->invert & XT_IPVS_DIR)) {
+			match = false;
+			goto out_put_cp;
+		}
+	}
+
+	if (data->bitmask & XT_IPVS_METHOD)
+		if (((cp->flags & IP_VS_CONN_F_FWD_MASK) == data->fwd_method) ^
+		    !(data->invert & XT_IPVS_METHOD)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_VADDR) {
+		if (ipvs_mt_addrcmp(&cp->vaddr, &data->vaddr,
+				    &data->vmask, family) ^
+		    !(data->invert & XT_IPVS_VADDR)) {
+			match = false;
+			goto out_put_cp;
+		}
+	}
+
+out_put_cp:
+	__ip_vs_conn_put(cp);
+out:
+	pr_debug("match=%d\n", match);
+	return match;
+}
+
+static int ipvs_mt_check(const struct xt_mtchk_param *par)
+{
+	if (par->family != NFPROTO_IPV4
+#ifdef CONFIG_IP_VS_IPV6
+	    && par->family != NFPROTO_IPV6
+#endif
+		) {
+		pr_info("protocol family %u not supported\n", par->family);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct xt_match xt_ipvs_mt_reg __read_mostly = {
+	.name       = "ipvs",
+	.revision   = 0,
+	.family     = NFPROTO_UNSPEC,
+	.match      = ipvs_mt,
+	.checkentry = ipvs_mt_check,
+	.matchsize  = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+	.me         = THIS_MODULE,
+};
+
+static int __init ipvs_mt_init(void)
+{
+	return xt_register_match(&xt_ipvs_mt_reg);
+}
+
+static void __exit ipvs_mt_exit(void)
+{
+	xt_unregister_match(&xt_ipvs_mt_reg);
+}
+
+module_init(ipvs_mt_init);
+module_exit(ipvs_mt_exit);

^ permalink raw reply

* [patch v2.3 2/4] IPVS: make friends with nf_conntrack
From: Simon Horman @ 2010-07-04 11:32 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Wensong Zhang, Julius Volz, Patrick McHardy,
	David S. Miller, Hannes Eder
In-Reply-To: <20100704113246.562399500@vergenet.net>

[-- Attachment #1: IPVS-make-friends-with-nf_conntrack.patch --]
[-- Type: text/plain, Size: 5504 bytes --]

From:	Hannes Eder <heder@google.com>

Update the nf_conntrack tuple in reply direction, as we will see
traffic from the real server (RIP) to the client (CIP).  Once this is
done we can use netfilters SNAT in POSTROUTING, especially with
xt_ipvs, to do source NAT, e.g.:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 80 \
> -j SNAT --to-source 192.168.10.10

Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>

--- 

 net/netfilter/ipvs/Kconfig      |    2 +-
 net/netfilter/ipvs/ip_vs_core.c |   36 ------------------------------------
 net/netfilter/ipvs/ip_vs_xmit.c |   30 ++++++++++++++++++++++++++++++
 3 files changed, 31 insertions(+), 37 deletions(-)

v2.1, v2.2, v2.3
No change

Index: nf-next-2.6/net/netfilter/ipvs/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/Kconfig	2010-07-04 20:18:12.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/Kconfig	2010-07-04 20:18:33.000000000 +0900
@@ -3,7 +3,7 @@
 #
 menuconfig IP_VS
 	tristate "IP virtual server support"
-	depends on NET && INET && NETFILTER
+	depends on NET && INET && NETFILTER && NF_CONNTRACK
 	---help---
 	  IP Virtual Server support will let you build a high-performance
 	  virtual server based on cluster of two or more real servers. This
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-07-04 20:18:12.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-07-04 20:18:33.000000000 +0900
@@ -536,26 +536,6 @@ int ip_vs_leave(struct ip_vs_service *sv
 	return NF_DROP;
 }
 
-
-/*
- *      It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING
- *      chain, and is used for VS/NAT.
- *      It detects packets for VS/NAT connections and sends the packets
- *      immediately. This can avoid that iptable_nat mangles the packets
- *      for VS/NAT.
- */
-static unsigned int ip_vs_post_routing(unsigned int hooknum,
-				       struct sk_buff *skb,
-				       const struct net_device *in,
-				       const struct net_device *out,
-				       int (*okfn)(struct sk_buff *))
-{
-	if (!skb->ipvs_property)
-		return NF_ACCEPT;
-	/* The packet was sent from IPVS, exit this chain */
-	return NF_STOP;
-}
-
 __sum16 ip_vs_checksum_complete(struct sk_buff *skb, int offset)
 {
 	return csum_fold(skb_checksum(skb, offset, skb->len - offset, 0));
@@ -1499,14 +1479,6 @@ static struct nf_hook_ops ip_vs_ops[] __
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP_PRI_NAT_SRC-1,
-	},
 #ifdef CONFIG_IP_VS_IPV6
 	/* After packet filtering, forward packet through VS/DR, VS/TUN,
 	 * or VS/NAT(change destination), so that filtering rules can be
@@ -1535,14 +1507,6 @@ static struct nf_hook_ops ip_vs_ops[] __
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET6,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP6_PRI_NAT_SRC-1,
-	},
 #endif
 };
 
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_xmit.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_xmit.c	2010-07-04 16:15:59.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_xmit.c	2010-07-04 20:18:33.000000000 +0900
@@ -28,6 +28,7 @@
 #include <net/ip6_route.h>
 #include <linux/icmpv6.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
 #include <linux/netfilter_ipv4.h>
 
 #include <net/ip_vs.h>
@@ -348,6 +349,31 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb
 }
 #endif
 
+static void
+ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp)
+{
+	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
+	struct nf_conntrack_tuple new_tuple;
+
+	if (ct == NULL || ct == &nf_conntrack_untracked ||
+	    nf_ct_is_confirmed(ct))
+		return;
+
+	/*
+	 * The connection is not yet in the hashtable, so we update it.
+	 * CIP->VIP will remain the same, so leave the tuple in
+	 * IP_CT_DIR_ORIGINAL untouched.  When the reply comes back from the
+	 * real-server we will see RIP->DIP.
+	 */
+	new_tuple = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+	new_tuple.src.u3 = cp->daddr;
+	/*
+	 * This will also take care of UDP and other protocols.
+	 */
+	new_tuple.src.u.tcp.port = cp->dport;
+	nf_conntrack_alter_reply(ct, &new_tuple);
+}
+
 /*
  *      NAT transmitter (only for outside-to-inside nat forwarding)
  *      Not used for related ICMP
@@ -403,6 +429,8 @@ ip_vs_nat_xmit(struct sk_buff *skb, stru
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */
@@ -479,6 +507,8 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, s
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */


^ permalink raw reply

* [patch v2.3 3/4] IPVS: make FTP work with full NAT support
From: Simon Horman @ 2010-07-04 11:32 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Wensong Zhang, Julius Volz, Patrick McHardy,
	David S. Miller, Hannes Eder
In-Reply-To: <20100704113246.562399500@vergenet.net>

[-- Attachment #1: IPVS-make-FTP-work-with-full-NAT-support.patch --]
[-- Type: text/plain, Size: 11854 bytes --]

From:	Hannes Eder <heder@google.com>

Use nf_conntrack/nf_nat code to do the packet mangling and the TCP
sequence adjusting.  The function 'ip_vs_skb_replace' is now dead
code, so it is removed.

To SNAT FTP, use something like:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 21 -j SNAT --to-source 192.168.10.10

and for the data connections in passive mode:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vportctl 21 -j SNAT --to-source 192.168.10.10

using '-m state --state RELATED' would also works.

Make sure the kernel modules ip_vs_ftp, nf_conntrack_ftp, and
nf_nat_ftp are loaded.

[ up-port and minor fixes by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Signed-off-by: Simon Horman <horms@verge.net.au>

--- 

 include/net/ip_vs.h             |    2 
 net/netfilter/ipvs/Kconfig      |    2 
 net/netfilter/ipvs/ip_vs_app.c  |   43 ----------
 net/netfilter/ipvs/ip_vs_core.c |    1 
 net/netfilter/ipvs/ip_vs_ftp.c  |  164 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 153 insertions(+), 59 deletions(-)

v2.1
* Up-port

v2.2
* No change

v2.3
* Up-port
* Use %pI4 instead of NIPQUAD
* Drop buf_len = snprintf() change - its a separate, cosmetic, fix

Index: nf-next-2.6/include/net/ip_vs.h
===================================================================
--- nf-next-2.6.orig/include/net/ip_vs.h	2010-07-04 20:30:19.000000000 +0900
+++ nf-next-2.6/include/net/ip_vs.h	2010-07-04 20:32:06.000000000 +0900
@@ -736,8 +736,6 @@ extern void ip_vs_app_inc_put(struct ip_
 
 extern int ip_vs_app_pkt_out(struct ip_vs_conn *, struct sk_buff *skb);
 extern int ip_vs_app_pkt_in(struct ip_vs_conn *, struct sk_buff *skb);
-extern int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
-			     char *o_buf, int o_len, char *n_buf, int n_len);
 extern int ip_vs_app_init(void);
 extern void ip_vs_app_cleanup(void);
 
Index: nf-next-2.6/net/netfilter/ipvs/Kconfig
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/Kconfig	2010-07-04 20:32:05.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/Kconfig	2010-07-04 20:32:06.000000000 +0900
@@ -238,7 +238,7 @@ comment 'IPVS application helper'
 
 config	IP_VS_FTP
   	tristate "FTP protocol helper"
-        depends on IP_VS_PROTO_TCP
+        depends on IP_VS_PROTO_TCP && NF_NAT
 	---help---
 	  FTP is a protocol that transfers IP address and/or port number in
 	  the payload. In the virtual server via Network Address Translation,
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_app.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_app.c	2010-07-04 20:30:19.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_app.c	2010-07-04 20:32:06.000000000 +0900
@@ -569,49 +569,6 @@ static const struct file_operations ip_v
 };
 #endif
 
-
-/*
- *	Replace a segment of data with a new segment
- */
-int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
-		      char *o_buf, int o_len, char *n_buf, int n_len)
-{
-	int diff;
-	int o_offset;
-	int o_left;
-
-	EnterFunction(9);
-
-	diff = n_len - o_len;
-	o_offset = o_buf - (char *)skb->data;
-	/* The length of left data after o_buf+o_len in the skb data */
-	o_left = skb->len - (o_offset + o_len);
-
-	if (diff <= 0) {
-		memmove(o_buf + n_len, o_buf + o_len, o_left);
-		memcpy(o_buf, n_buf, n_len);
-		skb_trim(skb, skb->len + diff);
-	} else if (diff <= skb_tailroom(skb)) {
-		skb_put(skb, diff);
-		memmove(o_buf + n_len, o_buf + o_len, o_left);
-		memcpy(o_buf, n_buf, n_len);
-	} else {
-		if (pskb_expand_head(skb, skb_headroom(skb), diff, pri))
-			return -ENOMEM;
-		skb_put(skb, diff);
-		memmove(skb->data + o_offset + n_len,
-			skb->data + o_offset + o_len, o_left);
-		skb_copy_to_linear_data_offset(skb, o_offset, n_buf, n_len);
-	}
-
-	/* must update the iph total length here */
-	ip_hdr(skb)->tot_len = htons(skb->len);
-
-	LeaveFunction(9);
-	return 0;
-}
-
-
 int __init ip_vs_app_init(void)
 {
 	/* we will replace it with proc_net_ipvs_create() soon */
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-07-04 20:32:05.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-07-04 20:32:06.000000000 +0900
@@ -54,7 +54,6 @@
 
 EXPORT_SYMBOL(register_ip_vs_scheduler);
 EXPORT_SYMBOL(unregister_ip_vs_scheduler);
-EXPORT_SYMBOL(ip_vs_skb_replace);
 EXPORT_SYMBOL(ip_vs_proto_name);
 EXPORT_SYMBOL(ip_vs_conn_new);
 EXPORT_SYMBOL(ip_vs_conn_in_get);
Index: nf-next-2.6/net/netfilter/ipvs/ip_vs_ftp.c
===================================================================
--- nf-next-2.6.orig/net/netfilter/ipvs/ip_vs_ftp.c	2010-07-04 20:30:19.000000000 +0900
+++ nf-next-2.6/net/netfilter/ipvs/ip_vs_ftp.c	2010-07-04 20:32:06.000000000 +0900
@@ -20,6 +20,17 @@
  *
  * Author:	Wouter Gadeyne
  *
+ *
+ * Code for ip_vs_expect_related and ip_vs_expect_callback is taken from
+ * http://www.ssi.bg/~ja/nfct/:
+ *
+ * ip_vs_nfct.c:	Netfilter connection tracking support for IPVS
+ *
+ * Portions Copyright (C) 2001-2002
+ * Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland.
+ *
+ * Portions Copyright (C) 2003-2008
+ * Julian Anastasov
  */
 
 #define KMSG_COMPONENT "IPVS"
@@ -32,6 +43,9 @@
 #include <linux/in.h>
 #include <linux/ip.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_expect.h>
+#include <net/netfilter/nf_nat_helper.h>
 #include <linux/gfp.h>
 #include <net/protocol.h>
 #include <net/tcp.h>
@@ -43,6 +57,16 @@
 #define SERVER_STRING "227 Entering Passive Mode ("
 #define CLIENT_STRING "PORT "
 
+#define FMT_TUPLE	"%pI4:%u->%pI4:%u/%u"
+#define ARG_TUPLE(T)	(T)->src.u3.ip, ntohs((T)->src.u.all), \
+			(T)->dst.u3.ip, ntohs((T)->dst.u.all), \
+			(T)->dst.protonum
+
+#define FMT_CONN	"%pI4:%u->%pI4:%u->%pI4:%u/%u:%u"
+#define ARG_CONN(C)	(C)->caddr, ntohs((C)->cport), \
+			(C)->vaddr, ntohs((C)->vport), \
+			(C)->daddr, ntohs((C)->dport), \
+			(C)->protocol, (C)->state
 
 /*
  * List of ports (up to IP_VS_APP_MAX_PORTS) to be handled by helper
@@ -123,6 +147,119 @@ static int ip_vs_ftp_get_addrport(char *
 	return 1;
 }
 
+/*
+ * Called from init_conntrack() as expectfn handler.
+ */
+static void
+ip_vs_expect_callback(struct nf_conn *ct,
+		      struct nf_conntrack_expect *exp)
+{
+	struct nf_conntrack_tuple *orig, new_reply;
+	struct ip_vs_conn *cp;
+
+	if (exp->tuple.src.l3num != PF_INET)
+		return;
+
+	/*
+	 * We assume that no NF locks are held before this callback.
+	 * ip_vs_conn_out_get and ip_vs_conn_in_get should match their
+	 * expectations even if they use wildcard values, now we provide the
+	 * actual values from the newly created original conntrack direction.
+	 * The conntrack is confirmed when packet reaches IPVS hooks.
+	 */
+
+	/* RS->CLIENT */
+	orig = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+	cp = ip_vs_conn_out_get(exp->tuple.src.l3num, orig->dst.protonum,
+				&orig->src.u3, orig->src.u.tcp.port,
+				&orig->dst.u3, orig->dst.u.tcp.port);
+	if (cp) {
+		/* Change reply CLIENT->RS to CLIENT->VS */
+		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+		IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", found inout cp=" FMT_CONN "\n",
+			  __func__, ct, ct->status,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		new_reply.dst.u3 = cp->vaddr;
+		new_reply.dst.u.tcp.port = cp->vport;
+		IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", " FMT_TUPLE
+			  ", inout cp=" FMT_CONN "\n",
+			  __func__, ct,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		goto alter;
+	}
+
+	/* CLIENT->VS */
+	cp = ip_vs_conn_in_get(exp->tuple.src.l3num, orig->dst.protonum,
+			       &orig->src.u3, orig->src.u.tcp.port,
+			       &orig->dst.u3, orig->dst.u.tcp.port);
+	if (cp) {
+		/* Change reply VS->CLIENT to RS->CLIENT */
+		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+		IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", found outin cp=" FMT_CONN "\n",
+			  __func__, ct, ct->status,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		new_reply.src.u3 = cp->daddr;
+		new_reply.src.u.tcp.port = cp->dport;
+		IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", outin cp=" FMT_CONN "\n",
+			  __func__, ct,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		goto alter;
+	}
+
+	IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuple=" FMT_TUPLE
+		  " - unknown expect\n",
+		  __func__, ct, ct->status, ARG_TUPLE(orig));
+	return;
+
+alter:
+	/* Never alter conntrack for non-NAT conns */
+	if (IP_VS_FWD_METHOD(cp) == IP_VS_CONN_F_MASQ)
+		nf_conntrack_alter_reply(ct, &new_reply);
+	ip_vs_conn_put(cp);
+	return;
+}
+
+/*
+ * Create NF conntrack expectation with wildcard (optional) source port.
+ * Then the default callback function will alter the reply and will confirm
+ * the conntrack entry when the first packet comes.
+ */
+static void
+ip_vs_expect_related(struct sk_buff *skb, struct nf_conn *ct,
+		     struct ip_vs_conn *cp, u_int8_t proto,
+		     const __be16 *port, int from_rs)
+{
+	struct nf_conntrack_expect *exp;
+
+	BUG_ON(!ct || ct == &nf_conntrack_untracked);
+
+	exp = nf_ct_expect_alloc(ct);
+	if (!exp)
+		return;
+
+	if (from_rs)
+		nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+				  nf_ct_l3num(ct), &cp->daddr, &cp->caddr,
+				  proto, port, &cp->cport);
+	else
+		nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+				  nf_ct_l3num(ct), &cp->caddr, &cp->vaddr,
+				  proto, port, &cp->vport);
+
+	exp->expectfn = ip_vs_expect_callback;
+
+	IP_VS_DBG(7, "%s(): ct=%p, expect tuple=" FMT_TUPLE "\n",
+		  __func__, ct, ARG_TUPLE(&exp->tuple));
+	nf_ct_expect_related(exp);
+	nf_ct_expect_put(exp);
+}
 
 /*
  * Look at outgoing ftp packets to catch the response to a PASV command
@@ -147,9 +284,11 @@ static int ip_vs_ftp_out(struct ip_vs_ap
 	union nf_inet_addr from;
 	__be16 port;
 	struct ip_vs_conn *n_cp;
-	char buf[24];		/* xxx.xxx.xxx.xxx,ppp,ppp\000 */
+	char buf[sizeof("xxx,xxx,xxx,xxx,ppp,ppp")];
 	unsigned buf_len;
 	int ret;
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
 
 #ifdef CONFIG_IP_VS_IPV6
 	/* This application helper doesn't work with IPv6 yet,
@@ -219,19 +358,23 @@ static int ip_vs_ftp_out(struct ip_vs_ap
 
 		buf_len = strlen(buf);
 
+		ct = nf_ct_get(skb, &ctinfo);
+		ret = nf_nat_mangle_tcp_packet(skb,
+					       ct,
+					       ctinfo,
+					       start-data,
+					       end-start,
+					       buf,
+					       buf_len);
+
+		if (ct && ct != &nf_conntrack_untracked)
+			ip_vs_expect_related(skb, ct, n_cp,
+					     IPPROTO_TCP, NULL, 0);
+
 		/*
-		 * Calculate required delta-offset to keep TCP happy
+		 * Not setting 'diff' is intentional, otherwise the sequence
+		 * would be adjusted twice.
 		 */
-		*diff = buf_len - (end-start);
-
-		if (*diff == 0) {
-			/* simply replace it with new passive address */
-			memcpy(start, buf, buf_len);
-			ret = 1;
-		} else {
-			ret = !ip_vs_skb_replace(skb, GFP_ATOMIC, start,
-					  end-start, buf, buf_len);
-		}
 
 		cp->app_data = NULL;
 		ip_vs_tcp_conn_listen(n_cp);
@@ -263,6 +406,7 @@ static int ip_vs_ftp_in(struct ip_vs_app
 	union nf_inet_addr to;
 	__be16 port;
 	struct ip_vs_conn *n_cp;
+	struct nf_conn *ct;
 
 #ifdef CONFIG_IP_VS_IPV6
 	/* This application helper doesn't work with IPv6 yet,
@@ -349,6 +493,11 @@ static int ip_vs_ftp_in(struct ip_vs_app
 		ip_vs_control_add(n_cp, cp);
 	}
 
+	ct = (struct nf_conn *)skb->nfct;
+	if (ct && ct != &nf_conntrack_untracked)
+		ip_vs_expect_related(skb, ct, n_cp,
+				     IPPROTO_TCP, &n_cp->dport, 1);
+
 	/*
 	 *	Move tunnel to listen state
 	 */


^ permalink raw reply

* [patch v2.3 4/4] libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
From: Simon Horman @ 2010-07-04 11:32 UTC (permalink / raw)
  To: lvs-devel, netdev, linux-kernel, netfilter, netfilter-devel
  Cc: Malcolm Turnbull, Wensong Zhang, Julius Volz, Patrick McHardy,
	David S. Miller, Hannes Eder
In-Reply-To: <20100704113246.562399500@vergenet.net>

[-- Attachment #1: libxt_ipvs-user-space-lib-for-netfilter-matcher-xt_ipvs.patch --]
[-- Type: text/plain, Size: 13930 bytes --]

From:	Hannes Eder <heder@google.com>

The user-space library for the netfilter matcher xt_ipvs.

[ trivial up-port by Simon Horman <horms@verge.net.au> ]
Signed-off-by: Hannes Eder <heder@google.com>
Acked-by: Simon Horman <horms@verge.net.au>

 configure.ac                      |   10 -
 extensions/libxt_ipvs.c           |  365 +++++++++++++++++++++++++++++++++++++
 extensions/libxt_ipvs.man         |   24 ++
 include/linux/netfilter/xt_ipvs.h |   25 +++
 4 files changed, 422 insertions(+), 2 deletions(-)
 create mode 100644 extensions/libxt_ipvs.c
 create mode 100644 extensions/libxt_ipvs.man
 create mode 100644 include/linux/netfilter/xt_ipvs.h

v2.1, v2.3
Trivial up-port

v2.2
No change

Index: iptables/configure.ac
===================================================================
--- iptables.orig/configure.ac	2010-07-04 20:21:07.000000000 +0900
+++ iptables/configure.ac	2010-07-04 20:23:30.000000000 +0900
@@ -52,12 +52,18 @@ AC_ARG_WITH([pkgconfigdir], AS_HELP_STRI
 	[Path to the pkgconfig directory [[LIBDIR/pkgconfig]]]),
 	[pkgconfigdir="$withval"], [pkgconfigdir='${libdir}/pkgconfig'])
 
-AC_CHECK_HEADER([linux/dccp.h])
-
 blacklist_modules="";
+
+AC_CHECK_HEADER([linux/dccp.h])
 if test "$ac_cv_header_linux_dccp_h" != "yes"; then
 	blacklist_modules="$blacklist_modules dccp";
 fi;
+
+AC_CHECK_HEADER([linux/ip_vs.h])
+if test "$ac_cv_header_linux_ip_vs_h" != "yes"; then
+	blacklist_modules="$blacklist_modules ipvs";
+fi;
+
 AC_SUBST([blacklist_modules])
 
 AM_CONDITIONAL([ENABLE_STATIC], [test "$enable_static" = "yes"])
Index: iptables/extensions/libxt_ipvs.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ iptables/extensions/libxt_ipvs.c	2010-07-04 20:24:01.000000000 +0900
@@ -0,0 +1,365 @@
+/*
+ * Shared library add-on to iptables to add IPVS matching.
+ *
+ * Detailed doc is in the kernel module source net/netfilter/xt_ipvs.c
+ *
+ * Author: Hannes Eder <heder@google.com>
+ */
+#include <sys/types.h>
+#include <assert.h>
+#include <ctype.h>
+#include <errno.h>
+#include <getopt.h>
+#include <netdb.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <xtables.h>
+#include <linux/ip_vs.h>
+#include <linux/netfilter/xt_ipvs.h>
+
+static const struct option ipvs_mt_opts[] = {
+	{ .name = "ipvs",     .has_arg = false, .val = '0' },
+	{ .name = "vproto",   .has_arg = true,  .val = '1' },
+	{ .name = "vaddr",    .has_arg = true,  .val = '2' },
+	{ .name = "vport",    .has_arg = true,  .val = '3' },
+	{ .name = "vdir",     .has_arg = true,  .val = '4' },
+	{ .name = "vmethod",  .has_arg = true,  .val = '5' },
+	{ .name = "vportctl", .has_arg = true,  .val = '6' },
+	{ .name = NULL }
+};
+
+static void ipvs_mt_help(void)
+{
+	printf(
+"IPVS match options:\n"
+"[!] --ipvs                      packet belongs to an IPVS connection\n"
+"\n"
+"Any of the following options implies --ipvs (even negated)\n"
+"[!] --vproto protocol           VIP protocol to match; by number or name,\n"
+"                                e.g. \"tcp\"\n"
+"[!] --vaddr address[/mask]      VIP address to match\n"
+"[!] --vport port                VIP port to match; by number or name,\n"
+"                                e.g. \"http\"\n"
+"    --vdir {ORIGINAL|REPLY}     flow direction of packet\n"
+"[!] --vmethod {GATE|IPIP|MASQ}  IPVS forwarding method used\n"
+"[!] --vportctl port             VIP port of the controlling connection to\n"
+"                                match, e.g. 21 for FTP\n"
+		);
+}
+
+static void ipvs_mt_parse_addr_and_mask(const char *arg,
+					union nf_inet_addr *address,
+					union nf_inet_addr *mask,
+					unsigned int family)
+{
+	struct in_addr *addr = NULL;
+	struct in6_addr *addr6 = NULL;
+	unsigned int naddrs = 0;
+
+	if (family == NFPROTO_IPV4) {
+		xtables_ipparse_any(arg, &addr, &mask->in, &naddrs);
+		if (naddrs > 1)
+			xtables_error(PARAMETER_PROBLEM,
+				      "multiple IP addresses not allowed");
+		if (naddrs == 1)
+			memcpy(&address->in, addr, sizeof(*addr));
+	} else if (family == NFPROTO_IPV6) {
+		xtables_ip6parse_any(arg, &addr6, &mask->in6, &naddrs);
+		if (naddrs > 1)
+			xtables_error(PARAMETER_PROBLEM,
+				      "multiple IP addresses not allowed");
+		if (naddrs == 1)
+			memcpy(&address->in6, addr6, sizeof(*addr6));
+	} else {
+		/* Hu? */
+		assert(false);
+	}
+}
+
+/* Function which parses command options; returns true if it ate an option */
+static int ipvs_mt_parse(int c, char **argv, int invert, unsigned int *flags,
+			 const void *entry, struct xt_entry_match **match,
+			 unsigned int family)
+{
+	struct xt_ipvs_mtinfo *data = (void *)(*match)->data;
+	char *p = NULL;
+	u_int8_t op = 0;
+
+	if ('0' <= c && c <= '6') {
+		static const int ops[] = {
+			XT_IPVS_IPVS_PROPERTY,
+			XT_IPVS_PROTO,
+			XT_IPVS_VADDR,
+			XT_IPVS_VPORT,
+			XT_IPVS_DIR,
+			XT_IPVS_METHOD,
+			XT_IPVS_VPORTCTL
+		};
+		op = ops[c - '0'];
+	} else
+		return 0;
+
+	if (*flags & op & XT_IPVS_ONCE_MASK)
+		goto multiple_use;
+
+	switch (c) {
+	case '0': /* --ipvs */
+		/* Nothing to do here. */
+		break;
+
+	case '1': /* --vproto */
+		/* Canonicalize into lower case */
+		for (p = optarg; *p != '\0'; ++p)
+			*p = tolower(*p);
+
+		data->l4proto = xtables_parse_protocol(optarg);
+		break;
+
+	case '2': /* --vaddr */
+		ipvs_mt_parse_addr_and_mask(optarg, &data->vaddr,
+					    &data->vmask, family);
+		break;
+
+	case '3': /* --vport */
+		data->vport = htons(xtables_parse_port(optarg, "tcp"));
+		break;
+
+	case '4': /* --vdir */
+		xtables_param_act(XTF_NO_INVERT, "ipvs", "--vdir", invert);
+		if (strcasecmp(optarg, "ORIGINAL") == 0) {
+			data->bitmask |= XT_IPVS_DIR;
+			data->invert   &= ~XT_IPVS_DIR;
+		} else if (strcasecmp(optarg, "REPLY") == 0) {
+			data->bitmask |= XT_IPVS_DIR;
+			data->invert  |= XT_IPVS_DIR;
+		} else {
+			xtables_param_act(XTF_BAD_VALUE,
+					  "ipvs", "--vdir", optarg);
+		}
+		break;
+
+	case '5': /* --vmethod */
+		if (strcasecmp(optarg, "GATE") == 0)
+			data->fwd_method = IP_VS_CONN_F_DROUTE;
+		else if (strcasecmp(optarg, "IPIP") == 0)
+			data->fwd_method = IP_VS_CONN_F_TUNNEL;
+		else if (strcasecmp(optarg, "MASQ") == 0)
+			data->fwd_method = IP_VS_CONN_F_MASQ;
+		else
+			xtables_param_act(XTF_BAD_VALUE,
+					  "ipvs", "--vmethod", optarg);
+		break;
+
+	case '6': /* --vportctl */
+		data->vportctl = htons(xtables_parse_port(optarg, "tcp"));
+		break;
+
+	default:
+		/* Hu? How did we come here? */
+		assert(false);
+		return 0;
+	}
+
+	if (op & XT_IPVS_ONCE_MASK) {
+		if (data->invert & XT_IPVS_IPVS_PROPERTY)
+			xtables_error(PARAMETER_PROBLEM,
+				      "! --ipvs cannot be together with"
+				      " other options");
+		data->bitmask |= XT_IPVS_IPVS_PROPERTY;
+	}
+
+	data->bitmask |= op;
+	if (invert)
+		data->invert |= op;
+	*flags |= op;
+	return 1;
+
+multiple_use:
+	xtables_error(PARAMETER_PROBLEM,
+		      "multiple use of the same IPVS option is not allowed");
+}
+
+static int ipvs_mt4_parse(int c, char **argv, int invert, unsigned int *flags,
+			  const void *entry, struct xt_entry_match **match)
+{
+	return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+			     NFPROTO_IPV4);
+}
+
+static int ipvs_mt6_parse(int c, char **argv, int invert, unsigned int *flags,
+			  const void *entry, struct xt_entry_match **match)
+{
+	return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+			     NFPROTO_IPV6);
+}
+
+static void ipvs_mt_check(unsigned int flags)
+{
+	if (flags == 0)
+		xtables_error(PARAMETER_PROBLEM,
+			      "IPVS: At least one option is required");
+}
+
+/* Shamelessly copied from libxt_conntrack.c */
+static void ipvs_mt_dump_addr(const union nf_inet_addr *addr,
+			      const union nf_inet_addr *mask,
+			      unsigned int family, bool numeric)
+{
+	char buf[BUFSIZ];
+
+	if (family == NFPROTO_IPV4) {
+		if (!numeric && addr->ip == 0) {
+			printf("anywhere ");
+			return;
+		}
+		if (numeric)
+			strcpy(buf, xtables_ipaddr_to_numeric(&addr->in));
+		else
+			strcpy(buf, xtables_ipaddr_to_anyname(&addr->in));
+		strcat(buf, xtables_ipmask_to_numeric(&mask->in));
+		printf("%s ", buf);
+	} else if (family == NFPROTO_IPV6) {
+		if (!numeric && addr->ip6[0] == 0 && addr->ip6[1] == 0 &&
+		    addr->ip6[2] == 0 && addr->ip6[3] == 0) {
+			printf("anywhere ");
+			return;
+		}
+		if (numeric)
+			strcpy(buf, xtables_ip6addr_to_numeric(&addr->in6));
+		else
+			strcpy(buf, xtables_ip6addr_to_anyname(&addr->in6));
+		strcat(buf, xtables_ip6mask_to_numeric(&mask->in6));
+		printf("%s ", buf);
+	}
+}
+
+static void ipvs_mt_dump(const void *ip, const struct xt_ipvs_mtinfo *data,
+			 unsigned int family, bool numeric, const char *prefix)
+{
+	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+		if (data->invert & XT_IPVS_IPVS_PROPERTY)
+			printf("! ");
+		printf("%sipvs ", prefix);
+	}
+
+	if (data->bitmask & XT_IPVS_PROTO) {
+		if (data->invert & XT_IPVS_PROTO)
+			printf("! ");
+		printf("%sproto %u ", prefix, data->l4proto);
+	}
+
+	if (data->bitmask & XT_IPVS_VADDR) {
+		if (data->invert & XT_IPVS_VADDR)
+			printf("! ");
+
+		printf("%svaddr ", prefix);
+		ipvs_mt_dump_addr(&data->vaddr, &data->vmask, family, numeric);
+	}
+
+	if (data->bitmask & XT_IPVS_VPORT) {
+		if (data->invert & XT_IPVS_VPORT)
+			printf("! ");
+
+		printf("%svport %u ", prefix, ntohs(data->vport));
+	}
+
+	if (data->bitmask & XT_IPVS_DIR) {
+		if (data->invert & XT_IPVS_DIR)
+			printf("%svdir REPLY ", prefix);
+		else
+			printf("%svdir ORIGINAL ", prefix);
+	}
+
+	if (data->bitmask & XT_IPVS_METHOD) {
+		if (data->invert & XT_IPVS_METHOD)
+			printf("! ");
+
+		printf("%svmethod ", prefix);
+		switch (data->fwd_method) {
+		case IP_VS_CONN_F_DROUTE:
+			printf("GATE ");
+			break;
+		case IP_VS_CONN_F_TUNNEL:
+			printf("IPIP ");
+			break;
+		case IP_VS_CONN_F_MASQ:
+			printf("MASQ ");
+			break;
+		default:
+			/* Hu? */
+			printf("UNKNOWN ");
+			break;
+		}
+	}
+
+	if (data->bitmask & XT_IPVS_VPORTCTL) {
+		if (data->invert & XT_IPVS_VPORTCTL)
+			printf("! ");
+
+		printf("%svportctl %u ", prefix, ntohs(data->vportctl));
+	}
+}
+
+static void ipvs_mt4_print(const void *ip, const struct xt_entry_match *match,
+			   int numeric)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV4, numeric, "");
+}
+
+static void ipvs_mt6_print(const void *ip, const struct xt_entry_match *match,
+			   int numeric)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV6, numeric, "");
+}
+
+static void ipvs_mt4_save(const void *ip, const struct xt_entry_match *match)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV4, true, "--");
+}
+
+static void ipvs_mt6_save(const void *ip, const struct xt_entry_match *match)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV6, true, "--");
+}
+
+static struct xtables_match ipvs_matches_reg[] = {
+	{
+		.version       = XTABLES_VERSION,
+		.name          = "ipvs",
+		.revision      = 0,
+		.family        = NFPROTO_IPV4,
+		.size          = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.help          = ipvs_mt_help,
+		.parse         = ipvs_mt4_parse,
+		.final_check   = ipvs_mt_check,
+		.print         = ipvs_mt4_print,
+		.save          = ipvs_mt4_save,
+		.extra_opts    = ipvs_mt_opts,
+	},
+	{
+		.version       = XTABLES_VERSION,
+		.name          = "ipvs",
+		.revision      = 0,
+		.family        = NFPROTO_IPV6,
+		.size          = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.help          = ipvs_mt_help,
+		.parse         = ipvs_mt6_parse,
+		.final_check   = ipvs_mt_check,
+		.print         = ipvs_mt6_print,
+		.save          = ipvs_mt6_save,
+		.extra_opts    = ipvs_mt_opts,
+	},
+};
+
+void _init(void)
+{
+	xtables_register_matches(ipvs_matches_reg,
+				 ARRAY_SIZE(ipvs_matches_reg));
+}
Index: iptables/extensions/libxt_ipvs.man
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ iptables/extensions/libxt_ipvs.man	2010-07-04 20:23:30.000000000 +0900
@@ -0,0 +1,24 @@
+Match IPVS connection properties.
+.TP
+[\fB!\fR] \fB\-\-ipvs\fP
+packet belongs to an IPVS connection
+.TP
+Any of the following options implies \-\-ipvs (even negated)
+.TP
+[\fB!\fR] \fB\-\-vproto\fP \fIprotocol\fP
+VIP protocol to match; by number or name, e.g. "tcp"
+.TP
+[\fB!\fR] \fB\-\-vaddr\fP \fIaddress\fP[\fB/\fP\fImask\fP]
+VIP address to match
+.TP
+[\fB!\fR] \fB\-\-vport\fP \fIport\fP
+VIP port to match; by number or name, e.g. "http"
+.TP
+\fB\-\-vdir\fP {\fBORIGINAL\fP|\fBREPLY\fP}
+flow direction of packet
+.TP
+[\fB!\fR] \fB\-\-vmethod\fP {\fBGATE\fP|\fBIPIP\fP|\fBMASQ\fP}
+IPVS forwarding method used
+.TP
+[\fB!\fR] \fB\-\-vportctl\fP \fIport\fP
+VIP port of the controlling connection to match, e.g. 21 for FTP
Index: iptables/include/linux/netfilter/xt_ipvs.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ iptables/include/linux/netfilter/xt_ipvs.h	2010-07-04 20:23:30.000000000 +0900
@@ -0,0 +1,25 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H 1
+
+#define XT_IPVS_IPVS_PROPERTY	(1 << 0) /* all other options imply this one */
+#define XT_IPVS_PROTO		(1 << 1)
+#define XT_IPVS_VADDR		(1 << 2)
+#define XT_IPVS_VPORT		(1 << 3)
+#define XT_IPVS_DIR		(1 << 4)
+#define XT_IPVS_METHOD		(1 << 5)
+#define XT_IPVS_VPORTCTL	(1 << 6)
+#define XT_IPVS_MASK		((1 << 7) - 1)
+#define XT_IPVS_ONCE_MASK	(XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
+
+struct xt_ipvs_mtinfo {
+	union nf_inet_addr	vaddr, vmask;
+	__be16			vport;
+	__u16			l4proto;
+	__u16			fwd_method;
+	__be16			vportctl;
+
+	__u8			invert;
+	__u8			bitmask;
+};
+
+#endif /* _XT_IPVS_H */


^ permalink raw reply

* RE: [PATCH] bnx2x: add support for receive hashing
From: Vladislav Zolotarov @ 2010-07-04 16:36 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev@vger.kernel.org
In-Reply-To: <alpine.DEB.1.00.1004222249400.27016@pokey.mtv.corp.google.com>

Tom, could u, pls., explain what did u mean by taking the (RSS) flags configuration out of RSS "if"? To recall "if(is_multi(bp))" is true iff RSS is enabled.

Thanks,
vlad

> @@ -5750,10 +5757,10 @@ static void bnx2x_init_internal_func(struct bnx2x
> *bp)
>  	u32 offset;
>  	u16 max_agg_size;
> 
> -	if (is_multi(bp)) {
> -		tstorm_config.config_flags = MULTI_FLAGS(bp);
> +	tstorm_config.config_flags = RSS_FLAGS(bp);
> +
> +	if (is_multi(bp))
>  		tstorm_config.rss_result_mask = MULTI_MASK;
> -	}
> 
>  	/* Enable TPA if needed */
>  	if (bp->flags & TPA_ENABLE_FLAG)


> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
> Behalf Of Tom Herbert
> Sent: Friday, April 23, 2010 8:54 AM
> To: davem@davemloft.net; netdev@vger.kernel.org
> Subject: [PATCH] bnx2x: add support for receive hashing
> 
> Add support to bnx2x to extract Toeplitz hash out of the receive descriptor
> for use in skb->rxhash.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
> diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> index 0819530..8bd2368 100644
> --- a/drivers/net/bnx2x.h
> +++ b/drivers/net/bnx2x.h
> @@ -1330,7 +1330,7 @@ static inline u32 reg_poll(struct bnx2x *bp, u32 reg,
> u32 expected, int ms,
>  		AEU_INPUTS_ATTN_BITS_MCP_LATCHED_UMP_TX_PARITY | \
>  		AEU_INPUTS_ATTN_BITS_MCP_LATCHED_SCPAD_PARITY)
> 
> -#define MULTI_FLAGS(bp) \
> +#define RSS_FLAGS(bp) \
>  		(TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV4_CAPABILITY | \
>  		 TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV4_TCP_CAPABILITY | \
>  		 TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV6_CAPABILITY | \
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index 0c6dba2..613f727 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -1582,7 +1582,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int
> budget)
>  		struct sw_rx_bd *rx_buf = NULL;
>  		struct sk_buff *skb;
>  		union eth_rx_cqe *cqe;
> -		u8 cqe_fp_flags;
> +		u8 cqe_fp_flags, cqe_fp_status_flags;
>  		u16 len, pad;
> 
>  		comp_ring_cons = RCQ_BD(sw_comp_cons);
> @@ -1598,6 +1598,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int
> budget)
> 
>  		cqe = &fp->rx_comp_ring[comp_ring_cons];
>  		cqe_fp_flags = cqe->fast_path_cqe.type_error_flags;
> +		cqe_fp_status_flags = cqe->fast_path_cqe.status_flags;
> 
>  		DP(NETIF_MSG_RX_STATUS, "CQE type %x  err %x  status %x"
>  		   "  queue %x  vlan %x  len %u\n", CQE_TYPE(cqe_fp_flags),
> @@ -1727,6 +1728,12 @@ reuse_rx:
> 
>  			skb->protocol = eth_type_trans(skb, bp->dev);
> 
> +			if ((bp->dev->features & ETH_FLAG_RXHASH) &&
> +			    (cqe_fp_status_flags &
> +			     ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> +				skb->rxhash = le32_to_cpu(
> +				    cqe->fast_path_cqe.rss_hash_result);
> +
>  			skb->ip_summed = CHECKSUM_NONE;
>  			if (bp->rx_csum) {
>  				if (likely(BNX2X_RX_CSUM_OK(cqe)))
> @@ -5750,10 +5757,10 @@ static void bnx2x_init_internal_func(struct bnx2x
> *bp)
>  	u32 offset;
>  	u16 max_agg_size;
> 
> -	if (is_multi(bp)) {
> -		tstorm_config.config_flags = MULTI_FLAGS(bp);
> +	tstorm_config.config_flags = RSS_FLAGS(bp);
> +
> +	if (is_multi(bp))
>  		tstorm_config.rss_result_mask = MULTI_MASK;
> -	}
> 
>  	/* Enable TPA if needed */
>  	if (bp->flags & TPA_ENABLE_FLAG)
> @@ -6629,10 +6636,8 @@ static int bnx2x_init_common(struct bnx2x *bp)
>  	bnx2x_init_block(bp, PBF_BLOCK, COMMON_STAGE);
> 
>  	REG_WR(bp, SRC_REG_SOFT_RST, 1);
> -	for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4) {
> -		REG_WR(bp, i, 0xc0cac01a);
> -		/* TODO: replace with something meaningful */
> -	}
> +	for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4)
> +		REG_WR(bp, i, random32());
>  	bnx2x_init_block(bp, SRCH_BLOCK, COMMON_STAGE);
>  #ifdef BCM_CNIC
>  	REG_WR(bp, SRC_REG_KEYSEARCH_0, 0x63285672);
> @@ -11001,6 +11006,11 @@ static int bnx2x_set_flags(struct net_device *dev,
> u32 data)
>  		changed = 1;
>  	}
> 
> +	if (data & ETH_FLAG_RXHASH)
> +		dev->features |= NETIF_F_RXHASH;
> +	else
> +		dev->features &= ~NETIF_F_RXHASH;
> +
>  	if (changed && netif_running(dev)) {
>  		bnx2x_nic_unload(bp, UNLOAD_NORMAL);
>  		rc = bnx2x_nic_load(bp, LOAD_NORMAL);
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply

* RE: [PATCH] bnx2x: add support for receive hashing
From: Vladislav Zolotarov @ 2010-07-04 16:46 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev@vger.kernel.org
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDE646FBFB@SJEXCHCCR02.corp.ad.broadcom.com>

Is there any reason not to fill skb->rxhash for LROed packets?

Thanks,
vlad

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
> Behalf Of Vladislav Zolotarov
> Sent: Sunday, July 04, 2010 7:36 PM
> To: Tom Herbert
> Cc: netdev@vger.kernel.org
> Subject: RE: [PATCH] bnx2x: add support for receive hashing
> 
> Tom, could u, pls., explain what did u mean by taking the (RSS) flags
> configuration out of RSS "if"? To recall "if(is_multi(bp))" is true iff RSS
> is enabled.
> 
> Thanks,
> vlad
> 
> > @@ -5750,10 +5757,10 @@ static void bnx2x_init_internal_func(struct bnx2x
> > *bp)
> >  	u32 offset;
> >  	u16 max_agg_size;
> >
> > -	if (is_multi(bp)) {
> > -		tstorm_config.config_flags = MULTI_FLAGS(bp);
> > +	tstorm_config.config_flags = RSS_FLAGS(bp);
> > +
> > +	if (is_multi(bp))
> >  		tstorm_config.rss_result_mask = MULTI_MASK;
> > -	}
> >
> >  	/* Enable TPA if needed */
> >  	if (bp->flags & TPA_ENABLE_FLAG)
> 
> 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
> > Behalf Of Tom Herbert
> > Sent: Friday, April 23, 2010 8:54 AM
> > To: davem@davemloft.net; netdev@vger.kernel.org
> > Subject: [PATCH] bnx2x: add support for receive hashing
> >
> > Add support to bnx2x to extract Toeplitz hash out of the receive descriptor
> > for use in skb->rxhash.
> >
> > Signed-off-by: Tom Herbert <therbert@google.com>
> > ---
> > diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> > index 0819530..8bd2368 100644
> > --- a/drivers/net/bnx2x.h
> > +++ b/drivers/net/bnx2x.h
> > @@ -1330,7 +1330,7 @@ static inline u32 reg_poll(struct bnx2x *bp, u32 reg,
> > u32 expected, int ms,
> >  		AEU_INPUTS_ATTN_BITS_MCP_LATCHED_UMP_TX_PARITY | \
> >  		AEU_INPUTS_ATTN_BITS_MCP_LATCHED_SCPAD_PARITY)
> >
> > -#define MULTI_FLAGS(bp) \
> > +#define RSS_FLAGS(bp) \
> >  		(TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV4_CAPABILITY | \
> >  		 TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV4_TCP_CAPABILITY | \
> >  		 TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV6_CAPABILITY | \
> > diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> > index 0c6dba2..613f727 100644
> > --- a/drivers/net/bnx2x_main.c
> > +++ b/drivers/net/bnx2x_main.c
> > @@ -1582,7 +1582,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp,
> int
> > budget)
> >  		struct sw_rx_bd *rx_buf = NULL;
> >  		struct sk_buff *skb;
> >  		union eth_rx_cqe *cqe;
> > -		u8 cqe_fp_flags;
> > +		u8 cqe_fp_flags, cqe_fp_status_flags;
> >  		u16 len, pad;
> >
> >  		comp_ring_cons = RCQ_BD(sw_comp_cons);
> > @@ -1598,6 +1598,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp,
> int
> > budget)
> >
> >  		cqe = &fp->rx_comp_ring[comp_ring_cons];
> >  		cqe_fp_flags = cqe->fast_path_cqe.type_error_flags;
> > +		cqe_fp_status_flags = cqe->fast_path_cqe.status_flags;
> >
> >  		DP(NETIF_MSG_RX_STATUS, "CQE type %x  err %x  status %x"
> >  		   "  queue %x  vlan %x  len %u\n", CQE_TYPE(cqe_fp_flags),
> > @@ -1727,6 +1728,12 @@ reuse_rx:
> >
> >  			skb->protocol = eth_type_trans(skb, bp->dev);
> >
> > +			if ((bp->dev->features & ETH_FLAG_RXHASH) &&
> > +			    (cqe_fp_status_flags &
> > +			     ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> > +				skb->rxhash = le32_to_cpu(
> > +				    cqe->fast_path_cqe.rss_hash_result);
> > +
> >  			skb->ip_summed = CHECKSUM_NONE;
> >  			if (bp->rx_csum) {
> >  				if (likely(BNX2X_RX_CSUM_OK(cqe)))
> > @@ -5750,10 +5757,10 @@ static void bnx2x_init_internal_func(struct bnx2x
> > *bp)
> >  	u32 offset;
> >  	u16 max_agg_size;
> >
> > -	if (is_multi(bp)) {
> > -		tstorm_config.config_flags = MULTI_FLAGS(bp);
> > +	tstorm_config.config_flags = RSS_FLAGS(bp);
> > +
> > +	if (is_multi(bp))
> >  		tstorm_config.rss_result_mask = MULTI_MASK;
> > -	}
> >
> >  	/* Enable TPA if needed */
> >  	if (bp->flags & TPA_ENABLE_FLAG)
> > @@ -6629,10 +6636,8 @@ static int bnx2x_init_common(struct bnx2x *bp)
> >  	bnx2x_init_block(bp, PBF_BLOCK, COMMON_STAGE);
> >
> >  	REG_WR(bp, SRC_REG_SOFT_RST, 1);
> > -	for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4) {
> > -		REG_WR(bp, i, 0xc0cac01a);
> > -		/* TODO: replace with something meaningful */
> > -	}
> > +	for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4)
> > +		REG_WR(bp, i, random32());
> >  	bnx2x_init_block(bp, SRCH_BLOCK, COMMON_STAGE);
> >  #ifdef BCM_CNIC
> >  	REG_WR(bp, SRC_REG_KEYSEARCH_0, 0x63285672);
> > @@ -11001,6 +11006,11 @@ static int bnx2x_set_flags(struct net_device *dev,
> > u32 data)
> >  		changed = 1;
> >  	}
> >
> > +	if (data & ETH_FLAG_RXHASH)
> > +		dev->features |= NETIF_F_RXHASH;
> > +	else
> > +		dev->features &= ~NETIF_F_RXHASH;
> > +
> >  	if (changed && netif_running(dev)) {
> >  		bnx2x_nic_unload(bp, UNLOAD_NORMAL);
> >  		rc = bnx2x_nic_load(bp, LOAD_NORMAL);
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply

* Re: [PATCH 0/4] Introduce and use printk pointer extension %pV
From: David Miller @ 2010-07-04 17:40 UTC (permalink / raw)
  To: greg; +Cc: joe, akpm, linux-kernel, netdev
In-Reply-To: <20100703160857.GA29043@kroah.com>

From: Greg KH <greg@kroah.com>
Date: Sat, 3 Jul 2010 09:08:57 -0700

> On Fri, Jul 02, 2010 at 10:32:44PM -0700, David Miller wrote:
>> From: David Miller <davem@davemloft.net>
>> Date: Wed, 30 Jun 2010 13:07:09 -0700 (PDT)
>> 
>> > From: Joe Perches <joe@perches.com>
>> > Date: Sun, 27 Jun 2010 04:02:32 -0700
>> > 
>> >> Recursive printk can reduce the total image size of an x86 defconfig about 1% 
>> >> by reducing duplicated KERN_<level> strings and centralizing the functions
>> >> used by macros in new separate functions.
>> >> 
>> >> Joe Perches (4):
>> >>   vsprintf: Recursive vsnprintf: Add "%pV", struct va_format
>> >>   device.h drivers/base/core.c Convert dev_<level> logging macros to functions
>> >>   netdevice.h net/core/dev.c: Convert netdev_<level> logging macros to functions
>> >>   netdevice.h: Change netif_<level> macros to call netdev_<level> functions
>> > 
>> > I'm fine with this, thanks Joe.
>> > 
>> > Greg, could you ACK this and let me know if it's OK if it swings
>> > through my net-next-2.6 tree?
>> 
>> Greg, ping?
> 
> Sorry about the delay.
> 
> Yes, that's fine to take it through your tree, thanks for doing that:
> 	Acked-by: Greg Kroah-Hartman <gregkh@suse.de>

Thanks!  I've added this set to my tree.

^ permalink raw reply

* Re: QoS weirdness : HTB accuracy
From: Andrew Beverley @ 2010-07-04 17:50 UTC (permalink / raw)
  To: Julien Vehent; +Cc: Philip A. Prindeville, Netdev, netfilter
In-Reply-To: <1276204948.1403.13.camel@andybev>

> > I was, in fact, an error in my ruleset. I had put the 'linklayer atm' at
> > both the branch and leaf levels, so the overhead was computed twice,
> > creating those holes in the bandwidth.
> 
> I am seeing similar behaviour with my setup. Am I making the same
> mistake? A subset of my rules is as follows:
> 
> 
> tc qdisc add dev ppp0 root handle 1: htb r2q 1
> 
> tc class add dev ppp0 parent 1: classid 1:1 htb \
>     rate ${DOWNLINK}kbit ceil ${DOWNLINK}kbit \
>     overhead $overhead linklayer atm                   <------- Here
> 
> tc class add dev ppp0 parent 1:1 classid 1:10 htb \
>     rate 612kbit ceil 612kbit prio 0 \
>     overhead $overhead linklayer atm                   <------- And here
> 
> tc qdisc add dev ppp0 parent 1:10 handle 4210: \
>     sfq perturb 10 limit 50
> 
> tc filter add dev ppp0 parent 1:0 protocol ip \
>     prio 10 handle 10 fw flowid 1:10

I removed the overhead option on the first leaf, and the speeds change
to what I expect. However, the rules above are taken straight from the
ADSL Optimizer project, which was the source of the original overhead
patch for tc. So is the ADSL Optimizer project wrong?

Andy



^ permalink raw reply

* Re: [PATCH net-next 1/4] bnx2: Always enable MSI-X on 5709.
From: David Miller @ 2010-07-04 18:44 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1278225738-7795-1-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Sat, 3 Jul 2010 23:42:15 -0700

> Minor change to use MSI-X even if there is only one CPU.  This allows
> the CNIC driver to always have a dedicated MSI-X vector to handle
> iSCSI events, instead of sharing the MSI vector.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/4] bnx2: Add support for skb->rxhash.
From: David Miller @ 2010-07-04 18:44 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1278225738-7795-2-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Sat, 3 Jul 2010 23:42:16 -0700

> Add skb->rxhash support for TCP packets only because the bnx2 RSS hash
> does not hash UDP ports.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 3/4] bnx2: Dump some config space registers during TX timeout.
From: David Miller @ 2010-07-04 18:44 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1278225738-7795-3-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Sat, 3 Jul 2010 23:42:17 -0700

> These config register values will be useful when the memory registers
> are returning 0xffffffff which has been reported.
> 
> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 4/4] bnx2: Update version to 2.0.16.
From: David Miller @ 2010-07-04 18:44 UTC (permalink / raw)
  To: mchan; +Cc: netdev
In-Reply-To: <1278225738-7795-4-git-send-email-mchan@broadcom.com>

From: "Michael Chan" <mchan@broadcom.com>
Date: Sat, 3 Jul 2010 23:42:18 -0700

> Signed-off-by: Michael Chan <mchan@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCHv2] xfrm: fix xfrm by MARK logic
From: David Miller @ 2010-07-04 18:46 UTC (permalink / raw)
  To: eric.dumazet; +Cc: p.kosyh, netdev
In-Reply-To: <1278139083.2474.42.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 03 Jul 2010 08:38:03 +0200

> Le vendredi 02 juillet 2010 à 21:47 +0400, Peter Kosyh a écrit :
>> From: Peter Kosyh <p.kosyh@gmail.com>
>> 
>> While using xfrm by MARK feature in
>> 2.6.34 - 2.6.35 kernels, the mark 
>> is always cleared in flowi structure via memset in 
>> _decode_session4 (net/ipv4/xfrm4_policy.c), so
>> the policy lookup fails.
>> IPv6 code is affected by this bug too.
>> 
>> Signed-off-by: Peter Kosyh <p.kosyh@gmail.com>
>> ---
>> 
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks everyone.

^ permalink raw reply

* Re: [PATCH net-next-2.6] IB/{nes,ipoib}: Pass supported flags to ethtool_op_set_flags()
From: David Miller @ 2010-07-04 18:48 UTC (permalink / raw)
  To: rdreier
  Cc: bhutchings, rolandd, randy.dunlap, netdev, linux-net-drivers,
	sgruszka, amit.salecha, amwang, anirban.chakraborty, dm, scofeldm,
	vkolluri, roprabhu, e1000-devel, buytenh, gallatin, brice,
	shemminger, jgarzik, faisal.latif, chien.tin.tung, linux-rdma
In-Reply-To: <adask40nxaa.fsf@roland-alpha.cisco.com>

From: Roland Dreier <rdreier@cisco.com>
Date: Sat, 03 Jul 2010 13:08:29 -0700

>  > Following commit 1437ce3983bcbc0447a0dedcd644c14fe833d266 "ethtool:
>  > Change ethtool_op_set_flags to validate flags", ethtool_op_set_flags
>  > takes a third parameter and cannot be used directly as an
>  > implementation of ethtool_ops::set_flags.
> 
> Acked-by: Roland Dreier <rolandd@cisco.com>

Applied, thanks guys.

^ permalink raw reply

* Re: bnx2/5709: Strange interrupts spread
From: Christophe Ngo Van Duc @ 2010-07-04 20:36 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <AANLkTimTcqANb3HQuC1W6VHWTh0ixO1lXAmVrmP6OZy2@mail.gmail.com>

Is it a requirement that the interface has an IP address for the TX
and RX hash to work?

Christophe.

On Fri, Jul 2, 2010 at 7:12 PM, Christophe Ngo Van Duc
<cngovanduc@gmail.com> wrote:
> Hi
>
> Well that's the strange thing: it is IP traffic. The only difference
> with eth0 and eth1 is that eth2 and eth3 belongs to a bridge (br0).
>
> Best Regards,
> Christophe.
>
> On 7/2/10, Michael Chan <mchan@broadcom.com> wrote:
>>
>> On Fri, 2010-07-02 at 13:33 -0700, Christophe Ngo Van Duc wrote:
>>> On eth2 (external card) all interrupts goes to CPU0
>>>
>>>
>>>            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
>>>   CPU6       CPU7
>>>   80:   46973077          0          0          0          0          0
>>>       0          0   PCI-MSI-edge      eth2-0
>>>   81:          0          0          0          0          0          0
>>>       0          0   PCI-MSI-edge      eth2-1
>>>   82:          0          0          0          0          0          0
>>>       0          0   PCI-MSI-edge      eth2-2
>>>   83:          0          0          0          0          0          0
>>>       0          0   PCI-MSI-edge      eth2-3
>>>   84:          0          0          0          0          0          0
>>>       0          0   PCI-MSI-edge      eth2-4
>>>   85:          0          0          0          0          0          0
>>>       0          0   PCI-MSI-edge      eth2-5
>>>   86:          0          0       2445          0         37          0
>>>    8463         13   PCI-MSI-edge      eth2-6
>>>   87:          0          0          0          0          0          0
>>>       0          0   PCI-MSI-edge      eth2-7
>>
>> Reformatted your output
>>
>>> If I understand correctly the RSS hash is used to dispatch the packets
>>> into the different queues running on the different CPU.
>>
>> It looks like most interrupts go to eth2-0, a few go to eth2-6.  The rx
>> ring for eth2-0 is for non-IP packets.  The RSS hash will hash IP
>> packets and place them on eth2-1 to eth2-7.  eth2-0 also handles tx
>> interrupts for TX ring 0.  TX traffic is hashed by the stack.
>>
>> What kind of traffic is passing through eth2?
>>
>> Thanks.
>>
>>
>>
>
> --
> Sent from my mobile device
>

^ permalink raw reply

* RE: [REGRESSION] e1000e stopped working [MANUALLY BISECTED]
From: Maxim Levitsky @ 2010-07-04 22:48 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: netdev@vger.kernel.org, Allan, Bruce W, Pieper, Jeffrey E
In-Reply-To: <1278204106.21020.0.camel@localhost.localdomain>

Did few guesses, and now I see that reverting the below commit fixes the
problem.

"e1000e: Fix/cleanup PHY reset code for ICHx/PCHx"
e98cac447cc1cc418dff1d610a5c79c4f2bdec7f.


Best regards,
	Maxim Levitsky


^ permalink raw reply

* [PATCH] ks8842: Replace usage of dev_dbg with netdev_dbg
From: Richard Röjfors @ 2010-07-04 23:26 UTC (permalink / raw)
  To: netdev; +Cc: davem

This patch replaces all usage of dev_dbg with netdev_dbg.

A side effect is that the pointer to the platform device in the adapter
struct can be removed.

Signed-off-by: Richard Röjfors <richard.rojfors@pelagicore.com>
---
diff --git a/drivers/net/ks8842.c b/drivers/net/ks8842.c
index f852ab3..d47bba9 100644
--- a/drivers/net/ks8842.c
+++ b/drivers/net/ks8842.c
@@ -119,7 +119,6 @@ struct ks8842_adapter {
 	int		irq;
 	struct tasklet_struct	tasklet;
 	spinlock_t	lock; /* spinlock to be interrupt safe */
-	struct platform_device *pdev;
 };
 
 static inline void ks8842_select_bank(struct ks8842_adapter *adapter, u16 bank)
@@ -331,8 +330,7 @@ static int ks8842_tx_frame(struct sk_buff *skb, struct net_device *netdev)
 	u32 *ptr = (u32 *)skb->data;
 	u32 ctrl;
 
-	dev_dbg(&adapter->pdev->dev,
-		"%s: len %u head %p data %p tail %p end %p\n",
+	netdev_dbg(netdev, "%s: len %u head %p data %p tail %p end %p\n",
 		__func__, skb->len, skb->head, skb->data,
 		skb_tail_pointer(skb), skb_end_pointer(skb));
 
@@ -369,15 +367,13 @@ static void ks8842_rx_frame(struct net_device *netdev,
 
 	status &= 0xffff;
 
-	dev_dbg(&adapter->pdev->dev, "%s - rx_data: status: %x\n",
-		__func__, status);
+	netdev_dbg(netdev, "%s - rx_data: status: %x\n", __func__, status);
 
 	/* check the status */
 	if ((status & RXSR_VALID) && !(status & RXSR_ERROR)) {
 		struct sk_buff *skb = netdev_alloc_skb_ip_align(netdev, len);
 
-		dev_dbg(&adapter->pdev->dev, "%s, got package, len: %d\n",
-			__func__, len);
+		netdev_dbg(netdev, "%s, got package, len: %d\n", __func__, len);
 		if (skb) {
 			u32 *data;
 
@@ -400,7 +396,7 @@ static void ks8842_rx_frame(struct net_device *netdev,
 		} else
 			netdev->stats.rx_dropped++;
 	} else {
-		dev_dbg(&adapter->pdev->dev, "RX error, status: %x\n", status);
+		netdev_dbg(netdev, "RX error, status: %x\n", status);
 		netdev->stats.rx_errors++;
 		if (status & RXSR_TOO_LONG)
 			netdev->stats.rx_length_errors++;
@@ -423,8 +419,7 @@ static void ks8842_rx_frame(struct net_device *netdev,
 void ks8842_handle_rx(struct net_device *netdev, struct ks8842_adapter *adapter)
 {
 	u16 rx_data = ks8842_read16(adapter, 16, REG_RXMIR) & 0x1fff;
-	dev_dbg(&adapter->pdev->dev, "%s Entry - rx_data: %d\n",
-		__func__, rx_data);
+	netdev_dbg(netdev, "%s Entry - rx_data: %d\n", __func__, rx_data);
 	while (rx_data) {
 		ks8842_rx_frame(netdev, adapter);
 		rx_data = ks8842_read16(adapter, 16, REG_RXMIR) & 0x1fff;
@@ -434,7 +429,7 @@ void ks8842_handle_rx(struct net_device *netdev, struct ks8842_adapter *adapter)
 void ks8842_handle_tx(struct net_device *netdev, struct ks8842_adapter *adapter)
 {
 	u16 sr = ks8842_read16(adapter, 16, REG_TXSR);
-	dev_dbg(&adapter->pdev->dev, "%s - entry, sr: %x\n", __func__, sr);
+	netdev_dbg(netdev, "%s - entry, sr: %x\n", __func__, sr);
 	netdev->stats.tx_packets++;
 	if (netif_queue_stopped(netdev))
 		netif_wake_queue(netdev);
@@ -443,7 +438,7 @@ void ks8842_handle_tx(struct net_device *netdev, struct ks8842_adapter *adapter)
 void ks8842_handle_rx_overrun(struct net_device *netdev,
 	struct ks8842_adapter *adapter)
 {
-	dev_dbg(&adapter->pdev->dev, "%s: entry\n", __func__);
+	netdev_dbg(netdev, "%s: entry\n", __func__);
 	netdev->stats.rx_errors++;
 	netdev->stats.rx_fifo_errors++;
 }
@@ -462,7 +457,7 @@ void ks8842_tasklet(unsigned long arg)
 	spin_unlock_irqrestore(&adapter->lock, flags);
 
 	isr = ks8842_read16(adapter, 18, REG_ISR);
-	dev_dbg(&adapter->pdev->dev, "%s - ISR: 0x%x\n", __func__, isr);
+	netdev_dbg(netdev, "%s - ISR: 0x%x\n", __func__, isr);
 
 	/* Ack */
 	ks8842_write16(adapter, 18, isr, REG_ISR);
@@ -501,13 +496,14 @@ void ks8842_tasklet(unsigned long arg)
 
 static irqreturn_t ks8842_irq(int irq, void *devid)
 {
-	struct ks8842_adapter *adapter = devid;
+	struct net_device *netdev = devid;
+	struct ks8842_adapter *adapter = netdev_priv(netdev);
 	u16 isr;
 	u16 entry_bank = ioread16(adapter->hw_addr + REG_SELECT_BANK);
 	irqreturn_t ret = IRQ_NONE;
 
 	isr = ks8842_read16(adapter, 18, REG_ISR);
-	dev_dbg(&adapter->pdev->dev, "%s - ISR: 0x%x\n", __func__, isr);
+	netdev_dbg(netdev, "%s - ISR: 0x%x\n", __func__, isr);
 
 	if (isr) {
 		/* disable IRQ */
@@ -532,7 +528,7 @@ static int ks8842_open(struct net_device *netdev)
 	struct ks8842_adapter *adapter = netdev_priv(netdev);
 	int err;
 
-	dev_dbg(&adapter->pdev->dev, "%s - entry\n", __func__);
+	netdev_dbg(netdev, "%s - entry\n", __func__);
 
 	/* reset the HW */
 	ks8842_reset_hw(adapter);
@@ -542,7 +538,7 @@ static int ks8842_open(struct net_device *netdev)
 	ks8842_update_link_status(netdev, adapter);
 
 	err = request_irq(adapter->irq, ks8842_irq, IRQF_SHARED, DRV_NAME,
-		adapter);
+		netdev);
 	if (err) {
 		pr_err("Failed to request IRQ: %d: %d\n", adapter->irq, err);
 		return err;
@@ -555,10 +551,10 @@ static int ks8842_close(struct net_device *netdev)
 {
 	struct ks8842_adapter *adapter = netdev_priv(netdev);
 
-	dev_dbg(&adapter->pdev->dev, "%s - entry\n", __func__);
+	netdev_dbg(netdev, "%s - entry\n", __func__);
 
 	/* free the irq */
-	free_irq(adapter->irq, adapter);
+	free_irq(adapter->irq, netdev);
 
 	/* disable the switch */
 	ks8842_write16(adapter, 32, 0x0, REG_SW_ID_AND_ENABLE);
@@ -572,7 +568,7 @@ static netdev_tx_t ks8842_xmit_frame(struct sk_buff *skb,
 	int ret;
 	struct ks8842_adapter *adapter = netdev_priv(netdev);
 
-	dev_dbg(&adapter->pdev->dev, "%s: entry\n", __func__);
+	netdev_dbg(netdev, "%s: entry\n", __func__);
 
 	ret = ks8842_tx_frame(skb, netdev);
 
@@ -588,7 +584,7 @@ static int ks8842_set_mac(struct net_device *netdev, void *p)
 	struct sockaddr *addr = p;
 	char *mac = (u8 *)addr->sa_data;
 
-	dev_dbg(&adapter->pdev->dev, "%s: entry\n", __func__);
+	netdev_dbg(netdev, "%s: entry\n", __func__);
 
 	if (!is_valid_ether_addr(addr->sa_data))
 		return -EADDRNOTAVAIL;
@@ -604,7 +600,7 @@ static void ks8842_tx_timeout(struct net_device *netdev)
 	struct ks8842_adapter *adapter = netdev_priv(netdev);
 	unsigned long flags;
 
-	dev_dbg(&adapter->pdev->dev, "%s: entry\n", __func__);
+	netdev_dbg(netdev, "%s: entry\n", __func__);
 
 	spin_lock_irqsave(&adapter->lock, flags);
 	/* disable interrupts */
@@ -663,8 +659,6 @@ static int __devinit ks8842_probe(struct platform_device *pdev)
 		goto err_get_irq;
 	}
 
-	adapter->pdev = pdev;
-
 	tasklet_init(&adapter->tasklet, ks8842_tasklet, (unsigned long)netdev);
 	spin_lock_init(&adapter->lock);
 


^ permalink raw reply related

* Re: [PATCH net-next-2.6] ipv6: adding ip_nonlocal_bind option from ipv4
From: Simon Horman @ 2010-07-05  2:03 UTC (permalink / raw)
  To: Michal Humpula; +Cc: netdev
In-Reply-To: <201007032238.28344.michal.humpula@web4u.cz>

On Sat, Jul 03, 2010 at 10:38:28PM +0200, Michal Humpula wrote:
> Adds ability to bind non-local IPv6 address the same way as for IPv4
> 
> Signed-off-by: Michal Humpula <michal.humpula@web4u.cz>
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index f350c69..27fa09a 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -962,6 +962,10 @@ bindv6only - BOOLEAN
>  		FALSE: enable IPv4-mapped address feature
>  
>  	Default: FALSE (as specified in RFC2553bis)

I think a blank line here would be nice.

> +ipv6_nonlocal_bind - BOOLEAN
> +	If set, allows processes to bind() to non-local IPv6 addresses,
> +	which can be quite useful - but may break some applications.
> +	Default: 0
>  
>  IPv6 Fragmentation:
>  
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index 7bb5cb6..8957ead 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -528,6 +528,7 @@ enum {
>  	NET_IPV6_IP6FRAG_TIME=23,
>  	NET_IPV6_IP6FRAG_SECRET_INTERVAL=24,
>  	NET_IPV6_MLD_MAX_MSF=25,
> +	NET_IPV6_NONLOCAL_BIND=26
>  };
>  
>  enum {
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 1f84124..f459fcb 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -641,6 +641,8 @@ static inline int snmp6_unregister_dev(struct inet6_dev *idev) { return 0; }
>  #endif
>  
>  #ifdef CONFIG_SYSCTL
> +extern int sysctl_ipv6_nonlocal_bind;
> +
>  extern ctl_table ipv6_route_table_template[];
>  extern ctl_table ipv6_icmp_table_template[];
>  
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 1357c57..525edae 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -559,6 +559,7 @@ static const struct bin_table bin_net_ipv6_table[] = {
>  	{ CTL_DIR,	NET_IPV6_ROUTE,		"route",	bin_net_ipv6_route_table },
>  	{ CTL_DIR,	NET_IPV6_ICMP,		"icmp",		bin_net_ipv6_icmp_table },
>  	{ CTL_INT,	NET_IPV6_BINDV6ONLY,		"bindv6only" },
> +	{ CTL_INT,	NET_IPV6_NONLOCAL_BIND,		"ipv6_nonlocal_bind" },
>  	{ CTL_INT,	NET_IPV6_IP6FRAG_HIGH_THRESH,	"ip6frag_high_thresh" },
>  	{ CTL_INT,	NET_IPV6_IP6FRAG_LOW_THRESH,	"ip6frag_low_thresh" },
>  	{ CTL_INT,	NET_IPV6_IP6FRAG_TIME,		"ip6frag_time" },
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index e830cd4..55b3552 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -252,6 +252,8 @@ out_rcu_unlock:
>  	goto out;
>  }
>  
> +int sysctl_ipv6_nonlocal_bind __read_mostly;
> +EXPORT_SYMBOL(sysctl_ipv6_nonlocal_bind);
>  
>  /* bind for INET6 API */
>  int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
> @@ -345,8 +347,10 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
>  			if (!(addr_type & IPV6_ADDR_MULTICAST))	{
>  				if (!ipv6_chk_addr(net, &addr->sin6_addr,
>  						   dev, 0)) {
> -					err = -EADDRNOTAVAIL;
> -					goto out_unlock;
> +					if (!sysctl_ipv6_nonlocal_bind) {
> +						err = -EADDRNOTAVAIL;
> +						goto out_unlock;
> +					}
>  				}
>  			}
>  			rcu_read_unlock();

Perhaps the inner two if statements could be combined to remove
unnecessary nesting? And perhaps check for sysctl_ipv6_nonlocal_bind first,
as presumably its a cheaper, though less likely to succeed check.

			if (!(addr_type & IPV6_ADDR_MULTICAST))	{
				if (!sysctl_ipv6_nonlocal_bind &&
				    !ipv6_chk_addr(net, &addr->sin6_addr,
						   dev, 0)) {
					err = -EADDRNOTAVAIL;
					goto out_unlock;
				}
			}

Actually, I wonder if all three if statements could be combined.

			if (!(addr_type & IPV6_ADDR_MULTICAST) &&
		            !sysctl_ipv6_nonlocal_bind &&
			    !ipv6_chk_addr(net, &addr->sin6_addr, dev, 0)) {
				err = -EADDRNOTAVAIL;
				goto out_unlock;
			}

> diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
> index fa1d8f4..56bfe76 100644
> --- a/net/ipv6/sysctl_net_ipv6.c
> +++ b/net/ipv6/sysctl_net_ipv6.c
> @@ -35,6 +35,13 @@ static ctl_table ipv6_table_template[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +	{
> +		.procname = "ipv6_nonlocal_bind",
> +		.data   = &sysctl_ipv6_nonlocal_bind,
> +		.maxlen   = sizeof(int),
> +		.mode   = 0644,
> +		.proc_handler = proc_dointvec
> +	},
>  	{ }
>  };
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2 0/4] Extend Time Stamping
From: Richard Cochran @ 2010-07-05  5:30 UTC (permalink / raw)
  To: netdev

This patch set extends the packet time stamping capabilites of the
network stack in two ways.

1. The first patch presents a work-around for the TX software time
   stamping fallback problem cited in cd4d8fdad1f1. The idea is to add
   two inline functions into each MAC driver. The functions will act
   as hooks for current (and possible future) time stamping needs,
   once they are placed correctly within each MAC driver.

2. The other patches prepare the way for PHY drivers to offer time
   stamping.

I am preparing a new round of patches for PTP support, but it will
require the changes in this patch set in order to function. Thus I
would like to have this patch set reviewed (and hopefully merged) in
order to go forward.

Thanks,
Richard

* Patch ChangeLog
** v2
   Removed the CONFIG option for the driver hooks.


Richard Cochran (4):
  net: add driver hooks for time stamping.
  phylib: add a way to make PHY time stamps possible.
  phylib: preserve ifreq parameter when calling generic phy_mii_ioctl()
  phylib: Allow reading and writing a mii bus from atomic context.

 drivers/net/arm/ixp4xx_eth.c           |    3 +-
 drivers/net/au1000_eth.c               |    2 +-
 drivers/net/bcm63xx_enet.c             |    2 +-
 drivers/net/cpmac.c                    |    5 +--
 drivers/net/dnet.c                     |    2 +-
 drivers/net/ethoc.c                    |    2 +-
 drivers/net/fec.c                      |    2 +-
 drivers/net/fec_mpc52xx.c              |    2 +-
 drivers/net/fs_enet/fs_enet-main.c     |    3 +-
 drivers/net/fsl_pq_mdio.c              |    4 +-
 drivers/net/gianfar.c                  |    2 +-
 drivers/net/macb.c                     |    2 +-
 drivers/net/mv643xx_eth.c              |    2 +-
 drivers/net/octeon/octeon_mgmt.c       |    2 +-
 drivers/net/phy/mdio_bus.c             |   45 ++++++++++++++++++---
 drivers/net/phy/phy.c                  |    8 +++-
 drivers/net/sb1250-mac.c               |    2 +-
 drivers/net/sh_eth.c                   |    2 +-
 drivers/net/smsc911x.c                 |    2 +-
 drivers/net/smsc9420.c                 |    2 +-
 drivers/net/stmmac/stmmac_main.c       |   22 ++++-------
 drivers/net/tc35815.c                  |    2 +-
 drivers/net/tg3.c                      |    2 +-
 drivers/net/ucc_geth.c                 |    2 +-
 drivers/staging/octeon/ethernet-mdio.c |    2 +-
 include/linux/phy.h                    |   24 ++++++++++-
 include/linux/skbuff.h                 |   65 ++++++++++++++++++++++++++++++++
 net/Kconfig                            |   11 +++++
 net/dsa/slave.c                        |    3 +-
 29 files changed, 175 insertions(+), 54 deletions(-)


^ permalink raw reply

* [PATCH 1/4] net: add driver hooks for time stamping.
From: Richard Cochran @ 2010-07-05  5:31 UTC (permalink / raw)
  To: netdev
In-Reply-To: <cover.1278307573.git.richard.cochran@omicron.at>

This patch adds hooks for transmit and receive time stamps. The
transmit hook allows a software fallback for transmit time stamps,
for MACs lacking time stamping hardware. The receive hook does not yet
have any effect, but it prepares the way for hardware time stamping
in PHY devices. Using the hooks will still require adding two inline
function calls to each MAC driver.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
 include/linux/skbuff.h |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ac74ee0..75323fe 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/dmaengine.h>
 #include <linux/hrtimer.h>
+#include <linux/phy.h>
 
 /* Don't change this without changing skb_csum_unnecessary! */
 #define CHECKSUM_NONE 0
@@ -1947,6 +1948,41 @@ static inline ktime_t net_invalid_timestamp(void)
 extern void skb_tstamp_tx(struct sk_buff *orig_skb,
 			struct skb_shared_hwtstamps *hwtstamps);
 
+static inline void sw_tx_timestamp(struct sk_buff *skb)
+{
+	union skb_shared_tx *shtx = skb_tx(skb);
+	if (shtx->software && !shtx->in_progress)
+		skb_tstamp_tx(skb, NULL);
+}
+
+/**
+ * skb_tx_timestamp() - Driver hook for transmit timestamping
+ *
+ * Ethernet MAC Drivers should call this function in their hard_xmit()
+ * function as soon as possible after giving the sk_buff to the MAC
+ * hardware, but before freeing the sk_buff.
+ *
+ * @phy: The port's phy_device. Pass NULL if this is not available.
+ * @skb: A socket buffer.
+ */
+static inline void skb_tx_timestamp(struct phy_device *phy, struct sk_buff *skb)
+{
+	sw_tx_timestamp(skb);
+}
+
+/**
+ * skb_rx_timestamp() - Driver hook for receive timestamping
+ *
+ * Ethernet MAC Drivers should call this function in their NAPI poll()
+ * function immediately before calling eth_type_trans().
+ *
+ * @phy: The port's phy_device. Pass NULL if this is not available.
+ * @skb: A socket buffer.
+ */
+static inline void skb_rx_timestamp(struct phy_device *phy, struct sk_buff *skb)
+{
+}
+
 extern __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len);
 extern __sum16 __skb_checksum_complete(struct sk_buff *skb);
 
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 2/4] phylib: add a way to make PHY time stamps possible.
From: Richard Cochran @ 2010-07-05  5:32 UTC (permalink / raw)
  To: netdev
In-Reply-To: <cover.1278307573.git.richard.cochran@omicron.at>

This patch adds a new networking option to allow hardware time stamps
from PHY devices. Using PHY time stamps will still require adding two
inline function calls to each MAC driver. The CONFIG option makes
these calls safe to add, since the calls become NOOPs when the option
is disabled.

The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl
and callbacks for transmit and receive time stamping. Drivers may
optionally implement these functions.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
 include/linux/phy.h    |    7 +++++++
 include/linux/skbuff.h |   29 +++++++++++++++++++++++++++++
 net/Kconfig            |   11 +++++++++++
 3 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/include/linux/phy.h b/include/linux/phy.h
index 987e111..3566bde 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -234,6 +234,8 @@ enum phy_state {
 	PHY_RESUMING
 };
 
+struct sk_buff;
+
 /* phy_device: An instance of a PHY
  *
  * drv: Pointer to the driver for this PHY instance
@@ -402,6 +404,11 @@ struct phy_driver {
 	/* Clears up any memory if needed */
 	void (*remove)(struct phy_device *phydev);
 
+	/* Handles SIOCSHWTSTAMP ioctl for hardware time stamping. */
+	int (*hwtstamp)(struct phy_device *phydev, struct ifreq *ifr);
+	int (*rxtstamp)(struct phy_device *phydev, struct sk_buff *skb);
+	int (*txtstamp)(struct phy_device *phydev, struct sk_buff *skb);
+
 	struct device_driver driver;
 };
 #define to_phy_driver(d) container_of(d, struct phy_driver, driver)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 75323fe..8905508 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1955,6 +1955,33 @@ static inline void sw_tx_timestamp(struct sk_buff *skb)
 		skb_tstamp_tx(skb, NULL);
 }
 
+#ifdef CONFIG_NETWORK_PHY_TIMESTAMPING
+
+static inline void phy_tx_timestamp(struct phy_device *phy, struct sk_buff *skb)
+{
+	union skb_shared_tx *shtx = skb_tx(skb);
+	if (shtx->hardware && phy && phy->drv->txtstamp)
+		phy->drv->txtstamp(phy, skb);
+}
+
+static inline void phy_rx_timestamp(struct phy_device *phy, struct sk_buff *skb)
+{
+	if (phy && phy->drv->rxtstamp)
+		phy->drv->rxtstamp(phy, skb);
+}
+
+#else /* CONFIG_NETWORK_PHY_TIMESTAMPING */
+
+static inline void phy_tx_timestamp(struct phy_device *phy, struct sk_buff *skb)
+{
+}
+
+static inline void phy_rx_timestamp(struct phy_device *phy, struct sk_buff *skb)
+{
+}
+
+#endif /* !CONFIG_NETWORK_PHY_TIMESTAMPING */
+
 /**
  * skb_tx_timestamp() - Driver hook for transmit timestamping
  *
@@ -1967,6 +1994,7 @@ static inline void sw_tx_timestamp(struct sk_buff *skb)
  */
 static inline void skb_tx_timestamp(struct phy_device *phy, struct sk_buff *skb)
 {
+	phy_tx_timestamp(phy, skb);
 	sw_tx_timestamp(skb);
 }
 
@@ -1981,6 +2009,7 @@ static inline void skb_tx_timestamp(struct phy_device *phy, struct sk_buff *skb)
  */
 static inline void skb_rx_timestamp(struct phy_device *phy, struct sk_buff *skb)
 {
+	phy_rx_timestamp(phy, skb);
 }
 
 extern __sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len);
diff --git a/net/Kconfig b/net/Kconfig
index 0d68b40..3fa7ae3 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -86,6 +86,17 @@ config NETWORK_SECMARK
 	  to nfmark, but designated for security purposes.
 	  If you are unsure how to answer this question, answer N.
 
+config NETWORK_PHY_TIMESTAMPING
+	bool "Timestamping in PHY devices"
+	depends on EXPERIMENTAL
+	help
+	  This allows timestamping of network packets by PHYs with
+	  hardware timestamping capabilities. This option adds some
+	  overhead in the transmit and receive paths. Note that this
+	  option also requires support in the MAC driver.
+
+	  If you are unsure how to answer this question, answer N.
+
 menuconfig NETFILTER
 	bool "Network packet filtering framework (Netfilter)"
 	---help---
-- 
1.7.0.4


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox