Netdev List
 help / color / mirror / Atom feed
* [PATCH] connector: Keep the skb in cn_callback_data
From: Philipp Reisner @ 2009-09-29 14:48 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, Lars Ellenberg, Philipp Reisner
In-Reply-To: <1254235692-1631-1-git-send-email-philipp.reisner@linbit.com>

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
---
 drivers/connector/cn_queue.c  |    3 ++-
 drivers/connector/connector.c |   11 +++++------
 include/linux/connector.h     |    4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/connector/cn_queue.c b/drivers/connector/cn_queue.c
index 4a1dfe1..b4cfac9 100644
--- a/drivers/connector/cn_queue.c
+++ b/drivers/connector/cn_queue.c
@@ -78,8 +78,9 @@ void cn_queue_wrapper(struct work_struct *work)
 	struct cn_callback_entry *cbq =
 		container_of(work, struct cn_callback_entry, work);
 	struct cn_callback_data *d = &cbq->data;
+	struct cn_msg *msg = NLMSG_DATA(nlmsg_hdr(d->skb));
 
-	d->callback(d->callback_priv);
+	d->callback(msg);
 
 	d->destruct_data(d->ddata);
 	d->ddata = NULL;
diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 74f52af..fc9887f 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -129,10 +129,11 @@ EXPORT_SYMBOL_GPL(cn_netlink_send);
 /*
  * Callback helper - queues work and setup destructor for given data.
  */
-static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), void *data)
+static int cn_call_callback(struct sk_buff *skb, void (*destruct_data)(void *), void *data)
 {
 	struct cn_callback_entry *__cbq, *__new_cbq;
 	struct cn_dev *dev = &cdev;
+	struct cn_msg *msg = NLMSG_DATA(nlmsg_hdr(skb));
 	int err = -ENODEV;
 
 	spin_lock_bh(&dev->cbdev->queue_lock);
@@ -140,7 +141,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
 		if (cn_cb_equal(&__cbq->id.id, &msg->id)) {
 			if (likely(!work_pending(&__cbq->work) &&
 					__cbq->data.ddata == NULL)) {
-				__cbq->data.callback_priv = msg;
+				__cbq->data.skb = skb;
 
 				__cbq->data.ddata = data;
 				__cbq->data.destruct_data = destruct_data;
@@ -156,7 +157,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
 				__new_cbq = kzalloc(sizeof(struct cn_callback_entry), GFP_ATOMIC);
 				if (__new_cbq) {
 					d = &__new_cbq->data;
-					d->callback_priv = msg;
+					d->skb = skb;
 					d->callback = __cbq->data.callback;
 					d->ddata = data;
 					d->destruct_data = destruct_data;
@@ -191,7 +192,6 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
  */
 static void cn_rx_skb(struct sk_buff *__skb)
 {
-	struct cn_msg *msg;
 	struct nlmsghdr *nlh;
 	int err;
 	struct sk_buff *skb;
@@ -208,8 +208,7 @@ static void cn_rx_skb(struct sk_buff *__skb)
 			return;
 		}
 
-		msg = NLMSG_DATA(nlh);
-		err = cn_call_callback(msg, (void (*)(void *))kfree_skb, skb);
+		err = cn_call_callback(skb, (void (*)(void *))kfree_skb, skb);
 		if (err < 0)
 			kfree_skb(skb);
 	}
diff --git a/include/linux/connector.h b/include/linux/connector.h
index 47ebf41..05a7a14 100644
--- a/include/linux/connector.h
+++ b/include/linux/connector.h
@@ -134,8 +134,8 @@ struct cn_callback_id {
 struct cn_callback_data {
 	void (*destruct_data) (void *);
 	void *ddata;
-	
-	void *callback_priv;
+
+	struct sk_buff *skb;
 	void (*callback) (struct cn_msg *);
 
 	void *free;
-- 
1.6.0.4

^ permalink raw reply related

* [PATCH] connector: Allow permission checking in the receiver callbacks
From: Philipp Reisner @ 2009-09-29 14:48 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel, netdev, Lars Ellenberg, Philipp Reisner

Various users of the connector should actually check if the
sender's capabilities of a netlink/connector packet are
actually sufficient for the operation they trigger. Up to
now the connector framework did not allow the kernel side
receiver to do so.

This patch set does the groundwork.

Philipp Reisner (4):
  connector: Keep the skb in cn_callback_data
  connector: Provide the sender's credentials to the callback
  connector/dm: Fixed a compilation warning
  connector: Removed the destruct_data callback since it is always
    kfree_skb()

 Documentation/connector/cn_test.c      |    2 +-
 Documentation/connector/connector.txt  |    8 ++++----
 drivers/connector/cn_queue.c           |   12 +++++++-----
 drivers/connector/connector.c          |   22 ++++++++--------------
 drivers/md/dm-log-userspace-transfer.c |    3 +--
 drivers/staging/dst/dcore.c            |    2 +-
 drivers/staging/pohmelfs/config.c      |    2 +-
 drivers/video/uvesafb.c                |    2 +-
 drivers/w1/w1_netlink.c                |    2 +-
 include/linux/connector.h              |   11 ++++-------
 10 files changed, 29 insertions(+), 37 deletions(-)

^ permalink raw reply

* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Oleg Nesterov @ 2009-09-29 14:25 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Christian Borntraeger, Evgeny Polyakov, Scott James Remnant,
	Linux Kernel, Matt Helsley, David S. Miller, netdev
In-Reply-To: <20090929140718.GA23858@ioremap.net>

On 09/29, Evgeniy Polyakov wrote:
>
> On Tue, Sep 29, 2009 at 03:47:21PM +0200, Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> > Ok,  can confirm that this patch fixes my problem, but I am not sure if the
> > intended behaviour is still working as expected.
>
> Your patch breaks assumption that task_session(current->group_leader) is
> not equal to new session id,

Afaics, no.

> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1090,6 +1090,7 @@ SYSCALL_DEFINE0(setsid)
>  	struct pid *sid = task_pid(group_leader);
>  	pid_t session = pid_vnr(sid);
>  	int err = -EPERM;
> +	int send_cn = 0;
>
>  	write_lock_irq(&tasklist_lock);
>  	/* Fail if I am already a session leader */
> @@ -1104,12 +1105,18 @@ SYSCALL_DEFINE0(setsid)
>
>  	group_leader->signal->leader = 1;
>  	__set_special_pids(sid);
> +	if (task_session(group_leader) != sid)
> +		send_cn = 1;

This is not right, task_session(group_leader) must be == sid after
__set_special_pids().

And I don't think "int send_cn" is needed. sys_setsid() must not
succeed if the caller lived in session == task_pid(group_leader).

Or I missed your point?

Oleg.

^ permalink raw reply

* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Christian Borntraeger @ 2009-09-29 14:15 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Oleg Nesterov, Evgeny Polyakov, Scott James Remnant, Linux Kernel,
	Matt Helsley, David S. Miller, netdev
In-Reply-To: <20090929140718.GA23858@ioremap.net>

Am Dienstag 29 September 2009 16:07:18 schrieb Evgeniy Polyakov:
> Your patch breaks assumption that task_session(current->group_leader) is
> not equal to new session id, but to check task_session() we need either
> rcu or task lock. Also setsid() return value is not zero or negative
> error, but new session ID or negative error,

Right.

> so I believe attached patch is a proper fix, although it looks rather ugly.
> 
> Also proc_sid_connector() uses GFP_KERNEL allocation which is way too
> wrong to use under any locks.
> 
> Something like this (not tested :)

Patch compiles and seems to work.

Christian

^ permalink raw reply

* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Evgeniy Polyakov @ 2009-09-29 14:07 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Oleg Nesterov, Evgeny Polyakov, Scott James Remnant, Linux Kernel,
	Matt Helsley, David S. Miller, netdev
In-Reply-To: <200909291547.21528.borntraeger@de.ibm.com>

On Tue, Sep 29, 2009 at 03:47:21PM +0200, Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> Ok,  can confirm that this patch fixes my problem, but I am not sure if the
> intended behaviour is still working as expected.

Your patch breaks assumption that task_session(current->group_leader) is
not equal to new session id, but to check task_session() we need either
rcu or task lock. Also setsid() return value is not zero or negative
error, but new session ID or negative error, so I believe attached patch
is a proper fix, although it looks rather ugly.

Also proc_sid_connector() uses GFP_KERNEL allocation which is way too
wrong to use under any locks.

Something like this (not tested :)

diff --git a/kernel/exit.c b/kernel/exit.c
index 5859f59..1565baf 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -359,10 +359,8 @@ void __set_special_pids(struct pid *pid)
 {
 	struct task_struct *curr = current->group_leader;
 
-	if (task_session(curr) != pid) {
+	if (task_session(curr) != pid)
 		change_pid(curr, PIDTYPE_SID, pid);
-		proc_sid_connector(curr);
-	}
 
 	if (task_pgrp(curr) != pid)
 		change_pid(curr, PIDTYPE_PGID, pid);
diff --git a/kernel/sys.c b/kernel/sys.c
index 255475d..b852a8b 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1090,6 +1090,7 @@ SYSCALL_DEFINE0(setsid)
 	struct pid *sid = task_pid(group_leader);
 	pid_t session = pid_vnr(sid);
 	int err = -EPERM;
+	int send_cn = 0;
 
 	write_lock_irq(&tasklist_lock);
 	/* Fail if I am already a session leader */
@@ -1104,12 +1105,18 @@ SYSCALL_DEFINE0(setsid)
 
 	group_leader->signal->leader = 1;
 	__set_special_pids(sid);
+	if (task_session(group_leader) != sid)
+		send_cn = 1;
 
 	proc_clear_tty(group_leader);
 
 	err = session;
 out:
 	write_unlock_irq(&tasklist_lock);
+
+	if (send_cn)
+		proc_sid_connector(group_leader);
+
 	return err;
 }
 

-- 
	Evgeniy Polyakov

^ permalink raw reply related

* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Oleg Nesterov @ 2009-09-29 13:59 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Evgeny Polyakov, Scott James Remnant, Linux Kernel, Matt Helsley,
	David S. Miller, Evgeniy Polyakov, netdev
In-Reply-To: <200909291547.21528.borntraeger@de.ibm.com>

On 09/29, Christian Borntraeger wrote:
>
> --- linux-2.6.orig/kernel/sys.c
> +++ linux-2.6/kernel/sys.c
> @@ -1110,6 +1110,8 @@ SYSCALL_DEFINE0(setsid)
>  	err = session;
>  out:
>  	write_unlock_irq(&tasklist_lock);
> +	if (!err)
> +		proc_sid_connector(sid);

sys_setsid() returns the session nr on success, not zero.

	if (err > 0)
		proc_sid_connector(sid);

Otherwize I think the patch is fine. Not only it should fix the problem,
imho it makes the code cleaner.

If Scott still thinks daemonize() should report too, we can change it.

(I'd suggest you to CC Andrew if you are going to re-send)

Oleg.

^ permalink raw reply

* [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Christian Borntraeger @ 2009-09-29 13:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Evgeny Polyakov, Scott James Remnant, Linux Kernel, Matt Helsley,
	David S. Miller, Evgeniy Polyakov, netdev
In-Reply-To: <20090929132415.GB4538@redhat.com>

Am Dienstag 29 September 2009 15:24:15 schrieb Oleg Nesterov:
> On 09/29, Evgeny Polyakov wrote:
> > I'm yet do download the latest git to check this particular patch, but
> > I suppose it is possible to copy data under the lock into temporary
> > buffer and then send it using existing infrastructure instead of
> > calling it under the tasklist lock.
> 
> Afaics, we can just shift proc_sid_connector() from __set_special_pids()
> to sys_setsid().

Ok,  can confirm that this patch fixes my problem, but I am not sure if the
intended behaviour is still working as expected.

[PATCH] connector: Fix sid connector

The sid connector gives the following warning:
Badness at kernel/softirq.c:143
[...]
Call Trace:
([<000000013fe04100>] 0x13fe04100)
 [<000000000048a946>] sk_filter+0x9a/0xd0
 [<000000000049d938>] netlink_broadcast+0x2c0/0x53c
 [<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0
 [<00000000003baef0>] proc_sid_connector+0xc4/0xd4
 [<0000000000142604>] __set_special_pids+0x58/0x90
 [<0000000000159938>] sys_setsid+0xb4/0xd8
 [<00000000001187fe>] sysc_noemu+0x10/0x16
 [<00000041616cb266>] 0x41616cb266

The warning is
--->    WARN_ON_ONCE(in_irq() || irqs_disabled());

The network code must not be called with disabled interrupts but
sys_setsid holds the tasklist_lock with spinlock_irq while calling
the connector. We can safely move proc_sid_connector from
__set_special_pids to sys_setsid.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

---
 kernel/exit.c |    4 +---
 kernel/sys.c  |    2 ++
 2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -359,10 +359,8 @@ void __set_special_pids(struct pid *pid)
 {
 	struct task_struct *curr = current->group_leader;
 
-	if (task_session(curr) != pid) {
+	if (task_session(curr) != pid)
 		change_pid(curr, PIDTYPE_SID, pid);
-		proc_sid_connector(curr);
-	}
 
 	if (task_pgrp(curr) != pid)
 		change_pid(curr, PIDTYPE_PGID, pid);
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -1110,6 +1110,8 @@ SYSCALL_DEFINE0(setsid)
 	err = session;
 out:
 	write_unlock_irq(&tasklist_lock);
+	if (!err)
+		proc_sid_connector(sid);
 	return err;
 }
 



^ permalink raw reply

* Re: 2.3.31++: Badness at kernel/softirq.c:143 due to new session leader connector
From: Oleg Nesterov @ 2009-09-29 13:24 UTC (permalink / raw)
  To: Evgeny Polyakov
  Cc: Christian Borntraeger, Scott James Remnant, Linux Kernel,
	Matt Helsley, David S. Miller, Evgeniy Polyakov, netdev
In-Reply-To: <fbbbe0a028fa.4ac1f1c0@2ka.mipt.ru>

On 09/29, Evgeny Polyakov wrote:
>
> I'm yet do download the latest git to check this particular patch, but
> I suppose it is possible to copy data under the lock into temporary
> buffer and then send it using existing infrastructure instead of
> calling it under the tasklist lock.

Afaics, we can just shift proc_sid_connector() from __set_special_pids()
to sys_setsid().

Oleg.


^ permalink raw reply

* Re: [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Arnd Bergmann @ 2009-09-29 13:05 UTC (permalink / raw)
  To: Chris Wright
  Cc: Shreyas Bhatewara, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, Stephen Hemminger, David S. Miller,
	Jeff Garzik, Anthony Liguori, Greg Kroah-Hartman, Andrew Morton,
	virtualization, pv-drivers@vmware.com
In-Reply-To: <20090929085333.GC3958@sequoia.sous-sol.org>

On Tuesday 29 September 2009, Chris Wright wrote:
> > +struct Vmxnet3_MiscConf {
> > +       struct Vmxnet3_DriverInfo driverInfo;
> > +       uint64_t             uptFeatures;
> > +       uint64_t             ddPA;         /* driver data PA */
> > +       uint64_t             queueDescPA;  /* queue descriptor table PA */
> > +       uint32_t             ddLen;        /* driver data len */
> > +       uint32_t             queueDescLen; /* queue desc. table len in bytes */
> > +       uint32_t             mtu;
> > +       uint16_t             maxNumRxSG;
> > +       uint8_t              numTxQueues;
> > +       uint8_t              numRxQueues;
> > +       uint32_t             reserved[4];
> > +};
> 
> should this be packed (or others that are shared w/ device)?  i assume
> you've already done 32 vs 64 here

I would not mark it packed, because it already is well-defined on all
systems. You should add __packed only to the fields where you screwed
up, but not to structures that already work fine.

One thing that should possibly be fixed is the naming of identifiers, e.g.
's/Vmxnet3_MiscConf/vmxnet3_misc_conf/g', unless these header files are
shared with the host implementation.

	Arnd <><

^ permalink raw reply

* [PATCH] net: restore tx timestamping for accelerated vlans
From: Eric Dumazet @ 2009-09-29 12:57 UTC (permalink / raw)
  To: David S. Miller, Patrick McHardy; +Cc: Linux Netdev List

Since commit 9b22ea560957de1484e6b3e8538f7eef202e3596
( net: fix packet socket delivery in rx irq handler )

We lost rx timestamping of packets received on accelerated vlans.

Effect is that tcpdump on real dev can show strange timings, since it gets rx timestamps
too late (ie at skb dequeueing time, not at skb queueing time)

14:47:26.986871 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 1
14:47:26.986786 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 1

14:47:27.986888 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 2
14:47:27.986781 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 2

14:47:28.986896 IP 192.168.20.110 > 192.168.20.141: icmp 64: echo request seq 3
14:47:28.986780 IP 192.168.20.141 > 192.168.20.110: icmp 64: echo reply seq 3

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/dev.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 560c8c9..b8f74cf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2288,6 +2288,9 @@ int netif_receive_skb(struct sk_buff *skb)
 	int ret = NET_RX_DROP;
 	__be16 type;
 
+	if (!skb->tstamp.tv64)
+		net_timestamp(skb);
+
 	if (skb->vlan_tci && vlan_hwaccel_do_receive(skb))
 		return NET_RX_SUCCESS;
 
@@ -2295,9 +2298,6 @@ int netif_receive_skb(struct sk_buff *skb)
 	if (netpoll_receive_skb(skb))
 		return NET_RX_DROP;
 
-	if (!skb->tstamp.tv64)
-		net_timestamp(skb);
-
 	if (!skb->iif)
 		skb->iif = skb->dev->ifindex;
 

^ permalink raw reply related

* [PATCH v2 4/4] libxt_ipvs: user-space lib for netfilter matcher xt_ipvs
From: Hannes Eder @ 2009-09-29 12:36 UTC (permalink / raw)
  To: lvs-devel
  Cc: Wensong Zhang, Julius Volz, lvs-users, Laurent Grawet,
	Jean-Luc Fortemaison, linux-kernel, Jan Engelhardt,
	Julian Anastasov, Simon Horman, netfilter-devel, netdev,
	Fabien Duchêne, Joseph Mack NA3T, Patrick McHardy
In-Reply-To: <20090929123501.13798.84004.stgit@jazzy.zrh.corp.google.com>

The user-space library for the netfilter matcher xt_ipvs.

Signed-off-by: Hannes Eder <heder@google.com>

 configure.ac                      |   11 +
 extensions/libxt_ipvs.c           |  365 +++++++++++++++++++++++++++++++++++++
 extensions/libxt_ipvs.man         |   24 ++
 include/linux/netfilter/xt_ipvs.h |   25 +++
 4 files changed, 422 insertions(+), 3 deletions(-)
 create mode 100644 extensions/libxt_ipvs.c
 create mode 100644 extensions/libxt_ipvs.man
 create mode 100644 include/linux/netfilter/xt_ipvs.h

diff --git a/configure.ac b/configure.ac
index 0419ea7..52e9223 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1,4 +1,3 @@
-
 AC_INIT([iptables], [1.4.5])
 
 # See libtool.info "Libtool's versioning system"
@@ -47,12 +46,18 @@ AC_ARG_WITH([pkgconfigdir], AS_HELP_STRING([--with-pkgconfigdir=PATH],
 	[Path to the pkgconfig directory [[LIBDIR/pkgconfig]]]),
 	[pkgconfigdir="$withval"], [pkgconfigdir='${libdir}/pkgconfig'])
 
-AC_CHECK_HEADER([linux/dccp.h])
-
 blacklist_modules="";
+
+AC_CHECK_HEADER([linux/dccp.h])
 if test "$ac_cv_header_linux_dccp_h" != "yes"; then
 	blacklist_modules="$blacklist_modules dccp";
 fi;
+
+AC_CHECK_HEADER([linux/ip_vs.h])
+if test "$ac_cv_header_linux_ip_vs_h" != "yes"; then
+	blacklist_modules="$blacklist_modules ipvs";
+fi;
+
 AC_SUBST([blacklist_modules])
 
 AM_CONDITIONAL([ENABLE_STATIC], [test "$enable_static" = "yes"])
diff --git a/extensions/libxt_ipvs.c b/extensions/libxt_ipvs.c
new file mode 100644
index 0000000..6843551
--- /dev/null
+++ b/extensions/libxt_ipvs.c
@@ -0,0 +1,365 @@
+/*
+ * Shared library add-on to iptables to add IPVS matching.
+ *
+ * Detailed doc is in the kernel module source net/netfilter/xt_ipvs.c
+ *
+ * Author: Hannes Eder <heder@google.com>
+ */
+#include <sys/types.h>
+#include <assert.h>
+#include <ctype.h>
+#include <errno.h>
+#include <getopt.h>
+#include <netdb.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <xtables.h>
+#include <linux/ip_vs.h>
+#include <linux/netfilter/xt_ipvs.h>
+
+static const struct option ipvs_mt_opts[] = {
+	{ .name = "ipvs",     .has_arg = false, .val = '0' },
+	{ .name = "vproto",   .has_arg = true,  .val = '1' },
+	{ .name = "vaddr",    .has_arg = true,  .val = '2' },
+	{ .name = "vport",    .has_arg = true,  .val = '3' },
+	{ .name = "vdir",     .has_arg = true,  .val = '4' },
+	{ .name = "vmethod",  .has_arg = true,  .val = '5' },
+	{ .name = "vportctl", .has_arg = true,  .val = '6' },
+	{ .name = NULL }
+};
+
+static void ipvs_mt_help(void)
+{
+	printf(
+"IPVS match options:\n"
+"[!] --ipvs                      packet belongs to an IPVS connection\n"
+"\n"
+"Any of the following options implies --ipvs (even negated)\n"
+"[!] --vproto protocol           VIP protocol to match; by number or name,\n"
+"                                e.g. \"tcp\"\n"
+"[!] --vaddr address[/mask]      VIP address to match\n"
+"[!] --vport port                VIP port to match; by number or name,\n"
+"                                e.g. \"http\"\n"
+"    --vdir {ORIGINAL|REPLY}     flow direction of packet\n"
+"[!] --vmethod {GATE|IPIP|MASQ}  IPVS forwarding method used\n"
+"[!] --vportctl port             VIP port of the controlling connection to\n"
+"                                match, e.g. 21 for FTP\n"
+		);
+}
+
+static void ipvs_mt_parse_addr_and_mask(const char *arg,
+					union nf_inet_addr *address,
+					union nf_inet_addr *mask,
+					unsigned int family)
+{
+	struct in_addr *addr = NULL;
+	struct in6_addr *addr6 = NULL;
+	unsigned int naddrs = 0;
+
+	if (family == NFPROTO_IPV4) {
+		xtables_ipparse_any(arg, &addr, &mask->in, &naddrs);
+		if (naddrs > 1)
+			xtables_error(PARAMETER_PROBLEM,
+				      "multiple IP addresses not allowed");
+		if (naddrs == 1)
+			memcpy(&address->in, addr, sizeof(*addr));
+	} else if (family == NFPROTO_IPV6) {
+		xtables_ip6parse_any(arg, &addr6, &mask->in6, &naddrs);
+		if (naddrs > 1)
+			xtables_error(PARAMETER_PROBLEM,
+				      "multiple IP addresses not allowed");
+		if (naddrs == 1)
+			memcpy(&address->in6, addr6, sizeof(*addr6));
+	} else {
+		/* Hu? */
+		assert(false);
+	}
+}
+
+/* Function which parses command options; returns true if it ate an option */
+static int ipvs_mt_parse(int c, char **argv, int invert, unsigned int *flags,
+			 const void *entry, struct xt_entry_match **match,
+			 unsigned int family)
+{
+	struct xt_ipvs_mtinfo *data = (void *)(*match)->data;
+	char *p = NULL;
+	u_int8_t op = 0;
+
+	if ('0' <= c && c <= '6') {
+		static const int ops[] = {
+			XT_IPVS_IPVS_PROPERTY,
+			XT_IPVS_PROTO,
+			XT_IPVS_VADDR,
+			XT_IPVS_VPORT,
+			XT_IPVS_DIR,
+			XT_IPVS_METHOD,
+			XT_IPVS_VPORTCTL
+		};
+		op = ops[c - '0'];
+	} else
+		return 0;
+
+	if (*flags & op & XT_IPVS_ONCE_MASK)
+		goto multiple_use;
+
+	switch (c) {
+	case '0': /* --ipvs */
+		/* Nothing to do here. */
+		break;
+
+	case '1': /* --vproto */
+		/* Canonicalize into lower case */
+		for (p = optarg; *p != '\0'; ++p)
+			*p = tolower(*p);
+
+		data->l4proto = xtables_parse_protocol(optarg);
+		break;
+
+	case '2': /* --vaddr */
+		ipvs_mt_parse_addr_and_mask(optarg, &data->vaddr,
+					    &data->vmask, family);
+		break;
+
+	case '3': /* --vport */
+		data->vport = htons(xtables_parse_port(optarg, "tcp"));
+		break;
+
+	case '4': /* --vdir */
+		xtables_param_act(XTF_NO_INVERT, "ipvs", "--vdir", invert);
+		if (strcasecmp(optarg, "ORIGINAL") == 0) {
+			data->bitmask |= XT_IPVS_DIR;
+			data->invert   &= ~XT_IPVS_DIR;
+		} else if (strcasecmp(optarg, "REPLY") == 0) {
+			data->bitmask |= XT_IPVS_DIR;
+			data->invert  |= XT_IPVS_DIR;
+		} else {
+			xtables_param_act(XTF_BAD_VALUE,
+					  "ipvs", "--vdir", optarg);
+		}
+		break;
+
+	case '5': /* --vmethod */
+		if (strcasecmp(optarg, "GATE") == 0)
+			data->fwd_method = IP_VS_CONN_F_DROUTE;
+		else if (strcasecmp(optarg, "IPIP") == 0)
+			data->fwd_method = IP_VS_CONN_F_TUNNEL;
+		else if (strcasecmp(optarg, "MASQ") == 0)
+			data->fwd_method = IP_VS_CONN_F_MASQ;
+		else
+			xtables_param_act(XTF_BAD_VALUE,
+					  "ipvs", "--vmethod", optarg);
+		break;
+
+	case '6': /* --vportctl */
+		data->vportctl = htons(xtables_parse_port(optarg, "tcp"));
+		break;
+
+	default:
+		/* Hu? How did we come here? */
+		assert(false);
+		return 0;
+	}
+
+	if (op & XT_IPVS_ONCE_MASK) {
+		if (data->invert & XT_IPVS_IPVS_PROPERTY)
+			xtables_error(PARAMETER_PROBLEM,
+				      "! --ipvs cannot be together with"
+				      " other options");
+		data->bitmask |= XT_IPVS_IPVS_PROPERTY;
+	}
+
+	data->bitmask |= op;
+	if (invert)
+		data->invert |= op;
+	*flags |= op;
+	return 1;
+
+multiple_use:
+	xtables_error(PARAMETER_PROBLEM,
+		      "multiple use of the same IPVS option is not allowed");
+}
+
+static int ipvs_mt4_parse(int c, char **argv, int invert, unsigned int *flags,
+			  const void *entry, struct xt_entry_match **match)
+{
+	return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+			     NFPROTO_IPV4);
+}
+
+static int ipvs_mt6_parse(int c, char **argv, int invert, unsigned int *flags,
+			  const void *entry, struct xt_entry_match **match)
+{
+	return ipvs_mt_parse(c, argv, invert, flags, entry, match,
+			     NFPROTO_IPV6);
+}
+
+static void ipvs_mt_check(unsigned int flags)
+{
+	if (flags == 0)
+		xtables_error(PARAMETER_PROBLEM,
+			      "IPVS: At least one option is required");
+}
+
+/* Shamelessly copied from libxt_conntrack.c */
+static void ipvs_mt_dump_addr(const union nf_inet_addr *addr,
+			      const union nf_inet_addr *mask,
+			      unsigned int family, bool numeric)
+{
+	char buf[BUFSIZ];
+
+	if (family == NFPROTO_IPV4) {
+		if (!numeric && addr->ip == 0) {
+			printf("anywhere ");
+			return;
+		}
+		if (numeric)
+			strcpy(buf, xtables_ipaddr_to_numeric(&addr->in));
+		else
+			strcpy(buf, xtables_ipaddr_to_anyname(&addr->in));
+		strcat(buf, xtables_ipmask_to_numeric(&mask->in));
+		printf("%s ", buf);
+	} else if (family == NFPROTO_IPV6) {
+		if (!numeric && addr->ip6[0] == 0 && addr->ip6[1] == 0 &&
+		    addr->ip6[2] == 0 && addr->ip6[3] == 0) {
+			printf("anywhere ");
+			return;
+		}
+		if (numeric)
+			strcpy(buf, xtables_ip6addr_to_numeric(&addr->in6));
+		else
+			strcpy(buf, xtables_ip6addr_to_anyname(&addr->in6));
+		strcat(buf, xtables_ip6mask_to_numeric(&mask->in6));
+		printf("%s ", buf);
+	}
+}
+
+static void ipvs_mt_dump(const void *ip, const struct xt_ipvs_mtinfo *data,
+			 unsigned int family, bool numeric, const char *prefix)
+{
+	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+		if (data->invert & XT_IPVS_IPVS_PROPERTY)
+			printf("! ");
+		printf("%sipvs ", prefix);
+	}
+
+	if (data->bitmask & XT_IPVS_PROTO) {
+		if (data->invert & XT_IPVS_PROTO)
+			printf("! ");
+		printf("%sproto %u ", prefix, data->l4proto);
+	}
+
+	if (data->bitmask & XT_IPVS_VADDR) {
+		if (data->invert & XT_IPVS_VADDR)
+			printf("! ");
+
+		printf("%svaddr ", prefix);
+		ipvs_mt_dump_addr(&data->vaddr, &data->vmask, family, numeric);
+	}
+
+	if (data->bitmask & XT_IPVS_VPORT) {
+		if (data->invert & XT_IPVS_VPORT)
+			printf("! ");
+
+		printf("%svport %u ", prefix, ntohs(data->vport));
+	}
+
+	if (data->bitmask & XT_IPVS_DIR) {
+		if (data->invert & XT_IPVS_DIR)
+			printf("%svdir REPLY ", prefix);
+		else
+			printf("%svdir ORIGINAL ", prefix);
+	}
+
+	if (data->bitmask & XT_IPVS_METHOD) {
+		if (data->invert & XT_IPVS_METHOD)
+			printf("! ");
+
+		printf("%svmethod ", prefix);
+		switch (data->fwd_method) {
+		case IP_VS_CONN_F_DROUTE:
+			printf("GATE ");
+			break;
+		case IP_VS_CONN_F_TUNNEL:
+			printf("IPIP ");
+			break;
+		case IP_VS_CONN_F_MASQ:
+			printf("MASQ ");
+			break;
+		default:
+			/* Hu? */
+			printf("UNKNOWN ");
+			break;
+		}
+	}
+
+	if (data->bitmask & XT_IPVS_VPORTCTL) {
+		if (data->invert & XT_IPVS_VPORTCTL)
+			printf("! ");
+
+		printf("%svportctl %u ", prefix, ntohs(data->vportctl));
+	}
+}
+
+static void ipvs_mt4_print(const void *ip, const struct xt_entry_match *match,
+			   int numeric)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV4, numeric, "");
+}
+
+static void ipvs_mt6_print(const void *ip, const struct xt_entry_match *match,
+			   int numeric)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV6, numeric, "");
+}
+
+static void ipvs_mt4_save(const void *ip, const struct xt_entry_match *match)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV4, true, "--");
+}
+
+static void ipvs_mt6_save(const void *ip, const struct xt_entry_match *match)
+{
+	const struct xt_ipvs_mtinfo *data = (const void *)match->data;
+	ipvs_mt_dump(ip, data, NFPROTO_IPV6, true, "--");
+}
+
+static struct xtables_match ipvs_matches_reg[] = {
+	{
+		.version       = XTABLES_VERSION,
+		.name          = "ipvs",
+		.revision      = 0,
+		.family        = NFPROTO_IPV4,
+		.size          = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.help          = ipvs_mt_help,
+		.parse         = ipvs_mt4_parse,
+		.final_check   = ipvs_mt_check,
+		.print         = ipvs_mt4_print,
+		.save          = ipvs_mt4_save,
+		.extra_opts    = ipvs_mt_opts,
+	},
+	{
+		.version       = XTABLES_VERSION,
+		.name          = "ipvs",
+		.revision      = 0,
+		.family        = NFPROTO_IPV6,
+		.size          = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.userspacesize = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+		.help          = ipvs_mt_help,
+		.parse         = ipvs_mt6_parse,
+		.final_check   = ipvs_mt_check,
+		.print         = ipvs_mt6_print,
+		.save          = ipvs_mt6_save,
+		.extra_opts    = ipvs_mt_opts,
+	},
+};
+
+void _init(void)
+{
+	xtables_register_matches(ipvs_matches_reg,
+				 ARRAY_SIZE(ipvs_matches_reg));
+}
diff --git a/extensions/libxt_ipvs.man b/extensions/libxt_ipvs.man
new file mode 100644
index 0000000..8968e1a
--- /dev/null
+++ b/extensions/libxt_ipvs.man
@@ -0,0 +1,24 @@
+Match IPVS connection properties.
+.TP
+[\fB!\fR] \fB\-\-ipvs\fP
+packet belongs to an IPVS connection
+.TP
+Any of the following options implies \-\-ipvs (even negated)
+.TP
+[\fB!\fR] \fB\-\-vproto\fP \fIprotocol\fP
+VIP protocol to match; by number or name, e.g. "tcp"
+.TP
+[\fB!\fR] \fB\-\-vaddr\fP \fIaddress\fP[\fB/\fP\fImask\fP]
+VIP address to match
+.TP
+[\fB!\fR] \fB\-\-vport\fP \fIport\fP
+VIP port to match; by number or name, e.g. "http"
+.TP
+\fB\-\-vdir\fP {\fBORIGINAL\fP|\fBREPLY\fP}
+flow direction of packet
+.TP
+[\fB!\fR] \fB\-\-vmethod\fP {\fBGATE\fP|\fBIPIP\fP|\fBMASQ\fP}
+IPVS forwarding method used
+.TP
+[\fB!\fR] \fB\-\-vportctl\fP \fIport\fP
+VIP port of the controlling connection to match, e.g. 21 for FTP
diff --git a/include/linux/netfilter/xt_ipvs.h b/include/linux/netfilter/xt_ipvs.h
new file mode 100644
index 0000000..32f3051
--- /dev/null
+++ b/include/linux/netfilter/xt_ipvs.h
@@ -0,0 +1,25 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H 1
+
+#define XT_IPVS_IPVS_PROPERTY	(1 << 0) /* all other options imply this one */
+#define XT_IPVS_PROTO		(1 << 1)
+#define XT_IPVS_VADDR		(1 << 2)
+#define XT_IPVS_VPORT		(1 << 3)
+#define XT_IPVS_DIR		(1 << 4)
+#define XT_IPVS_METHOD		(1 << 5)
+#define XT_IPVS_VPORTCTL	(1 << 6)
+#define XT_IPVS_MASK		((1 << 7) - 1)
+#define XT_IPVS_ONCE_MASK	(XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
+
+struct xt_ipvs_mtinfo {
+	union nf_inet_addr	vaddr, vmask;
+	__be16			vport;
+	__u16			l4proto;
+	__u16			fwd_method;
+	__be16			vportctl;
+
+	__u8			invert;
+	__u8			bitmask;
+};
+
+#endif /* _XT_IPVS_H */


^ permalink raw reply related

* [PATCH v2 3/4] IPVS: make FTP work with full NAT support
From: Hannes Eder @ 2009-09-29 12:35 UTC (permalink / raw)
  To: lvs-devel
  Cc: Wensong Zhang, Julius Volz, lvs-users, Laurent Grawet,
	Jean-Luc Fortemaison, linux-kernel, Jan Engelhardt,
	Julian Anastasov, Simon Horman, netfilter-devel, netdev,
	Fabien Duchêne, Joseph Mack NA3T, Patrick McHardy
In-Reply-To: <20090929123501.13798.84004.stgit@jazzy.zrh.corp.google.com>

Use nf_conntrack/nf_nat code to do the packet mangling and the TCP
sequence adjusting.  The function 'ip_vs_skb_replace' is now dead
code, so it is removed.

To SNAT FTP, use something like:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 21 -j SNAT --to-source 192.168.10.10

and for the data connections in passive mode:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vportctl 21 -j SNAT --to-source 192.168.10.10

using '-m state --state RELATED' would also works.

Make sure the kernel modules ip_vs_ftp, nf_conntrack_ftp, and
nf_nat_ftp are loaded.

Signed-off-by: Hannes Eder <heder@google.com>

 include/net/ip_vs.h             |    2 
 net/netfilter/ipvs/Kconfig      |    2 
 net/netfilter/ipvs/ip_vs_app.c  |   43 ---------
 net/netfilter/ipvs/ip_vs_core.c |    1 
 net/netfilter/ipvs/ip_vs_ftp.c  |  178 ++++++++++++++++++++++++++++++++++++---
 5 files changed, 164 insertions(+), 62 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 98978e7..ec467de 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -724,8 +724,6 @@ extern void ip_vs_app_inc_put(struct ip_vs_app *inc);
 
 extern int ip_vs_app_pkt_out(struct ip_vs_conn *, struct sk_buff *skb);
 extern int ip_vs_app_pkt_in(struct ip_vs_conn *, struct sk_buff *skb);
-extern int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
-			     char *o_buf, int o_len, char *n_buf, int n_len);
 extern int ip_vs_app_init(void);
 extern void ip_vs_app_cleanup(void);
 
diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index fca5379..afc03ec 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -226,7 +226,7 @@ comment 'IPVS application helper'
 
 config	IP_VS_FTP
   	tristate "FTP protocol helper"
-        depends on IP_VS_PROTO_TCP
+        depends on IP_VS_PROTO_TCP && NF_NAT
 	---help---
 	  FTP is a protocol that transfers IP address and/or port number in
 	  the payload. In the virtual server via Network Address Translation,
diff --git a/net/netfilter/ipvs/ip_vs_app.c b/net/netfilter/ipvs/ip_vs_app.c
index 3c7e427..1e2d450 100644
--- a/net/netfilter/ipvs/ip_vs_app.c
+++ b/net/netfilter/ipvs/ip_vs_app.c
@@ -568,49 +568,6 @@ static const struct file_operations ip_vs_app_fops = {
 };
 #endif
 
-
-/*
- *	Replace a segment of data with a new segment
- */
-int ip_vs_skb_replace(struct sk_buff *skb, gfp_t pri,
-		      char *o_buf, int o_len, char *n_buf, int n_len)
-{
-	int diff;
-	int o_offset;
-	int o_left;
-
-	EnterFunction(9);
-
-	diff = n_len - o_len;
-	o_offset = o_buf - (char *)skb->data;
-	/* The length of left data after o_buf+o_len in the skb data */
-	o_left = skb->len - (o_offset + o_len);
-
-	if (diff <= 0) {
-		memmove(o_buf + n_len, o_buf + o_len, o_left);
-		memcpy(o_buf, n_buf, n_len);
-		skb_trim(skb, skb->len + diff);
-	} else if (diff <= skb_tailroom(skb)) {
-		skb_put(skb, diff);
-		memmove(o_buf + n_len, o_buf + o_len, o_left);
-		memcpy(o_buf, n_buf, n_len);
-	} else {
-		if (pskb_expand_head(skb, skb_headroom(skb), diff, pri))
-			return -ENOMEM;
-		skb_put(skb, diff);
-		memmove(skb->data + o_offset + n_len,
-			skb->data + o_offset + o_len, o_left);
-		skb_copy_to_linear_data_offset(skb, o_offset, n_buf, n_len);
-	}
-
-	/* must update the iph total length here */
-	ip_hdr(skb)->tot_len = htons(skb->len);
-
-	LeaveFunction(9);
-	return 0;
-}
-
-
 int __init ip_vs_app_init(void)
 {
 	/* we will replace it with proc_net_ipvs_create() soon */
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index d5e00ae..e200725 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -52,7 +52,6 @@
 
 EXPORT_SYMBOL(register_ip_vs_scheduler);
 EXPORT_SYMBOL(unregister_ip_vs_scheduler);
-EXPORT_SYMBOL(ip_vs_skb_replace);
 EXPORT_SYMBOL(ip_vs_proto_name);
 EXPORT_SYMBOL(ip_vs_conn_new);
 EXPORT_SYMBOL(ip_vs_conn_in_get);
diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
index 33e2c79..a810ed2 100644
--- a/net/netfilter/ipvs/ip_vs_ftp.c
+++ b/net/netfilter/ipvs/ip_vs_ftp.c
@@ -20,6 +20,17 @@
  *
  * Author:	Wouter Gadeyne
  *
+ *
+ * Code for ip_vs_expect_related and ip_vs_expect_callback is taken from
+ * http://www.ssi.bg/~ja/nfct/:
+ *
+ * ip_vs_nfct.c:	Netfilter connection tracking support for IPVS
+ *
+ * Portions Copyright (C) 2001-2002
+ * Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland.
+ *
+ * Portions Copyright (C) 2003-2008
+ * Julian Anastasov
  */
 
 #define KMSG_COMPONENT "IPVS"
@@ -32,6 +43,9 @@
 #include <linux/in.h>
 #include <linux/ip.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_expect.h>
+#include <net/netfilter/nf_nat_helper.h>
 #include <net/protocol.h>
 #include <net/tcp.h>
 #include <asm/unaligned.h>
@@ -42,6 +56,16 @@
 #define SERVER_STRING "227 Entering Passive Mode ("
 #define CLIENT_STRING "PORT "
 
+#define FMT_TUPLE	"%u.%u.%u.%u:%u->%u.%u.%u.%u:%u/%u"
+#define ARG_TUPLE(T)	NIPQUAD((T)->src.u3.ip), ntohs((T)->src.u.all), \
+			NIPQUAD((T)->dst.u3.ip), ntohs((T)->dst.u.all), \
+			(T)->dst.protonum
+
+#define FMT_CONN	"%u.%u.%u.%u:%u->%u.%u.%u.%u:%u->%u.%u.%u.%u:%u/%u:%u"
+#define ARG_CONN(C)	NIPQUAD((C)->caddr), ntohs((C)->cport), \
+			NIPQUAD((C)->vaddr), ntohs((C)->vport), \
+			NIPQUAD((C)->daddr), ntohs((C)->dport), \
+			(C)->protocol, (C)->state
 
 /*
  * List of ports (up to IP_VS_APP_MAX_PORTS) to be handled by helper
@@ -122,6 +146,119 @@ static int ip_vs_ftp_get_addrport(char *data, char *data_limit,
 	return 1;
 }
 
+/*
+ * Called from init_conntrack() as expectfn handler.
+ */
+static void
+ip_vs_expect_callback(struct nf_conn *ct,
+		      struct nf_conntrack_expect *exp)
+{
+	struct nf_conntrack_tuple *orig, new_reply;
+	struct ip_vs_conn *cp;
+
+	if (exp->tuple.src.l3num != PF_INET)
+		return;
+
+	/*
+	 * We assume that no NF locks are held before this callback.
+	 * ip_vs_conn_out_get and ip_vs_conn_in_get should match their
+	 * expectations even if they use wildcard values, now we provide the
+	 * actual values from the newly created original conntrack direction.
+	 * The conntrack is confirmed when packet reaches IPVS hooks.
+	 */
+
+	/* RS->CLIENT */
+	orig = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+	cp = ip_vs_conn_out_get(exp->tuple.src.l3num, orig->dst.protonum,
+				&orig->src.u3, orig->src.u.tcp.port,
+				&orig->dst.u3, orig->dst.u.tcp.port);
+	if (cp) {
+		/* Change reply CLIENT->RS to CLIENT->VS */
+		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+		IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", found inout cp=" FMT_CONN "\n",
+			  __func__, ct, ct->status,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		new_reply.dst.u3 = cp->vaddr;
+		new_reply.dst.u.tcp.port = cp->vport;
+		IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", " FMT_TUPLE
+			  ", inout cp=" FMT_CONN "\n",
+			  __func__, ct,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		goto alter;
+	}
+
+	/* CLIENT->VS */
+	cp = ip_vs_conn_in_get(exp->tuple.src.l3num, orig->dst.protonum,
+			       &orig->src.u3, orig->src.u.tcp.port,
+			       &orig->dst.u3, orig->dst.u.tcp.port);
+	if (cp) {
+		/* Change reply VS->CLIENT to RS->CLIENT */
+		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+		IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", found outin cp=" FMT_CONN "\n",
+			  __func__, ct, ct->status,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		new_reply.src.u3 = cp->daddr;
+		new_reply.src.u.tcp.port = cp->dport;
+		IP_VS_DBG(7, "%s(): ct=%p, new tuples=" FMT_TUPLE ", "
+			  FMT_TUPLE ", outin cp=" FMT_CONN "\n",
+			  __func__, ct,
+			  ARG_TUPLE(orig), ARG_TUPLE(&new_reply),
+			  ARG_CONN(cp));
+		goto alter;
+	}
+
+	IP_VS_DBG(7, "%s(): ct=%p, status=0x%lX, tuple=" FMT_TUPLE
+		  " - unknown expect\n",
+		  __func__, ct, ct->status, ARG_TUPLE(orig));
+	return;
+
+alter:
+	/* Never alter conntrack for non-NAT conns */
+	if (IP_VS_FWD_METHOD(cp) == IP_VS_CONN_F_MASQ)
+		nf_conntrack_alter_reply(ct, &new_reply);
+	ip_vs_conn_put(cp);
+	return;
+}
+
+/*
+ * Create NF conntrack expectation with wildcard (optional) source port.
+ * Then the default callback function will alter the reply and will confirm
+ * the conntrack entry when the first packet comes.
+ */
+static void
+ip_vs_expect_related(struct sk_buff *skb, struct nf_conn *ct,
+		     struct ip_vs_conn *cp, u_int8_t proto,
+		     const __be16 *port, int from_rs)
+{
+	struct nf_conntrack_expect *exp;
+
+	BUG_ON(!ct || ct == &nf_conntrack_untracked);
+
+	exp = nf_ct_expect_alloc(ct);
+	if (!exp)
+		return;
+
+	if (from_rs)
+		nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+				  nf_ct_l3num(ct), &cp->daddr, &cp->caddr,
+				  proto, port, &cp->cport);
+	else
+		nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT,
+				  nf_ct_l3num(ct), &cp->caddr, &cp->vaddr,
+				  proto, port, &cp->vport);
+
+	exp->expectfn = ip_vs_expect_callback;
+
+	IP_VS_DBG(7, "%s(): ct=%p, expect tuple=" FMT_TUPLE "\n",
+		  __func__, ct, ARG_TUPLE(&exp->tuple));
+	nf_ct_expect_related(exp);
+	nf_ct_expect_put(exp);
+}
 
 /*
  * Look at outgoing ftp packets to catch the response to a PASV command
@@ -146,9 +283,11 @@ static int ip_vs_ftp_out(struct ip_vs_app *app, struct ip_vs_conn *cp,
 	union nf_inet_addr from;
 	__be16 port;
 	struct ip_vs_conn *n_cp;
-	char buf[24];		/* xxx.xxx.xxx.xxx,ppp,ppp\000 */
+	char buf[sizeof("xxx,xxx,xxx,xxx,ppp,ppp")];
 	unsigned buf_len;
 	int ret;
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
 
 #ifdef CONFIG_IP_VS_IPV6
 	/* This application helper doesn't work with IPv6 yet,
@@ -208,23 +347,26 @@ static int ip_vs_ftp_out(struct ip_vs_app *app, struct ip_vs_conn *cp,
 		 */
 		from.ip = n_cp->vaddr.ip;
 		port = n_cp->vport;
-		sprintf(buf, "%d,%d,%d,%d,%d,%d", NIPQUAD(from.ip),
-			(ntohs(port)>>8)&255, ntohs(port)&255);
-		buf_len = strlen(buf);
+		buf_len = sprintf(buf, "%d,%d,%d,%d,%d,%d", NIPQUAD(from.ip),
+				  (ntohs(port)>>8)&255, ntohs(port)&255);
+
+		ct = nf_ct_get(skb, &ctinfo);
+		ret = nf_nat_mangle_tcp_packet(skb,
+					       ct,
+					       ctinfo,
+					       start-data,
+					       end-start,
+					       buf,
+					       buf_len);
+
+		if (ct && ct != &nf_conntrack_untracked)
+			ip_vs_expect_related(skb, ct, n_cp,
+					     IPPROTO_TCP, NULL, 0);
 
 		/*
-		 * Calculate required delta-offset to keep TCP happy
+		 * Not setting 'diff' is intentional, otherwise the sequence
+		 * would be adjusted twice.
 		 */
-		*diff = buf_len - (end-start);
-
-		if (*diff == 0) {
-			/* simply replace it with new passive address */
-			memcpy(start, buf, buf_len);
-			ret = 1;
-		} else {
-			ret = !ip_vs_skb_replace(skb, GFP_ATOMIC, start,
-					  end-start, buf, buf_len);
-		}
 
 		cp->app_data = NULL;
 		ip_vs_tcp_conn_listen(n_cp);
@@ -256,6 +398,7 @@ static int ip_vs_ftp_in(struct ip_vs_app *app, struct ip_vs_conn *cp,
 	union nf_inet_addr to;
 	__be16 port;
 	struct ip_vs_conn *n_cp;
+	struct nf_conn *ct;
 
 #ifdef CONFIG_IP_VS_IPV6
 	/* This application helper doesn't work with IPv6 yet,
@@ -342,6 +485,11 @@ static int ip_vs_ftp_in(struct ip_vs_app *app, struct ip_vs_conn *cp,
 		ip_vs_control_add(n_cp, cp);
 	}
 
+	ct = (struct nf_conn *)skb->nfct;
+	if (ct && ct != &nf_conntrack_untracked)
+		ip_vs_expect_related(skb, ct, n_cp,
+				     IPPROTO_TCP, &n_cp->dport, 1);
+
 	/*
 	 *	Move tunnel to listen state
 	 */


^ permalink raw reply related

* [PATCH v2 2/4] IPVS: make friends with nf_conntrack
From: Hannes Eder @ 2009-09-29 12:35 UTC (permalink / raw)
  To: lvs-devel
  Cc: Wensong Zhang, Julius Volz, lvs-users, Laurent Grawet,
	Jean-Luc Fortemaison, linux-kernel, Jan Engelhardt,
	Julian Anastasov, Simon Horman, netfilter-devel, netdev,
	Fabien Duchêne, Joseph Mack NA3T, Patrick McHardy
In-Reply-To: <20090929123501.13798.84004.stgit@jazzy.zrh.corp.google.com>

Update the nf_conntrack tuple in reply direction, as we will see
traffic from the real server (RIP) to the client (CIP).  Once this is
done we can use netfilters SNAT in POSTROUTING, especially with
xt_ipvs, to do source NAT, e.g.:

% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 --vport 80 \
> -j SNAT --to-source 192.168.10.10

Signed-off-by: Hannes Eder <heder@google.com>

 net/netfilter/ipvs/Kconfig      |    2 +-
 net/netfilter/ipvs/ip_vs_core.c |   36 ------------------------------------
 net/netfilter/ipvs/ip_vs_xmit.c |   30 ++++++++++++++++++++++++++++++
 3 files changed, 31 insertions(+), 37 deletions(-)

diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index 79a6980..fca5379 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -3,7 +3,7 @@
 #
 menuconfig IP_VS
 	tristate "IP virtual server support"
-	depends on NET && INET && NETFILTER
+	depends on NET && INET && NETFILTER && NF_CONNTRACK
 	---help---
 	  IP Virtual Server support will let you build a high-performance
 	  virtual server based on cluster of two or more real servers. This
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index b95699f..d5e00ae 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -521,26 +521,6 @@ int ip_vs_leave(struct ip_vs_service *svc, struct sk_buff *skb,
 	return NF_DROP;
 }
 
-
-/*
- *      It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING
- *      chain, and is used for VS/NAT.
- *      It detects packets for VS/NAT connections and sends the packets
- *      immediately. This can avoid that iptable_nat mangles the packets
- *      for VS/NAT.
- */
-static unsigned int ip_vs_post_routing(unsigned int hooknum,
-				       struct sk_buff *skb,
-				       const struct net_device *in,
-				       const struct net_device *out,
-				       int (*okfn)(struct sk_buff *))
-{
-	if (!skb->ipvs_property)
-		return NF_ACCEPT;
-	/* The packet was sent from IPVS, exit this chain */
-	return NF_STOP;
-}
-
 __sum16 ip_vs_checksum_complete(struct sk_buff *skb, int offset)
 {
 	return csum_fold(skb_checksum(skb, offset, skb->len - offset, 0));
@@ -1442,14 +1422,6 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP_PRI_NAT_SRC-1,
-	},
 #ifdef CONFIG_IP_VS_IPV6
 	/* After packet filtering, forward packet through VS/DR, VS/TUN,
 	 * or VS/NAT(change destination), so that filtering rules can be
@@ -1478,14 +1450,6 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
 		.hooknum        = NF_INET_FORWARD,
 		.priority       = 99,
 	},
-	/* Before the netfilter connection tracking, exit from POST_ROUTING */
-	{
-		.hook		= ip_vs_post_routing,
-		.owner		= THIS_MODULE,
-		.pf		= PF_INET6,
-		.hooknum        = NF_INET_POST_ROUTING,
-		.priority       = NF_IP6_PRI_NAT_SRC-1,
-	},
 #endif
 };
 
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 30b3189..d7198e2 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -27,6 +27,7 @@
 #include <net/ip6_route.h>
 #include <linux/icmpv6.h>
 #include <linux/netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
 #include <linux/netfilter_ipv4.h>
 
 #include <net/ip_vs.h>
@@ -347,6 +348,31 @@ ip_vs_bypass_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 }
 #endif
 
+static void
+ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp)
+{
+	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
+	struct nf_conntrack_tuple new_tuple;
+
+	if (ct == NULL || ct == &nf_conntrack_untracked ||
+	    nf_ct_is_confirmed(ct))
+		return;
+
+	/*
+	 * The connection is not yet in the hashtable, so we update it.
+	 * CIP->VIP will remain the same, so leave the tuple in
+	 * IP_CT_DIR_ORIGINAL untouched.  When the reply comes back from the
+	 * real-server we will see RIP->DIP.
+	 */
+	new_tuple = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+	new_tuple.src.u3 = cp->daddr;
+	/*
+	 * This will also take care of UDP and other protocols.
+	 */
+	new_tuple.src.u.tcp.port = cp->dport;
+	nf_conntrack_alter_reply(ct, &new_tuple);
+}
+
 /*
  *      NAT transmitter (only for outside-to-inside nat forwarding)
  *      Not used for related ICMP
@@ -402,6 +428,8 @@ ip_vs_nat_xmit(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */
@@ -478,6 +506,8 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	IP_VS_DBG_PKT(10, pp, skb, 0, "After DNAT");
 
+	ip_vs_update_conntrack(skb, cp);
+
 	/* FIXME: when application helper enlarges the packet and the length
 	   is larger than the MTU of outgoing device, there will be still
 	   MTU problem. */

^ permalink raw reply related

* [PATCH v2 1/4] netfilter: xt_ipvs (netfilter matcher for IPVS)
From: Hannes Eder @ 2009-09-29 12:35 UTC (permalink / raw)
  To: lvs-devel
  Cc: Wensong Zhang, Julius Volz, lvs-users, Laurent Grawet,
	Jean-Luc Fortemaison, linux-kernel, Jan Engelhardt,
	Julian Anastasov, Simon Horman, netfilter-devel, netdev,
	Fabien Duchêne, Joseph Mack NA3T, Patrick McHardy
In-Reply-To: <20090929123501.13798.84004.stgit@jazzy.zrh.corp.google.com>

This implements the kernel-space side of the netfilter matcher
xt_ipvs.

Signed-off-by: Hannes Eder <heder@google.com>

 include/linux/netfilter/xt_ipvs.h |   25 +++++
 net/netfilter/Kconfig             |    9 ++
 net/netfilter/Makefile            |    1 
 net/netfilter/ipvs/ip_vs_proto.c  |    1 
 net/netfilter/xt_ipvs.c           |  187 +++++++++++++++++++++++++++++++++++++
 5 files changed, 223 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netfilter/xt_ipvs.h
 create mode 100644 net/netfilter/xt_ipvs.c

diff --git a/include/linux/netfilter/xt_ipvs.h b/include/linux/netfilter/xt_ipvs.h
new file mode 100644
index 0000000..32f3051
--- /dev/null
+++ b/include/linux/netfilter/xt_ipvs.h
@@ -0,0 +1,25 @@
+#ifndef _XT_IPVS_H
+#define _XT_IPVS_H 1
+
+#define XT_IPVS_IPVS_PROPERTY	(1 << 0) /* all other options imply this one */
+#define XT_IPVS_PROTO		(1 << 1)
+#define XT_IPVS_VADDR		(1 << 2)
+#define XT_IPVS_VPORT		(1 << 3)
+#define XT_IPVS_DIR		(1 << 4)
+#define XT_IPVS_METHOD		(1 << 5)
+#define XT_IPVS_VPORTCTL	(1 << 6)
+#define XT_IPVS_MASK		((1 << 7) - 1)
+#define XT_IPVS_ONCE_MASK	(XT_IPVS_MASK & ~XT_IPVS_IPVS_PROPERTY)
+
+struct xt_ipvs_mtinfo {
+	union nf_inet_addr	vaddr, vmask;
+	__be16			vport;
+	__u16			l4proto;
+	__u16			fwd_method;
+	__be16			vportctl;
+
+	__u8			invert;
+	__u8			bitmask;
+};
+
+#endif /* _XT_IPVS_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 634d14a..fc35bd6 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -678,6 +678,15 @@ config NETFILTER_XT_MATCH_IPRANGE
 
 	If unsure, say M.
 
+config NETFILTER_XT_MATCH_IPVS
+	tristate '"ipvs" match support'
+	depends on IP_VS
+	depends on NETFILTER_ADVANCED
+	help
+	  This option allows you to match against IPVS properties of a packet.
+
+	  If unsure, say N.
+
 config NETFILTER_XT_MATCH_LENGTH
 	tristate '"length" match support'
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 49f62ee..ff95372 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_HASHLIMIT) += xt_hashlimit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_HELPER) += xt_helper.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_HL) += xt_hl.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_IPRANGE) += xt_iprange.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_IPVS) += xt_ipvs.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_LENGTH) += xt_length.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_LIMIT) += xt_limit.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_MAC) += xt_mac.o
diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c
index 3e76716..db083c3 100644
--- a/net/netfilter/ipvs/ip_vs_proto.c
+++ b/net/netfilter/ipvs/ip_vs_proto.c
@@ -97,6 +97,7 @@ struct ip_vs_protocol * ip_vs_proto_get(unsigned short proto)
 
 	return NULL;
 }
+EXPORT_SYMBOL(ip_vs_proto_get);
 
 
 /*
diff --git a/net/netfilter/xt_ipvs.c b/net/netfilter/xt_ipvs.c
new file mode 100644
index 0000000..da7b634
--- /dev/null
+++ b/net/netfilter/xt_ipvs.c
@@ -0,0 +1,187 @@
+/*
+ *	xt_ipvs - kernel module to match IPVS connection properties
+ *
+ *	Author: Hannes Eder <heder@google.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/spinlock.h>
+#include <linux/skbuff.h>
+#ifdef CONFIG_IP_VS_IPV6
+#include <net/ipv6.h>
+#endif
+#include <linux/ip_vs.h>
+#include <linux/types.h>
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_ipvs.h>
+#include <net/netfilter/nf_conntrack.h>
+
+#include <net/ip_vs.h>
+
+MODULE_AUTHOR("Hannes Eder <heder@google.com>");
+MODULE_DESCRIPTION("Xtables: match IPVS connection properties");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("ipt_ipvs");
+MODULE_ALIAS("ip6t_ipvs");
+
+/* borrowed from xt_conntrack */
+static bool ipvs_mt_addrcmp(const union nf_inet_addr *kaddr,
+			    const union nf_inet_addr *uaddr,
+			    const union nf_inet_addr *umask,
+			    unsigned int l3proto)
+{
+	if (l3proto == NFPROTO_IPV4)
+		return ((kaddr->ip ^ uaddr->ip) & umask->ip) == 0;
+#ifdef CONFIG_IP_VS_IPV6
+	else if (l3proto == NFPROTO_IPV6)
+		return ipv6_masked_addr_cmp(&kaddr->in6, &umask->in6,
+		       &uaddr->in6) == 0;
+#endif
+	else
+		return false;
+}
+
+static bool ipvs_mt(const struct sk_buff *skb, const struct xt_match_param *par)
+{
+	const struct xt_ipvs_mtinfo *data = par->matchinfo;
+	/* ipvs_mt_check ensures that family is only NFPROTO_IPV[46]. */
+	const u_int8_t family = par->family;
+	struct ip_vs_iphdr iph;
+	struct ip_vs_protocol *pp;
+	struct ip_vs_conn *cp;
+	bool match = true;
+
+	if (data->bitmask == XT_IPVS_IPVS_PROPERTY) {
+		match = skb->ipvs_property ^
+			!!(data->invert & XT_IPVS_IPVS_PROPERTY);
+		goto out;
+	}
+
+	/* other flags than XT_IPVS_IPVS_PROPERTY are set */
+	if (!skb->ipvs_property) {
+		match = false;
+		goto out;
+	}
+
+	ip_vs_fill_iphdr(family, skb_network_header(skb), &iph);
+
+	if (data->bitmask & XT_IPVS_PROTO)
+		if ((iph.protocol == data->l4proto) ^
+		    !(data->invert & XT_IPVS_PROTO)) {
+			match = false;
+			goto out;
+		}
+
+	pp = ip_vs_proto_get(iph.protocol);
+	if (unlikely(!pp)) {
+		match = false;
+		goto out;
+	}
+
+	/*
+	 * Check if the packet belongs to an existing entry
+	 */
+	cp = pp->conn_out_get(family, skb, pp, &iph, iph.len, 1 /* inverse */);
+	if (unlikely(cp == NULL)) {
+		match = false;
+		goto out;
+	}
+
+	/*
+	 * We found a connection, i.e. ct != 0, make sure to call
+	 * __ip_vs_conn_put before returning.  In our case jump to out_put_con.
+	 */
+
+	if (data->bitmask & XT_IPVS_VPORT)
+		if ((cp->vport == data->vport) ^
+		    !(data->invert & XT_IPVS_VPORT)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_VPORTCTL)
+		if ((cp->control != NULL &&
+		     cp->control->vport == data->vportctl) ^
+		    !(data->invert & XT_IPVS_VPORTCTL)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_DIR) {
+		enum ip_conntrack_info ctinfo;
+		struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+
+		if (ct == NULL || ct == &nf_conntrack_untracked) {
+			match = false;
+			goto out_put_cp;
+		}
+
+		if ((ctinfo >= IP_CT_IS_REPLY) ^
+		    !!(data->invert & XT_IPVS_DIR)) {
+			match = false;
+			goto out_put_cp;
+		}
+	}
+
+	if (data->bitmask & XT_IPVS_METHOD)
+		if (((cp->flags & IP_VS_CONN_F_FWD_MASK) == data->fwd_method) ^
+		    !(data->invert & XT_IPVS_METHOD)) {
+			match = false;
+			goto out_put_cp;
+		}
+
+	if (data->bitmask & XT_IPVS_VADDR) {
+		if (ipvs_mt_addrcmp(&cp->vaddr, &data->vaddr,
+				    &data->vmask, family) ^
+		    !(data->invert & XT_IPVS_VADDR)) {
+			match = false;
+			goto out_put_cp;
+		}
+	}
+
+out_put_cp:
+	__ip_vs_conn_put(cp);
+out:
+	pr_debug("match=%d\n", match);
+	return match;
+}
+
+static bool ipvs_mt_check(const struct xt_mtchk_param *par)
+{
+	if (par->family != NFPROTO_IPV4
+#ifdef CONFIG_IP_VS_IPV6
+	    && par->family != NFPROTO_IPV6
+#endif
+		) {
+		pr_info("protocol family %u not supported\n", par->family);
+		return false;
+	}
+
+	return true;
+}
+
+static struct xt_match xt_ipvs_mt_reg __read_mostly = {
+	.name       = "ipvs",
+	.revision   = 0,
+	.family     = NFPROTO_UNSPEC,
+	.match      = ipvs_mt,
+	.checkentry = ipvs_mt_check,
+	.matchsize  = XT_ALIGN(sizeof(struct xt_ipvs_mtinfo)),
+	.me         = THIS_MODULE,
+};
+
+static int __init ipvs_mt_init(void)
+{
+	return xt_register_match(&xt_ipvs_mt_reg);
+}
+
+static void __exit ipvs_mt_exit(void)
+{
+	xt_unregister_match(&xt_ipvs_mt_reg);
+}
+
+module_init(ipvs_mt_init);
+module_exit(ipvs_mt_exit);


^ permalink raw reply related

* [PATCH v2 0/4] IPVS full NAT support + netfilter 'ipvs' match support
From: Hannes Eder @ 2009-09-29 12:35 UTC (permalink / raw)
  To: lvs-devel
  Cc: Wensong Zhang, Julius Volz, lvs-users, Laurent Grawet,
	Jean-Luc Fortemaison, linux-kernel, Jan Engelhardt,
	Julian Anastasov, Simon Horman, netfilter-devel, netdev,
	Fabien Duchêne, Joseph Mack NA3T, Patrick McHardy

The following series implements full NAT support for IPVS.  The
approach is via a minimal change to IPVS (make friends with
nf_conntrack) and adding a netfilter matcher, kernel- and user-space
part, i.e. xt_ipvs and libxt_ipvs.

Example usage:

% ipvsadm -A -t 192.168.100.30:80 -s rr
% ipvsadm -a -t 192.168.100.30:80 -r 192.168.10.20:80 -m
# ...

# Source NAT for VIP 192.168.100.30:80
% iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> --vport 80 -j SNAT --to-source 192.168.10.10

or SNAT-ing only a specific real server:

% iptables -t nat -A POSTROUTING --dst 192.168.11.20 \
> -m ipvs --vaddr 192.168.100.30/32 -j SNAT --to-source 192.168.10.10


First of all, thanks for all the feedback.  This is the changelog for v2:

- Make ip_vs_ftp work again.  Setup nf_conntrack expectations for
  related data connections (based on Julian's patch see
  http://www.ssi.bg/~ja/nfct/) and let nf_conntrack/nf_nat do the
  packet mangling and the TCP sequence adjusting.

  This change rises the question how to deal with ip_vs_sync?  Does it
  work together with conntrackd?  Wild idea: what about getting rid of
  ip_vs_sync and piggy packing all on nf_conntrack and use conntrackd?

  Any comments on this?

- xt_ipvs: add new rule '--vportctl port' to match the VIP port of the
  controlling connection, e.g. port 21 for FTP.  Can be used to match
  a related data connection for FTP:

  # SNAT FTP control connection
  % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
  > --vport 21 -j SNAT --to-source 192.168.10.10
  
  # SNAT FTP passive data connection
  % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
  > --vportctl 21 -j SNAT --to-source 192.168.10.10

- xt_ipvs: use 'par->family' instead of 'skb->protocol'

- xt_ipvs: add ipvs_mt_check and restrict to NFPROTO_IPV4 and NFPROTO_IPV6

- Call nf_conntrack_alter_reply(), so helper lookup is performed based
  on the changed tuple.

Changes to the linux kernel (rebased to next-20090925):

Hannes Eder (3):
      netfilter: xt_ipvs (netfilter matcher for IPVS)
      IPVS: make friends with nf_conntrack
      IPVS: make FTP work with full NAT support


 include/linux/netfilter/xt_ipvs.h |   25 +++++
 include/net/ip_vs.h               |    2 
 net/netfilter/Kconfig             |    9 ++
 net/netfilter/Makefile            |    1 
 net/netfilter/ipvs/Kconfig        |    4 -
 net/netfilter/ipvs/ip_vs_app.c    |   43 ---------
 net/netfilter/ipvs/ip_vs_core.c   |   37 -------
 net/netfilter/ipvs/ip_vs_ftp.c    |  178 ++++++++++++++++++++++++++++++++---
 net/netfilter/ipvs/ip_vs_proto.c  |    1 
 net/netfilter/ipvs/ip_vs_xmit.c   |   30 ++++++
 net/netfilter/xt_ipvs.c           |  187 +++++++++++++++++++++++++++++++++++++
 11 files changed, 418 insertions(+), 99 deletions(-)
 create mode 100644 include/linux/netfilter/xt_ipvs.h
 create mode 100644 net/netfilter/xt_ipvs.c


Changes to iptables (relative to 1.4.5):

Hannes Eder (1):
      libxt_ipvs: user-space lib for netfilter matcher xt_ipvs

 configure.ac                      |   11 +
 extensions/libxt_ipvs.c           |  365 +++++++++++++++++++++++++++++++++++++
 extensions/libxt_ipvs.man         |   24 ++
 include/linux/netfilter/xt_ipvs.h |   25 +++
 4 files changed, 422 insertions(+), 3 deletions(-)
 create mode 100644 extensions/libxt_ipvs.c
 create mode 100644 extensions/libxt_ipvs.man
 create mode 100644 include/linux/netfilter/xt_ipvs.h

^ permalink raw reply

* Re: WARNING: at net/ipv4/af_inet.c:154 inet_sock_destruct
From: Francis Moreau @ 2009-09-29  9:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Kernel Mailing List, Linux Netdev List, David S. Miller
In-Reply-To: <4AC1D0F5.4050709@gmail.com>

On Tue, Sep 29, 2009 at 11:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Francis Moreau a écrit :
>>
>> It happens on 2.6.31 and older kernels as well though I don't remember
>> when it really started.
>
> Could you please try following patch ?

I'll report back the result at the end of the day (ie in 8 hours).

Thanks
-- 
Francis

^ permalink raw reply

* Re: [Fwd: Re: Bug#538372: header failure including netlink.h (or uio.h)]
From: Jarek Poplawski @ 2009-09-29  9:27 UTC (permalink / raw)
  To: Manuel Prinz; +Cc: netdev
In-Reply-To: <1254137084.4756.11.camel@ce170155.zmb.uni-duisburg-essen.de>

On 28-09-2009 13:24, Manuel Prinz wrote:
> Hi everyone,
> 
> I'm forwarding this bug in Debian (http://bugs.debian.org/538372) as
> requested by the Debian kernel team. A patch is available. Applying just
> the first hunk fixes the issue for me. I've not enough kernel knowledge
> to judge if this fix is a proper solution, though.
> 
> It would be really great if someone could have a look at it. Thanks in
> advance! (And please CC me in replies. Thanks!)

I've tried it with current include/linux and it works OK. Replacing
uio.h on Debian really was not enough, but it looks like missing
compiler.h entries could be the reason. Otherwise, please send your
compile error log.

Best regards,
Jarek P.

^ permalink raw reply

* Re: [PATCH 2/3] iwmc3200wifi: select IWMC3200TOP in Kconfig
From: Zhu Yi @ 2009-09-29  9:22 UTC (permalink / raw)
  To: Winkler, Tomas
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Perez-Gonzalez, Inaky, Kao, Cindy H, Cohen, Guy, Rindjunsky, Ron
In-Reply-To: <1253662724-16497-3-git-send-email-tomas.winkler-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On Wed, 2009-09-23 at 07:38 +0800, Winkler, Tomas wrote:
> iwmc3200wifi requires iwmc3200top  for its operation
> 
> Signed-off-by: Tomas Winkler <tomas.winkler-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Acked-by: Zhu Yi <yi.zhu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Thanks,
-yi

> ---
>  drivers/net/wireless/iwmc3200wifi/Kconfig |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/wireless/iwmc3200wifi/Kconfig b/drivers/net/wireless/iwmc3200wifi/Kconfig
> index c62da43..69faaf1 100644
> --- a/drivers/net/wireless/iwmc3200wifi/Kconfig
> +++ b/drivers/net/wireless/iwmc3200wifi/Kconfig
> @@ -3,6 +3,7 @@ config IWM
>  	depends on MMC && WLAN_80211 && EXPERIMENTAL
>  	depends on CFG80211
>  	select FW_LOADER
> +	select IWMC3200TOP
>  	help
>  	  The Intel Wireless Multicomm 3200 hardware is a combo
>  	  card with GPS, Bluetooth, WiMax and 802.11 radios. It

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: WARNING: at net/ipv4/af_inet.c:154 inet_sock_destruct
From: Eric Dumazet @ 2009-09-29  9:18 UTC (permalink / raw)
  To: Francis Moreau
  Cc: Linux Kernel Mailing List, Linux Netdev List, David S. Miller
In-Reply-To: <38b2ab8a0909290109m3f82c161j4fb0f1266152877e@mail.gmail.com>

Francis Moreau a écrit :
> Hello,
> 
> I got this kernel warning when stopping nfsd:
> 
> [260104.553720] WARNING: at net/ipv4/af_inet.c:154
> inet_sock_destruct+0x164/0x182()
> [260104.553722] Hardware name: P5K-VM
> [260104.553724] Modules linked in: jfs loop nfsd lockd nfs_acl
> auth_rpcgss exportfs sunrpc [last unloaded: microcode]
> [260104.553736] Pid: 858, comm: nfsd Tainted: G   M       2.6.31 #13
> [260104.553738] Call Trace:
> [260104.553743]  [<ffffffff813ed53a>] ? inet_sock_destruct+0x164/0x182
> [260104.553748]  [<ffffffff81044471>] warn_slowpath_common+0x7c/0xa9
> [260104.553751]  [<ffffffff810444b2>] warn_slowpath_null+0x14/0x16
> [260104.553754]  [<ffffffff813ed53a>] inet_sock_destruct+0x164/0x182
> [260104.553759]  [<ffffffff8138e1c0>] __sk_free+0x23/0xe7
> [260104.553762]  [<ffffffff8138e2fd>] sk_free+0x1f/0x21
> [260104.553765]  [<ffffffff8138e3c7>] sk_common_release+0xc8/0xcd
> [260104.553769]  [<ffffffff813e4459>] udp_lib_close+0xe/0x10
> [260104.553772]  [<ffffffff813ecfe2>] inet_release+0x55/0x5c
> [260104.553775]  [<ffffffff8138b746>] sock_release+0x1f/0x71
> [260104.553778]  [<ffffffff8138b7bf>] sock_close+0x27/0x2b
> [260104.553782]  [<ffffffff810d0641>] __fput+0xfb/0x1c0
> [260104.553787]  [<ffffffff8104a197>] ? local_bh_disable+0x12/0x14
> [260104.553790]  [<ffffffff810d0723>] fput+0x1d/0x1f
> [260104.553810]  [<ffffffffa0014035>] svc_sock_free+0x40/0x56 [sunrpc]
> [260104.553827]  [<ffffffffa001dea0>] svc_xprt_free+0x43/0x53 [sunrpc]
> [260104.553843]  [<ffffffffa001de5d>] ? svc_xprt_free+0x0/0x53 [sunrpc]
> [260104.553847]  [<ffffffff811b4641>] kref_put+0x43/0x4f
> [260104.553863]  [<ffffffffa001d224>] svc_close_xprt+0x55/0x5e [sunrpc]
> [260104.553879]  [<ffffffffa001d27d>] svc_close_all+0x50/0x69 [sunrpc]
> [260104.553894]  [<ffffffffa0012922>] svc_destroy+0x9e/0x142 [sunrpc]
> [260104.553910]  [<ffffffffa0012a7f>] svc_exit_thread+0xb9/0xc2 [sunrpc]
> [260104.553922]  [<ffffffffa00707b1>] ? nfsd+0x0/0x151 [nfsd]
> [260104.553932]  [<ffffffffa00708e8>] nfsd+0x137/0x151 [nfsd]
> [260104.553936]  [<ffffffff8105ad28>] kthread+0x94/0x9c
> [260104.553941]  [<ffffffff8100c1fa>] child_rip+0xa/0x20
> [260104.553944]  [<ffffffff81047b00>] ? do_exit+0x5d7/0x691
> [260104.553948]  [<ffffffff81039cf8>] ? finish_task_switch+0x6a/0xc7
> [260104.553953]  [<ffffffff8100bb6d>] ? restore_args+0x0/0x30
> [260104.553956]  [<ffffffff8105ac94>] ? kthread+0x0/0x9c
> [260104.553959]  [<ffffffff8100c1f0>] ? child_rip+0x0/0x20
> 
> It happens on 2.6.31 and older kernels as well though I don't remember
> when it really started.

Could you please try following patch ?

Thanks

[PATCH] net: Fix sock_wfree() race

Commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
(net: No more expensive sock_hold()/sock_put() on each tx)
opens a window in sock_wfree() where another cpu
might free the socket we are working on.

A fix is to call sk->sk_write_space(sk) while still
holding a reference on sk.


Reported-by: Jike Song <albcamus@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/core/sock.c |   19 ++++++++++++-------
 1 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 30d5446..e1f034e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1228,17 +1228,22 @@ void __init sk_init(void)
 void sock_wfree(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
-	int res;
+	unsigned int len = skb->truesize;
 
-	/* In case it might be waiting for more memory. */
-	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
-	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
+	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) {
+		/*
+		 * Keep a reference on sk_wmem_alloc, this will be released
+		 * after sk_write_space() call
+		 */
+		atomic_sub(len - 1, &sk->sk_wmem_alloc);
 		sk->sk_write_space(sk);
+		len = 1;
+	}
 	/*
-	 * if sk_wmem_alloc reached 0, we are last user and should
-	 * free this sock, as sk_free() call could not do it.
+	 * if sk_wmem_alloc reaches 0, we must finish what sk_free()
+	 * could not do because of in-flight packets
 	 */
-	if (res == 0)
+	if (atomic_sub_and_test(len, &sk->sk_wmem_alloc))
 		__sk_free(sk);
 }
 EXPORT_SYMBOL(sock_wfree);

^ permalink raw reply related

* Re: [PATCH] /proc/net/tcp, overhead removed
From: Yakov Lerner @ 2009-09-29  8:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem
In-Reply-To: <4AC1BDAD.1010400@gmail.com>

On Tue, Sep 29, 2009 at 10:56, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Yakov Lerner a écrit :
> > Take 2.
> >
> > "Sharp improvement in performance of /proc/net/tcp when number of
> > sockets is large and hashsize is large.
> > O(numsock * hashsize) time becomes O(numsock + hashsize). On slow
> > processors, speed difference can be x100 and more."
> >
> > I must say that I'm not fully satisfied with my choice of "st->sbucket"
> > for the new preserved index. The better name would be "st->snum".
> > Re-using "st->sbucket" saves 4 bytes, and keeps the patch to one sourcefile.
> > But "st->sbucket" has different meaning in OPENREQ and LISTEN states;
> > this can be confusing.
> > Maybe better add "snum" member to struct tcp_iter_state ?
> >
> > Shall I change subject when sending "take N+1", or keep the old subject ?
> >
> > Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
> > ---
> >  net/ipv4/tcp_ipv4.c |   35 +++++++++++++++++++++++++++++++++--
> >  1 files changed, 33 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 7cda24b..e4c4f19 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1994,13 +1994,14 @@ static inline int empty_bucket(struct tcp_iter_state *st)
> >               hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
> >  }
> >
> > -static void *established_get_first(struct seq_file *seq)
> > +static void *established_get_first_after(struct seq_file *seq, int bucket)
> >  {
> >       struct tcp_iter_state *st = seq->private;
> >       struct net *net = seq_file_net(seq);
> >       void *rc = NULL;
> >
> > -     for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) {
> > +     for (st->bucket = bucket; st->bucket < tcp_hashinfo.ehash_size;
> > +          ++st->bucket) {
> >               struct sock *sk;
> >               struct hlist_nulls_node *node;
> >               struct inet_timewait_sock *tw;
> > @@ -2010,6 +2011,8 @@ static void *established_get_first(struct seq_file *seq)
> >               if (empty_bucket(st))
> >                       continue;
> >
> > +             st->sbucket = st->num;
> > +
> >               spin_lock_bh(lock);
> >               sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
> >                       if (sk->sk_family != st->family ||
> > @@ -2036,6 +2039,11 @@ out:
> >       return rc;
> >  }
> >
> > +static void *established_get_first(struct seq_file *seq)
> > +{
> > +     return established_get_first_after(seq, 0);
> > +}
> > +
> >  static void *established_get_next(struct seq_file *seq, void *cur)
> >  {
> >       struct sock *sk = cur;
> > @@ -2064,6 +2072,9 @@ get_tw:
> >               while (++st->bucket < tcp_hashinfo.ehash_size &&
> >                               empty_bucket(st))
> >                       ;
> > +
> > +             st->sbucket = st->num;
> > +
> >               if (st->bucket >= tcp_hashinfo.ehash_size)
> >                       return NULL;
> >
> > @@ -2107,6 +2118,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
> >
> >       if (!rc) {
> >               st->state = TCP_SEQ_STATE_ESTABLISHED;
> > +             st->sbucket = 0;
> >               rc        = established_get_idx(seq, pos);
> >       }
> >
> > @@ -2116,6 +2128,25 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
> >  static void *tcp_seq_start(struct seq_file *seq, loff_t *pos)
> >  {
> >       struct tcp_iter_state *st = seq->private;
> > +
> > +     if (*pos && *pos >= st->sbucket &&
> > +         (st->state == TCP_SEQ_STATE_ESTABLISHED ||
> > +          st->state == TCP_SEQ_STATE_TIME_WAIT)) {
> > +             void *cur;
> > +             int nskip;
> > +
> > +             /* for states estab and tw, st->sbucket is index (*pos) */
> > +             /* corresponding to the beginning of bucket st->bucket */
> > +
> > +             st->num = st->sbucket;
> > +             /* jump to st->bucket, then skip (*pos - st->sbucket) items */
> > +             st->state = TCP_SEQ_STATE_ESTABLISHED;
> > +             cur = established_get_first_after(seq, st->bucket);
> > +             for (nskip = *pos - st->num; cur && nskip > 0; --nskip)
> > +                     cur = established_get_next(seq, cur);
> > +             return cur;
> > +     }
> > +
> >       st->state = TCP_SEQ_STATE_LISTENING;
> >       st->num = 0;
> >       return *pos ? tcp_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
>
> Just in case you are working on "take 3" of the patch, there is a fondamental problem.
>
> All the scalability problems come from the fact that tcp_seq_start()
> *has* to rescan all the tables from the begining, because of lseek() capability
> on /proc/net/tcp file
>
> We probably could disable llseek() (on other positions than start of the file),
> and rely only on internal state (listening/established hashtable, hash bucket, position in chain)
>
> I cannot imagine how an application could rely on lseek() on >0 position in this file.


I thought  /proc/net/tcp  can  both  be fast and allow lseek;
(1) when no lseek was issued since last read
(we can detect this), /proc/net/tcp can jump to the
last known bucket (common case), vs
(2) switch to slow mode (scan from the beginning of hash)
when lseek was used , no ?

^ permalink raw reply

* Re: [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Chris Wright @ 2009-09-29  8:53 UTC (permalink / raw)
  To: Shreyas Bhatewara
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	Stephen Hemminger, David S. Miller, Jeff Garzik, Anthony Liguori,
	Chris Wright, Greg Kroah-Hartman, Andrew Morton, virtualization,
	pv-drivers@vmware.com
In-Reply-To: <89E2752CFA8EC044846EB849981913410173CDFAF6@EXCH-MBX-4.vmware.com>

* Shreyas Bhatewara (sbhatewara@vmware.com) wrote:
> Some of the features of vmxnet3 are :
>         PCIe 2.0 compliant PCI device: Vendor ID 0x15ad, Device ID 0x07b0
>         INTx, MSI, MSI-X (25 vectors) interrupts
>         16 Rx queues, 8 Tx queues

Driver doesn't appear to actually support more than a single MSI-X interrupt.
What is your plan for doing real multiqueue?

>         Offloads: TCP/UDP checksum, TSO over IPv4/IPv6,
>                     802.1q VLAN tag insertion, filtering, stripping
>                     Multicast filtering, Jumbo Frames

How about GRO conversion?

>         Wake-on-LAN, PCI Power Management D0-D3 states
>         PXE-ROM for boot support
> 

Whole thing appears to be space indented, and is fairly noisy w/ printk.
Also, heavy use of BUG_ON() (counted 51 of them), are you sure that none
of them can be triggered by guest or remote (esp. the ones that happen
in interrupt context)?  Some initial thoughts below.

<snip>
> diff --git a/drivers/net/vmxnet3/upt1_defs.h b/drivers/net/vmxnet3/upt1_defs.h
> new file mode 100644
> index 0000000..b50f91b
> --- /dev/null
> +++ b/drivers/net/vmxnet3/upt1_defs.h
> @@ -0,0 +1,104 @@
> +/*
> + * Linux driver for VMware's vmxnet3 ethernet NIC.
> + *
> + * Copyright (C) 2008-2009, VMware, Inc. All Rights Reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation; version 2 of the License and no later version.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
> + * NON INFRINGEMENT.  See the GNU General Public License for more
> + * details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
> + *
> + * The full GNU General Public License is included in this distribution in
> + * the file called "COPYING".
> + *
> + * Maintained by: Shreyas Bhatewara <pv-drivers@vmware.com>
> + *
> + */
> +
> +/* upt1_defs.h
> + *
> + *      Definitions for Uniform Pass Through.
> + */

Most of the source files have this format (some include -- after file
name).  Could just keep it all w/in the same comment block.  Since you
went to the trouble of saying what the file does, something a tad more
descriptive would be welcome.

> +
> +#ifndef _UPT1_DEFS_H
> +#define _UPT1_DEFS_H
> +
> +#define UPT1_MAX_TX_QUEUES  64
> +#define UPT1_MAX_RX_QUEUES  64

This is different than the 16/8 described above (and seemingly all moot
since it becomes a single queue device).

> +
> +/* interrupt moderation level */
> +#define UPT1_IML_NONE     0 /* no interrupt moderation */
> +#define UPT1_IML_HIGHEST  7 /* least intr generated */
> +#define UPT1_IML_ADAPTIVE 8 /* adpative intr moderation */

enum?  also only appears to support adaptive mode?

> +/* values for UPT1_RSSConf.hashFunc */
> +enum {
> +       UPT1_RSS_HASH_TYPE_NONE      = 0x0,
> +       UPT1_RSS_HASH_TYPE_IPV4      = 0x01,
> +       UPT1_RSS_HASH_TYPE_TCP_IPV4  = 0x02,
> +       UPT1_RSS_HASH_TYPE_IPV6      = 0x04,
> +       UPT1_RSS_HASH_TYPE_TCP_IPV6  = 0x08,
> +};
> +
> +enum {
> +       UPT1_RSS_HASH_FUNC_NONE      = 0x0,
> +       UPT1_RSS_HASH_FUNC_TOEPLITZ  = 0x01,
> +};
> +
> +#define UPT1_RSS_MAX_KEY_SIZE        40
> +#define UPT1_RSS_MAX_IND_TABLE_SIZE  128
> +
> +struct UPT1_RSSConf {
> +       uint16_t   hashType;
> +       uint16_t   hashFunc;
> +       uint16_t   hashKeySize;
> +       uint16_t   indTableSize;
> +       uint8_t    hashKey[UPT1_RSS_MAX_KEY_SIZE];
> +       uint8_t    indTable[UPT1_RSS_MAX_IND_TABLE_SIZE];
> +};
> +
> +/* features */
> +enum {
> +       UPT1_F_RXCSUM      = 0x0001,   /* rx csum verification */
> +       UPT1_F_RSS         = 0x0002,
> +       UPT1_F_RXVLAN      = 0x0004,   /* VLAN tag stripping */
> +       UPT1_F_LRO         = 0x0008,
> +};
> +#endif
> diff --git a/drivers/net/vmxnet3/vmxnet3_defs.h b/drivers/net/vmxnet3/vmxnet3_defs.h
> new file mode 100644
> index 0000000..a33a90b
> --- /dev/null
> +++ b/drivers/net/vmxnet3/vmxnet3_defs.h
> @@ -0,0 +1,534 @@
> +/*
> + * Linux driver for VMware's vmxnet3 ethernet NIC.
> + *
> + * Copyright (C) 2008-2009, VMware, Inc. All Rights Reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation; version 2 of the License and no later version.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
> + * NON INFRINGEMENT.  See the GNU General Public License for more
> + * details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
> + *
> + * The full GNU General Public License is included in this distribution in
> + * the file called "COPYING".
> + *
> + * Maintained by: Shreyas Bhatewara <pv-drivers@vmware.com>
> + *
> + */
> +
> +/*
> + * vmxnet3_defs.h --

Not particularly useful ;-)

> + */
> +
> +#ifndef _VMXNET3_DEFS_H_
> +#define _VMXNET3_DEFS_H_
> +
> +#include "upt1_defs.h"
> +
> +/* all registers are 32 bit wide */
> +/* BAR 1 */
> +enum {
> +       VMXNET3_REG_VRRS  = 0x0,        /* Vmxnet3 Revision Report Selection */
> +       VMXNET3_REG_UVRS  = 0x8,        /* UPT Version Report Selection */
> +       VMXNET3_REG_DSAL  = 0x10,       /* Driver Shared Address Low */
> +       VMXNET3_REG_DSAH  = 0x18,       /* Driver Shared Address High */
> +       VMXNET3_REG_CMD   = 0x20,       /* Command */
> +       VMXNET3_REG_MACL  = 0x28,       /* MAC Address Low */
> +       VMXNET3_REG_MACH  = 0x30,       /* MAC Address High */
> +       VMXNET3_REG_ICR   = 0x38,       /* Interrupt Cause Register */
> +       VMXNET3_REG_ECR   = 0x40        /* Event Cause Register */
> +};
> +
> +/* BAR 0 */
> +enum {
> +       VMXNET3_REG_IMR      = 0x0,     /* Interrupt Mask Register */
> +       VMXNET3_REG_TXPROD   = 0x600,   /* Tx Producer Index */
> +       VMXNET3_REG_RXPROD   = 0x800,   /* Rx Producer Index for ring 1 */
> +       VMXNET3_REG_RXPROD2  = 0xA00    /* Rx Producer Index for ring 2 */
> +};
> +
> +#define VMXNET3_PT_REG_SIZE     4096   /* BAR 0 */
> +#define VMXNET3_VD_REG_SIZE     4096   /* BAR 1 */
> +
> +#define VMXNET3_REG_ALIGN       8      /* All registers are 8-byte aligned. */
> +#define VMXNET3_REG_ALIGN_MASK  0x7
> +
> +/* I/O Mapped access to registers */
> +#define VMXNET3_IO_TYPE_PT              0
> +#define VMXNET3_IO_TYPE_VD              1
> +#define VMXNET3_IO_ADDR(type, reg)      (((type) << 24) | ((reg) & 0xFFFFFF))
> +#define VMXNET3_IO_TYPE(addr)           ((addr) >> 24)
> +#define VMXNET3_IO_REG(addr)            ((addr) & 0xFFFFFF)
> +
> +enum {
> +       VMXNET3_CMD_FIRST_SET = 0xCAFE0000,
> +       VMXNET3_CMD_ACTIVATE_DEV = VMXNET3_CMD_FIRST_SET,
> +       VMXNET3_CMD_QUIESCE_DEV,
> +       VMXNET3_CMD_RESET_DEV,
> +       VMXNET3_CMD_UPDATE_RX_MODE,
> +       VMXNET3_CMD_UPDATE_MAC_FILTERS,
> +       VMXNET3_CMD_UPDATE_VLAN_FILTERS,
> +       VMXNET3_CMD_UPDATE_RSSIDT,
> +       VMXNET3_CMD_UPDATE_IML,
> +       VMXNET3_CMD_UPDATE_PMCFG,
> +       VMXNET3_CMD_UPDATE_FEATURE,
> +       VMXNET3_CMD_LOAD_PLUGIN,
> +
> +       VMXNET3_CMD_FIRST_GET = 0xF00D0000,
> +       VMXNET3_CMD_GET_QUEUE_STATUS = VMXNET3_CMD_FIRST_GET,
> +       VMXNET3_CMD_GET_STATS,
> +       VMXNET3_CMD_GET_LINK,
> +       VMXNET3_CMD_GET_PERM_MAC_LO,
> +       VMXNET3_CMD_GET_PERM_MAC_HI,
> +       VMXNET3_CMD_GET_DID_LO,
> +       VMXNET3_CMD_GET_DID_HI,
> +       VMXNET3_CMD_GET_DEV_EXTRA_INFO,
> +       VMXNET3_CMD_GET_CONF_INTR
> +};
> +
> +struct Vmxnet3_TxDesc {
> +       uint64_t addr;
> +
> +       uint32_t len:14;
> +       uint32_t gen:1;      /* generation bit */
> +       uint32_t rsvd:1;
> +       uint32_t dtype:1;    /* descriptor type */
> +       uint32_t ext1:1;
> +       uint32_t msscof:14;  /* MSS, checksum offset, flags */
> +
> +       uint32_t hlen:10;    /* header len */
> +       uint32_t om:2;       /* offload mode */
> +       uint32_t eop:1;      /* End Of Packet */
> +       uint32_t cq:1;       /* completion request */
> +       uint32_t ext2:1;
> +       uint32_t ti:1;       /* VLAN Tag Insertion */
> +       uint32_t tci:16;     /* Tag to Insert */
> +};
> +
> +/* TxDesc.OM values */
> +#define VMXNET3_OM_NONE  0
> +#define VMXNET3_OM_CSUM  2
> +#define VMXNET3_OM_TSO   3
> +
> +/* fields in TxDesc we access w/o using bit fields */
> +#define VMXNET3_TXD_EOP_SHIFT 12
> +#define VMXNET3_TXD_CQ_SHIFT  13
> +#define VMXNET3_TXD_GEN_SHIFT 14
> +
> +#define VMXNET3_TXD_CQ  (1 << VMXNET3_TXD_CQ_SHIFT)
> +#define VMXNET3_TXD_EOP (1 << VMXNET3_TXD_EOP_SHIFT)
> +#define VMXNET3_TXD_GEN (1 << VMXNET3_TXD_GEN_SHIFT)
> +
> +#define VMXNET3_HDR_COPY_SIZE   128
> +
> +
> +struct Vmxnet3_TxDataDesc {
> +       uint8_t data[VMXNET3_HDR_COPY_SIZE];
> +};
> +
> +
> +struct Vmxnet3_TxCompDesc {
> +       uint32_t txdIdx:12;    /* Index of the EOP TxDesc */
> +       uint32_t ext1:20;
> +
> +       uint32_t ext2;
> +       uint32_t ext3;
> +
> +       uint32_t rsvd:24;
> +       uint32_t type:7;       /* completion type */
> +       uint32_t gen:1;        /* generation bit */
> +};
> +
> +
> +struct Vmxnet3_RxDesc {
> +       uint64_t addr;
> +
> +       uint32_t len:14;
> +       uint32_t btype:1;      /* Buffer Type */
> +       uint32_t dtype:1;      /* Descriptor type */
> +       uint32_t rsvd:15;
> +       uint32_t gen:1;        /* Generation bit */
> +
> +       uint32_t ext1;
> +};
> +
> +/* values of RXD.BTYPE */
> +#define VMXNET3_RXD_BTYPE_HEAD   0    /* head only */
> +#define VMXNET3_RXD_BTYPE_BODY   1    /* body only */
> +
> +/* fields in RxDesc we access w/o using bit fields */
> +#define VMXNET3_RXD_BTYPE_SHIFT  14
> +#define VMXNET3_RXD_GEN_SHIFT    31
> +
> +
> +struct Vmxnet3_RxCompDesc {
> +       uint32_t rxdIdx:12;    /* Index of the RxDesc */
> +       uint32_t ext1:2;
> +       uint32_t eop:1;        /* End of Packet */
> +       uint32_t sop:1;        /* Start of Packet */
> +       uint32_t rqID:10;      /* rx queue/ring ID */
> +       uint32_t rssType:4;    /* RSS hash type used */
> +       uint32_t cnc:1;        /* Checksum Not Calculated */
> +       uint32_t ext2:1;
> +
> +       uint32_t rssHash;      /* RSS hash value */
> +
> +       uint32_t len:14;       /* data length */
> +       uint32_t err:1;        /* Error */
> +       uint32_t ts:1;         /* Tag is stripped */
> +       uint32_t tci:16;       /* Tag stripped */
> +
> +       uint32_t csum:16;
> +       uint32_t tuc:1;        /* TCP/UDP Checksum Correct */
> +       uint32_t udp:1;        /* UDP packet */
> +       uint32_t tcp:1;        /* TCP packet */
> +       uint32_t ipc:1;        /* IP Checksum Correct */
> +       uint32_t v6:1;         /* IPv6 */
> +       uint32_t v4:1;         /* IPv4 */
> +       uint32_t frg:1;        /* IP Fragment */
> +       uint32_t fcs:1;        /* Frame CRC correct */
> +       uint32_t type:7;       /* completion type */
> +       uint32_t gen:1;        /* generation bit */
> +};
> +
> +/* fields in RxCompDesc we access via Vmxnet3_GenericDesc.dword[3] */
> +#define VMXNET3_RCD_TUC_SHIFT  16
> +#define VMXNET3_RCD_IPC_SHIFT  19
> +
> +/* fields in RxCompDesc we access via Vmxnet3_GenericDesc.qword[1] */
> +#define VMXNET3_RCD_TYPE_SHIFT 56
> +#define VMXNET3_RCD_GEN_SHIFT  63
> +
> +/* csum OK for TCP/UDP pkts over IP */
> +#define VMXNET3_RCD_CSUM_OK (1 << VMXNET3_RCD_TUC_SHIFT | \
> +                            1 << VMXNET3_RCD_IPC_SHIFT)
> +
> +/* value of RxCompDesc.rssType */
> +enum {
> +       VMXNET3_RCD_RSS_TYPE_NONE     = 0,
> +       VMXNET3_RCD_RSS_TYPE_IPV4     = 1,
> +       VMXNET3_RCD_RSS_TYPE_TCPIPV4  = 2,
> +       VMXNET3_RCD_RSS_TYPE_IPV6     = 3,
> +       VMXNET3_RCD_RSS_TYPE_TCPIPV6  = 4,
> +};
> +
> +/* a union for accessing all cmd/completion descriptors */
> +union Vmxnet3_GenericDesc {
> +       uint64_t                        qword[2];
> +       uint32_t                        dword[4];
> +       uint16_t                        word[8];
> +       struct Vmxnet3_TxDesc           txd;
> +       struct Vmxnet3_RxDesc           rxd;
> +       struct Vmxnet3_TxCompDesc       tcd;
> +       struct Vmxnet3_RxCompDesc       rcd;
> +};
> +
> +#define VMXNET3_INIT_GEN       1
> +
> +/* Max size of a single tx buffer */
> +#define VMXNET3_MAX_TX_BUF_SIZE  (1 << 14)
> +
> +/* # of tx desc needed for a tx buffer size */
> +#define VMXNET3_TXD_NEEDED(size) (((size) + VMXNET3_MAX_TX_BUF_SIZE - 1) / \
> +                                 VMXNET3_MAX_TX_BUF_SIZE)
> +
> +/* max # of tx descs for a non-tso pkt */
> +#define VMXNET3_MAX_TXD_PER_PKT 16
> +
> +/* Max size of a single rx buffer */
> +#define VMXNET3_MAX_RX_BUF_SIZE  ((1 << 14) - 1)
> +/* Minimum size of a type 0 buffer */
> +#define VMXNET3_MIN_T0_BUF_SIZE  128
> +#define VMXNET3_MAX_CSUM_OFFSET  1024
> +
> +/* Ring base address alignment */
> +#define VMXNET3_RING_BA_ALIGN   512
> +#define VMXNET3_RING_BA_MASK    (VMXNET3_RING_BA_ALIGN - 1)
> +
> +/* Ring size must be a multiple of 32 */
> +#define VMXNET3_RING_SIZE_ALIGN 32
> +#define VMXNET3_RING_SIZE_MASK  (VMXNET3_RING_SIZE_ALIGN - 1)
> +
> +/* Max ring size */
> +#define VMXNET3_TX_RING_MAX_SIZE   4096
> +#define VMXNET3_TC_RING_MAX_SIZE   4096
> +#define VMXNET3_RX_RING_MAX_SIZE   4096
> +#define VMXNET3_RC_RING_MAX_SIZE   8192
> +
> +/* a list of reasons for queue stop */
> +
> +enum {
> + VMXNET3_ERR_NOEOP        = 0x80000000,  /* cannot find the EOP desc of a pkt */
> + VMXNET3_ERR_TXD_REUSE    = 0x80000001,  /* reuse TxDesc before tx completion */
> + VMXNET3_ERR_BIG_PKT      = 0x80000002,  /* too many TxDesc for a pkt */
> + VMXNET3_ERR_DESC_NOT_SPT = 0x80000003,  /* descriptor type not supported */
> + VMXNET3_ERR_SMALL_BUF    = 0x80000004,  /* type 0 buffer too small */
> + VMXNET3_ERR_STRESS       = 0x80000005,  /* stress option firing in vmkernel */
> + VMXNET3_ERR_SWITCH       = 0x80000006,  /* mode switch failure */
> + VMXNET3_ERR_TXD_INVALID  = 0x80000007,  /* invalid TxDesc */
> +};
> +
> +/* completion descriptor types */
> +#define VMXNET3_CDTYPE_TXCOMP      0    /* Tx Completion Descriptor */
> +#define VMXNET3_CDTYPE_RXCOMP      3    /* Rx Completion Descriptor */
> +
> +enum {
> +       VMXNET3_GOS_BITS_UNK    = 0,   /* unknown */
> +       VMXNET3_GOS_BITS_32     = 1,
> +       VMXNET3_GOS_BITS_64     = 2,
> +};
> +
> +#define VMXNET3_GOS_TYPE_LINUX 1
> +
> +/* All structures in DriverShared are padded to multiples of 8 bytes */
> +
> +
> +struct Vmxnet3_GOSInfo {
> +       uint32_t   gosBits:2;   /* 32-bit or 64-bit? */
> +       uint32_t   gosType:4;   /* which guest */
> +       uint32_t   gosVer:16;   /* gos version */
> +       uint32_t   gosMisc:10;  /* other info about gos */
> +};
> +
> +
> +struct Vmxnet3_DriverInfo {
> +       uint32_t          version;        /* driver version */
> +       struct Vmxnet3_GOSInfo gos;
> +       uint32_t          vmxnet3RevSpt;  /* vmxnet3 revision supported */
> +       uint32_t          uptVerSpt;      /* upt version supported */
> +};
> +
> +#define VMXNET3_REV1_MAGIC  0xbabefee1
> 
> +
> +/*
> + * QueueDescPA must be 128 bytes aligned. It points to an array of
> + * Vmxnet3_TxQueueDesc followed by an array of Vmxnet3_RxQueueDesc.
> + * The number of Vmxnet3_TxQueueDesc/Vmxnet3_RxQueueDesc are specified by
> + * Vmxnet3_MiscConf.numTxQueues/numRxQueues, respectively.
> + */
> +#define VMXNET3_QUEUE_DESC_ALIGN  128

Lot of inconsistent spacing between types and names in the structure def'ns

> +struct Vmxnet3_MiscConf {
> +       struct Vmxnet3_DriverInfo driverInfo;
> +       uint64_t             uptFeatures;
> +       uint64_t             ddPA;         /* driver data PA */
> +       uint64_t             queueDescPA;  /* queue descriptor table PA */
> +       uint32_t             ddLen;        /* driver data len */
> +       uint32_t             queueDescLen; /* queue desc. table len in bytes */
> +       uint32_t             mtu;
> +       uint16_t             maxNumRxSG;
> +       uint8_t              numTxQueues;
> +       uint8_t              numRxQueues;
> +       uint32_t             reserved[4];
> +};

should this be packed (or others that are shared w/ device)?  i assume
you've already done 32 vs 64 here

> +struct Vmxnet3_TxQueueConf {
> +       uint64_t    txRingBasePA;
> +       uint64_t    dataRingBasePA;
> +       uint64_t    compRingBasePA;
> +       uint64_t    ddPA;         /* driver data */
> +       uint64_t    reserved;
> +       uint32_t    txRingSize;   /* # of tx desc */
> +       uint32_t    dataRingSize; /* # of data desc */
> +       uint32_t    compRingSize; /* # of comp desc */
> +       uint32_t    ddLen;        /* size of driver data */
> +       uint8_t     intrIdx;
> +       uint8_t     _pad[7];
> +};
> +
> +
> +struct Vmxnet3_RxQueueConf {
> +       uint64_t    rxRingBasePA[2];
> +       uint64_t    compRingBasePA;
> +       uint64_t    ddPA;            /* driver data */
> +       uint64_t    reserved;
> +       uint32_t    rxRingSize[2];   /* # of rx desc */
> +       uint32_t    compRingSize;    /* # of rx comp desc */
> +       uint32_t    ddLen;           /* size of driver data */
> +       uint8_t     intrIdx;
> +       uint8_t     _pad[7];
> +};
> +
> +enum vmxnet3_intr_mask_mode {
> +       VMXNET3_IMM_AUTO   = 0,
> +       VMXNET3_IMM_ACTIVE = 1,
> +       VMXNET3_IMM_LAZY   = 2
> +};
> +
> +enum vmxnet3_intr_type {
> +       VMXNET3_IT_AUTO = 0,
> +       VMXNET3_IT_INTX = 1,
> +       VMXNET3_IT_MSI  = 2,
> +       VMXNET3_IT_MSIX = 3
> +};
> +
> +#define VMXNET3_MAX_TX_QUEUES  8
> +#define VMXNET3_MAX_RX_QUEUES  16

different to UPT, I must've missed some layering here

> +/* addition 1 for events */
> +#define VMXNET3_MAX_INTRS      25
> +
> +
<snip>

> --- /dev/null
> +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
> @@ -0,0 +1,2608 @@
> +/*
> + * Linux driver for VMware's vmxnet3 ethernet NIC.
<snip>
> +/*
> + * vmxnet3_drv.c --
> + *
> + *      Linux driver for VMware's vmxnet3 NIC
> + */

Not useful

> +static void
> +vmxnet3_enable_intr(struct vmxnet3_adapter *adapter, unsigned intr_idx)
> +{
> +       VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_IMR + intr_idx * 8, 0);

	writel(0, adapter->hw_addr0 + VMXNET3_REG_IMR + intr_idx * 8)

seems just as clear to me.

> +vmxnet3_enable_all_intrs(struct vmxnet3_adapter *adapter)
> +{
> +       int i;
> +
> +       for (i = 0; i < adapter->intr.num_intrs; i++)
> +               vmxnet3_enable_intr(adapter, i);
> +}
> +
> +static void
> +vmxnet3_disable_all_intrs(struct vmxnet3_adapter *adapter)
> +{
> +       int i;
> +
> +       for (i = 0; i < adapter->intr.num_intrs; i++)
> +               vmxnet3_disable_intr(adapter, i);
> +}

only ever num_intrs=1, so there's some plan to bump this up and make
these wrappers useful?

> +static void
> +vmxnet3_ack_events(struct vmxnet3_adapter *adapter, u32 events)
> +{
> +       VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_ECR, events);
> +}
> +
> +
> +static bool
> +vmxnet3_tq_stopped(struct vmxnet3_tx_queue *tq, struct vmxnet3_adapter *adapter)
> +{
> +       return netif_queue_stopped(adapter->netdev);
> +}
> +
> +
> +static void
> +vmxnet3_tq_start(struct vmxnet3_tx_queue *tq, struct vmxnet3_adapter  *adapter)
> +{
> +       tq->stopped = false;

is tq->stopped used besides just toggling back and forth?

> +       netif_start_queue(adapter->netdev);
> +}

> +static void
> +vmxnet3_process_events(struct vmxnet3_adapter *adapter)

Should be trivial to break out to it's own MSI-X vector, basically set
up to do that already.

> +{
> +       u32 events = adapter->shared->ecr;
> +       if (!events)
> +               return;
> +
> +       vmxnet3_ack_events(adapter, events);
> +
> +       /* Check if link state has changed */
> +       if (events & VMXNET3_ECR_LINK)
> +               vmxnet3_check_link(adapter);
> +
> +       /* Check if there is an error on xmit/recv queues */
> +       if (events & (VMXNET3_ECR_TQERR | VMXNET3_ECR_RQERR)) {
> +               VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD,
> +                                      VMXNET3_CMD_GET_QUEUE_STATUS);
> +
> +               if (adapter->tqd_start->status.stopped) {
> +                       printk(KERN_ERR "%s: tq error 0x%x\n",
> +                              adapter->netdev->name,
> +                              adapter->tqd_start->status.error);
> +               }
> +               if (adapter->rqd_start->status.stopped) {
> +                       printk(KERN_ERR "%s: rq error 0x%x\n",
> +                              adapter->netdev->name,
> +                              adapter->rqd_start->status.error);
> +               }
> +
> +               schedule_work(&adapter->work);
> +       }
> +}
<snip>

> +
> +       tq->buf_info = kcalloc(sizeof(tq->buf_info[0]), tq->tx_ring.size,
> +                              GFP_KERNEL);

kcalloc args look backwards

<snip>
> +static int
> +vmxnet3_alloc_pci_resources(struct vmxnet3_adapter *adapter, bool *dma64)
> +{
> +       int err;
> +       unsigned long mmio_start, mmio_len;
> +       struct pci_dev *pdev = adapter->pdev;
> +
> +       err = pci_enable_device(pdev);

looks ioport free, can be pci_enable_device_mem()...

> +       if (err) {
> +               printk(KERN_ERR "Failed to enable adapter %s: error %d\n",
> +                      pci_name(pdev), err);
> +               return err;
> +       }
> +
> +       if (pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) == 0) {
> +               if (pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64)) != 0) {
> +                       printk(KERN_ERR "pci_set_consistent_dma_mask failed "
> +                              "for adapter %s\n", pci_name(pdev));
> +                       err = -EIO;
> +                       goto err_set_mask;
> +               }
> +               *dma64 = true;
> +       } else {
> +               if (pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) != 0) {
> +                       printk(KERN_ERR "pci_set_dma_mask failed for adapter "
> +                              "%s\n",  pci_name(pdev));
> +                       err = -EIO;
> +                       goto err_set_mask;
> +               }
> +               *dma64 = false;
> +       }
> +
> +       err = pci_request_regions(pdev, vmxnet3_driver_name);

...pci_request_selected_regions()

> +       if (err) {
> +               printk(KERN_ERR "Failed to request region for adapter %s: "
> +                      "error %d\n", pci_name(pdev), err);
> +               goto err_set_mask;
> +       }
> +
> +       pci_set_master(pdev);
> +
> +       mmio_start = pci_resource_start(pdev, 0);
> +       mmio_len = pci_resource_len(pdev, 0);
> +       adapter->hw_addr0 = ioremap(mmio_start, mmio_len);
> +       if (!adapter->hw_addr0) {
> +               printk(KERN_ERR "Failed to map bar0 for adapter %s\n",
> +                      pci_name(pdev));
> +               err = -EIO;
> +               goto err_ioremap;
> +       }
> +
> +       mmio_start = pci_resource_start(pdev, 1);
> +       mmio_len = pci_resource_len(pdev, 1);
> +       adapter->hw_addr1 = ioremap(mmio_start, mmio_len);
> +       if (!adapter->hw_addr1) {
> +               printk(KERN_ERR "Failed to map bar1 for adapter %s\n",
> +                      pci_name(pdev));
> +               err = -EIO;
> +               goto err_bar1;
> +       }
> +       return 0;
> +
> +err_bar1:
> +       iounmap(adapter->hw_addr0);
> +err_ioremap:
> +       pci_release_regions(pdev);

...and pci_release_selected_regions()

> +err_set_mask:
> +       pci_disable_device(pdev);
> +       return err;
> +}
> +

<snip>
> +vmxnet3_declare_features(struct vmxnet3_adapter *adapter, bool dma64)
> +{
> +       struct net_device *netdev = adapter->netdev;
> +
> +       netdev->features = NETIF_F_SG |
> +               NETIF_F_HW_CSUM |
> +               NETIF_F_HW_VLAN_TX |
> +               NETIF_F_HW_VLAN_RX |
> +               NETIF_F_HW_VLAN_FILTER |
> +               NETIF_F_TSO |
> +               NETIF_F_TSO6;
> +
> +       printk(KERN_INFO "features: sg csum vlan jf tso tsoIPv6");
> +
> +       adapter->rxcsum = true;
> +       adapter->jumbo_frame = true;
> +
> +       if (!disable_lro) {
> +               adapter->lro = true;
> +               printk(" lro");
> +       }

Plan to switch to GRO?

> +       if (dma64) {
> +               netdev->features |= NETIF_F_HIGHDMA;
> +               printk(" highDMA");
> +       }
> +
> +       netdev->vlan_features = netdev->features;
> +       printk("\n");
> +}
> +
> +static int __devinit
> +vmxnet3_probe_device(struct pci_dev *pdev,
> +                    const struct pci_device_id *id)
> +{
> +       static const struct net_device_ops vmxnet3_netdev_ops = {
> +               .ndo_open  = vmxnet3_open,
> +               .ndo_stop  = vmxnet3_close,
> +               .ndo_start_xmit = vmxnet3_xmit_frame,
> +               .ndo_set_mac_address = vmxnet3_set_mac_addr,
> +               .ndo_change_mtu = vmxnet3_change_mtu,
> +               .ndo_get_stats = vmxnet3_get_stats,
> +               .ndo_tx_timeout = vmxnet3_tx_timeout,
> +               .ndo_set_multicast_list = vmxnet3_set_mc,
> +               .ndo_vlan_rx_register = vmxnet3_vlan_rx_register,
> +               .ndo_vlan_rx_add_vid = vmxnet3_vlan_rx_add_vid,
> +               .ndo_vlan_rx_kill_vid = vmxnet3_vlan_rx_kill_vid,
> +#   ifdef CONFIG_NET_POLL_CONTROLLER
> +               .ndo_poll_controller = vmxnet3_netpoll,
> +#   endif

#ifdef
#endif

is more typical style here

> +       };
> +       int err;
> +       bool dma64 = false; /* stupid gcc */
> +       u32 ver;
> +       struct net_device *netdev;
> +       struct vmxnet3_adapter *adapter;
> +       u8  mac[ETH_ALEN];

extra space between type and name

> +
> +       netdev = alloc_etherdev(sizeof(struct vmxnet3_adapter));
> +       if (!netdev) {
> +               printk(KERN_ERR "Failed to alloc ethernet device for adapter "
> +                       "%s\n", pci_name(pdev));
> +               return -ENOMEM;
> +       }
> +
> +       pci_set_drvdata(pdev, netdev);
> +       adapter = netdev_priv(netdev);
> +       adapter->netdev = netdev;
> +       adapter->pdev = pdev;
> +
> +       adapter->shared = pci_alloc_consistent(adapter->pdev,
> +                         sizeof(struct Vmxnet3_DriverShared),
> +                         &adapter->shared_pa);
> +       if (!adapter->shared) {
> +               printk(KERN_ERR "Failed to allocate memory for %s\n",
> +                       pci_name(pdev));
> +               err = -ENOMEM;
> +               goto err_alloc_shared;
> +       }
> +
> +       adapter->tqd_start  = pci_alloc_consistent(adapter->pdev,

extra space before =

> diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> new file mode 100644
> index 0000000..490577f
> --- /dev/null
> +++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> +#include "vmxnet3_int.h"
> +
> +struct vmxnet3_stat_desc {
> +       char desc[ETH_GSTRING_LEN];
> +       int  offset;
> +};
> +
> +
> +static u32
> +vmxnet3_get_rx_csum(struct net_device *netdev)
> +{
> +       struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +       return adapter->rxcsum;
> +}
> +
> +
> +static int
> +vmxnet3_set_rx_csum(struct net_device *netdev, u32 val)
> +{
> +       struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +
> +       if (adapter->rxcsum != val) {
> +               adapter->rxcsum = val;
> +               if (netif_running(netdev)) {
> +                       if (val)
> +                               adapter->shared->devRead.misc.uptFeatures |=
> +                                                               UPT1_F_RXCSUM;
> +                       else
> +                               adapter->shared->devRead.misc.uptFeatures &=
> +                                                               ~UPT1_F_RXCSUM;
> +
> +                       VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD,
> +                                              VMXNET3_CMD_UPDATE_FEATURE);
> +               }
> +       }
> +       return 0;
> +}
> +
> +
> +static u32
> +vmxnet3_get_tx_csum(struct net_device *netdev)
> +{
> +       return (netdev->features & NETIF_F_HW_CSUM) != 0;
> +}

Not needed

> +static int
> +vmxnet3_set_tx_csum(struct net_device *netdev, u32 val)
> +{
> +       if (val)
> +               netdev->features |= NETIF_F_HW_CSUM;
> +       else
> +               netdev->features &= ~NETIF_F_HW_CSUM;
> +
> +       return 0;
> +}

This is just ethtool_op_set_tx_hw_csum()

> +static int
> +vmxnet3_set_sg(struct net_device *netdev, u32 val)
> +{
> +       ethtool_op_set_sg(netdev, val);
> +       return 0;
> +}

Useless wrapper

> +static int
> +vmxnet3_set_tso(struct net_device *netdev, u32 val)
> +{
> +       ethtool_op_set_tso(netdev, val);
> +       return 0;
> +}

Useless wrapper

> +struct net_device_stats*
> +vmxnet3_get_stats(struct net_device *netdev)
> +{
> +       struct vmxnet3_adapter *adapter;
> +       struct vmxnet3_tq_driver_stats *drvTxStats;
> +       struct vmxnet3_rq_driver_stats *drvRxStats;
> +       struct UPT1_TxStats *devTxStats;
> +       struct UPT1_RxStats *devRxStats;
> +
> +       adapter = netdev_priv(netdev);
> +
> +       /* Collect the dev stats into the shared area */
> +       VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD, VMXNET3_CMD_GET_STATS);
> +
> +       /* Assuming that we have a single queue device */
> +       devTxStats = &adapter->tqd_start->stats;
> +       devRxStats = &adapter->rqd_start->stats;

Another single queue assumption

> +
> +       /* Get access to the driver stats per queue */
> +       drvTxStats = &adapter->tx_queue.stats;
> +       drvRxStats = &adapter->rx_queue.stats;
> +
> +       memset(&adapter->net_stats, 0, sizeof(adapter->net_stats));
> +
> +       adapter->net_stats.rx_packets = devRxStats->ucastPktsRxOK +
> +                                       devRxStats->mcastPktsRxOK +
> +                                       devRxStats->bcastPktsRxOK;
> +
> +       adapter->net_stats.tx_packets = devTxStats->ucastPktsTxOK +
> +                                       devTxStats->mcastPktsTxOK +
> +                                       devTxStats->bcastPktsTxOK;
> +
> +       adapter->net_stats.rx_bytes = devRxStats->ucastBytesRxOK +
> +                                       devRxStats->mcastBytesRxOK +
> +                                       devRxStats->bcastBytesRxOK;
> +
> +       adapter->net_stats.tx_bytes = devTxStats->ucastBytesTxOK +
> +                                       devTxStats->mcastBytesTxOK +
> +                                       devTxStats->bcastBytesTxOK;
> +
> +       adapter->net_stats.rx_errors = devRxStats->pktsRxError;
> +       adapter->net_stats.tx_errors = devTxStats->pktsTxError;
> +       adapter->net_stats.rx_dropped = drvRxStats->drop_total;
> +       adapter->net_stats.tx_dropped = drvTxStats->drop_total;
> +       adapter->net_stats.multicast =  devRxStats->mcastPktsRxOK;
> +
> +       return &adapter->net_stats;
> +}
> +
> +static int
> +vmxnet3_get_stats_count(struct net_device *netdev)
> +{
> +       return ARRAY_SIZE(vmxnet3_tq_dev_stats) +
> +               ARRAY_SIZE(vmxnet3_tq_driver_stats) +
> +               ARRAY_SIZE(vmxnet3_rq_dev_stats) +
> +               ARRAY_SIZE(vmxnet3_rq_driver_stats) +
> +               ARRAY_SIZE(vmxnet3_global_stats);
> +}
> +
> +
> +static int
> +vmxnet3_get_regs_len(struct net_device *netdev)
> +{
> +       return 20 * sizeof(u32);
> +}
> +
> +
> +static void
> +vmxnet3_get_drvinfo(struct net_device *netdev, struct ethtool_drvinfo *drvinfo)
> +{
> +       struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +
> +       strncpy(drvinfo->driver, vmxnet3_driver_name, sizeof(drvinfo->driver));
> +       drvinfo->driver[sizeof(drvinfo->driver) - 1] = '\0';
> +
> +       strncpy(drvinfo->version, VMXNET3_DRIVER_VERSION_REPORT,
> +               sizeof(drvinfo->version));
> +       drvinfo->driver[sizeof(drvinfo->version) - 1] = '\0';
> +
> +       strncpy(drvinfo->fw_version, "N/A", sizeof(drvinfo->fw_version));
> +       drvinfo->fw_version[sizeof(drvinfo->fw_version) - 1] = '\0';
> +
> +       strncpy(drvinfo->bus_info,   pci_name(adapter->pdev),
> +               ETHTOOL_BUSINFO_LEN);

simplify all these to strlcpy

> +       drvinfo->n_stats = vmxnet3_get_stats_count(netdev);
> +       drvinfo->testinfo_len = 0;
> +       drvinfo->eedump_len   = 0;
> +       drvinfo->regdump_len  = vmxnet3_get_regs_len(netdev);
> +}

> +static int
> +vmxnet3_set_ringparam(struct net_device *netdev,
> +               struct ethtool_ringparam *param)
> +{
> +       struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +       u32 new_tx_ring_size, new_rx_ring_size;
> +       u32 sz;
> +       int err = 0;
> +
> +       if (param->tx_pending == 0 || param->tx_pending >
> +                                               VMXNET3_TX_RING_MAX_SIZE) {
> +               printk(KERN_ERR "%s: invalid tx ring size %u\n", netdev->name,
> +                       param->tx_pending);

Seems noisy

> +               return -EINVAL;
> +       }
> +       if (param->rx_pending == 0 || param->rx_pending >
> +                                       VMXNET3_RX_RING_MAX_SIZE) {
> +               printk(KERN_ERR "%s: invalid rx ring size %u\n", netdev->name,
> +                       param->rx_pending);

Same here

> +               return -EINVAL;
> +       }
> +
> +       /* round it up to a multiple of VMXNET3_RING_SIZE_ALIGN */
> +       new_tx_ring_size = (param->tx_pending + VMXNET3_RING_SIZE_MASK) &
> +                                                       ~VMXNET3_RING_SIZE_MASK;
> +       new_tx_ring_size = min_t(u32, new_tx_ring_size,
> +                                VMXNET3_TX_RING_MAX_SIZE);
> +       BUG_ON(new_tx_ring_size > VMXNET3_TX_RING_MAX_SIZE);
> +       BUG_ON(new_tx_ring_size % VMXNET3_RING_SIZE_ALIGN != 0);

Don't use BUG_ON for validating user input

> +
> +       /* ring0 has to be a multiple of
> +        * rx_buf_per_pkt * VMXNET3_RING_SIZE_ALIGN
> +        */
> +       sz = adapter->rx_buf_per_pkt * VMXNET3_RING_SIZE_ALIGN;
> +       new_rx_ring_size = (param->rx_pending + sz - 1) / sz * sz;
> +       new_rx_ring_size = min_t(u32, new_rx_ring_size,
> +                                VMXNET3_RX_RING_MAX_SIZE / sz * sz);
> +       BUG_ON(new_rx_ring_size > VMXNET3_RX_RING_MAX_SIZE);
> +       BUG_ON(new_rx_ring_size % sz != 0);
> +
> +       if (new_tx_ring_size == adapter->tx_queue.tx_ring.size &&
> +                       new_rx_ring_size == adapter->rx_queue.rx_ring[0].size) {
> +               return 0;
> +       }
> +
> +       /*
> +        * Reset_work may be in the middle of resetting the device, wait for its
> +        * completion.
> +        */
> +       while (test_and_set_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state))
> +               msleep(1);
> +
> +       if (netif_running(netdev)) {
> +               vmxnet3_quiesce_dev(adapter);
> +               vmxnet3_reset_dev(adapter);
> +
> +               /* recreate the rx queue and the tx queue based on the
> +                * new sizes */
> +               vmxnet3_tq_destroy(&adapter->tx_queue, adapter);
> +               vmxnet3_rq_destroy(&adapter->rx_queue, adapter);
> +
> +               err = vmxnet3_create_queues(adapter, new_tx_ring_size,
> +                       new_rx_ring_size, VMXNET3_DEF_RX_RING_SIZE);
> +               if (err) {
> +                       /* failed, most likely because of OOM, try default
> +                        * size */
> +                       printk(KERN_ERR "%s: failed to apply new sizes, try the"
> +                               " default ones\n", netdev->name);
> +                       err = vmxnet3_create_queues(adapter,
> +                                                   VMXNET3_DEF_TX_RING_SIZE,
> +                                                   VMXNET3_DEF_RX_RING_SIZE,
> +                                                   VMXNET3_DEF_RX_RING_SIZE);
> +                       if (err) {
> +                               printk(KERN_ERR "%s: failed to create queues "
> +                                       "with default sizes. Closing it\n",
> +                                       netdev->name);
> +                               goto out;
> +                       }
> +               }
> +
> +               err = vmxnet3_activate_dev(adapter);
> +               if (err) {
> +                       printk(KERN_ERR "%s: failed to re-activate, error %d."
> +                               " Closing it\n", netdev->name, err);
> +                       goto out;

Going to out: anyway...

> +               }
> +       }
> +
> +out:
> +       clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
> +       if (err)
> +               vmxnet3_force_close(adapter);
> +
> +       return err;
> +}
> +
> +
> +static struct ethtool_ops vmxnet3_ethtool_ops = {
> +       .get_settings      = vmxnet3_get_settings,
> +       .get_drvinfo       = vmxnet3_get_drvinfo,
> +       .get_regs_len      = vmxnet3_get_regs_len,
> +       .get_regs          = vmxnet3_get_regs,
> +       .get_wol           = vmxnet3_get_wol,
> +       .set_wol           = vmxnet3_set_wol,
> +       .get_link          = ethtool_op_get_link,
> +       .get_rx_csum       = vmxnet3_get_rx_csum,
> +       .set_rx_csum       = vmxnet3_set_rx_csum,
> +       .get_tx_csum       = vmxnet3_get_tx_csum,
> +       .set_tx_csum       = vmxnet3_set_tx_csum,
> +       .get_sg            = ethtool_op_get_sg,
> +       .set_sg            = vmxnet3_set_sg,
> +       .get_tso           = ethtool_op_get_tso,
> +       .set_tso           = vmxnet3_set_tso,
> +       .get_strings       = vmxnet3_get_strings,
> +       .get_stats_count   = vmxnet3_get_stats_count,

use get_sset_count instead

> +       .get_ethtool_stats = vmxnet3_get_ethtool_stats,
> +       .get_ringparam     = vmxnet3_get_ringparam,
> +       .set_ringparam     = vmxnet3_set_ringparam,
> +};
> +
> +void vmxnet3_set_ethtool_ops(struct net_device *netdev)
> +{
> +       SET_ETHTOOL_OPS(netdev, &vmxnet3_ethtool_ops);
> +}
<snip>

^ permalink raw reply

* Re: [PATCH] /proc/net/tcp, overhead removed
From: Eric Dumazet @ 2009-09-29  7:56 UTC (permalink / raw)
  To: Yakov Lerner; +Cc: netdev, davem
In-Reply-To: <1254178906-5293-1-git-send-email-iler.ml@gmail.com>

Yakov Lerner a écrit :
> Take 2. 
> 
> "Sharp improvement in performance of /proc/net/tcp when number of 
> sockets is large and hashsize is large. 
> O(numsock * hashsize) time becomes O(numsock + hashsize). On slow
> processors, speed difference can be x100 and more."
> 
> I must say that I'm not fully satisfied with my choice of "st->sbucket" 
> for the new preserved index. The better name would be "st->snum". 
> Re-using "st->sbucket" saves 4 bytes, and keeps the patch to one sourcefile.
> But "st->sbucket" has different meaning in OPENREQ and LISTEN states;
> this can be confusing. 
> Maybe better add "snum" member to struct tcp_iter_state ?
> 
> Shall I change subject when sending "take N+1", or keep the old subject ?
> 
> Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
> ---
>  net/ipv4/tcp_ipv4.c |   35 +++++++++++++++++++++++++++++++++--
>  1 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 7cda24b..e4c4f19 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1994,13 +1994,14 @@ static inline int empty_bucket(struct tcp_iter_state *st)
>  		hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
>  }
>  
> -static void *established_get_first(struct seq_file *seq)
> +static void *established_get_first_after(struct seq_file *seq, int bucket)
>  {
>  	struct tcp_iter_state *st = seq->private;
>  	struct net *net = seq_file_net(seq);
>  	void *rc = NULL;
>  
> -	for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) {
> +	for (st->bucket = bucket; st->bucket < tcp_hashinfo.ehash_size;
> +	     ++st->bucket) {
>  		struct sock *sk;
>  		struct hlist_nulls_node *node;
>  		struct inet_timewait_sock *tw;
> @@ -2010,6 +2011,8 @@ static void *established_get_first(struct seq_file *seq)
>  		if (empty_bucket(st))
>  			continue;
>  
> +		st->sbucket = st->num;
> +
>  		spin_lock_bh(lock);
>  		sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
>  			if (sk->sk_family != st->family ||
> @@ -2036,6 +2039,11 @@ out:
>  	return rc;
>  }
>  
> +static void *established_get_first(struct seq_file *seq)
> +{
> +	return established_get_first_after(seq, 0);
> +}
> +
>  static void *established_get_next(struct seq_file *seq, void *cur)
>  {
>  	struct sock *sk = cur;
> @@ -2064,6 +2072,9 @@ get_tw:
>  		while (++st->bucket < tcp_hashinfo.ehash_size &&
>  				empty_bucket(st))
>  			;
> +
> +		st->sbucket = st->num;
> +
>  		if (st->bucket >= tcp_hashinfo.ehash_size)
>  			return NULL;
>  
> @@ -2107,6 +2118,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
>  
>  	if (!rc) {
>  		st->state = TCP_SEQ_STATE_ESTABLISHED;
> +		st->sbucket = 0;
>  		rc	  = established_get_idx(seq, pos);
>  	}
>  
> @@ -2116,6 +2128,25 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
>  static void *tcp_seq_start(struct seq_file *seq, loff_t *pos)
>  {
>  	struct tcp_iter_state *st = seq->private;
> +
> +	if (*pos && *pos >= st->sbucket &&
> +	    (st->state == TCP_SEQ_STATE_ESTABLISHED ||
> +	     st->state == TCP_SEQ_STATE_TIME_WAIT)) {
> +		void *cur;
> +		int nskip;
> +
> +		/* for states estab and tw, st->sbucket is index (*pos) */
> +		/* corresponding to the beginning of bucket st->bucket */
> +
> +		st->num = st->sbucket;
> +		/* jump to st->bucket, then skip (*pos - st->sbucket) items */
> +		st->state = TCP_SEQ_STATE_ESTABLISHED;
> +		cur = established_get_first_after(seq, st->bucket);
> +		for (nskip = *pos - st->num; cur && nskip > 0; --nskip)
> +			cur = established_get_next(seq, cur);
> +		return cur;
> +	}
> +
>  	st->state = TCP_SEQ_STATE_LISTENING;
>  	st->num = 0;
>  	return *pos ? tcp_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;

Just in case you are working on "take 3" of the patch, there is a fondamental problem.

All the scalability problems come from the fact that tcp_seq_start()
*has* to rescan all the tables from the begining, because of lseek() capability
on /proc/net/tcp file 

We probably could disable llseek() (on other positions than start of the file),
and rely only on internal state (listening/established hashtable, hash bucket, position in chain)

I cannot imagine how an application could rely on lseek() on >0 position in this file.



^ permalink raw reply

* Re: [PATCH] /proc/net/tcp, overhead removed
From: Yakov Lerner @ 2009-09-29  7:43 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, netdev, David Miller
In-Reply-To: <20090928162417.59640672@nehalam>

On Tue, Sep 29, 2009 at 02:24, Stephen Hemminger <shemminger@vyatta.com> wrote:
> On Tue, 29 Sep 2009 00:20:07 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> Yakov Lerner a écrit :
>> > On Sun, Sep 27, 2009 at 12:53, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> Yakov Lerner a écrit :
>> >>> /proc/net/tcp does 20,000 sockets in 60-80 milliseconds, with this patch.
>> >>>
>> >>> The overhead was in tcp_seq_start(). See analysis (3) below.
>> >>> The patch is against Linus git tree (1). The patch is small.
>> >>>
>> >>> ------------  -----------   ------------------------------------
>> >>> Before patch  After patch   20,000 sockets (10,000 tw + 10,000 estab)(2)
>> >>> ------------  -----------   ------------------------------------
>> >>> 6 sec          0.06 sec     dd bs=1k if=/proc/net/tcp >/dev/null
>> >>> 1.5 sec        0.06 sec     dd bs=4k if=/proc/net/tcp >/dev/null
>> >>>
>> >>> 1.9 sec        0.16 sec     netstat -4ant >/dev/null
>> >>> ------------  -----------   ------------------------------------
>> >>>
>> >>> This is ~ x25 improvement.
>> >>> The new time is not dependent on read blockize.
>> >>> Speed of netstat, naturally, improves, too; both -4 and -6.
>> >>> /proc/net/tcp6 does 20,000 sockets in 100 millisec.
>> >>>
>> >>> (1) against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
>> >>>
>> >>> (2) Used 'manysock' utility to stress system with large number of sockets:
>> >>>   "manysock 10000 10000"    - 10,000 tw + 10,000 estab ip4 sockets.
>> >>>   "manysock -6 10000 10000" - 10,000 tw + 10,000 estab ip6 sockets.
>> >>> Found at http://ilerner.3b1.org/manysock/manysock.c
>> >>>
>> >>> (3) Algorithmic analysis.
>> >>>     Old algorithm.
>> >>>
>> >>> During 'cat </proc/net/tcp', tcp_seq_start() is called O(numsockets) times (4).
>> >>> On average, every call to tcp_seq_start() scans half the whole hashtable. Ouch.
>> >>> This is O(numsockets * hashsize). 95-99% of 'cat </proc/net/tcp' is spent in
>> >>> tcp_seq_start()->tcp_get_idx. This overhead is eliminated by new algorithm,
>> >>> which is O(numsockets + hashsize).
>> >>>
>> >>>     New algorithm.
>> >>>
>> >>> New algorithms is O(numsockets + hashsize). We jump to the right
>> >>> hash bucket in tcp_seq_start(), without scanning half the hash.
>> >>> To jump right to the hash bucket corresponding to *pos in tcp_seq_start(),
>> >>> we reuse three pieces of state (st->num, st->bucket, st->sbucket)
>> >>> as follows:
>> >>>  - we check that requested pos >= last seen pos (st->num), the typical case.
>> >>>  - if so, we jump to bucket st->bucket
>> >>>  - to arrive to the right item after beginning of st->bucket, we
>> >>> keep in st->sbucket the position corresponding to the beginning of
>> >>> bucket.
>> >>>
>> >>> (4) Explanation of O( numsockets * hashsize) of old algorithm.
>> >>>
>> >>> tcp_seq_start() is called once for every ~7 lines of netstat output
>> >>> if readsize is 1kb, or once for every ~28 lines if readsize >= 4kb.
>> >>> Since record length of /proc/net/tcp records is 150 bytes, formula for
>> >>> number of calls to tcp_seq_start() is
>> >>>             (numsockets * 150 / min(4096,readsize)).
>> >>> Netstat uses 4kb readsize (newer versions), or 1kb (older versions).
>> >>> Note that speed of old algorithm does not improve above 4kb blocksize.
>> >>>
>> >>> Speed of the new algorithm does not depend on blocksize.
>> >>>
>> >>> Speed of the new algorithm does not perceptibly depend on hashsize (which
>> >>> depends on ramsize). Speed of old algorithm drops with bigger hashsize.
>> >>>
>> >>> (5) Reporting order.
>> >>>
>> >>> Reporting order is exactly same as before if hash does not change underfoot.
>> >>> When hash elements come and go during report, reporting order will be
>> >>> same as that of tcpdiag.
>> >>>
>> >>> Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
>
> Does the netlink interface used by ss command have the problem?

No. It's  /proc/net/tcp that has fixable problem.

Yakov

^ permalink raw reply

* [PATCH] Phonet: fix mutex imbalance
From: Rémi Denis-Courmont @ 2009-09-29  7:16 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

port_mutex was unlocked twice.

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 net/phonet/socket.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index 07aa9f0..aa5b5a9 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -407,7 +407,6 @@ int pn_sock_get_port(struct sock *sk, unsigned short sport)
 	return -EADDRINUSE;
 
 found:
-	mutex_unlock(&port_mutex);
 	pn->sobject = pn_object(pn_addr(pn->sobject), sport);
 	return 0;
 }
-- 
1.6.0.4


^ permalink raw reply related

* Re: b43 is broken in latest net-2.6 and linux-2.6
From: Oliver Hartkopp @ 2009-09-29  7:13 UTC (permalink / raw)
  To: John W. Linville; +Cc: Michael Buesch, Linux Netdev List
In-Reply-To: <20090928184211.GC4737@tuxdriver.com>

John W. Linville wrote:
> On Sat, Sep 19, 2009 at 01:23:11PM +0200, Oliver Hartkopp wrote:
>> Hello Michael,
>>
>> my b43 wireless card (Dell 830) is not working with the latest net-2.6 (and
>> also linux-2.6 2.6.31-05767-gdf58bee).
>>
>> net-2.6 2.6.31-03263-gc29854e is working
>> net-2.6 2.6.31-03301-ga97e178 is broken
>>
>> I removed the patch with the work_queue stuff which did not help - so it's
>> probably the other patch you added to b43 recently.
>>
>> Don't know ... the wlan0 link does not become ready anymore.
>>
>> If you need some more information - please let me know.
> 
> Is this working better now, with 2.6.31-rc1?

Thanks for coming back on this.

Yes it is fixed now.

   'cfg80211: fix SME connect'

broke it and

   'cfg80211: don't overwrite privacy setting'

fixed it afterwards (at least in my setup).

So it's pretty well working now.

Best regards,
Oliver

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox