Netdev List
 help / color / mirror / Atom feed
* [patch v2 06/12] [PATCH 06/12] IPVS: ip_vs_{un,}bind_scheduler NULL arguments
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0006-IPVS-ip_vs_-un-bind_scheduler-NULL-arguments.patch --]
[-- Type: text/plain, Size: 1770 bytes --]

In general NULL arguments aren't passed by the few callers that exist,
so don't test for them.

The exception is to make passing NULL to ip_vs_unbind_scheduler() a noop.

Signed-off-by: Simon Horman <horms@verge.net.au>

v2
* Trivial rediff

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 84dae47..d57cc4a 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1229,8 +1229,7 @@ ip_vs_add_service(struct ip_vs_service_user_kern *u,
 
  out_err:
 	if (svc != NULL) {
-		if (svc->scheduler)
-			ip_vs_unbind_scheduler(svc);
+		ip_vs_unbind_scheduler(svc);
 		if (svc->inc) {
 			local_bh_disable();
 			ip_vs_app_inc_put(svc->inc);
diff --git a/net/netfilter/ipvs/ip_vs_sched.c b/net/netfilter/ipvs/ip_vs_sched.c
index cd77902..be0780a 100644
--- a/net/netfilter/ipvs/ip_vs_sched.c
+++ b/net/netfilter/ipvs/ip_vs_sched.c
@@ -46,15 +46,6 @@ int ip_vs_bind_scheduler(struct ip_vs_service *svc,
 {
 	int ret;
 
-	if (svc == NULL) {
-		pr_err("%s(): svc arg NULL\n", __func__);
-		return -EINVAL;
-	}
-	if (scheduler == NULL) {
-		pr_err("%s(): scheduler arg NULL\n", __func__);
-		return -EINVAL;
-	}
-
 	svc->scheduler = scheduler;
 
 	if (scheduler->init_service) {
@@ -74,18 +65,10 @@ int ip_vs_bind_scheduler(struct ip_vs_service *svc,
  */
 int ip_vs_unbind_scheduler(struct ip_vs_service *svc)
 {
-	struct ip_vs_scheduler *sched;
+	struct ip_vs_scheduler *sched = svc->scheduler;
 
-	if (svc == NULL) {
-		pr_err("%s(): svc arg NULL\n", __func__);
-		return -EINVAL;
-	}
-
-	sched = svc->scheduler;
-	if (sched == NULL) {
-		pr_err("%s(): svc isn't bound\n", __func__);
-		return -EINVAL;
-	}
+	if (!sched)
+		return 0;
 
 	if (sched->done_service) {
 		if (sched->done_service(svc) != 0) {
-- 
1.7.1



^ permalink raw reply related

* [patch v2 02/12] [PATCH 02/12] netfilter: nf_conntrack_sip: Add callid parser
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0002-netfilter-nf_conntrack_sip-Add-callid-parser.patch --]
[-- Type: text/plain, Size: 2444 bytes --]

Signed-off-by: Simon Horman <horms@verge.net.au>

--- 

The motivation for this is for it to be used by LVS as per
subsequent patches.

* Patrick McHardy suggested changing word_len to check for the
  next newline or whitespace. But I believe this is incorrect.
  For example '#' would be permitted but it is invalid
  in a word according to RFC3261.


diff --git a/include/linux/netfilter/nf_conntrack_sip.h b/include/linux/netfilter/nf_conntrack_sip.h
index ff8cfbc..0ce91d5 100644
--- a/include/linux/netfilter/nf_conntrack_sip.h
+++ b/include/linux/netfilter/nf_conntrack_sip.h
@@ -89,6 +89,7 @@ enum sip_header_types {
 	SIP_HDR_VIA_TCP,
 	SIP_HDR_EXPIRES,
 	SIP_HDR_CONTENT_LENGTH,
+	SIP_HDR_CALL_ID,
 };
 
 enum sdp_header_types {
diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index 2fd1ea2..715ce54 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -130,6 +130,44 @@ static int digits_len(const struct nf_conn *ct, const char *dptr,
 	return len;
 }
 
+static int iswordc(const char c)
+{
+	if (isalnum(c) || c == '!' || c == '"' || c == '%' ||
+	    (c >= '(' && c <= '/') || c == ':' || c == '<' || c == '>' ||
+	    c == '?' || (c >= '[' && c <= ']') || c == '_' || c == '`' ||
+	    c == '{' || c == '}' || c == '~')
+		return 1;
+	return 0;
+}
+
+static int word_len(const char *dptr, const char *limit)
+{
+	int len = 0;
+	while (dptr < limit && iswordc(*dptr)) {
+		dptr++;
+		len++;
+	}
+	return len;
+}
+
+static int callid_len(const struct nf_conn *ct, const char *dptr,
+		      const char *limit, int *shift)
+{
+	int len, domain_len;
+
+	len = word_len(dptr, limit);
+	dptr += len;
+	if (!len || dptr == limit || *dptr != '@')
+		return len;
+	dptr++;
+	len++;
+
+	domain_len = word_len(dptr, limit);
+	if (!domain_len)
+		return 0;
+	return len + domain_len;
+}
+
 /* get media type + port length */
 static int media_len(const struct nf_conn *ct, const char *dptr,
 		     const char *limit, int *shift)
@@ -299,6 +337,7 @@ static const struct sip_header ct_sip_hdrs[] = {
 	[SIP_HDR_VIA_TCP]		= SIP_HDR("Via", "v", "TCP ", epaddr_len),
 	[SIP_HDR_EXPIRES]		= SIP_HDR("Expires", NULL, NULL, digits_len),
 	[SIP_HDR_CONTENT_LENGTH]	= SIP_HDR("Content-Length", "l", NULL, digits_len),
+	[SIP_HDR_CALL_ID]		= SIP_HDR("Call-Id", "i", NULL, callid_len),
 };
 
 static const char *sip_follow_continuation(const char *dptr, const char *limit)
-- 
1.7.1



^ permalink raw reply related

* [patch v2 1/2] [PATCH 1/2] Slightly simplify options conflicts logic
From: Simon Horman @ 2010-10-01 14:40 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001144041.414393254@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0001-Slightly-simplify-options-conflicts-logic.patch --]
[-- Type: text/plain, Size: 744 bytes --]

Signed-off-by: Simon Horman <horms@verge.net.au>

diff --git a/ipvsadm.c b/ipvsadm.c
index 76ec7c4..1ac6c7f 100644
--- a/ipvsadm.c
+++ b/ipvsadm.c
@@ -763,11 +763,9 @@ static int process_options(int argc, char **argv, int reading_stdin)
 
 	switch (ce.cmd) {
 	case CMD_LIST:
-		if ((options & OPT_CONNECTION ||
-		     options & OPT_TIMEOUT || options & OPT_DAEMON) &&
-		    (options & OPT_STATS ||
-		     options & OPT_PERSISTENTCONN ||
-		     options & OPT_RATE || options & OPT_THRESHOLDS))
+		if (options & (OPT_CONNECTION|OPT_TIMEOUT|OPT_DAEMON) &&
+		    options & (OPT_STATS|OPT_PERSISTENTCONN|
+			       OPT_RATE|OPT_THRESHOLDS))
 			fail(2, "options conflicts in the list command");
 
 		if (options & OPT_CONNECTION)
-- 
1.7.1



^ permalink raw reply related

* Re: [PATCH 0/2] qcusbnet: Cleanups
From: Elly Jones @ 2010-10-01 14:42 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, dbrownell, mjg59, jglasgow, msb, olofj
In-Reply-To: <1285893935.10780.11.camel@Joe-Laptop>

On Thu, Sep 30, 2010 at 05:45:35PM -0700, Joe Perches wrote:
> On Thu, 2010-09-30 at 17:33 -0700, David Miller wrote:
> > From: Joe Perches <joe@perches.com>
> > Date: Tue, 28 Sep 2010 19:39:56 -0700
> > > Perhaps some of these cleanups are in order?
> > I don't see this driver in any of my trees, so someone else
> > should be taking this in it seems.
> 
> These cleanups are meant for Elly Jones on top
> of the Qualcomm Gobi 2000 driver she submitted.
> 
> http://patchwork.ozlabs.org/patch/66006/
> 

Wow, thank you! I'll incorporate your fixes, make sure I have clean
checkpatch output, and send a v2.

-- Elly

^ permalink raw reply

* [patch v2 2/2] [PATCH 2/2] Add support for persistence engines.
From: Simon Horman @ 2010-10-01 14:40 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001144041.414393254@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0002-Add-support-for-persistence-engines.patch --]
[-- Type: text/plain, Size: 13923 bytes --]

This adds the --pe [engine] option to the -A and -E commands
which allows a persistence engine to be associated with a virtual service.
The absence of --pe sets no persistence engine.

The --pe option only works when ipvsadm is compiled to use netlink
for user-space/kernel communication.

This patch also allows the --persistent-conn option to be given to the -L
command, which will list persistence engine data, if any is present, when
listing connections (and persistence templates).

At this time the only (proposed) persistence engine is sip.

Signed-off-by: Simon Horman <horms@verge.net.au>

---

v0.4
* Fix indentation of --pe help text

v2
* Only display pe_data if it is present

Index: github.com/Makefile
===================================================================
--- github.com.orig/Makefile	2010-10-01 22:58:26.000000000 +0900
+++ github.com/Makefile	2010-10-01 23:00:10.000000000 +0900
@@ -29,6 +29,7 @@ NAME		= ipvsadm
 VERSION		= $(shell cat VERSION)
 RELEASE		= 1
 SCHEDULERS	= "$(shell cat SCHEDULERS)"
+PE_LIST		= "$(shell cat PERSISTENCE_ENGINES)"
 PROGROOT	= $(shell basename `pwd`)
 ARCH		= $(shell uname -m)
 RPMSOURCEDIR	= $(shell rpm --eval '%_sourcedir')
@@ -83,7 +84,7 @@ ifneq (0,$(HAVE_NL))
 LIBS		+= -lnl
 endif
 DEFINES		= -DVERSION=\"$(VERSION)\" -DSCHEDULERS=\"$(SCHEDULERS)\" \
-		  $(POPT_DEFINE)
+		  -DPE_LIST=\"$(PE_LIST)\" $(POPT_DEFINE)
 DEFINES		+= $(shell if [ ! -f ../ip_vs.h ]; then	\
 		     echo "-DHAVE_NET_IP_VS_H"; fi;)
 
Index: github.com/PERSISTENCE_ENGINES
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ github.com/PERSISTENCE_ENGINES	2010-10-01 23:00:10.000000000 +0900
@@ -0,0 +1 @@
+sip
Index: github.com/ipvsadm.8
===================================================================
--- github.com.orig/ipvsadm.8	2010-09-26 22:07:58.000000000 +0900
+++ github.com/ipvsadm.8	2010-10-01 23:00:10.000000000 +0900
@@ -391,6 +391,10 @@ with this option will display the persis
 information of each server in service listing. The persistent
 connection is used to forward the actual connections from the same
 client/network to the same server.
+.sp
+The \fIlist\fP command with the -c, --connection option and this option
+will include persistence engine data, if any is present, when listing
+connections.
 .TP
 .B --sort
 Sort the list of virtual services and real servers. The virtual
Index: github.com/ipvsadm.c
===================================================================
--- github.com.orig/ipvsadm.c	2010-10-01 23:00:09.000000000 +0900
+++ github.com/ipvsadm.c	2010-10-01 23:00:20.000000000 +0900
@@ -181,13 +181,15 @@ static const char* cmdnames[] = {
 #define OPT_SYNCID		0x080000
 #define OPT_EXACT		0x100000
 #define OPT_ONEPACKET		0x200000
-#define NUMBER_OF_OPT		22
+#define OPT_PERSISTENCE_ENGINE  0x400000
+#define NUMBER_OF_OPT		23
 
 static const char* optnames[] = {
 	"numeric",
 	"connection",
 	"service-address",
 	"scheduler",
+	"pe",
 	"persistent",
 	"netmask",
 	"real-server",
@@ -282,6 +284,7 @@ enum {
 	TAG_PERSISTENTCONN,
 	TAG_SORT,
 	TAG_NO_SORT,
+	TAG_PERSISTENCE_ENGINE,
 };
 
 /* various parsing helpers & parsing functions */
@@ -421,6 +424,8 @@ parse_options(int argc, char **argv, str
 		{ "exact", 'X', POPT_ARG_NONE, NULL, 'X', NULL, NULL },
 		{ "ipv6", '6', POPT_ARG_NONE, NULL, '6', NULL, NULL },
 		{ "ops", 'o', POPT_ARG_NONE, NULL, 'o', NULL, NULL },
+		{ "pe", '\0', POPT_ARG_STRING, &optarg, TAG_PERSISTENCE_ENGINE,
+		  NULL, NULL },
 		{ NULL, 0, 0, NULL, 0, NULL, NULL }
 	};
 
@@ -647,6 +652,10 @@ parse_options(int argc, char **argv, str
 			set_option(options, OPT_ONEPACKET);
 			ce->svc.flags |= IP_VS_SVC_F_ONEPACKET;
 			break;
+		case TAG_PERSISTENCE_ENGINE:
+			set_option(options, OPT_PERSISTENCE_ENGINE);
+			strncpy(ce->svc.pe_name, optarg, IP_VS_PENAME_MAXLEN);
+			break;
 		default:
 			fail(2, "invalid option `%s'",
 			     poptBadOption(context, POPT_BADOPTION_NOALIAS));
@@ -763,9 +772,10 @@ static int process_options(int argc, cha
 
 	switch (ce.cmd) {
 	case CMD_LIST:
-		if (options & (OPT_CONNECTION|OPT_TIMEOUT|OPT_DAEMON) &&
-		    options & (OPT_STATS|OPT_PERSISTENTCONN|
-			       OPT_RATE|OPT_THRESHOLDS))
+		if ((options & (OPT_CONNECTION|OPT_TIMEOUT|OPT_DAEMON) &&
+		     options & (OPT_STATS|OPT_RATE|OPT_THRESHOLDS)) ||
+		    (options & (OPT_TIMEOUT|OPT_DAEMON) &&
+		     options & OPT_PERSISTENTCONN))
 			fail(2, "options conflicts in the list command");
 
 		if (options & OPT_CONNECTION)
@@ -1060,7 +1070,7 @@ static void usage_exit(const char *progr
 	version(stream);
 	fprintf(stream,
 		"Usage:\n"
-		"  %s -A|E -t|u|f service-address [-s scheduler] [-p [timeout]] [-M netmask]\n"
+		"  %s -A|E -t|u|f service-address [-s scheduler] [-p [timeout]] [-M netmask] [--pe persistence_engine]\n"
 		"  %s -D -t|u|f service-address\n"
 		"  %s -C\n"
 		"  %s -R\n"
@@ -1105,6 +1115,8 @@ static void usage_exit(const char *progr
 		"  --ipv6         -6                   fwmark entry uses IPv6\n"
 		"  --scheduler    -s scheduler         one of " SCHEDULERS ",\n"
 		"                                      the default scheduler is %s.\n"
+		"  --pe            engine              alternate persistence engine may be " PE_LIST ",\n"
+		"                                      not set by default.\n"
 		"  --persistent   -p [timeout]         persistent service\n"
 		"  --netmask      -M netmask           persistent granularity mask\n"
 		"  --real-server  -r server-address    server-address is host (and port)\n"
@@ -1225,6 +1237,8 @@ static void print_conn(char *buf, unsign
 	char            state[16];
 	unsigned int    expires;
 	unsigned short  af = AF_INET;
+	char		pe_name[IP_VS_PENAME_MAXLEN];
+	char		pe_data[IP_VS_PEDATA_MAXLEN];
 
 	int n;
 	char temp1[INET6_ADDRSTRLEN], temp2[INET6_ADDRSTRLEN], temp3[INET6_ADDRSTRLEN];
@@ -1232,9 +1246,10 @@ static void print_conn(char *buf, unsign
 	unsigned int	minutes, seconds;
 	char		expire_str[12];
 
-	if ((n = sscanf(buf, "%s %s %hX %s %hX %s %hX %s %d",
+	if ((n = sscanf(buf, "%s %s %hX %s %hX %s %hX %s %d %s %s",
 			protocol, temp1, &cport, temp2, &vport,
-			temp3, &dport, state, &expires)) == -1)
+			temp3, &dport, state, &expires,
+			pe_name, pe_data)) == -1)
 		exit(1);
 
 	if (strcmp(protocol, "TCP") == 0)
@@ -1268,8 +1283,13 @@ static void print_conn(char *buf, unsign
 	minutes = expires / 60;
 	sprintf(expire_str, "%02d:%02d", minutes, seconds);
 
-	printf("%-3s %-6s %-11s %-18s %-18s %s\n",
-	       protocol, expire_str, state, cname, vname, dname);
+	if (format & FMT_PERSISTENTCONN && n == 11)
+		printf("%-3s %-6s %-11s %-18s %-18s %-16s %-18s %s\n",
+		       protocol, expire_str, state, cname, vname, dname,
+		       pe_name, pe_data);
+	else
+		printf("%-3s %-6s %-11s %-18s %-18s %s\n",
+		       protocol, expire_str, state, cname, vname, dname);
 
 	free(cname);
 	free(vname);
@@ -1295,8 +1315,13 @@ void list_conn(unsigned int format)
 		exit(1);
 	}
 	printf("IPVS connection entries\n");
-	printf("pro expire %-11s %-18s %-18s %s\n",
-	       "state", "source", "virtual", "destination");
+	if (format & FMT_PERSISTENTCONN)
+		printf("pro expire %-11s %-18s %-18s %-18s %-16s %s\n",
+		       "state", "source", "virtual", "destination",
+		       "pe name", "pe_data");
+	else
+		printf("pro expire %-11s %-18s %-18s %s\n",
+		       "state", "source", "virtual", "destination");
 
 	/*
 	 * Print the VS information according to the format
@@ -1459,6 +1484,8 @@ print_service_entry(ipvs_service_entry_t
 					printf(" -M %i", se->netmask);
 				}
 		}
+		if (se->pe_name[0])
+			printf(" pe %s", se->pe_name);
 		if (se->flags & IP_VS_SVC_F_ONEPACKET)
 			printf(" ops");
 	} else if (format & FMT_STATS) {
@@ -1488,6 +1515,8 @@ print_service_entry(ipvs_service_entry_t
 			if (se->af == AF_INET6)
 				if (se->netmask != 128)
 					printf(" mask %i", se->netmask);
+			if (se->pe_name[0])
+				printf(" pe %s", se->pe_name);
 			if (se->flags & IP_VS_SVC_F_ONEPACKET)
 				printf(" ops");
 		}
Index: github.com/libipvs/ip_vs.h
===================================================================
--- github.com.orig/libipvs/ip_vs.h	2010-09-26 22:07:58.000000000 +0900
+++ github.com/libipvs/ip_vs.h	2010-10-01 23:00:10.000000000 +0900
@@ -92,8 +92,11 @@
 #define IP_VS_CONN_F_ONE_PACKET	0x2000		/* forward only one packet */
 
 #define IP_VS_SCHEDNAME_MAXLEN	16
+#define IP_VS_PENAME_MAXLEN	16
 #define IP_VS_IFNAME_MAXLEN	16
 
+#define IP_VS_PEDATA_MAXLEN	255
+
 union nf_inet_addr {
         __u32           all[4];
         __be32          ip;
@@ -134,6 +137,7 @@ struct ip_vs_service_user {
 	__be32			netmask;	/* persistent netmask */
 	u_int16_t		af;
 	union nf_inet_addr	addr;
+	char			pe_name[IP_VS_PENAME_MAXLEN];
 };
 
 struct ip_vs_dest_kern {
@@ -240,6 +244,7 @@ struct ip_vs_service_entry {
 
 	u_int16_t		af;
 	union nf_inet_addr	addr;
+	char			pe_name[IP_VS_PENAME_MAXLEN];
 
 };
 
@@ -429,6 +434,9 @@ enum {
 	IPVS_SVC_ATTR_NETMASK,		/* persistent netmask */
 
 	IPVS_SVC_ATTR_STATS,		/* nested attribute for service stats */
+
+	IPVS_SVC_ATTR_PE_NAME,		/* name of scheduler */
+
 	__IPVS_SVC_ATTR_MAX,
 };
 
Index: github.com/libipvs/libipvs.c
===================================================================
--- github.com.orig/libipvs/libipvs.c	2010-09-26 22:07:58.000000000 +0900
+++ github.com/libipvs/libipvs.c	2010-10-01 23:00:10.000000000 +0900
@@ -40,6 +40,15 @@ static int family, try_nl = 1;
 	{ errno = EAFNOSUPPORT; return ret; }			\
 	s->__addr_v4 = s->addr.ip;				\
 
+#define CHECK_PE(s, ret) if (s->pe_name)			\
+	{ errno = EAFNOSUPPORT; return ret; }
+
+#define CHECK_COMPAT_DEST(s, ret) CHECK_IPV4(s, ret)
+
+#define CHECK_COMPAT_SVC(s, ret)				\
+	CHECK_IPV4(s, ret);					\
+	CHECK_PE(s, ret);
+
 #ifdef LIBIPVS_USE_NL
 struct nl_msg *ipvs_nl_message(int cmd, int flags)
 {
@@ -218,6 +227,8 @@ static int ipvs_nl_fill_service_attr(str
 	}
 
 	NLA_PUT_STRING(msg, IPVS_SVC_ATTR_SCHED_NAME, svc->sched_name);
+	if (svc->pe_name)
+		NLA_PUT_STRING(msg, IPVS_SVC_ATTR_PE_NAME, svc->pe_name);
 	NLA_PUT(msg, IPVS_SVC_ATTR_FLAGS, sizeof(flags), &flags);
 	NLA_PUT_U32(msg, IPVS_SVC_ATTR_TIMEOUT, svc->timeout);
 	NLA_PUT_U32(msg, IPVS_SVC_ATTR_NETMASK, svc->netmask);
@@ -245,7 +256,7 @@ int ipvs_add_service(ipvs_service_t *svc
 	}
 #endif
 
-	CHECK_IPV4(svc, -1);
+	CHECK_COMPAT_SVC(svc, -1);
 	return setsockopt(sockfd, IPPROTO_IP, IP_VS_SO_SET_ADD, (char *)svc,
 			  sizeof(struct ip_vs_service_kern));
 }
@@ -265,7 +276,7 @@ int ipvs_update_service(ipvs_service_t *
 		return ipvs_nl_send_message(msg, ipvs_nl_noop_cb, NULL);
 	}
 #endif
-	CHECK_IPV4(svc, -1);
+	CHECK_COMPAT_SVC(svc, -1);
 	return setsockopt(sockfd, IPPROTO_IP, IP_VS_SO_SET_EDIT, (char *)svc,
 			  sizeof(struct ip_vs_service_kern));
 }
@@ -285,7 +296,7 @@ int ipvs_del_service(ipvs_service_t *svc
 		return ipvs_nl_send_message(msg, ipvs_nl_noop_cb, NULL);
 	}
 #endif
-	CHECK_IPV4(svc, -1);
+	CHECK_COMPAT_SVC(svc, -1);
 	return setsockopt(sockfd, IPPROTO_IP, IP_VS_SO_SET_DEL, (char *)svc,
 			  sizeof(struct ip_vs_service_kern));
 }
@@ -310,7 +321,7 @@ int ipvs_zero_service(ipvs_service_t *sv
 		return ipvs_nl_send_message(msg, ipvs_nl_noop_cb, NULL);
 	}
 #endif
-	CHECK_IPV4(svc, -1);
+	CHECK_COMPAT_SVC(svc, -1);
 	return setsockopt(sockfd, IPPROTO_IP, IP_VS_SO_SET_ZERO, (char *)svc,
 			  sizeof(struct ip_vs_service_kern));
 }
@@ -360,8 +371,8 @@ nla_put_failure:
 	}
 #endif
 
-	CHECK_IPV4(svc, -1);
-	CHECK_IPV4(dest, -1);
+	CHECK_COMPAT_SVC(svc, -1);
+	CHECK_COMPAT_DEST(dest, -1);
 	memcpy(&svcdest.svc, svc, sizeof(svcdest.svc));
 	memcpy(&svcdest.dest, dest, sizeof(svcdest.dest));
 	return setsockopt(sockfd, IPPROTO_IP, IP_VS_SO_SET_ADDDEST,
@@ -389,8 +400,8 @@ nla_put_failure:
 		return -1;
 	}
 #endif
-	CHECK_IPV4(svc, -1);
-	CHECK_IPV4(dest, -1);
+	CHECK_COMPAT_SVC(svc, -1);
+	CHECK_COMPAT_DEST(dest, -1);
 	memcpy(&svcdest.svc, svc, sizeof(svcdest.svc));
 	memcpy(&svcdest.dest, dest, sizeof(svcdest.dest));
 	return setsockopt(sockfd, IPPROTO_IP, IP_VS_SO_SET_EDITDEST,
@@ -419,8 +430,8 @@ nla_put_failure:
 	}
 #endif
 
-	CHECK_IPV4(svc, -1);
-	CHECK_IPV4(dest, -1);
+	CHECK_COMPAT_SVC(svc, -1);
+	CHECK_COMPAT_DEST(dest, -1);
 	memcpy(&svcdest.svc, svc, sizeof(svcdest.svc));
 	memcpy(&svcdest.dest, dest, sizeof(svcdest.dest));
 	return setsockopt(sockfd, IPPROTO_IP, IP_VS_SO_SET_DELDEST,
@@ -593,6 +604,11 @@ static int ipvs_services_parse_cb(struct
 		nla_get_string(svc_attrs[IPVS_SVC_ATTR_SCHED_NAME]),
 		IP_VS_SCHEDNAME_MAXLEN);
 
+	if (svc_attrs[IPVS_SVC_ATTR_PE_NAME])
+		strncpy(get->entrytable[i].pe_name,
+			nla_get_string(svc_attrs[IPVS_SVC_ATTR_PE_NAME]),
+			IP_VS_PENAME_MAXLEN);
+
 	get->entrytable[i].netmask = nla_get_u32(svc_attrs[IPVS_SVC_ATTR_NETMASK]);
 	get->entrytable[i].timeout = nla_get_u32(svc_attrs[IPVS_SVC_ATTR_TIMEOUT]);
 	nla_memcpy(&flags, svc_attrs[IPVS_SVC_ATTR_FLAGS], sizeof(flags));
@@ -937,7 +953,8 @@ ipvs_get_service_err2:
 	}
 #endif
 
-	CHECK_IPV4(svc, NULL);
+	CHECK_COMPAT_SVC(svc, NULL);
+	CHECK_PE(svc, NULL);
 	if (getsockopt(sockfd, IPPROTO_IP, IP_VS_SO_GET_SERVICE,
 		       (char *)svc, &len)) {
 		free(svc);
@@ -945,6 +962,7 @@ ipvs_get_service_err2:
 	}
 	svc->af = AF_INET;
 	svc->addr.ip = svc->__addr_v4;
+	svc->pe_name[0] = '\0';
 	return svc;
 }
 
@@ -1086,9 +1104,9 @@ const char *ipvs_strerror(int err)
 		const char *message;
 	} table [] = {
 		{ ipvs_add_service, EEXIST, "Service already exists" },
-		{ ipvs_add_service, ENOENT, "Scheduler not found" },
+		{ ipvs_add_service, ENOENT, "Scheduler or persistence engine not found" },
 		{ ipvs_update_service, ESRCH, "No such service" },
-		{ ipvs_update_service, ENOENT, "Scheduler not found" },
+		{ ipvs_update_service, ENOENT, "Scheduler or persistence engine not found" },
 		{ ipvs_del_service, ESRCH, "No such service" },
 		{ ipvs_zero_service, ESRCH, "No such service" },
 		{ ipvs_add_dest, ESRCH, "Service not defined" },


^ permalink raw reply

* [patch v2 0/2] [patch v1 0/2] ipvsadm: SIP Persistence Engine
From: Simon Horman @ 2010-10-01 14:40 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy

This series is the ipvsadm companion to the kernel
patch series "IPVS: SIP Persistence Engine" v2.


^ permalink raw reply

* [patch v2 12/12] [PATCH 12/12] IPVS: sip persistence engine
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0012-IPVS-sip-persistence-engine.patch --]
[-- Type: text/plain, Size: 7063 bytes --]

Add the SIP callid as a key for persistence.

This allows multiple connections from the same IP address to be
differentiated on the basis of the callid.

When used in conjunction with the persistence mask, it allows connections
from different  IP addresses to be aggregated on the basis of the callid.

It is envisaged that a persistence mask of 0.0.0.0 will be a useful
setting.  That is, ignore the source IP address when checking for
persistence.

It is envisaged that this option will be used in conjunction with
one-packet scheduling.

This only works with UDP and cannot be made to work with TCP
within the current framework.

Signed-off-by: Simon Horman <horms@verge.net.au>

---

v1
* Use buf[] instead of poiter arithmetic in ip_vs_dbg_callid()
  As suggested by Jan Engelhardt

v2
* Use GFP_ATOMIC for allocations inside of ip_vs_sip_fill_param()
  which is called in an atomic context. This resolves the
  "scheduling while atomic" problem.
* As noted by Julian Anastasov RFC 3261 section 8.1.1.4 says
  "Call-IDs are case-sensitive and are simply compared byte-by-byte",
  so may be memcmp should be used instead of strnicmp() in
  ip_vs_sip_ct_match().
* Spelling fix in comment: persistance -> persistence
* Trivial rediff

Index: lvs-test-2.6/net/netfilter/ipvs/Kconfig
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/Kconfig	2010-10-01 22:46:59.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/Kconfig	2010-10-01 22:50:17.000000000 +0900
@@ -256,4 +256,11 @@ config	IP_VS_NFCT
 	  connection state to be exported to the Netfilter framework
 	  for filtering purposes.
 
+config	IP_VS_PE_SIP
+	tristate "SIP persistence engine"
+        depends on IP_VS_PROTO_UDP
+	depends on NF_CONNTRACK_SIP
+	---help---
+	  Allow persistence based on the SIP Call-ID
+
 endif # IP_VS
Index: lvs-test-2.6/net/netfilter/ipvs/Makefile
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/Makefile	2010-10-01 22:50:17.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/Makefile	2010-10-01 22:50:17.000000000 +0900
@@ -35,3 +35,6 @@ obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
 
 # IPVS application helpers
 obj-$(CONFIG_IP_VS_FTP) += ip_vs_ftp.o
+
+# IPVS connection template retrievers
+obj-$(CONFIG_IP_VS_PE_SIP) += ip_vs_pe_sip.o
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_pe_sip.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_pe_sip.c	2010-10-01 22:50:23.000000000 +0900
@@ -0,0 +1,167 @@
+#define KMSG_COMPONENT "IPVS"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+
+#include <net/ip_vs.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <linux/netfilter/nf_conntrack_sip.h>
+
+static const char *ip_vs_dbg_callid(char *buf, size_t buf_len,
+				    const char *callid, size_t callid_len,
+				    int *idx)
+{
+	size_t len = min(min(callid_len, (size_t)64), buf_len - *idx - 1);
+	memcpy(buf + *idx, callid, len);
+	buf[*idx+len] = '\0';
+	*idx += len + 1;
+	return buf + *idx - len;
+}
+
+#define IP_VS_DEBUG_CALLID(callid, len)					\
+	ip_vs_dbg_callid(ip_vs_dbg_buf, sizeof(ip_vs_dbg_buf),		\
+			 callid, len, &ip_vs_dbg_idx)
+
+static int get_callid(const char *dptr, unsigned int dataoff,
+		      unsigned int datalen,
+		      unsigned int *matchoff, unsigned int *matchlen)
+{
+	/* Find callid */
+	while (1) {
+		int ret = ct_sip_get_header(NULL, dptr, dataoff, datalen,
+					    SIP_HDR_CALL_ID, matchoff,
+					    matchlen);
+		if (ret > 0)
+			break;
+		if (!ret)
+			return 0;
+		dataoff += *matchoff;
+	}
+
+	/* Empty callid is useless */
+	if (!*matchlen)
+		return -EINVAL;
+
+	/* Too large is useless */
+	if (*matchlen > IP_VS_PEDATA_MAXLEN)
+		return -EINVAL;
+
+	/* SIP headers are always followed by a line terminator */
+	if (*matchoff + *matchlen == datalen)
+		return -EINVAL;
+
+	/* RFC 2543 allows lines to be terminated with CR, LF or CRLF,
+	 * RFC 3261 allows only CRLF, we support both. */
+	if (*(dptr + *matchoff + *matchlen) != '\r' &&
+	    *(dptr + *matchoff + *matchlen) != '\n')
+		return -EINVAL;
+
+	IP_VS_DBG_BUF(9, "SIP callid %s (%d bytes)\n",
+		      IP_VS_DEBUG_CALLID(dptr + *matchoff, *matchlen),
+		      *matchlen);
+	return 0;
+}
+
+static int
+ip_vs_sip_fill_param(struct ip_vs_conn_param *p, struct sk_buff *skb)
+{
+	struct ip_vs_iphdr iph;
+	unsigned int dataoff, datalen, matchoff, matchlen;
+	const char *dptr;
+
+	ip_vs_fill_iphdr(p->af, skb_network_header(skb), &iph);
+
+	/* Only useful with UDP */
+	if (iph.protocol != IPPROTO_UDP)
+		return -EINVAL;
+
+	/* No Data ? */
+	dataoff = iph.len + sizeof(struct udphdr);
+	if (dataoff >= skb->len)
+		return -EINVAL;
+
+	dptr = skb->data + dataoff;
+	datalen = skb->len - dataoff;
+
+	if (get_callid(dptr, dataoff, datalen, &matchoff, &matchlen))
+		return -EINVAL;
+
+	p->pe_data = kmalloc(matchlen, GFP_ATOMIC);
+	if (!p->pe_data)
+		return -ENOMEM;
+
+	/* N.B: pe_data is only set on success,
+	 * this allows fallback to the default persistence logic on failure
+	 */
+	memcpy(p->pe_data, dptr + matchoff, matchlen);
+	p->pe_data_len = matchlen;
+
+	return 0;
+}
+
+static bool ip_vs_sip_ct_match(const struct ip_vs_conn_param *p,
+				  struct ip_vs_conn *ct)
+
+{
+	bool ret = 0;
+
+	if (ct->af == p->af &&
+	    ip_vs_addr_equal(p->af, p->caddr, &ct->caddr) &&
+	    /* protocol should only be IPPROTO_IP if
+	     * d_addr is a fwmark */
+	    ip_vs_addr_equal(p->protocol == IPPROTO_IP ? AF_UNSPEC : p->af,
+			     p->vaddr, &ct->vaddr) &&
+	    ct->vport == p->vport &&
+	    ct->flags & IP_VS_CONN_F_TEMPLATE &&
+	    ct->protocol == p->protocol &&
+	    ct->pe_data && ct->pe_data_len == p->pe_data_len &&
+	    !memcmp(ct->pe_data, p->pe_data, p->pe_data_len))
+		ret = 1;
+
+	IP_VS_DBG_BUF(9, "SIP template match %s %s->%s:%d %s\n",
+		      ip_vs_proto_name(p->protocol),
+		      IP_VS_DEBUG_CALLID(p->pe_data, p->pe_data_len),
+		      IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
+		      ret ? "hit" : "not hit");
+
+	return ret;
+}
+
+static u32 ip_vs_sip_hashkey_raw(const struct ip_vs_conn_param *p,
+				 u32 initval)
+{
+	return jhash(p->pe_data, p->pe_data_len, initval);
+}
+
+static int ip_vs_sip_show_pe_data(const struct ip_vs_conn *cp, char *buf)
+{
+	memcpy(buf, cp->pe_data, cp->pe_data_len);
+	return cp->pe_data_len;
+}
+
+static struct ip_vs_pe ip_vs_sip_pe =
+{
+	.name =			"sip",
+	.refcnt =		ATOMIC_INIT(0),
+	.module =		THIS_MODULE,
+	.n_list =		LIST_HEAD_INIT(ip_vs_sip_pe.n_list),
+	.fill_param =		ip_vs_sip_fill_param,
+	.ct_match =		ip_vs_sip_ct_match,
+	.hashkey_raw =		ip_vs_sip_hashkey_raw,
+	.show_pe_data =	ip_vs_sip_show_pe_data,
+};
+
+static int __init ip_vs_sip_init(void)
+{
+	return register_ip_vs_pe(&ip_vs_sip_pe);
+}
+
+static void __exit ip_vs_sip_cleanup(void)
+{
+	unregister_ip_vs_pe(&ip_vs_sip_pe);
+}
+
+module_init(ip_vs_sip_init);
+module_exit(ip_vs_sip_cleanup);
+MODULE_LICENSE("GPL");


^ permalink raw reply

* [patch v2 11/12] [PATCH 11/12] IPVS: Fallback if persistence engine fails
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0011-IPVS-Fallback-if-persistence-engine-fails.patch --]
[-- Type: text/plain, Size: 2961 bytes --]

Fall back to normal persistence handling if the persistence
engine fails to recognise a packet.

This way, at least the packet will go somewhere.

It is envisaged that iptables could be used to block packets
such if this is not desired although nf_conntrack_sip would
likely need to be enhanced first.

Signed-off-by: Simon Horman <horms@verge.net.au>

---

v2
* Trivial rediff

Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_conn.c	2010-10-01 22:27:32.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c	2010-10-01 22:37:32.000000000 +0900
@@ -150,7 +150,7 @@ static unsigned int ip_vs_conn_hashkey(i
 
 static unsigned int ip_vs_conn_hashkey_param(const struct ip_vs_conn_param *p)
 {
-	if (p->pe && p->pe->hashkey_raw)
+	if (p->pe_data && p->pe->hashkey_raw)
 		return p->pe->hashkey_raw(p, ip_vs_conn_rnd) &
 			ip_vs_conn_tab_mask;
 	return ip_vs_conn_hashkey(p->af, p->protocol, p->caddr, p->cport);
@@ -340,7 +340,7 @@ struct ip_vs_conn *ip_vs_ct_in_get(const
 	ct_read_lock(hash);
 
 	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
-		if (p->pe && p->pe->ct_match) {
+		if (p->pe_data && p->pe->ct_match) {
 			if (p->pe->ct_match(p, cp))
 				goto out;
 			continue;
@@ -944,7 +944,7 @@ static int ip_vs_conn_seq_show(struct se
 		char pe_data[IP_VS_PENAME_MAXLEN + IP_VS_PEDATA_MAXLEN + 3];
 		size_t len = 0;
 
-		if (cp->dest->svc->pe && cp->dest->svc->pe->show_pe_data) {
+		if (cp->pe_data && cp->dest->svc->pe->show_pe_data) {
 			pe_data[0] = ' ';
 			len = strlen(cp->dest->svc->pe->name);
 			memcpy(pe_data + 1, cp->dest->svc->pe->name, len);
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-10-01 22:27:17.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-10-01 22:37:32.000000000 +0900
@@ -176,7 +176,7 @@ ip_vs_set_state(struct ip_vs_conn *cp, i
 	return pp->state_transition(cp, direction, skb, pp);
 }
 
-static inline int
+static inline void
 ip_vs_conn_fill_param_persist(const struct ip_vs_service *svc,
 			      struct sk_buff *skb, int protocol,
 			      const union nf_inet_addr *caddr, __be16 cport,
@@ -186,8 +186,7 @@ ip_vs_conn_fill_param_persist(const stru
 	ip_vs_conn_fill_param(svc->af, protocol, caddr, cport, vaddr, vport, p);
 	p->pe = svc->pe;
 	if (p->pe && p->pe->fill_param)
-		return p->pe->fill_param(p, skb);
-	return 0;
+		p->pe->fill_param(p, skb);
 }
 
 /*
@@ -268,9 +267,8 @@ ip_vs_sched_persist(struct ip_vs_service
 				vaddr = &fwmark;
 			}
 		}
-		if (ip_vs_conn_fill_param_persist(svc, skb, protocol, &snet, 0,
-						  vaddr, vport, &param))
-			return NULL;
+		ip_vs_conn_fill_param_persist(svc, skb, protocol, &snet, 0,
+					      vaddr, vport, &param);
 	}
 
 	/* Check if a template already exists */


^ permalink raw reply

* [patch v2 08/12] [PATCH 08/12] IPVS: Add persistence engine data to /proc/net/ip_vs_conn
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0008-IPVS-Add-persistence-engine-data-to-proc-net-ip_vs_c.patch --]
[-- Type: text/plain, Size: 2901 bytes --]

This shouldn't break compatibility with userspace as the new data
is at the end of the line.

I have confirmed that this doesn't break ipvsadm, the main (only?)
user-space user of this data.

Signed-off-by: Simon Horman <horms@verge.net.au>

---

* Jan Engelhardt suggested using netlink to do this, but it seems like
  overkill to me. I'm willing to be convinced otherwise.

v2
* Trivial rediff

Index: lvs-test-2.6/include/net/ip_vs.h
===================================================================
--- lvs-test-2.6.orig/include/net/ip_vs.h	2010-10-01 22:27:17.000000000 +0900
+++ lvs-test-2.6/include/net/ip_vs.h	2010-10-01 22:27:32.000000000 +0900
@@ -571,6 +571,7 @@ struct ip_vs_pe {
 	bool (*ct_match)(const struct ip_vs_conn_param *p,
 			 struct ip_vs_conn *ct);
 	u32 (*hashkey_raw)(const struct ip_vs_conn_param *p, u32 initval);
+	int (*show_pe_data)(const struct ip_vs_conn *cp, char *buf);
 };
 
 /*
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_conn.c	2010-10-01 22:27:17.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c	2010-10-01 22:27:32.000000000 +0900
@@ -938,30 +938,44 @@ static int ip_vs_conn_seq_show(struct se
 
 	if (v == SEQ_START_TOKEN)
 		seq_puts(seq,
-   "Pro FromIP   FPrt ToIP     TPrt DestIP   DPrt State       Expires\n");
+   "Pro FromIP   FPrt ToIP     TPrt DestIP   DPrt State       Expires PEName PEData\n");
 	else {
 		const struct ip_vs_conn *cp = v;
+		char pe_data[IP_VS_PENAME_MAXLEN + IP_VS_PEDATA_MAXLEN + 3];
+		size_t len = 0;
+
+		if (cp->dest->svc->pe && cp->dest->svc->pe->show_pe_data) {
+			pe_data[0] = ' ';
+			len = strlen(cp->dest->svc->pe->name);
+			memcpy(pe_data + 1, cp->dest->svc->pe->name, len);
+			pe_data[len + 1] = ' ';
+			len += 2;
+			len += cp->dest->svc->pe->show_pe_data(cp,
+							       pe_data + len);
+		}
+		pe_data[len] = '\0';
 
 #ifdef CONFIG_IP_VS_IPV6
 		if (cp->af == AF_INET6)
-			seq_printf(seq, "%-3s %pI6 %04X %pI6 %04X %pI6 %04X %-11s %7lu\n",
+			seq_printf(seq, "%-3s %pI6 %04X %pI6 %04X "
+				"%pI6 %04X %-11s %7lu%s\n",
 				ip_vs_proto_name(cp->protocol),
 				&cp->caddr.in6, ntohs(cp->cport),
 				&cp->vaddr.in6, ntohs(cp->vport),
 				&cp->daddr.in6, ntohs(cp->dport),
 				ip_vs_state_name(cp->protocol, cp->state),
-				(cp->timer.expires-jiffies)/HZ);
+				(cp->timer.expires-jiffies)/HZ, pe_data);
 		else
 #endif
 			seq_printf(seq,
 				"%-3s %08X %04X %08X %04X"
-				" %08X %04X %-11s %7lu\n",
+				" %08X %04X %-11s %7lu%s\n",
 				ip_vs_proto_name(cp->protocol),
 				ntohl(cp->caddr.ip), ntohs(cp->cport),
 				ntohl(cp->vaddr.ip), ntohs(cp->vport),
 				ntohl(cp->daddr.ip), ntohs(cp->dport),
 				ip_vs_state_name(cp->protocol, cp->state),
-				(cp->timer.expires-jiffies)/HZ);
+				(cp->timer.expires-jiffies)/HZ, pe_data);
 	}
 	return 0;
 }


^ permalink raw reply

* [patch v2 07/12] [PATCH 07/12] IPVS: Add struct ip_vs_pe
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0007-IPVS-Add-struct-ip_vs_pe.patch --]
[-- Type: text/plain, Size: 11305 bytes --]

Signed-off-by: Simon Horman <horms@verge.net.au>
--- 

This the first of several patches that add persistence engines.

v2
* Don't leak pe_data
  - It wasn't being freed anywhere, ever
* Trivial rediff

Index: lvs-test-2.6/include/linux/ip_vs.h
===================================================================
--- lvs-test-2.6.orig/include/linux/ip_vs.h	2010-10-01 22:47:39.000000000 +0900
+++ lvs-test-2.6/include/linux/ip_vs.h	2010-10-01 22:48:51.000000000 +0900
@@ -99,8 +99,10 @@
 				0)
 
 #define IP_VS_SCHEDNAME_MAXLEN	16
+#define IP_VS_PENAME_MAXLEN	16
 #define IP_VS_IFNAME_MAXLEN	16
 
+#define IP_VS_PEDATA_MAXLEN     255
 
 /*
  *	The struct ip_vs_service_user and struct ip_vs_dest_user are
Index: lvs-test-2.6/include/net/ip_vs.h
===================================================================
--- lvs-test-2.6.orig/include/net/ip_vs.h	2010-10-01 22:48:42.000000000 +0900
+++ lvs-test-2.6/include/net/ip_vs.h	2010-10-01 22:48:51.000000000 +0900
@@ -364,6 +364,10 @@ struct ip_vs_conn_param {
 	__be16				vport;
 	__u16				protocol;
 	u16				af;
+
+	const struct ip_vs_pe		*pe;
+	char				*pe_data;
+	__u8				pe_data_len;
 };
 
 /*
@@ -416,6 +420,9 @@ struct ip_vs_conn {
 	void                    *app_data;      /* Application private data */
 	struct ip_vs_seq        in_seq;         /* incoming seq. struct */
 	struct ip_vs_seq        out_seq;        /* outgoing seq. struct */
+
+	char			*pe_data;
+	__u8			pe_data_len;
 };
 
 
@@ -486,6 +493,9 @@ struct ip_vs_service {
 	struct ip_vs_scheduler	*scheduler;    /* bound scheduler object */
 	rwlock_t		sched_lock;    /* lock sched_data */
 	void			*sched_data;   /* scheduler application data */
+
+	/* alternate persistence engine */
+	struct ip_vs_pe		*pe;
 };
 
 
@@ -549,6 +559,19 @@ struct ip_vs_scheduler {
 				       const struct sk_buff *skb);
 };
 
+/* The persistence engine object */
+struct ip_vs_pe {
+	struct list_head	n_list;		/* d-linked list head */
+	char			*name;		/* scheduler name */
+	atomic_t		refcnt;		/* reference counter */
+	struct module		*module;	/* THIS_MODULE/NULL */
+
+	/* get the connection template, if any */
+	int (*fill_param)(struct ip_vs_conn_param *p, struct sk_buff *skb);
+	bool (*ct_match)(const struct ip_vs_conn_param *p,
+			 struct ip_vs_conn *ct);
+	u32 (*hashkey_raw)(const struct ip_vs_conn_param *p, u32 initval);
+};
 
 /*
  *	The application module object (a.k.a. app incarnation)
@@ -648,6 +671,8 @@ static inline void ip_vs_conn_fill_param
 	p->cport = cport;
 	p->vaddr = vaddr;
 	p->vport = vport;
+	p->pe = NULL;
+	p->pe_data = NULL;
 }
 
 struct ip_vs_conn *ip_vs_conn_in_get(const struct ip_vs_conn_param *p);
@@ -803,7 +828,7 @@ extern int ip_vs_unbind_scheduler(struct
 extern struct ip_vs_scheduler *ip_vs_scheduler_get(const char *sched_name);
 extern void ip_vs_scheduler_put(struct ip_vs_scheduler *scheduler);
 extern struct ip_vs_conn *
-ip_vs_schedule(struct ip_vs_service *svc, const struct sk_buff *skb);
+ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb);
 extern int ip_vs_leave(struct ip_vs_service *svc, struct sk_buff *skb,
 			struct ip_vs_protocol *pp);
 
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_conn.c	2010-10-01 22:48:42.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c	2010-10-01 22:49:15.000000000 +0900
@@ -148,6 +148,29 @@ static unsigned int ip_vs_conn_hashkey(i
 		& ip_vs_conn_tab_mask;
 }
 
+static unsigned int ip_vs_conn_hashkey_param(const struct ip_vs_conn_param *p)
+{
+	if (p->pe && p->pe->hashkey_raw)
+		return p->pe->hashkey_raw(p, ip_vs_conn_rnd) &
+			ip_vs_conn_tab_mask;
+	return ip_vs_conn_hashkey(p->af, p->protocol, p->caddr, p->cport);
+}
+
+static unsigned int ip_vs_conn_hashkey_conn(const struct ip_vs_conn *cp)
+{
+	struct ip_vs_conn_param p;
+
+	ip_vs_conn_fill_param(cp->af, cp->protocol, &cp->caddr, cp->cport,
+			      NULL, 0, &p);
+
+	if (cp->dest->svc->pe) {
+		p.pe = cp->dest->svc->pe;
+		p.pe_data = cp->pe_data;
+		p.pe_data_len = cp->pe_data_len;
+	}
+
+	return ip_vs_conn_hashkey_param(&p);
+}
 
 /*
  *	Hashes ip_vs_conn in ip_vs_conn_tab by proto,addr,port.
@@ -162,7 +185,7 @@ static inline int ip_vs_conn_hash(struct
 		return 0;
 
 	/* Hash by protocol, client address and port */
-	hash = ip_vs_conn_hashkey(cp->af, cp->protocol, &cp->caddr, cp->cport);
+	hash = ip_vs_conn_hashkey_conn(cp);
 
 	ct_write_lock(hash);
 	spin_lock(&cp->lock);
@@ -195,7 +218,7 @@ static inline int ip_vs_conn_unhash(stru
 	int ret;
 
 	/* unhash it and decrease its reference counter */
-	hash = ip_vs_conn_hashkey(cp->af, cp->protocol, &cp->caddr, cp->cport);
+	hash = ip_vs_conn_hashkey_conn(cp);
 
 	ct_write_lock(hash);
 	spin_lock(&cp->lock);
@@ -227,7 +250,7 @@ __ip_vs_conn_in_get(const struct ip_vs_c
 	unsigned hash;
 	struct ip_vs_conn *cp;
 
-	hash = ip_vs_conn_hashkey(p->af, p->protocol, p->caddr, p->cport);
+	hash = ip_vs_conn_hashkey_param(p);
 
 	ct_read_lock(hash);
 
@@ -312,11 +335,17 @@ struct ip_vs_conn *ip_vs_ct_in_get(const
 	unsigned hash;
 	struct ip_vs_conn *cp;
 
-	hash = ip_vs_conn_hashkey(p->af, p->protocol, p->caddr, p->cport);
+	hash = ip_vs_conn_hashkey_param(p);
 
 	ct_read_lock(hash);
 
 	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+		if (p->pe && p->pe->ct_match) {
+			if (p->pe->ct_match(p, cp))
+				goto out;
+			continue;
+		}
+
 		if (cp->af == p->af &&
 		    ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) &&
 		    /* protocol should only be IPPROTO_IP if
@@ -325,15 +354,14 @@ struct ip_vs_conn *ip_vs_ct_in_get(const
 				     p->af, p->vaddr, &cp->vaddr) &&
 		    p->cport == cp->cport && p->vport == cp->vport &&
 		    cp->flags & IP_VS_CONN_F_TEMPLATE &&
-		    p->protocol == cp->protocol) {
-			/* HIT */
-			atomic_inc(&cp->refcnt);
+		    p->protocol == cp->protocol)
 			goto out;
-		}
 	}
 	cp = NULL;
 
   out:
+	if (cp)
+		atomic_inc(&cp->refcnt);
 	ct_read_unlock(hash);
 
 	IP_VS_DBG_BUF(9, "template lookup/in %s %s:%d->%s:%d %s\n",
@@ -359,7 +387,7 @@ struct ip_vs_conn *ip_vs_conn_out_get(co
 	/*
 	 *	Check for "full" addressed entries
 	 */
-	hash = ip_vs_conn_hashkey(p->af, p->protocol, p->vaddr, p->vport);
+	hash = ip_vs_conn_hashkey_param(p);
 
 	ct_read_lock(hash);
 
@@ -724,6 +752,7 @@ static void ip_vs_conn_expire(unsigned l
 		if (cp->flags & IP_VS_CONN_F_NFCT)
 			ip_vs_conn_drop_conntrack(cp);
 
+		kfree(cp->pe_data);
 		if (unlikely(cp->app != NULL))
 			ip_vs_unbind_app(cp);
 		ip_vs_unbind_dest(cp);
@@ -784,6 +813,10 @@ ip_vs_conn_new(const struct ip_vs_conn_p
 			&cp->daddr, daddr);
 	cp->dport          = dport;
 	cp->flags	   = flags;
+	if (flags & IP_VS_CONN_F_TEMPLATE && p->pe_data) {
+		cp->pe_data = p->pe_data;
+		cp->pe_data_len = p->pe_data_len;
+	}
 	spin_lock_init(&cp->lock);
 
 	/*
@@ -834,7 +867,6 @@ ip_vs_conn_new(const struct ip_vs_conn_p
 	return cp;
 }
 
-
 /*
  *	/proc/net/ip_vs_conn entries
  */
@@ -850,7 +882,7 @@ static void *ip_vs_conn_array(struct seq
 		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
 			if (pos-- == 0) {
 				seq->private = &ip_vs_conn_tab[idx];
-				return cp;
+			return cp;
 			}
 		}
 		ct_read_unlock_bh(idx);
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-10-01 22:48:42.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-10-01 22:49:15.000000000 +0900
@@ -176,6 +176,19 @@ ip_vs_set_state(struct ip_vs_conn *cp, i
 	return pp->state_transition(cp, direction, skb, pp);
 }
 
+static inline int
+ip_vs_conn_fill_param_persist(const struct ip_vs_service *svc,
+			      struct sk_buff *skb, int protocol,
+			      const union nf_inet_addr *caddr, __be16 cport,
+			      const union nf_inet_addr *vaddr, __be16 vport,
+			      struct ip_vs_conn_param *p)
+{
+	ip_vs_conn_fill_param(svc->af, protocol, caddr, cport, vaddr, vport, p);
+	p->pe = svc->pe;
+	if (p->pe && p->pe->fill_param)
+		return p->pe->fill_param(p, skb);
+	return 0;
+}
 
 /*
  *  IPVS persistent scheduling function
@@ -186,7 +199,7 @@ ip_vs_set_state(struct ip_vs_conn *cp, i
  */
 static struct ip_vs_conn *
 ip_vs_sched_persist(struct ip_vs_service *svc,
-		    const struct sk_buff *skb,
+		    struct sk_buff *skb,
 		    __be16 ports[2])
 {
 	struct ip_vs_conn *cp = NULL;
@@ -255,8 +268,9 @@ ip_vs_sched_persist(struct ip_vs_service
 				vaddr = &fwmark;
 			}
 		}
-		ip_vs_conn_fill_param(svc->af, protocol, &snet, 0,
-				      vaddr, vport, &param);
+		if (ip_vs_conn_fill_param_persist(svc, skb, protocol, &snet, 0,
+						  vaddr, vport, &param))
+			return NULL;
 	}
 
 	/* Check if a template already exists */
@@ -268,22 +282,31 @@ ip_vs_sched_persist(struct ip_vs_service
 		dest = svc->scheduler->schedule(svc, skb);
 		if (!dest) {
 			IP_VS_DBG(1, "p-schedule: no dest found.\n");
+			kfree(param.pe_data);
 			return NULL;
 		}
 
 		if (ports[1] == svc->port && svc->port != FTPPORT)
 			dport = dest->port;
 
-		/* Create a template */
+		/* Create a template
+		 * This adds param.pe_data to the template,
+		 * and thus param.pe_data will be destroyed
+		 * when the template expires */
 		ct = ip_vs_conn_new(&param, &dest->addr, dport,
 				    IP_VS_CONN_F_TEMPLATE, dest);
-		if (ct == NULL)
+		if (ct == NULL) {
+			kfree(param.pe_data);
 			return NULL;
+		}
 
 		ct->timeout = svc->timeout;
-	} else
+	} else {
 		/* set destination with the found template */
 		dest = ct->dest;
+		kfree(param.pe_data);
+	}
+
 	dport = dest->port;
 
 	flags = (svc->flags & IP_VS_SVC_F_ONEPACKET
@@ -317,7 +340,7 @@ ip_vs_sched_persist(struct ip_vs_service
  *  Protocols supported: TCP, UDP
  */
 struct ip_vs_conn *
-ip_vs_schedule(struct ip_vs_service *svc, const struct sk_buff *skb)
+ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff *skb)
 {
 	struct ip_vs_conn *cp = NULL;
 	struct ip_vs_iphdr iph;
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_sync.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_sync.c	2010-10-01 22:48:42.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_sync.c	2010-10-01 22:48:51.000000000 +0900
@@ -288,6 +288,16 @@ void ip_vs_sync_conn(struct ip_vs_conn *
 		ip_vs_sync_conn(cp->control);
 }
 
+static inline int
+ip_vs_conn_fill_param_sync(int af, int protocol,
+			   const union nf_inet_addr *caddr, __be16 cport,
+			   const union nf_inet_addr *vaddr, __be16 vport,
+			   struct ip_vs_conn_param *p)
+{
+	/* XXX: Need to take into account persistence engine */
+	ip_vs_conn_fill_param(af, protocol, caddr, cport, vaddr, vport, p);
+	return 0;
+}
 
 /*
  *      Process received multicast message and create the corresponding
@@ -372,11 +382,14 @@ static void ip_vs_process_message(const
 		}
 
 		{
-			ip_vs_conn_fill_param(AF_INET, s->protocol,
+			if (ip_vs_conn_fill_param_sync(AF_INET, s->protocol,
 					      (union nf_inet_addr *)&s->caddr,
 					      s->cport,
 					      (union nf_inet_addr *)&s->vaddr,
-					      s->vport, &param);
+					      s->vport, &param)) {
+				pr_err("ip_vs_conn_fill_param_sync failed");
+				return;
+			}
 			if (!(flags & IP_VS_CONN_F_TEMPLATE))
 				cp = ip_vs_conn_in_get(&param);
 			else


^ permalink raw reply

* [patch v2 05/12] [PATCH 05/12] IPVS: Allow null argument to ip_vs_scheduler_put()
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0005-IPVS-Allow-null-argument-to-ip_vs_scheduler_put.patch --]
[-- Type: text/plain, Size: 2041 bytes --]

This simplifies caller logic sightly.

Signed-off-by: Simon Horman <horms@verge.net.au>
---

v2
* Trivial rediff

Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_ctl.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_ctl.c	2010-10-01 22:25:53.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_ctl.c	2010-10-01 22:26:10.000000000 +0900
@@ -1144,7 +1144,7 @@ ip_vs_add_service(struct ip_vs_service_u
 	if (sched == NULL) {
 		pr_info("Scheduler module ip_vs_%s not found\n", u->sched_name);
 		ret = -ENOENT;
-		goto out_mod_dec;
+		goto out_err;
 	}
 
 #ifdef CONFIG_IP_VS_IPV6
@@ -1204,7 +1204,7 @@ ip_vs_add_service(struct ip_vs_service_u
 	*svc_p = svc;
 	return 0;
 
-  out_err:
+ out_err:
 	if (svc != NULL) {
 		if (svc->scheduler)
 			ip_vs_unbind_scheduler(svc);
@@ -1217,7 +1217,6 @@ ip_vs_add_service(struct ip_vs_service_u
 	}
 	ip_vs_scheduler_put(sched);
 
-  out_mod_dec:
 	/* decrease the module use count */
 	ip_vs_use_count_dec();
 
@@ -1300,10 +1299,7 @@ ip_vs_edit_service(struct ip_vs_service
 #ifdef CONFIG_IP_VS_IPV6
   out:
 #endif
-
-	if (old_sched)
-		ip_vs_scheduler_put(old_sched);
-
+	ip_vs_scheduler_put(old_sched);
 	return ret;
 }
 
@@ -1327,8 +1323,7 @@ static void __ip_vs_del_service(struct i
 	/* Unbind scheduler */
 	old_sched = svc->scheduler;
 	ip_vs_unbind_scheduler(svc);
-	if (old_sched)
-		ip_vs_scheduler_put(old_sched);
+	ip_vs_scheduler_put(old_sched);
 
 	/* Unbind app inc */
 	if (svc->inc) {
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_sched.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_sched.c	2010-10-01 22:25:53.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_sched.c	2010-10-01 22:26:10.000000000 +0900
@@ -159,7 +159,7 @@ struct ip_vs_scheduler *ip_vs_scheduler_
 
 void ip_vs_scheduler_put(struct ip_vs_scheduler *scheduler)
 {
-	if (scheduler->module)
+	if (scheduler && scheduler->module)
 		module_put(scheduler->module);
 }
 


^ permalink raw reply

* [patch v2 04/12] [PATCH 04/12] IPVS: Add struct ip_vs_conn_param
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0004-IPVS-Add-struct-ip_vs_conn_param.patch --]
[-- Type: text/plain, Size: 25185 bytes --]

Signed-off-by: Simon Horman <horms@verge.net.au>
---

The motivation for this is to allow persistence engine modules to
fill in the parameters.

v0.3
* Add missing changes to ip_vs_ftp.c

v2
* make "union nf_inet_addr fwmark" const
* Update for the recent addition of ip_vs_nfct.c

Index: lvs-test-2.6/include/net/ip_vs.h
===================================================================
--- lvs-test-2.6.orig/include/net/ip_vs.h	2010-10-01 21:56:39.000000000 +0900
+++ lvs-test-2.6/include/net/ip_vs.h	2010-10-01 22:07:22.000000000 +0900
@@ -357,6 +357,15 @@ struct ip_vs_protocol {
 
 extern struct ip_vs_protocol * ip_vs_proto_get(unsigned short proto);
 
+struct ip_vs_conn_param {
+	const union nf_inet_addr	*caddr;
+	const union nf_inet_addr	*vaddr;
+	__be16				cport;
+	__be16				vport;
+	__u16				protocol;
+	u16				af;
+};
+
 /*
  *	IP_VS structure allocated for each dynamically scheduled connection
  */
@@ -626,13 +635,23 @@ enum {
 	IP_VS_DIR_LAST,
 };
 
-extern struct ip_vs_conn *ip_vs_conn_in_get
-(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
- const union nf_inet_addr *d_addr, __be16 d_port);
-
-extern struct ip_vs_conn *ip_vs_ct_in_get
-(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
- const union nf_inet_addr *d_addr, __be16 d_port);
+static inline void ip_vs_conn_fill_param(int af, int protocol,
+					 const union nf_inet_addr *caddr,
+					 __be16 cport,
+					 const union nf_inet_addr *vaddr,
+					 __be16 vport,
+					 struct ip_vs_conn_param *p)
+{
+	p->af = af;
+	p->protocol = protocol;
+	p->caddr = caddr;
+	p->cport = cport;
+	p->vaddr = vaddr;
+	p->vport = vport;
+}
+
+struct ip_vs_conn *ip_vs_conn_in_get(const struct ip_vs_conn_param *p);
+struct ip_vs_conn *ip_vs_ct_in_get(const struct ip_vs_conn_param *p);
 
 struct ip_vs_conn * ip_vs_conn_in_get_proto(int af, const struct sk_buff *skb,
 					    struct ip_vs_protocol *pp,
@@ -640,9 +659,7 @@ struct ip_vs_conn * ip_vs_conn_in_get_pr
 					    unsigned int proto_off,
 					    int inverse);
 
-extern struct ip_vs_conn *ip_vs_conn_out_get
-(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
- const union nf_inet_addr *d_addr, __be16 d_port);
+struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p);
 
 struct ip_vs_conn * ip_vs_conn_out_get_proto(int af, const struct sk_buff *skb,
 					     struct ip_vs_protocol *pp,
@@ -658,11 +675,10 @@ static inline void __ip_vs_conn_put(stru
 extern void ip_vs_conn_put(struct ip_vs_conn *cp);
 extern void ip_vs_conn_fill_cport(struct ip_vs_conn *cp, __be16 cport);
 
-extern struct ip_vs_conn *
-ip_vs_conn_new(int af, int proto, const union nf_inet_addr *caddr, __be16 cport,
-	       const union nf_inet_addr *vaddr, __be16 vport,
-	       const union nf_inet_addr *daddr, __be16 dport, unsigned flags,
-	       struct ip_vs_dest *dest);
+struct ip_vs_conn *ip_vs_conn_new(const struct ip_vs_conn_param *p,
+				  const union nf_inet_addr *daddr,
+				  __be16 dport, unsigned flags,
+				  struct ip_vs_dest *dest);
 extern void ip_vs_conn_expire_now(struct ip_vs_conn *cp);
 
 extern const char * ip_vs_state_name(__u16 proto, int state);
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_conn.c	2010-10-01 21:56:39.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c	2010-10-01 22:07:22.000000000 +0900
@@ -218,27 +218,26 @@ static inline int ip_vs_conn_unhash(stru
 /*
  *  Gets ip_vs_conn associated with supplied parameters in the ip_vs_conn_tab.
  *  Called for pkts coming from OUTside-to-INside.
- *	s_addr, s_port: pkt source address (foreign host)
- *	d_addr, d_port: pkt dest address (load balancer)
+ *	p->caddr, p->cport: pkt source address (foreign host)
+ *	p->vaddr, p->vport: pkt dest address (load balancer)
  */
-static inline struct ip_vs_conn *__ip_vs_conn_in_get
-(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
- const union nf_inet_addr *d_addr, __be16 d_port)
+static inline struct ip_vs_conn *
+__ip_vs_conn_in_get(const struct ip_vs_conn_param *p)
 {
 	unsigned hash;
 	struct ip_vs_conn *cp;
 
-	hash = ip_vs_conn_hashkey(af, protocol, s_addr, s_port);
+	hash = ip_vs_conn_hashkey(p->af, p->protocol, p->caddr, p->cport);
 
 	ct_read_lock(hash);
 
 	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
-		if (cp->af == af &&
-		    ip_vs_addr_equal(af, s_addr, &cp->caddr) &&
-		    ip_vs_addr_equal(af, d_addr, &cp->vaddr) &&
-		    s_port == cp->cport && d_port == cp->vport &&
-		    ((!s_port) ^ (!(cp->flags & IP_VS_CONN_F_NO_CPORT))) &&
-		    protocol == cp->protocol) {
+		if (cp->af == p->af &&
+		    ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) &&
+		    ip_vs_addr_equal(p->af, p->vaddr, &cp->vaddr) &&
+		    p->cport == cp->cport && p->vport == cp->vport &&
+		    ((!p->cport) ^ (!(cp->flags & IP_VS_CONN_F_NO_CPORT))) &&
+		    p->protocol == cp->protocol) {
 			/* HIT */
 			atomic_inc(&cp->refcnt);
 			ct_read_unlock(hash);
@@ -251,71 +250,82 @@ static inline struct ip_vs_conn *__ip_vs
 	return NULL;
 }
 
-struct ip_vs_conn *ip_vs_conn_in_get
-(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
- const union nf_inet_addr *d_addr, __be16 d_port)
+struct ip_vs_conn *ip_vs_conn_in_get(const struct ip_vs_conn_param *p)
 {
 	struct ip_vs_conn *cp;
 
-	cp = __ip_vs_conn_in_get(af, protocol, s_addr, s_port, d_addr, d_port);
-	if (!cp && atomic_read(&ip_vs_conn_no_cport_cnt))
-		cp = __ip_vs_conn_in_get(af, protocol, s_addr, 0, d_addr,
-					 d_port);
+	cp = __ip_vs_conn_in_get(p);
+	if (!cp && atomic_read(&ip_vs_conn_no_cport_cnt)) {
+		struct ip_vs_conn_param cport_zero_p = *p;
+		cport_zero_p.cport = 0;
+		cp = __ip_vs_conn_in_get(&cport_zero_p);
+	}
 
 	IP_VS_DBG_BUF(9, "lookup/in %s %s:%d->%s:%d %s\n",
-		      ip_vs_proto_name(protocol),
-		      IP_VS_DBG_ADDR(af, s_addr), ntohs(s_port),
-		      IP_VS_DBG_ADDR(af, d_addr), ntohs(d_port),
+		      ip_vs_proto_name(p->protocol),
+		      IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport),
+		      IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
 		      cp ? "hit" : "not hit");
 
 	return cp;
 }
 
+static int
+ip_vs_conn_fill_param_proto(int af, const struct sk_buff *skb,
+			    const struct ip_vs_iphdr *iph,
+			    unsigned int proto_off, int inverse,
+			    struct ip_vs_conn_param *p)
+{
+	__be16 _ports[2], *pptr;
+
+	pptr = skb_header_pointer(skb, proto_off, sizeof(_ports), _ports);
+	if (pptr == NULL)
+		return 1;
+
+	if (likely(!inverse))
+		ip_vs_conn_fill_param(af, iph->protocol, &iph->saddr, pptr[0],
+				      &iph->daddr, pptr[1], p);
+	else
+		ip_vs_conn_fill_param(af, iph->protocol, &iph->saddr, pptr[0],
+				      &iph->daddr, pptr[1], p);
+	return 0;
+}
+
 struct ip_vs_conn *
 ip_vs_conn_in_get_proto(int af, const struct sk_buff *skb,
 			struct ip_vs_protocol *pp,
 			const struct ip_vs_iphdr *iph,
 			unsigned int proto_off, int inverse)
 {
-	__be16 _ports[2], *pptr;
+	struct ip_vs_conn_param p;
 
-	pptr = skb_header_pointer(skb, proto_off, sizeof(_ports), _ports);
-	if (pptr == NULL)
+	if (ip_vs_conn_fill_param_proto(af, skb, iph, proto_off, inverse, &p))
 		return NULL;
 
-	if (likely(!inverse))
-		return ip_vs_conn_in_get(af, iph->protocol,
-					 &iph->saddr, pptr[0],
-					 &iph->daddr, pptr[1]);
-	else
-		return ip_vs_conn_in_get(af, iph->protocol,
-					 &iph->daddr, pptr[1],
-					 &iph->saddr, pptr[0]);
+	return ip_vs_conn_in_get(&p);
 }
 EXPORT_SYMBOL_GPL(ip_vs_conn_in_get_proto);
 
 /* Get reference to connection template */
-struct ip_vs_conn *ip_vs_ct_in_get
-(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
- const union nf_inet_addr *d_addr, __be16 d_port)
+struct ip_vs_conn *ip_vs_ct_in_get(const struct ip_vs_conn_param *p)
 {
 	unsigned hash;
 	struct ip_vs_conn *cp;
 
-	hash = ip_vs_conn_hashkey(af, protocol, s_addr, s_port);
+	hash = ip_vs_conn_hashkey(p->af, p->protocol, p->caddr, p->cport);
 
 	ct_read_lock(hash);
 
 	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
-		if (cp->af == af &&
-		    ip_vs_addr_equal(af, s_addr, &cp->caddr) &&
+		if (cp->af == p->af &&
+		    ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) &&
 		    /* protocol should only be IPPROTO_IP if
-		     * d_addr is a fwmark */
-		    ip_vs_addr_equal(protocol == IPPROTO_IP ? AF_UNSPEC : af,
-		                     d_addr, &cp->vaddr) &&
-		    s_port == cp->cport && d_port == cp->vport &&
+		     * p->vaddr is a fwmark */
+		    ip_vs_addr_equal(p->protocol == IPPROTO_IP ? AF_UNSPEC :
+				     p->af, p->vaddr, &cp->vaddr) &&
+		    p->cport == cp->cport && p->vport == cp->vport &&
 		    cp->flags & IP_VS_CONN_F_TEMPLATE &&
-		    protocol == cp->protocol) {
+		    p->protocol == cp->protocol) {
 			/* HIT */
 			atomic_inc(&cp->refcnt);
 			goto out;
@@ -327,9 +337,9 @@ struct ip_vs_conn *ip_vs_ct_in_get
 	ct_read_unlock(hash);
 
 	IP_VS_DBG_BUF(9, "template lookup/in %s %s:%d->%s:%d %s\n",
-		      ip_vs_proto_name(protocol),
-		      IP_VS_DBG_ADDR(af, s_addr), ntohs(s_port),
-		      IP_VS_DBG_ADDR(af, d_addr), ntohs(d_port),
+		      ip_vs_proto_name(p->protocol),
+		      IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport),
+		      IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
 		      cp ? "hit" : "not hit");
 
 	return cp;
@@ -341,9 +351,7 @@ struct ip_vs_conn *ip_vs_ct_in_get
  *	s_addr, s_port: pkt source address (inside host)
  *	d_addr, d_port: pkt dest address (foreign host)
  */
-struct ip_vs_conn *ip_vs_conn_out_get
-(int af, int protocol, const union nf_inet_addr *s_addr, __be16 s_port,
- const union nf_inet_addr *d_addr, __be16 d_port)
+struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p)
 {
 	unsigned hash;
 	struct ip_vs_conn *cp, *ret=NULL;
@@ -351,16 +359,16 @@ struct ip_vs_conn *ip_vs_conn_out_get
 	/*
 	 *	Check for "full" addressed entries
 	 */
-	hash = ip_vs_conn_hashkey(af, protocol, d_addr, d_port);
+	hash = ip_vs_conn_hashkey(p->af, p->protocol, p->vaddr, p->vport);
 
 	ct_read_lock(hash);
 
 	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
-		if (cp->af == af &&
-		    ip_vs_addr_equal(af, d_addr, &cp->caddr) &&
-		    ip_vs_addr_equal(af, s_addr, &cp->daddr) &&
-		    d_port == cp->cport && s_port == cp->dport &&
-		    protocol == cp->protocol) {
+		if (cp->af == p->af &&
+		    ip_vs_addr_equal(p->af, p->vaddr, &cp->caddr) &&
+		    ip_vs_addr_equal(p->af, p->caddr, &cp->daddr) &&
+		    p->vport == cp->cport && p->cport == cp->dport &&
+		    p->protocol == cp->protocol) {
 			/* HIT */
 			atomic_inc(&cp->refcnt);
 			ret = cp;
@@ -371,9 +379,9 @@ struct ip_vs_conn *ip_vs_conn_out_get
 	ct_read_unlock(hash);
 
 	IP_VS_DBG_BUF(9, "lookup/out %s %s:%d->%s:%d %s\n",
-		      ip_vs_proto_name(protocol),
-		      IP_VS_DBG_ADDR(af, s_addr), ntohs(s_port),
-		      IP_VS_DBG_ADDR(af, d_addr), ntohs(d_port),
+		      ip_vs_proto_name(p->protocol),
+		      IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport),
+		      IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
 		      ret ? "hit" : "not hit");
 
 	return ret;
@@ -385,20 +393,12 @@ ip_vs_conn_out_get_proto(int af, const s
 			 const struct ip_vs_iphdr *iph,
 			 unsigned int proto_off, int inverse)
 {
-	__be16 _ports[2], *pptr;
+	struct ip_vs_conn_param p;
 
-	pptr = skb_header_pointer(skb, proto_off, sizeof(_ports), _ports);
-	if (pptr == NULL)
+	if (!ip_vs_conn_fill_param_proto(af, skb, iph, proto_off, inverse, &p))
 		return NULL;
 
-	if (likely(!inverse))
-		return ip_vs_conn_out_get(af, iph->protocol,
-					  &iph->saddr, pptr[0],
-					  &iph->daddr, pptr[1]);
-	else
-		return ip_vs_conn_out_get(af, iph->protocol,
-					  &iph->daddr, pptr[1],
-					  &iph->saddr, pptr[0]);
+	return ip_vs_conn_out_get(&p);
 }
 EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto);
 
@@ -758,13 +758,12 @@ void ip_vs_conn_expire_now(struct ip_vs_
  *	Create a new connection entry and hash it into the ip_vs_conn_tab
  */
 struct ip_vs_conn *
-ip_vs_conn_new(int af, int proto, const union nf_inet_addr *caddr, __be16 cport,
-	       const union nf_inet_addr *vaddr, __be16 vport,
+ip_vs_conn_new(const struct ip_vs_conn_param *p,
 	       const union nf_inet_addr *daddr, __be16 dport, unsigned flags,
 	       struct ip_vs_dest *dest)
 {
 	struct ip_vs_conn *cp;
-	struct ip_vs_protocol *pp = ip_vs_proto_get(proto);
+	struct ip_vs_protocol *pp = ip_vs_proto_get(p->protocol);
 
 	cp = kmem_cache_zalloc(ip_vs_conn_cachep, GFP_ATOMIC);
 	if (cp == NULL) {
@@ -774,14 +773,14 @@ ip_vs_conn_new(int af, int proto, const
 
 	INIT_LIST_HEAD(&cp->c_list);
 	setup_timer(&cp->timer, ip_vs_conn_expire, (unsigned long)cp);
-	cp->af		   = af;
-	cp->protocol	   = proto;
-	ip_vs_addr_copy(af, &cp->caddr, caddr);
-	cp->cport	   = cport;
-	ip_vs_addr_copy(af, &cp->vaddr, vaddr);
-	cp->vport	   = vport;
+	cp->af		   = p->af;
+	cp->protocol	   = p->protocol;
+	ip_vs_addr_copy(p->af, &cp->caddr, p->caddr);
+	cp->cport	   = p->cport;
+	ip_vs_addr_copy(p->af, &cp->vaddr, p->vaddr);
+	cp->vport	   = p->vport;
 	/* proto should only be IPPROTO_IP if d_addr is a fwmark */
-	ip_vs_addr_copy(proto == IPPROTO_IP ? AF_UNSPEC : af,
+	ip_vs_addr_copy(p->protocol == IPPROTO_IP ? AF_UNSPEC : p->af,
 			&cp->daddr, daddr);
 	cp->dport          = dport;
 	cp->flags	   = flags;
@@ -810,7 +809,7 @@ ip_vs_conn_new(int af, int proto, const
 
 	/* Bind its packet transmitter */
 #ifdef CONFIG_IP_VS_IPV6
-	if (af == AF_INET6)
+	if (p->af == AF_INET6)
 		ip_vs_bind_xmit_v6(cp);
 	else
 #endif
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-10-01 22:06:23.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-10-01 22:10:46.000000000 +0900
@@ -193,14 +193,11 @@ ip_vs_sched_persist(struct ip_vs_service
 	struct ip_vs_iphdr iph;
 	struct ip_vs_dest *dest;
 	struct ip_vs_conn *ct;
-	int protocol = iph.protocol;
 	__be16 dport = 0;		/* destination port to forward */
-	__be16 vport = 0;		/* virtual service port */
 	unsigned int flags;
 	union nf_inet_addr snet;	/* source network of the client,
 					   after masking */
-	const union nf_inet_addr fwmark = { .ip = htonl(svc->fwmark) };
-	const union nf_inet_addr *vaddr = &iph.daddr;
+	struct ip_vs_conn_param param;
 
 	ip_vs_fill_iphdr(svc->af, skb_network_header(skb), &iph);
 
@@ -232,6 +229,11 @@ ip_vs_sched_persist(struct ip_vs_service
 	 * is created for other persistent services.
 	 */
 	{
+		int protocol = iph.protocol;
+		const union nf_inet_addr *vaddr = &iph.daddr;
+		const union nf_inet_addr fwmark = { .ip = htonl(svc->fwmark) };
+		__be16 vport = 0;
+
 		if (ports[1] == svc->port) {
 			/* non-FTP template:
 			 * <protocol, caddr, 0, vaddr, vport, daddr, dport>
@@ -253,11 +255,12 @@ ip_vs_sched_persist(struct ip_vs_service
 				vaddr = &fwmark;
 			}
 		}
+		ip_vs_conn_fill_param(svc->af, protocol, &snet, 0,
+				      vaddr, vport, &param);
 	}
 
 	/* Check if a template already exists */
-	ct = ip_vs_ct_in_get(svc->af, protocol, &snet, 0, vaddr, vport);
-
+	ct = ip_vs_ct_in_get(&param);
 	if (!ct || !ip_vs_check_template(ct)) {
 		/* No template found or the dest of the connection
 		 * template is not available.
@@ -272,8 +275,7 @@ ip_vs_sched_persist(struct ip_vs_service
 			dport = dest->port;
 
 		/* Create a template */
-		ct = ip_vs_conn_new(svc->af, protocol, &snet, 0,vaddr, vport,
-				    &dest->addr, dport,
+		ct = ip_vs_conn_new(&param, &dest->addr, dport,
 				    IP_VS_CONN_F_TEMPLATE, dest);
 		if (ct == NULL)
 			return NULL;
@@ -291,12 +293,7 @@ ip_vs_sched_persist(struct ip_vs_service
 	/*
 	 *    Create a new connection according to the template
 	 */
-	cp = ip_vs_conn_new(svc->af, iph.protocol,
-			    &iph.saddr, ports[0],
-			    &iph.daddr, ports[1],
-			    &dest->addr, dport,
-			    flags,
-			    dest);
+	cp = ip_vs_conn_new(&param, &dest->addr, dport, flags, dest);
 	if (cp == NULL) {
 		ip_vs_conn_put(ct);
 		return NULL;
@@ -363,14 +360,16 @@ ip_vs_schedule(struct ip_vs_service *svc
 	/*
 	 *    Create a connection entry.
 	 */
-	cp = ip_vs_conn_new(svc->af, iph.protocol,
-			    &iph.saddr, pptr[0],
-			    &iph.daddr, pptr[1],
-			    &dest->addr, dest->port ? dest->port : pptr[1],
-			    flags,
-			    dest);
-	if (cp == NULL)
-		return NULL;
+	{
+		struct ip_vs_conn_param p;
+		ip_vs_conn_fill_param(svc->af, iph.protocol, &iph.saddr,
+				      pptr[0], &iph.daddr, pptr[1], &p);
+		cp = ip_vs_conn_new(&p, &dest->addr,
+				    dest->port ? dest->port : pptr[1],
+				    flags, dest);
+		if (!cp)
+			return NULL;
+	}
 
 	IP_VS_DBG_BUF(6, "Schedule fwd:%c c:%s:%u v:%s:%u "
 		      "d:%s:%u conn->flags:%X conn->refcnt:%d\n",
@@ -426,14 +425,17 @@ int ip_vs_leave(struct ip_vs_service *sv
 
 		/* create a new connection entry */
 		IP_VS_DBG(6, "%s(): create a cache_bypass entry\n", __func__);
-		cp = ip_vs_conn_new(svc->af, iph.protocol,
-				    &iph.saddr, pptr[0],
-				    &iph.daddr, pptr[1],
-				    &daddr, 0,
-				    IP_VS_CONN_F_BYPASS | flags,
-				    NULL);
-		if (cp == NULL)
-			return NF_DROP;
+		{
+			struct ip_vs_conn_param p;
+			ip_vs_conn_fill_param(svc->af, iph.protocol,
+					      &iph.saddr, pptr[0],
+					      &iph.daddr, pptr[1], &p);
+			cp = ip_vs_conn_new(&p, &daddr, 0,
+					    IP_VS_CONN_F_BYPASS | flags,
+					    NULL);
+			if (!cp)
+				return NF_DROP;
+		}
 
 		/* statistics */
 		ip_vs_in_stats(cp, skb);
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_ftp.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_ftp.c	2010-10-01 22:14:10.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_ftp.c	2010-10-01 22:21:09.000000000 +0900
@@ -195,13 +195,17 @@ static int ip_vs_ftp_out(struct ip_vs_ap
 		/*
 		 * Now update or create an connection entry for it
 		 */
-		n_cp = ip_vs_conn_out_get(AF_INET, iph->protocol, &from, port,
-					  &cp->caddr, 0);
+		{
+			struct ip_vs_conn_param p;
+			ip_vs_conn_fill_param(AF_INET, iph->protocol,
+					      &from, port, &cp->caddr, 0, &p);
+			n_cp = ip_vs_conn_out_get(&p);
+		}
 		if (!n_cp) {
-			n_cp = ip_vs_conn_new(AF_INET, IPPROTO_TCP,
-					      &cp->caddr, 0,
-					      &cp->vaddr, port,
-					      &from, port,
+			struct ip_vs_conn_param p;
+			ip_vs_conn_fill_param(AF_INET, IPPROTO_TCP, &cp->caddr,
+					      0, &cp->vaddr, port, &p);
+			n_cp = ip_vs_conn_new(&p, &from, port,
 					      IP_VS_CONN_F_NO_CPORT |
 					      IP_VS_CONN_F_NFCT,
 					      cp->dest);
@@ -347,21 +351,22 @@ static int ip_vs_ftp_in(struct ip_vs_app
 		  ip_vs_proto_name(iph->protocol),
 		  &to.ip, ntohs(port), &cp->vaddr.ip, 0);
 
-	n_cp = ip_vs_conn_in_get(AF_INET, iph->protocol,
-				 &to, port,
-				 &cp->vaddr, htons(ntohs(cp->vport)-1));
-	if (!n_cp) {
-		n_cp = ip_vs_conn_new(AF_INET, IPPROTO_TCP,
-				      &to, port,
+	{
+		struct ip_vs_conn_param p;
+		ip_vs_conn_fill_param(AF_INET, iph->protocol, &to, port,
 				      &cp->vaddr, htons(ntohs(cp->vport)-1),
-				      &cp->daddr, htons(ntohs(cp->dport)-1),
-				      IP_VS_CONN_F_NFCT,
-				      cp->dest);
-		if (!n_cp)
-			return 0;
+				      &p);
+		n_cp = ip_vs_conn_in_get(&p);
+		if (!n_cp) {
+			n_cp = ip_vs_conn_new(&p, &cp->daddr,
+					      htons(ntohs(cp->dport)-1),
+					      IP_VS_CONN_F_NFCT, cp->dest);
+			if (!n_cp)
+				return 0;
 
-		/* add its controller */
-		ip_vs_control_add(n_cp, cp);
+			/* add its controller */
+			ip_vs_control_add(n_cp, cp);
+		}
 	}
 
 	/*
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_nfct.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_nfct.c	2010-10-01 22:11:53.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_nfct.c	2010-10-01 22:24:29.000000000 +0900
@@ -140,6 +140,7 @@ static void ip_vs_nfct_expect_callback(s
 {
 	struct nf_conntrack_tuple *orig, new_reply;
 	struct ip_vs_conn *cp;
+	struct ip_vs_conn_param p;
 
 	if (exp->tuple.src.l3num != PF_INET)
 		return;
@@ -154,9 +155,10 @@ static void ip_vs_nfct_expect_callback(s
 
 	/* RS->CLIENT */
 	orig = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
-	cp = ip_vs_conn_out_get(exp->tuple.src.l3num, orig->dst.protonum,
-				&orig->src.u3, orig->src.u.tcp.port,
-				&orig->dst.u3, orig->dst.u.tcp.port);
+	ip_vs_conn_fill_param(exp->tuple.src.l3num, orig->dst.protonum,
+			      &orig->src.u3, orig->src.u.tcp.port,
+			      &orig->dst.u3, orig->dst.u.tcp.port, &p);
+	cp = ip_vs_conn_out_get(&p);
 	if (cp) {
 		/* Change reply CLIENT->RS to CLIENT->VS */
 		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
@@ -176,9 +178,7 @@ static void ip_vs_nfct_expect_callback(s
 	}
 
 	/* CLIENT->VS */
-	cp = ip_vs_conn_in_get(exp->tuple.src.l3num, orig->dst.protonum,
-			       &orig->src.u3, orig->src.u.tcp.port,
-			       &orig->dst.u3, orig->dst.u.tcp.port);
+	cp = ip_vs_conn_in_get(&p);
 	if (cp) {
 		/* Change reply VS->CLIENT to RS->CLIENT */
 		new_reply = ct->tuplehash[IP_CT_DIR_REPLY].tuple;
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_proto_ah_esp.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_proto_ah_esp.c	2010-10-01 21:55:19.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_proto_ah_esp.c	2010-10-01 22:23:33.000000000 +0900
@@ -40,6 +40,19 @@ struct isakmp_hdr {
 
 #define PORT_ISAKMP	500
 
+static void
+ah_esp_conn_fill_param_proto(int af, const struct ip_vs_iphdr *iph,
+			     int inverse, struct ip_vs_conn_param *p)
+{
+	if (likely(!inverse))
+		ip_vs_conn_fill_param(af, IPPROTO_UDP,
+				      &iph->saddr, htons(PORT_ISAKMP),
+				      &iph->daddr, htons(PORT_ISAKMP), p);
+	else
+		ip_vs_conn_fill_param(af, iph->protocol,
+				      &iph->saddr, htons(PORT_ISAKMP),
+				      &iph->daddr, htons(PORT_ISAKMP), p);
+}
 
 static struct ip_vs_conn *
 ah_esp_conn_in_get(int af, const struct sk_buff *skb, struct ip_vs_protocol *pp,
@@ -47,21 +60,10 @@ ah_esp_conn_in_get(int af, const struct
 		   int inverse)
 {
 	struct ip_vs_conn *cp;
+	struct ip_vs_conn_param p;
 
-	if (likely(!inverse)) {
-		cp = ip_vs_conn_in_get(af, IPPROTO_UDP,
-				       &iph->saddr,
-				       htons(PORT_ISAKMP),
-				       &iph->daddr,
-				       htons(PORT_ISAKMP));
-	} else {
-		cp = ip_vs_conn_in_get(af, IPPROTO_UDP,
-				       &iph->daddr,
-				       htons(PORT_ISAKMP),
-				       &iph->saddr,
-				       htons(PORT_ISAKMP));
-	}
-
+	ah_esp_conn_fill_param_proto(af, iph, inverse, &p);
+	cp = ip_vs_conn_in_get(&p);
 	if (!cp) {
 		/*
 		 * We are not sure if the packet is from our
@@ -87,21 +89,10 @@ ah_esp_conn_out_get(int af, const struct
 		    int inverse)
 {
 	struct ip_vs_conn *cp;
+	struct ip_vs_conn_param p;
 
-	if (likely(!inverse)) {
-		cp = ip_vs_conn_out_get(af, IPPROTO_UDP,
-					&iph->saddr,
-					htons(PORT_ISAKMP),
-					&iph->daddr,
-					htons(PORT_ISAKMP));
-	} else {
-		cp = ip_vs_conn_out_get(af, IPPROTO_UDP,
-					&iph->daddr,
-					htons(PORT_ISAKMP),
-					&iph->saddr,
-					htons(PORT_ISAKMP));
-	}
-
+	ah_esp_conn_fill_param_proto(af, iph, inverse, &p);
+	cp = ip_vs_conn_out_get(&p);
 	if (!cp) {
 		IP_VS_DBG_BUF(12, "Unknown ISAKMP entry for inout packet "
 			      "%s%s %s->%s\n",
Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_sync.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_sync.c	2010-10-01 21:55:19.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_sync.c	2010-10-01 22:23:33.000000000 +0900
@@ -301,6 +301,7 @@ static void ip_vs_process_message(const
 	struct ip_vs_conn *cp;
 	struct ip_vs_protocol *pp;
 	struct ip_vs_dest *dest;
+	struct ip_vs_conn_param param;
 	char *p;
 	int i;
 
@@ -370,18 +371,17 @@ static void ip_vs_process_message(const
 			}
 		}
 
-		if (!(flags & IP_VS_CONN_F_TEMPLATE))
-			cp = ip_vs_conn_in_get(AF_INET, s->protocol,
-					       (union nf_inet_addr *)&s->caddr,
-					       s->cport,
-					       (union nf_inet_addr *)&s->vaddr,
-					       s->vport);
-		else
-			cp = ip_vs_ct_in_get(AF_INET, s->protocol,
-					     (union nf_inet_addr *)&s->caddr,
-					     s->cport,
-					     (union nf_inet_addr *)&s->vaddr,
-					     s->vport);
+		{
+			ip_vs_conn_fill_param(AF_INET, s->protocol,
+					      (union nf_inet_addr *)&s->caddr,
+					      s->cport,
+					      (union nf_inet_addr *)&s->vaddr,
+					      s->vport, &param);
+			if (!(flags & IP_VS_CONN_F_TEMPLATE))
+				cp = ip_vs_conn_in_get(&param);
+			else
+				cp = ip_vs_ct_in_get(&param);
+		}
 		if (!cp) {
 			/*
 			 * Find the appropriate destination for the connection.
@@ -406,14 +406,9 @@ static void ip_vs_process_message(const
 				else
 					flags &= ~IP_VS_CONN_F_INACTIVE;
 			}
-			cp = ip_vs_conn_new(AF_INET, s->protocol,
-					    (union nf_inet_addr *)&s->caddr,
-					    s->cport,
-					    (union nf_inet_addr *)&s->vaddr,
-					    s->vport,
+			cp = ip_vs_conn_new(&param,
 					    (union nf_inet_addr *)&s->daddr,
-					    s->dport,
-					    flags, dest);
+					    s->dport, flags, dest);
 			if (dest)
 				atomic_dec(&dest->refcnt);
 			if (!cp) {


^ permalink raw reply

* [patch v2 03/12] [PATCH 03/12] IPVS: compact ip_vs_sched_persist()
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0003-IPVS-compact-ip_vs_sched_persist.patch --]
[-- Type: text/plain, Size: 5731 bytes --]

Compact ip_vs_sched_persist() by setting up parameters
and calling functions once.

Signed-off-by: Simon Horman <horms@verge.net.au>
---

v2
* Make "union nf_inet_addr fwmark" const
* Don't remove the comment next to the declaration of dport
* Add a comment to the declaration of vport

Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_core.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_core.c	2010-10-01 21:56:39.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_core.c	2010-10-01 22:02:41.000000000 +0900
@@ -193,10 +193,14 @@ ip_vs_sched_persist(struct ip_vs_service
 	struct ip_vs_iphdr iph;
 	struct ip_vs_dest *dest;
 	struct ip_vs_conn *ct;
-	__be16  dport;			/* destination port to forward */
+	int protocol = iph.protocol;
+	__be16 dport = 0;		/* destination port to forward */
+	__be16 vport = 0;		/* virtual service port */
 	unsigned int flags;
 	union nf_inet_addr snet;	/* source network of the client,
 					   after masking */
+	const union nf_inet_addr fwmark = { .ip = htonl(svc->fwmark) };
+	const union nf_inet_addr *vaddr = &iph.daddr;
 
 	ip_vs_fill_iphdr(svc->af, skb_network_header(skb), &iph);
 
@@ -227,119 +231,58 @@ ip_vs_sched_persist(struct ip_vs_service
 	 * service, and a template like <caddr, 0, vaddr, vport, daddr, dport>
 	 * is created for other persistent services.
 	 */
-	if (ports[1] == svc->port) {
-		/* Check if a template already exists */
-		if (svc->port != FTPPORT)
-			ct = ip_vs_ct_in_get(svc->af, iph.protocol, &snet, 0,
-					     &iph.daddr, ports[1]);
-		else
-			ct = ip_vs_ct_in_get(svc->af, iph.protocol, &snet, 0,
-					     &iph.daddr, 0);
-
-		if (!ct || !ip_vs_check_template(ct)) {
-			/*
-			 * No template found or the dest of the connection
-			 * template is not available.
-			 */
-			dest = svc->scheduler->schedule(svc, skb);
-			if (dest == NULL) {
-				IP_VS_DBG(1, "p-schedule: no dest found.\n");
-				return NULL;
-			}
-
-			/*
-			 * Create a template like <protocol,caddr,0,
-			 * vaddr,vport,daddr,dport> for non-ftp service,
-			 * and <protocol,caddr,0,vaddr,0,daddr,0>
-			 * for ftp service.
+	{
+		if (ports[1] == svc->port) {
+			/* non-FTP template:
+			 * <protocol, caddr, 0, vaddr, vport, daddr, dport>
+			 * FTP template:
+			 * <protocol, caddr, 0, vaddr, 0, daddr, 0>
 			 */
 			if (svc->port != FTPPORT)
-				ct = ip_vs_conn_new(svc->af, iph.protocol,
-						    &snet, 0,
-						    &iph.daddr,
-						    ports[1],
-						    &dest->addr, dest->port,
-						    IP_VS_CONN_F_TEMPLATE,
-						    dest);
-			else
-				ct = ip_vs_conn_new(svc->af, iph.protocol,
-						    &snet, 0,
-						    &iph.daddr, 0,
-						    &dest->addr, 0,
-						    IP_VS_CONN_F_TEMPLATE,
-						    dest);
-			if (ct == NULL)
-				return NULL;
-
-			ct->timeout = svc->timeout;
+				vport = ports[1];
 		} else {
-			/* set destination with the found template */
-			dest = ct->dest;
-		}
-		dport = dest->port;
-	} else {
-		/*
-		 * Note: persistent fwmark-based services and persistent
-		 * port zero service are handled here.
-		 * fwmark template: <IPPROTO_IP,caddr,0,fwmark,0,daddr,0>
-		 * port zero template: <protocol,caddr,0,vaddr,0,daddr,0>
-		 */
-		if (svc->fwmark) {
-			union nf_inet_addr fwmark = {
-				.ip = htonl(svc->fwmark)
-			};
-
-			ct = ip_vs_ct_in_get(svc->af, IPPROTO_IP, &snet, 0,
-					     &fwmark, 0);
-		} else
-			ct = ip_vs_ct_in_get(svc->af, iph.protocol, &snet, 0,
-					     &iph.daddr, 0);
-
-		if (!ct || !ip_vs_check_template(ct)) {
-			/*
-			 * If it is not persistent port zero, return NULL,
-			 * otherwise create a connection template.
+			/* Note: persistent fwmark-based services and
+			 * persistent port zero service are handled here.
+			 * fwmark template:
+			 * <IPPROTO_IP,caddr,0,fwmark,0,daddr,0>
+			 * port zero template:
+			 * <protocol,caddr,0,vaddr,0,daddr,0>
 			 */
-			if (svc->port)
-				return NULL;
-
-			dest = svc->scheduler->schedule(svc, skb);
-			if (dest == NULL) {
-				IP_VS_DBG(1, "p-schedule: no dest found.\n");
-				return NULL;
+			if (svc->fwmark) {
+				protocol = IPPROTO_IP;
+				vaddr = &fwmark;
 			}
+		}
+	}
 
-			/*
-			 * Create a template according to the service
-			 */
-			if (svc->fwmark) {
-				union nf_inet_addr fwmark = {
-					.ip = htonl(svc->fwmark)
-				};
-
-				ct = ip_vs_conn_new(svc->af, IPPROTO_IP,
-						    &snet, 0,
-						    &fwmark, 0,
-						    &dest->addr, 0,
-						    IP_VS_CONN_F_TEMPLATE,
-						    dest);
-			} else
-				ct = ip_vs_conn_new(svc->af, iph.protocol,
-						    &snet, 0,
-						    &iph.daddr, 0,
-						    &dest->addr, 0,
-						    IP_VS_CONN_F_TEMPLATE,
-						    dest);
-			if (ct == NULL)
-				return NULL;
+	/* Check if a template already exists */
+	ct = ip_vs_ct_in_get(svc->af, protocol, &snet, 0, vaddr, vport);
 
-			ct->timeout = svc->timeout;
-		} else {
-			/* set destination with the found template */
-			dest = ct->dest;
+	if (!ct || !ip_vs_check_template(ct)) {
+		/* No template found or the dest of the connection
+		 * template is not available.
+		 */
+		dest = svc->scheduler->schedule(svc, skb);
+		if (!dest) {
+			IP_VS_DBG(1, "p-schedule: no dest found.\n");
+			return NULL;
 		}
-		dport = ports[1];
-	}
+
+		if (ports[1] == svc->port && svc->port != FTPPORT)
+			dport = dest->port;
+
+		/* Create a template */
+		ct = ip_vs_conn_new(svc->af, protocol, &snet, 0,vaddr, vport,
+				    &dest->addr, dport,
+				    IP_VS_CONN_F_TEMPLATE, dest);
+		if (ct == NULL)
+			return NULL;
+
+		ct->timeout = svc->timeout;
+	} else
+		/* set destination with the found template */
+		dest = ct->dest;
+	dport = dest->port;
 
 	flags = (svc->flags & IP_VS_SVC_F_ONEPACKET
 		 && iph.protocol == IPPROTO_UDP)?


^ permalink raw reply

* [patch v2 01/12] [PATCH 01/12] netfilter: nf_conntrack_sip: Allow ct_sip_get_header() to be called with a null ct argument
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy
In-Reply-To: <20101001143517.645421976@akiko.akashicho.tokyo.vergenet.net>

[-- Attachment #1: 0001-netfilter-nf_conntrack_sip-Allow-ct_sip_get_header-t.patch --]
[-- Type: text/plain, Size: 476 bytes --]

Signed-off-by: Simon Horman <horms@verge.net.au>

diff --git a/net/netfilter/nf_conntrack_sip.c b/net/netfilter/nf_conntrack_sip.c
index 53d8922..2fd1ea2 100644
--- a/net/netfilter/nf_conntrack_sip.c
+++ b/net/netfilter/nf_conntrack_sip.c
@@ -152,6 +152,9 @@ static int parse_addr(const struct nf_conn *ct, const char *cp,
 	const char *end;
 	int ret = 0;
 
+	if (!ct)
+		return 0;
+
 	memset(addr, 0, sizeof(*addr));
 	switch (nf_ct_l3num(ct)) {
 	case AF_INET:
-- 
1.7.1



^ permalink raw reply related

* [patch v2 00/12] IPVS: SIP Persistence Engine
From: Simon Horman @ 2010-10-01 14:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter, netfilter-devel
  Cc: Jan Engelhardt, Stephen Hemminger, Wensong Zhang,
	Julian Anastasov, Patrick McHardy

This patch series adds load-balancing of UDP SIP based on Call-ID to
IPVS as well as a frame-work for extending IPVS to handle alternate
persistence requirements.

REVISIONS

This v2 of this patch series which fixes several problems
including non-atomic allocations while running atomic, and a memory leak.

v1 of this series addressed a few minor problems.

Internally there were 4 rfc versions, 0.1, 0.2, 0.3 and 0.4.

All changes are noted on a per-patch basis.

OVERVIEW

The approach that I have taken is what I call persistence engines.
The basic idea being that you can provide a module to LVS that alters
the way that it handles connection templates, which are at the core
of persistence. In particular, an additional key can be added, and
any of the normal IP address, port and protocol information can either
be used or ignored.

In the case of the SIP persistence engine, the only persistence engine, all
the keys used by the default persistence behaviour are used and the callid
is added as an extra key. I originally intended to ignore the cip, but this
can optionally be done by setting the persistence mask (-M) to 0.0.0.0
while allowing the flexibility of other mask values.

It is envisaged that the SIP persistence engine will be used in conjunction
with one-packet scheduling. I'm interested to hear if that doesn't fit your
needs.


CONFIGURATION

A persistence engine is associated with a virtual service
(as are schedulers). I have added the --pe option to the
ivpsadm -A and -E commands to allow the persistence engine
of a virtual service to be added, changed, or deleted.

e.g. ipvsadm -A -u 10.4.3.192:5060 -p 60 -M 0.0.0.0 -o --pe sip

There are no other configuration parameters at this time.


RUNNING

When a connection template is created, if its virtual service
has a persistence engine, then the persistence engine can add
an extra key to the connection template. For the SIP module this
is the callid. More generically, it is known as "pe data". And
both the name of the persistence engine, "pe name", and "pe data"
can be viewed in /proc/net/ip_vs_conn and by passing the
--persistent-conn option to ipvsadm -Lc.

e.g.
# ipvsadm -Lcn --persistent-conn
UDP 00:38  UDP         10.4.3.0:0         10.4.3.192:5060    127.0.0.1:5060 sip 193373839

Here we see a single persistence template (cport is 0), which has been
handled by the sip persistence engine. The pe data (callid) is 193373839.

In the case where the persistence engine can't match a packet for some
reason, the connection will fall back to the normal persistence handling.
This seems reasonable, as that if the packet ought to be dropped, iptables
could be used.

A limited amount of debugging information has been added which
can be enabled using a value of 9 or greater in
/proc/sys/net/ipv4/vs/debug_level

CODE AVAILABILITY

The kernel patches (12) are available in git as the pe-2 branch of
git://git.kernel.org/pub/scm/linux/kernel/git/horms/lvs-test-2.6.git

The ipvsadm patches (2) are available in git as the pe-2 branch of
git://github.com/horms/ipvsadm-test.git

I will post the ipvsadm patches separately


^ permalink raw reply

* Re: Packet time delays on multi-core systems
From: Alexey Vlasov @ 2010-10-01 14:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Mailing List, netdev
In-Reply-To: <1285937966.2641.117.camel@edumazet-laptop>

On Fri, Oct 01, 2010 at 02:59:26PM +0200, Eric Dumazet wrote:
> Le vendredi 01 octobre 2010 ?? 14:16 +0400, Alexey Vlasov a ??crit :
> 
> > I have also found that:
> > 1. rx overruns is increasing.
> > 2. rx_queue_drop_packet_count is increasing.
> 
> So you flood machine with packets, its not an idle one ?
> 
> I thought you were doing experiments with light trafic.

No, it's a usual working server for shared hosting. There're about 1000 
clients' website, and I don't flood it specially. Franlky speaking I don't 
see any network suspicious activity.
 
> > # ethtool -S eth0 | grep drop
> >      tx_dropped: 0
> >      rx_queue_drop_packet_count: 1260743751
> >      dropped_smbus: 0
> >      rx_queue_0_drops: 0
> >      rx_queue_1_drops: 0
> >      rx_queue_2_drops: 0
> >      rx_queue_3_drops: 0
> > 
> 
> 
> ethtool -S eth0   (full output, not small parts)

NIC statistics:
     rx_packets: 2973717440
     tx_packets: 3032670910
     rx_bytes: 1892633650741
     tx_bytes: 2536130682695
     rx_broadcast: 118773199
     tx_broadcast: 68013
     rx_multicast: 95257
     tx_multicast: 0
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 95257
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 7939
     rx_queue_drop_packet_count: 1324025520
     rx_missed_errors: 146631
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 50715
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 344724062
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 1892633650741
     rx_csum_offload_good: 2973697420
     rx_csum_offload_errors: 6235
     tx_dma_out_of_sync: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 9327
     rx_smbus: 118531661
     dropped_smbus: 0
     tx_queue_0_packets: 797617475
     tx_queue_0_bytes: 630191908685
     tx_queue_1_packets: 719681297
     tx_queue_1_bytes: 625907304846
     tx_queue_2_packets: 718841556
     tx_queue_2_bytes: 620522418855
     tx_queue_3_packets: 796521255
     tx_queue_3_bytes: 646196024585
     rx_queue_0_packets: 788885797
     rx_queue_0_bytes: 458936338699
     rx_queue_0_drops: 0
     rx_queue_1_packets: 701354604
     rx_queue_1_bytes: 457490536453
     rx_queue_1_drops: 0
     rx_queue_2_packets: 791887663
     rx_queue_2_bytes: 534425333616
     rx_queue_2_drops: 0
     rx_queue_3_packets: 691579028
     rx_queue_3_bytes: 429887244557
     rx_queue_3_drops: 0

> > 3. By sending SYN-packets by hping, RST packet doesn't send, but I don't know may 
> > be it is just some feature in 2.6.32.
> > newbox # hping -c 1 -S -p 80 111.111.111.111
> > HPING 111.111.111.111 (eth0 111.111.111.111): S set, 40 headers + 0 data bytes
> > len=46 ip=111.111.111.111 ttl=58 DF id=11471 sport=80 flags=SA seq=0 win=65535 rtt=99.0 ms
> > 
> > --- 111.111.111.111 hping statistic ---
> > 1 packets tramitted, 1 packets received, 0% packet loss
> > round-trip min/avg/max = 99.0/99.0/99.0 ms
> > 
> > 13:59:07.439528 IP newbox.2777 > 111.111.111.111.80: S 345595033:345595033(0) win 512
> > 13:59:07.439626 IP 111.111.111.111.80 > newbox.2777: S 1178827395:1178827395(0) ack 345595034 win 65535 <mss 1460>
> > 13:59:10.439368 IP 111.111.111.111.80 > newbox.2777: S 1178827395:1178827395(0) ack 345595034 win 65535 <mss 1460>
> > 13:59:16.439313 IP 111.111.111.111.80 > newbox.2777: S 1178827395:1178827395(0) ack 345595034 win 65535 <mss 1460>
> > 13:59:28.439206 IP 111.111.111.111.80 > newbox.2777: S 1178827395:1178827395(0) ack 345595034 win 65535 <mss 1460>
> > 
> > As a result I got doubles:
> 
> Are you playing with trafic shaping ?
> 
> tc -s -d qdisc
 
No, nothing alike, no shapers.

# tc -s -d qdisc
bash: tc: command not found
 
> > DUP! len=46 ip=111.111.111.111 ttl=58 DF id=27454 sport=80 flags=SA seq=0 win=65535 rtt=3137.8 ms
> > 
> > Example of another TCP-session from 2.6.28 kernel:
> > oldbox # hping -c 1 -S -p 80 111.111.111.111
> > HPING 111.111.111.111 (eth0 111.111.111.111): S set, 40 headers + 0 data bytes
> > len=46 ip=111.111.111.111 ttl=58 DF id=53180 sport=80 flags=SA seq=0 win=65535 rtt=2.9 ms
> > 
> > --- 111.111.111.111 hping statistic ---
> > 1 packets tramitted, 1 packets received, 0% packet loss
> > round-trip min/avg/max = 2.9/2.9/2.9 ms
> > 
> > 14:01:45.225136 IP oldbox.2776 > 111.111.111.111.80: S 1983626200:1983626200(0) win 512
> > 14:01:45.225288 IP 111.111.111.111.80 > oldbox.2776: S 3796385036:3796385036(0) ack 1983626201 win 65535 <mss 1460>
> > 14:01:45.227990 IP oldbox.2776 > 111.111.111.111.80: R 1983626201:1983626201(0) win 0
> > 

-- 
BRGDS. Alexey Vlasov.

^ permalink raw reply

* INVESTMENT PROJECT IN YOUR COUNTRY. (HOW CAN I INVEST IN YOUR COUNTRY)?
From: kunihira barbara @ 2010-10-01  7:47 UTC (permalink / raw)


Dear Sir, 
 
I am Mr.Sam Gapke, The Managing Director of the REA. I am interested in investing the sum of US $20M (Twenty Million US Dollars) in any Lucrative Business in the area related to your Business line. Thanks for anticipate cooperation, your urgent response will be highly appreciated. 
Feel free to contact me with this my personal email address:sam.pk01@gmail.com
Best regards, 
 
Mr. Sam Gapke

^ permalink raw reply

* [PATCH net-next V2] net: dynamic ingress_queue allocation
From: Eric Dumazet @ 2010-10-01 13:56 UTC (permalink / raw)
  To: hadi; +Cc: Jarek Poplawski, David Miller, netdev
In-Reply-To: <1285933506.3553.176.camel@bigi>

Le vendredi 01 octobre 2010 à 07:45 -0400, jamal a écrit :
> On Fri, 2010-10-01 at 00:58 +0200, Eric Dumazet wrote:
> > Hi Jamal
> > 
> > Here is the dynamic allocation I promised. I lightly tested it, could
> > you review it please ?
> > Thanks !
> > 
> > [PATCH net-next2.6] net: dynamic ingress_queue allocation
> > 
> > ingress being not used very much, and net_device->ingress_queue being
> > quite a big object (128 or 256 bytes), use a dynamic allocation if
> > needed (tc qdisc add dev eth0 ingress ...)
> 
> I agree with the principle that it is valuable in making it dynamic for
> people who dont want it; but, but (like my kid would say, sniff, sniff)
> you are making me sad saying it is not used very much ;-> It is used
> very much in my world. My friend Jarek uses it;->
> 

;)

> 
> > +#ifdef CONFIG_NET_CLS_ACT
> 
> I think appropriately this should be NET_SCH_INGRESS (everywhere else as
> well).
> 

I first thought of this, and found it would add a new dependence on
vmlinux :

If someone wants to add NET_SCH_INGRESS module, he would need to
recompile whole kernel and reboot.

This is probably why ing_filter() and handle_ing() are enclosed with
CONFIG_NET_CLS_ACT, not CONFIG_NET_SCH_INGRESS.

Since struct net_dev only holds one pointer (after this patch), I
believe its better to use same dependence.

> 
> > +static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
> > +{
> > +#ifdef CONFIG_NET_CLS_ACT
> > +	return dev->ingress_queue;
> > +#else
> > +	return NULL;
> > +#endif
> 
> Above, if you just returned dev->ingress_queue wouldnt it always be 
> NULL if it was not allocated?
> 

ingress_queue is not defined in "struct net_device *dev" if 
!CONFIG_NET_CLS_ACT

Returning NULL here permits dead code elimination by compiler.

Then, probably nobody unset CONFIG_NET_CLS_ACT, so we can probably avoid
this preprocessor stuff.

> 
> > @@ -2737,7 +2734,9 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
> >  					 struct packet_type **pt_prev,
> >  					 int *ret, struct net_device *orig_dev)
> >  {
> > -	if (skb->dev->ingress_queue.qdisc == &noop_qdisc)
> > +	struct netdev_queue *rxq = dev_ingress_queue(skb->dev);
> > +
> > +	if (!rxq || rxq->qdisc == &noop_qdisc)
> >  		goto out;
> 
> I stared at above a little longer since this is the only fast path
> affected; is it a few more cycles now for people who love ingress?

I see, this adds an indirection and a conditional branch, but this
should be in cpu cache and well predicted.

I thought adding a fake "struct netdev_queue" object, with a qdisc
pointing to noop_qdisc. But this would slow down a bit non ingress
users ;)

For people not using ingress, my solution removes an access to an extra
cache line. So network latency is improved a bit when cpu caches are
full of user land data.

> 
> > @@ -690,6 +693,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >  		    (new && new->flags & TCQ_F_INGRESS)) {
> >  			num_q = 1;
> >  			ingress = 1;
> > +			if (!dev_ingress_queue(dev))
> > +				return -ENOENT;
> >  		}
> >  
> 
> The above looks clever but worries me because it changes the old flow.
> If you have time,  the following tests will alleviate my fears
> 
> 1) compile support for ingress and add/delete ingress qdisc

This worked for me, but I dont know complex setups.

> 2) Dont compile support and add/delete ingress qdisc

tc gives an error (a bit like !CONFIG_NET_SCH_INGRESS)

# tc qdisc add dev eth0 ingress
RTNETLINK answers: No such file or directory
# tc -s -d qdisc show dev eth0
qdisc mq 0: root 
 Sent 636 bytes 10 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 


> 3) Compile ingress as a module and add/delete ingress qdisc
> 
> 

Seems to work like 1)

> Other than that excellent work Eric. And you can add my
> Acked/reviewed-by etc.
> 
> BTW, did i say i like your per-cpu stats stuff? It applies nicely to
> qdiscs, actions etc ;->

I took a look at ifb as suggested by Stephen but could not see trivial
changes (LLTX or per-cpu stats), since central lock is needed I am
afraid. And qdisc are the same, stats updates are mostly free as we
dirtied cache line for the lock.

Thanks Jamal !

Here is the V2, with two #ifdef removed.


[PATCH net-next V2] net: dynamic ingress_queue allocation

ingress being not used very much, and net_device->ingress_queue being
quite a big object (128 or 256 bytes), use a dynamic allocation if
needed (tc qdisc add dev eth0 ingress ...)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |   11 ++++++--
 net/core/dev.c            |   48 +++++++++++++++++++++++++++---------
 net/sched/sch_api.c       |   40 ++++++++++++++++++++----------
 net/sched/sch_generic.c   |   36 +++++++++++++++------------
 4 files changed, 92 insertions(+), 43 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ceed347..4f86009 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -986,8 +986,7 @@ struct net_device {
 	rx_handler_func_t	*rx_handler;
 	void			*rx_handler_data;
 
-	struct netdev_queue	ingress_queue; /* use two cache lines */
-
+	struct netdev_queue	*ingress_queue;
 /*
  * Cache lines mostly used on transmit path
  */
@@ -1115,6 +1114,14 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
 		f(dev, &dev->_tx[i], arg);
 }
 
+
+static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
+{
+	return dev->ingress_queue;
+}
+
+extern struct netdev_queue *dev_ingress_queue_create(struct net_device *dev);
+
 /*
  * Net namespace inlines
  */
diff --git a/net/core/dev.c b/net/core/dev.c
index a313bab..e3bb8c9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2702,11 +2702,10 @@ EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
  * the ingress scheduler, you just cant add policies on ingress.
  *
  */
-static int ing_filter(struct sk_buff *skb)
+static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
 {
 	struct net_device *dev = skb->dev;
 	u32 ttl = G_TC_RTTL(skb->tc_verd);
-	struct netdev_queue *rxq;
 	int result = TC_ACT_OK;
 	struct Qdisc *q;
 
@@ -2720,8 +2719,6 @@ static int ing_filter(struct sk_buff *skb)
 	skb->tc_verd = SET_TC_RTTL(skb->tc_verd, ttl);
 	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
 
-	rxq = &dev->ingress_queue;
-
 	q = rxq->qdisc;
 	if (q != &noop_qdisc) {
 		spin_lock(qdisc_lock(q));
@@ -2737,7 +2734,9 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 					 struct packet_type **pt_prev,
 					 int *ret, struct net_device *orig_dev)
 {
-	if (skb->dev->ingress_queue.qdisc == &noop_qdisc)
+	struct netdev_queue *rxq = dev_ingress_queue(skb->dev);
+
+	if (!rxq || rxq->qdisc == &noop_qdisc)
 		goto out;
 
 	if (*pt_prev) {
@@ -2745,7 +2744,7 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 		*pt_prev = NULL;
 	}
 
-	switch (ing_filter(skb)) {
+	switch (ing_filter(skb, rxq)) {
 	case TC_ACT_SHOT:
 	case TC_ACT_STOLEN:
 		kfree_skb(skb);
@@ -4932,15 +4931,17 @@ static void __netdev_init_queue_locks_one(struct net_device *dev,
 					  struct netdev_queue *dev_queue,
 					  void *_unused)
 {
-	spin_lock_init(&dev_queue->_xmit_lock);
-	netdev_set_xmit_lockdep_class(&dev_queue->_xmit_lock, dev->type);
-	dev_queue->xmit_lock_owner = -1;
+	if (dev_queue) {
+		spin_lock_init(&dev_queue->_xmit_lock);
+		netdev_set_xmit_lockdep_class(&dev_queue->_xmit_lock, dev->type);
+		dev_queue->xmit_lock_owner = -1;
+	}
 }
 
 static void netdev_init_queue_locks(struct net_device *dev)
 {
 	netdev_for_each_tx_queue(dev, __netdev_init_queue_locks_one, NULL);
-	__netdev_init_queue_locks_one(dev, &dev->ingress_queue, NULL);
+	__netdev_init_queue_locks_one(dev, dev_ingress_queue(dev), NULL);
 }
 
 unsigned long netdev_fix_features(unsigned long features, const char *name)
@@ -5447,16 +5448,37 @@ static void netdev_init_one_queue(struct net_device *dev,
 				  struct netdev_queue *queue,
 				  void *_unused)
 {
-	queue->dev = dev;
+	if (queue)
+		queue->dev = dev;
 }
 
 static void netdev_init_queues(struct net_device *dev)
 {
-	netdev_init_one_queue(dev, &dev->ingress_queue, NULL);
+	netdev_init_one_queue(dev, dev_ingress_queue(dev), NULL);
 	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
 	spin_lock_init(&dev->tx_global_lock);
 }
 
+struct netdev_queue *dev_ingress_queue_create(struct net_device *dev)
+{
+	struct netdev_queue *queue = dev_ingress_queue(dev);
+
+#ifdef CONFIG_NET_CLS_ACT
+	if (queue)
+		return queue;
+	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+	if (!queue)
+		return NULL;
+	netdev_init_one_queue(dev, queue, NULL);
+	__netdev_init_queue_locks_one(dev, queue, NULL);
+	queue->qdisc = &noop_qdisc;
+	queue->qdisc_sleeping = &noop_qdisc;
+	smp_wmb();
+	dev->ingress_queue = queue;
+#endif
+	return queue;
+}
+
 /**
  *	alloc_netdev_mq - allocate network device
  *	@sizeof_priv:	size of private data to allocate space for
@@ -5559,6 +5581,8 @@ void free_netdev(struct net_device *dev)
 
 	kfree(dev->_tx);
 
+	kfree(dev_ingress_queue(dev));
+
 	/* Flush device addresses */
 	dev_addr_flush(dev);
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b802078..8635110 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -240,7 +240,10 @@ struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
 	if (q)
 		goto out;
 
-	q = qdisc_match_from_root(dev->ingress_queue.qdisc_sleeping, handle);
+	if (!dev_ingress_queue(dev))
+		goto out;
+	q = qdisc_match_from_root(dev_ingress_queue(dev)->qdisc_sleeping,
+				  handle);
 out:
 	return q;
 }
@@ -690,6 +693,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 		    (new && new->flags & TCQ_F_INGRESS)) {
 			num_q = 1;
 			ingress = 1;
+			if (!dev_ingress_queue(dev))
+				return -ENOENT;
 		}
 
 		if (dev->flags & IFF_UP)
@@ -701,7 +706,7 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 		}
 
 		for (i = 0; i < num_q; i++) {
-			struct netdev_queue *dev_queue = &dev->ingress_queue;
+			struct netdev_queue *dev_queue = dev_ingress_queue(dev);
 
 			if (!ingress)
 				dev_queue = netdev_get_tx_queue(dev, i);
@@ -979,7 +984,8 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
 					return -ENOENT;
 				q = qdisc_leaf(p, clid);
 			} else { /* ingress */
-				q = dev->ingress_queue.qdisc_sleeping;
+				if (dev_ingress_queue(dev))
+					q = dev_ingress_queue(dev)->qdisc_sleeping;
 			}
 		} else {
 			q = dev->qdisc;
@@ -1044,7 +1050,8 @@ replay:
 					return -ENOENT;
 				q = qdisc_leaf(p, clid);
 			} else { /*ingress */
-				q = dev->ingress_queue.qdisc_sleeping;
+				if (dev_ingress_queue_create(dev))
+					q = dev_ingress_queue(dev)->qdisc_sleeping;
 			}
 		} else {
 			q = dev->qdisc;
@@ -1123,11 +1130,14 @@ replay:
 create_n_graft:
 	if (!(n->nlmsg_flags&NLM_F_CREATE))
 		return -ENOENT;
-	if (clid == TC_H_INGRESS)
-		q = qdisc_create(dev, &dev->ingress_queue, p,
-				 tcm->tcm_parent, tcm->tcm_parent,
-				 tca, &err);
-	else {
+	if (clid == TC_H_INGRESS) {
+		if (dev_ingress_queue(dev))
+			q = qdisc_create(dev, dev_ingress_queue(dev), p,
+					 tcm->tcm_parent, tcm->tcm_parent,
+					 tca, &err);
+		else
+			err = -ENOENT;
+	} else {
 		struct netdev_queue *dev_queue;
 
 		if (p && p->ops->cl_ops && p->ops->cl_ops->select_queue)
@@ -1304,8 +1314,10 @@ static int tc_dump_qdisc(struct sk_buff *skb, struct netlink_callback *cb)
 		if (tc_dump_qdisc_root(dev->qdisc, skb, cb, &q_idx, s_q_idx) < 0)
 			goto done;
 
-		dev_queue = &dev->ingress_queue;
-		if (tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb, &q_idx, s_q_idx) < 0)
+		dev_queue = dev_ingress_queue(dev);
+		if (dev_queue &&
+		    tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb,
+				       &q_idx, s_q_idx) < 0)
 			goto done;
 
 cont:
@@ -1595,8 +1607,10 @@ static int tc_dump_tclass(struct sk_buff *skb, struct netlink_callback *cb)
 	if (tc_dump_tclass_root(dev->qdisc, skb, tcm, cb, &t, s_t) < 0)
 		goto done;
 
-	dev_queue = &dev->ingress_queue;
-	if (tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb, &t, s_t) < 0)
+	dev_queue = dev_ingress_queue(dev);
+	if (dev_queue &&
+	    tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb,
+				&t, s_t) < 0)
 		goto done;
 
 done:
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 545278a..c42dec5 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -721,16 +721,18 @@ static void transition_one_qdisc(struct net_device *dev,
 				 struct netdev_queue *dev_queue,
 				 void *_need_watchdog)
 {
-	struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
-	int *need_watchdog_p = _need_watchdog;
+	if (dev_queue) {
+		struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
+		int *need_watchdog_p = _need_watchdog;
 
-	if (!(new_qdisc->flags & TCQ_F_BUILTIN))
-		clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
+		if (!(new_qdisc->flags & TCQ_F_BUILTIN))
+			clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
 
-	rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
-	if (need_watchdog_p && new_qdisc != &noqueue_qdisc) {
-		dev_queue->trans_start = 0;
-		*need_watchdog_p = 1;
+		rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
+		if (need_watchdog_p && new_qdisc != &noqueue_qdisc) {
+			dev_queue->trans_start = 0;
+			*need_watchdog_p = 1;
+		}
 	}
 }
 
@@ -753,7 +755,7 @@ void dev_activate(struct net_device *dev)
 
 	need_watchdog = 0;
 	netdev_for_each_tx_queue(dev, transition_one_qdisc, &need_watchdog);
-	transition_one_qdisc(dev, &dev->ingress_queue, NULL);
+	transition_one_qdisc(dev, dev_ingress_queue(dev), NULL);
 
 	if (need_watchdog) {
 		dev->trans_start = jiffies;
@@ -768,7 +770,7 @@ static void dev_deactivate_queue(struct net_device *dev,
 	struct Qdisc *qdisc_default = _qdisc_default;
 	struct Qdisc *qdisc;
 
-	qdisc = dev_queue->qdisc;
+	qdisc = dev_queue ? dev_queue->qdisc : NULL;
 	if (qdisc) {
 		spin_lock_bh(qdisc_lock(qdisc));
 
@@ -812,7 +814,7 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
 	netdev_for_each_tx_queue(dev, dev_deactivate_queue, &noop_qdisc);
-	dev_deactivate_queue(dev, &dev->ingress_queue, &noop_qdisc);
+	dev_deactivate_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
 
 	dev_watchdog_down(dev);
 
@@ -830,15 +832,17 @@ static void dev_init_scheduler_queue(struct net_device *dev,
 {
 	struct Qdisc *qdisc = _qdisc;
 
-	dev_queue->qdisc = qdisc;
-	dev_queue->qdisc_sleeping = qdisc;
+	if (dev_queue) {
+		dev_queue->qdisc = qdisc;
+		dev_queue->qdisc_sleeping = qdisc;
+	}
 }
 
 void dev_init_scheduler(struct net_device *dev)
 {
 	dev->qdisc = &noop_qdisc;
 	netdev_for_each_tx_queue(dev, dev_init_scheduler_queue, &noop_qdisc);
-	dev_init_scheduler_queue(dev, &dev->ingress_queue, &noop_qdisc);
+	dev_init_scheduler_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
 
 	setup_timer(&dev->watchdog_timer, dev_watchdog, (unsigned long)dev);
 }
@@ -847,7 +851,7 @@ static void shutdown_scheduler_queue(struct net_device *dev,
 				     struct netdev_queue *dev_queue,
 				     void *_qdisc_default)
 {
-	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+	struct Qdisc *qdisc = dev_queue ? dev_queue->qdisc_sleeping : NULL;
 	struct Qdisc *qdisc_default = _qdisc_default;
 
 	if (qdisc) {
@@ -861,7 +865,7 @@ static void shutdown_scheduler_queue(struct net_device *dev,
 void dev_shutdown(struct net_device *dev)
 {
 	netdev_for_each_tx_queue(dev, shutdown_scheduler_queue, &noop_qdisc);
-	shutdown_scheduler_queue(dev, &dev->ingress_queue, &noop_qdisc);
+	shutdown_scheduler_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
 	qdisc_destroy(dev->qdisc);
 	dev->qdisc = &noop_qdisc;
 



^ permalink raw reply related

* Re: [PATCH 1/2] drivers/net/usb/qcusbnet: Checkpatch cleanups
From: Paulius Zaleckas @ 2010-10-01 13:54 UTC (permalink / raw)
  To: Joe Perches; +Cc: Elly Jones, netdev, dbrownell, mjg59, jglasgow, msb, olofj
In-Reply-To: <1285940519.752.10.camel@Joe-Laptop>

On Fri, Oct 1, 2010 at 4:41 PM, Joe Perches <joe@perches.com> wrote:
> On Fri, 2010-10-01 at 16:26 +0300, Paulius Zaleckas wrote:
>> On 09/29/2010 05:39 AM, Joe Perches wrote:
>> > Whitespace and removal of KERNEL_VERSION tests
>> > Neaten DBG macro
>>
>> Why not use dev_dbg istead of this ugly DBG macro?
>
> Currently, like a lot of other macros
> in the tree, it uses a runtime flag to
> control output.
>
> dev_dbg doesn't have that capability.

It has runtime flag if CONFIG_DYNAMIC_DEBUG is enabled.

^ permalink raw reply

* Re: [PATCH] Add Qualcomm Gobi 2000 driver.
From: Matthew Garrett @ 2010-10-01 13:50 UTC (permalink / raw)
  To: Paulius Zaleckas; +Cc: Elly Jones, netdev, dbrownell, jglasgow, msb, olofj
In-Reply-To: <4CA5E40D.4080507@gmail.com>

On Fri, Oct 01, 2010 at 04:37:17PM +0300, Paulius Zaleckas wrote:
> On 09/28/2010 08:10 PM, Elly Jones wrote:
>> From: Elizabeth Jones<ellyjones@google.com>
>>
>> This driver is a rewrite of the original Qualcomm GPL driver, released as part
>> of Qualcomm's "Code Aurora" initiative. The driver has been transformed into
>> Linux kernel style and made to use kernel APIs where appropriate; some bugs have
>> also been fixed. Note that the device in question requires firmware and a
>> firmware loader; the latter has been written by mjg (see
>> http://www.codon.org.uk/~mjg59/gobi_loader/).
>
> Why not use already existing in kernel firmware upload API?

Because chosing the correct firmware requires making a runtime policy 
decision, the firmware is of variable size (anywhere from 12MB to 16MB) 
and it needs to be reencoded with something resembling PPP framing 
before it gets dumped to the device. It's certainly possible to do it 
in-kernel, but it'd be rather a lot more code and allocating that much 
unswappable space sounds like something that would make some usecases 
impossible.

-- 
Matthew Garrett | mjg59@srcf.ucam.org

^ permalink raw reply

* Re: [PATCH 1/2] drivers/net/usb/qcusbnet: Checkpatch cleanups
From: Joe Perches @ 2010-10-01 13:41 UTC (permalink / raw)
  To: Paulius Zaleckas
  Cc: Elly Jones, netdev, dbrownell, mjg59, jglasgow, msb, olofj
In-Reply-To: <4CA5E185.3090902@gmail.com>

On Fri, 2010-10-01 at 16:26 +0300, Paulius Zaleckas wrote:
> On 09/29/2010 05:39 AM, Joe Perches wrote:
> > Whitespace and removal of KERNEL_VERSION tests
> > Neaten DBG macro
> 
> Why not use dev_dbg istead of this ugly DBG macro?

Currently, like a lot of other macros
in the tree, it uses a runtime flag to
control output.

dev_dbg doesn't have that capability.



^ permalink raw reply

* Re: [PATCH V4] fs: allow for more than 2^31 files
From: Robin Holt @ 2010-10-01 13:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Robin Holt, David Miller, dipankar, viro, bcrl, den, mingo,
	mszeredi, cmm, npiggin, xemul, linux-kernel, netdev
In-Reply-To: <1285910958.2705.56.camel@edumazet-laptop>

Looks good.

Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>

I don't mean to flood this with my name, merely that I do find this
patch acceptable, worthy, and have tested it.  Feel free to lop off any
of these lines that are offensive.

Robin

On Fri, Oct 01, 2010 at 07:29:18AM +0200, Eric Dumazet wrote:
> Le vendredi 01 octobre 2010 à 07:03 +0200, Eric Dumazet a écrit :
> > Le jeudi 30 septembre 2010 à 23:34 -0500, Robin Holt a écrit :
> > 
> > > The proc_handler used to be proc_nr_files() which would call
> > > get_nr_files() and deposit the result in files_stat.nr_files then cascade
> > > to proc_dointvec() which would dump the 3 values.  Now it will dump the
> > > three values, but not update the middle (nr_files) value first.
> > > 
> > 
> > Ah I get it now, thanks !
> > 
> > I'll send a V4 shortly.
> > 
> > 
> 
> In this v4, I call proc_nr_files() again, and proc_nr_files() calls
> proc_doulongvec_minmax() instead of proc_dointvec()
> 
> Added the "cat /proc/sys/fs/file-nr" in Changelog
> 
> Thanks again Robin
> 
> [PATCH V3] fs: allow for more than 2^31 files
> 
> Robin Holt tried to boot a 16TB system and found af_unix was overflowing
> a 32bit value :
> 
> <quote>
> 
> We were seeing a failure which prevented boot.  The kernel was incapable
> of creating either a named pipe or unix domain socket.  This comes down
> to a common kernel function called unix_create1() which does:
> 
>         atomic_inc(&unix_nr_socks);
>         if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
>                 goto out;
> 
> The function get_max_files() is a simple return of files_stat.max_files.
> files_stat.max_files is a signed integer and is computed in
> fs/file_table.c's files_init().
> 
>         n = (mempages * (PAGE_SIZE / 1024)) / 10;
>         files_stat.max_files = n;
> 
> In our case, mempages (total_ram_pages) is approx 3,758,096,384
> (0xe0000000).  That leaves max_files at approximately 1,503,238,553.
> This causes 2 * get_max_files() to integer overflow.
> 
> </quote>
> 
> Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
> integers, and change af_unix to use an atomic_long_t instead of
> atomic_t.
> 
> get_max_files() is changed to return an unsigned long.
> get_nr_files() is changed to return a long.
> 
> unix_nr_socks is changed from atomic_t to atomic_long_t, while not
> strictly needed to address Robin problem.
>  
> Before patch (on a 64bit kernel) :
> # echo 2147483648 >/proc/sys/fs/file-max
> # cat /proc/sys/fs/file-max
> -18446744071562067968
> 
> After patch:
> # echo 2147483648 >/proc/sys/fs/file-max
> # cat /proc/sys/fs/file-max
> 2147483648
> # cat /proc/sys/fs/file-nr
> 704	0	2147483648
> 
> 
> Reported-by: Robin Holt <holt@sgi.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  fs/file_table.c    |   17 +++++++----------
>  include/linux/fs.h |    8 ++++----
>  kernel/sysctl.c    |    6 +++---
>  net/unix/af_unix.c |   14 +++++++-------
>  4 files changed, 21 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index a04bdd8..c3dee38 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -60,7 +60,7 @@ static inline void file_free(struct file *f)
>  /*
>   * Return the total number of open files in the system
>   */
> -static int get_nr_files(void)
> +static long get_nr_files(void)
>  {
>  	return percpu_counter_read_positive(&nr_files);
>  }
> @@ -68,7 +68,7 @@ static int get_nr_files(void)
>  /*
>   * Return the maximum number of open files in the system
>   */
> -int get_max_files(void)
> +unsigned long get_max_files(void)
>  {
>  	return files_stat.max_files;
>  }
> @@ -82,7 +82,7 @@ int proc_nr_files(ctl_table *table, int write,
>                       void __user *buffer, size_t *lenp, loff_t *ppos)
>  {
>  	files_stat.nr_files = get_nr_files();
> -	return proc_dointvec(table, write, buffer, lenp, ppos);
> +	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
>  }
>  #else
>  int proc_nr_files(ctl_table *table, int write,
> @@ -105,7 +105,7 @@ int proc_nr_files(ctl_table *table, int write,
>  struct file *get_empty_filp(void)
>  {
>  	const struct cred *cred = current_cred();
> -	static int old_max;
> +	static long old_max;
>  	struct file * f;
>  
>  	/*
> @@ -140,8 +140,7 @@ struct file *get_empty_filp(void)
>  over:
>  	/* Ran out of filps - report that */
>  	if (get_nr_files() > old_max) {
> -		printk(KERN_INFO "VFS: file-max limit %d reached\n",
> -					get_max_files());
> +		pr_info("VFS: file-max limit %lu reached\n", get_max_files());
>  		old_max = get_nr_files();
>  	}
>  	goto fail;
> @@ -487,7 +486,7 @@ retry:
>  
>  void __init files_init(unsigned long mempages)
>  { 
> -	int n; 
> +	unsigned long n;
>  
>  	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>  			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> @@ -498,9 +497,7 @@ void __init files_init(unsigned long mempages)
>  	 */ 
>  
>  	n = (mempages * (PAGE_SIZE / 1024)) / 10;
> -	files_stat.max_files = n; 
> -	if (files_stat.max_files < NR_FILE)
> -		files_stat.max_files = NR_FILE;
> +	files_stat.max_files = max_t(unsigned long, n, NR_FILE);
>  	files_defer_init();
>  	lg_lock_init(files_lglock);
>  	percpu_counter_init(&nr_files, 0);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 63d069b..8c06590 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -34,9 +34,9 @@
>  
>  /* And dynamically-tunable limits and defaults: */
>  struct files_stat_struct {
> -	int nr_files;		/* read only */
> -	int nr_free_files;	/* read only */
> -	int max_files;		/* tunable */
> +	unsigned long nr_files;		/* read only */
> +	unsigned long nr_free_files;	/* read only */
> +	unsigned long max_files;		/* tunable */
>  };
>  
>  struct inodes_stat_t {
> @@ -404,7 +404,7 @@ extern void __init inode_init_early(void);
>  extern void __init files_init(unsigned long);
>  
>  extern struct files_stat_struct files_stat;
> -extern int get_max_files(void);
> +extern unsigned long get_max_files(void);
>  extern int sysctl_nr_open;
>  extern struct inodes_stat_t inodes_stat;
>  extern int leases_enable, lease_break_time;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index f88552c..f789a0a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1352,16 +1352,16 @@ static struct ctl_table fs_table[] = {
>  	{
>  		.procname	= "file-nr",
>  		.data		= &files_stat,
> -		.maxlen		= 3*sizeof(int),
> +		.maxlen		= sizeof(files_stat),
>  		.mode		= 0444,
>  		.proc_handler	= proc_nr_files,
>  	},
>  	{
>  		.procname	= "file-max",
>  		.data		= &files_stat.max_files,
> -		.maxlen		= sizeof(int),
> +		.maxlen		= sizeof(files_stat.max_files),
>  		.mode		= 0644,
> -		.proc_handler	= proc_dointvec,
> +		.proc_handler	= proc_doulongvec_minmax,
>  	},
>  	{
>  		.procname	= "nr_open",
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 0b39b24..3e1d7d1 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -117,7 +117,7 @@
>  
>  static struct hlist_head unix_socket_table[UNIX_HASH_SIZE + 1];
>  static DEFINE_SPINLOCK(unix_table_lock);
> -static atomic_t unix_nr_socks = ATOMIC_INIT(0);
> +static atomic_long_t unix_nr_socks;
>  
>  #define unix_sockets_unbound	(&unix_socket_table[UNIX_HASH_SIZE])
>  
> @@ -360,13 +360,13 @@ static void unix_sock_destructor(struct sock *sk)
>  	if (u->addr)
>  		unix_release_addr(u->addr);
>  
> -	atomic_dec(&unix_nr_socks);
> +	atomic_long_dec(&unix_nr_socks);
>  	local_bh_disable();
>  	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>  	local_bh_enable();
>  #ifdef UNIX_REFCNT_DEBUG
> -	printk(KERN_DEBUG "UNIX %p is destroyed, %d are still alive.\n", sk,
> -		atomic_read(&unix_nr_socks));
> +	printk(KERN_DEBUG "UNIX %p is destroyed, %ld are still alive.\n", sk,
> +		atomic_long_read(&unix_nr_socks));
>  #endif
>  }
>  
> @@ -606,8 +606,8 @@ static struct sock *unix_create1(struct net *net, struct socket *sock)
>  	struct sock *sk = NULL;
>  	struct unix_sock *u;
>  
> -	atomic_inc(&unix_nr_socks);
> -	if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
> +	atomic_long_inc(&unix_nr_socks);
> +	if (atomic_long_read(&unix_nr_socks) > 2 * get_max_files())
>  		goto out;
>  
>  	sk = sk_alloc(net, PF_UNIX, GFP_KERNEL, &unix_proto);
> @@ -632,7 +632,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock)
>  	unix_insert_socket(unix_sockets_unbound, sk);
>  out:
>  	if (sk == NULL)
> -		atomic_dec(&unix_nr_socks);
> +		atomic_long_dec(&unix_nr_socks);
>  	else {
>  		local_bh_disable();
>  		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH] Add Qualcomm Gobi 2000 driver.
From: Paulius Zaleckas @ 2010-10-01 13:37 UTC (permalink / raw)
  To: Elly Jones; +Cc: netdev, dbrownell, mjg59, jglasgow, msb, olofj
In-Reply-To: <20100928171026.GB6083@google.com>

On 09/28/2010 08:10 PM, Elly Jones wrote:
> From: Elizabeth Jones<ellyjones@google.com>
>
> This driver is a rewrite of the original Qualcomm GPL driver, released as part
> of Qualcomm's "Code Aurora" initiative. The driver has been transformed into
> Linux kernel style and made to use kernel APIs where appropriate; some bugs have
> also been fixed. Note that the device in question requires firmware and a
> firmware loader; the latter has been written by mjg (see
> http://www.codon.org.uk/~mjg59/gobi_loader/).

Why not use already existing in kernel firmware upload API?

> Signed-Off-By: Elizabeth Jones<ellyjones@google.com>
> Signed-Off-By: Jason Glasgow<jglasgow@google.com>

^ permalink raw reply

* Re: [PATCH 1/2] drivers/net/usb/qcusbnet: Checkpatch cleanups
From: Paulius Zaleckas @ 2010-10-01 13:26 UTC (permalink / raw)
  To: Joe Perches; +Cc: Elly Jones, netdev, dbrownell, mjg59, jglasgow, msb, olofj
In-Reply-To: <0aa502d0e385f2333f8bc12dafdcde88e5ca0262.1285727642.git.joe@perches.com>

On 09/29/2010 05:39 AM, Joe Perches wrote:
> Whitespace and removal of KERNEL_VERSION tests
> Neaten DBG macro

Why not use dev_dbg istead of this ugly DBG macro?

^ permalink raw reply

* Re: [PATCH v4 1/2] HID: Add Support for Setting and Getting Feature Reports from hidraw
From: Jiri Kosina @ 2010-10-01 13:30 UTC (permalink / raw)
  To: Antonio Ospite
  Cc: Alan Ott, Stefan Achatz, Alexey Dobriyan, Tejun Heo, Alan Stern,
	Greg Kroah-Hartman, Marcel Holtmann, Stephane Chatty,
	Michael Poole, David S. Miller, Bastien Nocera, Eric Dumazet,
	linux-input-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100928153011.32750e5d.ospite-aNJ+ML1ZbiP93QAQaVx+gl6hYfS7NtTn@public.gmane.org>

On Tue, 28 Sep 2010, Antonio Ospite wrote:

> Hi Alan, I am doing some stress testing on hidraw, if I have a loop with
> HIDIOCGFEATURE on a given report and I disconnect the device while the
> loop is running I get this:
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> IP: [<ffffffffa02c66b4>] hidraw_ioctl+0xfc/0x32c [hid]
> 
> Full log attached along with the test program, the device is a Sony PS3
> Controller (sixaxis).
> 
> If my objdump analysis is right, hidraw_ioctl+0xfc should be around line
> 361 in hidraw.c (with your patch applied):
> 
> struct hid_device *hid = dev->hid;
> 
> It looks like 'dev' (which is hidraw_table[minor]) can be NULL
> sometimes, can't it?
> This is not introduced by your changes tho.
> 
> Just as a side note, the bug does not show up if the userspace program
> handles return values properly and exits as soon as it gets an error
> from the HID layer, see the XXX comment in test_hidraw_feature.c.
> 
> This fixes it, if it looks ok I will resend the patch rebased on
> mainline code:
> 
> diff --git a/drivers/hid/hidraw.c b/drivers/hid/hidraw.c
> index 7df1310..3c040c6 100644
> --- a/drivers/hid/hidraw.c
> +++ b/drivers/hid/hidraw.c
> @@ -322,6 +322,10 @@ static long hidraw_ioctl(struct file *file, unsigned int cmd,
> 
>         mutex_lock(&minors_lock);
>         dev = hidraw_table[minor];
> +       if (!dev) {
> +               ret = -ENODEV;
> +               goto out;
> +       }
> 
>         switch (cmd) {
>                 case HIDIOCGRDESCSIZE:
> @@ -412,6 +416,7 @@ static long hidraw_ioctl(struct file *file, unsigned int cmd,
> 
>                 ret = -ENOTTY;
>         }
> +out:
>         mutex_unlock(&minors_lock);
>         return ret;
>  }

Yes, this patch makes sense even for current mainline code. Could you 
please resend it to me with Signed-off-by: and changelog text, so that I 
could apply it?

Thanks!

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox