Netdev List

Netdev List
 help / color / mirror / Atom feed

* [ethtool PATCH 5/6] v4 Add RX packet classification interface
From: Alexander Duyck @ 2011-04-21 20:40 UTC (permalink / raw)
  To: davem, jeffrey.t.kirsher, bhutchings; +Cc: netdev
In-Reply-To: <20110421202857.23054.63316.stgit@gitlad.jf.intel.com>

From: Santwona Behera <santwona.behera@sun.com>

This patch was originally introduced as:
  [PATCH 1/3] [ethtool] Add rx pkt classification interface
  Signed-off-by: Santwona Behera <santwona.behera@sun.com>
  http://patchwork.ozlabs.org/patch/23223/

v2:
I have updated it to address a number of issues.  As a result I removed the
local caching of rules due to the fact that there were memory leaks in this
code and the rule manager would consume over 1Mb of space for an 8K table
when all that was needed was 1K in order to store which rules were active
and which were not.

In addition I dropped the use of regions as there were multiple issue found
including the fact that the regions were not properly expanding beyond 2
and the fact that the regions required reading all of the rules in order to
correctly expand beyond 2.  By dropping the regions from the rule manager
it is possible to write a much cleaner interface leaving region management
to be done by either the driver or by external management scripts.

v3:
The latest update to this patch now inverts the masks to match the mask
types used for n-tuple.  As such a network flow classifier is defined using
the exact same syntax as n-tuple, and the tool will correct for the fact
that NFC uses the 1's compliment of the n-tuple mask.

I also updated the ordering of new rules being added.  All new rules will
take the highest numbered open rule when no location is specified.

Since NFC now uses the same syntax as n-tuple I added code such that now
when location is not specified the -U option will first try to add a new
n-tuple rule, and if that fails with a ENOTSUPP it will then try to add the
rule via the NFC interface.

Finally I split out the addition of bitops and the updates to documentation
into separate patches.  This makes the total patch size a bit more
manageable since the addition of NFC and the merging of it with n-tuple were
combined into this patch.

v4:
This change merges the ntuple and network flow classifier rules so that if
we setup a rule and the device has the NTUPLE flag set we will first try to
use set_rx_ntuple.  If that fails with EOPNOTSUPP we then will attempt to
use the network flow classifier rule insertion.  This way we can support
legacy configurations such as niu on kernels prior to 2.6.40 that support
network flow classifier but not ntuple, but for drivers such as ixgbe we
can test for ntuple first, and then network flow classifier.

This patch has also updated the output to make use of the updated network
flow classifier extensions that have been accepted into the kernel.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 Makefile.am    |    3 
 ethtool-util.h |   14 +
 ethtool.c      |  357 ++++++++++--------
 rxclass.c      | 1106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1318 insertions(+), 162 deletions(-)
 create mode 100644 rxclass.c

diff --git a/Makefile.am b/Makefile.am
index a0d2116..0262c31 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -8,7 +8,8 @@ ethtool_SOURCES = ethtool.c ethtool-copy.h ethtool-util.h	\
 		  amd8111e.c de2104x.c e100.c e1000.c igb.c	\
 		  fec_8xx.c ibm_emac.c ixgb.c ixgbe.c natsemi.c	\
 		  pcnet32.c realtek.c tg3.c marvell.c vioc.c	\
-		  smsc911x.c at76c50x-usb.c sfc.c stmmac.c
+		  smsc911x.c at76c50x-usb.c sfc.c stmmac.c	\
+		  rxclass.c
 
 dist-hook:
 	cp $(top_srcdir)/ethtool.spec $(distdir)
diff --git a/ethtool-util.h b/ethtool-util.h
index 3d46faf..adf8fcc 100644
--- a/ethtool-util.h
+++ b/ethtool-util.h
@@ -62,6 +62,8 @@ static inline u64 cpu_to_be64(u64 value)
 #define SIOCETHTOOL     0x8946
 #endif
 
+#define	RX_CLS_LOC_UNSPEC	0xffffffffUL
+
 /* National Semiconductor DP83815, DP83816 */
 int natsemi_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs);
 int natsemi_dump_eeprom(struct ethtool_drvinfo *info,
@@ -125,4 +127,14 @@ int sfc_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs);
 int st_mac100_dump_regs(struct ethtool_drvinfo *info,
 			struct ethtool_regs *regs);
 int st_gmac_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs);
-#endif
+
+/* Rx flow classification */
+int rxclass_parse_ruleopts(char **optstr, int opt_cnt,
+			   struct ethtool_rx_flow_spec *fsp);
+int rxclass_rule_getall(int fd, struct ifreq *ifr);
+int rxclass_rule_get(int fd, struct ifreq *ifr, __u32 loc);
+int rxclass_rule_ins(int fd, struct ifreq *ifr,
+		     struct ethtool_rx_flow_spec *fsp);
+int rxclass_rule_del(int fd, struct ifreq *ifr, __u32 loc);
+
+#endif /* ETHTOOL_UTIL_H__ */
diff --git a/ethtool.c b/ethtool.c
index 15af86a..421fe20 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -6,6 +6,7 @@
  * Kernel 2.4 update Copyright 2001 Jeff Garzik <jgarzik@mandrakesoft.com>
  * Wake-on-LAN,natsemi,misc support by Tim Hockin <thockin@sun.com>
  * Portions Copyright 2002 Intel
+ * Portions Copyright (C) Sun Microsystems 2008
  * do_test support by Eli Kupermann <eli.kupermann@intel.com>
  * ETHTOOL_PHYS_ID support by Chris Leech <christopher.leech@intel.com>
  * e1000 support by Scott Feldman <scott.feldman@intel.com>
@@ -14,6 +15,7 @@
  * amd8111e support by Reeja John <reeja.john@amd.com>
  * long arguments by Andi Kleen.
  * SMSC LAN911x support by Steve Glendinning <steve.glendinning@smsc.com>
+ * Rx Network Flow Control configuration support <santwona.behera@sun.com>
  * Various features by Ben Hutchings <bhutchings@solarflare.com>;
  *	Copyright 2009, 2010 Solarflare Communications
  *
@@ -43,6 +45,8 @@
 #include <arpa/inet.h>
 
 #include <linux/sockios.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
 #include "ethtool-util.h"
 
 #ifndef MAX_ADDR_LEN
@@ -93,14 +97,13 @@ static int do_gstats(int fd, struct ifreq *ifr);
 static int rxflow_str_to_type(const char *str);
 static int parse_rxfhashopts(char *optstr, u32 *data);
 static char *unparse_rxfhashopts(u64 opts);
-static void parse_rxntupleopts(int argc, char **argp, int first_arg);
 static int dump_rxfhash(int fhash, u64 val);
 static int do_srxclass(int fd, struct ifreq *ifr);
 static int do_grxclass(int fd, struct ifreq *ifr);
 static int do_grxfhindir(int fd, struct ifreq *ifr);
 static int do_srxfhindir(int fd, struct ifreq *ifr);
-static int do_srxntuple(int fd, struct ifreq *ifr);
-static int do_grxntuple(int fd, struct ifreq *ifr);
+static int do_srxclsrule(int fd, struct ifreq *ifr);
+static int do_grxclsrule(int fd, struct ifreq *ifr);
 static int do_flash(int fd, struct ifreq *ifr);
 static int do_permaddr(int fd, struct ifreq *ifr);
 
@@ -131,8 +134,8 @@ static enum {
 	MODE_SNFC,
 	MODE_GRXFHINDIR,
 	MODE_SRXFHINDIR,
-	MODE_SNTUPLE,
-	MODE_GNTUPLE,
+	MODE_SCLSRULE,
+	MODE_GCLSRULE,
 	MODE_FLASHDEV,
 	MODE_PERMADDR,
 } mode = MODE_GSET;
@@ -238,7 +241,7 @@ static struct option {
 		"indirection" },
     { "-X", "--set-rxfh-indir", MODE_SRXFHINDIR, "Set Rx flow hash indirection",
 		"		equal N | weight W0 W1 ...\n" },
-    { "-U", "--config-ntuple", MODE_SNTUPLE, "Configure Rx ntuple filters "
+    { "-U", "--config-ntuple", MODE_SCLSRULE, "Configure Rx ntuple filters "
 		"and actions",
 		"		{ flow-type tcp4|udp4|sctp4\n"
 		"		  [ src-ip ADDR [src-ip-mask MASK] ]\n"
@@ -252,7 +255,7 @@ static struct option {
 		"		[ vlan VLAN-TAG [vlan-mask MASK] ]\n"
 		"		[ user-def DATA [user-def-mask MASK] ]\n"
 		"		action N\n" },
-    { "-u", "--show-ntuple", MODE_GNTUPLE,
+    { "-u", "--show-ntuple", MODE_GCLSRULE,
 		"Get Rx ntuple filters and actions\n" },
     { "-P", "--show-permaddr", MODE_PERMADDR,
 		"Show permanent hardware address" },
@@ -379,26 +382,6 @@ static u32 rx_fhash_val = 0;
 static int rx_fhash_changed = 0;
 static int rxfhindir_equal = 0;
 static char **rxfhindir_weight = NULL;
-static int sntuple_changed = 0;
-static struct ethtool_rx_ntuple_flow_spec ntuple_fs;
-static int ntuple_ip4src_seen = 0;
-static int ntuple_ip4src_mask_seen = 0;
-static int ntuple_ip4dst_seen = 0;
-static int ntuple_ip4dst_mask_seen = 0;
-static int ntuple_psrc_seen = 0;
-static int ntuple_psrc_mask_seen = 0;
-static int ntuple_pdst_seen = 0;
-static int ntuple_pdst_mask_seen = 0;
-static int ntuple_ether_dst_seen = 0;
-static int ntuple_ether_dst_mask_seen = 0;
-static int ntuple_ether_src_seen = 0;
-static int ntuple_ether_src_mask_seen = 0;
-static int ntuple_ether_proto_seen = 0;
-static int ntuple_ether_proto_mask_seen = 0;
-static int ntuple_vlan_tag_seen = 0;
-static int ntuple_vlan_tag_mask_seen = 0;
-static int ntuple_user_def_seen = 0;
-static int ntuple_user_def_mask_seen = 0;
 static char *flash_file = NULL;
 static int flash = -1;
 static int flash_region = -1;
@@ -407,6 +390,11 @@ static int msglvl_changed;
 static u32 msglvl_wanted = 0;
 static u32 msglvl_mask = 0;
 
+static int rx_class_rule_get = -1;
+static int rx_class_rule_del = -1;
+static int rx_class_rule_added = 0;
+static struct ethtool_rx_flow_spec rx_rule_fs;
+
 static enum {
 	ONLINE=0,
 	OFFLINE,
@@ -519,58 +507,6 @@ static struct cmdline_info cmdline_coalesce[] = {
 	{ "tx-frames-high", CMDL_S32, &coal_tx_frames_high_wanted, &ecoal.tx_max_coalesced_frames_high },
 };
 
-static struct cmdline_info cmdline_ntuple_tcp_ip4[] = {
-	{ "src-ip", CMDL_IP4, &ntuple_fs.h_u.tcp_ip4_spec.ip4src, NULL,
-	  0, &ntuple_ip4src_seen },
-	{ "src-ip-mask", CMDL_IP4, &ntuple_fs.m_u.tcp_ip4_spec.ip4src, NULL,
-	  0, &ntuple_ip4src_mask_seen },
-	{ "dst-ip", CMDL_IP4, &ntuple_fs.h_u.tcp_ip4_spec.ip4dst, NULL,
-	  0, &ntuple_ip4dst_seen },
-	{ "dst-ip-mask", CMDL_IP4, &ntuple_fs.m_u.tcp_ip4_spec.ip4dst, NULL,
-	  0, &ntuple_ip4dst_mask_seen },
-	{ "src-port", CMDL_BE16, &ntuple_fs.h_u.tcp_ip4_spec.psrc, NULL,
-	  0, &ntuple_psrc_seen },
-	{ "src-port-mask", CMDL_BE16, &ntuple_fs.m_u.tcp_ip4_spec.psrc, NULL,
-	  0, &ntuple_psrc_mask_seen },
-	{ "dst-port", CMDL_BE16, &ntuple_fs.h_u.tcp_ip4_spec.pdst, NULL,
-	  0, &ntuple_pdst_seen },
-	{ "dst-port-mask", CMDL_BE16, &ntuple_fs.m_u.tcp_ip4_spec.pdst, NULL,
-	  0, &ntuple_pdst_mask_seen },
-	{ "vlan", CMDL_U16, &ntuple_fs.vlan_tag, NULL,
-	  0, &ntuple_vlan_tag_seen },
-	{ "vlan-mask", CMDL_U16, &ntuple_fs.vlan_tag_mask, NULL,
-	  0, &ntuple_vlan_tag_mask_seen },
-	{ "user-def", CMDL_U64, &ntuple_fs.data, NULL,
-	  0, &ntuple_user_def_seen },
-	{ "user-def-mask", CMDL_U64, &ntuple_fs.data_mask, NULL,
-	  0, &ntuple_user_def_mask_seen },
-	{ "action", CMDL_S32, &ntuple_fs.action, NULL },
-};
-
-static struct cmdline_info cmdline_ntuple_ether[] = {
-	{ "dst", CMDL_MAC, ntuple_fs.h_u.ether_spec.h_dest, NULL,
-	  0, &ntuple_ether_dst_seen },
-	{ "dst-mask", CMDL_MAC, ntuple_fs.m_u.ether_spec.h_dest, NULL,
-	  0, &ntuple_ether_dst_mask_seen },
-	{ "src", CMDL_MAC, ntuple_fs.h_u.ether_spec.h_source, NULL,
-	  0, &ntuple_ether_src_seen },
-	{ "src-mask", CMDL_MAC, ntuple_fs.m_u.ether_spec.h_source, NULL,
-	  0, &ntuple_ether_src_mask_seen },
-	{ "proto", CMDL_BE16, &ntuple_fs.h_u.ether_spec.h_proto, NULL,
-	  0, &ntuple_ether_proto_seen },
-	{ "proto-mask", CMDL_BE16, &ntuple_fs.m_u.ether_spec.h_proto, NULL,
-	  0, &ntuple_ether_proto_mask_seen },
-	{ "vlan", CMDL_U16, &ntuple_fs.vlan_tag, NULL,
-	  0, &ntuple_vlan_tag_seen },
-	{ "vlan-mask", CMDL_U16, &ntuple_fs.vlan_tag_mask, NULL,
-	  0, &ntuple_vlan_tag_mask_seen },
-	{ "user-def", CMDL_U64, &ntuple_fs.data, NULL,
-	  0, &ntuple_user_def_seen },
-	{ "user-def-mask", CMDL_U64, &ntuple_fs.data_mask, NULL,
-	  0, &ntuple_user_def_mask_seen },
-	{ "action", CMDL_S32, &ntuple_fs.action, NULL },
-};
-
 static struct cmdline_info cmdline_msglvl[] = {
 	{ "drv", CMDL_FLAG, &msglvl_wanted, NULL,
 	  NETIF_MSG_DRV, &msglvl_mask },
@@ -841,8 +777,8 @@ static void parse_cmdline(int argc, char **argp)
 			    (mode == MODE_SNFC) ||
 			    (mode == MODE_GRXFHINDIR) ||
 			    (mode == MODE_SRXFHINDIR) ||
-			    (mode == MODE_SNTUPLE) ||
-			    (mode == MODE_GNTUPLE) ||
+			    (mode == MODE_SCLSRULE) ||
+			    (mode == MODE_GCLSRULE) ||
 			    (mode == MODE_PHYS_ID) ||
 			    (mode == MODE_FLASHDEV) ||
 			    (mode == MODE_PERMADDR)) {
@@ -926,16 +862,45 @@ static void parse_cmdline(int argc, char **argp)
 				i = argc;
 				break;
 			}
-			if (mode == MODE_SNTUPLE) {
+			if (mode == MODE_SCLSRULE) {
 				if (!strcmp(argp[i], "flow-type")) {
 					i += 1;
 					if (i >= argc) {
 						exit_bad_args();
 						break;
 					}
-					parse_rxntupleopts(argc, argp, i);
-					i = argc;
-					break;
+					if (rxclass_parse_ruleopts(&argp[i],
+							argc - i,
+							&rx_rule_fs) < 0) {
+						exit_bad_args();
+					} else {
+						i = argc;
+						rx_class_rule_added = 1;
+					}
+				} else if (!strcmp(argp[i], "class-rule-del")) {
+					i += 1;
+					if (i >= argc) {
+						exit_bad_args();
+						break;
+					}
+					rx_class_rule_del =
+						get_uint_range(argp[i], 0,
+							       INT_MAX);
+				} else {
+					exit_bad_args();
+				}
+				break;
+			}
+			if (mode == MODE_GCLSRULE) {
+				if (!strcmp(argp[i], "class-rule")) {
+					i += 1;
+					if (i >= argc) {
+						exit_bad_args();
+						break;
+					}
+					rx_class_rule_get =
+						get_uint_range(argp[i], 0,
+							       INT_MAX);
 				} else {
 					exit_bad_args();
 				}
@@ -1606,66 +1571,6 @@ static char *unparse_rxfhashopts(u64 opts)
 	return buf;
 }
 
-static void parse_rxntupleopts(int argc, char **argp, int i)
-{
-	ntuple_fs.flow_type = rxflow_str_to_type(argp[i]);
-
-	switch (ntuple_fs.flow_type) {
-	case TCP_V4_FLOW:
-	case UDP_V4_FLOW:
-	case SCTP_V4_FLOW:
-		parse_generic_cmdline(argc, argp, i + 1,
-				      &sntuple_changed,
-				      cmdline_ntuple_tcp_ip4,
-				      ARRAY_SIZE(cmdline_ntuple_tcp_ip4));
-		if (!ntuple_ip4src_seen)
-			ntuple_fs.m_u.tcp_ip4_spec.ip4src = 0xffffffff;
-		if (!ntuple_ip4dst_seen)
-			ntuple_fs.m_u.tcp_ip4_spec.ip4dst = 0xffffffff;
-		if (!ntuple_psrc_seen)
-			ntuple_fs.m_u.tcp_ip4_spec.psrc = 0xffff;
-		if (!ntuple_pdst_seen)
-			ntuple_fs.m_u.tcp_ip4_spec.pdst = 0xffff;
-		ntuple_fs.m_u.tcp_ip4_spec.tos = 0xff;
-		break;
-	case ETHER_FLOW:
-		parse_generic_cmdline(argc, argp, i + 1,
-				      &sntuple_changed,
-				      cmdline_ntuple_ether,
-				      ARRAY_SIZE(cmdline_ntuple_ether));
-		if (!ntuple_ether_dst_seen)
-			memset(ntuple_fs.m_u.ether_spec.h_dest, 0xff, ETH_ALEN);
-		if (!ntuple_ether_src_seen)
-			memset(ntuple_fs.m_u.ether_spec.h_source, 0xff,
-			       ETH_ALEN);
-		if (!ntuple_ether_proto_seen)
-			ntuple_fs.m_u.ether_spec.h_proto = 0xffff;
-		break;
-	default:
-		fprintf(stderr, "Unsupported flow type \"%s\"\n", argp[i]);
-		exit(106);
-		break;
-	}
-
-	if (!ntuple_vlan_tag_seen)
-		ntuple_fs.vlan_tag_mask = 0xffff;
-	if (!ntuple_user_def_seen)
-		ntuple_fs.data_mask = 0xffffffffffffffffULL;
-
-	if ((ntuple_ip4src_mask_seen && !ntuple_ip4src_seen) ||
-	    (ntuple_ip4dst_mask_seen && !ntuple_ip4dst_seen) ||
-	    (ntuple_psrc_mask_seen && !ntuple_psrc_seen) ||
-	    (ntuple_pdst_mask_seen && !ntuple_pdst_seen) ||
-	    (ntuple_ether_dst_mask_seen && !ntuple_ether_dst_seen) ||
-	    (ntuple_ether_src_mask_seen && !ntuple_ether_src_seen) ||
-	    (ntuple_ether_proto_mask_seen && !ntuple_ether_proto_seen) ||
-	    (ntuple_vlan_tag_mask_seen && !ntuple_vlan_tag_seen) ||
-	    (ntuple_user_def_mask_seen && !ntuple_user_def_seen)) {
-		fprintf(stderr, "Cannot specify mask without value\n");
-		exit(107);
-	}
-}
-
 static struct {
 	const char *name;
 	int (*func)(struct ethtool_drvinfo *info, struct ethtool_regs *regs);
@@ -2027,10 +1932,10 @@ static int doit(void)
 		return do_grxfhindir(fd, &ifr);
 	} else if (mode == MODE_SRXFHINDIR) {
 		return do_srxfhindir(fd, &ifr);
-	} else if (mode == MODE_SNTUPLE) {
-		return do_srxntuple(fd, &ifr);
-	} else if (mode == MODE_GNTUPLE) {
-		return do_grxntuple(fd, &ifr);
+	} else if (mode == MODE_SCLSRULE) {
+		return do_srxclsrule(fd, &ifr);
+	} else if (mode == MODE_GCLSRULE) {
+		return do_grxclsrule(fd, &ifr);
 	} else if (mode == MODE_FLASHDEV) {
 		return do_flash(fd, &ifr);
 	} else if (mode == MODE_PERMADDR) {
@@ -3163,21 +3068,130 @@ static int do_permaddr(int fd, struct ifreq *ifr)
 	return err;
 }
 
+static int flow_spec_to_ntuple(struct ethtool_rx_flow_spec *fsp,
+			       struct ethtool_rx_ntuple_flow_spec *ntuple)
+{
+	int i;
+
+	/* verify location is not specified */
+	if (fsp->location != RX_CLS_LOC_UNSPEC)
+		return -1;
+
+	/* verify ring cookie can transfer to action */
+	if (fsp->ring_cookie > INT_MAX &&
+	    ~fsp->ring_cookie > 1)
+		return -1;
+
+	/* verify only one field is setting data field */
+	if ((fsp->flow_type & FLOW_EXT) &&
+	    (fsp->m_ext.data[0] || fsp->m_ext.data[1]) &&
+	    fsp->m_ext.vlan_etype)
+		return -1;
+
+	/* initialize entire ntuple to all 0xFF */
+	memset(ntuple, ~0, sizeof(*ntuple));
+
+	/* set non-filter values */
+	ntuple->flow_type = fsp->flow_type;
+	ntuple->action = fsp->ring_cookie;
+
+	/* copy header portion over */
+	memcpy(&ntuple->h_u, &fsp->h_u, sizeof(fsp->h_u));
+
+	/* copy mask portion over and invert */
+	memcpy(&ntuple->m_u, &fsp->m_u, sizeof(fsp->m_u));
+	for (i = 0; i < sizeof(fsp->m_u); i++)
+		ntuple->m_u.hdata[i] ^= 0xFF;
+
+	/* copy extended fields */
+	if (fsp->flow_type & FLOW_EXT) {
+		ntuple->vlan_tag =
+			ntohs(fsp->h_ext.vlan_tci);
+		ntuple->vlan_tag_mask =
+			~ntohs(fsp->m_ext.vlan_tci);
+		if (fsp->m_ext.vlan_etype) {
+			ntuple->data =
+				ntohl(fsp->h_ext.vlan_etype);
+			ntuple->data_mask =
+				~(u64)ntohl(fsp->m_ext.vlan_etype);
+		} else {
+			ntuple->data =
+				(u64)ntohl(fsp->h_ext.data[0]);
+			ntuple->data |=
+				(u64)ntohl(fsp->h_ext.data[1]) << 32;
+			ntuple->data_mask =
+				(u64)ntohl(~fsp->m_ext.data[0]);
+			ntuple->data_mask |=
+				(u64)ntohl(~fsp->m_ext.data[1]) << 32;
+		}
+	}
+
+	return 0;
+}
+
 static int do_srxntuple(int fd, struct ifreq *ifr)
 {
+	struct ethtool_rx_ntuple ntuplecmd;
+	struct ethtool_value eval;
 	int err;
 
-	if (sntuple_changed) {
-		struct ethtool_rx_ntuple ntuplecmd;
+	/* verify if Ntuple is supported on the HW */
+	err = flow_spec_to_ntuple(&rx_rule_fs, &ntuplecmd.fs);
+	if (err)
+		return -1;
+
+	/*
+	 * Check to see if the flag is set for N-tuple, this allows
+	 * us to avoid the possible EINVAL response for the N-tuple
+	 * flag not being set on the device
+	 */
+	eval.cmd = ETHTOOL_GFLAGS;
+	ifr->ifr_data = (caddr_t)&eval;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
+	if (err || !(eval.data & ETH_FLAG_NTUPLE))
+		return -1;
 
-		ntuplecmd.cmd = ETHTOOL_SRXNTUPLE;
-		memcpy(&ntuplecmd.fs, &ntuple_fs,
-		       sizeof(struct ethtool_rx_ntuple_flow_spec));
+	/* send rule via N-tuple */
+	ntuplecmd.cmd = ETHTOOL_SRXNTUPLE;
+	ifr->ifr_data = (caddr_t)&ntuplecmd;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
 
-		ifr->ifr_data = (caddr_t)&ntuplecmd;
-		err = ioctl(fd, SIOCETHTOOL, ifr);
-		if (err < 0)
-			perror("Cannot add new RX n-tuple filter");
+	/*
+	 * Display error only if reponse is something other than op not
+	 * supported.  It is possible that the interface uses the network
+	 * flow classifier interface instead of N-tuple. 
+	 */ 
+	if (err && errno != EOPNOTSUPP)
+		perror("Cannot add new rule via N-tuple");
+
+	return err;
+}
+
+static int do_srxclsrule(int fd, struct ifreq *ifr)
+{
+	int err;
+
+	if (rx_class_rule_added) {
+		/* attempt to add rule via N-tuple specifier */
+		err = do_srxntuple(fd, ifr);
+		if (!err)
+			return 0;
+
+		/* attempt to add rule via network flow classifier */
+		err = rxclass_rule_ins(fd, ifr, &rx_rule_fs);
+		if (err < 0) {
+			fprintf(stderr, "Cannot insert"
+				" classification rule\n");
+			return 1;
+		}
+	} else if (rx_class_rule_del >= 0) {
+		err = rxclass_rule_del(fd, ifr, rx_class_rule_del);
+
+		if (err < 0) {
+			fprintf(stderr, "Cannot delete"
+				" classification rule\n");
+			return 1;
+		}
 	} else {
 		exit_bad_args();
 	}
@@ -3185,9 +3199,32 @@ static int do_srxntuple(int fd, struct ifreq *ifr)
 	return 0;
 }
 
-static int do_grxntuple(int fd, struct ifreq *ifr)
+static int do_grxclsrule(int fd, struct ifreq *ifr)
 {
-	return 0;
+	struct ethtool_rxnfc nfccmd;
+	int err;
+
+	if (rx_class_rule_get >= 0) {
+		err = rxclass_rule_get(fd, ifr, rx_class_rule_get);
+		if (err < 0)
+			fprintf(stderr, "Cannot get RX classification rule\n");
+		return err ? 1 : 0;
+	}
+
+	nfccmd.cmd = ETHTOOL_GRXRINGS;
+	ifr->ifr_data = (caddr_t)&nfccmd;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
+	if (err < 0)
+		perror("Cannot get RX rings");
+	else
+		fprintf(stdout, "%d RX rings available\n",
+			(int)nfccmd.data);
+
+	err = rxclass_rule_getall(fd, ifr);
+	if (err < 0)
+		fprintf(stderr, "RX classification rule retrieval failed\n");
+
+	return err ? 1 : 0;
 }
 
 static int send_ioctl(int fd, struct ifreq *ifr)
diff --git a/rxclass.c b/rxclass.c
new file mode 100644
index 0000000..5ad0639
--- /dev/null
+++ b/rxclass.c
@@ -0,0 +1,1106 @@
+/*
+ * Copyright (C) 2008 Sun Microsystems, Inc. All rights reserved.
+ */
+#include <stdio.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <linux/sockios.h>
+#include <arpa/inet.h>
+#include "ethtool-util.h"
+#include "ethtool-bitops.h"
+
+/*
+ * This is a rule manager implementation for ordering rx flow
+ * classification rules in a longest prefix first match order.
+ * The assumption is that this rule manager is the only one adding rules to
+ * the device's hardware classifier.
+ */
+
+struct rmgr_ctrl {
+	/* slot contains a bitmap indicating which filters are valid */
+	unsigned long		*slot;
+	__u32			n_rules;
+	__u32			size;
+};
+
+static struct rmgr_ctrl rmgr;
+static int rmgr_init_done = 0;
+
+static void invert_flow_mask(struct ethtool_rx_flow_spec *fsp)
+{
+	int i;
+
+	for (i = 0; i < sizeof(fsp->m_u); i++)
+		fsp->m_u.hdata[i] ^= 0xFF;
+}
+
+static void rmgr_print_nfc_spec_ext(struct ethtool_rx_flow_spec *fsp)
+{
+	u64 data, datam;
+	__u16 etype, etypem, tci, tcim;
+
+	if (!(fsp->flow_type & FLOW_EXT))
+		return;
+
+	etype = ntohs(fsp->h_ext.vlan_etype);
+	etypem = ntohs(~fsp->m_ext.vlan_etype);
+	tci = ntohs(fsp->h_ext.vlan_tci);
+	tcim = ntohs(~fsp->m_ext.vlan_tci);
+	data = (u64)ntohl(fsp->h_ext.data[0]) << 32;
+	data = (u64)ntohl(fsp->h_ext.data[1]);
+	datam = (u64)ntohl(~fsp->m_ext.data[0]) << 32;
+	datam |= (u64)ntohl(~fsp->m_ext.data[1]);
+
+	fprintf(stdout,
+		"\tVLAN EtherType: 0x%x mask: 0x%x\n"
+		"\tVLAN: 0x%x mask: 0x%x\n"
+		"\tUser-defined: 0x%Lx mask: 0x%Lx\n",
+		etype, etypem, tci, tcim, data, datam);
+}
+
+static void rmgr_print_nfc_rule(struct ethtool_rx_flow_spec *fsp)
+{
+	unsigned char	*smac, *smacm, *dmac, *dmacm;
+	__u32		sip, dip, sipm, dipm, flow_type;
+	__u16		proto, protom;
+
+	if (fsp->location != RX_CLS_LOC_UNSPEC)
+		fprintf(stdout,	"Filter: %d\n", fsp->location);
+	else
+		fprintf(stdout,	"Filter: Unspecified\n");
+
+	flow_type = fsp->flow_type & ~FLOW_EXT;
+
+	invert_flow_mask(fsp);
+
+	switch (flow_type) {
+	case TCP_V4_FLOW:
+	case UDP_V4_FLOW:
+	case SCTP_V4_FLOW:
+	case AH_V4_FLOW:
+	case ESP_V4_FLOW:
+	case IP_USER_FLOW:
+		sip = ntohl(fsp->h_u.tcp_ip4_spec.ip4src);
+		dip = ntohl(fsp->h_u.tcp_ip4_spec.ip4dst);
+		sipm = ntohl(fsp->m_u.tcp_ip4_spec.ip4src);
+		dipm = ntohl(fsp->m_u.tcp_ip4_spec.ip4dst);
+
+		switch (flow_type) {
+		case TCP_V4_FLOW:
+			fprintf(stdout, "\tRule Type: TCP over IPv4\n");
+			break;
+		case UDP_V4_FLOW:
+			fprintf(stdout, "\tRule Type: UDP over IPv4\n");
+			break;
+		case SCTP_V4_FLOW:
+			fprintf(stdout, "\tRule Type: SCTP over IPv4\n");
+			break;
+		case AH_V4_FLOW:
+			fprintf(stdout, "\tRule Type: IPSEC AH over IPv4\n");
+			break;
+		case ESP_V4_FLOW:
+			fprintf(stdout, "\tRule Type: IPSEC ESP over IPv4\n");
+			break;
+		case IP_USER_FLOW:
+			fprintf(stdout, "\tRule Type: Raw IPv4\n");
+			break;
+		default:
+			break;
+		}
+
+		fprintf(stdout,
+			"\tSrc IP addr: %d.%d.%d.%d mask: %d.%d.%d.%d\n"
+			"\tDest IP addr: %d.%d.%d.%d mask: %d.%d.%d.%d\n"
+			"\tTOS: 0x%x mask: 0x%x\n",
+			(sip & 0xff000000) >> 24,
+			(sip & 0xff0000) >> 16,
+			(sip & 0xff00) >> 8,
+			sip & 0xff,
+			(sipm & 0xff000000) >> 24,
+			(sipm & 0xff0000) >> 16,
+			(sipm & 0xff00) >> 8,
+			sipm & 0xff,
+			(dip & 0xff000000) >> 24,
+			(dip & 0xff0000) >> 16,
+			(dip & 0xff00) >> 8,
+			dip & 0xff,
+			(dipm & 0xff000000) >> 24,
+			(dipm & 0xff0000) >> 16,
+			(dipm & 0xff00) >> 8,
+			dipm & 0xff,
+			fsp->h_u.tcp_ip4_spec.tos,
+			fsp->m_u.tcp_ip4_spec.tos);
+
+		switch (flow_type) {
+		case TCP_V4_FLOW:
+		case UDP_V4_FLOW:
+		case SCTP_V4_FLOW:
+			fprintf(stdout,
+				"\tSrc port: %d mask: 0x%x\n"
+				"\tDest port: %d mask: 0x%x\n",
+				ntohs(fsp->h_u.tcp_ip4_spec.psrc),
+				ntohs(fsp->m_u.tcp_ip4_spec.psrc),
+				ntohs(fsp->h_u.tcp_ip4_spec.pdst),
+				ntohs(fsp->m_u.tcp_ip4_spec.pdst));
+			break;
+		case AH_V4_FLOW:
+		case ESP_V4_FLOW:
+			fprintf(stdout,
+				"\tSPI: %d mask: 0x%x\n",
+				ntohl(fsp->h_u.esp_ip4_spec.spi),
+				ntohl(fsp->m_u.esp_ip4_spec.spi));
+			break;
+		case IP_USER_FLOW:
+			fprintf(stdout,
+				"\tProtocol: %d mask: 0x%x\n"
+				"\tL4 bytes: 0x%x mask: 0x%x\n",
+				fsp->h_u.usr_ip4_spec.proto,
+				fsp->m_u.usr_ip4_spec.proto,
+				ntohl(fsp->h_u.usr_ip4_spec.l4_4_bytes),
+				ntohl(fsp->m_u.usr_ip4_spec.l4_4_bytes));
+			break;
+		default:
+			break;
+		}
+		rmgr_print_nfc_spec_ext(fsp);
+		break;
+	case ETHER_FLOW:
+		dmac = fsp->h_u.ether_spec.h_dest;
+		dmacm = fsp->m_u.ether_spec.h_dest;
+		smac = fsp->h_u.ether_spec.h_source;
+		smacm = fsp->m_u.ether_spec.h_source;
+		proto = ntohs(fsp->h_u.ether_spec.h_proto);
+		protom = ntohs(fsp->m_u.ether_spec.h_proto);
+
+		fprintf(stdout,
+			"\tFlow Type: Raw Ethernet\n"
+			"\tSrc MAC addr: %02X:%02X:%02X:%02X:%02X:%02X"
+			" mask: %02X:%02X:%02X:%02X:%02X:%02X\n"
+			"\tDest MAC addr: %02X:%02X:%02X:%02X:%02X:%02X"
+			" mask: %02X:%02X:%02X:%02X:%02X:%02X\n"
+			"\tEthertype: 0x%X mask: 0x%X\n",
+			smac[0], smac[1], smac[2], smac[3], smac[4], smac[5],
+			smacm[0], smacm[1], smacm[2], smacm[3], smacm[4],
+			smacm[5], dmac[0], dmac[1], dmac[2], dmac[3], dmac[4],
+			dmac[5], dmacm[0], dmacm[1], dmacm[2], dmacm[3],
+			dmacm[4], dmacm[5], proto, protom);
+		rmgr_print_nfc_spec_ext(fsp);
+		break;
+	default:
+		fprintf(stdout,
+			"\tUnknown Flow type: %d\n", flow_type);
+		break;
+	}
+
+	if (fsp->ring_cookie != RX_CLS_FLOW_DISC)
+		fprintf(stdout, "\tAction: Direct to queue %llu\n",
+			fsp->ring_cookie);
+	else
+		fprintf(stdout, "\tAction: Drop\n");
+
+	fprintf(stdout, "\n");
+}
+
+static void rmgr_print_rule(struct ethtool_rx_flow_spec *fsp)
+{
+	/* print the rule in this location */
+	switch (fsp->flow_type & ~FLOW_EXT) {
+	case TCP_V4_FLOW:
+	case UDP_V4_FLOW:
+	case SCTP_V4_FLOW:
+	case AH_V4_FLOW:
+	case ESP_V4_FLOW:
+	case ETHER_FLOW:
+		rmgr_print_nfc_rule(fsp);
+		break;
+	case IP_USER_FLOW:
+		if (fsp->h_u.usr_ip4_spec.ip_ver == ETH_RX_NFC_IP4) {
+			rmgr_print_nfc_rule(fsp);
+			break;
+		}
+		/* IPv6 User Flow falls through to the case below */
+	case TCP_V6_FLOW:
+	case UDP_V6_FLOW:
+	case SCTP_V6_FLOW:
+	case AH_V6_FLOW:
+	case ESP_V6_FLOW:
+		fprintf(stderr, "IPv6 flows not implemented\n");
+		break;
+	default:
+		fprintf(stderr, "rmgr: Unknown flow type\n");
+		break;
+	}
+}
+
+static int rmgr_ins(__u32 loc)
+{
+	/* verify location is in rule manager range */
+	if ((loc < 0) || (loc >= rmgr.size)) {
+		fprintf(stderr, "rmgr: Location out of range\n");
+		return -1;
+	}
+
+	/* set bit for the rule */
+	set_bit(loc, rmgr.slot);
+
+	return 0;
+}
+
+static int rmgr_find(__u32 loc)
+{
+	/* verify location is in rule manager range */
+	if ((loc < 0) || (loc >= rmgr.size)) {
+		fprintf(stderr, "rmgr: Location out of range\n");
+		return -1;
+	}
+
+	/* if slot is found return 0 indicating success */
+	if (test_bit(loc, rmgr.slot))
+		return 0;
+
+	/* rule not found */
+	fprintf(stderr, "rmgr: No such rule\n");
+	return -1;
+}
+
+static int rmgr_del(__u32 loc)
+{
+	/* verify rule exists before attempting to delete */
+	int err = rmgr_find(loc);
+	if (err)
+		return err;
+
+	/* clear bit for the rule */
+	clear_bit(loc, rmgr.slot);
+
+	return 0;
+}
+
+static int rmgr_add(struct ethtool_rx_flow_spec *fsp)
+{
+	__u32 loc = fsp->location;
+
+	/* location provided, insert rule and update regions to match rule */
+	if (loc != RX_CLS_LOC_UNSPEC)
+		return rmgr_ins(loc);
+
+	/* start at the end of the list since it is lowest priority */
+	loc = rmgr.size - 1;
+
+	/* only part of last word is set so fill in remaining bits and test */
+	if (!~(rmgr.slot[loc / BITS_PER_LONG] |
+	       (~1UL << (loc % BITS_PER_LONG))))
+		loc -= 1 + (loc % BITS_PER_LONG);
+
+	/* find an open slot */
+	while (loc != RX_CLS_LOC_UNSPEC && !~rmgr.slot[loc / BITS_PER_LONG])
+		loc -= BITS_PER_LONG;
+
+	/* find and use available location in slot */
+	while (loc != RX_CLS_LOC_UNSPEC && test_bit(loc, rmgr.slot))
+		loc--;
+
+	/* location found, insert rule */
+	if (loc != RX_CLS_LOC_UNSPEC) {
+		fsp->location = loc;
+		return rmgr_ins(loc);
+	}
+
+	/* No space to add this rule */
+	fprintf(stderr, "rmgr: Cannot find appropriate slot to insert rule\n");
+
+	return -1;
+}
+
+static int rmgr_init(int fd, struct ifreq *ifr)
+{
+	struct ethtool_rxnfc *nfccmd;
+	int err, i;
+	__u32 *rule_locs;
+
+	if (rmgr_init_done)
+		return 0;
+
+	/* clear rule manager settings */
+	memset(&rmgr, 0, sizeof(struct rmgr_ctrl));
+
+	/* allocate memory for count request */
+	nfccmd = calloc(1, sizeof(*nfccmd));
+	if (!nfccmd) {
+		perror("rmgr: Cannot allocate memory for RX class rule data");
+		return -1;
+	}
+
+	/* request count and store in rmgr.n_rules */
+	nfccmd->cmd = ETHTOOL_GRXCLSRLCNT;
+	ifr->ifr_data = (caddr_t)nfccmd;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
+	rmgr.n_rules = nfccmd->rule_cnt;
+	free(nfccmd);
+	if (err < 0) {
+		perror("rmgr: Cannot get RX class rule count");
+		return -1;
+	}
+
+	/* alloc memory for request of location list */
+	nfccmd = calloc(1, sizeof(*nfccmd) + (rmgr.n_rules * sizeof(__u32)));
+	if (!nfccmd) {
+		perror("rmgr: Cannot allocate memory for"
+		       " RX class rule locations");
+		return -1;
+	}
+
+	/* request location list */
+	nfccmd->cmd = ETHTOOL_GRXCLSRLALL;
+	nfccmd->rule_cnt = rmgr.n_rules;
+	ifr->ifr_data = (caddr_t)nfccmd;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
+	if (err < 0) {
+		perror("rmgr: Cannot get RX class rules");
+		free(nfccmd);
+		return -1;
+	}
+
+	/* intitialize bitmap for storage of valid locations */
+	rmgr.size = nfccmd->data;
+	rmgr.slot = calloc(1, BITS_TO_LONGS(rmgr.size) * sizeof(long));
+	if (!rmgr.slot) {
+		perror("rmgr: Cannot allocate memory for RX class rules");
+		return -1;
+	}
+
+	/* write locations to bitmap */
+	rule_locs = nfccmd->rule_locs;
+	for (i = 0; i < rmgr.n_rules; i++) {
+		err = rmgr_ins(rule_locs[i]);
+		if (err < 0)
+			break;
+	}
+
+	/* free memory and set flag to avoid reinit */
+	free(nfccmd);
+	rmgr_init_done = 1;
+
+	return err;
+}
+
+static void rmgr_cleanup(void)
+{
+	if (!rmgr_init_done)
+		return;
+
+	rmgr_init_done = 0;
+
+	free(rmgr.slot);
+	rmgr.slot = NULL;
+	rmgr.size = 0;
+}
+
+int rxclass_rule_getall(int fd, struct ifreq *ifr)
+{
+	struct ethtool_rxnfc nfccmd;
+	int err, i, j;
+
+	/* init table of available rules */
+	err = rmgr_init(fd, ifr);
+	if (err < 0)
+		return err;
+
+	fprintf(stdout, "Total %d rules\n\n", rmgr.n_rules);
+
+	/* fetch and display all available rules */
+	for (i = 0; i < rmgr.size; i += BITS_PER_LONG) {
+		if (!~rmgr.slot[i / BITS_PER_LONG])
+			continue;
+		for (j = 0; j < BITS_PER_LONG; j++) {
+			if (!test_bit(i + j, rmgr.slot))
+				continue;
+			nfccmd.cmd = ETHTOOL_GRXCLSRULE;
+			memset(&nfccmd.fs, 0,
+			       sizeof(struct ethtool_rx_flow_spec));
+			nfccmd.fs.location = i + j;
+			ifr->ifr_data = (caddr_t)&nfccmd;
+			err = ioctl(fd, SIOCETHTOOL, ifr);
+			if (err < 0) {
+				perror("rmgr: Cannot get RX class rule");
+				return -1;
+			}
+			rmgr_print_rule(&nfccmd.fs);
+		}
+	}
+
+	rmgr_cleanup();
+
+	return 0;
+}
+
+int rxclass_rule_get(int fd, struct ifreq *ifr, __u32 loc)
+{
+	struct ethtool_rxnfc nfccmd;
+	int err;
+
+	/* init table of available rules */
+	err = rmgr_init(fd, ifr);
+	if (err < 0)
+		return err;
+
+	/* verify rule exists before attempting to display */
+	err = rmgr_find(loc);
+	if (err < 0)
+		return err;
+
+	/* fetch rule from netdev and display */
+	nfccmd.cmd = ETHTOOL_GRXCLSRULE;
+	memset(&nfccmd.fs, 0, sizeof(struct ethtool_rx_flow_spec));
+	nfccmd.fs.location = loc;
+	ifr->ifr_data = (caddr_t)&nfccmd;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
+	if (err < 0) {
+		perror("rmgr: Cannot get RX class rule");
+		return -1;
+	}
+	rmgr_print_rule(&nfccmd.fs);
+
+	rmgr_cleanup();
+
+	return 0;
+}
+
+int rxclass_rule_ins(int fd, struct ifreq *ifr,
+		     struct ethtool_rx_flow_spec *fsp)
+{
+	struct ethtool_rxnfc nfccmd;
+	int err;
+
+	/* init table of available rules */
+	err = rmgr_init(fd, ifr);
+	if (err < 0)
+		return err;
+
+	/* verify rule location */
+	err = rmgr_add(fsp);
+	if (err < 0)
+		return err;
+
+	/* notify netdev of new rule */
+	nfccmd.cmd = ETHTOOL_SRXCLSRLINS;
+	nfccmd.fs = *fsp;
+	ifr->ifr_data = (caddr_t)&nfccmd;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
+	if (err < 0) {
+		perror("rmgr: Cannot insert RX class rule");
+		return -1;
+	}
+	rmgr.n_rules++;
+
+	printf("Added rule with ID %d\n", fsp->location);
+
+	rmgr_cleanup();
+
+	return 0;
+}
+
+int rxclass_rule_del(int fd, struct ifreq *ifr, __u32 loc)
+{
+	struct ethtool_rxnfc nfccmd;
+	int err;
+
+	/* init table of available rules */
+	err = rmgr_init(fd, ifr);
+	if (err < 0)
+		return err;
+
+	/* verify rule exists */
+	err = rmgr_del(loc);
+	if (err < 0)
+		return err;
+
+	/* notify netdev of rule removal */
+	nfccmd.cmd = ETHTOOL_SRXCLSRLDEL;
+	nfccmd.fs.location = loc;
+	ifr->ifr_data = (caddr_t)&nfccmd;
+	err = ioctl(fd, SIOCETHTOOL, ifr);
+	if (err < 0) {
+		perror("rmgr: Cannot delete RX class rule");
+		return -1;
+	}
+	rmgr.n_rules--;
+
+	rmgr_cleanup();
+
+	return 0;
+}
+
+typedef enum {
+	OPT_NONE = 0,
+	OPT_S32,
+	OPT_U8,
+	OPT_U16,
+	OPT_U32,
+	OPT_U64,
+	OPT_BE16,
+	OPT_BE32,
+	OPT_BE64,
+	OPT_IP4,
+	OPT_MAC,
+} rule_opt_type_t;
+
+#define NFC_FLAG_RING		0x001
+#define NFC_FLAG_LOC		0x002
+#define NFC_FLAG_SADDR		0x004
+#define NFC_FLAG_DADDR		0x008
+#define NFC_FLAG_SPORT		0x010
+#define NFC_FLAG_DPORT		0x020
+#define NFC_FLAG_SPI		0x030
+#define NFC_FLAG_TOS		0x040
+#define NFC_FLAG_PROTO		0x080
+#define NTUPLE_FLAG_VLAN	0x100
+#define NTUPLE_FLAG_UDEF	0x200
+#define NTUPLE_FLAG_VETH	0x400
+
+struct rule_opts {
+	const char	*name;
+	rule_opt_type_t	type;
+	u32		flag;
+	int		offset;
+	int		moffset;
+};
+
+static struct rule_opts rule_nfc_tcp_ip4[] = {
+	{ "src-ip", OPT_IP4, NFC_FLAG_SADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.tcp_ip4_spec.ip4src),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.tcp_ip4_spec.ip4src) },
+	{ "dst-ip", OPT_IP4, NFC_FLAG_DADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.tcp_ip4_spec.ip4dst),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.tcp_ip4_spec.ip4dst) },
+	{ "tos", OPT_U8, NFC_FLAG_TOS,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.tcp_ip4_spec.tos),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.tcp_ip4_spec.tos) },
+	{ "src-port", OPT_BE16, NFC_FLAG_SPORT,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.tcp_ip4_spec.psrc),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.tcp_ip4_spec.psrc) },
+	{ "dst-port", OPT_BE16, NFC_FLAG_DPORT,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.tcp_ip4_spec.pdst),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.tcp_ip4_spec.pdst) },
+	{ "action", OPT_U64, NFC_FLAG_RING,
+	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
+	{ "loc", OPT_U32, NFC_FLAG_LOC,
+	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
+	{ "vlan-etype", OPT_BE16, NTUPLE_FLAG_VETH,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.vlan_etype),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.vlan_etype) },
+	{ "vlan", OPT_BE16, NTUPLE_FLAG_VLAN,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.vlan_tci),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.vlan_tci) },
+	{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.data),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+};
+
+static struct rule_opts rule_nfc_esp_ip4[] = {
+	{ "src-ip", OPT_IP4, NFC_FLAG_SADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.esp_ip4_spec.ip4src),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.esp_ip4_spec.ip4src) },
+	{ "dst-ip", OPT_IP4, NFC_FLAG_DADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.esp_ip4_spec.ip4dst),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.esp_ip4_spec.ip4dst) },
+	{ "tos", OPT_U8, NFC_FLAG_TOS,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.esp_ip4_spec.tos),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.esp_ip4_spec.tos) },
+	{ "spi", OPT_BE32, NFC_FLAG_SPI,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.esp_ip4_spec.spi),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.esp_ip4_spec.spi) },
+	{ "action", OPT_U64, NFC_FLAG_RING,
+	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
+	{ "loc", OPT_U32, NFC_FLAG_LOC,
+	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
+	{ "vlan-etype", OPT_BE16, NTUPLE_FLAG_VETH,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.vlan_etype),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.vlan_etype) },
+	{ "vlan", OPT_BE16, NTUPLE_FLAG_VLAN,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.vlan_tci),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.vlan_tci) },
+	{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.data),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+};
+
+static struct rule_opts rule_nfc_usr_ip4[] = {
+	{ "src-ip", OPT_IP4, NFC_FLAG_SADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.ip4src),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.ip4src) },
+	{ "dst-ip", OPT_IP4, NFC_FLAG_DADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.ip4dst),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.ip4dst) },
+	{ "tos", OPT_U8, NFC_FLAG_TOS,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.tos),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.tos) },
+	{ "l4proto", OPT_U8, NFC_FLAG_PROTO,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.proto),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.proto) },
+	{ "spi", OPT_BE32, NFC_FLAG_SPI,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.l4_4_bytes),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.l4_4_bytes) },
+	{ "src-port", OPT_BE16, NFC_FLAG_SPORT,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.l4_4_bytes),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.l4_4_bytes) },
+	{ "dst-port", OPT_BE16, NFC_FLAG_DPORT,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.usr_ip4_spec.l4_4_bytes) + 2,
+	  offsetof(struct ethtool_rx_flow_spec, m_u.usr_ip4_spec.l4_4_bytes) + 2 },
+	{ "action", OPT_U64, NFC_FLAG_RING,
+	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
+	{ "loc", OPT_U32, NFC_FLAG_LOC,
+	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
+	{ "vlan-etype", OPT_BE16, NTUPLE_FLAG_VETH,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.vlan_etype),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.vlan_etype) },
+	{ "vlan", OPT_BE16, NTUPLE_FLAG_VLAN,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.vlan_tci),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.vlan_tci) },
+	{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.data),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+};
+
+static struct rule_opts rule_nfc_ether[] = {
+	{ "src", OPT_MAC, NFC_FLAG_SADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.ether_spec.h_dest),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.ether_spec.h_dest) },
+	{ "dst", OPT_MAC, NFC_FLAG_DADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.ether_spec.h_source),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.ether_spec.h_source) },
+	{ "proto", OPT_BE16, NFC_FLAG_PROTO,
+	  offsetof(struct ethtool_rx_flow_spec, h_u.ether_spec.h_proto),
+	  offsetof(struct ethtool_rx_flow_spec, m_u.ether_spec.h_proto) },
+	{ "action", OPT_U64, NFC_FLAG_RING,
+	  offsetof(struct ethtool_rx_flow_spec, ring_cookie), -1 },
+	{ "loc", OPT_U32, NFC_FLAG_LOC,
+	  offsetof(struct ethtool_rx_flow_spec, location), -1 },
+	{ "vlan-etype", OPT_BE16, NTUPLE_FLAG_VETH,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.vlan_etype),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.vlan_etype) },
+	{ "vlan", OPT_BE16, NTUPLE_FLAG_VLAN,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.vlan_tci),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.vlan_tci) },
+	{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.data),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+};
+
+static int rxclass_get_long(char *str, long long *val, int size)
+{
+	long long max = ~0ULL >> (65 - size);
+	char *endp;
+
+	errno = 0;
+
+	*val = strtoll(str, &endp, 0);
+
+	if (*endp || errno || (*val > max) || (*val < ~max))
+		return -1;
+
+	return 0;
+}
+
+static int rxclass_get_ulong(char *str, unsigned long long *val, int size)
+{
+	long long max = ~0ULL >> (64 - size);
+	char *endp;
+
+	errno = 0;
+
+	*val = strtoull(str, &endp, 0);
+
+	if (*endp || errno || (*val > max))
+		return -1;
+
+	return 0;
+}
+
+static int rxclass_get_ipv4(char *str, __be32 *val)
+{
+	if (!strchr(str, '.')) {
+		unsigned long long v;
+		int err;
+
+		err = rxclass_get_ulong(str, &v, 32);
+		if (err)
+			return -1;
+
+		*val = htonl((u32)v);
+
+		return 0;
+	}
+
+	if (!inet_pton(AF_INET, str, val))
+		return -1;
+
+	return 0;
+}
+
+static int rxclass_get_ether(char *str, unsigned char *val)
+{
+	unsigned int buf[ETH_ALEN];
+	int count;
+
+	if (!strchr(str, ':'))
+		return -1;
+
+	count = sscanf(str, "%2x:%2x:%2x:%2x:%2x:%2x",
+		       &buf[0], &buf[1], &buf[2],
+		       &buf[3], &buf[4], &buf[5]);
+
+	if (count != ETH_ALEN)
+		return -1;
+
+	do {
+		count--;
+		val[count] = buf[count];
+	} while (count);
+
+	return 0;
+}
+
+static int rxclass_get_val(char *str, unsigned char *p, u32 *flags,
+			   const struct rule_opts *opt)
+{
+	unsigned long long mask = ~0ULL;
+	int err = 0;
+
+	if (*flags & opt->flag)
+		return -1;
+
+	*flags |= opt->flag;
+
+	switch (opt->type) {
+	case OPT_S32: {
+		long long val;
+		err = rxclass_get_long(str, &val, 32);
+		if (err)
+			return -1;
+		*(int *)&p[opt->offset] = (int)val;
+		if (opt->moffset >= 0)
+			*(int *)&p[opt->moffset] = (int)mask;
+		break;
+	}
+	case OPT_U8: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 8);
+		if (err)
+			return -1;
+		*(u8 *)&p[opt->offset] = (u8)val;
+		if (opt->moffset >= 0)
+			*(u8 *)&p[opt->moffset] = (u8)mask;
+		break;
+	}
+	case OPT_U16: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 16);
+		if (err)
+			return -1;
+		*(u16 *)&p[opt->offset] = (u16)val;
+		if (opt->moffset >= 0)
+			*(u16 *)&p[opt->moffset] = (u16)mask;
+		break;
+	}
+	case OPT_U32: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 32);
+		if (err)
+			return -1;
+		*(u32 *)&p[opt->offset] = (u32)val;
+		if (opt->moffset >= 0)
+			*(u32 *)&p[opt->moffset] = (u32)mask;
+		break;
+	}
+	case OPT_U64: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 64);
+		if (err)
+			return -1;
+		*(u64 *)&p[opt->offset] = (u64)val;
+		if (opt->moffset >= 0)
+			*(u64 *)&p[opt->moffset] = (u64)mask;
+		break;
+	}
+	case OPT_BE16: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 16);
+		if (err)
+			return -1;
+		*(__be16 *)&p[opt->offset] = htons((u16)val);
+		if (opt->moffset >= 0)
+			*(__be16 *)&p[opt->moffset] = (__be16)mask;
+		break;
+	}
+	case OPT_BE32: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 32);
+		if (err)
+			return -1;
+		*(__be32 *)&p[opt->offset] = htonl((u32)val);
+		if (opt->moffset >= 0)
+			*(__be32 *)&p[opt->moffset] = (__be32)mask;
+		break;
+	}
+	case OPT_BE64: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 64);
+		if (err)
+			return -1;
+		*(__be64 *)&p[opt->offset] = htonll((u64)val);
+		if (opt->moffset >= 0)
+			*(__be64 *)&p[opt->moffset] = (__be64)mask;
+		break;
+	}
+	case OPT_IP4: {
+		__be32 val;
+		err = rxclass_get_ipv4(str, &val);
+		if (err)
+			return -1;
+		*(__be32 *)&p[opt->offset] = val;
+		if (opt->moffset >= 0)
+			*(__be32 *)&p[opt->moffset] = (__be32)mask;
+		break;
+	}
+	case OPT_MAC: {
+		unsigned char val[ETH_ALEN];
+		err = rxclass_get_ether(str, val);
+		if (err)
+			return -1;
+		memcpy(&p[opt->offset], val, ETH_ALEN);
+		if (opt->moffset >= 0)
+			memcpy(&p[opt->moffset], &mask, ETH_ALEN);
+		break;
+	}
+	case OPT_NONE:
+	default:
+		return -1;
+	}
+
+	return 0;
+}
+
+static int rxclass_get_mask(char *str, unsigned char *p,
+			    const struct rule_opts *opt)
+{
+	int err = 0;
+
+	if (opt->moffset < 0)
+		return -1;
+
+	switch (opt->type) {
+	case OPT_S32: {
+		long long val;
+		err = rxclass_get_long(str, &val, 32);
+		if (err)
+			return -1;
+		*(int *)&p[opt->moffset] = ~(int)val;
+		break;
+	}
+	case OPT_U8: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 8);
+		if (err)
+			return -1;
+		*(u8 *)&p[opt->moffset] = ~(u8)val;
+		break;
+	}
+	case OPT_U16: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 16);
+		if (err)
+			return -1;
+		*(u16 *)&p[opt->moffset] = ~(u16)val;
+		break;
+	}
+	case OPT_U32: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 32);
+		if (err)
+			return -1;
+		*(u32 *)&p[opt->moffset] = ~(u32)val;
+		break;
+	}
+	case OPT_U64: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 64);
+		if (err)
+			return -1;
+		*(u64 *)&p[opt->moffset] = ~(u64)val;
+		break;
+	}
+	case OPT_BE16: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 16);
+		if (err)
+			return -1;
+		*(__be16 *)&p[opt->moffset] = ~htons((u16)val);
+		break;
+	}
+	case OPT_BE32: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 32);
+		if (err)
+			return -1;
+		*(__be32 *)&p[opt->moffset] = ~htonl((u32)val);
+		break;
+	}
+	case OPT_BE64: {
+		unsigned long long val;
+		err = rxclass_get_ulong(str, &val, 64);
+		if (err)
+			return -1;
+		*(__be64 *)&p[opt->moffset] = ~htonll((u64)val);
+		break;
+	}
+	case OPT_IP4: {
+		__be32 val;
+		err = rxclass_get_ipv4(str, &val);
+		if (err)
+			return -1;
+		*(__be32 *)&p[opt->moffset] = ~val;
+		break;
+	}
+	case OPT_MAC: {
+		unsigned char val[ETH_ALEN];
+		int i;
+		err = rxclass_get_ether(str, val);
+		if (err)
+			return -1;
+
+		for (i = 0; i < ETH_ALEN; i++)
+			val[i] = ~val[i];
+
+		memcpy(&p[opt->moffset], val, ETH_ALEN);
+		break;
+	}
+	case OPT_NONE:
+	default:
+		return -1;
+	}
+
+	return 0;
+}
+
+int rxclass_parse_ruleopts(char **argp, int argc,
+			   struct ethtool_rx_flow_spec *fsp)
+{
+	const struct rule_opts *options;
+	unsigned char *p = (unsigned char *)fsp;
+	int i = 0, n_opts, err;
+	u32 flags = 0;
+	int flow_type;
+
+	if (*argp == NULL || **argp == '\0' || argc < 1)
+		goto syntax_err;
+
+	if (!strcmp(argp[0], "tcp4"))
+		flow_type = TCP_V4_FLOW;
+	else if (!strcmp(argp[0], "udp4"))
+		flow_type = UDP_V4_FLOW;
+	else if (!strcmp(argp[0], "sctp4"))
+		flow_type = SCTP_V4_FLOW;
+	else if (!strcmp(argp[0], "ah4"))
+		flow_type = AH_V4_FLOW;
+	else if (!strcmp(argp[0], "esp4"))
+		flow_type = ESP_V4_FLOW;
+	else if (!strcmp(argp[0], "ip4"))
+		flow_type = IP_USER_FLOW;
+	else if (!strcmp(argp[0], "ether"))
+		flow_type = ETHER_FLOW;
+	else
+		goto syntax_err;
+
+	switch (flow_type) {
+	case TCP_V4_FLOW:
+	case UDP_V4_FLOW:
+	case SCTP_V4_FLOW:
+		options = rule_nfc_tcp_ip4;
+		n_opts = ARRAY_SIZE(rule_nfc_tcp_ip4);
+		break;
+	case AH_V4_FLOW:
+	case ESP_V4_FLOW:
+		options = rule_nfc_esp_ip4;
+		n_opts = ARRAY_SIZE(rule_nfc_esp_ip4);
+		break;
+	case IP_USER_FLOW:
+		options = rule_nfc_usr_ip4;
+		n_opts = ARRAY_SIZE(rule_nfc_usr_ip4);
+		break;
+	case ETHER_FLOW:
+		options = rule_nfc_ether;
+		n_opts = ARRAY_SIZE(rule_nfc_ether);
+		break;
+	default:
+		fprintf(stdout, "Add rule, invalid rule type[%s]\n", argp[0]);
+		return -1;
+	}
+
+	memset(p, 0, sizeof(*fsp));
+	fsp->flow_type = flow_type;
+	fsp->location = RX_CLS_LOC_UNSPEC;
+
+	for (i = 1; i < argc;) {
+		const struct rule_opts *opt;
+		int idx;
+		for (opt = options, idx = 0; idx < n_opts; idx++, opt++) {
+			char mask_name[16];
+
+			if (strcmp(argp[i], opt->name))
+				continue;
+
+			i++;
+			if (i >= argc)
+				break;
+
+			err = rxclass_get_val(argp[i], p, &flags, opt);
+			if (err) {
+				fprintf(stderr, "Invalid %s value[%s]\n",
+					opt->name, argp[i]);
+				return -1;
+			}
+
+			i++;
+			if (i >= argc)
+				break;
+
+			sprintf(mask_name, "%s-mask", opt->name);
+			if (strcmp(argp[i], "m") && strcmp(argp[i], mask_name))
+				break;
+
+			i++;
+			if (i >= argc)
+				goto syntax_err;
+
+			err = rxclass_get_mask(argp[i], p, opt);
+			if (err) {
+				fprintf(stderr, "Invalid %s mask[%s]\n",
+					opt->name, argp[i]);
+				return -1;
+			}
+
+			i++;
+
+			break;
+		}
+		if (idx == n_opts) {
+			fprintf(stdout, "Add rule, unreconized option[%s]\n",
+				argp[i]);
+			return -1;
+		}
+	}
+
+	if (flags & (NTUPLE_FLAG_VLAN | NTUPLE_FLAG_UDEF | NTUPLE_FLAG_VETH))
+		fsp->flow_type |= FLOW_EXT;
+
+	return 0;
+
+syntax_err:
+	fprintf(stdout, "Add rule, invalid syntax\n");
+	return -1;
+}


^ permalink raw reply related

* [ethtool PATCH 4/6] Add support for __be64 and bitops to ethtool
From: Alexander Duyck @ 2011-04-21 20:40 UTC (permalink / raw)
  To: davem, jeffrey.t.kirsher, bhutchings; +Cc: netdev
In-Reply-To: <20110421202857.23054.63316.stgit@gitlad.jf.intel.com>

This change is meant to add support for __be64 values and bitops to
ethtool.  These changes will be needed in order to support network flow
classifier rule configuration.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 ethtool-bitops.h |   25 +++++++++++++++++++++++++
 ethtool-util.h   |   30 ++++++++++++++++++++++++++----
 ethtool.c        |    7 -------
 3 files changed, 51 insertions(+), 11 deletions(-)
 create mode 100644 ethtool-bitops.h

diff --git a/ethtool-bitops.h b/ethtool-bitops.h
new file mode 100644
index 0000000..0ff14f1
--- /dev/null
+++ b/ethtool-bitops.h
@@ -0,0 +1,25 @@
+#ifndef ETHTOOL_BITOPS_H__
+#define ETHTOOL_BITOPS_H__
+
+#define BITS_PER_BYTE		8
+#define BITS_PER_LONG		(BITS_PER_BYTE * sizeof(long))
+#define DIV_ROUND_UP(n, d)	(((n) + (d) - 1) / (d))
+#define BITS_TO_LONGS(nr)	DIV_ROUND_UP(nr, BITS_PER_LONG)
+
+static inline void set_bit(int nr, unsigned long *addr)
+{
+	addr[nr / BITS_PER_LONG] |= 1UL << (nr % BITS_PER_LONG);
+}
+
+static inline void clear_bit(int nr, unsigned long *addr)
+{
+	addr[nr / BITS_PER_LONG] &= ~(1UL << (nr % BITS_PER_LONG));
+}
+
+static __always_inline int test_bit(unsigned int nr, const unsigned long *addr)
+{
+	return !!((1UL << (nr % BITS_PER_LONG)) &
+		  (((unsigned long *)addr)[nr / BITS_PER_LONG]));
+}
+
+#endif
diff --git a/ethtool-util.h b/ethtool-util.h
index f053028..3d46faf 100644
--- a/ethtool-util.h
+++ b/ethtool-util.h
@@ -5,15 +5,18 @@
 
 #include <sys/types.h>
 #include <endian.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+#include "ethtool-config.h"
+#include "ethtool-copy.h"
 
 /* ethtool.h expects these to be defined by <linux/types.h> */
 #ifndef HAVE_BE_TYPES
 typedef __uint16_t __be16;
 typedef __uint32_t __be32;
+typedef unsigned long long __be64;
 #endif
 
-#include "ethtool-copy.h"
-
 typedef unsigned long long u64;
 typedef __uint32_t u32;
 typedef __uint16_t u16;
@@ -23,11 +26,15 @@ typedef __int32_t s32;
 #if __BYTE_ORDER == __BIG_ENDIAN
 static inline u16 cpu_to_be16(u16 value)
 {
-    return value;
+	return value;
 }
 static inline u32 cpu_to_be32(u32 value)
 {
-    return value;
+	return value;
+}
+static inline u64 cpu_to_be64(u64 value)
+{
+	return value;
 }
 #else
 static inline u16 cpu_to_be16(u16 value)
@@ -38,6 +45,21 @@ static inline u32 cpu_to_be32(u32 value)
 {
 	return cpu_to_be16(value >> 16) | (cpu_to_be16(value) << 16);
 }
+static inline u64 cpu_to_be64(u64 value)
+{
+	return cpu_to_be32(value >> 32) | ((u64)cpu_to_be32(value) << 32);
+}
+#endif
+
+#define ntohll cpu_to_be64
+#define htonll cpu_to_be64
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+#ifndef SIOCETHTOOL
+#define SIOCETHTOOL     0x8946
 #endif
 
 /* National Semiconductor DP83815, DP83816 */
diff --git a/ethtool.c b/ethtool.c
index 9ad7000..15af86a 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -45,16 +45,9 @@
 #include <linux/sockios.h>
 #include "ethtool-util.h"
 
-
-#ifndef SIOCETHTOOL
-#define SIOCETHTOOL     0x8946
-#endif
 #ifndef MAX_ADDR_LEN
 #define MAX_ADDR_LEN	32
 #endif
-#ifndef ARRAY_SIZE
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
-#endif
 
 #ifndef HAVE_NETIF_MSG
 enum {


^ permalink raw reply related

* [ethtool PATCH 3/6] ethtool: remove strings based approach for displaying n-tuple
From: Alexander Duyck @ 2011-04-21 20:40 UTC (permalink / raw)
  To: davem, jeffrey.t.kirsher, bhutchings; +Cc: netdev
In-Reply-To: <20110421202857.23054.63316.stgit@gitlad.jf.intel.com>

This change is meant to remove the strings based approach for displaying
n-tuple filters.  A follow-on patch will replace that functionality with a
network flow classification based approach that will get the number of
filters, get their locations, and then request and display them
individually.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 ethtool.c |   44 --------------------------------------------
 1 files changed, 0 insertions(+), 44 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index cfdac65..9ad7000 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -3194,50 +3194,6 @@ static int do_srxntuple(int fd, struct ifreq *ifr)
 
 static int do_grxntuple(int fd, struct ifreq *ifr)
 {
-	struct ethtool_sset_info *sset_info;
-	struct ethtool_gstrings *strings;
-	int sz_str, n_strings, err, i;
-
-	sset_info = malloc(sizeof(struct ethtool_sset_info) + sizeof(u32));
-	sset_info->cmd = ETHTOOL_GSSET_INFO;
-	sset_info->sset_mask = (1ULL << ETH_SS_NTUPLE_FILTERS);
-	ifr->ifr_data = (caddr_t)sset_info;
-	err = send_ioctl(fd, ifr);
-
-	if ((err < 0) ||
-	    (!(sset_info->sset_mask & (1ULL << ETH_SS_NTUPLE_FILTERS)))) {
-		perror("Cannot get driver strings info");
-		return 100;
-	}
-
-	n_strings = sset_info->data[0];
-	free(sset_info);
-	sz_str = n_strings * ETH_GSTRING_LEN;
-
-	strings = calloc(1, sz_str + sizeof(struct ethtool_gstrings));
-	if (!strings) {
-		fprintf(stderr, "no memory available\n");
-		return 95;
-	}
-
-	strings->cmd = ETHTOOL_GRXNTUPLE;
-	strings->string_set = ETH_SS_NTUPLE_FILTERS;
-	strings->len = n_strings;
-	ifr->ifr_data = (caddr_t) strings;
-	err = send_ioctl(fd, ifr);
-	if (err < 0) {
-		perror("Cannot get Rx n-tuple information");
-		free(strings);
-		return 101;
-	}
-
-	n_strings = strings->len;
-	fprintf(stdout, "Rx n-tuple filters:\n");
-	for (i = 0; i < n_strings; i++)
-		fprintf(stdout, "%s", &strings->data[i * ETH_GSTRING_LEN]);
-
-	free(strings);
-
 	return 0;
 }
 


^ permalink raw reply related

* [ethtool PATCH 2/6] Add support for NFC flow classifier extensions
From: Alexander Duyck @ 2011-04-21 20:40 UTC (permalink / raw)
  To: davem, jeffrey.t.kirsher, bhutchings; +Cc: netdev
In-Reply-To: <20110421202857.23054.63316.stgit@gitlad.jf.intel.com>

This change makes it so that we can add VLAN Ethertype, TCI, and 64bits of
driver defined data to a network flow classifier allowing us to handle a
n-tuple flow contained within a network flow classifier.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 ethtool-copy.h |   39 ++++++++++++++++++++++++++++-----------
 1 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 75c3ae7..5b45442 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -376,27 +376,42 @@ struct ethtool_usrip4_spec {
 	__u8    proto;
 };
 
+union ethtool_flow_union {
+	struct ethtool_tcpip4_spec		tcp_ip4_spec;
+	struct ethtool_tcpip4_spec		udp_ip4_spec;
+	struct ethtool_tcpip4_spec		sctp_ip4_spec;
+	struct ethtool_ah_espip4_spec		ah_ip4_spec;
+	struct ethtool_ah_espip4_spec		esp_ip4_spec;
+	struct ethtool_usrip4_spec		usr_ip4_spec;
+	struct ethhdr				ether_spec;
+	__u8					hdata[60];
+};
+
+struct ethtool_flow_ext {
+	__be16	vlan_etype;
+	__be16	vlan_tci;
+	__be32	data[2];
+};
+
 /**
  * struct ethtool_rx_flow_spec - specification for RX flow filter
  * @flow_type: Type of match to perform, e.g. %TCP_V4_FLOW
  * @h_u: Flow fields to match (dependent on @flow_type)
- * @m_u: Masks for flow field bits to be ignored
+ * @h_ext: Additional fields to match
+ * @m_u: Masks for flow field bits to be matched
+ * @m_ext: Masks for additional field bits to be matched
+ *	Note, all additional fields must be ignored unless @flow_type
+ *	includes the %FLOW_EXT flag.
  * @ring_cookie: RX ring/queue index to deliver to, or %RX_CLS_FLOW_DISC
  *	if packets should be discarded
  * @location: Index of filter in hardware table
  */
 struct ethtool_rx_flow_spec {
 	__u32		flow_type;
-	union {
-		struct ethtool_tcpip4_spec		tcp_ip4_spec;
-		struct ethtool_tcpip4_spec		udp_ip4_spec;
-		struct ethtool_tcpip4_spec		sctp_ip4_spec;
-		struct ethtool_ah_espip4_spec		ah_ip4_spec;
-		struct ethtool_ah_espip4_spec		esp_ip4_spec;
-		struct ethtool_usrip4_spec		usr_ip4_spec;
-		struct ethhdr				ether_spec;
-		__u8					hdata[72];
-	} h_u, m_u;
+	union ethtool_flow_union h_u;
+	struct ethtool_flow_ext h_ext;
+	union ethtool_flow_union m_u;
+	struct ethtool_flow_ext m_ext;
 	__u64		ring_cookie;
 	__u32		location;
 };
@@ -706,6 +721,8 @@ struct ethtool_flash {
 #define	IPV4_FLOW	0x10	/* hash only */
 #define	IPV6_FLOW	0x11	/* hash only */
 #define	ETHER_FLOW	0x12	/* spec only (ether_spec) */
+/* Flag to enable additional fields in struct ethtool_rx_flow_spec */
+#define	FLOW_EXT	0x80000000
 
 /* L3-L4 network traffic flow hash options */
 #define	RXH_L2DA	(1 << 1)


^ permalink raw reply related

* [ethtool PATCH 1/6] Add support for ESP as a separate protocol from AH
From: Alexander Duyck @ 2011-04-21 20:40 UTC (permalink / raw)
  To: davem, jeffrey.t.kirsher, bhutchings; +Cc: netdev
In-Reply-To: <20110421202857.23054.63316.stgit@gitlad.jf.intel.com>

This change is mostly cosmetic.  NIU had supported AH and ESP seperately.
As such it is possible that a return value of ESP or AH may be returned
for a has request instead of the AH_ESP combined value.  To resolve that
the inputs are combined for AH and ESP into the AH_ESP value and return
values for AH and ESP will display the combined string info.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 ethtool.8.in |    8 +++++---
 ethtool.c    |   21 ++++++++++++---------
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/ethtool.8.in b/ethtool.8.in
index 714486e..12a1d1d 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -52,7 +52,7 @@
 .\"
 .\"	\(*FL - flow type values
 .\"
-.ds FL \fBtcp4\fP|\fBudp4\fP|\fBah4\fP|\fBsctp4\fP|\fBtcp6\fP|\fBudp6\fP|\fBah6\fP|\fBsctp6\fP
+.ds FL \fBtcp4\fP|\fBudp4\fP|\fBah4\fP|\fBesp4\fP|\fBsctp4\fP|\fBtcp6\fP|\fBudp6\fP|\fBah6\fP|\fBesp6\fP|\fBsctp6\fP
 .\"
 .\"	\(*HO - hash options
 .\"
@@ -572,11 +572,13 @@ nokeep;
 lB	l.
 tcp4	TCP over IPv4
 udp4	UDP over IPv4
-ah4	IPSEC AH/ESP over IPv4
+ah4	IPSEC AH over IPv4
+esp4	IPSEC ESP over IPv4
 sctp4	SCTP over IPv4
 tcp6	TCP over IPv6
 udp6	UDP over IPv6
-ah6	IPSEC AH/ESP over IPv6
+ah6	IPSEC AH over IPv6
+esp6	IPSEC ESP over IPv6
 sctp6	SCTP over IPv6
 .TE
 .TP
diff --git a/ethtool.c b/ethtool.c
index 32df0ee..cfdac65 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -32,7 +32,6 @@
 #include <sys/ioctl.h>
 #include <sys/stat.h>
 #include <stdio.h>
-#include <string.h>
 #include <errno.h>
 #include <net/if.h>
 #include <sys/utsname.h>
@@ -233,15 +232,15 @@ static struct option {
     { "-S", "--statistics", MODE_GSTATS, "Show adapter statistics" },
     { "-n", "--show-nfc", MODE_GNFC, "Show Rx network flow classification "
 		"options",
-		"		[ rx-flow-hash tcp4|udp4|ah4|sctp4|"
-		"tcp6|udp6|ah6|sctp6 ]\n" },
+		"		[ rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|"
+		"tcp6|udp6|ah6|esp6|sctp6 ]\n" },
     { "-f", "--flash", MODE_FLASHDEV, "FILENAME " "Flash firmware image "
     		"from the specified file to a region on the device",
 		"               [ REGION-NUMBER-TO-FLASH ]\n" },
     { "-N", "--config-nfc", MODE_SNFC, "Configure Rx network flow "
 		"classification options",
-		"		[ rx-flow-hash tcp4|udp4|ah4|sctp4|"
-		"tcp6|udp6|ah6|sctp6 m|v|t|s|d|f|n|r... ]\n" },
+		"		[ rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|"
+		"tcp6|udp6|ah6|esp6|sctp6 m|v|t|s|d|f|n|r... ]\n" },
     { "-x", "--show-rxfh-indir", MODE_GRXFHINDIR, "Show Rx flow hash "
 		"indirection" },
     { "-X", "--set-rxfh-indir", MODE_SRXFHINDIR, "Set Rx flow hash indirection",
@@ -783,7 +782,7 @@ static int rxflow_str_to_type(const char *str)
 		flow_type = TCP_V4_FLOW;
 	else if (!strcmp(str, "udp4"))
 		flow_type = UDP_V4_FLOW;
-	else if (!strcmp(str, "ah4"))
+	else if (!strcmp(str, "ah4") || !strcmp(str, "esp4"))
 		flow_type = AH_ESP_V4_FLOW;
 	else if (!strcmp(str, "sctp4"))
 		flow_type = SCTP_V4_FLOW;
@@ -791,7 +790,7 @@ static int rxflow_str_to_type(const char *str)
 		flow_type = TCP_V6_FLOW;
 	else if (!strcmp(str, "udp6"))
 		flow_type = UDP_V6_FLOW;
-	else if (!strcmp(str, "ah6"))
+	else if (!strcmp(str, "ah6") || !strcmp(str, "esp6"))
 		flow_type = AH_ESP_V6_FLOW;
 	else if (!strcmp(str, "sctp6"))
 		flow_type = SCTP_V6_FLOW;
@@ -1941,7 +1940,9 @@ static int dump_rxfhash(int fhash, u64 val)
 		fprintf(stdout, "SCTP over IPV4 flows");
 		break;
 	case AH_ESP_V4_FLOW:
-		fprintf(stdout, "IPSEC AH over IPV4 flows");
+	case AH_V4_FLOW:
+	case ESP_V4_FLOW:
+		fprintf(stdout, "IPSEC AH/ESP over IPV4 flows");
 		break;
 	case TCP_V6_FLOW:
 		fprintf(stdout, "TCP over IPV6 flows");
@@ -1953,7 +1954,9 @@ static int dump_rxfhash(int fhash, u64 val)
 		fprintf(stdout, "SCTP over IPV6 flows");
 		break;
 	case AH_ESP_V6_FLOW:
-		fprintf(stdout, "IPSEC AH over IPV6 flows");
+	case AH_V6_FLOW:
+	case ESP_V6_FLOW:
+		fprintf(stdout, "IPSEC AH/ESP over IPV6 flows");
 		break;
 	default:
 		break;


^ permalink raw reply related

* [ethtool PATCH 0/6] Network flow classifier
From: Alexander Duyck @ 2011-04-21 20:40 UTC (permalink / raw)
  To: davem, jeffrey.t.kirsher, bhutchings; +Cc: netdev

The following patches are meant to add support for the network flow classifier
interface built into the ethtool_rxnfc.

Once this has been accepted I plan to submit the kernel portion to Jeff
Kirsher as it is almost all ixgbe code changes with the exception of the
removal of support for ethtool_get_rx_ntuple from the kernel.

Thanks,

Alex

---

Alexander Duyck (5):
      Update documentation for -u/-U operations
      Add support for __be64 and bitops to ethtool
      ethtool: remove strings based approach for displaying n-tuple
      Add support for NFC flow classifier extensions
      Add support for ESP as a separate protocol from AH

Santwona Behera (1):
      v4 Add RX packet classification interface


 Makefile.am      |    3 
 ethtool-bitops.h |   25 +
 ethtool-copy.h   |   39 +-
 ethtool-util.h   |   44 ++
 ethtool.8.in     |  193 +++++----
 ethtool.c        |  449 +++++++++++-----------
 rxclass.c        | 1106 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 1519 insertions(+), 340 deletions(-)
 create mode 100644 ethtool-bitops.h
 create mode 100644 rxclass.c

-- 

^ permalink raw reply

* Re: Hight speed data sending from custom IP out of kernel
From: juice @ 2011-04-21 20:27 UTC (permalink / raw)
  To: zhou rui, monstr, netdev
In-Reply-To: <BANLkTikF5qz4pBLHsTZS1+oQxYAfQSD+cg@mail.gmail.com>

>> There is not a lot of documentation, and the module is still "work in
>> progress" as I am going to fix it to work with more than one interface
>> at
>> the same time when I get to do it. Currently it can only use one
>> interface
>> on the sending host machine.
>>
>
> does it use the same command/config file as pktgen?
> or special command?
>

The commands are similar type but different than pktgen.

First, when loading the module the intreface to be used must be given as
parameter, like this: "sudo insmod ./streamgen.ko interface=eth0"
Optional parameter is the UDP port that the userspace loader program uses
to communicate with the module. default port is 5555.

Then, the streamer module creates file "/proc/net/streamgen/control" which
is used to read status of module and command it:

juice@alaspin:~$ sudo cat /proc/net/streamgen/control
Stream Generator v.0.02

  State:      off
  Repeat:     1
  Queue size: 600 packets

juice@alaspin:~$

The commands that can be used are "repeat <num>", "start", "stop",
"delete" and "clear_counters"
The repeat command sets how many times the loaded sequence is output, zero
means forever. Commands run and stop work as expected. Command delete
clears the sequence from memory and command clear_counters zeroes the
packet counters.

^ permalink raw reply

* Re: [PATCHv4] usbnet: Resubmit interrupt URB once if halted
From: Oliver Neukum @ 2011-04-21 20:00 UTC (permalink / raw)
  To: Alan Stern; +Cc: Paul Stewart, netdev, linux-usb, davem, greg
In-Reply-To: <Pine.LNX.4.44L0.1104210958560.1939-100000@iolanthe.rowland.org>

Am Donnerstag, 21. April 2011, 16:03:34 schrieb Alan Stern:
> On Tue, 19 Apr 2011, Paul Stewart wrote:

> > This version of the patch moves the urb submit directly into
> > usbnet_resume.  Is it okay to submit a GFP_KERNEL urb from
> > usbnet_resume()?

Suppose a device of two interfaces one of them storage is autosuspended.
GFP_KERNEL in the first device to be resumed triggers a pageout to the
suspended storage device.

	Regards
		Oliver

^ permalink raw reply

* [PATCH] inet: add RCU protection to inet->opt
From: Eric Dumazet @ 2011-04-21 19:45 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, netdev
In-Reply-To: <1302881994.3613.34.camel@edumazet-laptop>

Le vendredi 15 avril 2011 à 17:39 +0200, Eric Dumazet a écrit :
> In commit 903ab86d19 (udp: Add lockless transmit path), we added a
> fastpath to avoid taking socket lock if we dont use corking.
> 
> Prior work were commit 1c32c5ad6fac8c (inet: Add ip_make_skb and
> ip_finish_skb) and commit 1470ddf7f8cecf776921e5 (inet: Remove explicit
> write references to sk/inet in ip_append_data)
> 
> Problem is ip_make_skb() calls ip_setup_cork() and
> ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
> without any protection against another thread manipulating inet->opt.
> 
> Another thread can change inet->opt pointer and free old one... kaboom.
> 
> This was discovered by code analysis (I am trying to remove the zeroing
> of cork variable in ip_make_skb(), since its a bit expensive and
> probably useless)
> 
> Note : race was there before Herbert patches.
> 
> My plan is to add RCU protection on inet->opt, unless someone has better
> idea ?
> 
> 

OK, here is the patch I cooked to address this problem, on top of
net-next-2.6

[PATCH] inet: add RCU protection to inet->opt

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
 include/net/inet_sock.h         |   14 ++-
 include/net/ip.h                |   11 +-
 net/dccp/ipv4.c                 |   16 ++--
 net/dccp/ipv6.c                 |    2 
 net/ipv4/af_inet.c              |   17 +++-
 net/ipv4/cipso_ipv4.c           |  113 ++++++++++++++++--------------
 net/ipv4/icmp.c                 |   23 ++----
 net/ipv4/inet_connection_sock.c |    6 -
 net/ipv4/ip_options.c           |   38 ++++------
 net/ipv4/ip_output.c            |   44 +++++------
 net/ipv4/ip_sockglue.c          |   35 ++++++---
 net/ipv4/raw.c                  |   20 ++++-
 net/ipv4/syncookies.c           |    4 -
 net/ipv4/tcp_ipv4.c             |   34 +++++----
 net/ipv4/udp.c                  |   22 ++++-
 net/ipv6/tcp_ipv6.c             |    2 
 16 files changed, 236 insertions(+), 165 deletions(-)

diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 7a37369..ed2ba6e 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -57,7 +57,15 @@ struct ip_options {
 	unsigned char	__data[0];
 };
 
-#define optlength(opt) (sizeof(struct ip_options) + opt->optlen)
+struct ip_options_rcu {
+	struct rcu_head rcu;
+	struct ip_options opt;
+};
+
+struct ip_options_data {
+	struct ip_options_rcu	opt;
+	char			data[40];
+};
 
 struct inet_request_sock {
 	struct request_sock	req;
@@ -78,7 +86,7 @@ struct inet_request_sock {
 				acked	   : 1,
 				no_srccheck: 1;
 	kmemcheck_bitfield_end(flags);
-	struct ip_options	*opt;
+	struct ip_options_rcu	*opt;
 };
 
 static inline struct inet_request_sock *inet_rsk(const struct request_sock *sk)
@@ -140,7 +148,7 @@ struct inet_sock {
 	__be16			inet_sport;
 	__u16			inet_id;
 
-	struct ip_options	*opt;
+	struct ip_options_rcu __rcu	*inet_opt;
 	__u8			tos;
 	__u8			min_ttl;
 	__u8			mc_ttl;
diff --git a/include/net/ip.h b/include/net/ip.h
index 7c41658..3a59bf9 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -52,7 +52,7 @@ static inline unsigned int ip_hdrlen(const struct sk_buff *skb)
 struct ipcm_cookie {
 	__be32			addr;
 	int			oif;
-	struct ip_options	*opt;
+	struct ip_options_rcu	*opt;
 	__u8			tx_flags;
 };
 
@@ -92,7 +92,7 @@ extern int		igmp_mc_proc_init(void);
 
 extern int		ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
 					      __be32 saddr, __be32 daddr,
-					      struct ip_options *opt);
+					      struct ip_options_rcu *opt);
 extern int		ip_rcv(struct sk_buff *skb, struct net_device *dev,
 			       struct packet_type *pt, struct net_device *orig_dev);
 extern int		ip_local_deliver(struct sk_buff *skb);
@@ -416,14 +416,15 @@ extern int ip_forward(struct sk_buff *skb);
  *	Functions provided by ip_options.c
  */
  
-extern void ip_options_build(struct sk_buff *skb, struct ip_options *opt, __be32 daddr, struct rtable *rt, int is_frag);
+extern void ip_options_build(struct sk_buff *skb, struct ip_options *opt,
+			     __be32 daddr, struct rtable *rt, int is_frag);
 extern int ip_options_echo(struct ip_options *dopt, struct sk_buff *skb);
 extern void ip_options_fragment(struct sk_buff *skb);
 extern int ip_options_compile(struct net *net,
 			      struct ip_options *opt, struct sk_buff *skb);
-extern int ip_options_get(struct net *net, struct ip_options **optp,
+extern int ip_options_get(struct net *net, struct ip_options_rcu **optp,
 			  unsigned char *data, int optlen);
-extern int ip_options_get_from_user(struct net *net, struct ip_options **optp,
+extern int ip_options_get_from_user(struct net *net, struct ip_options_rcu **optp,
 				    unsigned char __user *data, int optlen);
 extern void ip_options_undo(struct ip_options * opt);
 extern void ip_forward_options(struct sk_buff *skb);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index ae451c6..1f54ee2 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -47,6 +47,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct rtable *rt;
 	__be32 daddr, nexthop;
 	int err;
+	struct ip_options_rcu *inet_opt;
 
 	dp->dccps_role = DCCP_ROLE_CLIENT;
 
@@ -57,10 +58,13 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		return -EAFNOSUPPORT;
 
 	nexthop = daddr = usin->sin_addr.s_addr;
-	if (inet->opt != NULL && inet->opt->srr) {
+
+	inet_opt = rcu_dereference_protected(inet->inet_opt,
+					     sock_owned_by_user(sk));
+	if (inet_opt != NULL && inet_opt->opt.srr) {
 		if (daddr == 0)
 			return -EINVAL;
-		nexthop = inet->opt->faddr;
+		nexthop = inet_opt->opt.faddr;
 	}
 
 	orig_sport = inet->inet_sport;
@@ -77,7 +81,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		return -ENETUNREACH;
 	}
 
-	if (inet->opt == NULL || !inet->opt->srr)
+	if (inet_opt == NULL || !inet_opt->opt.srr)
 		daddr = rt->rt_dst;
 
 	if (inet->inet_saddr == 0)
@@ -88,8 +92,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	inet->inet_daddr = daddr;
 
 	inet_csk(sk)->icsk_ext_hdr_len = 0;
-	if (inet->opt != NULL)
-		inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
+	if (inet_opt)
+		inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
 	/*
 	 * Socket identity is still unknown (sport may be zero).
 	 * However we set state to DCCP_REQUESTING and not releasing socket
@@ -405,7 +409,7 @@ struct sock *dccp_v4_request_recv_sock(struct sock *sk, struct sk_buff *skb,
 	newinet->inet_daddr	= ireq->rmt_addr;
 	newinet->inet_rcv_saddr = ireq->loc_addr;
 	newinet->inet_saddr	= ireq->loc_addr;
-	newinet->opt	   = ireq->opt;
+	newinet->inet_opt	= ireq->opt;
 	ireq->opt	   = NULL;
 	newinet->mc_index  = inet_iif(skb);
 	newinet->mc_ttl	   = ip_hdr(skb)->ttl;
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index de1b7e3..43eba99 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -573,7 +573,7 @@ static struct sock *dccp_v6_request_recv_sock(struct sock *sk,
 
 	   First: no IPv4 options.
 	 */
-	newinet->opt = NULL;
+	newinet->inet_opt = NULL;
 
 	/* Clone RX bits */
 	newnp->rxopt.all = np->rxopt.all;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 807d83c..01ae9b5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -153,7 +153,7 @@ void inet_sock_destruct(struct sock *sk)
 	WARN_ON(sk->sk_wmem_queued);
 	WARN_ON(sk->sk_forward_alloc);
 
-	kfree(inet->opt);
+	kfree(rcu_dereference_protected(inet->inet_opt, 1));
 	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
 	sk_refcnt_debug_dec(sk);
 }
@@ -1105,9 +1105,12 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	__be32 daddr = inet->inet_daddr;
 	struct rtable *rt;
 	__be32 new_saddr;
+	struct ip_options_rcu *inet_opt;
 
-	if (inet->opt && inet->opt->srr)
-		daddr = inet->opt->faddr;
+	inet_opt = rcu_dereference_protected(inet->inet_opt,
+					     sock_owned_by_user(sk));
+	if (inet_opt && inet_opt->opt.srr)
+		daddr = inet_opt->opt.faddr;
 
 	/* Query new route. */
 	rt = ip_route_connect(daddr, 0, RT_CONN_FLAGS(sk),
@@ -1147,6 +1150,7 @@ int inet_sk_rebuild_header(struct sock *sk)
 	struct inet_sock *inet = inet_sk(sk);
 	struct rtable *rt = (struct rtable *)__sk_dst_check(sk, 0);
 	__be32 daddr;
+	struct ip_options_rcu *inet_opt;
 	int err;
 
 	/* Route is OK, nothing to do. */
@@ -1154,9 +1158,12 @@ int inet_sk_rebuild_header(struct sock *sk)
 		return 0;
 
 	/* Reroute. */
+	rcu_read_lock();
+	inet_opt = rcu_dereference(inet->inet_opt);
 	daddr = inet->inet_daddr;
-	if (inet->opt && inet->opt->srr)
-		daddr = inet->opt->faddr;
+	if (inet_opt && inet_opt->opt.srr)
+		daddr = inet_opt->opt.faddr;
+	rcu_read_unlock();
 	rt = ip_route_output_ports(sock_net(sk), sk, daddr, inet->inet_saddr,
 				   inet->inet_dport, inet->inet_sport,
 				   sk->sk_protocol, RT_CONN_FLAGS(sk),
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index a0af7ea..2b3c23c 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1857,6 +1857,11 @@ static int cipso_v4_genopt(unsigned char *buf, u32 buf_len,
 	return CIPSO_V4_HDR_LEN + ret_val;
 }
 
+static void opt_kfree_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ip_options_rcu, rcu));
+}
+
 /**
  * cipso_v4_sock_setattr - Add a CIPSO option to a socket
  * @sk: the socket
@@ -1879,7 +1884,7 @@ int cipso_v4_sock_setattr(struct sock *sk,
 	unsigned char *buf = NULL;
 	u32 buf_len;
 	u32 opt_len;
-	struct ip_options *opt = NULL;
+	struct ip_options_rcu *old, *opt = NULL;
 	struct inet_sock *sk_inet;
 	struct inet_connection_sock *sk_conn;
 
@@ -1915,22 +1920,25 @@ int cipso_v4_sock_setattr(struct sock *sk,
 		ret_val = -ENOMEM;
 		goto socket_setattr_failure;
 	}
-	memcpy(opt->__data, buf, buf_len);
-	opt->optlen = opt_len;
-	opt->cipso = sizeof(struct iphdr);
+	memcpy(opt->opt.__data, buf, buf_len);
+	opt->opt.optlen = opt_len;
+	opt->opt.cipso = sizeof(struct iphdr);
 	kfree(buf);
 	buf = NULL;
 
 	sk_inet = inet_sk(sk);
+
+	old = rcu_dereference_protected(sk_inet->inet_opt, sock_owned_by_user(sk));
 	if (sk_inet->is_icsk) {
 		sk_conn = inet_csk(sk);
-		if (sk_inet->opt)
-			sk_conn->icsk_ext_hdr_len -= sk_inet->opt->optlen;
-		sk_conn->icsk_ext_hdr_len += opt->optlen;
+		if (old)
+			sk_conn->icsk_ext_hdr_len -= old->opt.optlen;
+		sk_conn->icsk_ext_hdr_len += opt->opt.optlen;
 		sk_conn->icsk_sync_mss(sk, sk_conn->icsk_pmtu_cookie);
 	}
-	opt = xchg(&sk_inet->opt, opt);
-	kfree(opt);
+	rcu_assign_pointer(sk_inet->inet_opt, opt);
+	if (old)
+		call_rcu(&old->rcu, opt_kfree_rcu);
 
 	return 0;
 
@@ -1960,7 +1968,7 @@ int cipso_v4_req_setattr(struct request_sock *req,
 	unsigned char *buf = NULL;
 	u32 buf_len;
 	u32 opt_len;
-	struct ip_options *opt = NULL;
+	struct ip_options_rcu *opt = NULL;
 	struct inet_request_sock *req_inet;
 
 	/* We allocate the maximum CIPSO option size here so we are probably
@@ -1988,15 +1996,16 @@ int cipso_v4_req_setattr(struct request_sock *req,
 		ret_val = -ENOMEM;
 		goto req_setattr_failure;
 	}
-	memcpy(opt->__data, buf, buf_len);
-	opt->optlen = opt_len;
-	opt->cipso = sizeof(struct iphdr);
+	memcpy(opt->opt.__data, buf, buf_len);
+	opt->opt.optlen = opt_len;
+	opt->opt.cipso = sizeof(struct iphdr);
 	kfree(buf);
 	buf = NULL;
 
 	req_inet = inet_rsk(req);
 	opt = xchg(&req_inet->opt, opt);
-	kfree(opt);
+	if (opt)
+		call_rcu(&opt->rcu, opt_kfree_rcu);
 
 	return 0;
 
@@ -2016,34 +2025,34 @@ req_setattr_failure:
  * values on failure.
  *
  */
-static int cipso_v4_delopt(struct ip_options **opt_ptr)
+static int cipso_v4_delopt(struct ip_options_rcu **opt_ptr)
 {
 	int hdr_delta = 0;
-	struct ip_options *opt = *opt_ptr;
+	struct ip_options_rcu *opt = *opt_ptr;
 
-	if (opt->srr || opt->rr || opt->ts || opt->router_alert) {
+	if (opt->opt.srr || opt->opt.rr || opt->opt.ts || opt->opt.router_alert) {
 		u8 cipso_len;
 		u8 cipso_off;
 		unsigned char *cipso_ptr;
 		int iter;
 		int optlen_new;
 
-		cipso_off = opt->cipso - sizeof(struct iphdr);
-		cipso_ptr = &opt->__data[cipso_off];
+		cipso_off = opt->opt.cipso - sizeof(struct iphdr);
+		cipso_ptr = &opt->opt.__data[cipso_off];
 		cipso_len = cipso_ptr[1];
 
-		if (opt->srr > opt->cipso)
-			opt->srr -= cipso_len;
-		if (opt->rr > opt->cipso)
-			opt->rr -= cipso_len;
-		if (opt->ts > opt->cipso)
-			opt->ts -= cipso_len;
-		if (opt->router_alert > opt->cipso)
-			opt->router_alert -= cipso_len;
-		opt->cipso = 0;
+		if (opt->opt.srr > opt->opt.cipso)
+			opt->opt.srr -= cipso_len;
+		if (opt->opt.rr > opt->opt.cipso)
+			opt->opt.rr -= cipso_len;
+		if (opt->opt.ts > opt->opt.cipso)
+			opt->opt.ts -= cipso_len;
+		if (opt->opt.router_alert > opt->opt.cipso)
+			opt->opt.router_alert -= cipso_len;
+		opt->opt.cipso = 0;
 
 		memmove(cipso_ptr, cipso_ptr + cipso_len,
-			opt->optlen - cipso_off - cipso_len);
+			opt->opt.optlen - cipso_off - cipso_len);
 
 		/* determining the new total option length is tricky because of
 		 * the padding necessary, the only thing i can think to do at
@@ -2052,21 +2061,21 @@ static int cipso_v4_delopt(struct ip_options **opt_ptr)
 		 * from there we can determine the new total option length */
 		iter = 0;
 		optlen_new = 0;
-		while (iter < opt->optlen)
-			if (opt->__data[iter] != IPOPT_NOP) {
-				iter += opt->__data[iter + 1];
+		while (iter < opt->opt.optlen)
+			if (opt->opt.__data[iter] != IPOPT_NOP) {
+				iter += opt->opt.__data[iter + 1];
 				optlen_new = iter;
 			} else
 				iter++;
-		hdr_delta = opt->optlen;
-		opt->optlen = (optlen_new + 3) & ~3;
-		hdr_delta -= opt->optlen;
+		hdr_delta = opt->opt.optlen;
+		opt->opt.optlen = (optlen_new + 3) & ~3;
+		hdr_delta -= opt->opt.optlen;
 	} else {
 		/* only the cipso option was present on the socket so we can
 		 * remove the entire option struct */
 		*opt_ptr = NULL;
-		hdr_delta = opt->optlen;
-		kfree(opt);
+		hdr_delta = opt->opt.optlen;
+		call_rcu(&opt->rcu, opt_kfree_rcu);
 	}
 
 	return hdr_delta;
@@ -2083,15 +2092,15 @@ static int cipso_v4_delopt(struct ip_options **opt_ptr)
 void cipso_v4_sock_delattr(struct sock *sk)
 {
 	int hdr_delta;
-	struct ip_options *opt;
+	struct ip_options_rcu *opt;
 	struct inet_sock *sk_inet;
 
 	sk_inet = inet_sk(sk);
-	opt = sk_inet->opt;
-	if (opt == NULL || opt->cipso == 0)
+	opt = rcu_dereference_protected(sk_inet->inet_opt, 1);
+	if (opt == NULL || opt->opt.cipso == 0)
 		return;
 
-	hdr_delta = cipso_v4_delopt(&sk_inet->opt);
+	hdr_delta = cipso_v4_delopt(&sk_inet->inet_opt);
 	if (sk_inet->is_icsk && hdr_delta > 0) {
 		struct inet_connection_sock *sk_conn = inet_csk(sk);
 		sk_conn->icsk_ext_hdr_len -= hdr_delta;
@@ -2109,12 +2118,12 @@ void cipso_v4_sock_delattr(struct sock *sk)
  */
 void cipso_v4_req_delattr(struct request_sock *req)
 {
-	struct ip_options *opt;
+	struct ip_options_rcu *opt;
 	struct inet_request_sock *req_inet;
 
 	req_inet = inet_rsk(req);
 	opt = req_inet->opt;
-	if (opt == NULL || opt->cipso == 0)
+	if (opt == NULL || opt->opt.cipso == 0)
 		return;
 
 	cipso_v4_delopt(&req_inet->opt);
@@ -2184,14 +2193,18 @@ getattr_return:
  */
 int cipso_v4_sock_getattr(struct sock *sk, struct netlbl_lsm_secattr *secattr)
 {
-	struct ip_options *opt;
+	struct ip_options_rcu *opt;
+	int res = -ENOMSG;
 
-	opt = inet_sk(sk)->opt;
-	if (opt == NULL || opt->cipso == 0)
-		return -ENOMSG;
-
-	return cipso_v4_getattr(opt->__data + opt->cipso - sizeof(struct iphdr),
-				secattr);
+	rcu_read_lock();
+	opt = rcu_dereference(inet_sk(sk)->inet_opt);
+	if (opt && opt->opt.cipso)
+		res = cipso_v4_getattr(opt->opt.__data +
+						opt->opt.cipso -
+						sizeof(struct iphdr),
+				       secattr);
+	rcu_read_unlock();
+	return res;
 }
 
 /**
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e5f8a71..831e7dc 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -108,8 +108,7 @@ struct icmp_bxm {
 		__be32	       times[3];
 	} data;
 	int head_len;
-	struct ip_options replyopts;
-	unsigned char  optbuf[40];
+	struct ip_options_data replyopts;
 };
 
 /* An array of errno for error messages from dest unreach. */
@@ -333,7 +332,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	struct inet_sock *inet;
 	__be32 daddr;
 
-	if (ip_options_echo(&icmp_param->replyopts, skb))
+	if (ip_options_echo(&icmp_param->replyopts.opt.opt, skb))
 		return;
 
 	sk = icmp_xmit_lock(net);
@@ -347,10 +346,10 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	daddr = ipc.addr = rt->rt_src;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
-	if (icmp_param->replyopts.optlen) {
-		ipc.opt = &icmp_param->replyopts;
-		if (ipc.opt->srr)
-			daddr = icmp_param->replyopts.faddr;
+	if (icmp_param->replyopts.opt.opt.optlen) {
+		ipc.opt = &icmp_param->replyopts.opt;
+		if (ipc.opt->opt.srr)
+			daddr = icmp_param->replyopts.opt.opt.faddr;
 	}
 	{
 		struct flowi4 fl4 = {
@@ -379,8 +378,8 @@ static struct rtable *icmp_route_lookup(struct net *net, struct sk_buff *skb_in,
 					struct icmp_bxm *param)
 {
 	struct flowi4 fl4 = {
-		.daddr = (param->replyopts.srr ?
-			  param->replyopts.faddr : iph->saddr),
+		.daddr = (param->replyopts.opt.opt.srr ?
+			  param->replyopts.opt.opt.faddr : iph->saddr),
 		.saddr = saddr,
 		.flowi4_tos = RT_TOS(tos),
 		.flowi4_proto = IPPROTO_ICMP,
@@ -581,7 +580,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 					   IPTOS_PREC_INTERNETCONTROL) :
 					  iph->tos;
 
-	if (ip_options_echo(&icmp_param.replyopts, skb_in))
+	if (ip_options_echo(&icmp_param.replyopts.opt.opt, skb_in))
 		goto out_unlock;
 
 
@@ -597,7 +596,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	icmp_param.offset = skb_network_offset(skb_in);
 	inet_sk(sk)->tos = tos;
 	ipc.addr = iph->saddr;
-	ipc.opt = &icmp_param.replyopts;
+	ipc.opt = &icmp_param.replyopts.opt;
 	ipc.tx_flags = 0;
 
 	rt = icmp_route_lookup(net, skb_in, iph, saddr, tos,
@@ -613,7 +612,7 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 	room = dst_mtu(&rt->dst);
 	if (room > 576)
 		room = 576;
-	room -= sizeof(struct iphdr) + icmp_param.replyopts.optlen;
+	room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
 	room -= sizeof(struct icmphdr);
 
 	icmp_param.data_len = skb_in->len - icmp_param.offset;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8514db5..3282cb2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -354,20 +354,20 @@ struct dst_entry *inet_csk_route_req(struct sock *sk,
 {
 	struct rtable *rt;
 	const struct inet_request_sock *ireq = inet_rsk(req);
-	struct ip_options *opt = inet_rsk(req)->opt;
+	struct ip_options_rcu *opt = inet_rsk(req)->opt;
 	struct net *net = sock_net(sk);
 	struct flowi4 fl4;
 
 	flowi4_init_output(&fl4, sk->sk_bound_dev_if, sk->sk_mark,
 			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
 			   sk->sk_protocol, inet_sk_flowi_flags(sk),
-			   (opt && opt->srr) ? opt->faddr : ireq->rmt_addr,
+			   (opt && opt->opt.srr) ? opt->opt.faddr : ireq->rmt_addr,
 			   ireq->loc_addr, ireq->rmt_port, inet_sk(sk)->inet_sport);
 	security_req_classify_flow(req, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_flow(net, &fl4, sk);
 	if (IS_ERR(rt))
 		goto no_route;
-	if (opt && opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
+	if (opt && opt->opt.is_strictroute && rt->rt_dst != rt->rt_gateway)
 		goto route_err;
 	return &rt->dst;
 
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 2391b24..01fc409 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -36,7 +36,7 @@
  * saddr is address of outgoing interface.
  */
 
-void ip_options_build(struct sk_buff * skb, struct ip_options * opt,
+void ip_options_build(struct sk_buff *skb, struct ip_options *opt,
 			    __be32 daddr, struct rtable *rt, int is_frag)
 {
 	unsigned char *iph = skb_network_header(skb);
@@ -83,9 +83,9 @@ void ip_options_build(struct sk_buff * skb, struct ip_options * opt,
  * NOTE: dopt cannot point to skb.
  */
 
-int ip_options_echo(struct ip_options * dopt, struct sk_buff * skb)
+int ip_options_echo(struct ip_options *dopt, struct sk_buff *skb)
 {
-	struct ip_options *sopt;
+	const struct ip_options *sopt;
 	unsigned char *sptr, *dptr;
 	int soffset, doffset;
 	int	optlen;
@@ -95,10 +95,8 @@ int ip_options_echo(struct ip_options * dopt, struct sk_buff * skb)
 
 	sopt = &(IPCB(skb)->opt);
 
-	if (sopt->optlen == 0) {
-		dopt->optlen = 0;
+	if (sopt->optlen == 0)
 		return 0;
-	}
 
 	sptr = skb_network_header(skb);
 	dptr = dopt->__data;
@@ -157,7 +155,7 @@ int ip_options_echo(struct ip_options * dopt, struct sk_buff * skb)
 		dopt->optlen += optlen;
 	}
 	if (sopt->srr) {
-		unsigned char * start = sptr+sopt->srr;
+		unsigned char *start = sptr+sopt->srr;
 		__be32 faddr;
 
 		optlen  = start[1];
@@ -499,19 +497,19 @@ void ip_options_undo(struct ip_options * opt)
 	}
 }
 
-static struct ip_options *ip_options_get_alloc(const int optlen)
+static struct ip_options_rcu *ip_options_get_alloc(const int optlen)
 {
-	return kzalloc(sizeof(struct ip_options) + ((optlen + 3) & ~3),
+	return kzalloc(sizeof(struct ip_options_rcu) + ((optlen + 3) & ~3),
 		       GFP_KERNEL);
 }
 
-static int ip_options_get_finish(struct net *net, struct ip_options **optp,
-				 struct ip_options *opt, int optlen)
+static int ip_options_get_finish(struct net *net, struct ip_options_rcu **optp,
+				 struct ip_options_rcu *opt, int optlen)
 {
 	while (optlen & 3)
-		opt->__data[optlen++] = IPOPT_END;
-	opt->optlen = optlen;
-	if (optlen && ip_options_compile(net, opt, NULL)) {
+		opt->opt.__data[optlen++] = IPOPT_END;
+	opt->opt.optlen = optlen;
+	if (optlen && ip_options_compile(net, &opt->opt, NULL)) {
 		kfree(opt);
 		return -EINVAL;
 	}
@@ -520,29 +518,29 @@ static int ip_options_get_finish(struct net *net, struct ip_options **optp,
 	return 0;
 }
 
-int ip_options_get_from_user(struct net *net, struct ip_options **optp,
+int ip_options_get_from_user(struct net *net, struct ip_options_rcu **optp,
 			     unsigned char __user *data, int optlen)
 {
-	struct ip_options *opt = ip_options_get_alloc(optlen);
+	struct ip_options_rcu *opt = ip_options_get_alloc(optlen);
 
 	if (!opt)
 		return -ENOMEM;
-	if (optlen && copy_from_user(opt->__data, data, optlen)) {
+	if (optlen && copy_from_user(opt->opt.__data, data, optlen)) {
 		kfree(opt);
 		return -EFAULT;
 	}
 	return ip_options_get_finish(net, optp, opt, optlen);
 }
 
-int ip_options_get(struct net *net, struct ip_options **optp,
+int ip_options_get(struct net *net, struct ip_options_rcu **optp,
 		   unsigned char *data, int optlen)
 {
-	struct ip_options *opt = ip_options_get_alloc(optlen);
+	struct ip_options_rcu *opt = ip_options_get_alloc(optlen);
 
 	if (!opt)
 		return -ENOMEM;
 	if (optlen)
-		memcpy(opt->__data, data, optlen);
+		memcpy(opt->opt.__data, data, optlen);
 	return ip_options_get_finish(net, optp, opt, optlen);
 }
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index bdad3d6..362e66f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -140,14 +140,14 @@ static inline int ip_select_ttl(struct inet_sock *inet, struct dst_entry *dst)
  *
  */
 int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
-			  __be32 saddr, __be32 daddr, struct ip_options *opt)
+			  __be32 saddr, __be32 daddr, struct ip_options_rcu *opt)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct rtable *rt = skb_rtable(skb);
 	struct iphdr *iph;
 
 	/* Build the IP header. */
-	skb_push(skb, sizeof(struct iphdr) + (opt ? opt->optlen : 0));
+	skb_push(skb, sizeof(struct iphdr) + (opt ? opt->opt.optlen : 0));
 	skb_reset_network_header(skb);
 	iph = ip_hdr(skb);
 	iph->version  = 4;
@@ -163,9 +163,9 @@ int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
 	iph->protocol = sk->sk_protocol;
 	ip_select_ident(iph, &rt->dst, sk);
 
-	if (opt && opt->optlen) {
-		iph->ihl += opt->optlen>>2;
-		ip_options_build(skb, opt, daddr, rt, 0);
+	if (opt && opt->opt.optlen) {
+		iph->ihl += opt->opt.optlen>>2;
+		ip_options_build(skb, &opt->opt, daddr, rt, 0);
 	}
 
 	skb->priority = sk->sk_priority;
@@ -316,7 +316,7 @@ int ip_queue_xmit(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
 	struct inet_sock *inet = inet_sk(sk);
-	struct ip_options *opt = inet->opt;
+	struct ip_options_rcu *inet_opt;
 	struct rtable *rt;
 	struct iphdr *iph;
 	int res;
@@ -325,6 +325,7 @@ int ip_queue_xmit(struct sk_buff *skb)
 	 * f.e. by something like SCTP.
 	 */
 	rcu_read_lock();
+	inet_opt = rcu_dereference(inet->inet_opt);
 	rt = skb_rtable(skb);
 	if (rt != NULL)
 		goto packet_routed;
@@ -336,8 +337,8 @@ int ip_queue_xmit(struct sk_buff *skb)
 
 		/* Use correct destination address if we have options. */
 		daddr = inet->inet_daddr;
-		if(opt && opt->srr)
-			daddr = opt->faddr;
+		if (inet_opt && inet_opt->opt.srr)
+			daddr = inet_opt->opt.faddr;
 
 		/* If this fails, retransmit mechanism of transport layer will
 		 * keep trying until route appears or the connection times
@@ -357,11 +358,11 @@ int ip_queue_xmit(struct sk_buff *skb)
 	skb_dst_set_noref(skb, &rt->dst);
 
 packet_routed:
-	if (opt && opt->is_strictroute && rt->rt_dst != rt->rt_gateway)
+	if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_dst != rt->rt_gateway)
 		goto no_route;
 
 	/* OK, we know where to send it, allocate and build IP header. */
-	skb_push(skb, sizeof(struct iphdr) + (opt ? opt->optlen : 0));
+	skb_push(skb, sizeof(struct iphdr) + (inet_opt ? inet_opt->opt.optlen : 0));
 	skb_reset_network_header(skb);
 	iph = ip_hdr(skb);
 	*((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
@@ -375,9 +376,9 @@ packet_routed:
 	iph->daddr    = rt->rt_dst;
 	/* Transport layer set skb->h.foo itself. */
 
-	if (opt && opt->optlen) {
-		iph->ihl += opt->optlen >> 2;
-		ip_options_build(skb, opt, inet->inet_daddr, rt, 0);
+	if (inet_opt && inet_opt->opt.optlen) {
+		iph->ihl += inet_opt->opt.optlen >> 2;
+		ip_options_build(skb, &inet_opt->opt, inet->inet_daddr, rt, 0);
 	}
 
 	ip_select_ident_more(iph, &rt->dst, sk,
@@ -1033,7 +1034,7 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 			 struct ipcm_cookie *ipc, struct rtable **rtp)
 {
 	struct inet_sock *inet = inet_sk(sk);
-	struct ip_options *opt;
+	struct ip_options_rcu *opt;
 	struct rtable *rt;
 
 	/*
@@ -1047,7 +1048,7 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 			if (unlikely(cork->opt == NULL))
 				return -ENOBUFS;
 		}
-		memcpy(cork->opt, opt, sizeof(struct ip_options) + opt->optlen);
+		memcpy(cork->opt, &opt->opt, sizeof(struct ip_options) + opt->opt.optlen);
 		cork->flags |= IPCORK_OPT;
 		cork->addr = ipc->addr;
 	}
@@ -1451,26 +1452,23 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
 		   unsigned int len)
 {
 	struct inet_sock *inet = inet_sk(sk);
-	struct {
-		struct ip_options	opt;
-		char			data[40];
-	} replyopts;
+	struct ip_options_data replyopts;
 	struct ipcm_cookie ipc;
 	__be32 daddr;
 	struct rtable *rt = skb_rtable(skb);
 
-	if (ip_options_echo(&replyopts.opt, skb))
+	if (ip_options_echo(&replyopts.opt.opt, skb))
 		return;
 
 	daddr = ipc.addr = rt->rt_src;
 	ipc.opt = NULL;
 	ipc.tx_flags = 0;
 
-	if (replyopts.opt.optlen) {
+	if (replyopts.opt.opt.optlen) {
 		ipc.opt = &replyopts.opt;
 
-		if (ipc.opt->srr)
-			daddr = replyopts.opt.faddr;
+		if (replyopts.opt.opt.srr)
+			daddr = replyopts.opt.opt.faddr;
 	}
 
 	{
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 3948c86..68cdf2b 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -451,6 +451,11 @@ out:
 }
 
 
+static void opt_kfree_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ip_options_rcu, rcu));
+}
+
 /*
  *	Socket option code for IP. This is the end of the line after any
  *	TCP,UDP etc options on an IP socket.
@@ -497,13 +502,16 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 	switch (optname) {
 	case IP_OPTIONS:
 	{
-		struct ip_options *opt = NULL;
+		struct ip_options_rcu *old, *opt = NULL;
+
 		if (optlen > 40)
 			goto e_inval;
 		err = ip_options_get_from_user(sock_net(sk), &opt,
 					       optval, optlen);
 		if (err)
 			break;
+		old = rcu_dereference_protected(inet->inet_opt,
+						sock_owned_by_user(sk));
 		if (inet->is_icsk) {
 			struct inet_connection_sock *icsk = inet_csk(sk);
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -512,17 +520,18 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 			       (TCPF_LISTEN | TCPF_CLOSE)) &&
 			     inet->inet_daddr != LOOPBACK4_IPV6)) {
 #endif
-				if (inet->opt)
-					icsk->icsk_ext_hdr_len -= inet->opt->optlen;
+				if (old)
+					icsk->icsk_ext_hdr_len -= old->opt.optlen;
 				if (opt)
-					icsk->icsk_ext_hdr_len += opt->optlen;
+					icsk->icsk_ext_hdr_len += opt->opt.optlen;
 				icsk->icsk_sync_mss(sk, icsk->icsk_pmtu_cookie);
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
 			}
 #endif
 		}
-		opt = xchg(&inet->opt, opt);
-		kfree(opt);
+		rcu_assign_pointer(inet->inet_opt, opt);
+		if (old)
+			call_rcu(&old->rcu, opt_kfree_rcu);
 		break;
 	}
 	case IP_PKTINFO:
@@ -1081,12 +1090,16 @@ static int do_ip_getsockopt(struct sock *sk, int level, int optname,
 	case IP_OPTIONS:
 	{
 		unsigned char optbuf[sizeof(struct ip_options)+40];
-		struct ip_options * opt = (struct ip_options *)optbuf;
+		struct ip_options *opt = (struct ip_options *)optbuf;
+		struct ip_options_rcu *inet_opt;
+
+		inet_opt = rcu_dereference_protected(inet->inet_opt,
+						     sock_owned_by_user(sk));
 		opt->optlen = 0;
-		if (inet->opt)
-			memcpy(optbuf, inet->opt,
-			       sizeof(struct ip_options)+
-			       inet->opt->optlen);
+		if (inet_opt)
+			memcpy(optbuf, &inet_opt->opt,
+			       sizeof(struct ip_options) +
+			       inet_opt->opt.optlen);
 		release_sock(sk);
 
 		if (opt->optlen == 0)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2b50cc2..c577d25 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -460,6 +460,7 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	__be32 saddr;
 	u8  tos;
 	int err;
+	struct ip_options_data opt_copy;
 
 	err = -EMSGSIZE;
 	if (len > 0xFFFF)
@@ -520,8 +521,18 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	saddr = ipc.addr;
 	ipc.addr = daddr;
 
-	if (!ipc.opt)
-		ipc.opt = inet->opt;
+	if (!ipc.opt) {
+		struct ip_options_rcu *inet_opt;
+
+		rcu_read_lock();
+		inet_opt = rcu_dereference(inet->inet_opt);
+		if (inet_opt) {
+			memcpy(&opt_copy, inet_opt,
+			       sizeof(*inet_opt) + inet_opt->opt.optlen);
+			ipc.opt = &opt_copy.opt;
+		}
+		rcu_read_unlock();
+	}
 
 	if (ipc.opt) {
 		err = -EINVAL;
@@ -530,10 +541,10 @@ static int raw_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		 */
 		if (inet->hdrincl)
 			goto done;
-		if (ipc.opt->srr) {
+		if (ipc.opt->opt.srr) {
 			if (!daddr)
 				goto done;
-			daddr = ipc.opt->faddr;
+			daddr = ipc.opt->opt.faddr;
 		}
 	}
 	tos = RT_CONN_FLAGS(sk);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 71e0296..2646149 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -321,10 +321,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
 	 * the ACK carries the same options again (see RFC1122 4.2.3.8)
 	 */
 	if (opt && opt->optlen) {
-		int opt_size = sizeof(struct ip_options) + opt->optlen;
+		int opt_size = sizeof(struct ip_options_rcu) + opt->optlen;
 
 		ireq->opt = kmalloc(opt_size, GFP_ATOMIC);
-		if (ireq->opt != NULL && ip_options_echo(ireq->opt, skb)) {
+		if (ireq->opt != NULL && ip_options_echo(&ireq->opt->opt, skb)) {
 			kfree(ireq->opt);
 			ireq->opt = NULL;
 		}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..d120f6f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -153,6 +153,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct rtable *rt;
 	__be32 daddr, nexthop;
 	int err;
+	struct ip_options_rcu *inet_opt;
 
 	if (addr_len < sizeof(struct sockaddr_in))
 		return -EINVAL;
@@ -161,10 +162,12 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		return -EAFNOSUPPORT;
 
 	nexthop = daddr = usin->sin_addr.s_addr;
-	if (inet->opt && inet->opt->srr) {
+	inet_opt = rcu_dereference_protected(inet->inet_opt,
+					     sock_owned_by_user(sk));
+	if (inet_opt && inet_opt->opt.srr) {
 		if (!daddr)
 			return -EINVAL;
-		nexthop = inet->opt->faddr;
+		nexthop = inet_opt->opt.faddr;
 	}
 
 	orig_sport = inet->inet_sport;
@@ -185,7 +188,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		return -ENETUNREACH;
 	}
 
-	if (!inet->opt || !inet->opt->srr)
+	if (!inet_opt || !inet_opt->opt.srr)
 		daddr = rt->rt_dst;
 
 	if (!inet->inet_saddr)
@@ -221,8 +224,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	inet->inet_daddr = daddr;
 
 	inet_csk(sk)->icsk_ext_hdr_len = 0;
-	if (inet->opt)
-		inet_csk(sk)->icsk_ext_hdr_len = inet->opt->optlen;
+	if (inet_opt)
+		inet_csk(sk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
 
 	tp->rx_opt.mss_clamp = TCP_MSS_DEFAULT;
 
@@ -820,17 +823,18 @@ static void syn_flood_warning(const struct sk_buff *skb)
 /*
  * Save and compile IPv4 options into the request_sock if needed.
  */
-static struct ip_options *tcp_v4_save_options(struct sock *sk,
-					      struct sk_buff *skb)
+static struct ip_options_rcu *tcp_v4_save_options(struct sock *sk,
+						  struct sk_buff *skb)
 {
-	struct ip_options *opt = &(IPCB(skb)->opt);
-	struct ip_options *dopt = NULL;
+	const struct ip_options *opt = &(IPCB(skb)->opt);
+	struct ip_options_rcu *dopt = NULL;
 
 	if (opt && opt->optlen) {
-		int opt_size = optlength(opt);
+		int opt_size = sizeof(*dopt) + opt->optlen;
+
 		dopt = kmalloc(opt_size, GFP_ATOMIC);
 		if (dopt) {
-			if (ip_options_echo(dopt, skb)) {
+			if (ip_options_echo(&dopt->opt, skb)) {
 				kfree(dopt);
 				dopt = NULL;
 			}
@@ -1411,6 +1415,7 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_md5sig_key *key;
 #endif
+	struct ip_options_rcu *inet_opt;
 
 	if (sk_acceptq_is_full(sk))
 		goto exit_overflow;
@@ -1431,13 +1436,14 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	newinet->inet_daddr   = ireq->rmt_addr;
 	newinet->inet_rcv_saddr = ireq->loc_addr;
 	newinet->inet_saddr	      = ireq->loc_addr;
-	newinet->opt	      = ireq->opt;
+	inet_opt	      = ireq->opt;
+	rcu_assign_pointer(newinet->inet_opt, inet_opt);
 	ireq->opt	      = NULL;
 	newinet->mc_index     = inet_iif(skb);
 	newinet->mc_ttl	      = ip_hdr(skb)->ttl;
 	inet_csk(newsk)->icsk_ext_hdr_len = 0;
-	if (newinet->opt)
-		inet_csk(newsk)->icsk_ext_hdr_len = newinet->opt->optlen;
+	if (inet_opt)
+		inet_csk(newsk)->icsk_ext_hdr_len = inet_opt->opt.optlen;
 	newinet->inet_id = newtp->write_seq ^ jiffies;
 
 	tcp_mtup_init(newsk);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a15c8fb..75d24ce 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -804,6 +804,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
 	int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
 	struct sk_buff *skb;
+	struct ip_options_data opt_copy;
 
 	if (len > 0xFFFF)
 		return -EMSGSIZE;
@@ -877,22 +878,32 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			free = 1;
 		connected = 0;
 	}
-	if (!ipc.opt)
-		ipc.opt = inet->opt;
+	if (!ipc.opt) {
+		struct ip_options_rcu *inet_opt;
+
+		rcu_read_lock();
+		inet_opt = rcu_dereference(inet->inet_opt);
+		if (inet_opt) {
+			memcpy(&opt_copy, inet_opt,
+			       sizeof(*inet_opt) + inet_opt->opt.optlen);
+			ipc.opt = &opt_copy.opt;
+		}
+		rcu_read_unlock();
+	}
 
 	saddr = ipc.addr;
 	ipc.addr = faddr = daddr;
 
-	if (ipc.opt && ipc.opt->srr) {
+	if (ipc.opt && ipc.opt->opt.srr) {
 		if (!daddr)
 			return -EINVAL;
-		faddr = ipc.opt->faddr;
+		faddr = ipc.opt->opt.faddr;
 		connected = 0;
 	}
 	tos = RT_TOS(inet->tos);
 	if (sock_flag(sk, SOCK_LOCALROUTE) ||
 	    (msg->msg_flags & MSG_DONTROUTE) ||
-	    (ipc.opt && ipc.opt->is_strictroute)) {
+	    (ipc.opt && ipc.opt->opt.is_strictroute)) {
 		tos |= RTO_ONLINK;
 		connected = 0;
 	}
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4f49e5d..2c6e606 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1469,7 +1469,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 
 	   First: no IPv4 options.
 	 */
-	newinet->opt = NULL;
+	newinet->inet_opt = NULL;
 	newnp->ipv6_fl_list = NULL;
 
 	/* Clone RX bits */



^ permalink raw reply related

* Re: [PATCH] bonding: fix bridged bonds in 802.3ad mode
From: Jiri Bohac @ 2011-04-21 19:27 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Jiri Bohac, netdev, Stephen Hemminger, Jay Vosburgh,
	Andy Gospodarek, Benjamin Poirier
In-Reply-To: <1303412899.3165.47.camel@bwh-desktop>

On Thu, Apr 21, 2011 at 08:08:19PM +0100, Ben Hutchings wrote:
> It seems to me that 1e253c3b8a1aeed51eef6fc366812f219b97de65 is bogus
> and should be reverted, rather than worked around by other drivers.  We
> shouldn't enable non-conformant forwarding behaviour by default just
> because some people find it useful.  The administrator should have to
> explicitly enable it.

This is what I thought as well. I find it even more awkward to 
make this behaviour dependend on the STP setting of the bridge
(turning on STP works around this bonding problem, btw).

But even if forwarding of link-local frames is made optional in
some way, it will still break bonding when turned on. So unless
this it is completely reverted, we still need some kind of fix
for bonding.

Btw, this could also be fixed by checking for 
skb->protocol == htons(ETH_P_SLOW) directly in the bridging code.
However, I think the wonderful rx_handler infrastructure makes it
much cleaner this way...

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply

* Re: [PATCH v5] net: bnx2x: convert to hw_features
From: Ben Hutchings @ 2011-04-21 19:19 UTC (permalink / raw)
  To: Vladislav Zolotarov
  Cc: Eric Dumazet, Michał Mirosław, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <1303411750.19212.123.camel@lb-tlvb-vladz>

On Thu, 2011-04-21 at 21:49 +0300, Vladislav Zolotarov wrote:
[...]
> More than that, in addition it is impossible to disable the LRO with 
> the current bnx2x upstream driver (ethtool -K ethX lro off) and this is
> because dev_disable_lro passes to __ethtool_set_flags() flags based on
> the current value of dev->features while __ethtool_set_flags() expects
> only the flags set in dev->hw_features. bnx2x has NETIF_F_HW_VLAN_RX
> that is set in dev->features and not set in dev->hw_features and it's
> passed down to the __ethtool_set_flags(). 
[...]

The problem is here in register_netdevice():

	/* Transfer changeable features to wanted_features and enable
	 * software offloads (GSO and GRO).
	 */
	dev->hw_features |= NETIF_F_SOFT_FEATURES;
	dev->features |= NETIF_F_SOFT_FEATURES;
	dev->wanted_features = dev->features & dev->hw_features;

This doesn't work correctly for features that are always enabled, like
NETIF_F_HW_VLAN_RX in bnx2x, which are set in dev->features but not in
dev->hw_features.

The name 'hw_features' really wasn't a good choice - the obvious meaning
and the meaning assumed by this code is 'hardware-supported features'
and not 'hardware-supported features that can be toggled'.  And since we
add NETIF_F_SOFT_FEATURES, it really only means 'features that can be
toggled'.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] bonding: fix bridged bonds in 802.3ad mode
From: Ben Hutchings @ 2011-04-21 19:08 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: netdev, Stephen Hemminger, Jay Vosburgh, Andy Gospodarek,
	Benjamin Poirier
In-Reply-To: <20110421184748.GD23756@midget.suse.cz>

On Thu, 2011-04-21 at 20:47 +0200, Jiri Bohac wrote:
> 802.3ad bonding inside a bridge is broken again. Originally fixed by
> 43aa1920117801fe9ae3d1fad886b62511e09bee, the bug was re-introduced by
> 1e253c3b8a1aeed51eef6fc366812f219b97de65.
> 
> LACP frames must not have their skb->dev changed by the bridging hook.
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> 
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1514,6 +1514,11 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
>  		memcpy(eth_hdr(skb)->h_dest, bond->dev->dev_addr, ETH_ALEN);
>  	}
>  
> +	/* prevent bridging code from mangling and forwarding LACP frames */
> +	if (bond->params.mode == BOND_MODE_8023AD &&
> +	    skb->protocol == htons(ETH_P_SLOW))
> +		return RX_HANDLER_PASS;
> +
>  	return RX_HANDLER_ANOTHER;
>  }
>  

It seems to me that 1e253c3b8a1aeed51eef6fc366812f219b97de65 is bogus
and should be reverted, rather than worked around by other drivers.  We
shouldn't enable non-conformant forwarding behaviour by default just
because some people find it useful.  The administrator should have to
explicitly enable it.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH v5] net: bnx2x: convert to hw_features
From: Vladislav Zolotarov @ 2011-04-21 18:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michał Mirosław, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <1303397531.3685.16.camel@edumazet-laptop>

On Thu, 2011-04-21 at 07:52 -0700, Eric Dumazet wrote:
> Le mardi 12 avril 2011  21:38 +0200, Micha Mirosaw a crit :
> > Since ndo_fix_features callback is postponing features change when
> > bp->recovery_state != BNX2X_RECOVERY_DONE, netdev_update_features()
> > has to be called again when this condition changes. Previously,
> > ethtool_ops->set_flags callback returned -EBUSY in that case
> > (it's not possible in the new model).
> > 
> > Signed-off-by: Micha Mirosaw <mirq-linux@rere.qmqm.pl>
> > 
> > v5: - don't delay set_features, as it's rtnl_locked - same as recovery process
> > v4: - complete bp->rx_csum -> NETIF_F_RXCSUM conversion
> >     - add check for failed ndo_set_features in ndo_open callback
> > v3: - include NETIF_F_LRO in hw_features
> >     - don't call netdev_update_features() if bnx2x_nic_load() failed
> > v2: - comment in ndo_fix_features callback
> > ---
> 
> Hi guys
> 
> I am not sure its related to these changes, but I now have in
> net-next-2.6 :
> 
> [   23.674263] ------------[ cut here ]------------
> [   23.674266] WARNING: at net/core/dev.c:1318 dev_disable_lro+0x83/0x90()
> [   23.674270] Hardware name: ProLiant BL460c G6
> [   23.674273] Modules linked in: tg3 libphy sg
> [   23.674280] Pid: 3070, comm: sysctl Tainted: G        W   2.6.39-rc2-01242-g3ef22b9-dirty #669
> [   23.674282] Call Trace:
> [   23.674285]  [<ffffffff813b94f3>] ? dev_disable_lro+0x83/0x90
> [   23.674291]  [<ffffffff81042c9b>] warn_slowpath_common+0x8b/0xc0
> [   23.674298]  [<ffffffff81042ce5>] warn_slowpath_null+0x15/0x20
> [   23.674304]  [<ffffffff813b94f3>] dev_disable_lro+0x83/0x90
> [   23.674309]  [<ffffffff81429789>] devinet_sysctl_forward+0x199/0x210
> [   23.674313]  [<ffffffff814296e4>] ? devinet_sysctl_forward+0xf4/0x210
> [   23.674318]  [<ffffffff8104e712>] ? capable+0x12/0x20
> [   23.674324]  [<ffffffff81168f45>] proc_sys_call_handler+0xb5/0xd0
> [   23.674328]  [<ffffffff81168f6f>] proc_sys_write+0xf/0x20
> [   23.674334]  [<ffffffff81105f39>] vfs_write+0xc9/0x170
> [   23.674339]  [<ffffffff81106550>] sys_write+0x50/0x90
> [   23.674345]  [<ffffffff814b95a0>] sysenter_dispatch+0x7/0x33
> [   23.674350] ---[ end trace 051ec497c66b228e ]---
> 
> Thanks
> 

More than that, in addition it is impossible to disable the LRO with 
the current bnx2x upstream driver (ethtool -K ethX lro off) and this is
because dev_disable_lro passes to __ethtool_set_flags() flags based on
the current value of dev->features while __ethtool_set_flags() expects
only the flags set in dev->hw_features. bnx2x has NETIF_F_HW_VLAN_RX
that is set in dev->features and not set in dev->hw_features and it's
passed down to the __ethtool_set_flags(). 

Regarding the 'ethtool -K ethX lro off' I noticed that there is the same
problem: the flags that are passed to the __ethtool_set_flags() include
NETIF_F_HW_VLAN_RX. The userspace app gets the flags set in
dev->features to be able to display them, then clears the needed bits
and passes the resulting value down to kernel. So, I don't know how to
fix this without changing __ethtool_set_flags() not to check the bits or
by the taking care of the bit mask before it's passed to the
__ethtool_set_flags(). Below is the implementation of the second
option: 

diff --git a/net/core/dev.c b/net/core/dev.c
index 3871bf6..fd9175b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1314,7 +1314,7 @@ void dev_disable_lro(struct net_device *dev)
 	if (!(flags & ETH_FLAG_LRO))
 		return;
 
-	__ethtool_set_flags(dev, flags & ~ETH_FLAG_LRO);
+	__ethtool_set_flags(dev, flags & ~ETH_FLAG_LRO & dev->hw_features);
 	WARN_ON(dev->features & NETIF_F_LRO);
 }
 EXPORT_SYMBOL(dev_disable_lro);
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 13d79f5..bbf75fc 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -1810,7 +1810,7 @@ static int ethtool_set_value(struct net_device *dev, char __user *useraddr,
 	if (copy_from_user(&edata, useraddr, sizeof(edata)))
 		return -EFAULT;
 
-	return actor(dev, edata.data);
+	return actor(dev, edata.data & dev->hw_features);
 }
 
 static noinline_for_stack int ethtool_flash_device(struct net_device *dev,

Pls., comment.

Thanks,
vlad



> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 




^ permalink raw reply related

* Re: [PATCHv4] usbnet: Resubmit interrupt URB once if halted
From: Alan Stern @ 2011-04-21 18:48 UTC (permalink / raw)
  To: Paul Stewart; +Cc: netdev, USB list
In-Reply-To: <BANLkTinhmAfe1V2SvoY+J6Tu_DdnZMoYgw@mail.gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=UTF-8, Size: 1715 bytes --]

On Thu, 21 Apr 2011, Paul Stewart wrote:

> > The driver needs better coordination between open/stop and
> > resume/suspend.  The interrupt and receive URBs are supposed to be
> > active whenever the interface is up and not suspended, right?  Which
> > means that usbnet_resume() shouldn't submit anything if the interface
> > isn't up.
> 
> How do we define "up" here (from a network perspective there are many
> ways to interpret that)?  How does this concept compare to the user's
> "ifconfig up/down" state?

I don't know the details of how network drivers are supposed to work.  
But it doesn't matter -- for your purposes you should define "up" to 
mean "whenever the URBs are supposed to be active (unless the interface 
is suspended)".

>  What call do I use in usbnet_resume() to
> tell that the interface isn't up?  Currently I'm using netif_running()
> which responds true in this condition, which is why I'm resorting to
> the flag.

Again, I don't know.  However, the URBs get submitted from within
usbnet_open() and killed within usbnet_stop(), right?  Therefore you
can use any condition which gets set to True in usbnet_open() and set
to False in usbnet_stop().  (If nothing else is suitable, use a flag of
your own.)  And be careful of the edge case: Since usbnet_open() itself
performs a resume operation, you need to make sure the resume takes
place before the condition becomes True -- otherwise the URBs will get
submitted twice.

One more thing to keep in mind: If the kernel is built without PM
support, the resume and suspend routines will never get called.  
Therefore they must not be the only places where URBs are submitted and
killed.

Alan Stern

^ permalink raw reply

* [PATCH] bonding: fix bridged bonds in 802.3ad mode
From: Jiri Bohac @ 2011-04-21 18:47 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Jay Vosburgh, Andy Gospodarek,
	Benjamin Poirier

802.3ad bonding inside a bridge is broken again. Originally fixed by
43aa1920117801fe9ae3d1fad886b62511e09bee, the bug was re-introduced by
1e253c3b8a1aeed51eef6fc366812f219b97de65.

LACP frames must not have their skb->dev changed by the bridging hook.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1514,6 +1514,11 @@ static rx_handler_result_t bond_handle_frame(struct sk_buff **pskb)
 		memcpy(eth_hdr(skb)->h_dest, bond->dev->dev_addr, ETH_ALEN);
 	}
 
+	/* prevent bridging code from mangling and forwarding LACP frames */
+	if (bond->params.mode == BOND_MODE_8023AD &&
+	    skb->protocol == htons(ETH_P_SLOW))
+		return RX_HANDLER_PASS;
+
 	return RX_HANDLER_ANOTHER;
 }
 
-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* RE: rtnetlink and many VFs
From: Rose, Gregory V @ 2011-04-21 18:28 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, sf-linux-drivers
In-Reply-To: <1303409517.3165.39.camel@bwh-desktop>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Ben Hutchings
> Sent: Thursday, April 21, 2011 11:12 AM
> To: Rose, Gregory V
> Cc: David Miller; netdev; sf-linux-drivers
> Subject: RE: rtnetlink and many VFs
> 
> On Thu, 2011-04-21 at 10:50 -0700, Rose, Gregory V wrote:
> > > -----Original Message-----
> > > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> > > Sent: Thursday, April 21, 2011 10:40 AM
> > > To: Rose, Gregory V
> > > Cc: David Miller; netdev; sf-linux-drivers
> > > Subject: RE: rtnetlink and many VFs
> > >
> 
> > I still feel like eventually the number of VFs will outgrow the
> > capability of a single message to handle, especially when VFs will
> > have the capability of having multiple MAC address and VLAN filters
> > assigned to them.  And it seems orthogonal to me to mirror the 'ip
> > link set eth(x) vf (n)' syntax with a 'ip link show eth(x) vf (n)'
> > syntax.
> >
> > That's just me though.
> [...]
> 
> I think it would be a useful extension, but we have to keep the current
> API working as far as possible.

Sure, sounds good.  I'm  working on something that will patch things up for the present use case right now.

- Greg


^ permalink raw reply

* RE: rtnetlink and many VFs
From: Ben Hutchings @ 2011-04-21 18:11 UTC (permalink / raw)
  To: Rose, Gregory V; +Cc: David Miller, netdev, sf-linux-drivers
In-Reply-To: <43F901BD926A4E43B106BF17856F07550145FC75D6@orsmsx508.amr.corp.intel.com>

On Thu, 2011-04-21 at 10:50 -0700, Rose, Gregory V wrote:
> > -----Original Message-----
> > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> > Sent: Thursday, April 21, 2011 10:40 AM
> > To: Rose, Gregory V
> > Cc: David Miller; netdev; sf-linux-drivers
> > Subject: RE: rtnetlink and many VFs
> > 
> > On Thu, 2011-04-21 at 10:02 -0700, Rose, Gregory V wrote:
> > > > -----Original Message-----
> > > > From: netdev-owner@vger.kernel.org [mailto:netdev-
> > owner@vger.kernel.org]
> > > > On Behalf Of Ben Hutchings
> > > > Sent: Thursday, April 21, 2011 7:36 AM
> > > > To: David Miller
> > > > Cc: netdev; sf-linux-drivers
> > > > Subject: rtnetlink and many VFs
> > > >
> > >
> > > As more VFs become possible it really needs a fix.  I was thinking
> > > about something along the lines of this:
> > >
> > > # ip link show eth(x) vf (n)
> > >
> > > Where eth(x) is the physical function that owns the VFs and (n) is the
> > > specific VF you want information for.  That way one could easily
> > > script something that loops through the VFs and gets the information
> > > for each.  This really becomes necessary when we start adding
> > > additional MAC and VLAN filters for each VF that need to be displayed.
> > > In that case you can only show a few VFs before you run out of space.
> > 
> > I think that what 'ip link show' is doing now seems to be perfectly
> > valid.  It allocates a 16K buffer which would be enough if netlink
> > didn't apply this PAGE_SIZE limit to single messages.
> 
> Ah, I hadn't seen that it was allocating 16K, that would then be
> enough for 128 VFs but in the future would not be enough for 40Gig (or
> higher speed) devices that might support two or 4 times that many VFs.

There are only 256 available function addresses per PCIe link, so there
can be at most 255 VFs associated with a single PF.

(The function number could be extended further by effectively assigning
multiple bus numbers to a link, but that would be a significantly more
disruptive change than ARI.)

> I still feel like eventually the number of VFs will outgrow the
> capability of a single message to handle, especially when VFs will
> have the capability of having multiple MAC address and VLAN filters
> assigned to them.  And it seems orthogonal to me to mirror the 'ip
> link set eth(x) vf (n)' syntax with a 'ip link show eth(x) vf (n)'
> syntax.
> 
> That's just me though.
[...]

I think it would be a useful extension, but we have to keep the current
API working as far as possible.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] tg3: Convert u32 flag,flg2,flg3 uses to bitmap
From: Joe Perches @ 2011-04-21 17:51 UTC (permalink / raw)
  To: Michael Chan
  Cc: Matthew Carlson, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1303406280.8857.13.camel@HP1>

On Thu, 2011-04-21 at 10:18 -0700, Michael Chan wrote:
> On Wed, 2011-04-20 at 23:39 -0700, Joe Perches wrote:
> > @@ -4622,7 +4611,7 @@ static void tg3_tx_recover(struct tg3 *tp)
> >                     "and include system chipset information.\n");
> > 
> >         spin_lock(&tp->lock);
> > -       tp->tg3_flags |= TG3_FLAG_TX_RECOVERY_PENDING;
> > +       tg3_flag_set(tp, TX_RECOVERY_PENDING);
> >         spin_unlock(&tp->lock);
> >  }
> By using set_bit() now, we can eliminate the spin_lock() here.  This
> flag is checked much later when tg3_reset_task() is scheduled to run in
> workqueue, so no memory barrier is needed either.

Sure, but as a separate patch.

^ permalink raw reply

* RE: rtnetlink and many VFs
From: Rose, Gregory V @ 2011-04-21 17:50 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, netdev, sf-linux-drivers
In-Reply-To: <1303407628.3165.30.camel@bwh-desktop>

> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Thursday, April 21, 2011 10:40 AM
> To: Rose, Gregory V
> Cc: David Miller; netdev; sf-linux-drivers
> Subject: RE: rtnetlink and many VFs
> 
> On Thu, 2011-04-21 at 10:02 -0700, Rose, Gregory V wrote:
> > > -----Original Message-----
> > > From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org]
> > > On Behalf Of Ben Hutchings
> > > Sent: Thursday, April 21, 2011 7:36 AM
> > > To: David Miller
> > > Cc: netdev; sf-linux-drivers
> > > Subject: rtnetlink and many VFs
> > >
> >
> > As more VFs become possible it really needs a fix.  I was thinking
> > about something along the lines of this:
> >
> > # ip link show eth(x) vf (n)
> >
> > Where eth(x) is the physical function that owns the VFs and (n) is the
> > specific VF you want information for.  That way one could easily
> > script something that loops through the VFs and gets the information
> > for each.  This really becomes necessary when we start adding
> > additional MAC and VLAN filters for each VF that need to be displayed.
> > In that case you can only show a few VFs before you run out of space.
> 
> I think that what 'ip link show' is doing now seems to be perfectly
> valid.  It allocates a 16K buffer which would be enough if netlink
> didn't apply this PAGE_SIZE limit to single messages.

Ah, I hadn't seen that it was allocating 16K, that would then be enough for 128 VFs but in the future would not be enough for 40Gig (or higher speed) devices that might support two or 4 times that many VFs.  I still feel like eventually the number of VFs will outgrow the capability of a single message to handle, especially when VFs will have the capability of having multiple MAC address and VLAN filters assigned to them.  And it seems orthogonal to me to mirror the 'ip link set eth(x) vf (n)' syntax with a 'ip link show eth(x) vf (n)' syntax.

That's just me though.

> 
> It is certainly a bug.  rtnetlink is the currently favoured API for
> querying and configuring network device settings, but there are now
> valid device settings that cannot be queried.

I'll look into just fixing the bug for now and reserve (at least what I consider to be) future improvements for later.

- Greg


^ permalink raw reply

* RE: rtnetlink and many VFs
From: Ben Hutchings @ 2011-04-21 17:40 UTC (permalink / raw)
  To: Rose, Gregory V; +Cc: David Miller, netdev, sf-linux-drivers
In-Reply-To: <43F901BD926A4E43B106BF17856F07550145FC7526@orsmsx508.amr.corp.intel.com>

On Thu, 2011-04-21 at 10:02 -0700, Rose, Gregory V wrote:
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> > On Behalf Of Ben Hutchings
> > Sent: Thursday, April 21, 2011 7:36 AM
> > To: David Miller
> > Cc: netdev; sf-linux-drivers
> > Subject: rtnetlink and many VFs
> > 
> > My colleagues have been working on SR-IOV support for sfc.  The hardware
> > supports up to 127 VFs per port.
> > 
> > If we configure all 127 VFs through the net device, an RTM_GETLINK dump
> > will need to include messages describing them, with a total size of:
> > 
> > 127 * (sizeof(struct ifla_vf_mac) + sizeof(struct ifla_vf_vlan) +
> >        sizeof(struct ifla_vf_tx_rate) + protocol overhead)
> > > 7112
> > 
> > These messages are nested within the message describing the device as a
> > whole, so they cannot be split.  The maximum size of an outgoing netlink
> > message, based on NLMSG_GOODSIZE, seems to be min(PAGE_SIZE, 8192).  So
> > when PAGE_SIZE = 4096 it is simply impossible to dump information about
> > such a device!
> > 
> > I think it needs to be made possible to grow a netlink skb during
> > generation of the first message.  Userspace may still be unable to
> > receive the large message but at least it has a chance.
> 
> I've been looking at this one too.  The limit seems to be about 40 or
> so in the most common case.

Right.  When Steve Hodgson investigated this here, he found that 46 VFs
would fit.

> My netlink fu is weak but I've been looking at the code in iproute2/ip
> and netlink to see what we can do about it.
> 
> As more VFs become possible it really needs a fix.  I was thinking
> about something along the lines of this:
> 
> # ip link show eth(x) vf (n)
> 
> Where eth(x) is the physical function that owns the VFs and (n) is the
> specific VF you want information for.  That way one could easily
> script something that loops through the VFs and gets the information
> for each.  This really becomes necessary when we start adding
> additional MAC and VLAN filters for each VF that need to be displayed.
> In that case you can only show a few VFs before you run out of space.

I think that what 'ip link show' is doing now seems to be perfectly
valid.  It allocates a 16K buffer which would be enough if netlink
didn't apply this PAGE_SIZE limit to single messages.

> In any case I've been working on an RFC patch for this and hope to
> have it soon.  I consider this a pretty serious limitation and one
> could even view it as a bug.

It is certainly a bug.  rtnetlink is the currently favoured API for
querying and configuring network device settings, but there are now
valid device settings that cannot be queried.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH BUG-FIX] ipv6: udp: fix the wrong headroom check
From: David Miller @ 2011-04-21 17:39 UTC (permalink / raw)
  To: herbert; +Cc: shanwei, kuznet, pekkas, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <20110420105007.GA16248@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 20 Apr 2011 18:50:07 +0800

> On Wed, Apr 20, 2011 at 04:52:49PM +0800, Shan Wei wrote:
>> At this point, skb->data points to skb_transport_header.
>> So, headroom check is wrong. 
>> 
>> For some case:bridge(UFO is on) + eth device(UFO is off),
>> there is no enough headroom for IPv6 frag head.
>> But headroom check is always false.
>> 
>> This will bring about data be moved to there prior to skb->head,
>> when adding IPv6 frag header to skb.
>> 
>> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
> 
> Ouch.
> 
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6 0/3] new SCTP sockets APIs
From: David Miller @ 2011-04-21 17:36 UTC (permalink / raw)
  To: yjwei; +Cc: netdev, linux-sctp
In-Reply-To: <4DABAEF1.1010506@cn.fujitsu.com>

From: Wei Yongjun <yjwei@cn.fujitsu.com>
Date: Mon, 18 Apr 2011 11:24:33 +0800

> This patchset implement some new sockets APIs for net-next-2.6.
> 
> Wei Yongjun (3):
>   sctp: implement socket option SCTP_GET_ASSOC_ID_LIST
>   sctp: change auth event type name to SCTP_AUTHENTICATION_EVENT
>   sctp: implement event notification SCTP_SENDER_DRY_EVENT

All 3 patches applied, thanks.

^ permalink raw reply

* [PATCH] [RFC] davinci_emac: don't WARN_ON cpdma_chan_submit -ENOMEM
From: Ben Gardiner @ 2011-04-21 17:30 UTC (permalink / raw)
  To: netdev; +Cc: davinci-linux-open-source, Sriramakrishnan A G

The current implementation of emac_rx_handler, when the host is
flooded, will result in a great deal of WARNs on the console; due to
the return value of cpdma_chan_submit. This function can error with
EINVAL and ENOMEM; the former when the channel is in an invalid
state, in which case the caller is in error. The latter when a cpdma
descriptor cannot be allocated.

When flooded, cpdma_chan_submit will return -ENOMEM;
treat the inability to allocate a cpdma descriptor as an rx error
similar in behaviour to when emac_rx_alloc returns NULL.

No Signed-off-by yet; not complete fix (see below).
CC: Sriramakrishnan A G <srk@ti.com>

---
I'm new to network drivers -- and kernel development, really. I'd be
happy to receive feedback on this approach of resolving the -ENOMEM
when flooded. Is there a more conventional approach? Shoud these frames
be recorded as 'dropped'?

Testing was performed on da850evm both with and without  "net:
davinci_emac: fix spinlock bug with dma channel cleanup" from
Sriramakrishnan A G applied. The behaviour was the same: the emac is
not able to receive any frames after being flooded -- but it can still
send. I would appreciate any insight into the potential causes of the
lockup.

Best Regards,
Ben Gardiner

Nanometrics Inc.
http://www.nanometrics.ca

---
 drivers/net/davinci_emac.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/davinci_emac.c b/drivers/net/davinci_emac.c
index 7018bfe..17c48d6 100644
--- a/drivers/net/davinci_emac.c
+++ b/drivers/net/davinci_emac.c
@@ -1037,8 +1037,12 @@ static void emac_rx_handler(void *token, int len, int status)
 recycle:
 	ret = cpdma_chan_submit(priv->rxchan, skb, skb->data,
 			skb_tailroom(skb), GFP_KERNEL);
-	if (WARN_ON(ret < 0))
+	WARN_ON(ret == -EINVAL);
+	if (ret < 0) {
+		if (netif_msg_rx_err(priv) && net_ratelimit())
+			dev_err(emac_dev, "failed cpdma submit\n");
 		dev_kfree_skb_any(skb);
+	}
 }

 static void emac_tx_handler(void *token, int len, int status)
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] tg3: Convert u32 flag,flg2,flg3 uses to bitmap
From: Michael Chan @ 2011-04-21 17:18 UTC (permalink / raw)
  To: Joe Perches
  Cc: Matthew Carlson, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <02bf2aa5c08514641ecbe7c39ef976918fad036c.1303367730.git.joe@perches.com>


On Wed, 2011-04-20 at 23:39 -0700, Joe Perches wrote:
> @@ -4622,7 +4611,7 @@ static void tg3_tx_recover(struct tg3 *tp)
>                     "and include system chipset information.\n");
> 
>         spin_lock(&tp->lock);
> -       tp->tg3_flags |= TG3_FLAG_TX_RECOVERY_PENDING;
> +       tg3_flag_set(tp, TX_RECOVERY_PENDING);
>         spin_unlock(&tp->lock);
>  }
> 

By using set_bit() now, we can eliminate the spin_lock() here.  This
flag is checked much later when tg3_reset_task() is scheduled to run in
workqueue, so no memory barrier is needed either.

Thanks.

^ permalink raw reply

* RE: rtnetlink and many VFs
From: Rose, Gregory V @ 2011-04-21 17:02 UTC (permalink / raw)
  To: Ben Hutchings, David Miller; +Cc: netdev, sf-linux-drivers
In-Reply-To: <1303396576.3165.13.camel@bwh-desktop>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Ben Hutchings
> Sent: Thursday, April 21, 2011 7:36 AM
> To: David Miller
> Cc: netdev; sf-linux-drivers
> Subject: rtnetlink and many VFs
> 
> My colleagues have been working on SR-IOV support for sfc.  The hardware
> supports up to 127 VFs per port.
> 
> If we configure all 127 VFs through the net device, an RTM_GETLINK dump
> will need to include messages describing them, with a total size of:
> 
> 127 * (sizeof(struct ifla_vf_mac) + sizeof(struct ifla_vf_vlan) +
>        sizeof(struct ifla_vf_tx_rate) + protocol overhead)
> > 7112
> 
> These messages are nested within the message describing the device as a
> whole, so they cannot be split.  The maximum size of an outgoing netlink
> message, based on NLMSG_GOODSIZE, seems to be min(PAGE_SIZE, 8192).  So
> when PAGE_SIZE = 4096 it is simply impossible to dump information about
> such a device!
> 
> I think it needs to be made possible to grow a netlink skb during
> generation of the first message.  Userspace may still be unable to
> receive the large message but at least it has a chance.

I've been looking at this one too.  The limit seems to be about 40 or so in the most common case.  My netlink fu is weak but I've been looking at the code in iproute2/ip and netlink to see what we can do about it.

As more VFs become possible it really needs a fix.  I was thinking about something along the lines of this:

# ip link show eth(x) vf (n)

Where eth(x) is the physical function that owns the VFs and (n) is the specific VF you want information for.  That way one could easily script something that loops through the VFs and gets the information for each.  This really becomes necessary when we start adding additional MAC and VLAN filters for each VF that need to be displayed.  In that case you can only show a few VFs before you run out of space.

In any case I've been working on an RFC patch for this and hope to have it soon.  I consider this a pretty serious limitation and one could even view it as a bug.

- Greg
Greg Rose
LAD Division
Intel Corp.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox