Netdev List

Netdev List
 help / color / mirror / Atom feed

* AW: RFC: (now non Base64) replace packets already in queue
From: Erdt, Ralph @ 2012-07-18 14:50 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: netdev@vger.kernel.org, Eric Dumazet, Rick Jones
In-Reply-To: <4FF4A873.1000001@gmail.com>

Hello.

I'm sorry for the very late answer. But I had top-priority family issues.

> I suggest you try and send a properly formated patch with your code, so
> that people here can have a look at it and evaluate the interest of
> integrating it into main line kernel.

Attached at the button of the eMail.


> That being said, I really think you should try to manage a userspace
> queue, [..] you can
> add many nice features into userspace to enhance the speed/quality 
> [..]
> And I really see your packet replacement system as one of those nice
> features and cannot imagine a good reason not to put it in userspace.

All this features are done already. E.g. we are using RoHC.
But we also want to use the TC stuff - its already there - why reprogramming?

Here the patch. But I didn't find which git tree I should use. This patch is against Linux-2.6. I'm sorry. Can you tell me, which tree I've to use?
-------------
>From 52f27fa2b0867de821af38c731c2ebc763afb1f1 Mon Sep 17 00:00:00 2001
From: Ralph Erdt <Ralph.Robert.Erdt@fkie.fraunhofer.de>
Date: Wed, 18 Jul 2012 16:43:44 +0200
Subject: [PATCH] RFC: TC qdisc "replace packet in queue"

This adds a new TC qdisc, which replaces packets in the queue. It
compares every incoming packet with all of the packets in the queue.
If the incoming and the compared packet meet all these conditions:
 - UDPv4
 - not fragmented
 - TOS like the given value(s)
 - same TOS
 - same source IP
 - same destination IP
 - same destination port
the packet in the queue will be replaced with the incoming packet.

The variable "overlimit" is the counter of replaced packets

Background:
In very low bandwidth networks (<=9.6Kbps, shared, etc.) it's hard
(rather: impossible) to get all packets sent.
But some of the packets contain information, which gets obsolete over
time. E.g. (GPS) positions, which will be sent periodically. If the
application sends a new packet while an old position packet is still in
the queue, the old packet is obsolete. This can be dropped. But just
dropping the old packet and queuing the new packet will result in never
sending a packet of this type. So this qdisc replace the old packet with
the new one. The information gets the chance to get sent - with the
newest available information.

Code-Status:
RFC for discussing.
The configuration by "debug-fs" is ... not optimal. But following the
"litte step" rules this is a first option. A configuration with tc will
be done later (if this patch got offical).

Signed-off-by: Ralph Erdt <Ralph.Robert.Erdt@fkie.fraunhofer.de>
---
 net/sched/Kconfig  |   16 +++
 net/sched/Makefile |    1 +
 net/sched/sch_pr.c |  264 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 281 insertions(+), 0 deletions(-)
 create mode 100644 net/sched/sch_pr.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e7a8976..e29ad48 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -308,6 +308,22 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_PR
+	tristate "Packet Replace"
+	help
+	  Say Y here if you want to use the "Packet Replace"
+	  packet scheduling algorithm.
+
+	  This qdisc will replace packets in the queue, if this is a packet
+	  from the same UDP stream (IP/Port).
+
+	  See the top of <file:net/sched/sch_pr.c> for more details.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_pr.
+
+	  If unsure, say N.
+
 comment "Classification"
 
 config NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5940a19..ef669ff 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
 obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
+obj-$(CONFIG_NET_SCH_PR)	+= sch_pr.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_pr.c b/net/sched/sch_pr.c
new file mode 100644
index 0000000..5cbf8d8
--- /dev/null
+++ b/net/sched/sch_pr.c
@@ -0,0 +1,264 @@
+/*
+ * net/sched/sch_pr.c	"packet replace"
+ *
+ * Copyrigth (c) 2012 Fraunhofer FKIE, all rigths reserved.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Ralph Erdt (Fraunhofer FKIE),
+ *                                 <ralph.robert.erdt@fkie.fraunhofer.de>
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+
+#include <linux/ip.h>
+#include <net/ip.h>
+#include <linux/debugfs.h>
+
+/*
+ * replace packet in queue
+ * ==========================
+ * This is a modified fifo queue (fifo by Alexey Kuznetsov).
+ *
+ * This packet compares every incoming packet with all of the packets in the
+ * queue.
+ * If the incoming and the compared packet meet all these conditions:
+ *  - UDPv4
+ *  - not fragmented
+ *  - TOS like the given value(s)
+ *  - same TOS
+ *  - same source IP
+ *  - same destination IP
+ *  - same destination port
+ * the packet in the queue will be replaced with the incoming packet.
+ *
+ * The variable "overlimit" is the counter of replaced packets
+ *
+ * Background:
+ * In very low bandwidth networks (<=9.6Kbps, shared, etc.) it's hard
+ * (rather: impossible) to get all packets sent.
+ * But some of the packets contain information, which gets obsolete over time.
+ * E.g. (GPS) positions, which will be sent periodically. If the application
+ * sends a new packet while an old position packet is still in the queue, the
+ * old packet is obsolete. This can be dropped. But just dropping the old
+ * packet and queuing the new packet will result in never sending a packet
+ * of this type.
+ * So this qdisc replace the old packet with the new one. The information gets
+ * the chance to get sent - with the newest available information.
+ *
+ * DRAWBACKS:
+ * Its not very CPU cycle saving. But on very low bandwith networks the
+ * application have to be careful with sending packets. And with a propper
+ * configuration, this will be OK.
+ */
+
+struct dentry *dgdir, *dgfile;
+
+#define TOSBITMASK 0
+#define TOSCOMPARE 1
+/* tos Flag. 1.: BitMask. 2.: Compare with */
+static u8 tos[] = {0xFF, 0xFF};
+
+bool pr_packet_to_work_with(struct sk_buff *pkt)
+{
+	struct iphdr *hdr;
+
+	if (unlikely(pkt->protocol != htons(ETH_P_IP)))
+		return false;
+
+	/* Only compare UDP - Layer 4 must be there */
+	if (unlikely(pkt->network_header == NULL))
+		return false;
+
+	hdr = ip_hdr(pkt);
+
+	/* Check for UDPv4 */
+	if (unlikely(hdr->protocol != IPPROTO_UDP))
+		return false;
+
+	/* no fragmented packets */
+	if (unlikely(ip_is_fragment(hdr)))
+		return false;
+
+	/* Correct TOS ? */
+	if ((hdr->tos & tos[TOSBITMASK]) != tos[TOSCOMPARE])
+		return false;
+
+	return true;
+}
+
+bool comp(struct sk_buff *a, struct sk_buff *b)
+{
+	struct iphdr *ah = NULL;
+	struct iphdr *bh = NULL;
+	u32 ipA, ipB;
+	u16 portsA, portsB;
+	int poff;
+	/* The packet has a header
+	 *  - the existence was already checked by "pr_packet_to_work_with" */
+	ah = ip_hdr(a);
+	bh = ip_hdr(b);
+
+	/* TOS must be the same */
+	if (ah->tos != bh->tos)
+		return false;
+
+	/* IP and Port must be the same */
+	ipA = (__force u32)ah->daddr;
+	ipB = (__force u32)bh->daddr;
+	if ((ipA != ipB))
+		return false;
+	ipA = (__force u32)ah->saddr;
+	ipB = (__force u32)bh->saddr;
+	if ((ipA != ipB))
+		return false;
+
+	poff = proto_ports_offset(IPPROTO_UDP);
+	if (unlikely(poff < 0))
+		/* This should be impossible.. */
+		return false;
+
+	/* Src Ports are always different - just compare destination ports */
+	portsA = *(u16 *)((void *)ah + bh->ihl * 4 + poff + 2);
+	portsB = *(u16 *)((void *)bh + ah->ihl * 4 + poff + 2);
+	if ((portsA != portsB))
+		return false;
+
+	return true;
+}
+
+static int pr_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct sk_buff *replace = NULL;
+
+	/* Search, if there is a packet with same IDs */
+	/* Only search, if this packet is worth it */
+	if (pr_packet_to_work_with(skb)) {
+		struct sk_buff *it;
+		skb_queue_walk((&(sch->q)), it) {
+			/* If the other packet is worth it? */
+			if (pr_packet_to_work_with(it)) {
+				if (comp(skb, it)) {
+					replace = it;
+					break;
+				}
+			}
+		}
+	}
+
+	if (replace == NULL) {
+		/* a new kind of packet. Just enqueue */
+		if (likely(skb_queue_len(&sch->q) < sch->limit))
+			return qdisc_enqueue_tail(skb, sch);
+		return qdisc_reshape_fail(skb, sch);
+	} else {
+		/* replace the packet */
+		sch->qstats.overlimits++;
+		/* There is no drop nor replace. So do the replace myself */
+		skb->next = replace->next;
+		skb->prev = replace->prev;
+		if (replace->next != NULL)
+			replace->next->prev = skb;
+		if (replace->prev != NULL)
+			replace->prev->next = skb;
+		kfree_skb(replace);
+		return NET_XMIT_SUCCESS;
+	}
+}
+
+static int pr_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	sch->flags |= TCQ_F_CAN_BYPASS; /* sounds good, but what? */
+	sch->limit = qdisc_dev(sch)->tx_queue_len ? : 1;
+	return 0;
+}
+
+static int pr_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct tc_fifo_qopt opt = { .limit = sch->limit };
+
+	if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	return -1;
+}
+
+struct Qdisc_ops pr_qdisc_ops __read_mostly = {
+	.id		=	"pr",
+	.priv_size	=	0,
+	.enqueue	=	pr_enqueue,
+	.dequeue	=	qdisc_dequeue_head,
+	.peek		=	qdisc_peek_head,
+	.drop		=	qdisc_queue_drop,
+	.init		=	pr_init,
+	.reset		=	qdisc_reset_queue,
+	.change		=	pr_init,
+	.dump		=	pr_dump,
+	.owner		=	THIS_MODULE,
+};
+EXPORT_SYMBOL(pr_qdisc_ops);
+
+/* DebugFS interface as first shot configuration */
+static ssize_t dg_read_file(struct file *file, char __user *userbuf,
+					size_t count, loff_t *ppos)
+{
+	return simple_read_from_buffer(userbuf, count, ppos, tos, 2);
+}
+
+static ssize_t dg_write_file(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	u8 tmp[] = {0xFF, 0xFF};
+	int res;
+	if (count != 2)
+		return -EINVAL;
+
+	res = copy_from_user(tmp, buf, count);
+	if (res != 0)
+		return -EINVAL;
+
+	/* Two bytes to copy.. for this a memcpy with errorhandling?!? */
+	tos[0] = tmp[0];
+	tos[1] = tmp[1];
+
+	return count;
+}
+
+static const struct file_operations dgfops = {
+	.read = dg_read_file,
+	.write = dg_write_file,
+};
+
+static int __init pr_module_init(void)
+{
+	bool ret = register_qdisc(&pr_qdisc_ops);
+	if (!ret) {
+		/* open Communication channel */
+		dgdir = debugfs_create_dir("sch_pr", NULL);
+		dgfile = debugfs_create_file("tos", 0644, dgdir, tos, &dgfops);
+	}
+	return ret;
+}
+
+static void __exit pr_module_exit(void)
+{
+	debugfs_remove(dgfile);
+	debugfs_remove(dgdir);
+	unregister_qdisc(&pr_qdisc_ops);
+}
+
+module_init(pr_module_init);
+module_exit(pr_module_exit);
+MODULE_LICENSE("GPL");
-- 
1.7.7

^ permalink raw reply related

* [PATCH net-next v3] ipv6: add ipv6_addr_hash() helper
From: Eric Dumazet @ 2012-07-18 14:27 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <1342620841.2626.2786.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.

Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)

Use it in ip6_tunnel.c and use more bit shuffling, as suggested
by David Laight, as existing hash was ignoring most of them.

Use it in sunrpc and use more bit shuffling, using hash_32().

As a cleanup, use it in net/ipv4/tcp_metrics.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Joe Perches <joe@perches.com>
---
v3: use the explicit cast inipv6_addr_hash() as suggested by Joe

 include/net/ipv6.h        |   13 +++++++++++++
 net/core/flow_dissector.c |    5 +++--
 net/ipv4/tcp_metrics.c    |   15 +++------------
 net/ipv6/ip6_tunnel.c     |   20 ++++++++++++--------
 net/sunrpc/svcauth_unix.c |   22 ++++------------------
 5 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f695f39..01c34b3 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
 #endif
 }
 
+static inline u32 ipv6_addr_hash(const struct in6_addr *a)
+{
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
+	const unsigned long *ul = (const unsigned long *)a;
+	unsigned long x = ul[0] ^ ul[1];
+
+	return (u32)(x ^ (x >> 32));
+#else
+	return (__force u32)(a->s6_addr32[0] ^ a->s6_addr32[1] ^
+			     a->s6_addr32[2] ^ a->s6_addr32[3]);
+#endif
+}
+
 static inline bool ipv6_addr_loopback(const struct in6_addr *a)
 {
 	return (a->s6_addr32[0] | a->s6_addr32[1] |
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index a225089..466820b 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -4,6 +4,7 @@
 #include <linux/ipv6.h>
 #include <linux/if_vlan.h>
 #include <net/ip.h>
+#include <net/ipv6.h>
 #include <linux/if_tunnel.h>
 #include <linux/if_pppox.h>
 #include <linux/ppp_defs.h>
@@ -55,8 +56,8 @@ ipv6:
 			return false;
 
 		ip_proto = iph->nexthdr;
-		flow->src = iph->saddr.s6_addr32[3];
-		flow->dst = iph->daddr.s6_addr32[3];
+		flow->src = (__force __be32)ipv6_addr_hash(&iph->saddr);
+		flow->dst = (__force __be32)ipv6_addr_hash(&iph->daddr);
 		nhoff += sizeof(struct ipv6hdr);
 		break;
 	}
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 5a38a2d..1a115b6 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -211,10 +211,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_rsk(req)->rmt_addr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_rsk(req)->rmt_addr);
 		break;
 	default:
 		return NULL;
@@ -251,10 +248,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 	case AF_INET6:
 		tw6 = inet6_twsk((struct sock *)tw);
 		*(struct in6_addr *)addr.addr.a6 = tw6->tw_v6_daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&tw6->tw_v6_daddr);
 		break;
 	default:
 		return NULL;
@@ -291,10 +285,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_sk(sk)->daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_sk(sk)->daddr);
 		break;
 	default:
 		return NULL;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index db32846..9a1d5fe 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -40,6 +40,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/netfilter_ipv6.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
 
 #include <asm/uaccess.h>
 #include <linux/atomic.h>
@@ -70,11 +71,15 @@ MODULE_ALIAS_NETDEV("ip6tnl0");
 #define IPV6_TCLASS_MASK (IPV6_FLOWINFO_MASK & ~IPV6_FLOWLABEL_MASK)
 #define IPV6_TCLASS_SHIFT 20
 
-#define HASH_SIZE  32
+#define HASH_SIZE_SHIFT  5
+#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
 
-#define HASH(addr) ((__force u32)((addr)->s6_addr32[0] ^ (addr)->s6_addr32[1] ^ \
-		     (addr)->s6_addr32[2] ^ (addr)->s6_addr32[3]) & \
-		    (HASH_SIZE - 1))
+static u32 HASH(const struct in6_addr *addr1, const struct in6_addr *addr2)
+{
+	u32 hash = ipv6_addr_hash(addr1) ^ ipv6_addr_hash(addr2);
+
+	return hash_32(hash, HASH_SIZE_SHIFT);
+}
 
 static int ip6_tnl_dev_init(struct net_device *dev);
 static void ip6_tnl_dev_setup(struct net_device *dev);
@@ -166,12 +171,11 @@ static inline void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
 static struct ip6_tnl *
 ip6_tnl_lookup(struct net *net, const struct in6_addr *remote, const struct in6_addr *local)
 {
-	unsigned int h0 = HASH(remote);
-	unsigned int h1 = HASH(local);
+	unsigned int hash = HASH(remote, local);
 	struct ip6_tnl *t;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 
-	for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[h0 ^ h1]) {
+	for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
 		if (ipv6_addr_equal(local, &t->parms.laddr) &&
 		    ipv6_addr_equal(remote, &t->parms.raddr) &&
 		    (t->dev->flags & IFF_UP))
@@ -205,7 +209,7 @@ ip6_tnl_bucket(struct ip6_tnl_net *ip6n, const struct ip6_tnl_parm *p)
 
 	if (!ipv6_addr_any(remote) || !ipv6_addr_any(local)) {
 		prio = 1;
-		h = HASH(remote) ^ HASH(local);
+		h = HASH(remote, local);
 	}
 	return &ip6n->tnls[prio][h];
 }
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 2777fa8..4d01292 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -104,23 +104,9 @@ static void ip_map_put(struct kref *kref)
 	kfree(im);
 }
 
-#if IP_HASHBITS == 8
-/* hash_long on a 64 bit machine is currently REALLY BAD for
- * IP addresses in reverse-endian (i.e. on a little-endian machine).
- * So use a trivial but reliable hash instead
- */
-static inline int hash_ip(__be32 ip)
-{
-	int hash = (__force u32)ip ^ ((__force u32)ip>>16);
-	return (hash ^ (hash>>8)) & 0xff;
-}
-#endif
-static inline int hash_ip6(struct in6_addr ip)
+static inline int hash_ip6(const struct in6_addr *ip)
 {
-	return (hash_ip(ip.s6_addr32[0]) ^
-		hash_ip(ip.s6_addr32[1]) ^
-		hash_ip(ip.s6_addr32[2]) ^
-		hash_ip(ip.s6_addr32[3]));
+	return hash_32(ipv6_addr_hash(ip), IP_HASHBITS);
 }
 static int ip_map_match(struct cache_head *corig, struct cache_head *cnew)
 {
@@ -301,7 +287,7 @@ static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
 	ip.m_addr = *addr;
 	ch = sunrpc_cache_lookup(cd, &ip.h,
 				 hash_str(class, IP_HASHBITS) ^
-				 hash_ip6(*addr));
+				 hash_ip6(addr));
 
 	if (ch)
 		return container_of(ch, struct ip_map, h);
@@ -331,7 +317,7 @@ static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
 	ip.h.expiry_time = expiry;
 	ch = sunrpc_cache_update(cd, &ip.h, &ipm->h,
 				 hash_str(ipm->m_class, IP_HASHBITS) ^
-				 hash_ip6(ipm->m_addr));
+				 hash_ip6(&ipm->m_addr));
 	if (!ch)
 		return -ENOMEM;
 	cache_put(ch, cd);

^ permalink raw reply related

* Re: [PATCH v4] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Neil Horman @ 2012-07-18 14:26 UTC (permalink / raw)
  To: John Fastabend
  Cc: Gao feng, eric.dumazet, linux-kernel, netdev, davem, Eric Dumazet,
	Rustad, Mark D
In-Reply-To: <5006C3CA.3010007@intel.com>

On Wed, Jul 18, 2012 at 07:10:18AM -0700, John Fastabend wrote:
> On 7/18/2012 5:21 AM, Neil Horman wrote:
> >On Tue, Jul 17, 2012 at 01:47:25PM -0700, John Fastabend wrote:
> >>On 7/12/2012 12:50 AM, Gao feng wrote:
> >>>there are some out of bound accesses in netprio cgroup.
> >>>
> >>>now before accessing the dev->priomap.priomap array,we only check
> >>>if the dev->priomap exist.and because we don't want to see
> >>>additional bound checkings in fast path, so we should make sure
> >>>that dev->priomap is null or array size of dev->priomap.priomap
> >>>is equal to max_prioidx + 1;
> >>>
> >>>so in write_priomap logic,we should call extend_netdev_table when
> >>>dev->priomap is null and dev->priomap.priomap_len < max_len.
> >>>and in cgrp_create->update_netdev_tables logic,we should call
> >>>extend_netdev_table only when dev->priomap exist and
> >>>dev->priomap.priomap_len < max_len.
> >>>
> >>>and it's not needed to call update_netdev_tables in write_priomap,
> >>>we can only allocate the net device's priomap which we change through
> >>>net_prio.ifpriomap.
> >>>
> >>>this patch also add a return value for update_netdev_tables &
> >>>extend_netdev_table, so when new_priomap is allocated failed,
> >>>write_priomap will stop to access the priomap,and return -ENOMEM
> >>>back to the userspace to tell the user what happend.
> >>>
> >>>Change From v3:
> >>>1. add rtnl protect when reading max_prioidx in write_priomap.
> >>>
> >>>2. only call extend_netdev_table when map->priomap_len < max_len,
> >>>    this will make sure array size of dev->map->priomap always
> >>>    bigger than any prioidx.
> >>>
> >>>3. add a function write_update_netdev_table to make codes clear.
> >>>
> >>>Change From v2:
> >>>1. protect extend_netdev_table by RTNL.
> >>>2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
> >>>
> >>>Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> >>>Cc: Neil Horman <nhorman@tuxdriver.com>
> >>>Cc: Eric Dumazet <edumazet@google.com>
> >>>---
> >>>  net/core/netprio_cgroup.c |   71 ++++++++++++++++++++++++++++++++++-----------
> >>>  1 files changed, 54 insertions(+), 17 deletions(-)
> >>>
> >>
> >>[...]
> >>
> >>>+
> >>>+static int update_netdev_tables(void)
> >>>+{
> >>>+	int ret = 0;
> >>>  	struct net_device *dev;
> >>>-	u32 max_len = atomic_read(&max_prioidx) + 1;
> >>>+	u32 max_len;
> >>>  	struct netprio_map *map;
> >>
> >>
> >>need to check if net subsystem is initialized before we try
> >>to use it here...
> >>
> >>	if (some_check)     -> need to lookup what this check is
> >>		return ret;
> >>
> >>>
> >>>  	rtnl_lock();
> >>>+	max_len = atomic_read(&max_prioidx) + 1;
> >>>  	for_each_netdev(&init_net, dev) {
> >>>  		map = rtnl_dereference(dev->priomap);
> >>>-		if ((!map) ||
> >>>-		    (map->priomap_len < max_len))
> >>>-			extend_netdev_table(dev, max_len);
> >>>+		/*
> >>>+		 * don't allocate priomap if we didn't
> >>>+		 * change net_prio.ifpriomap (map == NULL),
> >>>+		 * this will speed up skb_update_prio.
> >>>+		 */
> >>>+		if (map && map->priomap_len < max_len) {
> >>>+			ret = extend_netdev_table(dev, max_len);
> >>>+			if (ret < 0)
> >>>+				break;
> >>>+		}
> >>>  	}
> >>>  	rtnl_unlock();
> >>>+	return ret;
> >>>  }
> >>>
> >>>  static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
> >>>  {
> >>>  	struct cgroup_netprio_state *cs;
> >>>-	int ret;
> >>>+	int ret = -EINVAL;
> >>>
> >>>  	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
> >>>  	if (!cs)
> >>>  		return ERR_PTR(-ENOMEM);
> >>>
> >>>-	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
> >>>-		kfree(cs);
> >>>-		return ERR_PTR(-EINVAL);
> >>>-	}
> >>>+	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
> >>>+		goto out;
> >>>
> >>>  	ret = get_prioidx(&cs->prioidx);
> >>>-	if (ret != 0) {
> >>>+	if (ret < 0) {
> >>>  		pr_warn("No space in priority index array\n");
> >>>-		kfree(cs);
> >>>-		return ERR_PTR(ret);
> >>>+		goto out;
> >>>+	}
> >>>+
> >>>+	ret = update_netdev_tables();
> >>>+	if (ret < 0) {
> >>>+		put_prioidx(cs->prioidx);
> >>>+		goto out;
> >>>  	}
> >>
> >>Gao,
> >>
> >>This introduces a null ptr dereference when netprio_cgroup is built
> >>into the kernel because update_netdev_tables() depends on init_net.
> >>However cgrp_create is being called by cgroup_init before
> >>do_initcalls() is called and before net_dev_init().
> >>
> >>.John
> >>
> >Not sure I follow here John.  Shouldn't init_net be initialized prior to any
> >network devices getting registered?  In other words, shouldn't for_each_netdev
> >just result in zero iterations through the loop?
> >Neil
> >
> 
> init_net _is_ initialized prior to any network devices getting
> registered but not before cgrp_create called via cgroup_init.
> 
> #define for_each_netdev(net, d)         \
>                 list_for_each_entry(d, &(net)->dev_base_head, dev_list)
> 
> but dev_base_head is zeroed at this time. In netdev_init we have,
> 
>         INIT_LIST_HEAD(&net->dev_base_head);
> 
> but we haven't got that far yet because cgroup_init is called
> before do_initcalls().
> 
ok, I see that, and it makes sense, but at this point I'm more concerned with
cgroups getting initalized twice.  The early_init flag is clear in the
cgroup_subsystem for netprio, so we really shouldn't be getting initalized from
cgroup_init.  We should be getting initalized from the module_init() call that
we register
Neil

> 
> 
> 
> 

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: John Fastabend @ 2012-07-18 14:21 UTC (permalink / raw)
  To: Neil Horman; +Cc: davem, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <20120718124539.GC25563@hmsreliant.think-freely.org>

On 7/18/2012 5:45 AM, Neil Horman wrote:
> On Tue, Jul 17, 2012 at 05:33:16PM -0700, John Fastabend wrote:
>> When the netprio cgroup is built in the kernel cgroup_init will call
>> cgrp_create which eventually calls update_netdev_tables. This is
>> being called before do_initcalls() so a null ptr dereference occurs
>> on init_net.
>>
>> This patch adds a check on init_net.count to verify the structure
>> has been initialized. The failure was introduced here,
>>
>> commit ef209f15980360f6945873df3cd710c5f62f2a3e
>> Author: Gao feng <gaofeng@cn.fujitsu.com>
>> Date:   Wed Jul 11 21:50:15 2012 +0000
>>
>>      net: cgroup: fix access the unallocated memory in netprio cgroup
>>
>> Tested with ping with netprio_cgroup as a module and built in.
>>
>> Marked RFC for now I think DaveM might have a reason why this needs
>> some improvement.
>>
>> Reported-by: Mark Rustad <mark.d.rustad@intel.com>
>> Cc: Neil Horman <nhorman@tuxdriver.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Gao feng <gaofeng@cn.fujitsu.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>
>>   net/core/netprio_cgroup.c |    3 +++
>>   1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
>> index b2e9caa..e9fd7fd 100644
>> --- a/net/core/netprio_cgroup.c
>> +++ b/net/core/netprio_cgroup.c
>> @@ -116,6 +116,9 @@ static int update_netdev_tables(void)
>>   	u32 max_len;
>>   	struct netprio_map *map;
>>
>> +	if (!atomic_read(&init_net.count))
>> +		return ret;
>> +
>>   	rtnl_lock();
>>   	max_len = atomic_read(&max_prioidx) + 1;
>>   	for_each_netdev(&init_net, dev) {
>>
>>
>
> John, do you have a stack trace of this.  I'm having a hard time seeing how we
> get into this path prior to the network stack being initalized.

Mark had a partial trace

[    0.003455] Dentry cache hash table entries: 262144 (order: 9, 
2097152 bytes)
[    0.005550] Inode-cache hash table entries: 131072 (order: 8, 1048576 
bytes)
[    0.007165] Mount-cache hash table entries: 256
[    0.010289] Initializing cgroup subsys net_cls
[    0.010947] Initializing cgroup subsys net_prio
[    0.011039] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000828
[    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0


>
> It also brings up another point.  If this is happening, and we're creating the
> root cgroup from start_kernel, Then we're actually initalizing some cgroups
> twice, because a few cgroups register themselves via cgroup_load_subsys in
> module_init specified routines.  So if you're building netprio_cgroup or
> net_cls_cgroup as part of the monolithic kernel, you'll get cgroup_create called
> prior to your module_init() call.  Thats not good.

Well your module_init() wouldn't be called in this case right? I think
netprio has a bug where we only register a netdevice notifier when
its built as a module.

same issue with cls_cgroup and register_tcf_proto_ops?

>
> In fact, the cgroup_subsys struct has an early_init flag that cgroup_init
> appears to use to skip the initialization of subsystems that don't need to be
> initialized that early in boot (assuming thats the path we're going down to get
> to this oops).

Do you mean ss->early_init? Not sure that helps us either we get called
by cgroup_init because we don't have an early_init callback or we get
called via cgroup_init_early even earlier.

>
> If you can post the call stack, I'd appreciate it, I'd like to dig a bit deeper
> into this.

Yes I'll do this shortly.

> Neil
>

^ permalink raw reply

* [PATCH net-next 2/4] net/rps: Protect cpu_rmap.h from double inclusion
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342621162-18498-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/linux/cpu_rmap.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpu_rmap.h b/include/linux/cpu_rmap.h
index 473771a..ac3bbb5 100644
--- a/include/linux/cpu_rmap.h
+++ b/include/linux/cpu_rmap.h
@@ -1,3 +1,6 @@
+#ifndef __LINUX_CPU_RMAP_H
+#define __LINUX_CPU_RMAP_H
+
 /*
  * cpu_rmap.c: CPU affinity reverse-map support
  * Copyright 2011 Solarflare Communications Inc.
@@ -71,3 +74,4 @@ extern void free_irq_cpu_rmap(struct cpu_rmap *rmap);
 extern int irq_cpu_rmap_add(struct cpu_rmap *rmap, int irq);
 
 #endif
+#endif /* __LINUX_CPU_RMAP_H */
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 0/4] net/mlx4_en: Add accelerated RFS support
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Or Gerlitz, Amir Vadai

Hi Dave, 

So now a pure Ethernet post from us...

This series from Amir Vadai adds support for Accelerated RFS 
to the mlx4_en Ethernet driver.

The code uses the Accelerated RFS infrastructure and HW flow steering 
to keep CPU affinity of rx interrupts and applications per TCP stream.

To do so, we had to add little protection to cpu_rmap.h against double 
inclusion. Also, added linking between CPU to IRQ using rmap in the 
mlx4_core driver.

Or.

Amir Vadai (4):
  net/mlx4: Move MAC_MASK to a common place
  net/rps: Protect cpu_rmap.h from double inclusion
  {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
  net/mlx4_en: Add accelerated RFS support

 drivers/infiniband/hw/mlx4/main.c                  |    3 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c         |    9 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    |    6 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |  316 ++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |    3 +
 drivers/net/ethernet/mellanox/mlx4/eq.c            |   12 +-
 drivers/net/ethernet/mellanox/mlx4/mcg.c           |    1 -
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |   16 +
 drivers/net/ethernet/mellanox/mlx4/port.c          |    1 -
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |    3 +-
 include/linux/cpu_rmap.h                           |    4 +
 include/linux/mlx4/device.h                        |    4 +-
 include/linux/mlx4/driver.h                        |    2 +
 13 files changed, 369 insertions(+), 11 deletions(-)

CC: Amir Vadai <amirv@mellanox.com>

^ permalink raw reply

* [PATCH net-next 4/4] net/mlx4_en: Add accelerated RFS support
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342621162-18498-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Use RFS infrastructure and flow steering in HW to keep CPU
affinity of rx interrupts and application per TCP stream.

A flow steering filter is added to the HW whenever the RFS
ndo callback is invoked by core networking code.

Because the invocation takes place in interrupt context, the
actual setup of HW is done using workqueue. Whenever new filter
is added, the driver checks for expiry of existing filters.

Since there's window in time between the point where the core
RFS code invoked the ndo callback, to the point where the HW
is configured from the workqueue context, the 2nd, 3rd etc
packets from that stream will cause the net core to invoke
the callback again and again.

To prevent inefficient/double configuration of the HW, the filters
are kept in a database which is indexed using hash function to enable
fast access.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c     |    8 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  316 ++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     |    3 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   16 ++
 4 files changed, 342 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 0ef6156..2d6f1ba 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -77,6 +77,12 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 	struct mlx4_en_dev *mdev = priv->mdev;
 	int err = 0;
 	char name[25];
+	struct cpu_rmap *rmap =
+#ifdef CONFIG_CPU_RMAP
+		priv->dev->rx_cpu_rmap;
+#else
+		NULL;
+#endif
 
 	cq->dev = mdev->pndev[priv->port];
 	cq->mcq.set_ci_db  = cq->wqres.db.db;
@@ -91,7 +97,7 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 				sprintf(name, "%s-%d", priv->dev->name,
 					cq->ring);
 				/* Set IRQ for specific name (per ring) */
-				if (mlx4_assign_eq(mdev->dev, name, NULL,
+				if (mlx4_assign_eq(mdev->dev, name, rmap,
 						   &cq->vector)) {
 					cq->vector = (cq->ring + 1 + priv->port)
 					    % mdev->dev->caps.num_comp_vectors;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 4ce5ca8..8864d8b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -36,6 +36,8 @@
 #include <linux/if_vlan.h>
 #include <linux/delay.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
+#include <net/ip.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/device.h>
@@ -66,6 +68,299 @@ static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 	return 0;
 }
 
+#ifdef CONFIG_RFS_ACCEL
+
+struct mlx4_en_filter {
+	struct list_head next;
+	struct work_struct work;
+
+	__be32 src_ip;
+	__be32 dst_ip;
+	__be16 src_port;
+	__be16 dst_port;
+
+	int rxq_index;
+	struct mlx4_en_priv *priv;
+	u32 flow_id;			/* RFS infrastructure id */
+	int id;				/* mlx4_en driver id */
+	u64 reg_id;			/* Flow steering API id */
+	u8 activated;			/* Used to prevent expiry before filter
+					 * is attached
+					 */
+	struct hlist_node filter_chain;
+};
+
+static void mlx4_en_filter_rfs_expire(struct mlx4_en_priv *priv);
+
+static void mlx4_en_filter_work(struct work_struct *work)
+{
+	struct mlx4_en_filter *filter = container_of(work,
+						     struct mlx4_en_filter,
+						     work);
+	struct mlx4_en_priv *priv = filter->priv;
+	struct mlx4_spec_list spec_tcp = {
+		.id = MLX4_NET_TRANS_RULE_ID_TCP,
+		{
+			.tcp_udp = {
+				.dst_port = filter->dst_port,
+				.dst_port_msk = (__force __be16)-1,
+				.src_port = filter->src_port,
+				.src_port_msk = (__force __be16)-1,
+			},
+		},
+	};
+	struct mlx4_spec_list spec_ip = {
+		.id = MLX4_NET_TRANS_RULE_ID_IPV4,
+		{
+			.ipv4 = {
+				.dst_ip = filter->dst_ip,
+				.dst_ip_msk = (__force __be32)-1,
+				.src_ip = filter->src_ip,
+				.src_ip_msk = (__force __be32)-1,
+			},
+		},
+	};
+	struct mlx4_spec_list spec_eth = {
+		.id = MLX4_NET_TRANS_RULE_ID_ETH,
+	};
+	struct mlx4_net_trans_rule rule = {
+		.list = LIST_HEAD_INIT(rule.list),
+		.queue_mode = MLX4_NET_TRANS_Q_LIFO,
+		.exclusive = 1,
+		.allow_loopback = 1,
+		.promisc_mode = MLX4_FS_PROMISC_NONE,
+		.port = priv->port,
+		.priority = MLX4_DOMAIN_RFS,
+	};
+	int rc;
+	__be64 mac;
+	__be64 mac_mask = cpu_to_be64(MLX4_MAC_MASK << 16);
+
+	list_add_tail(&spec_eth.list, &rule.list);
+	list_add_tail(&spec_ip.list, &rule.list);
+	list_add_tail(&spec_tcp.list, &rule.list);
+
+	mac = cpu_to_be64((priv->mac & MLX4_MAC_MASK) << 16);
+
+	rule.qpn = priv->rss_map.qps[filter->rxq_index].qpn;
+	memcpy(spec_eth.eth.dst_mac, &mac, ETH_ALEN);
+	memcpy(spec_eth.eth.dst_mac_msk, &mac_mask, ETH_ALEN);
+
+	filter->activated = 0;
+
+	if (filter->reg_id) {
+		rc = mlx4_flow_detach(priv->mdev->dev, filter->reg_id);
+		if (rc && rc != -ENOENT)
+			en_err(priv, "Error detaching flow. rc = %d\n", rc);
+	}
+
+	rc = mlx4_flow_attach(priv->mdev->dev, &rule, &filter->reg_id);
+	if (rc)
+		en_err(priv, "Error attaching flow. err = %d\n", rc);
+
+	mlx4_en_filter_rfs_expire(priv);
+
+	filter->activated = 1;
+}
+
+static inline struct hlist_head *
+filter_hash_bucket(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
+		   __be16 src_port, __be16 dst_port)
+{
+	unsigned long l;
+	int bucket_idx;
+
+	l = (__force unsigned long)src_port |
+	    ((__force unsigned long)dst_port << 2);
+	l ^= (__force unsigned long)(src_ip ^ dst_ip);
+
+	bucket_idx = hash_long(l, MLX4_EN_FILTER_HASH_SHIFT);
+
+	return &priv->filter_hash[bucket_idx];
+}
+
+static struct mlx4_en_filter *
+mlx4_en_filter_alloc(struct mlx4_en_priv *priv, int rxq_index, __be32 src_ip,
+		     __be32 dst_ip, __be16 src_port, __be16 dst_port,
+		     u32 flow_id)
+{
+	struct mlx4_en_filter *filter = NULL;
+
+	filter = kzalloc(sizeof(struct mlx4_en_filter), GFP_ATOMIC);
+	if (!filter)
+		return NULL;
+
+	filter->priv = priv;
+	filter->rxq_index = rxq_index;
+	INIT_WORK(&filter->work, mlx4_en_filter_work);
+
+	filter->src_ip = src_ip;
+	filter->dst_ip = dst_ip;
+	filter->src_port = src_port;
+	filter->dst_port = dst_port;
+
+	filter->flow_id = flow_id;
+
+	filter->id = priv->last_filter_id++;
+
+	list_add_tail(&filter->next, &priv->filters);
+	hlist_add_head(&filter->filter_chain,
+		       filter_hash_bucket(priv, src_ip, dst_ip, src_port,
+					  dst_port));
+
+	return filter;
+}
+
+static void mlx4_en_filter_free(struct mlx4_en_filter *filter)
+{
+	struct mlx4_en_priv *priv = filter->priv;
+	int rc;
+
+	list_del(&filter->next);
+
+	rc = mlx4_flow_detach(priv->mdev->dev, filter->reg_id);
+	if (rc && rc != -ENOENT)
+		en_err(priv, "Error detaching flow. rc = %d\n", rc);
+
+	kfree(filter);
+}
+
+static inline struct mlx4_en_filter *
+mlx4_en_filter_find(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
+		    __be16 src_port, __be16 dst_port)
+{
+	struct hlist_node *elem;
+	struct mlx4_en_filter *filter;
+	struct mlx4_en_filter *ret = NULL;
+
+	hlist_for_each_entry(filter, elem,
+			     filter_hash_bucket(priv, src_ip, dst_ip,
+						src_port, dst_port),
+			     filter_chain) {
+		if (filter->src_ip == src_ip &&
+		    filter->dst_ip == dst_ip &&
+		    filter->src_port == src_port &&
+		    filter->dst_port == dst_port) {
+			ret = filter;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int
+mlx4_en_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+		   u16 rxq_index, u32 flow_id)
+{
+	struct mlx4_en_priv *priv = netdev_priv(net_dev);
+	struct mlx4_en_filter *filter;
+	const struct iphdr *ip;
+	const __be16 *ports;
+	__be32 src_ip;
+	__be32 dst_ip;
+	__be16 src_port;
+	__be16 dst_port;
+	int nhoff = skb_network_offset(skb);
+	int ret = 0;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		return -EPROTONOSUPPORT;
+
+	ip = (const struct iphdr *)(skb->data + nhoff);
+	if (ip_is_fragment(ip))
+		return -EPROTONOSUPPORT;
+
+	ports = (const __be16 *)(skb->data + nhoff + 4 * ip->ihl);
+
+	src_ip = ip->saddr;
+	dst_ip = ip->daddr;
+	src_port = ports[0];
+	dst_port = ports[1];
+
+	if (ip->protocol != IPPROTO_TCP)
+		return -EPROTONOSUPPORT;
+
+	spin_lock_bh(&priv->filters_lock);
+	filter = mlx4_en_filter_find(priv, src_ip, dst_ip, src_port, dst_port);
+	if (filter) {
+		if (filter->rxq_index == rxq_index)
+			goto out;
+
+		filter->rxq_index = rxq_index;
+	} else {
+		filter = mlx4_en_filter_alloc(priv, rxq_index,
+					      src_ip, dst_ip,
+					      src_port, dst_port, flow_id);
+		if (!filter) {
+			ret = -ENOMEM;
+			goto err;
+		}
+	}
+
+	queue_work(priv->mdev->workqueue, &filter->work);
+
+out:
+	ret = filter->id;
+err:
+	spin_unlock_bh(&priv->filters_lock);
+
+	return ret;
+}
+
+void mlx4_en_cleanup_filters(struct mlx4_en_priv *priv,
+			     struct mlx4_en_rx_ring *rx_ring)
+{
+	struct mlx4_en_filter *filter, *tmp;
+	LIST_HEAD(del_list);
+
+	spin_lock_bh(&priv->filters_lock);
+	list_for_each_entry_safe(filter, tmp, &priv->filters, next) {
+		list_move(&filter->next, &del_list);
+		hlist_del(&filter->filter_chain);
+	}
+	spin_unlock_bh(&priv->filters_lock);
+
+	list_for_each_entry_safe(filter, tmp, &del_list, next) {
+		cancel_work_sync(&filter->work);
+		mlx4_en_filter_free(filter);
+	}
+}
+
+static void mlx4_en_filter_rfs_expire(struct mlx4_en_priv *priv)
+{
+	struct mlx4_en_filter *filter = NULL, *tmp, *last_filter = NULL;
+	LIST_HEAD(del_list);
+	int i = 0;
+
+	spin_lock_bh(&priv->filters_lock);
+	list_for_each_entry_safe(filter, tmp, &priv->filters, next) {
+		if (i > MLX4_EN_FILTER_EXPIRY_QUOTA)
+			break;
+
+		if (filter->activated &&
+		    !work_pending(&filter->work) &&
+		    rps_may_expire_flow(priv->dev,
+					filter->rxq_index, filter->flow_id,
+					filter->id)) {
+			list_move(&filter->next, &del_list);
+			hlist_del(&filter->filter_chain);
+		} else
+			last_filter = filter;
+
+		i++;
+	}
+
+	if (last_filter && (&last_filter->next != priv->filters.next))
+		list_move(&priv->filters, &last_filter->next);
+
+	spin_unlock_bh(&priv->filters_lock);
+
+	list_for_each_entry_safe(filter, tmp, &del_list, next)
+		mlx4_en_filter_free(filter);
+}
+#endif
+
 static int mlx4_en_vlan_rx_add_vid(struct net_device *dev, unsigned short vid)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1079,6 +1374,11 @@ void mlx4_en_free_resources(struct mlx4_en_priv *priv)
 {
 	int i;
 
+#ifdef CONFIG_RFS_ACCEL
+	free_irq_cpu_rmap(priv->dev->rx_cpu_rmap);
+	priv->dev->rx_cpu_rmap = NULL;
+#endif
+
 	for (i = 0; i < priv->tx_ring_num; i++) {
 		if (priv->tx_ring[i].tx_info)
 			mlx4_en_destroy_tx_ring(priv, &priv->tx_ring[i]);
@@ -1134,6 +1434,15 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
 			goto err;
 	}
 
+#ifdef CONFIG_RFS_ACCEL
+	priv->dev->rx_cpu_rmap = alloc_irq_cpu_rmap(priv->rx_ring_num);
+	if (!priv->dev->rx_cpu_rmap)
+		goto err;
+
+	INIT_LIST_HEAD(&priv->filters);
+	spin_lock_init(&priv->filters_lock);
+#endif
+
 	return 0;
 
 err:
@@ -1241,6 +1550,9 @@ static const struct net_device_ops mlx4_netdev_ops = {
 #endif
 	.ndo_set_features	= mlx4_en_set_features,
 	.ndo_setup_tc		= mlx4_en_setup_tc,
+#ifdef CONFIG_RFS_ACCEL
+	.ndo_rx_flow_steer	= mlx4_en_filter_rfs,
+#endif
 };
 
 int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
@@ -1358,6 +1670,10 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 			NETIF_F_HW_VLAN_FILTER;
 	dev->hw_features |= NETIF_F_LOOPBACK;
 
+	if (mdev->dev->caps.steering_mode ==
+	    MLX4_STEERING_MODE_DEVICE_MANAGED)
+		dev->hw_features |= NETIF_F_NTUPLE;
+
 	mdev->pndev[port] = dev;
 
 	netif_carrier_off(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index a04cbf7..796cd58 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -389,6 +389,9 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
 	vfree(ring->rx_info);
 	ring->rx_info = NULL;
+#ifdef CONFIG_RFS_ACCEL
+	mlx4_en_cleanup_filters(priv, ring);
+#endif
 }
 
 void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index a126321..af34c98 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -43,6 +43,7 @@
 #ifdef CONFIG_MLX4_EN_DCB
 #include <linux/dcbnl.h>
 #endif
+#include <linux/cpu_rmap.h>
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/qp.h>
@@ -77,6 +78,9 @@
 #define STATS_DELAY		(HZ / 4)
 #define MAX_NUM_OF_FS_RULES	256
 
+#define MLX4_EN_FILTER_HASH_SHIFT 4
+#define MLX4_EN_FILTER_EXPIRY_QUOTA 60
+
 /* Typical TSO descriptor with 16 gather entries is 352 bytes... */
 #define MAX_DESC_SIZE		512
 #define MAX_DESC_TXBBS		(MAX_DESC_SIZE / TXBB_SIZE)
@@ -523,6 +527,13 @@ struct mlx4_en_priv {
 	struct ieee_ets ets;
 	u16 maxrate[IEEE_8021QAZ_MAX_TCS];
 #endif
+#ifdef CONFIG_RFS_ACCEL
+	spinlock_t filters_lock;
+	int last_filter_id;
+	struct list_head filters;
+	struct hlist_head filter_hash[1 << MLX4_EN_FILTER_HASH_SHIFT];
+#endif
+
 };
 
 enum mlx4_en_wol {
@@ -602,6 +613,11 @@ int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 extern const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops;
 #endif
 
+#ifdef CONFIG_RFS_ACCEL
+void mlx4_en_cleanup_filters(struct mlx4_en_priv *priv,
+			     struct mlx4_en_rx_ring *rx_ring);
+#endif
+
 #define MLX4_EN_NUM_SELF_TEST	5
 void mlx4_en_ex_selftest(struct net_device *dev, u32 *flags, u64 *buf);
 u64 mlx4_en_mac_to_u64(u8 *addr);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 3/4] {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342621162-18498-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Enable callers of mlx4_assign_eq to supply a pointer to cpu_rmap.
If supplied, the assigned IRQ is tracked using rmap infrastructure.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/hw/mlx4/main.c          |    3 ++-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |    3 ++-
 drivers/net/ethernet/mellanox/mlx4/eq.c    |   12 +++++++++++-
 include/linux/mlx4/device.h                |    4 +++-
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 8a3a203..a07b774 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1159,7 +1159,8 @@ static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
 			sprintf(name, "mlx4-ib-%d-%d@%s",
 				i, j, dev->pdev->bus->name);
 			/* Set IRQ for specific name (per ring) */
-			if (mlx4_assign_eq(dev, name, &ibdev->eq_table[eq])) {
+			if (mlx4_assign_eq(dev, name, NULL,
+					   &ibdev->eq_table[eq])) {
 				/* Use legacy (same as mlx4_en driver) */
 				pr_warn("Can't allocate EQ %d; reverting to legacy\n", eq);
 				ibdev->eq_table[eq] =
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 908a460..0ef6156 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -91,7 +91,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 				sprintf(name, "%s-%d", priv->dev->name,
 					cq->ring);
 				/* Set IRQ for specific name (per ring) */
-				if (mlx4_assign_eq(mdev->dev, name, &cq->vector)) {
+				if (mlx4_assign_eq(mdev->dev, name, NULL,
+						   &cq->vector)) {
 					cq->vector = (cq->ring + 1 + priv->port)
 					    % mdev->dev->caps.num_comp_vectors;
 					mlx4_warn(mdev, "Failed Assigning an EQ to "
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
index bce98d9..12c3ed2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -39,6 +39,7 @@
 #include <linux/dma-mapping.h>
 
 #include <linux/mlx4/cmd.h>
+#include <linux/cpu_rmap.h>
 
 #include "mlx4.h"
 #include "fw.h"
@@ -1060,7 +1061,8 @@ int mlx4_test_interrupts(struct mlx4_dev *dev)
 }
 EXPORT_SYMBOL(mlx4_test_interrupts);
 
-int mlx4_assign_eq(struct mlx4_dev *dev, char* name, int * vector)
+int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
+		   int *vector)
 {
 
 	struct mlx4_priv *priv = mlx4_priv(dev);
@@ -1074,6 +1076,14 @@ int mlx4_assign_eq(struct mlx4_dev *dev, char* name, int * vector)
 			snprintf(priv->eq_table.irq_names +
 					vec * MLX4_IRQNAME_SIZE,
 					MLX4_IRQNAME_SIZE, "%s", name);
+#ifdef CONFIG_CPU_RMAP
+			if (rmap) {
+				err = irq_cpu_rmap_add(rmap,
+						       priv->eq_table.eq[vec].irq);
+				if (err)
+					mlx4_warn(dev, "Failed adding irq rmap\n");
+			}
+#endif
 			err = request_irq(priv->eq_table.eq[vec].irq,
 					  mlx4_msi_x_interrupt, 0,
 					  &priv->eq_table.irq_names[vec<<5],
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 6f0d133..4d7761f 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -36,6 +36,7 @@
 #include <linux/pci.h>
 #include <linux/completion.h>
 #include <linux/radix-tree.h>
+#include <linux/cpu_rmap.h>
 
 #include <linux/atomic.h>
 
@@ -784,7 +785,8 @@ void mlx4_fmr_unmap(struct mlx4_dev *dev, struct mlx4_fmr *fmr,
 int mlx4_fmr_free(struct mlx4_dev *dev, struct mlx4_fmr *fmr);
 int mlx4_SYNC_TPT(struct mlx4_dev *dev);
 int mlx4_test_interrupts(struct mlx4_dev *dev);
-int mlx4_assign_eq(struct mlx4_dev *dev, char* name , int* vector);
+int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
+		   int *vector);
 void mlx4_release_eq(struct mlx4_dev *dev, int vec);
 
 int mlx4_wol_read(struct mlx4_dev *dev, u64 *config, int port);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 1/4] net/mlx4: Move MAC_MASK to a common place
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342621162-18498-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Define this macro is one common place instead of duplicating it over the code

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    |    6 +++---
 drivers/net/ethernet/mellanox/mlx4/mcg.c           |    1 -
 drivers/net/ethernet/mellanox/mlx4/port.c          |    1 -
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |    3 +--
 include/linux/mlx4/driver.h                        |    2 ++
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index dd6a77b..9d0b88e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -34,12 +34,12 @@
 #include <linux/kernel.h>
 #include <linux/ethtool.h>
 #include <linux/netdevice.h>
+#include <linux/mlx4/driver.h>
 
 #include "mlx4_en.h"
 #include "en_port.h"
 
 #define EN_ETHTOOL_QP_ATTACH (1ull << 63)
-#define EN_ETHTOOL_MAC_MASK 0xffffffffffffULL
 #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0xffff)
 #define EN_ETHTOOL_WORD_MASK  cpu_to_be32(0xffffffff)
 
@@ -751,7 +751,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	struct ethhdr *eth_spec;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_spec_list *spec_l2;
-	__be64 mac_msk = cpu_to_be64(EN_ETHTOOL_MAC_MASK << 16);
+	__be64 mac_msk = cpu_to_be64(MLX4_MAC_MASK << 16);
 
 	err = mlx4_en_validate_flow(dev, cmd);
 	if (err)
@@ -761,7 +761,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	if (!spec_l2)
 		return -ENOMEM;
 
-	mac = priv->mac & EN_ETHTOOL_MAC_MASK;
+	mac = priv->mac & MLX4_MAC_MASK;
 	be_mac = cpu_to_be64(mac << 16);
 
 	spec_l2->id = MLX4_NET_TRANS_RULE_ID_ETH;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index 5bac0df..4ec3835 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -41,7 +41,6 @@
 
 #define MGM_QPN_MASK       0x00FFFFFF
 #define MGM_BLCK_LB_BIT    30
-#define MLX4_MAC_MASK	   0xffffffffffffULL
 
 static const u8 zero_gid[16];	/* automatically initialized to 0 */
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c b/drivers/net/ethernet/mellanox/mlx4/port.c
index a51d1b9..028833f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -39,7 +39,6 @@
 #include "mlx4.h"
 
 #define MLX4_MAC_VALID		(1ull << 63)
-#define MLX4_MAC_MASK		0xffffffffffffULL
 
 #define MLX4_VLAN_VALID		(1u << 31)
 #define MLX4_VLAN_MASK		0xfff
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index c3fa919..94ceddd 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -41,13 +41,12 @@
 #include <linux/slab.h>
 #include <linux/mlx4/cmd.h>
 #include <linux/mlx4/qp.h>
+#include <linux/if_ether.h>
 
 #include "mlx4.h"
 #include "fw.h"
 
 #define MLX4_MAC_VALID		(1ull << 63)
-#define MLX4_MAC_MASK		0x7fffffffffffffffULL
-#define ETH_ALEN		6
 
 struct mac_res {
 	struct list_head list;
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 5f1298b..8dc485f 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -37,6 +37,8 @@
 
 struct mlx4_dev;
 
+#define MLX4_MAC_MASK	   0xffffffffffffULL
+
 enum mlx4_dev_event {
 	MLX4_DEV_EVENT_CATASTROPHIC_ERROR,
 	MLX4_DEV_EVENT_PORT_UP,
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: Eric Dumazet @ 2012-07-18 14:14 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <1342619879.9551.14.camel@joe2Laptop>

On Wed, 2012-07-18 at 06:57 -0700, Joe Perches wrote:
> On Wed, 2012-07-18 at 14:08 +0200, Eric Dumazet wrote:
> > Introduce ipv6_addr_hash() helper doing a XOR on all bits
> > of an IPv6 address, with an optimized x86_64 version.
> []
> > diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> []
> > @@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
> >  #endif
> >  }
> >  
> > +static inline u32 ipv6_addr_hash(const struct in6_addr *a)
> > +{
> > +#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
> > +	const unsigned long *ul = (const unsigned long *)a;
> > +	unsigned long x = ul[0] ^ ul[1];
> > +
> > +	return x ^ (x >> 32);
> 
> Thanks Eric.
> 
> Perhaps this would be better with an explicit rather
> than implicit cast.

In fact, returning an "unsigned long" here might give more shuffling
capabilities on 64bit arches, thanks to hash_long()

but hash_long() on 64bit sounds a bit expensive for our needs...

^ permalink raw reply

* Re: [PATCH v4] net: cgroup: fix access the unallocated memory in netprio cgroup
From: John Fastabend @ 2012-07-18 14:10 UTC (permalink / raw)
  To: Neil Horman
  Cc: Gao feng, eric.dumazet, linux-kernel, netdev, davem, Eric Dumazet,
	Rustad, Mark D
In-Reply-To: <20120718122106.GB25563@hmsreliant.think-freely.org>

On 7/18/2012 5:21 AM, Neil Horman wrote:
> On Tue, Jul 17, 2012 at 01:47:25PM -0700, John Fastabend wrote:
>> On 7/12/2012 12:50 AM, Gao feng wrote:
>>> there are some out of bound accesses in netprio cgroup.
>>>
>>> now before accessing the dev->priomap.priomap array,we only check
>>> if the dev->priomap exist.and because we don't want to see
>>> additional bound checkings in fast path, so we should make sure
>>> that dev->priomap is null or array size of dev->priomap.priomap
>>> is equal to max_prioidx + 1;
>>>
>>> so in write_priomap logic,we should call extend_netdev_table when
>>> dev->priomap is null and dev->priomap.priomap_len < max_len.
>>> and in cgrp_create->update_netdev_tables logic,we should call
>>> extend_netdev_table only when dev->priomap exist and
>>> dev->priomap.priomap_len < max_len.
>>>
>>> and it's not needed to call update_netdev_tables in write_priomap,
>>> we can only allocate the net device's priomap which we change through
>>> net_prio.ifpriomap.
>>>
>>> this patch also add a return value for update_netdev_tables &
>>> extend_netdev_table, so when new_priomap is allocated failed,
>>> write_priomap will stop to access the priomap,and return -ENOMEM
>>> back to the userspace to tell the user what happend.
>>>
>>> Change From v3:
>>> 1. add rtnl protect when reading max_prioidx in write_priomap.
>>>
>>> 2. only call extend_netdev_table when map->priomap_len < max_len,
>>>     this will make sure array size of dev->map->priomap always
>>>     bigger than any prioidx.
>>>
>>> 3. add a function write_update_netdev_table to make codes clear.
>>>
>>> Change From v2:
>>> 1. protect extend_netdev_table by RTNL.
>>> 2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
>>>
>>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>>> Cc: Neil Horman <nhorman@tuxdriver.com>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> ---
>>>   net/core/netprio_cgroup.c |   71 ++++++++++++++++++++++++++++++++++-----------
>>>   1 files changed, 54 insertions(+), 17 deletions(-)
>>>
>>
>> [...]
>>
>>> +
>>> +static int update_netdev_tables(void)
>>> +{
>>> +	int ret = 0;
>>>   	struct net_device *dev;
>>> -	u32 max_len = atomic_read(&max_prioidx) + 1;
>>> +	u32 max_len;
>>>   	struct netprio_map *map;
>>
>>
>> need to check if net subsystem is initialized before we try
>> to use it here...
>>
>> 	if (some_check)     -> need to lookup what this check is
>> 		return ret;
>>
>>>
>>>   	rtnl_lock();
>>> +	max_len = atomic_read(&max_prioidx) + 1;
>>>   	for_each_netdev(&init_net, dev) {
>>>   		map = rtnl_dereference(dev->priomap);
>>> -		if ((!map) ||
>>> -		    (map->priomap_len < max_len))
>>> -			extend_netdev_table(dev, max_len);
>>> +		/*
>>> +		 * don't allocate priomap if we didn't
>>> +		 * change net_prio.ifpriomap (map == NULL),
>>> +		 * this will speed up skb_update_prio.
>>> +		 */
>>> +		if (map && map->priomap_len < max_len) {
>>> +			ret = extend_netdev_table(dev, max_len);
>>> +			if (ret < 0)
>>> +				break;
>>> +		}
>>>   	}
>>>   	rtnl_unlock();
>>> +	return ret;
>>>   }
>>>
>>>   static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
>>>   {
>>>   	struct cgroup_netprio_state *cs;
>>> -	int ret;
>>> +	int ret = -EINVAL;
>>>
>>>   	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
>>>   	if (!cs)
>>>   		return ERR_PTR(-ENOMEM);
>>>
>>> -	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
>>> -		kfree(cs);
>>> -		return ERR_PTR(-EINVAL);
>>> -	}
>>> +	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
>>> +		goto out;
>>>
>>>   	ret = get_prioidx(&cs->prioidx);
>>> -	if (ret != 0) {
>>> +	if (ret < 0) {
>>>   		pr_warn("No space in priority index array\n");
>>> -		kfree(cs);
>>> -		return ERR_PTR(ret);
>>> +		goto out;
>>> +	}
>>> +
>>> +	ret = update_netdev_tables();
>>> +	if (ret < 0) {
>>> +		put_prioidx(cs->prioidx);
>>> +		goto out;
>>>   	}
>>
>> Gao,
>>
>> This introduces a null ptr dereference when netprio_cgroup is built
>> into the kernel because update_netdev_tables() depends on init_net.
>> However cgrp_create is being called by cgroup_init before
>> do_initcalls() is called and before net_dev_init().
>>
>> .John
>>
> Not sure I follow here John.  Shouldn't init_net be initialized prior to any
> network devices getting registered?  In other words, shouldn't for_each_netdev
> just result in zero iterations through the loop?
> Neil
>

init_net _is_ initialized prior to any network devices getting
registered but not before cgrp_create called via cgroup_init.

#define for_each_netdev(net, d)         \
                 list_for_each_entry(d, &(net)->dev_base_head, dev_list)

but dev_base_head is zeroed at this time. In netdev_init we have,

         INIT_LIST_HEAD(&net->dev_base_head);

but we haven't got that far yet because cgroup_init is called
before do_initcalls().

^ permalink raw reply

* RE: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: Eric Dumazet @ 2012-07-18 14:06 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B6F93@saturn3.aculab.com>

From: Eric Dumazet <edumazet@google.com>

On Wed, 2012-07-18 at 13:28 +0100, David Laight wrote:
> >  #define HASH_SIZE  32
> > 
> > -#define HASH(addr) ((__force u32)((addr)->s6_addr32[0] ^ (addr)->s6_addr32[1] ^ \
> > -		     (addr)->s6_addr32[2] ^ (addr)->s6_addr32[3]) & \
> > -		    (HASH_SIZE - 1))
> > +#define HASH(addr) (ipv6_addr_hash(addr) & (HASH_SIZE - 1))
> 
> That hash doesn't seem to include many variable bits at all!
> Especially on LE systems where it doesn't contain any of
> the low bits of a mac address based IPv6 address.
> 

Good point.

Apparently nobody uses a lot of ipv6 tunnels ;)

Thanks

[PATCH net-next v2] ipv6: add ipv6_addr_hash() helper

Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.

Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)

Use it in ip6_tunnel.c and use more bit shuffling, as suggested
by David Laight, as existing hash was ignoring most of them.

Use it in sunrpc and use more bit shuffling, using hash_32().

As cleanup, use it in net/ipv4/tcp_metrics.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
---
 include/net/ipv6.h        |   13 +++++++++++++
 net/core/flow_dissector.c |    5 +++--
 net/ipv4/tcp_metrics.c    |   15 +++------------
 net/ipv6/ip6_tunnel.c     |   20 ++++++++++++--------
 net/sunrpc/svcauth_unix.c |   22 ++++------------------
 5 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f695f39..56ff725 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
 #endif
 }
 
+static inline u32 ipv6_addr_hash(const struct in6_addr *a)
+{
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
+	const unsigned long *ul = (const unsigned long *)a;
+	unsigned long x = ul[0] ^ ul[1];
+
+	return x ^ (x >> 32);
+#else
+	return (__force u32)(a->s6_addr32[0] ^ a->s6_addr32[1] ^
+			     a->s6_addr32[2] ^ a->s6_addr32[3]);
+#endif
+}
+
 static inline bool ipv6_addr_loopback(const struct in6_addr *a)
 {
 	return (a->s6_addr32[0] | a->s6_addr32[1] |
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index a225089..466820b 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -4,6 +4,7 @@
 #include <linux/ipv6.h>
 #include <linux/if_vlan.h>
 #include <net/ip.h>
+#include <net/ipv6.h>
 #include <linux/if_tunnel.h>
 #include <linux/if_pppox.h>
 #include <linux/ppp_defs.h>
@@ -55,8 +56,8 @@ ipv6:
 			return false;
 
 		ip_proto = iph->nexthdr;
-		flow->src = iph->saddr.s6_addr32[3];
-		flow->dst = iph->daddr.s6_addr32[3];
+		flow->src = (__force __be32)ipv6_addr_hash(&iph->saddr);
+		flow->dst = (__force __be32)ipv6_addr_hash(&iph->daddr);
 		nhoff += sizeof(struct ipv6hdr);
 		break;
 	}
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 5a38a2d..1a115b6 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -211,10 +211,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_rsk(req)->rmt_addr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_rsk(req)->rmt_addr);
 		break;
 	default:
 		return NULL;
@@ -251,10 +248,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 	case AF_INET6:
 		tw6 = inet6_twsk((struct sock *)tw);
 		*(struct in6_addr *)addr.addr.a6 = tw6->tw_v6_daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&tw6->tw_v6_daddr);
 		break;
 	default:
 		return NULL;
@@ -291,10 +285,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_sk(sk)->daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_sk(sk)->daddr);
 		break;
 	default:
 		return NULL;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index db32846..9a1d5fe 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -40,6 +40,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/netfilter_ipv6.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
 
 #include <asm/uaccess.h>
 #include <linux/atomic.h>
@@ -70,11 +71,15 @@ MODULE_ALIAS_NETDEV("ip6tnl0");
 #define IPV6_TCLASS_MASK (IPV6_FLOWINFO_MASK & ~IPV6_FLOWLABEL_MASK)
 #define IPV6_TCLASS_SHIFT 20
 
-#define HASH_SIZE  32
+#define HASH_SIZE_SHIFT  5
+#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
 
-#define HASH(addr) ((__force u32)((addr)->s6_addr32[0] ^ (addr)->s6_addr32[1] ^ \
-		     (addr)->s6_addr32[2] ^ (addr)->s6_addr32[3]) & \
-		    (HASH_SIZE - 1))
+static u32 HASH(const struct in6_addr *addr1, const struct in6_addr *addr2)
+{
+	u32 hash = ipv6_addr_hash(addr1) ^ ipv6_addr_hash(addr2);
+
+	return hash_32(hash, HASH_SIZE_SHIFT);
+}
 
 static int ip6_tnl_dev_init(struct net_device *dev);
 static void ip6_tnl_dev_setup(struct net_device *dev);
@@ -166,12 +171,11 @@ static inline void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
 static struct ip6_tnl *
 ip6_tnl_lookup(struct net *net, const struct in6_addr *remote, const struct in6_addr *local)
 {
-	unsigned int h0 = HASH(remote);
-	unsigned int h1 = HASH(local);
+	unsigned int hash = HASH(remote, local);
 	struct ip6_tnl *t;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 
-	for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[h0 ^ h1]) {
+	for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
 		if (ipv6_addr_equal(local, &t->parms.laddr) &&
 		    ipv6_addr_equal(remote, &t->parms.raddr) &&
 		    (t->dev->flags & IFF_UP))
@@ -205,7 +209,7 @@ ip6_tnl_bucket(struct ip6_tnl_net *ip6n, const struct ip6_tnl_parm *p)
 
 	if (!ipv6_addr_any(remote) || !ipv6_addr_any(local)) {
 		prio = 1;
-		h = HASH(remote) ^ HASH(local);
+		h = HASH(remote, local);
 	}
 	return &ip6n->tnls[prio][h];
 }
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 2777fa8..4d01292 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -104,23 +104,9 @@ static void ip_map_put(struct kref *kref)
 	kfree(im);
 }
 
-#if IP_HASHBITS == 8
-/* hash_long on a 64 bit machine is currently REALLY BAD for
- * IP addresses in reverse-endian (i.e. on a little-endian machine).
- * So use a trivial but reliable hash instead
- */
-static inline int hash_ip(__be32 ip)
-{
-	int hash = (__force u32)ip ^ ((__force u32)ip>>16);
-	return (hash ^ (hash>>8)) & 0xff;
-}
-#endif
-static inline int hash_ip6(struct in6_addr ip)
+static inline int hash_ip6(const struct in6_addr *ip)
 {
-	return (hash_ip(ip.s6_addr32[0]) ^
-		hash_ip(ip.s6_addr32[1]) ^
-		hash_ip(ip.s6_addr32[2]) ^
-		hash_ip(ip.s6_addr32[3]));
+	return hash_32(ipv6_addr_hash(ip), IP_HASHBITS);
 }
 static int ip_map_match(struct cache_head *corig, struct cache_head *cnew)
 {
@@ -301,7 +287,7 @@ static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
 	ip.m_addr = *addr;
 	ch = sunrpc_cache_lookup(cd, &ip.h,
 				 hash_str(class, IP_HASHBITS) ^
-				 hash_ip6(*addr));
+				 hash_ip6(addr));
 
 	if (ch)
 		return container_of(ch, struct ip_map, h);
@@ -331,7 +317,7 @@ static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
 	ip.h.expiry_time = expiry;
 	ch = sunrpc_cache_update(cd, &ip.h, &ipm->h,
 				 hash_str(ipm->m_class, IP_HASHBITS) ^
-				 hash_ip6(ipm->m_addr));
+				 hash_ip6(&ipm->m_addr));
 	if (!ch)
 		return -ENOMEM;
 	cache_put(ch, cd);

^ permalink raw reply related

* Re: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: Joe Perches @ 2012-07-18 13:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <1342613334.2626.2504.camel@edumazet-glaptop>

On Wed, 2012-07-18 at 14:08 +0200, Eric Dumazet wrote:
> Introduce ipv6_addr_hash() helper doing a XOR on all bits
> of an IPv6 address, with an optimized x86_64 version.
[]
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
[]
> @@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
>  #endif
>  }
>  
> +static inline u32 ipv6_addr_hash(const struct in6_addr *a)
> +{
> +#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
> +	const unsigned long *ul = (const unsigned long *)a;
> +	unsigned long x = ul[0] ^ ul[1];
> +
> +	return x ^ (x >> 32);

Thanks Eric.

Perhaps this would be better with an explicit rather
than implicit cast.

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: Neil Horman @ 2012-07-18 13:04 UTC (permalink / raw)
  To: John Fastabend, h
  Cc: David Miller, mark.d.rustad, netdev, linux-wireless,
	netfilter-devel
In-Reply-To: <5005F4F9.6010208@intel.com>

On Tue, Jul 17, 2012 at 04:27:53PM -0700, John Fastabend wrote:
> On 7/17/2012 3:18 PM, David Miller wrote:
> >From: John Fastabend <john.r.fastabend@intel.com>
> >Date: Tue, 17 Jul 2012 15:13:36 -0700
> >
> >>Perhaps the easiest way is to check net->count this should be zero
> >>until setup_net is called.
> >>
> >>if (!atomic_read(&init_net.count))
> >>	return ret;
> >>
> >
> >Won't work, setup_net() runs via a pure_initcall().
> >
> 
> Why not must have missed something? cgroup_init() and
> cgroup_early_init() both run before _initcall() routines are
> called via kernel_init() so this will stop the update in
> netprio from occurring.
> 
> And I don't see any race elsewhere for this.
John, can you post the backtrace you got for this?  I replied to the patch that
you posted for this fix.  the cgroup subsystem has an early_init flag thats
supposed to prevent the initialization of cgroups that don't need initialization
until later (like via module_init() calls).
Neil

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: Paul Moore @ 2012-07-18 13:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120717.143137.1363542154253711667.davem@davemloft.net>

On Tuesday, July 17, 2012 02:31:37 PM David Miller wrote:
> From: Paul Moore <pmoore@redhat.com>
> Date: Tue, 17 Jul 2012 17:24:50 -0400
> 
> > David, if you don't queue this up for them, let me know and I'll resend it
> > to stable once it hits Linus' tree.
> 
> I will be sure to queue it up to -stable when I apply it, as I always
> do for appropriate patches.

Okay, thanks.  I just wanted to make sure it hit -stable one way or another.

-- 
paul moore
security and virtualization @ redhat

^ permalink raw reply

* Problem facing with ipv6 router advertisement in kernel-3.0.26
From: BALAKUMARAN KANNAN @ 2012-07-18 13:02 UTC (permalink / raw)
  To: netdev@vger.kernel.org

       As per RFC, curhoplimit of router advertisement(ra) should be kept as hoplimit for the routing table entry. It expects that the hoplimit of a routing path should not affected if a ra comes for the path with curhoplimit is 0. But in the case of kernel-3.0.26, the value is altered to 255. But it happens only in the router cache(#ip -6 route show cached) not in the main routing table (#ip -6 route show). But if I disable 'multiple routing table' option while building the kernel, it also affects the main routing table.

	Timer expiry for a routing table entry. I am facing this scenario. I have the gc_interval value 30 seconds. I am receinving a ra with preferred lifetime 20 seconds. So the routing table entry expirs and removed from the routing table in 20 seconds. But the cache remains with the expired routing entry till next flush.

	As per my knowledge MTU value is not stored in the routing table. I can see the the MTU value in routing cache but not in the main routing table. So in my case, once the routing cahce is flushed, the kernel not properly fragments the packet according to the MTU value even the timer doesn't expire. I don't know am I right or wrong. I will tell what is happening. I have a router connected with a target board. The router sends a ra with mtu 1200 and sends a ICMP_REQUEST with length 1500. And it expects the target board to send ICMP_REPLY as fragmented into two packets (1200 and 300). But the target board does correctly if the cache is present. But once the cache is flushed, it fails.

Note: I am using Kernel-3.0.26 for arm architecture.

Thank you,

--Regards,
K.Balakumaran

^ permalink raw reply

* [PATCH net-next 2/2] be2net: Ignore physical link async event for Lancer
From: Padmanabh Ratnakar @ 2012-07-18 12:52 UTC (permalink / raw)
  To: netdev; +Cc: Padmanabh Ratnakar

The ability of driver to transmit packets depends on logical state
of the link. Ignore physical link status.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_cmds.c |    5 +++++
 drivers/net/ethernet/emulex/benet/be_cmds.h |    1 +
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c b/drivers/net/ethernet/emulex/benet/be_cmds.c
index 7932490..7fac97b 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -141,6 +141,11 @@ static void be_async_link_state_process(struct be_adapter *adapter,
 	/* When link status changes, link speed must be re-queried from FW */
 	adapter->phy.link_speed = -1;
 
+	/* Ignore physical link event */
+	if (lancer_chip(adapter) &&
+	    !(evt->port_link_status & LOGICAL_LINK_STATUS_MASK))
+		return;
+
 	/* For the initial link status do not rely on the ASYNC event as
 	 * it may not be received in some cases.
 	 */
diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.h b/drivers/net/ethernet/emulex/benet/be_cmds.h
index d5a4ded..250f19b 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.h
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.h
@@ -93,6 +93,7 @@ enum {
 	LINK_UP		= 0x1
 };
 #define LINK_STATUS_MASK			0x1
+#define LOGICAL_LINK_STATUS_MASK		0x2
 
 /* When the event code of an async trailer is link-state, the mcc_compl
  * must be interpreted as follows
-- 
1.6.0.2

^ permalink raw reply related

* [PATCH net-next 1/2] be2net: Fix VF driver load for Lancer
From: Padmanabh Ratnakar @ 2012-07-18 12:51 UTC (permalink / raw)
  To: netdev; +Cc: Padmanabh Ratnakar

Lancer FW has added new capability checks for VFs.
Driver should only use those capabilities which are allowed for VFs.

Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be_cmds.c |    3 ++-
 drivers/net/ethernet/emulex/benet/be_main.c |    7 +++++++
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c b/drivers/net/ethernet/emulex/benet/be_cmds.c
index ddfca65..7932490 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -1631,7 +1631,8 @@ int be_cmd_rx_filter(struct be_adapter *adapter, u32 flags, u32 value)
 		/* Reset mcast promisc mode if already set by setting mask
 		 * and not setting flags field
 		 */
-		req->if_flags_mask |=
+		if (!lancer_chip(adapter) || be_physfn(adapter))
+			req->if_flags_mask |=
 				cpu_to_le32(BE_IF_FLAGS_MCAST_PROMISCUOUS);
 
 		req->mcast_num = cpu_to_le32(netdev_mc_count(adapter->netdev));
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index f18375c..4d96771 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2772,6 +2772,13 @@ static int be_setup(struct be_adapter *adapter)
 		en_flags |= BE_IF_FLAGS_RSS;
 	}
 
+	if (lancer_chip(adapter) && !be_physfn(adapter)) {
+		en_flags = BE_IF_FLAGS_UNTAGGED |
+			    BE_IF_FLAGS_BROADCAST |
+			    BE_IF_FLAGS_MULTICAST;
+		cap_flags = en_flags;
+	}
+
 	status = be_cmd_if_create(adapter, cap_flags, en_flags,
 				  &adapter->if_handle, 0);
 	if (status != 0)
-- 
1.6.0.2

^ permalink raw reply related

* [PATCH net-next 0/2] be2net fixes
From: Padmanabh Ratnakar @ 2012-07-18 12:51 UTC (permalink / raw)
  To: netdev; +Cc: Padmanabh Ratnakar

Driver fixes for recent Lancer FW changes.
Please apply.
Thanks,
Padmanabh

Padmanabh Ratnakar (2):
  be2net: Fix VF driver load for Lancer
  be2net: Ignore physical link async event for Lancer

 drivers/net/ethernet/emulex/benet/be_cmds.c |    8 +++++++-
 drivers/net/ethernet/emulex/benet/be_cmds.h |    1 +
 drivers/net/ethernet/emulex/benet/be_main.c |    7 +++++++
 3 files changed, 15 insertions(+), 1 deletions(-)

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: Neil Horman @ 2012-07-18 12:45 UTC (permalink / raw)
  To: John Fastabend; +Cc: davem, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <20120718003316.2979.49278.stgit@jf-dev1-dcblab>

On Tue, Jul 17, 2012 at 05:33:16PM -0700, John Fastabend wrote:
> When the netprio cgroup is built in the kernel cgroup_init will call
> cgrp_create which eventually calls update_netdev_tables. This is
> being called before do_initcalls() so a null ptr dereference occurs
> on init_net.
> 
> This patch adds a check on init_net.count to verify the structure
> has been initialized. The failure was introduced here,
> 
> commit ef209f15980360f6945873df3cd710c5f62f2a3e
> Author: Gao feng <gaofeng@cn.fujitsu.com>
> Date:   Wed Jul 11 21:50:15 2012 +0000
> 
>     net: cgroup: fix access the unallocated memory in netprio cgroup
> 
> Tested with ping with netprio_cgroup as a module and built in.
> 
> Marked RFC for now I think DaveM might have a reason why this needs
> some improvement.
> 
> Reported-by: Mark Rustad <mark.d.rustad@intel.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Gao feng <gaofeng@cn.fujitsu.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> 
>  net/core/netprio_cgroup.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index b2e9caa..e9fd7fd 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -116,6 +116,9 @@ static int update_netdev_tables(void)
>  	u32 max_len;
>  	struct netprio_map *map;
>  
> +	if (!atomic_read(&init_net.count))
> +		return ret;
> +
>  	rtnl_lock();
>  	max_len = atomic_read(&max_prioidx) + 1;
>  	for_each_netdev(&init_net, dev) {
> 
> 

John, do you have a stack trace of this.  I'm having a hard time seeing how we
get into this path prior to the network stack being initalized.

It also brings up another point.  If this is happening, and we're creating the
root cgroup from start_kernel, Then we're actually initalizing some cgroups
twice, because a few cgroups register themselves via cgroup_load_subsys in
module_init specified routines.  So if you're building netprio_cgroup or
net_cls_cgroup as part of the monolithic kernel, you'll get cgroup_create called
prior to your module_init() call.  Thats not good.

In fact, the cgroup_subsys struct has an early_init flag that cgroup_init
appears to use to skip the initialization of subsystems that don't need to be
initialized that early in boot (assuming thats the path we're going down to get
to this oops).  

If you can post the call stack, I'd appreciate it, I'd like to dig a bit deeper
into this.
Neil

^ permalink raw reply

* RE: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: David Laight @ 2012-07-18 12:28 UTC (permalink / raw)
  To: Eric Dumazet, David Miller
  Cc: netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <1342613334.2626.2504.camel@edumazet-glaptop>

>  #define HASH_SIZE  32
> 
> -#define HASH(addr) ((__force u32)((addr)->s6_addr32[0] ^ (addr)->s6_addr32[1] ^ \
> -		     (addr)->s6_addr32[2] ^ (addr)->s6_addr32[3]) & \
> -		    (HASH_SIZE - 1))
> +#define HASH(addr) (ipv6_addr_hash(addr) & (HASH_SIZE - 1))

That hash doesn't seem to include many variable bits at all!
Especially on LE systems where it doesn't contain any of
the low bits of a mac address based IPv6 address.

	David


^ permalink raw reply

* Re: [PATCH v4] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Neil Horman @ 2012-07-18 12:21 UTC (permalink / raw)
  To: John Fastabend
  Cc: Gao feng, eric.dumazet, linux-kernel, netdev, davem, Eric Dumazet,
	Rustad, Mark D
In-Reply-To: <5005CF5D.9070905@intel.com>

On Tue, Jul 17, 2012 at 01:47:25PM -0700, John Fastabend wrote:
> On 7/12/2012 12:50 AM, Gao feng wrote:
> >there are some out of bound accesses in netprio cgroup.
> >
> >now before accessing the dev->priomap.priomap array,we only check
> >if the dev->priomap exist.and because we don't want to see
> >additional bound checkings in fast path, so we should make sure
> >that dev->priomap is null or array size of dev->priomap.priomap
> >is equal to max_prioidx + 1;
> >
> >so in write_priomap logic,we should call extend_netdev_table when
> >dev->priomap is null and dev->priomap.priomap_len < max_len.
> >and in cgrp_create->update_netdev_tables logic,we should call
> >extend_netdev_table only when dev->priomap exist and
> >dev->priomap.priomap_len < max_len.
> >
> >and it's not needed to call update_netdev_tables in write_priomap,
> >we can only allocate the net device's priomap which we change through
> >net_prio.ifpriomap.
> >
> >this patch also add a return value for update_netdev_tables &
> >extend_netdev_table, so when new_priomap is allocated failed,
> >write_priomap will stop to access the priomap,and return -ENOMEM
> >back to the userspace to tell the user what happend.
> >
> >Change From v3:
> >1. add rtnl protect when reading max_prioidx in write_priomap.
> >
> >2. only call extend_netdev_table when map->priomap_len < max_len,
> >    this will make sure array size of dev->map->priomap always
> >    bigger than any prioidx.
> >
> >3. add a function write_update_netdev_table to make codes clear.
> >
> >Change From v2:
> >1. protect extend_netdev_table by RTNL.
> >2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
> >
> >Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> >Cc: Neil Horman <nhorman@tuxdriver.com>
> >Cc: Eric Dumazet <edumazet@google.com>
> >---
> >  net/core/netprio_cgroup.c |   71 ++++++++++++++++++++++++++++++++++-----------
> >  1 files changed, 54 insertions(+), 17 deletions(-)
> >
> 
> [...]
> 
> >+
> >+static int update_netdev_tables(void)
> >+{
> >+	int ret = 0;
> >  	struct net_device *dev;
> >-	u32 max_len = atomic_read(&max_prioidx) + 1;
> >+	u32 max_len;
> >  	struct netprio_map *map;
> 
> 
> need to check if net subsystem is initialized before we try
> to use it here...
> 
> 	if (some_check)     -> need to lookup what this check is
> 		return ret;
> 
> >
> >  	rtnl_lock();
> >+	max_len = atomic_read(&max_prioidx) + 1;
> >  	for_each_netdev(&init_net, dev) {
> >  		map = rtnl_dereference(dev->priomap);
> >-		if ((!map) ||
> >-		    (map->priomap_len < max_len))
> >-			extend_netdev_table(dev, max_len);
> >+		/*
> >+		 * don't allocate priomap if we didn't
> >+		 * change net_prio.ifpriomap (map == NULL),
> >+		 * this will speed up skb_update_prio.
> >+		 */
> >+		if (map && map->priomap_len < max_len) {
> >+			ret = extend_netdev_table(dev, max_len);
> >+			if (ret < 0)
> >+				break;
> >+		}
> >  	}
> >  	rtnl_unlock();
> >+	return ret;
> >  }
> >
> >  static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
> >  {
> >  	struct cgroup_netprio_state *cs;
> >-	int ret;
> >+	int ret = -EINVAL;
> >
> >  	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
> >  	if (!cs)
> >  		return ERR_PTR(-ENOMEM);
> >
> >-	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
> >-		kfree(cs);
> >-		return ERR_PTR(-EINVAL);
> >-	}
> >+	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
> >+		goto out;
> >
> >  	ret = get_prioidx(&cs->prioidx);
> >-	if (ret != 0) {
> >+	if (ret < 0) {
> >  		pr_warn("No space in priority index array\n");
> >-		kfree(cs);
> >-		return ERR_PTR(ret);
> >+		goto out;
> >+	}
> >+
> >+	ret = update_netdev_tables();
> >+	if (ret < 0) {
> >+		put_prioidx(cs->prioidx);
> >+		goto out;
> >  	}
> 
> Gao,
> 
> This introduces a null ptr dereference when netprio_cgroup is built
> into the kernel because update_netdev_tables() depends on init_net.
> However cgrp_create is being called by cgroup_init before
> do_initcalls() is called and before net_dev_init().
> 
> .John
> 
Not sure I follow here John.  Shouldn't init_net be initialized prior to any
network devices getting registered?  In other words, shouldn't for_each_netdev
just result in zero iterations through the loop?
Neil

^ permalink raw reply

* [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: Eric Dumazet @ 2012-07-18 12:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Andrew McGregor, Dave Taht, Tom Herbert

From: Eric Dumazet <edumazet@google.com>

Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.

Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)

As cleanups, use it in :

 net/sunrpc/svcauth_unix.c
 net/ipv6/ip6_tunnel.c
 net/ipv4/tcp_metrics.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
---
 include/net/ipv6.h        |   13 +++++++++++++
 net/core/flow_dissector.c |    5 +++--
 net/ipv4/tcp_metrics.c    |   15 +++------------
 net/ipv6/ip6_tunnel.c     |    4 +---
 net/sunrpc/svcauth_unix.c |   18 ++++++++----------
 5 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f695f39..079cd9c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
 #endif
 }
 
+static inline u32 ipv6_addr_hash(const struct in6_addr *a)
+{
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
+	const unsigned long *ul = (const unsigned long *)a;
+	unsigned long x = ul[0] ^ ul[1];
+
+	return x ^ (x >> 32);
+#else
+	return (__force u32)(a->s6_addr32[0] ^ a->s6_addr32[1] ^
+			     a->s6_addr32[2] ^ a->s6_addr32[3]);
+#endif
+}
+
 static inline bool ipv6_addr_loopback(const struct in6_addr *a)
 {
 	return (a->s6_addr32[0] | a->s6_addr32[1] |
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index a225089..466820b 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -4,6 +4,7 @@
 #include <linux/ipv6.h>
 #include <linux/if_vlan.h>
 #include <net/ip.h>
+#include <net/ipv6.h>
 #include <linux/if_tunnel.h>
 #include <linux/if_pppox.h>
 #include <linux/ppp_defs.h>
@@ -55,8 +56,8 @@ ipv6:
 			return false;
 
 		ip_proto = iph->nexthdr;
-		flow->src = iph->saddr.s6_addr32[3];
-		flow->dst = iph->daddr.s6_addr32[3];
+		flow->src = (__force __be32)ipv6_addr_hash(&iph->saddr);
+		flow->dst = (__force __be32)ipv6_addr_hash(&iph->daddr);
 		nhoff += sizeof(struct ipv6hdr);
 		break;
 	}
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 5a38a2d..1a115b6 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -211,10 +211,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_rsk(req)->rmt_addr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_rsk(req)->rmt_addr);
 		break;
 	default:
 		return NULL;
@@ -251,10 +248,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 	case AF_INET6:
 		tw6 = inet6_twsk((struct sock *)tw);
 		*(struct in6_addr *)addr.addr.a6 = tw6->tw_v6_daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&tw6->tw_v6_daddr);
 		break;
 	default:
 		return NULL;
@@ -291,10 +285,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_sk(sk)->daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_sk(sk)->daddr);
 		break;
 	default:
 		return NULL;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index db32846..819b8eb 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -72,9 +72,7 @@ MODULE_ALIAS_NETDEV("ip6tnl0");
 
 #define HASH_SIZE  32
 
-#define HASH(addr) ((__force u32)((addr)->s6_addr32[0] ^ (addr)->s6_addr32[1] ^ \
-		     (addr)->s6_addr32[2] ^ (addr)->s6_addr32[3]) & \
-		    (HASH_SIZE - 1))
+#define HASH(addr) (ipv6_addr_hash(addr) & (HASH_SIZE - 1))
 
 static int ip6_tnl_dev_init(struct net_device *dev);
 static void ip6_tnl_dev_setup(struct net_device *dev);
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 2777fa8..e7e1dfe 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -109,18 +109,16 @@ static void ip_map_put(struct kref *kref)
  * IP addresses in reverse-endian (i.e. on a little-endian machine).
  * So use a trivial but reliable hash instead
  */
-static inline int hash_ip(__be32 ip)
+static inline int hash_ip(u32 ip)
 {
-	int hash = (__force u32)ip ^ ((__force u32)ip>>16);
-	return (hash ^ (hash>>8)) & 0xff;
+	ip ^= (ip >> 16);
+	ip ^= (ip >> 8);
+	return ip & 0xff;
 }
 #endif
-static inline int hash_ip6(struct in6_addr ip)
+static inline int hash_ip6(const struct in6_addr *ip)
 {
-	return (hash_ip(ip.s6_addr32[0]) ^
-		hash_ip(ip.s6_addr32[1]) ^
-		hash_ip(ip.s6_addr32[2]) ^
-		hash_ip(ip.s6_addr32[3]));
+	return hash_ip(ipv6_addr_hash(ip));
 }
 static int ip_map_match(struct cache_head *corig, struct cache_head *cnew)
 {
@@ -301,7 +299,7 @@ static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
 	ip.m_addr = *addr;
 	ch = sunrpc_cache_lookup(cd, &ip.h,
 				 hash_str(class, IP_HASHBITS) ^
-				 hash_ip6(*addr));
+				 hash_ip6(addr));
 
 	if (ch)
 		return container_of(ch, struct ip_map, h);
@@ -331,7 +329,7 @@ static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
 	ip.h.expiry_time = expiry;
 	ch = sunrpc_cache_update(cd, &ip.h, &ipm->h,
 				 hash_str(ipm->m_class, IP_HASHBITS) ^
-				 hash_ip6(ipm->m_addr));
+				 hash_ip6(&ipm->m_addr));
 	if (!ch)
 		return -ENOMEM;
 	cache_put(ch, cd);

^ permalink raw reply related

* [PATCH net-next V1 6/9] net/eipoib: Add sysfs support
From: Or Gerlitz @ 2012-07-18 10:59 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit,
	Or Gerlitz
In-Reply-To: <1342609202-32427-1-git-send-email-ogerlitz@mellanox.com>

From: Erez Shitrit <erezsh@mellanox.co.il>

The management interface for the driver uses sysfs entries. Via these sysfs
entries the driver gets details on new VIF's to manage. The driver can
enslave new VIF (IPoIB cloned interface) or detaches from it.

Here are few sysfs commands that are used in order to manage the driver,
according to few scenarios:

1. create new clone of IPoIB interface:

	$ echo .Y > /sys/class/net/ibX/create_child

create new clone ibX.Y with the same pkey as ibX, for example:

	$ echo .1 > /sys/class/net/ib0/create_child

will create new interface ib0.1

2. notify parent interface on new VIF to enslave:

	$ echo +ibX.Y > /sys/class/net/ethZ/eth/slaves

where ethZ is the driver interface, for example:

	$ echo +ib0.1 > /sys/class/net/eth4/eth/slaves

will enslave ib0.1 to eth4

3. notify parent interface interface on VIF details (mac and vlan)

	$ echo +ibX.Y <MAC address> > /sys/class/net/ethZ/eth/vifs

for example:

	$ echo +ib0.1 00:02:c9:43:3b:f1 > /sys/class/net/eth4/eth/vifs

4. notify parent to release VIF:

	$ echo -ibX.Y > /sys/class/net/ethZ/eth/slaves

where ethZ is the driver interface, for example:

        $ echo -ib0.1 > /sys/class/net/eth4/eth/slaves

will release ib0.1 from eth4

5. see the list of ipoib interfaces enslaved under eipoib interface,

	$ cat /sys/class/net/ethX/eth/vifs

for example:

	$ cat /sys/class/net/eth4/eth/vifs

	SLAVE=ib0.1      MAC=9a:c2:1f:d7:3b:63 VLAN=N/A
	SLAVE=ib0.2      MAC=52:54:00:60:55:88 VLAN=N/A
	SLAVE=ib0.3      MAC=52:54:00:60:55:89 VLAN=N/A

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib_sysfs.c |  640 ++++++++++++++++++++++++++++++++++
 1 files changed, 640 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib_sysfs.c

diff --git a/drivers/net/eipoib/eth_ipoib_sysfs.c b/drivers/net/eipoib/eth_ipoib_sysfs.c
new file mode 100644
index 0000000..c3fc121
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib_sysfs.c
@@ -0,0 +1,640 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/in.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/inet.h>
+#include <linux/rtnetlink.h>
+#include <linux/etherdevice.h>
+#include <net/net_namespace.h>
+
+#include "eth_ipoib.h"
+
+#define to_dev(obj)	container_of(obj, struct device, kobj)
+#define to_parent(cd)	((struct parent *)(netdev_priv(to_net_dev(cd))))
+#define MOD_NA_STRING		"N/A"
+
+#define _sprintf(p, buf, format, arg...)				\
+((PAGE_SIZE - (int)(p - buf)) <= 0 ? 0 :				\
+	scnprintf(p, PAGE_SIZE - (int)(p - buf), format, ## arg))\
+
+#define _end_of_line(_p, _buf)					\
+do { if (_p - _buf) /* eat the leftover space */			\
+		buf[_p - _buf - 1] = '\n';				\
+} while (0)
+
+/* helper functions */
+static int get_emac(u8 *mac, char *s)
+{
+	if (sscanf(s, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx",
+		   mac + 0, mac + 1, mac + 2, mac + 3, mac + 4,
+		   mac + 5) != 6)
+		return -1;
+
+	return 0;
+}
+
+static int get_imac(u8 *mac, char *s)
+{
+	if (sscanf(s, "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:"
+		   "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:%hhx:"
+		   "%hhx:%hhx:%hhx:%hhx",
+		   mac + 0, mac + 1, mac + 2, mac + 3, mac + 4,
+		   mac + 5, mac + 6, mac + 7, mac + 8, mac + 9,
+		   mac + 10, mac + 11, mac + 12, mac + 13,
+		   mac + 14, mac + 15, mac + 16, mac + 17,
+		   mac + 18, mac + 19) != 20)
+		return -1;
+
+	return 0;
+}
+
+/* show/store functions per module (CLASS_ATTR) */
+static ssize_t show_parents(struct class *cls, struct class_attribute *attr,
+			    char *buf)
+{
+	char *p = buf;
+	struct parent *parent;
+
+	rtnl_lock(); /* because of parent_dev_list */
+
+	list_for_each_entry(parent, &parent_dev_list, parent_list) {
+		p += _sprintf(p, buf, "%s over IB port: %s\n",
+			      parent->dev->name,
+			      parent->ipoib_main_interface);
+	}
+	_end_of_line(p, buf);
+
+	rtnl_unlock();
+	return (ssize_t)(p - buf);
+}
+
+/* show/store functions per parent (DEVICE_ATTR) */
+static ssize_t parent_show_neighs(struct device *d,
+				  struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct neigh *neigh;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave) {
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			p += _sprintf(p, buf, "SLAVE=%-10s EMAC=%pM IMAC=%pM:%pM:%pM:%.2x:%.2x\n",
+				      slave->dev->name,
+				      neigh->emac,
+				      neigh->imac, neigh->imac + 6, neigh->imac + 12,
+				      neigh->imac[18], neigh->imac[19]);
+		}
+	}
+
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+struct neigh *parent_get_neigh_cmd(char op,
+				   char *ifname, u8 *remac, u8 *rimac)
+{
+	struct neigh *neigh_cmd;
+
+	neigh_cmd = kzalloc(sizeof *neigh_cmd, GFP_ATOMIC);
+	if (!neigh_cmd) {
+		pr_err("%s cannot allocate neigh struct\n", ifname);
+		goto out;
+	}
+
+	/*
+	 * populate emac field so it can be used easily
+	 * in neigh_cmd_find_by_mac()
+	 */
+	memcpy(neigh_cmd->emac, remac, ETH_ALEN);
+	memcpy(neigh_cmd->imac, rimac, INFINIBAND_ALEN);
+
+	/* prepare the command as a string */
+	sprintf(neigh_cmd->cmd, "%c%s %pM %pM:%pM:%pM:%.2x:%.2x",
+		op, ifname, remac, rimac, rimac + 6, rimac + 12, rimac[18], rimac[19]);
+out:
+	return neigh_cmd;
+}
+
+/* write_lock_bh(&parent->lock) must be held */
+ssize_t __parent_store_neighs(struct device *d,
+			      struct device_attribute *attr,
+			      const char *buffer, size_t count)
+{
+	char command[IFNAMSIZ + 1] = { 0, };
+	char emac_str[ETH_ALEN * 3] = { 0, };
+	u8 emac[ETH_ALEN];
+	char imac_str[INFINIBAND_ALEN * 3] = { 0, };
+	u8 imac[INFINIBAND_ALEN];
+	char *ifname;
+	int found = 0, ret = count;
+	struct slave *slave = NULL, *slave_tmp;
+	struct neigh *neigh;
+	struct parent *parent = to_parent(d);
+
+	sscanf(buffer, "%s %s %s", command, emac_str, imac_str);
+
+	/* check ifname */
+	ifname = command + 1;
+	if ((strlen(command) <= 1) || !dev_valid_name(ifname) ||
+	    (command[0] != '+' && command[0] != '-'))
+		goto err_no_cmd;
+
+	/* check if ifname exist */
+	parent_for_each_slave(parent, slave_tmp) {
+		if (!strcmp(slave_tmp->dev->name, ifname)) {
+			found = 1;
+			slave = slave_tmp;
+		}
+	}
+
+	if (!found) {
+		pr_err("%s could not find slave\n", ifname);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (get_emac(emac, emac_str)) {
+		pr_err("%s bad emac %s\n", ifname, emac_str);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (get_imac(imac, imac_str)) {
+		pr_err("%s bad imac %s\n", ifname, imac_str);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* process command */
+	if (command[0] == '+') {
+		found = 0;
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			if (!memcmp(neigh->emac, emac, ETH_ALEN))
+				found = 1;
+		}
+
+		if (found) {
+			pr_err("%s: cannot update neigh, slave already has "
+			       "this neigh mac %pM\n",
+			       slave->dev->name, emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		neigh = kzalloc(sizeof *neigh, GFP_ATOMIC);
+		if (!neigh) {
+			pr_err("%s cannot allocate neigh struct\n",
+			       slave->dev->name);
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		/* ready to go */
+		pr_info("%s: slave %s neigh mac is set to %pM\n",
+			ifname, parent->dev->name, emac);
+		memcpy(neigh->emac, emac, ETH_ALEN);
+		memcpy(neigh->imac, imac, INFINIBAND_ALEN);
+
+		list_add_tail(&neigh->list, &slave->neigh_list);
+
+		goto out;
+	}
+
+	if (command[0] == '-') {
+		found = 0;
+		list_for_each_entry(neigh, &slave->neigh_list, list) {
+			if (!memcmp(neigh->emac, emac, ETH_ALEN))
+				found = 1;
+		}
+
+		if (!found) {
+			pr_err("%s cannot delete neigh mac %pM\n",
+			       ifname, emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		list_del(&neigh->list);
+		kfree(neigh);
+
+		goto out;
+	}
+
+err_no_cmd:
+	pr_err("%s USAGE: (-|+)ifname emac imac\n", DRV_NAME);
+	ret = -EPERM;
+
+out:
+	return ret;
+}
+
+static ssize_t parent_store_neighs(struct device *d,
+				   struct device_attribute *attr,
+				   const char *buffer, size_t count)
+{
+	struct parent *parent = to_parent(d);
+	ssize_t rc;
+
+	write_lock_bh(&parent->lock);
+	rc = __parent_store_neighs(d, attr, buffer, count);
+	write_unlock_bh(&parent->lock);
+
+	return rc;
+}
+
+static DEVICE_ATTR(neighs, S_IRUGO | S_IWUSR, parent_show_neighs,
+		   parent_store_neighs);
+
+static ssize_t parent_show_vifs(struct device *d,
+				struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave) {
+		if (is_zero_ether_addr(slave->emac)) {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%-17s "
+				      "VLAN=%s\n", slave->dev->name,
+				      MOD_NA_STRING, MOD_NA_STRING);
+		} else if (slave->vlan == VLAN_N_VID) {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%pM VLAN=%s\n",
+				      slave->dev->name,
+				      slave->emac,
+				      MOD_NA_STRING);
+		} else {
+			p += _sprintf(p, buf, "SLAVE=%-10s MAC=%pM VLAN=%d\n",
+				      slave->dev->name,
+				      slave->emac,
+				      slave->vlan);
+		}
+	}
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+static ssize_t parent_store_vifs(struct device *d,
+				 struct device_attribute *attr,
+				 const char *buffer, size_t count)
+{
+	char command[IFNAMSIZ + 1] = { 0, };
+	char mac_str[ETH_ALEN * 3] = { 0, };
+	char *ifname;
+	u8 mac[ETH_ALEN];
+	int found = 0, ret = count;
+	struct slave *slave = NULL, *slave_tmp;
+	struct parent *parent = to_parent(d);
+
+	sscanf(buffer, "%s %s", command, mac_str);
+
+	write_lock_bh(&parent->lock);
+
+	/* check ifname */
+	ifname = command + 1;
+	if ((strlen(command) <= 1) || !dev_valid_name(ifname) ||
+	    (command[0] != '+' && command[0] != '-'))
+		goto err_no_cmd;
+
+	/* check if ifname exist */
+	parent_for_each_slave(parent, slave_tmp) {
+		if (!strcmp(slave_tmp->dev->name, ifname)) {
+			found = 1;
+			slave = slave_tmp;
+		}
+	}
+
+	if (!found) {
+		pr_err("%s could not find slave\n", ifname);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* process command */
+	if (command[0] == '+') {
+		if (get_emac(mac, mac_str) || !is_valid_ether_addr(mac)) {
+			pr_err("%s invalid mac input\n", ifname);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (!is_zero_ether_addr(slave->emac)) {
+			pr_err("%s slave %s mac already set to %pM\n",
+			       ifname, slave->dev->name, slave->emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		/* check another slave has this mac/vlan */
+		found = 0;
+		parent_for_each_slave(parent, slave_tmp) {
+			if (!memcmp(slave_tmp->emac, mac, ETH_ALEN) &&
+			    slave_tmp->vlan == slave->vlan) {
+				pr_err("cannot update %s, slave %s already has"
+				       " vlan 0x%x mac %pM\n",
+				       parent->dev->name, slave->dev->name,
+				       slave_tmp->vlan,
+				       mac);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		/* ready to go */
+		pr_info("slave %s mac is set to %pM\n",
+			ifname, mac);
+
+		memcpy(slave->emac, mac, ETH_ALEN);
+		goto out;
+	}
+
+	if (command[0] == '-') {
+		if (is_zero_ether_addr(slave->emac)) {
+			pr_err("%s slave mac already unset %pM\n",
+			       ifname, slave->emac);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		pr_info("slave %s mac is unset (was %pM)\n",
+			ifname, slave->emac);
+
+		goto out;
+	}
+
+err_no_cmd:
+	pr_err("%s USAGE: (-|+)ifname [mac]\n", DRV_NAME);
+	ret = -EPERM;
+
+out:
+	write_unlock_bh(&parent->lock);
+
+	return ret;
+}
+
+static DEVICE_ATTR(vifs, S_IRUGO | S_IWUSR, parent_show_vifs,
+		   parent_store_vifs);
+
+static ssize_t parent_show_slaves(struct device *d,
+				  struct device_attribute *attr, char *buf)
+{
+	struct slave *slave;
+	struct parent *parent = to_parent(d);
+	char *p = buf;
+
+	read_lock_bh(&parent->lock);
+	parent_for_each_slave(parent, slave)
+		p += _sprintf(p, buf, "%s\n", slave->dev->name);
+	read_unlock_bh(&parent->lock);
+
+	_end_of_line(p, buf);
+
+	return (ssize_t)(p - buf);
+}
+
+static ssize_t parent_store_slaves(struct device *d,
+				   struct device_attribute *attr,
+				   const char *buffer, size_t count)
+{
+	char command[IFNAMSIZ + 1] = { 0, };
+	char *ifname;
+	int res, ret = count;
+	struct slave *slave;
+	struct net_device *dev = NULL;
+	struct parent *parent = to_parent(d);
+
+	/* Quick sanity check -- is the parent interface up? */
+	if (!(parent->dev->flags & IFF_UP)) {
+		pr_warn("%s: doing slave updates when "
+			"interface is down.\n", dev->name);
+	}
+
+	if (!rtnl_trylock()) /* because __dev_get_by_name */
+		return restart_syscall();
+
+	sscanf(buffer, "%16s", command);
+
+	ifname = command + 1;
+	if ((strlen(command) <= 1) || !dev_valid_name(ifname))
+		goto err_no_cmd;
+
+	if (command[0] == '+') {
+		/* Got a slave name in ifname. Is it already in the list? */
+		dev = __dev_get_by_name(&init_net, ifname);
+		if (!dev) {
+			pr_warn("%s: Interface %s does not exist!\n",
+				parent->dev->name, ifname);
+			ret = -EINVAL;
+			goto out;
+		}
+
+		read_lock_bh(&parent->lock);
+		parent_for_each_slave(parent, slave) {
+			if (slave->dev == dev) {
+				pr_err("%s ERR- Interface %s is already enslaved!\n",
+				       parent->dev->name, dev->name);
+				ret = -EPERM;
+			}
+		}
+		read_unlock_bh(&parent->lock);
+
+		if (ret < 0)
+			goto out;
+
+		pr_info("%s: adding slave %s\n",
+			parent->dev->name, ifname);
+
+		res = parent_enslave(parent->dev, dev);
+		if (res)
+			ret = res;
+
+		goto out;
+	}
+
+	if (command[0] == '-') {
+		dev = NULL;
+		parent_for_each_slave(parent, slave)
+			if (strnicmp(slave->dev->name, ifname, IFNAMSIZ) == 0) {
+				dev = slave->dev;
+				break;
+			}
+
+		if (dev) {
+			pr_info("%s: removing slave %s\n",
+				parent->dev->name, dev->name);
+			res = parent_release_slave(parent->dev, dev);
+			if (res) {
+				ret = res;
+				goto out;
+			}
+		} else {
+			pr_warn("%s: unable to remove non-existent "
+				"slave for parent %s.\n",
+				ifname, parent->dev->name);
+			ret = -ENODEV;
+		}
+		goto out;
+	}
+
+err_no_cmd:
+	pr_err("%s USAGE: (-|+)ifname\n", DRV_NAME);
+	ret = -EPERM;
+
+out:
+	rtnl_unlock();
+	return ret;
+}
+
+static DEVICE_ATTR(slaves, S_IRUGO | S_IWUSR, parent_show_slaves,
+		   parent_store_slaves);
+
+/* sysfs create/destroy functions */
+static struct attribute *per_parent_attrs[] = {
+	&dev_attr_slaves.attr, /* DEVICE_ATTR(slaves..) */
+	&dev_attr_vifs.attr,
+	&dev_attr_neighs.attr,
+	NULL,
+};
+
+/* name spcase  support */
+static const void *eipoib_namespace(struct class *cls,
+				    const struct class_attribute *attr)
+{
+	const struct eipoib_net *eipoib_n =
+		container_of(attr,
+			     struct eipoib_net, class_attr_eipoib_interfaces);
+	return eipoib_n->net;
+}
+
+static struct attribute_group parent_group = {
+	/* per parent sysfs files under: /sys/class/net/<IF>/eth/.. */
+	.name = "eth",
+	.attrs = per_parent_attrs
+};
+
+int create_slave_symlinks(struct net_device *master,
+			  struct net_device *slave)
+{
+	char linkname[IFNAMSIZ+7];
+	int ret = 0;
+
+	ret = sysfs_create_link(&(slave->dev.kobj), &(master->dev.kobj),
+				"eth_parent");
+	if (ret)
+		return ret;
+
+	sprintf(linkname, "slave_%s", slave->name);
+	ret = sysfs_create_link(&(master->dev.kobj), &(slave->dev.kobj),
+				linkname);
+	return ret;
+
+}
+
+void destroy_slave_symlinks(struct net_device *master,
+			    struct net_device *slave)
+{
+	char linkname[IFNAMSIZ+7];
+
+	sysfs_remove_link(&(slave->dev.kobj), "eth_parent");
+	sprintf(linkname, "slave_%s", slave->name);
+	sysfs_remove_link(&(master->dev.kobj), linkname);
+}
+
+static struct class_attribute class_attr_eth_ipoib_interfaces = {
+	.attr = {
+		.name = "eth_ipoib_interfaces",
+		.mode = S_IWUSR | S_IRUGO,
+	},
+	.show = show_parents,
+	.namespace = eipoib_namespace,
+};
+
+/* per module sysfs file under: /sys/class/net/eth_ipoib_interfaces */
+int mod_create_sysfs(struct eipoib_net *eipoib_n)
+{
+	int rc;
+	/* defined in CLASS_ATTR(eth_ipoib_interfaces..) */
+	eipoib_n->class_attr_eipoib_interfaces =
+		class_attr_eth_ipoib_interfaces;
+
+	sysfs_attr_init(&eipoib_n->class_attr_eipoib_interfaces.attr);
+
+	rc = netdev_class_create_file(&eipoib_n->class_attr_eipoib_interfaces);
+	if (rc)
+		pr_err("%s failed to create sysfs (rc %d)\n",
+		       eipoib_n->class_attr_eipoib_interfaces.attr.name, rc);
+
+	return rc;
+}
+
+void mod_destroy_sysfs(struct eipoib_net *eipoib_n)
+{
+	netdev_class_remove_file(&eipoib_n->class_attr_eipoib_interfaces);
+}
+
+int parent_create_sysfs_entry(struct parent *parent)
+{
+	struct net_device *dev = parent->dev;
+	int rc;
+
+	rc = sysfs_create_group(&(dev->dev.kobj), &parent_group);
+	if (rc)
+		pr_info("failed to create sysfs group\n");
+
+	return rc;
+}
+
+void parent_destroy_sysfs_entry(struct parent *parent)
+{
+	struct net_device *dev = parent->dev;
+
+	sysfs_remove_group(&(dev->dev.kobj), &parent_group);
+}
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next V1 4/9] net/eipoib: Add private header file
From: Or Gerlitz @ 2012-07-18 10:59 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, ali, sean.hefty, shlomop, Erez Shitrit,
	Or Gerlitz
In-Reply-To: <1342609202-32427-1-git-send-email-ogerlitz@mellanox.com>

From: Erez Shitrit <erezsh@mellanox.co.il>

The header file includes all structures, macros and non-static
functions which are of use by the driver.

Signed-off-by: Erez Shitrit <erezsh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/eipoib/eth_ipoib.h |  227 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 227 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/eipoib/eth_ipoib.h

diff --git a/drivers/net/eipoib/eth_ipoib.h b/drivers/net/eipoib/eth_ipoib.h
new file mode 100644
index 0000000..408cef5
--- /dev/null
+++ b/drivers/net/eipoib/eth_ipoib.h
@@ -0,0 +1,227 @@
+/*
+ * Copyright (c) 2012 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * openfabric.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _LINUX_ETH_IPOIB_H
+#define _LINUX_ETH_IPOIB_H
+
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <net/arp.h>
+#include <linux/if_vlan.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <linux/if_infiniband.h>
+#include <rdma/ib_verbs.h>
+
+#include <rdma/e_ipoib.h>
+
+/* macros and definitions */
+#define DRV_VERSION		"1.0.0"
+#define DRV_RELDATE		"June 1, 2012"
+#define DRV_NAME		"eth_ipoib"
+#define SDRV_NAME		"ipoib"
+#define DRV_DESCRIPTION		"IP-over-InfiniBand Para Virtualized Driver"
+#define EIPOIB_ABI_VER	1
+
+#undef  pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#define GID_LEN			16
+#define GUID_LEN		8
+
+#define PARENT_VLAN_FEATURES \
+	(NETIF_F_HW_VLAN_RX | NETIF_F_HW_VLAN_TX | \
+	 NETIF_F_HW_VLAN_FILTER)
+
+#define parent_for_each_slave(_parent, slave)		\
+		list_for_each_entry(slave, &(_parent)->slave_list, list)\
+
+#define PARENT_IS_OK(_parent)				\
+		(((_parent)->dev->flags & IFF_UP) &&	\
+		 netif_running((_parent)->dev)    &&	\
+		 ((_parent)->slave_cnt > 0))
+
+#define IS_E_IPOIB_PROTO(_proto)			\
+		 (((_proto) == htons(ETH_P_ARP)) ||	\
+		 ((_proto) == htons(ETH_P_RARP)) ||	\
+		 ((_proto) == htons(ETH_P_IP)))
+
+enum eipoib_emac_guest_info {
+	VALID,
+	MIGRATED_OUT,
+	INVALID,
+};
+
+/* structs */
+struct eth_arp_data {
+	u8 arp_sha[ETH_ALEN];
+	__be32 arp_sip;
+	u8 arp_dha[ETH_ALEN];
+	__be32 arp_dip;
+} __packed;
+
+struct ipoib_arp_data {
+	u8 arp_sha[INFINIBAND_ALEN];
+	__be32 arp_sip;
+	u8 arp_dha[INFINIBAND_ALEN];
+	__be32 arp_dip;
+} __packed;
+
+/* live migration support structures: */
+struct ip_member {
+	__be32 ip;
+	struct list_head list;
+};
+
+/*
+ * for each slave (emac) saves all the ip over that mac.
+ * the parent keeps that list for live migration.
+ */
+struct guest_emac_info {
+	u8 emac[ETH_ALEN];
+	u16 vlan;
+	struct list_head ip_list;
+	struct list_head list;
+	enum eipoib_emac_guest_info rec_state;
+	int num_of_retries;
+};
+
+struct neigh {
+	struct list_head list;
+	u8 emac[ETH_ALEN];
+	u8 imac[INFINIBAND_ALEN];
+	/* this part is used for neigh_add_list */
+	char cmd[PAGE_SIZE];
+};
+
+struct slave {
+	struct net_device *dev;
+	struct slave *next;
+	struct slave *prev;
+	int    index;
+	struct list_head list;
+	unsigned long jiffies;
+	s8     link;
+	s8     state;
+	u16    pkey;
+	u16    vlan;
+	u8     emac[ETH_ALEN];
+	u8     imac[INFINIBAND_ALEN];
+	struct list_head neigh_list;
+	/* this part is used for vif_add_list */
+	char cmd[PAGE_SIZE];
+};
+
+struct port_stats {
+	/* update PORT_STATS_LEN (number of stat fields)accordingly */
+	unsigned long tx_parent_dropped;
+	unsigned long tx_vif_miss;
+	unsigned long tx_neigh_miss;
+	unsigned long tx_vlan;
+	unsigned long tx_shared;
+	unsigned long tx_proto_errors;
+	unsigned long tx_skb_errors;
+	unsigned long tx_slave_err;
+
+	unsigned long rx_parent_dropped;
+	unsigned long rx_vif_miss;
+	unsigned long rx_neigh_miss;
+	unsigned long rx_vlan;
+	unsigned long rx_shared;
+	unsigned long rx_proto_errors;
+	unsigned long rx_skb_errors;
+	unsigned long rx_slave_err;
+};
+
+struct parent {
+	struct   net_device *dev;
+	int      index;
+	struct   neigh_parms nparms;
+	struct   list_head slave_list;
+	/* never change this value outside the attach/detach wrappers */
+	s32      slave_cnt;
+	rwlock_t lock;
+	struct   net_device_stats stats;
+	struct   port_stats port_stats;
+	struct   list_head parent_list;
+	struct   dev_mc_list *mc_list;
+	u16      flags;
+	struct   list_head vlan_list;
+	struct   workqueue_struct *wq;
+	s8       kill_timers;
+	struct   delayed_work neigh_learn_work;
+	struct   delayed_work vif_learn_work;
+	struct   list_head neigh_add_list;
+	union    ib_gid gid;
+	char     ipoib_main_interface[IFNAMSIZ];
+	struct   list_head emac_ip_list;
+	struct   delayed_work emac_ip_work;
+	struct   delayed_work migrate_out_work;
+};
+
+#define eipoib_slave_get_rcu(dev) \
+	((struct slave *) rcu_dereference(dev->rx_handler_data))
+
+/* name space support for sys/fs */
+struct eipoib_net {
+	struct net	*net;	/* Associated network namespace */
+	struct class_attribute class_attr_eipoib_interfaces;
+};
+
+/* exported from main.c */
+extern int eipoib_net_id;
+extern struct list_head parent_dev_list;
+
+/* functions prototypes */
+int mod_create_sysfs(struct eipoib_net *eipoib_n);
+void mod_destroy_sysfs(struct eipoib_net *eipoib_n);
+void parent_destroy_sysfs_entry(struct parent *parent);
+int parent_create_sysfs_entry(struct parent *parent);
+int create_slave_symlinks(struct net_device *master,
+			  struct net_device *slave);
+void destroy_slave_symlinks(struct net_device *master,
+			    struct net_device *slave);
+int parent_enslave(struct net_device *parent_dev,
+		   struct net_device *slave_dev);
+int parent_release_slave(struct net_device *parent_dev,
+			 struct net_device *slave_dev);
+struct neigh *parent_get_neigh_cmd(char op, char *ifname,
+				   u8 *remac, u8 *rimac);
+struct slave *parent_get_vif_cmd(char op, char *ifname, u8 *lemac);
+ssize_t __parent_store_neighs(struct device *d,
+			      struct device_attribute *attr,
+			      const char *buffer, size_t count);
+void parent_set_ethtool_ops(struct net_device *dev);
+
+#endif /* _LINUX_ETH_IPOIB_H */
-- 
1.7.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox