Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: Joe Perches @ 2012-07-18 13:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <1342613334.2626.2504.camel@edumazet-glaptop>

On Wed, 2012-07-18 at 14:08 +0200, Eric Dumazet wrote:
> Introduce ipv6_addr_hash() helper doing a XOR on all bits
> of an IPv6 address, with an optimized x86_64 version.
[]
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
[]
> @@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
>  #endif
>  }
>  
> +static inline u32 ipv6_addr_hash(const struct in6_addr *a)
> +{
> +#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
> +	const unsigned long *ul = (const unsigned long *)a;
> +	unsigned long x = ul[0] ^ ul[1];
> +
> +	return x ^ (x >> 32);

Thanks Eric.

Perhaps this would be better with an explicit rather
than implicit cast.

^ permalink raw reply

* RE: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: Eric Dumazet @ 2012-07-18 14:06 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B6F93@saturn3.aculab.com>

From: Eric Dumazet <edumazet@google.com>

On Wed, 2012-07-18 at 13:28 +0100, David Laight wrote:
> >  #define HASH_SIZE  32
> > 
> > -#define HASH(addr) ((__force u32)((addr)->s6_addr32[0] ^ (addr)->s6_addr32[1] ^ \
> > -		     (addr)->s6_addr32[2] ^ (addr)->s6_addr32[3]) & \
> > -		    (HASH_SIZE - 1))
> > +#define HASH(addr) (ipv6_addr_hash(addr) & (HASH_SIZE - 1))
> 
> That hash doesn't seem to include many variable bits at all!
> Especially on LE systems where it doesn't contain any of
> the low bits of a mac address based IPv6 address.
> 

Good point.

Apparently nobody uses a lot of ipv6 tunnels ;)

Thanks

[PATCH net-next v2] ipv6: add ipv6_addr_hash() helper

Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.

Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)

Use it in ip6_tunnel.c and use more bit shuffling, as suggested
by David Laight, as existing hash was ignoring most of them.

Use it in sunrpc and use more bit shuffling, using hash_32().

As cleanup, use it in net/ipv4/tcp_metrics.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
---
 include/net/ipv6.h        |   13 +++++++++++++
 net/core/flow_dissector.c |    5 +++--
 net/ipv4/tcp_metrics.c    |   15 +++------------
 net/ipv6/ip6_tunnel.c     |   20 ++++++++++++--------
 net/sunrpc/svcauth_unix.c |   22 ++++------------------
 5 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f695f39..56ff725 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
 #endif
 }
 
+static inline u32 ipv6_addr_hash(const struct in6_addr *a)
+{
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
+	const unsigned long *ul = (const unsigned long *)a;
+	unsigned long x = ul[0] ^ ul[1];
+
+	return x ^ (x >> 32);
+#else
+	return (__force u32)(a->s6_addr32[0] ^ a->s6_addr32[1] ^
+			     a->s6_addr32[2] ^ a->s6_addr32[3]);
+#endif
+}
+
 static inline bool ipv6_addr_loopback(const struct in6_addr *a)
 {
 	return (a->s6_addr32[0] | a->s6_addr32[1] |
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index a225089..466820b 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -4,6 +4,7 @@
 #include <linux/ipv6.h>
 #include <linux/if_vlan.h>
 #include <net/ip.h>
+#include <net/ipv6.h>
 #include <linux/if_tunnel.h>
 #include <linux/if_pppox.h>
 #include <linux/ppp_defs.h>
@@ -55,8 +56,8 @@ ipv6:
 			return false;
 
 		ip_proto = iph->nexthdr;
-		flow->src = iph->saddr.s6_addr32[3];
-		flow->dst = iph->daddr.s6_addr32[3];
+		flow->src = (__force __be32)ipv6_addr_hash(&iph->saddr);
+		flow->dst = (__force __be32)ipv6_addr_hash(&iph->daddr);
 		nhoff += sizeof(struct ipv6hdr);
 		break;
 	}
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 5a38a2d..1a115b6 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -211,10 +211,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_rsk(req)->rmt_addr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_rsk(req)->rmt_addr);
 		break;
 	default:
 		return NULL;
@@ -251,10 +248,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 	case AF_INET6:
 		tw6 = inet6_twsk((struct sock *)tw);
 		*(struct in6_addr *)addr.addr.a6 = tw6->tw_v6_daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&tw6->tw_v6_daddr);
 		break;
 	default:
 		return NULL;
@@ -291,10 +285,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_sk(sk)->daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_sk(sk)->daddr);
 		break;
 	default:
 		return NULL;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index db32846..9a1d5fe 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -40,6 +40,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/netfilter_ipv6.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
 
 #include <asm/uaccess.h>
 #include <linux/atomic.h>
@@ -70,11 +71,15 @@ MODULE_ALIAS_NETDEV("ip6tnl0");
 #define IPV6_TCLASS_MASK (IPV6_FLOWINFO_MASK & ~IPV6_FLOWLABEL_MASK)
 #define IPV6_TCLASS_SHIFT 20
 
-#define HASH_SIZE  32
+#define HASH_SIZE_SHIFT  5
+#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
 
-#define HASH(addr) ((__force u32)((addr)->s6_addr32[0] ^ (addr)->s6_addr32[1] ^ \
-		     (addr)->s6_addr32[2] ^ (addr)->s6_addr32[3]) & \
-		    (HASH_SIZE - 1))
+static u32 HASH(const struct in6_addr *addr1, const struct in6_addr *addr2)
+{
+	u32 hash = ipv6_addr_hash(addr1) ^ ipv6_addr_hash(addr2);
+
+	return hash_32(hash, HASH_SIZE_SHIFT);
+}
 
 static int ip6_tnl_dev_init(struct net_device *dev);
 static void ip6_tnl_dev_setup(struct net_device *dev);
@@ -166,12 +171,11 @@ static inline void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
 static struct ip6_tnl *
 ip6_tnl_lookup(struct net *net, const struct in6_addr *remote, const struct in6_addr *local)
 {
-	unsigned int h0 = HASH(remote);
-	unsigned int h1 = HASH(local);
+	unsigned int hash = HASH(remote, local);
 	struct ip6_tnl *t;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 
-	for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[h0 ^ h1]) {
+	for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
 		if (ipv6_addr_equal(local, &t->parms.laddr) &&
 		    ipv6_addr_equal(remote, &t->parms.raddr) &&
 		    (t->dev->flags & IFF_UP))
@@ -205,7 +209,7 @@ ip6_tnl_bucket(struct ip6_tnl_net *ip6n, const struct ip6_tnl_parm *p)
 
 	if (!ipv6_addr_any(remote) || !ipv6_addr_any(local)) {
 		prio = 1;
-		h = HASH(remote) ^ HASH(local);
+		h = HASH(remote, local);
 	}
 	return &ip6n->tnls[prio][h];
 }
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 2777fa8..4d01292 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -104,23 +104,9 @@ static void ip_map_put(struct kref *kref)
 	kfree(im);
 }
 
-#if IP_HASHBITS == 8
-/* hash_long on a 64 bit machine is currently REALLY BAD for
- * IP addresses in reverse-endian (i.e. on a little-endian machine).
- * So use a trivial but reliable hash instead
- */
-static inline int hash_ip(__be32 ip)
-{
-	int hash = (__force u32)ip ^ ((__force u32)ip>>16);
-	return (hash ^ (hash>>8)) & 0xff;
-}
-#endif
-static inline int hash_ip6(struct in6_addr ip)
+static inline int hash_ip6(const struct in6_addr *ip)
 {
-	return (hash_ip(ip.s6_addr32[0]) ^
-		hash_ip(ip.s6_addr32[1]) ^
-		hash_ip(ip.s6_addr32[2]) ^
-		hash_ip(ip.s6_addr32[3]));
+	return hash_32(ipv6_addr_hash(ip), IP_HASHBITS);
 }
 static int ip_map_match(struct cache_head *corig, struct cache_head *cnew)
 {
@@ -301,7 +287,7 @@ static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
 	ip.m_addr = *addr;
 	ch = sunrpc_cache_lookup(cd, &ip.h,
 				 hash_str(class, IP_HASHBITS) ^
-				 hash_ip6(*addr));
+				 hash_ip6(addr));
 
 	if (ch)
 		return container_of(ch, struct ip_map, h);
@@ -331,7 +317,7 @@ static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
 	ip.h.expiry_time = expiry;
 	ch = sunrpc_cache_update(cd, &ip.h, &ipm->h,
 				 hash_str(ipm->m_class, IP_HASHBITS) ^
-				 hash_ip6(ipm->m_addr));
+				 hash_ip6(&ipm->m_addr));
 	if (!ch)
 		return -ENOMEM;
 	cache_put(ch, cd);

^ permalink raw reply related

* Re: [PATCH v4] net: cgroup: fix access the unallocated memory in netprio cgroup
From: John Fastabend @ 2012-07-18 14:10 UTC (permalink / raw)
  To: Neil Horman
  Cc: Gao feng, eric.dumazet, linux-kernel, netdev, davem, Eric Dumazet,
	Rustad, Mark D
In-Reply-To: <20120718122106.GB25563@hmsreliant.think-freely.org>

On 7/18/2012 5:21 AM, Neil Horman wrote:
> On Tue, Jul 17, 2012 at 01:47:25PM -0700, John Fastabend wrote:
>> On 7/12/2012 12:50 AM, Gao feng wrote:
>>> there are some out of bound accesses in netprio cgroup.
>>>
>>> now before accessing the dev->priomap.priomap array,we only check
>>> if the dev->priomap exist.and because we don't want to see
>>> additional bound checkings in fast path, so we should make sure
>>> that dev->priomap is null or array size of dev->priomap.priomap
>>> is equal to max_prioidx + 1;
>>>
>>> so in write_priomap logic,we should call extend_netdev_table when
>>> dev->priomap is null and dev->priomap.priomap_len < max_len.
>>> and in cgrp_create->update_netdev_tables logic,we should call
>>> extend_netdev_table only when dev->priomap exist and
>>> dev->priomap.priomap_len < max_len.
>>>
>>> and it's not needed to call update_netdev_tables in write_priomap,
>>> we can only allocate the net device's priomap which we change through
>>> net_prio.ifpriomap.
>>>
>>> this patch also add a return value for update_netdev_tables &
>>> extend_netdev_table, so when new_priomap is allocated failed,
>>> write_priomap will stop to access the priomap,and return -ENOMEM
>>> back to the userspace to tell the user what happend.
>>>
>>> Change From v3:
>>> 1. add rtnl protect when reading max_prioidx in write_priomap.
>>>
>>> 2. only call extend_netdev_table when map->priomap_len < max_len,
>>>     this will make sure array size of dev->map->priomap always
>>>     bigger than any prioidx.
>>>
>>> 3. add a function write_update_netdev_table to make codes clear.
>>>
>>> Change From v2:
>>> 1. protect extend_netdev_table by RTNL.
>>> 2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
>>>
>>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>>> Cc: Neil Horman <nhorman@tuxdriver.com>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> ---
>>>   net/core/netprio_cgroup.c |   71 ++++++++++++++++++++++++++++++++++-----------
>>>   1 files changed, 54 insertions(+), 17 deletions(-)
>>>
>>
>> [...]
>>
>>> +
>>> +static int update_netdev_tables(void)
>>> +{
>>> +	int ret = 0;
>>>   	struct net_device *dev;
>>> -	u32 max_len = atomic_read(&max_prioidx) + 1;
>>> +	u32 max_len;
>>>   	struct netprio_map *map;
>>
>>
>> need to check if net subsystem is initialized before we try
>> to use it here...
>>
>> 	if (some_check)     -> need to lookup what this check is
>> 		return ret;
>>
>>>
>>>   	rtnl_lock();
>>> +	max_len = atomic_read(&max_prioidx) + 1;
>>>   	for_each_netdev(&init_net, dev) {
>>>   		map = rtnl_dereference(dev->priomap);
>>> -		if ((!map) ||
>>> -		    (map->priomap_len < max_len))
>>> -			extend_netdev_table(dev, max_len);
>>> +		/*
>>> +		 * don't allocate priomap if we didn't
>>> +		 * change net_prio.ifpriomap (map == NULL),
>>> +		 * this will speed up skb_update_prio.
>>> +		 */
>>> +		if (map && map->priomap_len < max_len) {
>>> +			ret = extend_netdev_table(dev, max_len);
>>> +			if (ret < 0)
>>> +				break;
>>> +		}
>>>   	}
>>>   	rtnl_unlock();
>>> +	return ret;
>>>   }
>>>
>>>   static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
>>>   {
>>>   	struct cgroup_netprio_state *cs;
>>> -	int ret;
>>> +	int ret = -EINVAL;
>>>
>>>   	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
>>>   	if (!cs)
>>>   		return ERR_PTR(-ENOMEM);
>>>
>>> -	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
>>> -		kfree(cs);
>>> -		return ERR_PTR(-EINVAL);
>>> -	}
>>> +	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
>>> +		goto out;
>>>
>>>   	ret = get_prioidx(&cs->prioidx);
>>> -	if (ret != 0) {
>>> +	if (ret < 0) {
>>>   		pr_warn("No space in priority index array\n");
>>> -		kfree(cs);
>>> -		return ERR_PTR(ret);
>>> +		goto out;
>>> +	}
>>> +
>>> +	ret = update_netdev_tables();
>>> +	if (ret < 0) {
>>> +		put_prioidx(cs->prioidx);
>>> +		goto out;
>>>   	}
>>
>> Gao,
>>
>> This introduces a null ptr dereference when netprio_cgroup is built
>> into the kernel because update_netdev_tables() depends on init_net.
>> However cgrp_create is being called by cgroup_init before
>> do_initcalls() is called and before net_dev_init().
>>
>> .John
>>
> Not sure I follow here John.  Shouldn't init_net be initialized prior to any
> network devices getting registered?  In other words, shouldn't for_each_netdev
> just result in zero iterations through the loop?
> Neil
>

init_net _is_ initialized prior to any network devices getting
registered but not before cgrp_create called via cgroup_init.

#define for_each_netdev(net, d)         \
                 list_for_each_entry(d, &(net)->dev_base_head, dev_list)

but dev_base_head is zeroed at this time. In netdev_init we have,

         INIT_LIST_HEAD(&net->dev_base_head);

but we haven't got that far yet because cgroup_init is called
before do_initcalls().

^ permalink raw reply

* Re: [PATCH net-next] ipv6: add ipv6_addr_hash() helper
From: Eric Dumazet @ 2012-07-18 14:14 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <1342619879.9551.14.camel@joe2Laptop>

On Wed, 2012-07-18 at 06:57 -0700, Joe Perches wrote:
> On Wed, 2012-07-18 at 14:08 +0200, Eric Dumazet wrote:
> > Introduce ipv6_addr_hash() helper doing a XOR on all bits
> > of an IPv6 address, with an optimized x86_64 version.
> []
> > diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> []
> > @@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
> >  #endif
> >  }
> >  
> > +static inline u32 ipv6_addr_hash(const struct in6_addr *a)
> > +{
> > +#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
> > +	const unsigned long *ul = (const unsigned long *)a;
> > +	unsigned long x = ul[0] ^ ul[1];
> > +
> > +	return x ^ (x >> 32);
> 
> Thanks Eric.
> 
> Perhaps this would be better with an explicit rather
> than implicit cast.

In fact, returning an "unsigned long" here might give more shuffling
capabilities on 64bit arches, thanks to hash_long()

but hash_long() on 64bit sounds a bit expensive for our needs...

^ permalink raw reply

* [PATCH net-next 1/4] net/mlx4: Move MAC_MASK to a common place
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342621162-18498-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Define this macro is one common place instead of duplicating it over the code

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    |    6 +++---
 drivers/net/ethernet/mellanox/mlx4/mcg.c           |    1 -
 drivers/net/ethernet/mellanox/mlx4/port.c          |    1 -
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |    3 +--
 include/linux/mlx4/driver.h                        |    2 ++
 5 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index dd6a77b..9d0b88e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -34,12 +34,12 @@
 #include <linux/kernel.h>
 #include <linux/ethtool.h>
 #include <linux/netdevice.h>
+#include <linux/mlx4/driver.h>
 
 #include "mlx4_en.h"
 #include "en_port.h"
 
 #define EN_ETHTOOL_QP_ATTACH (1ull << 63)
-#define EN_ETHTOOL_MAC_MASK 0xffffffffffffULL
 #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0xffff)
 #define EN_ETHTOOL_WORD_MASK  cpu_to_be32(0xffffffff)
 
@@ -751,7 +751,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	struct ethhdr *eth_spec;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_spec_list *spec_l2;
-	__be64 mac_msk = cpu_to_be64(EN_ETHTOOL_MAC_MASK << 16);
+	__be64 mac_msk = cpu_to_be64(MLX4_MAC_MASK << 16);
 
 	err = mlx4_en_validate_flow(dev, cmd);
 	if (err)
@@ -761,7 +761,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	if (!spec_l2)
 		return -ENOMEM;
 
-	mac = priv->mac & EN_ETHTOOL_MAC_MASK;
+	mac = priv->mac & MLX4_MAC_MASK;
 	be_mac = cpu_to_be64(mac << 16);
 
 	spec_l2->id = MLX4_NET_TRANS_RULE_ID_ETH;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index 5bac0df..4ec3835 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -41,7 +41,6 @@
 
 #define MGM_QPN_MASK       0x00FFFFFF
 #define MGM_BLCK_LB_BIT    30
-#define MLX4_MAC_MASK	   0xffffffffffffULL
 
 static const u8 zero_gid[16];	/* automatically initialized to 0 */
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c b/drivers/net/ethernet/mellanox/mlx4/port.c
index a51d1b9..028833f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -39,7 +39,6 @@
 #include "mlx4.h"
 
 #define MLX4_MAC_VALID		(1ull << 63)
-#define MLX4_MAC_MASK		0xffffffffffffULL
 
 #define MLX4_VLAN_VALID		(1u << 31)
 #define MLX4_VLAN_MASK		0xfff
diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index c3fa919..94ceddd 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -41,13 +41,12 @@
 #include <linux/slab.h>
 #include <linux/mlx4/cmd.h>
 #include <linux/mlx4/qp.h>
+#include <linux/if_ether.h>
 
 #include "mlx4.h"
 #include "fw.h"
 
 #define MLX4_MAC_VALID		(1ull << 63)
-#define MLX4_MAC_MASK		0x7fffffffffffffffULL
-#define ETH_ALEN		6
 
 struct mac_res {
 	struct list_head list;
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 5f1298b..8dc485f 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -37,6 +37,8 @@
 
 struct mlx4_dev;
 
+#define MLX4_MAC_MASK	   0xffffffffffffULL
+
 enum mlx4_dev_event {
 	MLX4_DEV_EVENT_CATASTROPHIC_ERROR,
 	MLX4_DEV_EVENT_PORT_UP,
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 3/4] {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342621162-18498-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Enable callers of mlx4_assign_eq to supply a pointer to cpu_rmap.
If supplied, the assigned IRQ is tracked using rmap infrastructure.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/hw/mlx4/main.c          |    3 ++-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |    3 ++-
 drivers/net/ethernet/mellanox/mlx4/eq.c    |   12 +++++++++++-
 include/linux/mlx4/device.h                |    4 +++-
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 8a3a203..a07b774 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1159,7 +1159,8 @@ static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
 			sprintf(name, "mlx4-ib-%d-%d@%s",
 				i, j, dev->pdev->bus->name);
 			/* Set IRQ for specific name (per ring) */
-			if (mlx4_assign_eq(dev, name, &ibdev->eq_table[eq])) {
+			if (mlx4_assign_eq(dev, name, NULL,
+					   &ibdev->eq_table[eq])) {
 				/* Use legacy (same as mlx4_en driver) */
 				pr_warn("Can't allocate EQ %d; reverting to legacy\n", eq);
 				ibdev->eq_table[eq] =
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 908a460..0ef6156 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -91,7 +91,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 				sprintf(name, "%s-%d", priv->dev->name,
 					cq->ring);
 				/* Set IRQ for specific name (per ring) */
-				if (mlx4_assign_eq(mdev->dev, name, &cq->vector)) {
+				if (mlx4_assign_eq(mdev->dev, name, NULL,
+						   &cq->vector)) {
 					cq->vector = (cq->ring + 1 + priv->port)
 					    % mdev->dev->caps.num_comp_vectors;
 					mlx4_warn(mdev, "Failed Assigning an EQ to "
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
index bce98d9..12c3ed2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -39,6 +39,7 @@
 #include <linux/dma-mapping.h>
 
 #include <linux/mlx4/cmd.h>
+#include <linux/cpu_rmap.h>
 
 #include "mlx4.h"
 #include "fw.h"
@@ -1060,7 +1061,8 @@ int mlx4_test_interrupts(struct mlx4_dev *dev)
 }
 EXPORT_SYMBOL(mlx4_test_interrupts);
 
-int mlx4_assign_eq(struct mlx4_dev *dev, char* name, int * vector)
+int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
+		   int *vector)
 {
 
 	struct mlx4_priv *priv = mlx4_priv(dev);
@@ -1074,6 +1076,14 @@ int mlx4_assign_eq(struct mlx4_dev *dev, char* name, int * vector)
 			snprintf(priv->eq_table.irq_names +
 					vec * MLX4_IRQNAME_SIZE,
 					MLX4_IRQNAME_SIZE, "%s", name);
+#ifdef CONFIG_CPU_RMAP
+			if (rmap) {
+				err = irq_cpu_rmap_add(rmap,
+						       priv->eq_table.eq[vec].irq);
+				if (err)
+					mlx4_warn(dev, "Failed adding irq rmap\n");
+			}
+#endif
 			err = request_irq(priv->eq_table.eq[vec].irq,
 					  mlx4_msi_x_interrupt, 0,
 					  &priv->eq_table.irq_names[vec<<5],
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 6f0d133..4d7761f 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -36,6 +36,7 @@
 #include <linux/pci.h>
 #include <linux/completion.h>
 #include <linux/radix-tree.h>
+#include <linux/cpu_rmap.h>
 
 #include <linux/atomic.h>
 
@@ -784,7 +785,8 @@ void mlx4_fmr_unmap(struct mlx4_dev *dev, struct mlx4_fmr *fmr,
 int mlx4_fmr_free(struct mlx4_dev *dev, struct mlx4_fmr *fmr);
 int mlx4_SYNC_TPT(struct mlx4_dev *dev);
 int mlx4_test_interrupts(struct mlx4_dev *dev);
-int mlx4_assign_eq(struct mlx4_dev *dev, char* name , int* vector);
+int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
+		   int *vector);
 void mlx4_release_eq(struct mlx4_dev *dev, int vec);
 
 int mlx4_wol_read(struct mlx4_dev *dev, u64 *config, int port);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 4/4] net/mlx4_en: Add accelerated RFS support
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342621162-18498-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Use RFS infrastructure and flow steering in HW to keep CPU
affinity of rx interrupts and application per TCP stream.

A flow steering filter is added to the HW whenever the RFS
ndo callback is invoked by core networking code.

Because the invocation takes place in interrupt context, the
actual setup of HW is done using workqueue. Whenever new filter
is added, the driver checks for expiry of existing filters.

Since there's window in time between the point where the core
RFS code invoked the ndo callback, to the point where the HW
is configured from the workqueue context, the 2nd, 3rd etc
packets from that stream will cause the net core to invoke
the callback again and again.

To prevent inefficient/double configuration of the HW, the filters
are kept in a database which is indexed using hash function to enable
fast access.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c     |    8 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  316 ++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     |    3 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   16 ++
 4 files changed, 342 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 0ef6156..2d6f1ba 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -77,6 +77,12 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 	struct mlx4_en_dev *mdev = priv->mdev;
 	int err = 0;
 	char name[25];
+	struct cpu_rmap *rmap =
+#ifdef CONFIG_CPU_RMAP
+		priv->dev->rx_cpu_rmap;
+#else
+		NULL;
+#endif
 
 	cq->dev = mdev->pndev[priv->port];
 	cq->mcq.set_ci_db  = cq->wqres.db.db;
@@ -91,7 +97,7 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
 				sprintf(name, "%s-%d", priv->dev->name,
 					cq->ring);
 				/* Set IRQ for specific name (per ring) */
-				if (mlx4_assign_eq(mdev->dev, name, NULL,
+				if (mlx4_assign_eq(mdev->dev, name, rmap,
 						   &cq->vector)) {
 					cq->vector = (cq->ring + 1 + priv->port)
 					    % mdev->dev->caps.num_comp_vectors;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 4ce5ca8..8864d8b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -36,6 +36,8 @@
 #include <linux/if_vlan.h>
 #include <linux/delay.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
+#include <net/ip.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/device.h>
@@ -66,6 +68,299 @@ static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 	return 0;
 }
 
+#ifdef CONFIG_RFS_ACCEL
+
+struct mlx4_en_filter {
+	struct list_head next;
+	struct work_struct work;
+
+	__be32 src_ip;
+	__be32 dst_ip;
+	__be16 src_port;
+	__be16 dst_port;
+
+	int rxq_index;
+	struct mlx4_en_priv *priv;
+	u32 flow_id;			/* RFS infrastructure id */
+	int id;				/* mlx4_en driver id */
+	u64 reg_id;			/* Flow steering API id */
+	u8 activated;			/* Used to prevent expiry before filter
+					 * is attached
+					 */
+	struct hlist_node filter_chain;
+};
+
+static void mlx4_en_filter_rfs_expire(struct mlx4_en_priv *priv);
+
+static void mlx4_en_filter_work(struct work_struct *work)
+{
+	struct mlx4_en_filter *filter = container_of(work,
+						     struct mlx4_en_filter,
+						     work);
+	struct mlx4_en_priv *priv = filter->priv;
+	struct mlx4_spec_list spec_tcp = {
+		.id = MLX4_NET_TRANS_RULE_ID_TCP,
+		{
+			.tcp_udp = {
+				.dst_port = filter->dst_port,
+				.dst_port_msk = (__force __be16)-1,
+				.src_port = filter->src_port,
+				.src_port_msk = (__force __be16)-1,
+			},
+		},
+	};
+	struct mlx4_spec_list spec_ip = {
+		.id = MLX4_NET_TRANS_RULE_ID_IPV4,
+		{
+			.ipv4 = {
+				.dst_ip = filter->dst_ip,
+				.dst_ip_msk = (__force __be32)-1,
+				.src_ip = filter->src_ip,
+				.src_ip_msk = (__force __be32)-1,
+			},
+		},
+	};
+	struct mlx4_spec_list spec_eth = {
+		.id = MLX4_NET_TRANS_RULE_ID_ETH,
+	};
+	struct mlx4_net_trans_rule rule = {
+		.list = LIST_HEAD_INIT(rule.list),
+		.queue_mode = MLX4_NET_TRANS_Q_LIFO,
+		.exclusive = 1,
+		.allow_loopback = 1,
+		.promisc_mode = MLX4_FS_PROMISC_NONE,
+		.port = priv->port,
+		.priority = MLX4_DOMAIN_RFS,
+	};
+	int rc;
+	__be64 mac;
+	__be64 mac_mask = cpu_to_be64(MLX4_MAC_MASK << 16);
+
+	list_add_tail(&spec_eth.list, &rule.list);
+	list_add_tail(&spec_ip.list, &rule.list);
+	list_add_tail(&spec_tcp.list, &rule.list);
+
+	mac = cpu_to_be64((priv->mac & MLX4_MAC_MASK) << 16);
+
+	rule.qpn = priv->rss_map.qps[filter->rxq_index].qpn;
+	memcpy(spec_eth.eth.dst_mac, &mac, ETH_ALEN);
+	memcpy(spec_eth.eth.dst_mac_msk, &mac_mask, ETH_ALEN);
+
+	filter->activated = 0;
+
+	if (filter->reg_id) {
+		rc = mlx4_flow_detach(priv->mdev->dev, filter->reg_id);
+		if (rc && rc != -ENOENT)
+			en_err(priv, "Error detaching flow. rc = %d\n", rc);
+	}
+
+	rc = mlx4_flow_attach(priv->mdev->dev, &rule, &filter->reg_id);
+	if (rc)
+		en_err(priv, "Error attaching flow. err = %d\n", rc);
+
+	mlx4_en_filter_rfs_expire(priv);
+
+	filter->activated = 1;
+}
+
+static inline struct hlist_head *
+filter_hash_bucket(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
+		   __be16 src_port, __be16 dst_port)
+{
+	unsigned long l;
+	int bucket_idx;
+
+	l = (__force unsigned long)src_port |
+	    ((__force unsigned long)dst_port << 2);
+	l ^= (__force unsigned long)(src_ip ^ dst_ip);
+
+	bucket_idx = hash_long(l, MLX4_EN_FILTER_HASH_SHIFT);
+
+	return &priv->filter_hash[bucket_idx];
+}
+
+static struct mlx4_en_filter *
+mlx4_en_filter_alloc(struct mlx4_en_priv *priv, int rxq_index, __be32 src_ip,
+		     __be32 dst_ip, __be16 src_port, __be16 dst_port,
+		     u32 flow_id)
+{
+	struct mlx4_en_filter *filter = NULL;
+
+	filter = kzalloc(sizeof(struct mlx4_en_filter), GFP_ATOMIC);
+	if (!filter)
+		return NULL;
+
+	filter->priv = priv;
+	filter->rxq_index = rxq_index;
+	INIT_WORK(&filter->work, mlx4_en_filter_work);
+
+	filter->src_ip = src_ip;
+	filter->dst_ip = dst_ip;
+	filter->src_port = src_port;
+	filter->dst_port = dst_port;
+
+	filter->flow_id = flow_id;
+
+	filter->id = priv->last_filter_id++;
+
+	list_add_tail(&filter->next, &priv->filters);
+	hlist_add_head(&filter->filter_chain,
+		       filter_hash_bucket(priv, src_ip, dst_ip, src_port,
+					  dst_port));
+
+	return filter;
+}
+
+static void mlx4_en_filter_free(struct mlx4_en_filter *filter)
+{
+	struct mlx4_en_priv *priv = filter->priv;
+	int rc;
+
+	list_del(&filter->next);
+
+	rc = mlx4_flow_detach(priv->mdev->dev, filter->reg_id);
+	if (rc && rc != -ENOENT)
+		en_err(priv, "Error detaching flow. rc = %d\n", rc);
+
+	kfree(filter);
+}
+
+static inline struct mlx4_en_filter *
+mlx4_en_filter_find(struct mlx4_en_priv *priv, __be32 src_ip, __be32 dst_ip,
+		    __be16 src_port, __be16 dst_port)
+{
+	struct hlist_node *elem;
+	struct mlx4_en_filter *filter;
+	struct mlx4_en_filter *ret = NULL;
+
+	hlist_for_each_entry(filter, elem,
+			     filter_hash_bucket(priv, src_ip, dst_ip,
+						src_port, dst_port),
+			     filter_chain) {
+		if (filter->src_ip == src_ip &&
+		    filter->dst_ip == dst_ip &&
+		    filter->src_port == src_port &&
+		    filter->dst_port == dst_port) {
+			ret = filter;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int
+mlx4_en_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+		   u16 rxq_index, u32 flow_id)
+{
+	struct mlx4_en_priv *priv = netdev_priv(net_dev);
+	struct mlx4_en_filter *filter;
+	const struct iphdr *ip;
+	const __be16 *ports;
+	__be32 src_ip;
+	__be32 dst_ip;
+	__be16 src_port;
+	__be16 dst_port;
+	int nhoff = skb_network_offset(skb);
+	int ret = 0;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		return -EPROTONOSUPPORT;
+
+	ip = (const struct iphdr *)(skb->data + nhoff);
+	if (ip_is_fragment(ip))
+		return -EPROTONOSUPPORT;
+
+	ports = (const __be16 *)(skb->data + nhoff + 4 * ip->ihl);
+
+	src_ip = ip->saddr;
+	dst_ip = ip->daddr;
+	src_port = ports[0];
+	dst_port = ports[1];
+
+	if (ip->protocol != IPPROTO_TCP)
+		return -EPROTONOSUPPORT;
+
+	spin_lock_bh(&priv->filters_lock);
+	filter = mlx4_en_filter_find(priv, src_ip, dst_ip, src_port, dst_port);
+	if (filter) {
+		if (filter->rxq_index == rxq_index)
+			goto out;
+
+		filter->rxq_index = rxq_index;
+	} else {
+		filter = mlx4_en_filter_alloc(priv, rxq_index,
+					      src_ip, dst_ip,
+					      src_port, dst_port, flow_id);
+		if (!filter) {
+			ret = -ENOMEM;
+			goto err;
+		}
+	}
+
+	queue_work(priv->mdev->workqueue, &filter->work);
+
+out:
+	ret = filter->id;
+err:
+	spin_unlock_bh(&priv->filters_lock);
+
+	return ret;
+}
+
+void mlx4_en_cleanup_filters(struct mlx4_en_priv *priv,
+			     struct mlx4_en_rx_ring *rx_ring)
+{
+	struct mlx4_en_filter *filter, *tmp;
+	LIST_HEAD(del_list);
+
+	spin_lock_bh(&priv->filters_lock);
+	list_for_each_entry_safe(filter, tmp, &priv->filters, next) {
+		list_move(&filter->next, &del_list);
+		hlist_del(&filter->filter_chain);
+	}
+	spin_unlock_bh(&priv->filters_lock);
+
+	list_for_each_entry_safe(filter, tmp, &del_list, next) {
+		cancel_work_sync(&filter->work);
+		mlx4_en_filter_free(filter);
+	}
+}
+
+static void mlx4_en_filter_rfs_expire(struct mlx4_en_priv *priv)
+{
+	struct mlx4_en_filter *filter = NULL, *tmp, *last_filter = NULL;
+	LIST_HEAD(del_list);
+	int i = 0;
+
+	spin_lock_bh(&priv->filters_lock);
+	list_for_each_entry_safe(filter, tmp, &priv->filters, next) {
+		if (i > MLX4_EN_FILTER_EXPIRY_QUOTA)
+			break;
+
+		if (filter->activated &&
+		    !work_pending(&filter->work) &&
+		    rps_may_expire_flow(priv->dev,
+					filter->rxq_index, filter->flow_id,
+					filter->id)) {
+			list_move(&filter->next, &del_list);
+			hlist_del(&filter->filter_chain);
+		} else
+			last_filter = filter;
+
+		i++;
+	}
+
+	if (last_filter && (&last_filter->next != priv->filters.next))
+		list_move(&priv->filters, &last_filter->next);
+
+	spin_unlock_bh(&priv->filters_lock);
+
+	list_for_each_entry_safe(filter, tmp, &del_list, next)
+		mlx4_en_filter_free(filter);
+}
+#endif
+
 static int mlx4_en_vlan_rx_add_vid(struct net_device *dev, unsigned short vid)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1079,6 +1374,11 @@ void mlx4_en_free_resources(struct mlx4_en_priv *priv)
 {
 	int i;
 
+#ifdef CONFIG_RFS_ACCEL
+	free_irq_cpu_rmap(priv->dev->rx_cpu_rmap);
+	priv->dev->rx_cpu_rmap = NULL;
+#endif
+
 	for (i = 0; i < priv->tx_ring_num; i++) {
 		if (priv->tx_ring[i].tx_info)
 			mlx4_en_destroy_tx_ring(priv, &priv->tx_ring[i]);
@@ -1134,6 +1434,15 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
 			goto err;
 	}
 
+#ifdef CONFIG_RFS_ACCEL
+	priv->dev->rx_cpu_rmap = alloc_irq_cpu_rmap(priv->rx_ring_num);
+	if (!priv->dev->rx_cpu_rmap)
+		goto err;
+
+	INIT_LIST_HEAD(&priv->filters);
+	spin_lock_init(&priv->filters_lock);
+#endif
+
 	return 0;
 
 err:
@@ -1241,6 +1550,9 @@ static const struct net_device_ops mlx4_netdev_ops = {
 #endif
 	.ndo_set_features	= mlx4_en_set_features,
 	.ndo_setup_tc		= mlx4_en_setup_tc,
+#ifdef CONFIG_RFS_ACCEL
+	.ndo_rx_flow_steer	= mlx4_en_filter_rfs,
+#endif
 };
 
 int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
@@ -1358,6 +1670,10 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 			NETIF_F_HW_VLAN_FILTER;
 	dev->hw_features |= NETIF_F_LOOPBACK;
 
+	if (mdev->dev->caps.steering_mode ==
+	    MLX4_STEERING_MODE_DEVICE_MANAGED)
+		dev->hw_features |= NETIF_F_NTUPLE;
+
 	mdev->pndev[port] = dev;
 
 	netif_carrier_off(dev);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index a04cbf7..796cd58 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -389,6 +389,9 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
 	vfree(ring->rx_info);
 	ring->rx_info = NULL;
+#ifdef CONFIG_RFS_ACCEL
+	mlx4_en_cleanup_filters(priv, ring);
+#endif
 }
 
 void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index a126321..af34c98 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -43,6 +43,7 @@
 #ifdef CONFIG_MLX4_EN_DCB
 #include <linux/dcbnl.h>
 #endif
+#include <linux/cpu_rmap.h>
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/qp.h>
@@ -77,6 +78,9 @@
 #define STATS_DELAY		(HZ / 4)
 #define MAX_NUM_OF_FS_RULES	256
 
+#define MLX4_EN_FILTER_HASH_SHIFT 4
+#define MLX4_EN_FILTER_EXPIRY_QUOTA 60
+
 /* Typical TSO descriptor with 16 gather entries is 352 bytes... */
 #define MAX_DESC_SIZE		512
 #define MAX_DESC_TXBBS		(MAX_DESC_SIZE / TXBB_SIZE)
@@ -523,6 +527,13 @@ struct mlx4_en_priv {
 	struct ieee_ets ets;
 	u16 maxrate[IEEE_8021QAZ_MAX_TCS];
 #endif
+#ifdef CONFIG_RFS_ACCEL
+	spinlock_t filters_lock;
+	int last_filter_id;
+	struct list_head filters;
+	struct hlist_head filter_hash[1 << MLX4_EN_FILTER_HASH_SHIFT];
+#endif
+
 };
 
 enum mlx4_en_wol {
@@ -602,6 +613,11 @@ int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 extern const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops;
 #endif
 
+#ifdef CONFIG_RFS_ACCEL
+void mlx4_en_cleanup_filters(struct mlx4_en_priv *priv,
+			     struct mlx4_en_rx_ring *rx_ring);
+#endif
+
 #define MLX4_EN_NUM_SELF_TEST	5
 void mlx4_en_ex_selftest(struct net_device *dev, u32 *flags, u64 *buf);
 u64 mlx4_en_mac_to_u64(u8 *addr);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 0/4] net/mlx4_en: Add accelerated RFS support
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Or Gerlitz, Amir Vadai

Hi Dave, 

So now a pure Ethernet post from us...

This series from Amir Vadai adds support for Accelerated RFS 
to the mlx4_en Ethernet driver.

The code uses the Accelerated RFS infrastructure and HW flow steering 
to keep CPU affinity of rx interrupts and applications per TCP stream.

To do so, we had to add little protection to cpu_rmap.h against double 
inclusion. Also, added linking between CPU to IRQ using rmap in the 
mlx4_core driver.

Or.

Amir Vadai (4):
  net/mlx4: Move MAC_MASK to a common place
  net/rps: Protect cpu_rmap.h from double inclusion
  {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
  net/mlx4_en: Add accelerated RFS support

 drivers/infiniband/hw/mlx4/main.c                  |    3 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c         |    9 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    |    6 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |  316 ++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |    3 +
 drivers/net/ethernet/mellanox/mlx4/eq.c            |   12 +-
 drivers/net/ethernet/mellanox/mlx4/mcg.c           |    1 -
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |   16 +
 drivers/net/ethernet/mellanox/mlx4/port.c          |    1 -
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |    3 +-
 include/linux/cpu_rmap.h                           |    4 +
 include/linux/mlx4/device.h                        |    4 +-
 include/linux/mlx4/driver.h                        |    2 +
 13 files changed, 369 insertions(+), 11 deletions(-)

CC: Amir Vadai <amirv@mellanox.com>

^ permalink raw reply

* [PATCH net-next 2/4] net/rps: Protect cpu_rmap.h from double inclusion
From: Or Gerlitz @ 2012-07-18 14:19 UTC (permalink / raw)
  To: davem; +Cc: roland, netdev, oren, yevgenyp, Amir Vadai, Or Gerlitz
In-Reply-To: <1342621162-18498-1-git-send-email-ogerlitz@mellanox.com>

From: Amir Vadai <amirv@mellanox.com>

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 include/linux/cpu_rmap.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/cpu_rmap.h b/include/linux/cpu_rmap.h
index 473771a..ac3bbb5 100644
--- a/include/linux/cpu_rmap.h
+++ b/include/linux/cpu_rmap.h
@@ -1,3 +1,6 @@
+#ifndef __LINUX_CPU_RMAP_H
+#define __LINUX_CPU_RMAP_H
+
 /*
  * cpu_rmap.c: CPU affinity reverse-map support
  * Copyright 2011 Solarflare Communications Inc.
@@ -71,3 +74,4 @@ extern void free_irq_cpu_rmap(struct cpu_rmap *rmap);
 extern int irq_cpu_rmap_add(struct cpu_rmap *rmap, int irq);
 
 #endif
+#endif /* __LINUX_CPU_RMAP_H */
-- 
1.7.1

^ permalink raw reply related

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: John Fastabend @ 2012-07-18 14:21 UTC (permalink / raw)
  To: Neil Horman; +Cc: davem, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <20120718124539.GC25563@hmsreliant.think-freely.org>

On 7/18/2012 5:45 AM, Neil Horman wrote:
> On Tue, Jul 17, 2012 at 05:33:16PM -0700, John Fastabend wrote:
>> When the netprio cgroup is built in the kernel cgroup_init will call
>> cgrp_create which eventually calls update_netdev_tables. This is
>> being called before do_initcalls() so a null ptr dereference occurs
>> on init_net.
>>
>> This patch adds a check on init_net.count to verify the structure
>> has been initialized. The failure was introduced here,
>>
>> commit ef209f15980360f6945873df3cd710c5f62f2a3e
>> Author: Gao feng <gaofeng@cn.fujitsu.com>
>> Date:   Wed Jul 11 21:50:15 2012 +0000
>>
>>      net: cgroup: fix access the unallocated memory in netprio cgroup
>>
>> Tested with ping with netprio_cgroup as a module and built in.
>>
>> Marked RFC for now I think DaveM might have a reason why this needs
>> some improvement.
>>
>> Reported-by: Mark Rustad <mark.d.rustad@intel.com>
>> Cc: Neil Horman <nhorman@tuxdriver.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Gao feng <gaofeng@cn.fujitsu.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>
>>   net/core/netprio_cgroup.c |    3 +++
>>   1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
>> index b2e9caa..e9fd7fd 100644
>> --- a/net/core/netprio_cgroup.c
>> +++ b/net/core/netprio_cgroup.c
>> @@ -116,6 +116,9 @@ static int update_netdev_tables(void)
>>   	u32 max_len;
>>   	struct netprio_map *map;
>>
>> +	if (!atomic_read(&init_net.count))
>> +		return ret;
>> +
>>   	rtnl_lock();
>>   	max_len = atomic_read(&max_prioidx) + 1;
>>   	for_each_netdev(&init_net, dev) {
>>
>>
>
> John, do you have a stack trace of this.  I'm having a hard time seeing how we
> get into this path prior to the network stack being initalized.

Mark had a partial trace

[    0.003455] Dentry cache hash table entries: 262144 (order: 9, 
2097152 bytes)
[    0.005550] Inode-cache hash table entries: 131072 (order: 8, 1048576 
bytes)
[    0.007165] Mount-cache hash table entries: 256
[    0.010289] Initializing cgroup subsys net_cls
[    0.010947] Initializing cgroup subsys net_prio
[    0.011039] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000828
[    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0


>
> It also brings up another point.  If this is happening, and we're creating the
> root cgroup from start_kernel, Then we're actually initalizing some cgroups
> twice, because a few cgroups register themselves via cgroup_load_subsys in
> module_init specified routines.  So if you're building netprio_cgroup or
> net_cls_cgroup as part of the monolithic kernel, you'll get cgroup_create called
> prior to your module_init() call.  Thats not good.

Well your module_init() wouldn't be called in this case right? I think
netprio has a bug where we only register a netdevice notifier when
its built as a module.

same issue with cls_cgroup and register_tcf_proto_ops?

>
> In fact, the cgroup_subsys struct has an early_init flag that cgroup_init
> appears to use to skip the initialization of subsystems that don't need to be
> initialized that early in boot (assuming thats the path we're going down to get
> to this oops).

Do you mean ss->early_init? Not sure that helps us either we get called
by cgroup_init because we don't have an early_init callback or we get
called via cgroup_init_early even earlier.

>
> If you can post the call stack, I'd appreciate it, I'd like to dig a bit deeper
> into this.

Yes I'll do this shortly.

> Neil
>

^ permalink raw reply

* Re: [PATCH v4] net: cgroup: fix access the unallocated memory in netprio cgroup
From: Neil Horman @ 2012-07-18 14:26 UTC (permalink / raw)
  To: John Fastabend
  Cc: Gao feng, eric.dumazet, linux-kernel, netdev, davem, Eric Dumazet,
	Rustad, Mark D
In-Reply-To: <5006C3CA.3010007@intel.com>

On Wed, Jul 18, 2012 at 07:10:18AM -0700, John Fastabend wrote:
> On 7/18/2012 5:21 AM, Neil Horman wrote:
> >On Tue, Jul 17, 2012 at 01:47:25PM -0700, John Fastabend wrote:
> >>On 7/12/2012 12:50 AM, Gao feng wrote:
> >>>there are some out of bound accesses in netprio cgroup.
> >>>
> >>>now before accessing the dev->priomap.priomap array,we only check
> >>>if the dev->priomap exist.and because we don't want to see
> >>>additional bound checkings in fast path, so we should make sure
> >>>that dev->priomap is null or array size of dev->priomap.priomap
> >>>is equal to max_prioidx + 1;
> >>>
> >>>so in write_priomap logic,we should call extend_netdev_table when
> >>>dev->priomap is null and dev->priomap.priomap_len < max_len.
> >>>and in cgrp_create->update_netdev_tables logic,we should call
> >>>extend_netdev_table only when dev->priomap exist and
> >>>dev->priomap.priomap_len < max_len.
> >>>
> >>>and it's not needed to call update_netdev_tables in write_priomap,
> >>>we can only allocate the net device's priomap which we change through
> >>>net_prio.ifpriomap.
> >>>
> >>>this patch also add a return value for update_netdev_tables &
> >>>extend_netdev_table, so when new_priomap is allocated failed,
> >>>write_priomap will stop to access the priomap,and return -ENOMEM
> >>>back to the userspace to tell the user what happend.
> >>>
> >>>Change From v3:
> >>>1. add rtnl protect when reading max_prioidx in write_priomap.
> >>>
> >>>2. only call extend_netdev_table when map->priomap_len < max_len,
> >>>    this will make sure array size of dev->map->priomap always
> >>>    bigger than any prioidx.
> >>>
> >>>3. add a function write_update_netdev_table to make codes clear.
> >>>
> >>>Change From v2:
> >>>1. protect extend_netdev_table by RTNL.
> >>>2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
> >>>
> >>>Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> >>>Cc: Neil Horman <nhorman@tuxdriver.com>
> >>>Cc: Eric Dumazet <edumazet@google.com>
> >>>---
> >>>  net/core/netprio_cgroup.c |   71 ++++++++++++++++++++++++++++++++++-----------
> >>>  1 files changed, 54 insertions(+), 17 deletions(-)
> >>>
> >>
> >>[...]
> >>
> >>>+
> >>>+static int update_netdev_tables(void)
> >>>+{
> >>>+	int ret = 0;
> >>>  	struct net_device *dev;
> >>>-	u32 max_len = atomic_read(&max_prioidx) + 1;
> >>>+	u32 max_len;
> >>>  	struct netprio_map *map;
> >>
> >>
> >>need to check if net subsystem is initialized before we try
> >>to use it here...
> >>
> >>	if (some_check)     -> need to lookup what this check is
> >>		return ret;
> >>
> >>>
> >>>  	rtnl_lock();
> >>>+	max_len = atomic_read(&max_prioidx) + 1;
> >>>  	for_each_netdev(&init_net, dev) {
> >>>  		map = rtnl_dereference(dev->priomap);
> >>>-		if ((!map) ||
> >>>-		    (map->priomap_len < max_len))
> >>>-			extend_netdev_table(dev, max_len);
> >>>+		/*
> >>>+		 * don't allocate priomap if we didn't
> >>>+		 * change net_prio.ifpriomap (map == NULL),
> >>>+		 * this will speed up skb_update_prio.
> >>>+		 */
> >>>+		if (map && map->priomap_len < max_len) {
> >>>+			ret = extend_netdev_table(dev, max_len);
> >>>+			if (ret < 0)
> >>>+				break;
> >>>+		}
> >>>  	}
> >>>  	rtnl_unlock();
> >>>+	return ret;
> >>>  }
> >>>
> >>>  static struct cgroup_subsys_state *cgrp_create(struct cgroup *cgrp)
> >>>  {
> >>>  	struct cgroup_netprio_state *cs;
> >>>-	int ret;
> >>>+	int ret = -EINVAL;
> >>>
> >>>  	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
> >>>  	if (!cs)
> >>>  		return ERR_PTR(-ENOMEM);
> >>>
> >>>-	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
> >>>-		kfree(cs);
> >>>-		return ERR_PTR(-EINVAL);
> >>>-	}
> >>>+	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
> >>>+		goto out;
> >>>
> >>>  	ret = get_prioidx(&cs->prioidx);
> >>>-	if (ret != 0) {
> >>>+	if (ret < 0) {
> >>>  		pr_warn("No space in priority index array\n");
> >>>-		kfree(cs);
> >>>-		return ERR_PTR(ret);
> >>>+		goto out;
> >>>+	}
> >>>+
> >>>+	ret = update_netdev_tables();
> >>>+	if (ret < 0) {
> >>>+		put_prioidx(cs->prioidx);
> >>>+		goto out;
> >>>  	}
> >>
> >>Gao,
> >>
> >>This introduces a null ptr dereference when netprio_cgroup is built
> >>into the kernel because update_netdev_tables() depends on init_net.
> >>However cgrp_create is being called by cgroup_init before
> >>do_initcalls() is called and before net_dev_init().
> >>
> >>.John
> >>
> >Not sure I follow here John.  Shouldn't init_net be initialized prior to any
> >network devices getting registered?  In other words, shouldn't for_each_netdev
> >just result in zero iterations through the loop?
> >Neil
> >
> 
> init_net _is_ initialized prior to any network devices getting
> registered but not before cgrp_create called via cgroup_init.
> 
> #define for_each_netdev(net, d)         \
>                 list_for_each_entry(d, &(net)->dev_base_head, dev_list)
> 
> but dev_base_head is zeroed at this time. In netdev_init we have,
> 
>         INIT_LIST_HEAD(&net->dev_base_head);
> 
> but we haven't got that far yet because cgroup_init is called
> before do_initcalls().
> 
ok, I see that, and it makes sense, but at this point I'm more concerned with
cgroups getting initalized twice.  The early_init flag is clear in the
cgroup_subsystem for netprio, so we really shouldn't be getting initalized from
cgroup_init.  We should be getting initalized from the module_init() call that
we register
Neil

> 
> 
> 
> 

^ permalink raw reply

* [PATCH net-next v3] ipv6: add ipv6_addr_hash() helper
From: Eric Dumazet @ 2012-07-18 14:27 UTC (permalink / raw)
  To: Joe Perches; +Cc: David Miller, netdev, Andrew McGregor, Dave Taht, Tom Herbert
In-Reply-To: <1342620841.2626.2786.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

Introduce ipv6_addr_hash() helper doing a XOR on all bits
of an IPv6 address, with an optimized x86_64 version.

Use it in flow dissector, as suggested by Andrew McGregor,
to reduce hash collision probabilities in fq_codel (and other
users of flow dissector)

Use it in ip6_tunnel.c and use more bit shuffling, as suggested
by David Laight, as existing hash was ignoring most of them.

Use it in sunrpc and use more bit shuffling, using hash_32().

As a cleanup, use it in net/ipv4/tcp_metrics.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew McGregor <andrewmcgr@gmail.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Joe Perches <joe@perches.com>
---
v3: use the explicit cast inipv6_addr_hash() as suggested by Joe

 include/net/ipv6.h        |   13 +++++++++++++
 net/core/flow_dissector.c |    5 +++--
 net/ipv4/tcp_metrics.c    |   15 +++------------
 net/ipv6/ip6_tunnel.c     |   20 ++++++++++++--------
 net/sunrpc/svcauth_unix.c |   22 ++++------------------
 5 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index f695f39..01c34b3 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -419,6 +419,19 @@ static inline bool ipv6_addr_any(const struct in6_addr *a)
 #endif
 }
 
+static inline u32 ipv6_addr_hash(const struct in6_addr *a)
+{
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
+	const unsigned long *ul = (const unsigned long *)a;
+	unsigned long x = ul[0] ^ ul[1];
+
+	return (u32)(x ^ (x >> 32));
+#else
+	return (__force u32)(a->s6_addr32[0] ^ a->s6_addr32[1] ^
+			     a->s6_addr32[2] ^ a->s6_addr32[3]);
+#endif
+}
+
 static inline bool ipv6_addr_loopback(const struct in6_addr *a)
 {
 	return (a->s6_addr32[0] | a->s6_addr32[1] |
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index a225089..466820b 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -4,6 +4,7 @@
 #include <linux/ipv6.h>
 #include <linux/if_vlan.h>
 #include <net/ip.h>
+#include <net/ipv6.h>
 #include <linux/if_tunnel.h>
 #include <linux/if_pppox.h>
 #include <linux/ppp_defs.h>
@@ -55,8 +56,8 @@ ipv6:
 			return false;
 
 		ip_proto = iph->nexthdr;
-		flow->src = iph->saddr.s6_addr32[3];
-		flow->dst = iph->daddr.s6_addr32[3];
+		flow->src = (__force __be32)ipv6_addr_hash(&iph->saddr);
+		flow->dst = (__force __be32)ipv6_addr_hash(&iph->daddr);
 		nhoff += sizeof(struct ipv6hdr);
 		break;
 	}
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 5a38a2d..1a115b6 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -211,10 +211,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_rsk(req)->rmt_addr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_rsk(req)->rmt_addr);
 		break;
 	default:
 		return NULL;
@@ -251,10 +248,7 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 	case AF_INET6:
 		tw6 = inet6_twsk((struct sock *)tw);
 		*(struct in6_addr *)addr.addr.a6 = tw6->tw_v6_daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&tw6->tw_v6_daddr);
 		break;
 	default:
 		return NULL;
@@ -291,10 +285,7 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		break;
 	case AF_INET6:
 		*(struct in6_addr *)addr.addr.a6 = inet6_sk(sk)->daddr;
-		hash = ((__force unsigned int) addr.addr.a6[0] ^
-			(__force unsigned int) addr.addr.a6[1] ^
-			(__force unsigned int) addr.addr.a6[2] ^
-			(__force unsigned int) addr.addr.a6[3]);
+		hash = ipv6_addr_hash(&inet6_sk(sk)->daddr);
 		break;
 	default:
 		return NULL;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index db32846..9a1d5fe 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -40,6 +40,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/netfilter_ipv6.h>
 #include <linux/slab.h>
+#include <linux/hash.h>
 
 #include <asm/uaccess.h>
 #include <linux/atomic.h>
@@ -70,11 +71,15 @@ MODULE_ALIAS_NETDEV("ip6tnl0");
 #define IPV6_TCLASS_MASK (IPV6_FLOWINFO_MASK & ~IPV6_FLOWLABEL_MASK)
 #define IPV6_TCLASS_SHIFT 20
 
-#define HASH_SIZE  32
+#define HASH_SIZE_SHIFT  5
+#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
 
-#define HASH(addr) ((__force u32)((addr)->s6_addr32[0] ^ (addr)->s6_addr32[1] ^ \
-		     (addr)->s6_addr32[2] ^ (addr)->s6_addr32[3]) & \
-		    (HASH_SIZE - 1))
+static u32 HASH(const struct in6_addr *addr1, const struct in6_addr *addr2)
+{
+	u32 hash = ipv6_addr_hash(addr1) ^ ipv6_addr_hash(addr2);
+
+	return hash_32(hash, HASH_SIZE_SHIFT);
+}
 
 static int ip6_tnl_dev_init(struct net_device *dev);
 static void ip6_tnl_dev_setup(struct net_device *dev);
@@ -166,12 +171,11 @@ static inline void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
 static struct ip6_tnl *
 ip6_tnl_lookup(struct net *net, const struct in6_addr *remote, const struct in6_addr *local)
 {
-	unsigned int h0 = HASH(remote);
-	unsigned int h1 = HASH(local);
+	unsigned int hash = HASH(remote, local);
 	struct ip6_tnl *t;
 	struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
 
-	for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[h0 ^ h1]) {
+	for_each_ip6_tunnel_rcu(ip6n->tnls_r_l[hash]) {
 		if (ipv6_addr_equal(local, &t->parms.laddr) &&
 		    ipv6_addr_equal(remote, &t->parms.raddr) &&
 		    (t->dev->flags & IFF_UP))
@@ -205,7 +209,7 @@ ip6_tnl_bucket(struct ip6_tnl_net *ip6n, const struct ip6_tnl_parm *p)
 
 	if (!ipv6_addr_any(remote) || !ipv6_addr_any(local)) {
 		prio = 1;
-		h = HASH(remote) ^ HASH(local);
+		h = HASH(remote, local);
 	}
 	return &ip6n->tnls[prio][h];
 }
diff --git a/net/sunrpc/svcauth_unix.c b/net/sunrpc/svcauth_unix.c
index 2777fa8..4d01292 100644
--- a/net/sunrpc/svcauth_unix.c
+++ b/net/sunrpc/svcauth_unix.c
@@ -104,23 +104,9 @@ static void ip_map_put(struct kref *kref)
 	kfree(im);
 }
 
-#if IP_HASHBITS == 8
-/* hash_long on a 64 bit machine is currently REALLY BAD for
- * IP addresses in reverse-endian (i.e. on a little-endian machine).
- * So use a trivial but reliable hash instead
- */
-static inline int hash_ip(__be32 ip)
-{
-	int hash = (__force u32)ip ^ ((__force u32)ip>>16);
-	return (hash ^ (hash>>8)) & 0xff;
-}
-#endif
-static inline int hash_ip6(struct in6_addr ip)
+static inline int hash_ip6(const struct in6_addr *ip)
 {
-	return (hash_ip(ip.s6_addr32[0]) ^
-		hash_ip(ip.s6_addr32[1]) ^
-		hash_ip(ip.s6_addr32[2]) ^
-		hash_ip(ip.s6_addr32[3]));
+	return hash_32(ipv6_addr_hash(ip), IP_HASHBITS);
 }
 static int ip_map_match(struct cache_head *corig, struct cache_head *cnew)
 {
@@ -301,7 +287,7 @@ static struct ip_map *__ip_map_lookup(struct cache_detail *cd, char *class,
 	ip.m_addr = *addr;
 	ch = sunrpc_cache_lookup(cd, &ip.h,
 				 hash_str(class, IP_HASHBITS) ^
-				 hash_ip6(*addr));
+				 hash_ip6(addr));
 
 	if (ch)
 		return container_of(ch, struct ip_map, h);
@@ -331,7 +317,7 @@ static int __ip_map_update(struct cache_detail *cd, struct ip_map *ipm,
 	ip.h.expiry_time = expiry;
 	ch = sunrpc_cache_update(cd, &ip.h, &ipm->h,
 				 hash_str(ipm->m_class, IP_HASHBITS) ^
-				 hash_ip6(ipm->m_addr));
+				 hash_ip6(&ipm->m_addr));
 	if (!ch)
 		return -ENOMEM;
 	cache_put(ch, cd);

^ permalink raw reply related

* AW: RFC: (now non Base64) replace packets already in queue
From: Erdt, Ralph @ 2012-07-18 14:50 UTC (permalink / raw)
  To: Nicolas de Pesloüan; +Cc: netdev@vger.kernel.org, Eric Dumazet, Rick Jones
In-Reply-To: <4FF4A873.1000001@gmail.com>

Hello.

I'm sorry for the very late answer. But I had top-priority family issues.

> I suggest you try and send a properly formated patch with your code, so
> that people here can have a look at it and evaluate the interest of
> integrating it into main line kernel.

Attached at the button of the eMail.


> That being said, I really think you should try to manage a userspace
> queue, [..] you can
> add many nice features into userspace to enhance the speed/quality 
> [..]
> And I really see your packet replacement system as one of those nice
> features and cannot imagine a good reason not to put it in userspace.

All this features are done already. E.g. we are using RoHC.
But we also want to use the TC stuff - its already there - why reprogramming?

Here the patch. But I didn't find which git tree I should use. This patch is against Linux-2.6. I'm sorry. Can you tell me, which tree I've to use?
-------------
>From 52f27fa2b0867de821af38c731c2ebc763afb1f1 Mon Sep 17 00:00:00 2001
From: Ralph Erdt <Ralph.Robert.Erdt@fkie.fraunhofer.de>
Date: Wed, 18 Jul 2012 16:43:44 +0200
Subject: [PATCH] RFC: TC qdisc "replace packet in queue"

This adds a new TC qdisc, which replaces packets in the queue. It
compares every incoming packet with all of the packets in the queue.
If the incoming and the compared packet meet all these conditions:
 - UDPv4
 - not fragmented
 - TOS like the given value(s)
 - same TOS
 - same source IP
 - same destination IP
 - same destination port
the packet in the queue will be replaced with the incoming packet.

The variable "overlimit" is the counter of replaced packets

Background:
In very low bandwidth networks (<=9.6Kbps, shared, etc.) it's hard
(rather: impossible) to get all packets sent.
But some of the packets contain information, which gets obsolete over
time. E.g. (GPS) positions, which will be sent periodically. If the
application sends a new packet while an old position packet is still in
the queue, the old packet is obsolete. This can be dropped. But just
dropping the old packet and queuing the new packet will result in never
sending a packet of this type. So this qdisc replace the old packet with
the new one. The information gets the chance to get sent - with the
newest available information.

Code-Status:
RFC for discussing.
The configuration by "debug-fs" is ... not optimal. But following the
"litte step" rules this is a first option. A configuration with tc will
be done later (if this patch got offical).

Signed-off-by: Ralph Erdt <Ralph.Robert.Erdt@fkie.fraunhofer.de>
---
 net/sched/Kconfig  |   16 +++
 net/sched/Makefile |    1 +
 net/sched/sch_pr.c |  264 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 281 insertions(+), 0 deletions(-)
 create mode 100644 net/sched/sch_pr.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e7a8976..e29ad48 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -308,6 +308,22 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_PR
+	tristate "Packet Replace"
+	help
+	  Say Y here if you want to use the "Packet Replace"
+	  packet scheduling algorithm.
+
+	  This qdisc will replace packets in the queue, if this is a packet
+	  from the same UDP stream (IP/Port).
+
+	  See the top of <file:net/sched/sch_pr.c> for more details.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_pr.
+
+	  If unsure, say N.
+
 comment "Classification"
 
 config NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 5940a19..ef669ff 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -39,6 +39,7 @@ obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
 obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
+obj-$(CONFIG_NET_SCH_PR)	+= sch_pr.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_pr.c b/net/sched/sch_pr.c
new file mode 100644
index 0000000..5cbf8d8
--- /dev/null
+++ b/net/sched/sch_pr.c
@@ -0,0 +1,264 @@
+/*
+ * net/sched/sch_pr.c	"packet replace"
+ *
+ * Copyrigth (c) 2012 Fraunhofer FKIE, all rigths reserved.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Ralph Erdt (Fraunhofer FKIE),
+ *                                 <ralph.robert.erdt@fkie.fraunhofer.de>
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+
+#include <linux/ip.h>
+#include <net/ip.h>
+#include <linux/debugfs.h>
+
+/*
+ * replace packet in queue
+ * ==========================
+ * This is a modified fifo queue (fifo by Alexey Kuznetsov).
+ *
+ * This packet compares every incoming packet with all of the packets in the
+ * queue.
+ * If the incoming and the compared packet meet all these conditions:
+ *  - UDPv4
+ *  - not fragmented
+ *  - TOS like the given value(s)
+ *  - same TOS
+ *  - same source IP
+ *  - same destination IP
+ *  - same destination port
+ * the packet in the queue will be replaced with the incoming packet.
+ *
+ * The variable "overlimit" is the counter of replaced packets
+ *
+ * Background:
+ * In very low bandwidth networks (<=9.6Kbps, shared, etc.) it's hard
+ * (rather: impossible) to get all packets sent.
+ * But some of the packets contain information, which gets obsolete over time.
+ * E.g. (GPS) positions, which will be sent periodically. If the application
+ * sends a new packet while an old position packet is still in the queue, the
+ * old packet is obsolete. This can be dropped. But just dropping the old
+ * packet and queuing the new packet will result in never sending a packet
+ * of this type.
+ * So this qdisc replace the old packet with the new one. The information gets
+ * the chance to get sent - with the newest available information.
+ *
+ * DRAWBACKS:
+ * Its not very CPU cycle saving. But on very low bandwith networks the
+ * application have to be careful with sending packets. And with a propper
+ * configuration, this will be OK.
+ */
+
+struct dentry *dgdir, *dgfile;
+
+#define TOSBITMASK 0
+#define TOSCOMPARE 1
+/* tos Flag. 1.: BitMask. 2.: Compare with */
+static u8 tos[] = {0xFF, 0xFF};
+
+bool pr_packet_to_work_with(struct sk_buff *pkt)
+{
+	struct iphdr *hdr;
+
+	if (unlikely(pkt->protocol != htons(ETH_P_IP)))
+		return false;
+
+	/* Only compare UDP - Layer 4 must be there */
+	if (unlikely(pkt->network_header == NULL))
+		return false;
+
+	hdr = ip_hdr(pkt);
+
+	/* Check for UDPv4 */
+	if (unlikely(hdr->protocol != IPPROTO_UDP))
+		return false;
+
+	/* no fragmented packets */
+	if (unlikely(ip_is_fragment(hdr)))
+		return false;
+
+	/* Correct TOS ? */
+	if ((hdr->tos & tos[TOSBITMASK]) != tos[TOSCOMPARE])
+		return false;
+
+	return true;
+}
+
+bool comp(struct sk_buff *a, struct sk_buff *b)
+{
+	struct iphdr *ah = NULL;
+	struct iphdr *bh = NULL;
+	u32 ipA, ipB;
+	u16 portsA, portsB;
+	int poff;
+	/* The packet has a header
+	 *  - the existence was already checked by "pr_packet_to_work_with" */
+	ah = ip_hdr(a);
+	bh = ip_hdr(b);
+
+	/* TOS must be the same */
+	if (ah->tos != bh->tos)
+		return false;
+
+	/* IP and Port must be the same */
+	ipA = (__force u32)ah->daddr;
+	ipB = (__force u32)bh->daddr;
+	if ((ipA != ipB))
+		return false;
+	ipA = (__force u32)ah->saddr;
+	ipB = (__force u32)bh->saddr;
+	if ((ipA != ipB))
+		return false;
+
+	poff = proto_ports_offset(IPPROTO_UDP);
+	if (unlikely(poff < 0))
+		/* This should be impossible.. */
+		return false;
+
+	/* Src Ports are always different - just compare destination ports */
+	portsA = *(u16 *)((void *)ah + bh->ihl * 4 + poff + 2);
+	portsB = *(u16 *)((void *)bh + ah->ihl * 4 + poff + 2);
+	if ((portsA != portsB))
+		return false;
+
+	return true;
+}
+
+static int pr_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct sk_buff *replace = NULL;
+
+	/* Search, if there is a packet with same IDs */
+	/* Only search, if this packet is worth it */
+	if (pr_packet_to_work_with(skb)) {
+		struct sk_buff *it;
+		skb_queue_walk((&(sch->q)), it) {
+			/* If the other packet is worth it? */
+			if (pr_packet_to_work_with(it)) {
+				if (comp(skb, it)) {
+					replace = it;
+					break;
+				}
+			}
+		}
+	}
+
+	if (replace == NULL) {
+		/* a new kind of packet. Just enqueue */
+		if (likely(skb_queue_len(&sch->q) < sch->limit))
+			return qdisc_enqueue_tail(skb, sch);
+		return qdisc_reshape_fail(skb, sch);
+	} else {
+		/* replace the packet */
+		sch->qstats.overlimits++;
+		/* There is no drop nor replace. So do the replace myself */
+		skb->next = replace->next;
+		skb->prev = replace->prev;
+		if (replace->next != NULL)
+			replace->next->prev = skb;
+		if (replace->prev != NULL)
+			replace->prev->next = skb;
+		kfree_skb(replace);
+		return NET_XMIT_SUCCESS;
+	}
+}
+
+static int pr_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	sch->flags |= TCQ_F_CAN_BYPASS; /* sounds good, but what? */
+	sch->limit = qdisc_dev(sch)->tx_queue_len ? : 1;
+	return 0;
+}
+
+static int pr_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct tc_fifo_qopt opt = { .limit = sch->limit };
+
+	if (nla_put(skb, TCA_OPTIONS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	return skb->len;
+
+nla_put_failure:
+	return -1;
+}
+
+struct Qdisc_ops pr_qdisc_ops __read_mostly = {
+	.id		=	"pr",
+	.priv_size	=	0,
+	.enqueue	=	pr_enqueue,
+	.dequeue	=	qdisc_dequeue_head,
+	.peek		=	qdisc_peek_head,
+	.drop		=	qdisc_queue_drop,
+	.init		=	pr_init,
+	.reset		=	qdisc_reset_queue,
+	.change		=	pr_init,
+	.dump		=	pr_dump,
+	.owner		=	THIS_MODULE,
+};
+EXPORT_SYMBOL(pr_qdisc_ops);
+
+/* DebugFS interface as first shot configuration */
+static ssize_t dg_read_file(struct file *file, char __user *userbuf,
+					size_t count, loff_t *ppos)
+{
+	return simple_read_from_buffer(userbuf, count, ppos, tos, 2);
+}
+
+static ssize_t dg_write_file(struct file *file, const char __user *buf,
+					size_t count, loff_t *ppos)
+{
+	u8 tmp[] = {0xFF, 0xFF};
+	int res;
+	if (count != 2)
+		return -EINVAL;
+
+	res = copy_from_user(tmp, buf, count);
+	if (res != 0)
+		return -EINVAL;
+
+	/* Two bytes to copy.. for this a memcpy with errorhandling?!? */
+	tos[0] = tmp[0];
+	tos[1] = tmp[1];
+
+	return count;
+}
+
+static const struct file_operations dgfops = {
+	.read = dg_read_file,
+	.write = dg_write_file,
+};
+
+static int __init pr_module_init(void)
+{
+	bool ret = register_qdisc(&pr_qdisc_ops);
+	if (!ret) {
+		/* open Communication channel */
+		dgdir = debugfs_create_dir("sch_pr", NULL);
+		dgfile = debugfs_create_file("tos", 0644, dgdir, tos, &dgfops);
+	}
+	return ret;
+}
+
+static void __exit pr_module_exit(void)
+{
+	debugfs_remove(dgfile);
+	debugfs_remove(dgdir);
+	unregister_qdisc(&pr_qdisc_ops);
+}
+
+module_init(pr_module_init);
+module_exit(pr_module_exit);
+MODULE_LICENSE("GPL");
-- 
1.7.7

^ permalink raw reply related

* Re: [PATCH] mlx4_en: map entire pages to increase throughput
From: Or Gerlitz @ 2012-07-18 14:59 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo, Yevgeny Petrilin
  Cc: Or Gerlitz, Rick Jones, davem@davemloft.net,
	netdev@vger.kernel.org, amirv@mellanox.com,
	brking@linux.vnet.ibm.com, leitao@linux.vnet.ibm.com,
	klebers@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org,
	anton@samba.org
In-Reply-To: <20120716205708.GB16137@oc1711230544.ibm.com>

On 7/16/2012 11:57 PM, Thadeu Lima de Souza Cascardo wrote:
> On Mon, Jul 16, 2012 at 11:43:33PM +0300, Or Gerlitz wrote:
>>
>>
>> TCP_STREAM from this setup before the patch would be good to know as well
>>
>
> Does the stream test that I did with uperf using messages of 64000 bytes fit?

netperf/TCP_STREAM is very common and it would help to better compare 
the numbers
you get on your systems before/after the patch which runs done here. As 
for review for
the patch itself and the related discussion, Yevgeny Petrilin should be 
looking on your
patch, he'll be in by early next week.

Or.

^ permalink raw reply

* Confirm Your Identity Webmail
From: Christine BALAGUE @ 2012-07-18 13:45 UTC (permalink / raw)







Confirm Your Identity Webmail

Your mailbox has exceeded one or more size limits set by your  
administrator. You may not be able to send or receive new messages  
until your mailbox size is reduced. To make more space available,  
please click http://thanaban.com/main/wp-admin/account-update/  to  
correct account details.

Thank you and we apologize for the inconvenience.

System Administrator.


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: John Fastabend @ 2012-07-18 15:14 UTC (permalink / raw)
  To: Neil Horman; +Cc: davem, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <5006C679.2040605@intel.com>

On 7/18/2012 7:21 AM, John Fastabend wrote:
> On 7/18/2012 5:45 AM, Neil Horman wrote:
>> On Tue, Jul 17, 2012 at 05:33:16PM -0700, John Fastabend wrote:
>>> When the netprio cgroup is built in the kernel cgroup_init will call
>>> cgrp_create which eventually calls update_netdev_tables. This is
>>> being called before do_initcalls() so a null ptr dereference occurs
>>> on init_net.
>>>
>>> This patch adds a check on init_net.count to verify the structure
>>> has been initialized. The failure was introduced here,
>>>
>>> commit ef209f15980360f6945873df3cd710c5f62f2a3e
>>> Author: Gao feng <gaofeng@cn.fujitsu.com>
>>> Date:   Wed Jul 11 21:50:15 2012 +0000
>>>
>>>      net: cgroup: fix access the unallocated memory in netprio cgroup
>>>
>>> Tested with ping with netprio_cgroup as a module and built in.
>>>
>>> Marked RFC for now I think DaveM might have a reason why this needs
>>> some improvement.
>>>
>>> Reported-by: Mark Rustad <mark.d.rustad@intel.com>
>>> Cc: Neil Horman <nhorman@tuxdriver.com>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Cc: Gao feng <gaofeng@cn.fujitsu.com>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>> ---
>>>
>>>   net/core/netprio_cgroup.c |    3 +++
>>>   1 files changed, 3 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
>>> index b2e9caa..e9fd7fd 100644
>>> --- a/net/core/netprio_cgroup.c
>>> +++ b/net/core/netprio_cgroup.c
>>> @@ -116,6 +116,9 @@ static int update_netdev_tables(void)
>>>       u32 max_len;
>>>       struct netprio_map *map;
>>>
>>> +    if (!atomic_read(&init_net.count))
>>> +        return ret;
>>> +
>>>       rtnl_lock();
>>>       max_len = atomic_read(&max_prioidx) + 1;
>>>       for_each_netdev(&init_net, dev) {
>>>
>>>
>>
>> John, do you have a stack trace of this.  I'm having a hard time
>> seeing how we
>> get into this path prior to the network stack being initalized.
>
> Mark had a partial trace
>
> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
> 2097152 bytes)
> [    0.005550] Inode-cache hash table entries: 131072 (order: 8, 1048576
> bytes)
> [    0.007165] Mount-cache hash table entries: 256
> [    0.010289] Initializing cgroup subsys net_cls
> [    0.010947] Initializing cgroup subsys net_prio
> [    0.011039] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000828
> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
>
>
>>
>> It also brings up another point.  If this is happening, and we're
>> creating the
>> root cgroup from start_kernel, Then we're actually initalizing some
>> cgroups
>> twice, because a few cgroups register themselves via
>> cgroup_load_subsys in
>> module_init specified routines.  So if you're building netprio_cgroup or
>> net_cls_cgroup as part of the monolithic kernel, you'll get
>> cgroup_create called
>> prior to your module_init() call.  Thats not good.
>
> Well your module_init() wouldn't be called in this case right? I think
> netprio has a bug where we only register a netdevice notifier when
> its built as a module.
>
> same issue with cls_cgroup and register_tcf_proto_ops?
>

Neil, I was very unclear in the above. What I meant here was
cgroup_load_subsys() checks ss->module so you should _not_
get two create calls. And returns 0 so the register calls for
netdev notifiers should get setup.

I missed the return 0 part and so I thought we might abort before
this occurs but it looks ok to me on second glance.

^ permalink raw reply

* r8169: link up, link down
From: J. Christopher Pereira @ 2012-07-18 15:21 UTC (permalink / raw)
  To: netdev

Hi:

An old FC11 server died recently so we changed the hardware.
The new hardware has a r8169 nic with the following driver version (dmesg
output):

        r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
        r8169 0000:02:01.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
        r8169 0000:02:01.0: no PCI Express capability

Once in a while, the nic stops working with many "link up" and "link down"
messages:

        r8169: eth1: link up
        r8169: eth1: link down

The problem is solved by reseting the server.

I found some bug reports out there, but couldn't find any clear info about
the problem or if it has been fixed in newer driver versions.
The server is using the latest FC11 kernel
(kernel-2.6.30.10-105.2.23.fc11.x86_64) and FC11 reached its end of life.
Replacing the nic for another model is obviously the simplest solution, but.

Is there any solution or workarround?

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: Neil Horman @ 2012-07-18 15:25 UTC (permalink / raw)
  To: John Fastabend; +Cc: davem, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <5006C679.2040605@intel.com>

On Wed, Jul 18, 2012 at 07:21:45AM -0700, John Fastabend wrote:
> On 7/18/2012 5:45 AM, Neil Horman wrote:
> >On Tue, Jul 17, 2012 at 05:33:16PM -0700, John Fastabend wrote:
> >>When the netprio cgroup is built in the kernel cgroup_init will call
> >>cgrp_create which eventually calls update_netdev_tables. This is
> >>being called before do_initcalls() so a null ptr dereference occurs
> >>on init_net.
> >>
> >>This patch adds a check on init_net.count to verify the structure
> >>has been initialized. The failure was introduced here,
> >>
> >>commit ef209f15980360f6945873df3cd710c5f62f2a3e
> >>Author: Gao feng <gaofeng@cn.fujitsu.com>
> >>Date:   Wed Jul 11 21:50:15 2012 +0000
> >>
> >>     net: cgroup: fix access the unallocated memory in netprio cgroup
> >>
> >>Tested with ping with netprio_cgroup as a module and built in.
> >>
> >>Marked RFC for now I think DaveM might have a reason why this needs
> >>some improvement.
> >>
> >>Reported-by: Mark Rustad <mark.d.rustad@intel.com>
> >>Cc: Neil Horman <nhorman@tuxdriver.com>
> >>Cc: Eric Dumazet <edumazet@google.com>
> >>Cc: Gao feng <gaofeng@cn.fujitsu.com>
> >>Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> >>---
> >>
> >>  net/core/netprio_cgroup.c |    3 +++
> >>  1 files changed, 3 insertions(+), 0 deletions(-)
> >>
> >>diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> >>index b2e9caa..e9fd7fd 100644
> >>--- a/net/core/netprio_cgroup.c
> >>+++ b/net/core/netprio_cgroup.c
> >>@@ -116,6 +116,9 @@ static int update_netdev_tables(void)
> >>  	u32 max_len;
> >>  	struct netprio_map *map;
> >>
> >>+	if (!atomic_read(&init_net.count))
> >>+		return ret;
> >>+
> >>  	rtnl_lock();
> >>  	max_len = atomic_read(&max_prioidx) + 1;
> >>  	for_each_netdev(&init_net, dev) {
> >>
> >>
> >
> >John, do you have a stack trace of this.  I'm having a hard time seeing how we
> >get into this path prior to the network stack being initalized.
> 
> Mark had a partial trace
> 
> [    0.003455] Dentry cache hash table entries: 262144 (order: 9,
> 2097152 bytes)
> [    0.005550] Inode-cache hash table entries: 131072 (order: 8,
> 1048576 bytes)
> [    0.007165] Mount-cache hash table entries: 256
> [    0.010289] Initializing cgroup subsys net_cls
> [    0.010947] Initializing cgroup subsys net_prio
> [    0.011039] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000828
> [    0.011998] IP: [<ffffffff814202c8>] update_netdev_tables+0x68/0xe0
> 
> 
Well, I was really hoping to see what call path got us there, so this doesn't
really help.  I'll try to setup a system here to reproduce later today.

> >
> >It also brings up another point.  If this is happening, and we're creating the
> >root cgroup from start_kernel, Then we're actually initalizing some cgroups
> >twice, because a few cgroups register themselves via cgroup_load_subsys in
> >module_init specified routines.  So if you're building netprio_cgroup or
> >net_cls_cgroup as part of the monolithic kernel, you'll get cgroup_create called
> >prior to your module_init() call.  Thats not good.
> 
> Well your module_init() wouldn't be called in this case right? I think
> netprio has a bug where we only register a netdevice notifier when
> its built as a module.
> 
> same issue with cls_cgroup and register_tcf_proto_ops?
> 
No.  When not built monolitically, module_init is defined as __initcall, so it
still gets called during the boot process

> >
> >In fact, the cgroup_subsys struct has an early_init flag that cgroup_init
> >appears to use to skip the initialization of subsystems that don't need to be
> >initialized that early in boot (assuming thats the path we're going down to get
> >to this oops).
> 
> Do you mean ss->early_init? Not sure that helps us either we get called
> by cgroup_init because we don't have an early_init callback or we get
> called via cgroup_init_early even earlier.
> 
Yeah, I see what you mean.  Seems like what we need is to either:
1) move cgroup_init to later in the boot process.  If you're not early_init,
then I don't see why the subsystem can't wait until later in the boot process
(i.e. make cgroup_init a late_initcall or some such).

or

2) Allow module based cgroups to flag themselves as needing late init after the
rest of the kernel has booted.


Neil

> >
> >If you can post the call stack, I'd appreciate it, I'd like to dig a bit deeper
> >into this.
> 
> Yes I'll do this shortly.
> 
> >Neil
> >
> 

^ permalink raw reply

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: David Miller @ 2012-07-18 15:53 UTC (permalink / raw)
  To: nhorman; +Cc: john.r.fastabend, gaofeng, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <20120718152520.GG25563@hmsreliant.think-freely.org>

From: Neil Horman <nhorman@tuxdriver.com>
Date: Wed, 18 Jul 2012 11:25:20 -0400

> Yeah, I see what you mean.  Seems like what we need is to either:
> 1) move cgroup_init to later in the boot process.  If you're not early_init,
> then I don't see why the subsystem can't wait until later in the boot process
> (i.e. make cgroup_init a late_initcall or some such).
> 
> or
> 
> 2) Allow module based cgroups to flag themselves as needing late init after the
> rest of the kernel has booted.

These are way too complicated compared to John's currently proposed
fix for this recently introduced regression.

I want a one liner which I can prove is going to remove the crash.

All of this talk of rearranging initcall ordering for cgroup stuff
is too ambitious this late in the -rc.

^ permalink raw reply

* Re: [PATCH net-next] ipv6: fix inet6_csk_xmit()
From: David Miller @ 2012-07-18 16:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ncardwell, ycheng
In-Reply-To: <1342597084.2626.1851.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Jul 2012 09:38:04 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> We should provide to inet6_csk_route_socket a struct flowi6 pointer,
> so that net6_csk_xmit() works correctly instead of sending garbage.
> 
> Also add some consts 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Yuchung Cheng <ycheng@google.com>

Thanks a lot for fixing this Eric.

Applied.

^ permalink raw reply

* Re: [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: David Miller @ 2012-07-18 16:02 UTC (permalink / raw)
  To: pmoore; +Cc: netdev
In-Reply-To: <20120717210738.22790.23522.stgit@sifl>

From: Paul Moore <pmoore@redhat.com>
Date: Tue, 17 Jul 2012 17:07:47 -0400

> As reported by Alan Cox, and verified by Lin Ming, when a user
> attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL
> tag the kernel dies a terrible death when it attempts to follow a NULL
> pointer (the skb argument to cipso_v4_validate() is NULL when called via
> the setsockopt() syscall).
> 
> This patch fixes this by first checking to ensure that the skb is
> non-NULL before using it to find the incoming network interface.  In
> the unlikely case where the skb is NULL and the user attempts to add
> a CIPSO option with the _TAG_LOCAL tag we return an error as this is
> not something we want to allow.
> 
> A simple reproducer, kindly supplied by Lin Ming, although you must
> have the CIPSO DOI #3 configure on the system first or you will be
> caught early in cipso_v4_validate():
 ...
> CC: Lin Ming <mlin@ss.pku.edu.cn>
> Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
> Signed-off-by: Paul Moore <pmoore@redhat.com>

Applied and queued up for -stable, thanks Paul.

^ permalink raw reply

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: David Miller @ 2012-07-18 16:07 UTC (permalink / raw)
  To: ja; +Cc: eric.dumazet, netdev
In-Reply-To: <alpine.LFD.2.00.1207181105320.2154@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Wed, 18 Jul 2012 11:36:08 +0300 (EEST)

> 	Is the cost of read_seqbegin a problem? Here is a
> 2nd version, I still keep this first check for now.

No, the read side of seqlocks are extremely cheap, it's just a plain
read and compare of a read-mostly integer.

> Subject: [PATCH v2] ipv4: use seqlock for nh_exceptions
> 
> From: Julian Anastasov <ja@ssi.bg>
> 
> 	Use global seqlock for the nh_exceptions. Call
> fnhe_oldest with the right hash chain. Correct the diff
> value for dst_set_expires.
> 
> v2: after suggestions from Eric Dumazet:
> * get rid of spin lock fnhe_lock, rearrange update_or_create_fnhe
> * continue daddr search in rt_bind_exception
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

I think if you get a seqlock mis-compare, you will need to branch back
to rescan the hash chain from the beginning.

Otherwise I like these changes a lot.

We should perhaps consider doing something similar in the TCP metrics
code.

Thanks!

^ permalink raw reply

* Re: pull request: sfc-next 2012-07-17
From: David Miller @ 2012-07-18 16:10 UTC (permalink / raw)
  To: bhutchings; +Cc: linux-net-drivers, netdev
In-Reply-To: <1342544740.2698.13.camel@bwh-desktop.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Tue, 17 Jul 2012 18:05:40 +0100

> The following changes since commit 141e369de698f2e17bf716b83fcc647ddcb2220c:
> 
>   xfrm: Initialize the struct xfrm_dst behind the dst_enty field (2012-07-14 00:29:12 -0700)
> 
> are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next.git for-davem
> 
> (commit c2dbab39db1c3c2ccbdbb2c6bac6f07cc7a7c1f6)
> 
> 1. Fix potential badness when running a self-test with SR-IOV enabled.
> 2. Fix calculation of some interface statistics that could run backward.
> 3. Miscellaneous cleanup.

Looks good, pulled, thanks Ben.

Out of curiosity, why the conversion to the generic DMA interfaces?
Do you plan on using something uniquely provided by them vs. the PCI
specific DMA interfaces (ability to specify GFP flags, stuff like
that) or do you really plan on having non-PCI devices in the future?

^ permalink raw reply

* Re: getsockopt/setsockopt with SO_RCVBUF and SO_SNDBUF "non-standard" behaviour
From: Eric Dumazet @ 2012-07-18 16:11 UTC (permalink / raw)
  To: Eugen Dedu; +Cc: linux-kernel, netdev
In-Reply-To: <5006DD6B.3030300@pu-pm.univ-fcomte.fr>

On Wed, 2012-07-18 at 17:59 +0200, Eugen Dedu wrote:
> Any idea?
> 
> On 17/07/12 11:27, Eugen Dedu wrote:
> > Hi all,
> >
> > I looked on Internet and at the old thread
> > http://lkml.indiana.edu/hypermail/linux/kernel/0108.0/0275.html, but the
> > issue is still not settled as far as I see.
> >
> > I need to have the highest memory available for snd/rcv buffer and I
> > need to know/confirm how much it allocated for my process (how much I
> > can use).
> >
> > So with Linux we need to do something like:
> > setsockopt (..., SO_RCVBUF, 256000, ...)
> > getsockopt (..., SO_RCVBUF, &i, ...)
> > i /= 2;
> >
> > where i is the size I am looking for.
> >
> > Now, to make this code work for other OSes it should be changed to:
> > setsockopt (..., SO_RCVBUF, 256000, ...)
> > getsockopt (..., SO_RCVBUF, &i, ...)
> > #ifdef LINUX
> > i /= 2;
> > #endif
> >
> > First question, is this code correct? If not, what code gives the amount
> > of memory useable for my process?
> >
> > Second, it seems to me that linux is definitely "non-standard" here.
> > Saying that linux uses twice as memory has nothing to do with that,
> > since getsockopt should return what the application can count on, not
> > what is the internal use. It is like a hypothetical malloc (10) would
> > return not 10, but 20 (including meta-information). Is that right?
> >
> > Cheers,

That the way it's done on linux since day 0

You can probably find a lot of pages on the web explaining the
rationale.

If your application handles UDP frames, what SO_RCVBUF should count ?

If its the amount of payload bytes, you could have a pathological
situation where an attacker sends 1-byte UDP frames fast enough and
could consume a lot of kernel memory.

Each frame consumes a fair amount of kernel memory (between 512 bytes
and 8 Kbytes depending on the driver).

So linux says : If user expect to receive  XXXX bytes, set a limit of
_kernel_ memory used to store these bytes, and use an estimation of 100%
of overhead. That is : allow 2*XXXX bytes to be allocated for socket
receive buffers.

^ permalink raw reply

* Re: [PATCH net-next 0/2] Pull request for 'davem-next.r8169' branch
From: David Miller @ 2012-07-18 16:11 UTC (permalink / raw)
  To: romieu; +Cc: netdev, hayeswang
In-Reply-To: <cover.1342562326.git.romieu@fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Wed, 18 Jul 2012 00:09:42 +0200

> Please pull from branch 'davem-next.r8169' in repository
> 
> git://violet.fr.zoreil.com/romieu/linux davem-next.r8169
> 
> to get the changes below.

Pulled, thanks Francois.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox