Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: Gao feng @ 2012-07-18  1:59 UTC (permalink / raw)
  To: John Fastabend; +Cc: davem, nhorman, mark.d.rustad, netdev, eric.dumazet
In-Reply-To: <20120718003316.2979.49278.stgit@jf-dev1-dcblab>

于 2012年07月18日 08:33, John Fastabend 写道:
> When the netprio cgroup is built in the kernel cgroup_init will call
> cgrp_create which eventually calls update_netdev_tables. This is
> being called before do_initcalls() so a null ptr dereference occurs
> on init_net.
> 
> This patch adds a check on init_net.count to verify the structure
> has been initialized. The failure was introduced here,
> 
> commit ef209f15980360f6945873df3cd710c5f62f2a3e
> Author: Gao feng <gaofeng@cn.fujitsu.com>
> Date:   Wed Jul 11 21:50:15 2012 +0000
> 
>     net: cgroup: fix access the unallocated memory in netprio cgroup
> 
> Tested with ping with netprio_cgroup as a module and built in.
> 
> Marked RFC for now I think DaveM might have a reason why this needs
> some improvement.
> 
> Reported-by: Mark Rustad <mark.d.rustad@intel.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Gao feng <gaofeng@cn.fujitsu.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> 
>  net/core/netprio_cgroup.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index b2e9caa..e9fd7fd 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -116,6 +116,9 @@ static int update_netdev_tables(void)
>  	u32 max_len;
>  	struct netprio_map *map;


Thanks John.
It's my mistake.

Can we make sure init_net.count is zero here?
I can't find some places to initialize it to zero.

>  
> +	if (!atomic_read(&init_net.count))
> +		return ret;
> +
>  	rtnl_lock();
>  	max_len = atomic_read(&max_prioidx) + 1;
>  	for_each_netdev(&init_net, dev) {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: Julian Anastasov @ 2012-07-18  1:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120717.150920.1324071045620152376.davem@davemloft.net>


	Hello,

On Tue, 17 Jul 2012, David Miller wrote:

> From: Julian Anastasov <ja@ssi.bg>
> Date: Wed, 18 Jul 2012 01:14:04 +0300 (EEST)
> 
> > 	Aha, I see. Something around fnhe_oldest() and its
> > daddr arg does not look good. If the goal is to hijack
> > some entry, probably for another daddr and comparing it with
> > tcpm_new(), may be we should remove this daddr arg and fully
> > reset all parameters such as fnhe_pmtu, fnhe_gw, fnhe_expires
> > because the find_or_create_fnhe() callers modify only specific
> > fields, we should not end up with wrong gateway inherited from
> > another daddr, for example.
> 
> Better would be to use a seqlock when reading it's values.
> 
> Either way, patches welcome :-)

	I created patch with seqlock usage. This version
is with global seqlock because I'm not sure if 2048 locks
per NH are good idea. This is only compile tested.
After comments may be I have to resubmit in separate message.


Subject: [PATCH] ipv4: use seqlock for nh_exceptions

From: Julian Anastasov <ja@ssi.bg>

	Use global seqlock for the nh_exceptions. Call
fnhe_oldest with the right hash chain. Correct the diff
value for dst_set_expires.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip_fib.h |    2 +-
 net/ipv4/route.c     |  117 ++++++++++++++++++++++++++++++++------------------
 2 files changed, 76 insertions(+), 43 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index e9ee1ca..2daf096 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -51,7 +51,7 @@ struct fib_nh_exception {
 	struct fib_nh_exception __rcu	*fnhe_next;
 	__be32				fnhe_daddr;
 	u32				fnhe_pmtu;
-	u32				fnhe_gw;
+	__be32				fnhe_gw;
 	unsigned long			fnhe_expires;
 	unsigned long			fnhe_stamp;
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f67e702..e037c73 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1334,8 +1334,9 @@ static void ip_rt_build_flow_key(struct flowi4 *fl4, const struct sock *sk,
 }
 
 static DEFINE_SPINLOCK(fnhe_lock);
+static DEFINE_SEQLOCK(fnhe_seqlock);
 
-static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash, __be32 daddr)
+static struct fib_nh_exception *fnhe_oldest(struct fnhe_hash_bucket *hash)
 {
 	struct fib_nh_exception *fnhe, *oldest;
 
@@ -1358,47 +1359,76 @@ static inline u32 fnhe_hashfun(__be32 daddr)
 	return hval & (FNHE_HASH_SIZE - 1);
 }
 
-static struct fib_nh_exception *find_or_create_fnhe(struct fib_nh *nh, __be32 daddr)
+static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
+				  u32 pmtu, unsigned long expires)
 {
 	struct fnhe_hash_bucket *hash = nh->nh_exceptions;
 	struct fib_nh_exception *fnhe;
 	int depth;
 	u32 hval;
 
-	if (!hash) {
-		hash = nh->nh_exceptions = kzalloc(FNHE_HASH_SIZE * sizeof(*hash),
-						   GFP_ATOMIC);
-		if (!hash)
-			return NULL;
-	}
+	if (!hash)
+		goto start;
 
+repeat:
 	hval = fnhe_hashfun(daddr);
 	hash += hval;
 
 	depth = 0;
+	write_seqlock_bh(&fnhe_seqlock);
 	for (fnhe = rcu_dereference(hash->chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
 		if (fnhe->fnhe_daddr == daddr)
-			goto out;
+			break;
 		depth++;
 	}
 
-	if (depth > FNHE_RECLAIM_DEPTH) {
-		fnhe = fnhe_oldest(hash + hval, daddr);
-		goto out_daddr;
-	}
-	fnhe = kzalloc(sizeof(*fnhe), GFP_ATOMIC);
-	if (!fnhe)
-		return NULL;
+	if (fnhe) {
+		if (gw)
+			fnhe->fnhe_gw = gw;
+		if (pmtu) {
+			fnhe->fnhe_pmtu = pmtu;
+			fnhe->fnhe_expires = expires;
+		}
+	} else {
+		if (depth > FNHE_RECLAIM_DEPTH)
+			fnhe = fnhe_oldest(hash);
+		else {
+			fnhe = kzalloc(sizeof(*fnhe), GFP_ATOMIC);
+			if (!fnhe) {
+				write_sequnlock_bh(&fnhe_seqlock);
+				return;
+			}
 
-	fnhe->fnhe_next = hash->chain;
-	rcu_assign_pointer(hash->chain, fnhe);
+			fnhe->fnhe_next = hash->chain;
+			rcu_assign_pointer(hash->chain, fnhe);
+		}
+		fnhe->fnhe_daddr = daddr;
+		fnhe->fnhe_gw = gw;
+		fnhe->fnhe_pmtu = pmtu;
+		fnhe->fnhe_expires = expires;
+	}
 
-out_daddr:
-	fnhe->fnhe_daddr = daddr;
-out:
 	fnhe->fnhe_stamp = jiffies;
-	return fnhe;
+	write_sequnlock_bh(&fnhe_seqlock);
+	return;
+
+start:
+	spin_lock_bh(&fnhe_lock);
+	hash = nh->nh_exceptions;
+	if (hash) {
+		spin_unlock_bh(&fnhe_lock);
+		goto repeat;
+	}
+	nh->nh_exceptions = kzalloc(FNHE_HASH_SIZE * sizeof(*hash),
+				    GFP_ATOMIC);
+	if (!nh->nh_exceptions) {
+		spin_unlock_bh(&fnhe_lock);
+		return;
+	}
+	hash = nh->nh_exceptions;
+	spin_unlock_bh(&fnhe_lock);
+	goto repeat;
 }
 
 static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flowi4 *fl4)
@@ -1452,13 +1482,9 @@ static void __ip_do_redirect(struct rtable *rt, struct sk_buff *skb, struct flow
 		} else {
 			if (fib_lookup(net, fl4, &res) == 0) {
 				struct fib_nh *nh = &FIB_RES_NH(res);
-				struct fib_nh_exception *fnhe;
 
-				spin_lock_bh(&fnhe_lock);
-				fnhe = find_or_create_fnhe(nh, fl4->daddr);
-				if (fnhe)
-					fnhe->fnhe_gw = new_gw;
-				spin_unlock_bh(&fnhe_lock);
+				update_or_create_fnhe(nh, fl4->daddr, new_gw,
+						      0, 0);
 			}
 			rt->rt_gateway = new_gw;
 			rt->rt_flags |= RTCF_REDIRECTED;
@@ -1663,15 +1689,9 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 
 	if (fib_lookup(dev_net(rt->dst.dev), fl4, &res) == 0) {
 		struct fib_nh *nh = &FIB_RES_NH(res);
-		struct fib_nh_exception *fnhe;
 
-		spin_lock_bh(&fnhe_lock);
-		fnhe = find_or_create_fnhe(nh, fl4->daddr);
-		if (fnhe) {
-			fnhe->fnhe_pmtu = mtu;
-			fnhe->fnhe_expires = jiffies + ip_rt_mtu_expires;
-		}
-		spin_unlock_bh(&fnhe_lock);
+		update_or_create_fnhe(nh, fl4->daddr, 0, mtu,
+				      jiffies + ip_rt_mtu_expires);
 	}
 	rt->rt_pmtu = mtu;
 	dst_set_expires(&rt->dst, ip_rt_mtu_expires);
@@ -1898,6 +1918,7 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 {
 	struct fnhe_hash_bucket *hash = nh->nh_exceptions;
 	struct fib_nh_exception *fnhe;
+	unsigned int seq;
 	u32 hval;
 
 	hval = fnhe_hashfun(daddr);
@@ -1905,17 +1926,29 @@ static void rt_bind_exception(struct rtable *rt, struct fib_nh *nh, __be32 daddr
 	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
 		if (fnhe->fnhe_daddr == daddr) {
-			if (fnhe->fnhe_pmtu) {
-				unsigned long expires = fnhe->fnhe_expires;
-				unsigned long diff = jiffies - expires;
+			__be32 fnhe_daddr, gw;
+			u32 pmtu;
+			unsigned long expires;
+
+			do {
+				seq = read_seqbegin(&fnhe_seqlock);
+				fnhe_daddr = fnhe->fnhe_daddr;
+				gw = fnhe->fnhe_gw;
+				pmtu = fnhe->fnhe_pmtu;
+				expires = fnhe->fnhe_expires;
+			} while (read_seqretry(&fnhe_seqlock, seq));
+			if (daddr != fnhe_daddr)
+				break;
+			if (pmtu) {
+				unsigned long diff = expires - jiffies;
 
 				if (time_before(jiffies, expires)) {
-					rt->rt_pmtu = fnhe->fnhe_pmtu;
+					rt->rt_pmtu = pmtu;
 					dst_set_expires(&rt->dst, diff);
 				}
 			}
-			if (fnhe->fnhe_gw)
-				rt->rt_gateway = fnhe->fnhe_gw;
+			if (gw)
+				rt->rt_gateway = gw;
 			fnhe->fnhe_stamp = jiffies;
 			break;
 		}
-- 
1.7.3.4

^ permalink raw reply related

* [RFC PATCH] net: cgroup: null ptr dereference in netprio cgroup during init
From: John Fastabend @ 2012-07-18  0:33 UTC (permalink / raw)
  To: davem, gaofeng, nhorman; +Cc: mark.d.rustad, netdev, eric.dumazet

When the netprio cgroup is built in the kernel cgroup_init will call
cgrp_create which eventually calls update_netdev_tables. This is
being called before do_initcalls() so a null ptr dereference occurs
on init_net.

This patch adds a check on init_net.count to verify the structure
has been initialized. The failure was introduced here,

commit ef209f15980360f6945873df3cd710c5f62f2a3e
Author: Gao feng <gaofeng@cn.fujitsu.com>
Date:   Wed Jul 11 21:50:15 2012 +0000

    net: cgroup: fix access the unallocated memory in netprio cgroup

Tested with ping with netprio_cgroup as a module and built in.

Marked RFC for now I think DaveM might have a reason why this needs
some improvement.

Reported-by: Mark Rustad <mark.d.rustad@intel.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 net/core/netprio_cgroup.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b2e9caa..e9fd7fd 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -116,6 +116,9 @@ static int update_netdev_tables(void)
 	u32 max_len;
 	struct netprio_map *map;
 
+	if (!atomic_read(&init_net.count))
+		return ret;
+
 	rtnl_lock();
 	max_len = atomic_read(&max_prioidx) + 1;
 	for_each_netdev(&init_net, dev) {

^ permalink raw reply related

* Re: [GIT PULL net] IPVS
From: Simon Horman @ 2012-07-18  0:01 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: lvs-devel, netdev, netfilter-devel, Wensong Zhang,
	Julian Anastasov, Hans Schillstrom, Jesper Dangaard Brouer
In-Reply-To: <20120717101531.GD3812@1984>

On Tue, Jul 17, 2012 at 12:15:31PM +0200, Pablo Neira Ayuso wrote:
> On Tue, Jul 10, 2012 at 03:05:03PM +0200, Pablo Neira Ayuso wrote:
> > Hi Simon,
> > 
> > On Tue, Jul 10, 2012 at 06:20:03PM +0900, Simon Horman wrote:
> > > On Mon, Apr 30, 2012 at 11:27:22AM +0200, Pablo Neira Ayuso wrote:
> > > > On Fri, Apr 27, 2012 at 09:53:54AM +0900, Simon Horman wrote:
> > > > > Hi Pablo,
> > > > > 
> > > > > please consider the following 5 changes for 3.4, they are all bug fixes.
> > > > > I would also like these changes considered for stable.
> > > > 
> > > > Please, ping me again once these have hit Linus tree to ask for
> > > > -stable submission.
> > > 
> > > Sorry for letting this slip through the cracks.
> > > 
> > > Please consider the following commits which are in Linus's tree for stable.
> > > Or I can submit them directly if that is easier.
> > > 
> > > There are 7 patches listed below. The first 5 were the patches in this
> > > pull request. The last two were patches in a git pull request
> > > a few days earlier.
> > 
> > That's fine, I can make it, but you have to include what stable
> > releases this will be applied, eg. patch 1 to releases 3.4 and 3.2.
> > 
> > I think -stable maintainers will ask for that.
> 
> Ping?

Sorry, I haven't got to this yet.

> > > commit 8537de8a7ab6681cc72fb0411ab1ba7fdba62dd0
> > > Author: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > > Date:   Thu Apr 26 07:47:44 2012 +0200
> > > 
> > >     ipvs: kernel oops - do_ip_vs_get_ctl
> > >     
> > >     Change order of init so netns init is ready
> > >     when register ioctl and netlink.
> > >     
> > >     Ver2
> > >     	Whitespace fixes and __init added.
> > >     
> > >     Reported-by: "Ryan O'Hara" <rohara@redhat.com>
> > >     Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > >     Acked-by: Julian Anastasov <ja@ssi.bg>
> > >     Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > >     Signed-off-by: Simon Horman <horms@verge.net.au>
> > > 
> > > commit 582b8e3eadaec77788c1aa188081a8d5059c42a6
> > > Author: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > > Date:   Thu Apr 26 09:45:35 2012 +0200
> > > 
> > >     ipvs: take care of return value from protocol init_netns
> > >     
> > >     ip_vs_create_timeout_table() can return NULL
> > >     All functions protocol init_netns is affected of this patch.
> > >     
> > >     Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > >     Acked-by: Julian Anastasov <ja@ssi.bg>
> > >     Signed-off-by: Simon Horman <horms@verge.net.au>
> > > 
> > > commit 4b984cd50bc1b6d492175cd77bfabb78e76ffa67
> > > Author: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > > Date:   Thu Apr 26 09:45:34 2012 +0200
> > > 
> > >     ipvs: null check of net->ipvs in lblc(r) shedulers
> > >     
> > >     Avoid crash when registering shedulers after
> > >     the IPVS core initialization for netns fails. Do this by
> > >     checking for present core (net->ipvs).
> > >     
> > >     Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > >     Acked-by: Julian Anastasov <ja@ssi.bg>
> > >     Signed-off-by: Simon Horman <horms@verge.net.au>
> > > 
> > > commit 39f618b4fd95ae243d940ec64c961009c74e3333
> > > Author: Julian Anastasov <ja@ssi.bg>
> > > Date:   Wed Apr 25 00:29:58 2012 +0300
> > > 
> > >     ipvs: reset ipvs pointer in netns
> > >     
> > >     	Make sure net->ipvs is reset on netns cleanup or failed
> > >     initialization. It is needed for IPVS applications to know that
> > >     IPVS core is not loaded in netns.
> > >     
> > >     Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > >     Acked-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > >     Signed-off-by: Simon Horman <horms@verge.net.au>
> > > 
> > > commit 8d08d71ce59438a6ef06be5db07966e0c144b74e
> > > Author: Julian Anastasov <ja@ssi.bg>
> > > Date:   Wed Apr 25 00:29:59 2012 +0300
> > > 
> > >     ipvs: add check in ftp for initialized core
> > >     
> > >     	Avoid crash when registering ip_vs_ftp after
> > >     the IPVS core initialization for netns fails. Do this by
> > >     checking for present core (net->ipvs).
> > >     
> > >     Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > >     Acked-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> > >     Signed-off-by: Simon Horman <horms@verge.net.au>
> > > 
> > > commit 8f9b9a2fad47af27e14b037395e03cd8278d96d7
> > > Author: Julian Anastasov <ja@ssi.bg>
> > > Date:   Fri Apr 13 18:08:43 2012 +0300
> > > 
> > >     ipvs: fix crash in ip_vs_control_net_cleanup on unload
> > >     
> > >     	commit 14e405461e664b777e2a5636e10b2ebf36a686ec (2.6.39)
> > >     ("Add __ip_vs_control_{init,cleanup}_sysctl()")
> > >     introduced regression due to wrong __net_init for
> > >     __ip_vs_control_cleanup_sysctl. This leads to crash when
> > >     the ip_vs module is unloaded.
> > >     
> > >     	Fix it by changing __net_init to __net_exit for
> > >     the function that is already renamed to ip_vs_control_net_cleanup_sysctl.
> > >     
> > >     Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > >     Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
> > >     Signed-off-by: Simon Horman <horms@verge.net.au>
> > >     Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > > 
> > > commit 7118c07a844d367560ee91adb2071bde2fabcdbf
> > > Author: Sasha Levin <levinsasha928@gmail.com>
> > > Date:   Sat Apr 14 12:37:46 2012 -0400
> > > 
> > >     ipvs: Verify that IP_VS protocol has been registered
> > >     
> > >     The registration of a protocol might fail, there were no checks
> > >     and all registrations were assumed to be correct. This lead to
> > >     NULL ptr dereferences when apps tried registering.
> > >     
> > >     For example:
> > >     
> > >     [ 1293.226051] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> > >     [ 1293.227038] IP: [<ffffffff822aacb0>] tcp_register_app+0x60/0xb0
> > >     [ 1293.227038] PGD 391de067 PUD 6c20b067 PMD 0
> > >     [ 1293.227038] Oops: 0000 [#1] PREEMPT SMP
> > >     [ 1293.227038] CPU 1
> > >     [ 1293.227038] Pid: 19609, comm: trinity Tainted: G        W    3.4.0-rc1-next-20120405-sasha-dirty #57
> > >     [ 1293.227038] RIP: 0010:[<ffffffff822aacb0>]  [<ffffffff822aacb0>] tcp_register_app+0x60/0xb0
> > >     [ 1293.227038] RSP: 0018:ffff880038c1dd18  EFLAGS: 00010286
> > >     [ 1293.227038] RAX: ffffffffffffffc0 RBX: 0000000000001500 RCX: 0000000000010000
> > >     [ 1293.227038] RDX: 0000000000000000 RSI: ffff88003a2d5888 RDI: 0000000000000282
> > >     [ 1293.227038] RBP: ffff880038c1dd48 R08: 0000000000000000 R09: 0000000000000000
> > >     [ 1293.227038] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a2d5668
> > >     [ 1293.227038] R13: ffff88003a2d5988 R14: ffff8800696a8ff8 R15: 0000000000000000
> > >     [ 1293.227038] FS:  00007f01930d9700(0000) GS:ffff88007ce00000(0000) knlGS:0000000000000000
> > >     [ 1293.227038] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > >     [ 1293.227038] CR2: 0000000000000018 CR3: 0000000065dfc000 CR4: 00000000000406e0
> > >     [ 1293.227038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >     [ 1293.227038] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > >     [ 1293.227038] Process trinity (pid: 19609, threadinfo ffff880038c1c000, task ffff88002dc73000)
> > >     [ 1293.227038] Stack:
> > >     [ 1293.227038]  ffff880038c1dd48 00000000fffffff4 ffff8800696aada0 ffff8800694f5580
> > >     [ 1293.227038]  ffffffff8369f1e0 0000000000001500 ffff880038c1dd98 ffffffff822a716b
> > >     [ 1293.227038]  0000000000000000 ffff8800696a8ff8 0000000000000015 ffff8800694f5580
> > >     [ 1293.227038] Call Trace:
> > >     [ 1293.227038]  [<ffffffff822a716b>] ip_vs_app_inc_new+0xdb/0x180
> > >     [ 1293.227038]  [<ffffffff822a7258>] register_ip_vs_app_inc+0x48/0x70
> > >     [ 1293.227038]  [<ffffffff822b2fea>] __ip_vs_ftp_init+0xba/0x140
> > >     [ 1293.227038]  [<ffffffff821c9060>] ops_init+0x80/0x90
> > >     [ 1293.227038]  [<ffffffff821c90cb>] setup_net+0x5b/0xe0
> > >     [ 1293.227038]  [<ffffffff821c9416>] copy_net_ns+0x76/0x100
> > >     [ 1293.227038]  [<ffffffff810dc92b>] create_new_namespaces+0xfb/0x190
> > >     [ 1293.227038]  [<ffffffff810dca21>] unshare_nsproxy_namespaces+0x61/0x80
> > >     [ 1293.227038]  [<ffffffff810afd1f>] sys_unshare+0xff/0x290
> > >     [ 1293.227038]  [<ffffffff8187622e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > >     [ 1293.227038]  [<ffffffff82665539>] system_call_fastpath+0x16/0x1b
> > >     [ 1293.227038] Code: 89 c7 e8 34 91 3b 00 89 de 66 c1 ee 04 31 de 83 e6 0f 48 83 c6 22 48 c1 e6 04 4a 8b 14 26 49 8d 34 34 48 8d 42 c0 48 39 d6 74 13 <66> 39 58 58 74 22 48 8b 48 40 48 8d 41 c0 48 39 ce 75 ed 49 8d
> > >     [ 1293.227038] RIP  [<ffffffff822aacb0>] tcp_register_app+0x60/0xb0
> > >     [ 1293.227038]  RSP <ffff880038c1dd18>
> > >     [ 1293.227038] CR2: 0000000000000018
> > >     [ 1293.379284] ---[ end trace 364ab40c7011a009 ]---
> > >     [ 1293.381182] Kernel panic - not syncing: Fatal exception in interrupt
> > >     
> > >     Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
> > >     Acked-by: Julian Anastasov <ja@ssi.bg>
> > >     Signed-off-by: Simon Horman <horms@verge.net.au>
> > >     Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > > 
> 

^ permalink raw reply

* Re: [RFC] r8169 : why SG / TX checksum are default disabled
From: Francois Romieu @ 2012-07-17 23:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Hayes Wang
In-Reply-To: <1342564781.2626.1264.camel@edumazet-glaptop>

Eric Dumazet <eric.dumazet@gmail.com> :
[...]
> I was wondering why Scatter Gather (NETIF_F_SG) and TX checksum
> (NETIF_F_IP_CSUM) were disabled by default on r8169.

It was not unconditionally stable. For a number of reasons:
- incompatibility with jumbo frames.
  rtl_chip_infos now allows to take care of it
- lack of support for chipset dependent Tx descriptor 
  rtl_tx_desc_info handles it. There are yet a few unused ipv6 bits. No idea
  how useable they are.
- some false alarms due to different issues and probably some others
  I could hunt for in the archives.

> Couldnt we enable them by default, maybe for a whitelist of "good"
> nics ?

Sure.

Hayes, should we not add into the kernel driver something similar to
the rtl8168_start_xmit::skb_checksum_help stuff in Realtek's 8168 driver ?
There seems to be a bug for (skb->len < 60 && RTL_GIGA_MAC_VER_34.

> (I found that activating them with ethtool automatically enables GSO,
>  and performance with GSO is not good)

It's still an improvement though, isn't it ?

-- 
Ueimor

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 23:27 UTC (permalink / raw)
  To: David Miller
  Cc: mark.d.rustad-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20120717.151832.1306978935355646723.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On 7/17/2012 3:18 PM, David Miller wrote:
> From: John Fastabend <john.r.fastabend-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Date: Tue, 17 Jul 2012 15:13:36 -0700
>
>> Perhaps the easiest way is to check net->count this should be zero
>> until setup_net is called.
>>
>> if (!atomic_read(&init_net.count))
>> 	return ret;
>>
>
> Won't work, setup_net() runs via a pure_initcall().
>

Why not must have missed something? cgroup_init() and
cgroup_early_init() both run before _initcall() routines are
called via kernel_init() so this will stop the update in
netprio from occurring.

And I don't see any race elsewhere for this.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 0/2] Pull request for 'davem-next.r8169' branch
From: Joe Perches @ 2012-07-17 22:46 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev, David Miller, Hayes Wang
In-Reply-To: <cover.1342562326.git.romieu@fr.zoreil.com>

On Wed, 2012-07-18 at 00:09 +0200, Francois Romieu wrote:
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c

Hello Francois, just a bit of trivia:

> @@ -865,7 +865,8 @@ static bool rtl_loop_wait(struct rtl8169_private *tp, const struct rtl_cond *c,
>  		if (c->check(tp) == high)
>  			return true;
>  	}
> -	netif_err(tp, drv, tp->dev, c->msg);
> +	netif_err(tp, drv, tp->dev, "%s == %d (loop: %d, delay: %d).\n",
> +		  c->msg, !high, n, d);

Please avoid adding the period to a message before a newline.

$ git grep -E "[^\.]\\\\n\"" drivers/net/ethernet/realtek | wc -l
113
$ git grep -E "\.\\\\n\"" drivers/net/ethernet/realtek | wc -l
12

^ permalink raw reply

* [RFC] r8169 : why SG / TX checksum are default disabled
From: Eric Dumazet @ 2012-07-17 22:39 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev, Hayes Wang

Hi Francois

I was wondering why Scatter Gather (NETIF_F_SG) and TX checksum
(NETIF_F_IP_CSUM) were disabled by default on r8169.

Couldnt we enable them by default, maybe for a whitelist of "good"
nics ?

(I found that activating them with ethtool automatically enables GSO,
 and performance with GSO is not good)

Thanks

^ permalink raw reply

* [PATCH net-next] tcp: refine SYN handling in tcp_validate_incoming
From: Eric Dumazet @ 2012-07-17 22:29 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <1342563706.2626.1228.camel@edumazet-glaptop>

From: Eric Dumazet <edumazet@google.com>

Followup of commit 0c24604b68fc (tcp: implement RFC 5961 4.2)

As reported by Vijay Subramanian, we should send a challenge ACK
instead of a dup ack if a SYN flag is set on a packet received out of
window.

This permits the ratelimiting to work as intended, and to increase
correct SNMP counters.

Suggested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Kiran Kumar Kella <kkiran@broadcom.com>
---
 net/ipv4/tcp_input.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8aaec55..fdd49f1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5296,8 +5296,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		 * an acknowledgment should be sent in reply (unless the RST
 		 * bit is set, if so drop the segment and return)".
 		 */
-		if (!th->rst)
+		if (!th->rst) {
+			if (th->syn)
+				goto syn_challenge;
 			tcp_send_dupack(sk, skb);
+		}
 		goto discard;
 	}
 
@@ -5327,6 +5330,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	 * RFC 5691 4.2 : Send a challenge ack
 	 */
 	if (th->syn) {
+syn_challenge:
 		if (syn_inerr)
 			TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);

^ permalink raw reply related

* [PATCH net-next 2/2] r8169: verbose error message.
From: Francois Romieu @ 2012-07-17 22:09 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Hayes Wang
In-Reply-To: <cover.1342562326.git.romieu@fr.zoreil.com>

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Cc: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/ethernet/realtek/r8169.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 1f27318..be4e00f 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -865,7 +865,8 @@ static bool rtl_loop_wait(struct rtl8169_private *tp, const struct rtl_cond *c,
 		if (c->check(tp) == high)
 			return true;
 	}
-	netif_err(tp, drv, tp->dev, c->msg);
+	netif_err(tp, drv, tp->dev, "%s == %d (loop: %d, delay: %d).\n",
+		  c->msg, !high, n, d);
 	return false;
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next 1/2] r8169: remove rtl_ocpdr_cond.
From: Francois Romieu @ 2012-07-17 22:09 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Hayes Wang
In-Reply-To: <cover.1342562326.git.romieu@fr.zoreil.com>

From: Hayes Wang <hayeswang@realtek.com>

It is not needed for mac_ocp_{write / read}. Actually bit 31 of OCPDR
does not change and r8168_mac_ocp_read always returns ~0.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Tested-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/realtek/r8169.c |   12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index c29c5fb..1f27318 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1043,13 +1043,6 @@ static void rtl_w1w0_phy_ocp(struct rtl8169_private *tp, int reg, int p, int m)
 	r8168_phy_ocp_write(tp, reg, (val | p) & ~m);
 }
 
-DECLARE_RTL_COND(rtl_ocpdr_cond)
-{
-	void __iomem *ioaddr = tp->mmio_addr;
-
-	return RTL_R32(OCPDR) & OCPAR_FLAG;
-}
-
 static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
 {
 	void __iomem *ioaddr = tp->mmio_addr;
@@ -1058,8 +1051,6 @@ static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
 		return;
 
 	RTL_W32(OCPDR, OCPAR_FLAG | (reg << 15) | data);
-
-	rtl_udelay_loop_wait_low(tp, &rtl_ocpdr_cond, 25, 10);
 }
 
 static u16 r8168_mac_ocp_read(struct rtl8169_private *tp, u32 reg)
@@ -1071,8 +1062,7 @@ static u16 r8168_mac_ocp_read(struct rtl8169_private *tp, u32 reg)
 
 	RTL_W32(OCPDR, reg << 15);
 
-	return rtl_udelay_loop_wait_high(tp, &rtl_ocpdr_cond, 25, 10) ?
-		RTL_R32(OCPDR) : ~0;
+	return RTL_R32(OCPDR);
 }
 
 #define OCP_STD_PHY_BASE	0xa400
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net-next 0/2] Pull request for 'davem-next.r8169' branch
From: Francois Romieu @ 2012-07-17 22:09 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Hayes Wang

Please pull from branch 'davem-next.r8169' in repository

git://violet.fr.zoreil.com/romieu/linux davem-next.r8169

to get the changes below.

Distance from 'davem-next' (5abf7f7e0f6bdbfcac737f636497d7016d9507eb)
---------------------------------------------------------------------

82e316efbd1c68946c8760f930b81d73e9c4425a
3a83ad12b850c3c5b89fa9008bdd0c0782f0cf68

Diffstat
--------

 drivers/net/ethernet/realtek/r8169.c |   15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

Shortlog
--------

Francois Romieu (1):
      r8169: verbose error message.

Hayes Wang (1):
      r8169: remove rtl_ocpdr_cond.

Patch
-----

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index c29c5fb..be4e00f 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -865,7 +865,8 @@ static bool rtl_loop_wait(struct rtl8169_private *tp, const struct rtl_cond *c,
 		if (c->check(tp) == high)
 			return true;
 	}
-	netif_err(tp, drv, tp->dev, c->msg);
+	netif_err(tp, drv, tp->dev, "%s == %d (loop: %d, delay: %d).\n",
+		  c->msg, !high, n, d);
 	return false;
 }
 
@@ -1043,13 +1044,6 @@ static void rtl_w1w0_phy_ocp(struct rtl8169_private *tp, int reg, int p, int m)
 	r8168_phy_ocp_write(tp, reg, (val | p) & ~m);
 }
 
-DECLARE_RTL_COND(rtl_ocpdr_cond)
-{
-	void __iomem *ioaddr = tp->mmio_addr;
-
-	return RTL_R32(OCPDR) & OCPAR_FLAG;
-}
-
 static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
 {
 	void __iomem *ioaddr = tp->mmio_addr;
@@ -1058,8 +1052,6 @@ static void r8168_mac_ocp_write(struct rtl8169_private *tp, u32 reg, u32 data)
 		return;
 
 	RTL_W32(OCPDR, OCPAR_FLAG | (reg << 15) | data);
-
-	rtl_udelay_loop_wait_low(tp, &rtl_ocpdr_cond, 25, 10);
 }
 
 static u16 r8168_mac_ocp_read(struct rtl8169_private *tp, u32 reg)
@@ -1071,8 +1063,7 @@ static u16 r8168_mac_ocp_read(struct rtl8169_private *tp, u32 reg)
 
 	RTL_W32(OCPDR, reg << 15);
 
-	return rtl_udelay_loop_wait_high(tp, &rtl_ocpdr_cond, 25, 10) ?
-		RTL_R32(OCPDR) : ~0;
+	return RTL_R32(OCPDR);
 }
 
 #define OCP_STD_PHY_BASE	0xa400
-- 
Ueimor

^ permalink raw reply related

* Re: [PATCH net-next] tcp: implement RFC 5961 4.2
From: Eric Dumazet @ 2012-07-17 22:21 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <CAGK4HS8MvP6L5Rvy4wJx2KhdTSSHfP7YPT44e9mrV_vsBZJ9jQ@mail.gmail.com>

On Tue, 2012-07-17 at 15:10 -0700, Vijay Subramanian wrote:

> Yes. This is what I had in mind (along with a change to make
> tcp_sequence() return bool). I am not sure if this patch is official
> (or will be picked up by patchwork) but
> for what its worth
> 
> Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com>

Well, I am going to send an official patch with all credits ASAP,

Thanks !

^ permalink raw reply

* [PATCH net-next] bonding: refine IFF_XMIT_DST_RELEASE capability
From: Eric Dumazet @ 2012-07-17 22:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jay Vosburgh, Andy Gospodarek, Tom Herbert

From: Eric Dumazet <edumazet@google.com>

Some workloads greatly benefit of IFF_XMIT_DST_RELEASE capability
on output net device, avoiding dirtying dst refcount.

bonding currently disables IFF_XMIT_DST_RELEASE unconditionally.

If all slaves have the IFF_XMIT_DST_RELEASE bit set, then
bonding master can also have it in its priv_flags

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Tom Herbert <therbert@google.com>
---
 drivers/net/bonding/bond_main.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 1eb3979..3960b1b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1382,6 +1382,7 @@ static void bond_compute_features(struct bonding *bond)
 	netdev_features_t vlan_features = BOND_VLAN_FEATURES;
 	unsigned short max_hard_header_len = ETH_HLEN;
 	int i;
+	unsigned int flags, dst_release_flag = IFF_XMIT_DST_RELEASE;
 
 	read_lock(&bond->lock);
 
@@ -1392,6 +1393,7 @@ static void bond_compute_features(struct bonding *bond)
 		vlan_features = netdev_increment_features(vlan_features,
 			slave->dev->vlan_features, BOND_VLAN_FEATURES);
 
+		dst_release_flag &= slave->dev->priv_flags;
 		if (slave->dev->hard_header_len > max_hard_header_len)
 			max_hard_header_len = slave->dev->hard_header_len;
 	}
@@ -1400,6 +1402,9 @@ done:
 	bond_dev->vlan_features = vlan_features;
 	bond_dev->hard_header_len = max_hard_header_len;
 
+	flags = bond_dev->priv_flags & ~IFF_XMIT_DST_RELEASE;
+	bond_dev->priv_flags = flags | dst_release_flag;
+
 	read_unlock(&bond->lock);
 
 	netdev_change_features(bond_dev);

^ permalink raw reply related

* Re: That's pretty much it for 3.5.0
From: David Miller @ 2012-07-17 22:18 UTC (permalink / raw)
  To: john.r.fastabend; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <5005E390.7020706@intel.com>

From: John Fastabend <john.r.fastabend@intel.com>
Date: Tue, 17 Jul 2012 15:13:36 -0700

> Perhaps the easiest way is to check net->count this should be zero
> until setup_net is called.
> 
> if (!atomic_read(&init_net.count))
> 	return ret;
> 

Won't work, setup_net() runs via a pure_initcall().

^ permalink raw reply

* Re: That's pretty much it for 3.5.0
From: John Fastabend @ 2012-07-17 22:13 UTC (permalink / raw)
  To: David Miller; +Cc: mark.d.rustad, netdev, linux-wireless, netfilter-devel
In-Reply-To: <20120717.140241.1599386555723262095.davem@davemloft.net>

On 7/17/2012 2:02 PM, David Miller wrote:
> From: John Fastabend <john.r.fastabend@intel.com>
> Date: Tue, 17 Jul 2012 13:50:16 -0700
>
>> On 7/17/2012 12:24 PM, David Miller wrote:
>>> From: John Fastabend <john.r.fastabend@intel.com>
>>> Date: Tue, 17 Jul 2012 12:09:53 -0700
>>>
>>>> although we don't have an early_init hook for netprio_cgroup so this
>>>> is probably not correct.
>>>
>>> The dependency is actually on net_dev_init (a subsys_initcall) rather
>>> than a pure_initcall.
>>>
>>> net_dev_init is what registers the netdev_net_ops, which in turn
>>> initializes the netdev list in namespaces such as &init_net
>>>
>>
>> Ah right thanks sorry for the thrash. I guess we need to check if the
>> netdev list in the init_net namespace is initialized.
>
> It's a hack, but we could export and then test dev_boot_phase == 0,
> and if that test is true then skip the init_net device walk in the
> cgroup code.
>
> But I don't like that very much.
>
> The things this code cares about can't even be an issue until
> net_dev_init() runs.
>
> There is a comment warning not to do this in linux/init.h, but we
> could change the module_init() in netprio_cgroup.c to some level which
> runs after subsys_inticall().  When built as a module, linux/init.h
> will translate this into module_init() which is basically the behavior
> we want.
>

Perhaps the easiest way is to check net->count this should be zero
until setup_net is called.

if (!atomic_read(&init_net.count))
	return ret;

^ permalink raw reply

* Re: [PATCH net-next] tcp: implement RFC 5961 4.2
From: Vijay Subramanian @ 2012-07-17 22:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <1342560769.2626.1165.camel@edumazet-glaptop>

> But you probably are right, we could test th->syn here as well.
>
> Something like that ?

> -               if (!th->rst)
> +               if (!th->rst) {
> +                       if (th->syn)
> +                               goto syn_challenge;
>                         tcp_send_dupack(sk, skb);
> +               }
>                 goto discard;
>         }
>
> @@ -5327,6 +5330,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>          * RFC 5691 4.2 : Send a challenge ack
>          */
>         if (th->syn) {
> +syn_challenge:
>                 if (syn_inerr)
>                         TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
>                 NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);
>

Yes. This is what I had in mind (along with a change to make
tcp_sequence() return bool). I am not sure if this patch is official
(or will be picked up by patchwork) but
for what its worth

Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com>

I will send a separate patch to make tcp_sequence() return bool.
Thanks!
Vijay

^ permalink raw reply

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: David Miller @ 2012-07-17 22:09 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <alpine.LFD.2.00.1207180052420.2128@ja.ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Wed, 18 Jul 2012 01:14:04 +0300 (EEST)

> 	Aha, I see. Something around fnhe_oldest() and its
> daddr arg does not look good. If the goal is to hijack
> some entry, probably for another daddr and comparing it with
> tcpm_new(), may be we should remove this daddr arg and fully
> reset all parameters such as fnhe_pmtu, fnhe_gw, fnhe_expires
> because the find_or_create_fnhe() callers modify only specific
> fields, we should not end up with wrong gateway inherited from
> another daddr, for example.

Better would be to use a seqlock when reading it's values.

Either way, patches welcome :-)

^ permalink raw reply

* Re: [PATCH 0/5] Long term PMTU/redirect storage in ipv4.
From: Julian Anastasov @ 2012-07-17 22:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120717.134651.562831385960975623.davem@davemloft.net>


	Hello,

On Tue, 17 Jul 2012, David Miller wrote:

> > 	IIRC, struct fib_info was shared by different
> > prefixes. It saves a lot of memory when thousands of
> > routes are created to same GW. Now if we end up with 1 or
> > 2 fib_info structures for default routes, the nh_exceptions list
> > can become very long. May be fib_info is not a good place
> > to hide such data.
> 
> Your analysis of what fib_info is and how it's intended to
> work is accurate.
> 
> But we don't use a linked list for the exceptions in the final
> version, we use a reclaiming RCU'd hash table like we use for TCP
> metrics.
> 
> See the updated version of patch #5 and what I actually committed to
> net-next.

	Aha, I see. Something around fnhe_oldest() and its
daddr arg does not look good. If the goal is to hijack
some entry, probably for another daddr and comparing it with
tcpm_new(), may be we should remove this daddr arg and fully
reset all parameters such as fnhe_pmtu, fnhe_gw, fnhe_expires
because the find_or_create_fnhe() callers modify only specific
fields, we should not end up with wrong gateway inherited from
another daddr, for example.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net-next] tcp: implement RFC 5961 4.2
From: Eric Dumazet @ 2012-07-17 21:32 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: David Miller, netdev, Kiran Kumar Kella
In-Reply-To: <CAGK4HS-AMWZ0Ef7G1pvWJ7XADsvxqsiX3tMCe2=oq+_46won6g@mail.gmail.com>

On Tue, 2012-07-17 at 14:02 -0700, Vijay Subramanian wrote:
> On 17 July 2012 04:41, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Implement the RFC 5691 mitigation against Blind
> > Reset attack using SYN bit.
> >
> > Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
> > incoming packet, instead of resetting the session.
> 
> Eric,
> Section 4.2 has this to say:
> "If the SYN bit is set, irrespective of the sequence number, TCP
>       MUST send an ACK (also referred to as challenge ACK) to the remote
>       peer:"
> 
> I believe your patch only sends challenge acks for in-window SYN packets.
> After this patch, the code for out of window packets is like this:
> 
>         /* Step 1: check sequence number */
>         if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {
>                 /* RFC793, page 37: "In all states except SYN-SENT, all reset
>                  * (RST) segments are validated by checking their SEQ-fields."
>                  * And page 69: "If an incoming segment is not acceptable,
>                  * an acknowledgment should be sent in reply (unless the RST
>                  * bit is set, if so drop the segment and return)".
>                  */
>                 if (!th->rst)
>                         tcp_send_dupack(sk, skb);
>                 goto discard;
>         }
> 
> 
> For SYN packets that are not in window, we do end up calling
> tcp_send_dupack() but not tcp_send_challenge_ack().  Will it be more
> appropriate to call the latter so that
> we do proper rate limiting of challenge acks and update SNMP counters correctly?

Well, I only wanted to avoid RST ;)

But you probably are right, we could test th->syn here as well.

Something like that ?

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8aaec55..fdd49f1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5296,8 +5296,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		 * an acknowledgment should be sent in reply (unless the RST
 		 * bit is set, if so drop the segment and return)".
 		 */
-		if (!th->rst)
+		if (!th->rst) {
+			if (th->syn)
+				goto syn_challenge;
 			tcp_send_dupack(sk, skb);
+		}
 		goto discard;
 	}
 
@@ -5327,6 +5330,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	 * RFC 5691 4.2 : Send a challenge ack
 	 */
 	if (th->syn) {
+syn_challenge:
 		if (syn_inerr)
 			TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);

^ permalink raw reply related

* Re: [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: David Miller @ 2012-07-17 21:31 UTC (permalink / raw)
  To: pmoore; +Cc: netdev
In-Reply-To: <16005876.GSA3ToWriD@sifl>

From: Paul Moore <pmoore@redhat.com>
Date: Tue, 17 Jul 2012 17:24:50 -0400

> David, if you don't queue this up for them, let me know and I'll resend it to 
> stable once it hits Linus' tree.

I will be sure to queue it up to -stable when I apply it, as I always
do for appropriate patches.

^ permalink raw reply

* Re: [PATCH] cipso: don't follow a NULL pointer when setsockopt() is called
From: Paul Moore @ 2012-07-17 21:24 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20120717210738.22790.23522.stgit@sifl>

On Tuesday, July 17, 2012 05:07:47 PM Paul Moore wrote:
> As reported by Alan Cox, and verified by Lin Ming, when a user
> attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL
> tag the kernel dies a terrible death when it attempts to follow a NULL
> pointer (the skb argument to cipso_v4_validate() is NULL when called via
> the setsockopt() syscall).
> 
> This patch fixes this by first checking to ensure that the skb is
> non-NULL before using it to find the incoming network interface.  In
> the unlikely case where the skb is NULL and the user attempts to add
> a CIPSO option with the _TAG_LOCAL tag we return an error as this is
> not something we want to allow.

...

> CC: Lin Ming <mlin@ss.pku.edu.cn>
> Reported-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
> Signed-off-by: Paul Moore <pmoore@redhat.com>

Argh, I just realized I forgot to CC the stable folks.

David, if you don't queue this up for them, let me know and I'll resend it to 
stable once it hits Linus' tree.

> ---
>  net/ipv4/cipso_ipv4.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
> index c48adc5..667c1d4 100644
> --- a/net/ipv4/cipso_ipv4.c
> +++ b/net/ipv4/cipso_ipv4.c
> @@ -1725,8 +1725,10 @@ int cipso_v4_validate(const struct sk_buff *skb,
> unsigned char **option) case CIPSO_V4_TAG_LOCAL:
>  			/* This is a non-standard tag that we only allow for
>  			 * local connections, so if the incoming interface is
> -			 * not the loopback device drop the packet. */
> -			if (!(skb->dev->flags & IFF_LOOPBACK)) {
> +			 * not the loopback device drop the packet. Further,
> +			 * there is no legitimate reason for setting this from
> +			 * userspace so reject it if skb is NULL. */
> +			if (skb == NULL || !(skb->dev->flags & IFF_LOOPBACK)) {
>  				err_offset = opt_iter;
>  				goto validate_return_locked;
>  			}
-- 
paul moore
security and virtualization @ redhat

^ permalink raw reply

* [PATCH net-next] net: qmi_wwan: make dynamic device IDs work
From: Bjørn Mork @ 2012-07-17 21:14 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-usb-u79uwXL29TY76Z2rM5mHXA, Bjørn Mork

The usbnet API use the device ID table to store a pointer to
a minidriver. Setting a generic pointer for dynamic device
IDs will in most cases make them work as expected.  usbnet
will otherwise treat the dynamic IDs as blacklisted. That is
rarely useful.

There is no standard class describing devices supported by
this driver, and most vendors don't even provide enough
information to allow vendor specific wildcard matching. The
result is that most of the supported devices must be
explicitly listed in the device table.  Allowing dynamic IDs
to work both simplifies testing and verification of new
devices, and provides a way for end users to use a device
before the ID is added to the driver.

Signed-off-by: Bjørn Mork <bjorn-yOkvZcmFvRU@public.gmane.org>
---

I gave up on the generic solution to this problem.  Different
drivers will have different needs, and even the different usbnet
minidrivers are very different.  Some do mostly class based
wildcard matching and rarely need an ID table update, while
others, like qmi_wwan, will need frequent updates as new devices
hit the market.

My conclusion was that it was best to add support like this
in the few drivers actually needing it.  Each driver will have
to make a policy decision wrt which data they will use for
dynamic devices.  It just can't be done wholesale.

 drivers/net/usb/qmi_wwan.c |   19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index bc0469e..2ea126a 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -609,10 +609,27 @@ static const struct usb_device_id products[] = {
 };
 MODULE_DEVICE_TABLE(usb, products);

+static int qmi_wwan_probe(struct usb_interface *intf, const struct usb_device_id *prod)
+{
+	struct usb_device_id *id = (struct usb_device_id *)prod;
+
+	/* Workaround to enable dynamic IDs.  This disables usbnet
+	 * blacklisting functionality.  Which, if required, can be
+	 * reimplemented here by using a magic "blacklist" value
+	 * instead of 0 in the static device id table
+	 */
+	if (!id->driver_info) {
+		dev_dbg(&intf->dev, "setting defaults for dynamic device id\n");
+		id->driver_info = (unsigned long)&qmi_wwan_shared;
+	}
+
+	return usbnet_probe(intf, id);
+}
+
 static struct usb_driver qmi_wwan_driver = {
 	.name		      = "qmi_wwan",
 	.id_table	      = products,
-	.probe		      =	usbnet_probe,
+	.probe		      = qmi_wwan_probe,
 	.disconnect	      = usbnet_disconnect,
 	.suspend	      = qmi_wwan_suspend,
 	.resume		      =	qmi_wwan_resume,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: New commands to configure IOV features
From: David Miller @ 2012-07-17 21:11 UTC (permalink / raw)
  To: chris.friesen
  Cc: ddutile, yuvalmin, bhutchings, gregory.v.rose, netdev, linux-pci
In-Reply-To: <5005D45D.1040302@genband.com>

From: Chris Friesen <chris.friesen@genband.com>
Date: Tue, 17 Jul 2012 15:08:45 -0600

> From that perspective a sysfs-based interface is ideal since it is
> directly scriptable.

As is anything ethtool or netlink based, since we have 'ethtool'
and 'ip' for scripting.

^ permalink raw reply

* Re: New commands to configure IOV features
From: Chris Friesen @ 2012-07-17 21:08 UTC (permalink / raw)
  To: Don Dutile
  Cc: Yuval Mintz, davem@davemloft.net, Ben Hutchings, Greg Rose,
	netdev@vger.kernel.org, linux-pci
In-Reply-To: <5005BD00.4090106@redhat.com>

On 07/17/2012 01:29 PM, Don Dutile wrote:

> WRT SRIOV-nic devices, the thinking goes that protocol-level
> parameters associated with VFs should use protocol-specific interfaces,
> e.g., ethtool, ip link set, etc. for Ethernet VFs.
> Thus, the various protocol control functions/tools should
> be used to control VF parameters, as one would for a physical device
> of that protocol/class.

It seems to me that the mere act of creating one or more VFs is 
something generic, applicable to all devices that are capable of it. 
The details of configuring those VFs can then be handled by 
protocol-specific interfaces.

I'm not too worried about the exact mechanism of doing it, as long as 
it's ultimately scriptable--that is, if it's a C API then it would be 
appreciated if there was a standard tool to call that implements it. 
 From that perspective a sysfs-based interface is ideal since it is 
directly scriptable.

Chris

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox