Netdev List
 help / color / mirror / Atom feed
* [PATCH v2] tcp: fix outsegs stat for TSO segments
From: Tom Herbert @ 2010-04-22 17:00 UTC (permalink / raw)
  To: davem, netdev

Account for TSO segments of an skb in TCP_MIB_OUTSEGS counter.  Without
doing this, the counter can be off by orders of magnitude from the
actual number of segments sent.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/net/snmp.h b/include/net/snmp.h
index 884fdbb..92456f1 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -133,6 +133,8 @@ struct linux_xfrm_mib {
 			__this_cpu_add(mib[0]->mibs[field], addend)
 #define SNMP_ADD_STATS_USER(mib, field, addend)	\
 			this_cpu_add(mib[1]->mibs[field], addend)
+#define SNMP_ADD_STATS(mib, field, addend)	\
+			this_cpu_add(mib[0]->mibs[field], addend)
 /*
  * Use "__typeof__(*mib[0]) *ptr" instead of "__typeof__(mib[0]) ptr"
  * to make @ptr a non-percpu pointer.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70c5159..91640fe 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -294,6 +294,7 @@ extern struct proto tcp_prot;
 #define TCP_INC_STATS_BH(net, field)	SNMP_INC_STATS_BH((net)->mib.tcp_statistics, field)
 #define TCP_DEC_STATS(net, field)	SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
 #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+#define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
 
 extern void			tcp_v4_err(struct sk_buff *skb, u32);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2b7d71f..8ce0f99 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -888,7 +888,8 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		tcp_event_data_sent(tp, skb, sk);
 
 	if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)
-		TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
+		TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,
+			      tcp_skb_pcount(skb));
 
 	err = icsk->icsk_af_ops->queue_xmit(skb);
 	if (likely(err <= 0))
@@ -2503,7 +2504,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	th->window = htons(min(req->rcv_wnd, 65535U));
 	tcp_options_write((__be32 *)(th + 1), tp, &opts);
 	th->doff = (tcp_header_size >> 2);
-	TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
+	TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS, tcp_skb_pcount(skb));
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Okay, we have all we need - do the md5 hash if needed */

^ permalink raw reply related

* Re: DDoS attack causing bad effect on conntrack searches
From: Paul E. McKenney @ 2010-04-22 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <1271952128.7895.5851.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 06:02:08PM +0200, Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> > On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > > Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > > 
> > > > If one hash slot is under attack, then there is a bug somewhere.
> > > > 
> > > > If we cannot avoid this, we can fallback to a secure mode at the second
> > > > retry, and take the spinlock.
> > > > 
> > > > Tis way, most of lookups stay lockless (one pass), and some might take
> > > > the slot lock to avoid the possibility of a loop.
> > > > 
> > > > I suspect a bug elsewhere, quite frankly !
> > > > 
> > > > We have a chain that have an end pointer that doesnt match the expected
> > > > one.
> > > > 
> > > 
> > > On normal situation, we always finish the lookup :
> > > 
> > > 1) If we found the thing we were looking at.
> > > 
> > > 2) We get the list end (item not found), we then check if it is the
> > > expected end.
> > > 
> > > It is _not_ the expected end only if some writer deleted/inserted an
> > > element in _this_ chain during our lookup.
> > 
> > So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
> > elements?  (Not obvious from the code, but my ignorance of the networking
> > code is such that many things in that part of the kernel are not obvious
> > to me, I am afraid.)
> 
> Yes, this uses SLAB_DESTROY_BY_RCU, like tcp/udp lookups.

OK, that will do it!!!  ;-)

One way of throttling the bad effects of updates on readers is to
periodically force updates through a grace period.  But this seems
to be a very big hammer, and likely to have little practical effect.

Another approach would be to have multiple list pointers per element,
so that a given element could be reused a small number of times without
messing up concurrent readers (sort of like Herbert's resizable hash
table).

But as you say, if some other bug is really behind this, better to fix
that bug than to work around it.

> > Otherwise, of course you would simply allow deleted elements to continue
> > pointing where they did previously, so that concurrent readers would not
> > miss anything.
> 
> > Of course, the same potential might arise on insertion, but it is usually
> > OK to miss an element that was inserted after you started searching.
> > 
> > > Because our lookup is lockless, we then have to redo it because we might
> > > miss the object we are looking for.
> > 
> > Ah...  Is there also a resize operation?  Herbert did do a resizable
> > hash table recently, but I was under the impression that (1) it was in
> > some other part of the networking stack and (2) it avoided the need to
> > restart readers.
> > 
> > > If we can do the 'retry' a 10 times, it means the attacker was really
> > > clever enough to inject new packets (new conntracks) at the right
> > > moment, in the right hash chain, and this sounds so higly incredible
> > > that I cannot believe it at all :)
> > 
> > Or maybe the DoS attack is injecting so many new conntracks that a large
> > fraction of the hash chains are being modified at any given time?
> 
> maybe hash table has one slot :)

;-) ;-) ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/3] IPv6: Generic TTL Security Mechanism (original version)
From: Stephen Hemminger @ 2010-04-22 16:23 UTC (permalink / raw)
  To: davem; +Cc: Pekka Savola, YOSHIFUJI Hideaki, Nick Hilliard, netdev
In-Reply-To: <20100403232922.489187907@vyatta.com>

On Sat, 03 Apr 2010 16:21:04 -0700
Stephen Hemminger <shemminger@vyatta.com> wrote:

> This patch adds IPv6 support for RFC5082 Generalized TTL
> Security Mechanism.  
> 
> The original proposed code; the IPV6 and IPV4 socket options are seperate.
> With this method, the server does have to deal with both IPv4 and IPv6
> socket options and the client has to handle the different for each
> family.
> 
> On client:
> 	int ttl = 255;
> 	getaddrinfo(argv[1], argv[2], &hint, &result);
> 
> 	for (rp = result; rp != NULL; rp = rp->ai_next) {
> 		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
> 		if (s < 0) continue;
> 
> 		if (rp->ai_family == AF_INET) {
> 			setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
> 		} else if (rp->ai_family == AF_INET6) {
> 			setsockopt(s, IPPROTO_IPV6,  IPV6_UNICAST_HOPS, 
> 					&ttl, sizeof(ttl)))
> 		}
> 			
> 		if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
> 		   ...
> 
> On server:
> 	int minttl = 255 - maxhops;
>    
> 	getaddrinfo(NULL, port, &hints, &result);
> 	for (rp = result; rp != NULL; rp = rp->ai_next) {
> 		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
> 		if (s < 0) continue;
> 
> 		if (rp->ai_family == AF_INET6)
> 			setsockopt(s, IPPROTO_IPV6,  IPV6_MINHOPCOUNT,
> 					&minttl, sizeof(minttl));
> 		setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
> 			
> 		if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
> 			break
> ..
> 
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Dave: Yoshifuji and I agree this is the best solution, how come the patch
hasn't been applied?

^ permalink raw reply

* Re: [PATCH] NIU support for skb->rxhash
From: Stephen Hemminger @ 2010-04-22 16:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100422.042157.99869295.davem@davemloft.net>

On Thu, 22 Apr 2010 04:21:57 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> 
> But it turns out using it is largely pointless since the only way to
> get the hash value(s) is through a structure which is prepended to the
> packet data (so we take a cache miss on the packet data anyways)
> instead of being able to fetch it out of the RX descriptors :-/
> 
> If anyone out there is trying to design sane hardware, please put the
> following into your RX descriptors:
> 
> 1) ethernet protocol type (u16)
> 2) a flag bit indicating if the packet destination matched one
>    of the programmed unicast MAC addresses
> 3) a flag bit indicating "multicast"
> 4) a flag bit indicating "broadcast"
> 5) at least 32-bits of the computed flow hash (u32)
> 
> kthx, bye!
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Could you make configurable via ethtool like I did for sky2.

P.s: where is that patch seems lost in patchwork

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Stephen Hemminger @ 2010-04-22 16:17 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Herbert Xu, David Miller, yoshfuji, netdev
In-Reply-To: <20100422154908.GA31568@midget.suse.cz>

On Thu, 22 Apr 2010 17:49:08 +0200
Jiri Bohac <jbohac@suse.cz> wrote:

> On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> > This patch fixes this by using the DADFAILED bit to synchronise
> > the two paths while holding the ifp lock.  It relies on the fact
> > that the TENTATIVE bit is always set during DAD, and that the
> > DADFAILED bit is only set on failure.
> 
> But the addr_dad_failure()->...->ipv6_del_addr() path will
> still race with any other path calling ipv6_del_addr() (e.g. a
> manual address removal). Won't it?
> 
> I still don't see why __ipv6_ifa_notify() needs to call
> dst_free(). Shouldn't that be dst_release() instead, to drop the
> reference obtained by dst_hold(&ifp->rt->u.dst)?

Yes, some more locking and race condition management is needed.

Something like the following (untested):

--- a/net/ipv6/addrconf.c	2010-04-22 09:11:54.594827858 -0700
+++ b/net/ipv6/addrconf.c	2010-04-22 09:15:59.224631752 -0700
@@ -720,13 +720,18 @@ static void ipv6_del_addr(struct inet6_i
 
 	hash = ipv6_addr_hash(&ifp->addr);
 
+	write_lock_bh(&idev->lock);
+	if (ifp->dead) {
+		write_unlock(&idev->lock); /* lost race with DAD */
+		return;
+	}
+
 	ifp->dead = 1;
 
-	spin_lock_bh(&addrconf_hash_lock);
+	spin_lock(&addrconf_hash_lock);
 	hlist_del_init_rcu(&ifp->addr_lst);
-	spin_unlock_bh(&addrconf_hash_lock);
+	spin_unlock(&addrconf_hash_lock);
 
-	write_lock_bh(&idev->lock);
 #ifdef CONFIG_IPV6_PRIVACY
 	if (ifp->flags&IFA_F_TEMPORARY) {
 		list_del(&ifp->tmp_list);




^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 16:02 UTC (permalink / raw)
  To: paulmck
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <20100422155123.GA2524@linux.vnet.ibm.com>

Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > 
> > > If one hash slot is under attack, then there is a bug somewhere.
> > > 
> > > If we cannot avoid this, we can fallback to a secure mode at the second
> > > retry, and take the spinlock.
> > > 
> > > Tis way, most of lookups stay lockless (one pass), and some might take
> > > the slot lock to avoid the possibility of a loop.
> > > 
> > > I suspect a bug elsewhere, quite frankly !
> > > 
> > > We have a chain that have an end pointer that doesnt match the expected
> > > one.
> > > 
> > 
> > On normal situation, we always finish the lookup :
> > 
> > 1) If we found the thing we were looking at.
> > 
> > 2) We get the list end (item not found), we then check if it is the
> > expected end.
> > 
> > It is _not_ the expected end only if some writer deleted/inserted an
> > element in _this_ chain during our lookup.
> 
> So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
> elements?  (Not obvious from the code, but my ignorance of the networking
> code is such that many things in that part of the kernel are not obvious
> to me, I am afraid.)
> 

Yes, this uses SLAB_DESTROY_BY_RCU, like tcp/udp lookups.

> Otherwise, of course you would simply allow deleted elements to continue
> pointing where they did previously, so that concurrent readers would not
> miss anything.
> 




> Of course, the same potential might arise on insertion, but it is usually
> OK to miss an element that was inserted after you started searching.
> 
> > Because our lookup is lockless, we then have to redo it because we might
> > miss the object we are looking for.
> 
> Ah...  Is there also a resize operation?  Herbert did do a resizable
> hash table recently, but I was under the impression that (1) it was in
> some other part of the networking stack and (2) it avoided the need to
> restart readers.
> 
> > If we can do the 'retry' a 10 times, it means the attacker was really
> > clever enough to inject new packets (new conntracks) at the right
> > moment, in the right hash chain, and this sounds so higly incredible
> > that I cannot believe it at all :)
> 
> Or maybe the DoS attack is injecting so many new conntracks that a large
> fraction of the hash chains are being modified at any given time?
> 
> 							Thanx, Paul

maybe hash table has one slot :)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-22 16:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100422145640.GB3228@redhat.com>

On Thu, Apr 22, 2010 at 10:56:40AM -0400, Vivek Goyal wrote:
> On Wed, Apr 21, 2010 at 02:35:43PM -0700, Paul E. McKenney wrote:
> 
> [..]
> > > [    3.116754] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > [    3.116754] ---------------------------------------------------
> > > [    3.116754] kernel/cgroup.c:4432 invoked rcu_dereference_check()
> > > without protection!
> > > [    3.116754]
> > > [    3.116754] other info that might help us debug this:
> > > [    3.116754]
> > > [    3.116754]
> > > [    3.116754] rcu_scheduler_active = 1, debug_locks = 1
> > > [    3.116754] 2 locks held by async/1/666:
> > > [    3.116754]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> > > [<ffffffff812df0a0>] __scsi_add_device+0x83/0xe4
> > > [    3.116754]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> > > [<ffffffff811f2e8d>] blkiocg_add_blkio_group+0x29/0x7f
> > > [    3.116754]
> > > [    3.116754] stack backtrace:
> > > [    3.116754] Pid: 666, comm: async/1 Not tainted 2.6.34-rc5 #18
> > > [    3.116754] Call Trace:
> > > [    3.116754]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > > [    3.116754]  [<ffffffff8107f9b1>] css_id+0x3f/0x51
> > > [    3.116754]  [<ffffffff811f2e9c>] blkiocg_add_blkio_group+0x38/0x7f
> > > [    3.116754]  [<ffffffff811f4e64>] cfq_init_queue+0xdf/0x2dc
> > > [    3.116754]  [<ffffffff811e3445>] elevator_init+0xba/0xf5
> > > [    3.116754]  [<ffffffff812dc02a>] ? scsi_request_fn+0x0/0x451
> > > [    3.116754]  [<ffffffff811e696b>] blk_init_queue_node+0x12f/0x135
> > > [    3.116754]  [<ffffffff811e697d>] blk_init_queue+0xc/0xe
> > > [    3.116754]  [<ffffffff812dc49c>] __scsi_alloc_queue+0x21/0x111
> > > [    3.116754]  [<ffffffff812dc5a4>] scsi_alloc_queue+0x18/0x64
> > > [    3.116754]  [<ffffffff812de5a0>] scsi_alloc_sdev+0x19e/0x256
> > > [    3.116754]  [<ffffffff812de73e>] scsi_probe_and_add_lun+0xe6/0x9c5
> > > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > > [    3.116754]  [<ffffffff813ce0d6>] ? __mutex_lock_common+0x3e4/0x43a
> > > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > > [    3.116754]  [<ffffffff812d0a5c>] ? transport_setup_classdev+0x0/0x17
> > > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > > [    3.116754]  [<ffffffff812df0d5>] __scsi_add_device+0xb8/0xe4
> > > [    3.116754]  [<ffffffff812ea9c5>] ata_scsi_scan_host+0x74/0x16e
> > > [    3.116754]  [<ffffffff81057685>] ? autoremove_wake_function+0x0/0x34
> > > [    3.116754]  [<ffffffff812e8e64>] async_port_probe+0xab/0xb7
> > > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > > [    3.116754]  [<ffffffff8105e2ba>] async_thread+0x105/0x1f4
> > > [    3.116754]  [<ffffffff81033d79>] ? default_wake_function+0x0/0xf
> > > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > > [    3.116754]  [<ffffffff8105713e>] kthread+0x89/0x91
> > > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > > [    3.116754]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> > > [    3.116754]  [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> > > [    3.116754]  [<ffffffff810570b5>] ? kthread+0x0/0x91
> > > [    3.116754]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10
> > 
> > I cannot convince myself that the above access is safe.  Vivek, Nauman,
> > thoughts?
> 
> Hi Paul,
> 
> blkiocg_add_blkio_group() is called from two paths.
> 
> First one is following. This path should be safe as it takes rcu read
> lock.
> 
> cfq_get_cfqg()
> 	rcu_read_lock()
> 	cfq_find_alloc_cfqg()
> 		blkiocg_add_blkio_group()
> 	rcu_read_unlock()
> 
> Second one is as shown in above backtrace.
> 
> cfq_init_queue()
> 	blkiocg_add_blkio_group().
> 
> This path is called at request queue and cfq initialization time and
> we access only root cgroup (root blkio_cgroup). As root cgroup can't
> go away, do we have to protect that call also using rcu_read_lock()?

You are correct, if the root cgroup cannot go away and if we only access
the root cgroup, then rcu_read_lock() is not required.

> So I guess it is not unsafe but propably we need to fix the warning, I
> should wrap second call to blkiocg_add_blkio_group() with
> rcu_read_lock/unlock pair?

That would work very well!

							Thanx, Paul

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Paul E. McKenney @ 2010-04-22 15:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <1271948029.7895.5707.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> 
> > If one hash slot is under attack, then there is a bug somewhere.
> > 
> > If we cannot avoid this, we can fallback to a secure mode at the second
> > retry, and take the spinlock.
> > 
> > Tis way, most of lookups stay lockless (one pass), and some might take
> > the slot lock to avoid the possibility of a loop.
> > 
> > I suspect a bug elsewhere, quite frankly !
> > 
> > We have a chain that have an end pointer that doesnt match the expected
> > one.
> > 
> 
> On normal situation, we always finish the lookup :
> 
> 1) If we found the thing we were looking at.
> 
> 2) We get the list end (item not found), we then check if it is the
> expected end.
> 
> It is _not_ the expected end only if some writer deleted/inserted an
> element in _this_ chain during our lookup.

So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
elements?  (Not obvious from the code, but my ignorance of the networking
code is such that many things in that part of the kernel are not obvious
to me, I am afraid.)

Otherwise, of course you would simply allow deleted elements to continue
pointing where they did previously, so that concurrent readers would not
miss anything.

Of course, the same potential might arise on insertion, but it is usually
OK to miss an element that was inserted after you started searching.

> Because our lookup is lockless, we then have to redo it because we might
> miss the object we are looking for.

Ah...  Is there also a resize operation?  Herbert did do a resizable
hash table recently, but I was under the impression that (1) it was in
some other part of the networking stack and (2) it avoided the need to
restart readers.

> If we can do the 'retry' a 10 times, it means the attacker was really
> clever enough to inject new packets (new conntracks) at the right
> moment, in the right hash chain, and this sounds so higly incredible
> that I cannot believe it at all :)

Or maybe the DoS attack is injecting so many new conntracks that a large
fraction of the hash chains are being modified at any given time?

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-22 15:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100422142506.GA15858@gondor.apana.org.au>

On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> This patch fixes this by using the DADFAILED bit to synchronise
> the two paths while holding the ifp lock.  It relies on the fact
> that the TENTATIVE bit is always set during DAD, and that the
> DADFAILED bit is only set on failure.

But the addr_dad_failure()->...->ipv6_del_addr() path will
still race with any other path calling ipv6_del_addr() (e.g. a
manual address removal). Won't it?

I still don't see why __ipv6_ifa_notify() needs to call
dst_free(). Shouldn't that be dst_release() instead, to drop the
reference obtained by dst_hold(&ifp->rt->u.dst)?

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Ben Hutchings @ 2010-04-22 15:41 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: tglx@linutronix.de, davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <Pine.WNT.4.64.1004220459110.3324@PPWASKIE-MOBL2.amr.corp.intel.com>

On Thu, 2010-04-22 at 05:11 -0700, Peter P Waskiewicz Jr wrote:
> On Wed, 21 Apr 2010, Ben Hutchings wrote:
> 
> > On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
> >> This patch adds a callback function pointer to the irq_desc
> >> structure, along with a registration function and a read-only
> >> proc entry for each interrupt.
> >>
> >> This affinity_hint handle for each interrupt can be used by
> >> underlying drivers that need a better mechanism to control
> >> interrupt affinity.  The underlying driver can register a
> >> callback for the interrupt, which will allow the driver to
> >> provide the CPU mask for the interrupt to anything that
> >> requests it.  The intent is to extend the userspace daemon,
> >> irqbalance, to help hint to it a preferred CPU mask to balance
> >> the interrupt into.
> >
> > Doesn't it make more sense to have the driver follow affinity decisions
> > made from user-space?  I realise that reallocating queues is disruptive
> > and we probably don't want irqbalance to trigger that, but there should
> > be a mechanism for the administrator to trigger it.
> 
> The driver here would be assisting userspace (irqbalance) to provide 
> better details how the HW is laid out with respect to flows.  As it stands 
> today, irqbalance is almost guaranteed to move interrups to CPUs that are 
> not aligned with where applications are running for network adapters. 
> This is very apparent when running at speeds in the 10 Gigabit range, or 
> even multiple 1 Gigabit ports running at the same time.

I'm well aware that irqbalance isn't making good decisions at the
moment.  The question is whether this will really help irqbalance to do
better.

[...]
> > This just assigns IRQs to the first n CPU threads.  Depending on the
> > enumeration order, this might result in assigning an IRQ to each of 2
> > threads on a core while leaving other cores unused!
> 
> This ixgbe patch is only meant to be an example of how you could use it. 
> I didn't hammer out all the corner cases of interrupt alignment in it yet. 
> However, ixgbe is already aligning Tx flows onto the CPU/queue pair the Tx 
> occurred (i.e. Tx session from CPU 4 will be queued on Tx queue 4),
[...]

OK, now I remember ixgbe has this odd select_queue() implementation.
But this behaviour can result in reordering whenever a user thread
migrates, and in any case Dave discourages people from setting
select_queue().  So I see that these changes would be useful for ixgbe
(together with an update to irqbalance), but they don't seem to fit the
general direction of multiqueue networking on Linux.

(Actually, the hints seem to be incomplete.  If there are more than 16
CPU threads then multiple CPU threads can map to the same queues, but it
looks like you only include the first in the queue's hint.)

An alternate approach is to use the RX queue index to drive TX queue
selection.  I posted a patch to do that earlier this week.  However I
haven't yet had a chance to try that on a suitably large system.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Vivek Goyal @ 2010-04-22 14:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100421213543.GO2563@linux.vnet.ibm.com>

On Wed, Apr 21, 2010 at 02:35:43PM -0700, Paul E. McKenney wrote:

[..]
> > [    3.116754] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [    3.116754] ---------------------------------------------------
> > [    3.116754] kernel/cgroup.c:4432 invoked rcu_dereference_check()
> > without protection!
> > [    3.116754]
> > [    3.116754] other info that might help us debug this:
> > [    3.116754]
> > [    3.116754]
> > [    3.116754] rcu_scheduler_active = 1, debug_locks = 1
> > [    3.116754] 2 locks held by async/1/666:
> > [    3.116754]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> > [<ffffffff812df0a0>] __scsi_add_device+0x83/0xe4
> > [    3.116754]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> > [<ffffffff811f2e8d>] blkiocg_add_blkio_group+0x29/0x7f
> > [    3.116754]
> > [    3.116754] stack backtrace:
> > [    3.116754] Pid: 666, comm: async/1 Not tainted 2.6.34-rc5 #18
> > [    3.116754] Call Trace:
> > [    3.116754]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > [    3.116754]  [<ffffffff8107f9b1>] css_id+0x3f/0x51
> > [    3.116754]  [<ffffffff811f2e9c>] blkiocg_add_blkio_group+0x38/0x7f
> > [    3.116754]  [<ffffffff811f4e64>] cfq_init_queue+0xdf/0x2dc
> > [    3.116754]  [<ffffffff811e3445>] elevator_init+0xba/0xf5
> > [    3.116754]  [<ffffffff812dc02a>] ? scsi_request_fn+0x0/0x451
> > [    3.116754]  [<ffffffff811e696b>] blk_init_queue_node+0x12f/0x135
> > [    3.116754]  [<ffffffff811e697d>] blk_init_queue+0xc/0xe
> > [    3.116754]  [<ffffffff812dc49c>] __scsi_alloc_queue+0x21/0x111
> > [    3.116754]  [<ffffffff812dc5a4>] scsi_alloc_queue+0x18/0x64
> > [    3.116754]  [<ffffffff812de5a0>] scsi_alloc_sdev+0x19e/0x256
> > [    3.116754]  [<ffffffff812de73e>] scsi_probe_and_add_lun+0xe6/0x9c5
> > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [    3.116754]  [<ffffffff813ce0d6>] ? __mutex_lock_common+0x3e4/0x43a
> > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > [    3.116754]  [<ffffffff812d0a5c>] ? transport_setup_classdev+0x0/0x17
> > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > [    3.116754]  [<ffffffff812df0d5>] __scsi_add_device+0xb8/0xe4
> > [    3.116754]  [<ffffffff812ea9c5>] ata_scsi_scan_host+0x74/0x16e
> > [    3.116754]  [<ffffffff81057685>] ? autoremove_wake_function+0x0/0x34
> > [    3.116754]  [<ffffffff812e8e64>] async_port_probe+0xab/0xb7
> > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > [    3.116754]  [<ffffffff8105e2ba>] async_thread+0x105/0x1f4
> > [    3.116754]  [<ffffffff81033d79>] ? default_wake_function+0x0/0xf
> > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > [    3.116754]  [<ffffffff8105713e>] kthread+0x89/0x91
> > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [    3.116754]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> > [    3.116754]  [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> > [    3.116754]  [<ffffffff810570b5>] ? kthread+0x0/0x91
> > [    3.116754]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10
> 
> I cannot convince myself that the above access is safe.  Vivek, Nauman,
> thoughts?

Hi Paul,

blkiocg_add_blkio_group() is called from two paths.

First one is following. This path should be safe as it takes rcu read
lock.

cfq_get_cfqg()
	rcu_read_lock()
	cfq_find_alloc_cfqg()
		blkiocg_add_blkio_group()
	rcu_read_unlock()

Second one is as shown in above backtrace.

cfq_init_queue()
	blkiocg_add_blkio_group().

This path is called at request queue and cfq initialization time and
we access only root cgroup (root blkio_cgroup). As root cgroup can't
go away, do we have to protect that call also using rcu_read_lock()?

So I guess it is not unsafe but propably we need to fix the warning, I
should wrap second call to blkiocg_add_blkio_group() with
rcu_read_lock/unlock pair?

Thanks
Vivek

^ permalink raw reply

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22 14:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <1271946805.7895.5658.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 10:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> It does make a difference, Damn it.
>
> I really really start to think you dont read what I wrote, or you dont
> care.

I misunderstood it. Sorry.

>
> Damn, cant you update all the things at once, taking this lock only
> once ?
>
> You focus having an ultra precise count of pkt_queue.len, but we dont
> care at all ! We only want a _limit_, or else the box can be killed by
> DOS.
>
> If in practice this limit can be 2*limit, thats OK.
>
> Cant you understand this ?
>
>
> We need one limit. Not two limits.
>
> I already told you how to do it, but you ignored me and started yet
> another convoluted thing.
>
>
> process_backlog() transfert the queue to its own queue and reset pkt_len
> to 0 (Only once)
>
> End of story.
>
> Maximum packet queued to this cpu softnet_data will be 2 * old_limit.
>
> So what ?
>

Now, I think I really understand. We don't need a precious limit. So
only a additional queue is enough.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 14:53 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney
In-Reply-To: <1271946961.7895.5665.camel@edumazet-laptop>

Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :

> If one hash slot is under attack, then there is a bug somewhere.
> 
> If we cannot avoid this, we can fallback to a secure mode at the second
> retry, and take the spinlock.
> 
> Tis way, most of lookups stay lockless (one pass), and some might take
> the slot lock to avoid the possibility of a loop.
> 
> I suspect a bug elsewhere, quite frankly !
> 
> We have a chain that have an end pointer that doesnt match the expected
> one.
> 

On normal situation, we always finish the lookup :

1) If we found the thing we were looking at.

2) We get the list end (item not found), we then check if it is the
expected end.

It is _not_ the expected end only if some writer deleted/inserted an
element in _this_ chain during our lookup.

Because our lookup is lockless, we then have to redo it because we might
miss the object we are looking for.

If we can do the 'retry' a 10 times, it means the attacker was really
clever enough to inject new packets (new conntracks) at the right
moment, in the right hash chain, and this sounds so higly incredible
that I cannot believe it at all :)




^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 14:36 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney
In-Reply-To: <4BD04C74.9020402@trash.net>

Le jeudi 22 avril 2010 à 15:17 +0200, Patrick McHardy a écrit :
> Changli Gao wrote:
> > On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
> >> At an unnamed ISP, we experienced a DDoS attack against one of our
> >> customers.  This attack also caused problems for one of our Linux
> >> based routers.
> >>
> >> The attack was "only" generating 300 kpps (packets per sec), which
> >> usually isn't a problem for this (fairly old) Linux Router.  But the
> >> conntracking system chocked and reduced pps processing power to
> >> 40kpps.
> >>
> >> I do extensive RRD/graph monitoring of the machines.  The IP conntrack
> >> searches in the period exploded, to a stunning 700.000 searches per
> >> sec.
> >>
> >> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
> >>
> >> First I though it might be caused by bad hashing, but after reading
> >> the kernel code (func: __nf_conntrack_find()), I think its caused by
> >> the loop restart (goto begin) of the conntrack search, running under
> >> local_bh_disable().  These RCU changes to conntrack were introduced in
> >> ea781f19 by Eric Dumazet.
> >>
> >> Code: net/netfilter/nf_conntrack_core.c
> >> Func: __nf_conntrack_find()
> >>
> >> struct nf_conntrack_tuple_hash *
> >> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
> >> {
> >>        struct nf_conntrack_tuple_hash *h;
> >>        struct hlist_nulls_node *n;
> >>        unsigned int hash = hash_conntrack(tuple);
> >>
> >>        /* Disable BHs the entire time since we normally need to disable them
> >>         * at least once for the stats anyway.
> >>         */
> >>        local_bh_disable();
> >> begin:
> >>        hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
> >>                if (nf_ct_tuple_equal(tuple, &h->tuple)) {
> >>                        NF_CT_STAT_INC(net, found);
> >>                        local_bh_enable();
> >>                        return h;
> >>                }
> >>                NF_CT_STAT_INC(net, searched);
> >>        }
> >>        /*
> >>         * if the nulls value we got at the end of this lookup is
> >>         * not the expected one, we must restart lookup.
> >>         * We probably met an item that was moved to another chain.
> >>         */
> >>        if (get_nulls_value(n) != hash)
> >>                goto begin;
> >>        local_bh_enable();
> >>
> > 
> > We should add a retry limit there.
> 
> We can't do that since that would allow false negatives.

If one hash slot is under attack, then there is a bug somewhere.

If we cannot avoid this, we can fallback to a secure mode at the second
retry, and take the spinlock.

Tis way, most of lookups stay lockless (one pass), and some might take
the slot lock to avoid the possibility of a loop.

I suspect a bug elsewhere, quite frankly !

We have a chain that have an end pointer that doesnt match the expected
one.




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22 14:33 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <y2n412e6f7f1004220606id324dc9bj2cc04cfbad50a101@mail.gmail.com>

Le jeudi 22 avril 2010 à 21:06 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 7:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Please reorder things better.
> >
> > Most likely this function is called for one packet.
> >
> > In your version you take twice the rps_lock()/rps_unlock() path, so
> > it'll be slower.
> >
> > Once to 'transfert' one list to process list
> >
> > Once to be able to do the 'label out:' post processing.
> >
> 
> It doesn't make any difference. We have to hold rps_lock to update
> input_pkt_queue_len, if we don't use another variable to record the
> length of the process queue, or atomic variable.
> 

It does make a difference, Damn it.

I really really start to think you dont read what I wrote, or you dont
care.

Damn, cant you update all the things at once, taking this lock only
once ?

You focus having an ultra precise count of pkt_queue.len, but we dont
care at all ! We only want a _limit_, or else the box can be killed by
DOS.

If in practice this limit can be 2*limit, thats OK. 

Cant you understand this ?

> I think it is better that using another variable to record the length
> of the process queue, and updating it before process_backlog()
> returns. For one packet, there is only one locking/unlocking. There is
> only one issue you concerned: cache miss due to sum the two queues'
> length. I'll change softnet_data to:
> 
> struct softnet_data {
>         struct Qdisc            *output_queue;
>         struct list_head        poll_list;
>         struct sk_buff          *completion_queue;
>         struct sk_buff_head     process_queue;
> 
> #ifdef CONFIG_RPS
>         struct softnet_data     *rps_ipi_list;
> 
>         /* Elements below can be accessed between CPUs for RPS */
>         struct call_single_data csd ____cacheline_aligned_in_smp;
>         struct softnet_data     *rps_ipi_next;
>         unsigned int            cpu;
>         unsigned int            input_queue_head;
> #endif
>         unsigned int            process_queue_len;
>         struct sk_buff_head     input_pkt_queue;
>         struct napi_struct      backlog;
> };
> 
> For one packets, we have to update process_queue_len in any way. For
> more packets, we only change process_queue_len just before
> process_backlog() returns. It means that process_queue_len change is
> batched.
> 

We need one limit. Not two limits.

I already told you how to do it, but you ignored me and started yet
another convoluted thing.


process_backlog() transfert the queue to its own queue and reset pkt_len
to 0 (Only once)

End of story.

Maximum packet queued to this cpu softnet_data will be 2 * old_limit.

So what ?



^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Herbert Xu @ 2010-04-22 14:25 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100422.004324.67422011.davem@davemloft.net>

On Thu, Apr 22, 2010 at 12:43:24AM -0700, David Miller wrote:
>
> Thanks Herbert.

No worries :)

BTW similar races exist in other NDISC receive functions, but
it's too late today so I'll look at this tomorrow unless someone
else wants to have a go at this.

ipv6: Prevent DAD races

The NDISC receive path has not been written in a way that handles
unexpected packets properly.

For example, if we get two identical simultaneous NA/NS packets
that result in a DAD failure, we may try to delete the same address
twice.

A similar problem occurs when we get a DAD failure just as we're
about to mark an address as having passed DAD.

This patch fixes this by using the DADFAILED bit to synchronise
the two paths while holding the ifp lock.  It relies on the fact
that the TENTATIVE bit is always set during DAD, and that the
DADFAILED bit is only set on failure.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index de7a194..1d15d5e 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1401,6 +1401,16 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp, int dad_failed)
 void addrconf_dad_failure(struct inet6_ifaddr *ifp)
 {
 	struct inet6_dev *idev = ifp->idev;
+	int ignore;
+
+	spin_lock(&ifp->lock);
+	ignore = (ifp->flags & (IFA_F_DADFAILED|IFA_F_TENTATIVE)) ^
+		 IFA_F_TENTATIVE;
+	ifp->flags |= IFA_F_DADFAILED;
+	spin_unlock(&ifp->lock);
+
+	if (ignore)
+		return;
 
 	if (net_ratelimit())
 		printk(KERN_INFO "%s: IPv6 duplicate address %pI6c detected!\n",
@@ -2789,7 +2799,10 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 	read_lock_bh(&idev->lock);
 	if (ifp->dead)
 		goto out;
+
 	spin_lock_bh(&ifp->lock);
+	if (ifp->flags & IFA_F_DADFAILED)
+		goto unlock_ifp;
 
 	if (dev->flags&(IFF_NOARP|IFF_LOOPBACK) ||
 	    idev->cnf.accept_dad < 1 ||
@@ -2824,6 +2837,7 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 		ip6_ins_rt(ifp->rt);
 
 	addrconf_dad_kick(ifp);
+unlock_ifp:
 	spin_unlock_bh(&ifp->lock);
 out:
 	read_unlock_bh(&idev->lock);
@@ -2841,6 +2855,11 @@ static void addrconf_dad_timer(unsigned long data)
 		goto out;
 	}
 	spin_lock_bh(&ifp->lock);
+	if (ifp->flags & IFA_F_DADFAILED) {
+		spin_unlock_bh(&ifp->lock);
+		read_unlock_bh(&idev->lock);
+		goto out;
+	}
 	if (ifp->probes == 0) {
 		/*
 		 * DAD was successful

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22 14:24 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, hadi, therbert, netdev
In-Reply-To: <t2t412e6f7f1004220527mb5340bebye1d1fda0963dcf2a@mail.gmail.com>

Le jeudi 22 avril 2010 à 20:27 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 5:43 PM, David Miller <davem@davemloft.net> wrote:
> > From: Changli Gao <xiaosuo@gmail.com>
> > Date: Thu, 22 Apr 2010 17:09:17 +0800
> >
> >> +     unsigned int            input_pkt_queue_len;
> >> +     struct sk_buff          *input_pkt_queue_head;
> >> +     struct sk_buff          **input_pkt_queue_tailp;
> >> +
> >
> > Please do not ignore Stephen Hemminger's feedback.
> >
> > We already have enough odd SKB queue implementations, we
> > do not need yet another one in a core location.  This makes
> > it harder and harder to eventually convert sk_buff to use
> > "struct list_head".
> >
> > Instead, use "struct sk_buff_head" and the lockless accessors
> > (__skb_insert, etc.) and initializer (__skb_queue_head_init).
> >
> 
> If I want to keep softnet_data small, I have to access the internal
> fields of sk_buff_head, and modify them in a hack way. It doesn't
> sound good. If not, softnet_data will become:
> 
> struct softnet_data {
>         struct Qdisc            *output_queue;
>         struct list_head        poll_list;
>         struct sk_buff          *completion_queue;
>         struct sk_buff_head     process_queue;
> 
> #ifdef CONFIG_RPS
>         struct softnet_data     *rps_ipi_list;
> 
>         /* Elements below can be accessed between CPUs for RPS */
>         struct call_single_data csd ____cacheline_aligned_in_smp;
>         struct softnet_data     *rps_ipi_next;
>         unsigned int            cpu;
>         unsigned int            input_queue_head;
> #endif
>         unsigned int            input_pkt_queue_len;
>         struct sk_buff_head     input_pkt_queue;
>         struct napi_struct      backlog;
> };
> 
> Eric, do you think it is too fat?

No it is not, as long as we are careful to separate cache lines.

By the way, input_pkt_queue_len is already in input_ptk_queue (.len)

Thanks



^ permalink raw reply

* Re: Bug#577640: linux-image-2.6.33-2-amd64: Kernel warnings in netns thread
From: Martín Ferrari @ 2010-04-22 14:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Hutchings, 577640, netdev, Eric W. Biederman, Alexey Dobriyan,
	Mathieu Lacage
In-Reply-To: <m1sk6oky96.fsf@fess.ebiederm.org>

On Thu, Apr 22, 2010 at 04:38, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> > $ sudo ./startns bash
>>> > # ip l a type veth
>>> > # ip l s veth0 netns 1
>>> > # exit

> Then I should ask what is startns?

That's just a simple C program that calls unshare(CLONE_NEWNET) and
execs a shell.

> Either that is doing something different from my equivalent program, or I have
> patches to fix this, that just haven't been merged yet.

I have just downloaded and compiled 2.6.32-2 and 2.6.34-rc5 from
kernel.org using the .config from the debian package, and the oops is
reproducible in both.

This small C file reproduces the error every time:

$ cat netnsoops.c
#include <stdio.h>
#include <stdlib.h>
#define _GNU_SOURCE
#include <sched.h>

int main(int argc, char *argv[])
{	
	int c;
	unsigned long flags = CLONE_NEWNET;

	if(unshare(flags) == -1) {
		perror("unshare");
		return 1;
	}
	system("ip link add name FOO type veth peer name BAR");
	system("ip link set FOO netns 1");
	system("ip link show");
	return 0;
}


-- 
Martín Ferrari

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Jesper Dangaard Brouer @ 2010-04-22 13:31 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Patrick McHardy, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney
In-Reply-To: <q2h412e6f7f1004220613m488c2ee4r6d24a8d1e65997d4@mail.gmail.com>


I have added a stats counter to prove my case, which I think we should add to the kernel (to detect the case in the future).
The DDoS attack has disappeared, so I guess I'll try to see if I can reproduce the problem in my testlab.



[PATCH] net: netfilter conntrack extended with extra stat counter.

From: Jesper Dangaard Brouer <hawk@comx.dk>

I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.

Adding a stats counter to see if the search is restarted too often.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
---

 include/linux/netfilter/nf_conntrack_common.h      |    1 +
 .../netfilter/nf_conntrack_l3proto_ipv4_compat.c   |    7 ++++---
 net/netfilter/nf_conntrack_core.c                  |    4 +++-
 net/netfilter/nf_conntrack_standalone.c            |    7 ++++---
 4 files changed, 12 insertions(+), 7 deletions(-)


diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index a8248ee..57ff312 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -93,6 +93,7 @@ struct ip_conntrack_stat
 	unsigned int expect_new;
 	unsigned int expect_create;
 	unsigned int expect_delete;
+	unsigned int search_restart;
 };
 
 /* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 8668a3d..b94f510 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete search_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -358,7 +358,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 
 		   st->expect_new,
 		   st->expect_create,
-		   st->expect_delete
+		   st->expect_delete,
+		   st->search_restart
 		);
 	return 0;
 }
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 4299db7..5ca8286 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -315,8 +315,10 @@ begin:
 	 * not the expected one, we must restart lookup.
 	 * We probably met an item that was moved to another chain.
 	 */
-	if (get_nulls_value(n) != hash)
+	if (get_nulls_value(n) != hash) {
+		NF_CT_STAT_INC(net, search_restart);
 		goto begin;
+	}
 	local_bh_enable();
 
 	return NULL;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 1935153..4812750 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -245,12 +245,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete search_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -267,7 +267,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 
 		   st->expect_new,
 		   st->expect_create,
-		   st->expect_delete
+		   st->expect_delete,
+		   st->search_restart
 		);
 	return 0;
 }



-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related

* [PATCHv2] Socket filter ancilliary data access for skb->dev->type
From: Paul LeoNerd Evans @ 2010-04-22 13:32 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1712 bytes --]

Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving
access to skb->dev->type, as reported in the sll_hatype field.

When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on. This patch adds such ability.

This patch also handles the case where skb->dev can be NULL, such as on
netlink sockets.

Signed-off-by: Paul Evans <leonerd@leonerd.org.uk>

---

diff -ur linux-2.6.33.2.orig/include/linux/filter.h linux-2.6.33.2/include/linux/filter.h
--- linux-2.6.33.2.orig/include/linux/filter.h	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/include/linux/filter.h	2010-04-20 22:40:25.000000000 +0100
@@ -123,7 +123,8 @@
 #define SKF_AD_NLATTR_NEST	16
 #define SKF_AD_MARK 	20
 #define SKF_AD_QUEUE	24
-#define SKF_AD_MAX	28
+#define SKF_AD_HATYPE	28
+#define SKF_AD_MAX	32
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
diff -ur linux-2.6.33.2.orig/net/core/filter.c linux-2.6.33.2/net/core/filter.c
--- linux-2.6.33.2.orig/net/core/filter.c	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/net/core/filter.c	2010-04-22 14:19:24.000000000 +0100
@@ -301,6 +301,8 @@
 			A = skb->pkt_type;
 			continue;
 		case SKF_AD_IFINDEX:
+			if (!skb->dev)
+				return 0;
 			A = skb->dev->ifindex;
 			continue;
 		case SKF_AD_MARK:
@@ -309,6 +311,11 @@
 		case SKF_AD_QUEUE:
 			A = skb->queue_mapping;
 			continue;
+		case SKF_AD_HATYPE:
+			if (!skb->dev)
+				return 0;
+			A = skb->dev->type;
+			continue;
 		case SKF_AD_NLATTR: {
 			struct nlattr *nla;
 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-04-22 13:17 UTC (permalink / raw)
  To: Changli Gao
  Cc: hawk, Eric Dumazet, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney
In-Reply-To: <q2h412e6f7f1004220613m488c2ee4r6d24a8d1e65997d4@mail.gmail.com>

Changli Gao wrote:
> On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
>> At an unnamed ISP, we experienced a DDoS attack against one of our
>> customers.  This attack also caused problems for one of our Linux
>> based routers.
>>
>> The attack was "only" generating 300 kpps (packets per sec), which
>> usually isn't a problem for this (fairly old) Linux Router.  But the
>> conntracking system chocked and reduced pps processing power to
>> 40kpps.
>>
>> I do extensive RRD/graph monitoring of the machines.  The IP conntrack
>> searches in the period exploded, to a stunning 700.000 searches per
>> sec.
>>
>> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
>>
>> First I though it might be caused by bad hashing, but after reading
>> the kernel code (func: __nf_conntrack_find()), I think its caused by
>> the loop restart (goto begin) of the conntrack search, running under
>> local_bh_disable().  These RCU changes to conntrack were introduced in
>> ea781f19 by Eric Dumazet.
>>
>> Code: net/netfilter/nf_conntrack_core.c
>> Func: __nf_conntrack_find()
>>
>> struct nf_conntrack_tuple_hash *
>> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
>> {
>>        struct nf_conntrack_tuple_hash *h;
>>        struct hlist_nulls_node *n;
>>        unsigned int hash = hash_conntrack(tuple);
>>
>>        /* Disable BHs the entire time since we normally need to disable them
>>         * at least once for the stats anyway.
>>         */
>>        local_bh_disable();
>> begin:
>>        hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
>>                if (nf_ct_tuple_equal(tuple, &h->tuple)) {
>>                        NF_CT_STAT_INC(net, found);
>>                        local_bh_enable();
>>                        return h;
>>                }
>>                NF_CT_STAT_INC(net, searched);
>>        }
>>        /*
>>         * if the nulls value we got at the end of this lookup is
>>         * not the expected one, we must restart lookup.
>>         * We probably met an item that was moved to another chain.
>>         */
>>        if (get_nulls_value(n) != hash)
>>                goto begin;
>>        local_bh_enable();
>>
> 
> We should add a retry limit there.

We can't do that since that would allow false negatives.

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Changli Gao @ 2010-04-22 13:13 UTC (permalink / raw)
  To: hawk
  Cc: Eric Dumazet, Patrick McHardy, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney
In-Reply-To: <1271941082.14501.189.camel@jdb-workstation>

On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
>
> At an unnamed ISP, we experienced a DDoS attack against one of our
> customers.  This attack also caused problems for one of our Linux
> based routers.
>
> The attack was "only" generating 300 kpps (packets per sec), which
> usually isn't a problem for this (fairly old) Linux Router.  But the
> conntracking system chocked and reduced pps processing power to
> 40kpps.
>
> I do extensive RRD/graph monitoring of the machines.  The IP conntrack
> searches in the period exploded, to a stunning 700.000 searches per
> sec.
>
> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
>
> First I though it might be caused by bad hashing, but after reading
> the kernel code (func: __nf_conntrack_find()), I think its caused by
> the loop restart (goto begin) of the conntrack search, running under
> local_bh_disable().  These RCU changes to conntrack were introduced in
> ea781f19 by Eric Dumazet.
>
> Code: net/netfilter/nf_conntrack_core.c
> Func: __nf_conntrack_find()
>
> struct nf_conntrack_tuple_hash *
> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
> {
>        struct nf_conntrack_tuple_hash *h;
>        struct hlist_nulls_node *n;
>        unsigned int hash = hash_conntrack(tuple);
>
>        /* Disable BHs the entire time since we normally need to disable them
>         * at least once for the stats anyway.
>         */
>        local_bh_disable();
> begin:
>        hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
>                if (nf_ct_tuple_equal(tuple, &h->tuple)) {
>                        NF_CT_STAT_INC(net, found);
>                        local_bh_enable();
>                        return h;
>                }
>                NF_CT_STAT_INC(net, searched);
>        }
>        /*
>         * if the nulls value we got at the end of this lookup is
>         * not the expected one, we must restart lookup.
>         * We probably met an item that was moved to another chain.
>         */
>        if (get_nulls_value(n) != hash)
>                goto begin;
>        local_bh_enable();
>

We should add a retry limit there.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] Socket filter ancilliary data access for skb->dev->type
From: Patrick McHardy @ 2010-04-22 13:13 UTC (permalink / raw)
  To: Paul LeoNerd Evans; +Cc: netdev
In-Reply-To: <20100422131105.GS19334@cel.leo>

Paul LeoNerd Evans wrote:
> On Thu, Apr 22, 2010 at 02:28:46PM +0200, Patrick McHardy wrote:
>> I think we should be adding a check whether skb->dev is non-NULL here
>> since filters can also be attached to netlink sockets. The same applies
>> to SKF_AD_IFINDEX.
> 
> What should the appropriate behaviour be here? Set A to some rogue value
> - 0 or -1 seem appropriate? Or, abort the filter entirely (such as in
> e.g. divide-by-zero, or invalid memory buffer access)?
> 
> Either way that sounds simple enough, I can hack that in and resubmit.

I'd say we should abort execution.

^ permalink raw reply

* Re: [PATCH] Socket filter ancilliary data access for skb->dev->type
From: Paul LeoNerd Evans @ 2010-04-22 13:11 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev
In-Reply-To: <4BD040FE.3000809@trash.net>

[-- Attachment #1: Type: text/plain, Size: 644 bytes --]

On Thu, Apr 22, 2010 at 02:28:46PM +0200, Patrick McHardy wrote:
> I think we should be adding a check whether skb->dev is non-NULL here
> since filters can also be attached to netlink sockets. The same applies
> to SKF_AD_IFINDEX.

What should the appropriate behaviour be here? Set A to some rogue value
- 0 or -1 seem appropriate? Or, abort the filter entirely (such as in
e.g. divide-by-zero, or invalid memory buffer access)?

Either way that sounds simple enough, I can hack that in and resubmit.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22 13:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <1271936227.7895.5285.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 7:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Please reorder things better.
>
> Most likely this function is called for one packet.
>
> In your version you take twice the rps_lock()/rps_unlock() path, so
> it'll be slower.
>
> Once to 'transfert' one list to process list
>
> Once to be able to do the 'label out:' post processing.
>

It doesn't make any difference. We have to hold rps_lock to update
input_pkt_queue_len, if we don't use another variable to record the
length of the process queue, or atomic variable.

I think it is better that using another variable to record the length
of the process queue, and updating it before process_backlog()
returns. For one packet, there is only one locking/unlocking. There is
only one issue you concerned: cache miss due to sum the two queues'
length. I'll change softnet_data to:

struct softnet_data {
        struct Qdisc            *output_queue;
        struct list_head        poll_list;
        struct sk_buff          *completion_queue;
        struct sk_buff_head     process_queue;

#ifdef CONFIG_RPS
        struct softnet_data     *rps_ipi_list;

        /* Elements below can be accessed between CPUs for RPS */
        struct call_single_data csd ____cacheline_aligned_in_smp;
        struct softnet_data     *rps_ipi_next;
        unsigned int            cpu;
        unsigned int            input_queue_head;
#endif
        unsigned int            process_queue_len;
        struct sk_buff_head     input_pkt_queue;
        struct napi_struct      backlog;
};

For one packets, we have to update process_queue_len in any way. For
more packets, we only change process_queue_len just before
process_backlog() returns. It means that process_queue_len change is
batched.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox