Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] WAN: flush tx_queue in hdlc_ppp to prevent panic on rmmod hw_driver.
From: Michael Barkowski @ 2010-04-22 19:17 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: David Miller, netdev
In-Reply-To: <m3mxx5mv8v.fsf@intrepid.localdomain>

Krzysztof Halasa wrote:
> tx_queue is used as a temporary queue when not allowed to queue skb
> directly to the hw device driver (which may sleep). Most paths flush
> it before returning, but ppp_start() currently cannot. Make sure we
> don't leave skbs pointing to a non-existent device.
> 
> Thanks to Michael Barkowski for reporting this problem.

Great - thanks.  Will this be going into -stable?

-- 
Michael Barkowski

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-22 19:02 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, Scott Feldman, davem, netdev, Mitch Williams
In-Reply-To: <201004222048.53950.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Thursday 22 April 2010 19:47:29 Chris Wright wrote:
> > OK, wasn't clear if you meant that or simply 100% dedicating the interface
> > via something like virtio.  The add_vf() idea, while neat, doesn't really
> > match how VF's are allocated.
> 
> But we still need something like that for allocating queues in VMDq
> and similar cases where we do not have pass-through, right?

Iff we care about VMDq w/out SR-IOV (since SR-IOV hardware is VMDq
capable and already has a queue-pair + interrupt + net_dev), yes.

And it's not just VMDq, it's any multi-queue card that can do mac/vlan
filter in hw + header/data split (for direct data DMA to guest buffers).

> As far as I can tell we don't have an interface for that yet, but
> we have drivers for a number of cards that could do this.
> 
> > > I don't have an SR-IOV card available for testing yet. How is this
> > > configured now?
> > 
> > The device shows up in the host as a normal network device, so mgmt tools
> > currently treat it as if it's no different from a PF.  So that's just
> > plain old:
> > 
> > SIOCSIFHWADDR or RTM_SETLINK (i.e. normal ->ndo_set_mac_addr)
> 
> Ok, but that only works for a fixed number of VFs and you can only
> configure the VF before it's assigned to the guest, right?

Depends on assign.

Assign meaning it's still visible in host, but only one guest is using
it via virtio (e.g. vhost-net)....then no, can change anytime (although
it's not typically changed during VM lifecycle).

Assign meaning direct PCI device assignment of the VF to the guest,
then yes, only while the device has driver in host.

> Both are not serious limitations, but it would be nice to
> have an easy way around them. In particular, for assigning
> the mac address and vlan id (VF in access mode), there needs
> to be some interface that allows the host but not the guest
> to change the settings after assigning the card to the guest.
> 
> This is a fundamental requirement for VEPA, because the switch
> applied its forwarding rules based on the mac address and trusts
> the hypervisor to make sure it cannot be faked by the guest.

Sure, but the VF (when directly assigned to the guest) is going to (at
least it should, for security reasons) always trap to a privileged code if
the guest tries to do something like set mac or vlan id.  All the SR-IOV
cards I've seen do this.  The "set VF mac addr" is really a message to
the PF.

thanks,
-chris

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-22 18:48 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev, Mitch Williams
In-Reply-To: <20100422174729.GK28829@x200.localdomain>

On Thursday 22 April 2010 19:47:29 Chris Wright wrote:
> OK, wasn't clear if you meant that or simply 100% dedicating the interface
> via something like virtio.  The add_vf() idea, while neat, doesn't really
> match how VF's are allocated.

But we still need something like that for allocating queues in VMDq
and similar cases where we do not have pass-through, right?

As far as I can tell we don't have an interface for that yet, but
we have drivers for a number of cards that could do this.

> > I don't have an SR-IOV card available for testing yet. How is this
> > configured now?
> 
> The device shows up in the host as a normal network device, so mgmt tools
> currently treat it as if it's no different from a PF.  So that's just
> plain old:
> 
> SIOCSIFHWADDR or RTM_SETLINK (i.e. normal ->ndo_set_mac_addr)

Ok, but that only works for a fixed number of VFs and you can only
configure the VF before it's assigned to the guest, right?

Both are not serious limitations, but it would be nice to
have an easy way around them. In particular, for assigning
the mac address and vlan id (VF in access mode), there needs
to be some interface that allows the host but not the guest
to change the settings after assigning the card to the guest.

This is a fundamental requirement for VEPA, because the switch
applied its forwarding rules based on the mac address and trusts
the hypervisor to make sure it cannot be faked by the guest.

	Arnd

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-22 17:47 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, Scott Feldman, davem, netdev, Mitch Williams
In-Reply-To: <201004220851.05756.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Thursday 22 April 2010, Chris Wright wrote:
> > > 
> > > ip link add link eth0 type macvlan    # for a container
> > > ip link add link eth0 type macvtap    # for qemu/vhost
> > > ip link add link eth0 type vf         # for device assignment
> > 
> > BTW, what do you mean by device assignment?
> 
> I mean giving an SR-IOV VF to the guest as a native PCI device
> rather than having qemu or vhost present a virtio-net to the
> guest.

OK, wasn't clear if you meant that or simply 100% dedicating the interface
via something like virtio.  The add_vf() idea, while neat, doesn't really
match how VF's are allocated.

> > > There are obviously significant differences between these three, but
> > > they also share enough of their properties to let us treat them
> > > in similar ways.
> > > 
> > > If we integrate the iovnl client into iproute2, the sequence for setting
> > > up an enic VF and associating it to the port profile could be
> > > 
> > > # create vf0, pass mac and vlan id to HW, no association yet
> > > ip link add link eth0 name vf0 type vf mac fe:dc:ba:12:34:56 vlan 78
> > 
> > Just to clarify...right now, the normal SR-IOV VF is already there.
> > And, or course, can have its mac addr/vlan set already.
> 
> I don't have an SR-IOV card available for testing yet. How is this
> configured now?

The device shows up in the host as a normal network device, so mgmt tools
currently treat it as if it's no different from a PF.  So that's just
plain old:

SIOCSIFHWADDR or RTM_SETLINK (i.e. normal ->ndo_set_mac_addr)

There's also the possiblity of configuring through the PF (although
this isn't really widely used ATM, and has the disadvantage of exposing
the VF number to userspace in a way that's difficult to use).  This is
also done via RTM_SETLINK (on the PF this time), and will result in
->ndo_set_vf_mac().

> > > # associate vf with port profile, mac address must match the one assigned
> > > #  to the interface before.
> > > ip iov assoc eth0 port-profile "general" host-uuid "dcf2a873-f5ee-41dd-a7ad-802a544e48c2" \
> > >        mac fe:dc:ba:12:34:56
> > 
> > At that point you could just do s/mac fe:.*/link vf0/
> 
> My point was that this information should be irrelevant to the code doing the
> association with the switch. It sort of makes sense when the receiver is enic,
> but when we send the same data to lldpad, it doesn't care about the slave device
> name but only about the mac address. Especially since the slave device might not
> be in the root name space any more, meaning we have no way to find it.

Yeah, w/ namespace I think you'd normally do all setup before handing
into a new namespace.

thanks,
-chris

^ permalink raw reply

* Re: [patch] rdma: potential ERR_PTR dereference
From: Andy Grover @ 2010-04-22 17:28 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: David S. Miller, rds-devel, netdev, kernel-janitors
In-Reply-To: <20100422095527.GQ29647@bicker>

Dan Carpenter wrote:
> In the original code, the "goto out" calls "rdma_destroy_id(cm_id);"
> That isn't needed here and would cause problems because "cm_id" is an 
> ERR_PTR.  The new code just returns directly.
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Thanks, Dan.

Acked-by: Andy Grover <andy.grover@oracle.com>

-- Andy

^ permalink raw reply

* [PATCH v2] tcp: fix outsegs stat for TSO segments
From: Tom Herbert @ 2010-04-22 17:00 UTC (permalink / raw)
  To: davem, netdev

Account for TSO segments of an skb in TCP_MIB_OUTSEGS counter.  Without
doing this, the counter can be off by orders of magnitude from the
actual number of segments sent.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/net/snmp.h b/include/net/snmp.h
index 884fdbb..92456f1 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -133,6 +133,8 @@ struct linux_xfrm_mib {
 			__this_cpu_add(mib[0]->mibs[field], addend)
 #define SNMP_ADD_STATS_USER(mib, field, addend)	\
 			this_cpu_add(mib[1]->mibs[field], addend)
+#define SNMP_ADD_STATS(mib, field, addend)	\
+			this_cpu_add(mib[0]->mibs[field], addend)
 /*
  * Use "__typeof__(*mib[0]) *ptr" instead of "__typeof__(mib[0]) ptr"
  * to make @ptr a non-percpu pointer.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70c5159..91640fe 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -294,6 +294,7 @@ extern struct proto tcp_prot;
 #define TCP_INC_STATS_BH(net, field)	SNMP_INC_STATS_BH((net)->mib.tcp_statistics, field)
 #define TCP_DEC_STATS(net, field)	SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
 #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+#define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
 
 extern void			tcp_v4_err(struct sk_buff *skb, u32);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2b7d71f..8ce0f99 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -888,7 +888,8 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		tcp_event_data_sent(tp, skb, sk);
 
 	if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)
-		TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
+		TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS,
+			      tcp_skb_pcount(skb));
 
 	err = icsk->icsk_af_ops->queue_xmit(skb);
 	if (likely(err <= 0))
@@ -2503,7 +2504,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	th->window = htons(min(req->rcv_wnd, 65535U));
 	tcp_options_write((__be32 *)(th + 1), tp, &opts);
 	th->doff = (tcp_header_size >> 2);
-	TCP_INC_STATS(sock_net(sk), TCP_MIB_OUTSEGS);
+	TCP_ADD_STATS(sock_net(sk), TCP_MIB_OUTSEGS, tcp_skb_pcount(skb));
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Okay, we have all we need - do the md5 hash if needed */

^ permalink raw reply related

* Re: DDoS attack causing bad effect on conntrack searches
From: Paul E. McKenney @ 2010-04-22 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <1271952128.7895.5851.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 06:02:08PM +0200, Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> > On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > > Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > > 
> > > > If one hash slot is under attack, then there is a bug somewhere.
> > > > 
> > > > If we cannot avoid this, we can fallback to a secure mode at the second
> > > > retry, and take the spinlock.
> > > > 
> > > > Tis way, most of lookups stay lockless (one pass), and some might take
> > > > the slot lock to avoid the possibility of a loop.
> > > > 
> > > > I suspect a bug elsewhere, quite frankly !
> > > > 
> > > > We have a chain that have an end pointer that doesnt match the expected
> > > > one.
> > > > 
> > > 
> > > On normal situation, we always finish the lookup :
> > > 
> > > 1) If we found the thing we were looking at.
> > > 
> > > 2) We get the list end (item not found), we then check if it is the
> > > expected end.
> > > 
> > > It is _not_ the expected end only if some writer deleted/inserted an
> > > element in _this_ chain during our lookup.
> > 
> > So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
> > elements?  (Not obvious from the code, but my ignorance of the networking
> > code is such that many things in that part of the kernel are not obvious
> > to me, I am afraid.)
> 
> Yes, this uses SLAB_DESTROY_BY_RCU, like tcp/udp lookups.

OK, that will do it!!!  ;-)

One way of throttling the bad effects of updates on readers is to
periodically force updates through a grace period.  But this seems
to be a very big hammer, and likely to have little practical effect.

Another approach would be to have multiple list pointers per element,
so that a given element could be reused a small number of times without
messing up concurrent readers (sort of like Herbert's resizable hash
table).

But as you say, if some other bug is really behind this, better to fix
that bug than to work around it.

> > Otherwise, of course you would simply allow deleted elements to continue
> > pointing where they did previously, so that concurrent readers would not
> > miss anything.
> 
> > Of course, the same potential might arise on insertion, but it is usually
> > OK to miss an element that was inserted after you started searching.
> > 
> > > Because our lookup is lockless, we then have to redo it because we might
> > > miss the object we are looking for.
> > 
> > Ah...  Is there also a resize operation?  Herbert did do a resizable
> > hash table recently, but I was under the impression that (1) it was in
> > some other part of the networking stack and (2) it avoided the need to
> > restart readers.
> > 
> > > If we can do the 'retry' a 10 times, it means the attacker was really
> > > clever enough to inject new packets (new conntracks) at the right
> > > moment, in the right hash chain, and this sounds so higly incredible
> > > that I cannot believe it at all :)
> > 
> > Or maybe the DoS attack is injecting so many new conntracks that a large
> > fraction of the hash chains are being modified at any given time?
> 
> maybe hash table has one slot :)

;-) ;-) ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/3] IPv6: Generic TTL Security Mechanism (original version)
From: Stephen Hemminger @ 2010-04-22 16:23 UTC (permalink / raw)
  To: davem; +Cc: Pekka Savola, YOSHIFUJI Hideaki, Nick Hilliard, netdev
In-Reply-To: <20100403232922.489187907@vyatta.com>

On Sat, 03 Apr 2010 16:21:04 -0700
Stephen Hemminger <shemminger@vyatta.com> wrote:

> This patch adds IPv6 support for RFC5082 Generalized TTL
> Security Mechanism.  
> 
> The original proposed code; the IPV6 and IPV4 socket options are seperate.
> With this method, the server does have to deal with both IPv4 and IPv6
> socket options and the client has to handle the different for each
> family.
> 
> On client:
> 	int ttl = 255;
> 	getaddrinfo(argv[1], argv[2], &hint, &result);
> 
> 	for (rp = result; rp != NULL; rp = rp->ai_next) {
> 		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
> 		if (s < 0) continue;
> 
> 		if (rp->ai_family == AF_INET) {
> 			setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
> 		} else if (rp->ai_family == AF_INET6) {
> 			setsockopt(s, IPPROTO_IPV6,  IPV6_UNICAST_HOPS, 
> 					&ttl, sizeof(ttl)))
> 		}
> 			
> 		if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
> 		   ...
> 
> On server:
> 	int minttl = 255 - maxhops;
>    
> 	getaddrinfo(NULL, port, &hints, &result);
> 	for (rp = result; rp != NULL; rp = rp->ai_next) {
> 		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
> 		if (s < 0) continue;
> 
> 		if (rp->ai_family == AF_INET6)
> 			setsockopt(s, IPPROTO_IPV6,  IPV6_MINHOPCOUNT,
> 					&minttl, sizeof(minttl));
> 		setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
> 			
> 		if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
> 			break
> ..
> 
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Dave: Yoshifuji and I agree this is the best solution, how come the patch
hasn't been applied?

^ permalink raw reply

* Re: [PATCH] NIU support for skb->rxhash
From: Stephen Hemminger @ 2010-04-22 16:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100422.042157.99869295.davem@davemloft.net>

On Thu, 22 Apr 2010 04:21:57 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> 
> But it turns out using it is largely pointless since the only way to
> get the hash value(s) is through a structure which is prepended to the
> packet data (so we take a cache miss on the packet data anyways)
> instead of being able to fetch it out of the RX descriptors :-/
> 
> If anyone out there is trying to design sane hardware, please put the
> following into your RX descriptors:
> 
> 1) ethernet protocol type (u16)
> 2) a flag bit indicating if the packet destination matched one
>    of the programmed unicast MAC addresses
> 3) a flag bit indicating "multicast"
> 4) a flag bit indicating "broadcast"
> 5) at least 32-bits of the computed flow hash (u32)
> 
> kthx, bye!
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Could you make configurable via ethtool like I did for sky2.

P.s: where is that patch seems lost in patchwork

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Stephen Hemminger @ 2010-04-22 16:17 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Herbert Xu, David Miller, yoshfuji, netdev
In-Reply-To: <20100422154908.GA31568@midget.suse.cz>

On Thu, 22 Apr 2010 17:49:08 +0200
Jiri Bohac <jbohac@suse.cz> wrote:

> On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> > This patch fixes this by using the DADFAILED bit to synchronise
> > the two paths while holding the ifp lock.  It relies on the fact
> > that the TENTATIVE bit is always set during DAD, and that the
> > DADFAILED bit is only set on failure.
> 
> But the addr_dad_failure()->...->ipv6_del_addr() path will
> still race with any other path calling ipv6_del_addr() (e.g. a
> manual address removal). Won't it?
> 
> I still don't see why __ipv6_ifa_notify() needs to call
> dst_free(). Shouldn't that be dst_release() instead, to drop the
> reference obtained by dst_hold(&ifp->rt->u.dst)?

Yes, some more locking and race condition management is needed.

Something like the following (untested):

--- a/net/ipv6/addrconf.c	2010-04-22 09:11:54.594827858 -0700
+++ b/net/ipv6/addrconf.c	2010-04-22 09:15:59.224631752 -0700
@@ -720,13 +720,18 @@ static void ipv6_del_addr(struct inet6_i
 
 	hash = ipv6_addr_hash(&ifp->addr);
 
+	write_lock_bh(&idev->lock);
+	if (ifp->dead) {
+		write_unlock(&idev->lock); /* lost race with DAD */
+		return;
+	}
+
 	ifp->dead = 1;
 
-	spin_lock_bh(&addrconf_hash_lock);
+	spin_lock(&addrconf_hash_lock);
 	hlist_del_init_rcu(&ifp->addr_lst);
-	spin_unlock_bh(&addrconf_hash_lock);
+	spin_unlock(&addrconf_hash_lock);
 
-	write_lock_bh(&idev->lock);
 #ifdef CONFIG_IPV6_PRIVACY
 	if (ifp->flags&IFA_F_TEMPORARY) {
 		list_del(&ifp->tmp_list);




^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 16:02 UTC (permalink / raw)
  To: paulmck
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <20100422155123.GA2524@linux.vnet.ibm.com>

Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > 
> > > If one hash slot is under attack, then there is a bug somewhere.
> > > 
> > > If we cannot avoid this, we can fallback to a secure mode at the second
> > > retry, and take the spinlock.
> > > 
> > > Tis way, most of lookups stay lockless (one pass), and some might take
> > > the slot lock to avoid the possibility of a loop.
> > > 
> > > I suspect a bug elsewhere, quite frankly !
> > > 
> > > We have a chain that have an end pointer that doesnt match the expected
> > > one.
> > > 
> > 
> > On normal situation, we always finish the lookup :
> > 
> > 1) If we found the thing we were looking at.
> > 
> > 2) We get the list end (item not found), we then check if it is the
> > expected end.
> > 
> > It is _not_ the expected end only if some writer deleted/inserted an
> > element in _this_ chain during our lookup.
> 
> So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
> elements?  (Not obvious from the code, but my ignorance of the networking
> code is such that many things in that part of the kernel are not obvious
> to me, I am afraid.)
> 

Yes, this uses SLAB_DESTROY_BY_RCU, like tcp/udp lookups.

> Otherwise, of course you would simply allow deleted elements to continue
> pointing where they did previously, so that concurrent readers would not
> miss anything.
> 




> Of course, the same potential might arise on insertion, but it is usually
> OK to miss an element that was inserted after you started searching.
> 
> > Because our lookup is lockless, we then have to redo it because we might
> > miss the object we are looking for.
> 
> Ah...  Is there also a resize operation?  Herbert did do a resizable
> hash table recently, but I was under the impression that (1) it was in
> some other part of the networking stack and (2) it avoided the need to
> restart readers.
> 
> > If we can do the 'retry' a 10 times, it means the attacker was really
> > clever enough to inject new packets (new conntracks) at the right
> > moment, in the right hash chain, and this sounds so higly incredible
> > that I cannot believe it at all :)
> 
> Or maybe the DoS attack is injecting so many new conntracks that a large
> fraction of the hash chains are being modified at any given time?
> 
> 							Thanx, Paul

maybe hash table has one slot :)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-22 16:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100422145640.GB3228@redhat.com>

On Thu, Apr 22, 2010 at 10:56:40AM -0400, Vivek Goyal wrote:
> On Wed, Apr 21, 2010 at 02:35:43PM -0700, Paul E. McKenney wrote:
> 
> [..]
> > > [    3.116754] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > [    3.116754] ---------------------------------------------------
> > > [    3.116754] kernel/cgroup.c:4432 invoked rcu_dereference_check()
> > > without protection!
> > > [    3.116754]
> > > [    3.116754] other info that might help us debug this:
> > > [    3.116754]
> > > [    3.116754]
> > > [    3.116754] rcu_scheduler_active = 1, debug_locks = 1
> > > [    3.116754] 2 locks held by async/1/666:
> > > [    3.116754]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> > > [<ffffffff812df0a0>] __scsi_add_device+0x83/0xe4
> > > [    3.116754]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> > > [<ffffffff811f2e8d>] blkiocg_add_blkio_group+0x29/0x7f
> > > [    3.116754]
> > > [    3.116754] stack backtrace:
> > > [    3.116754] Pid: 666, comm: async/1 Not tainted 2.6.34-rc5 #18
> > > [    3.116754] Call Trace:
> > > [    3.116754]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > > [    3.116754]  [<ffffffff8107f9b1>] css_id+0x3f/0x51
> > > [    3.116754]  [<ffffffff811f2e9c>] blkiocg_add_blkio_group+0x38/0x7f
> > > [    3.116754]  [<ffffffff811f4e64>] cfq_init_queue+0xdf/0x2dc
> > > [    3.116754]  [<ffffffff811e3445>] elevator_init+0xba/0xf5
> > > [    3.116754]  [<ffffffff812dc02a>] ? scsi_request_fn+0x0/0x451
> > > [    3.116754]  [<ffffffff811e696b>] blk_init_queue_node+0x12f/0x135
> > > [    3.116754]  [<ffffffff811e697d>] blk_init_queue+0xc/0xe
> > > [    3.116754]  [<ffffffff812dc49c>] __scsi_alloc_queue+0x21/0x111
> > > [    3.116754]  [<ffffffff812dc5a4>] scsi_alloc_queue+0x18/0x64
> > > [    3.116754]  [<ffffffff812de5a0>] scsi_alloc_sdev+0x19e/0x256
> > > [    3.116754]  [<ffffffff812de73e>] scsi_probe_and_add_lun+0xe6/0x9c5
> > > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > > [    3.116754]  [<ffffffff813ce0d6>] ? __mutex_lock_common+0x3e4/0x43a
> > > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > > [    3.116754]  [<ffffffff812d0a5c>] ? transport_setup_classdev+0x0/0x17
> > > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > > [    3.116754]  [<ffffffff812df0d5>] __scsi_add_device+0xb8/0xe4
> > > [    3.116754]  [<ffffffff812ea9c5>] ata_scsi_scan_host+0x74/0x16e
> > > [    3.116754]  [<ffffffff81057685>] ? autoremove_wake_function+0x0/0x34
> > > [    3.116754]  [<ffffffff812e8e64>] async_port_probe+0xab/0xb7
> > > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > > [    3.116754]  [<ffffffff8105e2ba>] async_thread+0x105/0x1f4
> > > [    3.116754]  [<ffffffff81033d79>] ? default_wake_function+0x0/0xf
> > > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > > [    3.116754]  [<ffffffff8105713e>] kthread+0x89/0x91
> > > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > > [    3.116754]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> > > [    3.116754]  [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> > > [    3.116754]  [<ffffffff810570b5>] ? kthread+0x0/0x91
> > > [    3.116754]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10
> > 
> > I cannot convince myself that the above access is safe.  Vivek, Nauman,
> > thoughts?
> 
> Hi Paul,
> 
> blkiocg_add_blkio_group() is called from two paths.
> 
> First one is following. This path should be safe as it takes rcu read
> lock.
> 
> cfq_get_cfqg()
> 	rcu_read_lock()
> 	cfq_find_alloc_cfqg()
> 		blkiocg_add_blkio_group()
> 	rcu_read_unlock()
> 
> Second one is as shown in above backtrace.
> 
> cfq_init_queue()
> 	blkiocg_add_blkio_group().
> 
> This path is called at request queue and cfq initialization time and
> we access only root cgroup (root blkio_cgroup). As root cgroup can't
> go away, do we have to protect that call also using rcu_read_lock()?

You are correct, if the root cgroup cannot go away and if we only access
the root cgroup, then rcu_read_lock() is not required.

> So I guess it is not unsafe but propably we need to fix the warning, I
> should wrap second call to blkiocg_add_blkio_group() with
> rcu_read_lock/unlock pair?

That would work very well!

							Thanx, Paul

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Paul E. McKenney @ 2010-04-22 15:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel
In-Reply-To: <1271948029.7895.5707.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> 
> > If one hash slot is under attack, then there is a bug somewhere.
> > 
> > If we cannot avoid this, we can fallback to a secure mode at the second
> > retry, and take the spinlock.
> > 
> > Tis way, most of lookups stay lockless (one pass), and some might take
> > the slot lock to avoid the possibility of a loop.
> > 
> > I suspect a bug elsewhere, quite frankly !
> > 
> > We have a chain that have an end pointer that doesnt match the expected
> > one.
> > 
> 
> On normal situation, we always finish the lookup :
> 
> 1) If we found the thing we were looking at.
> 
> 2) We get the list end (item not found), we then check if it is the
> expected end.
> 
> It is _not_ the expected end only if some writer deleted/inserted an
> element in _this_ chain during our lookup.

So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
elements?  (Not obvious from the code, but my ignorance of the networking
code is such that many things in that part of the kernel are not obvious
to me, I am afraid.)

Otherwise, of course you would simply allow deleted elements to continue
pointing where they did previously, so that concurrent readers would not
miss anything.

Of course, the same potential might arise on insertion, but it is usually
OK to miss an element that was inserted after you started searching.

> Because our lookup is lockless, we then have to redo it because we might
> miss the object we are looking for.

Ah...  Is there also a resize operation?  Herbert did do a resizable
hash table recently, but I was under the impression that (1) it was in
some other part of the networking stack and (2) it avoided the need to
restart readers.

> If we can do the 'retry' a 10 times, it means the attacker was really
> clever enough to inject new packets (new conntracks) at the right
> moment, in the right hash chain, and this sounds so higly incredible
> that I cannot believe it at all :)

Or maybe the DoS attack is injecting so many new conntracks that a large
fraction of the hash chains are being modified at any given time?

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-22 15:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100422142506.GA15858@gondor.apana.org.au>

On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> This patch fixes this by using the DADFAILED bit to synchronise
> the two paths while holding the ifp lock.  It relies on the fact
> that the TENTATIVE bit is always set during DAD, and that the
> DADFAILED bit is only set on failure.

But the addr_dad_failure()->...->ipv6_del_addr() path will
still race with any other path calling ipv6_del_addr() (e.g. a
manual address removal). Won't it?

I still don't see why __ipv6_ifa_notify() needs to call
dst_free(). Shouldn't that be dst_release() instead, to drop the
reference obtained by dst_hold(&ifp->rt->u.dst)?

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply

* Re: [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Ben Hutchings @ 2010-04-22 15:41 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: tglx@linutronix.de, davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <Pine.WNT.4.64.1004220459110.3324@PPWASKIE-MOBL2.amr.corp.intel.com>

On Thu, 2010-04-22 at 05:11 -0700, Peter P Waskiewicz Jr wrote:
> On Wed, 21 Apr 2010, Ben Hutchings wrote:
> 
> > On Tue, 2010-04-20 at 11:01 -0700, Peter P Waskiewicz Jr wrote:
> >> This patch adds a callback function pointer to the irq_desc
> >> structure, along with a registration function and a read-only
> >> proc entry for each interrupt.
> >>
> >> This affinity_hint handle for each interrupt can be used by
> >> underlying drivers that need a better mechanism to control
> >> interrupt affinity.  The underlying driver can register a
> >> callback for the interrupt, which will allow the driver to
> >> provide the CPU mask for the interrupt to anything that
> >> requests it.  The intent is to extend the userspace daemon,
> >> irqbalance, to help hint to it a preferred CPU mask to balance
> >> the interrupt into.
> >
> > Doesn't it make more sense to have the driver follow affinity decisions
> > made from user-space?  I realise that reallocating queues is disruptive
> > and we probably don't want irqbalance to trigger that, but there should
> > be a mechanism for the administrator to trigger it.
> 
> The driver here would be assisting userspace (irqbalance) to provide 
> better details how the HW is laid out with respect to flows.  As it stands 
> today, irqbalance is almost guaranteed to move interrups to CPUs that are 
> not aligned with where applications are running for network adapters. 
> This is very apparent when running at speeds in the 10 Gigabit range, or 
> even multiple 1 Gigabit ports running at the same time.

I'm well aware that irqbalance isn't making good decisions at the
moment.  The question is whether this will really help irqbalance to do
better.

[...]
> > This just assigns IRQs to the first n CPU threads.  Depending on the
> > enumeration order, this might result in assigning an IRQ to each of 2
> > threads on a core while leaving other cores unused!
> 
> This ixgbe patch is only meant to be an example of how you could use it. 
> I didn't hammer out all the corner cases of interrupt alignment in it yet. 
> However, ixgbe is already aligning Tx flows onto the CPU/queue pair the Tx 
> occurred (i.e. Tx session from CPU 4 will be queued on Tx queue 4),
[...]

OK, now I remember ixgbe has this odd select_queue() implementation.
But this behaviour can result in reordering whenever a user thread
migrates, and in any case Dave discourages people from setting
select_queue().  So I see that these changes would be useful for ixgbe
(together with an update to irqbalance), but they don't seem to fit the
general direction of multiqueue networking on Linux.

(Actually, the hints seem to be incomplete.  If there are more than 16
CPU threads then multiple CPU threads can map to the same queues, but it
looks like you only include the first in the queue's hint.)

An alternate approach is to use the RX queue index to drive TX queue
selection.  I posted a patch to do that earlier this week.  However I
haven't yet had a chance to try that on a suitably large system.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Vivek Goyal @ 2010-04-22 14:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100421213543.GO2563@linux.vnet.ibm.com>

On Wed, Apr 21, 2010 at 02:35:43PM -0700, Paul E. McKenney wrote:

[..]
> > [    3.116754] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [    3.116754] ---------------------------------------------------
> > [    3.116754] kernel/cgroup.c:4432 invoked rcu_dereference_check()
> > without protection!
> > [    3.116754]
> > [    3.116754] other info that might help us debug this:
> > [    3.116754]
> > [    3.116754]
> > [    3.116754] rcu_scheduler_active = 1, debug_locks = 1
> > [    3.116754] 2 locks held by async/1/666:
> > [    3.116754]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> > [<ffffffff812df0a0>] __scsi_add_device+0x83/0xe4
> > [    3.116754]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> > [<ffffffff811f2e8d>] blkiocg_add_blkio_group+0x29/0x7f
> > [    3.116754]
> > [    3.116754] stack backtrace:
> > [    3.116754] Pid: 666, comm: async/1 Not tainted 2.6.34-rc5 #18
> > [    3.116754] Call Trace:
> > [    3.116754]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > [    3.116754]  [<ffffffff8107f9b1>] css_id+0x3f/0x51
> > [    3.116754]  [<ffffffff811f2e9c>] blkiocg_add_blkio_group+0x38/0x7f
> > [    3.116754]  [<ffffffff811f4e64>] cfq_init_queue+0xdf/0x2dc
> > [    3.116754]  [<ffffffff811e3445>] elevator_init+0xba/0xf5
> > [    3.116754]  [<ffffffff812dc02a>] ? scsi_request_fn+0x0/0x451
> > [    3.116754]  [<ffffffff811e696b>] blk_init_queue_node+0x12f/0x135
> > [    3.116754]  [<ffffffff811e697d>] blk_init_queue+0xc/0xe
> > [    3.116754]  [<ffffffff812dc49c>] __scsi_alloc_queue+0x21/0x111
> > [    3.116754]  [<ffffffff812dc5a4>] scsi_alloc_queue+0x18/0x64
> > [    3.116754]  [<ffffffff812de5a0>] scsi_alloc_sdev+0x19e/0x256
> > [    3.116754]  [<ffffffff812de73e>] scsi_probe_and_add_lun+0xe6/0x9c5
> > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [    3.116754]  [<ffffffff813ce0d6>] ? __mutex_lock_common+0x3e4/0x43a
> > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > [    3.116754]  [<ffffffff812d0a5c>] ? transport_setup_classdev+0x0/0x17
> > [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> > [    3.116754]  [<ffffffff812df0d5>] __scsi_add_device+0xb8/0xe4
> > [    3.116754]  [<ffffffff812ea9c5>] ata_scsi_scan_host+0x74/0x16e
> > [    3.116754]  [<ffffffff81057685>] ? autoremove_wake_function+0x0/0x34
> > [    3.116754]  [<ffffffff812e8e64>] async_port_probe+0xab/0xb7
> > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > [    3.116754]  [<ffffffff8105e2ba>] async_thread+0x105/0x1f4
> > [    3.116754]  [<ffffffff81033d79>] ? default_wake_function+0x0/0xf
> > [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> > [    3.116754]  [<ffffffff8105713e>] kthread+0x89/0x91
> > [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [    3.116754]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> > [    3.116754]  [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> > [    3.116754]  [<ffffffff810570b5>] ? kthread+0x0/0x91
> > [    3.116754]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10
> 
> I cannot convince myself that the above access is safe.  Vivek, Nauman,
> thoughts?

Hi Paul,

blkiocg_add_blkio_group() is called from two paths.

First one is following. This path should be safe as it takes rcu read
lock.

cfq_get_cfqg()
	rcu_read_lock()
	cfq_find_alloc_cfqg()
		blkiocg_add_blkio_group()
	rcu_read_unlock()

Second one is as shown in above backtrace.

cfq_init_queue()
	blkiocg_add_blkio_group().

This path is called at request queue and cfq initialization time and
we access only root cgroup (root blkio_cgroup). As root cgroup can't
go away, do we have to protect that call also using rcu_read_lock()?

So I guess it is not unsafe but propably we need to fix the warning, I
should wrap second call to blkiocg_add_blkio_group() with
rcu_read_lock/unlock pair?

Thanks
Vivek

^ permalink raw reply

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-22 14:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <1271946805.7895.5658.camel@edumazet-laptop>

On Thu, Apr 22, 2010 at 10:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> It does make a difference, Damn it.
>
> I really really start to think you dont read what I wrote, or you dont
> care.

I misunderstood it. Sorry.

>
> Damn, cant you update all the things at once, taking this lock only
> once ?
>
> You focus having an ultra precise count of pkt_queue.len, but we dont
> care at all ! We only want a _limit_, or else the box can be killed by
> DOS.
>
> If in practice this limit can be 2*limit, thats OK.
>
> Cant you understand this ?
>
>
> We need one limit. Not two limits.
>
> I already told you how to do it, but you ignored me and started yet
> another convoluted thing.
>
>
> process_backlog() transfert the queue to its own queue and reset pkt_len
> to 0 (Only once)
>
> End of story.
>
> Maximum packet queued to this cpu softnet_data will be 2 * old_limit.
>
> So what ?
>

Now, I think I really understand. We don't need a precious limit. So
only a additional queue is enough.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 14:53 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney
In-Reply-To: <1271946961.7895.5665.camel@edumazet-laptop>

Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :

> If one hash slot is under attack, then there is a bug somewhere.
> 
> If we cannot avoid this, we can fallback to a secure mode at the second
> retry, and take the spinlock.
> 
> Tis way, most of lookups stay lockless (one pass), and some might take
> the slot lock to avoid the possibility of a loop.
> 
> I suspect a bug elsewhere, quite frankly !
> 
> We have a chain that have an end pointer that doesnt match the expected
> one.
> 

On normal situation, we always finish the lookup :

1) If we found the thing we were looking at.

2) We get the list end (item not found), we then check if it is the
expected end.

It is _not_ the expected end only if some writer deleted/inserted an
element in _this_ chain during our lookup.

Because our lookup is lockless, we then have to redo it because we might
miss the object we are looking for.

If we can do the 'retry' a 10 times, it means the attacker was really
clever enough to inject new packets (new conntracks) at the right
moment, in the right hash chain, and this sounds so higly incredible
that I cannot believe it at all :)




^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-04-22 14:36 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney
In-Reply-To: <4BD04C74.9020402@trash.net>

Le jeudi 22 avril 2010 à 15:17 +0200, Patrick McHardy a écrit :
> Changli Gao wrote:
> > On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
> >> At an unnamed ISP, we experienced a DDoS attack against one of our
> >> customers.  This attack also caused problems for one of our Linux
> >> based routers.
> >>
> >> The attack was "only" generating 300 kpps (packets per sec), which
> >> usually isn't a problem for this (fairly old) Linux Router.  But the
> >> conntracking system chocked and reduced pps processing power to
> >> 40kpps.
> >>
> >> I do extensive RRD/graph monitoring of the machines.  The IP conntrack
> >> searches in the period exploded, to a stunning 700.000 searches per
> >> sec.
> >>
> >> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
> >>
> >> First I though it might be caused by bad hashing, but after reading
> >> the kernel code (func: __nf_conntrack_find()), I think its caused by
> >> the loop restart (goto begin) of the conntrack search, running under
> >> local_bh_disable().  These RCU changes to conntrack were introduced in
> >> ea781f19 by Eric Dumazet.
> >>
> >> Code: net/netfilter/nf_conntrack_core.c
> >> Func: __nf_conntrack_find()
> >>
> >> struct nf_conntrack_tuple_hash *
> >> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
> >> {
> >>        struct nf_conntrack_tuple_hash *h;
> >>        struct hlist_nulls_node *n;
> >>        unsigned int hash = hash_conntrack(tuple);
> >>
> >>        /* Disable BHs the entire time since we normally need to disable them
> >>         * at least once for the stats anyway.
> >>         */
> >>        local_bh_disable();
> >> begin:
> >>        hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
> >>                if (nf_ct_tuple_equal(tuple, &h->tuple)) {
> >>                        NF_CT_STAT_INC(net, found);
> >>                        local_bh_enable();
> >>                        return h;
> >>                }
> >>                NF_CT_STAT_INC(net, searched);
> >>        }
> >>        /*
> >>         * if the nulls value we got at the end of this lookup is
> >>         * not the expected one, we must restart lookup.
> >>         * We probably met an item that was moved to another chain.
> >>         */
> >>        if (get_nulls_value(n) != hash)
> >>                goto begin;
> >>        local_bh_enable();
> >>
> > 
> > We should add a retry limit there.
> 
> We can't do that since that would allow false negatives.

If one hash slot is under attack, then there is a bug somewhere.

If we cannot avoid this, we can fallback to a secure mode at the second
retry, and take the spinlock.

Tis way, most of lookups stay lockless (one pass), and some might take
the slot lock to avoid the possibility of a loop.

I suspect a bug elsewhere, quite frankly !

We have a chain that have an end pointer that doesnt match the expected
one.




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22 14:33 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, jamal, Tom Herbert, netdev
In-Reply-To: <y2n412e6f7f1004220606id324dc9bj2cc04cfbad50a101@mail.gmail.com>

Le jeudi 22 avril 2010 à 21:06 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 7:37 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Please reorder things better.
> >
> > Most likely this function is called for one packet.
> >
> > In your version you take twice the rps_lock()/rps_unlock() path, so
> > it'll be slower.
> >
> > Once to 'transfert' one list to process list
> >
> > Once to be able to do the 'label out:' post processing.
> >
> 
> It doesn't make any difference. We have to hold rps_lock to update
> input_pkt_queue_len, if we don't use another variable to record the
> length of the process queue, or atomic variable.
> 

It does make a difference, Damn it.

I really really start to think you dont read what I wrote, or you dont
care.

Damn, cant you update all the things at once, taking this lock only
once ?

You focus having an ultra precise count of pkt_queue.len, but we dont
care at all ! We only want a _limit_, or else the box can be killed by
DOS.

If in practice this limit can be 2*limit, thats OK. 

Cant you understand this ?

> I think it is better that using another variable to record the length
> of the process queue, and updating it before process_backlog()
> returns. For one packet, there is only one locking/unlocking. There is
> only one issue you concerned: cache miss due to sum the two queues'
> length. I'll change softnet_data to:
> 
> struct softnet_data {
>         struct Qdisc            *output_queue;
>         struct list_head        poll_list;
>         struct sk_buff          *completion_queue;
>         struct sk_buff_head     process_queue;
> 
> #ifdef CONFIG_RPS
>         struct softnet_data     *rps_ipi_list;
> 
>         /* Elements below can be accessed between CPUs for RPS */
>         struct call_single_data csd ____cacheline_aligned_in_smp;
>         struct softnet_data     *rps_ipi_next;
>         unsigned int            cpu;
>         unsigned int            input_queue_head;
> #endif
>         unsigned int            process_queue_len;
>         struct sk_buff_head     input_pkt_queue;
>         struct napi_struct      backlog;
> };
> 
> For one packets, we have to update process_queue_len in any way. For
> more packets, we only change process_queue_len just before
> process_backlog() returns. It means that process_queue_len change is
> batched.
> 

We need one limit. Not two limits.

I already told you how to do it, but you ignored me and started yet
another convoluted thing.

process_backlog() transfert the queue to its own queue and reset pkt_len
to 0 (Only once)

End of story.

Maximum packet queued to this cpu softnet_data will be 2 * old_limit.

So what ?

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Herbert Xu @ 2010-04-22 14:25 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger
In-Reply-To: <20100422.004324.67422011.davem@davemloft.net>

On Thu, Apr 22, 2010 at 12:43:24AM -0700, David Miller wrote:
>
> Thanks Herbert.

No worries :)

BTW similar races exist in other NDISC receive functions, but
it's too late today so I'll look at this tomorrow unless someone
else wants to have a go at this.

ipv6: Prevent DAD races

The NDISC receive path has not been written in a way that handles
unexpected packets properly.

For example, if we get two identical simultaneous NA/NS packets
that result in a DAD failure, we may try to delete the same address
twice.

A similar problem occurs when we get a DAD failure just as we're
about to mark an address as having passed DAD.

This patch fixes this by using the DADFAILED bit to synchronise
the two paths while holding the ifp lock.  It relies on the fact
that the TENTATIVE bit is always set during DAD, and that the
DADFAILED bit is only set on failure.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index de7a194..1d15d5e 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1401,6 +1401,16 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp, int dad_failed)
 void addrconf_dad_failure(struct inet6_ifaddr *ifp)
 {
 	struct inet6_dev *idev = ifp->idev;
+	int ignore;
+
+	spin_lock(&ifp->lock);
+	ignore = (ifp->flags & (IFA_F_DADFAILED|IFA_F_TENTATIVE)) ^
+		 IFA_F_TENTATIVE;
+	ifp->flags |= IFA_F_DADFAILED;
+	spin_unlock(&ifp->lock);
+
+	if (ignore)
+		return;

 	if (net_ratelimit())
 		printk(KERN_INFO "%s: IPv6 duplicate address %pI6c detected!\n",
@@ -2789,7 +2799,10 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 	read_lock_bh(&idev->lock);
 	if (ifp->dead)
 		goto out;
+
 	spin_lock_bh(&ifp->lock);
+	if (ifp->flags & IFA_F_DADFAILED)
+		goto unlock_ifp;

 	if (dev->flags&(IFF_NOARP|IFF_LOOPBACK) ||
 	    idev->cnf.accept_dad < 1 ||
@@ -2824,6 +2837,7 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 		ip6_ins_rt(ifp->rt);

 	addrconf_dad_kick(ifp);
+unlock_ifp:
 	spin_unlock_bh(&ifp->lock);
 out:
 	read_unlock_bh(&idev->lock);
@@ -2841,6 +2855,11 @@ static void addrconf_dad_timer(unsigned long data)
 		goto out;
 	}
 	spin_lock_bh(&ifp->lock);
+	if (ifp->flags & IFA_F_DADFAILED) {
+		spin_unlock_bh(&ifp->lock);
+		read_unlock_bh(&idev->lock);
+		goto out;
+	}
 	if (ifp->probes == 0) {
 		/*
 		 * DAD was successful

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: [PATCH v5] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-22 14:24 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, hadi, therbert, netdev
In-Reply-To: <t2t412e6f7f1004220527mb5340bebye1d1fda0963dcf2a@mail.gmail.com>

Le jeudi 22 avril 2010 à 20:27 +0800, Changli Gao a écrit :
> On Thu, Apr 22, 2010 at 5:43 PM, David Miller <davem@davemloft.net> wrote:
> > From: Changli Gao <xiaosuo@gmail.com>
> > Date: Thu, 22 Apr 2010 17:09:17 +0800
> >
> >> +     unsigned int            input_pkt_queue_len;
> >> +     struct sk_buff          *input_pkt_queue_head;
> >> +     struct sk_buff          **input_pkt_queue_tailp;
> >> +
> >
> > Please do not ignore Stephen Hemminger's feedback.
> >
> > We already have enough odd SKB queue implementations, we
> > do not need yet another one in a core location.  This makes
> > it harder and harder to eventually convert sk_buff to use
> > "struct list_head".
> >
> > Instead, use "struct sk_buff_head" and the lockless accessors
> > (__skb_insert, etc.) and initializer (__skb_queue_head_init).
> >
> 
> If I want to keep softnet_data small, I have to access the internal
> fields of sk_buff_head, and modify them in a hack way. It doesn't
> sound good. If not, softnet_data will become:
> 
> struct softnet_data {
>         struct Qdisc            *output_queue;
>         struct list_head        poll_list;
>         struct sk_buff          *completion_queue;
>         struct sk_buff_head     process_queue;
> 
> #ifdef CONFIG_RPS
>         struct softnet_data     *rps_ipi_list;
> 
>         /* Elements below can be accessed between CPUs for RPS */
>         struct call_single_data csd ____cacheline_aligned_in_smp;
>         struct softnet_data     *rps_ipi_next;
>         unsigned int            cpu;
>         unsigned int            input_queue_head;
> #endif
>         unsigned int            input_pkt_queue_len;
>         struct sk_buff_head     input_pkt_queue;
>         struct napi_struct      backlog;
> };
> 
> Eric, do you think it is too fat?

No it is not, as long as we are careful to separate cache lines.

By the way, input_pkt_queue_len is already in input_ptk_queue (.len)

Thanks



^ permalink raw reply

* Re: Bug#577640: linux-image-2.6.33-2-amd64: Kernel warnings in netns thread
From: Martín Ferrari @ 2010-04-22 14:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ben Hutchings, 577640, netdev, Eric W. Biederman, Alexey Dobriyan,
	Mathieu Lacage
In-Reply-To: <m1sk6oky96.fsf@fess.ebiederm.org>

On Thu, Apr 22, 2010 at 04:38, Eric W. Biederman <ebiederm@xmission.com> wrote:
>>> > $ sudo ./startns bash
>>> > # ip l a type veth
>>> > # ip l s veth0 netns 1
>>> > # exit

> Then I should ask what is startns?

That's just a simple C program that calls unshare(CLONE_NEWNET) and
execs a shell.

> Either that is doing something different from my equivalent program, or I have
> patches to fix this, that just haven't been merged yet.

I have just downloaded and compiled 2.6.32-2 and 2.6.34-rc5 from
kernel.org using the .config from the debian package, and the oops is
reproducible in both.

This small C file reproduces the error every time:

$ cat netnsoops.c
#include <stdio.h>
#include <stdlib.h>
#define _GNU_SOURCE
#include <sched.h>

int main(int argc, char *argv[])
{	
	int c;
	unsigned long flags = CLONE_NEWNET;

	if(unshare(flags) == -1) {
		perror("unshare");
		return 1;
	}
	system("ip link add name FOO type veth peer name BAR");
	system("ip link set FOO netns 1");
	system("ip link show");
	return 0;
}


-- 
Martín Ferrari

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Jesper Dangaard Brouer @ 2010-04-22 13:31 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Patrick McHardy, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney
In-Reply-To: <q2h412e6f7f1004220613m488c2ee4r6d24a8d1e65997d4@mail.gmail.com>


I have added a stats counter to prove my case, which I think we should add to the kernel (to detect the case in the future).
The DDoS attack has disappeared, so I guess I'll try to see if I can reproduce the problem in my testlab.



[PATCH] net: netfilter conntrack extended with extra stat counter.

From: Jesper Dangaard Brouer <hawk@comx.dk>

I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.

Adding a stats counter to see if the search is restarted too often.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
---

 include/linux/netfilter/nf_conntrack_common.h      |    1 +
 .../netfilter/nf_conntrack_l3proto_ipv4_compat.c   |    7 ++++---
 net/netfilter/nf_conntrack_core.c                  |    4 +++-
 net/netfilter/nf_conntrack_standalone.c            |    7 ++++---
 4 files changed, 12 insertions(+), 7 deletions(-)


diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index a8248ee..57ff312 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -93,6 +93,7 @@ struct ip_conntrack_stat
 	unsigned int expect_new;
 	unsigned int expect_create;
 	unsigned int expect_delete;
+	unsigned int search_restart;
 };
 
 /* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 8668a3d..b94f510 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete search_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -358,7 +358,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 
 		   st->expect_new,
 		   st->expect_create,
-		   st->expect_delete
+		   st->expect_delete,
+		   st->search_restart
 		);
 	return 0;
 }
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 4299db7..5ca8286 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -315,8 +315,10 @@ begin:
 	 * not the expected one, we must restart lookup.
 	 * We probably met an item that was moved to another chain.
 	 */
-	if (get_nulls_value(n) != hash)
+	if (get_nulls_value(n) != hash) {
+		NF_CT_STAT_INC(net, search_restart);
 		goto begin;
+	}
 	local_bh_enable();
 
 	return NULL;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 1935153..4812750 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -245,12 +245,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete search_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -267,7 +267,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 
 		   st->expect_new,
 		   st->expect_create,
-		   st->expect_delete
+		   st->expect_delete,
+		   st->search_restart
 		);
 	return 0;
 }



-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related

* [PATCHv2] Socket filter ancilliary data access for skb->dev->type
From: Paul LeoNerd Evans @ 2010-04-22 13:32 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1712 bytes --]

Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving
access to skb->dev->type, as reported in the sll_hatype field.

When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on. This patch adds such ability.

This patch also handles the case where skb->dev can be NULL, such as on
netlink sockets.

Signed-off-by: Paul Evans <leonerd@leonerd.org.uk>

---

diff -ur linux-2.6.33.2.orig/include/linux/filter.h linux-2.6.33.2/include/linux/filter.h
--- linux-2.6.33.2.orig/include/linux/filter.h	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/include/linux/filter.h	2010-04-20 22:40:25.000000000 +0100
@@ -123,7 +123,8 @@
 #define SKF_AD_NLATTR_NEST	16
 #define SKF_AD_MARK 	20
 #define SKF_AD_QUEUE	24
-#define SKF_AD_MAX	28
+#define SKF_AD_HATYPE	28
+#define SKF_AD_MAX	32
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
diff -ur linux-2.6.33.2.orig/net/core/filter.c linux-2.6.33.2/net/core/filter.c
--- linux-2.6.33.2.orig/net/core/filter.c	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/net/core/filter.c	2010-04-22 14:19:24.000000000 +0100
@@ -301,6 +301,8 @@
 			A = skb->pkt_type;
 			continue;
 		case SKF_AD_IFINDEX:
+			if (!skb->dev)
+				return 0;
 			A = skb->dev->ifindex;
 			continue;
 		case SKF_AD_MARK:
@@ -309,6 +311,11 @@
 		case SKF_AD_QUEUE:
 			A = skb->queue_mapping;
 			continue;
+		case SKF_AD_HATYPE:
+			if (!skb->dev)
+				return 0;
+			A = skb->dev->type;
+			continue;
 		case SKF_AD_NLATTR: {
 			struct nlattr *nla;
 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox