Netdev List
 help / color / mirror / Atom feed
* [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Torsten Schmidt @ 2009-10-19 19:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Skip loopback checksum in ip_rcv() for speed optimisation.

Signed-off-by: Torsten Schmidt <schmto@hrz.tu-chemnitz.de>
---
 net/ipv4/ip_input.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 6c98b43..dc72286 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -406,7 +406,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	 *
 	 *	1.	Length at least the size of an ip header
 	 *	2.	Version of 4
-	 *	3.	Checksums correctly. [Speed optimisation for later, skip loopback checksums]
+	 *	3.	Checksums correctly. [Speed optimisation: skip loopback checksums]
 	 *	4.	Doesn't have a bogus length
 	 */
 
@@ -418,8 +418,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 	iph = ip_hdr(skb);
 
-	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
-		goto inhdr_error;
+	if (!ipv4_is_loopback(iph->daddr))
+		if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
+			goto inhdr_error;
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-- 

^ permalink raw reply related

* Re: [PATCH] tcp: fix TCP_DEFER_ACCEPT retrans calculation
From: Eric Dumazet @ 2009-10-19 20:16 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David S. Miller, netdev
In-Reply-To: <Pine.LNX.4.58.0910192304060.2971@u.domain.uli>

Julian Anastasov a écrit :
> 	Fix TCP_DEFER_ACCEPT conversion between seconds and
> retransmission to match the TCP SYN-ACK retransmission periods
> because the time is converted to such retransmissions. The old
> algorithm selects one more retransmission in some cases. Allow
> up to 255 retransmissions.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>


Note: 255 retransmits translates to 30069 seconds.

^ permalink raw reply

* Re: TCP_DEFER_ACCEPT is missing counter update
From: Eric Dumazet @ 2009-10-19 20:17 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Julian Anastasov, David Miller, netdev
In-Reply-To: <20091019201122.GC21437@1wt.eu>

Willy Tarreau a écrit :
> On Mon, Oct 19, 2009 at 10:01:15PM +0200, Eric Dumazet wrote:
>> David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea
>> (tcp: fix tcp_defer_accept to consider the timeout) 
>> since we know its broken.
> 
> yes I agree with Eric, please revert it and wait for Julian's
> patch which gets it right. Sorry for the wrong fix, at least
> it will have brought the discussion on the table, but without
> a faulty patch it would have been better :-/
> 

Thanks again Willy, we made progress and this is nice :)


^ permalink raw reply

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Eric Dumazet @ 2009-10-19 20:42 UTC (permalink / raw)
  To: Torsten Schmidt; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <200910192134.02125.schmto@hrz.tu-chemnitz.de>

Torsten Schmidt a écrit :
> Skip loopback checksum in ip_rcv() for speed optimisation.
> 
> Signed-off-by: Torsten Schmidt <schmto@hrz.tu-chemnitz.de>
> ---
>  net/ipv4/ip_input.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index 6c98b43..dc72286 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -406,7 +406,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
>  	 *
>  	 *	1.	Length at least the size of an ip header
>  	 *	2.	Version of 4
> -	 *	3.	Checksums correctly. [Speed optimisation for later, skip loopback checksums]
> +	 *	3.	Checksums correctly. [Speed optimisation: skip loopback checksums]
>  	 *	4.	Doesn't have a bogus length
>  	 */
>  
> @@ -418,8 +418,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
>  
>  	iph = ip_hdr(skb);
>  
> -	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> -		goto inhdr_error;
> +	if (!ipv4_is_loopback(iph->daddr))
> +		if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> +			goto inhdr_error;
>  
>  	len = ntohs(iph->tot_len);
>  	if (skb->len < len) {

This is bogus IMHO.

One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
we then accept a bogus frame. This is a RFC violation.

This also slows down non loopback devices, adding an extra test.

ip_fast_csum() is really fast (about 16 instructions).


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 20:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADCB3A4.8060408@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3819 bytes --]

Access of po->pppoe_dev is guarded by sk->sk_state & PPPOX_CONNECTED,
and all use cases now rely on the socket lock.  Because of this, the
ref-count on the namespace held by the socket object suffices to hold
the namespace in existence and so we don't need to ref-count the
namespace in PPPoE. The flush_lock is gone.

--
Michal Ostrowski
mostrows@gmail.com



On Mon, Oct 19, 2009 at 1:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Michal Ostrowski a écrit :
>> On Mon, Oct 19, 2009 at 12:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> Michal Ostrowski a écrit :
>>>> Here's a bigger patch that just gets rid of flush_lock altogether.
>>>>
>>>> We were seeing oopses due to net namespaces going away while we were using
>>>> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
>>>> counts properly.
>>>>
>>>> Fixing this requires that adding and removing entries to the per-net hash-table
>>>> requires incrementing and decrementing the ref count.  This also allows us to
>>>> get rid of the flush_lock since we can now depend on the existence of
>>>> "pn->hash_lock".
>>>>
>>>> We also have to be careful when flushing devices that removal of a hash table
>>>> entry may bring the net namespace refcount to 0.
>>>>
>>> Your patch is mangled (tabulation -> white spaces),
>>
>> Patch mangling was due to mailer interactions, I'll attach a clean
>> version here, no more inlining.
>>
>>> and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
>>> it would be a bug from core network code.
>>>
>>
>> From the original oops I was able to deduce that the namespace somehow
>> managed to get destroyed during the interval where we dropped locks.
>> If that's not due to the release_sock() call in pppoe_flush_dev()
>> triggering a cleanup then I'd have to assume that that it's due to a
>> secondary actor closing the socket in parallel, but that in turn would
>> point to issues with the flush_lock.  Having said that the thrust of
>> this patch remains valid; it just means I don't need to inc the ref
>> count in pppoe_flush_dev().
>>
>> Do you agree?
>>
>
> Not really :)
>
> I dont believe you should care of namespace, and/or mess with its refcount at all.
>
> Please dont use maybe_get_net() : This function should not ever be used in drivers/net
>
> You can add a BUG_ON(dev_net(xxxx)->count <= 0) if you really want, but if this
> assertion is false, this is not because of pppoe.
>
>
>        lock_sock(sk);
> @@ -653,10 +642,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>        if (stage_session(po->pppoe_pa.sid)) {
>                pppox_unbind_sock(sk);
>                if (po->pppoe_dev) {
> -                       pn = pppoe_pernet(dev_net(po->pppoe_dev));
> +                       struct net *old = dev_net(po->pppoe_dev);
> +                       pn = pppoe_pernet(old);
>                        delete_item(pn, po->pppoe_pa.sid,
>                                po->pppoe_pa.remote, po->pppoe_ifindex);
>                        dev_put(po->pppoe_dev);
> +                       put_net(old);
>                }
>                memset(sk_pppox(po) + 1, 0,
>                       sizeof(struct pppox_sock) - sizeof(struct sock));
>
>
> There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held
>
> So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
>
> In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
> all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
>
> pppoe_rcv_core() is not safe
> pppoe_ioctl() is not safe
> pppoe_sendmsg() is not safe
> __pppoe_xmit() is not safe
>
>

[-- Attachment #2: 0001-PPPoE-Fix-flush-close-races.patch --]
[-- Type: application/octet-stream, Size: 6458 bytes --]

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index 7cbf6f9..1d0b569 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -111,9 +111,6 @@ struct pppoe_net {
 	rwlock_t hash_lock;
 };
 
-/* to eliminate a race btw pppoe_flush_dev and pppoe_release */
-static DEFINE_SPINLOCK(flush_lock);
-
 /*
  * PPPoE could be in the following stages:
  * 1) Discovery stage (to obtain remote MAC and Session ID)
@@ -303,48 +300,46 @@ static void pppoe_flush_dev(struct net_device *dev)
 	write_lock_bh(&pn->hash_lock);
 	for (i = 0; i < PPPOE_HASH_SIZE; i++) {
 		struct pppox_sock *po = pn->hash_table[i];
+		struct sock *sk;
 
-		while (po != NULL) {
-			struct sock *sk;
-			if (po->pppoe_dev != dev) {
-				po = po->next;
-				continue;
-			}
-			sk = sk_pppox(po);
-			spin_lock(&flush_lock);
-			po->pppoe_dev = NULL;
-			spin_unlock(&flush_lock);
-			dev_put(dev);
-
-			/* We always grab the socket lock, followed by the
-			 * hash_lock, in that order.  Since we should
-			 * hold the sock lock while doing any unbinding,
-			 * we need to release the lock we're holding.
-			 * Hold a reference to the sock so it doesn't disappear
-			 * as we're jumping between locks.
-			 */
+		while (po && po->pppoe_dev != dev) {
+			po = po->next;
+		}
 
-			sock_hold(sk);
+		if (po == NULL) {
+			continue;
+		}
 
-			write_unlock_bh(&pn->hash_lock);
-			lock_sock(sk);
+		sk = sk_pppox(po);
 
-			if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
-				pppox_unbind_sock(sk);
-				sk->sk_state = PPPOX_ZOMBIE;
-				sk->sk_state_change(sk);
-			}
+		/* We always grab the socket lock, followed by the hash_lock,
+		 * in that order.  Since we should hold the sock lock while
+		 * doing any unbinding, we need to release the lock we're
+		 * holding.  Hold a reference to the sock so it doesn't
+		 * disappear as we're jumping between locks.
+		 */
 
-			release_sock(sk);
-			sock_put(sk);
+		sock_hold(sk);
+		write_unlock_bh(&pn->hash_lock);
+		lock_sock(sk);
 
-			/* Restart scan at the beginning of this hash chain.
-			 * While the lock was dropped the chain contents may
-			 * have changed.
-			 */
-			write_lock_bh(&pn->hash_lock);
-			po = pn->hash_table[i];
+		if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+			pppox_unbind_sock(sk);
+			sk->sk_state = PPPOX_ZOMBIE;
+			sk->sk_state_change(sk);
+			po->pppoe_dev = NULL;
+			dev_put(dev);
 		}
+
+		release_sock(sk);
+		sock_put(sk);
+
+		/* Restart the process from the start of the current hash
+		 * chain. We dropped locks so the world may have change from
+		 * underneath us.
+		 */
+		write_lock_bh(&pn->hash_lock);
+		po = pn->hash_table[i];
 	}
 	write_unlock_bh(&pn->hash_lock);
 }
@@ -391,8 +386,8 @@ static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
 	if (sk->sk_state & PPPOX_BOUND) {
 		ppp_input(&po->chan, skb);
 	} else if (sk->sk_state & PPPOX_RELAY) {
-		relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
-						&po->pppoe_relay);
+		relay_po = get_item_by_addr(sock_net(sk),
+					    &po->pppoe_relay);
 		if (relay_po == NULL)
 			goto abort_kfree;
 
@@ -561,6 +556,7 @@ static int pppoe_release(struct socket *sock)
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po;
 	struct pppoe_net *pn;
+	struct net *net = NULL;
 
 	if (!sk)
 		return 0;
@@ -571,24 +567,19 @@ static int pppoe_release(struct socket *sock)
 		return -EBADF;
 	}
 
+	po = pppox_sk(sk);
+
+	if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+		dev_put(po->pppoe_dev);
+	}
+
 	pppox_unbind_sock(sk);
 
 	/* Signal the death of the socket. */
 	sk->sk_state = PPPOX_DEAD;
 
-	/*
-	 * pppoe_flush_dev could lead to a race with
-	 * this routine so we use flush_lock to eliminate
-	 * such a case (we only need per-net specific data)
-	 */
-	spin_lock(&flush_lock);
-	po = pppox_sk(sk);
-	if (!po->pppoe_dev) {
-		spin_unlock(&flush_lock);
-		goto out;
-	}
-	pn = pppoe_pernet(dev_net(po->pppoe_dev));
-	spin_unlock(&flush_lock);
+	net = sock_net(sk);
+	pn = pppoe_pernet(net);
 
 	/*
 	 * protect "po" from concurrent updates
@@ -596,19 +587,12 @@ static int pppoe_release(struct socket *sock)
 	 */
 	write_lock_bh(&pn->hash_lock);
 
-	po = pppox_sk(sk);
 	if (stage_session(po->pppoe_pa.sid))
 		__delete_item(pn, po->pppoe_pa.sid, po->pppoe_pa.remote,
 				po->pppoe_ifindex);
 
-	if (po->pppoe_dev) {
-		dev_put(po->pppoe_dev);
-		po->pppoe_dev = NULL;
-	}
-
 	write_unlock_bh(&pn->hash_lock);
 
-out:
 	sock_orphan(sk);
 	sock->sk = NULL;
 
@@ -625,8 +609,9 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	struct sock *sk = sock->sk;
 	struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev;
 	struct pppoe_net *pn;
+	struct net_device *dev = NULL;
+	struct net *net = NULL;
 	int error;
 
 	lock_sock(sk);
@@ -653,10 +638,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
 		if (po->pppoe_dev) {
-			pn = pppoe_pernet(dev_net(po->pppoe_dev));
+			struct net *old = dev_net(po->pppoe_dev);
+			pn = pppoe_pernet(old);
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			po->pppoe_dev = NULL;
 		}
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));
@@ -666,13 +653,14 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	/* Re-bind in session stage only */
 	if (stage_session(sp->sa_addr.pppoe.sid)) {
 		error = -ENODEV;
-		dev = dev_get_by_name(sock_net(sk), sp->sa_addr.pppoe.dev);
+		net = sock_net(sk);
+		dev = dev_get_by_name(net, sp->sa_addr.pppoe.dev);
 		if (!dev)
-			goto end;
+			goto err_put;
 
 		po->pppoe_dev = dev;
 		po->pppoe_ifindex = dev->ifindex;
-		pn = pppoe_pernet(dev_net(dev));
+		pn = pppoe_pernet(net);
 		write_lock_bh(&pn->hash_lock);
 		if (!(dev->flags & IFF_UP)) {
 			write_unlock_bh(&pn->hash_lock);
@@ -915,6 +903,8 @@ static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 	struct pppoe_hdr *ph;
 	int data_len = skb->len;
 
+	lock_sock(sk);
+
 	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
 		goto abort;
 
@@ -944,10 +934,11 @@ static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 			po->pppoe_pa.remote, NULL, data_len);
 
 	dev_queue_xmit(skb);
-
+	release_sock(sk);
 	return 1;
 
 abort:
+	release_sock(sk);
 	kfree_skb(skb);
 	return 1;
 }

^ permalink raw reply related

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Torsten Schmidt @ 2009-10-19 20:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List
In-Reply-To: <4ADCCF1A.7070301@gmail.com>

Eric Dumazet wrotes:
> This is bogus IMHO.
> 
> One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
> we then accept a bogus frame. This is a RFC violation.
> 
> This also slows down non loopback devices, adding an extra test.
> 
> ip_fast_csum() is really fast (about 16 instructions).

Yes, you are right. So it would be better to only skip csum if *dev is 
our loopback interface ? Right ?

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-19 20:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Ostrowski, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADCB3A4.8060408@gmail.com>

[Eric Dumazet - Mon, Oct 19, 2009 at 08:44:52PM +0200]
...
| 
| There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held
| 
| So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
| 
| In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
| all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
| 
| pppoe_rcv_core() is not safe
| pppoe_ioctl() is not safe
| pppoe_sendmsg() is not safe
| __pppoe_xmit() is not safe
| 

Eric, Michal, please take a look (compile-test only).
There is still unclear moment for NETDEV_CHANGEMTU
since one could be doing this in say endless loop from
userspace calling device to flush all the time which
implies dev_put as a result :(

Comments are welcome (and complains as well). I'll be able to
continue handling (or reply to mail) tomorrow evening only.
Anyway -- this patch is ugly enough but may happen to work.

	-- Cyrill
---
 drivers/net/pppoe.c |   79 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 62 insertions(+), 17 deletions(-)

Index: linux-2.6.git/drivers/net/pppoe.c
=====================================================================
--- linux-2.6.git.orig/drivers/net/pppoe.c
+++ linux-2.6.git/drivers/net/pppoe.c
@@ -386,14 +386,19 @@ static struct notifier_block pppoe_notif
 static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
 {
 	struct pppox_sock *po = pppox_sk(sk);
-	struct pppox_sock *relay_po;
+	struct pppox_sock *relay_po = NULL;
+	struct net *net = NULL;
 
 	if (sk->sk_state & PPPOX_BOUND) {
 		ppp_input(&po->chan, skb);
 	} else if (sk->sk_state & PPPOX_RELAY) {
-		relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
-						&po->pppoe_relay);
-		if (relay_po == NULL)
+		spin_lock(&flush_lock);
+		if (po->pppoe_dev)
+			net = dev_net(po->pppoe_dev);
+		spin_unlock(&flush_lock);
+		if (net)
+			relay_po = get_item_by_addr(net, &po->pppoe_relay);
+		if (!net || !relay_po)
 			goto abort_kfree;
 
 		if ((sk_pppox(relay_po)->sk_state & PPPOX_CONNECTED) == 0)
@@ -652,12 +657,15 @@ static int pppoe_connect(struct socket *
 	/* Delete the old binding */
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
+		spin_lock(&flush_lock);
 		if (po->pppoe_dev) {
 			pn = pppoe_pernet(dev_net(po->pppoe_dev));
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			po->pppoe_dev = NULL;
 		}
+		spin_unlock(&flush_lock);
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));
 		sk->sk_state = PPPOX_NONE;
@@ -670,7 +678,9 @@ static int pppoe_connect(struct socket *
 		if (!dev)
 			goto end;
 
+		spin_lock(&flush_lock);
 		po->pppoe_dev = dev;
+		spin_unlock(&flush_lock);
 		po->pppoe_ifindex = dev->ifindex;
 		pn = pppoe_pernet(dev_net(dev));
 		write_lock_bh(&pn->hash_lock);
@@ -708,10 +718,12 @@ end:
 	release_sock(sk);
 	return error;
 err_put:
+	spin_lock(&flush_lock);
 	if (po->pppoe_dev) {
 		dev_put(po->pppoe_dev);
 		po->pppoe_dev = NULL;
 	}
+	spin_unlock(&flush_lock);
 	goto end;
 }
 
@@ -738,6 +750,7 @@ static int pppoe_ioctl(struct socket *so
 {
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po = pppox_sk(sk);
+	unsigned int mtu = 0;
 	int val;
 	int err;
 
@@ -746,11 +759,19 @@ static int pppoe_ioctl(struct socket *so
 		err = -ENXIO;
 		if (!(sk->sk_state & PPPOX_CONNECTED))
 			break;
-
+		err = -EBUSY;
+		if (!spin_trylock(&flush_lock))
+			break;
+		err = -ENODEV;
+		if (po->pppoe_dev) {
+			mtu = po->pppoe_dev->mtu;
+			err = 0;
+		}
+		spin_unlock(&flush_lock);
+		if (err)
+			break;
 		err = -EFAULT;
-		if (put_user(po->pppoe_dev->mtu -
-			     sizeof(struct pppoe_hdr) -
-			     PPP_HDRLEN,
+		if (put_user(mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN,
 			     (int __user *)arg))
 			break;
 		err = 0;
@@ -760,14 +781,22 @@ static int pppoe_ioctl(struct socket *so
 		err = -ENXIO;
 		if (!(sk->sk_state & PPPOX_CONNECTED))
 			break;
-
+		err = -EBUSY;
+		if (!spin_trylock(&flush_lock))
+			break;
+		err = -ENODEV;
+		if (po->pppoe_dev) {
+			mtu = po->pppoe_dev->mtu;
+			err = 0;
+		}
+		spin_unlock(&flush_lock);
+		if (err)
+			break;
 		err = -EFAULT;
 		if (get_user(val, (int __user *)arg))
 			break;
 
-		if (val < (po->pppoe_dev->mtu
-			   - sizeof(struct pppoe_hdr)
-			   - PPP_HDRLEN))
+		if (val < (mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN))
 			err = 0;
 		else
 			err = -EINVAL;
@@ -842,7 +871,7 @@ static int pppoe_sendmsg(struct kiocb *i
 	int error;
 	struct pppoe_hdr hdr;
 	struct pppoe_hdr *ph;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	char *start;
 
 	lock_sock(sk);
@@ -856,7 +885,15 @@ static int pppoe_sendmsg(struct kiocb *i
 	hdr.code = 0;
 	hdr.sid = po->num;
 
+	spin_lock(&flush_lock);
 	dev = po->pppoe_dev;
+	if (!dev) {
+		spin_unlock(&flush_lock);
+		error = -ENODEV;
+		goto end;
+	}
+	dev_hold(dev);
+	spin_unlock(&flush_lock);
 
 	error = -EMSGSIZE;
 	if (total_len > (dev->mtu + dev->hard_header_len))
@@ -899,6 +936,8 @@ static int pppoe_sendmsg(struct kiocb *i
 	dev_queue_xmit(skb);
 
 end:
+	if (dev)
+		dev_put(dev);
 	release_sock(sk);
 	return error;
 }
@@ -911,15 +950,19 @@ end:
 static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 {
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev = po->pppoe_dev;
+	struct net_device *dev;
 	struct pppoe_hdr *ph;
 	int data_len = skb->len;
 
 	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
 		goto abort;
 
-	if (!dev)
-		goto abort;
+	spin_lock(&flush_lock);
+	if (!po->pppoe_dev)
+		goto abort_unlock;
+	dev = po->pppoe_dev;
+	dev_hold(dev);
+	spin_unlock(&flush_lock);
 
 	/* Copy the data if there is no space for the header or if it's
 	 * read-only.
@@ -942,11 +985,13 @@ static int __pppoe_xmit(struct sock *sk,
 
 	dev_hard_header(skb, dev, ETH_P_PPP_SES,
 			po->pppoe_pa.remote, NULL, data_len);
-
 	dev_queue_xmit(skb);
+	dev_put(dev);
 
 	return 1;
 
+abort_unlock:
+	spin_unlock(&flush_lock);
 abort:
 	kfree_skb(skb);
 	return 1;

^ permalink raw reply

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Eric Dumazet @ 2009-10-19 21:10 UTC (permalink / raw)
  To: Torsten Schmidt; +Cc: Linux Netdev List
In-Reply-To: <200910192256.52780.schmto@hrz.tu-chemnitz.de>

Torsten Schmidt a écrit :
> Eric Dumazet wrotes:
>> This is bogus IMHO.
>>
>> One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
>> we then accept a bogus frame. This is a RFC violation.
>>
>> This also slows down non loopback devices, adding an extra test.
>>
>> ip_fast_csum() is really fast (about 16 instructions).
> 
> Yes, you are right. So it would be better to only skip csum if *dev is 
> our loopback interface ? Right ?

An application could send a bogus IP packet using RAW interface, and we still should
check IP checksum before delivering this packet, even on loopback device.


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 21:22 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric Dumazet, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <20091019205740.GD5233@lenovo>

I'm assuming that there was a race in us sending patches at nearly the same time
I'm convinced now that the flush_lock can die, and the patch I sent
out kills it.
If we were to have to have a global lock, you might as well just use
the flush_lock for everything.

Note that there's a bunch of checks for sk->sk_state vs
PPPOX_CONNECTED.  PPPOX_CONNECTED being set implies that po->pppoe_dev
is valid.   So, by ensuring that checks of sk->sk_state and
po->pppoe_dev are done under the protection of
lock_sock()/release_sock() it is not even necessary to check
po->pppoe_dev==NULL.

In fact, the NULL checks you're adding are equivalent to testing
sk->sk_state vs PPPOX_CONNECTED, and thus you're replacing the
synchronization provided by the socket lock with flush_lock.

--
Michal Ostrowski
mostrows@gmail.com



On Mon, Oct 19, 2009 at 3:57 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> [Eric Dumazet - Mon, Oct 19, 2009 at 08:44:52PM +0200]
> ...
> |
> | There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held
> |
> | So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
> |
> | In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
> | all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
> |
> | pppoe_rcv_core() is not safe
> | pppoe_ioctl() is not safe
> | pppoe_sendmsg() is not safe
> | __pppoe_xmit() is not safe
> |
>
> Eric, Michal, please take a look (compile-test only).
> There is still unclear moment for NETDEV_CHANGEMTU
> since one could be doing this in say endless loop from
> userspace calling device to flush all the time which
> implies dev_put as a result :(
>
> Comments are welcome (and complains as well). I'll be able to
> continue handling (or reply to mail) tomorrow evening only.
> Anyway -- this patch is ugly enough but may happen to work.
>
>        -- Cyrill
> ---
>  drivers/net/pppoe.c |   79 ++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 62 insertions(+), 17 deletions(-)
>
> Index: linux-2.6.git/drivers/net/pppoe.c
> =====================================================================
> --- linux-2.6.git.orig/drivers/net/pppoe.c
> +++ linux-2.6.git/drivers/net/pppoe.c
> @@ -386,14 +386,19 @@ static struct notifier_block pppoe_notif
>  static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
>  {
>        struct pppox_sock *po = pppox_sk(sk);
> -       struct pppox_sock *relay_po;
> +       struct pppox_sock *relay_po = NULL;
> +       struct net *net = NULL;
>
>        if (sk->sk_state & PPPOX_BOUND) {
>                ppp_input(&po->chan, skb);
>        } else if (sk->sk_state & PPPOX_RELAY) {
> -               relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
> -                                               &po->pppoe_relay);
> -               if (relay_po == NULL)
> +               spin_lock(&flush_lock);
> +               if (po->pppoe_dev)
> +                       net = dev_net(po->pppoe_dev);
> +               spin_unlock(&flush_lock);
> +               if (net)
> +                       relay_po = get_item_by_addr(net, &po->pppoe_relay);
> +               if (!net || !relay_po)
>                        goto abort_kfree;
>
>                if ((sk_pppox(relay_po)->sk_state & PPPOX_CONNECTED) == 0)
> @@ -652,12 +657,15 @@ static int pppoe_connect(struct socket *
>        /* Delete the old binding */
>        if (stage_session(po->pppoe_pa.sid)) {
>                pppox_unbind_sock(sk);
> +               spin_lock(&flush_lock);
>                if (po->pppoe_dev) {
>                        pn = pppoe_pernet(dev_net(po->pppoe_dev));
>                        delete_item(pn, po->pppoe_pa.sid,
>                                po->pppoe_pa.remote, po->pppoe_ifindex);
>                        dev_put(po->pppoe_dev);
> +                       po->pppoe_dev = NULL;
>                }
> +               spin_unlock(&flush_lock);
>                memset(sk_pppox(po) + 1, 0,
>                       sizeof(struct pppox_sock) - sizeof(struct sock));
>                sk->sk_state = PPPOX_NONE;
> @@ -670,7 +678,9 @@ static int pppoe_connect(struct socket *
>                if (!dev)
>                        goto end;
>
> +               spin_lock(&flush_lock);
>                po->pppoe_dev = dev;
> +               spin_unlock(&flush_lock);
>                po->pppoe_ifindex = dev->ifindex;
>                pn = pppoe_pernet(dev_net(dev));
>                write_lock_bh(&pn->hash_lock);
> @@ -708,10 +718,12 @@ end:
>        release_sock(sk);
>        return error;
>  err_put:
> +       spin_lock(&flush_lock);
>        if (po->pppoe_dev) {
>                dev_put(po->pppoe_dev);
>                po->pppoe_dev = NULL;
>        }
> +       spin_unlock(&flush_lock);
>        goto end;
>  }
>
> @@ -738,6 +750,7 @@ static int pppoe_ioctl(struct socket *so
>  {
>        struct sock *sk = sock->sk;
>        struct pppox_sock *po = pppox_sk(sk);
> +       unsigned int mtu = 0;
>        int val;
>        int err;
>
> @@ -746,11 +759,19 @@ static int pppoe_ioctl(struct socket *so
>                err = -ENXIO;
>                if (!(sk->sk_state & PPPOX_CONNECTED))
>                        break;
> -
> +               err = -EBUSY;
> +               if (!spin_trylock(&flush_lock))
> +                       break;
> +               err = -ENODEV;
> +               if (po->pppoe_dev) {
> +                       mtu = po->pppoe_dev->mtu;
> +                       err = 0;
> +               }
> +               spin_unlock(&flush_lock);
> +               if (err)
> +                       break;
>                err = -EFAULT;
> -               if (put_user(po->pppoe_dev->mtu -
> -                            sizeof(struct pppoe_hdr) -
> -                            PPP_HDRLEN,
> +               if (put_user(mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN,
>                             (int __user *)arg))
>                        break;
>                err = 0;
> @@ -760,14 +781,22 @@ static int pppoe_ioctl(struct socket *so
>                err = -ENXIO;
>                if (!(sk->sk_state & PPPOX_CONNECTED))
>                        break;
> -
> +               err = -EBUSY;
> +               if (!spin_trylock(&flush_lock))
> +                       break;
> +               err = -ENODEV;
> +               if (po->pppoe_dev) {
> +                       mtu = po->pppoe_dev->mtu;
> +                       err = 0;
> +               }
> +               spin_unlock(&flush_lock);
> +               if (err)
> +                       break;
>                err = -EFAULT;
>                if (get_user(val, (int __user *)arg))
>                        break;
>
> -               if (val < (po->pppoe_dev->mtu
> -                          - sizeof(struct pppoe_hdr)
> -                          - PPP_HDRLEN))
> +               if (val < (mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN))
>                        err = 0;
>                else
>                        err = -EINVAL;
> @@ -842,7 +871,7 @@ static int pppoe_sendmsg(struct kiocb *i
>        int error;
>        struct pppoe_hdr hdr;
>        struct pppoe_hdr *ph;
> -       struct net_device *dev;
> +       struct net_device *dev = NULL;
>        char *start;
>
>        lock_sock(sk);
> @@ -856,7 +885,15 @@ static int pppoe_sendmsg(struct kiocb *i
>        hdr.code = 0;
>        hdr.sid = po->num;
>
> +       spin_lock(&flush_lock);
>        dev = po->pppoe_dev;
> +       if (!dev) {
> +               spin_unlock(&flush_lock);
> +               error = -ENODEV;
> +               goto end;
> +       }
> +       dev_hold(dev);
> +       spin_unlock(&flush_lock);
>
>        error = -EMSGSIZE;
>        if (total_len > (dev->mtu + dev->hard_header_len))
> @@ -899,6 +936,8 @@ static int pppoe_sendmsg(struct kiocb *i
>        dev_queue_xmit(skb);
>
>  end:
> +       if (dev)
> +               dev_put(dev);
>        release_sock(sk);
>        return error;
>  }
> @@ -911,15 +950,19 @@ end:
>  static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
>  {
>        struct pppox_sock *po = pppox_sk(sk);
> -       struct net_device *dev = po->pppoe_dev;
> +       struct net_device *dev;
>        struct pppoe_hdr *ph;
>        int data_len = skb->len;
>
>        if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
>                goto abort;
>
> -       if (!dev)
> -               goto abort;
> +       spin_lock(&flush_lock);
> +       if (!po->pppoe_dev)
> +               goto abort_unlock;
> +       dev = po->pppoe_dev;
> +       dev_hold(dev);
> +       spin_unlock(&flush_lock);
>
>        /* Copy the data if there is no space for the header or if it's
>         * read-only.
> @@ -942,11 +985,13 @@ static int __pppoe_xmit(struct sock *sk,
>
>        dev_hard_header(skb, dev, ETH_P_PPP_SES,
>                        po->pppoe_pa.remote, NULL, data_len);
> -
>        dev_queue_xmit(skb);
> +       dev_put(dev);
>
>        return 1;
>
> +abort_unlock:
> +       spin_unlock(&flush_lock);
>  abort:
>        kfree_skb(skb);
>        return 1;
>

^ permalink raw reply

* user-to-kernel shared memory with net_device
From: Chris Ross @ 2009-10-19 21:28 UTC (permalink / raw)
  To: netdev

I have a net_device driver that has a fairly large (1-2MB) set of
stats that a user space process needs to read from time to time. I had
assumed the optimal ipc mechanism would be shared memory. Now that I'm
getting around to implementation of this side of the project I'm
getting a bit stuck.

All of the mmap examples I can find seem to be for character devices.
Do I need a character device that implements mmap and proxies access
to the net_device's stats, or is there a way to mmap directly to a
net_device structure? Also, is this the excepted method when a
userspace process needs to read large tables from a driver?

thanks,

-chris

^ permalink raw reply

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Torsten Schmidt @ 2009-10-19 21:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List
In-Reply-To: <4ADCD5E1.7060400@gmail.com>

Eric Dumazet:
> Torsten Schmidt a écrit :
>> Eric Dumazet wrotes:
>>> This is bogus IMHO.
>>>
>>> One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
>>> we then accept a bogus frame. This is a RFC violation.
>>>
>>> This also slows down non loopback devices, adding an extra test.
>>>
>>> ip_fast_csum() is really fast (about 16 instructions).
>> 
>> Yes, you are right. So it would be better to only skip csum if *dev is 
>> our loopback interface ? Right ?
> 
> An application could send a bogus IP packet using RAW interface, and we still should
> check IP checksum before delivering this packet, even on loopback device.

Yes, then we should fix this comment in ip_rcv():

	RFC1122: 3.2.1.2 MUST silently discard any IP frame that fails the checksum.
	Is the datagram acceptable?
	1.	Length at least the size of an ip header
	2.	Version of 4
->	3.	Checksums correctly. [Speed optimisation for later, skip loopback checksums]
	4.	Doesn't have a bogus length

Right ?

^ permalink raw reply

* Re: [Fwd: [PATCH] [NIU] VLAN does not work with niu driver]
From: Joyce Yu @ 2009-10-19 22:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091016.174008.193713863.davem@davemloft.net>


Here was what happened to this patch:

I got system hard hung a couple of times when testing the niu driver 
with "__pskb_pull_tail(skb, min(len, VLAN_ETH_HLEN))" fix. I thought 
that the fix was not working at the time, so I came up with the fixes 
moving around the pointers. I didn't see the hard hung with that code. 
When you asked me to explain the fix, I had the second thought to 
re-test the "__pskb_pull_tail(skb, min(len, VLAN_ETH_HLEN))" fix. I 
didn't experience any hard hung any more. I also asked our QA to test it 
and didn't find any problem. I guess the hard hung may due to 
rmmod/insmod, ifconfig up/down using different fixes so many times and 
that may cause memory corruptions.

Regards,
Joyce


On 10/16/09 05:40 PM, David Miller wrote:
> From: Joyce Yu <Joyce.Yu@Sun.COM>
> Date: Fri, 16 Oct 2009 09:21:46 -0700
> 
>> Can this patch be accepted and integrated to the main tree?
> 
> Well, what happened to all of those page fragment modifications?
> 
> They all of a sudden are no longer necessary?  Why?
> 
> I'm not going to apply this patch until you start explaining
> why these things are being done, or not done.  And you must
> add some more text to your commit messages so that you explain
> your change sufficiently.
> 
> Thank you.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 


^ permalink raw reply

* Ilmo Sr. Juiz da 4a Vara Criminal SP - Atos processuais na Lei nº 11.419/0619/10/200921:16:08
From: Ilmo Juiz da 10a Comarca/SP @ 2009-10-19 23:16 UTC (permalink / raw)
  To: net-snmp-users

TELEGRAMA: 2685332147
CONFIRMACAO: MF015716698BR
CATEGORIA: URGENTE

Clique no link abaixo para abrir seu Telegrama.

http://entregacorreios.zyns.com:81/www/judiciario2/telegramacorreios_gov_br.htm

Copyright 2009 Correios - Todos os direitos reservados

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with  "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-20  0:08 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Cyrill Gorcunov, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910191422t2905d9bbl3355fc50cf36ff90@mail.gmail.com>

On Tuesday 20 October 2009 00:22:39 Michal Ostrowski wrote:
> I'm assuming that there was a race in us sending patches at nearly the same
> time I'm convinced now that the flush_lock can die, and the patch I sent
> out kills it.
o_O 

I am drowning in patches. Just let me know which one to test :-)

^ permalink raw reply

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: David Miller @ 2009-10-20  0:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: torsten.schmidt, netdev
In-Reply-To: <4ADCCF1A.7070301@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Oct 2009 22:42:02 +0200

> One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
> we then accept a bogus frame. This is a RFC violation.
> 
> This also slows down non loopback devices, adding an extra test.
> 
> ip_fast_csum() is really fast (about 16 instructions).

Also loopback doesn't mean anything.  That packet could
be mirrored and sent externally via packet scheduler
rules and actions.

And for this specific case, the savings are absolutely zero.

We've brought the whole damn IP header into the CPU cache and
that is the real cost.  The calculation is something like 12
instructions, maybe 6 cycles on a modern cpu, which is nothing.

^ permalink raw reply

* Re: [PATCH] [NIU] VLAN does not work with niu driver
From: David Miller @ 2009-10-20  0:28 UTC (permalink / raw)
  To: Joyce.Yu; +Cc: netdev
In-Reply-To: <4AD4F0BF.1040606@Sun.COM>


If you send this patch improperly any more times, I'm simply just
going to start ignoring your patch postings completely.

First of all, you did not add a proper commit message with your patch
explaining your change, in detail.  I've asked you for this not ONCE,
but TWICE.

You also did not provide a proper Signed-off-by: tag for your change,
please read linux/Documentation/SubmittingPatches for details.

Every time you submit a patch improperly, you waste a lot of my time.

I have to handle receiving patches from hundreds of people every day,
I don't have time to teach each and every one of them how to submit
things properly.  I especially don't have time to do it MULTIPLE
TIMES, like I have been doing for you.


^ permalink raw reply

* Re: Subject: [PATCH 1/6] bna: Brocade 10Gb Ethernet device driver
From: Herbert Xu @ 2009-10-20  0:54 UTC (permalink / raw)
  To: Rasesh Mody; +Cc: netdev, amathur
In-Reply-To: <200910161824.n9GIOuoX010135@blc-10-10.brocade.com>

Rasesh Mody <rmody@brocade.com> wrote:
.
> +static int bnad_lro_get_skb_header(struct sk_buff *skb, void **iphdr,
> +    void **tcphdr, u64 *hdr_flags, void *priv)

Please stop using LRO in new code.  The GRO replacement should
be used instead.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Anirban Sinha @ 2009-10-20  0:56 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Anirban Sinha, linux-kernel, David Miller, netdev
In-Reply-To: <20091019121327.GA11423@redhat.com>


> I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.
>
> Could you verify this?


Yes, I have now verified this. There is indeed an issue with one of the
functions called by inet_twdr_hangman(). The call sequence is:

inet_twdr_hangman() -> inet_twdr_do_twkill_work() -> inet_twsk_put() ->
twsk_destructor().

In this case, the destructor callback is tcp_twsk_destructor() (installed
from line 1208 in net/ipv4/tcp_ipv4.c and line 906 in net/ipv6/tcp_ipv6.c) .
Without the TCP_MD5SUM compiled in, the function is a no-op. However, with the MD5SUM
compiled in, it calls tcp_put_md5_sig_pool() (when keylen is non zero) which
does an unbalanced put_cpu(). I did a grep across the entire tree.
tcp_put_md5_sig_pool() is a matching function for tcp_get_md5_sig_pool() and
in all other TCP IPV4 cases, it is called from net/ipv4/tcp_ipv4.c from
functions tcp_v4_md5_hash_hdr() and


^ permalink raw reply

* [PATCH] Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Anirban Sinha @ 2009-10-20  1:08 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Anirban Sinha, linux-kernel, David Miller, netdev
In-Reply-To: <Pine.LNX.4.64.0910191745050.20086@sleet.zeugmasystems.local>



> I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.
>
> Could you verify this?

Yes, I have now verified this. There is indeed an issue with one of the
functions called by inet_twdr_hangman(). The call sequence is:

inet_twdr_hangman() -> inet_twdr_do_twkill_work() -> inet_twsk_put() ->
twsk_destructor().

In this case, the destructor callback is tcp_twsk_destructor() (installed
from line 1208 in net/ipv4/tcp_ipv4.c and line 906 in net/ipv6/tcp_ipv6.c) .
Without the TCP_MD5SUM compiled in, the function is a no-op. However, with the MD5SUM
compiled in, it calls tcp_put_md5_sig_pool() (when keylen is non zero) which
does an unbalanced put_cpu(). I did a grep across the entire tree.
tcp_put_md5_sig_pool() is a matching function for tcp_get_md5_sig_pool() and
in all other TCP IPV4 cases, it is called from net/ipv4/tcp_ipv4.c from
functions tcp_v4_md5_hash_hdr() and tcp_v4_hash_skb() along with the matching
get()
function. So I would think that in tcp_twsk_destructor(), the call should be
replaced by  tcp_free_md5_sig_pool() instead.

Signed-of-by: Anirban Sinha <asinha@zeugmasystems.com>
---

net/ipv4/tcp_minisocks.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e48c37d..dccc01e 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk)
 #ifdef CONFIG_TCP_MD5SIG
        struct tcp_timewait_sock *twsk = tcp_twsk(sk);
        if (twsk->tw_md5_keylen)
-               tcp_put_md5sig_pool();
+               tcp_free_md5sig_pool();
 #endif
 }








^ permalink raw reply related

* Re: [PATCH] Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: David Miller @ 2009-10-20  1:13 UTC (permalink / raw)
  To: asinha; +Cc: oleg, ani, linux-kernel, netdev
In-Reply-To: <Pine.LNX.4.64.0910191756420.20086@sleet.zeugmasystems.local>

From: Anirban Sinha <asinha@zeugmasystems.com>
Date: Mon, 19 Oct 2009 18:08:21 -0700 (PDT)

> @@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk)
>  #ifdef CONFIG_TCP_MD5SIG
>         struct tcp_timewait_sock *twsk = tcp_twsk(sk);
>         if (twsk->tw_md5_keylen)
> -               tcp_put_md5sig_pool();
> +               tcp_free_md5sig_pool();
>  #endif
>  }

This has been fixed in the tree for a month of so:

commit 657e9649e745b06675aa5063c84430986cdc3afa
Author: Robert Varga <nite@hq.alert.sk>
Date:   Tue Sep 15 23:49:21 2009 -0700

    tcp: fix CONFIG_TCP_MD5SIG + CONFIG_PREEMPT timer BUG()
    
    I have recently came across a preemption imbalance detected by:
    
    <4>huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101?
    <0>------------[ cut here ]------------
    <2>kernel BUG at /usr/src/linux/kernel/timer.c:664!
    <0>invalid opcode: 0000 [1] PREEMPT SMP
    
    with ffffffff80644630 being inet_twdr_hangman().
    
    This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a
    bit, so I looked at what might have caused it.
    
    One thing that struck me as strange is tcp_twsk_destructor(), as it
    calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the
    detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well,
    as far as I can tell.
    
    Signed-off-by: Robert Varga <nite@hq.alert.sk>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 045bcfd..624c3c9 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk)
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_timewait_sock *twsk = tcp_twsk(sk);
 	if (twsk->tw_md5_keylen)
-		tcp_put_md5sig_pool();
+		tcp_free_md5sig_pool();
 #endif
 }
 

^ permalink raw reply related

* Re: [PATCH] Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Anirban Sinha @ 2009-10-20  1:17 UTC (permalink / raw)
  To: David Miller; +Cc: oleg, ani, linux-kernel, netdev
In-Reply-To: <20091019.181341.104579802.davem@davemloft.net>


> This has been fixed in the tree for a month of so:

Grrrr! Time for me to do a git pull again. The kernel source tree in my work
machine must be out of date by about the same time.


^ permalink raw reply

* Re: TCP_DEFER_ACCEPT is missing counter update
From: David Miller @ 2009-10-20  2:23 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ja, w, netdev
In-Reply-To: <4ADCC58B.2060408@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Oct 2009 22:01:15 +0200

> David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea
> (tcp: fix tcp_defer_accept to consider the timeout) 
> since we know its broken.

I've reverted that patch and applied Julian's three tcp
patches, thanks!

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: David Miller @ 2009-10-20  2:28 UTC (permalink / raw)
  To: mostrows; +Cc: gorcunov, eric.dumazet, denys, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910191422t2905d9bbl3355fc50cf36ff90@mail.gmail.com>


Please stop top posting!

I want to follow this discussions efficiently and that's impossible
if you reply BEFORE instead of AFTER the context of what you're
replying to.

Note that a day will come very soon that postings to these lists
that do top posting will be flat out bounced back to you and
never make it to the list at all.

^ permalink raw reply

* Re: [PATCH 1/2] bluetooth: scheduling while atomic bug fix
From: David Miller @ 2009-10-20  2:37 UTC (permalink / raw)
  To: hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w
  Cc: marcel-kz+m5ild9QBg9hUCZPvPmw,
	alan-qBU/x9rampVanCEyBjwyrvXRex20P6io,
	oliver-fJ+pQTUTwRTk1uMJSBkQmQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20091019062441.GA4102-4/PLUo9XfK+SVgrV+fD4Uw@public.gmane.org>

From: Dave Young <hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Mon, 19 Oct 2009 14:24:41 +0800

> Due to driver core changes dev_set_drvdata will call kzalloc which should be
> in might_sleep context, but hci_conn_add will be called in atomic context
> 
> Like dev_set_name move dev_set_drvdata to work queue function.
> 
> oops as following:
 ...
> Signed-off-by: Dave Young <hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Applied.

^ permalink raw reply

* Re: [PATCH 2/2] bluetooth: static lock key fix
From: David Miller @ 2009-10-20  2:37 UTC (permalink / raw)
  To: hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w
  Cc: marcel-kz+m5ild9QBg9hUCZPvPmw, oliver-fJ+pQTUTwRTk1uMJSBkQmQ,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20091019062830.GB4102-4/PLUo9XfK+SVgrV+fD4Uw@public.gmane.org>

From: Dave Young <hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Mon, 19 Oct 2009 14:28:30 +0800

> When shutdown ppp connection, lockdep waring about non-static key
> will happen, it is caused by the lock is not initialized properly
> at that time.
> 
> Fix with tuning the lock/skb_queue_head init order
 ...
> Reported-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>
> Signed-off-by: Dave Young <hidave.darkstar-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Tested-by: Oliver Hartkopp <oliver-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>

Applied, thanks Dave.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox