Netdev List
 help / color / mirror / Atom feed
* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Eric Dumazet @ 2009-10-19 18:44 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910191107h899a4ffs588f2413093dfb4b@mail.gmail.com>

Michal Ostrowski a écrit :
> On Mon, Oct 19, 2009 at 12:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Michal Ostrowski a écrit :
>>> Here's a bigger patch that just gets rid of flush_lock altogether.
>>>
>>> We were seeing oopses due to net namespaces going away while we were using
>>> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
>>> counts properly.
>>>
>>> Fixing this requires that adding and removing entries to the per-net hash-table
>>> requires incrementing and decrementing the ref count.  This also allows us to
>>> get rid of the flush_lock since we can now depend on the existence of
>>> "pn->hash_lock".
>>>
>>> We also have to be careful when flushing devices that removal of a hash table
>>> entry may bring the net namespace refcount to 0.
>>>
>> Your patch is mangled (tabulation -> white spaces),
> 
> Patch mangling was due to mailer interactions, I'll attach a clean
> version here, no more inlining.
> 
>> and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
>> it would be a bug from core network code.
>>
> 
> From the original oops I was able to deduce that the namespace somehow
> managed to get destroyed during the interval where we dropped locks.
> If that's not due to the release_sock() call in pppoe_flush_dev()
> triggering a cleanup then I'd have to assume that that it's due to a
> secondary actor closing the socket in parallel, but that in turn would
> point to issues with the flush_lock.  Having said that the thrust of
> this patch remains valid; it just means I don't need to inc the ref
> count in pppoe_flush_dev().
> 
> Do you agree?
> 

Not really :)

I dont believe you should care of namespace, and/or mess with its refcount at all.

Please dont use maybe_get_net() : This function should not ever be used in drivers/net

You can add a BUG_ON(dev_net(xxxx)->count <= 0) if you really want, but if this
assertion is false, this is not because of pppoe.


 	lock_sock(sk);
@@ -653,10 +642,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
 		if (po->pppoe_dev) {
-			pn = pppoe_pernet(dev_net(po->pppoe_dev));
+			struct net *old = dev_net(po->pppoe_dev);
+			pn = pppoe_pernet(old);
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			put_net(old);
 		}
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));


There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held

So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.

In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.

pppoe_rcv_core() is not safe
pppoe_ioctl() is not safe
pppoe_sendmsg() is not safe
__pppoe_xmit() is not safe


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-19 19:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Ostrowski, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADCB3A4.8060408@gmail.com>

[Eric Dumazet - Mon, Oct 19, 2009 at 08:44:52PM +0200]
...
| 
| Not really :)
| 
| I dont believe you should care of namespace, and/or mess with its refcount at all.
| 
| Please dont use maybe_get_net() : This function should not ever be used in drivers/net
| 
| You can add a BUG_ON(dev_net(xxxx)->count <= 0) if you really want, but if this
| assertion is false, this is not because of pppoe.
| 
...
| So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
| 
| In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
| all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
| 
| pppoe_rcv_core() is not safe
| pppoe_ioctl() is not safe
| pppoe_sendmsg() is not safe
| __pppoe_xmit() is not safe
| 

Sigh... seem so (which is mostly my fault not Michal). Every time we touch pppoe_dev we
should dev_hold on it and dev_put as only done all we need. Async nature
of notifier seem to be a key here.

	-- Cyrill

^ permalink raw reply

* PATCH [net-next-2.6] IP: Cleanups
From: John Dykstra @ 2009-10-19 19:31 UTC (permalink / raw)
  To: netdev

I've been sitting on these waiting to accumulate more, but that isn't
likely to happen short-term.

---
Use symbols instead of magic constants while checking PMTU discovery
setsockopt.

Remove redundant test in ip_rt_frag_needed() (done by caller).

Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
---
 net/ipv4/ip_sockglue.c   |    2 +-
 net/ipv4/route.c         |    3 ---
 net/ipv6/ipv6_sockglue.c |    2 +-
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c602d98..2445fed 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -575,7 +575,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		inet->hdrincl = val ? 1 : 0;
 		break;
 	case IP_MTU_DISCOVER:
-		if (val < 0 || val > 3)
+		if (val < IP_PMTUDISC_DONT || val > IP_PMTUDISC_PROBE)
 			goto e_inval;
 		inet->pmtudisc = val;
 		break;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 39e10ac..68566de 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -662,7 +662,7 @@ done:
 	case IPV6_MTU_DISCOVER:
 		if (optlen < sizeof(int))
 			goto e_inval;
-		if (val<0 || val>3)
+		if (val < IP_PMTUDISC_DONT || val > IP_PMTUDISC_PROBE)
 			goto e_inval;
 		np->pmtudisc = val;
 		retv = 0;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bb41992..68fb227 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1628,9 +1628,6 @@ unsigned short ip_rt_frag_needed(struct net *net, struct iphdr *iph,
 	__be32  daddr = iph->daddr;
 	unsigned short est_mtu = 0;
 
-	if (ipv4_config.no_pmtu_disc)
-		return 0;
-
 	for (k = 0; k < 2; k++) {
 		for (i = 0; i < 2; i++) {
 			unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],-- 
1.5.4.3




^ permalink raw reply related

* [PATCH] tcp: accept socket after TCP_DEFER_ACCEPT period
From: Julian Anastasov @ 2009-10-19 20:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev


	Willy Tarreau and many other folks in recent years
were concerned what happens when the TCP_DEFER_ACCEPT period
expires for clients which sent ACK packet. They prefer clients
that actively resend ACK on our SYN-ACK retransmissions to be
converted from open requests to sockets and queued to the
listener for accepting after the deferring period is finished.
Then application server can decide to wait longer for data
or to properly terminate the connection with FIN if read()
returns EAGAIN which is an indication for accepting after
the deferring period. This change still can have side effects
for applications that expect always to see data on the accepted
socket. Others can be prepared to work in both modes (with or
without TCP_DEFER_ACCEPT period) and their data processing can
ignore the read=EAGAIN notification and to allocate resources for
clients which proved to have no data to send during the deferring
period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
as a timeout) to wait for data will notice clients that didn't
send data for 3 seconds but that still resend ACKs.
Thanks to Willy Tarreau for the initial idea and to
Eric Dumazet for the review and testing the change.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

diff -urp v2.6.31/linux/net/ipv4/tcp_minisocks.c linux/net/ipv4/tcp_minisocks.c
--- v2.6.31/linux/net/ipv4/tcp_minisocks.c	2009-09-11 10:27:17.000000000 +0300
+++ linux/net/ipv4/tcp_minisocks.c	2009-10-16 10:29:19.000000000 +0300
@@ -641,8 +641,8 @@ struct sock *tcp_check_req(struct sock *
 	if (!(flg & TCP_FLAG_ACK))
 		return NULL;
 
-	/* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
-	if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
+	/* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
+	if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
 	    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
 		inet_rsk(req)->acked = 1;
 		return NULL;

^ permalink raw reply

* Re: TCP_DEFER_ACCEPT is missing counter update
From: Eric Dumazet @ 2009-10-19 20:01 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <Pine.LNX.4.58.0910171709190.10440@u.domain.uli>

Julian Anastasov a écrit :
> 	Hello,
> 
> On Sat, 17 Oct 2009, Eric Dumazet wrote:
> 
>> I really like this, but please define helper functions out of do_tcp_setsockopt()
>>
>> /* Translate value in seconds to number of SYN-ACK retransmits */
>> static u8 secs_to_retrans(int seconds)
>> {
>> 	u8 res = 0;
>>
>> 	if (seconds > 0) {
>> 		int timeout = TCP_TIMEOUT_INIT / HZ;
>> 		int period = timeout;
>>
>> 		res = 1;
>> 		while (res < 255 && seconds > period) {
>> 			res++;
>> 			timeout <<= 1;
>> 			if (timeout > TCP_RTO_MAX / HZ)
>> 				timeout = TCP_RTO_MAX / HZ;
>> 			period += timeout;
>> 		}
>> 	}
>> 	return res;
>> }
>>
>> You also need the reverse function for getsockopt()...
> 
> 	Yes, I forgot that. Here is what I tested, should
> I split it later to 2 patches? May be it should go somewhere
> in net/core/sock.c as extern funcs with EXPORT_SYMBOL to
> allow other protocols one day to use it?
> 

No need to EXPORT_SYMBOL it for the moment. We can add it later if really needed.

David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea
(tcp: fix tcp_defer_accept to consider the timeout) 
since we know its broken.



^ permalink raw reply

* [PATCH] tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT
From: Julian Anastasov @ 2009-10-19 20:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev


	Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
users to not retransmit SYN-ACKs during the deferring period if
ACK from client was received. The goal is to reduce traffic
during the deferring period. When the period is finished
we continue with sending SYN-ACKs (at least one) but this time
any traffic from client will change the request to established
socket allowing application to terminate it properly.
Also, do not drop acked request if sending of SYN-ACK fails.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

diff -urp v2.6.31/linux/net/ipv4/inet_connection_sock.c linux/net/ipv4/inet_connection_sock.c
--- v2.6.31/linux/net/ipv4/inet_connection_sock.c	2009-06-13 10:53:58.000000000 +0300
+++ linux/net/ipv4/inet_connection_sock.c	2009-10-16 11:35:52.000000000 +0300
@@ -446,6 +446,28 @@ extern int sysctl_tcp_synack_retries;
 
 EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_hash_add);
 
+/* Decide when to expire the request and when to resend SYN-ACK */
+static inline void syn_ack_recalc(struct request_sock *req, const int thresh,
+				  const int max_retries,
+				  const u8 rskq_defer_accept,
+				  int *expire, int *resend)
+{
+	if (!rskq_defer_accept) {
+		*expire = req->retrans >= thresh;
+		*resend = 1;
+		return;
+	}
+	*expire = req->retrans >= thresh &&
+		  (!inet_rsk(req)->acked || req->retrans >= max_retries);
+	/*
+	 * Do not resend while waiting for data after ACK,
+	 * start to resend on end of deferring period to give
+	 * last chance for data or ACK to create established socket.
+	 */
+	*resend = !inet_rsk(req)->acked ||
+		  req->retrans >= rskq_defer_accept - 1;
+}
+
 void inet_csk_reqsk_queue_prune(struct sock *parent,
 				const unsigned long interval,
 				const unsigned long timeout,
@@ -501,9 +523,15 @@ void inet_csk_reqsk_queue_prune(struct s
 		reqp=&lopt->syn_table[i];
 		while ((req = *reqp) != NULL) {
 			if (time_after_eq(now, req->expires)) {
-				if ((req->retrans < thresh ||
-				     (inet_rsk(req)->acked && req->retrans < max_retries))
-				    && !req->rsk_ops->rtx_syn_ack(parent, req)) {
+				int expire = 0, resend = 0;
+
+				syn_ack_recalc(req, thresh, max_retries,
+					       queue->rskq_defer_accept,
+					       &expire, &resend);
+				if (!expire &&
+				    (!resend ||
+				     !req->rsk_ops->rtx_syn_ack(parent, req) ||
+				     inet_rsk(req)->acked)) {
 					unsigned long timeo;
 
 					if (req->retrans++ == 0)

^ permalink raw reply

* Re: [PATCH] tcp: accept socket after TCP_DEFER_ACCEPT period
From: Eric Dumazet @ 2009-10-19 20:04 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David S. Miller, netdev
In-Reply-To: <Pine.LNX.4.58.0910192259280.2971@u.domain.uli>

Julian Anastasov a écrit :
> 	Willy Tarreau and many other folks in recent years
> were concerned what happens when the TCP_DEFER_ACCEPT period
> expires for clients which sent ACK packet. They prefer clients
> that actively resend ACK on our SYN-ACK retransmissions to be
> converted from open requests to sockets and queued to the
> listener for accepting after the deferring period is finished.
> Then application server can decide to wait longer for data
> or to properly terminate the connection with FIN if read()
> returns EAGAIN which is an indication for accepting after
> the deferring period. This change still can have side effects
> for applications that expect always to see data on the accepted
> socket. Others can be prepared to work in both modes (with or
> without TCP_DEFER_ACCEPT period) and their data processing can
> ignore the read=EAGAIN notification and to allocate resources for
> clients which proved to have no data to send during the deferring
> period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
> as a timeout) to wait for data will notice clients that didn't
> send data for 3 seconds but that still resend ACKs.
> Thanks to Willy Tarreau for the initial idea and to
> Eric Dumazet for the review and testing the change.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
> 
> diff -urp v2.6.31/linux/net/ipv4/tcp_minisocks.c linux/net/ipv4/tcp_minisocks.c
> --- v2.6.31/linux/net/ipv4/tcp_minisocks.c	2009-09-11 10:27:17.000000000 +0300
> +++ linux/net/ipv4/tcp_minisocks.c	2009-10-16 10:29:19.000000000 +0300
> @@ -641,8 +641,8 @@ struct sock *tcp_check_req(struct sock *
>  	if (!(flg & TCP_FLAG_ACK))
>  		return NULL;
>  
> -	/* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
> -	if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
> +	/* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
> +	if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
>  	    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
>  		inet_rsk(req)->acked = 1;
>  		return NULL;
> --

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks Julian

^ permalink raw reply

* [PATCH] tcp: fix TCP_DEFER_ACCEPT retrans calculation
From: Julian Anastasov @ 2009-10-19 20:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev


	Fix TCP_DEFER_ACCEPT conversion between seconds and
retransmission to match the TCP SYN-ACK retransmission periods
because the time is converted to such retransmissions. The old
algorithm selects one more retransmission in some cases. Allow
up to 255 retransmissions.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

	Not sure if this is the right place to put both
functions ...

diff -urp v2.6.31/linux/net/ipv4/tcp.c linux/net/ipv4/tcp.c
--- v2.6.31/linux/net/ipv4/tcp.c	2009-09-11 10:27:17.000000000 +0300
+++ linux/net/ipv4/tcp.c	2009-10-17 16:37:48.000000000 +0300
@@ -326,6 +326,43 @@ void tcp_enter_memory_pressure(struct so
 
 EXPORT_SYMBOL(tcp_enter_memory_pressure);
 
+/* Convert seconds to retransmits based on initial and max timeout */
+static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
+{
+	u8 res = 0;
+
+	if (seconds > 0) {
+		int period = timeout;
+
+		res = 1;
+		while (seconds > period && res < 255) {
+			res++;
+			timeout <<= 1;
+			if (timeout > rto_max)
+				timeout = rto_max;
+			period += timeout;
+		}
+	}
+	return res;
+}
+
+/* Convert retransmits to seconds based on initial and max timeout */
+static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+{
+	int period = 0;
+
+	if (retrans > 0) {
+		period = timeout;
+		while (--retrans) {
+			timeout <<= 1;
+			if (timeout > rto_max)
+				timeout = rto_max;
+			period += timeout;
+		}
+	}
+	return period;
+}
+
 /*
  *	Wait for a TCP event.
  *
@@ -2163,16 +2200,10 @@ static int do_tcp_setsockopt(struct sock
 		break;
 
 	case TCP_DEFER_ACCEPT:
-		icsk->icsk_accept_queue.rskq_defer_accept = 0;
-		if (val > 0) {
-			/* Translate value in seconds to number of
-			 * retransmits */
-			while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
-			       val > ((TCP_TIMEOUT_INIT / HZ) <<
-				       icsk->icsk_accept_queue.rskq_defer_accept))
-				icsk->icsk_accept_queue.rskq_defer_accept++;
-			icsk->icsk_accept_queue.rskq_defer_accept++;
-		}
+		/* Translate value in seconds to number of retransmits */
+		icsk->icsk_accept_queue.rskq_defer_accept =
+			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+					TCP_RTO_MAX / HZ);
 		break;
 
 	case TCP_WINDOW_CLAMP:
@@ -2353,8 +2384,8 @@ static int do_tcp_getsockopt(struct sock
 			val = (val ? : sysctl_tcp_fin_timeout) / HZ;
 		break;
 	case TCP_DEFER_ACCEPT:
-		val = !icsk->icsk_accept_queue.rskq_defer_accept ? 0 :
-			((TCP_TIMEOUT_INIT / HZ) << (icsk->icsk_accept_queue.rskq_defer_accept - 1));
+		val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
+				      TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
 		break;
 	case TCP_WINDOW_CLAMP:
 		val = tp->window_clamp;

^ permalink raw reply

* Re: TCP_DEFER_ACCEPT is missing counter update
From: Willy Tarreau @ 2009-10-19 20:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Julian Anastasov, David Miller, netdev
In-Reply-To: <4ADCC58B.2060408@gmail.com>

On Mon, Oct 19, 2009 at 10:01:15PM +0200, Eric Dumazet wrote:
> David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea
> (tcp: fix tcp_defer_accept to consider the timeout) 
> since we know its broken.

yes I agree with Eric, please revert it and wait for Julian's
patch which gets it right. Sorry for the wrong fix, at least
it will have brought the discussion on the table, but without
a faulty patch it would have been better :-/

Thanks,
Willy


^ permalink raw reply

* Re: [PATCH] tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT
From: Eric Dumazet @ 2009-10-19 20:11 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David S. Miller, netdev
In-Reply-To: <Pine.LNX.4.58.0910192302000.2971@u.domain.uli>

Julian Anastasov a écrit :
> 	Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
> users to not retransmit SYN-ACKs during the deferring period if
> ACK from client was received. The goal is to reduce traffic
> during the deferring period. When the period is finished
> we continue with sending SYN-ACKs (at least one) but this time
> any traffic from client will change the request to established
> socket allowing application to terminate it properly.
> Also, do not drop acked request if sending of SYN-ACK fails.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>


^ permalink raw reply

* [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Torsten Schmidt @ 2009-10-19 19:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Skip loopback checksum in ip_rcv() for speed optimisation.

Signed-off-by: Torsten Schmidt <schmto@hrz.tu-chemnitz.de>
---
 net/ipv4/ip_input.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 6c98b43..dc72286 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -406,7 +406,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	 *
 	 *	1.	Length at least the size of an ip header
 	 *	2.	Version of 4
-	 *	3.	Checksums correctly. [Speed optimisation for later, skip loopback checksums]
+	 *	3.	Checksums correctly. [Speed optimisation: skip loopback checksums]
 	 *	4.	Doesn't have a bogus length
 	 */
 
@@ -418,8 +418,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 	iph = ip_hdr(skb);
 
-	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
-		goto inhdr_error;
+	if (!ipv4_is_loopback(iph->daddr))
+		if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
+			goto inhdr_error;
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-- 

^ permalink raw reply related

* Re: [PATCH] tcp: fix TCP_DEFER_ACCEPT retrans calculation
From: Eric Dumazet @ 2009-10-19 20:16 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David S. Miller, netdev
In-Reply-To: <Pine.LNX.4.58.0910192304060.2971@u.domain.uli>

Julian Anastasov a écrit :
> 	Fix TCP_DEFER_ACCEPT conversion between seconds and
> retransmission to match the TCP SYN-ACK retransmission periods
> because the time is converted to such retransmissions. The old
> algorithm selects one more retransmission in some cases. Allow
> up to 255 retransmissions.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>


Note: 255 retransmits translates to 30069 seconds.

^ permalink raw reply

* Re: TCP_DEFER_ACCEPT is missing counter update
From: Eric Dumazet @ 2009-10-19 20:17 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Julian Anastasov, David Miller, netdev
In-Reply-To: <20091019201122.GC21437@1wt.eu>

Willy Tarreau a écrit :
> On Mon, Oct 19, 2009 at 10:01:15PM +0200, Eric Dumazet wrote:
>> David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea
>> (tcp: fix tcp_defer_accept to consider the timeout) 
>> since we know its broken.
> 
> yes I agree with Eric, please revert it and wait for Julian's
> patch which gets it right. Sorry for the wrong fix, at least
> it will have brought the discussion on the table, but without
> a faulty patch it would have been better :-/
> 

Thanks again Willy, we made progress and this is nice :)


^ permalink raw reply

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Eric Dumazet @ 2009-10-19 20:42 UTC (permalink / raw)
  To: Torsten Schmidt; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <200910192134.02125.schmto@hrz.tu-chemnitz.de>

Torsten Schmidt a écrit :
> Skip loopback checksum in ip_rcv() for speed optimisation.
> 
> Signed-off-by: Torsten Schmidt <schmto@hrz.tu-chemnitz.de>
> ---
>  net/ipv4/ip_input.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index 6c98b43..dc72286 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -406,7 +406,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
>  	 *
>  	 *	1.	Length at least the size of an ip header
>  	 *	2.	Version of 4
> -	 *	3.	Checksums correctly. [Speed optimisation for later, skip loopback checksums]
> +	 *	3.	Checksums correctly. [Speed optimisation: skip loopback checksums]
>  	 *	4.	Doesn't have a bogus length
>  	 */
>  
> @@ -418,8 +418,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
>  
>  	iph = ip_hdr(skb);
>  
> -	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> -		goto inhdr_error;
> +	if (!ipv4_is_loopback(iph->daddr))
> +		if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
> +			goto inhdr_error;
>  
>  	len = ntohs(iph->tot_len);
>  	if (skb->len < len) {

This is bogus IMHO.

One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
we then accept a bogus frame. This is a RFC violation.

This also slows down non loopback devices, adding an extra test.

ip_fast_csum() is really fast (about 16 instructions).


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 20:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADCB3A4.8060408@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3819 bytes --]

Access of po->pppoe_dev is guarded by sk->sk_state & PPPOX_CONNECTED,
and all use cases now rely on the socket lock.  Because of this, the
ref-count on the namespace held by the socket object suffices to hold
the namespace in existence and so we don't need to ref-count the
namespace in PPPoE. The flush_lock is gone.

--
Michal Ostrowski
mostrows@gmail.com



On Mon, Oct 19, 2009 at 1:44 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Michal Ostrowski a écrit :
>> On Mon, Oct 19, 2009 at 12:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> Michal Ostrowski a écrit :
>>>> Here's a bigger patch that just gets rid of flush_lock altogether.
>>>>
>>>> We were seeing oopses due to net namespaces going away while we were using
>>>> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
>>>> counts properly.
>>>>
>>>> Fixing this requires that adding and removing entries to the per-net hash-table
>>>> requires incrementing and decrementing the ref count.  This also allows us to
>>>> get rid of the flush_lock since we can now depend on the existence of
>>>> "pn->hash_lock".
>>>>
>>>> We also have to be careful when flushing devices that removal of a hash table
>>>> entry may bring the net namespace refcount to 0.
>>>>
>>> Your patch is mangled (tabulation -> white spaces),
>>
>> Patch mangling was due to mailer interactions, I'll attach a clean
>> version here, no more inlining.
>>
>>> and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
>>> it would be a bug from core network code.
>>>
>>
>> From the original oops I was able to deduce that the namespace somehow
>> managed to get destroyed during the interval where we dropped locks.
>> If that's not due to the release_sock() call in pppoe_flush_dev()
>> triggering a cleanup then I'd have to assume that that it's due to a
>> secondary actor closing the socket in parallel, but that in turn would
>> point to issues with the flush_lock.  Having said that the thrust of
>> this patch remains valid; it just means I don't need to inc the ref
>> count in pppoe_flush_dev().
>>
>> Do you agree?
>>
>
> Not really :)
>
> I dont believe you should care of namespace, and/or mess with its refcount at all.
>
> Please dont use maybe_get_net() : This function should not ever be used in drivers/net
>
> You can add a BUG_ON(dev_net(xxxx)->count <= 0) if you really want, but if this
> assertion is false, this is not because of pppoe.
>
>
>        lock_sock(sk);
> @@ -653,10 +642,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>        if (stage_session(po->pppoe_pa.sid)) {
>                pppox_unbind_sock(sk);
>                if (po->pppoe_dev) {
> -                       pn = pppoe_pernet(dev_net(po->pppoe_dev));
> +                       struct net *old = dev_net(po->pppoe_dev);
> +                       pn = pppoe_pernet(old);
>                        delete_item(pn, po->pppoe_pa.sid,
>                                po->pppoe_pa.remote, po->pppoe_ifindex);
>                        dev_put(po->pppoe_dev);
> +                       put_net(old);
>                }
>                memset(sk_pppox(po) + 1, 0,
>                       sizeof(struct pppox_sock) - sizeof(struct sock));
>
>
> There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held
>
> So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
>
> In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
> all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
>
> pppoe_rcv_core() is not safe
> pppoe_ioctl() is not safe
> pppoe_sendmsg() is not safe
> __pppoe_xmit() is not safe
>
>

[-- Attachment #2: 0001-PPPoE-Fix-flush-close-races.patch --]
[-- Type: application/octet-stream, Size: 6458 bytes --]

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index 7cbf6f9..1d0b569 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -111,9 +111,6 @@ struct pppoe_net {
 	rwlock_t hash_lock;
 };
 
-/* to eliminate a race btw pppoe_flush_dev and pppoe_release */
-static DEFINE_SPINLOCK(flush_lock);
-
 /*
  * PPPoE could be in the following stages:
  * 1) Discovery stage (to obtain remote MAC and Session ID)
@@ -303,48 +300,46 @@ static void pppoe_flush_dev(struct net_device *dev)
 	write_lock_bh(&pn->hash_lock);
 	for (i = 0; i < PPPOE_HASH_SIZE; i++) {
 		struct pppox_sock *po = pn->hash_table[i];
+		struct sock *sk;
 
-		while (po != NULL) {
-			struct sock *sk;
-			if (po->pppoe_dev != dev) {
-				po = po->next;
-				continue;
-			}
-			sk = sk_pppox(po);
-			spin_lock(&flush_lock);
-			po->pppoe_dev = NULL;
-			spin_unlock(&flush_lock);
-			dev_put(dev);
-
-			/* We always grab the socket lock, followed by the
-			 * hash_lock, in that order.  Since we should
-			 * hold the sock lock while doing any unbinding,
-			 * we need to release the lock we're holding.
-			 * Hold a reference to the sock so it doesn't disappear
-			 * as we're jumping between locks.
-			 */
+		while (po && po->pppoe_dev != dev) {
+			po = po->next;
+		}
 
-			sock_hold(sk);
+		if (po == NULL) {
+			continue;
+		}
 
-			write_unlock_bh(&pn->hash_lock);
-			lock_sock(sk);
+		sk = sk_pppox(po);
 
-			if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
-				pppox_unbind_sock(sk);
-				sk->sk_state = PPPOX_ZOMBIE;
-				sk->sk_state_change(sk);
-			}
+		/* We always grab the socket lock, followed by the hash_lock,
+		 * in that order.  Since we should hold the sock lock while
+		 * doing any unbinding, we need to release the lock we're
+		 * holding.  Hold a reference to the sock so it doesn't
+		 * disappear as we're jumping between locks.
+		 */
 
-			release_sock(sk);
-			sock_put(sk);
+		sock_hold(sk);
+		write_unlock_bh(&pn->hash_lock);
+		lock_sock(sk);
 
-			/* Restart scan at the beginning of this hash chain.
-			 * While the lock was dropped the chain contents may
-			 * have changed.
-			 */
-			write_lock_bh(&pn->hash_lock);
-			po = pn->hash_table[i];
+		if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+			pppox_unbind_sock(sk);
+			sk->sk_state = PPPOX_ZOMBIE;
+			sk->sk_state_change(sk);
+			po->pppoe_dev = NULL;
+			dev_put(dev);
 		}
+
+		release_sock(sk);
+		sock_put(sk);
+
+		/* Restart the process from the start of the current hash
+		 * chain. We dropped locks so the world may have change from
+		 * underneath us.
+		 */
+		write_lock_bh(&pn->hash_lock);
+		po = pn->hash_table[i];
 	}
 	write_unlock_bh(&pn->hash_lock);
 }
@@ -391,8 +386,8 @@ static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
 	if (sk->sk_state & PPPOX_BOUND) {
 		ppp_input(&po->chan, skb);
 	} else if (sk->sk_state & PPPOX_RELAY) {
-		relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
-						&po->pppoe_relay);
+		relay_po = get_item_by_addr(sock_net(sk),
+					    &po->pppoe_relay);
 		if (relay_po == NULL)
 			goto abort_kfree;
 
@@ -561,6 +556,7 @@ static int pppoe_release(struct socket *sock)
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po;
 	struct pppoe_net *pn;
+	struct net *net = NULL;
 
 	if (!sk)
 		return 0;
@@ -571,24 +567,19 @@ static int pppoe_release(struct socket *sock)
 		return -EBADF;
 	}
 
+	po = pppox_sk(sk);
+
+	if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+		dev_put(po->pppoe_dev);
+	}
+
 	pppox_unbind_sock(sk);
 
 	/* Signal the death of the socket. */
 	sk->sk_state = PPPOX_DEAD;
 
-	/*
-	 * pppoe_flush_dev could lead to a race with
-	 * this routine so we use flush_lock to eliminate
-	 * such a case (we only need per-net specific data)
-	 */
-	spin_lock(&flush_lock);
-	po = pppox_sk(sk);
-	if (!po->pppoe_dev) {
-		spin_unlock(&flush_lock);
-		goto out;
-	}
-	pn = pppoe_pernet(dev_net(po->pppoe_dev));
-	spin_unlock(&flush_lock);
+	net = sock_net(sk);
+	pn = pppoe_pernet(net);
 
 	/*
 	 * protect "po" from concurrent updates
@@ -596,19 +587,12 @@ static int pppoe_release(struct socket *sock)
 	 */
 	write_lock_bh(&pn->hash_lock);
 
-	po = pppox_sk(sk);
 	if (stage_session(po->pppoe_pa.sid))
 		__delete_item(pn, po->pppoe_pa.sid, po->pppoe_pa.remote,
 				po->pppoe_ifindex);
 
-	if (po->pppoe_dev) {
-		dev_put(po->pppoe_dev);
-		po->pppoe_dev = NULL;
-	}
-
 	write_unlock_bh(&pn->hash_lock);
 
-out:
 	sock_orphan(sk);
 	sock->sk = NULL;
 
@@ -625,8 +609,9 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	struct sock *sk = sock->sk;
 	struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev;
 	struct pppoe_net *pn;
+	struct net_device *dev = NULL;
+	struct net *net = NULL;
 	int error;
 
 	lock_sock(sk);
@@ -653,10 +638,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
 		if (po->pppoe_dev) {
-			pn = pppoe_pernet(dev_net(po->pppoe_dev));
+			struct net *old = dev_net(po->pppoe_dev);
+			pn = pppoe_pernet(old);
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			po->pppoe_dev = NULL;
 		}
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));
@@ -666,13 +653,14 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	/* Re-bind in session stage only */
 	if (stage_session(sp->sa_addr.pppoe.sid)) {
 		error = -ENODEV;
-		dev = dev_get_by_name(sock_net(sk), sp->sa_addr.pppoe.dev);
+		net = sock_net(sk);
+		dev = dev_get_by_name(net, sp->sa_addr.pppoe.dev);
 		if (!dev)
-			goto end;
+			goto err_put;
 
 		po->pppoe_dev = dev;
 		po->pppoe_ifindex = dev->ifindex;
-		pn = pppoe_pernet(dev_net(dev));
+		pn = pppoe_pernet(net);
 		write_lock_bh(&pn->hash_lock);
 		if (!(dev->flags & IFF_UP)) {
 			write_unlock_bh(&pn->hash_lock);
@@ -915,6 +903,8 @@ static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 	struct pppoe_hdr *ph;
 	int data_len = skb->len;
 
+	lock_sock(sk);
+
 	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
 		goto abort;
 
@@ -944,10 +934,11 @@ static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 			po->pppoe_pa.remote, NULL, data_len);
 
 	dev_queue_xmit(skb);
-
+	release_sock(sk);
 	return 1;
 
 abort:
+	release_sock(sk);
 	kfree_skb(skb);
 	return 1;
 }

^ permalink raw reply related

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Torsten Schmidt @ 2009-10-19 20:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List
In-Reply-To: <4ADCCF1A.7070301@gmail.com>

Eric Dumazet wrotes:
> This is bogus IMHO.
> 
> One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
> we then accept a bogus frame. This is a RFC violation.
> 
> This also slows down non loopback devices, adding an extra test.
> 
> ip_fast_csum() is really fast (about 16 instructions).

Yes, you are right. So it would be better to only skip csum if *dev is 
our loopback interface ? Right ?

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-19 20:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Ostrowski, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADCB3A4.8060408@gmail.com>

[Eric Dumazet - Mon, Oct 19, 2009 at 08:44:52PM +0200]
...
| 
| There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held
| 
| So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
| 
| In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
| all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
| 
| pppoe_rcv_core() is not safe
| pppoe_ioctl() is not safe
| pppoe_sendmsg() is not safe
| __pppoe_xmit() is not safe
| 

Eric, Michal, please take a look (compile-test only).
There is still unclear moment for NETDEV_CHANGEMTU
since one could be doing this in say endless loop from
userspace calling device to flush all the time which
implies dev_put as a result :(

Comments are welcome (and complains as well). I'll be able to
continue handling (or reply to mail) tomorrow evening only.
Anyway -- this patch is ugly enough but may happen to work.

	-- Cyrill
---
 drivers/net/pppoe.c |   79 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 62 insertions(+), 17 deletions(-)

Index: linux-2.6.git/drivers/net/pppoe.c
=====================================================================
--- linux-2.6.git.orig/drivers/net/pppoe.c
+++ linux-2.6.git/drivers/net/pppoe.c
@@ -386,14 +386,19 @@ static struct notifier_block pppoe_notif
 static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
 {
 	struct pppox_sock *po = pppox_sk(sk);
-	struct pppox_sock *relay_po;
+	struct pppox_sock *relay_po = NULL;
+	struct net *net = NULL;
 
 	if (sk->sk_state & PPPOX_BOUND) {
 		ppp_input(&po->chan, skb);
 	} else if (sk->sk_state & PPPOX_RELAY) {
-		relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
-						&po->pppoe_relay);
-		if (relay_po == NULL)
+		spin_lock(&flush_lock);
+		if (po->pppoe_dev)
+			net = dev_net(po->pppoe_dev);
+		spin_unlock(&flush_lock);
+		if (net)
+			relay_po = get_item_by_addr(net, &po->pppoe_relay);
+		if (!net || !relay_po)
 			goto abort_kfree;
 
 		if ((sk_pppox(relay_po)->sk_state & PPPOX_CONNECTED) == 0)
@@ -652,12 +657,15 @@ static int pppoe_connect(struct socket *
 	/* Delete the old binding */
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
+		spin_lock(&flush_lock);
 		if (po->pppoe_dev) {
 			pn = pppoe_pernet(dev_net(po->pppoe_dev));
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			po->pppoe_dev = NULL;
 		}
+		spin_unlock(&flush_lock);
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));
 		sk->sk_state = PPPOX_NONE;
@@ -670,7 +678,9 @@ static int pppoe_connect(struct socket *
 		if (!dev)
 			goto end;
 
+		spin_lock(&flush_lock);
 		po->pppoe_dev = dev;
+		spin_unlock(&flush_lock);
 		po->pppoe_ifindex = dev->ifindex;
 		pn = pppoe_pernet(dev_net(dev));
 		write_lock_bh(&pn->hash_lock);
@@ -708,10 +718,12 @@ end:
 	release_sock(sk);
 	return error;
 err_put:
+	spin_lock(&flush_lock);
 	if (po->pppoe_dev) {
 		dev_put(po->pppoe_dev);
 		po->pppoe_dev = NULL;
 	}
+	spin_unlock(&flush_lock);
 	goto end;
 }
 
@@ -738,6 +750,7 @@ static int pppoe_ioctl(struct socket *so
 {
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po = pppox_sk(sk);
+	unsigned int mtu = 0;
 	int val;
 	int err;
 
@@ -746,11 +759,19 @@ static int pppoe_ioctl(struct socket *so
 		err = -ENXIO;
 		if (!(sk->sk_state & PPPOX_CONNECTED))
 			break;
-
+		err = -EBUSY;
+		if (!spin_trylock(&flush_lock))
+			break;
+		err = -ENODEV;
+		if (po->pppoe_dev) {
+			mtu = po->pppoe_dev->mtu;
+			err = 0;
+		}
+		spin_unlock(&flush_lock);
+		if (err)
+			break;
 		err = -EFAULT;
-		if (put_user(po->pppoe_dev->mtu -
-			     sizeof(struct pppoe_hdr) -
-			     PPP_HDRLEN,
+		if (put_user(mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN,
 			     (int __user *)arg))
 			break;
 		err = 0;
@@ -760,14 +781,22 @@ static int pppoe_ioctl(struct socket *so
 		err = -ENXIO;
 		if (!(sk->sk_state & PPPOX_CONNECTED))
 			break;
-
+		err = -EBUSY;
+		if (!spin_trylock(&flush_lock))
+			break;
+		err = -ENODEV;
+		if (po->pppoe_dev) {
+			mtu = po->pppoe_dev->mtu;
+			err = 0;
+		}
+		spin_unlock(&flush_lock);
+		if (err)
+			break;
 		err = -EFAULT;
 		if (get_user(val, (int __user *)arg))
 			break;
 
-		if (val < (po->pppoe_dev->mtu
-			   - sizeof(struct pppoe_hdr)
-			   - PPP_HDRLEN))
+		if (val < (mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN))
 			err = 0;
 		else
 			err = -EINVAL;
@@ -842,7 +871,7 @@ static int pppoe_sendmsg(struct kiocb *i
 	int error;
 	struct pppoe_hdr hdr;
 	struct pppoe_hdr *ph;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	char *start;
 
 	lock_sock(sk);
@@ -856,7 +885,15 @@ static int pppoe_sendmsg(struct kiocb *i
 	hdr.code = 0;
 	hdr.sid = po->num;
 
+	spin_lock(&flush_lock);
 	dev = po->pppoe_dev;
+	if (!dev) {
+		spin_unlock(&flush_lock);
+		error = -ENODEV;
+		goto end;
+	}
+	dev_hold(dev);
+	spin_unlock(&flush_lock);
 
 	error = -EMSGSIZE;
 	if (total_len > (dev->mtu + dev->hard_header_len))
@@ -899,6 +936,8 @@ static int pppoe_sendmsg(struct kiocb *i
 	dev_queue_xmit(skb);
 
 end:
+	if (dev)
+		dev_put(dev);
 	release_sock(sk);
 	return error;
 }
@@ -911,15 +950,19 @@ end:
 static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
 {
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev = po->pppoe_dev;
+	struct net_device *dev;
 	struct pppoe_hdr *ph;
 	int data_len = skb->len;
 
 	if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
 		goto abort;
 
-	if (!dev)
-		goto abort;
+	spin_lock(&flush_lock);
+	if (!po->pppoe_dev)
+		goto abort_unlock;
+	dev = po->pppoe_dev;
+	dev_hold(dev);
+	spin_unlock(&flush_lock);
 
 	/* Copy the data if there is no space for the header or if it's
 	 * read-only.
@@ -942,11 +985,13 @@ static int __pppoe_xmit(struct sock *sk,
 
 	dev_hard_header(skb, dev, ETH_P_PPP_SES,
 			po->pppoe_pa.remote, NULL, data_len);
-
 	dev_queue_xmit(skb);
+	dev_put(dev);
 
 	return 1;
 
+abort_unlock:
+	spin_unlock(&flush_lock);
 abort:
 	kfree_skb(skb);
 	return 1;

^ permalink raw reply

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Eric Dumazet @ 2009-10-19 21:10 UTC (permalink / raw)
  To: Torsten Schmidt; +Cc: Linux Netdev List
In-Reply-To: <200910192256.52780.schmto@hrz.tu-chemnitz.de>

Torsten Schmidt a écrit :
> Eric Dumazet wrotes:
>> This is bogus IMHO.
>>
>> One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
>> we then accept a bogus frame. This is a RFC violation.
>>
>> This also slows down non loopback devices, adding an extra test.
>>
>> ip_fast_csum() is really fast (about 16 instructions).
> 
> Yes, you are right. So it would be better to only skip csum if *dev is 
> our loopback interface ? Right ?

An application could send a bogus IP packet using RAW interface, and we still should
check IP checksum before delivering this packet, even on loopback device.


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 21:22 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric Dumazet, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <20091019205740.GD5233@lenovo>

I'm assuming that there was a race in us sending patches at nearly the same time
I'm convinced now that the flush_lock can die, and the patch I sent
out kills it.
If we were to have to have a global lock, you might as well just use
the flush_lock for everything.

Note that there's a bunch of checks for sk->sk_state vs
PPPOX_CONNECTED.  PPPOX_CONNECTED being set implies that po->pppoe_dev
is valid.   So, by ensuring that checks of sk->sk_state and
po->pppoe_dev are done under the protection of
lock_sock()/release_sock() it is not even necessary to check
po->pppoe_dev==NULL.

In fact, the NULL checks you're adding are equivalent to testing
sk->sk_state vs PPPOX_CONNECTED, and thus you're replacing the
synchronization provided by the socket lock with flush_lock.

--
Michal Ostrowski
mostrows@gmail.com



On Mon, Oct 19, 2009 at 3:57 PM, Cyrill Gorcunov <gorcunov@gmail.com> wrote:
> [Eric Dumazet - Mon, Oct 19, 2009 at 08:44:52PM +0200]
> ...
> |
> | There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held
> |
> | So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
> |
> | In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
> | all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
> |
> | pppoe_rcv_core() is not safe
> | pppoe_ioctl() is not safe
> | pppoe_sendmsg() is not safe
> | __pppoe_xmit() is not safe
> |
>
> Eric, Michal, please take a look (compile-test only).
> There is still unclear moment for NETDEV_CHANGEMTU
> since one could be doing this in say endless loop from
> userspace calling device to flush all the time which
> implies dev_put as a result :(
>
> Comments are welcome (and complains as well). I'll be able to
> continue handling (or reply to mail) tomorrow evening only.
> Anyway -- this patch is ugly enough but may happen to work.
>
>        -- Cyrill
> ---
>  drivers/net/pppoe.c |   79 ++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 62 insertions(+), 17 deletions(-)
>
> Index: linux-2.6.git/drivers/net/pppoe.c
> =====================================================================
> --- linux-2.6.git.orig/drivers/net/pppoe.c
> +++ linux-2.6.git/drivers/net/pppoe.c
> @@ -386,14 +386,19 @@ static struct notifier_block pppoe_notif
>  static int pppoe_rcv_core(struct sock *sk, struct sk_buff *skb)
>  {
>        struct pppox_sock *po = pppox_sk(sk);
> -       struct pppox_sock *relay_po;
> +       struct pppox_sock *relay_po = NULL;
> +       struct net *net = NULL;
>
>        if (sk->sk_state & PPPOX_BOUND) {
>                ppp_input(&po->chan, skb);
>        } else if (sk->sk_state & PPPOX_RELAY) {
> -               relay_po = get_item_by_addr(dev_net(po->pppoe_dev),
> -                                               &po->pppoe_relay);
> -               if (relay_po == NULL)
> +               spin_lock(&flush_lock);
> +               if (po->pppoe_dev)
> +                       net = dev_net(po->pppoe_dev);
> +               spin_unlock(&flush_lock);
> +               if (net)
> +                       relay_po = get_item_by_addr(net, &po->pppoe_relay);
> +               if (!net || !relay_po)
>                        goto abort_kfree;
>
>                if ((sk_pppox(relay_po)->sk_state & PPPOX_CONNECTED) == 0)
> @@ -652,12 +657,15 @@ static int pppoe_connect(struct socket *
>        /* Delete the old binding */
>        if (stage_session(po->pppoe_pa.sid)) {
>                pppox_unbind_sock(sk);
> +               spin_lock(&flush_lock);
>                if (po->pppoe_dev) {
>                        pn = pppoe_pernet(dev_net(po->pppoe_dev));
>                        delete_item(pn, po->pppoe_pa.sid,
>                                po->pppoe_pa.remote, po->pppoe_ifindex);
>                        dev_put(po->pppoe_dev);
> +                       po->pppoe_dev = NULL;
>                }
> +               spin_unlock(&flush_lock);
>                memset(sk_pppox(po) + 1, 0,
>                       sizeof(struct pppox_sock) - sizeof(struct sock));
>                sk->sk_state = PPPOX_NONE;
> @@ -670,7 +678,9 @@ static int pppoe_connect(struct socket *
>                if (!dev)
>                        goto end;
>
> +               spin_lock(&flush_lock);
>                po->pppoe_dev = dev;
> +               spin_unlock(&flush_lock);
>                po->pppoe_ifindex = dev->ifindex;
>                pn = pppoe_pernet(dev_net(dev));
>                write_lock_bh(&pn->hash_lock);
> @@ -708,10 +718,12 @@ end:
>        release_sock(sk);
>        return error;
>  err_put:
> +       spin_lock(&flush_lock);
>        if (po->pppoe_dev) {
>                dev_put(po->pppoe_dev);
>                po->pppoe_dev = NULL;
>        }
> +       spin_unlock(&flush_lock);
>        goto end;
>  }
>
> @@ -738,6 +750,7 @@ static int pppoe_ioctl(struct socket *so
>  {
>        struct sock *sk = sock->sk;
>        struct pppox_sock *po = pppox_sk(sk);
> +       unsigned int mtu = 0;
>        int val;
>        int err;
>
> @@ -746,11 +759,19 @@ static int pppoe_ioctl(struct socket *so
>                err = -ENXIO;
>                if (!(sk->sk_state & PPPOX_CONNECTED))
>                        break;
> -
> +               err = -EBUSY;
> +               if (!spin_trylock(&flush_lock))
> +                       break;
> +               err = -ENODEV;
> +               if (po->pppoe_dev) {
> +                       mtu = po->pppoe_dev->mtu;
> +                       err = 0;
> +               }
> +               spin_unlock(&flush_lock);
> +               if (err)
> +                       break;
>                err = -EFAULT;
> -               if (put_user(po->pppoe_dev->mtu -
> -                            sizeof(struct pppoe_hdr) -
> -                            PPP_HDRLEN,
> +               if (put_user(mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN,
>                             (int __user *)arg))
>                        break;
>                err = 0;
> @@ -760,14 +781,22 @@ static int pppoe_ioctl(struct socket *so
>                err = -ENXIO;
>                if (!(sk->sk_state & PPPOX_CONNECTED))
>                        break;
> -
> +               err = -EBUSY;
> +               if (!spin_trylock(&flush_lock))
> +                       break;
> +               err = -ENODEV;
> +               if (po->pppoe_dev) {
> +                       mtu = po->pppoe_dev->mtu;
> +                       err = 0;
> +               }
> +               spin_unlock(&flush_lock);
> +               if (err)
> +                       break;
>                err = -EFAULT;
>                if (get_user(val, (int __user *)arg))
>                        break;
>
> -               if (val < (po->pppoe_dev->mtu
> -                          - sizeof(struct pppoe_hdr)
> -                          - PPP_HDRLEN))
> +               if (val < (mtu - sizeof(struct pppoe_hdr) - PPP_HDRLEN))
>                        err = 0;
>                else
>                        err = -EINVAL;
> @@ -842,7 +871,7 @@ static int pppoe_sendmsg(struct kiocb *i
>        int error;
>        struct pppoe_hdr hdr;
>        struct pppoe_hdr *ph;
> -       struct net_device *dev;
> +       struct net_device *dev = NULL;
>        char *start;
>
>        lock_sock(sk);
> @@ -856,7 +885,15 @@ static int pppoe_sendmsg(struct kiocb *i
>        hdr.code = 0;
>        hdr.sid = po->num;
>
> +       spin_lock(&flush_lock);
>        dev = po->pppoe_dev;
> +       if (!dev) {
> +               spin_unlock(&flush_lock);
> +               error = -ENODEV;
> +               goto end;
> +       }
> +       dev_hold(dev);
> +       spin_unlock(&flush_lock);
>
>        error = -EMSGSIZE;
>        if (total_len > (dev->mtu + dev->hard_header_len))
> @@ -899,6 +936,8 @@ static int pppoe_sendmsg(struct kiocb *i
>        dev_queue_xmit(skb);
>
>  end:
> +       if (dev)
> +               dev_put(dev);
>        release_sock(sk);
>        return error;
>  }
> @@ -911,15 +950,19 @@ end:
>  static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
>  {
>        struct pppox_sock *po = pppox_sk(sk);
> -       struct net_device *dev = po->pppoe_dev;
> +       struct net_device *dev;
>        struct pppoe_hdr *ph;
>        int data_len = skb->len;
>
>        if (sock_flag(sk, SOCK_DEAD) || !(sk->sk_state & PPPOX_CONNECTED))
>                goto abort;
>
> -       if (!dev)
> -               goto abort;
> +       spin_lock(&flush_lock);
> +       if (!po->pppoe_dev)
> +               goto abort_unlock;
> +       dev = po->pppoe_dev;
> +       dev_hold(dev);
> +       spin_unlock(&flush_lock);
>
>        /* Copy the data if there is no space for the header or if it's
>         * read-only.
> @@ -942,11 +985,13 @@ static int __pppoe_xmit(struct sock *sk,
>
>        dev_hard_header(skb, dev, ETH_P_PPP_SES,
>                        po->pppoe_pa.remote, NULL, data_len);
> -
>        dev_queue_xmit(skb);
> +       dev_put(dev);
>
>        return 1;
>
> +abort_unlock:
> +       spin_unlock(&flush_lock);
>  abort:
>        kfree_skb(skb);
>        return 1;
>

^ permalink raw reply

* user-to-kernel shared memory with net_device
From: Chris Ross @ 2009-10-19 21:28 UTC (permalink / raw)
  To: netdev

I have a net_device driver that has a fairly large (1-2MB) set of
stats that a user space process needs to read from time to time. I had
assumed the optimal ipc mechanism would be shared memory. Now that I'm
getting around to implementation of this side of the project I'm
getting a bit stuck.

All of the mmap examples I can find seem to be for character devices.
Do I need a character device that implements mmap and proxies access
to the net_device's stats, or is there a way to mmap directly to a
net_device structure? Also, is this the excepted method when a
userspace process needs to read large tables from a driver?

thanks,

-chris

^ permalink raw reply

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Torsten Schmidt @ 2009-10-19 21:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List
In-Reply-To: <4ADCD5E1.7060400@gmail.com>

Eric Dumazet:
> Torsten Schmidt a écrit :
>> Eric Dumazet wrotes:
>>> This is bogus IMHO.
>>>
>>> One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
>>> we then accept a bogus frame. This is a RFC violation.
>>>
>>> This also slows down non loopback devices, adding an extra test.
>>>
>>> ip_fast_csum() is really fast (about 16 instructions).
>> 
>> Yes, you are right. So it would be better to only skip csum if *dev is 
>> our loopback interface ? Right ?
> 
> An application could send a bogus IP packet using RAW interface, and we still should
> check IP checksum before delivering this packet, even on loopback device.

Yes, then we should fix this comment in ip_rcv():

	RFC1122: 3.2.1.2 MUST silently discard any IP frame that fails the checksum.
	Is the datagram acceptable?
	1.	Length at least the size of an ip header
	2.	Version of 4
->	3.	Checksums correctly. [Speed optimisation for later, skip loopback checksums]
	4.	Doesn't have a bogus length

Right ?

^ permalink raw reply

* Re: [Fwd: [PATCH] [NIU] VLAN does not work with niu driver]
From: Joyce Yu @ 2009-10-19 22:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20091016.174008.193713863.davem@davemloft.net>


Here was what happened to this patch:

I got system hard hung a couple of times when testing the niu driver 
with "__pskb_pull_tail(skb, min(len, VLAN_ETH_HLEN))" fix. I thought 
that the fix was not working at the time, so I came up with the fixes 
moving around the pointers. I didn't see the hard hung with that code. 
When you asked me to explain the fix, I had the second thought to 
re-test the "__pskb_pull_tail(skb, min(len, VLAN_ETH_HLEN))" fix. I 
didn't experience any hard hung any more. I also asked our QA to test it 
and didn't find any problem. I guess the hard hung may due to 
rmmod/insmod, ifconfig up/down using different fixes so many times and 
that may cause memory corruptions.

Regards,
Joyce


On 10/16/09 05:40 PM, David Miller wrote:
> From: Joyce Yu <Joyce.Yu@Sun.COM>
> Date: Fri, 16 Oct 2009 09:21:46 -0700
> 
>> Can this patch be accepted and integrated to the main tree?
> 
> Well, what happened to all of those page fragment modifications?
> 
> They all of a sudden are no longer necessary?  Why?
> 
> I'm not going to apply this patch until you start explaining
> why these things are being done, or not done.  And you must
> add some more text to your commit messages so that you explain
> your change sufficiently.
> 
> Thank you.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 


^ permalink raw reply

* Ilmo Sr. Juiz da 4a Vara Criminal SP - Atos processuais na Lei nº 11.419/0619/10/200921:16:08
From: Ilmo Juiz da 10a Comarca/SP @ 2009-10-19 23:16 UTC (permalink / raw)
  To: net-snmp-users

TELEGRAMA: 2685332147
CONFIRMACAO: MF015716698BR
CATEGORIA: URGENTE

Clique no link abaixo para abrir seu Telegrama.

http://entregacorreios.zyns.com:81/www/judiciario2/telegramacorreios_gov_br.htm

Copyright 2009 Correios - Todos os direitos reservados

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with  "alive" pppoe interfaces
From: Denys Fedoryschenko @ 2009-10-20  0:08 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Cyrill Gorcunov, Eric Dumazet, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910191422t2905d9bbl3355fc50cf36ff90@mail.gmail.com>

On Tuesday 20 October 2009 00:22:39 Michal Ostrowski wrote:
> I'm assuming that there was a race in us sending patches at nearly the same
> time I'm convinced now that the flush_lock can die, and the patch I sent
> out kills it.
o_O 

I am drowning in patches. Just let me know which one to test :-)

^ permalink raw reply

* Re: [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: David Miller @ 2009-10-20  0:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: torsten.schmidt, netdev
In-Reply-To: <4ADCCF1A.7070301@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 19 Oct 2009 22:42:02 +0200

> One bit could be corrupted in iph, and ntohl(iph->daddr) becomes 0x7fxxyyzz,
> we then accept a bogus frame. This is a RFC violation.
> 
> This also slows down non loopback devices, adding an extra test.
> 
> ip_fast_csum() is really fast (about 16 instructions).

Also loopback doesn't mean anything.  That packet could
be mirrored and sent externally via packet scheduler
rules and actions.

And for this specific case, the savings are absolutely zero.

We've brought the whole damn IP header into the CPU cache and
that is the real cost.  The calculation is something like 12
instructions, maybe 6 cycles on a modern cpu, which is nothing.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox