Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH] IPv4: skip loopback checksums in ip_rcv()
From: Torsten Schmidt @ 2009-10-19 19:34 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Skip loopback checksum in ip_rcv() for speed optimisation.

Signed-off-by: Torsten Schmidt <schmto@hrz.tu-chemnitz.de>
---
 net/ipv4/ip_input.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 6c98b43..dc72286 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -406,7 +406,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	 *
 	 *	1.	Length at least the size of an ip header
 	 *	2.	Version of 4
-	 *	3.	Checksums correctly. [Speed optimisation for later, skip loopback checksums]
+	 *	3.	Checksums correctly. [Speed optimisation: skip loopback checksums]
 	 *	4.	Doesn't have a bogus length
 	 */
 
@@ -418,8 +418,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 	iph = ip_hdr(skb);
 
-	if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
-		goto inhdr_error;
+	if (!ipv4_is_loopback(iph->daddr))
+		if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
+			goto inhdr_error;
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-- 

^ permalink raw reply related

* Re: [PATCH] tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT
From: Eric Dumazet @ 2009-10-19 20:11 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David S. Miller, netdev
In-Reply-To: <Pine.LNX.4.58.0910192302000.2971@u.domain.uli>

Julian Anastasov a écrit :
> 	Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
> users to not retransmit SYN-ACKs during the deferring period if
> ACK from client was received. The goal is to reduce traffic
> during the deferring period. When the period is finished
> we continue with sending SYN-ACKs (at least one) but this time
> any traffic from client will change the request to established
> socket allowing application to terminate it properly.
> Also, do not drop acked request if sending of SYN-ACK fails.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
> 

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>


^ permalink raw reply

* Re: TCP_DEFER_ACCEPT is missing counter update
From: Willy Tarreau @ 2009-10-19 20:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Julian Anastasov, David Miller, netdev
In-Reply-To: <4ADCC58B.2060408@gmail.com>

On Mon, Oct 19, 2009 at 10:01:15PM +0200, Eric Dumazet wrote:
> David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea
> (tcp: fix tcp_defer_accept to consider the timeout) 
> since we know its broken.

yes I agree with Eric, please revert it and wait for Julian's
patch which gets it right. Sorry for the wrong fix, at least
it will have brought the discussion on the table, but without
a faulty patch it would have been better :-/

Thanks,
Willy

^ permalink raw reply

* [PATCH] tcp: fix TCP_DEFER_ACCEPT retrans calculation
From: Julian Anastasov @ 2009-10-19 20:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev


	Fix TCP_DEFER_ACCEPT conversion between seconds and
retransmission to match the TCP SYN-ACK retransmission periods
because the time is converted to such retransmissions. The old
algorithm selects one more retransmission in some cases. Allow
up to 255 retransmissions.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

	Not sure if this is the right place to put both
functions ...

diff -urp v2.6.31/linux/net/ipv4/tcp.c linux/net/ipv4/tcp.c
--- v2.6.31/linux/net/ipv4/tcp.c	2009-09-11 10:27:17.000000000 +0300
+++ linux/net/ipv4/tcp.c	2009-10-17 16:37:48.000000000 +0300
@@ -326,6 +326,43 @@ void tcp_enter_memory_pressure(struct so
 
 EXPORT_SYMBOL(tcp_enter_memory_pressure);
 
+/* Convert seconds to retransmits based on initial and max timeout */
+static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
+{
+	u8 res = 0;
+
+	if (seconds > 0) {
+		int period = timeout;
+
+		res = 1;
+		while (seconds > period && res < 255) {
+			res++;
+			timeout <<= 1;
+			if (timeout > rto_max)
+				timeout = rto_max;
+			period += timeout;
+		}
+	}
+	return res;
+}
+
+/* Convert retransmits to seconds based on initial and max timeout */
+static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+{
+	int period = 0;
+
+	if (retrans > 0) {
+		period = timeout;
+		while (--retrans) {
+			timeout <<= 1;
+			if (timeout > rto_max)
+				timeout = rto_max;
+			period += timeout;
+		}
+	}
+	return period;
+}
+
 /*
  *	Wait for a TCP event.
  *
@@ -2163,16 +2200,10 @@ static int do_tcp_setsockopt(struct sock
 		break;
 
 	case TCP_DEFER_ACCEPT:
-		icsk->icsk_accept_queue.rskq_defer_accept = 0;
-		if (val > 0) {
-			/* Translate value in seconds to number of
-			 * retransmits */
-			while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
-			       val > ((TCP_TIMEOUT_INIT / HZ) <<
-				       icsk->icsk_accept_queue.rskq_defer_accept))
-				icsk->icsk_accept_queue.rskq_defer_accept++;
-			icsk->icsk_accept_queue.rskq_defer_accept++;
-		}
+		/* Translate value in seconds to number of retransmits */
+		icsk->icsk_accept_queue.rskq_defer_accept =
+			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+					TCP_RTO_MAX / HZ);
 		break;
 
 	case TCP_WINDOW_CLAMP:
@@ -2353,8 +2384,8 @@ static int do_tcp_getsockopt(struct sock
 			val = (val ? : sysctl_tcp_fin_timeout) / HZ;
 		break;
 	case TCP_DEFER_ACCEPT:
-		val = !icsk->icsk_accept_queue.rskq_defer_accept ? 0 :
-			((TCP_TIMEOUT_INIT / HZ) << (icsk->icsk_accept_queue.rskq_defer_accept - 1));
+		val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
+				      TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
 		break;
 	case TCP_WINDOW_CLAMP:
 		val = tp->window_clamp;

^ permalink raw reply

* Re: [PATCH] tcp: accept socket after TCP_DEFER_ACCEPT period
From: Eric Dumazet @ 2009-10-19 20:04 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: David S. Miller, netdev
In-Reply-To: <Pine.LNX.4.58.0910192259280.2971@u.domain.uli>

Julian Anastasov a écrit :
> 	Willy Tarreau and many other folks in recent years
> were concerned what happens when the TCP_DEFER_ACCEPT period
> expires for clients which sent ACK packet. They prefer clients
> that actively resend ACK on our SYN-ACK retransmissions to be
> converted from open requests to sockets and queued to the
> listener for accepting after the deferring period is finished.
> Then application server can decide to wait longer for data
> or to properly terminate the connection with FIN if read()
> returns EAGAIN which is an indication for accepting after
> the deferring period. This change still can have side effects
> for applications that expect always to see data on the accepted
> socket. Others can be prepared to work in both modes (with or
> without TCP_DEFER_ACCEPT period) and their data processing can
> ignore the read=EAGAIN notification and to allocate resources for
> clients which proved to have no data to send during the deferring
> period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
> as a timeout) to wait for data will notice clients that didn't
> send data for 3 seconds but that still resend ACKs.
> Thanks to Willy Tarreau for the initial idea and to
> Eric Dumazet for the review and testing the change.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>
> ---
> 
> diff -urp v2.6.31/linux/net/ipv4/tcp_minisocks.c linux/net/ipv4/tcp_minisocks.c
> --- v2.6.31/linux/net/ipv4/tcp_minisocks.c	2009-09-11 10:27:17.000000000 +0300
> +++ linux/net/ipv4/tcp_minisocks.c	2009-10-16 10:29:19.000000000 +0300
> @@ -641,8 +641,8 @@ struct sock *tcp_check_req(struct sock *
>  	if (!(flg & TCP_FLAG_ACK))
>  		return NULL;
>  
> -	/* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
> -	if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
> +	/* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
> +	if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
>  	    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
>  		inet_rsk(req)->acked = 1;
>  		return NULL;
> --

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks Julian

^ permalink raw reply

* [PATCH] tcp: reduce SYN-ACK retrans for TCP_DEFER_ACCEPT
From: Julian Anastasov @ 2009-10-19 20:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev


	Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
users to not retransmit SYN-ACKs during the deferring period if
ACK from client was received. The goal is to reduce traffic
during the deferring period. When the period is finished
we continue with sending SYN-ACKs (at least one) but this time
any traffic from client will change the request to established
socket allowing application to terminate it properly.
Also, do not drop acked request if sending of SYN-ACK fails.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

diff -urp v2.6.31/linux/net/ipv4/inet_connection_sock.c linux/net/ipv4/inet_connection_sock.c
--- v2.6.31/linux/net/ipv4/inet_connection_sock.c	2009-06-13 10:53:58.000000000 +0300
+++ linux/net/ipv4/inet_connection_sock.c	2009-10-16 11:35:52.000000000 +0300
@@ -446,6 +446,28 @@ extern int sysctl_tcp_synack_retries;
 
 EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_hash_add);
 
+/* Decide when to expire the request and when to resend SYN-ACK */
+static inline void syn_ack_recalc(struct request_sock *req, const int thresh,
+				  const int max_retries,
+				  const u8 rskq_defer_accept,
+				  int *expire, int *resend)
+{
+	if (!rskq_defer_accept) {
+		*expire = req->retrans >= thresh;
+		*resend = 1;
+		return;
+	}
+	*expire = req->retrans >= thresh &&
+		  (!inet_rsk(req)->acked || req->retrans >= max_retries);
+	/*
+	 * Do not resend while waiting for data after ACK,
+	 * start to resend on end of deferring period to give
+	 * last chance for data or ACK to create established socket.
+	 */
+	*resend = !inet_rsk(req)->acked ||
+		  req->retrans >= rskq_defer_accept - 1;
+}
+
 void inet_csk_reqsk_queue_prune(struct sock *parent,
 				const unsigned long interval,
 				const unsigned long timeout,
@@ -501,9 +523,15 @@ void inet_csk_reqsk_queue_prune(struct s
 		reqp=&lopt->syn_table[i];
 		while ((req = *reqp) != NULL) {
 			if (time_after_eq(now, req->expires)) {
-				if ((req->retrans < thresh ||
-				     (inet_rsk(req)->acked && req->retrans < max_retries))
-				    && !req->rsk_ops->rtx_syn_ack(parent, req)) {
+				int expire = 0, resend = 0;
+
+				syn_ack_recalc(req, thresh, max_retries,
+					       queue->rskq_defer_accept,
+					       &expire, &resend);
+				if (!expire &&
+				    (!resend ||
+				     !req->rsk_ops->rtx_syn_ack(parent, req) ||
+				     inet_rsk(req)->acked)) {
 					unsigned long timeo;
 
 					if (req->retrans++ == 0)

^ permalink raw reply

* Re: TCP_DEFER_ACCEPT is missing counter update
From: Eric Dumazet @ 2009-10-19 20:01 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <Pine.LNX.4.58.0910171709190.10440@u.domain.uli>

Julian Anastasov a écrit :
> 	Hello,
> 
> On Sat, 17 Oct 2009, Eric Dumazet wrote:
> 
>> I really like this, but please define helper functions out of do_tcp_setsockopt()
>>
>> /* Translate value in seconds to number of SYN-ACK retransmits */
>> static u8 secs_to_retrans(int seconds)
>> {
>> 	u8 res = 0;
>>
>> 	if (seconds > 0) {
>> 		int timeout = TCP_TIMEOUT_INIT / HZ;
>> 		int period = timeout;
>>
>> 		res = 1;
>> 		while (res < 255 && seconds > period) {
>> 			res++;
>> 			timeout <<= 1;
>> 			if (timeout > TCP_RTO_MAX / HZ)
>> 				timeout = TCP_RTO_MAX / HZ;
>> 			period += timeout;
>> 		}
>> 	}
>> 	return res;
>> }
>>
>> You also need the reverse function for getsockopt()...
> 
> 	Yes, I forgot that. Here is what I tested, should
> I split it later to 2 patches? May be it should go somewhere
> in net/core/sock.c as extern funcs with EXPORT_SYMBOL to
> allow other protocols one day to use it?
> 

No need to EXPORT_SYMBOL it for the moment. We can add it later if really needed.

David, I think we should revert 6d01a026b7d3009a418326bdcf313503a314f1ea
(tcp: fix tcp_defer_accept to consider the timeout) 
since we know its broken.



^ permalink raw reply

* [PATCH] tcp: accept socket after TCP_DEFER_ACCEPT period
From: Julian Anastasov @ 2009-10-19 20:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev


	Willy Tarreau and many other folks in recent years
were concerned what happens when the TCP_DEFER_ACCEPT period
expires for clients which sent ACK packet. They prefer clients
that actively resend ACK on our SYN-ACK retransmissions to be
converted from open requests to sockets and queued to the
listener for accepting after the deferring period is finished.
Then application server can decide to wait longer for data
or to properly terminate the connection with FIN if read()
returns EAGAIN which is an indication for accepting after
the deferring period. This change still can have side effects
for applications that expect always to see data on the accepted
socket. Others can be prepared to work in both modes (with or
without TCP_DEFER_ACCEPT period) and their data processing can
ignore the read=EAGAIN notification and to allocate resources for
clients which proved to have no data to send during the deferring
period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
as a timeout) to wait for data will notice clients that didn't
send data for 3 seconds but that still resend ACKs.
Thanks to Willy Tarreau for the initial idea and to
Eric Dumazet for the review and testing the change.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

diff -urp v2.6.31/linux/net/ipv4/tcp_minisocks.c linux/net/ipv4/tcp_minisocks.c
--- v2.6.31/linux/net/ipv4/tcp_minisocks.c	2009-09-11 10:27:17.000000000 +0300
+++ linux/net/ipv4/tcp_minisocks.c	2009-10-16 10:29:19.000000000 +0300
@@ -641,8 +641,8 @@ struct sock *tcp_check_req(struct sock *
 	if (!(flg & TCP_FLAG_ACK))
 		return NULL;
 
-	/* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
-	if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
+	/* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
+	if (req->retrans < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
 	    TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
 		inet_rsk(req)->acked = 1;
 		return NULL;

^ permalink raw reply

* PATCH [net-next-2.6] IP: Cleanups
From: John Dykstra @ 2009-10-19 19:31 UTC (permalink / raw)
  To: netdev

I've been sitting on these waiting to accumulate more, but that isn't
likely to happen short-term.

---
Use symbols instead of magic constants while checking PMTU discovery
setsockopt.

Remove redundant test in ip_rt_frag_needed() (done by caller).

Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
---
 net/ipv4/ip_sockglue.c   |    2 +-
 net/ipv4/route.c         |    3 ---
 net/ipv6/ipv6_sockglue.c |    2 +-
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c602d98..2445fed 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -575,7 +575,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 		inet->hdrincl = val ? 1 : 0;
 		break;
 	case IP_MTU_DISCOVER:
-		if (val < 0 || val > 3)
+		if (val < IP_PMTUDISC_DONT || val > IP_PMTUDISC_PROBE)
 			goto e_inval;
 		inet->pmtudisc = val;
 		break;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 39e10ac..68566de 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -662,7 +662,7 @@ done:
 	case IPV6_MTU_DISCOVER:
 		if (optlen < sizeof(int))
 			goto e_inval;
-		if (val<0 || val>3)
+		if (val < IP_PMTUDISC_DONT || val > IP_PMTUDISC_PROBE)
 			goto e_inval;
 		np->pmtudisc = val;
 		retv = 0;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bb41992..68fb227 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1628,9 +1628,6 @@ unsigned short ip_rt_frag_needed(struct net *net, struct iphdr *iph,
 	__be32  daddr = iph->daddr;
 	unsigned short est_mtu = 0;
 
-	if (ipv4_config.no_pmtu_disc)
-		return 0;
-
 	for (k = 0; k < 2; k++) {
 		for (i = 0; i < 2; i++) {
 			unsigned hash = rt_hash(daddr, skeys[i], ikeys[k],-- 
1.5.4.3




^ permalink raw reply related

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-19 19:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Ostrowski, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADCB3A4.8060408@gmail.com>

[Eric Dumazet - Mon, Oct 19, 2009 at 08:44:52PM +0200]
...
| 
| Not really :)
| 
| I dont believe you should care of namespace, and/or mess with its refcount at all.
| 
| Please dont use maybe_get_net() : This function should not ever be used in drivers/net
| 
| You can add a BUG_ON(dev_net(xxxx)->count <= 0) if you really want, but if this
| assertion is false, this is not because of pppoe.
| 
...
| So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
| 
| In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
| all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
| 
| pppoe_rcv_core() is not safe
| pppoe_ioctl() is not safe
| pppoe_sendmsg() is not safe
| __pppoe_xmit() is not safe
| 

Sigh... seem so (which is mostly my fault not Michal). Every time we touch pppoe_dev we
should dev_hold on it and dev_put as only done all we need. Async nature
of notifier seem to be a key here.

	-- Cyrill

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Eric Dumazet @ 2009-10-19 18:44 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910191107h899a4ffs588f2413093dfb4b@mail.gmail.com>

Michal Ostrowski a écrit :
> On Mon, Oct 19, 2009 at 12:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Michal Ostrowski a écrit :
>>> Here's a bigger patch that just gets rid of flush_lock altogether.
>>>
>>> We were seeing oopses due to net namespaces going away while we were using
>>> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
>>> counts properly.
>>>
>>> Fixing this requires that adding and removing entries to the per-net hash-table
>>> requires incrementing and decrementing the ref count.  This also allows us to
>>> get rid of the flush_lock since we can now depend on the existence of
>>> "pn->hash_lock".
>>>
>>> We also have to be careful when flushing devices that removal of a hash table
>>> entry may bring the net namespace refcount to 0.
>>>
>> Your patch is mangled (tabulation -> white spaces),
> 
> Patch mangling was due to mailer interactions, I'll attach a clean
> version here, no more inlining.
> 
>> and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
>> it would be a bug from core network code.
>>
> 
> From the original oops I was able to deduce that the namespace somehow
> managed to get destroyed during the interval where we dropped locks.
> If that's not due to the release_sock() call in pppoe_flush_dev()
> triggering a cleanup then I'd have to assume that that it's due to a
> secondary actor closing the socket in parallel, but that in turn would
> point to issues with the flush_lock.  Having said that the thrust of
> this patch remains valid; it just means I don't need to inc the ref
> count in pppoe_flush_dev().
> 
> Do you agree?
> 

Not really :)

I dont believe you should care of namespace, and/or mess with its refcount at all.

Please dont use maybe_get_net() : This function should not ever be used in drivers/net

You can add a BUG_ON(dev_net(xxxx)->count <= 0) if you really want, but if this
assertion is false, this is not because of pppoe.


 	lock_sock(sk);
@@ -653,10 +642,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
 		if (po->pppoe_dev) {
-			pn = pppoe_pernet(dev_net(po->pppoe_dev));
+			struct net *old = dev_net(po->pppoe_dev);
+			pn = pppoe_pernet(old);
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			put_net(old);
 		}
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));


There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held

So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.

In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.

pppoe_rcv_core() is not safe
pppoe_ioctl() is not safe
pppoe_sendmsg() is not safe
__pppoe_xmit() is not safe


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 18:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADC9DE2.5010308@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1653 bytes --]

On Mon, Oct 19, 2009 at 12:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Michal Ostrowski a écrit :
>> Here's a bigger patch that just gets rid of flush_lock altogether.
>>
>> We were seeing oopses due to net namespaces going away while we were using
>> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
>> counts properly.
>>
>> Fixing this requires that adding and removing entries to the per-net hash-table
>> requires incrementing and decrementing the ref count.  This also allows us to
>> get rid of the flush_lock since we can now depend on the existence of
>> "pn->hash_lock".
>>
>> We also have to be careful when flushing devices that removal of a hash table
>> entry may bring the net namespace refcount to 0.
>>
>
> Your patch is mangled (tabulation -> white spaces),

Patch mangling was due to mailer interactions, I'll attach a clean
version here, no more inlining.

>
> and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
> it would be a bug from core network code.
>

>From the original oops I was able to deduce that the namespace somehow
managed to get destroyed during the interval where we dropped locks.
If that's not due to the release_sock() call in pppoe_flush_dev()
triggering a cleanup then I'd have to assume that that it's due to a
secondary actor closing the socket in parallel, but that in turn would
point to issues with the flush_lock.  Having said that the thrust of
this patch remains valid; it just means I don't need to inc the ref
count in pppoe_flush_dev().

Do you agree?

--
Michal Ostrowski
mostrows@gmail.com

[-- Attachment #2: 0001-PPPoE-Fix-ref-counts-on-net-namespaces.patch --]
[-- Type: application/octet-stream, Size: 5483 bytes --]

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index 7cbf6f9..60bbeb2 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -111,9 +111,6 @@ struct pppoe_net {
 	rwlock_t hash_lock;
 };
 
-/* to eliminate a race btw pppoe_flush_dev and pppoe_release */
-static DEFINE_SPINLOCK(flush_lock);
-
 /*
  * PPPoE could be in the following stages:
  * 1) Discovery stage (to obtain remote MAC and Session ID)
@@ -303,48 +300,49 @@ static void pppoe_flush_dev(struct net_device *dev)
 	write_lock_bh(&pn->hash_lock);
 	for (i = 0; i < PPPOE_HASH_SIZE; i++) {
 		struct pppox_sock *po = pn->hash_table[i];
+		struct sock *sk;
 
-		while (po != NULL) {
-			struct sock *sk;
-			if (po->pppoe_dev != dev) {
-				po = po->next;
-				continue;
-			}
-			sk = sk_pppox(po);
-			spin_lock(&flush_lock);
-			po->pppoe_dev = NULL;
-			spin_unlock(&flush_lock);
-			dev_put(dev);
-
-			/* We always grab the socket lock, followed by the
-			 * hash_lock, in that order.  Since we should
-			 * hold the sock lock while doing any unbinding,
-			 * we need to release the lock we're holding.
-			 * Hold a reference to the sock so it doesn't disappear
-			 * as we're jumping between locks.
-			 */
+		while (po && po->pppoe_dev != dev) {
+			po = po->next;
+		}
 
-			sock_hold(sk);
+		if (po == NULL) {
+			continue;
+		}
 
-			write_unlock_bh(&pn->hash_lock);
-			lock_sock(sk);
+		sk = sk_pppox(po);
 
-			if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
-				pppox_unbind_sock(sk);
-				sk->sk_state = PPPOX_ZOMBIE;
-				sk->sk_state_change(sk);
-			}
+		if (po->pppoe_dev) {
+			dev_put(po->pppoe_dev);
+			po->pppoe_dev = NULL;
+		}
 
-			release_sock(sk);
-			sock_put(sk);
+		/* We always grab the socket lock, followed by the hash_lock,
+		 * in that order.  Since we should hold the sock lock while
+		 * doing any unbinding, we need to release the lock we're
+		 * holding.  Hold a reference to the sock so it doesn't
+		 * disappear as we're jumping between locks.
+		 */
 
-			/* Restart scan at the beginning of this hash chain.
-			 * While the lock was dropped the chain contents may
-			 * have changed.
-			 */
-			write_lock_bh(&pn->hash_lock);
-			po = pn->hash_table[i];
+		sock_hold(sk);
+		write_unlock_bh(&pn->hash_lock);
+		lock_sock(sk);
+
+		if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+			pppox_unbind_sock(sk);
+			sk->sk_state = PPPOX_ZOMBIE;
+			sk->sk_state_change(sk);
 		}
+
+		release_sock(sk);
+		sock_put(sk);
+
+		/* Restart the process from the start of the current hash
+		 * chain. We dropped locks so the world may have change from
+		 * underneath us.
+		 */
+		write_unlock_bh(&pn->hash_lock);
+		po = pn->hash_table[i];
 	}
 	write_unlock_bh(&pn->hash_lock);
 }
@@ -561,6 +559,7 @@ static int pppoe_release(struct socket *sock)
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po;
 	struct pppoe_net *pn;
+	struct net *net = NULL;
 
 	if (!sk)
 		return 0;
@@ -576,19 +575,8 @@ static int pppoe_release(struct socket *sock)
 	/* Signal the death of the socket. */
 	sk->sk_state = PPPOX_DEAD;
 
-	/*
-	 * pppoe_flush_dev could lead to a race with
-	 * this routine so we use flush_lock to eliminate
-	 * such a case (we only need per-net specific data)
-	 */
-	spin_lock(&flush_lock);
-	po = pppox_sk(sk);
-	if (!po->pppoe_dev) {
-		spin_unlock(&flush_lock);
-		goto out;
-	}
-	pn = pppoe_pernet(dev_net(po->pppoe_dev));
-	spin_unlock(&flush_lock);
+	net = sock_net(sk);
+	pn = pppoe_pernet(net);
 
 	/*
 	 * protect "po" from concurrent updates
@@ -607,8 +595,8 @@ static int pppoe_release(struct socket *sock)
 	}
 
 	write_unlock_bh(&pn->hash_lock);
+	put_net(net);
 
-out:
 	sock_orphan(sk);
 	sock->sk = NULL;
 
@@ -625,8 +613,9 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	struct sock *sk = sock->sk;
 	struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev;
 	struct pppoe_net *pn;
+	struct net_device *dev = NULL;
+	struct net *net = NULL;
 	int error;
 
 	lock_sock(sk);
@@ -653,10 +642,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
 		if (po->pppoe_dev) {
-			pn = pppoe_pernet(dev_net(po->pppoe_dev));
+			struct net *old = dev_net(po->pppoe_dev);
+			pn = pppoe_pernet(old);
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			put_net(old);
 		}
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));
@@ -666,13 +657,17 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	/* Re-bind in session stage only */
 	if (stage_session(sp->sa_addr.pppoe.sid)) {
 		error = -ENODEV;
-		dev = dev_get_by_name(sock_net(sk), sp->sa_addr.pppoe.dev);
-		if (!dev)
+		net = maybe_get_net(dev_net(dev));
+		if (!net)
 			goto end;
 
+		dev = dev_get_by_name(net, sp->sa_addr.pppoe.dev);
+		if (!dev)
+			goto err_put_net;
+
 		po->pppoe_dev = dev;
 		po->pppoe_ifindex = dev->ifindex;
-		pn = pppoe_pernet(dev_net(dev));
+		pn = pppoe_pernet(net);
 		write_lock_bh(&pn->hash_lock);
 		if (!(dev->flags & IFF_UP)) {
 			write_unlock_bh(&pn->hash_lock);
@@ -707,6 +702,10 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 end:
 	release_sock(sk);
 	return error;
+err_put_net:
+	if (net)
+		put_net(net);
+
 err_put:
 	if (po->pppoe_dev) {
 		dev_put(po->pppoe_dev);

^ permalink raw reply related

* Re: [PATCH] AF_UNIX: Fix deadlock on connecting to shutdown socket
From: Jarek Poplawski @ 2009-10-19 18:07 UTC (permalink / raw)
  To: David Miller
  Cc: tomoki.sekiyama.qu, linux-kernel, netdev, alan, satoshi.oshima.fk,
	hidehiro.kawai.ez, hideo.aoki.tk
In-Reply-To: <20091019.061459.193694892.davem@davemloft.net>

On Mon, Oct 19, 2009 at 06:14:59AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 19 Oct 2009 11:57:13 +0000
> 
> > Isn't the shutdown call expected to change sk_state to TCP_CLOSE?
> 
> No, because the send side is still up and operational, it's
> only a half duplex close.

OK, thanks for the explanation,
Jarek P.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Eric Dumazet @ 2009-10-19 17:12 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910190905x382bfc23w2987c84aa0837609@mail.gmail.com>

Michal Ostrowski a écrit :
> Here's a bigger patch that just gets rid of flush_lock altogether.
> 
> We were seeing oopses due to net namespaces going away while we were using
> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
> counts properly.
> 
> Fixing this requires that adding and removing entries to the per-net hash-table
> requires incrementing and decrementing the ref count.  This also allows us to
> get rid of the flush_lock since we can now depend on the existence of
> "pn->hash_lock".
> 
> We also have to be careful when flushing devices that removal of a hash table
> entry may bring the net namespace refcount to 0.
> 

Your patch is mangled (tabulation -> white spaces),

and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
it would be a bug from core network code.



^ permalink raw reply

* Re: [PATCH 1/4 v3] net: Introduce sk_tx_queue_mapping
From: Eric Dumazet @ 2009-10-19 16:45 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, netdev, herbert
In-Reply-To: <20091018130740.3960.96469.sendpatchset@localhost.localdomain>

Krishna Kumar a écrit :
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> Introduce sk_tx_queue_mapping; and functions that set, test and
> get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
> cache is set/reset, and in socket alloc. Setting txq to -1 and
> using valid txq=<0 to n-1> allows the tx path to use the value
> of sk_tx_queue_mapping directly instead of subtracting 1 on every
> tx.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply

* [PATCH] net: Fix IP_MULTICAST_IF
From: Eric Dumazet @ 2009-10-19 16:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.

This function should be called only with RTNL or dev_base_lock held, or reader
could see a corrupt hash chain and eventually enter an endless loop.

Fix is to call dev_get_by_index()/dev_put().

If this happens to be performance critical, we could define a new dev_exist_by_index()
function to avoid touching dev refcount.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_sockglue.c   |    7 +++----
 net/ipv6/ipv6_sockglue.c |    6 +++++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 0c0b6e3..e982b5c 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -634,17 +634,16 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 				break;
 			}
 			dev = ip_dev_find(sock_net(sk), mreq.imr_address.s_addr);
-			if (dev) {
+			if (dev)
 				mreq.imr_ifindex = dev->ifindex;
-				dev_put(dev);
-			}
 		} else
-			dev = __dev_get_by_index(sock_net(sk), mreq.imr_ifindex);
+			dev = dev_get_by_index(sock_net(sk), mreq.imr_ifindex);
 
 
 		err = -EADDRNOTAVAIL;
 		if (!dev)
 			break;
+		dev_put(dev);
 
 		err = -EINVAL;
 		if (sk->sk_bound_dev_if &&
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 14f54eb..4f7aaf6 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -496,13 +496,17 @@ done:
 			goto e_inval;
 
 		if (val) {
+			struct net_device *dev;
+
 			if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != val)
 				goto e_inval;
 
-			if (__dev_get_by_index(net, val) == NULL) {
+			dev = dev_get_by_index(net, val);
+			if (!dev) {
 				retv = -ENODEV;
 				break;
 			}
+			dev_put(dev);
 		}
 		np->mcast_oif = val;
 		retv = 0;

^ permalink raw reply related

* Re: PATCH: Network Device Naming mechanism and policy
From: Bryan Kadzban @ 2009-10-19 16:14 UTC (permalink / raw)
  To: Narendra_K
  Cc: dannf, bhutchings, netdev, linux-hotplug, Matt_Domsch,
	Jordan_Hargrave, Charles_Rose
In-Reply-To: <EDA0A4495861324DA2618B4C45DCB3EE5895AF@blrx3m08.blr.amer.dell.com>

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

Narendra_K@Dell.com wrote:
>>>>> And how would the regular file look like in terms of holding
>>>>>  ifindex of the interface, which can be passed to
>>>>> libnetdevname.
>>>> I can't think of anything we need to store in the regular file.
>>>> If we have the kernel name for the device, we can look up the
>>>> ifindex in /sys. Correct me if I'm wrong, but storing it
>>>> ourselves seems redundant.
>>> But the name of a netdev can change whereas its ifindex never
>>> does. Identifying netdevs by name would require additional work
>>> to update the links when a netdev is renamed and would still be
>>> prone to race conditions.  This is why Narendra and Matt were
>>> proposing to
>> store the
>>> ifindex in the node all along...
>> Matt, Ben and I talked about a few other possibilities on IRC. The
>> one I like the most at the moment is an idea Ben had to creat dummy
>> files named after the ifindex. Then, use symlinks for the kernel
>> name and the various by-$property subdirectories. This means the
>> KOBJ events will need to expose the ifindex.
>> 
> 
> I suppose the KOBJ events already expose the ifindex of a network 
> interface. The file "/sys/class/net/ethN/uevent" contains
> INTERFACE=ethN and IFINDEX=n already. But it looks like udev doesn't
> use it in any way.

Right; it could simply do the equivalent of:

touch /dev/netdev/$env{IFINDEX}

instead of its normal mknod(2), and then do normal SYMLINK processing.
That last part is what would link /dev/netdev/by-name/$env{INTERFACE} to
that device, along with /dev/netdev/by-mac/*, /dev/netdev/by-path/*,
etc., etc., in as many different ways as people want to add rules.

(Or /dev/net/by-* instead of netdev; I'm mostly ambivalent about the
first-level directory under /dev.  Looks like libnetdevname requires
/dev/netdev though.)

> For example, with the kernel patch the "/sys/class/net/ethN/uevent" 
> contains in addition to the above details, MAJOR=M and MINOR=m which
> the udev knows how to make use of with a rule like
> 
> SUBSYSTEM=="net", KERNEL!="tun", NAME="netdev/%k", MODE="0600".

And if the only point is to get the ifindex via stat(2) on the resulting
symlinks, but people don't like device files, then why not get the
ifindex via readlink(2) (and a bit of string parsing, and a strtol(3) or
strtoul(3) call) instead?  :-)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 260 bytes --]

^ permalink raw reply

* Re: [PATCHv2 2/4] Implement loss counting on TFRC-SP receiver
From: Leandro Sales @ 2009-10-19 16:04 UTC (permalink / raw)
  To: Gerrit Renker, Ivo Calado, dccp, netdev
In-Reply-To: <20091019052612.GE3366@gerrit.erg.abdn.ac.uk>

Hi Gerrit,

On Mon, Oct 19, 2009 at 2:26 AM, Gerrit Renker <gerrit@erg.abdn.ac.uk> wrote:
>
> | --- dccp_tree_work03.orig/net/dccp/ccids/lib/packet_history_sp.c      2009-10-08 22:58:21.418908270 -0300
> | +++ dccp_tree_work03/net/dccp/ccids/lib/packet_history_sp.c   2009-10-08 22:59:07.442411383 -0300
> | @@ -243,6 +243,7 @@
> |  {
> |       u64 s0 = tfrc_rx_hist_loss_prev(h)->tfrchrx_seqno,
> |           s1 = tfrc_rx_hist_entry(h, 1)->tfrchrx_seqno,
> | +         n1 = tfrc_rx_hist_entry(h, 1)->tfrchrx_ndp,
> |           s2 = tfrc_rx_hist_entry(h, 2)->tfrchrx_seqno,
> |           s3 = DCCP_SKB_CB(skb)->dccpd_seq;
> I have removed the old definition of n1, which was further below and which caused this warning.
>
> net/dccp/ccids/lib/packet_history_sp.c:276:7: warning: symbol 'n1' shadows an earlier
> net/dccp/ccids/lib/packet_history_sp.c:247:6: originally declared here
>
>

Well done!

> I thought again about the earlier suggestion to make 'num_losses' u64. Since li_losses sums the values
> stored in num_losses, it needs to have the same size (currently it is u32). But then another thought is
> that if there are so many losses that u32 overflows, then the performance is so bad anyway that it is
> better to turn off the receiver. Hence I have reverted it to u32, as per your original patch.
>

OK

> Please find attached a patch of the changes I made. As per posting, I have separated out the dccp.h part,
> since it is also useful in general.

OK, agreed!

Thank you,

BR,
Leandro.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 16:05 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric Dumazet, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <20091019155034.GA5233@lenovo>

Here's a bigger patch that just gets rid of flush_lock altogether.

We were seeing oopses due to net namespaces going away while we were using
them, which turns out is simply due to the fact that pppoew wasn't claiming ref
counts properly.

Fixing this requires that adding and removing entries to the per-net hash-table
requires incrementing and decrementing the ref count.  This also allows us to
get rid of the flush_lock since we can now depend on the existence of
"pn->hash_lock".

We also have to be careful when flushing devices that removal of a hash table
entry may bring the net namespace refcount to 0.

---
 drivers/net/pppoe.c |  152 ++++++++++++++++++++++++++++-----------------------
 1 files changed, 83 insertions(+), 69 deletions(-)

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index 7cbf6f9..140a196 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -111,9 +111,6 @@ struct pppoe_net {
       rwlock_t hash_lock;
 };

-/* to eliminate a race btw pppoe_flush_dev and pppoe_release */
-static DEFINE_SPINLOCK(flush_lock);
-
 /*
 * PPPoE could be in the following stages:
 * 1) Discovery stage (to obtain remote MAC and Session ID)
@@ -292,61 +289,77 @@ static inline struct pppox_sock
*delete_item(struct pppoe_net *pn, __be16 sid,
 static void pppoe_flush_dev(struct net_device *dev)
 {
       struct pppoe_net *pn;
+        struct net * net;
       int i;

       BUG_ON(dev == NULL);

+        /* We have to drop and re-acquire locks.  So we'll grab a ref-count on
+         * the net namespace to ensure it is valid throughout this function.
+         */
+
+        net = maybe_get_net(dev_net(dev));
+        if (!net)
+                return;
+
       pn = pppoe_pernet(dev_net(dev));
-       if (!pn) /* already freed */
+       if (!pn) { /* already freed */
+                put_net(net);
               return;
+        }

       write_lock_bh(&pn->hash_lock);
       for (i = 0; i < PPPOE_HASH_SIZE; i++) {
               struct pppox_sock *po = pn->hash_table[i];
-
-               while (po != NULL) {
-                       struct sock *sk;
-                       if (po->pppoe_dev != dev) {
-                               po = po->next;
-                               continue;
-                       }
-                       sk = sk_pppox(po);
-                       spin_lock(&flush_lock);
-                       po->pppoe_dev = NULL;
-                       spin_unlock(&flush_lock);
-                       dev_put(dev);
-
-                       /* We always grab the socket lock, followed by the
-                        * hash_lock, in that order.  Since we should
-                        * hold the sock lock while doing any unbinding,
-                        * we need to release the lock we're holding.
-                        * Hold a reference to the sock so it doesn't disappear
-                        * as we're jumping between locks.
-                        */
-
-                       sock_hold(sk);
-
-                       write_unlock_bh(&pn->hash_lock);
-                       lock_sock(sk);
-
-                       if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
-                               pppox_unbind_sock(sk);
-                               sk->sk_state = PPPOX_ZOMBIE;
-                               sk->sk_state_change(sk);
-                       }
-
-                       release_sock(sk);
-                       sock_put(sk);
-
-                       /* Restart scan at the beginning of this hash chain.
-                        * While the lock was dropped the chain contents may
-                        * have changed.
-                        */
-                       write_lock_bh(&pn->hash_lock);
-                       po = pn->hash_table[i];
-               }
+                struct sock *sk;
+
+                while (po && po->pppoe_dev != dev) {
+                        po = po->next;
+                }
+
+                if (po == NULL) {
+                        continue;
+                }
+
+                sk = sk_pppox(po);
+
+                if (po->pppoe_dev) {
+                        dev_put(po->pppoe_dev);
+                        po->pppoe_dev = NULL;
+                }
+
+                /* We always grab the socket lock, followed by the
+                 * hash_lock, in that order.  Since we should
+                 * hold the sock lock while doing any unbinding,
+                 * we need to release the lock we're holding.
+                 * Hold a reference to the sock so it doesn't disappear
+                 * as we're jumping between locks.
+                 */
+
+                sock_hold(sk);
+
+                write_unlock_bh(&pn->hash_lock);
+                lock_sock(sk);
+
+                if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+                        pppox_unbind_sock(sk);
+                        sk->sk_state = PPPOX_ZOMBIE;
+                        sk->sk_state_change(sk);
+                }
+
+                release_sock(sk);
+                sock_put(sk);
+
+                /* Restart the process from the start of the current hash
+                 * chain. We dropped locks so the world may have change from
+                 * underneath us, but we know "pn" is still good because we
+                 * grabbed a ref count on "net".
+                 */
+                write_unlock_bh(&pn->hash_lock);
+                po = pn->hash_table[i];
       }
       write_unlock_bh(&pn->hash_lock);
+        put_net(net);
 }

 static int pppoe_device_event(struct notifier_block *this,
@@ -561,6 +574,7 @@ static int pppoe_release(struct socket *sock)
       struct sock *sk = sock->sk;
       struct pppox_sock *po;
       struct pppoe_net *pn;
+        struct net *net = NULL;

       if (!sk)
               return 0;
@@ -576,19 +590,8 @@ static int pppoe_release(struct socket *sock)
       /* Signal the death of the socket. */
       sk->sk_state = PPPOX_DEAD;

-       /*
-        * pppoe_flush_dev could lead to a race with
-        * this routine so we use flush_lock to eliminate
-        * such a case (we only need per-net specific data)
-        */
-       spin_lock(&flush_lock);
-       po = pppox_sk(sk);
-       if (!po->pppoe_dev) {
-               spin_unlock(&flush_lock);
-               goto out;
-       }
-       pn = pppoe_pernet(dev_net(po->pppoe_dev));
-       spin_unlock(&flush_lock);
+        net = sock_net(sk);
+       pn = pppoe_pernet(net);

       /*
        * protect "po" from concurrent updates
@@ -601,14 +604,14 @@ static int pppoe_release(struct socket *sock)
               __delete_item(pn, po->pppoe_pa.sid, po->pppoe_pa.remote,
                               po->pppoe_ifindex);

-       if (po->pppoe_dev) {
-               dev_put(po->pppoe_dev);
-               po->pppoe_dev = NULL;
-       }
+        if (po->pppoe_dev) {
+                dev_put(po->pppoe_dev);
+                po->pppoe_dev = NULL;
+        }

       write_unlock_bh(&pn->hash_lock);
+        put_net(net);

-out:
       sock_orphan(sk);
       sock->sk = NULL;

@@ -625,8 +628,9 @@ static int pppoe_connect(struct socket *sock,
struct sockaddr *uservaddr,
       struct sock *sk = sock->sk;
       struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
       struct pppox_sock *po = pppox_sk(sk);
-       struct net_device *dev;
       struct pppoe_net *pn;
+       struct net_device *dev = NULL;
+        struct net *net = NULL;
       int error;

       lock_sock(sk);
@@ -653,10 +657,12 @@ static int pppoe_connect(struct socket *sock,
struct sockaddr *uservaddr,
       if (stage_session(po->pppoe_pa.sid)) {
               pppox_unbind_sock(sk);
               if (po->pppoe_dev) {
-                       pn = pppoe_pernet(dev_net(po->pppoe_dev));
+                        struct net *old = dev_net(po->pppoe_dev);
+                       pn = pppoe_pernet(old);
                       delete_item(pn, po->pppoe_pa.sid,
                               po->pppoe_pa.remote, po->pppoe_ifindex);
                       dev_put(po->pppoe_dev);
+                        put_net(old);
               }
               memset(sk_pppox(po) + 1, 0,
                      sizeof(struct pppox_sock) - sizeof(struct sock));
@@ -666,13 +672,17 @@ static int pppoe_connect(struct socket *sock,
struct sockaddr *uservaddr,
       /* Re-bind in session stage only */
       if (stage_session(sp->sa_addr.pppoe.sid)) {
               error = -ENODEV;
-               dev = dev_get_by_name(sock_net(sk), sp->sa_addr.pppoe.dev);
+                net = maybe_get_net(dev_net(dev));
+                if (!net)
+                        goto end;
+
+               dev = dev_get_by_name(net, sp->sa_addr.pppoe.dev);
               if (!dev)
-                       goto end;
+                       goto err_put_net;

               po->pppoe_dev = dev;
               po->pppoe_ifindex = dev->ifindex;
-               pn = pppoe_pernet(dev_net(dev));
+               pn = pppoe_pernet(net);
               write_lock_bh(&pn->hash_lock);
               if (!(dev->flags & IFF_UP)) {
                       write_unlock_bh(&pn->hash_lock);
@@ -707,6 +717,10 @@ static int pppoe_connect(struct socket *sock,
struct sockaddr *uservaddr,
 end:
       release_sock(sk);
       return error;
+err_put_net:
+        if (net) {
+                put_net(net);
+        }
 err_put:
       if (po->pppoe_dev) {
               dev_put(po->pppoe_dev);
--
1.6.3.3

^ permalink raw reply related

* Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Anirban Sinha @ 2009-10-19 16:01 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, David Miller, netdev, Anirban Sinha
In-Reply-To: <20091019153618.GA20967@redhat.com>


> I meant: yes sure it _is legal_ to schedule a work from within a timer
> callback (in fact it is legal from any context).

ok, then that part of the function looks fine.

Ani

^ permalink raw reply

* Re: MAINTAINERS drivers/net cleanups?
From: David Dillow @ 2009-10-19 15:24 UTC (permalink / raw)
  To: Joe Perches; +Cc: netdev
In-Reply-To: <1255729936.2267.179.camel@Joe-Laptop.home>

On Fri, 2009-10-16 at 14:52 -0700, Joe Perches wrote:
> Ben Hutchings suggested adding a facility to
> scripts/get_maintainer.pl to print the role of
> each maintainer.  I added a bit more to print
> statistics about the maintainers and "-by:" lines
> from each git commit as well.
[snip script]
> There are many people described as MAINTAINERS for
> these files that have not had a single sign-off or
> commit in git history.

Uhm, I think your script is broken, as I've patched typhoon this year.
I've also ACKed a few patches that have been sent to me, and tried to
watch the list for patches that weren't.

> Should these individuals be removed from MAINTAINERS
> and added to CREDITS if not already there?

I think you should try to contact the ones listed and see if they are
still actively watching for their drivers. I've not had to do a lot to
typhoon, as there is no new hardware coming out, and while I've sent in
several Acked-by's, not all of them have made it to the final commit.

Just because there are few patches from the person listed as maintainer
does not mean they are not keeping an eye on things.

> ------> drivers/net/typhoon.c
> David Dillow <dave@thedillows.org> (maintainer)
> 
> 3CR990 NETWORK DRIVER
> M: David Dillow <dave@thedillows.org>
> L: netdev@vger.kernel.org
> S: Maintained
> F: drivers/net/typhoon*

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-19 15:50 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Eric Dumazet, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910190619t3e009e1by49cc8f7307eb7cdb@mail.gmail.com>

[Michal Ostrowski - Mon, Oct 19, 2009 at 08:19:23AM -0500]
|
| The entire scheme for managing net namespaces seems unsafe.  We depend
| on synchronization via pn->hash_lock, but have no guarantee of the
| existence of the "net" object -- hence no way to ensure the existence
| of the lock itself.  This should be relatively easy to fix though as
| we should be able to get/put the net namespace as we add remove
| objects to/from the pppoe hash.
|

Hmm... it seems not. The only possible scenario I see (for such nonexistence
namespace is that when it was cached via RCU and returned before grace period
elapsed, so perhaps we need to call synchronize_net somewhere).

| 
| Once you solve this existence issue, the flush_lock can be eliminated
| altogether since all of the relevant code paths already depend on a
| write_lock_bh(&pn->hash_lock), and that's the lock that should be use
| to protect the pppoe_dev field.
| 
| Another patch to follow later...
| 
| --
| Michal Ostrowski
| mostrows@gmail.com
| 
| 
| 
| On Mon, Oct 19, 2009 at 7:36 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
| > Michal Ostrowski a écrit :
| >> Here's my theory on this after an inital look...
| >>
| >> Looking at the oops report and disassembly of the actual module binary
| >> that caused the oops, one can deduce that:
| >>
| >> Execution was in pppoe_flush_dev().  %ebx contained the pointer "struct
| >> pppox_sock *po", which is what we faulted on, excuting "cmp %eax, 0x190(%ebx)".
| >> %ebx value was 0xffffffff (hence we got "NULL pointer dereference at 0x18f").
| >>
| >> At this point "i" (stored in %esi) is 15 (valid), meaning that we got a value
| >> of 0xffffffff in pn->hash_table[i].
| >>
| >>>From this I'd hypothesize that the combination of dev_put() and release_sock()
| >> may have allowed us to free "pn".  At the bottom of the loop we alreayd
| >> recognize that since locks are dropped we're responsible for handling
| >> invalidation of objects, and perhaps that should be extended to "pn" as well.
| >> --
| >> Michal Ostrowski
| >> mostrows@gmail.com
| >>
| >>
| >
| > Looking at this stuff, I do believe flush_lock protection is not
| > properly done.
| >
| > At the end of pppoe_connect() for example we can find :
| >
| > err_put:
| >        if (po->pppoe_dev) {
| >                dev_put(po->pppoe_dev);
| >                po->pppoe_dev = NULL;
| >        }

Yep, this is unsafe, thanks!

| >
| > This is done without any protection, and can therefore clash with
| > pppoe_flush_dev() :
| >
| >        spin_lock(&flush_lock);
| >        po->pppoe_dev = NULL; /* ppoe_dev can already be NULL before this point */
| >        spin_unlock(&flush_lock);
| >
| >        dev_put(dev);    /* oops */
| >
| 

Denys, could you check if the patch below help?

	-- Cyrill
---
 drivers/net/pppoe.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.git/drivers/net/pppoe.c
=====================================================================
--- linux-2.6.git.orig/drivers/net/pppoe.c
+++ linux-2.6.git/drivers/net/pppoe.c
@@ -312,9 +312,9 @@ static void pppoe_flush_dev(struct net_d
 			}
 			sk = sk_pppox(po);
 			spin_lock(&flush_lock);
+			dev_put(po->pppoe_dev);
 			po->pppoe_dev = NULL;
 			spin_unlock(&flush_lock);
-			dev_put(dev);
 
 			/* We always grab the socket lock, followed by the
 			 * hash_lock, in that order.  Since we should
@@ -708,10 +708,12 @@ end:
 	release_sock(sk);
 	return error;
 err_put:
+	spin_lock(&flush_lock);
 	if (po->pppoe_dev) {
 		dev_put(po->pppoe_dev);
 		po->pppoe_dev = NULL;
 	}
+	spin_unlock(&flush_lock);
 	goto end;
 }
 

^ permalink raw reply

* Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Oleg Nesterov @ 2009-10-19 15:36 UTC (permalink / raw)
  To: Anirban Sinha; +Cc: linux-kernel, David Miller, netdev, Anirban Sinha
In-Reply-To: <4ADC8677.7000607@anirban.org>

On 10/19, Anirban Sinha wrote:
>
> Once upon a time, like on 09-10-19 5:13 AM, Oleg Nesterov wrote:
>
> >> Is is
> >> it illegal to schedule a work function from within a timer callback?
> >
> > Yes sure.
>
> hmm. may be in that case, that function needs to be re-written.

OOPS!!!! I misread your question, didn't notice "il" above...

I meant: yes sure it _is legal_ to schedule a work from within a timer
callback (in fact it is legal from any context).

Sorry for confusion.

Oleg.


^ permalink raw reply

* Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Anirban Sinha @ 2009-10-19 15:32 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, David Miller, netdev, Anirban Sinha
In-Reply-To: <20091019121327.GA11423@redhat.com>



Once upon a time, like on 09-10-19 5:13 AM, Oleg Nesterov wrote:
> Hi Anirban,
> 
> On 10/18, Anirban Sinha wrote:
>>
>> I have a question for you. The queue_work() routine which is called from
>> schedule_work() does a put_cpu() which in turn does a enable_preempt(). Is
>> this an attempt to trigger the scheduler?
> 
> No. please note that queue_work() does get_cpu() + put_cpu() to protect
> against cpu_down() in between.

grrr! Ah yes, my eyes failed me (or it saw what I wanted it to see :)). You do have a get_cpu() and put_cpu() together in the same code path. I guess I will have to keep looking at inet_twdr_hangman(). 

>> Is is
>> it illegal to schedule a work function from within a timer callback?
> 
> Yes sure.

hmm. may be in that case, that function needs to be re-written. 

> I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.
> 
> Could you verify this?

I'll keep looking. Thanks for the help Oleg. 

Ani


^ permalink raw reply

* Re: [PATCH] myri10ge: improve port type reporting in ethtool output
From: Ben Hutchings @ 2009-10-19 14:17 UTC (permalink / raw)
  To: Andrew Gallatin
  Cc: Brice Goglin, David S. Miller, Linux Network Development list
In-Reply-To: <4ADC69FF.5000109@myri.com>

On Mon, 2009-10-19 at 09:30 -0400, Andrew Gallatin wrote:
> Ben Hutchings wrote:
> > On Mon, 2009-10-19 at 08:34 -0400, Andrew Gallatin wrote:
> >> Ben Hutchings wrote:
> >>
> >>> Lying about link modes is not an improvement.
> >> OK, so we're probably doing something wrong. I suspect we're not
> >> alone.  At least we don't set SUPPORTED_TP for CX4, like I've
> >> seen some NICs do.
> >>
> >> Can somebody suggest how we can tell ethtool that
> >> the NIC supports 10Gb only (no autoneg down to 1Gb or lower)
> >> for copper (10Gbase-CX4)?   How about for fiber (10Gbase-{S,L})R?
> > 
> > What's wrong with what you already do?  Customers expect to see
> > something on the supported line?
> 
> Exactly.  One has complained because drivers for
> other vendors NICs show this, even if they are fibre NICs
> or CX4 NICs, and don't actually support 10GbaseT.

Let's fix the other drivers then.  Labelling these NICs as supporting
10GBASE-T is liable to confuse more people (and tools) in the long run.

> I'm happy to back this part out, and resubmit the patch without
> it. There is still some fairly valuable stuff in the patch
> -- mainly updating the NIC detection logic for new NICs to
> detect fibre vs copper.

Sure.

You should also set port = PORT_OTHER for CX4 or KX4.  Currently it
looks like you don't set port, so it appears as 0 == PORT_TP.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox