Netdev List
 help / color / mirror / Atom feed
* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-20 22:31 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: netdev
In-Reply-To: <g2z46c8cb3e1004201517i5641a75cze2ec5bd33e81fb0f@mail.gmail.com>

Le mercredi 21 avril 2010 à 03:17 +0500, Gaspar Chilingarov a écrit :
> [1.] Large amount of outgoing tcp connections fail to bind properly to
> their ip/ports
> 
> [2.] Full description of the problem/report:
> 
> I'm trying to establish huge amount of outgoing tcp connections (over
> several 100000-s) on a single machine. I need to test load a server,
> which could process that amount of connections :)
> 
> The number of connections which are possible to establish from single
> ip is regulated by
> net.ipv4.ip_local_port_range = 32768    61000, which gives 28232 connections.
> 
> Good. I expect that each socket is identified on a local side as
> unique pair of local_ip:local_port .
> Thus I've added some more IP addresses (say 10) to the machine
> (aliases to the same network interface).
> I expect to be able to establish 10 times more connections than before
> (I know about file descriptor limits, system limit of total number of
> file descriptors and  so on - which are tuned to high values already).
> 
> And the fun part begins -
> I have 28232 on a first source IP (all in established state, say
> 10.0.0.10) and now I'm trying to establish one more connection with
> nc, specifying 10.0.0.11 as a source IP -- and getting "unable to bind
> error"
> 
> Notes about example;
> 10.0.0.1:8192 is a server which just accepts a connections and listens
> forever on them. It's in erlang and it can handle great loads -- so
> there is not problems on that side.
> Using the same script I was able to establish more than 20.000
> connections without any problems (having a standard local port range
> set)
> 
> 
> To make experiment easily reproducible I've done the following things:
> 
> Decrease number of local ports available to 1001 -
> net.ipv4.ip_local_port_range = 60000    61000
> 
> I have script like this (writing from memory)
> 
> #!/bin/sh
> 
> I=0
> 
> IP=10.0.0.10
> 
> # connection stats before run
> netstat -n | grep ESTABLISHED | fgrep "$IP" | wc -l
> 
> while [ $I -le 1000 ]; do
> 
> # run nc in background, supress any output
> nc -s $IP 10.0.0.1 8192 > /dev/null 2>&1 &
> 
> I=$(($I + 1))
> 
> done
> 
> # connection stats after run
> netstat -n | grep ESTABLISHED | fgrep "$IP" | wc -l
> 
> 
> EVEN on the first run I get only 990 successful connections! something
> fails, strange ....
> 
> nc 10.0.0.1 8192 fails with error "unable to bind" and establishes
> connection only from 5-10 try.
> 
> Ook, well, run this script again, get all possible 1001 connections
> and than change source IP to 10.0.0.11
> 
> If you run in several times you will get the following numbers of
> established connections about each run (for given source IP)
> ~650, ~870, 950,980,990,995,995, 1000 and several runs to get 1001.
> 
> Then if you change IP to the next available and run it again - you
> will get practically the same numbers and this continues for 3-th,
> 4th, 5-th and other IP's.
> 
> 
> As a programmer, I feel that there is some hash table for
> local_ip:local_port pairs in the kernel (may be also incorporating
> PID), which has a collisions and
> in case of collision it just fails to reserve/bind this pair for the
> socket.  I hope I'm right, but I've failed to find where the
> allocation is done :)
> In case if PID does not change (i've tried to run tests from primitive
> client in erlang as well -- you get much more worse picture and
> getting new socket becomes just impossible).
> 
> I think that even in case if there is one port available for that IP
> -- it should be possible to bind (even if the kernel should do the
> full scan on local port range to find that unused port).
> 
> 
> I would be grateful for hints where to look in the source -- may be I
> can produce some working patches for it.
> 
> 
> [3.] Keywords (i.e., modules, networking, kernel):
> does not matter, i think.
> 
> [4.] Kernel version (from /proc/version):
> Ubuntu Karmic Koala on amd64 with latest shipped kernel.
> Linux version 2.6.31-21-generic (buildd@yellow) (gcc version 4.4.1
> (Ubuntu 4.4.1-4ubuntu9) ) #59-Ubuntu SMP Wed Mar 24 07:28:27 UTC 2010
> 
> 
> [5.] Output of Oops.. message (if applicable) with symbolic information
>     resolved (see Documentation/oops-tracing.txt)
> n/a
> 
> [6.] A small shell script or example program which triggers the
>     problem (if possible)
> [7.] Environment
> 
> 
> Thanks in advance,
> Gaspar
> 

Its doable, only if you bind() your sockets before connect()

For each socket, you'll need to chose an (local IP, local port) not
already in use.

kernel wont magically select one source IP from the pool you have.




^ permalink raw reply

* PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Gaspar Chilingarov @ 2010-04-20 22:17 UTC (permalink / raw)
  To: netdev

[1.] Large amount of outgoing tcp connections fail to bind properly to
their ip/ports

[2.] Full description of the problem/report:

I'm trying to establish huge amount of outgoing tcp connections (over
several 100000-s) on a single machine. I need to test load a server,
which could process that amount of connections :)

The number of connections which are possible to establish from single
ip is regulated by
net.ipv4.ip_local_port_range = 32768    61000, which gives 28232 connections.

Good. I expect that each socket is identified on a local side as
unique pair of local_ip:local_port .
Thus I've added some more IP addresses (say 10) to the machine
(aliases to the same network interface).
I expect to be able to establish 10 times more connections than before
(I know about file descriptor limits, system limit of total number of
file descriptors and  so on - which are tuned to high values already).

And the fun part begins -
I have 28232 on a first source IP (all in established state, say
10.0.0.10) and now I'm trying to establish one more connection with
nc, specifying 10.0.0.11 as a source IP -- and getting "unable to bind
error"

Notes about example;
10.0.0.1:8192 is a server which just accepts a connections and listens
forever on them. It's in erlang and it can handle great loads -- so
there is not problems on that side.
Using the same script I was able to establish more than 20.000
connections without any problems (having a standard local port range
set)


To make experiment easily reproducible I've done the following things:

Decrease number of local ports available to 1001 -
net.ipv4.ip_local_port_range = 60000    61000

I have script like this (writing from memory)

#!/bin/sh

I=0

IP=10.0.0.10

# connection stats before run
netstat -n | grep ESTABLISHED | fgrep "$IP" | wc -l

while [ $I -le 1000 ]; do

# run nc in background, supress any output
nc -s $IP 10.0.0.1 8192 > /dev/null 2>&1 &

I=$(($I + 1))

done

# connection stats after run
netstat -n | grep ESTABLISHED | fgrep "$IP" | wc -l


EVEN on the first run I get only 990 successful connections! something
fails, strange ....

nc 10.0.0.1 8192 fails with error "unable to bind" and establishes
connection only from 5-10 try.

Ook, well, run this script again, get all possible 1001 connections
and than change source IP to 10.0.0.11

If you run in several times you will get the following numbers of
established connections about each run (for given source IP)
~650, ~870, 950,980,990,995,995, 1000 and several runs to get 1001.

Then if you change IP to the next available and run it again - you
will get practically the same numbers and this continues for 3-th,
4th, 5-th and other IP's.


As a programmer, I feel that there is some hash table for
local_ip:local_port pairs in the kernel (may be also incorporating
PID), which has a collisions and
in case of collision it just fails to reserve/bind this pair for the
socket.  I hope I'm right, but I've failed to find where the
allocation is done :)
In case if PID does not change (i've tried to run tests from primitive
client in erlang as well -- you get much more worse picture and
getting new socket becomes just impossible).

I think that even in case if there is one port available for that IP
-- it should be possible to bind (even if the kernel should do the
full scan on local port range to find that unused port).


I would be grateful for hints where to look in the source -- may be I
can produce some working patches for it.


[3.] Keywords (i.e., modules, networking, kernel):
does not matter, i think.

[4.] Kernel version (from /proc/version):
Ubuntu Karmic Koala on amd64 with latest shipped kernel.
Linux version 2.6.31-21-generic (buildd@yellow) (gcc version 4.4.1
(Ubuntu 4.4.1-4ubuntu9) ) #59-Ubuntu SMP Wed Mar 24 07:28:27 UTC 2010


[5.] Output of Oops.. message (if applicable) with symbolic information
    resolved (see Documentation/oops-tracing.txt)
n/a

[6.] A small shell script or example program which triggers the
    problem (if possible)
[7.] Environment


Thanks in advance,
Gaspar



--
Gaspar Chilingarov

tel +37493 419763 (mobile - leave voice mail message)
icq 63174784
skype://gasparch
e mailto:nm@web.am mailto:gasparch@gmail.com
w http://gasparchilingarov.com/

^ permalink raw reply

* [RFC PATCH] net: Set sock's TX queue to match the RX queue
From: Ben Hutchings @ 2010-04-20 22:10 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers

We generally expect TX work for a connected socket to be done on the
same CPU (or rather hardware thread) as RX work, so the RX and TX
queue selection should match.  Therefore, when a sock with a cached
destination receives an skb from a multi-RX-queue net device, set the
sock's TX queue index to the RX queue index of the skb, modulo the
number of TX queues on the cached output device.

This changes the representation of TX queue mapping to record how the
current queue selection (if any) was made.  We can make the initial
queue selection based on the current simple hash, then replace it once
we know which RX queue is selected.  For TCP and other connected
protocols, this should all be settled during the initial handshake
so the change of TX queue will not cause packet reordering.

I haven't given this a whole lot of testing yet.  I would be interested
to hear how this affects performance on different systems and whether
it's actually worth doing.

Ben.
---
 include/net/sock.h |   37 +++++++++++++++++++++++++++++--------
 net/core/dev.c     |    3 ++-
 net/core/sock.c    |   18 +++++++++++++++++-
 3 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 56df440..3cd2b17 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -108,7 +108,8 @@ struct net;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_refcnt: reference count
- *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_tx_queue_source: source for @skc_tx_queue_index
+ *	@skc_tx_queue_index: tx queue index for this connection
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_u16hashes: two u16 hash values used by UDP lookup tables
  *	@skc_family: network address family
@@ -132,7 +133,8 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	atomic_t		skc_refcnt;
-	int			skc_tx_queue_mapping;
+	u16			skc_tx_queue_source;
+	u16			skc_tx_queue_index;
 
 	union  {
 		unsigned int	skc_hash;
@@ -152,6 +154,12 @@ struct sock_common {
 #endif
 };
 
+enum sock_tx_queue_source {
+	SK_TX_QUEUE_SOURCE_NONE,
+	SK_TX_QUEUE_SOURCE_HASH,
+	SK_TX_QUEUE_SOURCE_RX_QUEUE
+};
+
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -226,7 +234,8 @@ struct sock {
 #define sk_node			__sk_common.skc_node
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
-#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#define sk_tx_queue_source	__sk_common.skc_tx_queue_source
+#define sk_tx_queue_index	__sk_common.skc_tx_queue_index
 
 #define sk_copy_start		__sk_common.skc_hash
 #define sk_hash			__sk_common.skc_hash
@@ -1134,24 +1143,29 @@ static inline void sock_put(struct sock *sk)
 extern int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 			  const int nested);
 
-static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
+static inline void
+sk_tx_queue_set(struct sock *sk, u16 index, enum sock_tx_queue_source source)
 {
-	sk->sk_tx_queue_mapping = tx_queue;
+	sk->sk_tx_queue_source = source;
+	sk->sk_tx_queue_index = index;
 }
 
+extern void
+__sk_tx_queue_set_from_rx_queue(struct sock *sk, const struct sk_buff *skb);
+
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
-	sk->sk_tx_queue_mapping = -1;
+	sk->sk_tx_queue_source = SK_TX_QUEUE_SOURCE_NONE;
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-	return sk->sk_tx_queue_mapping;
+	return sk->sk_tx_queue_index;
 }
 
 static inline bool sk_tx_queue_recorded(const struct sock *sk)
 {
-	return (sk && sk->sk_tx_queue_mapping >= 0);
+	return (sk && sk->sk_tx_queue_source != SK_TX_QUEUE_SOURCE_NONE);
 }
 
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
@@ -1423,6 +1437,13 @@ static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
 	skb->destructor = sock_rfree;
 	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
 	sk_mem_charge(sk, skb->truesize);
+	/*
+	 * If we're receiving through a multiqueue device, set the
+	 * TX queue to match the RX queue.
+	 */
+	if (sk->sk_dst_cache && skb_rx_queue_recorded(skb) &&
+	    unlikely(sk->sk_tx_queue_source != SK_TX_QUEUE_SOURCE_RX_QUEUE))
+		__sk_tx_queue_set_from_rx_queue(sk, skb);
 }
 
 extern void sk_reset_timer(struct sock *sk, struct timer_list* timer,
diff --git a/net/core/dev.c b/net/core/dev.c
index b31d5d6..576bb28 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2016,7 +2016,8 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 				queue_index = skb_tx_hash(dev, skb);
 
 			if (sk && rcu_dereference_check(sk->sk_dst_cache, 1))
-				sk_tx_queue_set(sk, queue_index);
+				sk_tx_queue_set(sk, queue_index,
+						SK_TX_QUEUE_SOURCE_HASH);
 		}
 	}
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 7effa1e..66719a7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -362,6 +362,21 @@ void sk_reset_txq(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_reset_txq);
 
+void __sk_tx_queue_set_from_rx_queue(struct sock *sk, const struct sk_buff *skb)
+{
+	struct dst_entry *dst = __sk_dst_get(sk);
+	u16 queue_index = skb_get_rx_queue(skb);
+	struct net_device *output_dev;
+
+	if (!dst)
+		return;
+	output_dev = dst->dev;
+	if (unlikely(queue_index >= output_dev->real_num_tx_queues))
+		queue_index %= output_dev->real_num_tx_queues;
+	sk_tx_queue_set(sk, queue_index, SK_TX_QUEUE_SOURCE_RX_QUEUE);
+}
+EXPORT_SYMBOL_GPL(__sk_tx_queue_set_from_rx_queue);
+
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
 	struct dst_entry *dst = __sk_dst_get(sk);
@@ -966,7 +981,8 @@ static void sock_copy(struct sock *nsk, const struct sock *osk)
 #endif
 	BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
 		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt) +
-		     sizeof(osk->sk_tx_queue_mapping));
+		     sizeof(osk->sk_tx_queue_source) +
+		     sizeof(osk->sk_tx_queue_index));
 	memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
 	       osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
 #ifdef CONFIG_SECURITY_NETWORK
-- 
1.6.2.5

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Socket filter access to hatype
From: Paul LeoNerd Evans @ 2010-04-20 22:00 UTC (permalink / raw)
  To: netdev


[-- Attachment #1.1: Type: text/plain, Size: 879 bytes --]

When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on, such as is reported by the sll_hatype field of the struct sockaddr_ll
when the packet is sent up to userland.

Unless I've managed to miss a trick somewhere, this would seem to put a
fairly fundamental blocker on actually being able to filter in such
packets. Granted there's the SKF_OFF_NET area to inspect at the e.g. IPv4
level, but this makes it impossible to do anything on e.g. the Ethernet
level.

See attached for a patch to add an SKF_AD_HATYPE field, up among the
other special access fields around SKF_AD_OFF.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #1.2: linux-SKF_AD_HATYPE.patch --]
[-- Type: text/x-patch, Size: 949 bytes --]

diff -ur linux-2.6.33.2.orig/include/linux/filter.h linux-2.6.33.2/include/linux/filter.h
--- linux-2.6.33.2.orig/include/linux/filter.h	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/include/linux/filter.h	2010-04-20 22:40:25.000000000 +0100
@@ -123,7 +123,8 @@
 #define SKF_AD_NLATTR_NEST	16
 #define SKF_AD_MARK 	20
 #define SKF_AD_QUEUE	24
-#define SKF_AD_MAX	28
+#define SKF_AD_HATYPE	28
+#define SKF_AD_MAX	32
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)
 
diff -ur linux-2.6.33.2.orig/net/core/filter.c linux-2.6.33.2/net/core/filter.c
--- linux-2.6.33.2.orig/net/core/filter.c	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/net/core/filter.c	2010-04-20 22:41:01.000000000 +0100
@@ -309,6 +309,9 @@
 		case SKF_AD_QUEUE:
 			A = skb->queue_mapping;
 			continue;
+		case SKF_AD_HATYPE:
+			A = skb->dev->type;
+			continue;
 		case SKF_AD_NLATTR: {
 			struct nlattr *nla;
 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: David Miller @ 2010-04-20 21:41 UTC (permalink / raw)
  To: eric.dumazet; +Cc: franco, xiaosuo, therbert, netdev
In-Reply-To: <1271775421.7895.19.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 16:57:01 +0200

> I know many applications using TCP on loopback, they are real :)

This is all true and I support your hashing patch and all of that.

But if we really want TCP over loopback to go fast, there are much
better ways to do this.

Eric, do you remember that "TCP friends" rough patch I sent you last
year that essentailly made TCP sockets over loopback behave like
AF_UNIX ones and just queue the SKBs directly to the destination
socket without doing any protocol work?

If we ever got that working, tbench performance would become
impressive :)

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-20 21:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <20100420141655.3a66b8e8@nehalam>

On Tue, Apr 20, 2010 at 02:16:55PM -0700, Stephen Hemminger wrote:
> Is this problem new to net-next where we hold onto addresses, or was
> the issue there before?

Before. I am seeing it on 2.6.32.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: A possible bug in reqsk_queue_hash_req()
From: Eric Dumazet @ 2010-04-20 21:29 UTC (permalink / raw)
  To: David Miller; +Cc: raise.sail, netdev, linux-kernel
In-Reply-To: <20100420.142405.228805796.davem@davemloft.net>

Le mardi 20 avril 2010 à 14:24 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Apr 2010 13:06:51 +0200
> 
> > I believe its not really necessary, because we are the only possible
> > writer at this stage.
> > 
> > The write_lock() ... write_unlock() is there only to enforce a
> > synchronisation with readers.
> > 
> > All callers of this reqsk_queue_hash_req() must have the socket locked
> 
> Right.
> 
> In fact there are quite a few snippets around the networking
> where we use this trick of only locking around the single
> pointer assignment that puts the object into the list.
> 
> And they were all written by Alexey Kuznetsov, so they must
> be correct :-)

We could convert to RCU, but given this read_lock is only taken by
inet_diag, its not really necessary.

Something like the following, but I have no plan to complete this work.


diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index b6d3b55..5f3373a 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -278,7 +278,9 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk,
 
 static inline int inet_csk_reqsk_queue_len(const struct sock *sk)
 {
-	return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
+	struct listen_sock * lopt = reqsk_lopt_get(&inet_csk(sk)->icsk_accept_queue);
+
+	return reqsk_queue_len(lopt);
 }
 
 static inline int inet_csk_reqsk_queue_young(const struct sock *sk)
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 99e6e19..b665103 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -65,6 +65,7 @@ struct request_sock {
 	struct sock			*sk;
 	u32				secid;
 	u32				peer_secid;
+	struct rcu_head			rcu;
 };
 
 static inline struct request_sock *reqsk_alloc(const struct request_sock_ops *ops)
@@ -77,10 +78,7 @@ static inline struct request_sock *reqsk_alloc(const struct request_sock_ops *op
 	return req;
 }
 
-static inline void __reqsk_free(struct request_sock *req)
-{
-	kmem_cache_free(req->rsk_ops->slab, req);
-}
+extern void __reqsk_free(struct request_sock *req);
 
 static inline void reqsk_free(struct request_sock *req)
 {
@@ -110,23 +108,13 @@ struct listen_sock {
  * @rskq_accept_head - FIFO head of established children
  * @rskq_accept_tail - FIFO tail of established children
  * @rskq_defer_accept - User waits for some data after accept()
- * @syn_wait_lock - serializer
- *
- * %syn_wait_lock is necessary only to avoid proc interface having to grab the main
- * lock sock while browsing the listening hash (otherwise it's deadlock prone).
  *
- * This lock is acquired in read mode only from listening_get_next() seq_file
- * op and it's acquired in write mode _only_ from code that is actively
- * changing rskq_accept_head. All readers that are holding the master sock lock
- * don't need to grab this lock in read mode too as rskq_accept_head. writes
- * are always protected from the main sock lock.
  */
 struct request_sock_queue {
 	struct request_sock	*rskq_accept_head;
 	struct request_sock	*rskq_accept_tail;
-	rwlock_t		syn_wait_lock;
 	u8			rskq_defer_accept;
-	/* 3 bytes hole, try to pack */
+	/* 3-7 bytes hole, try to pack */
 	struct listen_sock	*listen_opt;
 };
 
@@ -154,9 +142,7 @@ static inline void reqsk_queue_unlink(struct request_sock_queue *queue,
 				      struct request_sock *req,
 				      struct request_sock **prev_req)
 {
-	write_lock(&queue->syn_wait_lock);
-	*prev_req = req->dl_next;
-	write_unlock(&queue->syn_wait_lock);
+	rcu_assign_pointer(*prev_req, req->dl_next);
 }
 
 static inline void reqsk_queue_add(struct request_sock_queue *queue,
@@ -167,13 +153,13 @@ static inline void reqsk_queue_add(struct request_sock_queue *queue,
 	req->sk = child;
 	sk_acceptq_added(parent);
 
+	req->dl_next = NULL;
 	if (queue->rskq_accept_head == NULL)
 		queue->rskq_accept_head = req;
 	else
 		queue->rskq_accept_tail->dl_next = req;
 
 	queue->rskq_accept_tail = req;
-	req->dl_next = NULL;
 }
 
 static inline struct request_sock *reqsk_queue_remove(struct request_sock_queue *queue)
@@ -223,9 +209,16 @@ static inline int reqsk_queue_added(struct request_sock_queue *queue)
 	return prev_qlen;
 }
 
-static inline int reqsk_queue_len(const struct request_sock_queue *queue)
+static inline struct listen_sock *reqsk_lopt_get(const struct request_sock_queue *queue)
 {
-	return queue->listen_opt != NULL ? queue->listen_opt->qlen : 0;
+	return rcu_dereference_check(queue->listen_opt,
+		rcu_read_lock_bh_held() ||
+		sock_owned_by_user(sk));
+}
+
+static inline int reqsk_queue_len(struct listen_sock *lopt)
+{
+	return lopt != NULL ? lopt->qlen : 0;
 }
 
 static inline int reqsk_queue_len_young(const struct request_sock_queue *queue)
@@ -247,11 +240,9 @@ static inline void reqsk_queue_hash_req(struct request_sock_queue *queue,
 	req->expires = jiffies + timeout;
 	req->retrans = 0;
 	req->sk = NULL;
-	req->dl_next = lopt->syn_table[hash];
 
-	write_lock(&queue->syn_wait_lock);
-	lopt->syn_table[hash] = req;
-	write_unlock(&queue->syn_wait_lock);
+	req->dl_next = lopt->syn_table[hash];
+	rcu_assign_pointer(lopt->syn_table[hash], req);
 }
 
 #endif /* _REQUEST_SOCK_H */
diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index 7552495..e4c333e 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -58,13 +58,10 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
 	     lopt->max_qlen_log++);
 
 	get_random_bytes(&lopt->hash_rnd, sizeof(lopt->hash_rnd));
-	rwlock_init(&queue->syn_wait_lock);
 	queue->rskq_accept_head = NULL;
 	lopt->nr_table_entries = nr_table_entries;
 
-	write_lock_bh(&queue->syn_wait_lock);
-	queue->listen_opt = lopt;
-	write_unlock_bh(&queue->syn_wait_lock);
+	rcu_assign_pointer(queue->listen_opt, lopt);
 
 	return 0;
 }
@@ -94,10 +91,8 @@ static inline struct listen_sock *reqsk_queue_yank_listen_sk(
 {
 	struct listen_sock *lopt;
 
-	write_lock_bh(&queue->syn_wait_lock);
 	lopt = queue->listen_opt;
 	queue->listen_opt = NULL;
-	write_unlock_bh(&queue->syn_wait_lock);
 
 	return lopt;
 }
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8da6429..2b2cb80 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -49,6 +49,19 @@ void inet_get_local_port_range(int *low, int *high)
 }
 EXPORT_SYMBOL(inet_get_local_port_range);
 
+static void reqsk_rcu_free(struct rcu_head *head)
+{
+	struct request_sock *req = container_of(head, struct request_sock, rcu);
+
+	kmem_cache_free(req->rsk_ops->slab, req);
+}
+
+void __reqsk_free(struct request_sock *req)
+{
+	call_rcu_bh(&req->rcu, reqsk_rcu_free);
+}
+EXPORT_SYMBOL(__reqsk_free);
+
 int inet_csk_bind_conflict(const struct sock *sk,
 			   const struct inet_bind_bucket *tb)
 {
@@ -420,7 +433,7 @@ struct request_sock *inet_csk_search_req(const struct sock *sk,
 		    ireq->loc_addr == laddr &&
 		    AF_INET_FAMILY(req->rsk_ops->family)) {
 			WARN_ON(req->sk);
-			*prevp = prev;
+			rcu_assign_pointer(*prevp, prev);
 			break;
 		}
 	}
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index e5fa2dd..4da8365 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -632,9 +632,9 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
 
 	entry.family = sk->sk_family;
 
-	read_lock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+	rcu_read_lock_bh();
 
-	lopt = icsk->icsk_accept_queue.listen_opt;
+	lopt = rcu_dereference_bh(icsk->icsk_accept_queue.listen_opt);
 	if (!lopt || !lopt->qlen)
 		goto out;
 
@@ -648,7 +648,7 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
 		struct request_sock *req, *head = lopt->syn_table[j];
 
 		reqnum = 0;
-		for (req = head; req; reqnum++, req = req->dl_next) {
+		for (req = head; req; reqnum++, req = rcu_dereference_bh(req->dl_next)) {
 			struct inet_request_sock *ireq = inet_rsk(req);
 
 			if (reqnum < s_reqnum)
@@ -691,7 +691,7 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
 	}
 
 out:
-	read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+	rcu_read_unlock_bh();
 
 	return err;
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad08392..cd6a042 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1985,6 +1985,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	struct inet_listen_hashbucket *ilb;
 	struct tcp_iter_state *st = seq->private;
 	struct net *net = seq_file_net(seq);
+	struct listen_sock *lopt;
 
 	if (!sk) {
 		st->bucket = 0;
@@ -2009,20 +2010,21 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 				}
 				req = req->dl_next;
 			}
-			if (++st->sbucket >= icsk->icsk_accept_queue.listen_opt->nr_table_entries)
+			if (++st->sbucket >= lopt->nr_table_entries)
 				break;
 get_req:
-			req = icsk->icsk_accept_queue.listen_opt->syn_table[st->sbucket];
+			req = lopt->syn_table[st->sbucket];
 		}
 		sk	  = sk_next(st->syn_wait_sk);
 		st->state = TCP_SEQ_STATE_LISTENING;
-		read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+		rcu_read_unlock_bh();
 	} else {
 		icsk = inet_csk(sk);
-		read_lock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
-		if (reqsk_queue_len(&icsk->icsk_accept_queue))
+		rcu_read_lock_bh();
+		lopt = rcu_dereference_bh(icsk->icsk_accept_queue.listen_opt);
+		if (reqsk_queue_len(lopt))
 			goto start_req;
-		read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+		rcu_read_unlock_bh();
 		sk = sk_next(sk);
 	}
 get_sk:
@@ -2032,8 +2034,9 @@ get_sk:
 			goto out;
 		}
 		icsk = inet_csk(sk);
-		read_lock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
-		if (reqsk_queue_len(&icsk->icsk_accept_queue)) {
+		rcu_read_lock_bh();
+		lopt = rcu_dereference_bh(icsk->icsk_accept_queue.listen_opt);
+		if (reqsk_queue_len(lopt)) {
 start_req:
 			st->uid		= sock_i_uid(sk);
 			st->syn_wait_sk = sk;
@@ -2041,7 +2044,7 @@ start_req:
 			st->sbucket	= 0;
 			goto get_req;
 		}
-		read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+		rcu_read_unlock_bh();
 	}
 	spin_unlock_bh(&ilb->lock);
 	if (++st->bucket < INET_LHTABLE_SIZE) {
@@ -2235,10 +2238,8 @@ static void tcp_seq_stop(struct seq_file *seq, void *v)
 
 	switch (st->state) {
 	case TCP_SEQ_STATE_OPENREQ:
-		if (v) {
-			struct inet_connection_sock *icsk = inet_csk(st->syn_wait_sk);
-			read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
-		}
+		if (v)
+			rcu_read_unlock_bh();
 	case TCP_SEQ_STATE_LISTENING:
 		if (v != SEQ_START_TOKEN)
 			spin_unlock_bh(&tcp_hashinfo.listening_hash[st->bucket].lock);



^ permalink raw reply related

* Re: A possible bug in reqsk_queue_hash_req()
From: David Miller @ 2010-04-20 21:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: raise.sail, netdev, linux-kernel
In-Reply-To: <1271761611.3845.223.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 13:06:51 +0200

> I believe its not really necessary, because we are the only possible
> writer at this stage.
> 
> The write_lock() ... write_unlock() is there only to enforce a
> synchronisation with readers.
> 
> All callers of this reqsk_queue_hash_req() must have the socket locked

Right.

In fact there are quite a few snippets around the networking
where we use this trick of only locking around the single
pointer assignment that puts the object into the list.

And they were all written by Alexey Kuznetsov, so they must
be correct :-)

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Stephen Hemminger @ 2010-04-20 21:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <1271797043.7895.320.camel@edumazet-laptop>

On Tue, 20 Apr 2010 22:57:23 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 20 avril 2010 à 22:49 +0200, Jiri Bohac a écrit :
> > On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> > > Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > > > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > > > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > > > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> > > >  			addrconf_leave_anycast(ifp);
> > > >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> > > >  		dst_hold(&ifp->rt->u.dst);
> > > > -		if (ip6_del_rt(ifp->rt))
> > > > -			dst_free(&ifp->rt->u.dst);
> > > > +		ip6_del_rt(ifp->rt);
> > > >  		break;
> > > >  	}
> > > >  }
> > > > 
> > > 
> > > 
> > > I dont understand the problem Jiri.
> > > 
> > > We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> > > dst_free(), or we leak a refcount.
> > 
> > Well, no ... dst_free() does not decrement the refcount.
> > The "opposite" of dst_hold() is dst_release(). And it does not
> > automatically call dst_free() when the refcount drops to 0.
> > dst_free() needs to be called explicitly and it apparently
> > expects the caller to ensure that two dst_free()s won't be called
> > simultaneously. But my bonding example shows this is not the
> > case.
> > 
> > 
> 
> Ah yes you're right
> 
> Maybe ask Stephen ?
> 
> commit 93fa159abe50d3c55c7f83622d3f5c09b6e06f4b
> Author: stephen hemminger <shemminger@vyatta.com>
> Date:   Mon Apr 12 05:41:31 2010 +0000
> 
>     IPv6: keep route for tentative address
>     
>     Recent changes preserve IPv6 address when link goes down (good).
>     But would cause address to point to dead dst entry (bad).
>     The simplest fix is to just not delete route if address is
>     being held for later use.
>     
>     Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 1b00bfe..a9913d2 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4047,7 +4047,8 @@ static void __ipv6_ifa_notify(int event, struct
> inet6_ifaddr *ifp)
>                         addrconf_leave_anycast(ifp);
>                 addrconf_leave_solict(ifp->idev, &ifp->addr);
>                 dst_hold(&ifp->rt->u.dst);
> -               if (ip6_del_rt(ifp->rt))
> +
> +               if (ifp->dead && ip6_del_rt(ifp->rt))
>                         dst_free(&ifp->rt->u.dst);
>                 break;


Is this problem new to net-next where we hold onto addresses, or was
the issue there before?

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Eric Dumazet @ 2010-04-20 20:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: netdev, Hideaki YOSHIFUJI, David Miller, Stephen Hemminger
In-Reply-To: <20100420204939.GA15354@smudla-wifi.bakulak.kosire.czf>

Le mardi 20 avril 2010 à 22:49 +0200, Jiri Bohac a écrit :
> On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> > Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> > >  			addrconf_leave_anycast(ifp);
> > >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> > >  		dst_hold(&ifp->rt->u.dst);
> > > -		if (ip6_del_rt(ifp->rt))
> > > -			dst_free(&ifp->rt->u.dst);
> > > +		ip6_del_rt(ifp->rt);
> > >  		break;
> > >  	}
> > >  }
> > > 
> > 
> > 
> > I dont understand the problem Jiri.
> > 
> > We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> > dst_free(), or we leak a refcount.
> 
> Well, no ... dst_free() does not decrement the refcount.
> The "opposite" of dst_hold() is dst_release(). And it does not
> automatically call dst_free() when the refcount drops to 0.
> dst_free() needs to be called explicitly and it apparently
> expects the caller to ensure that two dst_free()s won't be called
> simultaneously. But my bonding example shows this is not the
> case.
> 
> 

Ah yes you're right

Maybe ask Stephen ?

commit 93fa159abe50d3c55c7f83622d3f5c09b6e06f4b
Author: stephen hemminger <shemminger@vyatta.com>
Date:   Mon Apr 12 05:41:31 2010 +0000

    IPv6: keep route for tentative address
    
    Recent changes preserve IPv6 address when link goes down (good).
    But would cause address to point to dead dst entry (bad).
    The simplest fix is to just not delete route if address is
    being held for later use.
    
    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1b00bfe..a9913d2 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4047,7 +4047,8 @@ static void __ipv6_ifa_notify(int event, struct
inet6_ifaddr *ifp)
                        addrconf_leave_anycast(ifp);
                addrconf_leave_solict(ifp->idev, &ifp->addr);
                dst_hold(&ifp->rt->u.dst);
-               if (ip6_del_rt(ifp->rt))
+
+               if (ifp->dead && ip6_del_rt(ifp->rt))
                        dst_free(&ifp->rt->u.dst);
                break;
        }



^ permalink raw reply related

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-20 20:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <1271786247.7895.130.camel@edumazet-laptop>

On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> >  			addrconf_leave_anycast(ifp);
> >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> >  		dst_hold(&ifp->rt->u.dst);
> > -		if (ip6_del_rt(ifp->rt))
> > -			dst_free(&ifp->rt->u.dst);
> > +		ip6_del_rt(ifp->rt);
> >  		break;
> >  	}
> >  }
> > 
> 
> 
> I dont understand the problem Jiri.
> 
> We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> dst_free(), or we leak a refcount.

Well, no ... dst_free() does not decrement the refcount.
The "opposite" of dst_hold() is dst_release(). And it does not
automatically call dst_free() when the refcount drops to 0.
dst_free() needs to be called explicitly and it apparently
expects the caller to ensure that two dst_free()s won't be called
simultaneously. But my bonding example shows this is not the
case.


-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Scott Feldman @ 2010-04-20 20:26 UTC (permalink / raw)
  To: Arnd Bergmann, Chris Wright; +Cc: davem, netdev
In-Reply-To: <201004201819.32970.arnd@arndb.de>

On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

>>> But that's only the case if the NIC itself is in VEPA mode. If that
>>> were the case, there would be no need for a kernel interface at all,
>>> because then we could just drive the port profile selection from user
>>> space.
>>> 
>>> The proposed interface only seems to make sense if you use it to
>>> configure the NIC itself! Why should it care about the port profile
>>> otherwise?
>> 
>> In the case of devices that can do adjacent switch negotiations directly.
> 
> I thought the idea to deal with those devices was to beat sense into
> the respective developers until they do the negotiation in software 8-)

When the device can do the negotiation directly with the switch, why does it
make sense to bypass that and use software on the host?  I don't think we'd
want to give up on link speed/duplex auto-negotiation and punt those setting
back to the user/host like in the old days.

-scott


^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Scott Feldman @ 2010-04-20 19:56 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: davem, netdev, chrisw
In-Reply-To: <201004201548.26609.arnd@arndb.de>

On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> On Monday 19 April 2010, Scott Feldman wrote:
> 
>> IOV netlink (IOVNL) adds I/O Virtualization control support to a master
>> device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
>> control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
>> design allows for the case where master and slave are the
>> same netdev interface.
> 
> What is the reason for controlling the slave device through the master,
> rather than talking to the slave directly? The kernel always knows
> the master for each slave, so it seems to me that this information
> is redundant.

The interface would allow talking to the slave directly. In fact, that's the
example with enic port-profile in patch 2/2.  But, it would be nice not to
rule out the case where the master proxies slave control and the master is
under exclusively controlled by hypervisor.
 
> Is this new interface only for the case that you have a switch integrated
> in the NIC, or also for the case where you do an LLDP and EDP exchange
> with an adjacent bridge and put the device into VEPA mode?

All of the above.  Basing this on netlink give us flexibility to work with
user-space mgmt tools or directly with kernel netdev as in the enic case.
Not trying to make assumptions about where (user-space, kernel) and by which
entity sources or sinks the netlink msg.
 
>> One control setting example is MAC/VLAN settings for a VF.  Another
>> example control setting is a port-profile for a VF.  A port-profile is an
>> identifier that defines policy-based settings on the network port
>> backing the VF.  The network port settings examples are VLAN membership,
>> QoS settings, and L2 security settings, typical of a data center network.
>> 
>> This patch adds the iovnl interface definitions and an iovnl module.
> 
> How does this relate to the existing DCB netlink interface? My feeling
> is that there is some overlap in how it would get used, and some parts
> that are very distinct. In particular, I'd guess that you'd want to
> be able to set DCB parameters for each VF, but not all DCB adapters
> would support SR-IOV.
>
> Did you consider making this code an extension to the DCB interface
> instead of a separate one? What was the reason for your decision
> to keep it separate?

Considered it but DCB interface is well defined for DCB and it didn't seem
right gluing on interfaces not specified within DCB.  I agree that there is
some overlap in the sense that both interface are used to configure a netdev
with some properties interesting for the data center, but the DCB interface
is for local setting of the properties on the host whereas iovnl is about
pushing the setting of those properties to the network for policy-based
control.
 
> Also, do you expect your interface to be supported by dcbd/lldpad,
> or is there a good reason to create a new tool for iovnl?

Lldpad supporting this interface would seem right, for those cases where
lldpad is responsible for configuring the netdev.
 
>> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
>> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> 
> Can you elaborate more on what these do? Who is the 'client' and the 'host'
> in this case, and why do you need to identify them?

Those are optional and useful, for example, by the network mgmt tool for
presenting a view such as:

    - blade 1/2                     // know by host uuid
        - vm-rhel5-eth0             // client name
            - port-profile: xyz

Something like that.
 
>> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> 
> Just one mac address? What happens if we want to assign multiple mac
> addresses to the VF later? Also, how is this defined specifically?
> Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> later, or is this just the default value?

Depends on how the VF wants to handle this.  For our use-case with enic we
only need the port-profile op so I'm not sure what the best design is for
mac+vlan on a VF.  Looking for advise from folks like yourself.  If it's not
needed, let's scratch it.

-scott


^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: Wolfgang Grandegger @ 2010-04-20 19:47 UTC (permalink / raw)
  To: Hans J. Koch
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100420135538.GA1994-hikPBsva6T+Nj9Bq2fkWzw@public.gmane.org>

Hans J. Koch wrote:
> In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...
> 
> This patch replaces dev_err() with printk() to avoid this.
> 
> Signed-off-by: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> ---
>  drivers/net/can/usb/ems_usb.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.34-rc/drivers/net/can/usb/ems_usb.c
> ===================================================================
> --- linux-2.6.34-rc.orig/drivers/net/can/usb/ems_usb.c	2010-04-20 15:32:25.000000000 +0200
> +++ linux-2.6.34-rc/drivers/net/can/usb/ems_usb.c	2010-04-20 15:33:20.000000000 +0200
> @@ -1006,7 +1006,7 @@
>  
>  	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
>  	if (!netdev) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
> +		printk(KERN_ERR "ems_usb: Couldn't alloc candev\n");
>  		return -ENOMEM;
>  	}

I think "dev_err(&intf->dev, ...)" should be used before
SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
calls which need to be fixed.

Wolfgang.

^ permalink raw reply

* pull request: wireless-2.6 2010-04-20
From: John W. Linville @ 2010-04-20 18:30 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Dave,

Here are a few more intended for 2.6.34, mostly from the iwlwifi team.
They fix a couple of potential crashes, an incorrect EEPROM offset
related to regulatory information, and a connectivity failure relating
to 802.11n and the 4965.  The "iwlwifi: fix scan races" seems a little
long, but much of it is the fallout from renaming a goto label.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit b91ecb0027c7171c83d7cf443a22c39b1fde6d83:
  Tilman Schmidt (1):
        gigaset: include cleanup cleanup

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git master

Johannes Berg (2):
      iwlwifi: fix scan races
      mac80211: remove bogus TX agg state assignment

Reinette Chatre (1):
      mac80211: pass HT changes to driver when off channel

Shanyu Zhao (1):
      iwlwifi: correct 6000 EEPROM regulatory address

 drivers/net/wireless/iwlwifi/iwl-6000.c   |    4 +-
 drivers/net/wireless/iwlwifi/iwl-agn.c    |    1 +
 drivers/net/wireless/iwlwifi/iwl-core.c   |    1 -
 drivers/net/wireless/iwlwifi/iwl-core.h   |    2 +-
 drivers/net/wireless/iwlwifi/iwl-dev.h    |    1 +
 drivers/net/wireless/iwlwifi/iwl-eeprom.h |    4 +++
 drivers/net/wireless/iwlwifi/iwl-scan.c   |   31 ++++++++++++++++++----------
 net/mac80211/agg-tx.c                     |    1 -
 net/mac80211/mlme.c                       |    2 +
 9 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl-6000.c b/drivers/net/wireless/iwlwifi/iwl-6000.c
index c4844ad..92b3e64 100644
--- a/drivers/net/wireless/iwlwifi/iwl-6000.c
+++ b/drivers/net/wireless/iwlwifi/iwl-6000.c
@@ -259,7 +259,7 @@ static struct iwl_lib_ops iwl6000_lib = {
 			EEPROM_5000_REG_BAND_3_CHANNELS,
 			EEPROM_5000_REG_BAND_4_CHANNELS,
 			EEPROM_5000_REG_BAND_5_CHANNELS,
-			EEPROM_5000_REG_BAND_24_HT40_CHANNELS,
+			EEPROM_6000_REG_BAND_24_HT40_CHANNELS,
 			EEPROM_5000_REG_BAND_52_HT40_CHANNELS
 		},
 		.verify_signature  = iwlcore_eeprom_verify_signature,
@@ -323,7 +323,7 @@ static struct iwl_lib_ops iwl6050_lib = {
 			EEPROM_5000_REG_BAND_3_CHANNELS,
 			EEPROM_5000_REG_BAND_4_CHANNELS,
 			EEPROM_5000_REG_BAND_5_CHANNELS,
-			EEPROM_5000_REG_BAND_24_HT40_CHANNELS,
+			EEPROM_6000_REG_BAND_24_HT40_CHANNELS,
 			EEPROM_5000_REG_BAND_52_HT40_CHANNELS
 		},
 		.verify_signature  = iwlcore_eeprom_verify_signature,
diff --git a/drivers/net/wireless/iwlwifi/iwl-agn.c b/drivers/net/wireless/iwlwifi/iwl-agn.c
index e4c2e1e..ba0fdba 100644
--- a/drivers/net/wireless/iwlwifi/iwl-agn.c
+++ b/drivers/net/wireless/iwlwifi/iwl-agn.c
@@ -3330,6 +3330,7 @@ static void iwl_cancel_deferred_work(struct iwl_priv *priv)
 
 	cancel_delayed_work_sync(&priv->init_alive_start);
 	cancel_delayed_work(&priv->scan_check);
+	cancel_work_sync(&priv->start_internal_scan);
 	cancel_delayed_work(&priv->alive_start);
 	cancel_work_sync(&priv->beacon_update);
 	del_timer_sync(&priv->statistics_periodic);
diff --git a/drivers/net/wireless/iwlwifi/iwl-core.c b/drivers/net/wireless/iwlwifi/iwl-core.c
index 894bcb8..1459cdb 100644
--- a/drivers/net/wireless/iwlwifi/iwl-core.c
+++ b/drivers/net/wireless/iwlwifi/iwl-core.c
@@ -3357,7 +3357,6 @@ static void iwl_force_rf_reset(struct iwl_priv *priv)
 	 */
 	IWL_DEBUG_INFO(priv, "perform radio reset.\n");
 	iwl_internal_short_hw_scan(priv);
-	return;
 }
 
 
diff --git a/drivers/net/wireless/iwlwifi/iwl-core.h b/drivers/net/wireless/iwlwifi/iwl-core.h
index 732590f..36940a9 100644
--- a/drivers/net/wireless/iwlwifi/iwl-core.h
+++ b/drivers/net/wireless/iwlwifi/iwl-core.h
@@ -506,7 +506,7 @@ void iwl_init_scan_params(struct iwl_priv *priv);
 int iwl_scan_cancel(struct iwl_priv *priv);
 int iwl_scan_cancel_timeout(struct iwl_priv *priv, unsigned long ms);
 int iwl_mac_hw_scan(struct ieee80211_hw *hw, struct cfg80211_scan_request *req);
-int iwl_internal_short_hw_scan(struct iwl_priv *priv);
+void iwl_internal_short_hw_scan(struct iwl_priv *priv);
 int iwl_force_reset(struct iwl_priv *priv, int mode);
 u16 iwl_fill_probe_req(struct iwl_priv *priv, struct ieee80211_mgmt *frame,
 		       const u8 *ie, int ie_len, int left);
diff --git a/drivers/net/wireless/iwlwifi/iwl-dev.h b/drivers/net/wireless/iwlwifi/iwl-dev.h
index 6054c5f..ef1720a 100644
--- a/drivers/net/wireless/iwlwifi/iwl-dev.h
+++ b/drivers/net/wireless/iwlwifi/iwl-dev.h
@@ -1296,6 +1296,7 @@ struct iwl_priv {
 	struct work_struct tt_work;
 	struct work_struct ct_enter;
 	struct work_struct ct_exit;
+	struct work_struct start_internal_scan;
 
 	struct tasklet_struct irq_tasklet;
 
diff --git a/drivers/net/wireless/iwlwifi/iwl-eeprom.h b/drivers/net/wireless/iwlwifi/iwl-eeprom.h
index 4e1ba82..8171c70 100644
--- a/drivers/net/wireless/iwlwifi/iwl-eeprom.h
+++ b/drivers/net/wireless/iwlwifi/iwl-eeprom.h
@@ -203,6 +203,10 @@ struct iwl_eeprom_enhanced_txpwr {
 #define EEPROM_5000_REG_BAND_52_HT40_CHANNELS  ((0x92)\
 		| INDIRECT_ADDRESS | INDIRECT_REGULATORY)   /* 22  bytes */
 
+/* 6000 regulatory - indirect access */
+#define EEPROM_6000_REG_BAND_24_HT40_CHANNELS  ((0x80)\
+		| INDIRECT_ADDRESS | INDIRECT_REGULATORY)   /* 14  bytes */
+
 /* 6000 and up regulatory tx power - indirect access */
 /* max. elements per section */
 #define EEPROM_MAX_TXPOWER_SECTION_ELEMENTS	(8)
diff --git a/drivers/net/wireless/iwlwifi/iwl-scan.c b/drivers/net/wireless/iwlwifi/iwl-scan.c
index bd2f7c4..5062f4e 100644
--- a/drivers/net/wireless/iwlwifi/iwl-scan.c
+++ b/drivers/net/wireless/iwlwifi/iwl-scan.c
@@ -469,6 +469,8 @@ EXPORT_SYMBOL(iwl_init_scan_params);
 
 static int iwl_scan_initiate(struct iwl_priv *priv)
 {
+	WARN_ON(!mutex_is_locked(&priv->mutex));
+
 	IWL_DEBUG_INFO(priv, "Starting scan...\n");
 	set_bit(STATUS_SCANNING, &priv->status);
 	priv->is_internal_short_scan = false;
@@ -546,24 +548,31 @@ EXPORT_SYMBOL(iwl_mac_hw_scan);
  * internal short scan, this function should only been called while associated.
  * It will reset and tune the radio to prevent possible RF related problem
  */
-int iwl_internal_short_hw_scan(struct iwl_priv *priv)
+void iwl_internal_short_hw_scan(struct iwl_priv *priv)
 {
-	int ret = 0;
+	queue_work(priv->workqueue, &priv->start_internal_scan);
+}
+
+static void iwl_bg_start_internal_scan(struct work_struct *work)
+{
+	struct iwl_priv *priv =
+		container_of(work, struct iwl_priv, start_internal_scan);
+
+	mutex_lock(&priv->mutex);
 
 	if (!iwl_is_ready_rf(priv)) {
-		ret = -EIO;
 		IWL_DEBUG_SCAN(priv, "not ready or exit pending\n");
-		goto out;
+		goto unlock;
 	}
+
 	if (test_bit(STATUS_SCANNING, &priv->status)) {
 		IWL_DEBUG_SCAN(priv, "Scan already in progress.\n");
-		ret = -EAGAIN;
-		goto out;
+		goto unlock;
 	}
+
 	if (test_bit(STATUS_SCAN_ABORTING, &priv->status)) {
 		IWL_DEBUG_SCAN(priv, "Scan request while abort pending\n");
-		ret = -EAGAIN;
-		goto out;
+		goto unlock;
 	}
 
 	priv->scan_bands = 0;
@@ -576,9 +585,8 @@ int iwl_internal_short_hw_scan(struct iwl_priv *priv)
 	set_bit(STATUS_SCANNING, &priv->status);
 	priv->is_internal_short_scan = true;
 	queue_work(priv->workqueue, &priv->request_scan);
-
-out:
-	return ret;
+ unlock:
+	mutex_unlock(&priv->mutex);
 }
 EXPORT_SYMBOL(iwl_internal_short_hw_scan);
 
@@ -964,6 +972,7 @@ void iwl_setup_scan_deferred_work(struct iwl_priv *priv)
 	INIT_WORK(&priv->scan_completed, iwl_bg_scan_completed);
 	INIT_WORK(&priv->request_scan, iwl_bg_request_scan);
 	INIT_WORK(&priv->abort_scan, iwl_bg_abort_scan);
+	INIT_WORK(&priv->start_internal_scan, iwl_bg_start_internal_scan);
 	INIT_DELAYED_WORK(&priv->scan_check, iwl_bg_scan_check);
 }
 EXPORT_SYMBOL(iwl_setup_scan_deferred_work);
diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c
index 5538e1b..944a8a9 100644
--- a/net/mac80211/agg-tx.c
+++ b/net/mac80211/agg-tx.c
@@ -183,7 +183,6 @@ static void sta_addba_resp_timer_expired(unsigned long data)
 		       HT_AGG_STATE_REQ_STOP_BA_MSK)) !=
 						HT_ADDBA_REQUESTED_MSK) {
 		spin_unlock_bh(&sta->lock);
-		*state = HT_AGG_STATE_IDLE;
 #ifdef CONFIG_MAC80211_HT_DEBUG
 		printk(KERN_DEBUG "timer expired on tid %d but we are not "
 				"(or no longer) expecting addBA response there",
diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c
index be5f723..8a96503 100644
--- a/net/mac80211/mlme.c
+++ b/net/mac80211/mlme.c
@@ -167,6 +167,8 @@ static u32 ieee80211_enable_ht(struct ieee80211_sub_if_data *sdata,
 	ht_changed = conf_is_ht(&local->hw.conf) != enable_ht ||
 		     channel_type != local->hw.conf.channel_type;
 
+	if (local->tmp_channel)
+		local->tmp_channel_type = channel_type;
 	local->oper_channel_type = channel_type;
 
 	if (ht_changed) {
-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH] net: ipv6 bind to device issue
From: Brian Haley @ 2010-04-20 18:13 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet,
	netdev
In-Reply-To: <1271767572-5282-1-git-send-email-jolsa@redhat.com>

Jiri Olsa wrote:
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index c2438e8..7bf7717 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>  {
>  	int flags = 0;
>  
> -	if (rt6_need_strict(&fl->fl6_dst))
> +	if (rt6_need_strict(&fl->fl6_dst) || fl->oif)
>  		flags |= RT6_LOOKUP_F_IFACE;
>  
>  	if (!ipv6_addr_any(&fl->fl6_src))

Actually, looking at this again, we might want to swap the order
here since fl->oif should be filled-in for most link-local and
multicast requests calling this:

	if (fl->oif || rt6_need_strict(&fl->fl6_dst))

Just a thought, but it potentially saves a call to determine
the scope of the address.

-Brian

^ permalink raw reply

* [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Peter P Waskiewicz Jr @ 2010-04-20 18:01 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel

This patch adds a callback function pointer to the irq_desc
structure, along with a registration function and a read-only
proc entry for each interrupt.

This affinity_hint handle for each interrupt can be used by
underlying drivers that need a better mechanism to control
interrupt affinity.  The underlying driver can register a
callback for the interrupt, which will allow the driver to
provide the CPU mask for the interrupt to anything that
requests it.  The intent is to extend the userspace daemon,
irqbalance, to help hint to it a preferred CPU mask to balance
the interrupt into.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/interrupt.h |   27 +++++++++++++++++++++++++++
 include/linux/irq.h       |    1 +
 kernel/irq/manage.c       |   39 +++++++++++++++++++++++++++++++++++++++
 kernel/irq/proc.c         |   41 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 75f3f00..f2a7d0b 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -78,6 +78,8 @@ enum {
 };
 
 typedef irqreturn_t (*irq_handler_t)(int, void *);
+typedef unsigned int (*irq_affinity_hint_t)(cpumask_var_t, unsigned int,
+                                            void *);
 
 /**
  * struct irqaction - per interrupt action descriptor
@@ -105,6 +107,18 @@ struct irqaction {
 	unsigned long thread_flags;
 };
 
+/**
+ * struct irqaffinityhint - per interrupt affinity helper
+ * @callback:	device driver callback function
+ * @dev:	reference for the affected device
+ * @irq:	interrupt number
+ */
+struct irqaffinityhint {
+	irq_affinity_hint_t callback;
+	void *dev;
+	int irq;
+};
+
 extern irqreturn_t no_action(int cpl, void *dev_id);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
@@ -209,6 +223,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
 extern int irq_can_set_affinity(unsigned int irq);
 extern int irq_select_affinity(unsigned int irq);
 
+extern int irq_register_affinity_hint(unsigned int irq, void *dev,
+                                      irq_affinity_hint_t callback);
+extern int irq_unregister_affinity_hint(unsigned int irq);
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -223,6 +240,16 @@ static inline int irq_can_set_affinity(unsigned int irq)
 
 static inline int irq_select_affinity(unsigned int irq)  { return 0; }
 
+static inline int irq_register_affinity_hint(unsigned int irq, void *dev,
+                                             irq_affinity_hint_t callback)
+{
+	return -EINVAL;
+}
+
+static inline int irq_unregister_affinity_hint(unsigned int irq);
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SMP && CONFIG_GENERIC_HARDIRQS */
 
 #ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 707ab12..bd73e9b 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -206,6 +206,7 @@ struct irq_desc {
 	struct proc_dir_entry	*dir;
 #endif
 	const char		*name;
+	struct irqaffinityhint	*hint;
 } ____cacheline_internodealigned_in_smp;
 
 extern void arch_init_copy_chip_data(struct irq_desc *old_desc,
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 704e488..3674b6a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -138,6 +138,42 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask)
 	return 0;
 }
 
+int irq_register_affinity_hint(unsigned int irq, void *dev,
+                               irq_affinity_hint_t callback)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	if (!desc->hint)
+		desc->hint = kmalloc(sizeof(struct irqaffinityhint),
+		                     GFP_KERNEL);
+	desc->hint->callback = callback;
+	desc->hint->dev = dev;
+	desc->hint->irq = irq;
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_register_affinity_hint);
+
+int irq_unregister_affinity_hint(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	kfree(desc->hint);
+	desc->hint = NULL;
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_unregister_affinity_hint);
 #ifndef CONFIG_AUTO_IRQ_AFFINITY
 /*
  * Generic version of the affinity autoselector.
@@ -916,6 +952,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 			desc->chip->disable(irq);
 	}
 
+	/* make sure affinity_hint callback is cleaned up */
+	kfree(desc->hint);
+
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
 	unregister_handler_proc(irq, action);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 7a6eb04..59110a3 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -32,6 +32,23 @@ static int irq_affinity_proc_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
+{
+	struct irq_desc *desc = irq_to_desc((long)m->private);
+	struct cpumask mask;
+	unsigned int ret = 0;
+
+	if (desc->hint && desc->hint->callback) {
+		ret = desc->hint->callback(&mask, (long)m->private,
+		                           desc->hint->dev);
+		if (!ret)
+			seq_cpumask(m, &mask);
+	}
+
+	seq_putc(m, '\n');
+	return ret;
+}
+
 #ifndef is_affinity_mask_valid
 #define is_affinity_mask_valid(val) 1
 #endif
@@ -79,11 +96,23 @@ free_cpumask:
 	return err;
 }
 
+static ssize_t irq_affinity_hint_proc_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *pos)
+{
+	/* affinity_hint is read-only from proc */
+	return -EOPNOTSUPP;
+}
+
 static int irq_affinity_proc_open(struct inode *inode, struct file *file)
 {
 	return single_open(file, irq_affinity_proc_show, PDE(inode)->data);
 }
 
+static int irq_affinity_hint_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irq_affinity_hint_proc_show, PDE(inode)->data);
+}
+
 static const struct file_operations irq_affinity_proc_fops = {
 	.open		= irq_affinity_proc_open,
 	.read		= seq_read,
@@ -92,6 +121,14 @@ static const struct file_operations irq_affinity_proc_fops = {
 	.write		= irq_affinity_proc_write,
 };
 
+static const struct file_operations irq_affinity_hint_proc_fops = {
+	.open		= irq_affinity_hint_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+	.write		= irq_affinity_hint_proc_write,
+};
+
 static int default_affinity_show(struct seq_file *m, void *v)
 {
 	seq_cpumask(m, irq_default_affinity);
@@ -231,6 +268,10 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	/* create /proc/irq/<irq>/smp_affinity */
 	proc_create_data("smp_affinity", 0600, desc->dir,
 			 &irq_affinity_proc_fops, (void *)(long)irq);
+
+	/* create /proc/irq/<irq>/affinity_hint */
+	proc_create_data("affinity_hint", 0400, desc->dir,
+			 &irq_affinity_hint_proc_fops, (void *)(long)irq);
 #endif
 
 	proc_create_data("spurious", 0444, desc->dir,

^ permalink raw reply related

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Eric Dumazet @ 2010-04-20 17:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <20100420174401.GB1334@midget.suse.cz>

Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> Hi,
> 
> I found what I believe is a race condition in __ipv6_ifa_notify(), in the call
> to dst_free().
> 
> __ipv6_ifa_notify() contains:
> 
>         case RTM_DELADDR:
>                 if (ifp->idev->cnf.forwarding)
>                         addrconf_leave_anycast(ifp);
>                 addrconf_leave_solict(ifp->idev, &ifp->addr);
>                 dst_hold(&ifp->rt->u.dst);
>                 if (ip6_del_rt(ifp->rt))
>                         dst_free(&ifp->rt->u.dst);
>                 break;
> 
> AFAICT, ip6_del_rt() will call dst_free() itself if it finds and actually
> deletes the route: 
> 	ip6_del_rt() -> __ip6_del_rt() -> fib6_del() -> fib6_del_route() ->
> 	-> rt6_release() -> dst_free()
> 
> If it fails (like when it races with another invocation of ip6_del_rt()), it
> will return nonzero and this will cause the above code to call dst_free() on its own.
> 
> dst_free() has no protection against concurrent invocation and if

Sorry ? of course dst_free() has a protection...

By definition, the dst_destroy() is called only by the last thread with
the final refcount on object.

> two invocations make it through the "if (dst->obsolete > 1)"
> check before one of them calls __dst_free(), the same dst_entry
> may end up either:
> 	1) dst_destroy()ed and put on the dst_garbage.list, or
> 	2) put on the dst_garbage.list twice
> both resulting in trouble once the GC is run.
> 
> One possible code path leading to two invocations of __ipv6_ifa_notify() seems
> to be when two bonding slaves receive a NS/NA with the bonds IPv6 address when
> the bonding master is in the DAD phase with a tentative address:
> 
> netif_receive_skb() gets invoked on two CPUs and sets skb->dev to the bonding master ...
> ... ip6_mc_input() -> ip6_input_finish() -> icmpv6_rcv() -> ndisc_rcv() ->
>  -> ndisc_recv_ns() -> addrconf_dad_failure() -> ipv6_del_addr() -> ipv6_ifa_notify() ->
>  -> __ipv6_ifa_notify
> 
> 
> What is the reason __ipv6_ifa_notify() calls dst_free() when
> ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> with the dst still needing to be freed.
> 
> I am just testing whether the following will help:
> 
> --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
>  			addrconf_leave_anycast(ifp);
>  		addrconf_leave_solict(ifp->idev, &ifp->addr);
>  		dst_hold(&ifp->rt->u.dst);
> -		if (ip6_del_rt(ifp->rt))
> -			dst_free(&ifp->rt->u.dst);
> +		ip6_del_rt(ifp->rt);
>  		break;
>  	}
>  }
> 


I dont understand the problem Jiri.

We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
dst_free(), or we leak a refcount.




^ permalink raw reply

* [PATCH linux-next 2/2] ixgbe: Example usage of the new IRQ affinity_hint callback
From: Peter P Waskiewicz Jr @ 2010-04-20 18:01 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel
In-Reply-To: <20100420180112.1276.11906.stgit@ppwaskie-hc2.jf.intel.com>

This patch uses the new IRQ affinity_hint callback mechanism.
It serves purely as an example of how a low-level driver can
utilize this new interface.

An official ixgbe patch will be pushed through netdev once the
IRQ patches have been accepted and merged.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/ixgbe.h      |    2 ++
 drivers/net/ixgbe/ixgbe_main.c |   51 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..c220b9f 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/netdevice.h>
 #include <linux/aer.h>
+#include <linux/cpumask.h>
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
@@ -236,6 +237,7 @@ struct ixgbe_q_vector {
 	u8 tx_itr;
 	u8 rx_itr;
 	u32 eitr;
+	cpumask_var_t affinity_mask;
 };
 
 /* Helper macros to switch between ints/sec and what the register uses.
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..3e00d41 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1031,6 +1031,36 @@ next_desc:
 	return cleaned;
 }
 
+static unsigned int ixgbe_irq_affinity_callback(cpumask_var_t mask,
+                                                unsigned int irq, void *dev)
+{
+	struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)dev;
+	int i, q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
+
+	if (test_bit(__IXGBE_DOWN, &adapter->state))
+		return -EINVAL;
+
+	/*
+	 * Loop through the msix_entries array, looking for the vector that
+	 * matches the irq passed to us.  Once we find it, use that index to
+	 * grab the corresponding q_vector (1 to 1 mapping), and grab the
+	 * cpumask from that q_vector.
+	 */
+	for (i = 0; i < q_vectors; i++)
+		if (adapter->msix_entries[i].vector == irq)
+			break;
+
+	if (i == q_vectors)
+		return -EINVAL;
+
+	if (adapter->q_vector[i]->affinity_mask)
+		cpumask_copy(mask, adapter->q_vector[i]->affinity_mask);
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
 static int ixgbe_clean_rxonly(struct napi_struct *, int);
 /**
  * ixgbe_configure_msix - Configure MSI-X hardware
@@ -1083,6 +1113,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
 			q_vector->eitr = adapter->rx_eitr_param;
 
 		ixgbe_write_eitr(q_vector);
+
+		/*
+		 * Allocate the affinity_hint cpumask, assign the mask for
+		 * this vector, and register our affinity_hint callback.
+		 */
+		alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL);
+		cpumask_set_cpu(v_idx, q_vector->affinity_mask);
+		irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
+		                           adapter,
+		                           &ixgbe_irq_affinity_callback);
 	}
 
 	if (adapter->hw.mac.type == ixgbe_mac_82598EB)
@@ -3218,7 +3258,7 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	u32 rxctrl;
 	u32 txdctl;
-	int i, j;
+	int i, j, num_q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
 
 	/* signal that we are down to the interrupt handler */
 	set_bit(__IXGBE_DOWN, &adapter->state);
@@ -3251,6 +3291,14 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 
 	ixgbe_napi_disable_all(adapter);
 
+	for (i = 0; i < num_q_vectors; i++) {
+		struct ixgbe_q_vector *q_vector = adapter->q_vector[i];
+		/* unregister the affinity_hint callback */
+		irq_unregister_affinity_hint(adapter->msix_entries[i].vector);
+		/* release the CPU mask memory */
+		free_cpumask_var(q_vector->affinity_mask);
+	}
+
 	clear_bit(__IXGBE_SFP_MODULE_NOT_FOUND, &adapter->state);
 	del_timer_sync(&adapter->sfp_timer);
 	del_timer_sync(&adapter->watchdog_timer);
@@ -4052,6 +4100,7 @@ static void ixgbe_free_q_vectors(struct ixgbe_adapter *adapter)
 		struct ixgbe_q_vector *q_vector = adapter->q_vector[q_idx];
 		adapter->q_vector[q_idx] = NULL;
 		netif_napi_del(&q_vector->napi);
+		free_cpumask_var(q_vector->affinity_mask);
 		kfree(q_vector);
 	}
 }


^ permalink raw reply related

* IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-20 17:44 UTC (permalink / raw)
  To: netdev; +Cc: Hideaki YOSHIFUJI, David Miller

Hi,

I found what I believe is a race condition in __ipv6_ifa_notify(), in the call
to dst_free().

__ipv6_ifa_notify() contains:

        case RTM_DELADDR:
                if (ifp->idev->cnf.forwarding)
                        addrconf_leave_anycast(ifp);
                addrconf_leave_solict(ifp->idev, &ifp->addr);
                dst_hold(&ifp->rt->u.dst);
                if (ip6_del_rt(ifp->rt))
                        dst_free(&ifp->rt->u.dst);
                break;

AFAICT, ip6_del_rt() will call dst_free() itself if it finds and actually
deletes the route: 
	ip6_del_rt() -> __ip6_del_rt() -> fib6_del() -> fib6_del_route() ->
	-> rt6_release() -> dst_free()

If it fails (like when it races with another invocation of ip6_del_rt()), it
will return nonzero and this will cause the above code to call dst_free() on its own.

dst_free() has no protection against concurrent invocation and if
two invocations make it through the "if (dst->obsolete > 1)"
check before one of them calls __dst_free(), the same dst_entry
may end up either:
	1) dst_destroy()ed and put on the dst_garbage.list, or
	2) put on the dst_garbage.list twice
both resulting in trouble once the GC is run.

One possible code path leading to two invocations of __ipv6_ifa_notify() seems
to be when two bonding slaves receive a NS/NA with the bonds IPv6 address when
the bonding master is in the DAD phase with a tentative address:

netif_receive_skb() gets invoked on two CPUs and sets skb->dev to the bonding master ...
... ip6_mc_input() -> ip6_input_finish() -> icmpv6_rcv() -> ndisc_rcv() ->
 -> ndisc_recv_ns() -> addrconf_dad_failure() -> ipv6_del_addr() -> ipv6_ifa_notify() ->
 -> __ipv6_ifa_notify


What is the reason __ipv6_ifa_notify() calls dst_free() when
ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
with the dst still needing to be freed.

I am just testing whether the following will help:

--- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
+++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
@@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
 			addrconf_leave_anycast(ifp);
 		addrconf_leave_solict(ifp->idev, &ifp->addr);
 		dst_hold(&ifp->rt->u.dst);
-		if (ip6_del_rt(ifp->rt))
-			dst_free(&ifp->rt->u.dst);
+		ip6_del_rt(ifp->rt);
 		break;
 	}
 }

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: [PATCH net-next-2.6 v4 10/14] l2tp: Convert rwlock to RCU
From: Eric Dumazet @ 2010-04-20 16:55 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev
In-Reply-To: <20100402161916.11367.32252.stgit@bert.katalix.com>

Le vendredi 02 avril 2010 à 17:19 +0100, James Chapman a écrit :
> Reader/write locks are discouraged because they are slower than spin
> locks. So this patch converts the rwlocks used in the per_net structs
> to rcu.
> 
> Signed-off-by: James Chapman <jchapman@katalix.com>

>  static inline struct l2tp_net *l2tp_pernet(struct net *net)
> @@ -139,14 +140,14 @@ static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
>  	struct l2tp_session *session;
>  	struct hlist_node *walk;
>  
> -	read_lock_bh(&pn->l2tp_session_hlist_lock);
> -	hlist_for_each_entry(session, walk, session_list, global_hlist) {
> +	rcu_read_lock_bh();
> +	hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
>  		if (session->session_id == session_id) {
> -			read_unlock_bh(&pn->l2tp_session_hlist_lock);
> +			rcu_read_unlock_bh();
>  			return session;
>  		}
>  	}
> -	read_unlock_bh(&pn->l2tp_session_hlist_lock);
> +	rcu_read_unlock_bh();
>  

Hi James

I started a while ago patching l2tp but I wont be able to finish and
test the thing...

There is a fundamental problem with this kind of construct :
(this was wrong even better your RCU conversion)

rcu_read_lock_bh()
hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
	if (session->session_id == session_id) {
		rcu_read_unlock_bh();
		return session;
	}
}
rcu_read_unlock_bh();


While the lookup _is_ protected, the result is not.

As soon as you call rcu_read_unlock_bh(); and before the "return
session;", current thread could be preempted and an other thread frees
session under first thread. Unexpected things can then happen.

Therefore, you need either to :

1) Take a refcount on session (or tunnel) before the return
2) Or move the rcu_read_lock_bh()/rcu_read_unlock_bh() at callers.
3) Or all callers use a stronger lock. But then, why use RCU ;)

Here is a preliminary patch, obviously not finished, nor compiled, nor
tested, to give possible ways to handle this problem.

(I added the ref parameter to make sure to change function signatures,
maybe its not necessary and we should always take references)

Thanks

 net/l2tp/l2tp_core.c    |   25 ++++++++++++++-----------
 net/l2tp/l2tp_core.h    |    6 +++---
 net/l2tp/l2tp_debugfs.c |    2 +-
 net/l2tp/l2tp_eth.c     |    4 ++--
 net/l2tp/l2tp_ip.c      |    8 +++++---
 net/l2tp/l2tp_netlink.c |   26 +++++++++++++-------------
 6 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index ecc7aea..e662659 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -132,7 +132,7 @@ l2tp_session_id_hash_2(struct l2tp_net *pn, u32 session_id)
 
 /* Lookup a session by id in the global session list
  */
-static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
+static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id, int ref)
 {
 	struct l2tp_net *pn = l2tp_pernet(net);
 	struct hlist_head *session_list =
@@ -143,6 +143,8 @@ static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
 	rcu_read_lock_bh();
 	hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
 		if (session->session_id == session_id) {
+			if (ref)
+				l2tp_session_inc_refcount(session);
 			rcu_read_unlock_bh();
 			return session;
 		}
@@ -166,7 +168,7 @@ l2tp_session_id_hash(struct l2tp_tunnel *tunnel, u32 session_id)
 
 /* Lookup a session by id
  */
-struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunnel, u32 session_id)
+struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunnel, u32 session_id, int ref)
 {
 	struct hlist_head *session_list;
 	struct l2tp_session *session;
@@ -177,12 +179,14 @@ struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunn
 	 * tunnel.
 	 */
 	if (tunnel == NULL)
-		return l2tp_session_find_2(net, session_id);
+		return l2tp_session_find_2(net, session_id, ref);
 
 	session_list = l2tp_session_id_hash(tunnel, session_id);
 	read_lock_bh(&tunnel->hlist_lock);
 	hlist_for_each_entry(session, walk, session_list, hlist) {
 		if (session->session_id == session_id) {
+			if (ref)
+				l2tp_session_inc_refcount(session);
 			read_unlock_bh(&tunnel->hlist_lock);
 			return session;
 		}
@@ -193,7 +197,7 @@ struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunn
 }
 EXPORT_SYMBOL_GPL(l2tp_session_find);
 
-struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth)
+struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth, int ref)
 {
 	int hash;
 	struct hlist_node *walk;
@@ -204,6 +208,8 @@ struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth)
 	for (hash = 0; hash < L2TP_HASH_SIZE; hash++) {
 		hlist_for_each_entry(session, walk, &tunnel->session_hlist[hash], hlist) {
 			if (++count > nth) {
+				if (ref)
+					l2tp_session_inc_refcount(session);
 				read_unlock_bh(&tunnel->hlist_lock);
 				return session;
 			}
@@ -244,7 +250,7 @@ EXPORT_SYMBOL_GPL(l2tp_session_find_by_ifname);
 
 /* Lookup a tunnel by id
  */
-struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id)
+struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id, int ref)
 {
 	struct l2tp_tunnel *tunnel;
 	struct l2tp_net *pn = l2tp_pernet(net);
@@ -252,6 +258,8 @@ struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id)
 	rcu_read_lock_bh();
 	list_for_each_entry_rcu(tunnel, &pn->l2tp_tunnel_list, list) {
 		if (tunnel->tunnel_id == tunnel_id) {
+			if (ref)
+				l2tp_tunnel_inc_refcount(tunnel);
 			rcu_read_unlock_bh();
 			return tunnel;
 		}
@@ -500,11 +508,6 @@ void l2tp_recv_common(struct l2tp_session *session, struct sk_buff *skb,
 	int offset;
 	u32 ns, nr;
 
-	/* The ref count is increased since we now hold a pointer to
-	 * the session. Take care to decrement the refcnt when exiting
-	 * this function from now on...
-	 */
-	l2tp_session_inc_refcount(session);
 	if (session->ref)
 		(*session->ref)(session);
 
@@ -785,7 +788,7 @@ int l2tp_udp_recv_core(struct l2tp_tunnel *tunnel, struct sk_buff *skb,
 	}
 
 	/* Find the session context */
-	session = l2tp_session_find(tunnel->l2tp_net, tunnel, session_id);
+	session = l2tp_session_find(tunnel->l2tp_net, tunnel, session_id, 1);
 	if (!session || !session->recv_skb) {
 		/* Not found? Pass to userspace to deal with */
 		PRINTK(tunnel->debug, L2TP_MSG_DATA, KERN_INFO,
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index f0f318e..51df82e 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -221,10 +221,10 @@ out:
 	return tunnel;
 }
 
-extern struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunnel, u32 session_id);
-extern struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth);
+extern struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunnel, u32 session_id, int ref);
+extern struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth, int ref);
 extern struct l2tp_session *l2tp_session_find_by_ifname(struct net *net, char *ifname);
-extern struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id);
+extern struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id, int ref);
 extern struct l2tp_tunnel *l2tp_tunnel_find_nth(struct net *net, int nth);
 
 extern int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg, struct l2tp_tunnel **tunnelp);
diff --git a/net/l2tp/l2tp_debugfs.c b/net/l2tp/l2tp_debugfs.c
index 104ec3b..a7ecda2 100644
--- a/net/l2tp/l2tp_debugfs.c
+++ b/net/l2tp/l2tp_debugfs.c
@@ -51,7 +51,7 @@ static void l2tp_dfs_next_tunnel(struct l2tp_dfs_seq_data *pd)
 
 static void l2tp_dfs_next_session(struct l2tp_dfs_seq_data *pd)
 {
-	pd->session = l2tp_session_find_nth(pd->tunnel, pd->session_idx);
+	pd->session = l2tp_session_find_nth(pd->tunnel, pd->session_idx, 0);
 	pd->session_idx++;
 
 	if (pd->session == NULL) {
diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index ca1164a..b069447 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -194,13 +194,13 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 p
 	int rc;
 	struct l2tp_eth_net *pn;
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (!tunnel) {
 		rc = -ENODEV;
 		goto out;
 	}
 
-	session = l2tp_session_find(net, tunnel, session_id);
+	session = l2tp_session_find(net, tunnel, session_id, 0);
 	if (session) {
 		rc = -EEXIST;
 		goto out;
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 0852512..dbd7b7f 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -126,7 +126,7 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 	u32 session_id;
 	u32 tunnel_id;
 	unsigned char *ptr, *optr;
-	struct l2tp_session *session;
+	struct l2tp_session *session = NULL;
 	struct l2tp_tunnel *tunnel = NULL;
 	int length;
 	int offset;
@@ -150,7 +150,7 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 	}
 
 	/* Ok, this is a data packet. Lookup the session. */
-	session = l2tp_session_find(&init_net, NULL, session_id);
+	session = l2tp_session_find(&init_net, NULL, session_id, 1);
 	if (session == NULL)
 		goto discard;
 
@@ -187,7 +187,7 @@ pass_up:
 		goto discard;
 
 	tunnel_id = ntohl(*(__be32 *) &skb->data[4]);
-	tunnel = l2tp_tunnel_find(&init_net, tunnel_id);
+	tunnel = l2tp_tunnel_find(&init_net, tunnel_id, 0);
 	if (tunnel != NULL)
 		sk = tunnel->sock;
 	else {
@@ -214,6 +214,8 @@ discard_put:
 	sock_put(sk);
 
 discard:
+	if (session)
+		l2tp_session_dec_refcount(session);
 	kfree_skb(skb);
 	return 0;
 }
diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index 4c1e540..5f42d48 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -40,7 +40,7 @@ static struct genl_family l2tp_nl_family = {
 /* Accessed under genl lock */
 static const struct l2tp_nl_cmd_ops *l2tp_nl_cmd_ops[__L2TP_PWTYPE_MAX];
 
-static struct l2tp_session *l2tp_nl_session_find(struct genl_info *info)
+static struct l2tp_session *l2tp_nl_session_find(struct genl_info *info, int ref)
 {
 	u32 tunnel_id;
 	u32 session_id;
@@ -56,9 +56,9 @@ static struct l2tp_session *l2tp_nl_session_find(struct genl_info *info)
 		   (info->attrs[L2TP_ATTR_CONN_ID])) {
 		tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 		session_id = nla_get_u32(info->attrs[L2TP_ATTR_SESSION_ID]);
-		tunnel = l2tp_tunnel_find(net, tunnel_id);
+		tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 		if (tunnel)
-			session = l2tp_session_find(net, tunnel, session_id);
+			session = l2tp_session_find(net, tunnel, session_id, ref);
 	}
 
 	return session;
@@ -148,7 +148,7 @@ static int l2tp_nl_cmd_tunnel_create(struct sk_buff *skb, struct genl_info *info
 	if (info->attrs[L2TP_ATTR_DEBUG])
 		cfg.debug = nla_get_u32(info->attrs[L2TP_ATTR_DEBUG]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (tunnel != NULL) {
 		ret = -EEXIST;
 		goto out;
@@ -180,7 +180,7 @@ static int l2tp_nl_cmd_tunnel_delete(struct sk_buff *skb, struct genl_info *info
 	}
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (tunnel == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -205,7 +205,7 @@ static int l2tp_nl_cmd_tunnel_modify(struct sk_buff *skb, struct genl_info *info
 	}
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (tunnel == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -292,7 +292,7 @@ static int l2tp_nl_cmd_tunnel_get(struct sk_buff *skb, struct genl_info *info)
 
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (tunnel == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -359,7 +359,7 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 		goto out;
 	}
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (!tunnel) {
 		ret = -ENODEV;
 		goto out;
@@ -370,7 +370,7 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 		goto out;
 	}
 	session_id = nla_get_u32(info->attrs[L2TP_ATTR_SESSION_ID]);
-	session = l2tp_session_find(net, tunnel, session_id);
+	session = l2tp_session_find(net, tunnel, session_id, 0);
 	if (session) {
 		ret = -EEXIST;
 		goto out;
@@ -495,7 +495,7 @@ static int l2tp_nl_cmd_session_delete(struct sk_buff *skb, struct genl_info *inf
 	struct l2tp_session *session;
 	u16 pw_type;
 
-	session = l2tp_nl_session_find(info);
+	session = l2tp_nl_session_find(info, 0);
 	if (session == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -515,7 +515,7 @@ static int l2tp_nl_cmd_session_modify(struct sk_buff *skb, struct genl_info *inf
 	int ret = 0;
 	struct l2tp_session *session;
 
-	session = l2tp_nl_session_find(info);
+	session = l2tp_nl_session_find(info, 1);
 	if (session == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -615,7 +615,7 @@ static int l2tp_nl_cmd_session_get(struct sk_buff *skb, struct genl_info *info)
 	struct sk_buff *msg;
 	int ret;
 
-	session = l2tp_nl_session_find(info);
+	session = l2tp_nl_session_find(info, 0);
 	if (session == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -656,7 +656,7 @@ static int l2tp_nl_cmd_session_dump(struct sk_buff *skb, struct netlink_callback
 				goto out;
 		}
 
-		session = l2tp_session_find_nth(tunnel, si);
+		session = l2tp_session_find_nth(tunnel, si, 0);
 		if (session == NULL) {
 			ti++;
 			tunnel = NULL;








^ permalink raw reply related

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-20 16:19 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev
In-Reply-To: <20100420152206.GA24942@x200.localdomain>

On Tuesday 20 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Tuesday 20 April 2010, Chris Wright wrote:
> >
> > After thinking some more about this case, I now believe we should do
> > it the other way around, and have lldpad in control of this interface
> > from the user space side, and letting user programs (lldptool, libvirt,
> > ...) talk to lldpad in order to set it up.
> 
> lldpad won't be involved in all cases, yet a mgmt tool like libvirt will.
> so this seems backwards.

Well, that part is still the matter of this discussion, as far as I can tell ;-)

> > But that's only the case if the NIC itself is in VEPA mode. If that
> > were the case, there would be no need for a kernel interface at all,
> > because then we could just drive the port profile selection from user
> > space.
> > 
> > The proposed interface only seems to make sense if you use it to
> > configure the NIC itself! Why should it care about the port profile
> > otherwise?
> 
> In the case of devices that can do adjacent switch negotiations directly.

I thought the idea to deal with those devices was to beat sense into
the respective developers until they do the negotiation in software 8-)

> > > > Same here: Should you be able to set multiple MAC addresses, or
> > > > trunk mode? Can the VF override it?
> > > > Also, for the new multi-channel VEPA, I'd guess that you also need
> > > > to supply an 802.1ad S-VLAN ID.
> > > 
> > > Something like set_port_profile() would initiate the negotiation for the
> > > s-vlan id for a particular channel, not sure it's needed as part of the
> > > netlink interface or not.
> > 
> > Well, you have to set up the s-vlan ID in order to have something to
> > set the port profile in.
> 
> Right, depends if the use the port profile to establish the channel and
> negotiate the s-vlan ID.  I don't recall the order there.

I'm pretty sure that setting up the channel (for 802.1bg) is done before
any port profile comes in.

	Arnd

^ permalink raw reply

* Re: [PATCH] NET: Fix an RCU warning in dev_pick_tx()
From: Eric Dumazet @ 2010-04-20 16:03 UTC (permalink / raw)
  To: David Howells; +Cc: netdev
In-Reply-To: <20100420102558.31923.91.stgit@warthog.procyon.org.uk>

Le mardi 20 avril 2010 à 11:25 +0100, David Howells a écrit :
> Fix the following RCU warning in dev_pick_tx():
> 
> ===================================================
> [ INFO: suspicious rcu_dereference_check() usage. ]
> ---------------------------------------------------
> net/core/dev.c:1993 invoked rcu_dereference_check() without protection!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 1, debug_locks = 0
> 2 locks held by swapper/0:
>  #0:  (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278
>  #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc
> 
> stack backtrace:
> Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4
> Call Trace:
>  <IRQ>  [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2
>  [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc
>  [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc
>  [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1
>  [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c
>  [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4
>  [<ffffffff81350c34>] ip6_output2+0x256/0x261
>  [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb
>  [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d
>  [<ffffffff81368acb>] mld_sendpack+0x273/0x39d
>  [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d
>  [<ffffffff81052099>] ? mark_held_locks+0x52/0x70
>  [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288
>  [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278
>  [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278
>  [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288
>  [<ffffffff81035531>] ? __do_softirq+0x69/0x140
>  [<ffffffff8103556a>] __do_softirq+0xa2/0x140
>  [<ffffffff81002e0c>] call_softirq+0x1c/0x28
>  [<ffffffff81004b54>] do_softirq+0x38/0x80
>  [<ffffffff81034f06>] irq_exit+0x45/0x47
>  [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96
>  [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20
>  <EOI>  [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86
>  [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78
>  [<ffffffff810096b6>] ? mwait_idle+0x65/0x78
>  [<ffffffff810011cb>] cpu_idle+0x4d/0x83
>  [<ffffffff81380b05>] rest_init+0xb9/0xc0
>  [<ffffffff81380a4c>] ? rest_init+0x0/0xc0
>  [<ffffffff8168dcf0>] start_kernel+0x392/0x39d
>  [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7
>  [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb
> 
> An rcu_dereference() should be an rcu_dereference_bh().
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 

Absolutely right, thanks David

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

This might conflict with following commit in net-next-2.6, where I chose
the rcu_dereference_check() alternative

commit b6c6712a42ca3f9fa7f4a3d7c40e3a9dd1fd9e03
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Thu Apr 8 23:03:29 2010 +0000

    net: sk_dst_cache RCUification
    
    With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
    work.
    
    sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
    
    This rwlock is readlocked for a very small amount of time, and dst
    entries are already freed after RCU grace period. This calls for RCU
    again :)
    
    This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
    
    __sk_dst_get() is supposed to be called with rcu_read_lock() or if
    socket locked by user, so use appropriate rcu_dereference_check()
    condition (rcu_read_lock_held() || sock_owned_by_user(sk))
    
    This patch avoids two atomic ops per tx packet on UDP connected sockets,
    for example, and permits sk_dst_lock to be much less dirtied.
    
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/include/net/dst.h b/include/net/dst.h
index ce078cd..aac5a5f 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -225,21 +225,6 @@ static inline void dst_confirm(struct dst_entry *dst)
 		neigh_confirm(dst->neighbour);
 }
 
-static inline void dst_negative_advice(struct dst_entry **dst_p,
-				       struct sock *sk)
-{
-	struct dst_entry * dst = *dst_p;
-	if (dst && dst->ops->negative_advice) {
-		*dst_p = dst->ops->negative_advice(dst);
-
-		if (dst != *dst_p) {
-			extern void sk_reset_txq(struct sock *sk);
-
-			sk_reset_txq(sk);
-		}
-	}
-}
-
 static inline void dst_link_failure(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 68f6783..278312c 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -152,9 +152,9 @@ static inline void __ip6_dst_store(struct sock *sk, struct dst_entry *dst,
 static inline void ip6_dst_store(struct sock *sk, struct dst_entry *dst,
 				 struct in6_addr *daddr, struct in6_addr *saddr)
 {
-	write_lock(&sk->sk_dst_lock);
+	spin_lock(&sk->sk_dst_lock);
 	__ip6_dst_store(sk, dst, daddr, saddr);
-	write_unlock(&sk->sk_dst_lock);
+	spin_unlock(&sk->sk_dst_lock);
 }
 
 static inline int ipv6_unicast_destination(struct sk_buff *skb)
diff --git a/include/net/sock.h b/include/net/sock.h
index b4603cd..56df440 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -262,7 +262,7 @@ struct sock {
 #ifdef CONFIG_XFRM
 	struct xfrm_policy	*sk_policy[2];
 #endif
-	rwlock_t		sk_dst_lock;
+	spinlock_t		sk_dst_lock;
 	atomic_t		sk_rmem_alloc;
 	atomic_t		sk_wmem_alloc;
 	atomic_t		sk_omem_alloc;
@@ -1192,7 +1192,8 @@ extern unsigned long sock_i_ino(struct sock *sk);
 static inline struct dst_entry *
 __sk_dst_get(struct sock *sk)
 {
-	return sk->sk_dst_cache;
+	return rcu_dereference_check(sk->sk_dst_cache, rcu_read_lock_held() ||
+						       sock_owned_by_user(sk));
 }
 
 static inline struct dst_entry *
@@ -1200,50 +1201,62 @@ sk_dst_get(struct sock *sk)
 {
 	struct dst_entry *dst;
 
-	read_lock(&sk->sk_dst_lock);
-	dst = sk->sk_dst_cache;
+	rcu_read_lock();
+	dst = rcu_dereference(sk->sk_dst_cache);
 	if (dst)
 		dst_hold(dst);
-	read_unlock(&sk->sk_dst_lock);
+	rcu_read_unlock();
 	return dst;
 }
 
+extern void sk_reset_txq(struct sock *sk);
+
+static inline void dst_negative_advice(struct sock *sk)
+{
+	struct dst_entry *ndst, *dst = __sk_dst_get(sk);
+
+	if (dst && dst->ops->negative_advice) {
+		ndst = dst->ops->negative_advice(dst);
+
+		if (ndst != dst) {
+			rcu_assign_pointer(sk->sk_dst_cache, ndst);
+			sk_reset_txq(sk);
+		}
+	}
+}
+
 static inline void
 __sk_dst_set(struct sock *sk, struct dst_entry *dst)
 {
 	struct dst_entry *old_dst;
 
 	sk_tx_queue_clear(sk);
-	old_dst = sk->sk_dst_cache;
-	sk->sk_dst_cache = dst;
+	old_dst = rcu_dereference_check(sk->sk_dst_cache,
+					lockdep_is_held(&sk->sk_dst_lock));
+	rcu_assign_pointer(sk->sk_dst_cache, dst);
 	dst_release(old_dst);
 }
 
 static inline void
 sk_dst_set(struct sock *sk, struct dst_entry *dst)
 {
-	write_lock(&sk->sk_dst_lock);
+	spin_lock(&sk->sk_dst_lock);
 	__sk_dst_set(sk, dst);
-	write_unlock(&sk->sk_dst_lock);
+	spin_unlock(&sk->sk_dst_lock);
 }
 
 static inline void
 __sk_dst_reset(struct sock *sk)
 {
-	struct dst_entry *old_dst;
-
-	sk_tx_queue_clear(sk);
-	old_dst = sk->sk_dst_cache;
-	sk->sk_dst_cache = NULL;
-	dst_release(old_dst);
+	__sk_dst_set(sk, NULL);
 }
 
 static inline void
 sk_dst_reset(struct sock *sk)
 {
-	write_lock(&sk->sk_dst_lock);
+	spin_lock(&sk->sk_dst_lock);
 	__sk_dst_reset(sk);
-	write_unlock(&sk->sk_dst_lock);
+	spin_unlock(&sk->sk_dst_lock);
 }
 
 extern struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie);
diff --git a/net/core/dev.c b/net/core/dev.c
index 0eb79e3..ca4cdef 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2015,7 +2015,7 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 			if (dev->real_num_tx_queues > 1)
 				queue_index = skb_tx_hash(dev, skb);
 
-			if (sk && sk->sk_dst_cache)
+			if (sk && rcu_dereference_check(sk->sk_dst_cache, 1))
 				sk_tx_queue_set(sk, queue_index);
 		}
 	}
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..7effa1e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -364,11 +364,11 @@ EXPORT_SYMBOL(sk_reset_txq);
 
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
-	struct dst_entry *dst = sk->sk_dst_cache;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
 		sk_tx_queue_clear(sk);
-		sk->sk_dst_cache = NULL;
+		rcu_assign_pointer(sk->sk_dst_cache, NULL);
 		dst_release(dst);
 		return NULL;
 	}
@@ -1157,7 +1157,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		skb_queue_head_init(&newsk->sk_async_wait_queue);
 #endif
 
-		rwlock_init(&newsk->sk_dst_lock);
+		spin_lock_init(&newsk->sk_dst_lock);
 		rwlock_init(&newsk->sk_callback_lock);
 		lockdep_set_class_and_name(&newsk->sk_callback_lock,
 				af_callback_keys + newsk->sk_family,
@@ -1898,7 +1898,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	} else
 		sk->sk_sleep	=	NULL;
 
-	rwlock_init(&sk->sk_dst_lock);
+	spin_lock_init(&sk->sk_dst_lock);
 	rwlock_init(&sk->sk_callback_lock);
 	lockdep_set_class_and_name(&sk->sk_callback_lock,
 			af_callback_keys + sk->sk_family,
diff --git a/net/dccp/timer.c b/net/dccp/timer.c
index bbfeb5e..1a9aa05 100644
--- a/net/dccp/timer.c
+++ b/net/dccp/timer.c
@@ -38,7 +38,7 @@ static int dccp_write_timeout(struct sock *sk)
 
 	if (sk->sk_state == DCCP_REQUESTING || sk->sk_state == DCCP_PARTOPEN) {
 		if (icsk->icsk_retransmits != 0)
-			dst_negative_advice(&sk->sk_dst_cache, sk);
+			dst_negative_advice(sk);
 		retry_until = icsk->icsk_syn_retries ?
 			    : sysctl_dccp_request_retries;
 	} else {
@@ -63,7 +63,7 @@ static int dccp_write_timeout(struct sock *sk)
 			   Golden words :-).
 		   */
 
-			dst_negative_advice(&sk->sk_dst_cache, sk);
+			dst_negative_advice(sk);
 		}
 
 		retry_until = sysctl_dccp_retries2;
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 2b494fa..55e3b6b 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -446,7 +446,7 @@ static void dn_destruct(struct sock *sk)
 	skb_queue_purge(&scp->other_xmit_queue);
 	skb_queue_purge(&scp->other_receive_queue);
 
-	dst_release(xchg(&sk->sk_dst_cache, NULL));
+	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
 }
 
 static int dn_memory_pressure;
@@ -1105,7 +1105,7 @@ static int dn_accept(struct socket *sock, struct socket *newsock, int flags)
 	release_sock(sk);
 
 	dst = skb_dst(skb);
-	dst_release(xchg(&newsk->sk_dst_cache, dst));
+	sk_dst_set(newsk, dst);
 	skb_dst_set(skb, NULL);
 
 	DN_SK(newsk)->state        = DN_CR;
@@ -1956,7 +1956,7 @@ static int dn_sendmsg(struct kiocb *iocb, struct socket *sock,
 	}
 
 	if ((flags & MSG_TRYHARD) && sk->sk_dst_cache)
-		dst_negative_advice(&sk->sk_dst_cache, sk);
+		dst_negative_advice(sk);
 
 	mss = scp->segsize_rem;
 	fctype = scp->services_rem & NSP_FC_MASK;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index a0beb32..193dcd6 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -154,7 +154,7 @@ void inet_sock_destruct(struct sock *sk)
 	WARN_ON(sk->sk_forward_alloc);
 
 	kfree(inet->opt);
-	dst_release(sk->sk_dst_cache);
+	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
 	sk_refcnt_debug_dec(sk);
 }
 EXPORT_SYMBOL(inet_sock_destruct);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4000b10..ae3ec15 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3710,7 +3710,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	}
 
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
-		dst_confirm(sk->sk_dst_cache);
+		dst_confirm(__sk_dst_get(sk));
 
 	return 1;
 
@@ -5833,7 +5833,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 			if (tp->snd_una == tp->write_seq) {
 				tcp_set_state(sk, TCP_FIN_WAIT2);
 				sk->sk_shutdown |= SEND_SHUTDOWN;
-				dst_confirm(sk->sk_dst_cache);
+				dst_confirm(__sk_dst_get(sk));
 
 				if (!sock_flag(sk, SOCK_DEAD))
 					/* Wake up lingering close() */
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 8a0ab29..c732be0 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -172,14 +172,14 @@ static int tcp_write_timeout(struct sock *sk)
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
-			dst_negative_advice(&sk->sk_dst_cache, sk);
+			dst_negative_advice(sk);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
-			dst_negative_advice(&sk->sk_dst_cache, sk);
+			dst_negative_advice(sk);
 		}
 
 		retry_until = sysctl_tcp_retries2;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 33f60fc..1160400 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -114,9 +114,9 @@ struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
 		}
 		opt = xchg(&inet6_sk(sk)->opt, opt);
 	} else {
-		write_lock(&sk->sk_dst_lock);
+		spin_lock(&sk->sk_dst_lock);
 		opt = xchg(&inet6_sk(sk)->opt, opt);
-		write_unlock(&sk->sk_dst_lock);
+		spin_unlock(&sk->sk_dst_lock);
 	}
 	sk_dst_reset(sk);
 
@@ -971,14 +971,13 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 	case IPV6_MTU:
 	{
 		struct dst_entry *dst;
+
 		val = 0;
-		lock_sock(sk);
-		dst = sk_dst_get(sk);
-		if (dst) {
+		rcu_read_lock();
+		dst = __sk_dst_get(sk);
+		if (dst)
 			val = dst_mtu(dst);
-			dst_release(dst);
-		}
-		release_sock(sk);
+		rcu_read_unlock();
 		if (!val)
 			return -ENOTCONN;
 		break;
@@ -1066,12 +1065,14 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 		else
 			val = np->mcast_hops;
 
-		dst = sk_dst_get(sk);
-		if (dst) {
-			if (val < 0)
+		if (val < 0) {
+			rcu_read_lock();
+			dst = __sk_dst_get(sk);
+			if (dst)
 				val = ip6_dst_hoplimit(dst);
-			dst_release(dst);
+			rcu_read_unlock();
 		}
+
 		if (val < 0)
 			val = sock_net(sk)->ipv6.devconf_all->hop_limit;
 		break;





^ permalink raw reply related

* Re: [PATCH] gianfar: Wait for both RX and TX to stop
From: Timur Tabi @ 2010-04-20 15:59 UTC (permalink / raw)
  To: Kumar Gala; +Cc: Andy Fleming, davem, netdev
In-Reply-To: <o2oed82fe3e1004200801n98683cdalb2898c879b93f8fd@mail.gmail.com>

On Tue, Apr 20, 2010 at 10:01 AM, Timur Tabi <timur.tabi@gmail.com> wrote:
> On Mon, Apr 19, 2010 at 11:43 PM, Kumar Gala <galak@kernel.crashing.org> wrote:
>
>> spin_event_timeout doesn't make sense for this.  The patch is fine.
>
> Can you please elaborate on that?  I don't understand why you think
> that.  spin_event_timeout() takes an expression and a timeout, and
> loops over the expression calling cpu_relax(), just like this loop
> does.

I haven't tested it, but I think this should work:

	spin_event_timeout((gfar_read(&regs->ievent) &
		(IEVENT_GRSC | IEVENT_GTSC)) ==
		(IEVENT_GRSC | IEVENT_GTSC), -1, 0);

Ideally, Andy should use a timeout value other than -1, and then test
the result like this:

	u32 ret;

	ret = spin_event_timeout((gfar_read(&regs->ievent) &
		(IEVENT_GRSC | IEVENT_GTSC)) ==
		(IEVENT_GRSC | IEVENT_GTSC), 1000, 0);

	if (!ret)
		/* timeout has occurred */

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [PATCH] net: ipv6 bind to device issue
From: Jiri Olsa @ 2010-04-20 15:42 UTC (permalink / raw)
  To: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet; +Cc: netdev
In-Reply-To: <1271767572-5282-1-git-send-email-jolsa@redhat.com>

On Tue, Apr 20, 2010 at 02:46:12PM +0200, Jiri Olsa wrote:
> hi,
> 
> The issue raises when having 2 NICs both assigned the same
> IPv6 global address.
> 
> If a sender binds to a particular NIC (SO_BINDTODEVICE),
> the outgoing traffic is being sent via the first found.
> The bonded device is thus not taken into an account during the
> routing.
> 
> 
> From the ip6_route_output function:
> 
> If the binding address is multicast, linklocal or loopback,
> the RT6_LOOKUP_F_IFACE bit is set, but not for global address.
> 
> So binding global address will neglect SO_BINDTODEVICE-binded device,
> because the fib6_rule_lookup function path won't check for the
> flowi::oif field and take first route that fits.
> 
> Following patch should handle the issue.
> 
> wbr,
> jirka
> 
> 
> Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Scott Otto <scott.otto@alcatel-lucent.com>

> ---
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index c2438e8..7bf7717 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>  {
>  	int flags = 0;
>  
> -	if (rt6_need_strict(&fl->fl6_dst))
> +	if (rt6_need_strict(&fl->fl6_dst) || fl->oif)
>  		flags |= RT6_LOOKUP_F_IFACE;
>  
>  	if (!ipv6_addr_any(&fl->fl6_src))

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox