Netdev List
 help / color / mirror / Atom feed
* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-20 14:57 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev
In-Reply-To: <20100420143433.GF11712@x200.localdomain>

On Tuesday 20 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Monday 19 April 2010, Scott Feldman wrote:
> > 
> > What is the reason for controlling the slave device through the master,
> > rather than talking to the slave directly? The kernel always knows
> > the master for each slave, so it seems to me that this information
> > is redundant.
> 
> Not all devices have this relationship explicit (i.e. not all are pure
> sr-iov devices).  If there's always a way to discover the master from
> the device, then I agree we only need the slave.

Hmm, is there an actual example of a card where the relationship is not
known to the kernel?

> > Is this new interface only for the case that you have a switch integrated
> > in the NIC, or also for the case where you do an LLDP and EDP exchange
> > with an adjacent bridge and put the device into VEPA mode?
> 
> It should be useful for both.  That's part of the reason for using
> netlink, a userspace daemon running the VDP state machine (like lldpad)
> can listen for these messages and see a set_port_profile request when
> the user starts up a VM.

After thinking some more about this case, I now believe we should do
it the other way around, and have lldpad in control of this interface
from the user space side, and letting user programs (lldptool, libvirt,
...) talk to lldpad in order to set it up.
 
> > Also, do you expect your interface to be supported by dcbd/lldpad,
> > or is there a good reason to create a new tool for iovnl?
> 
> lldpad would listen, I don't see why iproute2 couldn't send, and libvirt
> will send as well.

Not sure. We need lldpad to do this exchange for the case of VEPA with
VDP, so always using lldpad would let us unify the user interface for
both cases. We can of course have iproute2 talk to lldpad, in the
same way that libvirt does.

> > > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device
> > > + *   (NLA_NUL_STRING)
> > 
> > How does the definition of the port profile get into the NIC's switch?
> > Is there any way to list the available port profiles?
> 
> The port profile is a concept external to the NIC's switch.  It's a value
> that exists in the external physical layer 2 switching infrastructure.
> So an admin knows this value and is informing the adjacent switch that a
> new virutal interface is coming up and needs some particular port profile.

But that's only the case if the NIC itself is in VEPA mode. If that
were the case, there would be no need for a kernel interface at all,
because then we could just drive the port profile selection from user
space.

The proposed interface only seems to make sense if you use it to
configure the NIC itself! Why should it care about the port profile
otherwise?

> > Same here: Should you be able to set multiple MAC addresses, or
> > trunk mode? Can the VF override it?
> > Also, for the new multi-channel VEPA, I'd guess that you also need
> > to supply an 802.1ad S-VLAN ID.
> 
> Something like set_port_profile() would initiate the negotiation for the
> s-vlan id for a particular channel, not sure it's needed as part of the
> netlink interface or not.

Well, you have to set up the s-vlan ID in order to have something to
set the port profile in.

	Arnd

^ permalink raw reply

* Re: [PATCH] gianfar: Wait for both RX and TX to stop
From: Timur Tabi @ 2010-04-20 15:01 UTC (permalink / raw)
  To: Kumar Gala; +Cc: Andy Fleming, davem, netdev
In-Reply-To: <DD217C72-156B-403D-B83B-EB365C782F2B@kernel.crashing.org>

On Mon, Apr 19, 2010 at 11:43 PM, Kumar Gala <galak@kernel.crashing.org> wrote:

> spin_event_timeout doesn't make sense for this.  The patch is fine.

Can you please elaborate on that?  I don't understand why you think
that.  spin_event_timeout() takes an expression and a timeout, and
loops over the expression calling cpu_relax(), just like this loop
does.

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Tom Herbert @ 2010-04-20 15:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, David Miller, netdev
In-Reply-To: <1271743164.3845.128.camel@edumazet-laptop>

On Mon, Apr 19, 2010 at 10:59 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 20 avril 2010 à 07:38 +0800, Changli Gao a écrit :
>
>> Does this problem has relationship with your patch? No. If the rxhash
>> isn't provided by hardware, we can get more throughput from you patch,
>> and on the other side, we don't lose anything but potential more hash
>> collision.
>>
>
> I am not sure what you call hash collision. There is no hash chain here.
>
> This 32bit hash is a jhash one, and we only need 1 to 12 bits in it, I
> am pretty sure its OK.
>
Maybe for the purposes of RPS, but hash collisions could definitely be
an issue in RFS.  If two active connections hit the same rps_flow
entry this may cause thrashing of those connections between CPUs.  I
think your patch may increase the probability of this happening.

>
>
>

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Tom Herbert @ 2010-04-20 15:09 UTC (permalink / raw)
  To: Franco Fichtner; +Cc: Eric Dumazet, Changli Gao, David Miller, netdev
In-Reply-To: <4BCDA2B5.4060609@lastsummer.de>

> I thought about this for some time...
>
> Do we really need the port numbers here at all? A simple
> addr1^addr2 can provide a good enough pointer for
> distribution amongst CPUs.
>

What about a server behind a TCP proxy?  Also, need to minimize
collisions for RPS to be effective.

Tom

> The real connection tracking is better done locally at the
> corresponding CPU. That way a potential cache miss can be
> avoided and the still needed hash calculation for
> connection tracking will be offloaded.
>
>
> Franco
>
>

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-20 15:22 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, Scott Feldman, davem, netdev
In-Reply-To: <201004201657.02873.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Tuesday 20 April 2010, Chris Wright wrote:
> > * Arnd Bergmann (arnd@arndb.de) wrote:
> > > On Monday 19 April 2010, Scott Feldman wrote:
> > > 
> > > What is the reason for controlling the slave device through the master,
> > > rather than talking to the slave directly? The kernel always knows
> > > the master for each slave, so it seems to me that this information
> > > is redundant.
> > 
> > Not all devices have this relationship explicit (i.e. not all are pure
> > sr-iov devices).  If there's always a way to discover the master from
> > the device, then I agree we only need the slave.
> 
> Hmm, is there an actual example of a card where the relationship is not
> known to the kernel?
> 
> > > Is this new interface only for the case that you have a switch integrated
> > > in the NIC, or also for the case where you do an LLDP and EDP exchange
> > > with an adjacent bridge and put the device into VEPA mode?
> > 
> > It should be useful for both.  That's part of the reason for using
> > netlink, a userspace daemon running the VDP state machine (like lldpad)
> > can listen for these messages and see a set_port_profile request when
> > the user starts up a VM.
> 
> After thinking some more about this case, I now believe we should do
> it the other way around, and have lldpad in control of this interface
> from the user space side, and letting user programs (lldptool, libvirt,
> ...) talk to lldpad in order to set it up.

lldpad won't be involved in all cases, yet a mgmt tool like libvirt will.
so this seems backwards.

> > > Also, do you expect your interface to be supported by dcbd/lldpad,
> > > or is there a good reason to create a new tool for iovnl?
> > 
> > lldpad would listen, I don't see why iproute2 couldn't send, and libvirt
> > will send as well.
> 
> Not sure. We need lldpad to do this exchange for the case of VEPA with
> VDP, so always using lldpad would let us unify the user interface for
> both cases. We can of course have iproute2 talk to lldpad, in the
> same way that libvirt does.
> 
> > > > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device
> > > > + *   (NLA_NUL_STRING)
> > > 
> > > How does the definition of the port profile get into the NIC's switch?
> > > Is there any way to list the available port profiles?
> > 
> > The port profile is a concept external to the NIC's switch.  It's a value
> > that exists in the external physical layer 2 switching infrastructure.
> > So an admin knows this value and is informing the adjacent switch that a
> > new virutal interface is coming up and needs some particular port profile.
> 
> But that's only the case if the NIC itself is in VEPA mode. If that
> were the case, there would be no need for a kernel interface at all,
> because then we could just drive the port profile selection from user
> space.
> 
> The proposed interface only seems to make sense if you use it to
> configure the NIC itself! Why should it care about the port profile
> otherwise?

In the case of devices that can do adjacent switch negotiations directly.

> > > Same here: Should you be able to set multiple MAC addresses, or
> > > trunk mode? Can the VF override it?
> > > Also, for the new multi-channel VEPA, I'd guess that you also need
> > > to supply an 802.1ad S-VLAN ID.
> > 
> > Something like set_port_profile() would initiate the negotiation for the
> > s-vlan id for a particular channel, not sure it's needed as part of the
> > netlink interface or not.
> 
> Well, you have to set up the s-vlan ID in order to have something to
> set the port profile in.

Right, depends if the use the port profile to establish the channel and
negotiate the s-vlan ID.  I don't recall the order there.

thanks,
-chris

^ permalink raw reply

* Re: [PATCH v5] rfs: Receive Flow Steering
From: Eric Dumazet @ 2010-04-20 15:39 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Changli Gao, David Miller, netdev
In-Reply-To: <t2y65634d661004200804r15427f19k9d58862f05df94aa@mail.gmail.com>

Le mardi 20 avril 2010 à 08:04 -0700, Tom Herbert a écrit :

> Maybe for the purposes of RPS, but hash collisions could definitely be
> an issue in RFS.  If two active connections hit the same rps_flow
> entry this may cause thrashing of those connections between CPUs.  I
> think your patch may increase the probability of this happening.
> 

Good point.

I'll make a gathering of tcp tuples on a busy server over a day and try
to compute number of clashes we can get with and without the addr/port
swapping.



^ permalink raw reply

* Re: [PATCH] net: ipv6 bind to device issue
From: Brian Haley @ 2010-04-20 15:41 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet,
	netdev
In-Reply-To: <1271767572-5282-1-git-send-email-jolsa@redhat.com>

Jiri Olsa wrote:
> Signed-off-by: Jiri Olsa <jolsa@redhat.com>
> ---
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index c2438e8..7bf7717 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>  {
>  	int flags = 0;
>  
> -	if (rt6_need_strict(&fl->fl6_dst))
> +	if (rt6_need_strict(&fl->fl6_dst) || fl->oif)
>  		flags |= RT6_LOOKUP_F_IFACE;
>  
>  	if (!ipv6_addr_any(&fl->fl6_src))

Acked-by: Brian Haley <brian.haley@hp.com>

Saw this within the past month here too and have been testing
this same fix without problems.

^ permalink raw reply

* Re: [PATCH] net: ipv6 bind to device issue
From: Jiri Olsa @ 2010-04-20 15:42 UTC (permalink / raw)
  To: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet; +Cc: netdev
In-Reply-To: <1271767572-5282-1-git-send-email-jolsa@redhat.com>

On Tue, Apr 20, 2010 at 02:46:12PM +0200, Jiri Olsa wrote:
> hi,
> 
> The issue raises when having 2 NICs both assigned the same
> IPv6 global address.
> 
> If a sender binds to a particular NIC (SO_BINDTODEVICE),
> the outgoing traffic is being sent via the first found.
> The bonded device is thus not taken into an account during the
> routing.
> 
> 
> From the ip6_route_output function:
> 
> If the binding address is multicast, linklocal or loopback,
> the RT6_LOOKUP_F_IFACE bit is set, but not for global address.
> 
> So binding global address will neglect SO_BINDTODEVICE-binded device,
> because the fib6_rule_lookup function path won't check for the
> flowi::oif field and take first route that fits.
> 
> Following patch should handle the issue.
> 
> wbr,
> jirka
> 
> 
> Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Scott Otto <scott.otto@alcatel-lucent.com>

> ---
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index c2438e8..7bf7717 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>  {
>  	int flags = 0;
>  
> -	if (rt6_need_strict(&fl->fl6_dst))
> +	if (rt6_need_strict(&fl->fl6_dst) || fl->oif)
>  		flags |= RT6_LOOKUP_F_IFACE;
>  
>  	if (!ipv6_addr_any(&fl->fl6_src))

^ permalink raw reply

* Re: [PATCH] gianfar: Wait for both RX and TX to stop
From: Timur Tabi @ 2010-04-20 15:59 UTC (permalink / raw)
  To: Kumar Gala; +Cc: Andy Fleming, davem, netdev
In-Reply-To: <o2oed82fe3e1004200801n98683cdalb2898c879b93f8fd@mail.gmail.com>

On Tue, Apr 20, 2010 at 10:01 AM, Timur Tabi <timur.tabi@gmail.com> wrote:
> On Mon, Apr 19, 2010 at 11:43 PM, Kumar Gala <galak@kernel.crashing.org> wrote:
>
>> spin_event_timeout doesn't make sense for this.  The patch is fine.
>
> Can you please elaborate on that?  I don't understand why you think
> that.  spin_event_timeout() takes an expression and a timeout, and
> loops over the expression calling cpu_relax(), just like this loop
> does.

I haven't tested it, but I think this should work:

	spin_event_timeout((gfar_read(&regs->ievent) &
		(IEVENT_GRSC | IEVENT_GTSC)) ==
		(IEVENT_GRSC | IEVENT_GTSC), -1, 0);

Ideally, Andy should use a timeout value other than -1, and then test
the result like this:

	u32 ret;

	ret = spin_event_timeout((gfar_read(&regs->ievent) &
		(IEVENT_GRSC | IEVENT_GTSC)) ==
		(IEVENT_GRSC | IEVENT_GTSC), 1000, 0);

	if (!ret)
		/* timeout has occurred */

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [PATCH] NET: Fix an RCU warning in dev_pick_tx()
From: Eric Dumazet @ 2010-04-20 16:03 UTC (permalink / raw)
  To: David Howells; +Cc: netdev
In-Reply-To: <20100420102558.31923.91.stgit@warthog.procyon.org.uk>

Le mardi 20 avril 2010 à 11:25 +0100, David Howells a écrit :
> Fix the following RCU warning in dev_pick_tx():
> 
> ===================================================
> [ INFO: suspicious rcu_dereference_check() usage. ]
> ---------------------------------------------------
> net/core/dev.c:1993 invoked rcu_dereference_check() without protection!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 1, debug_locks = 0
> 2 locks held by swapper/0:
>  #0:  (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81039e65>] run_timer_softirq+0x17b/0x278
>  #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff812ea3eb>] dev_queue_xmit+0x14e/0x4dc
> 
> stack backtrace:
> Pid: 0, comm: swapper Not tainted 2.6.34-rc5-cachefs #4
> Call Trace:
>  <IRQ>  [<ffffffff810516c4>] lockdep_rcu_dereference+0xaa/0xb2
>  [<ffffffff812ea4f6>] dev_queue_xmit+0x259/0x4dc
>  [<ffffffff812ea3eb>] ? dev_queue_xmit+0x14e/0x4dc
>  [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff81035362>] ? local_bh_enable_ip+0xbc/0xc1
>  [<ffffffff812f0954>] neigh_resolve_output+0x24b/0x27c
>  [<ffffffff8134f673>] ip6_output_finish+0x7c/0xb4
>  [<ffffffff81350c34>] ip6_output2+0x256/0x261
>  [<ffffffff81052324>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff813517fb>] ip6_output+0xbbc/0xbcb
>  [<ffffffff8135bc5d>] ? fib6_force_start_gc+0x2b/0x2d
>  [<ffffffff81368acb>] mld_sendpack+0x273/0x39d
>  [<ffffffff81368858>] ? mld_sendpack+0x0/0x39d
>  [<ffffffff81052099>] ? mark_held_locks+0x52/0x70
>  [<ffffffff813692fc>] mld_ifc_timer_expire+0x24f/0x288
>  [<ffffffff81039ed6>] run_timer_softirq+0x1ec/0x278
>  [<ffffffff81039e65>] ? run_timer_softirq+0x17b/0x278
>  [<ffffffff813690ad>] ? mld_ifc_timer_expire+0x0/0x288
>  [<ffffffff81035531>] ? __do_softirq+0x69/0x140
>  [<ffffffff8103556a>] __do_softirq+0xa2/0x140
>  [<ffffffff81002e0c>] call_softirq+0x1c/0x28
>  [<ffffffff81004b54>] do_softirq+0x38/0x80
>  [<ffffffff81034f06>] irq_exit+0x45/0x47
>  [<ffffffff810177c3>] smp_apic_timer_interrupt+0x88/0x96
>  [<ffffffff810028d3>] apic_timer_interrupt+0x13/0x20
>  <EOI>  [<ffffffff810488dd>] ? __atomic_notifier_call_chain+0x0/0x86
>  [<ffffffff810096bf>] ? mwait_idle+0x6e/0x78
>  [<ffffffff810096b6>] ? mwait_idle+0x65/0x78
>  [<ffffffff810011cb>] cpu_idle+0x4d/0x83
>  [<ffffffff81380b05>] rest_init+0xb9/0xc0
>  [<ffffffff81380a4c>] ? rest_init+0x0/0xc0
>  [<ffffffff8168dcf0>] start_kernel+0x392/0x39d
>  [<ffffffff8168d2a3>] x86_64_start_reservations+0xb3/0xb7
>  [<ffffffff8168d38b>] x86_64_start_kernel+0xe4/0xeb
> 
> An rcu_dereference() should be an rcu_dereference_bh().
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 

Absolutely right, thanks David

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

This might conflict with following commit in net-next-2.6, where I chose
the rcu_dereference_check() alternative

commit b6c6712a42ca3f9fa7f4a3d7c40e3a9dd1fd9e03
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Thu Apr 8 23:03:29 2010 +0000

    net: sk_dst_cache RCUification
    
    With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
    work.
    
    sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
    
    This rwlock is readlocked for a very small amount of time, and dst
    entries are already freed after RCU grace period. This calls for RCU
    again :)
    
    This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
    
    __sk_dst_get() is supposed to be called with rcu_read_lock() or if
    socket locked by user, so use appropriate rcu_dereference_check()
    condition (rcu_read_lock_held() || sock_owned_by_user(sk))
    
    This patch avoids two atomic ops per tx packet on UDP connected sockets,
    for example, and permits sk_dst_lock to be much less dirtied.
    
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/include/net/dst.h b/include/net/dst.h
index ce078cd..aac5a5f 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -225,21 +225,6 @@ static inline void dst_confirm(struct dst_entry *dst)
 		neigh_confirm(dst->neighbour);
 }
 
-static inline void dst_negative_advice(struct dst_entry **dst_p,
-				       struct sock *sk)
-{
-	struct dst_entry * dst = *dst_p;
-	if (dst && dst->ops->negative_advice) {
-		*dst_p = dst->ops->negative_advice(dst);
-
-		if (dst != *dst_p) {
-			extern void sk_reset_txq(struct sock *sk);
-
-			sk_reset_txq(sk);
-		}
-	}
-}
-
 static inline void dst_link_failure(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 68f6783..278312c 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -152,9 +152,9 @@ static inline void __ip6_dst_store(struct sock *sk, struct dst_entry *dst,
 static inline void ip6_dst_store(struct sock *sk, struct dst_entry *dst,
 				 struct in6_addr *daddr, struct in6_addr *saddr)
 {
-	write_lock(&sk->sk_dst_lock);
+	spin_lock(&sk->sk_dst_lock);
 	__ip6_dst_store(sk, dst, daddr, saddr);
-	write_unlock(&sk->sk_dst_lock);
+	spin_unlock(&sk->sk_dst_lock);
 }
 
 static inline int ipv6_unicast_destination(struct sk_buff *skb)
diff --git a/include/net/sock.h b/include/net/sock.h
index b4603cd..56df440 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -262,7 +262,7 @@ struct sock {
 #ifdef CONFIG_XFRM
 	struct xfrm_policy	*sk_policy[2];
 #endif
-	rwlock_t		sk_dst_lock;
+	spinlock_t		sk_dst_lock;
 	atomic_t		sk_rmem_alloc;
 	atomic_t		sk_wmem_alloc;
 	atomic_t		sk_omem_alloc;
@@ -1192,7 +1192,8 @@ extern unsigned long sock_i_ino(struct sock *sk);
 static inline struct dst_entry *
 __sk_dst_get(struct sock *sk)
 {
-	return sk->sk_dst_cache;
+	return rcu_dereference_check(sk->sk_dst_cache, rcu_read_lock_held() ||
+						       sock_owned_by_user(sk));
 }
 
 static inline struct dst_entry *
@@ -1200,50 +1201,62 @@ sk_dst_get(struct sock *sk)
 {
 	struct dst_entry *dst;
 
-	read_lock(&sk->sk_dst_lock);
-	dst = sk->sk_dst_cache;
+	rcu_read_lock();
+	dst = rcu_dereference(sk->sk_dst_cache);
 	if (dst)
 		dst_hold(dst);
-	read_unlock(&sk->sk_dst_lock);
+	rcu_read_unlock();
 	return dst;
 }
 
+extern void sk_reset_txq(struct sock *sk);
+
+static inline void dst_negative_advice(struct sock *sk)
+{
+	struct dst_entry *ndst, *dst = __sk_dst_get(sk);
+
+	if (dst && dst->ops->negative_advice) {
+		ndst = dst->ops->negative_advice(dst);
+
+		if (ndst != dst) {
+			rcu_assign_pointer(sk->sk_dst_cache, ndst);
+			sk_reset_txq(sk);
+		}
+	}
+}
+
 static inline void
 __sk_dst_set(struct sock *sk, struct dst_entry *dst)
 {
 	struct dst_entry *old_dst;
 
 	sk_tx_queue_clear(sk);
-	old_dst = sk->sk_dst_cache;
-	sk->sk_dst_cache = dst;
+	old_dst = rcu_dereference_check(sk->sk_dst_cache,
+					lockdep_is_held(&sk->sk_dst_lock));
+	rcu_assign_pointer(sk->sk_dst_cache, dst);
 	dst_release(old_dst);
 }
 
 static inline void
 sk_dst_set(struct sock *sk, struct dst_entry *dst)
 {
-	write_lock(&sk->sk_dst_lock);
+	spin_lock(&sk->sk_dst_lock);
 	__sk_dst_set(sk, dst);
-	write_unlock(&sk->sk_dst_lock);
+	spin_unlock(&sk->sk_dst_lock);
 }
 
 static inline void
 __sk_dst_reset(struct sock *sk)
 {
-	struct dst_entry *old_dst;
-
-	sk_tx_queue_clear(sk);
-	old_dst = sk->sk_dst_cache;
-	sk->sk_dst_cache = NULL;
-	dst_release(old_dst);
+	__sk_dst_set(sk, NULL);
 }
 
 static inline void
 sk_dst_reset(struct sock *sk)
 {
-	write_lock(&sk->sk_dst_lock);
+	spin_lock(&sk->sk_dst_lock);
 	__sk_dst_reset(sk);
-	write_unlock(&sk->sk_dst_lock);
+	spin_unlock(&sk->sk_dst_lock);
 }
 
 extern struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie);
diff --git a/net/core/dev.c b/net/core/dev.c
index 0eb79e3..ca4cdef 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2015,7 +2015,7 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 			if (dev->real_num_tx_queues > 1)
 				queue_index = skb_tx_hash(dev, skb);
 
-			if (sk && sk->sk_dst_cache)
+			if (sk && rcu_dereference_check(sk->sk_dst_cache, 1))
 				sk_tx_queue_set(sk, queue_index);
 		}
 	}
diff --git a/net/core/sock.c b/net/core/sock.c
index c5812bb..7effa1e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -364,11 +364,11 @@ EXPORT_SYMBOL(sk_reset_txq);
 
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
-	struct dst_entry *dst = sk->sk_dst_cache;
+	struct dst_entry *dst = __sk_dst_get(sk);
 
 	if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
 		sk_tx_queue_clear(sk);
-		sk->sk_dst_cache = NULL;
+		rcu_assign_pointer(sk->sk_dst_cache, NULL);
 		dst_release(dst);
 		return NULL;
 	}
@@ -1157,7 +1157,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		skb_queue_head_init(&newsk->sk_async_wait_queue);
 #endif
 
-		rwlock_init(&newsk->sk_dst_lock);
+		spin_lock_init(&newsk->sk_dst_lock);
 		rwlock_init(&newsk->sk_callback_lock);
 		lockdep_set_class_and_name(&newsk->sk_callback_lock,
 				af_callback_keys + newsk->sk_family,
@@ -1898,7 +1898,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	} else
 		sk->sk_sleep	=	NULL;
 
-	rwlock_init(&sk->sk_dst_lock);
+	spin_lock_init(&sk->sk_dst_lock);
 	rwlock_init(&sk->sk_callback_lock);
 	lockdep_set_class_and_name(&sk->sk_callback_lock,
 			af_callback_keys + sk->sk_family,
diff --git a/net/dccp/timer.c b/net/dccp/timer.c
index bbfeb5e..1a9aa05 100644
--- a/net/dccp/timer.c
+++ b/net/dccp/timer.c
@@ -38,7 +38,7 @@ static int dccp_write_timeout(struct sock *sk)
 
 	if (sk->sk_state == DCCP_REQUESTING || sk->sk_state == DCCP_PARTOPEN) {
 		if (icsk->icsk_retransmits != 0)
-			dst_negative_advice(&sk->sk_dst_cache, sk);
+			dst_negative_advice(sk);
 		retry_until = icsk->icsk_syn_retries ?
 			    : sysctl_dccp_request_retries;
 	} else {
@@ -63,7 +63,7 @@ static int dccp_write_timeout(struct sock *sk)
 			   Golden words :-).
 		   */
 
-			dst_negative_advice(&sk->sk_dst_cache, sk);
+			dst_negative_advice(sk);
 		}
 
 		retry_until = sysctl_dccp_retries2;
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 2b494fa..55e3b6b 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -446,7 +446,7 @@ static void dn_destruct(struct sock *sk)
 	skb_queue_purge(&scp->other_xmit_queue);
 	skb_queue_purge(&scp->other_receive_queue);
 
-	dst_release(xchg(&sk->sk_dst_cache, NULL));
+	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
 }
 
 static int dn_memory_pressure;
@@ -1105,7 +1105,7 @@ static int dn_accept(struct socket *sock, struct socket *newsock, int flags)
 	release_sock(sk);
 
 	dst = skb_dst(skb);
-	dst_release(xchg(&newsk->sk_dst_cache, dst));
+	sk_dst_set(newsk, dst);
 	skb_dst_set(skb, NULL);
 
 	DN_SK(newsk)->state        = DN_CR;
@@ -1956,7 +1956,7 @@ static int dn_sendmsg(struct kiocb *iocb, struct socket *sock,
 	}
 
 	if ((flags & MSG_TRYHARD) && sk->sk_dst_cache)
-		dst_negative_advice(&sk->sk_dst_cache, sk);
+		dst_negative_advice(sk);
 
 	mss = scp->segsize_rem;
 	fctype = scp->services_rem & NSP_FC_MASK;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index a0beb32..193dcd6 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -154,7 +154,7 @@ void inet_sock_destruct(struct sock *sk)
 	WARN_ON(sk->sk_forward_alloc);
 
 	kfree(inet->opt);
-	dst_release(sk->sk_dst_cache);
+	dst_release(rcu_dereference_check(sk->sk_dst_cache, 1));
 	sk_refcnt_debug_dec(sk);
 }
 EXPORT_SYMBOL(inet_sock_destruct);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4000b10..ae3ec15 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3710,7 +3710,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	}
 
 	if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
-		dst_confirm(sk->sk_dst_cache);
+		dst_confirm(__sk_dst_get(sk));
 
 	return 1;
 
@@ -5833,7 +5833,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
 			if (tp->snd_una == tp->write_seq) {
 				tcp_set_state(sk, TCP_FIN_WAIT2);
 				sk->sk_shutdown |= SEND_SHUTDOWN;
-				dst_confirm(sk->sk_dst_cache);
+				dst_confirm(__sk_dst_get(sk));
 
 				if (!sock_flag(sk, SOCK_DEAD))
 					/* Wake up lingering close() */
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 8a0ab29..c732be0 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -172,14 +172,14 @@ static int tcp_write_timeout(struct sock *sk)
 
 	if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
 		if (icsk->icsk_retransmits)
-			dst_negative_advice(&sk->sk_dst_cache, sk);
+			dst_negative_advice(sk);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
 		if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
-			dst_negative_advice(&sk->sk_dst_cache, sk);
+			dst_negative_advice(sk);
 		}
 
 		retry_until = sysctl_tcp_retries2;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 33f60fc..1160400 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -114,9 +114,9 @@ struct ipv6_txoptions *ipv6_update_options(struct sock *sk,
 		}
 		opt = xchg(&inet6_sk(sk)->opt, opt);
 	} else {
-		write_lock(&sk->sk_dst_lock);
+		spin_lock(&sk->sk_dst_lock);
 		opt = xchg(&inet6_sk(sk)->opt, opt);
-		write_unlock(&sk->sk_dst_lock);
+		spin_unlock(&sk->sk_dst_lock);
 	}
 	sk_dst_reset(sk);
 
@@ -971,14 +971,13 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 	case IPV6_MTU:
 	{
 		struct dst_entry *dst;
+
 		val = 0;
-		lock_sock(sk);
-		dst = sk_dst_get(sk);
-		if (dst) {
+		rcu_read_lock();
+		dst = __sk_dst_get(sk);
+		if (dst)
 			val = dst_mtu(dst);
-			dst_release(dst);
-		}
-		release_sock(sk);
+		rcu_read_unlock();
 		if (!val)
 			return -ENOTCONN;
 		break;
@@ -1066,12 +1065,14 @@ static int do_ipv6_getsockopt(struct sock *sk, int level, int optname,
 		else
 			val = np->mcast_hops;
 
-		dst = sk_dst_get(sk);
-		if (dst) {
-			if (val < 0)
+		if (val < 0) {
+			rcu_read_lock();
+			dst = __sk_dst_get(sk);
+			if (dst)
 				val = ip6_dst_hoplimit(dst);
-			dst_release(dst);
+			rcu_read_unlock();
 		}
+
 		if (val < 0)
 			val = sock_net(sk)->ipv6.devconf_all->hop_limit;
 		break;





^ permalink raw reply related

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-20 16:19 UTC (permalink / raw)
  To: Chris Wright; +Cc: Scott Feldman, davem, netdev
In-Reply-To: <20100420152206.GA24942@x200.localdomain>

On Tuesday 20 April 2010, Chris Wright wrote:
> * Arnd Bergmann (arnd@arndb.de) wrote:
> > On Tuesday 20 April 2010, Chris Wright wrote:
> >
> > After thinking some more about this case, I now believe we should do
> > it the other way around, and have lldpad in control of this interface
> > from the user space side, and letting user programs (lldptool, libvirt,
> > ...) talk to lldpad in order to set it up.
> 
> lldpad won't be involved in all cases, yet a mgmt tool like libvirt will.
> so this seems backwards.

Well, that part is still the matter of this discussion, as far as I can tell ;-)

> > But that's only the case if the NIC itself is in VEPA mode. If that
> > were the case, there would be no need for a kernel interface at all,
> > because then we could just drive the port profile selection from user
> > space.
> > 
> > The proposed interface only seems to make sense if you use it to
> > configure the NIC itself! Why should it care about the port profile
> > otherwise?
> 
> In the case of devices that can do adjacent switch negotiations directly.

I thought the idea to deal with those devices was to beat sense into
the respective developers until they do the negotiation in software 8-)

> > > > Same here: Should you be able to set multiple MAC addresses, or
> > > > trunk mode? Can the VF override it?
> > > > Also, for the new multi-channel VEPA, I'd guess that you also need
> > > > to supply an 802.1ad S-VLAN ID.
> > > 
> > > Something like set_port_profile() would initiate the negotiation for the
> > > s-vlan id for a particular channel, not sure it's needed as part of the
> > > netlink interface or not.
> > 
> > Well, you have to set up the s-vlan ID in order to have something to
> > set the port profile in.
> 
> Right, depends if the use the port profile to establish the channel and
> negotiate the s-vlan ID.  I don't recall the order there.

I'm pretty sure that setting up the channel (for 802.1bg) is done before
any port profile comes in.

	Arnd

^ permalink raw reply

* Re: [PATCH net-next-2.6 v4 10/14] l2tp: Convert rwlock to RCU
From: Eric Dumazet @ 2010-04-20 16:55 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev
In-Reply-To: <20100402161916.11367.32252.stgit@bert.katalix.com>

Le vendredi 02 avril 2010 à 17:19 +0100, James Chapman a écrit :
> Reader/write locks are discouraged because they are slower than spin
> locks. So this patch converts the rwlocks used in the per_net structs
> to rcu.
> 
> Signed-off-by: James Chapman <jchapman@katalix.com>

>  static inline struct l2tp_net *l2tp_pernet(struct net *net)
> @@ -139,14 +140,14 @@ static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
>  	struct l2tp_session *session;
>  	struct hlist_node *walk;
>  
> -	read_lock_bh(&pn->l2tp_session_hlist_lock);
> -	hlist_for_each_entry(session, walk, session_list, global_hlist) {
> +	rcu_read_lock_bh();
> +	hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
>  		if (session->session_id == session_id) {
> -			read_unlock_bh(&pn->l2tp_session_hlist_lock);
> +			rcu_read_unlock_bh();
>  			return session;
>  		}
>  	}
> -	read_unlock_bh(&pn->l2tp_session_hlist_lock);
> +	rcu_read_unlock_bh();
>  

Hi James

I started a while ago patching l2tp but I wont be able to finish and
test the thing...

There is a fundamental problem with this kind of construct :
(this was wrong even better your RCU conversion)

rcu_read_lock_bh()
hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
	if (session->session_id == session_id) {
		rcu_read_unlock_bh();
		return session;
	}
}
rcu_read_unlock_bh();


While the lookup _is_ protected, the result is not.

As soon as you call rcu_read_unlock_bh(); and before the "return
session;", current thread could be preempted and an other thread frees
session under first thread. Unexpected things can then happen.

Therefore, you need either to :

1) Take a refcount on session (or tunnel) before the return
2) Or move the rcu_read_lock_bh()/rcu_read_unlock_bh() at callers.
3) Or all callers use a stronger lock. But then, why use RCU ;)

Here is a preliminary patch, obviously not finished, nor compiled, nor
tested, to give possible ways to handle this problem.

(I added the ref parameter to make sure to change function signatures,
maybe its not necessary and we should always take references)

Thanks

 net/l2tp/l2tp_core.c    |   25 ++++++++++++++-----------
 net/l2tp/l2tp_core.h    |    6 +++---
 net/l2tp/l2tp_debugfs.c |    2 +-
 net/l2tp/l2tp_eth.c     |    4 ++--
 net/l2tp/l2tp_ip.c      |    8 +++++---
 net/l2tp/l2tp_netlink.c |   26 +++++++++++++-------------
 6 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index ecc7aea..e662659 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -132,7 +132,7 @@ l2tp_session_id_hash_2(struct l2tp_net *pn, u32 session_id)
 
 /* Lookup a session by id in the global session list
  */
-static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
+static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id, int ref)
 {
 	struct l2tp_net *pn = l2tp_pernet(net);
 	struct hlist_head *session_list =
@@ -143,6 +143,8 @@ static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
 	rcu_read_lock_bh();
 	hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
 		if (session->session_id == session_id) {
+			if (ref)
+				l2tp_session_inc_refcount(session);
 			rcu_read_unlock_bh();
 			return session;
 		}
@@ -166,7 +168,7 @@ l2tp_session_id_hash(struct l2tp_tunnel *tunnel, u32 session_id)
 
 /* Lookup a session by id
  */
-struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunnel, u32 session_id)
+struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunnel, u32 session_id, int ref)
 {
 	struct hlist_head *session_list;
 	struct l2tp_session *session;
@@ -177,12 +179,14 @@ struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunn
 	 * tunnel.
 	 */
 	if (tunnel == NULL)
-		return l2tp_session_find_2(net, session_id);
+		return l2tp_session_find_2(net, session_id, ref);
 
 	session_list = l2tp_session_id_hash(tunnel, session_id);
 	read_lock_bh(&tunnel->hlist_lock);
 	hlist_for_each_entry(session, walk, session_list, hlist) {
 		if (session->session_id == session_id) {
+			if (ref)
+				l2tp_session_inc_refcount(session);
 			read_unlock_bh(&tunnel->hlist_lock);
 			return session;
 		}
@@ -193,7 +197,7 @@ struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunn
 }
 EXPORT_SYMBOL_GPL(l2tp_session_find);
 
-struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth)
+struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth, int ref)
 {
 	int hash;
 	struct hlist_node *walk;
@@ -204,6 +208,8 @@ struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth)
 	for (hash = 0; hash < L2TP_HASH_SIZE; hash++) {
 		hlist_for_each_entry(session, walk, &tunnel->session_hlist[hash], hlist) {
 			if (++count > nth) {
+				if (ref)
+					l2tp_session_inc_refcount(session);
 				read_unlock_bh(&tunnel->hlist_lock);
 				return session;
 			}
@@ -244,7 +250,7 @@ EXPORT_SYMBOL_GPL(l2tp_session_find_by_ifname);
 
 /* Lookup a tunnel by id
  */
-struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id)
+struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id, int ref)
 {
 	struct l2tp_tunnel *tunnel;
 	struct l2tp_net *pn = l2tp_pernet(net);
@@ -252,6 +258,8 @@ struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id)
 	rcu_read_lock_bh();
 	list_for_each_entry_rcu(tunnel, &pn->l2tp_tunnel_list, list) {
 		if (tunnel->tunnel_id == tunnel_id) {
+			if (ref)
+				l2tp_tunnel_inc_refcount(tunnel);
 			rcu_read_unlock_bh();
 			return tunnel;
 		}
@@ -500,11 +508,6 @@ void l2tp_recv_common(struct l2tp_session *session, struct sk_buff *skb,
 	int offset;
 	u32 ns, nr;
 
-	/* The ref count is increased since we now hold a pointer to
-	 * the session. Take care to decrement the refcnt when exiting
-	 * this function from now on...
-	 */
-	l2tp_session_inc_refcount(session);
 	if (session->ref)
 		(*session->ref)(session);
 
@@ -785,7 +788,7 @@ int l2tp_udp_recv_core(struct l2tp_tunnel *tunnel, struct sk_buff *skb,
 	}
 
 	/* Find the session context */
-	session = l2tp_session_find(tunnel->l2tp_net, tunnel, session_id);
+	session = l2tp_session_find(tunnel->l2tp_net, tunnel, session_id, 1);
 	if (!session || !session->recv_skb) {
 		/* Not found? Pass to userspace to deal with */
 		PRINTK(tunnel->debug, L2TP_MSG_DATA, KERN_INFO,
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index f0f318e..51df82e 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -221,10 +221,10 @@ out:
 	return tunnel;
 }
 
-extern struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunnel, u32 session_id);
-extern struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth);
+extern struct l2tp_session *l2tp_session_find(struct net *net, struct l2tp_tunnel *tunnel, u32 session_id, int ref);
+extern struct l2tp_session *l2tp_session_find_nth(struct l2tp_tunnel *tunnel, int nth, int ref);
 extern struct l2tp_session *l2tp_session_find_by_ifname(struct net *net, char *ifname);
-extern struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id);
+extern struct l2tp_tunnel *l2tp_tunnel_find(struct net *net, u32 tunnel_id, int ref);
 extern struct l2tp_tunnel *l2tp_tunnel_find_nth(struct net *net, int nth);
 
 extern int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg, struct l2tp_tunnel **tunnelp);
diff --git a/net/l2tp/l2tp_debugfs.c b/net/l2tp/l2tp_debugfs.c
index 104ec3b..a7ecda2 100644
--- a/net/l2tp/l2tp_debugfs.c
+++ b/net/l2tp/l2tp_debugfs.c
@@ -51,7 +51,7 @@ static void l2tp_dfs_next_tunnel(struct l2tp_dfs_seq_data *pd)
 
 static void l2tp_dfs_next_session(struct l2tp_dfs_seq_data *pd)
 {
-	pd->session = l2tp_session_find_nth(pd->tunnel, pd->session_idx);
+	pd->session = l2tp_session_find_nth(pd->tunnel, pd->session_idx, 0);
 	pd->session_idx++;
 
 	if (pd->session == NULL) {
diff --git a/net/l2tp/l2tp_eth.c b/net/l2tp/l2tp_eth.c
index ca1164a..b069447 100644
--- a/net/l2tp/l2tp_eth.c
+++ b/net/l2tp/l2tp_eth.c
@@ -194,13 +194,13 @@ static int l2tp_eth_create(struct net *net, u32 tunnel_id, u32 session_id, u32 p
 	int rc;
 	struct l2tp_eth_net *pn;
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (!tunnel) {
 		rc = -ENODEV;
 		goto out;
 	}
 
-	session = l2tp_session_find(net, tunnel, session_id);
+	session = l2tp_session_find(net, tunnel, session_id, 0);
 	if (session) {
 		rc = -EEXIST;
 		goto out;
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 0852512..dbd7b7f 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -126,7 +126,7 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 	u32 session_id;
 	u32 tunnel_id;
 	unsigned char *ptr, *optr;
-	struct l2tp_session *session;
+	struct l2tp_session *session = NULL;
 	struct l2tp_tunnel *tunnel = NULL;
 	int length;
 	int offset;
@@ -150,7 +150,7 @@ static int l2tp_ip_recv(struct sk_buff *skb)
 	}
 
 	/* Ok, this is a data packet. Lookup the session. */
-	session = l2tp_session_find(&init_net, NULL, session_id);
+	session = l2tp_session_find(&init_net, NULL, session_id, 1);
 	if (session == NULL)
 		goto discard;
 
@@ -187,7 +187,7 @@ pass_up:
 		goto discard;
 
 	tunnel_id = ntohl(*(__be32 *) &skb->data[4]);
-	tunnel = l2tp_tunnel_find(&init_net, tunnel_id);
+	tunnel = l2tp_tunnel_find(&init_net, tunnel_id, 0);
 	if (tunnel != NULL)
 		sk = tunnel->sock;
 	else {
@@ -214,6 +214,8 @@ discard_put:
 	sock_put(sk);
 
 discard:
+	if (session)
+		l2tp_session_dec_refcount(session);
 	kfree_skb(skb);
 	return 0;
 }
diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index 4c1e540..5f42d48 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -40,7 +40,7 @@ static struct genl_family l2tp_nl_family = {
 /* Accessed under genl lock */
 static const struct l2tp_nl_cmd_ops *l2tp_nl_cmd_ops[__L2TP_PWTYPE_MAX];
 
-static struct l2tp_session *l2tp_nl_session_find(struct genl_info *info)
+static struct l2tp_session *l2tp_nl_session_find(struct genl_info *info, int ref)
 {
 	u32 tunnel_id;
 	u32 session_id;
@@ -56,9 +56,9 @@ static struct l2tp_session *l2tp_nl_session_find(struct genl_info *info)
 		   (info->attrs[L2TP_ATTR_CONN_ID])) {
 		tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 		session_id = nla_get_u32(info->attrs[L2TP_ATTR_SESSION_ID]);
-		tunnel = l2tp_tunnel_find(net, tunnel_id);
+		tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 		if (tunnel)
-			session = l2tp_session_find(net, tunnel, session_id);
+			session = l2tp_session_find(net, tunnel, session_id, ref);
 	}
 
 	return session;
@@ -148,7 +148,7 @@ static int l2tp_nl_cmd_tunnel_create(struct sk_buff *skb, struct genl_info *info
 	if (info->attrs[L2TP_ATTR_DEBUG])
 		cfg.debug = nla_get_u32(info->attrs[L2TP_ATTR_DEBUG]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (tunnel != NULL) {
 		ret = -EEXIST;
 		goto out;
@@ -180,7 +180,7 @@ static int l2tp_nl_cmd_tunnel_delete(struct sk_buff *skb, struct genl_info *info
 	}
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (tunnel == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -205,7 +205,7 @@ static int l2tp_nl_cmd_tunnel_modify(struct sk_buff *skb, struct genl_info *info
 	}
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (tunnel == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -292,7 +292,7 @@ static int l2tp_nl_cmd_tunnel_get(struct sk_buff *skb, struct genl_info *info)
 
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
 
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (tunnel == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -359,7 +359,7 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 		goto out;
 	}
 	tunnel_id = nla_get_u32(info->attrs[L2TP_ATTR_CONN_ID]);
-	tunnel = l2tp_tunnel_find(net, tunnel_id);
+	tunnel = l2tp_tunnel_find(net, tunnel_id, 0);
 	if (!tunnel) {
 		ret = -ENODEV;
 		goto out;
@@ -370,7 +370,7 @@ static int l2tp_nl_cmd_session_create(struct sk_buff *skb, struct genl_info *inf
 		goto out;
 	}
 	session_id = nla_get_u32(info->attrs[L2TP_ATTR_SESSION_ID]);
-	session = l2tp_session_find(net, tunnel, session_id);
+	session = l2tp_session_find(net, tunnel, session_id, 0);
 	if (session) {
 		ret = -EEXIST;
 		goto out;
@@ -495,7 +495,7 @@ static int l2tp_nl_cmd_session_delete(struct sk_buff *skb, struct genl_info *inf
 	struct l2tp_session *session;
 	u16 pw_type;
 
-	session = l2tp_nl_session_find(info);
+	session = l2tp_nl_session_find(info, 0);
 	if (session == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -515,7 +515,7 @@ static int l2tp_nl_cmd_session_modify(struct sk_buff *skb, struct genl_info *inf
 	int ret = 0;
 	struct l2tp_session *session;
 
-	session = l2tp_nl_session_find(info);
+	session = l2tp_nl_session_find(info, 1);
 	if (session == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -615,7 +615,7 @@ static int l2tp_nl_cmd_session_get(struct sk_buff *skb, struct genl_info *info)
 	struct sk_buff *msg;
 	int ret;
 
-	session = l2tp_nl_session_find(info);
+	session = l2tp_nl_session_find(info, 0);
 	if (session == NULL) {
 		ret = -ENODEV;
 		goto out;
@@ -656,7 +656,7 @@ static int l2tp_nl_cmd_session_dump(struct sk_buff *skb, struct netlink_callback
 				goto out;
 		}
 
-		session = l2tp_session_find_nth(tunnel, si);
+		session = l2tp_session_find_nth(tunnel, si, 0);
 		if (session == NULL) {
 			ti++;
 			tunnel = NULL;








^ permalink raw reply related

* IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-20 17:44 UTC (permalink / raw)
  To: netdev; +Cc: Hideaki YOSHIFUJI, David Miller

Hi,

I found what I believe is a race condition in __ipv6_ifa_notify(), in the call
to dst_free().

__ipv6_ifa_notify() contains:

        case RTM_DELADDR:
                if (ifp->idev->cnf.forwarding)
                        addrconf_leave_anycast(ifp);
                addrconf_leave_solict(ifp->idev, &ifp->addr);
                dst_hold(&ifp->rt->u.dst);
                if (ip6_del_rt(ifp->rt))
                        dst_free(&ifp->rt->u.dst);
                break;

AFAICT, ip6_del_rt() will call dst_free() itself if it finds and actually
deletes the route: 
	ip6_del_rt() -> __ip6_del_rt() -> fib6_del() -> fib6_del_route() ->
	-> rt6_release() -> dst_free()

If it fails (like when it races with another invocation of ip6_del_rt()), it
will return nonzero and this will cause the above code to call dst_free() on its own.

dst_free() has no protection against concurrent invocation and if
two invocations make it through the "if (dst->obsolete > 1)"
check before one of them calls __dst_free(), the same dst_entry
may end up either:
	1) dst_destroy()ed and put on the dst_garbage.list, or
	2) put on the dst_garbage.list twice
both resulting in trouble once the GC is run.

One possible code path leading to two invocations of __ipv6_ifa_notify() seems
to be when two bonding slaves receive a NS/NA with the bonds IPv6 address when
the bonding master is in the DAD phase with a tentative address:

netif_receive_skb() gets invoked on two CPUs and sets skb->dev to the bonding master ...
... ip6_mc_input() -> ip6_input_finish() -> icmpv6_rcv() -> ndisc_rcv() ->
 -> ndisc_recv_ns() -> addrconf_dad_failure() -> ipv6_del_addr() -> ipv6_ifa_notify() ->
 -> __ipv6_ifa_notify


What is the reason __ipv6_ifa_notify() calls dst_free() when
ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
with the dst still needing to be freed.

I am just testing whether the following will help:

--- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
+++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
@@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
 			addrconf_leave_anycast(ifp);
 		addrconf_leave_solict(ifp->idev, &ifp->addr);
 		dst_hold(&ifp->rt->u.dst);
-		if (ip6_del_rt(ifp->rt))
-			dst_free(&ifp->rt->u.dst);
+		ip6_del_rt(ifp->rt);
 		break;
 	}
 }

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* [PATCH linux-next 2/2] ixgbe: Example usage of the new IRQ affinity_hint callback
From: Peter P Waskiewicz Jr @ 2010-04-20 18:01 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel
In-Reply-To: <20100420180112.1276.11906.stgit@ppwaskie-hc2.jf.intel.com>

This patch uses the new IRQ affinity_hint callback mechanism.
It serves purely as an example of how a low-level driver can
utilize this new interface.

An official ixgbe patch will be pushed through netdev once the
IRQ patches have been accepted and merged.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 drivers/net/ixgbe/ixgbe.h      |    2 ++
 drivers/net/ixgbe/ixgbe_main.c |   51 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 52 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 79c35ae..c220b9f 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -32,6 +32,7 @@
 #include <linux/pci.h>
 #include <linux/netdevice.h>
 #include <linux/aer.h>
+#include <linux/cpumask.h>
 
 #include "ixgbe_type.h"
 #include "ixgbe_common.h"
@@ -236,6 +237,7 @@ struct ixgbe_q_vector {
 	u8 tx_itr;
 	u8 rx_itr;
 	u32 eitr;
+	cpumask_var_t affinity_mask;
 };
 
 /* Helper macros to switch between ints/sec and what the register uses.
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 1b1419c..3e00d41 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1031,6 +1031,36 @@ next_desc:
 	return cleaned;
 }
 
+static unsigned int ixgbe_irq_affinity_callback(cpumask_var_t mask,
+                                                unsigned int irq, void *dev)
+{
+	struct ixgbe_adapter *adapter = (struct ixgbe_adapter *)dev;
+	int i, q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
+
+	if (test_bit(__IXGBE_DOWN, &adapter->state))
+		return -EINVAL;
+
+	/*
+	 * Loop through the msix_entries array, looking for the vector that
+	 * matches the irq passed to us.  Once we find it, use that index to
+	 * grab the corresponding q_vector (1 to 1 mapping), and grab the
+	 * cpumask from that q_vector.
+	 */
+	for (i = 0; i < q_vectors; i++)
+		if (adapter->msix_entries[i].vector == irq)
+			break;
+
+	if (i == q_vectors)
+		return -EINVAL;
+
+	if (adapter->q_vector[i]->affinity_mask)
+		cpumask_copy(mask, adapter->q_vector[i]->affinity_mask);
+	else
+		return -EINVAL;
+
+	return 0;
+}
+
 static int ixgbe_clean_rxonly(struct napi_struct *, int);
 /**
  * ixgbe_configure_msix - Configure MSI-X hardware
@@ -1083,6 +1113,16 @@ static void ixgbe_configure_msix(struct ixgbe_adapter *adapter)
 			q_vector->eitr = adapter->rx_eitr_param;
 
 		ixgbe_write_eitr(q_vector);
+
+		/*
+		 * Allocate the affinity_hint cpumask, assign the mask for
+		 * this vector, and register our affinity_hint callback.
+		 */
+		alloc_cpumask_var(&q_vector->affinity_mask, GFP_KERNEL);
+		cpumask_set_cpu(v_idx, q_vector->affinity_mask);
+		irq_register_affinity_hint(adapter->msix_entries[v_idx].vector,
+		                           adapter,
+		                           &ixgbe_irq_affinity_callback);
 	}
 
 	if (adapter->hw.mac.type == ixgbe_mac_82598EB)
@@ -3218,7 +3258,7 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 	struct ixgbe_hw *hw = &adapter->hw;
 	u32 rxctrl;
 	u32 txdctl;
-	int i, j;
+	int i, j, num_q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;
 
 	/* signal that we are down to the interrupt handler */
 	set_bit(__IXGBE_DOWN, &adapter->state);
@@ -3251,6 +3291,14 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 
 	ixgbe_napi_disable_all(adapter);
 
+	for (i = 0; i < num_q_vectors; i++) {
+		struct ixgbe_q_vector *q_vector = adapter->q_vector[i];
+		/* unregister the affinity_hint callback */
+		irq_unregister_affinity_hint(adapter->msix_entries[i].vector);
+		/* release the CPU mask memory */
+		free_cpumask_var(q_vector->affinity_mask);
+	}
+
 	clear_bit(__IXGBE_SFP_MODULE_NOT_FOUND, &adapter->state);
 	del_timer_sync(&adapter->sfp_timer);
 	del_timer_sync(&adapter->watchdog_timer);
@@ -4052,6 +4100,7 @@ static void ixgbe_free_q_vectors(struct ixgbe_adapter *adapter)
 		struct ixgbe_q_vector *q_vector = adapter->q_vector[q_idx];
 		adapter->q_vector[q_idx] = NULL;
 		netif_napi_del(&q_vector->napi);
+		free_cpumask_var(q_vector->affinity_mask);
 		kfree(q_vector);
 	}
 }


^ permalink raw reply related

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Eric Dumazet @ 2010-04-20 17:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <20100420174401.GB1334@midget.suse.cz>

Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> Hi,
> 
> I found what I believe is a race condition in __ipv6_ifa_notify(), in the call
> to dst_free().
> 
> __ipv6_ifa_notify() contains:
> 
>         case RTM_DELADDR:
>                 if (ifp->idev->cnf.forwarding)
>                         addrconf_leave_anycast(ifp);
>                 addrconf_leave_solict(ifp->idev, &ifp->addr);
>                 dst_hold(&ifp->rt->u.dst);
>                 if (ip6_del_rt(ifp->rt))
>                         dst_free(&ifp->rt->u.dst);
>                 break;
> 
> AFAICT, ip6_del_rt() will call dst_free() itself if it finds and actually
> deletes the route: 
> 	ip6_del_rt() -> __ip6_del_rt() -> fib6_del() -> fib6_del_route() ->
> 	-> rt6_release() -> dst_free()
> 
> If it fails (like when it races with another invocation of ip6_del_rt()), it
> will return nonzero and this will cause the above code to call dst_free() on its own.
> 
> dst_free() has no protection against concurrent invocation and if

Sorry ? of course dst_free() has a protection...

By definition, the dst_destroy() is called only by the last thread with
the final refcount on object.

> two invocations make it through the "if (dst->obsolete > 1)"
> check before one of them calls __dst_free(), the same dst_entry
> may end up either:
> 	1) dst_destroy()ed and put on the dst_garbage.list, or
> 	2) put on the dst_garbage.list twice
> both resulting in trouble once the GC is run.
> 
> One possible code path leading to two invocations of __ipv6_ifa_notify() seems
> to be when two bonding slaves receive a NS/NA with the bonds IPv6 address when
> the bonding master is in the DAD phase with a tentative address:
> 
> netif_receive_skb() gets invoked on two CPUs and sets skb->dev to the bonding master ...
> ... ip6_mc_input() -> ip6_input_finish() -> icmpv6_rcv() -> ndisc_rcv() ->
>  -> ndisc_recv_ns() -> addrconf_dad_failure() -> ipv6_del_addr() -> ipv6_ifa_notify() ->
>  -> __ipv6_ifa_notify
> 
> 
> What is the reason __ipv6_ifa_notify() calls dst_free() when
> ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> with the dst still needing to be freed.
> 
> I am just testing whether the following will help:
> 
> --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
>  			addrconf_leave_anycast(ifp);
>  		addrconf_leave_solict(ifp->idev, &ifp->addr);
>  		dst_hold(&ifp->rt->u.dst);
> -		if (ip6_del_rt(ifp->rt))
> -			dst_free(&ifp->rt->u.dst);
> +		ip6_del_rt(ifp->rt);
>  		break;
>  	}
>  }
> 


I dont understand the problem Jiri.

We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
dst_free(), or we leak a refcount.




^ permalink raw reply

* [PATCH linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Peter P Waskiewicz Jr @ 2010-04-20 18:01 UTC (permalink / raw)
  To: tglx, davem, arjan; +Cc: netdev, linux-kernel

This patch adds a callback function pointer to the irq_desc
structure, along with a registration function and a read-only
proc entry for each interrupt.

This affinity_hint handle for each interrupt can be used by
underlying drivers that need a better mechanism to control
interrupt affinity.  The underlying driver can register a
callback for the interrupt, which will allow the driver to
provide the CPU mask for the interrupt to anything that
requests it.  The intent is to extend the userspace daemon,
irqbalance, to help hint to it a preferred CPU mask to balance
the interrupt into.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
---

 include/linux/interrupt.h |   27 +++++++++++++++++++++++++++
 include/linux/irq.h       |    1 +
 kernel/irq/manage.c       |   39 +++++++++++++++++++++++++++++++++++++++
 kernel/irq/proc.c         |   41 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 75f3f00..f2a7d0b 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -78,6 +78,8 @@ enum {
 };
 
 typedef irqreturn_t (*irq_handler_t)(int, void *);
+typedef unsigned int (*irq_affinity_hint_t)(cpumask_var_t, unsigned int,
+                                            void *);
 
 /**
  * struct irqaction - per interrupt action descriptor
@@ -105,6 +107,18 @@ struct irqaction {
 	unsigned long thread_flags;
 };
 
+/**
+ * struct irqaffinityhint - per interrupt affinity helper
+ * @callback:	device driver callback function
+ * @dev:	reference for the affected device
+ * @irq:	interrupt number
+ */
+struct irqaffinityhint {
+	irq_affinity_hint_t callback;
+	void *dev;
+	int irq;
+};
+
 extern irqreturn_t no_action(int cpl, void *dev_id);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
@@ -209,6 +223,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
 extern int irq_can_set_affinity(unsigned int irq);
 extern int irq_select_affinity(unsigned int irq);
 
+extern int irq_register_affinity_hint(unsigned int irq, void *dev,
+                                      irq_affinity_hint_t callback);
+extern int irq_unregister_affinity_hint(unsigned int irq);
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -223,6 +240,16 @@ static inline int irq_can_set_affinity(unsigned int irq)
 
 static inline int irq_select_affinity(unsigned int irq)  { return 0; }
 
+static inline int irq_register_affinity_hint(unsigned int irq, void *dev,
+                                             irq_affinity_hint_t callback)
+{
+	return -EINVAL;
+}
+
+static inline int irq_unregister_affinity_hint(unsigned int irq);
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_SMP && CONFIG_GENERIC_HARDIRQS */
 
 #ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 707ab12..bd73e9b 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -206,6 +206,7 @@ struct irq_desc {
 	struct proc_dir_entry	*dir;
 #endif
 	const char		*name;
+	struct irqaffinityhint	*hint;
 } ____cacheline_internodealigned_in_smp;
 
 extern void arch_init_copy_chip_data(struct irq_desc *old_desc,
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 704e488..3674b6a 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -138,6 +138,42 @@ int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask)
 	return 0;
 }
 
+int irq_register_affinity_hint(unsigned int irq, void *dev,
+                               irq_affinity_hint_t callback)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	if (!desc->hint)
+		desc->hint = kmalloc(sizeof(struct irqaffinityhint),
+		                     GFP_KERNEL);
+	desc->hint->callback = callback;
+	desc->hint->dev = dev;
+	desc->hint->irq = irq;
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_register_affinity_hint);
+
+int irq_unregister_affinity_hint(unsigned int irq)
+{
+	struct irq_desc *desc = irq_to_desc(irq);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&desc->lock, flags);
+
+	kfree(desc->hint);
+	desc->hint = NULL;
+
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(irq_unregister_affinity_hint);
 #ifndef CONFIG_AUTO_IRQ_AFFINITY
 /*
  * Generic version of the affinity autoselector.
@@ -916,6 +952,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
 			desc->chip->disable(irq);
 	}
 
+	/* make sure affinity_hint callback is cleaned up */
+	kfree(desc->hint);
+
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
 	unregister_handler_proc(irq, action);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 7a6eb04..59110a3 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -32,6 +32,23 @@ static int irq_affinity_proc_show(struct seq_file *m, void *v)
 	return 0;
 }
 
+static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
+{
+	struct irq_desc *desc = irq_to_desc((long)m->private);
+	struct cpumask mask;
+	unsigned int ret = 0;
+
+	if (desc->hint && desc->hint->callback) {
+		ret = desc->hint->callback(&mask, (long)m->private,
+		                           desc->hint->dev);
+		if (!ret)
+			seq_cpumask(m, &mask);
+	}
+
+	seq_putc(m, '\n');
+	return ret;
+}
+
 #ifndef is_affinity_mask_valid
 #define is_affinity_mask_valid(val) 1
 #endif
@@ -79,11 +96,23 @@ free_cpumask:
 	return err;
 }
 
+static ssize_t irq_affinity_hint_proc_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *pos)
+{
+	/* affinity_hint is read-only from proc */
+	return -EOPNOTSUPP;
+}
+
 static int irq_affinity_proc_open(struct inode *inode, struct file *file)
 {
 	return single_open(file, irq_affinity_proc_show, PDE(inode)->data);
 }
 
+static int irq_affinity_hint_proc_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, irq_affinity_hint_proc_show, PDE(inode)->data);
+}
+
 static const struct file_operations irq_affinity_proc_fops = {
 	.open		= irq_affinity_proc_open,
 	.read		= seq_read,
@@ -92,6 +121,14 @@ static const struct file_operations irq_affinity_proc_fops = {
 	.write		= irq_affinity_proc_write,
 };
 
+static const struct file_operations irq_affinity_hint_proc_fops = {
+	.open		= irq_affinity_hint_proc_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+	.write		= irq_affinity_hint_proc_write,
+};
+
 static int default_affinity_show(struct seq_file *m, void *v)
 {
 	seq_cpumask(m, irq_default_affinity);
@@ -231,6 +268,10 @@ void register_irq_proc(unsigned int irq, struct irq_desc *desc)
 	/* create /proc/irq/<irq>/smp_affinity */
 	proc_create_data("smp_affinity", 0600, desc->dir,
 			 &irq_affinity_proc_fops, (void *)(long)irq);
+
+	/* create /proc/irq/<irq>/affinity_hint */
+	proc_create_data("affinity_hint", 0400, desc->dir,
+			 &irq_affinity_hint_proc_fops, (void *)(long)irq);
 #endif
 
 	proc_create_data("spurious", 0444, desc->dir,

^ permalink raw reply related

* Re: [PATCH] net: ipv6 bind to device issue
From: Brian Haley @ 2010-04-20 18:13 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, eric.dumazet,
	netdev
In-Reply-To: <1271767572-5282-1-git-send-email-jolsa@redhat.com>

Jiri Olsa wrote:
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index c2438e8..7bf7717 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -815,7 +815,7 @@ struct dst_entry * ip6_route_output(struct net *net, struct sock *sk,
>  {
>  	int flags = 0;
>  
> -	if (rt6_need_strict(&fl->fl6_dst))
> +	if (rt6_need_strict(&fl->fl6_dst) || fl->oif)
>  		flags |= RT6_LOOKUP_F_IFACE;
>  
>  	if (!ipv6_addr_any(&fl->fl6_src))

Actually, looking at this again, we might want to swap the order
here since fl->oif should be filled-in for most link-local and
multicast requests calling this:

	if (fl->oif || rt6_need_strict(&fl->fl6_dst))

Just a thought, but it potentially saves a call to determine
the scope of the address.

-Brian

^ permalink raw reply

* pull request: wireless-2.6 2010-04-20
From: John W. Linville @ 2010-04-20 18:30 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Dave,

Here are a few more intended for 2.6.34, mostly from the iwlwifi team.
They fix a couple of potential crashes, an incorrect EEPROM offset
related to regulatory information, and a connectivity failure relating
to 802.11n and the 4965.  The "iwlwifi: fix scan races" seems a little
long, but much of it is the fallout from renaming a goto label.

Please let me know if there are problems!

Thanks,

John

---

The following changes since commit b91ecb0027c7171c83d7cf443a22c39b1fde6d83:
  Tilman Schmidt (1):
        gigaset: include cleanup cleanup

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git master

Johannes Berg (2):
      iwlwifi: fix scan races
      mac80211: remove bogus TX agg state assignment

Reinette Chatre (1):
      mac80211: pass HT changes to driver when off channel

Shanyu Zhao (1):
      iwlwifi: correct 6000 EEPROM regulatory address

 drivers/net/wireless/iwlwifi/iwl-6000.c   |    4 +-
 drivers/net/wireless/iwlwifi/iwl-agn.c    |    1 +
 drivers/net/wireless/iwlwifi/iwl-core.c   |    1 -
 drivers/net/wireless/iwlwifi/iwl-core.h   |    2 +-
 drivers/net/wireless/iwlwifi/iwl-dev.h    |    1 +
 drivers/net/wireless/iwlwifi/iwl-eeprom.h |    4 +++
 drivers/net/wireless/iwlwifi/iwl-scan.c   |   31 ++++++++++++++++++----------
 net/mac80211/agg-tx.c                     |    1 -
 net/mac80211/mlme.c                       |    2 +
 9 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl-6000.c b/drivers/net/wireless/iwlwifi/iwl-6000.c
index c4844ad..92b3e64 100644
--- a/drivers/net/wireless/iwlwifi/iwl-6000.c
+++ b/drivers/net/wireless/iwlwifi/iwl-6000.c
@@ -259,7 +259,7 @@ static struct iwl_lib_ops iwl6000_lib = {
 			EEPROM_5000_REG_BAND_3_CHANNELS,
 			EEPROM_5000_REG_BAND_4_CHANNELS,
 			EEPROM_5000_REG_BAND_5_CHANNELS,
-			EEPROM_5000_REG_BAND_24_HT40_CHANNELS,
+			EEPROM_6000_REG_BAND_24_HT40_CHANNELS,
 			EEPROM_5000_REG_BAND_52_HT40_CHANNELS
 		},
 		.verify_signature  = iwlcore_eeprom_verify_signature,
@@ -323,7 +323,7 @@ static struct iwl_lib_ops iwl6050_lib = {
 			EEPROM_5000_REG_BAND_3_CHANNELS,
 			EEPROM_5000_REG_BAND_4_CHANNELS,
 			EEPROM_5000_REG_BAND_5_CHANNELS,
-			EEPROM_5000_REG_BAND_24_HT40_CHANNELS,
+			EEPROM_6000_REG_BAND_24_HT40_CHANNELS,
 			EEPROM_5000_REG_BAND_52_HT40_CHANNELS
 		},
 		.verify_signature  = iwlcore_eeprom_verify_signature,
diff --git a/drivers/net/wireless/iwlwifi/iwl-agn.c b/drivers/net/wireless/iwlwifi/iwl-agn.c
index e4c2e1e..ba0fdba 100644
--- a/drivers/net/wireless/iwlwifi/iwl-agn.c
+++ b/drivers/net/wireless/iwlwifi/iwl-agn.c
@@ -3330,6 +3330,7 @@ static void iwl_cancel_deferred_work(struct iwl_priv *priv)
 
 	cancel_delayed_work_sync(&priv->init_alive_start);
 	cancel_delayed_work(&priv->scan_check);
+	cancel_work_sync(&priv->start_internal_scan);
 	cancel_delayed_work(&priv->alive_start);
 	cancel_work_sync(&priv->beacon_update);
 	del_timer_sync(&priv->statistics_periodic);
diff --git a/drivers/net/wireless/iwlwifi/iwl-core.c b/drivers/net/wireless/iwlwifi/iwl-core.c
index 894bcb8..1459cdb 100644
--- a/drivers/net/wireless/iwlwifi/iwl-core.c
+++ b/drivers/net/wireless/iwlwifi/iwl-core.c
@@ -3357,7 +3357,6 @@ static void iwl_force_rf_reset(struct iwl_priv *priv)
 	 */
 	IWL_DEBUG_INFO(priv, "perform radio reset.\n");
 	iwl_internal_short_hw_scan(priv);
-	return;
 }
 
 
diff --git a/drivers/net/wireless/iwlwifi/iwl-core.h b/drivers/net/wireless/iwlwifi/iwl-core.h
index 732590f..36940a9 100644
--- a/drivers/net/wireless/iwlwifi/iwl-core.h
+++ b/drivers/net/wireless/iwlwifi/iwl-core.h
@@ -506,7 +506,7 @@ void iwl_init_scan_params(struct iwl_priv *priv);
 int iwl_scan_cancel(struct iwl_priv *priv);
 int iwl_scan_cancel_timeout(struct iwl_priv *priv, unsigned long ms);
 int iwl_mac_hw_scan(struct ieee80211_hw *hw, struct cfg80211_scan_request *req);
-int iwl_internal_short_hw_scan(struct iwl_priv *priv);
+void iwl_internal_short_hw_scan(struct iwl_priv *priv);
 int iwl_force_reset(struct iwl_priv *priv, int mode);
 u16 iwl_fill_probe_req(struct iwl_priv *priv, struct ieee80211_mgmt *frame,
 		       const u8 *ie, int ie_len, int left);
diff --git a/drivers/net/wireless/iwlwifi/iwl-dev.h b/drivers/net/wireless/iwlwifi/iwl-dev.h
index 6054c5f..ef1720a 100644
--- a/drivers/net/wireless/iwlwifi/iwl-dev.h
+++ b/drivers/net/wireless/iwlwifi/iwl-dev.h
@@ -1296,6 +1296,7 @@ struct iwl_priv {
 	struct work_struct tt_work;
 	struct work_struct ct_enter;
 	struct work_struct ct_exit;
+	struct work_struct start_internal_scan;
 
 	struct tasklet_struct irq_tasklet;
 
diff --git a/drivers/net/wireless/iwlwifi/iwl-eeprom.h b/drivers/net/wireless/iwlwifi/iwl-eeprom.h
index 4e1ba82..8171c70 100644
--- a/drivers/net/wireless/iwlwifi/iwl-eeprom.h
+++ b/drivers/net/wireless/iwlwifi/iwl-eeprom.h
@@ -203,6 +203,10 @@ struct iwl_eeprom_enhanced_txpwr {
 #define EEPROM_5000_REG_BAND_52_HT40_CHANNELS  ((0x92)\
 		| INDIRECT_ADDRESS | INDIRECT_REGULATORY)   /* 22  bytes */
 
+/* 6000 regulatory - indirect access */
+#define EEPROM_6000_REG_BAND_24_HT40_CHANNELS  ((0x80)\
+		| INDIRECT_ADDRESS | INDIRECT_REGULATORY)   /* 14  bytes */
+
 /* 6000 and up regulatory tx power - indirect access */
 /* max. elements per section */
 #define EEPROM_MAX_TXPOWER_SECTION_ELEMENTS	(8)
diff --git a/drivers/net/wireless/iwlwifi/iwl-scan.c b/drivers/net/wireless/iwlwifi/iwl-scan.c
index bd2f7c4..5062f4e 100644
--- a/drivers/net/wireless/iwlwifi/iwl-scan.c
+++ b/drivers/net/wireless/iwlwifi/iwl-scan.c
@@ -469,6 +469,8 @@ EXPORT_SYMBOL(iwl_init_scan_params);
 
 static int iwl_scan_initiate(struct iwl_priv *priv)
 {
+	WARN_ON(!mutex_is_locked(&priv->mutex));
+
 	IWL_DEBUG_INFO(priv, "Starting scan...\n");
 	set_bit(STATUS_SCANNING, &priv->status);
 	priv->is_internal_short_scan = false;
@@ -546,24 +548,31 @@ EXPORT_SYMBOL(iwl_mac_hw_scan);
  * internal short scan, this function should only been called while associated.
  * It will reset and tune the radio to prevent possible RF related problem
  */
-int iwl_internal_short_hw_scan(struct iwl_priv *priv)
+void iwl_internal_short_hw_scan(struct iwl_priv *priv)
 {
-	int ret = 0;
+	queue_work(priv->workqueue, &priv->start_internal_scan);
+}
+
+static void iwl_bg_start_internal_scan(struct work_struct *work)
+{
+	struct iwl_priv *priv =
+		container_of(work, struct iwl_priv, start_internal_scan);
+
+	mutex_lock(&priv->mutex);
 
 	if (!iwl_is_ready_rf(priv)) {
-		ret = -EIO;
 		IWL_DEBUG_SCAN(priv, "not ready or exit pending\n");
-		goto out;
+		goto unlock;
 	}
+
 	if (test_bit(STATUS_SCANNING, &priv->status)) {
 		IWL_DEBUG_SCAN(priv, "Scan already in progress.\n");
-		ret = -EAGAIN;
-		goto out;
+		goto unlock;
 	}
+
 	if (test_bit(STATUS_SCAN_ABORTING, &priv->status)) {
 		IWL_DEBUG_SCAN(priv, "Scan request while abort pending\n");
-		ret = -EAGAIN;
-		goto out;
+		goto unlock;
 	}
 
 	priv->scan_bands = 0;
@@ -576,9 +585,8 @@ int iwl_internal_short_hw_scan(struct iwl_priv *priv)
 	set_bit(STATUS_SCANNING, &priv->status);
 	priv->is_internal_short_scan = true;
 	queue_work(priv->workqueue, &priv->request_scan);
-
-out:
-	return ret;
+ unlock:
+	mutex_unlock(&priv->mutex);
 }
 EXPORT_SYMBOL(iwl_internal_short_hw_scan);
 
@@ -964,6 +972,7 @@ void iwl_setup_scan_deferred_work(struct iwl_priv *priv)
 	INIT_WORK(&priv->scan_completed, iwl_bg_scan_completed);
 	INIT_WORK(&priv->request_scan, iwl_bg_request_scan);
 	INIT_WORK(&priv->abort_scan, iwl_bg_abort_scan);
+	INIT_WORK(&priv->start_internal_scan, iwl_bg_start_internal_scan);
 	INIT_DELAYED_WORK(&priv->scan_check, iwl_bg_scan_check);
 }
 EXPORT_SYMBOL(iwl_setup_scan_deferred_work);
diff --git a/net/mac80211/agg-tx.c b/net/mac80211/agg-tx.c
index 5538e1b..944a8a9 100644
--- a/net/mac80211/agg-tx.c
+++ b/net/mac80211/agg-tx.c
@@ -183,7 +183,6 @@ static void sta_addba_resp_timer_expired(unsigned long data)
 		       HT_AGG_STATE_REQ_STOP_BA_MSK)) !=
 						HT_ADDBA_REQUESTED_MSK) {
 		spin_unlock_bh(&sta->lock);
-		*state = HT_AGG_STATE_IDLE;
 #ifdef CONFIG_MAC80211_HT_DEBUG
 		printk(KERN_DEBUG "timer expired on tid %d but we are not "
 				"(or no longer) expecting addBA response there",
diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c
index be5f723..8a96503 100644
--- a/net/mac80211/mlme.c
+++ b/net/mac80211/mlme.c
@@ -167,6 +167,8 @@ static u32 ieee80211_enable_ht(struct ieee80211_sub_if_data *sdata,
 	ht_changed = conf_is_ht(&local->hw.conf) != enable_ht ||
 		     channel_type != local->hw.conf.channel_type;
 
+	if (local->tmp_channel)
+		local->tmp_channel_type = channel_type;
 	local->oper_channel_type = channel_type;
 
 	if (ht_changed) {
-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: Wolfgang Grandegger @ 2010-04-20 19:47 UTC (permalink / raw)
  To: Hans J. Koch
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100420135538.GA1994-hikPBsva6T+Nj9Bq2fkWzw@public.gmane.org>

Hans J. Koch wrote:
> In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...
> 
> This patch replaces dev_err() with printk() to avoid this.
> 
> Signed-off-by: "Hans J. Koch" <hjk-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
> ---
>  drivers/net/can/usb/ems_usb.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6.34-rc/drivers/net/can/usb/ems_usb.c
> ===================================================================
> --- linux-2.6.34-rc.orig/drivers/net/can/usb/ems_usb.c	2010-04-20 15:32:25.000000000 +0200
> +++ linux-2.6.34-rc/drivers/net/can/usb/ems_usb.c	2010-04-20 15:33:20.000000000 +0200
> @@ -1006,7 +1006,7 @@
>  
>  	netdev = alloc_candev(sizeof(struct ems_usb), MAX_TX_URBS);
>  	if (!netdev) {
> -		dev_err(netdev->dev.parent, "Couldn't alloc candev\n");
> +		printk(KERN_ERR "ems_usb: Couldn't alloc candev\n");
>  		return -ENOMEM;
>  	}

I think "dev_err(&intf->dev, ...)" should be used before
SET_NETDEV_DEV(netdev, &intf->dev) is called. I see two "dev_err()"
calls which need to be fixed.

Wolfgang.

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Scott Feldman @ 2010-04-20 19:56 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: davem, netdev, chrisw
In-Reply-To: <201004201548.26609.arnd@arndb.de>

On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> On Monday 19 April 2010, Scott Feldman wrote:
> 
>> IOV netlink (IOVNL) adds I/O Virtualization control support to a master
>> device (MD) netdev interface.  The MD (e.g. SR-IOV PF) will set/get
>> control settings on behalf of a slave netdevice (e.g. SR-IOV VF).  The
>> design allows for the case where master and slave are the
>> same netdev interface.
> 
> What is the reason for controlling the slave device through the master,
> rather than talking to the slave directly? The kernel always knows
> the master for each slave, so it seems to me that this information
> is redundant.

The interface would allow talking to the slave directly. In fact, that's the
example with enic port-profile in patch 2/2.  But, it would be nice not to
rule out the case where the master proxies slave control and the master is
under exclusively controlled by hypervisor.
 
> Is this new interface only for the case that you have a switch integrated
> in the NIC, or also for the case where you do an LLDP and EDP exchange
> with an adjacent bridge and put the device into VEPA mode?

All of the above.  Basing this on netlink give us flexibility to work with
user-space mgmt tools or directly with kernel netdev as in the enic case.
Not trying to make assumptions about where (user-space, kernel) and by which
entity sources or sinks the netlink msg.
 
>> One control setting example is MAC/VLAN settings for a VF.  Another
>> example control setting is a port-profile for a VF.  A port-profile is an
>> identifier that defines policy-based settings on the network port
>> backing the VF.  The network port settings examples are VLAN membership,
>> QoS settings, and L2 security settings, typical of a data center network.
>> 
>> This patch adds the iovnl interface definitions and an iovnl module.
> 
> How does this relate to the existing DCB netlink interface? My feeling
> is that there is some overlap in how it would get used, and some parts
> that are very distinct. In particular, I'd guess that you'd want to
> be able to set DCB parameters for each VF, but not all DCB adapters
> would support SR-IOV.
>
> Did you consider making this code an extension to the DCB interface
> instead of a separate one? What was the reason for your decision
> to keep it separate?

Considered it but DCB interface is well defined for DCB and it didn't seem
right gluing on interfaces not specified within DCB.  I agree that there is
some overlap in the sense that both interface are used to configure a netdev
with some properties interesting for the data center, but the DCB interface
is for local setting of the properties on the host whereas iovnl is about
pushing the setting of those properties to the network for policy-based
control.
 
> Also, do you expect your interface to be supported by dcbd/lldpad,
> or is there a good reason to create a new tool for iovnl?

Lldpad supporting this interface would seem right, for those cases where
lldpad is responsible for configuring the netdev.
 
>> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING)
>> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING)
> 
> Can you elaborate more on what these do? Who is the 'client' and the 'host'
> in this case, and why do you need to identify them?

Those are optional and useful, for example, by the network mgmt tool for
presenting a view such as:

    - blade 1/2                     // know by host uuid
        - vm-rhel5-eth0             // client name
            - port-profile: xyz

Something like that.
 
>> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6])
> 
> Just one mac address? What happens if we want to assign multiple mac
> addresses to the VF later? Also, how is this defined specifically?
> Will a SIOCSIFHWADDR with a different MAC address on the VF fail
> later, or is this just the default value?

Depends on how the VF wants to handle this.  For our use-case with enic we
only need the port-profile op so I'm not sure what the best design is for
mac+vlan on a VF.  Looking for advise from folks like yourself.  If it's not
needed, let's scratch it.

-scott


^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Scott Feldman @ 2010-04-20 20:26 UTC (permalink / raw)
  To: Arnd Bergmann, Chris Wright; +Cc: davem, netdev
In-Reply-To: <201004201819.32970.arnd@arndb.de>

On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote:

>>> But that's only the case if the NIC itself is in VEPA mode. If that
>>> were the case, there would be no need for a kernel interface at all,
>>> because then we could just drive the port profile selection from user
>>> space.
>>> 
>>> The proposed interface only seems to make sense if you use it to
>>> configure the NIC itself! Why should it care about the port profile
>>> otherwise?
>> 
>> In the case of devices that can do adjacent switch negotiations directly.
> 
> I thought the idea to deal with those devices was to beat sense into
> the respective developers until they do the negotiation in software 8-)

When the device can do the negotiation directly with the switch, why does it
make sense to bypass that and use software on the host?  I don't think we'd
want to give up on link speed/duplex auto-negotiation and punt those setting
back to the user/host like in the old days.

-scott


^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-20 20:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <1271786247.7895.130.camel@edumazet-laptop>

On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> >  			addrconf_leave_anycast(ifp);
> >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> >  		dst_hold(&ifp->rt->u.dst);
> > -		if (ip6_del_rt(ifp->rt))
> > -			dst_free(&ifp->rt->u.dst);
> > +		ip6_del_rt(ifp->rt);
> >  		break;
> >  	}
> >  }
> > 
> 
> 
> I dont understand the problem Jiri.
> 
> We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> dst_free(), or we leak a refcount.

Well, no ... dst_free() does not decrement the refcount.
The "opposite" of dst_hold() is dst_release(). And it does not
automatically call dst_free() when the refcount drops to 0.
dst_free() needs to be called explicitly and it apparently
expects the caller to ensure that two dst_free()s won't be called
simultaneously. But my bonding example shows this is not the
case.


-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Eric Dumazet @ 2010-04-20 20:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: netdev, Hideaki YOSHIFUJI, David Miller, Stephen Hemminger
In-Reply-To: <20100420204939.GA15354@smudla-wifi.bakulak.kosire.czf>

Le mardi 20 avril 2010 à 22:49 +0200, Jiri Bohac a écrit :
> On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> > Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> > >  			addrconf_leave_anycast(ifp);
> > >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> > >  		dst_hold(&ifp->rt->u.dst);
> > > -		if (ip6_del_rt(ifp->rt))
> > > -			dst_free(&ifp->rt->u.dst);
> > > +		ip6_del_rt(ifp->rt);
> > >  		break;
> > >  	}
> > >  }
> > > 
> > 
> > 
> > I dont understand the problem Jiri.
> > 
> > We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> > dst_free(), or we leak a refcount.
> 
> Well, no ... dst_free() does not decrement the refcount.
> The "opposite" of dst_hold() is dst_release(). And it does not
> automatically call dst_free() when the refcount drops to 0.
> dst_free() needs to be called explicitly and it apparently
> expects the caller to ensure that two dst_free()s won't be called
> simultaneously. But my bonding example shows this is not the
> case.
> 
> 

Ah yes you're right

Maybe ask Stephen ?

commit 93fa159abe50d3c55c7f83622d3f5c09b6e06f4b
Author: stephen hemminger <shemminger@vyatta.com>
Date:   Mon Apr 12 05:41:31 2010 +0000

    IPv6: keep route for tentative address
    
    Recent changes preserve IPv6 address when link goes down (good).
    But would cause address to point to dead dst entry (bad).
    The simplest fix is to just not delete route if address is
    being held for later use.
    
    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1b00bfe..a9913d2 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4047,7 +4047,8 @@ static void __ipv6_ifa_notify(int event, struct
inet6_ifaddr *ifp)
                        addrconf_leave_anycast(ifp);
                addrconf_leave_solict(ifp->idev, &ifp->addr);
                dst_hold(&ifp->rt->u.dst);
-               if (ip6_del_rt(ifp->rt))
+
+               if (ifp->dead && ip6_del_rt(ifp->rt))
                        dst_free(&ifp->rt->u.dst);
                break;
        }



^ permalink raw reply related

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Stephen Hemminger @ 2010-04-20 21:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <1271797043.7895.320.camel@edumazet-laptop>

On Tue, 20 Apr 2010 22:57:23 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 20 avril 2010 à 22:49 +0200, Jiri Bohac a écrit :
> > On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> > > Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > > > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > > > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > > > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> > > >  			addrconf_leave_anycast(ifp);
> > > >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> > > >  		dst_hold(&ifp->rt->u.dst);
> > > > -		if (ip6_del_rt(ifp->rt))
> > > > -			dst_free(&ifp->rt->u.dst);
> > > > +		ip6_del_rt(ifp->rt);
> > > >  		break;
> > > >  	}
> > > >  }
> > > > 
> > > 
> > > 
> > > I dont understand the problem Jiri.
> > > 
> > > We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> > > dst_free(), or we leak a refcount.
> > 
> > Well, no ... dst_free() does not decrement the refcount.
> > The "opposite" of dst_hold() is dst_release(). And it does not
> > automatically call dst_free() when the refcount drops to 0.
> > dst_free() needs to be called explicitly and it apparently
> > expects the caller to ensure that two dst_free()s won't be called
> > simultaneously. But my bonding example shows this is not the
> > case.
> > 
> > 
> 
> Ah yes you're right
> 
> Maybe ask Stephen ?
> 
> commit 93fa159abe50d3c55c7f83622d3f5c09b6e06f4b
> Author: stephen hemminger <shemminger@vyatta.com>
> Date:   Mon Apr 12 05:41:31 2010 +0000
> 
>     IPv6: keep route for tentative address
>     
>     Recent changes preserve IPv6 address when link goes down (good).
>     But would cause address to point to dead dst entry (bad).
>     The simplest fix is to just not delete route if address is
>     being held for later use.
>     
>     Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 1b00bfe..a9913d2 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4047,7 +4047,8 @@ static void __ipv6_ifa_notify(int event, struct
> inet6_ifaddr *ifp)
>                         addrconf_leave_anycast(ifp);
>                 addrconf_leave_solict(ifp->idev, &ifp->addr);
>                 dst_hold(&ifp->rt->u.dst);
> -               if (ip6_del_rt(ifp->rt))
> +
> +               if (ifp->dead && ip6_del_rt(ifp->rt))
>                         dst_free(&ifp->rt->u.dst);
>                 break;


Is this problem new to net-next where we hold onto addresses, or was
the issue there before?

^ permalink raw reply

* Re: A possible bug in reqsk_queue_hash_req()
From: David Miller @ 2010-04-20 21:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: raise.sail, netdev, linux-kernel
In-Reply-To: <1271761611.3845.223.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 13:06:51 +0200

> I believe its not really necessary, because we are the only possible
> writer at this stage.
> 
> The write_lock() ... write_unlock() is there only to enforce a
> synchronisation with readers.
> 
> All callers of this reqsk_queue_hash_req() must have the socket locked

Right.

In fact there are quite a few snippets around the networking
where we use this trick of only locking around the single
pointer assignment that puts the object into the list.

And they were all written by Alexey Kuznetsov, so they must
be correct :-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox