Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net: small cleanup of lib8390
From: David Miller @ 2010-04-21 23:23 UTC (permalink / raw)
  To: knikanth; +Cc: p_gortmaker, netdev, viro, jeff
In-Reply-To: <201004151751.23899.knikanth@suse.de>

From: Nikanth Karthikesan <knikanth@suse.de>
Date: Thu, 15 Apr 2010 17:51:23 +0530

> Remove the always true #if 1. Also the unecessary re-test of ei_local->irqlock
> and the unreachable printk format string.
> 
> Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

Applied to net-next-2.6, thanks.

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Tom Herbert @ 2010-04-21 23:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, David S. Miller, netdev, jamal
In-Reply-To: <1271891149.7895.3751.camel@edumazet-laptop>

>        do {
> -               struct sk_buff *skb;
> +               lim = quota - work;
> +               if (lim > ARRAY_SIZE(sd->batch))
> +                       lim = ARRAY_SIZE(sd->batch);
> +               /* batch at most 16 buffers */
>

How about just using two input_pkt_queue's (define
input_pkt_queue[2])?  One that is used to enqueue from RPS, and one
that is being processed by process_backlog.  Then the only thing that
needs to be done under lock in process_backlog is to switch the
queues;  something like sd->current_input_pkt_queue ^= 1

Tom

> -               local_irq_disable();
>                rps_lock(sd);
> -               skb = __skb_dequeue(&sd->input_pkt_queue);
> -               if (!skb) {
> +               for (n = 0; n < lim; n++) {
> +                       sd->batch[n] = __skb_dequeue(&sd->input_pkt_queue);
> +                       if (!sd->batch[n])
> +                               break;
> +               }
> +               if (!sd->input_pkt_queue.qlen) {
>                        __napi_complete(napi);
> -                       rps_unlock(sd);
> -                       local_irq_enable();
> -                       break;
> +                       quota = 0;
>                }
> -               input_queue_head_incr(sd);
>                rps_unlock(sd);
> -               local_irq_enable();
>
> -               __netif_receive_skb(skb);
> -       } while (++work < quota);
> +               /* Now process our batch */
> +               for (i = 0; i < n; i++) {
> +                       skb = sd->batch[i];
> +                       /* flush_backlog() might have stolen this skb */
> +                       input_queue_head_incr(sd);
> +                       if (likely(skb)) {
> +                               sd->batch[i] = NULL;
> +                               local_irq_enable();
> +                               __netif_receive_skb(skb);
> +                               local_irq_disable();
> +                       }
> +               }
> +               work += n;
> +       } while (work < quota);
>
> +       local_irq_enable();
>        return work;
>  }
>
>
>
>

^ permalink raw reply

* Re: [PATCH] Infiniband: Randomize local port allocation.
From: Roland Dreier @ 2010-04-21 23:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sean.hefty, amwang, opurdila, eric.dumazet, netdev, nhorman,
	davem, ebiederm, linux-kernel, rolandd
In-Reply-To: <adabpdco0lb.fsf@roland-alpha.cisco.com>

 > Thanks, applied this part of the patch -- I preferred this one since the

err, not "part of the patch" -- I meant "this version of the patch".
-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply

* Re: [PATCH net-next-2.6 v3] fasync: RCU and fine grained locking
From: David Miller @ 2010-04-21 23:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: paulmck, netdev, linux-kernel, laijs
In-Reply-To: <1271274935.16881.1746.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 14 Apr 2010 21:55:35 +0200

> [PATCH net-next-2.6 v3] fasync: RCU and fine grained locking

BTW, just to be clear I made sure to apply V3 of this
patch not the initial submission :-)

^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: Herbert Xu @ 2010-04-21 23:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, cratiu
In-Reply-To: <20100421.142841.191672280.davem@davemloft.net>

On Wed, Apr 21, 2010 at 02:28:41PM -0700, David Miller wrote:
> 
> See?  We were computing the checksum over the TCP header twice now :-)
> That's why my patch fixed things for dataless packets, whose
> ->csum would evaluate to zero.

Oh I see, as send_response packets are always dataless this is
corect.

Thanks!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH net-next-2.6] fasync: RCU locking
From: David Miller @ 2010-04-21 23:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: paulmck, netdev, linux-kernel
In-Reply-To: <1271230961.16881.630.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 14 Apr 2010 09:42:41 +0200

> [PATCH net-next-2.6] fasync: RCU locking
> 
> kill_fasync() uses a central rwlock, candidate for RCU conversion.
> 
> We can remove __kill_fasync() direct use in net, and rename it to
> kill_fasync_rcu()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

This looks good to me, applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH] Infiniband: Randomize local port allocation.
From: Roland Dreier @ 2010-04-21 23:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sean.hefty, amwang, opurdila, eric.dumazet, netdev, nhorman,
	davem, ebiederm, linux-kernel, rolandd
In-Reply-To: <201004150229.o3F2T4dZ054768@www262.sakura.ne.jp>

Thanks, applied this part of the patch -- I preferred this one since the
goto into the middle of a loop seemed worse than a goto out of the loop...
-- 
Roland Dreier <rolandd@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html

^ permalink raw reply

* Re: [PATCH] tg3: Fix INTx fallback when MSI fails
From: David Miller @ 2010-04-21 23:17 UTC (permalink / raw)
  To: adetsch; +Cc: netdev, mcarlson
In-Reply-To: <201004161015.12089.adetsch@br.ibm.com>

From: Andre Detsch <adetsch@br.ibm.com>
Date: Fri, 16 Apr 2010 10:15:11 -0300

> tg3: Fix INTx fallback when MSI fails
> 
> MSI setup changes the value of some key attributes of struct tg3 *tp.
> These attributes must be taken into account and restored before
> we try to do a new request_irq for INTx fallback.
> 
> In powerpc, the original code was leading to an EINVAL return within
> request_irq, because the driver was trying to use the disabled MSI
> virtual irq number instead of tp->pdev->irq.
> 
> Signed-off-by: Andre Detsch <adetsch@br.ibm.com>

Broadcom folks, please review, thanks.

^ permalink raw reply

* Re: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
From: David Miller @ 2010-04-21 23:15 UTC (permalink / raw)
  To: wg; +Cc: hjk, socketcan-core, netdev
In-Reply-To: <4BCED326.3060000@grandegger.com>

From: Wolfgang Grandegger <wg@grandegger.com>
Date: Wed, 21 Apr 2010 12:27:50 +0200

>> Subject: [PATCH] can: Fix possible NULL pointer dereference in ems_usb.c
>> 
>> In ems_usb_probe(), a pointer is dereferenced after making sure it is NULL...
>> 
>> This patch replaces netdev->dev.parent with &intf->dev in dev_err() calls to
>> avoid this.
>> 
>> Signed-off-by: "Hans J. Koch" <hjk@linutronix.de>
 ...
> Acked-by: Wolfgang Grandegger <wg@grandegger.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v3] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-21 23:05 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev, Tom Herbert, jamal
In-Reply-To: <1271238738-8386-1-git-send-email-xiaosuo@gmail.com>

Le mercredi 14 avril 2010 à 17:52 +0800, Changli Gao a écrit :
> batch skb dequeueing from softnet input_pkt_queue
> 
> batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
> contention and irq disabling/enabling.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> ----

lock contention _is_ a problem, Jamal tests can show it.

irq disabling/enabling is not, and force to use stop_machine() killer.

I suggest something very simple, like a small buffer (16 slots), so that
process_backlog() can batch 16 buffers at once.

Following patch not tested, but its late here and I need to sleep ;)

This is a RFC, not for inclusion, and based on current net-next-2.6 tree

[RFC] net: introduce a batch mode in process_backlog()

We see a lock contention on input_pkt_queue.lock in RPS benches.

As suggested by Changli Gao, we can batch several skbs at once in
process_backlog(), so that we dirty input_pkt_queue less often.

I chose to batch at most 16 skbs per round, and place them in
softnet_data zone where flush_backlog() can find them and eventually
free this skbs at device dismantle.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    2 +
 net/core/dev.c            |   48 +++++++++++++++++++++++++++---------
 2 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3c5ed5f..16da8db 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1383,11 +1383,13 @@ static inline int unregister_gifconf(unsigned int family)
 /*
  * Incoming packets are placed on per-cpu queues
  */
+#define SD_BATCH_SZ 16
 struct softnet_data {
 	struct Qdisc		*output_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
+	struct sk_buff		*batch[SD_BATCH_SZ]; /* process_backlog() & flush_backlog() */
 #ifdef CONFIG_RPS
 	struct softnet_data	*rps_ipi_list;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index e904c47..2673ce0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2932,6 +2932,7 @@ static void flush_backlog(void *arg)
 	struct net_device *dev = arg;
 	struct softnet_data *sd = &__get_cpu_var(softnet_data);
 	struct sk_buff *skb, *tmp;
+	int i;
 
 	rps_lock(sd);
 	skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp)
@@ -2941,6 +2942,13 @@ static void flush_backlog(void *arg)
 			input_queue_head_incr(sd);
 		}
 	rps_unlock(sd);
+	for (i = 0; i < ARRAY_SIZE(sd->batch); i++) {
+		skb = sd->batch[i];
+		if (skb && skb->dev == dev) {
+			kfree_skb(skb);
+			sd->batch[i] = NULL;
+		}
+	}
 }
 
 static int napi_gro_complete(struct sk_buff *skb)
@@ -3245,29 +3253,47 @@ EXPORT_SYMBOL(napi_gro_frags);
 
 static int process_backlog(struct napi_struct *napi, int quota)
 {
-	int work = 0;
+	int i, n, lim, work = 0;
 	struct softnet_data *sd = &__get_cpu_var(softnet_data);
+	struct sk_buff *skb;
 
 	napi->weight = weight_p;
+	local_irq_disable();
+
 	do {
-		struct sk_buff *skb;
+		lim = quota - work;
+		if (lim > ARRAY_SIZE(sd->batch))
+			lim = ARRAY_SIZE(sd->batch);
+		/* batch at most 16 buffers */
 
-		local_irq_disable();
 		rps_lock(sd);
-		skb = __skb_dequeue(&sd->input_pkt_queue);
-		if (!skb) {
+		for (n = 0; n < lim; n++) {
+			sd->batch[n] = __skb_dequeue(&sd->input_pkt_queue);
+			if (!sd->batch[n])
+				break;
+		}
+		if (!sd->input_pkt_queue.qlen) {
 			__napi_complete(napi);
-			rps_unlock(sd);
-			local_irq_enable();
-			break;
+			quota = 0;
 		}
-		input_queue_head_incr(sd);
 		rps_unlock(sd);
-		local_irq_enable();
 
-		__netif_receive_skb(skb);
-	} while (++work < quota);
+		/* Now process our batch */
+		for (i = 0; i < n; i++) {
+			skb = sd->batch[i];
+			/* flush_backlog() might have stolen this skb */
+			input_queue_head_incr(sd);
+			if (likely(skb)) {
+				sd->batch[i] = NULL;
+				local_irq_enable();
+				__netif_receive_skb(skb);
+				local_irq_disable();
+			}
+		}
+		work += n;
+	} while (work < quota);
 
+	local_irq_enable();
 	return work;
 }
 



^ permalink raw reply related

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-21 22:48 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Scott Feldman, Chris Wright, davem, netdev
In-Reply-To: <201004212313.05060.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Wednesday 21 April 2010, Scott Feldman wrote:
> > On 4/21/10 12:39 PM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> > 
> > >>> 1. Setting up the slave device
> > >>>  a) create an SR-IOV VF to assign to a guest
> > >>>  b) create a macvtap device to pass to qemu or vhost
> > >>>  c) attach a tap device to a bridge
> > >>>  d) create a macvlan device and put it into a container
> > >>>  e) create a virtual interface for a VMDq adapter
> > >> 
> > >> OK, but iovnl isn't doing this.
> > > 
> > > The set_mac_vlan that Scott's patch adds seems to implement 1a), as far
> > > as I can tell. Interestingly, this is not actually implemented in
> > > the enic driver in patch 2/2. So if we all agree that this is out of the
> > > scope of iovnl, let's just remove it from the interface and find another
> > > way for it (ethtool, iplink, ..., as listed above).
> > 
> > You're right, not needed for enic since mac addr is included with
> > port-profile push and vlan membership is implied by port-profile.  So I put
> > set_mac_vlan in there basically to elicit feedback.
> 
> Ok. Two points though:
> 
> - when you say that the mac address is included in the port-profile push,
>   does that imply that the VF does not have a mac address prior to this?
>   This would again mix the NIC configuration phase with the switch
>   association, which I think we really need to avoid if we want to be
>   able to implement the association in user space!
> 
> - The VLAN ID being implied in the port profile seems to be another
>   difference between what enic is doing and the current draft VDP
>   that will eventually become 802.1Qbg, and I fear that this difference
>   will be visible in the iovnl protocol.
> 
> > There really wouldn't be much different between iplink and iovnl since
> > they're both rtnetlink...seems we should keep IOV-related APIs in one place.
> > Maybe there are other IOV APIs to add to iovnl in the future like:
> > 
> >     vf <- add_vf(pf)
> >     del_vf(pf, vf)
> > 
> > Ethtool doesn't seem the right place for this.
> 
> Right. My preference would probably be make these a subcategory of
> the if_link, and use the existing RTM_NEWLINK/RTM_DELLINK commands.
> This would make it resemble the existing interfaces and mean you can
> use
> 
> ip link add link eth0 type macvlan    # for a container
> ip link add link eth0 type macvtap    # for qemu/vhost
> ip link add link eth0 type vf         # for device assignment

BTW, what do you mean by device assignment?

> There are obviously significant differences between these three, but
> they also share enough of their properties to let us treat them
> in similar ways.
> 
> If we integrate the iovnl client into iproute2, the sequence for setting
> up an enic VF and associating it to the port profile could be
> 
> # create vf0, pass mac and vlan id to HW, no association yet
> ip link add link eth0 name vf0 type vf mac fe:dc:ba:12:34:56 vlan 78

Just to clarify...right now, the normal SR-IOV VF is already there.
And, or course, can have its mac addr/vlan set already.

> # associate vf with port profile, mac address must match the one assigned
> #  to the interface before.
> ip iov assoc eth0 port-profile "general" host-uuid "dcf2a873-f5ee-41dd-a7ad-802a544e48c2" \
> 	 mac fe:dc:ba:12:34:56

At that point you could just do s/mac fe:.*/link vf0/

thanks,
-chris

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Chris Wright @ 2010-04-21 22:18 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Chris Wright, Scott Feldman, davem, netdev
In-Reply-To: <201004212139.22421.arnd@arndb.de>

* Arnd Bergmann (arnd@arndb.de) wrote:
> On Wednesday 21 April 2010, Chris Wright wrote:
> > * Arnd Bergmann (arnd@arndb.de) wrote:
> > > On Wednesday 21 April 2010, Chris Wright wrote:
> > > > * Arnd Bergmann (arnd@arndb.de) wrote:
> > > > > Since it seems what you really want to do is to do the exchange with the
> > > > > switch from here, maybe the hardware configuration part should be moved
> > > > > the DCB interface?
> > > > 
> > > > I suppose this would work  (although it's a bit odd being out of scope
> > > > of DCB spec).
> > > 
> > > It could be anywhere, it doesn't have to be the DCB interface, but could
> > > be anything ranging from ethtool to iplink I guess. And we should define
> > > it in a way that works for any SR-IOV card, whether it's using Cisco's
> > > protocol in firmware, 802.1Qbg VDP in firmware, lldpad to do VDP or
> > > none of the above and just provides an internal switch like all the
> > > existing NICs.
> > 
> > Heh, that's exactly what iovnl does ;-)
> 
> No, according to what you write below, it's exactly what iovnl does *not* do,
> i.e. part 1 in my list.

OK, I see...in this case to me hw setup was the port profile for the
enic to initiate host<->switch negotiation, sorry for confusion.

> > > 1. Setting up the slave device
> > >  a) create an SR-IOV VF to assign to a guest
> > >  b) create a macvtap device to pass to qemu or vhost
> > >  c) attach a tap device to a bridge
> > >  d) create a macvlan device and put it into a container
> > >  e) create a virtual interface for a VMDq adapter
> > 
> > OK, but iovnl isn't doing this.
> 
> The set_mac_vlan that Scott's patch adds seems to implement 1a), as far
> as I can tell. Interestingly, this is not actually implemented in
> the enic driver in patch 2/2. So if we all agree that this is out of the
> scope of iovnl, let's just remove it from the interface and find another
> way for it (ethtool, iplink, ..., as listed above).

Scott, any objection?  At least a way to keep moving forward on the port
profile bit.

> Note that we still need to pass the MAC address and VLAN ID (or a list
> of these) to the external switch, my point is just that this should be
> separate from enforcing it in the hypervisor.

Yup, we should focus on reconciling the diff of enic vs vpd port profile
needs.

> > > 2) Registering the slave with the switch
> > >  a) use Cisco protocol in enic firmware (see patch 2/2)
> > >  b) use standard VDP in lldpad
> > >  c) use reverse-engineered cisco protocol in some user tool for
> > >     non-enic adapters.
> > >  d) use standard VDP in firmware (hopefully this never happens)
> > >  e) do nothing at all (as we do today)
> > 
> > And this is the step that is the main purpose of iovnl.
> > 
> > Here's the simplest snippet of libvirt to show this.  It sends
> > set_port_profile netlink messages and then creates macvtap.  As simple
> > as it gets...
> > 
> > --- a/src/qemu/qemu_conf.c
> > +++ b/src/qemu/qemu_conf.c
> > @@ -1470,6 +1470,11 @@ qemudPhysIfaceConnect(virConnectPtr conn,
> >          net->model && STREQ(net->model, "virtio"))
> >          vnet_hdr = 1;
> >  
> > +    setPortProfileId(net->data.direct.linkdev,
> > +                      net->data.direct.mode,
> > +                      net->data.direct.profileid,
> > +                      net->mac);
> > +
> >      rc = openMacvtapTap(net->ifname, net->mac, linkdev, brmode,
> >                          &res_ifname, vnet_hdr);
> 
> Ok. In case of VDP, I guess this needs to be extended with the vlan ID
> that has been configured, and possibly with a UUID, because we need to
> pass the same one on the target machine if we migrate it.
> 
> Alternatively, the setPortProfileId could figure out the MAC address and
> VLAN ID from the device itself, but then we don't need to pass either of
> them.

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-21 22:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, vgoyal, nauman, netdev, Eric W. Biederman
In-Reply-To: <1271887029.7895.3535.camel@edumazet-laptop>

On Wed, Apr 21, 2010 at 11:57:09PM +0200, Eric Dumazet wrote:
> Le mercredi 21 avril 2010 à 14:35 -0700, Paul E. McKenney a écrit :
> 
> > > [   33.425087] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > [   33.425090] ---------------------------------------------------
> > > [   33.425094] net/core/dev.c:1993 invoked rcu_dereference_check()
> > > without protection!
> > > [   33.425098]
> > > [   33.425098] other info that might help us debug this:
> > > [   33.425100]
> > > [   33.425103]
> > > [   33.425104] rcu_scheduler_active = 1, debug_locks = 1
> > > [   33.425108] 2 locks held by canberra-gtk-pl/4208:
> > > [   33.425111]  #0:  (sk_lock-AF_INET){+.+.+.}, at:
> > > [<ffffffff81394ffd>] inet_stream_connect+0x3a/0x24d
> > > [   33.425125]  #1:  (rcu_read_lock_bh){.+....}, at:
> > > [<ffffffff8134a809>] dev_queue_xmit+0x14e/0x4b8
> > > [   33.425137]
> > > [   33.425138] stack backtrace:
> > > [   33.425142] Pid: 4208, comm: canberra-gtk-pl Not tainted 2.6.34-rc5 #18
> > > [   33.425146] Call Trace:
> > > [   33.425154]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > > [   33.425161]  [<ffffffff8134a914>] dev_queue_xmit+0x259/0x4b8
> > > [   33.425167]  [<ffffffff8134a809>] ? dev_queue_xmit+0x14e/0x4b8
> > > [   33.425173]  [<ffffffff81041c52>] ? _local_bh_enable_ip+0xcd/0xda
> > > [   33.425180]  [<ffffffff8135375a>] neigh_resolve_output+0x234/0x285
> > > [   33.425188]  [<ffffffff8136f71f>] ip_finish_output2+0x257/0x28c
> > > [   33.425193]  [<ffffffff8136f7bc>] ip_finish_output+0x68/0x6a
> > > [   33.425198]  [<ffffffff813704b3>] T.866+0x52/0x59
> > > [   33.425203]  [<ffffffff813706fe>] ip_output+0xaa/0xb4
> > > [   33.425209]  [<ffffffff8136ebb8>] ip_local_out+0x20/0x24
> > > [   33.425215]  [<ffffffff8136f204>] ip_queue_xmit+0x309/0x368
> > > [   33.425223]  [<ffffffff810e41e6>] ? __kmalloc_track_caller+0x111/0x155
> > > [   33.425230]  [<ffffffff813831ef>] ? tcp_connect+0x223/0x3d3
> > > [   33.425236]  [<ffffffff81381971>] tcp_transmit_skb+0x707/0x745
> > > [   33.425243]  [<ffffffff81383342>] tcp_connect+0x376/0x3d3
> > > [   33.425250]  [<ffffffff81268ac3>] ? secure_tcp_sequence_number+0x55/0x6f
> > > [   33.425256]  [<ffffffff813872f0>] tcp_v4_connect+0x3df/0x455
> > > [   33.425263]  [<ffffffff8133cbd9>] ? lock_sock_nested+0xf3/0x102
> > > [   33.425269]  [<ffffffff81395067>] inet_stream_connect+0xa4/0x24d
> > > [   33.425276]  [<ffffffff8133b418>] sys_connect+0x90/0xd0
> > > [   33.425283]  [<ffffffff81002b9c>] ? sysret_check+0x27/0x62
> > > [   33.425289]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > > [   33.425296]  [<ffffffff813ced00>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > > [   33.425303]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b
> > 
> > This looks like an rcu_dereference() needs to instead be
> > rcu_dereference_bh(), but the line numbering in my version of
> > net/core/dev.c does not match yours.  CCing netdev, hopefully
> > someone there will know which rcu_dereference() is indicated.
> 
> This is already sorted out in David trees

Very good!!!  ;-)

> > > [   85.939528] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > [   85.939531] ---------------------------------------------------
> > > [   85.939535] include/net/inet_timewait_sock.h:227 invoked
> > > rcu_dereference_check() without protection!
> > > [   85.939539]
> > > [   85.939540] other info that might help us debug this:
> > > [   85.939541]
> > > [   85.939544]
> > > [   85.939545] rcu_scheduler_active = 1, debug_locks = 1
> > > [   85.939549] 2 locks held by gwibber-service/4798:
> > > [   85.939552]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff811034b2>]
> > > seq_read+0x37/0x381
> > > [   85.939566]  #1:  (&(&hashinfo->ehash_locks[i])->rlock){+.-...},
> > > at: [<ffffffff81386355>] established_get_next+0xc4/0x132
> > > [   85.939579]
> > > [   85.939580] stack backtrace:
> > > [   85.939585] Pid: 4798, comm: gwibber-service Not tainted 2.6.34-rc5 #18
> > > [   85.939588] Call Trace:
> > > [   85.939598]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > > [   85.939604]  [<ffffffff81385018>] twsk_net+0x4f/0x57
> > > [   85.939610]  [<ffffffff813862e5>] established_get_next+0x54/0x132
> > > [   85.939615]  [<ffffffff813864c7>] tcp_seq_next+0x5d/0x6a
> > > [   85.939621]  [<ffffffff81103701>] seq_read+0x286/0x381
> > > [   85.939627]  [<ffffffff8110347b>] ? seq_read+0x0/0x381
> > > [   85.939633]  [<ffffffff81133240>] proc_reg_read+0x8d/0xac
> > > [   85.939640]  [<ffffffff810ea110>] vfs_read+0xa6/0x103
> > > [   85.939645]  [<ffffffff810ea223>] sys_read+0x45/0x69
> > > [   85.939652]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b
> > 
> > This one appears to be a case of missing rcu_read_lock(), but it is
> > not clear to me at what level it needs to go.
> > 
> > Eric, any enlightenment on this one and the next one?
> 
> Coming from commit b099ce2602d806deb41caaa578731848995cdb2a
> >From Eric Biederman (CCed)
> 
> Apparently he added rcu to twsk_net(), but Changelog doesnt mention it.

Thank you for chasing this down, Eric Dumazet!

Eric Biederman, any enlightment?

							Thanx, Paul

> > > [   87.296366] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > [   87.296369] ---------------------------------------------------
> > > [   87.296373] include/net/inet_timewait_sock.h:227 invoked
> > > rcu_dereference_check() without protection!
> > > [   87.296377]
> > > [   87.296377] other info that might help us debug this:
> > > [   87.296379]
> > > [   87.296382]
> > > [   87.296383] rcu_scheduler_active = 1, debug_locks = 1
> > > [   87.296386] no locks held by gwibber-service/4803.
> > > [   87.296389]
> > > [   87.296390] stack backtrace:
> > > [   87.296395] Pid: 4803, comm: gwibber-service Not tainted 2.6.34-rc5 #18
> > > [   87.296398] Call Trace:
> > > [   87.296411]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > > [   87.296419]  [<ffffffff813733d3>] twsk_net+0x4f/0x57
> > > [   87.296424]  [<ffffffff813737f3>] __inet_twsk_hashdance+0x50/0x158
> > > [   87.296431]  [<ffffffff81389239>] tcp_time_wait+0x1c1/0x24b
> > > [   87.296437]  [<ffffffff8137c417>] tcp_fin+0x83/0x162
> > > [   87.296443]  [<ffffffff8137cda7>] tcp_data_queue+0x1ff/0xa1e
> > > [   87.296450]  [<ffffffff810495c6>] ? mod_timer+0x1e/0x20
> > > [   87.296456]  [<ffffffff813809e3>] tcp_rcv_state_process+0x89d/0x8f2
> > > [   87.296463]  [<ffffffff8133ca0b>] ? release_sock+0x30/0x10b
> > > [   87.296468]  [<ffffffff81386df2>] tcp_v4_do_rcv+0x2de/0x33f
> > > [   87.296475]  [<ffffffff8133ca5d>] release_sock+0x82/0x10b
> > > [   87.296481]  [<ffffffff81376ef5>] tcp_close+0x1b5/0x37e
> > > [   87.296487]  [<ffffffff81395437>] inet_release+0x50/0x57
> > > [   87.296493]  [<ffffffff8133a134>] sock_release+0x1a/0x66
> > > [   87.296498]  [<ffffffff8133a1a2>] sock_close+0x22/0x26
> > > [   87.296505]  [<ffffffff810eb003>] __fput+0x120/0x1cd
> > > [   87.296510]  [<ffffffff810eb0c5>] fput+0x15/0x17
> > > [   87.296516]  [<ffffffff810e7f3d>] filp_close+0x63/0x6d
> > > [   87.296521]  [<ffffffff810e801e>] sys_close+0xd7/0x111
> > > [   87.296528]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b
> > 
> > commit d3b8ba1bde9afb7d50cf0712f9d95317ea66c06f
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Wed Apr 21 14:04:56 2010 -0700
> > 
> >     sched: protect __sched_setscheduler() access to cgroups
> >     
> >     A given task's cgroups structures must remain while that task is running
> >     due to reference counting, so this is presumably a false positive.
> >     
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 14c44ec..1d43c1a 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4575,9 +4575,11 @@ recheck:
> >  		 * Do not allow realtime tasks into groups that have no runtime
> >  		 * assigned.
> >  		 */
> > +		rcu_read_lock();
> >  		if (rt_bandwidth_enabled() && rt_policy(policy) &&
> >  				task_group(p)->rt_bandwidth.rt_runtime == 0)
> >  			return -EPERM;
> > +		rcu_read_unlock();
> >  #endif
> >  
> >  		retval = security_task_setscheduler(p, policy, param);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 
> 

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Eric Dumazet @ 2010-04-21 21:57 UTC (permalink / raw)
  To: paulmck
  Cc: Miles Lane, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, vgoyal, nauman, netdev, Eric W. Biederman
In-Reply-To: <20100421213543.GO2563@linux.vnet.ibm.com>

Le mercredi 21 avril 2010 à 14:35 -0700, Paul E. McKenney a écrit :

> > [   33.425087] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [   33.425090] ---------------------------------------------------
> > [   33.425094] net/core/dev.c:1993 invoked rcu_dereference_check()
> > without protection!
> > [   33.425098]
> > [   33.425098] other info that might help us debug this:
> > [   33.425100]
> > [   33.425103]
> > [   33.425104] rcu_scheduler_active = 1, debug_locks = 1
> > [   33.425108] 2 locks held by canberra-gtk-pl/4208:
> > [   33.425111]  #0:  (sk_lock-AF_INET){+.+.+.}, at:
> > [<ffffffff81394ffd>] inet_stream_connect+0x3a/0x24d
> > [   33.425125]  #1:  (rcu_read_lock_bh){.+....}, at:
> > [<ffffffff8134a809>] dev_queue_xmit+0x14e/0x4b8
> > [   33.425137]
> > [   33.425138] stack backtrace:
> > [   33.425142] Pid: 4208, comm: canberra-gtk-pl Not tainted 2.6.34-rc5 #18
> > [   33.425146] Call Trace:
> > [   33.425154]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > [   33.425161]  [<ffffffff8134a914>] dev_queue_xmit+0x259/0x4b8
> > [   33.425167]  [<ffffffff8134a809>] ? dev_queue_xmit+0x14e/0x4b8
> > [   33.425173]  [<ffffffff81041c52>] ? _local_bh_enable_ip+0xcd/0xda
> > [   33.425180]  [<ffffffff8135375a>] neigh_resolve_output+0x234/0x285
> > [   33.425188]  [<ffffffff8136f71f>] ip_finish_output2+0x257/0x28c
> > [   33.425193]  [<ffffffff8136f7bc>] ip_finish_output+0x68/0x6a
> > [   33.425198]  [<ffffffff813704b3>] T.866+0x52/0x59
> > [   33.425203]  [<ffffffff813706fe>] ip_output+0xaa/0xb4
> > [   33.425209]  [<ffffffff8136ebb8>] ip_local_out+0x20/0x24
> > [   33.425215]  [<ffffffff8136f204>] ip_queue_xmit+0x309/0x368
> > [   33.425223]  [<ffffffff810e41e6>] ? __kmalloc_track_caller+0x111/0x155
> > [   33.425230]  [<ffffffff813831ef>] ? tcp_connect+0x223/0x3d3
> > [   33.425236]  [<ffffffff81381971>] tcp_transmit_skb+0x707/0x745
> > [   33.425243]  [<ffffffff81383342>] tcp_connect+0x376/0x3d3
> > [   33.425250]  [<ffffffff81268ac3>] ? secure_tcp_sequence_number+0x55/0x6f
> > [   33.425256]  [<ffffffff813872f0>] tcp_v4_connect+0x3df/0x455
> > [   33.425263]  [<ffffffff8133cbd9>] ? lock_sock_nested+0xf3/0x102
> > [   33.425269]  [<ffffffff81395067>] inet_stream_connect+0xa4/0x24d
> > [   33.425276]  [<ffffffff8133b418>] sys_connect+0x90/0xd0
> > [   33.425283]  [<ffffffff81002b9c>] ? sysret_check+0x27/0x62
> > [   33.425289]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [   33.425296]  [<ffffffff813ced00>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [   33.425303]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b
> 
> This looks like an rcu_dereference() needs to instead be
> rcu_dereference_bh(), but the line numbering in my version of
> net/core/dev.c does not match yours.  CCing netdev, hopefully
> someone there will know which rcu_dereference() is indicated.
> 

This is already sorted out in David trees



> > [   85.939528] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [   85.939531] ---------------------------------------------------
> > [   85.939535] include/net/inet_timewait_sock.h:227 invoked
> > rcu_dereference_check() without protection!
> > [   85.939539]
> > [   85.939540] other info that might help us debug this:
> > [   85.939541]
> > [   85.939544]
> > [   85.939545] rcu_scheduler_active = 1, debug_locks = 1
> > [   85.939549] 2 locks held by gwibber-service/4798:
> > [   85.939552]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff811034b2>]
> > seq_read+0x37/0x381
> > [   85.939566]  #1:  (&(&hashinfo->ehash_locks[i])->rlock){+.-...},
> > at: [<ffffffff81386355>] established_get_next+0xc4/0x132
> > [   85.939579]
> > [   85.939580] stack backtrace:
> > [   85.939585] Pid: 4798, comm: gwibber-service Not tainted 2.6.34-rc5 #18
> > [   85.939588] Call Trace:
> > [   85.939598]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > [   85.939604]  [<ffffffff81385018>] twsk_net+0x4f/0x57
> > [   85.939610]  [<ffffffff813862e5>] established_get_next+0x54/0x132
> > [   85.939615]  [<ffffffff813864c7>] tcp_seq_next+0x5d/0x6a
> > [   85.939621]  [<ffffffff81103701>] seq_read+0x286/0x381
> > [   85.939627]  [<ffffffff8110347b>] ? seq_read+0x0/0x381
> > [   85.939633]  [<ffffffff81133240>] proc_reg_read+0x8d/0xac
> > [   85.939640]  [<ffffffff810ea110>] vfs_read+0xa6/0x103
> > [   85.939645]  [<ffffffff810ea223>] sys_read+0x45/0x69
> > [   85.939652]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b
> 
> This one appears to be a case of missing rcu_read_lock(), but it is
> not clear to me at what level it needs to go.
> 
> Eric, any enlightenment on this one and the next one?
> 

Coming from commit b099ce2602d806deb41caaa578731848995cdb2a
>From Eric Biederman (CCed)

Apparently he added rcu to twsk_net(), but Changelog doesnt mention it.

> > [   87.296366] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [   87.296369] ---------------------------------------------------
> > [   87.296373] include/net/inet_timewait_sock.h:227 invoked
> > rcu_dereference_check() without protection!
> > [   87.296377]
> > [   87.296377] other info that might help us debug this:
> > [   87.296379]
> > [   87.296382]
> > [   87.296383] rcu_scheduler_active = 1, debug_locks = 1
> > [   87.296386] no locks held by gwibber-service/4803.
> > [   87.296389]
> > [   87.296390] stack backtrace:
> > [   87.296395] Pid: 4803, comm: gwibber-service Not tainted 2.6.34-rc5 #18
> > [   87.296398] Call Trace:
> > [   87.296411]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> > [   87.296419]  [<ffffffff813733d3>] twsk_net+0x4f/0x57
> > [   87.296424]  [<ffffffff813737f3>] __inet_twsk_hashdance+0x50/0x158
> > [   87.296431]  [<ffffffff81389239>] tcp_time_wait+0x1c1/0x24b
> > [   87.296437]  [<ffffffff8137c417>] tcp_fin+0x83/0x162
> > [   87.296443]  [<ffffffff8137cda7>] tcp_data_queue+0x1ff/0xa1e
> > [   87.296450]  [<ffffffff810495c6>] ? mod_timer+0x1e/0x20
> > [   87.296456]  [<ffffffff813809e3>] tcp_rcv_state_process+0x89d/0x8f2
> > [   87.296463]  [<ffffffff8133ca0b>] ? release_sock+0x30/0x10b
> > [   87.296468]  [<ffffffff81386df2>] tcp_v4_do_rcv+0x2de/0x33f
> > [   87.296475]  [<ffffffff8133ca5d>] release_sock+0x82/0x10b
> > [   87.296481]  [<ffffffff81376ef5>] tcp_close+0x1b5/0x37e
> > [   87.296487]  [<ffffffff81395437>] inet_release+0x50/0x57
> > [   87.296493]  [<ffffffff8133a134>] sock_release+0x1a/0x66
> > [   87.296498]  [<ffffffff8133a1a2>] sock_close+0x22/0x26
> > [   87.296505]  [<ffffffff810eb003>] __fput+0x120/0x1cd
> > [   87.296510]  [<ffffffff810eb0c5>] fput+0x15/0x17
> > [   87.296516]  [<ffffffff810e7f3d>] filp_close+0x63/0x6d
> > [   87.296521]  [<ffffffff810e801e>] sys_close+0xd7/0x111
> > [   87.296528]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b
> 
> commit d3b8ba1bde9afb7d50cf0712f9d95317ea66c06f
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Wed Apr 21 14:04:56 2010 -0700
> 
>     sched: protect __sched_setscheduler() access to cgroups
>     
>     A given task's cgroups structures must remain while that task is running
>     due to reference counting, so this is presumably a false positive.
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 14c44ec..1d43c1a 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4575,9 +4575,11 @@ recheck:
>  		 * Do not allow realtime tasks into groups that have no runtime
>  		 * assigned.
>  		 */
> +		rcu_read_lock();
>  		if (rt_bandwidth_enabled() && rt_policy(policy) &&
>  				task_group(p)->rt_bandwidth.rt_runtime == 0)
>  			return -EPERM;
> +		rcu_read_unlock();
>  #endif
>  
>  		retval = security_task_setscheduler(p, policy, param);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply

* Re: rps perfomance WAS(Re: rps: question
From: Rick Jones @ 2010-04-21 21:53 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, Changli Gao, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1271853570.4032.21.camel@bigi>

> Let me explain my test setup (which some app types may gasp at;->):
> 
> SUT(system under test) was a nehalem single processor (4 cores, 2 SMT
> threads per core). 
> SUT runs a udp sink server i wrote (with apologies to Rick Jones[1])
 > ...
> 
> [1] I want to hump on the SUT with tons of traffic and count packets;
> too complex to do with netperf

No need to apologize,  if you like I'd be happy to discuss netperf usage tips 
offline.  That offer stands for everyone.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-21 21:48 UTC (permalink / raw)
  To: Miles Lane
  Cc: Eric Paris, Lai Jiangshan, Ingo Molnar, Peter Zijlstra, LKML,
	vgoyal, nauman, eric.dumazet, netdev
In-Reply-To: <20100421213543.GO2563@linux.vnet.ibm.com>

On Wed, Apr 21, 2010 at 02:35:43PM -0700, Paul E. McKenney wrote:
> On Tue, Apr 20, 2010 at 11:38:28AM -0400, Miles Lane wrote:
> > Excellent.  Here are the results on my machine.  .config appended.
> 
> First, thank you very much for testing this, Miles!

And as Tetsuo Handa pointed out privately, my patch was way broken.

Here is an updated version.

							Thanx, Paul

commit b15e561ed91b7a366c3cc635026f3b9ce6483070
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed Apr 21 14:04:56 2010 -0700

    sched: protect __sched_setscheduler() access to cgroups
    
    A given task's cgroups structures must remain while that task is running
    due to reference counting, so this is presumably a false positive.
    Updated to reflect feedback from Tetsuo Handa.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index 14c44ec..f425a2b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4575,9 +4575,13 @@ recheck:
 		 * Do not allow realtime tasks into groups that have no runtime
 		 * assigned.
 		 */
+		rcu_read_lock();
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
-				task_group(p)->rt_bandwidth.rt_runtime == 0)
+				task_group(p)->rt_bandwidth.rt_runtime == 0) {
+			rcu_read_unlock();
 			return -EPERM;
+		}
+		rcu_read_unlock();
 #endif
 
 		retval = security_task_setscheduler(p, policy, param);

^ permalink raw reply related

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-21 21:45 UTC (permalink / raw)
  To: Borislav Petkov, Miles Lane, Eric Paris, Lai Jiangshan,
	Ingo Molnar, Peter Zijlstra <
  Cc: eric.dumazet, netdev
In-Reply-To: <20100421060428.GA3839@liondog.tnic>

On Wed, Apr 21, 2010 at 08:04:28AM +0200, Borislav Petkov wrote:
> Hi,
> 
> a plain -rc5 triggers at net/core/dev.c:1993 here too:

Thank you for testing this, Borislav!  A prototype patch may be
found below, CCing the netdev guys for their thoughts on this.
(And if this patch is valid, it is probably best for it to go up with
the networking patches, since this part of the code looks to be under
active development.)

							Thanx, Paul

> [   12.889090] ===================================================
> [   12.889387] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   12.889533] ---------------------------------------------------
> [   12.889679] net/core/dev.c:1993 invoked rcu_dereference_check() without protection!
> [   12.889929] 
> [   12.889929] other info that might help us debug this:
> [   12.889930] 
> [   12.890368] 
> [   12.890369] rcu_scheduler_active = 1, debug_locks = 0
> [   12.890659] 2 locks held by swapper/0:
> [   12.890803]  #0:  (&idev->mc_ifc_timer){+.-...}, at: [<ffffffff81045f4a>] run_timer_softirq+0x266/0x503
> [   12.891227]  #1:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81397eb4>] dev_queue_xmit+0x153/0x512
> [   12.891647] 
> [   12.891648] stack backtrace:
> [   12.891934] Pid: 0, comm: swapper Not tainted 2.6.34-rc5 #1
> [   12.892085] Call Trace:
> [   12.892231]  <IRQ>  [<ffffffff81065d8f>] lockdep_rcu_dereference+0xaa/0xb2
> [   12.892430]  [<ffffffff81397fbf>] dev_queue_xmit+0x25e/0x512
> [   12.892576]  [<ffffffff81397eb4>] ? dev_queue_xmit+0x153/0x512
> [   12.892723]  [<ffffffff81066a4a>] ? trace_hardirqs_on+0xd/0xf
> [   12.892871]  [<ffffffff8103f4fb>] ? local_bh_enable_ip+0xbc/0xda
> [   12.893024]  [<ffffffff8139ea67>] neigh_resolve_output+0x323/0x36a
> [   12.893183]  [<ffffffffa00ae6b7>] ? ipv6_chk_mcast_addr+0x0/0x1fa [ipv6]
> [   12.893338]  [<ffffffffa0094860>] ip6_output_finish+0x81/0xb9 [ipv6]
> [   12.893492]  [<ffffffffa0096067>] ip6_output2+0x2a9/0x2b4 [ipv6]
> [   12.893644]  [<ffffffffa0096c33>] ip6_output+0xbc1/0xbd0 [ipv6]
> [   12.893797]  [<ffffffffa00a2a06>] ? fib6_force_start_gc+0x30/0x32 [ipv6]
> [   12.893951]  [<ffffffffa00b04e8>] mld_sendpack+0x30b/0x435 [ipv6]
> [   12.894109]  [<ffffffffa00b01dd>] ? mld_sendpack+0x0/0x435 [ipv6]
> [   12.894264]  [<ffffffff8106676d>] ? mark_held_locks+0x52/0x70
> [   12.894418]  [<ffffffffa00b0d2d>] mld_ifc_timer_expire+0x254/0x28d [ipv6]
> [   12.894570]  [<ffffffff81046065>] run_timer_softirq+0x381/0x503
> [   12.894717]  [<ffffffff81045f4a>] ? run_timer_softirq+0x266/0x503
> [   12.894870]  [<ffffffffa00b0ad9>] ? mld_ifc_timer_expire+0x0/0x28d [ipv6]
> [   12.895024]  [<ffffffff8103f708>] ? __do_softirq+0x79/0x2f5
> [   12.895174]  [<ffffffff8103f80f>] __do_softirq+0x180/0x2f5
> [   12.895323]  [<ffffffff810030cc>] call_softirq+0x1c/0x28
> [   12.895472]  [<ffffffff81004d91>] do_softirq+0x3d/0x85
> [   12.895619]  [<ffffffff8103f2b5>] irq_exit+0x4a/0x95
> [   12.895766]  [<ffffffff81413a3d>] smp_apic_timer_interrupt+0x8c/0x9a
> [   12.895913]  [<ffffffff81002b93>] apic_timer_interrupt+0x13/0x20
> [   12.896065]  <EOI>  [<ffffffff81412e1e>] ? _raw_spin_unlock_irqrestore+0x38/0x69
> [   12.896363]  [<ffffffff8100a189>] ? default_idle+0xd8/0x10a
> [   12.896512]  [<ffffffff8100a187>] ? default_idle+0xd6/0x10a
> [   12.896658]  [<ffffffff8100a5b3>] c1e_idle+0xcd/0xf4
> [   12.896805]  [<ffffffff8100138c>] cpu_idle+0x5e/0xb5
> [   12.896952]  [<ffffffff813ff0eb>] rest_init+0xff/0x106
> [   12.897104]  [<ffffffff813fefec>] ? rest_init+0x0/0x106
> [   12.897260]  [<ffffffff818e7c2a>] start_kernel+0x30f/0x31a
> [   12.897409]  [<ffffffff818e726d>] x86_64_start_reservations+0x7d/0x81
> [   12.897560]  [<ffffffff818e7355>] x86_64_start_kernel+0xe4/0xeb



commit df3f17af2d26d1451a3d23d5c7b7a6423a38569e
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed Apr 21 14:40:37 2010 -0700

    net: fix dev_pick_tx() to use rcu_dereference_bh()
    
    dev_pick_tx() is called in an RCU-bh read-side critical section, but
    uses rcu_dereference().  This patch changes to rcu_dereference_bh() in
    order to suppress the RCU lockdep splat.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/net/core/dev.c b/net/core/dev.c
index 92584bf..f769098 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1990,7 +1990,7 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 				queue_index = skb_tx_hash(dev, skb);
 
 			if (sk) {
-				struct dst_entry *dst = rcu_dereference(sk->sk_dst_cache);
+				struct dst_entry *dst = rcu_dereference_bh(sk->sk_dst_cache);
 
 				if (dst && skb_dst(skb) == dst)
 					sk_tx_queue_set(sk, queue_index);

^ permalink raw reply related

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-21 21:35 UTC (permalink / raw)
  To: Miles Lane
  Cc: Eric Paris, Lai Jiangshan, Ingo Molnar, Peter Zijlstra, LKML,
	vgoyal, nauman, eric.dumazet, netdev
In-Reply-To: <t2la44ae5cd1004200838w87256e80v9dcde91342b321db@mail.gmail.com>

On Tue, Apr 20, 2010 at 11:38:28AM -0400, Miles Lane wrote:
> Excellent.  Here are the results on my machine.  .config appended.

First, thank you very much for testing this, Miles!

> [    0.177300] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    0.177428] ---------------------------------------------------
> [    0.177557] include/linux/cgroup.h:533 invoked
> rcu_dereference_check() without protection!
> [    0.177760]
> [    0.177761] other info that might help us debug this:
> [    0.177762]
> [    0.178123]
> [    0.178124] rcu_scheduler_active = 1, debug_locks = 1
> [    0.178369] no locks held by watchdog/0/5.
> [    0.178493]
> [    0.178494] stack backtrace:
> [    0.178735] Pid: 5, comm: watchdog/0 Not tainted 2.6.34-rc5 #18
> [    0.178863] Call Trace:
> [    0.178994]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> [    0.179127]  [<ffffffff8102d667>] task_subsys_state+0x48/0x60
> [    0.179259]  [<ffffffff810328e5>] __sched_setscheduler+0x19d/0x300
> [    0.179392]  [<ffffffff8102b477>] ? need_resched+0x1e/0x28
> [    0.179523]  [<ffffffff813cd501>] ? schedule+0x643/0x66e
> [    0.179653]  [<ffffffff81091903>] ? watchdog+0x0/0x8c
> [    0.179783]  [<ffffffff81032a63>] sched_setscheduler+0xe/0x10
> [    0.179913]  [<ffffffff8109192d>] watchdog+0x2a/0x8c
> [    0.180010]  [<ffffffff81091903>] ? watchdog+0x0/0x8c
> [    0.180142]  [<ffffffff8105713e>] kthread+0x89/0x91
> [    0.180272]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    0.180405]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> [    0.180537]  [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> [    0.180667]  [<ffffffff810570b5>] ? kthread+0x0/0x91
> [    0.180796]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10

I have a prototype patch for this way down below, but someone who knows
more about CONFIG_RT_GROUP_SCHED than I do should look it over.  In the
meantime, could you please see if it helps?

> [    3.116754] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    3.116754] ---------------------------------------------------
> [    3.116754] kernel/cgroup.c:4432 invoked rcu_dereference_check()
> without protection!
> [    3.116754]
> [    3.116754] other info that might help us debug this:
> [    3.116754]
> [    3.116754]
> [    3.116754] rcu_scheduler_active = 1, debug_locks = 1
> [    3.116754] 2 locks held by async/1/666:
> [    3.116754]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> [<ffffffff812df0a0>] __scsi_add_device+0x83/0xe4
> [    3.116754]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> [<ffffffff811f2e8d>] blkiocg_add_blkio_group+0x29/0x7f
> [    3.116754]
> [    3.116754] stack backtrace:
> [    3.116754] Pid: 666, comm: async/1 Not tainted 2.6.34-rc5 #18
> [    3.116754] Call Trace:
> [    3.116754]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> [    3.116754]  [<ffffffff8107f9b1>] css_id+0x3f/0x51
> [    3.116754]  [<ffffffff811f2e9c>] blkiocg_add_blkio_group+0x38/0x7f
> [    3.116754]  [<ffffffff811f4e64>] cfq_init_queue+0xdf/0x2dc
> [    3.116754]  [<ffffffff811e3445>] elevator_init+0xba/0xf5
> [    3.116754]  [<ffffffff812dc02a>] ? scsi_request_fn+0x0/0x451
> [    3.116754]  [<ffffffff811e696b>] blk_init_queue_node+0x12f/0x135
> [    3.116754]  [<ffffffff811e697d>] blk_init_queue+0xc/0xe
> [    3.116754]  [<ffffffff812dc49c>] __scsi_alloc_queue+0x21/0x111
> [    3.116754]  [<ffffffff812dc5a4>] scsi_alloc_queue+0x18/0x64
> [    3.116754]  [<ffffffff812de5a0>] scsi_alloc_sdev+0x19e/0x256
> [    3.116754]  [<ffffffff812de73e>] scsi_probe_and_add_lun+0xe6/0x9c5
> [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    3.116754]  [<ffffffff813ce0d6>] ? __mutex_lock_common+0x3e4/0x43a
> [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> [    3.116754]  [<ffffffff812d0a5c>] ? transport_setup_classdev+0x0/0x17
> [    3.116754]  [<ffffffff812df0a0>] ? __scsi_add_device+0x83/0xe4
> [    3.116754]  [<ffffffff812df0d5>] __scsi_add_device+0xb8/0xe4
> [    3.116754]  [<ffffffff812ea9c5>] ata_scsi_scan_host+0x74/0x16e
> [    3.116754]  [<ffffffff81057685>] ? autoremove_wake_function+0x0/0x34
> [    3.116754]  [<ffffffff812e8e64>] async_port_probe+0xab/0xb7
> [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> [    3.116754]  [<ffffffff8105e2ba>] async_thread+0x105/0x1f4
> [    3.116754]  [<ffffffff81033d79>] ? default_wake_function+0x0/0xf
> [    3.116754]  [<ffffffff8105e1b5>] ? async_thread+0x0/0x1f4
> [    3.116754]  [<ffffffff8105713e>] kthread+0x89/0x91
> [    3.116754]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    3.116754]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> [    3.116754]  [<ffffffff813cfcc0>] ? restore_args+0x0/0x30
> [    3.116754]  [<ffffffff810570b5>] ? kthread+0x0/0x91
> [    3.116754]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10

I cannot convince myself that the above access is safe.  Vivek, Nauman,
thoughts?

> [   33.425087] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   33.425090] ---------------------------------------------------
> [   33.425094] net/core/dev.c:1993 invoked rcu_dereference_check()
> without protection!
> [   33.425098]
> [   33.425098] other info that might help us debug this:
> [   33.425100]
> [   33.425103]
> [   33.425104] rcu_scheduler_active = 1, debug_locks = 1
> [   33.425108] 2 locks held by canberra-gtk-pl/4208:
> [   33.425111]  #0:  (sk_lock-AF_INET){+.+.+.}, at:
> [<ffffffff81394ffd>] inet_stream_connect+0x3a/0x24d
> [   33.425125]  #1:  (rcu_read_lock_bh){.+....}, at:
> [<ffffffff8134a809>] dev_queue_xmit+0x14e/0x4b8
> [   33.425137]
> [   33.425138] stack backtrace:
> [   33.425142] Pid: 4208, comm: canberra-gtk-pl Not tainted 2.6.34-rc5 #18
> [   33.425146] Call Trace:
> [   33.425154]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> [   33.425161]  [<ffffffff8134a914>] dev_queue_xmit+0x259/0x4b8
> [   33.425167]  [<ffffffff8134a809>] ? dev_queue_xmit+0x14e/0x4b8
> [   33.425173]  [<ffffffff81041c52>] ? _local_bh_enable_ip+0xcd/0xda
> [   33.425180]  [<ffffffff8135375a>] neigh_resolve_output+0x234/0x285
> [   33.425188]  [<ffffffff8136f71f>] ip_finish_output2+0x257/0x28c
> [   33.425193]  [<ffffffff8136f7bc>] ip_finish_output+0x68/0x6a
> [   33.425198]  [<ffffffff813704b3>] T.866+0x52/0x59
> [   33.425203]  [<ffffffff813706fe>] ip_output+0xaa/0xb4
> [   33.425209]  [<ffffffff8136ebb8>] ip_local_out+0x20/0x24
> [   33.425215]  [<ffffffff8136f204>] ip_queue_xmit+0x309/0x368
> [   33.425223]  [<ffffffff810e41e6>] ? __kmalloc_track_caller+0x111/0x155
> [   33.425230]  [<ffffffff813831ef>] ? tcp_connect+0x223/0x3d3
> [   33.425236]  [<ffffffff81381971>] tcp_transmit_skb+0x707/0x745
> [   33.425243]  [<ffffffff81383342>] tcp_connect+0x376/0x3d3
> [   33.425250]  [<ffffffff81268ac3>] ? secure_tcp_sequence_number+0x55/0x6f
> [   33.425256]  [<ffffffff813872f0>] tcp_v4_connect+0x3df/0x455
> [   33.425263]  [<ffffffff8133cbd9>] ? lock_sock_nested+0xf3/0x102
> [   33.425269]  [<ffffffff81395067>] inet_stream_connect+0xa4/0x24d
> [   33.425276]  [<ffffffff8133b418>] sys_connect+0x90/0xd0
> [   33.425283]  [<ffffffff81002b9c>] ? sysret_check+0x27/0x62
> [   33.425289]  [<ffffffff81068922>] ? trace_hardirqs_on_caller+0x114/0x13f
> [   33.425296]  [<ffffffff813ced00>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   33.425303]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

This looks like an rcu_dereference() needs to instead be
rcu_dereference_bh(), but the line numbering in my version of
net/core/dev.c does not match yours.  CCing netdev, hopefully
someone there will know which rcu_dereference() is indicated.

> [   52.869375] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   52.869378] ---------------------------------------------------
> [   52.869382] net/mac80211/sta_info.c:886 invoked
> rcu_dereference_check() without protection!
> [   52.869386]
> [   52.869387] other info that might help us debug this:
> [   52.869389]
> [   52.869392]
> [   52.869392] rcu_scheduler_active = 1, debug_locks = 1
> [   52.869397] 1 lock held by Xorg/4051:
> [   52.869399]  #0:  (&dev->struct_mutex){+.+.+.}, at:
> [<ffffffff812afdc4>] i915_gem_do_execbuffer+0xf4c/0xfda
> [   52.869414]
> [   52.869415] stack backtrace:
> [   52.869420] Pid: 4051, comm: Xorg Not tainted 2.6.34-rc5 #18
> [   52.869423] Call Trace:
> [   52.869426]  <IRQ>  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> [   52.869454]  [<ffffffffa01289ae>]
> ieee80211_find_sta_by_hw+0x46/0x10f [mac80211]
> [   52.869467]  [<ffffffffa0128a8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
> [   52.869483]  [<ffffffffa017a0f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
> [   52.869490]  [<ffffffff8106842f>] ? mark_lock+0x2d/0x235
> [   52.869501]  [<ffffffffa01a2f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
> [   52.869508]  [<ffffffff8120a3d3>] ? is_swiotlb_buffer+0x2e/0x3b
> [   52.869518]  [<ffffffffa019bbf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
> [   52.869524]  [<ffffffff81068908>] ? trace_hardirqs_on_caller+0xfa/0x13f
> [   52.869534]  [<ffffffffa019c3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
> [   52.869540]  [<ffffffff810411df>] tasklet_action+0xa7/0x10f
> [   52.869546]  [<ffffffff810421f1>] __do_softirq+0x144/0x252
> [   52.869553]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
> [   52.869559]  [<ffffffff810050e4>] do_softirq+0x38/0x80
> [   52.869564]  [<ffffffff81041cbe>] irq_exit+0x45/0x94
> [   52.869569]  [<ffffffff81004829>] do_IRQ+0xad/0xc4
> [   52.869576]  [<ffffffff813cfc13>] ret_from_intr+0x0/0xf
> [   52.869580]  <EOI>  [<ffffffff81068765>] ? lockdep_trace_alloc+0xbe/0xc2
> [   52.869592]  [<ffffffff810bca55>] __alloc_pages_nodemask+0x8f/0x6a5
> [   52.869598]  [<ffffffff810b70f5>] ? rcu_read_lock+0x0/0x35
> [   52.869604]  [<ffffffff810b70f5>] ? rcu_read_lock+0x0/0x35
> [   52.869610]  [<ffffffff810c33cb>] ? kmap_atomic+0x16/0x4b
> [   52.869615]  [<ffffffff810b71ad>] ? rcu_read_unlock+0x21/0x23
> [   52.869621]  [<ffffffff810b6c3c>] __page_cache_alloc+0x14/0x16
> [   52.869627]  [<ffffffff810b836d>] do_read_cache_page+0x43/0x121
> [   52.869632]  [<ffffffff810c54bd>] ? shmem_readpage+0x0/0x3c
> [   52.869638]  [<ffffffff810b8464>] read_cache_page_gfp+0x19/0x23
> [   52.869644]  [<ffffffff812aac10>] i915_gem_object_get_pages+0xa1/0x115
> [   52.869651]  [<ffffffff812ad23e>] i915_gem_object_bind_to_gtt+0x16d/0x2ce
> [   52.869657]  [<ffffffff812ad3c6>] i915_gem_object_pin+0x27/0x88
> [   52.869663]  [<ffffffff812af316>] i915_gem_do_execbuffer+0x49e/0xfda
> [   52.869670]  [<ffffffff810cbb93>] ? might_fault+0x63/0xb3
> [   52.869676]  [<ffffffff810cbbdc>] ? might_fault+0xac/0xb3
> [   52.869681]  [<ffffffff810cbb93>] ? might_fault+0x63/0xb3
> [   52.869687]  [<ffffffff812b010d>] i915_gem_execbuffer+0x192/0x221
> [   52.869694]  [<ffffffff812900d0>] drm_ioctl+0x25a/0x36e
> [   52.869700]  [<ffffffff812aff7b>] ? i915_gem_execbuffer+0x0/0x221
> [   52.869707]  [<ffffffff810e9ad1>] ? do_sync_read+0xc6/0x103
> [   52.869714]  [<ffffffff810f6dcd>] vfs_ioctl+0x2d/0xa1
> [   52.869720]  [<ffffffff810f7343>] do_vfs_ioctl+0x48b/0x4d1
> [   52.869726]  [<ffffffff810f73da>] sys_ioctl+0x51/0x74
> [   52.869733]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

This one looks to be an update-side reference protected by dev->struct_mutex,
but there is no obvious way to get that information to the pair
of rcu_dereference() calls in for_each_sta_info().  Besides which,
I am not 100% certain that this one is really only a false positive.
Especially given that the next one looks similar, but uses a different
lock.

Eric, and enlightenment?

> [   52.884563] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   52.884566] ---------------------------------------------------
> [   52.884571] net/mac80211/sta_info.c:886 invoked
> rcu_dereference_check() without protection!
> [   52.884574]
> [   52.884575] other info that might help us debug this:
> [   52.884577]
> [   52.884580]
> [   52.884581] rcu_scheduler_active = 1, debug_locks = 1
> [   52.884585] 1 lock held by rsyslogd/3854:
> [   52.884588]  #0:  (&sb->s_type->i_mutex_key#10){+.+.+.}, at:
> [<ffffffff810b7f97>] generic_file_aio_write+0x47/0xa8
> [   52.884604]
> [   52.884605] stack backtrace:
> [   52.884610] Pid: 3854, comm: rsyslogd Not tainted 2.6.34-rc5 #18
> [   52.884613] Call Trace:
> [   52.884617]  <IRQ>  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> [   52.884645]  [<ffffffffa01289fe>]
> ieee80211_find_sta_by_hw+0x96/0x10f [mac80211]
> [   52.884658]  [<ffffffffa0128a8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
> [   52.884675]  [<ffffffffa017a0f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
> [   52.884681]  [<ffffffff8106842f>] ? mark_lock+0x2d/0x235
> [   52.884693]  [<ffffffffa01a2f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
> [   52.884701]  [<ffffffff8120a3d3>] ? is_swiotlb_buffer+0x2e/0x3b
> [   52.884710]  [<ffffffffa019bbf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
> [   52.884717]  [<ffffffff81068908>] ? trace_hardirqs_on_caller+0xfa/0x13f
> [   52.884726]  [<ffffffffa019c3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
> [   52.884733]  [<ffffffff810411df>] tasklet_action+0xa7/0x10f
> [   52.884739]  [<ffffffff810421f1>] __do_softirq+0x144/0x252
> [   52.884746]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
> [   52.884752]  [<ffffffff810050e4>] do_softirq+0x38/0x80
> [   52.884757]  [<ffffffff81041cbe>] irq_exit+0x45/0x94
> [   52.884762]  [<ffffffff81004829>] do_IRQ+0xad/0xc4
> [   52.884769]  [<ffffffff813cfc13>] ret_from_intr+0x0/0xf
> [   52.884773]  <EOI>  [<ffffffff810e3509>] ? kmem_cache_free+0xb0/0x134
> [   52.884789]  [<ffffffff811913dc>] ? jbd2_journal_stop+0x32c/0x33e
> [   52.884796]  [<ffffffff811913dc>] jbd2_journal_stop+0x32c/0x33e
> [   52.884804]  [<ffffffff8115e689>] ? ext4_dirty_inode+0x40/0x45
> [   52.884811]  [<ffffffff81105fdb>] ? __mark_inode_dirty+0x2f/0x12e
> [   52.884819]  [<ffffffff81170a65>] __ext4_journal_stop+0x6f/0x75
> [   52.884825]  [<ffffffff81162949>] ext4_da_write_end+0x25c/0x2fc
> [   52.884833]  [<ffffffff810b6b2e>] generic_file_buffered_write+0x161/0x25b
> [   52.884840]  [<ffffffff810b7f1b>] __generic_file_aio_write+0x24a/0x27f
> [   52.884845]  [<ffffffff810b7f97>] ? generic_file_aio_write+0x47/0xa8
> [   52.884852]  [<ffffffff810b7faa>] generic_file_aio_write+0x5a/0xa8
> [   52.884858]  [<ffffffff8115ab2a>] ext4_file_write+0x8c/0x96
> [   52.884864]  [<ffffffff810e99ce>] do_sync_write+0xc6/0x103
> [   52.884871]  [<ffffffff810eac6d>] ? rcu_read_lock+0x0/0x35
> [   52.884878]  [<ffffffff811c17db>] ? selinux_file_permission+0x57/0xaf
> [   52.884885]  [<ffffffff811bb085>] ? security_file_permission+0x11/0x13
> [   52.884893]  [<ffffffff810e9f33>] vfs_write+0xa9/0x106
> [   52.884898]  [<ffffffff810ea046>] sys_write+0x45/0x69
> [   52.884905]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

Ditto!

> [   85.939528] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   85.939531] ---------------------------------------------------
> [   85.939535] include/net/inet_timewait_sock.h:227 invoked
> rcu_dereference_check() without protection!
> [   85.939539]
> [   85.939540] other info that might help us debug this:
> [   85.939541]
> [   85.939544]
> [   85.939545] rcu_scheduler_active = 1, debug_locks = 1
> [   85.939549] 2 locks held by gwibber-service/4798:
> [   85.939552]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff811034b2>]
> seq_read+0x37/0x381
> [   85.939566]  #1:  (&(&hashinfo->ehash_locks[i])->rlock){+.-...},
> at: [<ffffffff81386355>] established_get_next+0xc4/0x132
> [   85.939579]
> [   85.939580] stack backtrace:
> [   85.939585] Pid: 4798, comm: gwibber-service Not tainted 2.6.34-rc5 #18
> [   85.939588] Call Trace:
> [   85.939598]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> [   85.939604]  [<ffffffff81385018>] twsk_net+0x4f/0x57
> [   85.939610]  [<ffffffff813862e5>] established_get_next+0x54/0x132
> [   85.939615]  [<ffffffff813864c7>] tcp_seq_next+0x5d/0x6a
> [   85.939621]  [<ffffffff81103701>] seq_read+0x286/0x381
> [   85.939627]  [<ffffffff8110347b>] ? seq_read+0x0/0x381
> [   85.939633]  [<ffffffff81133240>] proc_reg_read+0x8d/0xac
> [   85.939640]  [<ffffffff810ea110>] vfs_read+0xa6/0x103
> [   85.939645]  [<ffffffff810ea223>] sys_read+0x45/0x69
> [   85.939652]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

This one appears to be a case of missing rcu_read_lock(), but it is
not clear to me at what level it needs to go.

Eric, any enlightenment on this one and the next one?

> [   87.296366] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   87.296369] ---------------------------------------------------
> [   87.296373] include/net/inet_timewait_sock.h:227 invoked
> rcu_dereference_check() without protection!
> [   87.296377]
> [   87.296377] other info that might help us debug this:
> [   87.296379]
> [   87.296382]
> [   87.296383] rcu_scheduler_active = 1, debug_locks = 1
> [   87.296386] no locks held by gwibber-service/4803.
> [   87.296389]
> [   87.296390] stack backtrace:
> [   87.296395] Pid: 4803, comm: gwibber-service Not tainted 2.6.34-rc5 #18
> [   87.296398] Call Trace:
> [   87.296411]  [<ffffffff81067fc2>] lockdep_rcu_dereference+0x9d/0xa5
> [   87.296419]  [<ffffffff813733d3>] twsk_net+0x4f/0x57
> [   87.296424]  [<ffffffff813737f3>] __inet_twsk_hashdance+0x50/0x158
> [   87.296431]  [<ffffffff81389239>] tcp_time_wait+0x1c1/0x24b
> [   87.296437]  [<ffffffff8137c417>] tcp_fin+0x83/0x162
> [   87.296443]  [<ffffffff8137cda7>] tcp_data_queue+0x1ff/0xa1e
> [   87.296450]  [<ffffffff810495c6>] ? mod_timer+0x1e/0x20
> [   87.296456]  [<ffffffff813809e3>] tcp_rcv_state_process+0x89d/0x8f2
> [   87.296463]  [<ffffffff8133ca0b>] ? release_sock+0x30/0x10b
> [   87.296468]  [<ffffffff81386df2>] tcp_v4_do_rcv+0x2de/0x33f
> [   87.296475]  [<ffffffff8133ca5d>] release_sock+0x82/0x10b
> [   87.296481]  [<ffffffff81376ef5>] tcp_close+0x1b5/0x37e
> [   87.296487]  [<ffffffff81395437>] inet_release+0x50/0x57
> [   87.296493]  [<ffffffff8133a134>] sock_release+0x1a/0x66
> [   87.296498]  [<ffffffff8133a1a2>] sock_close+0x22/0x26
> [   87.296505]  [<ffffffff810eb003>] __fput+0x120/0x1cd
> [   87.296510]  [<ffffffff810eb0c5>] fput+0x15/0x17
> [   87.296516]  [<ffffffff810e7f3d>] filp_close+0x63/0x6d
> [   87.296521]  [<ffffffff810e801e>] sys_close+0xd7/0x111
> [   87.296528]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

commit d3b8ba1bde9afb7d50cf0712f9d95317ea66c06f
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Wed Apr 21 14:04:56 2010 -0700

    sched: protect __sched_setscheduler() access to cgroups
    
    A given task's cgroups structures must remain while that task is running
    due to reference counting, so this is presumably a false positive.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/sched.c b/kernel/sched.c
index 14c44ec..1d43c1a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4575,9 +4575,11 @@ recheck:
 		 * Do not allow realtime tasks into groups that have no runtime
 		 * assigned.
 		 */
+		rcu_read_lock();
 		if (rt_bandwidth_enabled() && rt_policy(policy) &&
 				task_group(p)->rt_bandwidth.rt_runtime == 0)
 			return -EPERM;
+		rcu_read_unlock();
 #endif
 
 		retval = security_task_setscheduler(p, policy, param);

^ permalink raw reply related

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-21 21:34 UTC (permalink / raw)
  To: Hideaki YOSHIFUJI, Herbert Xu; +Cc: netdev, David Miller, Stephen Hemminger
In-Reply-To: <20100420174401.GB1334@midget.suse.cz>

Hi,

On Tue, Apr 20, 2010 at 07:44:01PM +0200, Jiri Bohac wrote:
> What is the reason __ipv6_ifa_notify() calls dst_free() when
> ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> with the dst still needing to be freed.

checked again and I still think that if ip6_del_rt() fails,
ifp->rt must have been freed already. Anybody with a
counterexample?

> I am just testing whether the following will help:
> 
> --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
>  			addrconf_leave_anycast(ifp);
>  		addrconf_leave_solict(ifp->idev, &ifp->addr);
>  		dst_hold(&ifp->rt->u.dst);
> -		if (ip6_del_rt(ifp->rt))
> -			dst_free(&ifp->rt->u.dst);
> +		ip6_del_rt(ifp->rt);
>  		break;
>  	}
>  }

not surprisingly, it helps my case -- the race condition does not
happen and the resulting oopses disappears. Of course, this does
not prove the patch is correct.

Could anybody with more insight into the dst refcounting please
take a look? The code originally came from:

	Author: Hideaki Yoshifuji <yoshfuji@linux-ipv6.org>
	Date:   Wed Aug 18 05:25:16 2004 +0900

	    [IPV6] refer inet6 device via corresponding local route from address structure.

And has been modified later by:
	commit 4641e7a334adf6856300a98e7296dfc886c446af
	Author: Herbert Xu <herbert@gondor.apana.org.au>
	Date:   Thu Feb 2 16:55:45 2006 -0800

	    [IPV6]: Don't hold extra ref count in ipv6_ifa_notify

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply

* Re: [PATCH] gianfar: Wait for both RX and TX to stop
From: David Miller @ 2010-04-21 21:33 UTC (permalink / raw)
  To: galak; +Cc: timur.tabi, afleming, netdev
In-Reply-To: <CD8C73BA-0E18-4F85-A90B-C2FA5FFC8689@kernel.crashing.org>

From: Kumar Gala <galak@kernel.crashing.org>
Date: Wed, 21 Apr 2010 14:13:00 -0500

> I'm not opposed, I'm just asking if we are saying we shouldn't be using cpu_relax() for spinning on HW status registers ever.
> 
> If we are suggesting that cpu_relax() shouldn't be used in these scenarios going forward I'm ok w/the change you suggest and starting to convert other cpu_relax() calls to use spin_event_timeout()


Kumar this isn't an either-or thing.

In both cases we're using cpu_relax().

But by using spin_event_timeout() you're getting both the cpu_relax()
and a break-out in case the hardware gets wedged for some reason.

^ permalink raw reply

* Re: ipv6: Fix tcp_v6_send_response checksum
From: David Miller @ 2010-04-21 21:28 UTC (permalink / raw)
  To: herbert; +Cc: netdev, cratiu
In-Reply-To: <20100421130920.GA31300@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 21 Apr 2010 21:09:20 +0800

> On Wed, Apr 21, 2010 at 01:58:23AM -0700, David Miller wrote:
>> 
>> If we're going to use CHECKSUM_PARTIAL for these things (which we are
>> since commit 2e8e18ef52e7dd1af0a3bd1f7d990a1d0b249586 "tcp: Set
>> CHECKSUM_UNNECESSARY in tcp_init_nondata_skb"), we can't be setting
>> buff->csum as we always have been here in tcp_v6_send_response.  We
>> need to leave it at zero.
>> 
>> Kill that line and checksums are good again.
>> 
>> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> That line is needed for the CHECKSUM_NONE case.  In fact, if we're
> in the CHECKSUM_PARTIAL case, then that buff->csum calculation
> should be a noop because it immediately gets overwritten when we
> subsequently set csum_start/csum_offset.

Ok, but the checksums were broken.  Can you see why?

I think we changed semantics here when we did this:

-	t1->check = csum_ipv6_magic(&fl.fl6_src, &fl.fl6_dst,
-				    tot_len, IPPROTO_TCP,
-				    buff->csum);
+	__tcp_v6_send_check(buff, &fl.fl6_src, &fl.fl6_dst);

which changes what ->csum needs to be.  It used to need to be computed
over the header as well as any data afterwards, but now it has to
cover only data.  Because now it evaluates to:

		th->check = tcp_v6_check(skb->len, saddr, daddr,
					 csum_partial(th, th->doff << 2,
						      skb->csum));

See?  We were computing the checksum over the TCP header twice now :-)
That's why my patch fixed things for dataless packets, whose
->csum would evaluate to zero.

So to make things work fully as-is, we would need to compute ->csum
over the post-header data only.

But an even easier change is to just be consistent with the rest
of the TCP packet generation code and use CHECKSUM_PARTIAL here.
And that's how I'll fix this.

What confused me here last night is the fact that CHECKSUM_NONE
is zero and the implicit zero initialization of new SKBs givs us
that :-)

Anyways, thanks for pointing this out!

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 21:22 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw
In-Reply-To: <201004201548.26609.arnd@arndb.de>

On Tuesday 20 April 2010, Arnd Bergmann wrote:
> > + * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING)
> > + * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING)
> 
> As mentioned above, why not drop one of these, and just pass the VF's IFNAME?
> 

Coming back to this point, I now think it would be ideal if we could actually
leave out IOV_ATTR_VF_IFNAME and just pass the master IFNAME and the slave
MAC address. Since we're not actually doing anything with the slave itself
but really talking the switch, it should not be needed at all.

That would solve all problems with the slave having moved to another namespace
already, and make it totally clear that this is not about configuring the
slave but about registering it.

Scott, would that still work with your driver?

	Arnd

^ permalink raw reply

* Re: [net-next,1/2] add iovnl netlink support
From: Arnd Bergmann @ 2010-04-21 21:13 UTC (permalink / raw)
  To: Scott Feldman; +Cc: Chris Wright, davem, netdev
In-Reply-To: <C7F4AD4F.2A34D%scofeldm@cisco.com>

On Wednesday 21 April 2010, Scott Feldman wrote:
> On 4/21/10 12:39 PM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> 
> >>> 1. Setting up the slave device
> >>>  a) create an SR-IOV VF to assign to a guest
> >>>  b) create a macvtap device to pass to qemu or vhost
> >>>  c) attach a tap device to a bridge
> >>>  d) create a macvlan device and put it into a container
> >>>  e) create a virtual interface for a VMDq adapter
> >> 
> >> OK, but iovnl isn't doing this.
> > 
> > The set_mac_vlan that Scott's patch adds seems to implement 1a), as far
> > as I can tell. Interestingly, this is not actually implemented in
> > the enic driver in patch 2/2. So if we all agree that this is out of the
> > scope of iovnl, let's just remove it from the interface and find another
> > way for it (ethtool, iplink, ..., as listed above).
> 
> You're right, not needed for enic since mac addr is included with
> port-profile push and vlan membership is implied by port-profile.  So I put
> set_mac_vlan in there basically to elicit feedback.

Ok. Two points though:

- when you say that the mac address is included in the port-profile push,
  does that imply that the VF does not have a mac address prior to this?
  This would again mix the NIC configuration phase with the switch
  association, which I think we really need to avoid if we want to be
  able to implement the association in user space!

- The VLAN ID being implied in the port profile seems to be another
  difference between what enic is doing and the current draft VDP
  that will eventually become 802.1Qbg, and I fear that this difference
  will be visible in the iovnl protocol.

> There really wouldn't be much different between iplink and iovnl since
> they're both rtnetlink...seems we should keep IOV-related APIs in one place.
> Maybe there are other IOV APIs to add to iovnl in the future like:
> 
>     vf <- add_vf(pf)
>     del_vf(pf, vf)
> 
> Ethtool doesn't seem the right place for this.

Right. My preference would probably be make these a subcategory of
the if_link, and use the existing RTM_NEWLINK/RTM_DELLINK commands.
This would make it resemble the existing interfaces and mean you can
use

ip link add link eth0 type macvlan    # for a container
ip link add link eth0 type macvtap    # for qemu/vhost
ip link add link eth0 type vf         # for device assignment

There are obviously significant differences between these three, but
they also share enough of their properties to let us treat them
in similar ways.

If we integrate the iovnl client into iproute2, the sequence for setting
up an enic VF and associating it to the port profile could be

# create vf0, pass mac and vlan id to HW, no association yet
ip link add link eth0 name vf0 type vf mac fe:dc:ba:12:34:56 vlan 78

# associate vf with port profile, mac address must match the one assigned
#  to the interface before.
ip iov assoc eth0 port-profile "general" host-uuid "dcf2a873-f5ee-41dd-a7ad-802a544e48c2" \
	 mac fe:dc:ba:12:34:56

	Arnd

^ permalink raw reply

* [PATCH net-next-2.6] rps: immediate send IPI in process_backlog()
From: Eric Dumazet @ 2010-04-21 21:04 UTC (permalink / raw)
  To: David Miller; +Cc: Tom Herbert, Changli Gao, netdev

If some skb are queued to our backlog, we are delaying IPI sending at
the end of net_rx_action(), increasing latencies. This defeats the
queueing, since we want to quickly dispatch packets to the pool of
worker cpus, then eventually deeply process our packets.

It's better to send IPI before processing our packets in upper layers,
from process_backlog().

Change the _and_disable_irq suffix to _and_enable_irq(), since we enable
local irq in net_rps_action(), sorry for the confusion.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

 net/core/dev.c |   76 +++++++++++++++++++++++++----------------------
 1 file changed, 42 insertions(+), 34 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b31d5d6..9871660 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3239,11 +3239,48 @@ gro_result_t napi_gro_frags(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(napi_gro_frags);
 
+/*
+ * net_rps_action sends any pending IPI's for rps.
+ * Note: called with local irq disabled, but exits with local irq enabled.
+ */
+static void net_rps_action_and_irq_enable(struct softnet_data *sd)
+{
+#ifdef CONFIG_RPS
+	struct softnet_data *remsd = sd->rps_ipi_list;
+
+	if (remsd) {
+		sd->rps_ipi_list = NULL;
+
+		local_irq_enable();
+
+		/* Send pending IPI's to kick RPS processing on remote cpus. */
+		while (remsd) {
+			struct softnet_data *next = remsd->rps_ipi_next;
+
+			if (cpu_online(remsd->cpu))
+				__smp_call_function_single(remsd->cpu,
+							   &remsd->csd, 0);
+			remsd = next;
+		}
+	} else
+#endif
+		local_irq_enable();
+}
+
 static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
 	struct softnet_data *sd = &__get_cpu_var(softnet_data);
 
+#ifdef CONFIG_RPS
+	/* Check if we have pending ipi, its better to send them now,
+	 * not waiting net_rx_action() end.
+	 */
+	if (sd->rps_ipi_list) {
+		local_irq_disable();
+		net_rps_action_and_irq_enable(sd);
+	}
+#endif
 	napi->weight = weight_p;
 	do {
 		struct sk_buff *skb;
@@ -3350,45 +3387,16 @@ void netif_napi_del(struct napi_struct *napi)
 }
 EXPORT_SYMBOL(netif_napi_del);
 
-/*
- * net_rps_action sends any pending IPI's for rps.
- * Note: called with local irq disabled, but exits with local irq enabled.
- */
-static void net_rps_action_and_irq_disable(void)
-{
-#ifdef CONFIG_RPS
-	struct softnet_data *sd = &__get_cpu_var(softnet_data);
-	struct softnet_data *remsd = sd->rps_ipi_list;
-
-	if (remsd) {
-		sd->rps_ipi_list = NULL;
-
-		local_irq_enable();
-
-		/* Send pending IPI's to kick RPS processing on remote cpus. */
-		while (remsd) {
-			struct softnet_data *next = remsd->rps_ipi_next;
-
-			if (cpu_online(remsd->cpu))
-				__smp_call_function_single(remsd->cpu,
-							   &remsd->csd, 0);
-			remsd = next;
-		}
-	} else
-#endif
-		local_irq_enable();
-}
-
 static void net_rx_action(struct softirq_action *h)
 {
-	struct list_head *list = &__get_cpu_var(softnet_data).poll_list;
+	struct softnet_data *sd = &__get_cpu_var(softnet_data);
 	unsigned long time_limit = jiffies + 2;
 	int budget = netdev_budget;
 	void *have;
 
 	local_irq_disable();
 
-	while (!list_empty(list)) {
+	while (!list_empty(&sd->poll_list)) {
 		struct napi_struct *n;
 		int work, weight;
 
@@ -3406,7 +3414,7 @@ static void net_rx_action(struct softirq_action *h)
 		 * entries to the tail of this list, and only ->poll()
 		 * calls can remove this head entry from the list.
 		 */
-		n = list_first_entry(list, struct napi_struct, poll_list);
+		n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);
 
 		have = netpoll_poll_lock(n);
 
@@ -3441,13 +3449,13 @@ static void net_rx_action(struct softirq_action *h)
 				napi_complete(n);
 				local_irq_disable();
 			} else
-				list_move_tail(&n->poll_list, list);
+				list_move_tail(&n->poll_list, &sd->poll_list);
 		}
 
 		netpoll_poll_unlock(have);
 	}
 out:
-	net_rps_action_and_irq_disable();
+	net_rps_action_and_irq_enable(sd);
 
 #ifdef CONFIG_NET_DMA
 	/*



^ permalink raw reply related

* Re: [net-next,1/2] add iovnl netlink support
From: Scott Feldman @ 2010-04-21 20:25 UTC (permalink / raw)
  To: Arnd Bergmann, Chris Wright; +Cc: davem, netdev
In-Reply-To: <201004212139.22421.arnd@arndb.de>

On 4/21/10 12:39 PM, "Arnd Bergmann" <arnd@arndb.de> wrote:

>>> 1. Setting up the slave device
>>>  a) create an SR-IOV VF to assign to a guest
>>>  b) create a macvtap device to pass to qemu or vhost
>>>  c) attach a tap device to a bridge
>>>  d) create a macvlan device and put it into a container
>>>  e) create a virtual interface for a VMDq adapter
>> 
>> OK, but iovnl isn't doing this.
> 
> The set_mac_vlan that Scott's patch adds seems to implement 1a), as far
> as I can tell. Interestingly, this is not actually implemented in
> the enic driver in patch 2/2. So if we all agree that this is out of the
> scope of iovnl, let's just remove it from the interface and find another
> way for it (ethtool, iplink, ..., as listed above).

You're right, not needed for enic since mac addr is included with
port-profile push and vlan membership is implied by port-profile.  So I put
set_mac_vlan in there basically to elicit feedback.

There really wouldn't be much different between iplink and iovnl since
they're both rtnetlink...seems we should keep IOV-related APIs in one place.
Maybe there are other IOV APIs to add to iovnl in the future like:

    vf <- add_vf(pf)
    del_vf(pf, vf)

Ethtool doesn't seem the right place for this.

-scott


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox