Netdev List
 help / color / mirror / Atom feed
* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang @ 2010-06-08  8:36 UTC (permalink / raw)
  To: David Miller
  Cc: andy, fubar, fbl, linux-kernel, mpm, netdev, bridge, gospo,
	nhorman, jmoyer, shemminger, bonding-devel
In-Reply-To: <20100607.030108.235696592.davem@davemloft.net>

On 06/07/10 18:01, David Miller wrote:
> From: Cong Wang<amwang@redhat.com>
> Date: Mon, 07 Jun 2010 17:57:49 +0800
>
>> Hmm, I still feel like this way is ugly, although it may work.
>> I guess David doesn't like it either.
>
> Of course I don't like it. :-)
>
> I suspect the locking scheme will need to be changed.
>
> Besides, if we're going to hack this up and do write lock attempts in
> the read locking paths, there is no point in using a rwlock any more.
> And I'm personally in disfavor of all rwlock usage anyways (it dirties
> the cacheline for readers just as equally for writers, and if the
> critically protected code path is short enough, that shared cache
> line atomic operation will be the predominant cost).
>
> So I'd say, 1) make this a spinlock and 2) try to use RCU for the
> read path.
>
> That would fix everything.

Yeah, agreed. Even not talking about netconsole, bonding code
does have locking problems, netconsole just makes this problem
clear.

I will try your suggestions above.

Thanks!

^ permalink raw reply

* [PATCH net-next-2.6] icmp: RCU conversion in icmp_address_reply()
From: Eric Dumazet @ 2010-06-08  8:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

- rcu_read_lock() already held by caller
- use __in_dev_get_rcu() instead of in_dev_get() / in_dev_put()
- remove goto out;

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/icmp.c |   12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index d65e921..bdb6c71 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -925,6 +925,7 @@ static void icmp_address(struct sk_buff *skb)
 /*
  * RFC1812 (4.3.3.9).	A router SHOULD listen all replies, and complain
  *			loudly if an inconsistency is found.
+ * called with rcu_read_lock()
  */
 
 static void icmp_address_reply(struct sk_buff *skb)
@@ -935,12 +936,12 @@ static void icmp_address_reply(struct sk_buff *skb)
 	struct in_ifaddr *ifa;
 
 	if (skb->len < 4 || !(rt->rt_flags&RTCF_DIRECTSRC))
-		goto out;
+		return;
 
-	in_dev = in_dev_get(dev);
+	in_dev = __in_dev_get_rcu(dev);
 	if (!in_dev)
-		goto out;
-	rcu_read_lock();
+		return;
+
 	if (in_dev->ifa_list &&
 	    IN_DEV_LOG_MARTIANS(in_dev) &&
 	    IN_DEV_FORWARD(in_dev)) {
@@ -958,9 +959,6 @@ static void icmp_address_reply(struct sk_buff *skb)
 			       mp, dev->name, &rt->rt_src);
 		}
 	}
-	rcu_read_unlock();
-	in_dev_put(in_dev);
-out:;
 }
 
 static void icmp_discard(struct sk_buff *skb)



^ permalink raw reply related

* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang @ 2010-06-08  8:38 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Jay Vosburgh, Flavio Leitner, linux-kernel, Matt Mackall, netdev,
	bridge, Andy Gospodarek, Neil Horman, Jeff Moyer,
	Stephen Hemminger, bonding-devel, David Miller
In-Reply-To: <20100607130357.GN7497@gospo.rdu.redhat.com>

On 06/07/10 21:03, Andy Gospodarek wrote:
> On Mon, Jun 07, 2010 at 05:57:49PM +0800, Cong Wang wrote:
>> On 06/05/10 03:18, Andy Gospodarek wrote:
>>> On Wed, Jun 02, 2010 at 06:04:45PM +0800, Cong Wang wrote:
>>>> On 06/02/10 02:42, Jay Vosburgh wrote:
>>>>> Cong Wang<amwang@redhat.com>    wrote:
>>>>>
>>>>>> On 06/01/10 03:08, Flavio Leitner wrote:
>>>>>>> On Mon, May 31, 2010 at 01:56:52PM +0800, Cong Wang wrote:
>>>>>>>> Hi, Flavio,
>>>>>>>>
>>>>>>>> Please use the attached patch instead, try to see if it solves
>>>>>>>> all your problems.
>>>>>>>
>>>>>>> I tried and it hangs. No backtraces this time.
>>>>>>> The bond_change_active_slave() prints before NETDEV_BONDING_FAILOVER
>>>>>>> notification, so I think it won't work.
>>>>>>
>>>>>> Ah, I thought the same.
>>>>>>
>>>>>>>
>>>>>>> Please, correct if I'm wrong, but when a failover happens with your
>>>>>>> patch applied, the netconsole would be disabled forever even with
>>>>>>> another healthy slave, right?
>>>>>>>
>>>>>>
>>>>>> Yes, this is an easy solution, because bonding has several modes,
>>>>>> it is complex to make netpoll work in different modes.
>>>>>
>>>>> 	If I understand correctly, the root cause of the problem with
>>>>> netconsole and bonding is that bonding is, ultimately, performing
>>>>> printks with a write lock held, and when netconsole recursively calls
>>>>> into bonding to send the printk over the netconsole, there is a deadlock
>>>>> (when the bonding xmit function attempts to acquire the same lock for
>>>>> read).
>>>>
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>> 	You're trying to avoid the deadlock by shutting off netconsole
>>>>> (permanently, it looks like) for one problem case: a failover, which
>>>>> does some printks with a write lock held.
>>>>>
>>>>> 	This doesn't look to me like a complete solution, there are
>>>>> other cases in bonding that will do printk with write locks held.  I
>>>>> suspect those will also hang netconsole as things exist today, and won't
>>>>> be affected by your patch below.
>>>>
>>>>
>>>> I can expect that, bonding modes are complex.
>>>>
>>>>>
>>>>> 	For example:
>>>>>
>>>>> 	The sysfs functions to set the primary (bonding_store_primary)
>>>>> or active (bonding_store_active_slave) options: a pr_info is called to
>>>>> provide a log message of the results.  These could be tested by setting
>>>>> the primary or active options via sysfs, e.g.,
>>>>>
>>>>> echo eth0>    /sys/class/net/bond0/bonding/primary
>>>>> echo eth0>    /sys/class/net/bond0/bonding/active
>>>>>
>>>>> 	If the kernel is defined with DEBUG, there are a few pr_debug
>>>>> calls within write_locks (bond_del_vlan, for example).
>>>>>
>>>>> 	If the slave's underlying device driver's ndo_vlan_rx_register
>>>>> or ndo_vlan_rx_kill_vid functions call printk (and it looks like some do
>>>>> for error cases, e.g., igbvf, ehea, enic), those would also presumably
>>>>> deadlock (because bonding holds its write_lock when calling the ndo_
>>>>> vlan functions).
>>>>>
>>>>> 	It also appears that (with the patch below) some nominally
>>>>> normal usage patterns will immediately disable netconsole.  The one that
>>>>> comes to mind is if the primary= option is set (to "eth1" for this
>>>>> example), but that slave not enslaved first (the slaves are added, say,
>>>>> eth0 then eth1).  In that situation, when the primary slave (eth1 here)
>>>>> is added, the first thing that will happen is a failover, and that will
>>>>> disable netconsole.
>>>>>
>>>>
>>>> Thanks for your detailed explanation!
>>>>
>>>> This is why I said bonding is complex. I guess we would have to adjust
>>>> netpoll code for different bonding cases, one solution seems not fix all.
>>>> I am not sure how much work to do, since I am not familiar with bonding
>>>> code. Maybe Andy can help?
>>>>
>>>
>>> Sorry I've been silent until now.  This does seem quite similar to a
>>> problem I've previously encountered when dealing with bonding+netpoll on
>>> some old 2.6.9-based kernels.  There is no guarantee the methods used
>>> there will apply here, but I'll talk about them anyway.
>>>
>>> As Flavio noticed, recursive calls into the bond transmit routines were
>>> not a good idea.  I discovered the same and worked around this issue by
>>> checking to see if we could take the bond->lock for writing before
>>> continuing.  If we could not get, I wanted to signal that this should be
>>> queued for transmission later.  Based on the flow of netpoll_send_skb
>>> (or possibly for another reason that is escaping me right now) I added
>>> one of these checks in bond_poll_controller too.  These aren't the
>>> prettiest fixes, but seemed to work well for me when I did this work in
>>> the past.  I realize the differences are not that great compared to some
>>> of the patches posted by Flavio, but I think they are worth trying.
>>
>>
>> Hmm, I still feel like this way is ugly, although it may work.
>> I guess David doesn't like it either.
>>
>
> Notice how I referred to it as a work-around? :)
>
> It certainly isn't a great way to resolve the issue, but I wanted to
> offer my opinon on the issue since you asked.

Sorry for my misunderstanding.

^ permalink raw reply

* Re: 2.6.35-rc2-git1 - include/linux/cgroup.h:534 invoked rcu_dereference_check() without protection!
From: Peter Zijlstra @ 2010-06-08  8:40 UTC (permalink / raw)
  To: Miles Lane
  Cc: paulmck, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	LKML, nauman, eric.dumazet, netdev, Jens Axboe, Gui Jianfeng,
	Li Zefan, Johannes Berg
In-Reply-To: <AANLkTikl13cJNuU32JzRarrUnoh5DYHr-Tbnv9B-UjsG@mail.gmail.com>

On Tue, 2010-06-08 at 00:16 -0400, Miles Lane wrote:
> On Mon, Jun 7, 2010 at 8:19 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Jun 07, 2010 at 02:14:25PM -0400, Miles Lane wrote:
> >> Hi All,
> >>
> >> I just reproduced a warning I reported quite a while ago.  Is a patch
> >> for this in the pipeline?
> >
> > I proposed a patch, thinking that it was a false positive.  Peter Zijlstra
> > pointed out that there was a real race, and proposed an alternative patch,
> > which may be found at http://lkml.org/lkml/2010/4/22/603.
> >
> > Could you please test Peter's patch and let us know if it cures the problem?
> >

Gah, this task_group() stuff is annoying, how about something like the
below which teaches task_group() about the task_rq()->lock rule?

---
 include/linux/cgroup.h |   20 +++++++++++----
 kernel/sched.c         |   61 +++++++++++++++++++++++++----------------------
 2 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0c62160..1efd212 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -525,13 +525,21 @@ static inline struct cgroup_subsys_state *cgroup_subsys_state(
 	return cgrp->subsys[subsys_id];
 }
 
-static inline struct cgroup_subsys_state *task_subsys_state(
-	struct task_struct *task, int subsys_id)
+/*
+ * function to get the cgroup_subsys_state which allows for extra
+ * rcu_dereference_check() conditions, such as locks used during the
+ * cgroup_subsys::attach() methods.
+ */
+#define task_subsys_state_check(task, subsys_id, __c) 			\
+	rcu_dereference_check(task->cgroups->subsys[subsys_id],		\
+			      rcu_read_lock_held() ||			\
+			      lockdep_is_held(&task->alloc_lock) || 	\
+			      cgroup_lock_is_held() || (__c))
+
+static inline struct cgroup_subsys_state *
+task_subsys_state(struct task_struct *task, int subsys_id)
 {
-	return rcu_dereference_check(task->cgroups->subsys[subsys_id],
-				     rcu_read_lock_held() ||
-				     lockdep_is_held(&task->alloc_lock) ||
-				     cgroup_lock_is_held());
+	return task_subsys_state_check(task, subsys_id, false);
 }
 
 static inline struct cgroup* task_cgroup(struct task_struct *task,
diff --git a/kernel/sched.c b/kernel/sched.c
index f8b8996..e01bb45 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -306,32 +306,26 @@ static int init_task_group_load = INIT_TASK_GROUP_LOAD;
  */
 struct task_group init_task_group;
 
-/* return group to which a task belongs */
+/*
+ * Return the group to which this tasks belongs.
+ *
+ * We use task_subsys_state_check() and extend the RCU verification
+ * with lockdep_is_held(&task_rq(p)->lock) because cpu_cgroup_attach()
+ * holds that lock for each task it moves into the cgroup. Therefore
+ * by holding that lock, we pin the task to the current cgroup.
+ */
 static inline struct task_group *task_group(struct task_struct *p)
 {
-	struct task_group *tg;
+	struct cgroup_subsys_state *css;
 
-#ifdef CONFIG_CGROUP_SCHED
-	tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
-				struct task_group, css);
-#else
-	tg = &init_task_group;
-#endif
-	return tg;
+	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
+			lockdep_is_held(&task_rq(p)->lock));
+	return container_of(css, struct task_group, css);
 }
 
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 {
-	/*
-	 * Strictly speaking this rcu_read_lock() is not needed since the
-	 * task_group is tied to the cgroup, which in turn can never go away
-	 * as long as there are tasks attached to it.
-	 *
-	 * However since task_group() uses task_subsys_state() which is an
-	 * rcu_dereference() user, this quiets CONFIG_PROVE_RCU.
-	 */
-	rcu_read_lock();
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
 	p->se.parent = task_group(p)->se[cpu];
@@ -341,7 +335,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 	p->rt.rt_rq  = task_group(p)->rt_rq[cpu];
 	p->rt.parent = task_group(p)->rt_se[cpu];
 #endif
-	rcu_read_unlock();
 }
 
 #else
@@ -4465,16 +4458,6 @@ recheck:
 	}
 
 	if (user) {
-#ifdef CONFIG_RT_GROUP_SCHED
-		/*
-		 * Do not allow realtime tasks into groups that have no runtime
-		 * assigned.
-		 */
-		if (rt_bandwidth_enabled() && rt_policy(policy) &&
-				task_group(p)->rt_bandwidth.rt_runtime == 0)
-			return -EPERM;
-#endif
-
 		retval = security_task_setscheduler(p, policy, param);
 		if (retval)
 			return retval;
@@ -4490,6 +4473,26 @@ recheck:
 	 * runqueue lock must be held.
 	 */
 	rq = __task_rq_lock(p);
+
+	retval = 0;
+#ifdef CONFIG_RT_GROUP_SCHED
+	if (user) {
+		/*
+		 * Do not allow realtime tasks into groups that have no runtime
+		 * assigned.
+		 */
+		if (rt_bandwidth_enabled() && rt_policy(policy) &&
+				task_group(p)->rt_bandwidth.rt_runtime == 0)
+			retval = -EPERM;
+
+		if (retval) {
+			__task_rq_unlock(rq);
+			raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+			return retval;
+		}
+	}
+#endif
+
 	/* recheck policy now with rq lock held */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;

^ permalink raw reply related

* Re: [PATCH] netconsole: queue console messages to send later
From: Cong Wang @ 2010-06-08  8:59 UTC (permalink / raw)
  To: Flavio Leitner
  Cc: David Miller, netdev, fubar, mpm, gospo, nhorman, jmoyer,
	shemminger, linux-kernel, bridge, bonding-devel
In-Reply-To: <20100608003707.GA30604@sysclose.org>


Thanks for your fix, Flavio!

On 06/08/10 08:37, Flavio Leitner wrote:
>> There may not be another timer or workqueue able to execute after the
>> printk() we're trying to emit.  We may never get to that point.
>
> What if in the netpoll, before we push the skb to the driver, we check
> for a bit saying that it's already pushing another skb. In this case,
> queue the new skb inside of netpoll and soon as the first call returns
> and try to clear the bit, it will send the next skb?
>
> printk("message 1")
> ...
> netconsole called
> netpoll sets the flag bit
> pushes to the bonding driver which does another printk("message 2")
> netconsole called again
> netpoll checks for the flag, queue the message, returns.
> so, bonding can finish up to send the first message
> netpoll is about to return, checks for new queued messages, and pushes them.
> bonding finishes up to send the second message
> ....
>
> No deadlocks, skbs are ordered and still under the same opportunity
> to send something. Does it sound acceptable?
> It's off the top of my head, so probably this idea has some problems.
>


I am not a net expert, I am not sure if this solution really addresses
David's concern, but it makes sense for me.

>
>> Fix the locking in the drivers or layers that cause the issue instead
>> of breaking netconsole.
>
> Someday, somewhere, I know because I did this before, someone will
> use a debugging printk() and will see the entire box hanging with
> absolutely no message in any console because of this problem.
> I'm not saying that fixing driver isn't the right way to go but
> it seems not enough to me.

Well, I think netconsole is not alone, other console drivers could
have the same problem, printk() is not always available in some
situation like this.

Thanks.

^ permalink raw reply

* RE: [PATCH 11/12] sfc: Only count bad packets in rx_errors
From: Dmitry Kravkov @ 2010-06-08  9:20 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: Ben Hutchings
In-Reply-To: <1275427265.2114.35.camel@achroite.uk.solarflarecom.com>

Hi,

Different drivers include different errors in stats->rx_errors:

  bxn2	  bnx2x	  ixgbx	  sfc		  dnet	  atlx
----------------------------------------------------------------------
rx_length	rx_length	rx_length	rx_length	rx_length	rx_length
rx_over	rx_over							rx_over
rx_frame	rx_frame			rx_frame	rx_frame	rx_frame
rx_crc	rx_crc	rx_crc	rx_crc			rx_crc
		rx_fifo							rx_fifo
		rx_missed							rx_missed
						rx_symbol
								unsup_opcd
----------------------------------------------------------------------
Is there any documentation which defines what should this statistic include?

Thanks
Dmitry


-----Original Message-----
From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Ben Hutchings
Sent: Wednesday, June 02, 2010 12:21 AM
To: David Miller
Cc: netdev@vger.kernel.org; linux-net-drivers@solarflare.com
Subject: [PATCH 11/12] sfc: Only count bad packets in rx_errors

rx_errors is defined as 'bad packets received', but we are currently
including various overflow errors as well.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/sfc/efx.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/drivers/net/sfc/efx.c b/drivers/net/sfc/efx.c
index aae3347..26b0cc2 100644
--- a/drivers/net/sfc/efx.c
+++ b/drivers/net/sfc/efx.c
@@ -1518,11 +1518,8 @@ static struct net_device_stats *efx_net_stats(struct net_device *net_dev)
 	stats->tx_window_errors = mac_stats->tx_late_collision;
 
 	stats->rx_errors = (stats->rx_length_errors +
-			    stats->rx_over_errors +
 			    stats->rx_crc_errors +
 			    stats->rx_frame_errors +
-			    stats->rx_fifo_errors +
-			    stats->rx_missed_errors +
 			    mac_stats->rx_symbol_error);
 	stats->tx_errors = (stats->tx_window_errors +
 			    mac_stats->tx_bad);
-- 
1.6.2.5


-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply related

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-08  9:27 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTilqB2sAyfBTLhYT_a2_TySGay4FFG6SGJUZ1o_4@mail.gmail.com>

On Mon, 2010-06-07 at 14:54 +0200, Kay Sievers wrote:

> > I see you remove the driver. Does this mean that in sysfs these devices
> > wouldn't have a driver symlink? ISTR that we needed that for userspace,
> > but I'm not entirely sure, and I don't have all the relevant userspace
> > in my test setup.
> 
> Yeah, if you need the driver, you need to use bus device, as you do
> now. The driver needs to be registered with the core, and the probe
> function needs to tell to bind to the devices while registered. You
> can not manually assign this.

So I spoke to Jouni and he says that he doesn't remember needing the
driver, so I guess we don't care. Can I get you to submit that patch
properly? Send it to linux-wireless@vger.kernel.org please.

johannes


^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-08  9:30 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275989260.3706.115.camel@jlt3.sipsolutions.net>

On Tue, Jun 8, 2010 at 11:27, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Mon, 2010-06-07 at 14:54 +0200, Kay Sievers wrote:
>
>> > I see you remove the driver. Does this mean that in sysfs these devices
>> > wouldn't have a driver symlink? ISTR that we needed that for userspace,
>> > but I'm not entirely sure, and I don't have all the relevant userspace
>> > in my test setup.
>>
>> Yeah, if you need the driver, you need to use bus device, as you do
>> now. The driver needs to be registered with the core, and the probe
>> function needs to tell to bind to the devices while registered. You
>> can not manually assign this.
>
> So I spoke to Jouni and he says that he doesn't remember needing the
> driver, so I guess we don't care. Can I get you to submit that patch
> properly? Send it to linux-wireless@vger.kernel.org please.

Sure, you mean the patch that changes the hwsim code, or the one I
talked about to properly create a virtual bus parent?

Kay

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-08  9:45 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTinMauFsJTKS3eT-pT44KQ8PSf3HJLS8CZIeHmxP@mail.gmail.com>

On Tue, 2010-06-08 at 11:30 +0200, Kay Sievers wrote:
> On Tue, Jun 8, 2010 at 11:27, Johannes Berg <johannes@sipsolutions.net> wrote:
> > On Mon, 2010-06-07 at 14:54 +0200, Kay Sievers wrote:
> >
> >> > I see you remove the driver. Does this mean that in sysfs these devices
> >> > wouldn't have a driver symlink? ISTR that we needed that for userspace,
> >> > but I'm not entirely sure, and I don't have all the relevant userspace
> >> > in my test setup.
> >>
> >> Yeah, if you need the driver, you need to use bus device, as you do
> >> now. The driver needs to be registered with the core, and the probe
> >> function needs to tell to bind to the devices while registered. You
> >> can not manually assign this.
> >
> > So I spoke to Jouni and he says that he doesn't remember needing the
> > driver, so I guess we don't care. Can I get you to submit that patch
> > properly? Send it to linux-wireless@vger.kernel.org please.
> 
> Sure, you mean the patch that changes the hwsim code, or the one I
> talked about to properly create a virtual bus parent?

The hwsim one you pasted earlier.

johannes


^ permalink raw reply

* Re: [PATCHv2 1/2] net: Enable 64-bit net device statistics on 32-bit architectures
From: Ben Hutchings @ 2010-06-08 11:35 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, arnd, netdev, linux-net-drivers
In-Reply-To: <20100607.211829.63035366.davem@davemloft.net>

On Mon, 2010-06-07 at 21:18 -0700, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 04 Jun 2010 23:04:53 +0100
> 
> > +#if BITS_PER_LONG == 64
> > +#define NET_DEVICE_STATS_DEFINE(name)	u64 name
> > +#elif defined(__LITTLE_ENDIAN)
> > +#define NET_DEVICE_STATS_DEFINE(name)	u32 name, pad_ ## name
> > +#else
> > +#define NET_DEVICE_STATS_DEFINE(name)	u32 pad_ ## name, name
> > +#endif
> > +
>  ...
> > +	NET_DEVICE_STATS_DEFINE(rx_packets);
> > +	NET_DEVICE_STATS_DEFINE(tx_packets);
> > +	NET_DEVICE_STATS_DEFINE(rx_bytes);
>  ...
> >  	static const char fmt[] = "%30s %12lu\n";
> > +	static const char fmt64[] = "%30s %12llu\n";
>  ...
> > +	seq_printf(seq, fmt64, "total frames received", stats->rx_packets);
> > +	seq_printf(seq, fmt64, "total bytes received", stats->rx_bytes);
> > +	seq_printf(seq, fmt64, "Broadcast/Multicast Rcvd", stats->multicast);
> 
> I guess you only built this on a 64-bit platform that defines u64 as a
> long long type.

There was some disussion of formatting u64 / unsigned long long in the
past and I thought the outcome of that was that u64 should always be
defined as unsigned long long.  (See commits fe33332 and 9018113.)

[...]
> And the whole tree needs to be inspected to make sure there isn't going
> to be fallout in areas your patch didn't take care of wrt. printf format
> strings and the like.
> 
> What was always "unsigned long" is now a variable type, therefore using
> a fixed printf format string is impossible unless you always cast these
> things when passed in as printf arguments.

Yes, that's true if there are drivers out there printing members of
net_device_stats.  I admit I haven't checked for that.  (Hmm, might be
time to try Coccinelle.)

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: 2.6.35-rc2-git1 - net/mac80211/sta_info.c:125 invoked rcu_dereference_check() without protection!
From: Miles Lane @ 2010-06-08 11:29 UTC (permalink / raw)
  To: Johannes Berg
  Cc: paulmck, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <1275981969.3706.1.camel@jlt3.sipsolutions.net>

On Tue, Jun 8, 2010 at 3:26 AM, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Mon, 2010-06-07 at 16:59 -0700, Paul E. McKenney wrote:
>> On Mon, Jun 07, 2010 at 02:25:44PM -0400, Miles Lane wrote:
>> > [   43.478812] [ INFO: suspicious rcu_dereference_check() usage. ]
>> > [   43.478815] ---------------------------------------------------
>> > [   43.478820] net/mac80211/sta_info.c:125 invoked
>> > rcu_dereference_check() without protection!
>> > [   43.478824]
>> > [   43.478824] other info that might help us debug this:
>> > [   43.478826]
>> > [   43.478829]
>> > [   43.478830] rcu_scheduler_active = 1, debug_locks = 1
>> > [   43.478834] no locks held by NetworkManager/4017.
>>
>> Hmmm...  Johannes's update has been merged, and it requires that callers
>> either be in an RCU read-side critical section or hold either the
>> ->sta_lock or the ->sta_mtx, and this thread does none of this.
>>
>> Johannes, any thoughts?
>>
>>                                                       Thanx, Paul
>>
>> > [   43.478837] stack backtrace:
>> > [   43.478842] Pid: 4017, comm: NetworkManager Not tainted 2.6.35-rc2-git1 #8
>> > [   43.478846] Call Trace:
>> > [   43.478849]  <IRQ>  [<ffffffff81064e9c>] lockdep_rcu_dereference+0x9d/0xa5
>> > [   43.478876]  [<ffffffffa010cb3c>] sta_info_get_bss+0x71/0x12d [mac80211]
>> > [   43.478889]  [<ffffffffa010cc0d>] ieee80211_find_sta+0x15/0x2f [mac80211]
>> > [   43.478902]  [<ffffffffa019ae16>] iwlagn_tx_queue_reclaim+0xe7/0x1bb [iwlagn]
>
> iwlwifi wasn't using rcu protection here -- already sent a patch fixing
> it. My mistake, I  think. Thanks for checking :)
>
> johannes
>
>

Yes, your patch works for me.

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Kay Sievers @ 2010-06-08 11:55 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275990325.3706.116.camel@jlt3.sipsolutions.net>

On Tue, 2010-06-08 at 11:45 +0200, Johannes Berg wrote:
> On Tue, 2010-06-08 at 11:30 +0200, Kay Sievers wrote:
> > On Tue, Jun 8, 2010 at 11:27, Johannes Berg <johannes@sipsolutions.net> wrote:
> > > On Mon, 2010-06-07 at 14:54 +0200, Kay Sievers wrote:
> > >
> > >> > I see you remove the driver. Does this mean that in sysfs these devices
> > >> > wouldn't have a driver symlink? ISTR that we needed that for userspace,
> > >> > but I'm not entirely sure, and I don't have all the relevant userspace
> > >> > in my test setup.
> > >>
> > >> Yeah, if you need the driver, you need to use bus device, as you do
> > >> now. The driver needs to be registered with the core, and the probe
> > >> function needs to tell to bind to the devices while registered. You
> > >> can not manually assign this.
> > >
> > > So I spoke to Jouni and he says that he doesn't remember needing the
> > > driver, so I guess we don't care. Can I get you to submit that patch
> > > properly? Send it to linux-wireless@vger.kernel.org please.
> > 
> > Sure, you mean the patch that changes the hwsim code, or the one I
> > talked about to properly create a virtual bus parent?
> 
> The hwsim one you pasted earlier.

Ok, this needs testing.

Please let me know, if that works for you so far.

Thanks,
Kay


From: Kay Sievers <kay.sievers@vrfy.org>
Subject: mac80211_hwsim - convert class devices to bus devices

Class devices must never have child devices of a different class,
so use bus devices.

Embed the 'struct device' into mac80211_hwsim_data, to do proper
ressource lifetime management at device release time.

Registers a driver, to binds devices to the bus.

Use a virtual root as the parent in /sys/devices/.

Puts the monitor device under the virtual parent.

This creates the following devices (udevadm monitor):
  /module/mac80211_hwsim (module)
  /bus/mac80211_hwsim (bus)
  /bus/mac80211_hwsim/drivers/mac80211_hwsim (drivers)
  /devices/mac80211_hwsim/hwsim0 (mac80211_hwsim)
  /devices/mac80211_hwsim/hwsim0/ieee80211/phy17 (ieee80211)
  /devices/mac80211_hwsim/hwsim0/ieee80211/phy17/rfkill19 (rfkill)
  /devices/mac80211_hwsim/hwsim0/net/wlan9 (net)
  /devices/mac80211_hwsim/hwsim0/net/wlan9/queues/rx-0 (queues)
  /devices/mac80211_hwsim/hwsim0/net/wlan9/queues/rx-1 (queues)
  /devices/mac80211_hwsim/hwsim0/net/wlan9/queues/rx-2 (queues)
  /devices/mac80211_hwsim/hwsim0/net/wlan9/queues/rx-3 (queues)
  /devices/mac80211_hwsim/hwsim1 (mac80211_hwsim)
  /devices/mac80211_hwsim/hwsim1/ieee80211/phy18 (ieee80211)
  /devices/mac80211_hwsim/hwsim1/ieee80211/phy18/rfkill20 (rfkill)
  /devices/mac80211_hwsim/hwsim1/net/wlan10 (net)
  /devices/mac80211_hwsim/hwsim1/net/wlan10/queues/rx-0 (queues)
  /devices/mac80211_hwsim/hwsim1/net/wlan10/queues/rx-1 (queues)
  /devices/mac80211_hwsim/hwsim1/net/wlan10/queues/rx-2 (queues)
  /devices/mac80211_hwsim/hwsim1/net/wlan10/queues/rx-3 (queues)
  /devices/mac80211_hwsim/net/hwsim0 (net)
  /devices/mac80211_hwsim/net/hwsim0/queues/rx-0 (queues)

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
---

diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c
index 6f8cb3e..2543e74 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -194,7 +194,16 @@ static inline void hwsim_clear_sta_magic(struct ieee80211_sta *sta)
 	sp->magic = 0;
 }
 
-static struct class *hwsim_class;
+static struct bus_type hwsim_bus = {
+	.name = "mac80211_hwsim",
+};
+
+static struct device_driver hwsim_driver = {
+	.name = "mac80211_hwsim",
+	.bus = &hwsim_bus,
+};
+
+static struct device *hwsim_virtual_parent;
 
 static struct net_device *hwsim_mon; /* global monitor netdev */
 
@@ -280,7 +289,7 @@ static struct list_head hwsim_radios;
 struct mac80211_hwsim_data {
 	struct list_head list;
 	struct ieee80211_hw *hw;
-	struct device *dev;
+	struct device dev;
 	struct ieee80211_supported_band bands[2];
 	struct ieee80211_channel channels_2ghz[ARRAY_SIZE(hwsim_channels_2ghz)];
 	struct ieee80211_channel channels_5ghz[ARRAY_SIZE(hwsim_channels_5ghz)];
@@ -1063,22 +1072,13 @@ static void mac80211_hwsim_free(void)
 		list_move(i, &tmplist);
 	spin_unlock_bh(&hwsim_radio_lock);
 
-	list_for_each_entry_safe(data, tmpdata, &tmplist, list) {
-		debugfs_remove(data->debugfs_group);
-		debugfs_remove(data->debugfs_ps);
-		debugfs_remove(data->debugfs);
-		ieee80211_unregister_hw(data->hw);
-		device_unregister(data->dev);
-		ieee80211_free_hw(data->hw);
-	}
-	class_destroy(hwsim_class);
+	list_for_each_entry_safe(data, tmpdata, &tmplist, list)
+		device_unregister(&data->dev);
+	device_unregister(hwsim_virtual_parent);
+	driver_unregister(&hwsim_driver);
+	bus_unregister(&hwsim_bus);
 }
 
-
-static struct device_driver mac80211_hwsim_driver = {
-	.name = "mac80211_hwsim"
-};
-
 static const struct net_device_ops hwsim_netdev_ops = {
 	.ndo_start_xmit 	= hwsim_mon_xmit,
 	.ndo_change_mtu		= eth_change_mtu,
@@ -1095,6 +1095,7 @@ static void hwsim_mon_setup(struct net_device *dev)
 	dev->type = ARPHRD_IEEE80211_RADIOTAP;
 	memset(dev->dev_addr, 0, ETH_ALEN);
 	dev->dev_addr[0] = 0x12;
+	dev->dev.parent = hwsim_virtual_parent;
 }
 
 
@@ -1231,6 +1232,17 @@ DEFINE_SIMPLE_ATTRIBUTE(hwsim_fops_group,
 			hwsim_fops_group_read, hwsim_fops_group_write,
 			"%llx\n");
 
+static void hwsim_dev_release(struct device* hwsim_dev) {
+	struct mac80211_hwsim_data *data =
+	  container_of(hwsim_dev, struct mac80211_hwsim_data, dev);
+
+	debugfs_remove(data->debugfs_group);
+	debugfs_remove(data->debugfs_ps);
+	debugfs_remove(data->debugfs);
+	ieee80211_unregister_hw(data->hw);
+	ieee80211_free_hw(data->hw);
+}
+
 static int __init init_mac80211_hwsim(void)
 {
 	int i, err = 0;
@@ -1251,9 +1263,22 @@ static int __init init_mac80211_hwsim(void)
 	spin_lock_init(&hwsim_radio_lock);
 	INIT_LIST_HEAD(&hwsim_radios);
 
-	hwsim_class = class_create(THIS_MODULE, "mac80211_hwsim");
-	if (IS_ERR(hwsim_class))
-		return PTR_ERR(hwsim_class);
+	err = bus_register(&hwsim_bus);
+	if (err)
+		return err;
+
+	err = driver_register(&hwsim_driver);
+	if (err) {
+		bus_unregister(&hwsim_bus);
+		return err;
+	}
+
+	hwsim_virtual_parent = root_device_register("mac80211_hwsim");
+	if (IS_ERR(hwsim_virtual_parent)) {
+		driver_unregister(&hwsim_driver);
+		bus_unregister(&hwsim_bus);
+		return PTR_ERR(hwsim_virtual_parent);
+	}
 
 	memset(addr, 0, ETH_ALEN);
 	addr[0] = 0x02;
@@ -1271,18 +1296,19 @@ static int __init init_mac80211_hwsim(void)
 		data = hw->priv;
 		data->hw = hw;
 
-		data->dev = device_create(hwsim_class, NULL, 0, hw,
-					  "hwsim%d", i);
-		if (IS_ERR(data->dev)) {
+		dev_set_name(&data->dev, "hwsim%d", i);
+		data->dev.parent = hwsim_virtual_parent;
+		data->dev.bus = &hwsim_bus;
+		data->dev.release = hwsim_dev_release;
+		err = device_register(&data->dev);
+		if (err) {
 			printk(KERN_DEBUG
-			       "mac80211_hwsim: device_create "
-			       "failed (%ld)\n", PTR_ERR(data->dev));
-			err = -ENOMEM;
+			       "mac80211_hwsim: device_register failed (%d)\n",
+			       err);
 			goto failed_drvdata;
 		}
-		data->dev->driver = &mac80211_hwsim_driver;
 
-		SET_IEEE80211_DEV(hw, data->dev);
+		SET_IEEE80211_DEV(hw, &data->dev);
 		addr[3] = i >> 8;
 		addr[4] = i;
 		memcpy(data->addresses[0].addr, addr, ETH_ALEN);
@@ -1513,7 +1539,7 @@ failed_mon:
 	return err;
 
 failed_hw:
-	device_unregister(data->dev);
+	device_unregister(&data->dev);
 failed_drvdata:
 	ieee80211_free_hw(hw);
 failed:



^ permalink raw reply related

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-08 12:03 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275998127.1899.38.camel@yio.site>

On Tue, 2010-06-08 at 13:55 +0200, Kay Sievers wrote:

> Ok, this needs testing.
> 
> Please let me know, if that works for you so far.

Will do. Just a question: none of this seems to pin the module? So can I
be sure if I rmmod the module that the release function will still be
around etc.?

johannes


^ permalink raw reply

* Re: [PATCH nf-next-2.6] xt_rateest: Better struct xt_rateest layout
From: Patrick McHardy @ 2010-06-08 12:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Netfilter Development Mailinglist
In-Reply-To: <1275941837.2775.7.camel@edumazet-laptop>

On 07.06.2010 22:17, Eric Dumazet wrote:
> We currently dirty two cache lines in struct xt_rateest, this hurts SMP
> performance.
> 
> This patch moves lock/bstats/rstats at beginning of structure so that
> they share a single cache line.

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] pkt_sched: gen_estimator: kill est_lock rwlock
From: Jarek Poplawski @ 2010-06-08 12:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, netdev, Stephen Hemminger,
	Patrick McHardy
In-Reply-To: <1275929761.2545.159.camel@edumazet-laptop>

Eric Dumazet wrote:
...
> [PATCH net-next-2.6] pkt_sched: gen_kill_estimator() rcu fixes
> 
> gen_kill_estimator() API is fundamentaly wrong, since caller should make
> sure an RCU grace period is respected before freeing bstats or lock.
> 
> This was partially addressed in commit 5d944c640b4 (gen_estimator:
> deadlock fix), but same problem exist for all gen_kill_estimator()
> users.
> 
> Change its name to gen_kill_estimator_rcu() and change all callers to
> use RCU grace period before freeing bstats container.
> 
> As a bonus, est_lock rwlock can be killed for good.
> 
> Special thanks to Changli Gao for commenting on a previous patch, this
> made this bug clear to me.
> 

Actually, I guess, Changli meant the bug introduced by your previous
patch by removing the est_lock. With this lock (and your commit 5d944)
bstats (and API) seem "fundamentaly" safe.

Jarek P.

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] pkt_sched: gen_estimator: kill est_lock rwlock
From: Eric Dumazet @ 2010-06-08 12:27 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Changli Gao, David Miller, netdev, Stephen Hemminger,
	Patrick McHardy
In-Reply-To: <20100608121546.GA9392@ff.dom.local>

Le mardi 08 juin 2010 à 12:15 +0000, Jarek Poplawski a écrit :

> 
> Actually, I guess, Changli meant the bug introduced by your previous
> patch by removing the est_lock. With this lock (and your commit 5d944)
> bstats (and API) seem "fundamentaly" safe.
> 

Sorry, I have no idea of what you want to say, I cant find commit 5d944.




^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] pkt_sched: gen_estimator: kill est_lock rwlock
From: Jarek Poplawski @ 2010-06-08 12:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David Miller, netdev, Stephen Hemminger,
	Patrick McHardy
In-Reply-To: <1276000052.2475.307.camel@edumazet-laptop>

On Tue, Jun 08, 2010 at 02:27:32PM +0200, Eric Dumazet wrote:
> Le mardi 08 juin 2010 ?? 12:15 +0000, Jarek Poplawski a écrit :
> 
> > 
> > Actually, I guess, Changli meant the bug introduced by your previous
> > patch by removing the est_lock. With this lock (and your commit 5d944)
> > bstats (and API) seem "fundamentaly" safe.
> > 
> 
> Sorry, I have no idea of what you want to say, I cant find commit 5d944.
> 

Sorry, I meant the commit mentioned in your changelog which was quoted.

> This was partially addressed in commit 5d944c640b4 (gen_estimator:
> deadlock fix), but same problem exist for all gen_kill_estimator()
> users.

Jarek P.

^ permalink raw reply

* AF_PACKET meets AF_UNIX
From: Sebastian Krahmer @ 2010-06-08 12:43 UTC (permalink / raw)
  To: netdev

Hi

I made a small fix for 2.6.34 to have something like
"tcpdump -i unix". For debugging purposes (DBUS etc)
it would be really helpfull. The sockaddr_ll struct
contains the socket's ino and the senders PID in
ifindex, so all necessary infos can be found
out via /proc/net/unix in userland.

The patch goes here: http://www.suse.de/~krahmer/unix-monitor-2.6.34.tgz

If you dont like any constructions in it please let me know,
so I can fix it and it can be accepted upstream :)

cya,
Sebastian

-- 
~
~ perl self.pl
~ $_='print"\$_=\47$_\47;eval"';eval
~ krahmer@suse.de - SuSE Security Team
~ SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg)


^ permalink raw reply

* Re: 2.6.35-rc2-git1 - include/linux/cgroup.h:534 invoked rcu_dereference_check() without protection!
From: Miles Lane @ 2010-06-08 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	LKML, nauman, eric.dumazet, netdev, Jens Axboe, Gui Jianfeng,
	Li Zefan, Johannes Berg
In-Reply-To: <1275986441.5408.111.camel@twins>

On Tue, Jun 8, 2010 at 4:40 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, 2010-06-08 at 00:16 -0400, Miles Lane wrote:
>> On Mon, Jun 7, 2010 at 8:19 PM, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Mon, Jun 07, 2010 at 02:14:25PM -0400, Miles Lane wrote:
>> >> Hi All,
>> >>
>> >> I just reproduced a warning I reported quite a while ago.  Is a patch
>> >> for this in the pipeline?
>> >
>> > I proposed a patch, thinking that it was a false positive.  Peter Zijlstra
>> > pointed out that there was a real race, and proposed an alternative patch,
>> > which may be found at http://lkml.org/lkml/2010/4/22/603.
>> >
>> > Could you please test Peter's patch and let us know if it cures the problem?
>> >
>
> Gah, this task_group() stuff is annoying, how about something like the
> below which teaches task_group() about the task_rq()->lock rule?
>
> ---
>  include/linux/cgroup.h |   20 +++++++++++----
>  kernel/sched.c         |   61 +++++++++++++++++++++++++----------------------
>  2 files changed, 46 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 0c62160..1efd212 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -525,13 +525,21 @@ static inline struct cgroup_subsys_state *cgroup_subsys_state(
>        return cgrp->subsys[subsys_id];
>  }
>
> -static inline struct cgroup_subsys_state *task_subsys_state(
> -       struct task_struct *task, int subsys_id)
> +/*
> + * function to get the cgroup_subsys_state which allows for extra
> + * rcu_dereference_check() conditions, such as locks used during the
> + * cgroup_subsys::attach() methods.
> + */
> +#define task_subsys_state_check(task, subsys_id, __c)                  \
> +       rcu_dereference_check(task->cgroups->subsys[subsys_id],         \
> +                             rcu_read_lock_held() ||                   \
> +                             lockdep_is_held(&task->alloc_lock) ||     \
> +                             cgroup_lock_is_held() || (__c))
> +
> +static inline struct cgroup_subsys_state *
> +task_subsys_state(struct task_struct *task, int subsys_id)
>  {
> -       return rcu_dereference_check(task->cgroups->subsys[subsys_id],
> -                                    rcu_read_lock_held() ||
> -                                    lockdep_is_held(&task->alloc_lock) ||
> -                                    cgroup_lock_is_held());
> +       return task_subsys_state_check(task, subsys_id, false);
>  }
>
>  static inline struct cgroup* task_cgroup(struct task_struct *task,
> diff --git a/kernel/sched.c b/kernel/sched.c
> index f8b8996..e01bb45 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -306,32 +306,26 @@ static int init_task_group_load = INIT_TASK_GROUP_LOAD;
>  */
>  struct task_group init_task_group;
>
> -/* return group to which a task belongs */
> +/*
> + * Return the group to which this tasks belongs.
> + *
> + * We use task_subsys_state_check() and extend the RCU verification
> + * with lockdep_is_held(&task_rq(p)->lock) because cpu_cgroup_attach()
> + * holds that lock for each task it moves into the cgroup. Therefore
> + * by holding that lock, we pin the task to the current cgroup.
> + */
>  static inline struct task_group *task_group(struct task_struct *p)
>  {
> -       struct task_group *tg;
> +       struct cgroup_subsys_state *css;
>
> -#ifdef CONFIG_CGROUP_SCHED
> -       tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
> -                               struct task_group, css);
> -#else
> -       tg = &init_task_group;
> -#endif
> -       return tg;
> +       css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
> +                       lockdep_is_held(&task_rq(p)->lock));
> +       return container_of(css, struct task_group, css);
>  }
>
>  /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
>  static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>  {
> -       /*
> -        * Strictly speaking this rcu_read_lock() is not needed since the
> -        * task_group is tied to the cgroup, which in turn can never go away
> -        * as long as there are tasks attached to it.
> -        *
> -        * However since task_group() uses task_subsys_state() which is an
> -        * rcu_dereference() user, this quiets CONFIG_PROVE_RCU.
> -        */
> -       rcu_read_lock();
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>        p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
>        p->se.parent = task_group(p)->se[cpu];
> @@ -341,7 +335,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>        p->rt.rt_rq  = task_group(p)->rt_rq[cpu];
>        p->rt.parent = task_group(p)->rt_se[cpu];
>  #endif
> -       rcu_read_unlock();
>  }
>
>  #else
> @@ -4465,16 +4458,6 @@ recheck:
>        }
>
>        if (user) {
> -#ifdef CONFIG_RT_GROUP_SCHED
> -               /*
> -                * Do not allow realtime tasks into groups that have no runtime
> -                * assigned.
> -                */
> -               if (rt_bandwidth_enabled() && rt_policy(policy) &&
> -                               task_group(p)->rt_bandwidth.rt_runtime == 0)
> -                       return -EPERM;
> -#endif
> -
>                retval = security_task_setscheduler(p, policy, param);
>                if (retval)
>                        return retval;
> @@ -4490,6 +4473,26 @@ recheck:
>         * runqueue lock must be held.
>         */
>        rq = __task_rq_lock(p);
> +
> +       retval = 0;
> +#ifdef CONFIG_RT_GROUP_SCHED
> +       if (user) {
> +               /*
> +                * Do not allow realtime tasks into groups that have no runtime
> +                * assigned.
> +                */
> +               if (rt_bandwidth_enabled() && rt_policy(policy) &&
> +                               task_group(p)->rt_bandwidth.rt_runtime == 0)
> +                       retval = -EPERM;
> +
> +               if (retval) {
> +                       __task_rq_unlock(rq);
> +                       raw_spin_unlock_irqrestore(&p->pi_lock, flags);
> +                       return retval;
> +               }
> +       }
> +#endif
> +
>        /* recheck policy now with rq lock held */
>        if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
>                policy = oldpolicy = -1;
>
>

  CC      kernel/sched.o
kernel/sched.c: In function ‘task_group’:
kernel/sched.c:321: error: implicit declaration of function ‘task_rq’
kernel/sched.c:321: error: invalid type argument of ‘->’ (have ‘int’)
make[1]: *** [kernel/sched.o] Error 1

I had to apply with fuzz.  Did it mess up?

static inline struct task_group *task_group(struct task_struct *p)
{
       struct cgroup_subsys_state *css;

       css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
                       lockdep_is_held(&task_rq(p)->lock));
       return container_of(css, struct task_group, css);
}

^ permalink raw reply

* Re: BUG: double spinlock in "drivers/net/3c505.c"
From: Christoph Fritz @ 2010-06-08 13:26 UTC (permalink / raw)
  To: Alexander Strakh
  Cc: Philip Blundell, Craig Southeren, Andrew Tridgell, Alan Cox,
	netdev, linux-kernel
In-Reply-To: <201006071517.28744.strakh@ispras.ru>

On Mon, 2010-06-07 at 15:17 +0400, Alexander Strakh wrote: 
> 	KERNEL_VERSION: 2.6.35-rc1
>         SUBJECT: duble spinlock  in function elp_start_xmit

Not only in elp_start_xmit. This driver is for a pretty old and slow isa
ethernet card and I think nobody cares. To quote a comment from the
source: "[...] the concurrency protection is particularly awful".

Thanks, 
   Christoph

^ permalink raw reply

* Re: 2.6.35-rc2-git1 - include/linux/cgroup.h:534 invoked rcu_dereference_check() without protection!
From: Peter Zijlstra @ 2010-06-08 13:34 UTC (permalink / raw)
  To: Miles Lane
  Cc: paulmck, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	LKML, nauman, eric.dumazet, netdev, Jens Axboe, Gui Jianfeng,
	Li Zefan, Johannes Berg
In-Reply-To: <AANLkTimXVlgzHivCQberLdrFP8BH_nJzt8_ozUNHG_aI@mail.gmail.com>

On Tue, 2010-06-08 at 09:14 -0400, Miles Lane wrote:

>   CC      kernel/sched.o
> kernel/sched.c: In function ‘task_group’:
> kernel/sched.c:321: error: implicit declaration of function ‘task_rq’
> kernel/sched.c:321: error: invalid type argument of ‘->’ (have ‘int’)
> make[1]: *** [kernel/sched.o] Error 1
> 
> I had to apply with fuzz.  Did it mess up?


No, I probably did.. task_rq() is defined on line 636 or thereabouts,
and this function landed around line 320.

Ahh, and it compiled here because I have CGROUP_SCHED=y, but
PROVE_RCU=n, so that whole check expression disappears and is never
evaluated...

/me fixes

---
Subject: sched: PROVE_RCU vs cpu_cgroup
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Jun 08 11:40:42 CEST 2010

PROVE_RCU has a few issues with the cpu_cgroup because the scheduler
typically holds rq->lock around the css rcu derefs but the generic
cgroup code doesn't (and can't) know about that lock.

Provide means to add extra checks to the css dereference and use that
in the scheduler to annotate its users.

The addition of rq->lock to these checks is correct because the
cgroup_subsys::attach() method takes the rq->lock for each task it
moves, therefore by holding that lock, we ensure the task is pinned to
the current cgroup and the RCU dereference is valid.

That leaves one genuine race in __sched_setscheduler() where we used
task_group() without holding any of the required locks and thus raced
with the cgroup code. Solve this by moving the check under the rq->lock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/cgroup.h |   20 +++++---
 kernel/sched.c         |  115 +++++++++++++++++++++++++------------------------
 2 files changed, 73 insertions(+), 62 deletions(-)

Index: linux-2.6/include/linux/cgroup.h
===================================================================
--- linux-2.6.orig/include/linux/cgroup.h
+++ linux-2.6/include/linux/cgroup.h
@@ -525,13 +525,21 @@ static inline struct cgroup_subsys_state
 	return cgrp->subsys[subsys_id];
 }
 
-static inline struct cgroup_subsys_state *task_subsys_state(
-	struct task_struct *task, int subsys_id)
+/*
+ * function to get the cgroup_subsys_state which allows for extra
+ * rcu_dereference_check() conditions, such as locks used during the
+ * cgroup_subsys::attach() methods.
+ */
+#define task_subsys_state_check(task, subsys_id, __c)			\
+	rcu_dereference_check(task->cgroups->subsys[subsys_id],		\
+			      rcu_read_lock_held() ||			\
+			      lockdep_is_held(&task->alloc_lock) ||	\
+			      cgroup_lock_is_held() || (__c))
+
+static inline struct cgroup_subsys_state *
+task_subsys_state(struct task_struct *task, int subsys_id)
 {
-	return rcu_dereference_check(task->cgroups->subsys[subsys_id],
-				     rcu_read_lock_held() ||
-				     lockdep_is_held(&task->alloc_lock) ||
-				     cgroup_lock_is_held());
+	return task_subsys_state_check(task, subsys_id, false);
 }
 
 static inline struct cgroup* task_cgroup(struct task_struct *task,
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -306,52 +306,6 @@ static int init_task_group_load = INIT_T
  */
 struct task_group init_task_group;
 
-/* return group to which a task belongs */
-static inline struct task_group *task_group(struct task_struct *p)
-{
-	struct task_group *tg;
-
-#ifdef CONFIG_CGROUP_SCHED
-	tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
-				struct task_group, css);
-#else
-	tg = &init_task_group;
-#endif
-	return tg;
-}
-
-/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
-static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
-{
-	/*
-	 * Strictly speaking this rcu_read_lock() is not needed since the
-	 * task_group is tied to the cgroup, which in turn can never go away
-	 * as long as there are tasks attached to it.
-	 *
-	 * However since task_group() uses task_subsys_state() which is an
-	 * rcu_dereference() user, this quiets CONFIG_PROVE_RCU.
-	 */
-	rcu_read_lock();
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
-	p->se.parent = task_group(p)->se[cpu];
-#endif
-
-#ifdef CONFIG_RT_GROUP_SCHED
-	p->rt.rt_rq  = task_group(p)->rt_rq[cpu];
-	p->rt.parent = task_group(p)->rt_se[cpu];
-#endif
-	rcu_read_unlock();
-}
-
-#else
-
-static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
-static inline struct task_group *task_group(struct task_struct *p)
-{
-	return NULL;
-}
-
 #endif	/* CONFIG_CGROUP_SCHED */
 
 /* CFS-related fields in a runqueue */
@@ -644,6 +598,49 @@ static inline int cpu_of(struct rq *rq)
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+#ifdef CONFIG_CGROUP_SCHED
+
+/*
+ * Return the group to which this tasks belongs.
+ *
+ * We use task_subsys_state_check() and extend the RCU verification
+ * with lockdep_is_held(&task_rq(p)->lock) because cpu_cgroup_attach()
+ * holds that lock for each task it moves into the cgroup. Therefore
+ * by holding that lock, we pin the task to the current cgroup.
+ */
+static inline struct task_group *task_group(struct task_struct *p)
+{
+	struct cgroup_subsys_state *css;
+
+	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
+			lockdep_is_held(&task_rq(p)->lock));
+	return container_of(css, struct task_group, css);
+}
+
+/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
+static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+	p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
+	p->se.parent = task_group(p)->se[cpu];
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+	p->rt.rt_rq  = task_group(p)->rt_rq[cpu];
+	p->rt.parent = task_group(p)->rt_se[cpu];
+#endif
+}
+
+#else /* CONFIG_CGROUP_SCHED */
+
+static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
+static inline struct task_group *task_group(struct task_struct *p)
+{
+	return NULL;
+}
+
+#endif /* CONFIG_CGROUP_SCHED */
+
 inline void update_rq_clock(struct rq *rq)
 {
 	if (!rq->skip_clock_update)
@@ -4465,16 +4462,6 @@ recheck:
 	}
 
 	if (user) {
-#ifdef CONFIG_RT_GROUP_SCHED
-		/*
-		 * Do not allow realtime tasks into groups that have no runtime
-		 * assigned.
-		 */
-		if (rt_bandwidth_enabled() && rt_policy(policy) &&
-				task_group(p)->rt_bandwidth.rt_runtime == 0)
-			return -EPERM;
-#endif
-
 		retval = security_task_setscheduler(p, policy, param);
 		if (retval)
 			return retval;
@@ -4490,6 +4477,22 @@ recheck:
 	 * runqueue lock must be held.
 	 */
 	rq = __task_rq_lock(p);
+
+#ifdef CONFIG_RT_GROUP_SCHED
+	if (user) {
+		/*
+		 * Do not allow realtime tasks into groups that have no runtime
+		 * assigned.
+		 */
+		if (rt_bandwidth_enabled() && rt_policy(policy) &&
+				task_group(p)->rt_bandwidth.rt_runtime == 0) {
+			__task_rq_unlock(rq);
+			raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+			return -EPERM;
+		}
+	}
+#endif
+
 	/* recheck policy now with rq lock held */
 	if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) {
 		policy = oldpolicy = -1;

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-08 13:54 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275998603.3706.118.camel@jlt3.sipsolutions.net>

On Tue, Jun 8, 2010 at 14:03, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Tue, 2010-06-08 at 13:55 +0200, Kay Sievers wrote:
>
>> Ok, this needs testing.
>>
>> Please let me know, if that works for you so far.
>
> Will do. Just a question: none of this seems to pin the module? So can I
> be sure if I rmmod the module that the release function will still be
> around etc.?

Oh, sorry, that's something very specific to network devices, that the
module can be unloaded while the devices it has created are still in
use. I am not sure right now what's needed to make this work in a
single module.

Kay

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-08 14:06 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTinev96yxFCFSdDdT89jjgYR4ChUCMU2-aSj8kkh@mail.gmail.com>

On Tue, 2010-06-08 at 15:54 +0200, Kay Sievers wrote:
> On Tue, Jun 8, 2010 at 14:03, Johannes Berg <johannes@sipsolutions.net> wrote:
> > On Tue, 2010-06-08 at 13:55 +0200, Kay Sievers wrote:
> >
> >> Ok, this needs testing.
> >>
> >> Please let me know, if that works for you so far.
> >
> > Will do. Just a question: none of this seems to pin the module? So can I
> > be sure if I rmmod the module that the release function will still be
> > around etc.?
> 
> Oh, sorry, that's something very specific to network devices, that the
> module can be unloaded while the devices it has created are still in
> use. I am not sure right now what's needed to make this work in a
> single module.

Well they will be unregistered and everything, but once all the netdevs
are gone etc. the devices you create in the patch might stick around
because somebody has them open in sysfs, and I see nothing that would
pin the module in that case?

johannes


^ permalink raw reply

* Re: [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit
From: Patrick McHardy @ 2010-06-08 14:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Netfilter Developers, netdev
In-Reply-To: <1275668732.2482.201.camel@edumazet-laptop>

On 04.06.2010 18:25, Eric Dumazet wrote:
> [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit
> 
> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> twice per packet. This is bad for performance.
> __read_mostly annotation is also a bad choice.
> 
> This patch introduces IPS_UNTRACKED bit so that we can use later a
> per_cpu untrack structure more easily.
> 
> A new helper, nf_ct_untracked_get() returns a pointer to
> nf_conntrack_untracked.
> 
> Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
> IPS_NAT_DONE_MASK bits to untracked status.
> 
> nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.

Applied, thanks Eric.

^ permalink raw reply

* Re: RFS seems to have incompatiblities with bridged vlans
From: Eric Dumazet @ 2010-06-08 14:18 UTC (permalink / raw)
  To: John Fastabend
  Cc: Stephen Hemminger, Peter Lieven, davem@davemloft.net,
	netdev@vger.kernel.org
In-Reply-To: <4C0D7312.1020300@intel.com>

Le lundi 07 juin 2010 à 15:30 -0700, John Fastabend a écrit :

> There is always a possibility that the underlying device sets the 
> queue_mapping to be greater then num_cpus.  Also I suspect the same 
> issue exists with bonding devices.  Maybe something like the following 
> is worth while? compile tested only,
> 
> [PATCH] 8021q: vlan reassigns dev without check queue_mapping
> 
> recv path reassigns skb->dev without sanity checking the
> queue_mapping field.  This can result in the queue_mapping
> field being set incorrectly if the new dev supports less
> queues then the underlying device.
> 
> This patch just resets the queue_mapping to 0 which should
> resolve this issue?  Any thoughts?
> 
> The same issue could happen on bonding devices as well.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> 
>   net/8021q/vlan_core.c |    6 ++++++
>   1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
> index bd537fc..ad309f8 100644
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -21,6 +21,9 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct 
> vlan_group *grp,
>   	if (!skb->dev)
>   		goto drop;
> 
> +	if (unlikely(skb->queue_mapping >= skb->dev->real_num_tx_queues))
> +		skb_set_queue_mapping(skb, 0);
> +
>   	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
> 
>   drop:
> @@ -93,6 +96,9 @@ vlan_gro_common(struct napi_struct *napi, struct 
> vlan_group *grp,
>   	if (!skb->dev)
>   		goto drop;
> 
> +	if (unlikely(skb->queue_mapping >= skb->dev->real_num_tx_queues))
> +		skb_set_queue_mapping(skb, 0);
> +
>   	for (p = napi->gro_list; p; p = p->next) {
>   		NAPI_GRO_CB(p)->same_flow =
>   			p->dev == skb->dev && !compare_ether_header(
> --

Only a workaround, added in hot path in a otherwise 'good' driver
(multiqueue enabled and ready)

eth0  -------> bond / bridge ---------> vlan.id
(nbtxq=8)      (ntxbq=1)        (nbtxq=X)

X is capped to 1 because of bond/bridge, while bond has no "queue"
(LLTX driver)

Solutions :

1) queue_mapping could be silently tested in get_rps_cpu()...

diff --git a/net/core/dev.c b/net/core/dev.c
index 6f330ce..3a3f7f6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2272,14 +2272,11 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
 
 	if (skb_rx_queue_recorded(skb)) {
 		u16 index = skb_get_rx_queue(skb);
-		if (unlikely(index >= dev->num_rx_queues)) {
-			if (net_ratelimit()) {
-				pr_warning("%s received packet on queue "
-					"%u, but number of RX queues is %u\n",
-					dev->name, index, dev->num_rx_queues);
-			}
-			goto done;
-		}
+		if (WARN_ONCE(index >= dev->num_rx_queues,
+				KERN_WARNING "%s received packet on queue %u, "
+				"but number of RX queues is %u\n",
+				dev->name, index, dev->num_rx_queues))
+			index %= dev->num_rx_queues;
 		rxqueue = dev->_rx + index;
 	} else
 		rxqueue = dev->_rx;



2) bond/bridge should setup more queues, just in case.
   We probably need to be able to make things more dynamic,
   (propagate nbtxq between layers) but not for 2.6.35



diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5e12462..ce813dd 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -5012,8 +5012,8 @@ int bond_create(struct net *net, const char *name)
 
 	rtnl_lock();
 
-	bond_dev = alloc_netdev(sizeof(struct bonding), name ? name : "",
-				bond_setup);
+	bond_dev = alloc_netdev_mq(sizeof(struct bonding), name ? name : "",
+				   bond_setup, max(64, nr_cpu_ids));
 	if (!bond_dev) {
 		pr_err("%s: eek! can't alloc netdev!\n", name);
 		rtnl_unlock();



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox