Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Patch net 2/2] net_sched: always call ->destroy when ->init() fails
From: Patrick McHardy @ 2014-10-25  0:44 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Cong Wang, netdev, David Miller, Wang Bo,
	John Fastabend, Eric Dumazet, Terry Lam
In-Reply-To: <CAHA+R7M3y=Mj_L3yL0UM41mfGa+go+Kq4wYga7D6epY3uq_yWQ@mail.gmail.com>

On Fri, Oct 24, 2014 at 05:38:48PM -0700, Cong Wang wrote:
> On Fri, Oct 24, 2014 at 5:36 PM, Patrick McHardy <kaber@trash.net> wrote:
> >
> > Again, the correct fix is to make qdisc_create_dflt() not call
> > qdisc_destroy() but clean up the qdisc manually as done in
> > qdisc_create().
> 
> I kindly wish you a good luck with fixing all callers of qdisc_create_dflt().
> Go ahead. :)

Here you go:

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index fc04fe9..3a71e51 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -590,18 +590,21 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
 	struct Qdisc *sch;
 
 	if (!try_module_get(ops->owner))
-		goto errout;
+		goto err1;
 
 	sch = qdisc_alloc(dev_queue, ops);
 	if (IS_ERR(sch))
-		goto errout;
+		goto err2;
 	sch->parent = parentid;
 
 	if (!ops->init || ops->init(sch, NULL) == 0)
 		return sch;
 
-	qdisc_destroy(sch);
-errout:
+	dev_put(sch->dev_queue->dev);
+	kfree((char *)sch - sch->padded);
+err2:
+	module_put(ops->owner);
+err1:
 	return NULL;
 }
 EXPORT_SYMBOL(qdisc_create_dflt);

^ permalink raw reply related

* Re: [Patch net 1/2] sch_pie: schedule the timer after all init succeed
From: Cong Wang @ 2014-10-25  0:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, netdev, David Miller, Vijay Subramanian
In-Reply-To: <1414197563.20845.50.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, Oct 24, 2014 at 5:39 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> You don't need to add one line in the changelog.
>
> Why would it be needed ? Think about it.


Simply because git-send-email reads Cc: so that I don't have to
specify in cmdline.
That's all.

OMG, first time to hear people complaining on Cc'ing too much! lol :)

^ permalink raw reply

* Re: [Patch net 2/2] net_sched: always call ->destroy when ->init() fails
From: Eric Dumazet @ 2014-10-25  0:42 UTC (permalink / raw)
  To: Cong Wang
  Cc: Cong Wang, netdev, David Miller, Wang Bo, John Fastabend,
	Eric Dumazet, Patrick McHardy, Terry Lam
In-Reply-To: <CAHA+R7N3jJCgOJjnDUxsTaH7ZCvPydDT60qkcqSPsEyFbp=UQg@mail.gmail.com>

On Fri, 2014-10-24 at 17:36 -0700, Cong Wang wrote:

> You missed what Wang Bo reported, we could double free
> mq qdisc on failure path, one in mq_init() and one in qdisc_create_dflt().

So fix mq/mqprio ?

I am sorry, but sending patches on Friday afternoon should be forbidden,
too much beer lying around ;)

^ permalink raw reply

* Re: [Patch net 2/2] net_sched: always call ->destroy when ->init() fails
From: Eric Dumazet @ 2014-10-25  0:40 UTC (permalink / raw)
  To: Cong Wang
  Cc: Cong Wang, netdev, David Miller, Wang Bo, John Fastabend,
	Eric Dumazet, Patrick McHardy, Terry Lam
In-Reply-To: <CAHA+R7N3jJCgOJjnDUxsTaH7ZCvPydDT60qkcqSPsEyFbp=UQg@mail.gmail.com>

On Fri, 2014-10-24 at 17:36 -0700, Cong Wang wrote:

> >>  static int multiq_dump(struct Qdisc *sch, struct sk_buff *skb)
> >> diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
> >> index 3ec7e88..55ac6a4 100644
> >> --- a/net/sched/sch_qfq.c
> >> +++ b/net/sched/sch_qfq.c
> >> @@ -522,7 +522,6 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
> >>       return 0;
> >>
> >>  destroy_class:
> >> -     qdisc_destroy(cl->qdisc);
> >>       kfree(cl);
> >>       return err;
> >>  }
> >
> >
> > Really ? I am calling this a leak of cl->qdisc.
> >
> >
> 
> How? I already added a comment on qdisc_create_dflt(), callers don't
> need to clean up
> on failure path.

Seriously, can you read the function again ?

What happens if gen_new_estimator() fails ?

^ permalink raw reply

* Re: [Patch net 1/2] sch_pie: schedule the timer after all init succeed
From: Eric Dumazet @ 2014-10-25  0:39 UTC (permalink / raw)
  To: Cong Wang; +Cc: Cong Wang, netdev, David Miller, Vijay Subramanian
In-Reply-To: <CAHA+R7PFSCACHyrU8fa4PVLOF1OtJiFiw4peLmFPzEu6-adaeQ@mail.gmail.com>

On Fri, 2014-10-24 at 17:28 -0700, Cong Wang wrote:
> On Fri, Oct 24, 2014 at 5:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Not sure why you use "Cc: David S. Miller <davem@davemloft.net>",
> > given that all networking patches have "Signed-off-by: David S. Miller
> > <davem@davemloft.net>" ???
> >
> 
> I don't understand what you are talking about:
> 
> If you are suggesting DaveM is too obvious to Cc, well, Cc'ing him
> doesn't harm anything. I know he is on netdev list, but just in case
> vger.kernel.org is broken or etc...
> 
> If you are suggesting to Cc his another email address, please update
> MAINTAINERS.


You send the patch to David Miller, as a recipient, so that he can read
it, and apply it.

You don't need to add one line in the changelog.

Why would it be needed ? Think about it.

^ permalink raw reply

* Re: [Patch net 2/2] net_sched: always call ->destroy when ->init() fails
From: Cong Wang @ 2014-10-25  0:38 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Eric Dumazet, Cong Wang, netdev, David Miller, Wang Bo,
	John Fastabend, Eric Dumazet, Terry Lam
In-Reply-To: <20141025003644.GB11289@acer.localdomain>

On Fri, Oct 24, 2014 at 5:36 PM, Patrick McHardy <kaber@trash.net> wrote:
>
> Again, the correct fix is to make qdisc_create_dflt() not call
> qdisc_destroy() but clean up the qdisc manually as done in
> qdisc_create().

I kindly wish you a good luck with fixing all callers of qdisc_create_dflt().
Go ahead. :)

^ permalink raw reply

* Re: [Patch net 2/2] net_sched: always call ->destroy when ->init() fails
From: Patrick McHardy @ 2014-10-25  0:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cong Wang, netdev, davem, Wang Bo, John Fastabend, Eric Dumazet,
	Terry Lam
In-Reply-To: <1414196053.20845.45.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, Oct 24, 2014 at 05:14:13PM -0700, Eric Dumazet wrote:
> On Fri, 2014-10-24 at 16:55 -0700, Cong Wang wrote:
> > In qdisc_create(), when ->init() exists and it fails, we should
> > call ->destroy() to clean up the potentially partially initialized
> > qdisc's. This will also make the ->init() implementation easier.
> > 
> 
> Why is this patch needed ?
> 
> You are adding bugs, its not clear what bug you are fixing.
> 
> I really do not like the idea of ->init() relying on a ->destroy() to
> cleanup a failed ->init().
> 
> This is not what most management functions do in our stack.
> 
> I very much prefer that a function returning an error has no side
> effect, like if it hadnt be called at all.

Absolutely.

Again, the correct fix is to make qdisc_create_dflt() not call
qdisc_destroy() but clean up the qdisc manually as done in
qdisc_create().

^ permalink raw reply

* Re: [Patch net 2/2] net_sched: always call ->destroy when ->init() fails
From: Cong Wang @ 2014-10-25  0:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cong Wang, netdev, David Miller, Wang Bo, John Fastabend,
	Eric Dumazet, Patrick McHardy, Terry Lam
In-Reply-To: <1414196053.20845.45.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, Oct 24, 2014 at 5:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-10-24 at 16:55 -0700, Cong Wang wrote:
>> In qdisc_create(), when ->init() exists and it fails, we should
>> call ->destroy() to clean up the potentially partially initialized
>> qdisc's. This will also make the ->init() implementation easier.
>>
>
> Why is this patch needed ?
>
> You are adding bugs, its not clear what bug you are fixing.


You missed what Wang Bo reported, we could double free
mq qdisc on failure path, one in mq_init() and one in qdisc_create_dflt().


>
> I really do not like the idea of ->init() relying on a ->destroy() to
> cleanup a failed ->init().
>
> This is not what most management functions do in our stack.
>
> I very much prefer that a function returning an error has no side
> effect, like if it hadnt be called at all.


If you would like to fix all qdisc_create_dflt() callers, yes, and good
luck with even more complex error handling there.

Read the diffstat, if we could fix a bug with less code, why not....

>
>> qdisc_create_dflt() already does that. Most callers of qdisc_create_dflt()
>> simply use noop_qdisc when it fails.
>>
>> And, most of the ->destroy() implementations are already able to clean up
>> partially initialization, we don't need to worry.
>>
>> The following ->init()'s need to catch up:
>> fq_codel_init(), hhf_init(), multiq_init(),  sfq_init(), mq_init(),
>> mqprio_init().
>>
>
> ...
>
>>
>>  static int multiq_dump(struct Qdisc *sch, struct sk_buff *skb)
>> diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
>> index 3ec7e88..55ac6a4 100644
>> --- a/net/sched/sch_qfq.c
>> +++ b/net/sched/sch_qfq.c
>> @@ -522,7 +522,6 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
>>       return 0;
>>
>>  destroy_class:
>> -     qdisc_destroy(cl->qdisc);
>>       kfree(cl);
>>       return err;
>>  }
>
>
> Really ? I am calling this a leak of cl->qdisc.
>
>

How? I already added a comment on qdisc_create_dflt(), callers don't
need to clean up
on failure path.

^ permalink raw reply

* Re: [PATCH net] net/sched: Fix use of wild pointer in mq_destroy() when qdisc_alloc fail
From: Patrick McHardy @ 2014-10-25  0:33 UTC (permalink / raw)
  To: Cong Wang; +Cc: John Fastabend, wang.bo116, David Miller, netdev, cui.yunfeng
In-Reply-To: <CAHA+R7PxvW6umBJMxGk0h=ht4RhoDRcrxr5WSTEUBa80TxU5GQ@mail.gmail.com>

On Fri, Oct 24, 2014 at 03:17:41PM -0700, Cong Wang wrote:
> On Fri, Oct 24, 2014 at 2:45 PM, Patrick McHardy <kaber@trash.net> wrote:
> >> >
> >> > I would argue that the qdisc_destroy() call in qdisc_create_dflt()
> >> > is wrong, it should instead free the qdisc and release the module
> >> > reference manually as done in qdisc_create().
> >> >
> >> > qdisc_destroy() should only be called for fully initialized qdiscs.
> >>
> >> Probably, but at least ->destroy() should be called, looking at
> >> those calling qdisc_watchdog_init(), they are supposed to call
> >> qdisc_watchdog_cancel() when >init() fails after that.
> >
> > In which cases does it actually fail after that? Usually this is
> > called once initialization is complete.
> 
> How about tbf_change() in tbf_init()? If tbf_change() fails,
> watchdog is still there if we don't call ->destroy(). Yes,
> I know the timer is started, the point is we do miss something
> clean up, even trivial.
> 
> tbf is not the only one who calls xxx_change() in xxx_init().

Then these are bugs. On failure we exit with the same state as we
entered, there's nothing new about that.

This in fact is *no* bug though since qdisc_watchdog_init() merely
initializes the timer and doesn't require cleanup.

> > Its simply symetrical, as everywhere else in the kernel. If a sub-init
> > funtion fails, it should clean up and return an error. We don't
> > destroy things we've never successfully initialized, they're supposed
> > to clean up after themselves.
> 
> Most (if not all) ->destroy() are able to clean partially initialized qdisc,
> I don't see why it  could be a problem here.
> 
> We don't have to keep with other kernel subsystem as long as it makes
> sense, net_sched subsystem is pretty much self-contained.

Its about having a sane API.

^ permalink raw reply

* Re: [Patch net 1/2] sch_pie: schedule the timer after all init succeed
From: Cong Wang @ 2014-10-25  0:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Cong Wang, netdev, David Miller, Vijay Subramanian
In-Reply-To: <1414196222.20845.48.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, Oct 24, 2014 at 5:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Not sure why you use "Cc: David S. Miller <davem@davemloft.net>",
> given that all networking patches have "Signed-off-by: David S. Miller
> <davem@davemloft.net>" ???
>

I don't understand what you are talking about:

If you are suggesting DaveM is too obvious to Cc, well, Cc'ing him
doesn't harm anything. I know he is on netdev list, but just in case
vger.kernel.org is broken or etc...

If you are suggesting to Cc his another email address, please update
MAINTAINERS.

^ permalink raw reply

* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Jay Vosburgh @ 2014-10-25  0:20 UTC (permalink / raw)
  To: paulmck
  Cc: Yanko Kaneti, Josh Boyer, Eric W. Biederman, Cong Wang,
	Kevin Fenzi, netdev, Linux-Kernel@Vger. Kernel. Org, mroos, tj
In-Reply-To: <20141024230524.GA16023@linux.vnet.ibm.com>

Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

>On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
[...]
>> Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
>> need to take a hard look at the sleep/wakeup code.
>> 
>> Thank you for running this!!!
>
>Could you please try the following patch?  If no joy, could you please
>add rcu:rcu_nocb_wake to the list of ftrace events?

	I tried the patch, it did not change the behavior.

	I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
and ran it again (with this patch and the first patch from earlier
today); the trace output is a bit on the large side so I put it and the
dmesg log at:

http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt

http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

	-J


>							Thanx, Paul
>
>------------------------------------------------------------------------
>
>rcu: Kick rcuo kthreads after their CPU goes offline
>
>If a no-CBs CPU were to post an RCU callback with interrupts disabled
>after it entered the idle loop for the last time, there might be no
>deferred wakeup for the corresponding rcuo kthreads.  This commit
>therefore adds a set of calls to do_nocb_deferred_wakeup() after the
>CPU has gone completely offline.
>
>Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
>diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>index 84b41b3c6ebd..f6880052b917 100644
>--- a/kernel/rcu/tree.c
>+++ b/kernel/rcu/tree.c
>@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
> 	case CPU_DEAD_FROZEN:
> 	case CPU_UP_CANCELED:
> 	case CPU_UP_CANCELED_FROZEN:
>-		for_each_rcu_flavor(rsp)
>+		for_each_rcu_flavor(rsp) {
> 			rcu_cleanup_dead_cpu(cpu, rsp);
>+			do_nocb_deferred_wakeup(per_cpu_ptr(rsp->rda, cpu));
>+		}
> 		break;
> 	default:
> 		break;
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* Re: [Patch net 1/2] sch_pie: schedule the timer after all init succeed
From: Eric Dumazet @ 2014-10-25  0:17 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, davem, Vijay Subramanian
In-Reply-To: <1414194959-28006-1-git-send-email-xiyou.wangcong@gmail.com>

On Fri, 2014-10-24 at 16:55 -0700, Cong Wang wrote:
> Cc: Vijay Subramanian <vijaynsu@cisco.com>
> Cc: David S. Miller <davem@davemloft.net>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/sched/sch_pie.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Eric Dumazet <edumazet@google.com>

Not sure why you use "Cc: David S. Miller <davem@davemloft.net>",
given that all networking patches have "Signed-off-by: David S. Miller
<davem@davemloft.net>" ???

^ permalink raw reply

* Re: [Patch net 2/2] net_sched: always call ->destroy when ->init() fails
From: Eric Dumazet @ 2014-10-25  0:14 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, davem, Wang Bo, John Fastabend, Eric Dumazet,
	Patrick McHardy, Terry Lam
In-Reply-To: <1414194959-28006-2-git-send-email-xiyou.wangcong@gmail.com>

On Fri, 2014-10-24 at 16:55 -0700, Cong Wang wrote:
> In qdisc_create(), when ->init() exists and it fails, we should
> call ->destroy() to clean up the potentially partially initialized
> qdisc's. This will also make the ->init() implementation easier.
> 

Why is this patch needed ?

You are adding bugs, its not clear what bug you are fixing.

I really do not like the idea of ->init() relying on a ->destroy() to
cleanup a failed ->init().

This is not what most management functions do in our stack.

I very much prefer that a function returning an error has no side
effect, like if it hadnt be called at all.

> qdisc_create_dflt() already does that. Most callers of qdisc_create_dflt()
> simply use noop_qdisc when it fails.
> 
> And, most of the ->destroy() implementations are already able to clean up
> partially initialization, we don't need to worry.
> 
> The following ->init()'s need to catch up:
> fq_codel_init(), hhf_init(), multiq_init(),  sfq_init(), mq_init(),
> mqprio_init().
> 

...

>  
>  static int multiq_dump(struct Qdisc *sch, struct sk_buff *skb)
> diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
> index 3ec7e88..55ac6a4 100644
> --- a/net/sched/sch_qfq.c
> +++ b/net/sched/sch_qfq.c
> @@ -522,7 +522,6 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
>  	return 0;
>  
>  destroy_class:
> -	qdisc_destroy(cl->qdisc);
>  	kfree(cl);
>  	return err;
>  }


Really ? I am calling this a leak of cl->qdisc.

^ permalink raw reply

* [Patch net 2/2] net_sched: always call ->destroy when ->init() fails
From: Cong Wang @ 2014-10-24 23:55 UTC (permalink / raw)
  To: netdev
  Cc: davem, Cong Wang, Wang Bo, John Fastabend, Eric Dumazet,
	Patrick McHardy, Terry Lam
In-Reply-To: <1414194959-28006-1-git-send-email-xiyou.wangcong@gmail.com>

In qdisc_create(), when ->init() exists and it fails, we should
call ->destroy() to clean up the potentially partially initialized
qdisc's. This will also make the ->init() implementation easier.

qdisc_create_dflt() already does that. Most callers of qdisc_create_dflt()
simply use noop_qdisc when it fails.

And, most of the ->destroy() implementations are already able to clean up
partially initialization, we don't need to worry.

The following ->init()'s need to catch up:
fq_codel_init(), hhf_init(), multiq_init(),  sfq_init(), mq_init(),
mqprio_init().

Reported-by: Wang Bo <wang.bo116@zte.com.cn>
Cc: Wang Bo <wang.bo116@zte.com.cn>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Terry Lam <vtlam@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_api.c      |  2 ++
 net/sched/sch_drr.c      |  1 -
 net/sched/sch_fq_codel.c |  6 +++---
 net/sched/sch_generic.c  |  1 +
 net/sched/sch_hhf.c      |  8 ++------
 net/sched/sch_mq.c       |  6 +-----
 net/sched/sch_mqprio.c   | 18 +++++-------------
 net/sched/sch_multiq.c   |  9 ++-------
 net/sched/sch_qfq.c      |  1 -
 net/sched/sch_sfq.c      |  7 ++++---
 10 files changed, 20 insertions(+), 39 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 76f402e..7c308c9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -990,6 +990,8 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 		qdisc_list_add(sch);
 
 		return sch;
+	} else {
+		goto err_out4;
 	}
 err_out3:
 	dev_put(dev);
diff --git a/net/sched/sch_drr.c b/net/sched/sch_drr.c
index 3387060..6c69b88 100644
--- a/net/sched/sch_drr.c
+++ b/net/sched/sch_drr.c
@@ -121,7 +121,6 @@ static int drr_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
 					    qdisc_root_sleeping_lock(sch),
 					    tca[TCA_RATE]);
 		if (err) {
-			qdisc_destroy(cl->qdisc);
 			kfree(cl);
 			return err;
 		}
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index b9ca32e..05c725f 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -406,11 +406,11 @@ static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
 					   sizeof(struct fq_codel_flow));
 		if (!q->flows)
 			return -ENOMEM;
+
 		q->backlogs = fq_codel_zalloc(q->flows_cnt * sizeof(u32));
-		if (!q->backlogs) {
-			fq_codel_free(q->flows);
+		if (!q->backlogs)
 			return -ENOMEM;
-		}
+
 		for (i = 0; i < q->flows_cnt; i++) {
 			struct fq_codel_flow *flow = q->flows + i;
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6efca30..d1e2ed6 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -622,6 +622,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	return ERR_PTR(err);
 }
 
+/* Callers don't need to clean up on failure, we do here. */
 struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
 				const struct Qdisc_ops *ops,
 				unsigned int parentid)
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index 15d3aab..8f94bb9 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -639,10 +639,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 		for (i = 0; i < HHF_ARRAYS_CNT; i++) {
 			q->hhf_arrays[i] = hhf_zalloc(HHF_ARRAYS_LEN *
 						      sizeof(u32));
-			if (!q->hhf_arrays[i]) {
-				hhf_destroy(sch);
+			if (!q->hhf_arrays[i])
 				return -ENOMEM;
-			}
 		}
 		q->hhf_arrays_reset_timestamp = hhf_time_stamp();
 
@@ -650,10 +648,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 		for (i = 0; i < HHF_ARRAYS_CNT; i++) {
 			q->hhf_valid_bits[i] = hhf_zalloc(HHF_ARRAYS_LEN /
 							  BITS_PER_BYTE);
-			if (!q->hhf_valid_bits[i]) {
-				hhf_destroy(sch);
+			if (!q->hhf_valid_bits[i])
 				return -ENOMEM;
-			}
 		}
 
 		/* Initialize Weighted DRR buckets. */
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index f3cbaec..8f009c9 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -61,17 +61,13 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 					  TC_H_MAKE(TC_H_MAJ(sch->handle),
 						    TC_H_MIN(ntx + 1)));
 		if (qdisc == NULL)
-			goto err;
+			return -ENOMEM;
 		priv->qdiscs[ntx] = qdisc;
 		qdisc->flags |= TCQ_F_ONETXQUEUE;
 	}
 
 	sch->flags |= TCQ_F_MQROOT;
 	return 0;
-
-err:
-	mq_destroy(sch);
-	return -ENOMEM;
 }
 
 static void mq_attach(struct Qdisc *sch)
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 3811a74..d172819 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -117,20 +117,16 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 	/* pre-allocate qdisc, attachment can't fail */
 	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
 			       GFP_KERNEL);
-	if (priv->qdiscs == NULL) {
-		err = -ENOMEM;
-		goto err;
-	}
+	if (priv->qdiscs == NULL)
+		return -ENOMEM;
 
 	for (i = 0; i < dev->num_tx_queues; i++) {
 		dev_queue = netdev_get_tx_queue(dev, i);
 		qdisc = qdisc_create_dflt(dev_queue, default_qdisc_ops,
 					  TC_H_MAKE(TC_H_MAJ(sch->handle),
 						    TC_H_MIN(i + 1)));
-		if (qdisc == NULL) {
-			err = -ENOMEM;
-			goto err;
-		}
+		if (qdisc == NULL)
+			return -ENOMEM;
 		priv->qdiscs[i] = qdisc;
 		qdisc->flags |= TCQ_F_ONETXQUEUE;
 	}
@@ -143,7 +139,7 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 		priv->hw_owned = 1;
 		err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc);
 		if (err)
-			goto err;
+			return err;
 	} else {
 		netdev_set_num_tc(dev, qopt->num_tc);
 		for (i = 0; i < qopt->num_tc; i++)
@@ -157,10 +153,6 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 
 	sch->flags |= TCQ_F_MQROOT;
 	return 0;
-
-err:
-	mqprio_destroy(sch);
-	return err;
 }
 
 static void mqprio_attach(struct Qdisc *sch)
diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c
index 42dd218..31dd2d2 100644
--- a/net/sched/sch_multiq.c
+++ b/net/sched/sch_multiq.c
@@ -252,7 +252,7 @@ static int multiq_tune(struct Qdisc *sch, struct nlattr *opt)
 static int multiq_init(struct Qdisc *sch, struct nlattr *opt)
 {
 	struct multiq_sched_data *q = qdisc_priv(sch);
-	int i, err;
+	int i;
 
 	q->queues = NULL;
 
@@ -267,12 +267,7 @@ static int multiq_init(struct Qdisc *sch, struct nlattr *opt)
 	for (i = 0; i < q->max_bands; i++)
 		q->queues[i] = &noop_qdisc;
 
-	err = multiq_tune(sch, opt);
-
-	if (err)
-		kfree(q->queues);
-
-	return err;
+	return multiq_tune(sch, opt);
 }
 
 static int multiq_dump(struct Qdisc *sch, struct sk_buff *skb)
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index 3ec7e88..55ac6a4 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -522,7 +522,6 @@ static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
 	return 0;
 
 destroy_class:
-	qdisc_destroy(cl->qdisc);
 	kfree(cl);
 	return err;
 }
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index b877140..7ad2879 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -761,11 +761,12 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 	}
 
 	q->ht = sfq_alloc(sizeof(q->ht[0]) * q->divisor);
+	if (!q->ht)
+		return -ENOMEM;
 	q->slots = sfq_alloc(sizeof(q->slots[0]) * q->maxflows);
-	if (!q->ht || !q->slots) {
-		sfq_destroy(sch);
+	if (!q->slots)
 		return -ENOMEM;
-	}
+
 	for (i = 0; i < q->divisor; i++)
 		q->ht[i] = SFQ_EMPTY_SLOT;
 
-- 
1.8.3.1

^ permalink raw reply related

* [Patch net 1/2] sch_pie: schedule the timer after all init succeed
From: Cong Wang @ 2014-10-24 23:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, Cong Wang, Vijay Subramanian

Cc: Vijay Subramanian <vijaynsu@cisco.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_pie.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index 33d7a98..b783a44 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -445,7 +445,6 @@ static int pie_init(struct Qdisc *sch, struct nlattr *opt)
 	sch->limit = q->params.limit;
 
 	setup_timer(&q->adapt_timer, pie_timer, (unsigned long)sch);
-	mod_timer(&q->adapt_timer, jiffies + HZ / 2);
 
 	if (opt) {
 		int err = pie_change(sch, opt);
@@ -454,6 +453,7 @@ static int pie_init(struct Qdisc *sch, struct nlattr *opt)
 			return err;
 	}
 
+	mod_timer(&q->adapt_timer, jiffies + HZ / 2);
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Paul E. McKenney @ 2014-10-24 23:05 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Yanko Kaneti, Josh Boyer, Eric W. Biederman, Cong Wang,
	Kevin Fenzi, netdev, Linux-Kernel@Vger. Kernel. Org, mroos, tj
In-Reply-To: <20141024225931.GC4977@linux.vnet.ibm.com>

On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 24, 2014 at 03:34:07PM -0700, Jay Vosburgh wrote:
> > Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> > 
> > >On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
> > >> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> > >> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> > >> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> > >
> > >[ . . . ]
> > >
> > >> > > > Well, if you are feeling aggressive, give the following patch a spin.
> > >> > > > I am doing sanity tests on it in the meantime.
> > >> > > 
> > >> > > Doesn't seem to make a difference here
> > >> > 
> > >> > OK, inspection isn't cutting it, so time for tracing.  Does the system
> > >> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
> > >> > the problem occurs, then dump the trace buffer after the problem occurs.
> > >> 
> > >> Sorry for being unresposive here, but I know next to nothing about tracing
> > >> or most things about the kernel, so I have some cathing up to do.
> > >> 
> > >> In the meantime some layman observations while I tried to find what exactly
> > >> triggers the problem.
> > >> - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
> > >> - libvirtd seems to be very active in using all sorts of kernel facilities
> > >>   that are modules on fedora so it seems to cause many simultaneous kworker 
> > >>   calls to modprobe
> > >> - there are 8 kworker/u16 from 0 to 7
> > >> - one of these kworkers always deadlocks, while there appear to be two
> > >>   kworker/u16:6 - the seventh
> > >
> > >Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
> > >
> > >>   6 vs 8 as in 6 rcuos where before they were always 8
> > >> 
> > >> Just observations from someone who still doesn't know what the u16
> > >> kworkers are..
> > >
> > >Could you please run the following diagnostic patch?  This will help
> > >me see if I have managed to miswire the rcuo kthreads.  It should
> > >print some information at task-hang time.
> > 
> > 	Here's the output of the patch; I let it sit through two hang
> > cycles.
> > 
> > 	-J
> > 
> > 
> > [  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
> > [  240.354878]       Not tainted 3.17.0-testola+ #4
> > [  240.359481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  240.367285] ovs-vswitchd    D ffff88013fc94600     0   902    901 0x00000004
> > [  240.367290]  ffff8800ab20f7b8 0000000000000002 ffff8800b3304b00 ffff8800ab20ffd8
> > [  240.367293]  0000000000014600 0000000000014600 ffff8800b0810000 ffff8800b3304b00
> > [  240.367296]  ffff8800b3304b00 ffffffff81c59850 ffffffff81c59858 7fffffffffffffff
> > [  240.367300] Call Trace:
> > [  240.367307]  [<ffffffff81722b99>] schedule+0x29/0x70
> > [  240.367310]  [<ffffffff81725b6c>] schedule_timeout+0x1dc/0x260
> > [  240.367313]  [<ffffffff81722f69>] ? _cond_resched+0x29/0x40
> > [  240.367316]  [<ffffffff81723818>] ? wait_for_completion+0x28/0x160
> > [  240.367321]  [<ffffffff811081a7>] ? queue_stop_cpus_work+0xc7/0xe0
> > [  240.367324]  [<ffffffff81723896>] wait_for_completion+0xa6/0x160
> > [  240.367328]  [<ffffffff81099980>] ? wake_up_state+0x20/0x20
> > [  240.367331]  [<ffffffff810d0ecc>] _rcu_barrier+0x20c/0x480
> > [  240.367334]  [<ffffffff810d1195>] rcu_barrier+0x15/0x20
> > [  240.367338]  [<ffffffff81625010>] netdev_run_todo+0x60/0x300
> > [  240.367341]  [<ffffffff8162f9ee>] rtnl_unlock+0xe/0x10
> > [  240.367349]  [<ffffffffa01ffcc5>] internal_dev_destroy+0x55/0x80 [openvswitch]
> > [  240.367354]  [<ffffffffa01ff622>] ovs_vport_del+0x32/0x40 [openvswitch]
> > [  240.367358]  [<ffffffffa01f8dd0>] ovs_dp_detach_port+0x30/0x40 [openvswitch]
> > [  240.367363]  [<ffffffffa01f8ea5>] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
> > [  240.367367]  [<ffffffff81651d75>] genl_family_rcv_msg+0x1a5/0x3c0
> > [  240.367370]  [<ffffffff81651f90>] ? genl_family_rcv_msg+0x3c0/0x3c0
> > [  240.367372]  [<ffffffff81652021>] genl_rcv_msg+0x91/0xd0
> > [  240.367376]  [<ffffffff81650091>] netlink_rcv_skb+0xc1/0xe0
> > [  240.367378]  [<ffffffff816505bc>] genl_rcv+0x2c/0x40
> > [  240.367381]  [<ffffffff8164f626>] netlink_unicast+0xf6/0x200
> > [  240.367383]  [<ffffffff8164fa4d>] netlink_sendmsg+0x31d/0x780
> > [  240.367387]  [<ffffffff8164ca74>] ? netlink_rcv_wake+0x44/0x60
> > [  240.367391]  [<ffffffff81606a53>] sock_sendmsg+0x93/0xd0
> > [  240.367395]  [<ffffffff81337700>] ? apparmor_capable+0x60/0x60
> > [  240.367399]  [<ffffffff81614f27>] ? verify_iovec+0x47/0xd0
> > [  240.367402]  [<ffffffff81606e79>] ___sys_sendmsg+0x399/0x3b0
> > [  240.367406]  [<ffffffff812598a2>] ? kernfs_seq_stop_active+0x32/0x40
> > [  240.367410]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
> > [  240.367413]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
> > [  240.367416]  [<ffffffff8101c3e9>] ? sched_clock+0x9/0x10
> > [  240.367420]  [<ffffffff811277fc>] ? acct_account_cputime+0x1c/0x20
> > [  240.367424]  [<ffffffff8109ce6b>] ? account_user_time+0x8b/0xa0
> > [  240.367428]  [<ffffffff81200bd5>] ? __fget_light+0x25/0x70
> > [  240.367431]  [<ffffffff81607c02>] __sys_sendmsg+0x42/0x80
> > [  240.367433]  [<ffffffff81607c52>] SyS_sendmsg+0x12/0x20
> > [  240.367436]  [<ffffffff81727464>] tracesys_phase2+0xd8/0xdd
> > [  240.367439] rcu_show_nocb_setup(): rcu_sched nocb state:
> > [  240.372734]   0: ffff88013fc0e600 l:ffff88013fc0e600 n:ffff88013fc8e600 .G.
> > [  240.379673]   1: ffff88013fc8e600 l:ffff88013fc0e600 n:          (null) .G.
> > [  240.386611]   2: ffff88013fd0e600 l:ffff88013fd0e600 n:ffff88013fd8e600 N..
> > [  240.393550]   3: ffff88013fd8e600 l:ffff88013fd0e600 n:          (null) N..
> > [  240.400489] rcu_show_nocb_setup(): rcu_bh nocb state:
> > [  240.405525]   0: ffff88013fc0e3c0 l:ffff88013fc0e3c0 n:ffff88013fc8e3c0 ...
> > [  240.412463]   1: ffff88013fc8e3c0 l:ffff88013fc0e3c0 n:          (null) ...
> > [  240.419401]   2: ffff88013fd0e3c0 l:ffff88013fd0e3c0 n:ffff88013fd8e3c0 ...
> > [  240.426339]   3: ffff88013fd8e3c0 l:ffff88013fd0e3c0 n:          (null) ...
> > [  360.432020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
> > [  360.438881]       Not tainted 3.17.0-testola+ #4
> > [  360.443484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  360.451289] ovs-vswitchd    D ffff88013fc94600     0   902    901 0x00000004
> > [  360.451293]  ffff8800ab20f7b8 0000000000000002 ffff8800b3304b00 ffff8800ab20ffd8
> > [  360.451297]  0000000000014600 0000000000014600 ffff8800b0810000 ffff8800b3304b00
> > [  360.451300]  ffff8800b3304b00 ffffffff81c59850 ffffffff81c59858 7fffffffffffffff
> > [  360.451303] Call Trace:
> > [  360.451311]  [<ffffffff81722b99>] schedule+0x29/0x70
> > [  360.451314]  [<ffffffff81725b6c>] schedule_timeout+0x1dc/0x260
> > [  360.451317]  [<ffffffff81722f69>] ? _cond_resched+0x29/0x40
> > [  360.451320]  [<ffffffff81723818>] ? wait_for_completion+0x28/0x160
> > [  360.451325]  [<ffffffff811081a7>] ? queue_stop_cpus_work+0xc7/0xe0
> > [  360.451327]  [<ffffffff81723896>] wait_for_completion+0xa6/0x160
> > [  360.451331]  [<ffffffff81099980>] ? wake_up_state+0x20/0x20
> > [  360.451335]  [<ffffffff810d0ecc>] _rcu_barrier+0x20c/0x480
> > [  360.451338]  [<ffffffff810d1195>] rcu_barrier+0x15/0x20
> > [  360.451342]  [<ffffffff81625010>] netdev_run_todo+0x60/0x300
> > [  360.451345]  [<ffffffff8162f9ee>] rtnl_unlock+0xe/0x10
> > [  360.451353]  [<ffffffffa01ffcc5>] internal_dev_destroy+0x55/0x80 [openvswitch]
> > [  360.451358]  [<ffffffffa01ff622>] ovs_vport_del+0x32/0x40 [openvswitch]
> > [  360.451362]  [<ffffffffa01f8dd0>] ovs_dp_detach_port+0x30/0x40 [openvswitch]
> > [  360.451366]  [<ffffffffa01f8ea5>] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
> > [  360.451370]  [<ffffffff81651d75>] genl_family_rcv_msg+0x1a5/0x3c0
> > [  360.451373]  [<ffffffff81651f90>] ? genl_family_rcv_msg+0x3c0/0x3c0
> > [  360.451376]  [<ffffffff81652021>] genl_rcv_msg+0x91/0xd0
> > [  360.451379]  [<ffffffff81650091>] netlink_rcv_skb+0xc1/0xe0
> > [  360.451381]  [<ffffffff816505bc>] genl_rcv+0x2c/0x40
> > [  360.451384]  [<ffffffff8164f626>] netlink_unicast+0xf6/0x200
> > [  360.451387]  [<ffffffff8164fa4d>] netlink_sendmsg+0x31d/0x780
> > [  360.451390]  [<ffffffff8164ca74>] ? netlink_rcv_wake+0x44/0x60
> > [  360.451394]  [<ffffffff81606a53>] sock_sendmsg+0x93/0xd0
> > [  360.451399]  [<ffffffff81337700>] ? apparmor_capable+0x60/0x60
> > [  360.451402]  [<ffffffff81614f27>] ? verify_iovec+0x47/0xd0
> > [  360.451406]  [<ffffffff81606e79>] ___sys_sendmsg+0x399/0x3b0
> > [  360.451410]  [<ffffffff812598a2>] ? kernfs_seq_stop_active+0x32/0x40
> > [  360.451414]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
> > [  360.451417]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
> > [  360.451419]  [<ffffffff8101c3e9>] ? sched_clock+0x9/0x10
> > [  360.451424]  [<ffffffff811277fc>] ? acct_account_cputime+0x1c/0x20
> > [  360.451427]  [<ffffffff8109ce6b>] ? account_user_time+0x8b/0xa0
> > [  360.451431]  [<ffffffff81200bd5>] ? __fget_light+0x25/0x70
> > [  360.451434]  [<ffffffff81607c02>] __sys_sendmsg+0x42/0x80
> > [  360.451437]  [<ffffffff81607c52>] SyS_sendmsg+0x12/0x20
> > [  360.451440]  [<ffffffff81727464>] tracesys_phase2+0xd8/0xdd
> > [  360.451442] rcu_show_nocb_setup(): rcu_sched nocb state:
> > [  360.456737]   0: ffff88013fc0e600 l:ffff88013fc0e600 n:ffff88013fc8e600 ...
> > [  360.463676]   1: ffff88013fc8e600 l:ffff88013fc0e600 n:          (null) ...
> > [  360.470614]   2: ffff88013fd0e600 l:ffff88013fd0e600 n:ffff88013fd8e600 N..
> > [  360.477554]   3: ffff88013fd8e600 l:ffff88013fd0e600 n:          (null) N..
> 
> Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
> need to take a hard look at the sleep/wakeup code.
> 
> Thank you for running this!!!

Could you please try the following patch?  If no joy, could you please
add rcu:rcu_nocb_wake to the list of ftrace events?

							Thanx, Paul

------------------------------------------------------------------------

rcu: Kick rcuo kthreads after their CPU goes offline

If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads.  This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 84b41b3c6ebd..f6880052b917 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
 	case CPU_DEAD_FROZEN:
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
-		for_each_rcu_flavor(rsp)
+		for_each_rcu_flavor(rsp) {
 			rcu_cleanup_dead_cpu(cpu, rsp);
+			do_nocb_deferred_wakeup(per_cpu_ptr(rsp->rda, cpu));
+		}
 		break;
 	default:
 		break;

^ permalink raw reply related

* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Paul E. McKenney @ 2014-10-24 22:59 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Yanko Kaneti, Josh Boyer, Eric W. Biederman, Cong Wang,
	Kevin Fenzi, netdev, Linux-Kernel@Vger. Kernel. Org, mroos, tj
In-Reply-To: <8915.1414190047@famine>

On Fri, Oct 24, 2014 at 03:34:07PM -0700, Jay Vosburgh wrote:
> Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
> 
> >On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
> >> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
> >> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
> >> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
> >
> >[ . . . ]
> >
> >> > > > Well, if you are feeling aggressive, give the following patch a spin.
> >> > > > I am doing sanity tests on it in the meantime.
> >> > > 
> >> > > Doesn't seem to make a difference here
> >> > 
> >> > OK, inspection isn't cutting it, so time for tracing.  Does the system
> >> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
> >> > the problem occurs, then dump the trace buffer after the problem occurs.
> >> 
> >> Sorry for being unresposive here, but I know next to nothing about tracing
> >> or most things about the kernel, so I have some cathing up to do.
> >> 
> >> In the meantime some layman observations while I tried to find what exactly
> >> triggers the problem.
> >> - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
> >> - libvirtd seems to be very active in using all sorts of kernel facilities
> >>   that are modules on fedora so it seems to cause many simultaneous kworker 
> >>   calls to modprobe
> >> - there are 8 kworker/u16 from 0 to 7
> >> - one of these kworkers always deadlocks, while there appear to be two
> >>   kworker/u16:6 - the seventh
> >
> >Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
> >
> >>   6 vs 8 as in 6 rcuos where before they were always 8
> >> 
> >> Just observations from someone who still doesn't know what the u16
> >> kworkers are..
> >
> >Could you please run the following diagnostic patch?  This will help
> >me see if I have managed to miswire the rcuo kthreads.  It should
> >print some information at task-hang time.
> 
> 	Here's the output of the patch; I let it sit through two hang
> cycles.
> 
> 	-J
> 
> 
> [  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
> [  240.354878]       Not tainted 3.17.0-testola+ #4
> [  240.359481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  240.367285] ovs-vswitchd    D ffff88013fc94600     0   902    901 0x00000004
> [  240.367290]  ffff8800ab20f7b8 0000000000000002 ffff8800b3304b00 ffff8800ab20ffd8
> [  240.367293]  0000000000014600 0000000000014600 ffff8800b0810000 ffff8800b3304b00
> [  240.367296]  ffff8800b3304b00 ffffffff81c59850 ffffffff81c59858 7fffffffffffffff
> [  240.367300] Call Trace:
> [  240.367307]  [<ffffffff81722b99>] schedule+0x29/0x70
> [  240.367310]  [<ffffffff81725b6c>] schedule_timeout+0x1dc/0x260
> [  240.367313]  [<ffffffff81722f69>] ? _cond_resched+0x29/0x40
> [  240.367316]  [<ffffffff81723818>] ? wait_for_completion+0x28/0x160
> [  240.367321]  [<ffffffff811081a7>] ? queue_stop_cpus_work+0xc7/0xe0
> [  240.367324]  [<ffffffff81723896>] wait_for_completion+0xa6/0x160
> [  240.367328]  [<ffffffff81099980>] ? wake_up_state+0x20/0x20
> [  240.367331]  [<ffffffff810d0ecc>] _rcu_barrier+0x20c/0x480
> [  240.367334]  [<ffffffff810d1195>] rcu_barrier+0x15/0x20
> [  240.367338]  [<ffffffff81625010>] netdev_run_todo+0x60/0x300
> [  240.367341]  [<ffffffff8162f9ee>] rtnl_unlock+0xe/0x10
> [  240.367349]  [<ffffffffa01ffcc5>] internal_dev_destroy+0x55/0x80 [openvswitch]
> [  240.367354]  [<ffffffffa01ff622>] ovs_vport_del+0x32/0x40 [openvswitch]
> [  240.367358]  [<ffffffffa01f8dd0>] ovs_dp_detach_port+0x30/0x40 [openvswitch]
> [  240.367363]  [<ffffffffa01f8ea5>] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
> [  240.367367]  [<ffffffff81651d75>] genl_family_rcv_msg+0x1a5/0x3c0
> [  240.367370]  [<ffffffff81651f90>] ? genl_family_rcv_msg+0x3c0/0x3c0
> [  240.367372]  [<ffffffff81652021>] genl_rcv_msg+0x91/0xd0
> [  240.367376]  [<ffffffff81650091>] netlink_rcv_skb+0xc1/0xe0
> [  240.367378]  [<ffffffff816505bc>] genl_rcv+0x2c/0x40
> [  240.367381]  [<ffffffff8164f626>] netlink_unicast+0xf6/0x200
> [  240.367383]  [<ffffffff8164fa4d>] netlink_sendmsg+0x31d/0x780
> [  240.367387]  [<ffffffff8164ca74>] ? netlink_rcv_wake+0x44/0x60
> [  240.367391]  [<ffffffff81606a53>] sock_sendmsg+0x93/0xd0
> [  240.367395]  [<ffffffff81337700>] ? apparmor_capable+0x60/0x60
> [  240.367399]  [<ffffffff81614f27>] ? verify_iovec+0x47/0xd0
> [  240.367402]  [<ffffffff81606e79>] ___sys_sendmsg+0x399/0x3b0
> [  240.367406]  [<ffffffff812598a2>] ? kernfs_seq_stop_active+0x32/0x40
> [  240.367410]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
> [  240.367413]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
> [  240.367416]  [<ffffffff8101c3e9>] ? sched_clock+0x9/0x10
> [  240.367420]  [<ffffffff811277fc>] ? acct_account_cputime+0x1c/0x20
> [  240.367424]  [<ffffffff8109ce6b>] ? account_user_time+0x8b/0xa0
> [  240.367428]  [<ffffffff81200bd5>] ? __fget_light+0x25/0x70
> [  240.367431]  [<ffffffff81607c02>] __sys_sendmsg+0x42/0x80
> [  240.367433]  [<ffffffff81607c52>] SyS_sendmsg+0x12/0x20
> [  240.367436]  [<ffffffff81727464>] tracesys_phase2+0xd8/0xdd
> [  240.367439] rcu_show_nocb_setup(): rcu_sched nocb state:
> [  240.372734]   0: ffff88013fc0e600 l:ffff88013fc0e600 n:ffff88013fc8e600 .G.
> [  240.379673]   1: ffff88013fc8e600 l:ffff88013fc0e600 n:          (null) .G.
> [  240.386611]   2: ffff88013fd0e600 l:ffff88013fd0e600 n:ffff88013fd8e600 N..
> [  240.393550]   3: ffff88013fd8e600 l:ffff88013fd0e600 n:          (null) N..
> [  240.400489] rcu_show_nocb_setup(): rcu_bh nocb state:
> [  240.405525]   0: ffff88013fc0e3c0 l:ffff88013fc0e3c0 n:ffff88013fc8e3c0 ...
> [  240.412463]   1: ffff88013fc8e3c0 l:ffff88013fc0e3c0 n:          (null) ...
> [  240.419401]   2: ffff88013fd0e3c0 l:ffff88013fd0e3c0 n:ffff88013fd8e3c0 ...
> [  240.426339]   3: ffff88013fd8e3c0 l:ffff88013fd0e3c0 n:          (null) ...
> [  360.432020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
> [  360.438881]       Not tainted 3.17.0-testola+ #4
> [  360.443484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  360.451289] ovs-vswitchd    D ffff88013fc94600     0   902    901 0x00000004
> [  360.451293]  ffff8800ab20f7b8 0000000000000002 ffff8800b3304b00 ffff8800ab20ffd8
> [  360.451297]  0000000000014600 0000000000014600 ffff8800b0810000 ffff8800b3304b00
> [  360.451300]  ffff8800b3304b00 ffffffff81c59850 ffffffff81c59858 7fffffffffffffff
> [  360.451303] Call Trace:
> [  360.451311]  [<ffffffff81722b99>] schedule+0x29/0x70
> [  360.451314]  [<ffffffff81725b6c>] schedule_timeout+0x1dc/0x260
> [  360.451317]  [<ffffffff81722f69>] ? _cond_resched+0x29/0x40
> [  360.451320]  [<ffffffff81723818>] ? wait_for_completion+0x28/0x160
> [  360.451325]  [<ffffffff811081a7>] ? queue_stop_cpus_work+0xc7/0xe0
> [  360.451327]  [<ffffffff81723896>] wait_for_completion+0xa6/0x160
> [  360.451331]  [<ffffffff81099980>] ? wake_up_state+0x20/0x20
> [  360.451335]  [<ffffffff810d0ecc>] _rcu_barrier+0x20c/0x480
> [  360.451338]  [<ffffffff810d1195>] rcu_barrier+0x15/0x20
> [  360.451342]  [<ffffffff81625010>] netdev_run_todo+0x60/0x300
> [  360.451345]  [<ffffffff8162f9ee>] rtnl_unlock+0xe/0x10
> [  360.451353]  [<ffffffffa01ffcc5>] internal_dev_destroy+0x55/0x80 [openvswitch]
> [  360.451358]  [<ffffffffa01ff622>] ovs_vport_del+0x32/0x40 [openvswitch]
> [  360.451362]  [<ffffffffa01f8dd0>] ovs_dp_detach_port+0x30/0x40 [openvswitch]
> [  360.451366]  [<ffffffffa01f8ea5>] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
> [  360.451370]  [<ffffffff81651d75>] genl_family_rcv_msg+0x1a5/0x3c0
> [  360.451373]  [<ffffffff81651f90>] ? genl_family_rcv_msg+0x3c0/0x3c0
> [  360.451376]  [<ffffffff81652021>] genl_rcv_msg+0x91/0xd0
> [  360.451379]  [<ffffffff81650091>] netlink_rcv_skb+0xc1/0xe0
> [  360.451381]  [<ffffffff816505bc>] genl_rcv+0x2c/0x40
> [  360.451384]  [<ffffffff8164f626>] netlink_unicast+0xf6/0x200
> [  360.451387]  [<ffffffff8164fa4d>] netlink_sendmsg+0x31d/0x780
> [  360.451390]  [<ffffffff8164ca74>] ? netlink_rcv_wake+0x44/0x60
> [  360.451394]  [<ffffffff81606a53>] sock_sendmsg+0x93/0xd0
> [  360.451399]  [<ffffffff81337700>] ? apparmor_capable+0x60/0x60
> [  360.451402]  [<ffffffff81614f27>] ? verify_iovec+0x47/0xd0
> [  360.451406]  [<ffffffff81606e79>] ___sys_sendmsg+0x399/0x3b0
> [  360.451410]  [<ffffffff812598a2>] ? kernfs_seq_stop_active+0x32/0x40
> [  360.451414]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
> [  360.451417]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
> [  360.451419]  [<ffffffff8101c3e9>] ? sched_clock+0x9/0x10
> [  360.451424]  [<ffffffff811277fc>] ? acct_account_cputime+0x1c/0x20
> [  360.451427]  [<ffffffff8109ce6b>] ? account_user_time+0x8b/0xa0
> [  360.451431]  [<ffffffff81200bd5>] ? __fget_light+0x25/0x70
> [  360.451434]  [<ffffffff81607c02>] __sys_sendmsg+0x42/0x80
> [  360.451437]  [<ffffffff81607c52>] SyS_sendmsg+0x12/0x20
> [  360.451440]  [<ffffffff81727464>] tracesys_phase2+0xd8/0xdd
> [  360.451442] rcu_show_nocb_setup(): rcu_sched nocb state:
> [  360.456737]   0: ffff88013fc0e600 l:ffff88013fc0e600 n:ffff88013fc8e600 ...
> [  360.463676]   1: ffff88013fc8e600 l:ffff88013fc0e600 n:          (null) ...
> [  360.470614]   2: ffff88013fd0e600 l:ffff88013fd0e600 n:ffff88013fd8e600 N..
> [  360.477554]   3: ffff88013fd8e600 l:ffff88013fd0e600 n:          (null) N..

Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
need to take a hard look at the sleep/wakeup code.

Thank you for running this!!!

							Thanx, Paul

> [  360.484494] rcu_show_nocb_setup(): rcu_bh nocb state:
> [  360.489529]   0: ffff88013fc0e3c0 l:ffff88013fc0e3c0 n:ffff88013fc8e3c0 ...
> [  360.496469]   1: ffff88013fc8e3c0 l:ffff88013fc0e3c0 n:          (null) .G.
> [  360.503407]   2: ffff88013fd0e3c0 l:ffff88013fd0e3c0 n:ffff88013fd8e3c0 ...
> [  360.510346]   3: ffff88013fd8e3c0 l:ffff88013fd0e3c0 n:          (null) ...
> 
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply

* Re: [PATCH net-next 2/4] net: bcmgenet: return number of packets completed in TX reclaim
From: Florian Fainelli @ 2014-10-24 22:52 UTC (permalink / raw)
  To: Petri Gynther; +Cc: netdev, David Miller
In-Reply-To: <CAGXr9JHNRFsXXT+_tfM=ye5k5YwjmJ6s5YP+qaTdGMOLKZxPwA@mail.gmail.com>

On 10/24/2014 03:20 PM, Petri Gynther wrote:
> Hi Florian,
> 
> On Fri, Oct 24, 2014 at 1:02 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>> In preparation for reclaiming transmitted buffers in NAPI context,
>> update __bcmgenet_tx_reclaim() and bcmgenet_tx_reclaim() to return the
>> number of packets completed per call.
>>
>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>> ---
>>  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 20 +++++++++++++-------
>>  1 file changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> index ee4d5baf09b6..70f2fb366375 100644
>> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
>> @@ -876,14 +876,14 @@ static inline void bcmgenet_tx_ring_int_disable(struct bcmgenet_priv *priv,
>>  }
>>
>>  /* Unlocked version of the reclaim routine */
>> -static void __bcmgenet_tx_reclaim(struct net_device *dev,
>> -                                 struct bcmgenet_tx_ring *ring)
>> +static unsigned int __bcmgenet_tx_reclaim(struct net_device *dev,
>> +                                         struct bcmgenet_tx_ring *ring)
>>  {
>>         struct bcmgenet_priv *priv = netdev_priv(dev);
>>         int last_tx_cn, last_c_index, num_tx_bds;
>>         struct enet_cb *tx_cb_ptr;
>>         struct netdev_queue *txq;
>> -       unsigned int bds_compl;
>> +       unsigned int bds_compl, pkts_compl = 0;
> 
> bcmgenet_desc_rx() uses "rxpktprocessed", so I'd go with "txpktprocessed" here.

Sure.

> 
>>         unsigned int c_index;
>>
>>         /* Compute how many buffers are transmitted since last xmit call */
>> @@ -928,6 +928,7 @@ static void __bcmgenet_tx_reclaim(struct net_device *dev,
>>                 }
>>                 dev->stats.tx_packets++;
>>                 ring->free_bds += bds_compl;
>> +               pkts_compl += bds_compl;
> 
> This change doesn't look correct. The number of cleaned packets is not
> necessarily equal to the number of cleaned TxBDs.

Right, that should be a +1 increment, it does not matter because what we
want to know is 0 versus anything else.

> 
> I think that the while-loop should be:
> 
>         while (last_tx_cn-- > 0) {
>                 tx_cb_ptr = ring->cbs + last_c_index;
> 
>                 if (tx_cb_ptr->skb) {
>                         pkts_compl++;
>                         dev->stats.tx_packets++;
>                         dev->stats.tx_bytes += tx_cb_ptr->skb->len;
>                         dma_unmap_single(&dev->dev,
>                                          dma_unmap_addr(tx_cb_ptr, dma_addr),
>                                          tx_cb_ptr->skb->len,
>                                          DMA_TO_DEVICE);
>                         bcmgenet_free_cb(tx_cb_ptr);
>                 } else if (dma_unmap_addr(tx_cb_ptr, dma_addr)) {
>                         dev->stats.tx_bytes +=
>                                 dma_unmap_len(tx_cb_ptr, dma_len);
>                         dma_unmap_page(&dev->dev,
>                                        dma_unmap_addr(tx_cb_ptr, dma_addr),
>                                        dma_unmap_len(tx_cb_ptr, dma_len),
>                                        DMA_TO_DEVICE);
>                         dma_unmap_addr_set(tx_cb_ptr, dma_addr, 0);
>                 }
>                 ring->free_bds++;
>                 last_c_index++;
>                 last_c_index &= (num_tx_bds - 1);
>         }
> 
>>
>>                 last_c_index++;
>>                 last_c_index &= (num_tx_bds - 1);
>> @@ -936,20 +937,25 @@ static void __bcmgenet_tx_reclaim(struct net_device *dev,
>>         if (ring->free_bds > (MAX_SKB_FRAGS + 1))
>>                 ring->int_disable(priv, ring);
>>
>> -       if (netif_tx_queue_stopped(txq))
>> +       if (netif_tx_queue_stopped(txq) && pkts_compl)
> 
> bcmgenet_xmit() stops the Tx queue when:
> ring->free_bds <= (MAX_SKB_FRAGS + 1)
> 
> So, shouldn't this check be:
> netif_tx_queue_stopped(txq) && (ring->free_bds > (MAX_SKB_FRAGS + 1))
> 
> Having said that --
> Why does the driver stop the Tx queue when there are still
> (MAX_SKB_FRAGS + 1) TxBDs left?
> It leaves 17 or so TxBDs unused on the Tx ring, and most of the
> packets are not fragmented.

Right, this is something I am still trying to understand. One one hand
it does make sense to stop the queue on MAX_SKB_FRAGS + 1 just to be
safe and allow for a full-size fragment to be pushed on the next xmit()
call. But by the same token, bcmgenet_xmit() will check for the actual
fragment size early on.

I think the only one which is valid is the one at the end of
bcmgenet_xmit(), this one in bcmgenet_tx_reclaim() should just verify
that we have released some packets and that the queue was previous stopped.

> 
> I feel that bcmgenet_xmit() should stop the Tx queue only when there
> is 1 TxBD left after putting a new packet on the ring.
> 
>>                 netif_tx_wake_queue(txq);
>>
>>         ring->c_index = c_index;
>> +
>> +       return pkts_compl;
>>  }
>>
>> -static void bcmgenet_tx_reclaim(struct net_device *dev,
>> -                               struct bcmgenet_tx_ring *ring)
>> +static unsigned int bcmgenet_tx_reclaim(struct net_device *dev,
>> +                                       struct bcmgenet_tx_ring *ring)
>>  {
>>         unsigned long flags;
>> +       unsigned int pkts_compl;
>>
>>         spin_lock_irqsave(&ring->lock, flags);
>> -       __bcmgenet_tx_reclaim(dev, ring);
>> +       pkts_compl = __bcmgenet_tx_reclaim(dev, ring);
>>         spin_unlock_irqrestore(&ring->lock, flags);
>> +
>> +       return pkts_compl;
>>  }
>>
>>  static void bcmgenet_tx_reclaim_all(struct net_device *dev)
>> --
>> 1.9.1
>>

^ permalink raw reply

* Re: [PATCH v2 2/2] dsa: mv88e6171: Fix tagging protocol/Kconfig
From: Florian Fainelli @ 2014-10-24 22:45 UTC (permalink / raw)
  To: Andrew Lunn, davem; +Cc: netdev
In-Reply-To: <1414187045-8326-3-git-send-email-andrew@lunn.ch>

On 10/24/2014 02:44 PM, Andrew Lunn wrote:
> The mv88e6171 can support two different tagging protocols, DSA and
> EDSA. The switch driver structure only allows one protocol to be
> enumerated, and DSA was chosen. However the Kconfig entry ensures the
> EDSA tagging code is built. With a minimal configuration, we then end
> up with a mismatch. The probe is successful, EDSA tagging is used, but
> the switch is configured for DSA, resulting in mangled packets.
> 
> Change the switch driver structure to enumerate EDSA, fixing the
> mismatch.
> 
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> Fixes: 42f272539487 ("net: DSA: Marvell mv88e6171 switch driver")

Acked-by: Florian Fainelli <f.fainelli@gmail.com>

> ---
>  drivers/net/dsa/mv88e6171.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/dsa/mv88e6171.c b/drivers/net/dsa/mv88e6171.c
> index 1020a7af67cf..78d8e876f3aa 100644
> --- a/drivers/net/dsa/mv88e6171.c
> +++ b/drivers/net/dsa/mv88e6171.c
> @@ -395,7 +395,7 @@ static int mv88e6171_get_sset_count(struct dsa_switch *ds)
>  }
>  
>  struct dsa_switch_driver mv88e6171_switch_driver = {
> -	.tag_protocol		= DSA_TAG_PROTO_DSA,
> +	.tag_protocol		= DSA_TAG_PROTO_EDSA,
>  	.priv_size		= sizeof(struct mv88e6xxx_priv_state),
>  	.probe			= mv88e6171_probe,
>  	.setup			= mv88e6171_setup,
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] net: dsa: Error out on tagging protocol mismatches
From: Florian Fainelli @ 2014-10-24 22:42 UTC (permalink / raw)
  To: Andrew Lunn, davem; +Cc: netdev, Alexander Duyck
In-Reply-To: <1414187045-8326-2-git-send-email-andrew@lunn.ch>

On 10/24/2014 02:44 PM, Andrew Lunn wrote:
> If there is a mismatch between enabled tagging protocols and the
> protocol the switch supports, error out, rather than continue with a
> situation which is unlikely to work.
> 
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Acked-by: Florian Fainelli <f.fainelli@gmail.com>

> cc: alexander.h.duyck@intel.com
> ---
> 
> v2: Handle the use case of DSA_TAG_PROTO_NONE
> 
>  net/dsa/dsa.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
> index 22f34cf4cb27..6317b41c99b0 100644
> --- a/net/dsa/dsa.c
> +++ b/net/dsa/dsa.c
> @@ -174,8 +174,11 @@ dsa_switch_setup(struct dsa_switch_tree *dst, int index,
>  			dst->rcv = brcm_netdev_ops.rcv;
>  			break;
>  #endif
> -		default:
> +		case DSA_TAG_PROTO_NONE:
>  			break;
> +		default:
> +			ret = -ENOPROTOOPT;
> +			goto out;
>  		}
>  
>  		dst->tag_protocol = drv->tag_protocol;
> 

^ permalink raw reply

* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Jay Vosburgh @ 2014-10-24 22:41 UTC (permalink / raw)
  To: paulmck
  Cc: Yanko Kaneti, Josh Boyer, Eric W. Biederman, Cong Wang,
	Kevin Fenzi, netdev, Linux-Kernel@Vger. Kernel. Org, mroos, tj
In-Reply-To: <20141024221602.GB4977@linux.vnet.ibm.com>

Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

>On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote:
>> Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:
>> 
[...]
>> 	I've got an ftrace capture from unmodified -net, it looks like
>> this:
>> 
>>     ovs-vswitchd-902   [000] ....   471.778441: rcu_barrier: rcu_sched Begin cpu -1 remaining 0 # 0
>>     ovs-vswitchd-902   [000] ....   471.778452: rcu_barrier: rcu_sched Check cpu -1 remaining 0 # 0
>>     ovs-vswitchd-902   [000] ....   471.778452: rcu_barrier: rcu_sched Inc1 cpu -1 remaining 0 # 1
>>     ovs-vswitchd-902   [000] ....   471.778453: rcu_barrier: rcu_sched OnlineNoCB cpu 0 remaining 1 # 1
>>     ovs-vswitchd-902   [000] ....   471.778453: rcu_barrier: rcu_sched OnlineNoCB cpu 1 remaining 2 # 1
>>     ovs-vswitchd-902   [000] ....   471.778453: rcu_barrier: rcu_sched OnlineNoCB cpu 2 remaining 3 # 1
>>     ovs-vswitchd-902   [000] ....   471.778454: rcu_barrier: rcu_sched OnlineNoCB cpu 3 remaining 4 # 1
>
>OK, so it looks like your system has four CPUs, and rcu_barrier() placed
>callbacks on them all.

	No, the system has only two CPUs.  It's an Intel Core 2 Duo
E8400, and /proc/cpuinfo agrees that there are only 2.  There is a
potentially relevant-sounding message early in dmesg that says:

[    0.000000] smpboot: Allowing 4 CPUs, 2 hotplug CPUs

>>     ovs-vswitchd-902   [000] ....   471.778454: rcu_barrier: rcu_sched Inc2 cpu -1 remaining 4 # 2
>
>The above removes the extra count used to avoid races between posting new
>callbacks and completion of previously posted callbacks.
>
>>          rcuos/0-9     [000] ..s.   471.793150: rcu_barrier: rcu_sched CB cpu -1 remaining 3 # 2
>>          rcuos/1-18    [001] ..s.   471.793308: rcu_barrier: rcu_sched CB cpu -1 remaining 2 # 2
>
>Two of the four callbacks fired, but the other two appear to be AWOL.
>And rcu_barrier() won't return until they all fire.
>
>> 	I let it sit through several "hung task" cycles but that was all
>> there was for rcu:rcu_barrier.
>> 
>> 	I should have ftrace with the patch as soon as the kernel is
>> done building, then I can try the below patch (I'll start it building
>> now).
>
>Sounds very good, looking forward to hearing of the results.

	Going to bounce it for ftrace now, but the cpu count mismatch
seemed important enough to mention separately.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* Re: [PATCH net-next 4/4] net: bcmgenet: update ring producer index and buffer count in xmit
From: Petri Gynther @ 2014-10-24 22:39 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, David Miller
In-Reply-To: <1414180942-2132-5-git-send-email-f.fainelli@gmail.com>

Looks good.

On Fri, Oct 24, 2014 at 1:02 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> There is no need to have both bcmgenet_xmit_single() and
> bcmgenet_xmit_frag() perform a free_bds decrement and a prod_index
> increment by one. In case one of these functions fails to map a SKB or
> fragment for transmit, we will return and exit bcmgenet_xmit() with an
> error.
>
> We can therefore safely use our local copy of nr_frags to know by how
> much we should decrement the number of free buffers available, and by
> how much the producer count must be incremented and do this in the tail
> of bcmgenet_xmit().
>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Acked-by: Petri Gynther <pgynther@google.com>

> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 18 +++++-------------
>  1 file changed, 5 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index d6f4a7ace05e..3cd8c25a1120 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -1012,11 +1012,6 @@ static int bcmgenet_xmit_single(struct net_device *dev,
>
>         dmadesc_set(priv, tx_cb_ptr->bd_addr, mapping, length_status);
>
> -       /* Decrement total BD count and advance our write pointer */
> -       ring->free_bds -= 1;
> -       ring->prod_index += 1;
> -       ring->prod_index &= DMA_P_INDEX_MASK;
> -
>         return 0;
>  }
>
> @@ -1054,11 +1049,6 @@ static int bcmgenet_xmit_frag(struct net_device *dev,
>                     (frag->size << DMA_BUFLENGTH_SHIFT) | dma_desc_flags |
>                     (priv->hw_params->qtag_mask << DMA_TX_QTAG_SHIFT));
>
> -
> -       ring->free_bds -= 1;
> -       ring->prod_index += 1;
> -       ring->prod_index &= DMA_P_INDEX_MASK;
> -
>         return 0;
>  }
>
> @@ -1202,9 +1192,11 @@ static netdev_tx_t bcmgenet_xmit(struct sk_buff *skb, struct net_device *dev)
>
>         skb_tx_timestamp(skb);
>
> -       /* we kept a software copy of how much we should advance the TDMA
> -        * producer index, now write it down to the hardware
> -        */
> +       /* Decrement total BD count and advance our write pointer */
> +       ring->free_bds -= nr_frags + 1;
> +       ring->prod_index += nr_frags + 1;
> +       ring->prod_index &= DMA_P_INDEX_MASK;
> +
>         bcmgenet_tdma_ring_writel(priv, ring->index,
>                                   ring->prod_index, TDMA_PROD_INDEX);
>
> --
> 1.9.1
>

^ permalink raw reply

* Re: localed stuck in recent 3.18 git in copy_net_ns?
From: Jay Vosburgh @ 2014-10-24 22:34 UTC (permalink / raw)
  To: paulmck
  Cc: Yanko Kaneti, Josh Boyer, Eric W. Biederman, Cong Wang,
	Kevin Fenzi, netdev, Linux-Kernel@Vger. Kernel. Org, mroos, tj
In-Reply-To: <20141024214927.GA4977@linux.vnet.ibm.com>

Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote:

>On Sat, Oct 25, 2014 at 12:25:57AM +0300, Yanko Kaneti wrote:
>> On Fri-10/24/14-2014 11:32, Paul E. McKenney wrote:
>> > On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
>> > > On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
>
>[ . . . ]
>
>> > > > Well, if you are feeling aggressive, give the following patch a spin.
>> > > > I am doing sanity tests on it in the meantime.
>> > > 
>> > > Doesn't seem to make a difference here
>> > 
>> > OK, inspection isn't cutting it, so time for tracing.  Does the system
>> > respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
>> > the problem occurs, then dump the trace buffer after the problem occurs.
>> 
>> Sorry for being unresposive here, but I know next to nothing about tracing
>> or most things about the kernel, so I have some cathing up to do.
>> 
>> In the meantime some layman observations while I tried to find what exactly
>> triggers the problem.
>> - Even in runlevel 1 I can reliably trigger the problem by starting libvirtd
>> - libvirtd seems to be very active in using all sorts of kernel facilities
>>   that are modules on fedora so it seems to cause many simultaneous kworker 
>>   calls to modprobe
>> - there are 8 kworker/u16 from 0 to 7
>> - one of these kworkers always deadlocks, while there appear to be two
>>   kworker/u16:6 - the seventh
>
>Adding Tejun on CC in case this duplication of kworker/u16:6 is important.
>
>>   6 vs 8 as in 6 rcuos where before they were always 8
>> 
>> Just observations from someone who still doesn't know what the u16
>> kworkers are..
>
>Could you please run the following diagnostic patch?  This will help
>me see if I have managed to miswire the rcuo kthreads.  It should
>print some information at task-hang time.

	Here's the output of the patch; I let it sit through two hang
cycles.

	-J


[  240.348020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
[  240.354878]       Not tainted 3.17.0-testola+ #4
[  240.359481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.367285] ovs-vswitchd    D ffff88013fc94600     0   902    901 0x00000004
[  240.367290]  ffff8800ab20f7b8 0000000000000002 ffff8800b3304b00 ffff8800ab20ffd8
[  240.367293]  0000000000014600 0000000000014600 ffff8800b0810000 ffff8800b3304b00
[  240.367296]  ffff8800b3304b00 ffffffff81c59850 ffffffff81c59858 7fffffffffffffff
[  240.367300] Call Trace:
[  240.367307]  [<ffffffff81722b99>] schedule+0x29/0x70
[  240.367310]  [<ffffffff81725b6c>] schedule_timeout+0x1dc/0x260
[  240.367313]  [<ffffffff81722f69>] ? _cond_resched+0x29/0x40
[  240.367316]  [<ffffffff81723818>] ? wait_for_completion+0x28/0x160
[  240.367321]  [<ffffffff811081a7>] ? queue_stop_cpus_work+0xc7/0xe0
[  240.367324]  [<ffffffff81723896>] wait_for_completion+0xa6/0x160
[  240.367328]  [<ffffffff81099980>] ? wake_up_state+0x20/0x20
[  240.367331]  [<ffffffff810d0ecc>] _rcu_barrier+0x20c/0x480
[  240.367334]  [<ffffffff810d1195>] rcu_barrier+0x15/0x20
[  240.367338]  [<ffffffff81625010>] netdev_run_todo+0x60/0x300
[  240.367341]  [<ffffffff8162f9ee>] rtnl_unlock+0xe/0x10
[  240.367349]  [<ffffffffa01ffcc5>] internal_dev_destroy+0x55/0x80 [openvswitch]
[  240.367354]  [<ffffffffa01ff622>] ovs_vport_del+0x32/0x40 [openvswitch]
[  240.367358]  [<ffffffffa01f8dd0>] ovs_dp_detach_port+0x30/0x40 [openvswitch]
[  240.367363]  [<ffffffffa01f8ea5>] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
[  240.367367]  [<ffffffff81651d75>] genl_family_rcv_msg+0x1a5/0x3c0
[  240.367370]  [<ffffffff81651f90>] ? genl_family_rcv_msg+0x3c0/0x3c0
[  240.367372]  [<ffffffff81652021>] genl_rcv_msg+0x91/0xd0
[  240.367376]  [<ffffffff81650091>] netlink_rcv_skb+0xc1/0xe0
[  240.367378]  [<ffffffff816505bc>] genl_rcv+0x2c/0x40
[  240.367381]  [<ffffffff8164f626>] netlink_unicast+0xf6/0x200
[  240.367383]  [<ffffffff8164fa4d>] netlink_sendmsg+0x31d/0x780
[  240.367387]  [<ffffffff8164ca74>] ? netlink_rcv_wake+0x44/0x60
[  240.367391]  [<ffffffff81606a53>] sock_sendmsg+0x93/0xd0
[  240.367395]  [<ffffffff81337700>] ? apparmor_capable+0x60/0x60
[  240.367399]  [<ffffffff81614f27>] ? verify_iovec+0x47/0xd0
[  240.367402]  [<ffffffff81606e79>] ___sys_sendmsg+0x399/0x3b0
[  240.367406]  [<ffffffff812598a2>] ? kernfs_seq_stop_active+0x32/0x40
[  240.367410]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
[  240.367413]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
[  240.367416]  [<ffffffff8101c3e9>] ? sched_clock+0x9/0x10
[  240.367420]  [<ffffffff811277fc>] ? acct_account_cputime+0x1c/0x20
[  240.367424]  [<ffffffff8109ce6b>] ? account_user_time+0x8b/0xa0
[  240.367428]  [<ffffffff81200bd5>] ? __fget_light+0x25/0x70
[  240.367431]  [<ffffffff81607c02>] __sys_sendmsg+0x42/0x80
[  240.367433]  [<ffffffff81607c52>] SyS_sendmsg+0x12/0x20
[  240.367436]  [<ffffffff81727464>] tracesys_phase2+0xd8/0xdd
[  240.367439] rcu_show_nocb_setup(): rcu_sched nocb state:
[  240.372734]   0: ffff88013fc0e600 l:ffff88013fc0e600 n:ffff88013fc8e600 .G.
[  240.379673]   1: ffff88013fc8e600 l:ffff88013fc0e600 n:          (null) .G.
[  240.386611]   2: ffff88013fd0e600 l:ffff88013fd0e600 n:ffff88013fd8e600 N..
[  240.393550]   3: ffff88013fd8e600 l:ffff88013fd0e600 n:          (null) N..
[  240.400489] rcu_show_nocb_setup(): rcu_bh nocb state:
[  240.405525]   0: ffff88013fc0e3c0 l:ffff88013fc0e3c0 n:ffff88013fc8e3c0 ...
[  240.412463]   1: ffff88013fc8e3c0 l:ffff88013fc0e3c0 n:          (null) ...
[  240.419401]   2: ffff88013fd0e3c0 l:ffff88013fd0e3c0 n:ffff88013fd8e3c0 ...
[  240.426339]   3: ffff88013fd8e3c0 l:ffff88013fd0e3c0 n:          (null) ...
[  360.432020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
[  360.438881]       Not tainted 3.17.0-testola+ #4
[  360.443484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.451289] ovs-vswitchd    D ffff88013fc94600     0   902    901 0x00000004
[  360.451293]  ffff8800ab20f7b8 0000000000000002 ffff8800b3304b00 ffff8800ab20ffd8
[  360.451297]  0000000000014600 0000000000014600 ffff8800b0810000 ffff8800b3304b00
[  360.451300]  ffff8800b3304b00 ffffffff81c59850 ffffffff81c59858 7fffffffffffffff
[  360.451303] Call Trace:
[  360.451311]  [<ffffffff81722b99>] schedule+0x29/0x70
[  360.451314]  [<ffffffff81725b6c>] schedule_timeout+0x1dc/0x260
[  360.451317]  [<ffffffff81722f69>] ? _cond_resched+0x29/0x40
[  360.451320]  [<ffffffff81723818>] ? wait_for_completion+0x28/0x160
[  360.451325]  [<ffffffff811081a7>] ? queue_stop_cpus_work+0xc7/0xe0
[  360.451327]  [<ffffffff81723896>] wait_for_completion+0xa6/0x160
[  360.451331]  [<ffffffff81099980>] ? wake_up_state+0x20/0x20
[  360.451335]  [<ffffffff810d0ecc>] _rcu_barrier+0x20c/0x480
[  360.451338]  [<ffffffff810d1195>] rcu_barrier+0x15/0x20
[  360.451342]  [<ffffffff81625010>] netdev_run_todo+0x60/0x300
[  360.451345]  [<ffffffff8162f9ee>] rtnl_unlock+0xe/0x10
[  360.451353]  [<ffffffffa01ffcc5>] internal_dev_destroy+0x55/0x80 [openvswitch]
[  360.451358]  [<ffffffffa01ff622>] ovs_vport_del+0x32/0x40 [openvswitch]
[  360.451362]  [<ffffffffa01f8dd0>] ovs_dp_detach_port+0x30/0x40 [openvswitch]
[  360.451366]  [<ffffffffa01f8ea5>] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
[  360.451370]  [<ffffffff81651d75>] genl_family_rcv_msg+0x1a5/0x3c0
[  360.451373]  [<ffffffff81651f90>] ? genl_family_rcv_msg+0x3c0/0x3c0
[  360.451376]  [<ffffffff81652021>] genl_rcv_msg+0x91/0xd0
[  360.451379]  [<ffffffff81650091>] netlink_rcv_skb+0xc1/0xe0
[  360.451381]  [<ffffffff816505bc>] genl_rcv+0x2c/0x40
[  360.451384]  [<ffffffff8164f626>] netlink_unicast+0xf6/0x200
[  360.451387]  [<ffffffff8164fa4d>] netlink_sendmsg+0x31d/0x780
[  360.451390]  [<ffffffff8164ca74>] ? netlink_rcv_wake+0x44/0x60
[  360.451394]  [<ffffffff81606a53>] sock_sendmsg+0x93/0xd0
[  360.451399]  [<ffffffff81337700>] ? apparmor_capable+0x60/0x60
[  360.451402]  [<ffffffff81614f27>] ? verify_iovec+0x47/0xd0
[  360.451406]  [<ffffffff81606e79>] ___sys_sendmsg+0x399/0x3b0
[  360.451410]  [<ffffffff812598a2>] ? kernfs_seq_stop_active+0x32/0x40
[  360.451414]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
[  360.451417]  [<ffffffff8101c385>] ? native_sched_clock+0x35/0x90
[  360.451419]  [<ffffffff8101c3e9>] ? sched_clock+0x9/0x10
[  360.451424]  [<ffffffff811277fc>] ? acct_account_cputime+0x1c/0x20
[  360.451427]  [<ffffffff8109ce6b>] ? account_user_time+0x8b/0xa0
[  360.451431]  [<ffffffff81200bd5>] ? __fget_light+0x25/0x70
[  360.451434]  [<ffffffff81607c02>] __sys_sendmsg+0x42/0x80
[  360.451437]  [<ffffffff81607c52>] SyS_sendmsg+0x12/0x20
[  360.451440]  [<ffffffff81727464>] tracesys_phase2+0xd8/0xdd
[  360.451442] rcu_show_nocb_setup(): rcu_sched nocb state:
[  360.456737]   0: ffff88013fc0e600 l:ffff88013fc0e600 n:ffff88013fc8e600 ...
[  360.463676]   1: ffff88013fc8e600 l:ffff88013fc0e600 n:          (null) ...
[  360.470614]   2: ffff88013fd0e600 l:ffff88013fd0e600 n:ffff88013fd8e600 N..
[  360.477554]   3: ffff88013fd8e600 l:ffff88013fd0e600 n:          (null) N..
[  360.484494] rcu_show_nocb_setup(): rcu_bh nocb state:
[  360.489529]   0: ffff88013fc0e3c0 l:ffff88013fc0e3c0 n:ffff88013fc8e3c0 ...
[  360.496469]   1: ffff88013fc8e3c0 l:ffff88013fc0e3c0 n:          (null) .G.
[  360.503407]   2: ffff88013fd0e3c0 l:ffff88013fd0e3c0 n:ffff88013fd8e3c0 ...
[  360.510346]   3: ffff88013fd8e3c0 l:ffff88013fd0e3c0 n:          (null) ...

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* Re: [PATCH net-next 3/4] net: bcmgenet: reclaim transmitted buffers in NAPI context
From: Petri Gynther @ 2014-10-24 22:30 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, David Miller
In-Reply-To: <1414180942-2132-4-git-send-email-f.fainelli@gmail.com>

Hi Florian,

On Fri, Oct 24, 2014 at 1:02 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> The GENET driver is currently reclaiming transmitted buffers from hard
> interrupt context in bcmgenet_isr0 as well as NAPI context in
> bcmgenet_poll, which is not consistent and not ideal. Instead, update
> the driver to reclaim transmitted buffers in NAPI context only and
> properly switch the TX path to use interrupt mitigation based on NAPI.
>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 29 +++++++++-----------------
>  1 file changed, 10 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index 70f2fb366375..d6f4a7ace05e 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -934,9 +934,6 @@ static unsigned int __bcmgenet_tx_reclaim(struct net_device *dev,
>                 last_c_index &= (num_tx_bds - 1);
>         }
>
> -       if (ring->free_bds > (MAX_SKB_FRAGS + 1))
> -               ring->int_disable(priv, ring);
> -
>         if (netif_tx_queue_stopped(txq) && pkts_compl)
>                 netif_tx_wake_queue(txq);
>
> @@ -1211,10 +1208,8 @@ static netdev_tx_t bcmgenet_xmit(struct sk_buff *skb, struct net_device *dev)
>         bcmgenet_tdma_ring_writel(priv, ring->index,
>                                   ring->prod_index, TDMA_PROD_INDEX);
>
> -       if (ring->free_bds <= (MAX_SKB_FRAGS + 1)) {
> +       if (ring->free_bds <= (MAX_SKB_FRAGS + 1))

Like I mentioned in patch 2/4 comments, I feel that bcmgenet_xmit()
stops the Tx queue too early.
Too many TxBDs sit unused that non-fragmented frames could use.
I think this should be: if (ring->free_bds <= 1)


>                 netif_tx_stop_queue(txq);
> -               ring->int_enable(priv, ring);
> -       }
>
>  out:
>         spin_unlock_irqrestore(&ring->lock, flags);
> @@ -1553,9 +1548,9 @@ static int init_umac(struct bcmgenet_priv *priv)
>
>         bcmgenet_intr_disable(priv);
>
> -       cpu_mask_clear = UMAC_IRQ_RXDMA_MASK;
> +       cpu_mask_clear = UMAC_IRQ_RX_TX_MASK;
>
> -       dev_dbg(kdev, "%s:Enabling RXDMA interrupts\n", __func__);
> +       dev_dbg(kdev, "%s:Enabling RXDMA & TXDMA interrupts\n", __func__);
>
>         /* Monitor cable plug/unplugged event for internal PHY */
>         if (phy_is_internal(priv->phydev)) {
> @@ -1879,10 +1874,10 @@ static int bcmgenet_poll(struct napi_struct *napi, int budget)
>  {
>         struct bcmgenet_priv *priv = container_of(napi,
>                         struct bcmgenet_priv, napi);
> -       unsigned int work_done;
> +       unsigned int work_done, tx_work;
>
>         /* tx reclaim */
> -       bcmgenet_tx_reclaim(priv->dev, &priv->tx_rings[DESC_INDEX]);
> +       tx_work = bcmgenet_tx_reclaim(priv->dev, &priv->tx_rings[DESC_INDEX]);
>
>         work_done = bcmgenet_desc_rx(priv, budget);
>
> @@ -1891,9 +1886,9 @@ static int bcmgenet_poll(struct napi_struct *napi, int budget)
>         priv->rx_c_index &= DMA_C_INDEX_MASK;
>         bcmgenet_rdma_ring_writel(priv, DESC_INDEX,
>                                   priv->rx_c_index, RDMA_CONS_INDEX);
> -       if (work_done < budget) {
> +       if (work_done < budget || tx_work == 0) {

Imagine interface with a lot of Rx traffic but no Tx (e.g. UDP receiver).
For that case, work_done == budget and tx_work == 0.

Your change ends up completing NAPI when it shouldn't.

>                 napi_complete(napi);
> -               bcmgenet_intrl2_0_writel(priv, UMAC_IRQ_RXDMA_MASK,
> +               bcmgenet_intrl2_0_writel(priv, UMAC_IRQ_RX_TX_MASK,
>                                          INTRL2_CPU_MASK_CLEAR);
>         }
>
> @@ -1968,22 +1963,18 @@ static irqreturn_t bcmgenet_isr0(int irq, void *dev_id)
>         netif_dbg(priv, intr, priv->dev,
>                   "IRQ=0x%x\n", priv->irq0_stat);
>
> -       if (priv->irq0_stat & UMAC_IRQ_RXDMA_MASK) {
> +       if (priv->irq0_stat & UMAC_IRQ_RX_TX_MASK) {
>                 /* We use NAPI(software interrupt throttling, if
>                  * Rx Descriptor throttling is not used.
>                  * Disable interrupt, will be enabled in the poll method.
>                  */
>                 if (likely(napi_schedule_prep(&priv->napi))) {
> -                       bcmgenet_intrl2_0_writel(priv, UMAC_IRQ_RXDMA_MASK,
> +                       bcmgenet_intrl2_0_writel(priv, UMAC_IRQ_RX_TX_MASK,
>                                                  INTRL2_CPU_MASK_SET);
>                         __napi_schedule(&priv->napi);
>                 }
>         }
> -       if (priv->irq0_stat &
> -                       (UMAC_IRQ_TXDMA_BDONE | UMAC_IRQ_TXDMA_PDONE)) {
> -               /* Tx reclaim */
> -               bcmgenet_tx_reclaim(priv->dev, &priv->tx_rings[DESC_INDEX]);
> -       }
> +
>         if (priv->irq0_stat & (UMAC_IRQ_PHY_DET_R |
>                                 UMAC_IRQ_PHY_DET_F |
>                                 UMAC_IRQ_LINK_UP |
> --
> 1.9.1
>

^ permalink raw reply

* Re: [PATCH net-next 2/4] net: bcmgenet: return number of packets completed in TX reclaim
From: Petri Gynther @ 2014-10-24 22:20 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, David Miller
In-Reply-To: <1414180942-2132-3-git-send-email-f.fainelli@gmail.com>

Hi Florian,

On Fri, Oct 24, 2014 at 1:02 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> In preparation for reclaiming transmitted buffers in NAPI context,
> update __bcmgenet_tx_reclaim() and bcmgenet_tx_reclaim() to return the
> number of packets completed per call.
>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 20 +++++++++++++-------
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index ee4d5baf09b6..70f2fb366375 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -876,14 +876,14 @@ static inline void bcmgenet_tx_ring_int_disable(struct bcmgenet_priv *priv,
>  }
>
>  /* Unlocked version of the reclaim routine */
> -static void __bcmgenet_tx_reclaim(struct net_device *dev,
> -                                 struct bcmgenet_tx_ring *ring)
> +static unsigned int __bcmgenet_tx_reclaim(struct net_device *dev,
> +                                         struct bcmgenet_tx_ring *ring)
>  {
>         struct bcmgenet_priv *priv = netdev_priv(dev);
>         int last_tx_cn, last_c_index, num_tx_bds;
>         struct enet_cb *tx_cb_ptr;
>         struct netdev_queue *txq;
> -       unsigned int bds_compl;
> +       unsigned int bds_compl, pkts_compl = 0;

bcmgenet_desc_rx() uses "rxpktprocessed", so I'd go with "txpktprocessed" here.

>         unsigned int c_index;
>
>         /* Compute how many buffers are transmitted since last xmit call */
> @@ -928,6 +928,7 @@ static void __bcmgenet_tx_reclaim(struct net_device *dev,
>                 }
>                 dev->stats.tx_packets++;
>                 ring->free_bds += bds_compl;
> +               pkts_compl += bds_compl;

This change doesn't look correct. The number of cleaned packets is not
necessarily equal to the number of cleaned TxBDs.

I think that the while-loop should be:

        while (last_tx_cn-- > 0) {
                tx_cb_ptr = ring->cbs + last_c_index;

                if (tx_cb_ptr->skb) {
                        pkts_compl++;
                        dev->stats.tx_packets++;
                        dev->stats.tx_bytes += tx_cb_ptr->skb->len;
                        dma_unmap_single(&dev->dev,
                                         dma_unmap_addr(tx_cb_ptr, dma_addr),
                                         tx_cb_ptr->skb->len,
                                         DMA_TO_DEVICE);
                        bcmgenet_free_cb(tx_cb_ptr);
                } else if (dma_unmap_addr(tx_cb_ptr, dma_addr)) {
                        dev->stats.tx_bytes +=
                                dma_unmap_len(tx_cb_ptr, dma_len);
                        dma_unmap_page(&dev->dev,
                                       dma_unmap_addr(tx_cb_ptr, dma_addr),
                                       dma_unmap_len(tx_cb_ptr, dma_len),
                                       DMA_TO_DEVICE);
                        dma_unmap_addr_set(tx_cb_ptr, dma_addr, 0);
                }
                ring->free_bds++;
                last_c_index++;
                last_c_index &= (num_tx_bds - 1);
        }

>
>                 last_c_index++;
>                 last_c_index &= (num_tx_bds - 1);
> @@ -936,20 +937,25 @@ static void __bcmgenet_tx_reclaim(struct net_device *dev,
>         if (ring->free_bds > (MAX_SKB_FRAGS + 1))
>                 ring->int_disable(priv, ring);
>
> -       if (netif_tx_queue_stopped(txq))
> +       if (netif_tx_queue_stopped(txq) && pkts_compl)

bcmgenet_xmit() stops the Tx queue when:
ring->free_bds <= (MAX_SKB_FRAGS + 1)

So, shouldn't this check be:
netif_tx_queue_stopped(txq) && (ring->free_bds > (MAX_SKB_FRAGS + 1))

Having said that --
Why does the driver stop the Tx queue when there are still
(MAX_SKB_FRAGS + 1) TxBDs left?
It leaves 17 or so TxBDs unused on the Tx ring, and most of the
packets are not fragmented.

I feel that bcmgenet_xmit() should stop the Tx queue only when there
is 1 TxBD left after putting a new packet on the ring.

>                 netif_tx_wake_queue(txq);
>
>         ring->c_index = c_index;
> +
> +       return pkts_compl;
>  }
>
> -static void bcmgenet_tx_reclaim(struct net_device *dev,
> -                               struct bcmgenet_tx_ring *ring)
> +static unsigned int bcmgenet_tx_reclaim(struct net_device *dev,
> +                                       struct bcmgenet_tx_ring *ring)
>  {
>         unsigned long flags;
> +       unsigned int pkts_compl;
>
>         spin_lock_irqsave(&ring->lock, flags);
> -       __bcmgenet_tx_reclaim(dev, ring);
> +       pkts_compl = __bcmgenet_tx_reclaim(dev, ring);
>         spin_unlock_irqrestore(&ring->lock, flags);
> +
> +       return pkts_compl;
>  }
>
>  static void bcmgenet_tx_reclaim_all(struct net_device *dev)
> --
> 1.9.1
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox