[PATCH] pkt_sched: Destroy gen estimators under rtnl

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
@ 2008-08-11 20:53 Jarek Poplawski
  2008-08-12  1:12 ` David Miller
  2008-08-12 22:02 ` [PATCH take 2] pkt_sched: Protect gen estimators under est_lock Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-11 20:53 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

pkt_sched: Destroy gen estimators under rtnl_lock().

gen_kill_estimator() requires rtnl_lock() protection, and since it is
called in qdisc ->destroy() too, this has to go back from RCU callback
to qdisc_destroy().


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/sched/sch_generic.c |   17 +++++++++--------
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7cf83b3..336dc88 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -524,18 +524,10 @@ EXPORT_SYMBOL(qdisc_reset);
 static void __qdisc_destroy(struct rcu_head *head)
 {
 	struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
-	const struct Qdisc_ops  *ops = qdisc->ops;
 
 #ifdef CONFIG_NET_SCHED
 	qdisc_put_stab(qdisc->stab);
 #endif
-	gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
-	if (ops->reset)
-		ops->reset(qdisc);
-	if (ops->destroy)
-		ops->destroy(qdisc);
-
-	module_put(ops->owner);
 	dev_put(qdisc_dev(qdisc));
 
 	kfree_skb(qdisc->gso_skb);
@@ -547,6 +539,8 @@ static void __qdisc_destroy(struct rcu_head *head)
 
 void qdisc_destroy(struct Qdisc *qdisc)
 {
+	const struct Qdisc_ops *ops = qdisc->ops;
+
 	if (qdisc->flags & TCQ_F_BUILTIN ||
 	    !atomic_dec_and_test(&qdisc->refcnt))
 		return;
@@ -554,6 +548,13 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	if (qdisc->parent)
 		list_del(&qdisc->list);
 
+	gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);
+	if (ops->reset)
+		ops->reset(qdisc);
+	if (ops->destroy)
+		ops->destroy(qdisc);
+
+	module_put(ops->owner);
 	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
 }
 EXPORT_SYMBOL(qdisc_destroy);

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-11 20:53 [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
@ 2008-08-12  1:12 ` David Miller
  2008-08-12  5:20   ` Jarek Poplawski
  2008-08-12 22:02 ` [PATCH take 2] pkt_sched: Protect gen estimators under est_lock Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-12  1:12 UTC (permalink / raw)
  To: jarkao2; +Cc: netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 11 Aug 2008 22:53:57 +0200

> pkt_sched: Destroy gen estimators under rtnl_lock().
> 
> gen_kill_estimator() requires rtnl_lock() protection, and since it is
> called in qdisc ->destroy() too, this has to go back from RCU callback
> to qdisc_destroy().
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

We can't do this.  And at a minimum, the final ->reset() must
occur in the RCU callback, otherwise asynchronous threads of
execution could queue packets into this dying qdisc and
such packets would leak forever.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-12  1:12 ` David Miller
@ 2008-08-12  5:20   ` Jarek Poplawski
  2008-08-12  5:40     ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-12  5:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, Aug 11, 2008 at 06:12:35PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 11 Aug 2008 22:53:57 +0200
> 
> > pkt_sched: Destroy gen estimators under rtnl_lock().
> > 
> > gen_kill_estimator() requires rtnl_lock() protection, and since it is
> > called in qdisc ->destroy() too, this has to go back from RCU callback
> > to qdisc_destroy().
> > 
> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> 
> We can't do this.  And at a minimum, the final ->reset() must
> occur in the RCU callback, otherwise asynchronous threads of
> execution could queue packets into this dying qdisc and
> such packets would leak forever.

Could you explain this more? I've thought this synchronize_rcu() is
just to prevent this (and what these comments talk about?):

void dev_deactivate(struct net_device *dev)
{
        bool running;

        netdev_for_each_tx_queue(dev, dev_deactivate_queue, &noop_qdisc);
        dev_deactivate_queue(dev, &dev->rx_queue, &noop_qdisc);

        dev_watchdog_down(dev);

        /* Wait for outstanding qdisc-less dev_queue_xmit calls. */
        synchronize_rcu();

        do {
                while (some_qdisc_is_running(dev, 0))
                        yield();

                /*
                 * Double-check inside queue lock to ensure that all effects
                 * of the queue run are visible when we return.
                 */
                running = some_qdisc_is_running(dev, 1);

                /*
                 * The running flag should never be set at this point because
                 * we've already set dev->qdisc to noop_qdisc *inside* the same
                 * pair of spin locks.  That is, if any qdisc_run starts after
                 * our initial test it should see the noop_qdisc and then
                 * clear the RUNNING bit before dropping the queue lock.  So
                 * if it is set here then we've found a bug.
                 */
        } while (WARN_ON_ONCE(running));
}

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-12  5:20   ` Jarek Poplawski
@ 2008-08-12  5:40     ` David Miller
  2008-08-12  7:00       ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-12  5:40 UTC (permalink / raw)
  To: jarkao2; +Cc: netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 12 Aug 2008 05:20:48 +0000

> On Mon, Aug 11, 2008 at 06:12:35PM -0700, David Miller wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Mon, 11 Aug 2008 22:53:57 +0200
> > 
> > > pkt_sched: Destroy gen estimators under rtnl_lock().
> > > 
> > > gen_kill_estimator() requires rtnl_lock() protection, and since it is
> > > called in qdisc ->destroy() too, this has to go back from RCU callback
> > > to qdisc_destroy().
> > > 
> > > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> > 
> > We can't do this.  And at a minimum, the final ->reset() must
> > occur in the RCU callback, otherwise asynchronous threads of
> > execution could queue packets into this dying qdisc and
> > such packets would leak forever.
> 
> Could you explain this more? I've thought this synchronize_rcu() is
> just to prevent this (and what these comments talk about?):

Those comments are out of date and I need to update them.
In fact this whole loop is now largely pointless.

The rcu_dereference() on dev_queue->qdisc happens before the
QDISC_RUNNING bit is set.

We no longer resample the qdisc under any kind of lock.  Because we no
longer have a top-level lock that synchronizes the setting of
dev_queue->qdisc

Rather, the lock we use for calling ->enqueue() and ->dequeue() is
inside of the root qdisc itself.

That's why all of the real destruction has to occur in the RCU handler.

Anyways, this is part of the problem I think is causing the crash the
Intel folks are triggering.

We sample the qdisc in dev_queue_xmit() or wherever, then we attach
that to the per-cpu ->output_queue to process it via qdisc_run()
in the software interrupt handler.

The RCU quiesce period extends to the next scheduling point and this
is enough if we do normal direct softirq processing of this qdisc.

But if it gets postponed into ksoftirqd... the RCU will pass too
early.

I'm still thinking about how to fix this without avoiding RCU
and without adding new synchronization primitives.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-12  5:40     ` David Miller
@ 2008-08-12  7:00       ` Jarek Poplawski
  2008-08-12  8:15         ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-12  7:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, Aug 11, 2008 at 10:40:47PM -0700, David Miller wrote:
...
> Those comments are out of date and I need to update them.
> In fact this whole loop is now largely pointless.
> 
> The rcu_dereference() on dev_queue->qdisc happens before the
> QDISC_RUNNING bit is set.
> 
> We no longer resample the qdisc under any kind of lock.  Because we no
> longer have a top-level lock that synchronizes the setting of
> dev_queue->qdisc
> 
> Rather, the lock we use for calling ->enqueue() and ->dequeue() is
> inside of the root qdisc itself.
> 
> That's why all of the real destruction has to occur in the RCU handler.
> 
> Anyways, this is part of the problem I think is causing the crash the
> Intel folks are triggering.
> 
> We sample the qdisc in dev_queue_xmit() or wherever, then we attach
> that to the per-cpu ->output_queue to process it via qdisc_run()
> in the software interrupt handler.
> 
> The RCU quiesce period extends to the next scheduling point and this
> is enough if we do normal direct softirq processing of this qdisc.
> 
> But if it gets postponed into ksoftirqd... the RCU will pass too
> early.
> 
> I'm still thinking about how to fix this without avoiding RCU
> and without adding new synchronization primitives.

Of course I've to miss something, but I still don't get it: after
synchronize_rcu() in dev_deactivate() we are sure anyone in
dev_queue_xmit() rcu block has to see the change to noop_qdisc(),
so it can only lose packets and not really enqueue(). IMHO the
only problem is this __netif_schedule(), which could be done with
dev_queues instead of Qdiscs with proper dereferencing there.
(BTW, I think we need rcu_read_lock() instead of the _bh() version in
dev_queue_xmit() to match this with rcu_call() or synchronize_rcu().)

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-12  7:00       ` Jarek Poplawski
@ 2008-08-12  8:15         ` David Miller
  2008-08-12 10:38           ` Jarek Poplawski
  2008-08-13  4:30           ` Herbert Xu
  0 siblings, 2 replies; 209+ messages in thread
From: David Miller @ 2008-08-12  8:15 UTC (permalink / raw)
  To: jarkao2; +Cc: netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 12 Aug 2008 07:00:05 +0000

> Of course I've to miss something, but I still don't get it: after
> synchronize_rcu() in dev_deactivate() we are sure anyone in
> dev_queue_xmit() rcu block has to see the change to noop_qdisc(),
> so it can only lose packets and not really enqueue().

The qdisc pointer traverses to the softirq handler, which can be run
in a process context (via ksoftirqd), and this pointer gets there
via the per-cpu ->output_queue.

> IMHO the only problem is this __netif_schedule(), which could be
> done with dev_queues instead of Qdiscs with proper dereferencing
> there.  (BTW, I think we need rcu_read_lock() instead of the _bh()
> version in dev_queue_xmit() to match this with rcu_call() or
> synchronize_rcu().)

I didn't see it possible to keep scheduling the netdev_queues, as the
qdiscs can be shared with multiple queues.

Qdisc "are we running?" and other state pieces are now inside of the
Qdisc itself.  And all of the qdisc_run() and netif_schedule logic is,
as a result, Qdisc centric.

The synchronization object is the qdisc.  So we can't resample the
qdisc after scheduling it, because then the qdisc attached to the
netdev_queue can change and we'd be holding the root lock for
the wrong qdisc object.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-12  8:15         ` David Miller
@ 2008-08-12 10:38           ` Jarek Poplawski
  2008-08-13  4:30           ` Herbert Xu
  1 sibling, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-12 10:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Tue, Aug 12, 2008 at 01:15:10AM -0700, David Miller wrote:
...
> The synchronization object is the qdisc.  So we can't resample the
> qdisc after scheduling it, because then the qdisc attached to the
> netdev_queue can change and we'd be holding the root lock for
> the wrong qdisc object.

If you mean net_tx_action() this looks like we would get a root lock
of a current qdisc, just like seen in dev_queue_xmit() at the moment,
so I'm still looking for a clue, what could be wrong with this...

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* [PATCH take 2] pkt_sched: Protect gen estimators under est_lock.
  2008-08-11 20:53 [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
  2008-08-12  1:12 ` David Miller
@ 2008-08-12 22:02 ` Jarek Poplawski
  2008-08-13 22:20   ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-12 22:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, Aug 11, 2008 at 10:53:57PM +0200, Jarek Poplawski wrote:
> pkt_sched: Destroy gen estimators under rtnl_lock().
> 
> gen_kill_estimator() requires rtnl_lock() protection, and since it is
> called in qdisc ->destroy() too, this has to go back from RCU callback
> to qdisc_destroy().

So, since it's currently impossible, here is an alternative solution.

Jarek P.

------------>

pkt_sched: Protect gen estimators under est_lock.

gen_kill_estimator() required rtnl_lock() protection, but since it is
moved to an RCU callback __qdisc_destroy() let's use est_lock instead.


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/core/gen_estimator.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c
index 57abe82..a89f32f 100644
--- a/net/core/gen_estimator.c
+++ b/net/core/gen_estimator.c
@@ -99,7 +99,7 @@ struct gen_estimator_head
 
 static struct gen_estimator_head elist[EST_MAX_INTERVAL+1];
 
-/* Protects against NULL dereference */
+/* Protects against NULL dereference and RCU write-side */
 static DEFINE_RWLOCK(est_lock);
 
 static void est_timer(unsigned long arg)
@@ -185,6 +185,7 @@ int gen_new_estimator(struct gnet_stats_basic *bstats,
 	est->last_packets = bstats->packets;
 	est->avpps = rate_est->pps<<10;
 
+	write_lock_bh(&est_lock);
 	if (!elist[idx].timer.function) {
 		INIT_LIST_HEAD(&elist[idx].list);
 		setup_timer(&elist[idx].timer, est_timer, idx);
@@ -194,6 +195,7 @@ int gen_new_estimator(struct gnet_stats_basic *bstats,
 		mod_timer(&elist[idx].timer, jiffies + ((HZ/4) << idx));
 
 	list_add_rcu(&est->list, &elist[idx].list);
+	write_unlock_bh(&est_lock);
 	return 0;
 }
 
@@ -212,7 +214,6 @@ static void __gen_kill_estimator(struct rcu_head *head)
  * Removes the rate estimator specified by &bstats and &rate_est
  * and deletes the timer.
  *
- * NOTE: Called under rtnl_mutex
  */
 void gen_kill_estimator(struct gnet_stats_basic *bstats,
 	struct gnet_stats_rate_est *rate_est)
@@ -226,17 +227,17 @@ void gen_kill_estimator(struct gnet_stats_basic *bstats,
 		if (!elist[idx].timer.function)
 			continue;
 
+		write_lock_bh(&est_lock);
 		list_for_each_entry_safe(e, n, &elist[idx].list, list) {
 			if (e->rate_est != rate_est || e->bstats != bstats)
 				continue;
 
-			write_lock_bh(&est_lock);
 			e->bstats = NULL;
-			write_unlock_bh(&est_lock);
 
 			list_del_rcu(&e->list);
 			call_rcu(&e->e_rcu, __gen_kill_estimator);
 		}
+		write_unlock_bh(&est_lock);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-12  8:15         ` David Miller
  2008-08-12 10:38           ` Jarek Poplawski
@ 2008-08-13  4:30           ` Herbert Xu
  2008-08-13  5:11             ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-13  4:30 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

David Miller <davem@davemloft.net> wrote:
>
> The qdisc pointer traverses to the softirq handler, which can be run
> in a process context (via ksoftirqd), and this pointer gets there
> via the per-cpu ->output_queue.

Here are two possible solutions:

1) The active way: smp_call_function and forcibly remove the qdiscs
in question from each output_queue.

2) The passive way: Make dev_deactive call yield() until no qdisc's
are on an output_queue.  This assumes there is some sort of dead
flag detection on the output_queue side so it doesn't keep going
forever.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  4:30           ` Herbert Xu
@ 2008-08-13  5:11             ` David Miller
  2008-08-13  5:31               ` Herbert Xu
  2008-08-13  6:13               ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: David Miller @ 2008-08-13  5:11 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 13 Aug 2008 14:30:19 +1000

> David Miller <davem@davemloft.net> wrote:
> >
> > The qdisc pointer traverses to the softirq handler, which can be run
> > in a process context (via ksoftirqd), and this pointer gets there
> > via the per-cpu ->output_queue.
> 
> Here are two possible solutions:
> 
> 1) The active way: smp_call_function and forcibly remove the qdiscs
> in question from each output_queue.
> 
> 2) The passive way: Make dev_deactive call yield() until no qdisc's
> are on an output_queue.  This assumes there is some sort of dead
> flag detection on the output_queue side so it doesn't keep going
> forever.

Yes, we'll need some kind of dead flag it seems.

Another thing we can do is, in the yield loop, grabbing the
__QDISC_STATE_RUNNING bit.

But actually, I think Jarek has a point.

The existing loop there in dev_deactivate() should work _iff_ we make
it look at the proper qdisc.

This is another case where I didn't transform things correctly.  The
old code worked on dev->state since that's where we kept what used
to be __LINK_STATE_QDISC_RUNNING.

But now that state is kept in the qdisc itself.  But we just zapped
the active qdisc, so the old one is in ->qdisc_sleeping not ->qdisc.

So, just like one of Jarek's patches, we should simply change
dev_queue->qdisc into dev_queue->qdisc_sleeping and that should
take care of the bulk of the issues.

Shouldn't it?

Hmmm... maybe we have to sample __QDISC_STATE_SCHED too.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  5:11             ` David Miller
@ 2008-08-13  5:31               ` Herbert Xu
  2008-08-13  9:30                 ` David Miller
  2008-08-13  6:13               ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-13  5:31 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Tue, Aug 12, 2008 at 10:11:03PM -0700, David Miller wrote:
>
> So, just like one of Jarek's patches, we should simply change
> dev_queue->qdisc into dev_queue->qdisc_sleeping and that should
> take care of the bulk of the issues.
> 
> Shouldn't it?

Yes you're absolutely right, this takes care of the running qdiscs.

> Hmmm... maybe we have to sample __QDISC_STATE_SCHED too.

You need this too for the ones which aren't running but sitting
on output_queue.  Either you'll have to forcibly remove them as
I outlined earlier, or just wait for them to expire naturally.

The latter would seem more suitable since we're waiting anyway.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  5:11             ` David Miller
  2008-08-13  5:31               ` Herbert Xu
@ 2008-08-13  6:13               ` Jarek Poplawski
  2008-08-13  6:16                 ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-13  6:13 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Tue, Aug 12, 2008 at 10:11:03PM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 13 Aug 2008 14:30:19 +1000
> 
> > David Miller <davem@davemloft.net> wrote:
> > >
> > > The qdisc pointer traverses to the softirq handler, which can be run
> > > in a process context (via ksoftirqd), and this pointer gets there
> > > via the per-cpu ->output_queue.
> > 
> > Here are two possible solutions:
> > 
> > 1) The active way: smp_call_function and forcibly remove the qdiscs
> > in question from each output_queue.
> > 
> > 2) The passive way: Make dev_deactive call yield() until no qdisc's
> > are on an output_queue.  This assumes there is some sort of dead
> > flag detection on the output_queue side so it doesn't keep going
> > forever.
> 
> Yes, we'll need some kind of dead flag it seems.
> 
> Another thing we can do is, in the yield loop, grabbing the
> __QDISC_STATE_RUNNING bit.
> 
> But actually, I think Jarek has a point.
> 
> The existing loop there in dev_deactivate() should work _iff_ we make
> it look at the proper qdisc.
> 
> This is another case where I didn't transform things correctly.  The
> old code worked on dev->state since that's where we kept what used
> to be __LINK_STATE_QDISC_RUNNING.
> 
> But now that state is kept in the qdisc itself.  But we just zapped
> the active qdisc, so the old one is in ->qdisc_sleeping not ->qdisc.
> 
> So, just like one of Jarek's patches, we should simply change
> dev_queue->qdisc into dev_queue->qdisc_sleeping and that should
> take care of the bulk of the issues.
> 
> Shouldn't it?

If we don't change anything in __netif_schedule() I doubt it's enough.
And if this old way of waiting for "outstanding qdisc-less" calls was
really needed we should probably wait for both qdisc and qdisc_sleeping
then.

> 
> Hmmm... maybe we have to sample __QDISC_STATE_SCHED too.

We could probably even think of using this flag napi_disable() way.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  6:13               ` Jarek Poplawski
@ 2008-08-13  6:16                 ` David Miller
  2008-08-13  6:53                   ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-13  6:16 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 13 Aug 2008 06:13:17 +0000

> On Tue, Aug 12, 2008 at 10:11:03PM -0700, David Miller wrote:
> > So, just like one of Jarek's patches, we should simply change
> > dev_queue->qdisc into dev_queue->qdisc_sleeping and that should
> > take care of the bulk of the issues.
> > 
> > Shouldn't it?
> 
> If we don't change anything in __netif_schedule() I doubt it's enough.
> And if this old way of waiting for "outstanding qdisc-less" calls was
> really needed we should probably wait for both qdisc and qdisc_sleeping
> then.

I think if we check both RUNNING and SCHED bits, we'll be OK.

> > Hmmm... maybe we have to sample __QDISC_STATE_SCHED too.
> 
> We could probably even think of using this flag napi_disable() way.

Here (in dev_deactivate), ->qdisc is going to now be &noop_qdisc or
similar.  Asynchronous contexts can run into that thing as much as
they want :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  6:16                 ` David Miller
@ 2008-08-13  6:53                   ` Jarek Poplawski
  2008-08-13  7:31                     ` Jarek Poplawski
  2008-08-13  9:25                     ` David Miller
  0 siblings, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-13  6:53 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Tue, Aug 12, 2008 at 11:16:33PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Wed, 13 Aug 2008 06:13:17 +0000
> 
> > On Tue, Aug 12, 2008 at 10:11:03PM -0700, David Miller wrote:
> > > So, just like one of Jarek's patches, we should simply change
> > > dev_queue->qdisc into dev_queue->qdisc_sleeping and that should
> > > take care of the bulk of the issues.
> > > 
> > > Shouldn't it?
> > 
> > If we don't change anything in __netif_schedule() I doubt it's enough.
> > And if this old way of waiting for "outstanding qdisc-less" calls was
> > really needed we should probably wait for both qdisc and qdisc_sleeping
> > then.
> 
> I think if we check both RUNNING and SCHED bits, we'll be OK.
> 
> > > Hmmm... maybe we have to sample __QDISC_STATE_SCHED too.
> > 
> > We could probably even think of using this flag napi_disable() way.
> 
> Here (in dev_deactivate), ->qdisc is going to now be &noop_qdisc or
> similar.  Asynchronous contexts can run into that thing as much as
> they want :-)

Thats why I still think a "common" RCU with rcu_dereference() (from
dev_queue pointer) in net_tx_action() should be enough: after
synchronize_rcu() in dev_deactivate() we are sure any qdisc_run(),
from dev_queue_xmit() or net_tx_action() can only see and lock
noop_qdisc. Any activities on qdisc_sleeping can't happen so no
need to wait for this. There could be some skbs enqueued just before
synchronize, and they could be ->reset() and ->destroy() just after,
even without rcu_call().

Otherwise, I think you would better send some code example with these
flags, so we could be sure there is no misunderstanding around this.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  6:53                   ` Jarek Poplawski
@ 2008-08-13  7:31                     ` Jarek Poplawski
  2008-08-13  9:25                     ` David Miller
  1 sibling, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-13  7:31 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Wed, Aug 13, 2008 at 06:53:02AM +0000, Jarek Poplawski wrote:
...
> Thats why I still think a "common" RCU with rcu_dereference() (from
> dev_queue pointer)

I hope nobody read this literally: I mean using dev_queue pointer
for dereferencing Qdisc pointer...

> in net_tx_action() should be enough: after
> synchronize_rcu() in dev_deactivate() we are sure any qdisc_run(),
> from dev_queue_xmit() or net_tx_action() can only see and lock
> noop_qdisc. Any activities on qdisc_sleeping can't happen so no
> need to wait for this. There could be some skbs enqueued just before
> synchronize, and they could be ->reset() and ->destroy() just after,
> even without rcu_call().

BTW, this all is easy to verify: simply adding debugging waiting loops
with checking the state of qdisc_sleeping after synchronize_rcu.

Jarek P. 

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  6:53                   ` Jarek Poplawski
  2008-08-13  7:31                     ` Jarek Poplawski
@ 2008-08-13  9:25                     ` David Miller
  2008-08-13  9:58                       ` Herbert Xu
  2008-08-13 10:27                       ` Jarek Poplawski
  1 sibling, 2 replies; 209+ messages in thread
From: David Miller @ 2008-08-13  9:25 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 13 Aug 2008 06:53:03 +0000

> Otherwise, I think you would better send some code example with these
> flags, so we could be sure there is no misunderstanding around this.

Here it my concrete proposal for a fix.

pkt_sched: Fix queue quiescence testing in dev_deactivate().

Based upon discussions with Jarek P. and Herbert Xu.

First, we're testing the wrong qdisc.  We just reset the device
queue qdiscs to &noop_qdisc and checking it's state is completely
pointless here.

We want to wait until the previous qdisc that was sitting at
the ->qdisc pointer is not busy any more.  And that would be
->qdisc_sleeping.

Because of how we propagate the samples qdisc pointer down into
qdisc_run and friends via per-cpu ->output_queue and netif_schedule,
we have to wait also for the __QDISC_STATE_SCHED bit to clear as
well.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7cf83b3..4685746 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -647,7 +647,7 @@ static void dev_deactivate_queue(struct net_device *dev,
 	}
 }
 
-static bool some_qdisc_is_running(struct net_device *dev, int lock)
+static bool some_qdisc_is_busy(struct net_device *dev, int lock)
 {
 	unsigned int i;
 
@@ -658,13 +658,14 @@ static bool some_qdisc_is_running(struct net_device *dev, int lock)
 		int val;
 
 		dev_queue = netdev_get_tx_queue(dev, i);
-		q = dev_queue->qdisc;
+		q = dev_queue->qdisc_sleeping;
 		root_lock = qdisc_lock(q);
 
 		if (lock)
 			spin_lock_bh(root_lock);
 
-		val = test_bit(__QDISC_STATE_RUNNING, &q->state);
+		val = (test_bit(__QDISC_STATE_RUNNING, &q->state) ||
+		       test_bit(__QDISC_STATE_SCHED, &q->state));
 
 		if (lock)
 			spin_unlock_bh(root_lock);
@@ -689,14 +690,14 @@ void dev_deactivate(struct net_device *dev)
 
 	/* Wait for outstanding qdisc_run calls. */
 	do {
-		while (some_qdisc_is_running(dev, 0))
+		while (some_qdisc_is_busy(dev, 0))
 			yield();
 
 		/*
 		 * Double-check inside queue lock to ensure that all effects
 		 * of the queue run are visible when we return.
 		 */
-		running = some_qdisc_is_running(dev, 1);
+		running = some_qdisc_is_busy(dev, 1);
 
 		/*
 		 * The running flag should never be set at this point because

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  5:31               ` Herbert Xu
@ 2008-08-13  9:30                 ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-13  9:30 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 13 Aug 2008 15:31:18 +1000

> On Tue, Aug 12, 2008 at 10:11:03PM -0700, David Miller wrote:
> > Hmmm... maybe we have to sample __QDISC_STATE_SCHED too.
> 
> You need this too for the ones which aren't running but sitting
> on output_queue.  Either you'll have to forcibly remove them as
> I outlined earlier, or just wait for them to expire naturally.
> 
> The latter would seem more suitable since we're waiting anyway.

This is what I do in the patch I just posted.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  9:25                     ` David Miller
@ 2008-08-13  9:58                       ` Herbert Xu
  2008-08-13 10:27                       ` Jarek Poplawski
  1 sibling, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-13  9:58 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Wed, Aug 13, 2008 at 02:25:49AM -0700, David Miller wrote:
> 
> pkt_sched: Fix queue quiescence testing in dev_deactivate().

Looks good to me.  Thanks!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13  9:25                     ` David Miller
  2008-08-13  9:58                       ` Herbert Xu
@ 2008-08-13 10:27                       ` Jarek Poplawski
  2008-08-13 10:42                         ` Jarek Poplawski
  2008-08-13 10:42                         ` Herbert Xu
  1 sibling, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-13 10:27 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Wed, Aug 13, 2008 at 02:25:49AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Wed, 13 Aug 2008 06:53:03 +0000
> 
> > Otherwise, I think you would better send some code example with these
> > flags, so we could be sure there is no misunderstanding around this.
> 
> Here it my concrete proposal for a fix.
> 
> pkt_sched: Fix queue quiescence testing in dev_deactivate().
> 
> Based upon discussions with Jarek P. and Herbert Xu.
> 
> First, we're testing the wrong qdisc.  We just reset the device
> queue qdiscs to &noop_qdisc and checking it's state is completely
> pointless here.
> 
> We want to wait until the previous qdisc that was sitting at
> the ->qdisc pointer is not busy any more.  And that would be
> ->qdisc_sleeping.
> 
> Because of how we propagate the samples qdisc pointer down into
> qdisc_run and friends via per-cpu ->output_queue and netif_schedule,
> we have to wait also for the __QDISC_STATE_SCHED bit to clear as
> well.

Of course, checking this needs more time, but it looks like it could
work, only two little doubts:

- in net_tx_action() we can hit a place just after clear_bit() where
none of these bits is set. Of course, hitting this 2 times in a row
seems to be very unprobable, yet possible, and a lock isn't helpful
here, so probably some change around this would make this nicer.

- isn't there possible some longer ping-pong between qdic_run() and
net_tx_action() when dev_requeue_skb() would get it back to
__netif_schedule() and so on (with NETDEV_TX_BUSY)?

Otherwise, this patch looks OK to me.

Jarek P.


> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 7cf83b3..4685746 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -647,7 +647,7 @@ static void dev_deactivate_queue(struct net_device *dev,
>  	}
>  }
>  
> -static bool some_qdisc_is_running(struct net_device *dev, int lock)
> +static bool some_qdisc_is_busy(struct net_device *dev, int lock)
>  {
>  	unsigned int i;
>  
> @@ -658,13 +658,14 @@ static bool some_qdisc_is_running(struct net_device *dev, int lock)
>  		int val;
>  
>  		dev_queue = netdev_get_tx_queue(dev, i);
> -		q = dev_queue->qdisc;
> +		q = dev_queue->qdisc_sleeping;
>  		root_lock = qdisc_lock(q);
>  
>  		if (lock)
>  			spin_lock_bh(root_lock);
>  
> -		val = test_bit(__QDISC_STATE_RUNNING, &q->state);
> +		val = (test_bit(__QDISC_STATE_RUNNING, &q->state) ||
> +		       test_bit(__QDISC_STATE_SCHED, &q->state));
>  
>  		if (lock)
>  			spin_unlock_bh(root_lock);
> @@ -689,14 +690,14 @@ void dev_deactivate(struct net_device *dev)
>  
>  	/* Wait for outstanding qdisc_run calls. */
>  	do {
> -		while (some_qdisc_is_running(dev, 0))
> +		while (some_qdisc_is_busy(dev, 0))
>  			yield();
>  
>  		/*
>  		 * Double-check inside queue lock to ensure that all effects
>  		 * of the queue run are visible when we return.
>  		 */
> -		running = some_qdisc_is_running(dev, 1);
> +		running = some_qdisc_is_busy(dev, 1);
>  
>  		/*
>  		 * The running flag should never be set at this point because

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13 10:27                       ` Jarek Poplawski
@ 2008-08-13 10:42                         ` Jarek Poplawski
  2008-08-13 10:42                         ` Herbert Xu
  1 sibling, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-13 10:42 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Wed, Aug 13, 2008 at 10:27:01AM +0000, Jarek Poplawski wrote:
...
> - isn't there possible some longer ping-pong between qdic_run() and
> net_tx_action() when dev_requeue_skb() would get it back to
> __netif_schedule() and so on (with NETDEV_TX_BUSY)?

Hmm... I see, after qdisc_reset() it doesn't seem possible to happen.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13 10:27                       ` Jarek Poplawski
  2008-08-13 10:42                         ` Jarek Poplawski
@ 2008-08-13 10:42                         ` Herbert Xu
  2008-08-13 10:50                           ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-13 10:42 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Wed, Aug 13, 2008 at 10:27:01AM +0000, Jarek Poplawski wrote:
> 
> - in net_tx_action() we can hit a place just after clear_bit() where
> none of these bits is set. Of course, hitting this 2 times in a row
> seems to be very unprobable, yet possible, and a lock isn't helpful
> here, so probably some change around this would make this nicer.
>
> - isn't there possible some longer ping-pong between qdic_run() and
> net_tx_action() when dev_requeue_skb() would get it back to
> __netif_schedule() and so on (with NETDEV_TX_BUSY)?

Good point.  I think we should add an aliveness check in both
net_tx_action and qdisc_run.  In fact the net_tx_action problem
existed previously as well.  But it is pretty darn unlikely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13 10:42                         ` Herbert Xu
@ 2008-08-13 10:50                           ` Jarek Poplawski
  2008-08-13 22:19                             ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-13 10:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Wed, Aug 13, 2008 at 08:42:38PM +1000, Herbert Xu wrote:
> On Wed, Aug 13, 2008 at 10:27:01AM +0000, Jarek Poplawski wrote:
> > 
> > - in net_tx_action() we can hit a place just after clear_bit() where
> > none of these bits is set. Of course, hitting this 2 times in a row
> > seems to be very unprobable, yet possible, and a lock isn't helpful
> > here, so probably some change around this would make this nicer.
> >
> > - isn't there possible some longer ping-pong between qdic_run() and
> > net_tx_action() when dev_requeue_skb() would get it back to
> > __netif_schedule() and so on (with NETDEV_TX_BUSY)?
> 
> Good point.  I think we should add an aliveness check in both
> net_tx_action and qdisc_run.  In fact the net_tx_action problem
> existed previously as well.  But it is pretty darn unlikely.

Yes, it seems qdisc_reset() doesn't have to help with this, so
probably there is needed some requeue counter or something...

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13 10:50                           ` Jarek Poplawski
@ 2008-08-13 22:19                             ` David Miller
  2008-08-14  7:59                               ` Jarek Poplawski
                                                 ` (2 more replies)
  0 siblings, 3 replies; 209+ messages in thread
From: David Miller @ 2008-08-13 22:19 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 13 Aug 2008 10:50:52 +0000

> On Wed, Aug 13, 2008 at 08:42:38PM +1000, Herbert Xu wrote:
> > On Wed, Aug 13, 2008 at 10:27:01AM +0000, Jarek Poplawski wrote:
> > > 
> > > - in net_tx_action() we can hit a place just after clear_bit() where
> > > none of these bits is set. Of course, hitting this 2 times in a row
> > > seems to be very unprobable, yet possible, and a lock isn't helpful
> > > here, so probably some change around this would make this nicer.
> > >
> > > - isn't there possible some longer ping-pong between qdic_run() and
> > > net_tx_action() when dev_requeue_skb() would get it back to
> > > __netif_schedule() and so on (with NETDEV_TX_BUSY)?
> > 
> > Good point.  I think we should add an aliveness check in both
> > net_tx_action and qdisc_run.  In fact the net_tx_action problem
> > existed previously as well.  But it is pretty darn unlikely.
> 
> Yes, it seems qdisc_reset() doesn't have to help with this, so
> probably there is needed some requeue counter or something...

Ok, so what I'm going to do is check in my patch and then try
to figure out how to resolve this "both bits clear" scenerio.

Thanks.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Protect gen estimators under est_lock.
  2008-08-12 22:02 ` [PATCH take 2] pkt_sched: Protect gen estimators under est_lock Jarek Poplawski
@ 2008-08-13 22:20   ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-13 22:20 UTC (permalink / raw)
  To: jarkao2; +Cc: netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 13 Aug 2008 00:02:18 +0200

> pkt_sched: Protect gen estimators under est_lock.
> 
> gen_kill_estimator() required rtnl_lock() protection, but since it is
> moved to an RCU callback __qdisc_destroy() let's use est_lock instead.
> 
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Looks good, applied, thanks Jarek.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13 22:19                             ` David Miller
@ 2008-08-14  7:59                               ` Jarek Poplawski
  2008-08-14  8:16                                 ` Herbert Xu
  2008-08-14  8:17                               ` [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
  2008-08-14 11:24                               ` Jarek Poplawski
  2 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-14  7:59 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Wed, Aug 13, 2008 at 03:19:18PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Wed, 13 Aug 2008 10:50:52 +0000
> 
> > On Wed, Aug 13, 2008 at 08:42:38PM +1000, Herbert Xu wrote:
> > > On Wed, Aug 13, 2008 at 10:27:01AM +0000, Jarek Poplawski wrote:
> > > > 
> > > > - in net_tx_action() we can hit a place just after clear_bit() where
> > > > none of these bits is set. Of course, hitting this 2 times in a row
> > > > seems to be very unprobable, yet possible, and a lock isn't helpful
> > > > here, so probably some change around this would make this nicer.
> > > >
> > > > - isn't there possible some longer ping-pong between qdic_run() and
> > > > net_tx_action() when dev_requeue_skb() would get it back to
> > > > __netif_schedule() and so on (with NETDEV_TX_BUSY)?
> > > 
> > > Good point.  I think we should add an aliveness check in both
> > > net_tx_action and qdisc_run.  In fact the net_tx_action problem
> > > existed previously as well.  But it is pretty darn unlikely.
> > 
> > Yes, it seems qdisc_reset() doesn't have to help with this, so
> > probably there is needed some requeue counter or something...
> 
> Ok, so what I'm going to do is check in my patch and then try
> to figure out how to resolve this "both bits clear" scenerio.

Here is my proposal.

Jarek P.

----------->

net: Change handling of the __QDISC_STATE_SCHED flag in net_tx_action().

Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to
enable proper control in dev_deactivate_queue(). Now, if this flag is
seen as unset under root_lock means a qdisc can't be netif_scheduled.


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/core/dev.c |   34 +++++++++++++++++++---------------
 1 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..f67581b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1339,19 +1339,23 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 }
 
 
-void __netif_schedule(struct Qdisc *q)
+static inline void __netif_reschedule(struct Qdisc *q)
 {
-	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
-		struct softnet_data *sd;
-		unsigned long flags;
+	struct softnet_data *sd;
+	unsigned long flags;
 
-		local_irq_save(flags);
-		sd = &__get_cpu_var(softnet_data);
-		q->next_sched = sd->output_queue;
-		sd->output_queue = q;
-		raise_softirq_irqoff(NET_TX_SOFTIRQ);
-		local_irq_restore(flags);
-	}
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	q->next_sched = sd->output_queue;
+	sd->output_queue = q;
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
+}
+
+void __netif_schedule(struct Qdisc *q)
+{
+	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
+		__netif_reschedule(q);
 }
 EXPORT_SYMBOL(__netif_schedule);
 
@@ -1974,15 +1978,15 @@ static void net_tx_action(struct softirq_action *h)
 
 			head = head->next_sched;
 
-			smp_mb__before_clear_bit();
-			clear_bit(__QDISC_STATE_SCHED, &q->state);
-
 			root_lock = qdisc_lock(q);
 			if (spin_trylock(root_lock)) {
+				smp_mb__before_clear_bit();
+				clear_bit(__QDISC_STATE_SCHED,
+					  &q->state);
 				qdisc_run(q);
 				spin_unlock(root_lock);
 			} else {
-				__netif_schedule(q);
+				__netif_reschedule(q);
 			}
 		}
 	}

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-14  7:59                               ` Jarek Poplawski
@ 2008-08-14  8:16                                 ` Herbert Xu
  2008-08-14  8:31                                   ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-14  8:16 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Thu, Aug 14, 2008 at 07:59:07AM +0000, Jarek Poplawski wrote:
> 
> net: Change handling of the __QDISC_STATE_SCHED flag in net_tx_action().
> 
> Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to
> enable proper control in dev_deactivate_queue(). Now, if this flag is
> seen as unset under root_lock means a qdisc can't be netif_scheduled.
> 
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Well this probably works in practice but at least on paper it
is vulnerable to live-lock if the net_tx_action side always gets
to the trylock stage and loses to the waiting side.

An aliveness flag would be the safest.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13 22:19                             ` David Miller
  2008-08-14  7:59                               ` Jarek Poplawski
@ 2008-08-14  8:17                               ` Jarek Poplawski
  2008-08-14 11:24                               ` Jarek Poplawski
  2 siblings, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-14  8:17 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Wed, Aug 13, 2008 at 03:19:18PM -0700, David Miller wrote:
...
> Ok, so what I'm going to do is check in my patch and then try
> to figure out how to resolve this "both bits clear" scenerio.

Here is my proposal again...

Jarek P.

-----------> (resend with changelog fixed only)

net: Change handling of the __QDISC_STATE_SCHED flag in net_tx_action().

Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to
enable proper control in dev_deactivate(). Now, if this flag is seen
as unset under root_lock means a qdisc can't be netif_scheduled.


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/core/dev.c |   34 +++++++++++++++++++---------------
 1 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..f67581b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1339,19 +1339,23 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 }
 
 
-void __netif_schedule(struct Qdisc *q)
+static inline void __netif_reschedule(struct Qdisc *q)
 {
-	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
-		struct softnet_data *sd;
-		unsigned long flags;
+	struct softnet_data *sd;
+	unsigned long flags;
 
-		local_irq_save(flags);
-		sd = &__get_cpu_var(softnet_data);
-		q->next_sched = sd->output_queue;
-		sd->output_queue = q;
-		raise_softirq_irqoff(NET_TX_SOFTIRQ);
-		local_irq_restore(flags);
-	}
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	q->next_sched = sd->output_queue;
+	sd->output_queue = q;
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
+}
+
+void __netif_schedule(struct Qdisc *q)
+{
+	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
+		__netif_reschedule(q);
 }
 EXPORT_SYMBOL(__netif_schedule);
 
@@ -1974,15 +1978,15 @@ static void net_tx_action(struct softirq_action *h)
 
 			head = head->next_sched;
 
-			smp_mb__before_clear_bit();
-			clear_bit(__QDISC_STATE_SCHED, &q->state);
-
 			root_lock = qdisc_lock(q);
 			if (spin_trylock(root_lock)) {
+				smp_mb__before_clear_bit();
+				clear_bit(__QDISC_STATE_SCHED,
+					  &q->state);
 				qdisc_run(q);
 				spin_unlock(root_lock);
 			} else {
-				__netif_schedule(q);
+				__netif_reschedule(q);
 			}
 		}
 	}

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-14  8:16                                 ` Herbert Xu
@ 2008-08-14  8:31                                   ` Jarek Poplawski
  2008-08-14  8:33                                     ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-14  8:31 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Thu, Aug 14, 2008 at 06:16:32PM +1000, Herbert Xu wrote:
> On Thu, Aug 14, 2008 at 07:59:07AM +0000, Jarek Poplawski wrote:
> > 
> > net: Change handling of the __QDISC_STATE_SCHED flag in net_tx_action().
> > 
> > Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to
> > enable proper control in dev_deactivate_queue(). Now, if this flag is
> > seen as unset under root_lock means a qdisc can't be netif_scheduled.
> > 
> > 
> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> 
> Well this probably works in practice but at least on paper it
> is vulnerable to live-lock if the net_tx_action side always gets
> to the trylock stage and loses to the waiting side.
> 
> An aliveness flag would be the safest.

I'm not sure of your point... This patch is only to fix my yesterday's
doubt #1, and it doesn't introduce, I hope, any new live-lock
vulnerabity. So, if you mean doubt #2, there is needed a separate
patch, but I'm not sure there is a need to add a flag. I've thougt
about a counter in a Qdisc for consecutive requeues with
netif_schedule, so we could break after some limit. Of course, your
idea could be simpler and better, but if I could only see some code...

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-14  8:31                                   ` Jarek Poplawski
@ 2008-08-14  8:33                                     ` Herbert Xu
  2008-08-14  8:44                                       ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-14  8:33 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Thu, Aug 14, 2008 at 08:31:27AM +0000, Jarek Poplawski wrote:
>
> I'm not sure of your point... This patch is only to fix my yesterday's
> doubt #1, and it doesn't introduce, I hope, any new live-lock
> vulnerabity. So, if you mean doubt #2, there is needed a separate
> patch, but I'm not sure there is a need to add a flag. I've thougt
> about a counter in a Qdisc for consecutive requeues with
> netif_schedule, so we could break after some limit. Of course, your
> idea could be simpler and better, but if I could only see some code...

What I mean is the extremely unlikely scenario of net_tx_action
always failing on trylock because dev_deactivate has grabbed the
lock to check whether net_tx_action has completed.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-14  8:33                                     ` Herbert Xu
@ 2008-08-14  8:44                                       ` Jarek Poplawski
  2008-08-14  8:52                                         ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-14  8:44 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Thu, Aug 14, 2008 at 06:33:36PM +1000, Herbert Xu wrote:
> On Thu, Aug 14, 2008 at 08:31:27AM +0000, Jarek Poplawski wrote:
> >
> > I'm not sure of your point... This patch is only to fix my yesterday's
> > doubt #1, and it doesn't introduce, I hope, any new live-lock
> > vulnerabity. So, if you mean doubt #2, there is needed a separate
> > patch, but I'm not sure there is a need to add a flag. I've thougt
> > about a counter in a Qdisc for consecutive requeues with
> > netif_schedule, so we could break after some limit. Of course, your
> > idea could be simpler and better, but if I could only see some code...
> 
> What I mean is the extremely unlikely scenario of net_tx_action
> always failing on trylock because dev_deactivate has grabbed the
> lock to check whether net_tx_action has completed.

Of course! I got it myself after responding and re-reading, sorry. So,
this is yet another doubt, but I still wonder why you don't attach
any code... (I'm currently trying to re-think this.)

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-14  8:44                                       ` Jarek Poplawski
@ 2008-08-14  8:52                                         ` Jarek Poplawski
  2008-08-17 22:57                                           ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-14  8:52 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Thu, Aug 14, 2008 at 08:44:29AM +0000, Jarek Poplawski wrote:
> On Thu, Aug 14, 2008 at 06:33:36PM +1000, Herbert Xu wrote:
> > On Thu, Aug 14, 2008 at 08:31:27AM +0000, Jarek Poplawski wrote:
> > >
> > > I'm not sure of your point... This patch is only to fix my yesterday's
> > > doubt #1, and it doesn't introduce, I hope, any new live-lock
> > > vulnerabity. So, if you mean doubt #2, there is needed a separate
> > > patch, but I'm not sure there is a need to add a flag. I've thougt
> > > about a counter in a Qdisc for consecutive requeues with
> > > netif_schedule, so we could break after some limit. Of course, your
> > > idea could be simpler and better, but if I could only see some code...
> > 
> > What I mean is the extremely unlikely scenario of net_tx_action
> > always failing on trylock because dev_deactivate has grabbed the
> > lock to check whether net_tx_action has completed.
> 
> Of course! I got it myself after responding and re-reading, sorry. So,
> this is yet another doubt, but I still wonder why you don't attach
> any code... (I'm currently trying to re-think this.)

On the other hand... such a flag would be probably for one thing only.
And if we would have a "netif_scheduled_without_xmitting" counter this
could probably make 2 in 1?

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-13 22:19                             ` David Miller
  2008-08-14  7:59                               ` Jarek Poplawski
  2008-08-14  8:17                               ` [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
@ 2008-08-14 11:24                               ` Jarek Poplawski
  2008-08-17 13:42                                 ` Jarek Poplawski
  2 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-14 11:24 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Wed, Aug 13, 2008 at 03:19:18PM -0700, David Miller wrote:
...
> Ok, so what I'm going to do is check in my patch and then try
> to figure out how to resolve this "both bits clear" scenerio.

BTW, here is my older doubt revisited, where I hope to be re-considered/
re-convinced, if possible...

Thanks,
Jarek P.

---------------->

pkt_sched: Destroy qdiscs under rtnl_lock again.

We don't need to trigger __qdisc_destroy() as an RCU callback because
the use of qdisc isn't controlled by RCU alone: after querying RCU
with synchronize_rcu() in dev_deactivate() we additionaly wait in a
loop checking some flags. After the loop is done there could be no
outstanding use of the qdisc, so call_rcu() doesn't make any sense.

On the other hand, current calling Qdisc's ->destroy() from a softirq
context without locking (rtnl) can break various things like:
qdisc_put_rtab(), tcf_destroy_chain() (e.g. u32_destroy()), and
probably more.


Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/sched/sch_generic.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 4685746..e7379d2 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -518,12 +518,8 @@ void qdisc_reset(struct Qdisc *qdisc)
 }
 EXPORT_SYMBOL(qdisc_reset);
 
-/* this is the rcu callback function to clean up a qdisc when there
- * are no further references to it */
-
-static void __qdisc_destroy(struct rcu_head *head)
+static void __qdisc_destroy(struct Qdisc *qdisc)
 {
-	struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
 	const struct Qdisc_ops  *ops = qdisc->ops;
 
 #ifdef CONFIG_NET_SCHED
@@ -554,7 +550,7 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	if (qdisc->parent)
 		list_del(&qdisc->list);
 
-	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
+	__qdisc_destroy(qdisc);
 }
 EXPORT_SYMBOL(qdisc_destroy);
 

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-14 11:24                               ` Jarek Poplawski
@ 2008-08-17 13:42                                 ` Jarek Poplawski
  2008-08-17 21:34                                   ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-17 13:42 UTC (permalink / raw)
  Cc: David Miller, herbert, netdev, Denys Fedoryshchenko

Jarek Poplawski wrote, On 08/14/2008 01:24 PM:

> On Wed, Aug 13, 2008 at 03:19:18PM -0700, David Miller wrote:
> ...
>> Ok, so what I'm going to do is check in my patch and then try
>> to figure out how to resolve this "both bits clear" scenerio.
> 
> BTW, here is my older doubt revisited, where I hope to be re-considered/
> re-convinced, if possible...


After problems while testing this by Denys in another thread
I withdraw this patch.

Thanks,
Jarek P.


 
> ---------------->
> 
> pkt_sched: Destroy qdiscs under rtnl_lock again.
> 
> We don't need to trigger __qdisc_destroy() as an RCU callback because
> the use of qdisc isn't controlled by RCU alone: after querying RCU
> with synchronize_rcu() in dev_deactivate() we additionaly wait in a
> loop checking some flags. After the loop is done there could be no
> outstanding use of the qdisc, so call_rcu() doesn't make any sense.
> 
> On the other hand, current calling Qdisc's ->destroy() from a softirq
> context without locking (rtnl) can break various things like:
> qdisc_put_rtab(), tcf_destroy_chain() (e.g. u32_destroy()), and
> probably more.
> 
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> 
> ---
> 
>  net/sched/sch_generic.c |    8 ++------
>  1 files changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 4685746..e7379d2 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -518,12 +518,8 @@ void qdisc_reset(struct Qdisc *qdisc)
>  }
>  EXPORT_SYMBOL(qdisc_reset);
>  
> -/* this is the rcu callback function to clean up a qdisc when there
> - * are no further references to it */
> -
> -static void __qdisc_destroy(struct rcu_head *head)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
>  {
> -	struct Qdisc *qdisc = container_of(head, struct Qdisc, q_rcu);
>  	const struct Qdisc_ops  *ops = qdisc->ops;
>  
>  #ifdef CONFIG_NET_SCHED
> @@ -554,7 +550,7 @@ void qdisc_destroy(struct Qdisc *qdisc)
>  	if (qdisc->parent)
>  		list_del(&qdisc->list);
>  
> -	call_rcu(&qdisc->q_rcu, __qdisc_destroy);
> +	__qdisc_destroy(qdisc);
>  }
>  EXPORT_SYMBOL(qdisc_destroy);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-17 13:42                                 ` Jarek Poplawski
@ 2008-08-17 21:34                                   ` David Miller
  2008-08-17 22:22                                     ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-17 21:34 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 17 Aug 2008 15:42:54 +0200

> Jarek Poplawski wrote, On 08/14/2008 01:24 PM:
> 
> > On Wed, Aug 13, 2008 at 03:19:18PM -0700, David Miller wrote:
> > ...
> >> Ok, so what I'm going to do is check in my patch and then try
> >> to figure out how to resolve this "both bits clear" scenerio.
> > 
> > BTW, here is my older doubt revisited, where I hope to be re-considered/
> > re-convinced, if possible...
> 
> 
> After problems while testing this by Denys in another thread
> I withdraw this patch.

Well, I knew it was completely wrong from the beginning, sorry
to say :-)

This stuff can't be done outside of RCU, period.  I moved all of this
work into RCU for a reason, I really meant it, and none of the reasons
for that move have changed :-)

If we want to do it under RTNL we have to do something like schedule a
workqueue from the RCU handler and then take the RTNL there.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-17 21:34                                   ` David Miller
@ 2008-08-17 22:22                                     ` Jarek Poplawski
  2008-08-17 22:32                                       ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-17 22:22 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, denys

On Sun, Aug 17, 2008 at 02:34:44PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sun, 17 Aug 2008 15:42:54 +0200
> 
> > Jarek Poplawski wrote, On 08/14/2008 01:24 PM:
> > 
> > > On Wed, Aug 13, 2008 at 03:19:18PM -0700, David Miller wrote:
> > > ...
> > >> Ok, so what I'm going to do is check in my patch and then try
> > >> to figure out how to resolve this "both bits clear" scenerio.
> > > 
> > > BTW, here is my older doubt revisited, where I hope to be re-considered/
> > > re-convinced, if possible...
> > 
> > 
> > After problems while testing this by Denys in another thread
> > I withdraw this patch.
> 
> Well, I knew it was completely wrong from the beginning, sorry
> to say :-)
> 
> This stuff can't be done outside of RCU, period.  I moved all of this
> work into RCU for a reason, I really meant it, and none of the reasons
> for that move have changed :-)
> 
> If we want to do it under RTNL we have to do something like schedule a
> workqueue from the RCU handler and then take the RTNL there.

Actually, I've only asked you to withdraw this patch for now, but I'm
still not convinced you're right. You should better show me first the
place where this can make a difference. (I think this test broke for
some other non RCU reason.) So, maybe you're right, but I've to check
this more.

BTW, I guess you've seen this other thread: "panic 2.6.27-rc3-git2,
qdisc_dequeue_head" where Denys and I fight with this new locking.
Alas, it looks to me as a real mess, and I currently try with this
previous idea of netdev_queue->qdisc_lock, which you didn't like too.
But, after looking at the current bugs shown by debugging I really 
think we'll have bugs here all the time without simplifying this.
I think my concept should work soon, but if you don't agree with
this at all we can stop and wait for better ideas.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-17 22:22                                     ` Jarek Poplawski
@ 2008-08-17 22:32                                       ` David Miller
  2008-08-18 20:12                                         ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-17 22:32 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 18 Aug 2008 00:22:07 +0200

> Actually, I've only asked you to withdraw this patch for now, but I'm
> still not convinced you're right. You should better show me first the
> place where this can make a difference. (I think this test broke for
> some other non RCU reason.) So, maybe you're right, but I've to check
> this more.

It _might_ be ok once we are done sorting out the synchronization
sequence dev_deactivate() uses.  But we aren't there yet.

> BTW, I guess you've seen this other thread: "panic 2.6.27-rc3-git2,
> qdisc_dequeue_head" where Denys and I fight with this new locking.
> Alas, it looks to me as a real mess, and I currently try with this
> previous idea of netdev_queue->qdisc_lock, which you didn't like too.
> But, after looking at the current bugs shown by debugging I really 
> think we'll have bugs here all the time without simplifying this.
> I think my concept should work soon, but if you don't agree with
> this at all we can stop and wait for better ideas.

I can't even follow your flurry of patches, and neither can the
tester :-)  I deleted the entire thread to be honest, hoping you
would come back with a simple analysis once you've worked things
out with the tester.

What is the real problem besides the correct notify_and_destroy()
issue you discovered?

The locking we have now is very simple:

1) Only under RTNL can qdisc roots change.

2) Therefore, sch_tree_lock() and tcf_tree_lock() are fully valid
   and lock the entire qdisc tree state, if and only if used under
   RTNL lock.

3) Before modifying a qdisc, we dev_deactivate(), which synchronizes
   with asynchronous TX/RX packet handling contexts.

4) The qdisc root and all children are protected by the root qdiscs
   lock, which is taken when asynchonous contexts need to blocked
   while modifying some root or inner qdisc's state.

Yes, of course, if you apply a hammer and add a bit lock at the
top of all of this it will fix whatever bugs remain, but as you
know I don't think that's the solution.

The only substance I've seen is that you've found a violation of #4 in
notify_and_destroy(), so great let's test the fix for that.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-14  8:52                                         ` Jarek Poplawski
@ 2008-08-17 22:57                                           ` David Miller
  2008-08-17 23:03                                             ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-17 22:57 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu, 14 Aug 2008 08:52:50 +0000

> On the other hand... such a flag would be probably for one thing only.
> And if we would have a "netif_scheduled_without_xmitting" counter this
> could probably make 2 in 1?

Here is how I propose to plug this hole.  The issue is that there is
this potential gap where both bits are clear while there is a context
that can reschedule the qdisc.

And I'm pretty sure the most desirable fix is to get rid of that gap
if it can easily be done.

It seems like it can be.  What we do in the patch below is something
like this:

1) The one code path that can leave both bits clear then reschedule
   is changed to only leave that state visible while holding the
   root qdisc's lock.

   All other code paths will not reschedule once they clear the bits.

2) dev_deactivate() unconditionally takes the root qdisc spinlock
   and just waits for both bits to clear.  This is now entirely
   sufficient to ensure that no pending runs of the qdisc remain.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..896393d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1338,20 +1338,27 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 	rcu_read_unlock();
 }
 
+/* We own the __QDISC_STATE_SCHED state bit, and know that
+ * we are the only entity which can schedule this qdisc on
+ * the output queue.
+ */
+static void __qdisc_schedule(struct Qdisc *q)
+{
+	struct softnet_data *sd;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sd = &__get_cpu_var(softnet_data);
+	q->next_sched = sd->output_queue;
+	sd->output_queue = q;
+	raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	local_irq_restore(flags);
+}
 
 void __netif_schedule(struct Qdisc *q)
 {
-	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
-		struct softnet_data *sd;
-		unsigned long flags;
-
-		local_irq_save(flags);
-		sd = &__get_cpu_var(softnet_data);
-		q->next_sched = sd->output_queue;
-		sd->output_queue = q;
-		raise_softirq_irqoff(NET_TX_SOFTIRQ);
-		local_irq_restore(flags);
-	}
+	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
+		__qdisc_schedule(q);
 }
 EXPORT_SYMBOL(__netif_schedule);
 
@@ -1974,15 +1981,13 @@ static void net_tx_action(struct softirq_action *h)
 
 			head = head->next_sched;
 
-			smp_mb__before_clear_bit();
-			clear_bit(__QDISC_STATE_SCHED, &q->state);
-
 			root_lock = qdisc_lock(q);
 			if (spin_trylock(root_lock)) {
+				clear_bit(__QDISC_STATE_SCHED, &q->state);
 				qdisc_run(q);
 				spin_unlock(root_lock);
 			} else {
-				__netif_schedule(q);
+				__qdisc_schedule(q);
 			}
 		}
 	}
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 4685746..d6ac170 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -647,7 +647,7 @@ static void dev_deactivate_queue(struct net_device *dev,
 	}
 }
 
-static bool some_qdisc_is_busy(struct net_device *dev, int lock)
+static bool some_qdisc_is_busy(struct net_device *dev)
 {
 	unsigned int i;
 
@@ -661,14 +661,12 @@ static bool some_qdisc_is_busy(struct net_device *dev, int lock)
 		q = dev_queue->qdisc_sleeping;
 		root_lock = qdisc_lock(q);
 
-		if (lock)
-			spin_lock_bh(root_lock);
+		spin_lock_bh(root_lock);
 
 		val = (test_bit(__QDISC_STATE_RUNNING, &q->state) ||
 		       test_bit(__QDISC_STATE_SCHED, &q->state));
 
-		if (lock)
-			spin_unlock_bh(root_lock);
+		spin_unlock_bh(root_lock);
 
 		if (val)
 			return true;
@@ -689,25 +687,8 @@ void dev_deactivate(struct net_device *dev)
 	synchronize_rcu();
 
 	/* Wait for outstanding qdisc_run calls. */
-	do {
-		while (some_qdisc_is_busy(dev, 0))
-			yield();
-
-		/*
-		 * Double-check inside queue lock to ensure that all effects
-		 * of the queue run are visible when we return.
-		 */
-		running = some_qdisc_is_busy(dev, 1);
-
-		/*
-		 * The running flag should never be set at this point because
-		 * we've already set dev->qdisc to noop_qdisc *inside* the same
-		 * pair of spin locks.  That is, if any qdisc_run starts after
-		 * our initial test it should see the noop_qdisc and then
-		 * clear the RUNNING bit before dropping the queue lock.  So
-		 * if it is set here then we've found a bug.
-		 */
-	} while (WARN_ON_ONCE(running));
+	while (some_qdisc_is_busy(dev))
+		yield();
 }
 
 static void dev_init_scheduler_queue(struct net_device *dev,

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-17 22:57                                           ` David Miller
@ 2008-08-17 23:03                                             ` David Miller
  2008-08-18  1:25                                               ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-17 23:03 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev

From: David Miller <davem@davemloft.net>
Date: Sun, 17 Aug 2008 15:57:23 -0700 (PDT)

> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Thu, 14 Aug 2008 08:52:50 +0000
> 
> > On the other hand... such a flag would be probably for one thing only.
> > And if we would have a "netif_scheduled_without_xmitting" counter this
> > could probably make 2 in 1?
> 
> Here is how I propose to plug this hole.  The issue is that there is
> this potential gap where both bits are clear while there is a context
> that can reschedule the qdisc.

My apologies Jarek, this is largely identical to a patch you posted
already.  Sorry :(

I thought about it some more, but there is still the dev_queue_xmit()
context that can have a reference to the old qdisc, be about to take
the qdisc lock and feed packets into it's ->enqueue().

That's why we need the RCU destruction still.  dev_deactivate() really
doesn't clear out all references to the qdisc fully.

It might be a better approach to work on making that simply not matter,
and just rely on RCU or similar to defer destruction and any real state
changes internally to the qdisc tree will simply use the root qdisc's
lock to block out enqueue/dequeue/requeue calls.

In that kind of scheme dev_deactivate() just sets the RCU pointer and
doesn't sit around waiting for anything.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-17 23:03                                             ` David Miller
@ 2008-08-18  1:25                                               ` Herbert Xu
  2008-08-18  1:35                                                 ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-18  1:25 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Sun, Aug 17, 2008 at 04:03:29PM -0700, David Miller wrote:
> 
> I thought about it some more, but there is still the dev_queue_xmit()
> context that can have a reference to the old qdisc, be about to take
> the qdisc lock and feed packets into it's ->enqueue().
> 
> That's why we need the RCU destruction still.  dev_deactivate() really
> doesn't clear out all references to the qdisc fully.
> 
> It might be a better approach to work on making that simply not matter,
> and just rely on RCU or similar to defer destruction and any real state
> changes internally to the qdisc tree will simply use the root qdisc's
> lock to block out enqueue/dequeue/requeue calls.
> 
> In that kind of scheme dev_deactivate() just sets the RCU pointer and
> doesn't sit around waiting for anything.

Well one of the advantages of it being synchronous is that the
driver may be relying on this so that it knows all transmissions
have ceased in dev_close.  If we stick with RCU then drivers would
have to implement their own synchronisation anyway.

So what's the issue with dev_queue_xmit? That should be taken care
of by something like rcu_barrier_bh, no?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  1:25                                               ` Herbert Xu
@ 2008-08-18  1:35                                                 ` David Miller
  2008-08-18  1:36                                                   ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-18  1:35 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 18 Aug 2008 11:25:16 +1000

> So what's the issue with dev_queue_xmit? That should be taken care
> of by something like rcu_barrier_bh, no?

The code in dev_queue_xmit() used to resample the pointer.  It
relied upon the fact that we always used the same top-level
spinlock to set the qdisc.

But now that the lock is in the qdisc itself instead of the
netdevice or netdev_queue, that no longer works.

That's why I got rid of the "resample" code in these places
and tried to move everything into RCU.

I think I see another way out of this:

1) Add __QDISC_STATE_DEACTIVATE.

2) Set it right before dev_deactivate() swaps resets the qdisc
   pointer.

3) Test it in dev_queue_xmit() et al. once the qdisc root lock is
   acquired, and drop lock and resample ->qdisc if
   __QDISC_STATE_DEACTIVATE is set.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  1:35                                                 ` David Miller
@ 2008-08-18  1:36                                                   ` Herbert Xu
  2008-08-18  1:49                                                     ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-18  1:36 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Sun, Aug 17, 2008 at 06:35:05PM -0700, David Miller wrote:
>
> That's why I got rid of the "resample" code in these places
> and tried to move everything into RCU.

Right.

> I think I see another way out of this:
> 
> 1) Add __QDISC_STATE_DEACTIVATE.
> 
> 2) Set it right before dev_deactivate() swaps resets the qdisc
>    pointer.
> 
> 3) Test it in dev_queue_xmit() et al. once the qdisc root lock is
>    acquired, and drop lock and resample ->qdisc if
>    __QDISC_STATE_DEACTIVATE is set.

Yep this sounds good to me.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  1:36                                                   ` Herbert Xu
@ 2008-08-18  1:49                                                     ` David Miller
  2008-08-18  4:27                                                       ` Herbert Xu
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 209+ messages in thread
From: David Miller @ 2008-08-18  1:49 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 18 Aug 2008 11:36:33 +1000

> On Sun, Aug 17, 2008 at 06:35:05PM -0700, David Miller wrote:
> > I think I see another way out of this:
> > 
> > 1) Add __QDISC_STATE_DEACTIVATE.
> > 
> > 2) Set it right before dev_deactivate() swaps resets the qdisc
> >    pointer.
> > 
> > 3) Test it in dev_queue_xmit() et al. once the qdisc root lock is
> >    acquired, and drop lock and resample ->qdisc if
> >    __QDISC_STATE_DEACTIVATE is set.
> 
> Yep this sounds good to me.

Here it is as a patch below.

The test being added to __netif_schedule() is just a reminder that
we have to address the "both bits clear" case somehow, likely with
Jarko's patch which I unintentionally reimplemented :)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a7abfda..757ab08 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -27,6 +27,7 @@ enum qdisc_state_t
 {
 	__QDISC_STATE_RUNNING,
 	__QDISC_STATE_SCHED,
+	__QDISC_STATE_DEACTIVATED,
 };
 
 struct qdisc_size_table {
diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..b88f669 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1341,6 +1341,9 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 
 void __netif_schedule(struct Qdisc *q)
 {
+	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
+		return;
+
 	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
 		struct softnet_data *sd;
 		unsigned long flags;
@@ -1790,6 +1793,8 @@ gso:
 	rcu_read_lock_bh();
 
 	txq = dev_pick_tx(dev, skb);
+
+resample_qdisc:
 	q = rcu_dereference(txq->qdisc);
 
 #ifdef CONFIG_NET_CLS_ACT
@@ -1800,6 +1805,11 @@ gso:
 
 		spin_lock(root_lock);
 
+		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+			spin_unlock(root_lock);
+			goto resample_qdisc;
+		}
+
 		rc = qdisc_enqueue_root(skb, q);
 		qdisc_run(q);
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 4685746..ff1c455 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -597,6 +597,9 @@ static void transition_one_qdisc(struct net_device *dev,
 	struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
 	int *need_watchdog_p = _need_watchdog;
 
+	if (!(new_qdisc->flags & TCQ_F_BUILTIN))
+		clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
+
 	rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
 	if (need_watchdog_p && new_qdisc != &noqueue_qdisc)
 		*need_watchdog_p = 1;
@@ -640,6 +643,9 @@ static void dev_deactivate_queue(struct net_device *dev,
 	if (qdisc) {
 		spin_lock_bh(qdisc_lock(qdisc));
 
+		if (!(qdisc->flags & TCQ_F_BUILTIN))
+			set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
+
 		dev_queue->qdisc = qdisc_default;
 		qdisc_reset(qdisc);
 

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  1:49                                                     ` David Miller
@ 2008-08-18  4:27                                                       ` Herbert Xu
  2008-08-18  4:31                                                         ` David Miller
  2008-08-18  6:27                                                       ` Jarek Poplawski
  2008-08-18 21:29                                                       ` Jarek Poplawski
  2 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-18  4:27 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Sun, Aug 17, 2008 at 06:49:21PM -0700, David Miller wrote:
> 
> The test being added to __netif_schedule() is just a reminder that
> we have to address the "both bits clear" case somehow, likely with
> Jarko's patch which I unintentionally reimplemented :)

The patch looks good to me.

> @@ -1790,6 +1793,8 @@ gso:
>  	rcu_read_lock_bh();
>  
>  	txq = dev_pick_tx(dev, skb);
> +
> +resample_qdisc:
>  	q = rcu_dereference(txq->qdisc);
>  
>  #ifdef CONFIG_NET_CLS_ACT
> @@ -1800,6 +1805,11 @@ gso:
>  
>  		spin_lock(root_lock);
>  
> +		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> +			spin_unlock(root_lock);
> +			goto resample_qdisc;
> +		}

We could just drop the packet.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  4:27                                                       ` Herbert Xu
@ 2008-08-18  4:31                                                         ` David Miller
  2008-08-18  4:36                                                           ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-18  4:31 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 18 Aug 2008 14:27:31 +1000

> On Sun, Aug 17, 2008 at 06:49:21PM -0700, David Miller wrote:
> > @@ -1800,6 +1805,11 @@ gso:
> >  
> >  		spin_lock(root_lock);
> >  
> > +		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> > +			spin_unlock(root_lock);
> > +			goto resample_qdisc;
> > +		}
> 
> We could just drop the packet.

True, that would be simpler and have the same effect.

I just noticed that ingress qdisc needs similar code.

Ok, here is what I'll do as a rough plan:

1) Integrate this patch with your drop suggestion and ingress
   fixed up.

2) Integrate Jarek's patch that closes the "both bits clear"
   hole.

3) Add my bit that makes dev_deactivate() always take the spinlock
   so there is just one simple yield() loop.

And, finally, we can be at the point where we can get rid of the
RCU deferral of qdisc_destroy().

Should be fun evening :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  4:31                                                         ` David Miller
@ 2008-08-18  4:36                                                           ` Herbert Xu
  2008-08-18  5:13                                                             ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-18  4:36 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Sun, Aug 17, 2008 at 09:31:31PM -0700, David Miller wrote:
> 
> Ok, here is what I'll do as a rough plan:
> 
> 1) Integrate this patch with your drop suggestion and ingress
>    fixed up.
> 
> 2) Integrate Jarek's patch that closes the "both bits clear"
>    hole.
> 
> 3) Add my bit that makes dev_deactivate() always take the spinlock
>    so there is just one simple yield() loop.
> 
> And, finally, we can be at the point where we can get rid of the
> RCU deferral of qdisc_destroy().
> 
> Should be fun evening :-)

Worthy of a gold medal :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  4:36                                                           ` Herbert Xu
@ 2008-08-18  5:13                                                             ` David Miller
  2008-08-18  6:08                                                               ` Denys Fedoryshchenko
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-18  5:13 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 18 Aug 2008 14:36:45 +1000

> On Sun, Aug 17, 2008 at 09:31:31PM -0700, David Miller wrote:
> > 
> > Ok, here is what I'll do as a rough plan:
> > 
> > 1) Integrate this patch with your drop suggestion and ingress
> >    fixed up.
> > 
> > 2) Integrate Jarek's patch that closes the "both bits clear"
> >    hole.
> > 
> > 3) Add my bit that makes dev_deactivate() always take the spinlock
> >    so there is just one simple yield() loop.
> > 
> > And, finally, we can be at the point where we can get rid of the
> > RCU deferral of qdisc_destroy().
> > 
> > Should be fun evening :-)
> 
> Worthy of a gold medal :)

:-)

All of this is now pushed to net-2.6 on kernel.org

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  5:13                                                             ` David Miller
@ 2008-08-18  6:08                                                               ` Denys Fedoryshchenko
  2008-08-18  6:13                                                                 ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Denys Fedoryshchenko @ 2008-08-18  6:08 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, jarkao2, netdev

On Monday 18 August 2008, David Miller wrote:
> All of this is now pushed to net-2.6 on kernel.org
So i can build & test net-2.6 from git? :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  6:08                                                               ` Denys Fedoryshchenko
@ 2008-08-18  6:13                                                                 ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-18  6:13 UTC (permalink / raw)
  To: denys; +Cc: herbert, jarkao2, netdev

From: Denys Fedoryshchenko <denys@visp.net.lb>
Date: Mon, 18 Aug 2008 09:08:12 +0300

> On Monday 18 August 2008, David Miller wrote:
> > All of this is now pushed to net-2.6 on kernel.org
> So i can build & test net-2.6 from git? :-)

Should be able to, yes :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  1:49                                                     ` David Miller
  2008-08-18  4:27                                                       ` Herbert Xu
@ 2008-08-18  6:27                                                       ` Jarek Poplawski
  2008-08-18  6:38                                                         ` David Miller
  2008-08-18 21:29                                                       ` Jarek Poplawski
  2 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-18  6:27 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Sun, Aug 17, 2008 at 06:49:21PM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 18 Aug 2008 11:36:33 +1000
> 
> > On Sun, Aug 17, 2008 at 06:35:05PM -0700, David Miller wrote:
> > > I think I see another way out of this:
> > > 
> > > 1) Add __QDISC_STATE_DEACTIVATE.
> > > 
> > > 2) Set it right before dev_deactivate() swaps resets the qdisc
> > >    pointer.
> > > 
> > > 3) Test it in dev_queue_xmit() et al. once the qdisc root lock is
> > >    acquired, and drop lock and resample ->qdisc if
> > >    __QDISC_STATE_DEACTIVATE is set.
> > 
> > Yep this sounds good to me.
> 
> Here it is as a patch below.
> 
> The test being added to __netif_schedule() is just a reminder that
> we have to address the "both bits clear" case somehow, likely with
> Jarko's patch which I unintentionally reimplemented :)

You shouldn't bother with this at all. I'm really pleased if I
sometimes think similarly to you, and this wasn't the most complex
idea to found, btw.

But, there is probably something other to bother here. I didn't get
the final version of this patch nor I can see this on the list, but
in your git there is this change to "goto out_kfree_skb", which
seems to skip rcu_read_unlock_bh().

Otherwise, hmm.. I'm half asleep yet, and only after 1 coffee, so
maybe I'll change my mind soon, but now I've some doubts about these
last changes.

Jarek P.

> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index a7abfda..757ab08 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -27,6 +27,7 @@ enum qdisc_state_t
>  {
>  	__QDISC_STATE_RUNNING,
>  	__QDISC_STATE_SCHED,
> +	__QDISC_STATE_DEACTIVATED,
>  };
>  
>  struct qdisc_size_table {
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 600bb23..b88f669 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1341,6 +1341,9 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
>  
>  void __netif_schedule(struct Qdisc *q)
>  {
> +	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> +		return;
> +
>  	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
>  		struct softnet_data *sd;
>  		unsigned long flags;
> @@ -1790,6 +1793,8 @@ gso:
>  	rcu_read_lock_bh();
>  
>  	txq = dev_pick_tx(dev, skb);
> +
> +resample_qdisc:
>  	q = rcu_dereference(txq->qdisc);
>  
>  #ifdef CONFIG_NET_CLS_ACT
> @@ -1800,6 +1805,11 @@ gso:
>  
>  		spin_lock(root_lock);
>  
> +		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> +			spin_unlock(root_lock);
> +			goto resample_qdisc;
> +		}
> +
>  		rc = qdisc_enqueue_root(skb, q);
>  		qdisc_run(q);
>  
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 4685746..ff1c455 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -597,6 +597,9 @@ static void transition_one_qdisc(struct net_device *dev,
>  	struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
>  	int *need_watchdog_p = _need_watchdog;
>  
> +	if (!(new_qdisc->flags & TCQ_F_BUILTIN))
> +		clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
> +
>  	rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
>  	if (need_watchdog_p && new_qdisc != &noqueue_qdisc)
>  		*need_watchdog_p = 1;
> @@ -640,6 +643,9 @@ static void dev_deactivate_queue(struct net_device *dev,
>  	if (qdisc) {
>  		spin_lock_bh(qdisc_lock(qdisc));
>  
> +		if (!(qdisc->flags & TCQ_F_BUILTIN))
> +			set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
> +
>  		dev_queue->qdisc = qdisc_default;
>  		qdisc_reset(qdisc);
>  

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  6:27                                                       ` Jarek Poplawski
@ 2008-08-18  6:38                                                         ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-18  6:38 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 18 Aug 2008 06:27:47 +0000

> But, there is probably something other to bother here. I didn't get
> the final version of this patch nor I can see this on the list, but
> in your git there is this change to "goto out_kfree_skb", which
> seems to skip rcu_read_unlock_bh().

That's a bug I added when I implemented Herber't suggestion
to just drop the packet.  Good spotting.

I've just pushed the following fix, thanks!

pkt_sched: Fix missed RCU unlock in dev_queue_xmit()

Noticed by Jarek Poplawski.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dev.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 819f017..8d13380 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1805,14 +1805,12 @@ gso:
 		spin_lock(root_lock);
 
 		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
-			spin_unlock(root_lock);
+			kfree_skb(skb);
 			rc = NET_XMIT_DROP;
-			goto out_kfree_skb;
+		} else {
+			rc = qdisc_enqueue_root(skb, q);
+			qdisc_run(q);
 		}
-
-		rc = qdisc_enqueue_root(skb, q);
-		qdisc_run(q);
-
 		spin_unlock(root_lock);
 
 		goto out;
-- 
1.5.6.5.GIT


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-17 22:32                                       ` David Miller
@ 2008-08-18 20:12                                         ` Jarek Poplawski
  2008-08-18 23:54                                           ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-18 20:12 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, denys

David Miller wrote, On 08/18/2008 12:32 AM:
...
> What is the real problem besides the correct notify_and_destroy()
> issue you discovered?
> 
> The locking we have now is very simple:

Probably simple for you, but I'm not even sure of this. If it were
true there would be no this and a few other threads concerned with
this.

> 
> 1) Only under RTNL can qdisc roots change.
> 
> 2) Therefore, sch_tree_lock() and tcf_tree_lock() are fully valid
>    and lock the entire qdisc tree state, if and only if used under
>    RTNL lock.
> 
> 3) Before modifying a qdisc, we dev_deactivate(), which synchronizes
>    with asynchronous TX/RX packet handling contexts.
> 
> 4) The qdisc root and all children are protected by the root qdiscs
>    lock, which is taken when asynchonous contexts need to blocked
>    while modifying some root or inner qdisc's state.
> 
> Yes, of course, if you apply a hammer and add a bit lock at the
> top of all of this it will fix whatever bugs remain, but as you
> know I don't think that's the solution.

I don't think it's true. What matters here is the qdisc lock. And
within the qdisc's tree only one such lock can matter because current
code doesn't use qdisc locks on lower levels. So we need to take this
one top lock. But as we have seen in notify_and_destroy() or
qdisc_watchdog() it's easy to do this wrong because you've always
analyze the tree plus activated/deactivated state. My proposal is to
simply have still this one lock but easy accessible at some well
known place (actually, with an exception for builtin qdiscs).

> The only substance I've seen is that you've found a violation of #4 in
> notify_and_destroy(), so great let's test the fix for that.

Actually, after partly fixing this we have currently messed this up
again. We can't destroy a qdisc without RCU or some other delayed
method while we're holding a lock inside this - and this is a case
of root qdiscs... It's quite simple too, but why we've missed this
so easily?

As mentioned there was also this tricky qdisc_watchdog (or other
direct __netif_scheduling), but this is not all. Try to look at
qdisc_create() and gen_new_estimator() call. It passes
qdisc_root_lock(sch) somewhere. This is similar to qdisc_watchdog()
problem, but are you really sure we always save the right spinlock
here? I don't think so.

So, of course, if you think such problems are exceptional, and will
not return after current fixes, then we can continue, no problem
for me, but I'm not sure how about Denys or other testers and users.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18  1:49                                                     ` David Miller
  2008-08-18  4:27                                                       ` Herbert Xu
  2008-08-18  6:27                                                       ` Jarek Poplawski
@ 2008-08-18 21:29                                                       ` Jarek Poplawski
  2008-08-18 23:47                                                         ` David Miller
  2 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-18 21:29 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

David Miller wrote, On 08/18/2008 03:49 AM:

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 18 Aug 2008 11:36:33 +1000
> 
>> On Sun, Aug 17, 2008 at 06:35:05PM -0700, David Miller wrote:
>>> I think I see another way out of this:
>>>
>>> 1) Add __QDISC_STATE_DEACTIVATE.
>>>
>>> 2) Set it right before dev_deactivate() swaps resets the qdisc
>>>    pointer.
>>>
>>> 3) Test it in dev_queue_xmit() et al. once the qdisc root lock is
>>>    acquired, and drop lock and resample ->qdisc if
>>>    __QDISC_STATE_DEACTIVATE is set.

Two little doubts below:

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 600bb23..b88f669 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1341,6 +1341,9 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
>  
>  void __netif_schedule(struct Qdisc *q)
>  {
> +	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> +		return;
> +

Why I can't see this code in net-2.6? BTW, I guess it should be now
moved to the current __netif_reschedule()? 

>  	if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
>  		struct softnet_data *sd;
>  		unsigned long flags;
> @@ -1790,6 +1793,8 @@ gso:
>  	rcu_read_lock_bh();
>  
>  	txq = dev_pick_tx(dev, skb);
> +
> +resample_qdisc:
>  	q = rcu_dereference(txq->qdisc);
>  
>  #ifdef CONFIG_NET_CLS_ACT
> @@ -1800,6 +1805,11 @@ gso:
>  
>  		spin_lock(root_lock);
>  
> +		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> +			spin_unlock(root_lock);
> +			goto resample_qdisc;
> +		}
> +

OK, we now have this kfree_skb() with NET_XMIT_DROP here, but how is it
better than qdisc_enque_root() on noop_qdisc? Or how can we have here
anything else under both rcu lock and spin_lock() while this
__QDISC_STATE_DEACTIVATED bit is set?

Jarek P.

>  		rc = qdisc_enqueue_root(skb, q);
>  		qdisc_run(q);
>  
...

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18 21:29                                                       ` Jarek Poplawski
@ 2008-08-18 23:47                                                         ` David Miller
  2008-08-19 10:31                                                           ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-18 23:47 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 18 Aug 2008 23:29:46 +0200

> Two little doubts below:
> 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 600bb23..b88f669 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -1341,6 +1341,9 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
> >  
> >  void __netif_schedule(struct Qdisc *q)
> >  {
> > +	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> > +		return;
> > +
> 
> Why I can't see this code in net-2.6? BTW, I guess it should be now
> moved to the current __netif_reschedule()? 

I deleted it, it's unnecessary with your "both bits clear" fix
which I also added.

> OK, we now have this kfree_skb() with NET_XMIT_DROP here, but how is it
> better than qdisc_enque_root() on noop_qdisc? Or how can we have here
> anything else under both rcu lock and spin_lock() while this
> __QDISC_STATE_DEACTIVATED bit is set?

Both operations are equivalent, the choice is arbitrary in my
opinion.  It's not like the user can even see this in noop_qdisc's
stats or something like that.


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18 20:12                                         ` Jarek Poplawski
@ 2008-08-18 23:54                                           ` David Miller
  2008-08-19  0:05                                             ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-18 23:54 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 18 Aug 2008 22:12:36 +0200

> David Miller wrote, On 08/18/2008 12:32 AM:
> > 1) Only under RTNL can qdisc roots change.
> > 
> > 2) Therefore, sch_tree_lock() and tcf_tree_lock() are fully valid
> >    and lock the entire qdisc tree state, if and only if used under
> >    RTNL lock.
> > 
> > 3) Before modifying a qdisc, we dev_deactivate(), which synchronizes
> >    with asynchronous TX/RX packet handling contexts.
> > 
> > 4) The qdisc root and all children are protected by the root qdiscs
> >    lock, which is taken when asynchonous contexts need to blocked
> >    while modifying some root or inner qdisc's state.
> > 
> > Yes, of course, if you apply a hammer and add a bit lock at the
> > top of all of this it will fix whatever bugs remain, but as you
> > know I don't think that's the solution.
> 
> I don't think it's true. What matters here is the qdisc lock. And
> within the qdisc's tree only one such lock can matter because current
> code doesn't use qdisc locks on lower levels. So we need to take this
> one top lock. But as we have seen in notify_and_destroy() or
> qdisc_watchdog() it's easy to do this wrong because you've always
> analyze the tree plus activated/deactivated state. My proposal is to
> simply have still this one lock but easy accessible at some well
> known place (actually, with an exception for builtin qdiscs).

It sounds good in theory, but you cannot make this place be netdev_queue,
because multiple netdev_queue objects can point to the same qdisc.

That's why the lock isn't in netdev_queue any more, there is no longer
a 1 to 1 relationship.

> Actually, after partly fixing this we have currently messed this up
> again. We can't destroy a qdisc without RCU or some other delayed
> method while we're holding a lock inside this - and this is a case
> of root qdiscs... It's quite simple too, but why we've missed this
> so easily?

Yep, and that's the lock debugging thing which is triggering now.
Likely what I'll do is simply reinstate RCU but only for the freeing
of the memory, nothing more.

This keeps everything doing destruction under RTNL as desired,
yet fixes the "we're holding lock that's being freed" problem.

> As mentioned there was also this tricky qdisc_watchdog (or other
> direct __netif_scheduling), but this is not all. Try to look at
> qdisc_create() and gen_new_estimator() call. It passes
> qdisc_root_lock(sch) somewhere. This is similar to qdisc_watchdog()
> problem, but are you really sure we always save the right spinlock
> here? I don't think so.

If RTNL is held, we must be saving the correct lock.

Root qdisc and other aspects of qdisc configuration cannot be changing
when RTNL is held.  That is why I put RTNL assertion in
qdisc_root_lock() as this is the only valid situation where it may be
used.

-------------------- sch_generic.h --------------------
static inline spinlock_t *qdisc_root_lock(struct Qdisc *qdisc)
{
	struct Qdisc *root = qdisc_root(qdisc);

	ASSERT_RTNL();
	return qdisc_lock(root);
}

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18 23:54                                           ` David Miller
@ 2008-08-19  0:05                                             ` Herbert Xu
  2008-08-19  0:11                                               ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-19  0:05 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, denys

On Mon, Aug 18, 2008 at 04:54:11PM -0700, David Miller wrote:
> 
> Yep, and that's the lock debugging thing which is triggering now.
> Likely what I'll do is simply reinstate RCU but only for the freeing
> of the memory, nothing more.
> 
> This keeps everything doing destruction under RTNL as desired,
> yet fixes the "we're holding lock that's being freed" problem.

Is this the problem with dev_shutdown freeing the qdisc while
holding its lock?

Couldn't we just drop the lock before calling qdisc_destroy?

dev_shutdown can only be called after dev_deactivate, as such
all qdisc users should have disappeared by now (and if they
haven't, our waiting loop in dev_deactivate didn't work and
we should fix that).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  0:05                                             ` Herbert Xu
@ 2008-08-19  0:11                                               ` David Miller
  2008-08-19  4:07                                                 ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-19  0:11 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, denys

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 19 Aug 2008 10:05:51 +1000

> On Mon, Aug 18, 2008 at 04:54:11PM -0700, David Miller wrote:
> > 
> > Yep, and that's the lock debugging thing which is triggering now.
> > Likely what I'll do is simply reinstate RCU but only for the freeing
> > of the memory, nothing more.
> > 
> > This keeps everything doing destruction under RTNL as desired,
> > yet fixes the "we're holding lock that's being freed" problem.
> 
> Is this the problem with dev_shutdown freeing the qdisc while
> holding its lock?
> 
> Couldn't we just drop the lock before calling qdisc_destroy?
> 
> dev_shutdown can only be called after dev_deactivate, as such
> all qdisc users should have disappeared by now (and if they
> haven't, our waiting loop in dev_deactivate didn't work and
> we should fix that).

Yep, that's another possible approach.

Especially since we've bullet-proof'd dev_deactivate(), there
should be absolutely no more references to the qdisc any
longer.

I'll look into doing it this way, thanks.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  0:11                                               ` David Miller
@ 2008-08-19  4:07                                                 ` David Miller
  2008-08-19  5:27                                                   ` Ilpo Järvinen
  2008-08-19  6:46                                                   ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: David Miller @ 2008-08-19  4:07 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, denys

From: David Miller <davem@davemloft.net>
Date: Mon, 18 Aug 2008 17:11:24 -0700 (PDT)

> Especially since we've bullet-proof'd dev_deactivate(), there
> should be absolutely no more references to the qdisc any
> longer.
> 
> I'll look into doing it this way, thanks.

Ok, here it is.  I'll push this out to net-2.6 after I do
some testing here.

pkt_sched: Don't hold qdisc lock over qdisc_destroy().

Based upon reports by Denys Fedoryshchenko, and feedback
and help from Jarek Poplawski and Herbert Xu.

We always either:

1) Never made an external reference to this qdisc.

or

2) Did a dev_deactivate() which purged all asynchronous
   references.

So do not lock the qdisc when we call qdisc_destroy(),
it's illegal anyways as when we drop the lock this is
free'd memory.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sched/sch_api.c     |   13 ++-----------
 net/sched/sch_generic.c |    6 ------
 2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 7d7070b..d91a233 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -638,11 +638,8 @@ static void notify_and_destroy(struct sk_buff *skb, struct nlmsghdr *n, u32 clid
 	if (new || old)
 		qdisc_notify(skb, n, clid, old, new);
 
-	if (old) {
-		sch_tree_lock(old);
+	if (old)
 		qdisc_destroy(old);
-		sch_tree_unlock(old);
-	}
 }
 
 /* Graft qdisc "new" to class "classid" of qdisc "parent" or
@@ -1092,16 +1089,10 @@ create_n_graft:
 
 graft:
 	if (1) {
-		spinlock_t *root_lock;
-
 		err = qdisc_graft(dev, p, skb, n, clid, q, NULL);
 		if (err) {
-			if (q) {
-				root_lock = qdisc_root_lock(q);
-				spin_lock_bh(root_lock);
+			if (q)
 				qdisc_destroy(q);
-				spin_unlock_bh(root_lock);
-			}
 			return err;
 		}
 	}
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6f96b7b..c3ed4d4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -518,8 +518,6 @@ void qdisc_reset(struct Qdisc *qdisc)
 }
 EXPORT_SYMBOL(qdisc_reset);
 
-/* Under qdisc_lock(qdisc) and BH! */
-
 void qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
@@ -712,14 +710,10 @@ static void shutdown_scheduler_queue(struct net_device *dev,
 	struct Qdisc *qdisc_default = _qdisc_default;
 
 	if (qdisc) {
-		spinlock_t *root_lock = qdisc_lock(qdisc);
-
 		dev_queue->qdisc = qdisc_default;
 		dev_queue->qdisc_sleeping = qdisc_default;
 
-		spin_lock_bh(root_lock);
 		qdisc_destroy(qdisc);
-		spin_unlock_bh(root_lock);
 	}
 }
 
-- 
1.5.6.5.GIT


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  4:07                                                 ` David Miller
@ 2008-08-19  5:27                                                   ` Ilpo Järvinen
  2008-08-19  5:30                                                     ` David Miller
  2008-08-19  6:46                                                   ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Ilpo Järvinen @ 2008-08-19  5:27 UTC (permalink / raw)
  To: David Miller; +Cc: Herbert Xu, jarkao2, Netdev, denys

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1219 bytes --]

On Mon, 18 Aug 2008, David Miller wrote:

> From: David Miller <davem@davemloft.net>
> Date: Mon, 18 Aug 2008 17:11:24 -0700 (PDT)
> 
> Ok, here it is.  I'll push this out to net-2.6 after I do
> some testing here.
> 
> pkt_sched: Don't hold qdisc lock over qdisc_destroy().

...snip...

> @@ -1092,16 +1089,10 @@ create_n_graft:
>  
>  graft:
>  	if (1) {
> -		spinlock_t *root_lock;
> -
>  		err = qdisc_graft(dev, p, skb, n, clid, q, NULL);

After this the block became unnecessary...

-- 
 i.


--
[PATCH] pkt_sched: remove bogus block (cleanup)

...Last block local var got just deleted.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/sched/sch_api.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index d91a233..9372ec4 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1088,14 +1088,13 @@ create_n_graft:
 	}
 
 graft:
-	if (1) {
-		err = qdisc_graft(dev, p, skb, n, clid, q, NULL);
-		if (err) {
-			if (q)
-				qdisc_destroy(q);
-			return err;
-		}
+	err = qdisc_graft(dev, p, skb, n, clid, q, NULL);
+	if (err) {
+		if (q)
+			qdisc_destroy(q);
+		return err;
 	}
+
 	return 0;
 }
 
-- 
1.5.2.2

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  5:27                                                   ` Ilpo Järvinen
@ 2008-08-19  5:30                                                     ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-19  5:30 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: herbert, jarkao2, netdev, denys

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 19 Aug 2008 08:27:37 +0300 (EEST)

> pkt_sched: remove bogus block (cleanup)
> 
> ...Last block local var got just deleted.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied, thanks!

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  4:07                                                 ` David Miller
  2008-08-19  5:27                                                   ` Ilpo Järvinen
@ 2008-08-19  6:46                                                   ` Jarek Poplawski
  2008-08-19  7:03                                                     ` David Miller
  2008-08-19  7:23                                                     ` Herbert Xu
  1 sibling, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19  6:46 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, denys

On Mon, Aug 18, 2008 at 09:07:01PM -0700, David Miller wrote:
...
> pkt_sched: Don't hold qdisc lock over qdisc_destroy().
> 
> Based upon reports by Denys Fedoryshchenko, and feedback
> and help from Jarek Poplawski and Herbert Xu.
> 
> We always either:
> 
> 1) Never made an external reference to this qdisc.
> 
> or
> 
> 2) Did a dev_deactivate() which purged all asynchronous
>    references.

3) Read below, please...

> 
> So do not lock the qdisc when we call qdisc_destroy(),
> it's illegal anyways as when we drop the lock this is
> free'd memory.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  net/sched/sch_api.c     |   13 ++-----------
>  net/sched/sch_generic.c |    6 ------
>  2 files changed, 2 insertions(+), 17 deletions(-)
> 
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 7d7070b..d91a233 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -638,11 +638,8 @@ static void notify_and_destroy(struct sk_buff *skb, struct nlmsghdr *n, u32 clid
>  	if (new || old)
>  		qdisc_notify(skb, n, clid, old, new);
>  
> -	if (old) {
> -		sch_tree_lock(old);
> +	if (old)
>  		qdisc_destroy(old);
> -		sch_tree_unlock(old);
> -	}
>  }

Actually, I, and earlier Herbert, have written about destroying root
qdiscs without sch_tree_lock(). I don't know how Herbert, but I'd
prefer to leave here this lock for child qdiscs: they can remove some
common structures, so this needs more checking, and even if they don't
do this currently, there is no need to remove this possibility here.
Similarly, I'm not sure if removing BH protection is really needed
here.

And, btw., this is neither 1) nor 2) case according to the changelog,
I guess.

>  
>  /* Graft qdisc "new" to class "classid" of qdisc "parent" or
> @@ -1092,16 +1089,10 @@ create_n_graft:
>  
>  graft:
>  	if (1) {
> -		spinlock_t *root_lock;
> -
>  		err = qdisc_graft(dev, p, skb, n, clid, q, NULL);
>  		if (err) {
> -			if (q) {
> -				root_lock = qdisc_root_lock(q);
> -				spin_lock_bh(root_lock);
> +			if (q)
>  				qdisc_destroy(q);
> -				spin_unlock_bh(root_lock);
> -			}

Probably, as above.

>  			return err;
>  		}
>  	}
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 6f96b7b..c3ed4d4 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -518,8 +518,6 @@ void qdisc_reset(struct Qdisc *qdisc)
>  }
>  EXPORT_SYMBOL(qdisc_reset);
>  
> -/* Under qdisc_lock(qdisc) and BH! */
> -

??
+ /* Under BH for all, and qdisc_lock(qdisc) for child qdiscs only */

Jarek P.

>  void qdisc_destroy(struct Qdisc *qdisc)
>  {
>  	const struct Qdisc_ops  *ops = qdisc->ops;
> @@ -712,14 +710,10 @@ static void shutdown_scheduler_queue(struct net_device *dev,
>  	struct Qdisc *qdisc_default = _qdisc_default;
>  
>  	if (qdisc) {
> -		spinlock_t *root_lock = qdisc_lock(qdisc);
> -
>  		dev_queue->qdisc = qdisc_default;
>  		dev_queue->qdisc_sleeping = qdisc_default;
>  
> -		spin_lock_bh(root_lock);
>  		qdisc_destroy(qdisc);
> -		spin_unlock_bh(root_lock);
>  	}
>  }
>  
> -- 
> 1.5.6.5.GIT
> 

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  6:46                                                   ` Jarek Poplawski
@ 2008-08-19  7:03                                                     ` David Miller
  2008-08-19  7:23                                                       ` Jarek Poplawski
  2008-08-19  7:23                                                     ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-19  7:03 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 19 Aug 2008 06:46:09 +0000

> Actually, I, and earlier Herbert, have written about destroying root
> qdiscs without sch_tree_lock(). I don't know how Herbert, but I'd
> prefer to leave here this lock for child qdiscs: they can remove some
> common structures, so this needs more checking, and even if they don't
> do this currently, there is no need to remove this possibility here.
> Similarly, I'm not sure if removing BH protection is really needed
> here.

Well you don't really know if this happens or not for sure
do you? :-)

Why don't you go make sure of this and report back what you
find?  I see no reason to account for something that cannot
happen.

It's better to have a consistent rule for qdisc_destroy()
rather than a bunch of special cases that are hard to audit.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  6:46                                                   ` Jarek Poplawski
  2008-08-19  7:03                                                     ` David Miller
@ 2008-08-19  7:23                                                     ` Herbert Xu
  2008-08-19  7:35                                                       ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-19  7:23 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 06:46:09AM +0000, Jarek Poplawski wrote:
>
> Actually, I, and earlier Herbert, have written about destroying root
> qdiscs without sch_tree_lock(). I don't know how Herbert, but I'd
> prefer to leave here this lock for child qdiscs: they can remove some
> common structures, so this needs more checking, and even if they don't
> do this currently, there is no need to remove this possibility here.
> Similarly, I'm not sure if removing BH protection is really needed
> here.

Qdiscs can die in two ways, when the underlying device dies or
when the user removes/replaces the qdisc.  In the first case the
we're being called from dev_shutdown so all users of the qdisc
should have ceased or dev_deactivate is buggy.  The other case
is again divided into two subcases.  First of all if we're removing
the root qdisc then again dev_deactivate gets called and the same
reasoning applies.

If we're removing a non-root qdisc, then we will first grab the
root qdisc's lock, kill the child, and release the root lock.  By
convention, any user of a child qdisc must have acquired the root
qdisc's lock because the child is only reachable through the root.
Therefore once we have released the root qdisc lock after killing
the child, we're guaranteed that no further references to that
child can be made.

If for whatever reason the code does not reflect the reasoning
above, please feel free to fix the code :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  7:03                                                     ` David Miller
@ 2008-08-19  7:23                                                       ` Jarek Poplawski
  0 siblings, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19  7:23 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, denys

On Tue, Aug 19, 2008 at 12:03:07AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 19 Aug 2008 06:46:09 +0000
> 
> > Actually, I, and earlier Herbert, have written about destroying root
> > qdiscs without sch_tree_lock(). I don't know how Herbert, but I'd
> > prefer to leave here this lock for child qdiscs: they can remove some
> > common structures, so this needs more checking, and even if they don't
> > do this currently, there is no need to remove this possibility here.
> > Similarly, I'm not sure if removing BH protection is really needed
> > here.
> 
> Well you don't really know if this happens or not for sure
> do you? :-)

Actually, you are the author, and I'm here only to make some FUD...

> Why don't you go make sure of this and report back what you
> find?  I see no reason to account for something that cannot
> happen.

Sure, this can be done, but needs some time. And removing such old
locking could be treacherous, so I'd say otherwise, let's leave it
where we are not sure (and where it's not necessary) until more
checking. Currently I think mostly of something like cls_u32(). And
it's not the easiest to analyze piece of code.

> It's better to have a consistent rule for qdisc_destroy()
> rather than a bunch of special cases that are hard to audit.

I don't agree with this: there is a difference between doing total
destruction, when you are sure proper order doesn't matter, and
nobody will ever read after this.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  7:23                                                     ` Herbert Xu
@ 2008-08-19  7:35                                                       ` Jarek Poplawski
  2008-08-19  7:46                                                         ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19  7:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 05:23:16PM +1000, Herbert Xu wrote:
> On Tue, Aug 19, 2008 at 06:46:09AM +0000, Jarek Poplawski wrote:
> >
> > Actually, I, and earlier Herbert, have written about destroying root
> > qdiscs without sch_tree_lock(). I don't know how Herbert, but I'd
> > prefer to leave here this lock for child qdiscs: they can remove some
> > common structures, so this needs more checking, and even if they don't
> > do this currently, there is no need to remove this possibility here.
> > Similarly, I'm not sure if removing BH protection is really needed
> > here.
> 
> Qdiscs can die in two ways, when the underlying device dies or
> when the user removes/replaces the qdisc.  In the first case the
> we're being called from dev_shutdown so all users of the qdisc
> should have ceased or dev_deactivate is buggy.  The other case
> is again divided into two subcases.  First of all if we're removing
> the root qdisc then again dev_deactivate gets called and the same
> reasoning applies.

This case is quite clear.

> 
> If we're removing a non-root qdisc, then we will first grab the
> root qdisc's lock, kill the child, and release the root lock.  By
> convention, any user of a child qdisc must have acquired the root
> qdisc's lock because the child is only reachable through the root.
> Therefore once we have released the root qdisc lock after killing
> the child, we're guaranteed that no further references to that
> child can be made.

By convention, there was always this comment that destroy is under
sch_tree_lock(), so it was legal to depend on this. I'm not afraid
of somebody using such an under destroy qdisc - it's about a code
inside this qdisc could refer to not destroyed part.

> If for whatever reason the code does not reflect the reasoning
> above, please feel free to fix the code :)

Sure, I'll try to look for such problems.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  7:35                                                       ` Jarek Poplawski
@ 2008-08-19  7:46                                                         ` Herbert Xu
  2008-08-19  7:56                                                           ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-19  7:46 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 07:35:58AM +0000, Jarek Poplawski wrote:
>
> > If we're removing a non-root qdisc, then we will first grab the
> > root qdisc's lock, kill the child, and release the root lock.  By
> > convention, any user of a child qdisc must have acquired the root
> > qdisc's lock because the child is only reachable through the root.
> > Therefore once we have released the root qdisc lock after killing
> > the child, we're guaranteed that no further references to that
> > child can be made.
> 
> By convention, there was always this comment that destroy is under
> sch_tree_lock(), so it was legal to depend on this. I'm not afraid
> of somebody using such an under destroy qdisc - it's about a code
> inside this qdisc could refer to not destroyed part.

No no no, it's not about qdisc_destroy at all.  If you're relying
on the lock around qdisc_destroy, then you're already too late.
The qdisc should have been removed before we get to qdisc_destroy.

It's the act of removal that's protected by the root lock, and
still is.  For example, in htb_graft we do sch_tree_lock before
killing any children, this is the lock that I was referring to.

> > If for whatever reason the code does not reflect the reasoning
> > above, please feel free to fix the code :)
> 
> Sure, I'll try to look for such problems.

Of course if you find any classful qdisc that does not hold the
tree lock when killing children, then please send patches.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  7:46                                                         ` Herbert Xu
@ 2008-08-19  7:56                                                           ` Jarek Poplawski
  2008-08-19  8:05                                                             ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19  7:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 05:46:06PM +1000, Herbert Xu wrote:
> On Tue, Aug 19, 2008 at 07:35:58AM +0000, Jarek Poplawski wrote:
> >
> > > If we're removing a non-root qdisc, then we will first grab the
> > > root qdisc's lock, kill the child, and release the root lock.  By
> > > convention, any user of a child qdisc must have acquired the root
> > > qdisc's lock because the child is only reachable through the root.
> > > Therefore once we have released the root qdisc lock after killing
> > > the child, we're guaranteed that no further references to that
> > > child can be made.
> > 
> > By convention, there was always this comment that destroy is under
> > sch_tree_lock(), so it was legal to depend on this. I'm not afraid
> > of somebody using such an under destroy qdisc - it's about a code
> > inside this qdisc could refer to not destroyed part.
> 
> No no no, it's not about qdisc_destroy at all.  If you're relying
> on the lock around qdisc_destroy, then you're already too late.
> The qdisc should have been removed before we get to qdisc_destroy.
> 
> It's the act of removal that's protected by the root lock, and
> still is.  For example, in htb_graft we do sch_tree_lock before
> killing any children, this is the lock that I was referring to.

I'm not sure I can understand you: could you look at htb_destroy()
instead and think of this as a child qdisc of prio or another htb?
Having a top level "queue" lock guarantees there is no activity at
the whole tree at the moment.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  7:56                                                           ` Jarek Poplawski
@ 2008-08-19  8:05                                                             ` Herbert Xu
  2008-08-19  8:17                                                               ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-19  8:05 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 07:56:23AM +0000, Jarek Poplawski wrote:
>
> I'm not sure I can understand you: could you look at htb_destroy()
> instead and think of this as a child qdisc of prio or another htb?
> Having a top level "queue" lock guarantees there is no activity at
> the whole tree at the moment.

htb_destroy can either be called by qdisc_destroy or when a brand
new HTB qdisc fails construction.  The latter case is trivial since
the qdisc has never been used.

In the first case, as you have seen from my previous email, the
entire branch containing the HTB qdisc (that is, either the HTB
qdisc itself if it's being deleted directly, or the branch stemming
from its ancestor that's being deleted) must no longer have any
references to it at all apart from this thread of execution.

As such we can do whatever we want with it, including freeing it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:05                                                             ` Herbert Xu
@ 2008-08-19  8:17                                                               ` Jarek Poplawski
  2008-08-19  8:23                                                                 ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19  8:17 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 06:05:57PM +1000, Herbert Xu wrote:
> On Tue, Aug 19, 2008 at 07:56:23AM +0000, Jarek Poplawski wrote:
> >
> > I'm not sure I can understand you: could you look at htb_destroy()
> > instead and think of this as a child qdisc of prio or another htb?
> > Having a top level "queue" lock guarantees there is no activity at
> > the whole tree at the moment.
> 
> htb_destroy can either be called by qdisc_destroy or when a brand
> new HTB qdisc fails construction.  The latter case is trivial since
> the qdisc has never been used.
> 
> In the first case, as you have seen from my previous email, the
> entire branch containing the HTB qdisc (that is, either the HTB
> qdisc itself if it's being deleted directly, or the branch stemming
> from its ancestor that's being deleted) must no longer have any
> references to it at all apart from this thread of execution.
> 
> As such we can do whatever we want with it, including freeing it.
> 

As I've written before I'm mainly concerned with things like
tcf_destroy_chain(), especially wrt. cls_u32, but I can be wrong with
this. So, if you don't have such concerns, let's forget it for now,
and after I look at this more maybe we'll get back to this discussion.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:17                                                               ` Jarek Poplawski
@ 2008-08-19  8:23                                                                 ` Herbert Xu
  2008-08-19  8:32                                                                   ` David Miller
  2008-08-19  8:39                                                                   ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-19  8:23 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 08:17:13AM +0000, Jarek Poplawski wrote:
>
> As I've written before I'm mainly concerned with things like
> tcf_destroy_chain(), especially wrt. cls_u32, but I can be wrong with
> this. So, if you don't have such concerns, let's forget it for now,
> and after I look at this more maybe we'll get back to this discussion.

Well I can't vouch for every single qdisc in the tree.  However,
what I can say is that as long as they respect the rules I outlined
earlier with regards to holding the root qdisc lock when deleting
or using children, then they'll work as expected.

You're definitely welcome to audit the qdiscs to make sure that
they are obeying the rules.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:23                                                                 ` Herbert Xu
@ 2008-08-19  8:32                                                                   ` David Miller
  2008-08-19  8:41                                                                     ` Jarek Poplawski
  2008-08-19  8:50                                                                     ` Herbert Xu
  2008-08-19  8:39                                                                   ` Jarek Poplawski
  1 sibling, 2 replies; 209+ messages in thread
From: David Miller @ 2008-08-19  8:32 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, denys

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 19 Aug 2008 18:23:55 +1000

> On Tue, Aug 19, 2008 at 08:17:13AM +0000, Jarek Poplawski wrote:
> >
> > As I've written before I'm mainly concerned with things like
> > tcf_destroy_chain(), especially wrt. cls_u32, but I can be wrong with
> > this. So, if you don't have such concerns, let's forget it for now,
> > and after I look at this more maybe we'll get back to this discussion.
> 
> Well I can't vouch for every single qdisc in the tree.  However,
> what I can say is that as long as they respect the rules I outlined
> earlier with regards to holding the root qdisc lock when deleting
> or using children, then they'll work as expected.
> 
> You're definitely welcome to audit the qdiscs to make sure that
> they are obeying the rules.

Jarek may have a point about the u32 classifier.  So we
should think about it.

The hash tables and tp_u_common objects are shared, and
it does non-atomic refcounting during destruction, see
u32_destroy().

However, this all might be OK because all of this management
is performed only under the RTNL semaphore.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:23                                                                 ` Herbert Xu
  2008-08-19  8:32                                                                   ` David Miller
@ 2008-08-19  8:39                                                                   ` Jarek Poplawski
  2008-08-19  8:55                                                                     ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19  8:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 06:23:55PM +1000, Herbert Xu wrote:
> On Tue, Aug 19, 2008 at 08:17:13AM +0000, Jarek Poplawski wrote:
> >
> > As I've written before I'm mainly concerned with things like
> > tcf_destroy_chain(), especially wrt. cls_u32, but I can be wrong with
> > this. So, if you don't have such concerns, let's forget it for now,
> > and after I look at this more maybe we'll get back to this discussion.
> 
> Well I can't vouch for every single qdisc in the tree.  However,
> what I can say is that as long as they respect the rules I outlined
> earlier with regards to holding the root qdisc lock when deleting
> or using children, then they'll work as expected.
> 
> You're definitely welcome to audit the qdiscs to make sure that
> they are obeying the rules.

That's my point - is there really a reason do this change without such
an audit if we are not forced at the moment? (I'd remind this way of
doing things was entirely legal according to comments.) I doubt, I'm
the right person for auditing this but as I said I'll have a look,
especially when there will be lack of those fascinating oopses and
warnings around.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:32                                                                   ` David Miller
@ 2008-08-19  8:41                                                                     ` Jarek Poplawski
  2008-08-19  8:48                                                                       ` David Miller
  2008-08-19  8:50                                                                     ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19  8:41 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, denys

On Tue, Aug 19, 2008 at 01:32:05AM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Tue, 19 Aug 2008 18:23:55 +1000
> 
> > On Tue, Aug 19, 2008 at 08:17:13AM +0000, Jarek Poplawski wrote:
> > >
> > > As I've written before I'm mainly concerned with things like
> > > tcf_destroy_chain(), especially wrt. cls_u32, but I can be wrong with
> > > this. So, if you don't have such concerns, let's forget it for now,
> > > and after I look at this more maybe we'll get back to this discussion.
> > 
> > Well I can't vouch for every single qdisc in the tree.  However,
> > what I can say is that as long as they respect the rules I outlined
> > earlier with regards to holding the root qdisc lock when deleting
> > or using children, then they'll work as expected.
> > 
> > You're definitely welcome to audit the qdiscs to make sure that
> > they are obeying the rules.
> 
> Jarek may have a point about the u32 classifier.  So we
> should think about it.
> 
> The hash tables and tp_u_common objects are shared, and
> it does non-atomic refcounting during destruction, see
> u32_destroy().
> 
> However, this all might be OK because all of this management
> is performed only under the RTNL semaphore.

Sure, this all should be write protected. I'm concerned only about
the read side here.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:41                                                                     ` Jarek Poplawski
@ 2008-08-19  8:48                                                                       ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-19  8:48 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 19 Aug 2008 08:41:48 +0000

> On Tue, Aug 19, 2008 at 01:32:05AM -0700, David Miller wrote:
> > Jarek may have a point about the u32 classifier.  So we
> > should think about it.
> > 
> > The hash tables and tp_u_common objects are shared, and
> > it does non-atomic refcounting during destruction, see
> > u32_destroy().
> > 
> > However, this all might be OK because all of this management
> > is performed only under the RTNL semaphore.
> 
> Sure, this all should be write protected. I'm concerned only about
> the read side here.

If reference count hits zero, nobody can reference the object.
Reference counts only change under RTNL, which is my point.

The old qdisc object is taken away from global visibility inside of
the cops->graft() call done by qdisc_graft().  This handler already
must do whatever locking is necessary while removing the qdisc from
visibility from the packet path.

And by virtue of that, the old qdisc and anything it references will
no longer be visible to the packet processing path after cops->graft()
returns.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:32                                                                   ` David Miller
  2008-08-19  8:41                                                                     ` Jarek Poplawski
@ 2008-08-19  8:50                                                                     ` Herbert Xu
  1 sibling, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-19  8:50 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, denys

On Tue, Aug 19, 2008 at 01:32:05AM -0700, David Miller wrote:
> 
> Jarek may have a point about the u32 classifier.  So we
> should think about it.
> 
> The hash tables and tp_u_common objects are shared, and
> it does non-atomic refcounting during destruction, see
> u32_destroy().

I had a look and it seems to be OK to me.  Essentially we have
two sides in all this, the read side which is the path that
transmits packets, and the write side which is the control path.

As with a qdisc_destroy, u32_destroy can only be called once all
read sides have exited so we won't be racing against that.  In
fact if we were able to race against it then holding the lock is
no good anyway because this implies the u32 object is still
referenced by a qdisc tree and as such once we release the lock
the read side can get back to a destroyed u32 object.

> However, this all might be OK because all of this management
> is performed only under the RTNL semaphore.

Right, all other writers have been excluded by RTNL so we should
be the only thread with a reference to the u32 and we can decrement
the ref count non-atomically.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:39                                                                   ` Jarek Poplawski
@ 2008-08-19  8:55                                                                     ` Herbert Xu
  2008-08-19  9:16                                                                       ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-19  8:55 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 08:39:09AM +0000, Jarek Poplawski wrote:
>
> > Well I can't vouch for every single qdisc in the tree.  However,
> > what I can say is that as long as they respect the rules I outlined
> > earlier with regards to holding the root qdisc lock when deleting
> > or using children, then they'll work as expected.
> > 
> > You're definitely welcome to audit the qdiscs to make sure that
> > they are obeying the rules.
> 
> That's my point - is there really a reason do this change without such
> an audit if we are not forced at the moment? (I'd remind this way of
> doing things was entirely legal according to comments.) I doubt, I'm
> the right person for auditing this but as I said I'll have a look,
> especially when there will be lack of those fascinating oopses and
> warnings around.

No you misunderstood my point.  I wasn't saying that I'm not confident
that our qdiscs obey the rules, but rather that if any of them didn't,
then they're buggy and should be fixed.

In fact we're not really adding anything new here, the qdiscs were
not accessed under RCU uniformly.  If you go back in the tree prior
to the multi-qdisc stuff, you'll find that only dev_queue_xmit works
under RCU.  qdisc_restart does not and therefore deferring the
destruction to RCU is pointless anyway.

So in fact we've already been relying on the fact that by the time
qdisc_destroy comes about nobody on the read side (i.e., the packet
transmission path) should have a reference to it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  8:55                                                                     ` Herbert Xu
@ 2008-08-19  9:16                                                                       ` Jarek Poplawski
  2008-08-21 10:01                                                                         ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19  9:16 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 06:55:04PM +1000, Herbert Xu wrote:
> On Tue, Aug 19, 2008 at 08:39:09AM +0000, Jarek Poplawski wrote:
> >
> > > Well I can't vouch for every single qdisc in the tree.  However,
> > > what I can say is that as long as they respect the rules I outlined
> > > earlier with regards to holding the root qdisc lock when deleting
> > > or using children, then they'll work as expected.
> > > 
> > > You're definitely welcome to audit the qdiscs to make sure that
> > > they are obeying the rules.
> > 
> > That's my point - is there really a reason do this change without such
> > an audit if we are not forced at the moment? (I'd remind this way of
> > doing things was entirely legal according to comments.) I doubt, I'm
> > the right person for auditing this but as I said I'll have a look,
> > especially when there will be lack of those fascinating oopses and
> > warnings around.
> 
> No you misunderstood my point.  I wasn't saying that I'm not confident
> that our qdiscs obey the rules, but rather that if any of them didn't,
> then they're buggy and should be fixed.

What difference does it make? You're not sure thinks will not break
after this change.

> 
> In fact we're not really adding anything new here, the qdiscs were
> not accessed under RCU uniformly.  If you go back in the tree prior
> to the multi-qdisc stuff, you'll find that only dev_queue_xmit works
> under RCU.  qdisc_restart does not and therefore deferring the
> destruction to RCU is pointless anyway.
> 
> So in fact we've already been relying on the fact that by the time
> qdisc_destroy comes about nobody on the read side (i.e., the packet
> transmission path) should have a reference to it.

Let's not discuss using such a qdisc by others but a possibility
that some common lists could be broken for readers from upper qdiscs.
(They were not deactivated.) Of course, if it's done properly with
some references before qdisc_destroy then it's all right. I'd prefer
to check this later yet.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-18 23:47                                                         ` David Miller
@ 2008-08-19 10:31                                                           ` Jarek Poplawski
  2008-08-19 10:51                                                             ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19 10:31 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Mon, Aug 18, 2008 at 04:47:48PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 18 Aug 2008 23:29:46 +0200
> 
> > Two little doubts below:
> > 
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 600bb23..b88f669 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -1341,6 +1341,9 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
> > >  
> > >  void __netif_schedule(struct Qdisc *q)
> > >  {
> > > +	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> > > +		return;
> > > +
> > 
> > Why I can't see this code in net-2.6? BTW, I guess it should be now
> > moved to the current __netif_reschedule()? 
> 
> I deleted it, it's unnecessary with your "both bits clear" fix
> which I also added.

Herbert was concerned earlier with this:
"What I mean is the extremely unlikely scenario of net_tx_action
always failing on trylock because dev_deactivate has grabbed the
lock to check whether net_tx_action has completed."

So, I guess this could help here.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19 10:31                                                           ` Jarek Poplawski
@ 2008-08-19 10:51                                                             ` Herbert Xu
  2008-08-19 10:54                                                               ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-19 10:51 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Tue, Aug 19, 2008 at 10:31:51AM +0000, Jarek Poplawski wrote:
>
> > > >  void __netif_schedule(struct Qdisc *q)
> > > >  {
> > > > +	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> > > > +		return;
> > > 
> > > Why I can't see this code in net-2.6? BTW, I guess it should be now
> > > moved to the current __netif_reschedule()? 
> > 
> > I deleted it, it's unnecessary with your "both bits clear" fix
> > which I also added.
> 
> Herbert was concerned earlier with this:
> "What I mean is the extremely unlikely scenario of net_tx_action
> always failing on trylock because dev_deactivate has grabbed the
> lock to check whether net_tx_action has completed."
> 
> So, I guess this could help here.

Right.  Even though the live-lock is an extremely unlikely event,
as the aliveness flag check isn't on the fast path anyway I think
we should keep it.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19 10:51                                                             ` Herbert Xu
@ 2008-08-19 10:54                                                               ` David Miller
  2008-08-19 10:55                                                                 ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-19 10:54 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 19 Aug 2008 20:51:33 +1000

> On Tue, Aug 19, 2008 at 10:31:51AM +0000, Jarek Poplawski wrote:
> >
> > > > >  void __netif_schedule(struct Qdisc *q)
> > > > >  {
> > > > > +	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> > > > > +		return;
> > > > 
> > > > Why I can't see this code in net-2.6? BTW, I guess it should be now
> > > > moved to the current __netif_reschedule()? 
> > > 
> > > I deleted it, it's unnecessary with your "both bits clear" fix
> > > which I also added.
> > 
> > Herbert was concerned earlier with this:
> > "What I mean is the extremely unlikely scenario of net_tx_action
> > always failing on trylock because dev_deactivate has grabbed the
> > lock to check whether net_tx_action has completed."
> > 
> > So, I guess this could help here.
> 
> Right.  Even though the live-lock is an extremely unlikely event,
> as the aliveness flag check isn't on the fast path anyway I think
> we should keep it.

Every qdisc_run() will invoke __netif_schedule() so it is
a fast path I think :-)

But anyways, all of these paths didle with these state bits
anyways, so it's in the cache and a cheap test.

I can add it back once we're all sure we're talking about
the same thing.  So you're saying that we should add the
__QDISC_STATE_DEACTIVATED test to __netif_schedule(), right?

I was confused because you say "we should keep it", it was never
in the net-2.6 tree and only existed in my RFC patch posting, so
I'm trying to figure out what you meant. :-)


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19 10:54                                                               ` David Miller
@ 2008-08-19 10:55                                                                 ` Herbert Xu
  2008-08-19 10:58                                                                   ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-19 10:55 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Tue, Aug 19, 2008 at 03:54:06AM -0700, David Miller wrote:
>
> Every qdisc_run() will invoke __netif_schedule() so it is
> a fast path I think :-)

Argh, I meant __netif_reschedule which shouldn't be the fast path.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19 10:55                                                                 ` Herbert Xu
@ 2008-08-19 10:58                                                                   ` Herbert Xu
  2008-08-19 11:02                                                                     ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-19 10:58 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Tue, Aug 19, 2008 at 08:55:51PM +1000, Herbert Xu wrote:
> On Tue, Aug 19, 2008 at 03:54:06AM -0700, David Miller wrote:
> >
> > Every qdisc_run() will invoke __netif_schedule() so it is
> > a fast path I think :-)
> 
> Argh, I meant __netif_reschedule which shouldn't be the fast path.

Nevermind, both paths call __netif_reschedule :)

OK, how about just moving it to the else clause in net_tx_action?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19 10:58                                                                   ` Herbert Xu
@ 2008-08-19 11:02                                                                     ` David Miller
  2008-08-19 11:11                                                                       ` Herbert Xu
  2008-08-19 16:48                                                                       ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: David Miller @ 2008-08-19 11:02 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 19 Aug 2008 20:58:15 +1000

> On Tue, Aug 19, 2008 at 08:55:51PM +1000, Herbert Xu wrote:
> > On Tue, Aug 19, 2008 at 03:54:06AM -0700, David Miller wrote:
> > >
> > > Every qdisc_run() will invoke __netif_schedule() so it is
> > > a fast path I think :-)
> > 
> > Argh, I meant __netif_reschedule which shouldn't be the fast path.
> 
> Nevermind, both paths call __netif_reschedule :)
> 
> OK, how about just moving it to the else clause in net_tx_action?

I just checked the following into net-2.6

pkt_sched: Prevent livelock in TX queue running.

If dev_deactivate() is trying to quiesce the queue, it
is theoretically possible for another cpu to livelock
trying to process that queue.  This happens because
dev_deactivate() grabs the queue spinlock as it checks
the queue state, whereas net_tx_action() does a trylock
and reschedules the qdisc if it hits the lock.

This breaks the livelock by adding a check on
__QDISC_STATE_DEACTIVATED to net_tx_action() when
the trylock fails.

Based upon feedback from Herbert Xu and Jarek Poplawski.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dev.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8d13380..60c51f7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1990,7 +1990,9 @@ static void net_tx_action(struct softirq_action *h)
 				qdisc_run(q);
 				spin_unlock(root_lock);
 			} else {
-				__netif_reschedule(q);
+				if (!test_bit(__QDISC_STATE_DEACTIVATED,
+					      &q->state))
+					__netif_reschedule(q);
 			}
 		}
 	}
-- 
1.5.6.5.GIT


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19 11:02                                                                     ` David Miller
@ 2008-08-19 11:11                                                                       ` Herbert Xu
  2008-08-19 16:48                                                                       ` Jarek Poplawski
  1 sibling, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-19 11:11 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Tue, Aug 19, 2008 at 04:02:00AM -0700, David Miller wrote:
> 
> I just checked the following into net-2.6

Thanks Dave!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19 11:02                                                                     ` David Miller
  2008-08-19 11:11                                                                       ` Herbert Xu
@ 2008-08-19 16:48                                                                       ` Jarek Poplawski
  2008-08-19 22:23                                                                         ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-19 16:48 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

David Miller wrote, On 08/19/2008 01:02 PM:

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Tue, 19 Aug 2008 20:58:15 +1000
> 
>> On Tue, Aug 19, 2008 at 08:55:51PM +1000, Herbert Xu wrote:
>>> On Tue, Aug 19, 2008 at 03:54:06AM -0700, David Miller wrote:
>>>> Every qdisc_run() will invoke __netif_schedule() so it is
>>>> a fast path I think :-)
>>> Argh, I meant __netif_reschedule which shouldn't be the fast path.
>> Nevermind, both paths call __netif_reschedule :)
>>

I can miss something, but probably it's needed in __netif_reschedule()
yet. Here is a scenario:

cpu1				cpu2
dev_deactivate()
dev_deactivate_queue()
qdisc_reset()
				qdisc_run()
				qdisc_watchdog_schedule()
				(or hrtimer_restart in cbq)
while (some_qdisc_is_busy())
return (qdisc not busy)
				hrtimer triggered
				__netif_schedule()
qdisc_destroy()			qdisc_run()
		
Jarek P.

>> OK, how about just moving it to the else clause in net_tx_action?
> 
> I just checked the following into net-2.6
> 
> pkt_sched: Prevent livelock in TX queue running.
> 
> If dev_deactivate() is trying to quiesce the queue, it
> is theoretically possible for another cpu to livelock
> trying to process that queue.  This happens because
> dev_deactivate() grabs the queue spinlock as it checks
> the queue state, whereas net_tx_action() does a trylock
> and reschedules the qdisc if it hits the lock.
> 
> This breaks the livelock by adding a check on
> __QDISC_STATE_DEACTIVATED to net_tx_action() when
> the trylock fails.
> 
> Based upon feedback from Herbert Xu and Jarek Poplawski.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>  net/core/dev.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8d13380..60c51f7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1990,7 +1990,9 @@ static void net_tx_action(struct softirq_action *h)
>  				qdisc_run(q);
>  				spin_unlock(root_lock);
>  			} else {
> -				__netif_reschedule(q);
> +				if (!test_bit(__QDISC_STATE_DEACTIVATED,
> +					      &q->state))
> +					__netif_reschedule(q);
>  			}
>  		}
>  	}

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19 16:48                                                                       ` Jarek Poplawski
@ 2008-08-19 22:23                                                                         ` Herbert Xu
  2008-08-20 11:56                                                                           ` [PATCH] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race Jarek Poplawski
  2008-08-21  5:49                                                                           ` [PATCH take 2] " Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-19 22:23 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Tue, Aug 19, 2008 at 06:48:31PM +0200, Jarek Poplawski wrote:
> 
> I can miss something, but probably it's needed in __netif_reschedule()

No this is a genuine bug.  However, it's also an old bug :) Even
with the locking and RCU barrier it could have occured.

> yet. Here is a scenario:
> 
> cpu1				cpu2
> dev_deactivate()
> dev_deactivate_queue()
> qdisc_reset()
> 				qdisc_run()
> 				qdisc_watchdog_schedule()
> 				(or hrtimer_restart in cbq)

We should add an aliveness check before scheduling the timer.  This
is the slow path so adding a check shouldn't hurt.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* [PATCH] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-19 22:23                                                                         ` Herbert Xu
@ 2008-08-20 11:56                                                                           ` Jarek Poplawski
  2008-08-20 12:16                                                                             ` Herbert Xu
  2008-08-21  5:17                                                                             ` Jarek Poplawski
  2008-08-21  5:49                                                                           ` [PATCH take 2] " Jarek Poplawski
  1 sibling, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-20 11:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Wed, Aug 20, 2008 at 08:23:29AM +1000, Herbert Xu wrote:
> On Tue, Aug 19, 2008 at 06:48:31PM +0200, Jarek Poplawski wrote:
> > 
> > I can miss something, but probably it's needed in __netif_reschedule()
> 
> No this is a genuine bug.  However, it's also an old bug :) Even
> with the locking and RCU barrier it could have occured.
> 
> > yet. Here is a scenario:
> > 
> > cpu1				cpu2
> > dev_deactivate()
> > dev_deactivate_queue()
> > qdisc_reset()
> > 				qdisc_run()
> > 				qdisc_watchdog_schedule()
> > 				(or hrtimer_restart in cbq)
> 
> We should add an aliveness check before scheduling the timer.  This
> is the slow path so adding a check shouldn't hurt.

Thanks,
Jarek P.

--------------->

pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race

dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
or other timer calling netif_schedule() after dev_queue_deactivate().
We prevent this checking aliveness before scheduling the timer.

With feedback from Herbert Xu <herbert@gondor.apana.org.au>

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 net/sched/sch_api.c |    3 +++
 net/sched/sch_cbq.c |    3 +++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index ef0efec..6f2bc7f 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -444,6 +444,9 @@ void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires)
 {
 	ktime_t time;
 
+	if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc_root(wd->qdisc)->state))
+		return;
+
 	wd->qdisc->flags |= TCQ_F_THROTTLED;
 	time = ktime_set(0, 0);
 	time = ktime_add_ns(time, PSCHED_US2NS(expires));
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 47ef492..c04d335 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -521,6 +521,9 @@ static void cbq_ovl_delay(struct cbq_class *cl)
 	struct cbq_sched_data *q = qdisc_priv(cl->qdisc);
 	psched_tdiff_t delay = cl->undertime - q->now;
 
+	if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc_root(cl->qdisc)->state))
+		return;
+
 	if (!cl->delayed) {
 		psched_time_t sched = q->now;
 		ktime_t expires;

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-20 11:56                                                                           ` [PATCH] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race Jarek Poplawski
@ 2008-08-20 12:16                                                                             ` Herbert Xu
  2008-08-21  5:17                                                                             ` Jarek Poplawski
  1 sibling, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-20 12:16 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Wed, Aug 20, 2008 at 11:56:56AM +0000, Jarek Poplawski wrote:
> 
> pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
> 
> dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
> or other timer calling netif_schedule() after dev_queue_deactivate().
> We prevent this checking aliveness before scheduling the timer.
> 
> With feedback from Herbert Xu <herbert@gondor.apana.org.au>
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Looks good to me.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-20 11:56                                                                           ` [PATCH] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race Jarek Poplawski
  2008-08-20 12:16                                                                             ` Herbert Xu
@ 2008-08-21  5:17                                                                             ` Jarek Poplawski
  1 sibling, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21  5:17 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Wed, Aug 20, 2008 at 11:56:56AM +0000, Jarek Poplawski wrote:
...
> pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
> 
> dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
> or other timer calling netif_schedule() after dev_queue_deactivate().
> We prevent this checking aliveness before scheduling the timer.
> 
> With feedback from Herbert Xu <herbert@gondor.apana.org.au>
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> 

Actually, this patch is "no good" (wrong root qdisc).

David, please don't apply - I'll send it redone.

Thanks,
Jarek P.


> ---
> 
>  net/sched/sch_api.c |    3 +++
>  net/sched/sch_cbq.c |    3 +++
>  2 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index ef0efec..6f2bc7f 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -444,6 +444,9 @@ void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires)
>  {
>  	ktime_t time;
>  
> +	if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc_root(wd->qdisc)->state))
> +		return;
> +
>  	wd->qdisc->flags |= TCQ_F_THROTTLED;
>  	time = ktime_set(0, 0);
>  	time = ktime_add_ns(time, PSCHED_US2NS(expires));
> diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
> index 47ef492..c04d335 100644
> --- a/net/sched/sch_cbq.c
> +++ b/net/sched/sch_cbq.c
> @@ -521,6 +521,9 @@ static void cbq_ovl_delay(struct cbq_class *cl)
>  	struct cbq_sched_data *q = qdisc_priv(cl->qdisc);
>  	psched_tdiff_t delay = cl->undertime - q->now;
>  
> +	if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc_root(cl->qdisc)->state))
> +		return;
> +
>  	if (!cl->delayed) {
>  		psched_time_t sched = q->now;
>  		ktime_t expires;

^ permalink raw reply	[flat|nested] 209+ messages in thread

* [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-19 22:23                                                                         ` Herbert Xu
  2008-08-20 11:56                                                                           ` [PATCH] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race Jarek Poplawski
@ 2008-08-21  5:49                                                                           ` Jarek Poplawski
  2008-08-21  6:10                                                                             ` Herbert Xu
  2008-08-21 12:11                                                                             ` David Miller
  1 sibling, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21  5:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

---------------> (take 2)

pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race

dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
or other timer calling netif_schedule() after dev_queue_deactivate().
We prevent this checking aliveness before scheduling the timer. Since
during deactivation the root qdisc is available only as qdisc_sleeping
additional accessor qdisc_root_sleeping() is created.

With feedback from Herbert Xu <herbert@gondor.apana.org.au>

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 include/net/sch_generic.h |    5 +++++
 net/sched/sch_api.c       |    4 ++++
 net/sched/sch_cbq.c       |    4 ++++
 3 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 84d25f2..b1d2cfe 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -193,6 +193,11 @@ static inline struct Qdisc *qdisc_root(struct Qdisc *qdisc)
 	return qdisc->dev_queue->qdisc;
 }
 
+static inline struct Qdisc *qdisc_root_sleeping(struct Qdisc *qdisc)
+{
+	return qdisc->dev_queue->qdisc_sleeping;
+}
+
 /* The qdisc root lock is a mechanism by which to top level
  * of a qdisc tree can be locked from any qdisc node in the
  * forest.  This allows changing the configuration of some
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index ef0efec..45f442d 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -444,6 +444,10 @@ void qdisc_watchdog_schedule(struct qdisc_watchdog *wd, psched_time_t expires)
 {
 	ktime_t time;
 
+	if (test_bit(__QDISC_STATE_DEACTIVATED,
+		     &qdisc_root_sleeping(wd->qdisc)->state))
+		return;
+
 	wd->qdisc->flags |= TCQ_F_THROTTLED;
 	time = ktime_set(0, 0);
 	time = ktime_add_ns(time, PSCHED_US2NS(expires));
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 47ef492..8fa90d6 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -521,6 +521,10 @@ static void cbq_ovl_delay(struct cbq_class *cl)
 	struct cbq_sched_data *q = qdisc_priv(cl->qdisc);
 	psched_tdiff_t delay = cl->undertime - q->now;
 
+	if (test_bit(__QDISC_STATE_DEACTIVATED,
+		     &qdisc_root_sleeping(cl->qdisc)->state))
+		return;
+
 	if (!cl->delayed) {
 		psched_time_t sched = q->now;
 		ktime_t expires;

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  5:49                                                                           ` [PATCH take 2] " Jarek Poplawski
@ 2008-08-21  6:10                                                                             ` Herbert Xu
  2008-08-21  6:49                                                                               ` Jarek Poplawski
  2008-08-21 12:11                                                                             ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21  6:10 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Thu, Aug 21, 2008 at 05:49:11AM +0000, Jarek Poplawski wrote:
> ---------------> (take 2)
> 
> pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
> 
> dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
> or other timer calling netif_schedule() after dev_queue_deactivate().
> We prevent this checking aliveness before scheduling the timer. Since
> during deactivation the root qdisc is available only as qdisc_sleeping
> additional accessor qdisc_root_sleeping() is created.
> 
> With feedback from Herbert Xu <herbert@gondor.apana.org.au>
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Good catch!
 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 84d25f2..b1d2cfe 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -193,6 +193,11 @@ static inline struct Qdisc *qdisc_root(struct Qdisc *qdisc)
>  	return qdisc->dev_queue->qdisc;
>  }
>  
> +static inline struct Qdisc *qdisc_root_sleeping(struct Qdisc *qdisc)
> +{
> +	return qdisc->dev_queue->qdisc_sleeping;
> +}

When would we actually want the non-sleeping variant?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  6:10                                                                             ` Herbert Xu
@ 2008-08-21  6:49                                                                               ` Jarek Poplawski
  2008-08-21  7:16                                                                                 ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21  6:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Thu, Aug 21, 2008 at 04:10:24PM +1000, Herbert Xu wrote:
...
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index 84d25f2..b1d2cfe 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -193,6 +193,11 @@ static inline struct Qdisc *qdisc_root(struct Qdisc *qdisc)
> >  	return qdisc->dev_queue->qdisc;
> >  }
> >  
> > +static inline struct Qdisc *qdisc_root_sleeping(struct Qdisc *qdisc)
> > +{
> > +	return qdisc->dev_queue->qdisc_sleeping;
> > +}
> 
> When would we actually want the non-sleeping variant?
> 

We need to check if something depends on &noop_qdisc returned in the
similar state. Otherwise, there is a bit too much possibilities here,
so it would be nice to simplify this all.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  6:49                                                                               ` Jarek Poplawski
@ 2008-08-21  7:16                                                                                 ` Herbert Xu
  2008-08-21  7:52                                                                                   ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21  7:16 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Thu, Aug 21, 2008 at 06:49:41AM +0000, Jarek Poplawski wrote:
>
> We need to check if something depends on &noop_qdisc returned in the
> similar state. Otherwise, there is a bit too much possibilities here,
> so it would be nice to simplify this all.

Actually, why do we even keep a netdev_queue pointer in a qdisc?
A given qdisc can be used by multiple queues (which is why the
lock was moved into the qdisc in the first place).

How about keeping a pointer directly to the root qdisc plus a
pointer to the netdev (which seems to be the only other use for
qdisc->dev_queue)? That way there won't be any confusion as to
whether we want the sleeping or non-sleeping qdisc.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  7:16                                                                                 ` Herbert Xu
@ 2008-08-21  7:52                                                                                   ` David Miller
  2008-08-21  8:00                                                                                     ` Herbert Xu
                                                                                                       ` (2 more replies)
  0 siblings, 3 replies; 209+ messages in thread
From: David Miller @ 2008-08-21  7:52 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 21 Aug 2008 17:16:34 +1000

> Actually, why do we even keep a netdev_queue pointer in a qdisc?
> A given qdisc can be used by multiple queues (which is why the
> lock was moved into the qdisc in the first place).
> 
> How about keeping a pointer directly to the root qdisc plus a
> pointer to the netdev (which seems to be the only other use for
> qdisc->dev_queue)? That way there won't be any confusion as to
> whether we want the sleeping or non-sleeping qdisc.

Not a bad idea at all.

The reason it's there is a left-over from earlier designs of my
multiqueue stuff, I thought we'd always multiplex the qdiscs to be
per-queue.  But once Patrick showed me we couldn't do that, we now
have shared qdiscs.

If I get a chance I'll work on this.  But to be honest given
Linus's temperment right now we're only going to be able to
merge one-liners to fix these kinds of problems and anything
more serious is going to be for net-next-2.6 only.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  7:52                                                                                   ` David Miller
@ 2008-08-21  8:00                                                                                     ` Herbert Xu
  2008-08-21  8:27                                                                                     ` Jarek Poplawski
  2008-09-11 10:39                                                                                     ` David Miller
  2 siblings, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-21  8:00 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Thu, Aug 21, 2008 at 12:52:50AM -0700, David Miller wrote:
> 
> If I get a chance I'll work on this.  But to be honest given
> Linus's temperment right now we're only going to be able to
> merge one-liners to fix these kinds of problems and anything
> more serious is going to be for net-next-2.6 only.

Oh yeah we don't need this right away.  Jarek's patch should be
quite safe for the time being.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  7:52                                                                                   ` David Miller
  2008-08-21  8:00                                                                                     ` Herbert Xu
@ 2008-08-21  8:27                                                                                     ` Jarek Poplawski
  2008-08-21  8:35                                                                                       ` Jarek Poplawski
  2008-09-11 10:39                                                                                     ` David Miller
  2 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21  8:27 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Thu, Aug 21, 2008 at 12:52:50AM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Thu, 21 Aug 2008 17:16:34 +1000
> 
> > Actually, why do we even keep a netdev_queue pointer in a qdisc?
> > A given qdisc can be used by multiple queues (which is why the
> > lock was moved into the qdisc in the first place).
> > 
> > How about keeping a pointer directly to the root qdisc plus a
> > pointer to the netdev (which seems to be the only other use for
> > qdisc->dev_queue)? That way there won't be any confusion as to
> > whether we want the sleeping or non-sleeping qdisc.

Probably it's too early in the morning, so I would prefer to see some
code... It seems exchanging the qdisc to &noop_qdisc needs such a
direct root dev_queue access, e.g. we want to get in dev_queue_xmit()
&noop_qdisc as fast as possible during deactivation. On the other hand
we need to check for __QDISC_STATE_DEACTIVATED to know the state of
a "real" qdisc.

BTW, it looks like currently we do this wrong: if it's really
dectivated we're checking &noop_qdisc for this...

> 
> Not a bad idea at all.
> 
> The reason it's there is a left-over from earlier designs of my
> multiqueue stuff, I thought we'd always multiplex the qdiscs to be
> per-queue.  But once Patrick showed me we couldn't do that, we now
> have shared qdiscs.
> 
> If I get a chance I'll work on this.  But to be honest given
> Linus's temperment right now we're only going to be able to
> merge one-liners to fix these kinds of problems and anything
> more serious is going to be for net-next-2.6 only.

I think definitely some changes are needed here for the future,
but of course no need to experiment with our stable -rc now.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  8:27                                                                                     ` Jarek Poplawski
@ 2008-08-21  8:35                                                                                       ` Jarek Poplawski
  2008-08-21  8:47                                                                                         ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21  8:35 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Thu, Aug 21, 2008 at 08:27:21AM +0000, Jarek Poplawski wrote:
...
> Probably it's too early in the morning, 

It's for sure...

> BTW, it looks like currently we do this wrong: if it's really
> dectivated we're checking &noop_qdisc for this...

...and it's OK, yet.

Sorry,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  8:35                                                                                       ` Jarek Poplawski
@ 2008-08-21  8:47                                                                                         ` Jarek Poplawski
  0 siblings, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21  8:47 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Thu, Aug 21, 2008 at 08:35:06AM +0000, Jarek Poplawski wrote:
> On Thu, Aug 21, 2008 at 08:27:21AM +0000, Jarek Poplawski wrote:
> ...
> > Probably it's too early in the morning, 
> 
> It's for sure...
> 
> > BTW, it looks like currently we do this wrong: if it's really
> > dectivated we're checking &noop_qdisc for this...
> 
> ...and it's OK, yet.

Anyway, it's tricky: we have 2 deactivation cases to care for:

1) a "real" (sleeping) qdisc and __QDISC_STATE_DEACTIVATED
2) a &noop_qdisc and not __QDISC_STATE_DEACTIVATED

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-19  9:16                                                                       ` Jarek Poplawski
@ 2008-08-21 10:01                                                                         ` Jarek Poplawski
  2008-08-21 10:05                                                                           ` David Miller
  2008-08-21 10:18                                                                           ` Herbert Xu
  0 siblings, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 10:01 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Tue, Aug 19, 2008 at 09:16:42AM +0000, Jarek Poplawski wrote:
> On Tue, Aug 19, 2008 at 06:55:04PM +1000, Herbert Xu wrote:
...
> > In fact we're not really adding anything new here, the qdiscs were
> > not accessed under RCU uniformly.  If you go back in the tree prior
> > to the multi-qdisc stuff, you'll find that only dev_queue_xmit works
> > under RCU.  qdisc_restart does not and therefore deferring the
> > destruction to RCU is pointless anyway.
> > 
> > So in fact we've already been relying on the fact that by the time
> > qdisc_destroy comes about nobody on the read side (i.e., the packet
> > transmission path) should have a reference to it.
> 
> Let's not discuss using such a qdisc by others but a possibility
> that some common lists could be broken for readers from upper qdiscs.
> (They were not deactivated.) Of course, if it's done properly with
> some references before qdisc_destroy then it's all right. I'd prefer
> to check this later yet.

So, what I was most suspicious of, cls_u32, looks like safe wrt. this.
Congratulations for good estimation of this.

But how about this part in qdisc_destroy(): 
        if (qdisc->parent)
                list_del(&qdisc->list);

If we do this with child qdisc from qdisc_graft() it's without
deactivation. The rest of the tree can be dequeued in the meantime
and call qdisc_tree_decrease_qlen() (like hfsc, tbf, netem), which
uses qdisc_lookup() to access this list. We list_del() under rtnl
lock only, they lookup under sch_tree_lock(). Is it a bit unsafe
or I miss something?

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:01                                                                         ` Jarek Poplawski
@ 2008-08-21 10:05                                                                           ` David Miller
  2008-08-21 10:11                                                                             ` Jarek Poplawski
  2008-08-21 10:18                                                                           ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-21 10:05 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu, 21 Aug 2008 10:01:55 +0000

> If we do this with child qdisc from qdisc_graft() it's without
> deactivation. The rest of the tree can be dequeued in the meantime
> and call qdisc_tree_decrease_qlen() (like hfsc, tbf, netem), which
> uses qdisc_lookup() to access this list. We list_del() under rtnl
> lock only, they lookup under sch_tree_lock(). Is it a bit unsafe
> or I miss something?

They hold RTNL as well.

Remember, sch_tree_lock() uses qdisc_root_lock() which as I've
told you at least twice now asserts that RTNL is held :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:05                                                                           ` David Miller
@ 2008-08-21 10:11                                                                             ` Jarek Poplawski
  2008-08-21 10:18                                                                               ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 10:11 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, denys

On Thu, Aug 21, 2008 at 03:05:23AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Thu, 21 Aug 2008 10:01:55 +0000
> 
> > If we do this with child qdisc from qdisc_graft() it's without
> > deactivation. The rest of the tree can be dequeued in the meantime
> > and call qdisc_tree_decrease_qlen() (like hfsc, tbf, netem), which
> > uses qdisc_lookup() to access this list. We list_del() under rtnl
> > lock only, they lookup under sch_tree_lock(). Is it a bit unsafe
> > or I miss something?
> 
> They hold RTNL as well.
> 
> Remember, sch_tree_lock() uses qdisc_root_lock() which as I've
> told you at least twice now asserts that RTNL is held :-)

Actually, I've called it wrong, they hold qdisc root lock, but they
certainly can't have rtnl_lock() at the moment!

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:01                                                                         ` Jarek Poplawski
  2008-08-21 10:05                                                                           ` David Miller
@ 2008-08-21 10:18                                                                           ` Herbert Xu
  1 sibling, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 10:18 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 10:01:55AM +0000, Jarek Poplawski wrote:
> 
> But how about this part in qdisc_destroy(): 
>         if (qdisc->parent)
>                 list_del(&qdisc->list);

This list should only be used by things like qdisc_lookup which
occurs under the RTNL.  The actual packet processing certainly
should not be walking a linked list.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:11                                                                             ` Jarek Poplawski
@ 2008-08-21 10:18                                                                               ` Jarek Poplawski
  2008-08-21 10:21                                                                                 ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 10:18 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, denys

On Thu, Aug 21, 2008 at 10:11:45AM +0000, Jarek Poplawski wrote:
> On Thu, Aug 21, 2008 at 03:05:23AM -0700, David Miller wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Thu, 21 Aug 2008 10:01:55 +0000
> > 
> > > If we do this with child qdisc from qdisc_graft() it's without
> > > deactivation. The rest of the tree can be dequeued in the meantime
> > > and call qdisc_tree_decrease_qlen() (like hfsc, tbf, netem), which
> > > uses qdisc_lookup() to access this list. We list_del() under rtnl
> > > lock only, they lookup under sch_tree_lock(). Is it a bit unsafe
> > > or I miss something?
> > 
> > They hold RTNL as well.
> > 
> > Remember, sch_tree_lock() uses qdisc_root_lock() which as I've
> > told you at least twice now asserts that RTNL is held :-)
> 
> Actually, I've called it wrong, they hold qdisc root lock, but they
> certainly can't have rtnl_lock() at the moment!

I mean here: hfsc_dequeue(), netem_dequeue() and tbf_dequeue().

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:18                                                                               ` Jarek Poplawski
@ 2008-08-21 10:21                                                                                 ` Herbert Xu
  2008-08-21 10:23                                                                                   ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 10:21 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 10:18:30AM +0000, Jarek Poplawski wrote:
>
> > > Remember, sch_tree_lock() uses qdisc_root_lock() which as I've
> > > told you at least twice now asserts that RTNL is held :-)
> > 
> > Actually, I've called it wrong, they hold qdisc root lock, but they
> > certainly can't have rtnl_lock() at the moment!
> 
> I mean here: hfsc_dequeue(), netem_dequeue() and tbf_dequeue().

Where do they use qdisc->list?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:21                                                                                 ` Herbert Xu
@ 2008-08-21 10:23                                                                                   ` Herbert Xu
  2008-08-21 10:33                                                                                     ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 10:23 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 08:21:53PM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 10:18:30AM +0000, Jarek Poplawski wrote:
> >
> > > > Remember, sch_tree_lock() uses qdisc_root_lock() which as I've
> > > > told you at least twice now asserts that RTNL is held :-)
> > > 
> > > Actually, I've called it wrong, they hold qdisc root lock, but they
> > > certainly can't have rtnl_lock() at the moment!
> > 
> > I mean here: hfsc_dequeue(), netem_dequeue() and tbf_dequeue().
> 
> Where do they use qdisc->list?

For the next development tree we could even add an ASSERT_RTNL to
qdisc_lookup and co to ensure that nobody abuses this interface.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:23                                                                                   ` Herbert Xu
@ 2008-08-21 10:33                                                                                     ` Jarek Poplawski
  2008-08-21 10:51                                                                                       ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 10:33 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 08:23:38PM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 08:21:53PM +1000, Herbert Xu wrote:
> > On Thu, Aug 21, 2008 at 10:18:30AM +0000, Jarek Poplawski wrote:
> > >
> > > > > Remember, sch_tree_lock() uses qdisc_root_lock() which as I've
> > > > > told you at least twice now asserts that RTNL is held :-)
> > > > 
> > > > Actually, I've called it wrong, they hold qdisc root lock, but they
> > > > certainly can't have rtnl_lock() at the moment!
> > > 
> > > I mean here: hfsc_dequeue(), netem_dequeue() and tbf_dequeue().
> > 
> > Where do they use qdisc->list?
> 
> For the next development tree we could even add an ASSERT_RTNL to
> qdisc_lookup and co to ensure that nobody abuses this interface.

But if I get above right, you acknowledge the problem currently exists?

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:33                                                                                     ` Jarek Poplawski
@ 2008-08-21 10:51                                                                                       ` Herbert Xu
  2008-08-21 11:20                                                                                         ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 10:51 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 10:33:54AM +0000, Jarek Poplawski wrote:
>
> > > Where do they use qdisc->list?
> > 
> > For the next development tree we could even add an ASSERT_RTNL to
> > qdisc_lookup and co to ensure that nobody abuses this interface.
> 
> But if I get above right, you acknowledge the problem currently exists?

Why don't you just point out the problems that you see rather
than hypothesising :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 10:51                                                                                       ` Herbert Xu
@ 2008-08-21 11:20                                                                                         ` Jarek Poplawski
  2008-08-21 11:26                                                                                           ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 11:20 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 08:51:10PM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 10:33:54AM +0000, Jarek Poplawski wrote:
> >
> > > > Where do they use qdisc->list?
> > > 
> > > For the next development tree we could even add an ASSERT_RTNL to
> > > qdisc_lookup and co to ensure that nobody abuses this interface.
> > 
> > But if I get above right, you acknowledge the problem currently exists?
> 
> Why don't you just point out the problems that you see rather
> than hypothesising :)

I thought I described the problem quite clearly, except this error in
the lock name. I certainly can miss much more things than you. And,
after your answer I'm simply not sure if the question is still valid.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 11:20                                                                                         ` Jarek Poplawski
@ 2008-08-21 11:26                                                                                           ` Herbert Xu
  2008-08-21 11:55                                                                                             ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 11:26 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 11:20:56AM +0000, Jarek Poplawski wrote:
>
> I thought I described the problem quite clearly, except this error in
> the lock name. I certainly can miss much more things than you. And,
> after your answer I'm simply not sure if the question is still valid.

OK I've lost track of what you were trying to say.  Could you please
just restate the problem you saw?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 11:26                                                                                           ` Herbert Xu
@ 2008-08-21 11:55                                                                                             ` Jarek Poplawski
  2008-08-21 12:01                                                                                               ` Herbert Xu
  2008-08-21 12:06                                                                                               ` David Miller
  0 siblings, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 11:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 09:26:09PM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 11:20:56AM +0000, Jarek Poplawski wrote:
> >
> > I thought I described the problem quite clearly, except this error in
> > the lock name. I certainly can miss much more things than you. And,
> > after your answer I'm simply not sure if the question is still valid.
> 
> OK I've lost track of what you were trying to say.  Could you please
> just restate the problem you saw?
> 

Sure, here is a scenario:

cpu1					cpu2
rtnl_lock()
qdisc_graft()
// parent != NULL
->cops-graft()
notify_and_destroy()			qdisc_run()
					spin_lock(root_lock)
qdisc_destroy(old)			dequeue_skb()
					tbf_dequeue()
					qdisc_tree_decrease_qlen()	
					qdisc_lookup()
//deleting from qdisc_sleeping->list	//walking qdisc_sleeping->list
//under rtnl_lock() only		//under qdisc root_lock only
list_del(qdisc->list)			list_for_each_entry(txq_root)


Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 11:55                                                                                             ` Jarek Poplawski
@ 2008-08-21 12:01                                                                                               ` Herbert Xu
  2008-08-21 12:19                                                                                                 ` Jarek Poplawski
  2008-08-21 12:06                                                                                               ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 12:01 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 11:55:22AM +0000, Jarek Poplawski wrote:
> 
> Sure, here is a scenario:
> 
> cpu1					cpu2
> rtnl_lock()
> qdisc_graft()
> // parent != NULL
> ->cops-graft()
> notify_and_destroy()			qdisc_run()
> 					spin_lock(root_lock)
> qdisc_destroy(old)			dequeue_skb()
> 					tbf_dequeue()
> 					qdisc_tree_decrease_qlen()	
> 					qdisc_lookup()
> //deleting from qdisc_sleeping->list	//walking qdisc_sleeping->list
> //under rtnl_lock() only		//under qdisc root_lock only
> list_del(qdisc->list)			list_for_each_entry(txq_root)

Good catch.  Longer term we should fix it so that it doesn't do
the silly lookup at run-time.  In fact we'll be getting rid of
requeue so there will be no need to do this in TBF at all.

However, for now please create a patch to put a pair of root locks
around the list_del in qdisc_destroy.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 11:55                                                                                             ` Jarek Poplawski
  2008-08-21 12:01                                                                                               ` Herbert Xu
@ 2008-08-21 12:06                                                                                               ` David Miller
  1 sibling, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-21 12:06 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu, 21 Aug 2008 11:55:22 +0000

> Sure, here is a scenario:
> 
> cpu1					cpu2
> rtnl_lock()
> qdisc_graft()
> // parent != NULL
> ->cops-graft()
> notify_and_destroy()			qdisc_run()
> 					spin_lock(root_lock)
> qdisc_destroy(old)			dequeue_skb()
> 					tbf_dequeue()
> 					qdisc_tree_decrease_qlen()	
> 					qdisc_lookup()
> //deleting from qdisc_sleeping->list	//walking qdisc_sleeping->list
> //under rtnl_lock() only		//under qdisc root_lock only
> list_del(qdisc->list)			list_for_each_entry(txq_root)

Grrr... :-)

Note that this only happens when my arch nemesis, ->requeue(), fails.
Same applies to the netem case, and hfsc's "peek".

All other qdisc_tree_decrease_qlen() users hold RTNL.

Really, it proves ->requeue() should die, and be replaced with "peek"
and "unlink" methods.


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  5:49                                                                           ` [PATCH take 2] " Jarek Poplawski
  2008-08-21  6:10                                                                             ` Herbert Xu
@ 2008-08-21 12:11                                                                             ` David Miller
  1 sibling, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-21 12:11 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu, 21 Aug 2008 05:49:11 +0000

> pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
> 
> dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
> or other timer calling netif_schedule() after dev_queue_deactivate().
> We prevent this checking aliveness before scheduling the timer. Since
> during deactivation the root qdisc is available only as qdisc_sleeping
> additional accessor qdisc_root_sleeping() is created.
> 
> With feedback from Herbert Xu <herbert@gondor.apana.org.au>
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Applied, thanks a lot Jarek.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:01                                                                                               ` Herbert Xu
@ 2008-08-21 12:19                                                                                                 ` Jarek Poplawski
  2008-08-21 12:22                                                                                                   ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 12:19 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 10:01:12PM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 11:55:22AM +0000, Jarek Poplawski wrote:
> > 
> > Sure, here is a scenario:
> > 
> > cpu1					cpu2
> > rtnl_lock()
> > qdisc_graft()
> > // parent != NULL
> > ->cops-graft()
> > notify_and_destroy()			qdisc_run()
> > 					spin_lock(root_lock)
> > qdisc_destroy(old)			dequeue_skb()
> > 					tbf_dequeue()
> > 					qdisc_tree_decrease_qlen()	
> > 					qdisc_lookup()
> > //deleting from qdisc_sleeping->list	//walking qdisc_sleeping->list
> > //under rtnl_lock() only		//under qdisc root_lock only
> > list_del(qdisc->list)			list_for_each_entry(txq_root)
> 
> Good catch.  Longer term we should fix it so that it doesn't do
> the silly lookup at run-time.  In fact we'll be getting rid of
> requeue so there will be no need to do this in TBF at all.
> 
> However, for now please create a patch to put a pair of root locks
> around the list_del in qdisc_destroy.

Since qdisc_destroy() is used for destroying root qdisc too, isn't it
better to get this lock back in notify_and_destroy() like this:

if (old) {
	if (parent)
		sch_tree_lock()
	qdisc_destroy()
	if (parent)
		sch_tree_unlock()
}
	

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:19                                                                                                 ` Jarek Poplawski
@ 2008-08-21 12:22                                                                                                   ` Herbert Xu
  2008-08-21 12:27                                                                                                     ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 12:22 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 12:19:20PM +0000, Jarek Poplawski wrote:
> 
> Since qdisc_destroy() is used for destroying root qdisc too, isn't it
> better to get this lock back in notify_and_destroy() like this:
> 
> if (old) {
> 	if (parent)
> 		sch_tree_lock()
> 	qdisc_destroy()
> 	if (parent)
> 		sch_tree_unlock()
> }

Ugly as it may look, I think this is probably the best fix for now.

For 2.6.28 we can fix it properly and remove this eyesore :)

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:22                                                                                                   ` Herbert Xu
@ 2008-08-21 12:27                                                                                                     ` David Miller
  2008-08-21 12:35                                                                                                       ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-21 12:27 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, denys

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 21 Aug 2008 22:22:45 +1000

> On Thu, Aug 21, 2008 at 12:19:20PM +0000, Jarek Poplawski wrote:
> > 
> > Since qdisc_destroy() is used for destroying root qdisc too, isn't it
> > better to get this lock back in notify_and_destroy() like this:
> > 
> > if (old) {
> > 	if (parent)
> > 		sch_tree_lock()
> > 	qdisc_destroy()
> > 	if (parent)
> > 		sch_tree_unlock()
> > }
> 
> Ugly as it may look, I think this is probably the best fix for now.
> 
> For 2.6.28 we can fix it properly and remove this eyesore :)

This looks even worse, actually.

If we just unlinked this thing, we don't want anyone finding
it, even before we grab this lock, to adjust queue counts.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:27                                                                                                     ` David Miller
@ 2008-08-21 12:35                                                                                                       ` Herbert Xu
  2008-08-21 12:48                                                                                                         ` Herbert Xu
  2008-08-21 12:49                                                                                                         ` [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 12:35 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, denys

On Thu, Aug 21, 2008 at 05:27:54AM -0700, David Miller wrote:
> 
> This looks even worse, actually.
> 
> If we just unlinked this thing, we don't want anyone finding
> it, even before we grab this lock, to adjust queue counts.

You're right, this doesn't work at all.  In fact it's been broken
even before we removed the root lock.  The problem is that we used
to have one big linked list for each device.  That was protected
by the device qdisc lock.  Now we have one list for each txq and
qdisc_lookup walks every single txq.  This means that no single
qdisc root lock can protect this anymore.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:35                                                                                                       ` Herbert Xu
@ 2008-08-21 12:48                                                                                                         ` Herbert Xu
  2008-08-21 12:55                                                                                                           ` Jarek Poplawski
  2008-08-21 20:40                                                                                                           ` Jarek Poplawski
  2008-08-21 12:49                                                                                                         ` [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
  1 sibling, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 12:48 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, denys

On Thu, Aug 21, 2008 at 10:35:38PM +1000, Herbert Xu wrote:
>
> You're right, this doesn't work at all.  In fact it's been broken
> even before we removed the root lock.  The problem is that we used
> to have one big linked list for each device.  That was protected
> by the device qdisc lock.  Now we have one list for each txq and
> qdisc_lookup walks every single txq.  This means that no single
> qdisc root lock can protect this anymore.

How about going back to a single list per-device again? This list
is only used on the slow path (well anything that tries to walk
a potentially unbounded linked list is slow :), and qdisc_lookup
walks through everything anyway.

We'll need to then add a new lock to protect this list, until we
remove requeue.

Actually just doing the locking will be sufficient.  Something like
this totally untested patch (I've abused your tx global lock):

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index ef0efec..3f5f9b9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -202,16 +202,25 @@ struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
 struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
 {
 	unsigned int i;
+	struct Qdisc *q;
+
+	spin_lock_bh(&dev->tx_global_lock);
 
 	for (i = 0; i < dev->num_tx_queues; i++) {
 		struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
-		struct Qdisc *q, *txq_root = txq->qdisc_sleeping;
+		struct Qdisc *txq_root = txq->qdisc_sleeping;
 
 		q = qdisc_match_from_root(txq_root, handle);
 		if (q)
-			return q;
+			goto unlock;
 	}
-	return qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
+
+	q = qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
+
+unlock:
+	spin_unlock_bh(&dev->tx_global_lock);
+
+	return q;
 }
 
 static struct Qdisc *qdisc_leaf(struct Qdisc *p, u32 classid)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index c3ed4d4..292a373 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -526,8 +526,10 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	    !atomic_dec_and_test(&qdisc->refcnt))
 		return;
 
+	spin_lock_bh(&dev->tx_global_lock);
 	if (qdisc->parent)
 		list_del(&qdisc->list);
+	spin_unlock_bh(&dev->tx_global_lock);
 
 #ifdef CONFIG_NET_SCHED
 	qdisc_put_stab(qdisc->stab);

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:35                                                                                                       ` Herbert Xu
  2008-08-21 12:48                                                                                                         ` Herbert Xu
@ 2008-08-21 12:49                                                                                                         ` Jarek Poplawski
  2008-08-21 12:51                                                                                                           ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 12:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 10:35:38PM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 05:27:54AM -0700, David Miller wrote:
> > 
> > This looks even worse, actually.
> > 
> > If we just unlinked this thing, we don't want anyone finding
> > it, even before we grab this lock, to adjust queue counts.
> 
> You're right, this doesn't work at all.  In fact it's been broken
> even before we removed the root lock.  The problem is that we used
> to have one big linked list for each device.  That was protected
> by the device qdisc lock.  Now we have one list for each txq and
> qdisc_lookup walks every single txq.  This means that no single
> qdisc root lock can protect this anymore.
> 

I don't think there could be such a problem, since nobody should look
for such a destroyed qdisc: they look for their ancestors only.
Anyway, I can't do this patch before evening, so I wait for
suggestions or you could simply do it as you wish, no problem.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:49                                                                                                         ` [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
@ 2008-08-21 12:51                                                                                                           ` Herbert Xu
  0 siblings, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 12:51 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 12:49:36PM +0000, Jarek Poplawski wrote:
>
> I don't think there could be such a problem, since nobody should look
> for such a destroyed qdisc: they look for their ancestors only.
> Anyway, I can't do this patch before evening, so I wait for
> suggestions or you could simply do it as you wish, no problem.

It's not about looking for the destroyed qdisc, it's about walking
the list and accidentally hitting the destroyed qdisc and wandering
into lalaland.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:48                                                                                                         ` Herbert Xu
@ 2008-08-21 12:55                                                                                                           ` Jarek Poplawski
  2008-08-21 13:12                                                                                                             ` Herbert Xu
  2008-08-21 20:40                                                                                                           ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 12:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 10:48:34PM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 10:35:38PM +1000, Herbert Xu wrote:
> >
> > You're right, this doesn't work at all.  In fact it's been broken
> > even before we removed the root lock.  The problem is that we used
> > to have one big linked list for each device.  That was protected
> > by the device qdisc lock.  Now we have one list for each txq and
> > qdisc_lookup walks every single txq.  This means that no single
> > qdisc root lock can protect this anymore.
> 
> How about going back to a single list per-device again? This list
> is only used on the slow path (well anything that tries to walk
> a potentially unbounded linked list is slow :), and qdisc_lookup
> walks through everything anyway.
> 
> We'll need to then add a new lock to protect this list, until we
> remove requeue.
> 
> Actually just doing the locking will be sufficient.  Something like
> this totally untested patch (I've abused your tx global lock):

Alas I've to have a break now, anyway I think for now "my" proposal
should be safer since it worked like this before... But of course
after good checking your patch should be better.

Jarek P.

> 
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index ef0efec..3f5f9b9 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -202,16 +202,25 @@ struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
>  struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
>  {
>  	unsigned int i;
> +	struct Qdisc *q;
> +
> +	spin_lock_bh(&dev->tx_global_lock);
>  
>  	for (i = 0; i < dev->num_tx_queues; i++) {
>  		struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
> -		struct Qdisc *q, *txq_root = txq->qdisc_sleeping;
> +		struct Qdisc *txq_root = txq->qdisc_sleeping;
>  
>  		q = qdisc_match_from_root(txq_root, handle);
>  		if (q)
> -			return q;
> +			goto unlock;
>  	}
> -	return qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
> +
> +	q = qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
> +
> +unlock:
> +	spin_unlock_bh(&dev->tx_global_lock);
> +
> +	return q;
>  }
>  
>  static struct Qdisc *qdisc_leaf(struct Qdisc *p, u32 classid)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index c3ed4d4..292a373 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -526,8 +526,10 @@ void qdisc_destroy(struct Qdisc *qdisc)
>  	    !atomic_dec_and_test(&qdisc->refcnt))
>  		return;
>  
> +	spin_lock_bh(&dev->tx_global_lock);
>  	if (qdisc->parent)
>  		list_del(&qdisc->list);
> +	spin_unlock_bh(&dev->tx_global_lock);
>  
>  #ifdef CONFIG_NET_SCHED
>  	qdisc_put_stab(qdisc->stab);
> 
> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:55                                                                                                           ` Jarek Poplawski
@ 2008-08-21 13:12                                                                                                             ` Herbert Xu
  2008-08-21 18:58                                                                                                               ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 13:12 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 12:55:01PM +0000, Jarek Poplawski wrote:
> 
> Alas I've to have a break now, anyway I think for now "my" proposal
> should be safer since it worked like this before... But of course
> after good checking your patch should be better.

Well your patch doesn't work and my patch doesn't compile :)

The problem is that qdisc_lookup walks through every txq's list while
the sch_tree_lock protects a single txq only.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 13:12                                                                                                             ` Herbert Xu
@ 2008-08-21 18:58                                                                                                               ` Jarek Poplawski
  2008-08-21 21:14                                                                                                                 ` Jarek Poplawski
                                                                                                                                   ` (2 more replies)
  0 siblings, 3 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 18:58 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 11:12:18PM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 12:55:01PM +0000, Jarek Poplawski wrote:
> > 
> > Alas I've to have a break now, anyway I think for now "my" proposal
> > should be safer since it worked like this before... But of course
> > after good checking your patch should be better.
> 
> Well your patch doesn't work and my patch doesn't compile :)
> 
> The problem is that qdisc_lookup walks through every txq's list while
> the sch_tree_lock protects a single txq only.

I don't think there is such a problem. I thought you and David were
concerned with something trying to find and use this qdisc: that's why
I wrote it's nobody ancestor at the moment. sch_tree_lock() should be
enough for now because in the current implementation we have only one
root qdisc with pointers copied to every dev_queue. At least I can't
see nothing more in qdisc_create() and qdisc_graft(). So,
qdisc_lookup() seems to be designed for the future (or to do this
lookup more exactly with additional loops...). 

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 12:48                                                                                                         ` Herbert Xu
  2008-08-21 12:55                                                                                                           ` Jarek Poplawski
@ 2008-08-21 20:40                                                                                                           ` Jarek Poplawski
  2008-08-21 22:24                                                                                                             ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 20:40 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

Herbert Xu wrote, On 08/21/2008 02:48 PM:

> On Thu, Aug 21, 2008 at 10:35:38PM +1000, Herbert Xu wrote:
>> You're right, this doesn't work at all.  In fact it's been broken
>> even before we removed the root lock.  The problem is that we used
>> to have one big linked list for each device.  That was protected
>> by the device qdisc lock.  Now we have one list for each txq and
>> qdisc_lookup walks every single txq.  This means that no single
>> qdisc root lock can protect this anymore.

As I wrote earlier, I don't think it's like this at least with the
current implementation, an this fix seems to be temporary.

> How about going back to a single list per-device again? This list
> is only used on the slow path (well anything that tries to walk
> a potentially unbounded linked list is slow :), and qdisc_lookup
> walks through everything anyway.
> 
> We'll need to then add a new lock to protect this list, until we
> remove requeue.
> 
> Actually just doing the locking will be sufficient.  Something like
> this totally untested patch (I've abused your tx global lock):

If it's really needed, then OK with me, but tx_global_lock doesn't
look like the best choice, considering it can be used here with
qdisc root lock, and this comment from sch_generic:

" *  qdisc_lock(q) and netif_tx_lock are mutually exclusive,
  *  if one is grabbed, another must be free."

IMHO, since it's probably not for something very busy, we can even
create a global lock to avoid dependancies or maybe to use this
qdisc_stab_lock (after changing spin_locks to _bh) which is BTW
used in qdisc_destroy() already.

Thanks,
Jarek P.

> 
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index ef0efec..3f5f9b9 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -202,16 +202,25 @@ struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
>  struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
>  {
>  	unsigned int i;
> +	struct Qdisc *q;
> +
> +	spin_lock_bh(&dev->tx_global_lock);
>  
>  	for (i = 0; i < dev->num_tx_queues; i++) {
>  		struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
> -		struct Qdisc *q, *txq_root = txq->qdisc_sleeping;
> +		struct Qdisc *txq_root = txq->qdisc_sleeping;
>  
>  		q = qdisc_match_from_root(txq_root, handle);
>  		if (q)
> -			return q;
> +			goto unlock;
>  	}
> -	return qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
> +
> +	q = qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
> +
> +unlock:
> +	spin_unlock_bh(&dev->tx_global_lock);
> +
> +	return q;
>  }
>  
>  static struct Qdisc *qdisc_leaf(struct Qdisc *p, u32 classid)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index c3ed4d4..292a373 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -526,8 +526,10 @@ void qdisc_destroy(struct Qdisc *qdisc)
>  	    !atomic_dec_and_test(&qdisc->refcnt))
>  		return;
>  
> +	spin_lock_bh(&dev->tx_global_lock);
>  	if (qdisc->parent)
>  		list_del(&qdisc->list);
> +	spin_unlock_bh(&dev->tx_global_lock);
>  
>  #ifdef CONFIG_NET_SCHED
>  	qdisc_put_stab(qdisc->stab);
> 
> Cheers,

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 18:58                                                                                                               ` Jarek Poplawski
@ 2008-08-21 21:14                                                                                                                 ` Jarek Poplawski
  2008-08-21 22:23                                                                                                                 ` Herbert Xu
  2008-08-23 12:15                                                                                                                 ` David Miller
  2 siblings, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-21 21:14 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 08:58:57PM +0200, Jarek Poplawski wrote:
...
> sch_tree_lock() should be
> enough for now because in the current implementation we have only one
> root qdisc with pointers copied to every dev_queue.

Of course, except the state before qdisc_create() and qdisc_graft()
with default qdiscs, but it doesn't seem to matter here.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 18:58                                                                                                               ` Jarek Poplawski
  2008-08-21 21:14                                                                                                                 ` Jarek Poplawski
@ 2008-08-21 22:23                                                                                                                 ` Herbert Xu
  2008-08-22  8:49                                                                                                                   ` Jarek Poplawski
  2008-08-22 11:38                                                                                                                   ` Jarek Poplawski
  2008-08-23 12:15                                                                                                                 ` David Miller
  2 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 22:23 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 08:58:57PM +0200, Jarek Poplawski wrote:
>
> root qdisc with pointers copied to every dev_queue. At least I can't
> see nothing more in qdisc_create() and qdisc_graft(). So,
> qdisc_lookup() seems to be designed for the future (or to do this
> lookup more exactly with additional loops...). 

We've got at least the RX and TX queues.  That makes two locks and
two lists.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 20:40                                                                                                           ` Jarek Poplawski
@ 2008-08-21 22:24                                                                                                             ` Herbert Xu
  2008-08-22  8:41                                                                                                               ` [PATCH] pkt_sched: Fix qdisc list locking Jarek Poplawski
  2008-08-22  9:27                                                                                                               ` [PATCH take 2] " Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-21 22:24 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Thu, Aug 21, 2008 at 10:40:53PM +0200, Jarek Poplawski wrote:
>
> If it's really needed, then OK with me, but tx_global_lock doesn't
> look like the best choice, considering it can be used here with
> qdisc root lock, and this comment from sch_generic:

Sure, that was just an illustration of what I meant.  It doesn't
even compile :)

Feel free to fix this up into a real patch with a new lock.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* [PATCH] pkt_sched: Fix qdisc list locking
  2008-08-21 22:24                                                                                                             ` Herbert Xu
@ 2008-08-22  8:41                                                                                                               ` Jarek Poplawski
  2008-08-22 10:14                                                                                                                 ` Herbert Xu
  2008-08-22  9:27                                                                                                               ` [PATCH take 2] " Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-22  8:41 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Fri, Aug 22, 2008 at 08:24:47AM +1000, Herbert Xu wrote:
...
> Feel free to fix this up into a real patch with a new lock.

It looks like adding to the list needs similar protection, but if I
exagerated here let me know.

Thanks,
Jarek P.

--------------->

pkt_sched: Fix qdisc list locking

Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
without rtnl_lock(), adding and deleting from a qdisc list needs
additional locking. This patch adds global spinlock qdisc_list_lock
and wrapper functions for modifying the list. It is considered as a
temporary solution until hfsc_dequeue(), netem_dequeue() and
tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.

With feedback from Herbert Xu and David S. Miller.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 include/net/pkt_sched.h |    1 +
 net/sched/sch_api.c     |   44 +++++++++++++++++++++++++++++++++++++++-----
 net/sched/sch_generic.c |    5 ++---
 3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 853fe83..b786a5b 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -78,6 +78,7 @@ extern struct Qdisc *fifo_create_dflt(struct Qdisc *sch, struct Qdisc_ops *ops,
 
 extern int register_qdisc(struct Qdisc_ops *qops);
 extern int unregister_qdisc(struct Qdisc_ops *qops);
+extern void qdisc_list_del(struct Qdisc *q);
 extern struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle);
 extern struct Qdisc *qdisc_lookup_class(struct net_device *dev, u32 handle);
 extern struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r,
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 45f442d..e35b8d8 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -199,19 +199,53 @@ struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
 	return NULL;
 }
 
+/*
+ * This lock is needed until some qdiscs stop calling qdisc_tree_decrease_qlen()
+ * without rtnl_lock(); currently hfsc_dequeue(), netem_dequeue(), tbf_dequeue()
+ */
+static DEFINE_SPINLOCK(qdisc_lookup_lock);
+
+static void qdisc_list_add(struct Qdisc *q)
+{
+	if ((q->parent != TC_H_ROOT) && !(q->flags & TCQ_F_INGRESS)) {
+		spin_lock_bh(&qdisc_lookup_lock);
+		list_add_tail(&q->list, &qdisc_root_sleeping(q)->list);
+		spin_unlock_bh(&qdisc_lookup_lock);
+	}
+}
+
+void qdisc_list_del(struct Qdisc *q)
+{
+	if ((q->parent != TC_H_ROOT) && !(q->flags & TCQ_F_INGRESS)) {
+		spin_lock_bh(&qdisc_lookup_lock);
+		list_del(&q->list);
+		spin_unlock_bh(&qdisc_lookup_lock);
+	}
+}
+EXPORT_SYMBOL(qdisc_list_del);
+
 struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
 {
 	unsigned int i;
+	struct Qdisc *q;
+
+	spin_lock_bh(&qdisc_lookup_lock);
 
 	for (i = 0; i < dev->num_tx_queues; i++) {
 		struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
-		struct Qdisc *q, *txq_root = txq->qdisc_sleeping;
+		struct Qdisc *txq_root = txq->qdisc_sleeping;
 
 		q = qdisc_match_from_root(txq_root, handle);
 		if (q)
-			return q;
+			goto unlock;
 	}
-	return qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
+
+	q = qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
+
+unlock:
+	spin_unlock_bh(&qdisc_lookup_lock);
+
+	return q;
 }
 
 static struct Qdisc *qdisc_leaf(struct Qdisc *p, u32 classid)
@@ -810,8 +844,8 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 				goto err_out3;
 			}
 		}
-		if ((parent != TC_H_ROOT) && !(sch->flags & TCQ_F_INGRESS))
-			list_add_tail(&sch->list, &dev_queue->qdisc_sleeping->list);
+
+		qdisc_list_add(sch);
 
 		return sch;
 	}
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index c3ed4d4..5f0ade7 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -526,10 +526,9 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	    !atomic_dec_and_test(&qdisc->refcnt))
 		return;
 
-	if (qdisc->parent)
-		list_del(&qdisc->list);
-
 #ifdef CONFIG_NET_SCHED
+	qdisc_list_del(qdisc);
+
 	qdisc_put_stab(qdisc->stab);
 #endif
 	gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 22:23                                                                                                                 ` Herbert Xu
@ 2008-08-22  8:49                                                                                                                   ` Jarek Poplawski
  2008-08-22  8:55                                                                                                                     ` David Miller
  2008-08-22 11:38                                                                                                                   ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-22  8:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Fri, Aug 22, 2008 at 08:23:30AM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 08:58:57PM +0200, Jarek Poplawski wrote:
> >
> > root qdisc with pointers copied to every dev_queue. At least I can't
> > see nothing more in qdisc_create() and qdisc_graft(). So,
> > qdisc_lookup() seems to be designed for the future (or to do this
> > lookup more exactly with additional loops...). 
> 
> We've got at least the RX and TX queues.  That makes two locks and
> two lists.

RX currently doesn't add anything to the list.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22  8:49                                                                                                                   ` Jarek Poplawski
@ 2008-08-22  8:55                                                                                                                     ` David Miller
  2008-08-22 10:07                                                                                                                       ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-22  8:55 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 22 Aug 2008 08:49:31 +0000

> On Fri, Aug 22, 2008 at 08:23:30AM +1000, Herbert Xu wrote:
> > On Thu, Aug 21, 2008 at 08:58:57PM +0200, Jarek Poplawski wrote:
> > >
> > > root qdisc with pointers copied to every dev_queue. At least I can't
> > > see nothing more in qdisc_create() and qdisc_graft(). So,
> > > qdisc_lookup() seems to be designed for the future (or to do this
> > > lookup more exactly with additional loops...). 
> > 
> > We've got at least the RX and TX queues.  That makes two locks and
> > two lists.
> 
> RX currently doesn't add anything to the list.

That's correct, currently ingress qdiscs only support a hierarchy of
single root and that's it.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* [PATCH take 2] pkt_sched: Fix qdisc list locking
  2008-08-21 22:24                                                                                                             ` Herbert Xu
  2008-08-22  8:41                                                                                                               ` [PATCH] pkt_sched: Fix qdisc list locking Jarek Poplawski
@ 2008-08-22  9:27                                                                                                               ` Jarek Poplawski
  2008-08-22 10:15                                                                                                                 ` Herbert Xu
  2008-08-22 10:23                                                                                                                 ` David Miller
  1 sibling, 2 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-22  9:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys


I made an error in the name of this new lock in the changlog,
so I decided to fix this in ...the patch.

Sorry,
Jarek P.

---------------> (take 2)

pkt_sched: Fix qdisc list locking

Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
without rtnl_lock(), adding and deleting from a qdisc list needs
additional locking. This patch adds global spinlock qdisc_list_lock
and wrapper functions for modifying the list. It is considered as a
temporary solution until hfsc_dequeue(), netem_dequeue() and
tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.

With feedback from Herbert Xu and David S. Miller.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 include/net/pkt_sched.h |    1 +
 net/sched/sch_api.c     |   44 +++++++++++++++++++++++++++++++++++++++-----
 net/sched/sch_generic.c |    5 ++---
 3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 853fe83..b786a5b 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -78,6 +78,7 @@ extern struct Qdisc *fifo_create_dflt(struct Qdisc *sch, struct Qdisc_ops *ops,
 
 extern int register_qdisc(struct Qdisc_ops *qops);
 extern int unregister_qdisc(struct Qdisc_ops *qops);
+extern void qdisc_list_del(struct Qdisc *q);
 extern struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle);
 extern struct Qdisc *qdisc_lookup_class(struct net_device *dev, u32 handle);
 extern struct qdisc_rate_table *qdisc_get_rtab(struct tc_ratespec *r,
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 45f442d..e7fb9e0 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -199,19 +199,53 @@ struct Qdisc *qdisc_match_from_root(struct Qdisc *root, u32 handle)
 	return NULL;
 }
 
+/*
+ * This lock is needed until some qdiscs stop calling qdisc_tree_decrease_qlen()
+ * without rtnl_lock(); currently hfsc_dequeue(), netem_dequeue(), tbf_dequeue()
+ */
+static DEFINE_SPINLOCK(qdisc_list_lock);
+
+static void qdisc_list_add(struct Qdisc *q)
+{
+	if ((q->parent != TC_H_ROOT) && !(q->flags & TCQ_F_INGRESS)) {
+		spin_lock_bh(&qdisc_list_lock);
+		list_add_tail(&q->list, &qdisc_root_sleeping(q)->list);
+		spin_unlock_bh(&qdisc_list_lock);
+	}
+}
+
+void qdisc_list_del(struct Qdisc *q)
+{
+	if ((q->parent != TC_H_ROOT) && !(q->flags & TCQ_F_INGRESS)) {
+		spin_lock_bh(&qdisc_list_lock);
+		list_del(&q->list);
+		spin_unlock_bh(&qdisc_list_lock);
+	}
+}
+EXPORT_SYMBOL(qdisc_list_del);
+
 struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
 {
 	unsigned int i;
+	struct Qdisc *q;
+
+	spin_lock_bh(&qdisc_list_lock);
 
 	for (i = 0; i < dev->num_tx_queues; i++) {
 		struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
-		struct Qdisc *q, *txq_root = txq->qdisc_sleeping;
+		struct Qdisc *txq_root = txq->qdisc_sleeping;
 
 		q = qdisc_match_from_root(txq_root, handle);
 		if (q)
-			return q;
+			goto unlock;
 	}
-	return qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
+
+	q = qdisc_match_from_root(dev->rx_queue.qdisc_sleeping, handle);
+
+unlock:
+	spin_unlock_bh(&qdisc_list_lock);
+
+	return q;
 }
 
 static struct Qdisc *qdisc_leaf(struct Qdisc *p, u32 classid)
@@ -810,8 +844,8 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 				goto err_out3;
 			}
 		}
-		if ((parent != TC_H_ROOT) && !(sch->flags & TCQ_F_INGRESS))
-			list_add_tail(&sch->list, &dev_queue->qdisc_sleeping->list);
+
+		qdisc_list_add(sch);
 
 		return sch;
 	}
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index c3ed4d4..5f0ade7 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -526,10 +526,9 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	    !atomic_dec_and_test(&qdisc->refcnt))
 		return;
 
-	if (qdisc->parent)
-		list_del(&qdisc->list);
-
 #ifdef CONFIG_NET_SCHED
+	qdisc_list_del(qdisc);
+
 	qdisc_put_stab(qdisc->stab);
 #endif
 	gen_kill_estimator(&qdisc->bstats, &qdisc->rate_est);

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22  8:55                                                                                                                     ` David Miller
@ 2008-08-22 10:07                                                                                                                       ` Herbert Xu
  2008-08-22 10:27                                                                                                                         ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-22 10:07 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, denys

On Fri, Aug 22, 2008 at 01:55:13AM -0700, David Miller wrote:
>
> > > We've got at least the RX and TX queues.  That makes two locks and
> > > two lists.
> > 
> > RX currently doesn't add anything to the list.
> 
> That's correct, currently ingress qdiscs only support a hierarchy of
> single root and that's it.

We seem to be talking about different things.

Yes the ingress hierachy has a single root, i.e., it's a tree.  But
that has nothing to do with what I was talking about.  I'm talking
about the list at dev->rx_queue.qdisc_sleeping->list which is
certainly not guaranteed to be empty.

If you look at qdisc_create you'll find that every time we create
a non-root ingress qdisc we add it to that list (we have to,
otherwise qdisc_lookup doesn't work at all for ingress qdiscs).
So when somebody on the TX side does a qdisc_lookup they may be
walking the RX list without any protection.  Similarly, if somebody
on the ingress side does qdisc_lookup they may walk the TX lists
without protection.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Fix qdisc list locking
  2008-08-22  8:41                                                                                                               ` [PATCH] pkt_sched: Fix qdisc list locking Jarek Poplawski
@ 2008-08-22 10:14                                                                                                                 ` Herbert Xu
  0 siblings, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-22 10:14 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Fri, Aug 22, 2008 at 08:41:50AM +0000, Jarek Poplawski wrote:
> 
> It looks like adding to the list needs similar protection, but if I
> exagerated here let me know.

Yes I agree.  List addition is certainly not safe for walkers
without memory barriers.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc list locking
  2008-08-22  9:27                                                                                                               ` [PATCH take 2] " Jarek Poplawski
@ 2008-08-22 10:15                                                                                                                 ` Herbert Xu
  2008-08-22 10:28                                                                                                                   ` David Miller
  2008-08-22 10:23                                                                                                                 ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-22 10:15 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Fri, Aug 22, 2008 at 09:27:30AM +0000, Jarek Poplawski wrote:
> 
> pkt_sched: Fix qdisc list locking
> 
> Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
> without rtnl_lock(), adding and deleting from a qdisc list needs
> additional locking. This patch adds global spinlock qdisc_list_lock
> and wrapper functions for modifying the list. It is considered as a
> temporary solution until hfsc_dequeue(), netem_dequeue() and
> tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.
> 
> With feedback from Herbert Xu and David S. Miller.
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

> +void qdisc_list_del(struct Qdisc *q)
> +{
> +	if ((q->parent != TC_H_ROOT) && !(q->flags & TCQ_F_INGRESS)) {

Good catch! I'm not sure whether this would actually break but
it certainly makes me feel a lot better :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc list locking
  2008-08-22  9:27                                                                                                               ` [PATCH take 2] " Jarek Poplawski
  2008-08-22 10:15                                                                                                                 ` Herbert Xu
@ 2008-08-22 10:23                                                                                                                 ` David Miller
  1 sibling, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-22 10:23 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 22 Aug 2008 09:27:30 +0000

> pkt_sched: Fix qdisc list locking
> 
> Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
> without rtnl_lock(), adding and deleting from a qdisc list needs
> additional locking. This patch adds global spinlock qdisc_list_lock
> and wrapper functions for modifying the list. It is considered as a
> temporary solution until hfsc_dequeue(), netem_dequeue() and
> tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.
> 
> With feedback from Herbert Xu and David S. Miller.
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Looks good Jarek, thanks!

Since these requeue failure cases are slow paths, this shouldn't have
any performance impact either.

I'll apply this, thanks!

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 10:07                                                                                                                       ` Herbert Xu
@ 2008-08-22 10:27                                                                                                                         ` David Miller
  2008-08-22 11:02                                                                                                                           ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-22 10:27 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, denys

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 22 Aug 2008 20:07:08 +1000

> Yes the ingress hierachy has a single root, i.e., it's a tree.  But
> that has nothing to do with what I was talking about.  I'm talking
> about the list at dev->rx_queue.qdisc_sleeping->list which is
> certainly not guaranteed to be empty.

It is guarenteed to be empty.

Only root qdiscs go to rx_queue.qdisc_sleeping, and such qdiscs
will always have an empty list.

> If you look at qdisc_create you'll find that every time we create
> a non-root ingress qdisc we add it to that list (we have to,
> otherwise qdisc_lookup doesn't work at all for ingress qdiscs).

We don't allow non-root ingress qdiscs.  All ingress qdiscs
always take the device graft path, and always have a parent
of TC_H_INGRESS, and always operate on rx_queue.

> So when somebody on the TX side does a qdisc_lookup they may be
> walking the RX list without any protection.  Similarly, if somebody
> on the ingress side does qdisc_lookup they may walk the TX lists
> without protection.

Ingress data paths do not do qdisc_lookup().  There is only root
allowed for ingress and thus rx_queue's non-default qdiscs.

Add some assertions and run some test tc commands if you don't believe
me :)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc list locking
  2008-08-22 10:15                                                                                                                 ` Herbert Xu
@ 2008-08-22 10:28                                                                                                                   ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-22 10:28 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, denys

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 22 Aug 2008 20:15:26 +1000

> On Fri, Aug 22, 2008 at 09:27:30AM +0000, Jarek Poplawski wrote:
> > 
> > pkt_sched: Fix qdisc list locking
> > 
> > Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup())
> > without rtnl_lock(), adding and deleting from a qdisc list needs
> > additional locking. This patch adds global spinlock qdisc_list_lock
> > and wrapper functions for modifying the list. It is considered as a
> > temporary solution until hfsc_dequeue(), netem_dequeue() and
> > tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone.
> > 
> > With feedback from Herbert Xu and David S. Miller.
> > 
> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> 
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

I'll add you ACK, thanks Herbert.

> Good catch! I'm not sure whether this would actually break but
> it certainly makes me feel a lot better :)

Thankfully list_del() on an empty list works or we'd have tons of
reports :)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 10:27                                                                                                                         ` David Miller
@ 2008-08-22 11:02                                                                                                                           ` Herbert Xu
  0 siblings, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-22 11:02 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, denys

On Fri, Aug 22, 2008 at 03:27:33AM -0700, David Miller wrote:
> 
> We don't allow non-root ingress qdiscs.  All ingress qdiscs
> always take the device graft path, and always have a parent
> of TC_H_INGRESS, and always operate on rx_queue.

Ah yes, I forgot that the ingress qdisc is special, and that's
why we have the ifb device.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 22:23                                                                                                                 ` Herbert Xu
  2008-08-22  8:49                                                                                                                   ` Jarek Poplawski
@ 2008-08-22 11:38                                                                                                                   ` Jarek Poplawski
  2008-08-22 11:42                                                                                                                     ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-22 11:38 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Fri, Aug 22, 2008 at 08:23:30AM +1000, Herbert Xu wrote:
> On Thu, Aug 21, 2008 at 08:58:57PM +0200, Jarek Poplawski wrote:
> >
> > root qdisc with pointers copied to every dev_queue. At least I can't
> > see nothing more in qdisc_create() and qdisc_graft(). So,
> > qdisc_lookup() seems to be designed for the future (or to do this
> > lookup more exactly with additional loops...). 
> 
> We've got at least the RX and TX queues.  That makes two locks and
> two lists.

As a matter of fact your doubts around this enlightened me only now
there is something "wrong" here... This qdisc_lookup(), even if
there were all these multi RX and TX things implemented, still
shouldn't matter because what qdisc_tree_decrease_qlen() could be
interested in is only one qdisc tree. So it looks like current
implementation of qdisc_lookup() is an overkill for this anyway.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 11:38                                                                                                                   ` Jarek Poplawski
@ 2008-08-22 11:42                                                                                                                     ` David Miller
  2008-08-22 12:09                                                                                                                       ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-22 11:42 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 22 Aug 2008 11:38:33 +0000

> On Fri, Aug 22, 2008 at 08:23:30AM +1000, Herbert Xu wrote:
> > On Thu, Aug 21, 2008 at 08:58:57PM +0200, Jarek Poplawski wrote:
> > >
> > > root qdisc with pointers copied to every dev_queue. At least I can't
> > > see nothing more in qdisc_create() and qdisc_graft(). So,
> > > qdisc_lookup() seems to be designed for the future (or to do this
> > > lookup more exactly with additional loops...). 
> > 
> > We've got at least the RX and TX queues.  That makes two locks and
> > two lists.
> 
> As a matter of fact your doubts around this enlightened me only now
> there is something "wrong" here... This qdisc_lookup(), even if
> there were all these multi RX and TX things implemented, still
> shouldn't matter because what qdisc_tree_decrease_qlen() could be
> interested in is only one qdisc tree. So it looks like current
> implementation of qdisc_lookup() is an overkill for this anyway.

Yes, that is true.

We could add true parent backpointers for this.  Speaking of which,
look at the existing __parent hack that's there in struct Qdisc for
CBQ :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 11:42                                                                                                                     ` David Miller
@ 2008-08-22 12:09                                                                                                                       ` Jarek Poplawski
  2008-08-22 12:11                                                                                                                         ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-22 12:09 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, denys

On Fri, Aug 22, 2008 at 04:42:48AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Fri, 22 Aug 2008 11:38:33 +0000
> 
> > On Fri, Aug 22, 2008 at 08:23:30AM +1000, Herbert Xu wrote:
> > > On Thu, Aug 21, 2008 at 08:58:57PM +0200, Jarek Poplawski wrote:
> > > >
> > > > root qdisc with pointers copied to every dev_queue. At least I can't
> > > > see nothing more in qdisc_create() and qdisc_graft(). So,
> > > > qdisc_lookup() seems to be designed for the future (or to do this
> > > > lookup more exactly with additional loops...). 
> > > 
> > > We've got at least the RX and TX queues.  That makes two locks and
> > > two lists.
> > 
> > As a matter of fact your doubts around this enlightened me only now
> > there is something "wrong" here... This qdisc_lookup(), even if
> > there were all these multi RX and TX things implemented, still
> > shouldn't matter because what qdisc_tree_decrease_qlen() could be
> > interested in is only one qdisc tree. So it looks like current
> > implementation of qdisc_lookup() is an overkill for this anyway.
> 
> Yes, that is true.
> 
> We could add true parent backpointers for this.  Speaking of which,
> look at the existing __parent hack that's there in struct Qdisc for
> CBQ :-)

I guess, we need to establish first how much it's needed on fast
paths. But anyway, it seems something like qdisc_match_from_root()
should be enough in this qdisc_tree_decrease_qlen().

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 12:09                                                                                                                       ` Jarek Poplawski
@ 2008-08-22 12:11                                                                                                                         ` Herbert Xu
  2008-08-22 12:18                                                                                                                           ` David Miller
  2008-08-22 12:25                                                                                                                           ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-22 12:11 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, denys

On Fri, Aug 22, 2008 at 12:09:18PM +0000, Jarek Poplawski wrote:
>
> I guess, we need to establish first how much it's needed on fast
> paths. But anyway, it seems something like qdisc_match_from_root()
> should be enough in this qdisc_tree_decrease_qlen().

Or we could just replace requeue with a peek interface.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 12:11                                                                                                                         ` Herbert Xu
@ 2008-08-22 12:18                                                                                                                           ` David Miller
  2008-08-22 12:45                                                                                                                             ` Herbert Xu
  2008-08-22 12:25                                                                                                                           ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-22 12:18 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, denys

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 22 Aug 2008 22:11:44 +1000

> On Fri, Aug 22, 2008 at 12:09:18PM +0000, Jarek Poplawski wrote:
> >
> > I guess, we need to establish first how much it's needed on fast
> > paths. But anyway, it seems something like qdisc_match_from_root()
> > should be enough in this qdisc_tree_decrease_qlen().
> 
> Or we could just replace requeue with a peek interface.

Yes, and as per discussions over the past week or so, that would
allow us to eliminate basically everything except the netem usage.

Netem is trying to create packet reordering by requeueing (which is
logically like a head insert) instead of enqueueing (which is
logically like a tail insert).

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 12:11                                                                                                                         ` Herbert Xu
  2008-08-22 12:18                                                                                                                           ` David Miller
@ 2008-08-22 12:25                                                                                                                           ` Jarek Poplawski
  1 sibling, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-08-22 12:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, denys

On Fri, Aug 22, 2008 at 10:11:44PM +1000, Herbert Xu wrote:
> On Fri, Aug 22, 2008 at 12:09:18PM +0000, Jarek Poplawski wrote:
> >
> > I guess, we need to establish first how much it's needed on fast
> > paths. But anyway, it seems something like qdisc_match_from_root()
> > should be enough in this qdisc_tree_decrease_qlen().
> 
> Or we could just replace requeue with a peek interface.

...Or with a kfree_skb() interface ;) That's just to establish.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 12:18                                                                                                                           ` David Miller
@ 2008-08-22 12:45                                                                                                                             ` Herbert Xu
  2008-08-24 23:26                                                                                                                               ` Stephen Hemminger
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-22 12:45 UTC (permalink / raw)
  To: David Miller, Stephen Hemminger; +Cc: jarkao2, netdev, denys

On Fri, Aug 22, 2008 at 05:18:41AM -0700, David Miller wrote:
> 
> Netem is trying to create packet reordering by requeueing (which is
> logically like a head insert) instead of enqueueing (which is
> logically like a tail insert).

Why does it use requeue when tfifo's enqueue will insert this
at the front of the queue anyway based on time_to_send anyway?

Perhaps it's trying to force reordering when the user replaces
tfifo with some other qidsc? But that doesn't make sense since
netem relies on tfifo to implement all the other features such
as delay and jitter.

So how about

1) Change netem_graft to disallow the replacement of tfifo;
2) Simply use enqueue instead of requeue in netem_enqueue?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-21 18:58                                                                                                               ` Jarek Poplawski
  2008-08-21 21:14                                                                                                                 ` Jarek Poplawski
  2008-08-21 22:23                                                                                                                 ` Herbert Xu
@ 2008-08-23 12:15                                                                                                                 ` David Miller
  2 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-08-23 12:15 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, denys

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu, 21 Aug 2008 20:58:57 +0200

> sch_tree_lock() should be enough for now because in the current
> implementation we have only one root qdisc with pointers copied to
> every dev_queue. At least I can't see nothing more in qdisc_create()
> and qdisc_graft(). So, qdisc_lookup() seems to be designed for the
> future (or to do this lookup more exactly with additional loops...).

Yes, I designed it for the future.

Nothing really uses this capability currently.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-22 12:45                                                                                                                             ` Herbert Xu
@ 2008-08-24 23:26                                                                                                                               ` Stephen Hemminger
  2008-08-24 23:49                                                                                                                                 ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Stephen Hemminger @ 2008-08-24 23:26 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jarkao2, netdev, denys

On Fri, 22 Aug 2008 22:45:57 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Fri, Aug 22, 2008 at 05:18:41AM -0700, David Miller wrote:
> > 
> > Netem is trying to create packet reordering by requeueing (which is
> > logically like a head insert) instead of enqueueing (which is
> > logically like a tail insert).
> 
> Why does it use requeue when tfifo's enqueue will insert this
> at the front of the queue anyway based on time_to_send anyway?
> 
> Perhaps it's trying to force reordering when the user replaces
> tfifo with some other qidsc? But that doesn't make sense since
> netem relies on tfifo to implement all the other features such
> as delay and jitter.
> 
> So how about
> 
> 1) Change netem_graft to disallow the replacement of tfifo;
> 2) Simply use enqueue instead of requeue in netem_enqueue?
> 
> Cheers,


Tfifo is there only to add the jitter based reordering. Netem has other
better kinds of reordering as well.

Netem has to be able to put TBF in as a child qdisc. This is how loss
plus rate control is done.

Requeue was the natural way to do this based on the API's available
at the time.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-24 23:26                                                                                                                               ` Stephen Hemminger
@ 2008-08-24 23:49                                                                                                                                 ` Herbert Xu
  2008-08-25  0:29                                                                                                                                   ` Stephen Hemminger
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-24 23:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, jarkao2, netdev, denys

On Sun, Aug 24, 2008 at 07:26:39PM -0400, Stephen Hemminger wrote:
>
> Tfifo is there only to add the jitter based reordering. Netem has other
> better kinds of reordering as well.
> 
> Netem has to be able to put TBF in as a child qdisc. This is how loss
> plus rate control is done.

What about making TBF a child of tfifo instead? Alternatively
we could merge tfifo into netem.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-24 23:49                                                                                                                                 ` Herbert Xu
@ 2008-08-25  0:29                                                                                                                                   ` Stephen Hemminger
  2008-08-26  7:35                                                                                                                                     ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Stephen Hemminger @ 2008-08-25  0:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jarkao2, netdev, denys

On Mon, 25 Aug 2008 09:49:47 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Sun, Aug 24, 2008 at 07:26:39PM -0400, Stephen Hemminger wrote:
> >
> > Tfifo is there only to add the jitter based reordering. Netem has other
> > better kinds of reordering as well.
> > 
> > Netem has to be able to put TBF in as a child qdisc. This is how loss
> > plus rate control is done.
> 
> What about making TBF a child of tfifo instead? Alternatively
> we could merge tfifo into netem.
> 

But then you couldn't replace tfifo with pfifo or tbf??

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-25  0:29                                                                                                                                   ` Stephen Hemminger
@ 2008-08-26  7:35                                                                                                                                     ` Herbert Xu
  2008-08-26  7:47                                                                                                                                       ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-26  7:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, jarkao2, netdev, denys

On Sun, Aug 24, 2008 at 08:29:22PM -0400, Stephen Hemminger wrote:
> 
> But then you couldn't replace tfifo with pfifo or tbf??

Couldn't the user install TBF as a parent of netem instead?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-26  7:35                                                                                                                                     ` Herbert Xu
@ 2008-08-26  7:47                                                                                                                                       ` Herbert Xu
  2008-08-26 12:24                                                                                                                                         ` Stephen Hemminger
  2008-08-27  9:32                                                                                                                                         ` David Miller
  0 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-26  7:47 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, jarkao2, netdev, denys, Jussi Kivilinna

On Tue, Aug 26, 2008 at 05:35:08PM +1000, Herbert Xu wrote:
> On Sun, Aug 24, 2008 at 08:29:22PM -0400, Stephen Hemminger wrote:
> > 
> > But then you couldn't replace tfifo with pfifo or tbf??
> 
> Couldn't the user install TBF as a parent of netem instead?

In fact, having tfifo there all the time gives us something
that we couldn't do before.  Conceptually, whether TBF is above
or below netem corresponds to a network topology where the shaping
occurs after or before the segment that netem is simulating,
respectively.

If you actually replaced tfifo with TBF (as we do now), then the
shaping always occurs after the segment simulated by netem.  That
is, this is pretty much the same as having TBF as the parent.

BTW, the use of the CB area conflicts with the new pkt_len stuff
so either netem or the pkt_len code needs to be fixed.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-26  7:47                                                                                                                                       ` Herbert Xu
@ 2008-08-26 12:24                                                                                                                                         ` Stephen Hemminger
  2008-08-26 12:41                                                                                                                                           ` Herbert Xu
  2008-08-27  9:32                                                                                                                                         ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Stephen Hemminger @ 2008-08-26 12:24 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jarkao2, netdev, denys, Jussi Kivilinna

On Tue, 26 Aug 2008 17:47:02 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Tue, Aug 26, 2008 at 05:35:08PM +1000, Herbert Xu wrote:
> > On Sun, Aug 24, 2008 at 08:29:22PM -0400, Stephen Hemminger wrote:
> > > 
> > > But then you couldn't replace tfifo with pfifo or tbf??
> > 
> > Couldn't the user install TBF as a parent of netem instead?
> 
> In fact, having tfifo there all the time gives us something
> that we couldn't do before.  Conceptually, whether TBF is above
> or below netem corresponds to a network topology where the shaping
> occurs after or before the segment that netem is simulating,
> respectively.
> 
> If you actually replaced tfifo with TBF (as we do now), then the
> shaping always occurs after the segment simulated by netem.  That
> is, this is pretty much the same as having TBF as the parent.
> 
> BTW, the use of the CB area conflicts with the new pkt_len stuff
> so either netem or the pkt_len code needs to be fixed.
> 
> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

The problem with netem as child of TBF is that TBF counts the number
of packets in the queue to determine the rate. Therefore TBF gets confused
about the rate because of the large number of packets that are held in
netem when delaying. 

In an earlier version, I did rate control in netem but jamal thought
doing layering was better and it has worked until the redesign.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-26 12:24                                                                                                                                         ` Stephen Hemminger
@ 2008-08-26 12:41                                                                                                                                           ` Herbert Xu
  2008-08-26 12:50                                                                                                                                             ` Stephen Hemminger
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-08-26 12:41 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, jarkao2, netdev, denys, Jussi Kivilinna, jamal

(Adding Jamal to the cc)

On Tue, Aug 26, 2008 at 08:24:55AM -0400, Stephen Hemminger wrote:
> 
> The problem with netem as child of TBF is that TBF counts the number
> of packets in the queue to determine the rate. Therefore TBF gets confused
> about the rate because of the large number of packets that are held in
> netem when delaying. 

OK I'm probably missing something.  I can't find any code in TBF
that looks at the number of packets held in the queue.  All it
does is look at the dequeued packet and whether we have enough
tokens to send it right now.

In any case, looking at the number of packets in the queue sounds
broken for TBF as the packets could be held in upper-level queues
which are invisible.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-26 12:41                                                                                                                                           ` Herbert Xu
@ 2008-08-26 12:50                                                                                                                                             ` Stephen Hemminger
  2008-08-26 12:56                                                                                                                                               ` Herbert Xu
  2008-08-27 12:17                                                                                                                                               ` Bastian Bloessl
  0 siblings, 2 replies; 209+ messages in thread
From: Stephen Hemminger @ 2008-08-26 12:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jarkao2, netdev, denys, Jussi Kivilinna, jamal

On Tue, 26 Aug 2008 22:41:53 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> (Adding Jamal to the cc)
> 
> On Tue, Aug 26, 2008 at 08:24:55AM -0400, Stephen Hemminger wrote:
> > 
> > The problem with netem as child of TBF is that TBF counts the number
> > of packets in the queue to determine the rate. Therefore TBF gets confused
> > about the rate because of the large number of packets that are held in
> > netem when delaying. 
> 
> OK I'm probably missing something.  I can't find any code in TBF
> that looks at the number of packets held in the queue.  All it
> does is look at the dequeued packet and whether we have enough
> tokens to send it right now.
> 
> In any case, looking at the number of packets in the queue sounds
> broken for TBF as the packets could be held in upper-level queues
> which are invisible.
> 
> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Last time I tried TBF(100kbit) { netem(+100ms) } it gave different answers
than netem(+100ms) { TBF(100kbit) }.  

I would prefer a peek() to the current dequeue/requeue.
An alternative would be to have netem keep a parallel data structure with
the time to send for all packets, but that would be assuming the underlying
qdisc's were work conserving.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-26 12:50                                                                                                                                             ` Stephen Hemminger
@ 2008-08-26 12:56                                                                                                                                               ` Herbert Xu
  2008-08-27 12:17                                                                                                                                               ` Bastian Bloessl
  1 sibling, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-26 12:56 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, jarkao2, netdev, denys, Jussi Kivilinna, jamal

On Tue, Aug 26, 2008 at 08:50:29AM -0400, Stephen Hemminger wrote:
>
> Last time I tried TBF(100kbit) { netem(+100ms) } it gave different answers
> than netem(+100ms) { TBF(100kbit) }.  

In what ways were the answers different?

> I would prefer a peek() to the current dequeue/requeue.
> An alternative would be to have netem keep a parallel data structure with
> the time to send for all packets, but that would be assuming the underlying
> qdisc's were work conserving.

The peek() interface isn't really appliacable for netem since the
packet that it's requeueing wasn't dequeued in the first place.

In any case, what I'm trying to say is that netem should really
have its own queue (e.g., just fold tfifo in) to implement the
reordering and delays.

This does not prevent the user from creating children of netem
such as TBF to simulate a network environment where you have
loss/delay/jitter after traffic goes through a shaper.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-26  7:47                                                                                                                                       ` Herbert Xu
  2008-08-26 12:24                                                                                                                                         ` Stephen Hemminger
@ 2008-08-27  9:32                                                                                                                                         ` David Miller
  2008-08-27  9:56                                                                                                                                           ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: David Miller @ 2008-08-27  9:32 UTC (permalink / raw)
  To: herbert; +Cc: shemminger, jarkao2, netdev, denys, jussi.kivilinna

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 26 Aug 2008 17:47:02 +1000

> BTW, the use of the CB area conflicts with the new pkt_len stuff
> so either netem or the pkt_len code needs to be fixed.

Netem should be just fine, it's using the opaque ->data[]
blob at the end of qdisc_skb_cb which was created exactly
for this purpose I imagine. :)

Let me know if I missed something.


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-27  9:32                                                                                                                                         ` David Miller
@ 2008-08-27  9:56                                                                                                                                           ` Herbert Xu
  0 siblings, 0 replies; 209+ messages in thread
From: Herbert Xu @ 2008-08-27  9:56 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, jarkao2, netdev, denys, jussi.kivilinna

On Wed, Aug 27, 2008 at 02:32:40AM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Tue, 26 Aug 2008 17:47:02 +1000
> 
> > BTW, the use of the CB area conflicts with the new pkt_len stuff
> > so either netem or the pkt_len code needs to be fixed.
> 
> Netem should be just fine, it's using the opaque ->data[]
> blob at the end of qdisc_skb_cb which was created exactly
> for this purpose I imagine. :)
> 
> Let me know if I missed something.

You're right I missed that bit :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock().
  2008-08-26 12:50                                                                                                                                             ` Stephen Hemminger
  2008-08-26 12:56                                                                                                                                               ` Herbert Xu
@ 2008-08-27 12:17                                                                                                                                               ` Bastian Bloessl
  1 sibling, 0 replies; 209+ messages in thread
From: Bastian Bloessl @ 2008-08-27 12:17 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Herbert Xu, David Miller, jarkao2, netdev, denys, Jussi Kivilinna,
	jamal

Stephen Hemminger wrote:

> 
> Last time I tried TBF(100kbit) { netem(+100ms) } it gave different answers
> than netem(+100ms) { TBF(100kbit) }.  

This might be because in the 2nd case you waste tokens when netem 
requeues to TBF.
But TBFs requeue() can't restore tokens because as David said netem uses 
it to enqueue reordered packets.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-08-21  7:52                                                                                   ` David Miller
  2008-08-21  8:00                                                                                     ` Herbert Xu
  2008-08-21  8:27                                                                                     ` Jarek Poplawski
@ 2008-09-11 10:39                                                                                     ` David Miller
  2008-09-11 10:45                                                                                       ` Herbert Xu
  2 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-11 10:39 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: David Miller <davem@davemloft.net>
Date: Thu, 21 Aug 2008 00:52:50 -0700 (PDT)

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Thu, 21 Aug 2008 17:16:34 +1000
> 
> > Actually, why do we even keep a netdev_queue pointer in a qdisc?
> > A given qdisc can be used by multiple queues (which is why the
> > lock was moved into the qdisc in the first place).
> > 
> > How about keeping a pointer directly to the root qdisc plus a
> > pointer to the netdev (which seems to be the only other use for
> > qdisc->dev_queue)? That way there won't be any confusion as to
> > whether we want the sleeping or non-sleeping qdisc.
> 
> Not a bad idea at all.
> 
> The reason it's there is a left-over from earlier designs of my
> multiqueue stuff, I thought we'd always multiplex the qdiscs to be
> per-queue.  But once Patrick showed me we couldn't do that, we now
> have shared qdiscs.

I got to looking into this and we do need the qdisc->dev_queue member,
see qdisc_run().  So it's not like we can get rid of it if we replace
it with ->netdevdev and add a ->root_qdisc backpointer as well.


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 10:39                                                                                     ` David Miller
@ 2008-09-11 10:45                                                                                       ` Herbert Xu
  2008-09-11 10:49                                                                                         ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-11 10:45 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Thu, Sep 11, 2008 at 03:39:56AM -0700, David Miller wrote:
>
> I got to looking into this and we do need the qdisc->dev_queue member,
> see qdisc_run().  So it's not like we can get rid of it if we replace
> it with ->netdevdev and add a ->root_qdisc backpointer as well.

That can't be right.  Let's I've got a single qdisc shared by
n queues.  It makes no sense for qdisc_run to decide to whether
it should process the qdisc depending on the status of a single
one out of the n queues.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 10:45                                                                                       ` Herbert Xu
@ 2008-09-11 10:49                                                                                         ` David Miller
  2008-09-11 11:00                                                                                           ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-11 10:49 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 11 Sep 2008 20:45:31 +1000

> On Thu, Sep 11, 2008 at 03:39:56AM -0700, David Miller wrote:
> >
> > I got to looking into this and we do need the qdisc->dev_queue member,
> > see qdisc_run().  So it's not like we can get rid of it if we replace
> > it with ->netdevdev and add a ->root_qdisc backpointer as well.
> 
> That can't be right.  Let's I've got a single qdisc shared by
> n queues.  It makes no sense for qdisc_run to decide to whether
> it should process the qdisc depending on the status of a single
> one out of the n queues.

Well some kind of check has to be there.

I _did_ remove it during my initial implementation, and that
turned into a reported performance regression.

See:

commit 83f36f3f35f4f83fa346bfff58a5deabc78370e5
Author: David S. Miller <davem@davemloft.net>
Date:   Wed Aug 13 02:13:34 2008 -0700

    pkt_sched: Add queue stopped test back to qdisc_run().
    
    Based upon a bug report by Andrew Gallatin on netdev
    with subject "CPU utilization increased in 2.6.27rc"
    
    In commit 37437bb2e1ae8af470dfcd5b4ff454110894ccaf
    ("pkt_sched: Schedule qdiscs instead of netdev_queue.")
    the test of the queue being stopped was erroneously
    removed from qdisc_run().
    
    When the TX queue of the device fills up, this omission
    causes lots of extraneous useless work to be queued up
    to softirq context, where we'll just return immediately
    because the device is still stuffed up.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 10:49                                                                                         ` David Miller
@ 2008-09-11 11:00                                                                                           ` Herbert Xu
  2008-09-11 11:42                                                                                             ` David Miller
  2008-09-11 11:51                                                                                             ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-09-11 11:00 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Thu, Sep 11, 2008 at 03:49:55AM -0700, David Miller wrote:
> 
> Well some kind of check has to be there.
> 
> I _did_ remove it during my initial implementation, and that
> turned into a reported performance regression.

I see.  How about looking at the queue that the head-of-qdisc
packet maps to? That should be fairly cheap to compute.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 11:00                                                                                           ` Herbert Xu
@ 2008-09-11 11:42                                                                                             ` David Miller
  2008-09-11 11:45                                                                                               ` Herbert Xu
  2008-09-11 11:51                                                                                             ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-11 11:42 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 11 Sep 2008 21:00:35 +1000

> On Thu, Sep 11, 2008 at 03:49:55AM -0700, David Miller wrote:
> > 
> > Well some kind of check has to be there.
> > 
> > I _did_ remove it during my initial implementation, and that
> > turned into a reported performance regression.
> 
> I see.  How about looking at the queue that the head-of-qdisc
> packet maps to? That should be fairly cheap to compute.

This gets us back to the whole qdisc->ops->peek() discussion :)

And we don't have the qdisc lock here, taking it is undesirable,
and if we do take it we have to transfer that lock down into
__qdisc_run() which means adjusting all the other __qdisc_run()
callers.

It's very clumsy at best.

I therefore don't think it's wise peeking into the qdisc here.

But I do realize we have to do something about this, hmmm...

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 11:42                                                                                             ` David Miller
@ 2008-09-11 11:45                                                                                               ` Herbert Xu
  2008-09-11 11:47                                                                                                 ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-11 11:45 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev

On Thu, Sep 11, 2008 at 04:42:55AM -0700, David Miller wrote:
>
> This gets us back to the whole qdisc->ops->peek() discussion :)

That just proves it's a good idea :)

> And we don't have the qdisc lock here, taking it is undesirable,
> and if we do take it we have to transfer that lock down into
> __qdisc_run() which means adjusting all the other __qdisc_run()
> callers.

Last I checked qdisc_run did run under the root lock...

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 11:45                                                                                               ` Herbert Xu
@ 2008-09-11 11:47                                                                                                 ` David Miller
  2008-09-12  4:49                                                                                                   ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-11 11:47 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 11 Sep 2008 21:45:02 +1000

> On Thu, Sep 11, 2008 at 04:42:55AM -0700, David Miller wrote:
> > And we don't have the qdisc lock here, taking it is undesirable,
> > and if we do take it we have to transfer that lock down into
> > __qdisc_run() which means adjusting all the other __qdisc_run()
> > callers.
> 
> Last I checked qdisc_run did run under the root lock...

That certainly changes things :)

Ok, so implementing ->peek() is the first step in dealing
with this.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 11:00                                                                                           ` Herbert Xu
  2008-09-11 11:42                                                                                             ` David Miller
@ 2008-09-11 11:51                                                                                             ` Jarek Poplawski
  2008-09-11 11:54                                                                                               ` Herbert Xu
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-11 11:51 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Thu, Sep 11, 2008 at 09:00:35PM +1000, Herbert Xu wrote:
> On Thu, Sep 11, 2008 at 03:49:55AM -0700, David Miller wrote:
> > 
> > Well some kind of check has to be there.
> > 
> > I _did_ remove it during my initial implementation, and that
> > turned into a reported performance regression.
> 
> I see.  How about looking at the queue that the head-of-qdisc
> packet maps to? That should be fairly cheap to compute.
> 

Very good point: "That can't be right"! But I'm not sure I got the
above idea: e.g. this new sch_multiq can have different packets
depending just on some unstopped tx_queue.

IMHO, the most reasonable test here is for all tx_queues of a qdisc
beeing stopped, but since this is quite heavy, probably we need an
additional qdisc flag for such an occasion.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 11:51                                                                                             ` Jarek Poplawski
@ 2008-09-11 11:54                                                                                               ` Herbert Xu
  2008-09-11 12:10                                                                                                 ` Jarek Poplawski
  2008-09-11 12:34                                                                                                 ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: Herbert Xu @ 2008-09-11 11:54 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

On Thu, Sep 11, 2008 at 11:51:32AM +0000, Jarek Poplawski wrote:
> 
> IMHO, the most reasonable test here is for all tx_queues of a qdisc
> beeing stopped, but since this is quite heavy, probably we need an
> additional qdisc flag for such an occasion.

Yes we could do that too.  Although the head-of-qdisc approach
will eventually lead to the same result.  That is, as you pop
things off the head eventually you'll hit a packet that belongs
to the stopped queue and that'll then block the whole qdisc.

So I don't think there's anything inherently advantageous in
checking all the queues.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 11:54                                                                                               ` Herbert Xu
@ 2008-09-11 12:10                                                                                                 ` Jarek Poplawski
  2008-09-11 12:34                                                                                                 ` Jarek Poplawski
  1 sibling, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-11 12:10 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Thu, Sep 11, 2008 at 09:54:49PM +1000, Herbert Xu wrote:
> On Thu, Sep 11, 2008 at 11:51:32AM +0000, Jarek Poplawski wrote:
> > 
> > IMHO, the most reasonable test here is for all tx_queues of a qdisc
> > beeing stopped, but since this is quite heavy, probably we need an
> > additional qdisc flag for such an occasion.
> 
> Yes we could do that too.  Although the head-of-qdisc approach
> will eventually lead to the same result.  That is, as you pop
> things off the head eventually you'll hit a packet that belongs
> to the stopped queue and that'll then block the whole qdisc.
> 
> So I don't think there's anything inherently advantageous in
> checking all the queues.

Yes, but this is only because this current behaviour of blocking all
transmit by one stopped tx_queue is wrong (IMHO), and with sch_multiq
there should be a real advantage.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 11:54                                                                                               ` Herbert Xu
  2008-09-11 12:10                                                                                                 ` Jarek Poplawski
@ 2008-09-11 12:34                                                                                                 ` Jarek Poplawski
  1 sibling, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-11 12:34 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev

On Thu, Sep 11, 2008 at 09:54:49PM +1000, Herbert Xu wrote:
> On Thu, Sep 11, 2008 at 11:51:32AM +0000, Jarek Poplawski wrote:
> > 
> > IMHO, the most reasonable test here is for all tx_queues of a qdisc
> > beeing stopped, but since this is quite heavy, probably we need an
> > additional qdisc flag for such an occasion.
...
> So I don't think there's anything inherently advantageous in
> checking all the queues.

BTW, I hope I wasn't misunderstood here. I don't mean any additional
checking of all the queues anywhere. It's only about setting such
a flag at the end of netif_stop_all_queues() and netif_stop_queue().

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-11 11:47                                                                                                 ` David Miller
@ 2008-09-12  4:49                                                                                                   ` David Miller
  2008-09-12  8:02                                                                                                     ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-12  4:49 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev

From: David Miller <davem@davemloft.net>
Date: Thu, 11 Sep 2008 04:47:17 -0700 (PDT)

> Ok, so implementing ->peek() is the first step in dealing
> with this.

Ok, here's a first cut at this.

Most of it is simple and straightforward.

As usual, though, CBQ, HFSC, and HTB are complicated.

Most of the peek implementations just skb_peek() in their
downstream queue or iterate over their prio array doing
the same looking for a non-empty list.

But CBQ, HFSC, and HTB have complicated class iterators and
internal time state machine things.  I tried to do my best
in these cases.

I didn't want these things firing off class watchdog timers and
stuff like this.  Just see if there is any packet ready now
and return it.

The one exception is that I allow CBQ to advance it's time
state machine.

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e556962..c4eb6e5 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -41,6 +41,7 @@ struct Qdisc
 {
 	int 			(*enqueue)(struct sk_buff *skb, struct Qdisc *dev);
 	struct sk_buff *	(*dequeue)(struct Qdisc *dev);
+	struct sk_buff *	(*peek)(struct Qdisc *dev);
 	unsigned		flags;
 #define TCQ_F_BUILTIN	1
 #define TCQ_F_THROTTLED	2
@@ -110,6 +111,7 @@ struct Qdisc_ops
 
 	int 			(*enqueue)(struct sk_buff *, struct Qdisc *);
 	struct sk_buff *	(*dequeue)(struct Qdisc *);
+	struct sk_buff *	(*peek)(struct Qdisc *);
 	int 			(*requeue)(struct sk_buff *, struct Qdisc *);
 	unsigned int		(*drop)(struct Qdisc *);
 
@@ -431,6 +433,28 @@ static inline struct sk_buff *qdisc_dequeue_tail(struct Qdisc *sch)
 	return __qdisc_dequeue_tail(sch, &sch->q);
 }
 
+static inline struct sk_buff *__qdisc_peek_head(struct Qdisc *sch,
+						struct sk_buff_head *list)
+{
+	return skb_peek(list);
+}
+
+static inline struct sk_buff *qdisc_peek_head(struct Qdisc *sch)
+{
+	return __qdisc_peek_head(sch, &sch->q);
+}
+
+static inline struct sk_buff *__qdisc_peek_tail(struct Qdisc *sch,
+						struct sk_buff_head *list)
+{
+	return skb_peek_tail(list);
+}
+
+static inline struct sk_buff *qdisc_peek_tail(struct Qdisc *sch)
+{
+	return __qdisc_peek_tail(sch, &sch->q);
+}
+
 static inline int __qdisc_requeue(struct sk_buff *skb, struct Qdisc *sch,
 				  struct sk_buff_head *list)
 {
diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c
index 43d3725..cfe44c0 100644
--- a/net/sched/sch_atm.c
+++ b/net/sched/sch_atm.c
@@ -522,6 +522,13 @@ static struct sk_buff *atm_tc_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
+static struct sk_buff *atm_tc_peek(struct Qdisc *sch)
+{
+	struct atm_qdisc_data *p = qdisc_priv(sch);
+
+	return p->link.q->peek(p->link.q);
+}
+
 static int atm_tc_requeue(struct sk_buff *skb, struct Qdisc *sch)
 {
 	struct atm_qdisc_data *p = qdisc_priv(sch);
@@ -694,6 +701,7 @@ static struct Qdisc_ops atm_qdisc_ops __read_mostly = {
 	.priv_size	= sizeof(struct atm_qdisc_data),
 	.enqueue	= atm_tc_enqueue,
 	.dequeue	= atm_tc_dequeue,
+	.peek		= atm_tc_peek,
 	.requeue	= atm_tc_requeue,
 	.drop		= atm_tc_drop,
 	.init		= atm_tc_init,
diff --git a/net/sched/sch_blackhole.c b/net/sched/sch_blackhole.c
index 507fb48..094a874 100644
--- a/net/sched/sch_blackhole.c
+++ b/net/sched/sch_blackhole.c
@@ -33,6 +33,7 @@ static struct Qdisc_ops blackhole_qdisc_ops __read_mostly = {
 	.priv_size	= 0,
 	.enqueue	= blackhole_enqueue,
 	.dequeue	= blackhole_dequeue,
+	.peek		= blackhole_dequeue,
 	.owner		= THIS_MODULE,
 };
 
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 8b06fa9..bf1a5d9 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -851,7 +851,7 @@ cbq_under_limit(struct cbq_class *cl)
 }
 
 static __inline__ struct sk_buff *
-cbq_dequeue_prio(struct Qdisc *sch, int prio)
+cbq_dequeue_prio(struct Qdisc *sch, int prio, int peek)
 {
 	struct cbq_sched_data *q = qdisc_priv(sch);
 	struct cbq_class *cl_tail, *cl_prev, *cl;
@@ -881,7 +881,13 @@ cbq_dequeue_prio(struct Qdisc *sch, int prio)
 				goto next_class;
 			}
 
-			skb = cl->q->dequeue(cl->q);
+			if (peek) {
+				skb = cl->q->peek(cl->q);
+				if (skb)
+					return skb;
+			} else {
+				skb = cl->q->dequeue(cl->q);
+			}
 
 			/* Class did not give us any skb :-(
 			   It could occur even if cl->q->q.qlen != 0
@@ -964,23 +970,34 @@ cbq_dequeue_1(struct Qdisc *sch)
 	while (activemask) {
 		int prio = ffz(~activemask);
 		activemask &= ~(1<<prio);
-		skb = cbq_dequeue_prio(sch, prio);
+		skb = cbq_dequeue_prio(sch, prio, 0);
 		if (skb)
 			return skb;
 	}
 	return NULL;
 }
 
-static struct sk_buff *
-cbq_dequeue(struct Qdisc *sch)
+static __inline__ struct sk_buff *
+cbq_peek_1(struct Qdisc *sch)
 {
-	struct sk_buff *skb;
 	struct cbq_sched_data *q = qdisc_priv(sch);
-	psched_time_t now;
-	psched_tdiff_t incr;
+	struct sk_buff *skb;
+	unsigned activemask;
 
-	now = psched_get_time();
-	incr = now - q->now_rt;
+	activemask = q->activemask&0xFF;
+	while (activemask) {
+		int prio = ffz(~activemask);
+		activemask &= ~(1<<prio);
+		skb = cbq_dequeue_prio(sch, prio, 1);
+		if (skb)
+			return skb;
+	}
+	return NULL;
+}
+
+static void cbq_update_clock(struct cbq_sched_data *q, psched_time_t now)
+{
+	psched_tdiff_t incr = now - q->now_rt;
 
 	if (q->tx_class) {
 		psched_tdiff_t incr2;
@@ -999,7 +1016,18 @@ cbq_dequeue(struct Qdisc *sch)
 	}
 	q->now += incr;
 	q->now_rt = now;
+}
 
+static struct sk_buff *
+cbq_dequeue(struct Qdisc *sch)
+{
+	struct cbq_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	psched_time_t now;
+
+	now = psched_get_time();
+
+	cbq_update_clock(q, now);
 	for (;;) {
 		q->wd_expires = 0;
 
@@ -1048,6 +1076,30 @@ cbq_dequeue(struct Qdisc *sch)
 	return NULL;
 }
 
+static struct sk_buff *
+cbq_peek(struct Qdisc *sch)
+{
+	struct cbq_sched_data *q = qdisc_priv(sch);
+	psched_time_t now;
+
+	now = psched_get_time();
+
+	cbq_update_clock(q, now);
+	for (;;) {
+		struct sk_buff *skb = cbq_peek_1(sch);
+		if (skb)
+			return skb;
+
+		if (q->toplevel == TC_CBQ_MAXLEVEL &&
+		    q->link.undertime == PSCHED_PASTPERFECT)
+			break;
+
+		q->toplevel = TC_CBQ_MAXLEVEL;
+		q->link.undertime = PSCHED_PASTPERFECT;
+	}
+	return NULL;
+}
+
 /* CBQ class maintanance routines */
 
 static void cbq_adjust_levels(struct cbq_class *this)
@@ -2065,6 +2117,7 @@ static struct Qdisc_ops cbq_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct cbq_sched_data),
 	.enqueue	=	cbq_enqueue,
 	.dequeue	=	cbq_dequeue,
+	.peek		=	cbq_peek,
 	.requeue	=	cbq_requeue,
 	.drop		=	cbq_drop,
 	.init		=	cbq_init,
diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c
index edd1298..7213faa 100644
--- a/net/sched/sch_dsmark.c
+++ b/net/sched/sch_dsmark.c
@@ -313,6 +313,13 @@ static struct sk_buff *dsmark_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
+static struct sk_buff *dsmark_peek(struct Qdisc *sch)
+{
+	struct dsmark_qdisc_data *p = qdisc_priv(sch);
+
+	return p->q->ops->peek(p->q);
+}
+
 static int dsmark_requeue(struct sk_buff *skb, struct Qdisc *sch)
 {
 	struct dsmark_qdisc_data *p = qdisc_priv(sch);
@@ -496,6 +503,7 @@ static struct Qdisc_ops dsmark_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct dsmark_qdisc_data),
 	.enqueue	=	dsmark_enqueue,
 	.dequeue	=	dsmark_dequeue,
+	.peek		=	dsmark_peek,
 	.requeue	=	dsmark_requeue,
 	.drop		=	dsmark_drop,
 	.init		=	dsmark_init,
diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c
index 23d258b..8825e88 100644
--- a/net/sched/sch_fifo.c
+++ b/net/sched/sch_fifo.c
@@ -83,6 +83,7 @@ struct Qdisc_ops pfifo_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct fifo_sched_data),
 	.enqueue	=	pfifo_enqueue,
 	.dequeue	=	qdisc_dequeue_head,
+	.peek		=	qdisc_peek_head,
 	.requeue	=	qdisc_requeue,
 	.drop		=	qdisc_queue_drop,
 	.init		=	fifo_init,
@@ -98,6 +99,7 @@ struct Qdisc_ops bfifo_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct fifo_sched_data),
 	.enqueue	=	bfifo_enqueue,
 	.dequeue	=	qdisc_dequeue_head,
+	.peek		=	qdisc_peek_head,
 	.requeue	=	qdisc_requeue,
 	.drop		=	qdisc_queue_drop,
 	.init		=	fifo_init,
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec0a083..148c117 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -313,6 +313,7 @@ struct Qdisc_ops noop_qdisc_ops __read_mostly = {
 	.priv_size	=	0,
 	.enqueue	=	noop_enqueue,
 	.dequeue	=	noop_dequeue,
+	.peek		=	noop_dequeue,
 	.requeue	=	noop_requeue,
 	.owner		=	THIS_MODULE,
 };
@@ -324,6 +325,7 @@ static struct netdev_queue noop_netdev_queue = {
 struct Qdisc noop_qdisc = {
 	.enqueue	=	noop_enqueue,
 	.dequeue	=	noop_dequeue,
+	.peek		=	noop_dequeue,
 	.flags		=	TCQ_F_BUILTIN,
 	.ops		=	&noop_qdisc_ops,
 	.list		=	LIST_HEAD_INIT(noop_qdisc.list),
@@ -337,6 +339,7 @@ static struct Qdisc_ops noqueue_qdisc_ops __read_mostly = {
 	.priv_size	=	0,
 	.enqueue	=	noop_enqueue,
 	.dequeue	=	noop_dequeue,
+	.peek		=	noop_dequeue,
 	.requeue	=	noop_requeue,
 	.owner		=	THIS_MODULE,
 };
@@ -349,6 +352,7 @@ static struct netdev_queue noqueue_netdev_queue = {
 static struct Qdisc noqueue_qdisc = {
 	.enqueue	=	NULL,
 	.dequeue	=	noop_dequeue,
+	.peek		=	noop_dequeue,
 	.flags		=	TCQ_F_BUILTIN,
 	.ops		=	&noqueue_qdisc_ops,
 	.list		=	LIST_HEAD_INIT(noqueue_qdisc.list),
@@ -400,6 +404,19 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc* qdisc)
 	return NULL;
 }
 
+static struct sk_buff *pfifo_fast_peek(struct Qdisc* qdisc)
+{
+	int prio;
+	struct sk_buff_head *list = qdisc_priv(qdisc);
+
+	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+		if (!skb_queue_empty(list + prio))
+			return __qdisc_peek_head(qdisc, list + prio);
+	}
+
+	return NULL;
+}
+
 static int pfifo_fast_requeue(struct sk_buff *skb, struct Qdisc* qdisc)
 {
 	qdisc->q.qlen++;
@@ -446,6 +463,7 @@ static struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 	.priv_size	=	PFIFO_FAST_BANDS * sizeof(struct sk_buff_head),
 	.enqueue	=	pfifo_fast_enqueue,
 	.dequeue	=	pfifo_fast_dequeue,
+	.peek		=	pfifo_fast_peek,
 	.requeue	=	pfifo_fast_requeue,
 	.init		=	pfifo_fast_init,
 	.reset		=	pfifo_fast_reset,
@@ -476,6 +494,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	sch->ops = ops;
 	sch->enqueue = ops->enqueue;
 	sch->dequeue = ops->dequeue;
+	sch->peek = ops->peek;
 	sch->dev_queue = dev_queue;
 	dev_hold(qdisc_dev(sch));
 	atomic_set(&sch->refcnt, 1);
diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c
index c1ad6b8..cb20ee3 100644
--- a/net/sched/sch_gred.c
+++ b/net/sched/sch_gred.c
@@ -602,6 +602,7 @@ static struct Qdisc_ops gred_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct gred_sched),
 	.enqueue	=	gred_enqueue,
 	.dequeue	=	gred_dequeue,
+	.peek		=	qdisc_peek_head,
 	.requeue	=	gred_requeue,
 	.drop		=	gred_drop,
 	.init		=	gred_init,
diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index c1e77da..c9e80de 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1674,6 +1674,39 @@ hfsc_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
+static struct sk_buff *
+hfsc_peek(struct Qdisc *sch)
+{
+	struct hfsc_sched *q = qdisc_priv(sch);
+	struct hfsc_class *cl;
+	struct sk_buff *skb;
+	u64 cur_time;
+
+	if (sch->q.qlen == 0)
+		return NULL;
+	if ((skb = skb_peek(&q->requeue)))
+		return skb;
+
+	cur_time = psched_get_time();
+
+	/*
+	 * if there are eligible classes, use real-time criteria.
+	 * find the class with the minimum deadline among
+	 * the eligible classes.
+	 */
+	if ((cl = eltree_get_mindl(q, cur_time)) == NULL) {
+		/*
+		 * use link-sharing criteria
+		 * get the class with the minimum vt in the hierarchy
+		 */
+		cl = vttree_get_minvt(&q->root, cur_time);
+		if (cl == NULL)
+			return NULL;
+	}
+
+	return cl->qdisc->peek(cl->qdisc);
+}
+
 static int
 hfsc_requeue(struct sk_buff *skb, struct Qdisc *sch)
 {
@@ -1735,6 +1768,7 @@ static struct Qdisc_ops hfsc_qdisc_ops __read_mostly = {
 	.dump		= hfsc_dump_qdisc,
 	.enqueue	= hfsc_enqueue,
 	.dequeue	= hfsc_dequeue,
+	.peek		= hfsc_peek,
 	.requeue	= hfsc_requeue,
 	.drop		= hfsc_drop,
 	.cl_ops		= &hfsc_class_ops,
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index d14f020..adc0264 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -803,7 +803,7 @@ static struct htb_class *htb_lookup_leaf(struct rb_root *tree, int prio,
 /* dequeues packet at given priority and level; call only if
    you are sure that there is active class at prio/level */
 static struct sk_buff *htb_dequeue_tree(struct htb_sched *q, int prio,
-					int level)
+					int level, int peek)
 {
 	struct sk_buff *skb = NULL;
 	struct htb_class *cl, *start;
@@ -840,7 +840,10 @@ next:
 			goto next;
 		}
 
-		skb = cl->un.leaf.q->dequeue(cl->un.leaf.q);
+		if (peek)
+			skb = cl->un.leaf.q->peek(cl->un.leaf.q);
+		else
+			skb = cl->un.leaf.q->dequeue(cl->un.leaf.q);
 		if (likely(skb != NULL))
 			break;
 		if (!cl->warned) {
@@ -858,7 +861,7 @@ next:
 
 	} while (cl != start);
 
-	if (likely(skb != NULL)) {
+	if (likely(skb != NULL && !peek)) {
 		cl->un.leaf.deficit[level] -= qdisc_pkt_len(skb);
 		if (cl->un.leaf.deficit[level] < 0) {
 			cl->un.leaf.deficit[level] += cl->un.leaf.quantum;
@@ -915,7 +918,7 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 		while (m != (int)(-1)) {
 			int prio = ffz(m);
 			m |= 1 << prio;
-			skb = htb_dequeue_tree(q, prio, level);
+			skb = htb_dequeue_tree(q, prio, level, 0);
 			if (likely(skb != NULL)) {
 				sch->q.qlen--;
 				sch->flags &= ~TCQ_F_THROTTLED;
@@ -929,6 +932,52 @@ fin:
 	return skb;
 }
 
+static struct sk_buff *htb_peek(struct Qdisc *sch)
+{
+	struct htb_sched *q = qdisc_priv(sch);
+	psched_time_t next_event;
+	struct sk_buff *skb;
+	int level;
+
+	/* try to peek direct packets as high prio (!) to minimize cpu work */
+	if ((skb = skb_peek(&q->direct_queue)))
+		return skb;
+
+	if (!sch->q.qlen)
+		return NULL;
+
+	q->now = psched_get_time();
+
+	next_event = q->now + 5 * PSCHED_TICKS_PER_SEC;
+	q->nwc_hit = 0;
+	for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
+		/* common case optimization - skip event handler quickly */
+		psched_time_t event;
+		int m;
+
+		if (q->now >= q->near_ev_cache[level]) {
+			event = htb_do_events(q, level);
+			if (!event)
+				event = q->now + PSCHED_TICKS_PER_SEC;
+			q->near_ev_cache[level] = event;
+		} else
+			event = q->near_ev_cache[level];
+
+		if (event && next_event > event)
+			next_event = event;
+
+		m = ~q->row_mask[level];
+		while (m != (int)(-1)) {
+			int prio = ffz(m);
+			m |= 1 << prio;
+			skb = htb_dequeue_tree(q, prio, level, 1);
+			if (likely(skb != NULL))
+				return skb;
+		}
+	}
+	return NULL;
+}
+
 /* try to drop from each class (by prio) until one succeed */
 static unsigned int htb_drop(struct Qdisc *sch)
 {
@@ -1565,6 +1614,7 @@ static struct Qdisc_ops htb_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct htb_sched),
 	.enqueue	=	htb_enqueue,
 	.dequeue	=	htb_dequeue,
+	.peek		=	htb_peek,
 	.requeue	=	htb_requeue,
 	.drop		=	htb_drop,
 	.init		=	htb_init,
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index a119599..528cd55 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -308,6 +308,28 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 	return NULL;
 }
 
+static struct sk_buff *netem_peek(struct Qdisc *sch)
+{
+	struct netem_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	smp_mb();
+	if (sch->flags & TCQ_F_THROTTLED)
+		return NULL;
+
+	skb = q->qdisc->peek(q->qdisc);
+	if (skb) {
+		const struct netem_skb_cb *cb = netem_skb_cb(skb);
+		psched_time_t now = psched_get_time();
+
+		/* if more time remaining? */
+		if (cb->time_to_send <= now)
+			return skb;
+	}
+
+	return NULL;
+}
+
 static void netem_reset(struct Qdisc *sch)
 {
 	struct netem_sched_data *q = qdisc_priv(sch);
@@ -541,6 +563,7 @@ static struct Qdisc_ops tfifo_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct fifo_sched_data),
 	.enqueue	=	tfifo_enqueue,
 	.dequeue	=	qdisc_dequeue_head,
+	.peek		=	qdisc_peek_head,
 	.requeue	=	qdisc_requeue,
 	.drop		=	qdisc_queue_drop,
 	.init		=	tfifo_init,
@@ -716,6 +739,7 @@ static struct Qdisc_ops netem_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct netem_sched_data),
 	.enqueue	=	netem_enqueue,
 	.dequeue	=	netem_dequeue,
+	.peek		=	netem_peek,
 	.requeue	=	netem_requeue,
 	.drop		=	netem_drop,
 	.init		=	netem_init,
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 504a78c..7ae2226 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -138,6 +138,21 @@ static struct sk_buff *prio_dequeue(struct Qdisc* sch)
 
 }
 
+static struct sk_buff *prio_peek(struct Qdisc* sch)
+{
+	struct prio_sched_data *q = qdisc_priv(sch);
+	int prio;
+
+	for (prio = 0; prio < q->bands; prio++) {
+		struct Qdisc *qdisc = q->queues[prio];
+		struct sk_buff *skb = qdisc->peek(qdisc);
+		if (skb)
+			return skb;
+	}
+	return NULL;
+
+}
+
 static unsigned int prio_drop(struct Qdisc* sch)
 {
 	struct prio_sched_data *q = qdisc_priv(sch);
@@ -421,6 +436,7 @@ static struct Qdisc_ops prio_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct prio_sched_data),
 	.enqueue	=	prio_enqueue,
 	.dequeue	=	prio_dequeue,
+	.peek		=	prio_peek,
 	.requeue	=	prio_requeue,
 	.drop		=	prio_drop,
 	.init		=	prio_init,
diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 5da0583..f1b6465 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -140,6 +140,14 @@ static struct sk_buff * red_dequeue(struct Qdisc* sch)
 	return skb;
 }
 
+static struct sk_buff * red_peek(struct Qdisc* sch)
+{
+	struct red_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *child = q->qdisc;
+
+	return child->peek(child);
+}
+
 static unsigned int red_drop(struct Qdisc* sch)
 {
 	struct red_sched_data *q = qdisc_priv(sch);
@@ -361,6 +369,7 @@ static struct Qdisc_ops red_qdisc_ops __read_mostly = {
 	.cl_ops		=	&red_class_ops,
 	.enqueue	=	red_enqueue,
 	.dequeue	=	red_dequeue,
+	.peek		=	red_peek,
 	.requeue	=	red_requeue,
 	.drop		=	red_drop,
 	.init		=	red_init,
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 6e041d1..16102da 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -431,6 +431,21 @@ sfq_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
+static struct sk_buff *
+sfq_peek(struct Qdisc *sch)
+{
+	struct sfq_sched_data *q = qdisc_priv(sch);
+	sfq_index a;
+
+	/* No active slots */
+	if (q->tail == SFQ_DEPTH)
+		return NULL;
+
+	a = q->next[q->tail];
+
+	return skb_peek(&q->qs[a]);
+}
+
 static void
 sfq_reset(struct Qdisc *sch)
 {
@@ -624,6 +639,7 @@ static struct Qdisc_ops sfq_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct sfq_sched_data),
 	.enqueue	=	sfq_enqueue,
 	.dequeue	=	sfq_dequeue,
+	.peek		=	sfq_peek,
 	.requeue	=	sfq_requeue,
 	.drop		=	sfq_drop,
 	.init		=	sfq_init,
diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index 94c6159..876a0a7 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -225,6 +225,13 @@ static struct sk_buff *tbf_dequeue(struct Qdisc* sch)
 	return NULL;
 }
 
+static struct sk_buff *tbf_peek(struct Qdisc* sch)
+{
+	struct tbf_sched_data *q = qdisc_priv(sch);
+
+	return q->qdisc->peek(q->qdisc);
+}
+
 static void tbf_reset(struct Qdisc* sch)
 {
 	struct tbf_sched_data *q = qdisc_priv(sch);
@@ -469,6 +476,7 @@ static struct Qdisc_ops tbf_qdisc_ops __read_mostly = {
 	.priv_size	=	sizeof(struct tbf_sched_data),
 	.enqueue	=	tbf_enqueue,
 	.dequeue	=	tbf_dequeue,
+	.peek		=	tbf_peek,
 	.requeue	=	tbf_requeue,
 	.drop		=	tbf_drop,
 	.init		=	tbf_init,
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index d35ef05..8d7acd8 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -123,6 +123,14 @@ teql_dequeue(struct Qdisc* sch)
 	return skb;
 }
 
+static struct sk_buff *
+teql_peek(struct Qdisc* sch)
+{
+	struct teql_sched_data *dat = qdisc_priv(sch);
+
+	return skb_peek(&dat->q);
+}
+
 static __inline__ void
 teql_neigh_release(struct neighbour *n)
 {
@@ -433,6 +441,7 @@ static __init void teql_master_setup(struct net_device *dev)
 
 	ops->enqueue	=	teql_enqueue;
 	ops->dequeue	=	teql_dequeue;
+	ops->peek	=	teql_peek;
 	ops->requeue	=	teql_requeue;
 	ops->init	=	teql_qdisc_init;
 	ops->reset	=	teql_reset;

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-12  4:49                                                                                                   ` David Miller
@ 2008-09-12  8:02                                                                                                     ` Jarek Poplawski
  2008-09-12 23:10                                                                                                       ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-12  8:02 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, Patrick McHardy

On Thu, Sep 11, 2008 at 09:49:29PM -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Thu, 11 Sep 2008 04:47:17 -0700 (PDT)
> 
> > Ok, so implementing ->peek() is the first step in dealing
> > with this.
> 
> Ok, here's a first cut at this.
> 
> Most of it is simple and straightforward.
> 
> As usual, though, CBQ, HFSC, and HTB are complicated.

I guess not for Patrick... (so CCed)

> 
> Most of the peek implementations just skb_peek() in their
> downstream queue or iterate over their prio array doing
> the same looking for a non-empty list.
> 
> But CBQ, HFSC, and HTB have complicated class iterators and
> internal time state machine things.  I tried to do my best
> in these cases.
> 
> I didn't want these things firing off class watchdog timers and
> stuff like this.  Just see if there is any packet ready now
> and return it.
> 
> The one exception is that I allow CBQ to advance it's time
> state machine.

Alas I'm still not sure how this whole peek idea is going to be
implemented (dequeuing after peeking doesn't have to give us the
same skb since in the meantime e.g. in HTB some other class with
higher prio can get enough tokens etc., and if there is a break
for transmit in the meantime with possible enqueuing, or we can
deal with something like sch_multiq, which depends on the current
state of the tx_queues, this all looks even more interesting).

So, until there is some example, even in pseudocode, for qdisc_run()
vs. some_classful_sched interaction, I think, I'm not able to give
more feedback now, but mabe only one doubt if this wrapper below is
really needed, since skb_peek() is more readable and it's used
directly in a few places anyway.

> +static inline struct sk_buff *__qdisc_peek_head(struct Qdisc *sch,
> +						struct sk_buff_head *list)
> +{
> +	return skb_peek(list);
> +}

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-12  8:02                                                                                                     ` Jarek Poplawski
@ 2008-09-12 23:10                                                                                                       ` David Miller
  2008-09-13  1:10                                                                                                         ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-12 23:10 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, kaber

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 12 Sep 2008 08:02:49 +0000

> Alas I'm still not sure how this whole peek idea is going to be
> implemented (dequeuing after peeking doesn't have to give us the
> same skb since in the meantime e.g. in HTB some other class with
> higher prio can get enough tokens etc., and if there is a break
> for transmit in the meantime with possible enqueuing, or we can
> deal with something like sch_multiq, which depends on the current
> state of the tx_queues, this all looks even more interesting).

That's a good point.

Well, once that is discussed and resolved, at least the patch I
posted can be used as a base for implementation :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-12 23:10                                                                                                       ` David Miller
@ 2008-09-13  1:10                                                                                                         ` Herbert Xu
  2008-09-13  1:22                                                                                                           ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-13  1:10 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, kaber

On Fri, Sep 12, 2008 at 04:10:05PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Fri, 12 Sep 2008 08:02:49 +0000
> 
> > Alas I'm still not sure how this whole peek idea is going to be
> > implemented (dequeuing after peeking doesn't have to give us the
> > same skb since in the meantime e.g. in HTB some other class with
> > higher prio can get enough tokens etc., and if there is a break
> > for transmit in the meantime with possible enqueuing, or we can
> > deal with something like sch_multiq, which depends on the current
> > state of the tx_queues, this all looks even more interesting).
> 
> That's a good point.
> 
> Well, once that is discussed and resolved, at least the patch I
> posted can be used as a base for implementation :-)

Well we should remember the skb returned by peek, and then return
it on the next dequeue.  If peek hasn't been called since the
last dequeue then it's equivalent to the current dequeue.

This works for the two scenarios where we're planning to use
peek since they'll both call dequeue immediately.

It'd even work if there was a large gap (i.e., a sleep) after
the peek since we'd just peek again after waking up.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-13  1:10                                                                                                         ` Herbert Xu
@ 2008-09-13  1:22                                                                                                           ` David Miller
  2008-09-13  1:27                                                                                                             ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-13  1:22 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, kaber

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 12 Sep 2008 18:10:18 -0700

> On Fri, Sep 12, 2008 at 04:10:05PM -0700, David Miller wrote:
> > Well, once that is discussed and resolved, at least the patch I
> > posted can be used as a base for implementation :-)
> 
> Well we should remember the skb returned by peek, and then return
> it on the next dequeue.  If peek hasn't been called since the
> last dequeue then it's equivalent to the current dequeue.
> 
> This works for the two scenarios where we're planning to use
> peek since they'll both call dequeue immediately.
> 
> It'd even work if there was a large gap (i.e., a sleep) after
> the peek since we'd just peek again after waking up.

This requires state, for the unlink.  And that unlink point could be
several levels deep into the tree.

There is also currently no restriction on how a packet scheduler
maintains it's queue of SKBs.

So, if the idea is to keep track of the skb_queue_head pointer to
unlink from at the root, I don't think that's a good idea.

And then there are all of these complicated pieces of state the
classful qdiscs modify when a SKB is removed from their visibility.
So, the unlink isn't just a simple list delete operation.  There
are queue lengths that need to be updated, watchdog timers to
manage, etc.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-13  1:22                                                                                                           ` David Miller
@ 2008-09-13  1:27                                                                                                             ` Herbert Xu
  2008-09-13  1:40                                                                                                               ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-13  1:27 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, kaber

On Fri, Sep 12, 2008 at 06:22:59PM -0700, David Miller wrote:
> 
> This requires state, for the unlink.  And that unlink point could be
> several levels deep into the tree.
> 
> There is also currently no restriction on how a packet scheduler
> maintains it's queue of SKBs.
> 
> So, if the idea is to keep track of the skb_queue_head pointer to
> unlink from at the root, I don't think that's a good idea.

No non-leaf qdiscs would remember remember the child qdisc where
it peeked from.

> And then there are all of these complicated pieces of state the
> classful qdiscs modify when a SKB is removed from their visibility.
> So, the unlink isn't just a simple list delete operation.  There
> are queue lengths that need to be updated, watchdog timers to
> manage, etc.

The decision really comes down to whether it's harder to requeue
a seemingly arbitrary packet or whether it's harder to dequeue the
packet that was last peeked.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-13  1:27                                                                                                             ` Herbert Xu
@ 2008-09-13  1:40                                                                                                               ` David Miller
  2008-09-13  1:48                                                                                                                 ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-13  1:40 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, kaber

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 12 Sep 2008 18:27:58 -0700

> On Fri, Sep 12, 2008 at 06:22:59PM -0700, David Miller wrote:
> > And then there are all of these complicated pieces of state the
> > classful qdiscs modify when a SKB is removed from their visibility.
> > So, the unlink isn't just a simple list delete operation.  There
> > are queue lengths that need to be updated, watchdog timers to
> > manage, etc.
> 
> The decision really comes down to whether it's harder to requeue
> a seemingly arbitrary packet or whether it's harder to dequeue the
> packet that was last peeked.

My current opinion is that both operations are equally difficult.
With the slight advantage for ->requeue() because all the complicated
logic is already implemented :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-13  1:40                                                                                                               ` David Miller
@ 2008-09-13  1:48                                                                                                                 ` Herbert Xu
  2008-09-13 20:54                                                                                                                   ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-13  1:48 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, kaber

On Fri, Sep 12, 2008 at 06:40:08PM -0700, David Miller wrote:
>
> My current opinion is that both operations are equally difficult.
> With the slight advantage for ->requeue() because all the complicated
> logic is already implemented :-)

I'd agree with you if you haven't written the peek stuff :)

Now that peek exists, the dequeue stuff would be a lot simpler
than requeue because the only non-trivial logic would be in the
leaf qdiscs.  All the complex/classful qdiscs would be trivial
as they'd just write down the child qdisc that was peeked and
then call dequeue on that child.

Compare that to requeue where the classful qdiscs have to do
loads of work to figure out which child the packet should be
sent to.

In fact it looks like CBQ has taken the easy way out by remembering
the class the last packet was dequeued from so it's essentially
doing what I'm proposing here :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-13  1:48                                                                                                                 ` Herbert Xu
@ 2008-09-13 20:54                                                                                                                   ` Jarek Poplawski
  2008-09-14  6:16                                                                                                                     ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-13 20:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, kaber

On Fri, Sep 12, 2008 at 06:48:00PM -0700, Herbert Xu wrote:
> On Fri, Sep 12, 2008 at 06:40:08PM -0700, David Miller wrote:
> >
> > My current opinion is that both operations are equally difficult.
> > With the slight advantage for ->requeue() because all the complicated
> > logic is already implemented :-)
> 
> I'd agree with you if you haven't written the peek stuff :)
> 
> Now that peek exists, the dequeue stuff would be a lot simpler
> than requeue because the only non-trivial logic would be in the
> leaf qdiscs.  All the complex/classful qdiscs would be trivial
> as they'd just write down the child qdisc that was peeked and
> then call dequeue on that child.

If I get it right peek + dequeue should do all current dequeue logic
plus additionally write down the child qdisc or skb (leaves) info,
plus, probably, some ifs btw., which looks like a bit of overhead,
if we consider requeuing as something exceptional. Unless we don't -
then of course something like this could be useful.

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-13 20:54                                                                                                                   ` Jarek Poplawski
@ 2008-09-14  6:16                                                                                                                     ` Herbert Xu
  2008-09-14 10:31                                                                                                                       ` Alexander Duyck
                                                                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 209+ messages in thread
From: Herbert Xu @ 2008-09-14  6:16 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, kaber

On Sat, Sep 13, 2008 at 10:54:08PM +0200, Jarek Poplawski wrote:
>
> If I get it right peek + dequeue should do all current dequeue logic
> plus additionally write down the child qdisc or skb (leaves) info,
> plus, probably, some ifs btw., which looks like a bit of overhead,
> if we consider requeuing as something exceptional. Unless we don't -
> then of course something like this could be useful.

I don't see the overhead in writing down something that we alrady
have.  In any case, do you have an alternative solution to the
current problem that qdisc_run looks at an arbitrary queue's
status to decide whether it should process a qdisc that empties
into n queues?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-14  6:16                                                                                                                     ` Herbert Xu
@ 2008-09-14 10:31                                                                                                                       ` Alexander Duyck
  2008-09-14 21:43                                                                                                                         ` Jarek Poplawski
  2008-09-14 11:56                                                                                                                       ` jamal
  2008-09-14 20:27                                                                                                                       ` Jarek Poplawski
  2 siblings, 1 reply; 209+ messages in thread
From: Alexander Duyck @ 2008-09-14 10:31 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Jarek Poplawski, David Miller, netdev, kaber

On Sat, Sep 13, 2008 at 11:16 PM, Herbert Xu <herbert@gondor.apana.org.au>
wrote:
> On Sat, Sep 13, 2008 at 10:54:08PM +0200, Jarek Poplawski wrote:
>>
>> If I get it right peek + dequeue should do all current dequeue logic
>> plus additionally write down the child qdisc or skb (leaves) info,
>> plus, probably, some ifs btw., which looks like a bit of overhead,
>> if we consider requeuing as something exceptional. Unless we don't -
>> then of course something like this could be useful.
>
> I don't see the overhead in writing down something that we alrady
> have.  In any case, do you have an alternative solution to the
> current problem that qdisc_run looks at an arbitrary queue's
> status to decide whether it should process a qdisc that empties
> into n queues?

What if we were to push the check for netif_tx_queue_stopped further down into
the qdisc layer so that the basic qdiscs such as pfifo_fast did their own peek
and in the event that a queue is stopped it just returned NULL and set a flag
bit?  Basically this would mimic how we are currently handling throttled
queues (TCQ_F_THROTTLED).  Then in turn each parent could do a check on skb ==
NULL and set the same flag, or they could act like multiq and just skip over
that leaf and move onto the next because it is stopped.

I might try putting together a patch for this on Monday but in the meantime
here are a couple code snippets to demonstrate what I am thinking.  It seems
like this would  provide most of what you are looking for because the first
thing that happens in qdisc_restart() is a packet is dequeued and if that
fails the routine exits.

Thanks,

Alex

static inline struct sk_buff *__qdisc_dequeue_head(struct Qdisc *sch,
						   struct sk_buff_head *list)
{
	struct sk_buff *skb = skb_peek(list);
	struct netdev_queue *txq;

	if (skb == NULL)
		return NULL;
		
	txq = netdev_get_tx_queue(sch->dev, skb_get_queue_mapping(skb));
	if (netif_tx_queue_stopped(txq) || netif_tx_queue_frozen(txq)) {
		sch->flags |= TCQ_F_STOPPED;
		return NULL;
	}

	__skb_unlink(skb, list);
	sch->qstats.backlog -= qdisc_pkt_len(skb);
	sch->flags &= ~TCQ_F_STOPPED;

	return skb;
}

static struct sk_buff *prio_dequeue(struct Qdisc* sch)
{
	struct prio_sched_data *q = qdisc_priv(sch);
	int prio;

	for (prio = 0; prio < q->bands; prio++) {
		struct Qdisc *qdisc = q->queues[prio];
		struct sk_buff *skb = qdisc->dequeue(qdisc);
		if (skb) {
			sch->q.qlen--;
			sch->flags &= ~TCQ_F_STOPPED;
			return skb;
		}
		if (qdisc->flags & TCQ_F_STOPPED) {
			sch->flags |= TCQ_F_STOPPED;
			return NULL;
		}
	}
	sch->flags &= ~TCQ_F_STOPPED;
	return NULL;

}

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-14  6:16                                                                                                                     ` Herbert Xu
  2008-09-14 10:31                                                                                                                       ` Alexander Duyck
@ 2008-09-14 11:56                                                                                                                       ` jamal
  2008-09-14 20:27                                                                                                                       ` Jarek Poplawski
  2 siblings, 0 replies; 209+ messages in thread
From: jamal @ 2008-09-14 11:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Jarek Poplawski, David Miller, netdev, kaber

On Sat, 2008-13-09 at 23:16 -0700, Herbert Xu wrote:
> In any case, do you have an alternative solution to the
> current problem that qdisc_run looks at an arbitrary queue's
> status to decide whether it should process a qdisc that empties
> into n queues?

What about something that would be cheaper than a peek - just check some
driver per-hwq variable? variable is set at netif_stop and unset at
netif_wake and should tell you if the driver infact can send if you gave
it a packet (way before you dequeue from qdisc, assuming you know which
hardware queue it is going to).

cheers,
jamal


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-14  6:16                                                                                                                     ` Herbert Xu
  2008-09-14 10:31                                                                                                                       ` Alexander Duyck
  2008-09-14 11:56                                                                                                                       ` jamal
@ 2008-09-14 20:27                                                                                                                       ` Jarek Poplawski
  2008-09-20  7:21                                                                                                                         ` David Miller
  2 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-14 20:27 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, kaber

On Sat, Sep 13, 2008 at 11:16:10PM -0700, Herbert Xu wrote:
> On Sat, Sep 13, 2008 at 10:54:08PM +0200, Jarek Poplawski wrote:
> >
> > If I get it right peek + dequeue should do all current dequeue logic
> > plus additionally write down the child qdisc or skb (leaves) info,
> > plus, probably, some ifs btw., which looks like a bit of overhead,
> > if we consider requeuing as something exceptional. Unless we don't -
> > then of course something like this could be useful.
> 
> I don't see the overhead in writing down something that we alrady
> have.  In any case, do you have an alternative solution to the
> current problem that qdisc_run looks at an arbitrary queue's
> status to decide whether it should process a qdisc that empties
> into n queues?

If it's only for this initial check I still think my earlier proposal
should be enough:
http://marc.info/?l=linux-netdev&m=122113717013988&w=2

Anyway, the main problem here was a high cpu load despite stopped
queue. Are you sure this peek, which is almost full dequeue, can
really help for this? BTW, since after current fix there were no
later complains I guess it's just about full netif_stop or non-mq
device.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-14 10:31                                                                                                                       ` Alexander Duyck
@ 2008-09-14 21:43                                                                                                                         ` Jarek Poplawski
  2008-09-14 22:13                                                                                                                           ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-14 21:43 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Herbert Xu, David Miller, netdev, kaber

On Sun, Sep 14, 2008 at 03:31:35AM -0700, Alexander Duyck wrote:
...
> What if we were to push the check for netif_tx_queue_stopped further down into
> the qdisc layer so that the basic qdiscs such as pfifo_fast did their own peek
> and in the event that a queue is stopped it just returned NULL and set a flag
> bit?  Basically this would mimic how we are currently handling throttled
> queues (TCQ_F_THROTTLED).  Then in turn each parent could do a check on skb ==
> NULL and set the same flag, or they could act like multiq and just skip over
> that leaf and move onto the next because it is stopped.

IMHO it's a very interesting idea. Probably it could be better
evaluated if we have any stats how much this reason of requeing is a
problem with mq devices.

On the other hand, I wondered about a possibility of rehashing to other
queues in some cases during requeuing, which would be impossible after
such change.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-14 21:43                                                                                                                         ` Jarek Poplawski
@ 2008-09-14 22:13                                                                                                                           ` Herbert Xu
  2008-09-15  6:07                                                                                                                             ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-14 22:13 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Alexander Duyck, David Miller, netdev, kaber

On Sun, Sep 14, 2008 at 11:43:31PM +0200, Jarek Poplawski wrote:
>
> On the other hand, I wondered about a possibility of rehashing to other
> queues in some cases during requeuing, which would be impossible after
> such change.

Why would you want to do that? Just because people have abused
requeue in the psat doesn't mean that we need to support such
abuses for perpetuity.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-14 22:13                                                                                                                           ` Herbert Xu
@ 2008-09-15  6:07                                                                                                                             ` Jarek Poplawski
  2008-09-15  6:19                                                                                                                               ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-15  6:07 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Alexander Duyck, David Miller, netdev, kaber

On Sun, Sep 14, 2008 at 03:13:41PM -0700, Herbert Xu wrote:
> On Sun, Sep 14, 2008 at 11:43:31PM +0200, Jarek Poplawski wrote:
> >
> > On the other hand, I wondered about a possibility of rehashing to other
> > queues in some cases during requeuing, which would be impossible after
> > such change.
> 
> Why would you want to do that? Just because people have abused
> requeue in the psat doesn't mean that we need to support such
> abuses for perpetuity.

Well, it was only wondering, and probably you are right this is wrong.
On the other hand, simple_tx_hash() choices are "probabilistic": user
doesn't care if it goes through tx_queue #1 or #11. And here, in some
cases, some tx_queues could be always full while other always empty,
so some dynamic rehashing could be thought of, but I understand it's
not trivial.
 
Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-15  6:07                                                                                                                             ` Jarek Poplawski
@ 2008-09-15  6:19                                                                                                                               ` Herbert Xu
  2008-09-15  7:20                                                                                                                                 ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-15  6:19 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Alexander Duyck, David Miller, netdev, kaber

On Mon, Sep 15, 2008 at 06:07:58AM +0000, Jarek Poplawski wrote:
>
> Well, it was only wondering, and probably you are right this is wrong.
> On the other hand, simple_tx_hash() choices are "probabilistic": user
> doesn't care if it goes through tx_queue #1 or #11. And here, in some
> cases, some tx_queues could be always full while other always empty,
> so some dynamic rehashing could be thought of, but I understand it's
> not trivial.

No that would be totally wrong.  One of the important constraints
on a TX hashing mechanism is to preserve packet ordering within
a flow.  If simple_tx_hash started placing the same packet in
different queues then it would do the same thing to flows which
is unacceptable.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-15  6:19                                                                                                                               ` Herbert Xu
@ 2008-09-15  7:20                                                                                                                                 ` Jarek Poplawski
  2008-09-15  7:45                                                                                                                                   ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-15  7:20 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Alexander Duyck, David Miller, netdev, kaber

On Sun, Sep 14, 2008 at 11:19:22PM -0700, Herbert Xu wrote:
> On Mon, Sep 15, 2008 at 06:07:58AM +0000, Jarek Poplawski wrote:
> >
> > Well, it was only wondering, and probably you are right this is wrong.
> > On the other hand, simple_tx_hash() choices are "probabilistic": user
> > doesn't care if it goes through tx_queue #1 or #11. And here, in some
> > cases, some tx_queues could be always full while other always empty,
> > so some dynamic rehashing could be thought of, but I understand it's
> > not trivial.
> 
> No that would be totally wrong.  One of the important constraints
> on a TX hashing mechanism is to preserve packet ordering within
> a flow.  If simple_tx_hash started placing the same packet in
> different queues then it would do the same thing to flows which
> is unacceptable.

Of course preserving a flow consistency is must-be here, but I think
there are rehashing algorithms used in similar cases (sch_sfq) which
take care for this. As a matter of fact, I've thought of requeuing as
a best place to detect possible problems, but now I see that
Alexander's proposal let's to do this simply by observing this
TCQ_F_STOPPED flag, so I withdraw my objection.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-15  7:20                                                                                                                                 ` Jarek Poplawski
@ 2008-09-15  7:45                                                                                                                                   ` Jarek Poplawski
  2008-09-15 23:44                                                                                                                                     ` Duyck, Alexander H
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-15  7:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Alexander Duyck, David Miller, netdev, kaber

On Mon, Sep 15, 2008 at 07:20:08AM +0000, Jarek Poplawski wrote:
...
> Of course preserving a flow consistency is must-be here, but I think
> there are rehashing algorithms used in similar cases (sch_sfq) which
> take care for this. As a matter of fact, I've thought of requeuing as
> a best place to detect possible problems, but now I see that
> Alexander's proposal let's to do this simply by observing this
> TCQ_F_STOPPED flag [...]

Hmm... or maybe it doesn't? Since this is qdisc flag we don't know at
the top which tx_queue is a problem at the bottom...

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* RE: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-15  7:45                                                                                                                                   ` Jarek Poplawski
@ 2008-09-15 23:44                                                                                                                                     ` Duyck, Alexander H
  2008-09-16 10:47                                                                                                                                       ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Duyck, Alexander H @ 2008-09-15 23:44 UTC (permalink / raw)
  To: Jarek Poplawski, Herbert Xu
  Cc: Alexander Duyck, David Miller, netdev@vger.kernel.org,
	kaber@trash.net

Jarek Poplawski wrote:
> On Mon, Sep 15, 2008 at 07:20:08AM +0000, Jarek Poplawski wrote:
> ...
>> Of course preserving a flow consistency is must-be here, but I think
>> there are rehashing algorithms used in similar cases (sch_sfq) which
>> take care for this. As a matter of fact, I've thought of requeuing as
>> a best place to detect possible problems, but now I see that
>> Alexander's proposal let's to do this simply by observing this
>> TCQ_F_STOPPED flag [...]
>
> Hmm... or maybe it doesn't? Since this is qdisc flag we don't know at
> the top which tx_queue is a problem at the bottom...
>
The problem is with the nature of a qdisc.  For example let's say we have a
prio qdisc with a packet in lowest priority that fails to be transmitted due
to a stopped subqueue.  If we add an skb for a non-stopped queue to a higher
prio then the qdisc should be no longer stopped since we can dequeue from the
ring and transmit.  Thus, keeping a memory of which queue is stopped may not be
useful in a situation such as this.

The only thing I really prefer about my solution as opposed to the solution
Dave implemented was that it would mean only one dequeue instead of a peek
followed by a dequeue.  I figure the important thing is to push the
discovery of us being stopped to as soon as possible in the process.

It will probably be a few days before I have a patch with my approach ready.
I didn't realize how complex it would be to resolve this issue for CBQ, HTB,
HFSC, etc.  Also it is starting to look like I will probably need to implement
another function to support this since it seems like the dequeue operations
would need to be split into a multiqueue safe version, and a standard version
to support some workarounds like those found in qdisc_peek_len() for HFSC.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-15 23:44                                                                                                                                     ` Duyck, Alexander H
@ 2008-09-16 10:47                                                                                                                                       ` Jarek Poplawski
  2008-09-17  2:31                                                                                                                                         ` Alexander Duyck
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-16 10:47 UTC (permalink / raw)
  To: Duyck, Alexander H
  Cc: Herbert Xu, Alexander Duyck, David Miller, netdev@vger.kernel.org,
	kaber@trash.net

On Mon, Sep 15, 2008 at 04:44:08PM -0700, Duyck, Alexander H wrote:
...
> The only thing I really prefer about my solution as opposed to the solution
> Dave implemented was that it would mean only one dequeue instead of a peek
> followed by a dequeue.  I figure the important thing is to push the
> discovery of us being stopped to as soon as possible in the process.
> 
> It will probably be a few days before I have a patch with my approach ready.
> I didn't realize how complex it would be to resolve this issue for CBQ, HTB,
> HFSC, etc.  Also it is starting to look like I will probably need to implement
> another function to support this since it seems like the dequeue operations
> would need to be split into a multiqueue safe version, and a standard version
> to support some workarounds like those found in qdisc_peek_len() for HFSC.

Actually, looking at this HFSC now I start to doubt we need to
complicate these things so much. If HFSC is OK with its simple
hfsc_requeue() I doubt other qdiscs need much more, and we should
reconsider David's idea to do the same on top, in dev_requeue_skb().
Qdiscs like multiq would probably never use this, and these above
mentioned (not mq-optimized) qdiscs could be used with multiq if
needed. Then, it seems, it would be enough to improve multiq as a
"leaf" adding these dedicated operations and/or flags.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-16 10:47                                                                                                                                       ` Jarek Poplawski
@ 2008-09-17  2:31                                                                                                                                         ` Alexander Duyck
  0 siblings, 0 replies; 209+ messages in thread
From: Alexander Duyck @ 2008-09-17  2:31 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Duyck, Alexander H, Herbert Xu, David Miller,
	netdev@vger.kernel.org, kaber@trash.net

On Tue, Sep 16, 2008 at 3:47 AM, Jarek Poplawski <jarkao2@gmail.com> wrote:
> On Mon, Sep 15, 2008 at 04:44:08PM -0700, Duyck, Alexander H wrote:
> ...
>> The only thing I really prefer about my solution as opposed to the solution
>> Dave implemented was that it would mean only one dequeue instead of a peek
>> followed by a dequeue.  I figure the important thing is to push the
>> discovery of us being stopped to as soon as possible in the process.
>>
>> It will probably be a few days before I have a patch with my approach ready.
>> I didn't realize how complex it would be to resolve this issue for CBQ, HTB,
>> HFSC, etc.  Also it is starting to look like I will probably need to implement
>> another function to support this since it seems like the dequeue operations
>> would need to be split into a multiqueue safe version, and a standard version
>> to support some workarounds like those found in qdisc_peek_len() for HFSC.
>
> Actually, looking at this HFSC now I start to doubt we need to
> complicate these things so much. If HFSC is OK with its simple
> hfsc_requeue() I doubt other qdiscs need much more, and we should
> reconsider David's idea to do the same on top, in dev_requeue_skb().
> Qdiscs like multiq would probably never use this, and these above
> mentioned (not mq-optimized) qdiscs could be used with multiq if
> needed. Then, it seems, it would be enough to improve multiq as a
> "leaf" adding these dedicated operations and/or flags.
>
> Thanks,
> Jarek P.
>

I am just not convinced that the requeue approach will work very well.  I am
just starting to test my patch today and the cpu savings were pretty significant
against the current configuration when using just the standard prio qdisc on a
multiqueue device.

I setup a simple test running a neterf UDP_STREAM test from my test system to
one of my clients sending 1460 byte UDP messages at line rate on an 82575
with 4 tx queues enabled.  The current dequeue/requeue approach used 2.5% cpu
whenever the test was run through queue 0, but if I ended up with packets going
out one of the other queues the cpu utilization would jump to ~12.5%.  The same
test done using my patch showed ~2.5% for every queue I tested.

I will hopefully have the patch ready to submit for comments tomorrow.  I just
need to run a few tests with the patch on versus the patch off to
verify that I didn't
break any of the qdiscs and that there isn't any negative performance impact.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-14 20:27                                                                                                                       ` Jarek Poplawski
@ 2008-09-20  7:21                                                                                                                         ` David Miller
  2008-09-20  7:25                                                                                                                           ` Herbert Xu
  2008-09-20 23:48                                                                                                                           ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: David Miller @ 2008-09-20  7:21 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, kaber

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 14 Sep 2008 22:27:15 +0200

[ I'm just picking one posting out of several in this thread.  ]

> Anyway, the main problem here was a high cpu load despite stopped
> queue. Are you sure this peek, which is almost full dequeue, can
> really help for this? BTW, since after current fix there were no
> later complains I guess it's just about full netif_stop or non-mq
> device.

I think we are overengineering this situation.

Let's look at what actually matters for cpu utilization.  These
__qdisc_run() things are invoked in two situations where we might
block on the hw queue being stopped:

1) When feeding packets into the qdisc in dev_queue_xmit().

   Guess what?  We _know_ the queue this packet is going to
   hit.

   The only new thing we can possible trigger and be interested
   in at this specific point is if _this_ packet can be sent at
   this time.

   And we can check that queue mapping after the qdisc_enqueue_root()
   call, so that multiq aware qdiscs can have made their changes.

2) When waking up a queue.  And here we should schedule the qdisc_run
   _unconditionally_.

   If the queue was full, it is extremely likely that new packets
   are bound for that device queue.  There is no real savings to
   be had by doing this peek/requeue/dequeue stuff.

The cpu utilization savings exist for case #1 only, and we can
implement the bypass logic _perfectly_ as described above.

For #2 there is nothing to check, just do it and see what comes
out of the qdisc.

I would suggest adding an skb pointer argument to qdisc_run().
If it's NULL, unconditionally schedule __qdisc_run().  Else,
only schedule if the TX queue indicated by skb_queue_mapping()
is not stopped.

dev_queue_xmit() will use the "pass the skb" case, but only if
qdisc_enqueue_root()'s return value doesn't indicate that there
is a potential drop.  On potential drop, we'll pass NULL to
make sure we don't potentially reference a free'd SKB.

The other case in net_tx_action() can always pass NULL to qdisc_run().

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-20  7:21                                                                                                                         ` David Miller
@ 2008-09-20  7:25                                                                                                                           ` Herbert Xu
  2008-09-20  7:28                                                                                                                             ` David Miller
  2008-09-20 23:48                                                                                                                           ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-20  7:25 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, kaber

On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:
>
> The cpu utilization savings exist for case #1 only, and we can
> implement the bypass logic _perfectly_ as described above.
> 
> For #2 there is nothing to check, just do it and see what comes
> out of the qdisc.

Your analysis sounds perfect to me.  And I'm sure happy to see
this thread die as it's starting to take up a significant amount
of space in my mailbox :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-20  7:25                                                                                                                           ` Herbert Xu
@ 2008-09-20  7:28                                                                                                                             ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-09-20  7:28 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, kaber

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 20 Sep 2008 16:25:42 +0900

> On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:
> >
> > The cpu utilization savings exist for case #1 only, and we can
> > implement the bypass logic _perfectly_ as described above.
> > 
> > For #2 there is nothing to check, just do it and see what comes
> > out of the qdisc.
> 
> Your analysis sounds perfect to me.  And I'm sure happy to see
> this thread die as it's starting to take up a significant amount
> of space in my mailbox :)

It's not really dead until someone writes a patch :-)

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-20  7:21                                                                                                                         ` David Miller
  2008-09-20  7:25                                                                                                                           ` Herbert Xu
@ 2008-09-20 23:48                                                                                                                           ` Jarek Poplawski
  2008-09-21  5:35                                                                                                                             ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-20 23:48 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber

On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:
...
> Let's look at what actually matters for cpu utilization.  These
> __qdisc_run() things are invoked in two situations where we might
> block on the hw queue being stopped:
> 
> 1) When feeding packets into the qdisc in dev_queue_xmit().
> 
>    Guess what?  We _know_ the queue this packet is going to
>    hit.
> 
>    The only new thing we can possible trigger and be interested
>    in at this specific point is if _this_ packet can be sent at
>    this time.
> 
>    And we can check that queue mapping after the qdisc_enqueue_root()
>    call, so that multiq aware qdiscs can have made their changes.
> 
> 2) When waking up a queue.  And here we should schedule the qdisc_run
>    _unconditionally_.
> 
>    If the queue was full, it is extremely likely that new packets
>    are bound for that device queue.  There is no real savings to
>    be had by doing this peek/requeue/dequeue stuff.
> 
> The cpu utilization savings exist for case #1 only, and we can
> implement the bypass logic _perfectly_ as described above.
> 
> For #2 there is nothing to check, just do it and see what comes
> out of the qdisc.

Right, unless __netif_schedule() wasn't done when waking up. I've
thought about this because of another thread/patch around this
problem, and got misled by dev_requeue_skb() scheduling. Now, I think
this could be the main reason for this high load. Anyway, if we want
to skip this check for #2 I think something like the patch below is
needed.

> I would suggest adding an skb pointer argument to qdisc_run().
> If it's NULL, unconditionally schedule __qdisc_run().  Else,
> only schedule if the TX queue indicated by skb_queue_mapping()
> is not stopped.
> 
> dev_queue_xmit() will use the "pass the skb" case, but only if
> qdisc_enqueue_root()'s return value doesn't indicate that there
> is a potential drop.  On potential drop, we'll pass NULL to
> make sure we don't potentially reference a free'd SKB.
> 
> The other case in net_tx_action() can always pass NULL to qdisc_run().

I'm not convinced this #1 is useful for us: this could be an skb #1000
in a queue; the tx status could change many times before this packet
would be #1; why worry? This adds additional checks on the fast path
for something which is unlikely even if this skb would be #1, but for
any later skbs it's only a guess. IMHO, if we can't check for the next
skb to be xmitted it's better to skip this test entirely (which seems
to be safe with the patch below).

Jarek P.

--------------->
pkt_sched: dev_requeue_skb: Don't schedule if a queue is stopped

Doing __netif_schedule() while requeuing because of a stopped tx queue
and skipping such a test in qdisc_run() can cause a requeuing loop with
high cpu use until the queue is awaken.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/sched/sch_generic.c |   23 +++++++++++++++--------
 1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec0a083..bae2eb8 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -42,14 +42,17 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q->q.qlen;
 }
 
-static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q,
+				  bool stopped)
 {
 	if (unlikely(skb->next))
 		q->gso_skb = skb;
 	else
 		q->ops->requeue(skb, q);
 
-	__netif_schedule(q);
+	if (!stopped)
+		__netif_schedule(q);
+
 	return 0;
 }
 
@@ -89,7 +92,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 		 * some time.
 		 */
 		__get_cpu_var(netdev_rx_stat).cpu_collision++;
-		ret = dev_requeue_skb(skb, q);
+		ret = dev_requeue_skb(skb, q, false);
 	}
 
 	return ret;
@@ -121,6 +124,7 @@ static inline int qdisc_restart(struct Qdisc *q)
 	struct net_device *dev;
 	spinlock_t *root_lock;
 	struct sk_buff *skb;
+	bool stopped;
 
 	/* Dequeue packet */
 	if (unlikely((skb = dequeue_skb(q)) == NULL))
@@ -135,9 +139,13 @@ static inline int qdisc_restart(struct Qdisc *q)
 	txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
 
 	HARD_TX_LOCK(dev, txq, smp_processor_id());
-	if (!netif_tx_queue_stopped(txq) &&
-	    !netif_tx_queue_frozen(txq))
+	if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq)) {
 		ret = dev_hard_start_xmit(skb, dev, txq);
+		stopped = netif_tx_queue_stopped(txq) ||
+			  netif_tx_queue_frozen(txq);
+	} else {
+		stopped = true;
+	}
 	HARD_TX_UNLOCK(dev, txq);
 
 	spin_lock(root_lock);
@@ -159,12 +167,11 @@ static inline int qdisc_restart(struct Qdisc *q)
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
 
-		ret = dev_requeue_skb(skb, q);
+		ret = dev_requeue_skb(skb, q, stopped);
 		break;
 	}
 
-	if (ret && (netif_tx_queue_stopped(txq) ||
-		    netif_tx_queue_frozen(txq)))
+	if (ret && stopped)
 		ret = 0;
 
 	return ret;

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-20 23:48                                                                                                                           ` Jarek Poplawski
@ 2008-09-21  5:35                                                                                                                             ` David Miller
  2008-09-21  5:50                                                                                                                               ` David Miller
  2008-09-21  9:57                                                                                                                               ` Jarek Poplawski
  0 siblings, 2 replies; 209+ messages in thread
From: David Miller @ 2008-09-21  5:35 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, kaber

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 21 Sep 2008 01:48:43 +0200

> On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:
> ...
> > Let's look at what actually matters for cpu utilization.  These
> > __qdisc_run() things are invoked in two situations where we might
> > block on the hw queue being stopped:
> > 
> > 1) When feeding packets into the qdisc in dev_queue_xmit().
 ...
> > 2) When waking up a queue.  And here we should schedule the qdisc_run
> >    _unconditionally_.
 ...
> > The cpu utilization savings exist for case #1 only, and we can
> > implement the bypass logic _perfectly_ as described above.
> > 
> > For #2 there is nothing to check, just do it and see what comes
> > out of the qdisc.
> 
> Right, unless __netif_schedule() wasn't done when waking up. I've
> thought about this because of another thread/patch around this
> problem, and got misled by dev_requeue_skb() scheduling. Now, I think
> this could be the main reason for this high load. Anyway, if we want
> to skip this check for #2 I think something like the patch below is
> needed.

Hmmm, looking at your patch....

It's only doing something new when the driver returns NETDEV_TX_BUSY
from ->hard_start_xmit().

That _never_ happens in any sane driver.  That case is for buggy
devices that do not maintain their TX queue state properly.  And
in fact it's a case for which I advocate we just drop the packet
instead of requeueing.  :-)

Oh I see, you're concerned about that cases where qdisc_restart() ends
up using the default initialization of the 'ret' variable.

Really, for the case where the driver actually returns NETDEV_TX_BUSY
we _do_ want to unconditionally __netif_schedule(), since the device
doesn't maintain it's queue state in the normal way.

Therefore it seems logical that what really needs to happen is that
we simply pick some new local special token value for 'ret' so that
we can handle that case.  "-1" would probably work fine.

So I'm dropping your patch.

I also think the qdisc_run() test needs to be there.  When the TX
queue fills up, we will doing tons of completely useless work going:

1) ->dequeue
2) qdisc unlock
3) TXQ lock
4) test state
5) TXQ unlock
6) qdisc lock
7) ->requeue

for EVERY SINGLE packet that is generated towards that device.

That has to be expensive, and I am still very much convinced that
this was the original regression cause that made me put that TXQ
state test back into qdisc_run().

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21  5:35                                                                                                                             ` David Miller
@ 2008-09-21  5:50                                                                                                                               ` David Miller
  2008-09-21  6:38                                                                                                                                 ` Herbert Xu
  2008-09-21 15:25                                                                                                                                 ` Jarek Poplawski
  2008-09-21  9:57                                                                                                                               ` Jarek Poplawski
  1 sibling, 2 replies; 209+ messages in thread
From: David Miller @ 2008-09-21  5:50 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, kaber

From: David Miller <davem@davemloft.net>
Date: Sat, 20 Sep 2008 22:35:38 -0700 (PDT)

> Therefore it seems logical that what really needs to happen is that
> we simply pick some new local special token value for 'ret' so that
> we can handle that case.  "-1" would probably work fine.
 ...
> I also think the qdisc_run() test needs to be there.  When the TX
> queue fills up, we will doing tons of completely useless work going:

Ok, here is the kind of thing I'm suggesting in all of this.

It gets rid of bouncing unnecessarily into __qdisc_run() when
dev_queue_xmit()'s finally selected TXQ is stopped.

It also gets rid of the dev_requeue_skb() looping case Jarek
discovered.

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index b786a5b..4082f39 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	struct netdev_queue *txq = q->dev_queue;
-
-	if (!netif_tx_queue_stopped(txq) &&
-	    !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
 		__qdisc_run(q);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fdfc4b6..4654127 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1809,7 +1809,15 @@ gso:
 			rc = NET_XMIT_DROP;
 		} else {
 			rc = qdisc_enqueue_root(skb, q);
-			qdisc_run(q);
+
+			txq = NULL;
+			if (rc == NET_XMIT_SUCCESS) {
+				int map = skb_get_queue_mapping(skb);
+				txq = netdev_get_tx_queue(dev, map);
+			}
+
+			if (!txq || !netif_tx_queue_stopped(txq))
+				qdisc_run(q);
 		}
 		spin_unlock(root_lock);
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ec0a083..b6e6926 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -117,7 +117,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 static inline int qdisc_restart(struct Qdisc *q)
 {
 	struct netdev_queue *txq;
-	int ret = NETDEV_TX_BUSY;
+	int ret = -2;
 	struct net_device *dev;
 	spinlock_t *root_lock;
 	struct sk_buff *skb;
@@ -158,7 +158,7 @@ static inline int qdisc_restart(struct Qdisc *q)
 		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
 			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
 			       dev->name, ret, q->q.qlen);
-
+	case -2:
 		ret = dev_requeue_skb(skb, q);
 		break;
 	}

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21  5:50                                                                                                                               ` David Miller
@ 2008-09-21  6:38                                                                                                                                 ` Herbert Xu
  2008-09-21  7:03                                                                                                                                   ` David Miller
  2008-09-21 15:25                                                                                                                                 ` Jarek Poplawski
  1 sibling, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-21  6:38 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, kaber, alexander.h.duyck

On Sat, Sep 20, 2008 at 10:50:33PM -0700, David Miller wrote:
>
> Ok, here is the kind of thing I'm suggesting in all of this.
> 
> It gets rid of bouncing unnecessarily into __qdisc_run() when
> dev_queue_xmit()'s finally selected TXQ is stopped.
> 
> It also gets rid of the dev_requeue_skb() looping case Jarek
> discovered.

Looks good to me.

However, I've changed my mind about letting this thread die :)

Let's go back to the basic requirements.  I claim that there
are exactly two different ways for which multiple TX queues are
useful:  SMP scaling and QOS.  In the first case we stuff different
flows into different queues to reduce contention between CPUs.
In the latter case we put packets of different priorities into
different queues in order to prevent a storm of lower priority
packets from starving higher priority ones that arrive later.

Despite the different motivations, these two scenarios have
one thing in common, we can structure it such that there is
a one-to-one correspondence between the qdisc/software queues
and the hardware queues.  I know that this isn't currently the
case for prio but I'll get to that later in the message.

What I'm trying to say is that we shouldn't ever support cases
where a single qdisc empties into multiple TX queues.  It just
doesn't make sense.

For example, if you were using a qdisc like TBF, multiple queues
buy you absolutely nothing.  All it does is give you a longer queue
to stuff packets into but you can already get that in software.

Why am I saying all this? It's because a lot of the complexity
in the current code comes from supporting the case of one qdisc
queue mapping onto n hardware queues.  If we didn't do that then
handling stopped queues becomes trivial (or at least as easy as
it was before).

Put it another way, it makes absolutely no sense to map packets
onto different queues after you've dequeued them from a single
qdisc queue.  The mapping by hashing is for SMP scalability only
and if you've already gone through a single qdisc queue you can
stop worrying about it because it will suck on SMP :)

Going back to the case of prio, I think what we should do is to
create a separate qdisc queue for each band.  The qdisc selection
should be done before the packet is queued, just as we do in the
TX hashing case.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21  6:38                                                                                                                                 ` Herbert Xu
@ 2008-09-21  7:03                                                                                                                                   ` David Miller
  2008-09-23  6:23                                                                                                                                     ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-21  7:03 UTC (permalink / raw)
  To: herbert; +Cc: jarkao2, netdev, kaber, alexander.h.duyck

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 21 Sep 2008 15:38:21 +0900

> Going back to the case of prio, I think what we should do is to
> create a separate qdisc queue for each band.  The qdisc selection
> should be done before the packet is queued, just as we do in the
> TX hashing case.

That's a very interesting idea.

This works if you want it at the root, but what if you only wanted to
prio at a leaf?  I think that case has value too.

I tend to also disagree with another mentioned assertion.  The one
where having a shared qdisc sucks on SMP.  It doesn't.  The TX queue
lock is held much longer than the qdisc lock.

The ->hard_start_xmit() TXQ lock has to be held while:

1) DMA mapping the SKB
2) building TX descriptors
3) doing at least one PIO to the hardware

These operations, each, can be on the order of few thousands of
cycles.

Whereas a qdisc dequeue is perhaps a few hundred, maybe on the order
of a thousand, except in very elaborate classful qdisc configs.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21  5:35                                                                                                                             ` David Miller
  2008-09-21  5:50                                                                                                                               ` David Miller
@ 2008-09-21  9:57                                                                                                                               ` Jarek Poplawski
  2008-09-21 10:18                                                                                                                                 ` David Miller
  1 sibling, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-21  9:57 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber

On Sat, Sep 20, 2008 at 10:35:38PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sun, 21 Sep 2008 01:48:43 +0200
> 
> > On Sat, Sep 20, 2008 at 12:21:37AM -0700, David Miller wrote:
> > ...
> > > Let's look at what actually matters for cpu utilization.  These
> > > __qdisc_run() things are invoked in two situations where we might
> > > block on the hw queue being stopped:
> > > 
> > > 1) When feeding packets into the qdisc in dev_queue_xmit().
>  ...
> > > 2) When waking up a queue.  And here we should schedule the qdisc_run
> > >    _unconditionally_.
>  ...
> > > The cpu utilization savings exist for case #1 only, and we can
> > > implement the bypass logic _perfectly_ as described above.
> > > 
> > > For #2 there is nothing to check, just do it and see what comes
> > > out of the qdisc.
> > 
> > Right, unless __netif_schedule() wasn't done when waking up. I've
> > thought about this because of another thread/patch around this
> > problem, and got misled by dev_requeue_skb() scheduling. Now, I think
> > this could be the main reason for this high load. Anyway, if we want
> > to skip this check for #2 I think something like the patch below is
> > needed.
> 
> Hmmm, looking at your patch....
> 
> It's only doing something new when the driver returns NETDEV_TX_BUSY
> from ->hard_start_xmit().
> 
> That _never_ happens in any sane driver.  That case is for buggy
> devices that do not maintain their TX queue state properly.  And
> in fact it's a case for which I advocate we just drop the packet
> instead of requeueing.  :-)

OK, then let's do it! Why I can't see this in your new patch?

> 
> Oh I see, you're concerned about that cases where qdisc_restart() ends
> up using the default initialization of the 'ret' variable.

Yes, this is my main concern.

> 
> Really, for the case where the driver actually returns NETDEV_TX_BUSY
> we _do_ want to unconditionally __netif_schedule(), since the device
> doesn't maintain it's queue state in the normal way.

So, do you advocate both to drop the packet and unconditionally
__netif_schedule()?!

> 
> Therefore it seems logical that what really needs to happen is that
> we simply pick some new local special token value for 'ret' so that
> we can handle that case.  "-1" would probably work fine.
> 
> So I'm dropping your patch.
> 
> I also think the qdisc_run() test needs to be there.  When the TX
> queue fills up, we will doing tons of completely useless work going:
> 
> 1) ->dequeue
> 2) qdisc unlock
> 3) TXQ lock
> 4) test state
> 5) TXQ unlock
> 6) qdisc lock
> 7) ->requeue
> 
> for EVERY SINGLE packet that is generated towards that device.
> 
> That has to be expensive,

I agree this useless work should be avoided, but only with a reliable
(and not too expensive) test. Your test might be done for the last
packet in the queue, while all the previous packets (and especially
the first one) have a different state of the queue. This should work
well for uniqueue devs and multiqueues with dedicated qdiscs, but is
doubtful for multiqueues with one qdisc, where it actually should be
most needed, because of potentially complex multiclass configs with
this new problem of blocking at the head-of-line (Alexander's main
concern).

BTW, since this problem is strongly conected with the requeuing
policy, I wonder why you seemingly lost interest in this. I tried to
advocate for your simple, one level requeuing, but also Herbert's
peek, and Alexander's early detection, after some polish(!), should
make this initial test meaningless.

> and I am still very much convinced that
> this was the original regression cause that made me put that TXQ
> state test back into qdisc_run().

I doubt this: I've just looked at this Andrew Gallatin's report, and
there is really a lot of net_tx_action, __netif_schedule, and guess
what: pfifo_fast_requeue in this oprofile...

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21  9:57                                                                                                                               ` Jarek Poplawski
@ 2008-09-21 10:18                                                                                                                                 ` David Miller
  2008-09-21 11:15                                                                                                                                   ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-21 10:18 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, kaber

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 21 Sep 2008 11:57:06 +0200

> On Sat, Sep 20, 2008 at 10:35:38PM -0700, David Miller wrote:
> > That _never_ happens in any sane driver.  That case is for buggy
> > devices that do not maintain their TX queue state properly.  And
> > in fact it's a case for which I advocate we just drop the packet
> > instead of requeueing.  :-)
> 
> OK, then let's do it! Why I can't see this in your new patch?

I'm trying to address one thing at a time.  I really want to
also encourage an audit of the drivers that trigger that condition,
and I fear if I put the packet drop in there it might not happen :)

> BTW, since this problem is strongly conected with the requeuing
> policy, I wonder why you seemingly lost interest in this. I tried to
> advocate for your simple, one level requeuing, but also Herbert's
> peek, and Alexander's early detection, after some polish(!), should
> make this initial test meaningless.

Yes, thanks for reminding me about the the multiq qdisc head of line
blocking thing.

I really don't like the requeue/peek patches, because they resulted in
so much code duplication in the CBQ and other classful qdiscs.

Alexander's patch has similar code duplication issues.

Since I've seen the code duplication happen twice, I begin to suspect
we're attacking the implementation (not the idea) from the wrong
angle.

It might make review easier if we first attack the classful qdiscs and
restructure their internal implementation into seperate "pick" and a
"remove" operations.  Of course, initially it'll just be that
->dequeue is implemented as pick+remove.

On a similar note I think all of the ->requeue() uses can die
trivially except for the netem usage.

> > and I am still very much convinced that
> > this was the original regression cause that made me put that TXQ
> > state test back into qdisc_run().
> 
> I doubt this: I've just looked at this Andrew Gallatin's report, and
> there is really a lot of net_tx_action, __netif_schedule, and guess
> what: pfifo_fast_requeue in this oprofile...

I see.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21 10:18                                                                                                                                 ` David Miller
@ 2008-09-21 11:15                                                                                                                                   ` Jarek Poplawski
  2008-09-23  5:16                                                                                                                                     ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-21 11:15 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber

On Sun, Sep 21, 2008 at 03:18:29AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sun, 21 Sep 2008 11:57:06 +0200
...
> > BTW, since this problem is strongly conected with the requeuing
> > policy, I wonder why you seemingly lost interest in this. I tried to
> > advocate for your simple, one level requeuing, but also Herbert's
> > peek, and Alexander's early detection, after some polish(!), should
> > make this initial test meaningless.
> 
> Yes, thanks for reminding me about the the multiq qdisc head of line
> blocking thing.
> 
> I really don't like the requeue/peek patches, because they resulted in
> so much code duplication in the CBQ and other classful qdiscs.
> 
> Alexander's patch has similar code duplication issues.

That's why I think you should reconsider this simple solution for now,
until somebody proves this is wrong or something else is better.

Jarek P.

[RESEND]
From: Jarek Poplawski <jarkao2@gmail.com>
Newsgroups: gmane.linux.network
Subject: Re: [RFC PATCH] sched: only dequeue if packet can be queued to
	hardware queue.
Date: Fri, 19 Sep 2008 09:17:53 +0000
Archived-At: <http://permalink.gmane.org/gmane.linux.network/106324>

On Thu, Sep 18, 2008 at 06:11:45PM -0700, Alexander Duyck wrote:
> Once again if you have a suggestion on approach feel free to modify
> the patch and see how it works for you.  My only concern is that there
> are several qdiscs which won't give you the same packet twice and so
> you don't know what is going to pop out until you go in and check.
...

Actually, here is my suggestion: I think, that your obviously more
complex solution should be compared with something simple which IMHO
has similar feature, i.e. limited overhead of requeuing.

My proposal is to use partly David's idea of killing ->ops->requeue(),
for now only the first two patches, so don't requeue back to the
qdiscs, but leave qdisc->ops->requeue() for internal usage (like in
HFSC's qdisc_peek_len() hack). Additionaly I use your idea of early
checking the state of tx queue to make this even lighter after
possibly removing the entry check from qdisc_run().

I attach my patch at the end, after original David's two patches.

Thanks,
Jarek P.

---------------> patch 1/3
Subject: [PATCH 1/9]: pkt_sched: Make qdisc->gso_skb a list.
Date: Mon, 18 Aug 2008 01:36:47 -0700 (PDT)
From: David Miller <davem@davemloft.net>
To: netdev@vger.kernel.org
CC: jarkao2@gmail.com
Newsgroups: gmane.linux.network


pkt_sched: Make qdisc->gso_skb a list.

The idea is that we can use this to get rid of
->requeue().

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/sch_generic.h |    2 +-
 net/sched/sch_generic.c   |   12 +++++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 84d25f2..140c48b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -52,7 +52,7 @@ struct Qdisc
 	u32			parent;
 	atomic_t		refcnt;
 	unsigned long		state;
-	struct sk_buff		*gso_skb;
+	struct sk_buff_head	requeue;
 	struct sk_buff_head	q;
 	struct netdev_queue	*dev_queue;
 	struct Qdisc		*next_sched;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6f96b7b..39d969e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -45,7 +45,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
 	if (unlikely(skb->next))
-		q->gso_skb = skb;
+		__skb_queue_head(&q->requeue, skb);
 	else
 		q->ops->requeue(skb, q);
 
@@ -57,9 +57,8 @@ static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
 	struct sk_buff *skb;
 
-	if ((skb = q->gso_skb))
-		q->gso_skb = NULL;
-	else
+	skb = __skb_dequeue(&q->requeue);
+	if (!skb)
 		skb = q->dequeue(q);
 
 	return skb;
@@ -328,6 +327,7 @@ struct Qdisc noop_qdisc = {
 	.flags		=	TCQ_F_BUILTIN,
 	.ops		=	&noop_qdisc_ops,
 	.list		=	LIST_HEAD_INIT(noop_qdisc.list),
+	.requeue.lock	=	__SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
 	.q.lock		=	__SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
 	.dev_queue	=	&noop_netdev_queue,
 };
@@ -353,6 +353,7 @@ static struct Qdisc noqueue_qdisc = {
 	.flags		=	TCQ_F_BUILTIN,
 	.ops		=	&noqueue_qdisc_ops,
 	.list		=	LIST_HEAD_INIT(noqueue_qdisc.list),
+	.requeue.lock	=	__SPIN_LOCK_UNLOCKED(noqueue_qdisc.q.lock),
 	.q.lock		=	__SPIN_LOCK_UNLOCKED(noqueue_qdisc.q.lock),
 	.dev_queue	=	&noqueue_netdev_queue,
 };
@@ -473,6 +474,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	sch->padded = (char *) sch - (char *) p;
 
 	INIT_LIST_HEAD(&sch->list);
+	skb_queue_head_init(&sch->requeue);
 	skb_queue_head_init(&sch->q);
 	sch->ops = ops;
 	sch->enqueue = ops->enqueue;
@@ -543,7 +545,7 @@ void qdisc_destroy(struct Qdisc *qdisc)
 	module_put(ops->owner);
 	dev_put(qdisc_dev(qdisc));
 
-	kfree_skb(qdisc->gso_skb);
+	__skb_queue_purge(&qdisc->requeue);
 
 	kfree((char *) qdisc - qdisc->padded);
 }


-------------> patch 2/3
Subject: [PATCH 2/9]: pkt_sched: Always use q->requeue in dev_requeue_skb().
Date: Mon, 18 Aug 2008 01:36:50 -0700 (PDT)
From: David Miller <davem@davemloft.net>
To: netdev@vger.kernel.org
CC: jarkao2@gmail.com
Newsgroups: gmane.linux.network


pkt_sched: Always use q->requeue in dev_requeue_skb().

There is no reason to call into the complicated qdiscs
just to remember the last SKB where we found the device
blocked.

The SKB is outside of the qdiscs realm at this point.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/sched/sch_generic.c |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 39d969e..96d7d08 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -44,10 +44,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 
 static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
-	if (unlikely(skb->next))
-		__skb_queue_head(&q->requeue, skb);
-	else
-		q->ops->requeue(skb, q);
+	__skb_queue_head(&q->requeue, skb);
 
 	__netif_schedule(q);
 	return 0;


----------------> patch 3/3

pkt_sched: Check the state of tx_queue in dequeue_skb()

Check in dequeue_skb() the state of tx_queue for requeued skb to save
on locking and re-requeuing, and possibly remove the current check in
qdisc_run(). Based on the idea of Alexander Duyck.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

diff -Nurp net-next-2.6-/net/sched/sch_generic.c net-next-2.6+/net/sched/sch_generic.c
--- net-next-2.6-/net/sched/sch_generic.c	2008-09-19 07:21:44.000000000 +0000
+++ net-next-2.6+/net/sched/sch_generic.c	2008-09-19 07:49:15.000000000 +0000
@@ -52,11 +52,21 @@ static inline int dev_requeue_skb(struct
 
 static inline struct sk_buff *dequeue_skb(struct Qdisc *q)
 {
-	struct sk_buff *skb;
+	struct sk_buff *skb = skb_peek(&q->requeue);
 
-	skb = __skb_dequeue(&q->requeue);
-	if (!skb)
+	if (unlikely(skb)) {
+		struct net_device *dev = qdisc_dev(q);
+		struct netdev_queue *txq;
+
+		/* check the reason of requeuing without tx lock first */
+		txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
+		if (!netif_tx_queue_stopped(txq) && !netif_tx_queue_frozen(txq))
+			__skb_unlink(skb, &q->requeue);
+		else
+			skb = NULL;
+	} else {
 		skb = q->dequeue(q);
+	}
 
 	return skb;
 }
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21  5:50                                                                                                                               ` David Miller
  2008-09-21  6:38                                                                                                                                 ` Herbert Xu
@ 2008-09-21 15:25                                                                                                                                 ` Jarek Poplawski
  1 sibling, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-21 15:25 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber

On Sat, Sep 20, 2008 at 10:50:33PM -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sat, 20 Sep 2008 22:35:38 -0700 (PDT)
> 
> > Therefore it seems logical that what really needs to happen is that
> > we simply pick some new local special token value for 'ret' so that
> > we can handle that case.  "-1" would probably work fine.
>  ...
> > I also think the qdisc_run() test needs to be there.  When the TX
> > queue fills up, we will doing tons of completely useless work going:
> 
> Ok, here is the kind of thing I'm suggesting in all of this.
> 
> It gets rid of bouncing unnecessarily into __qdisc_run() when
> dev_queue_xmit()'s finally selected TXQ is stopped.
> 
> It also gets rid of the dev_requeue_skb() looping case Jarek
> discovered.
> 
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index b786a5b..4082f39 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q);
>  
>  static inline void qdisc_run(struct Qdisc *q)
>  {
> -	struct netdev_queue *txq = q->dev_queue;
> -
> -	if (!netif_tx_queue_stopped(txq) &&
> -	    !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
> +	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
>  		__qdisc_run(q);
>  }
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fdfc4b6..4654127 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1809,7 +1809,15 @@ gso:
>  			rc = NET_XMIT_DROP;
>  		} else {
>  			rc = qdisc_enqueue_root(skb, q);
> -			qdisc_run(q);
> +
> +			txq = NULL;
> +			if (rc == NET_XMIT_SUCCESS) {
> +				int map = skb_get_queue_mapping(skb);

Deja vu? How about e.g. blackhole_enqueue()?

> +				txq = netdev_get_tx_queue(dev, map);
> +			}
> +
> +			if (!txq || !netif_tx_queue_stopped(txq))
> +				qdisc_run(q);
>  		}
>  		spin_unlock(root_lock);
>  
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index ec0a083..b6e6926 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -117,7 +117,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
>  static inline int qdisc_restart(struct Qdisc *q)
>  {
>  	struct netdev_queue *txq;
> -	int ret = NETDEV_TX_BUSY;
> +	int ret = -2;
>  	struct net_device *dev;
>  	spinlock_t *root_lock;
>  	struct sk_buff *skb;
> @@ -158,7 +158,7 @@ static inline int qdisc_restart(struct Qdisc *q)
>  		if (unlikely (ret != NETDEV_TX_BUSY && net_ratelimit()))
>  			printk(KERN_WARNING "BUG %s code %d qlen %d\n",
>  			       dev->name, ret, q->q.qlen);
> -
> +	case -2:
>  		ret = dev_requeue_skb(skb, q);

Hmm... I really can't see any difference - except getting rid of this
if sometimes.

>  		break;
>  	}

Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21 11:15                                                                                                                                   ` Jarek Poplawski
@ 2008-09-23  5:16                                                                                                                                     ` David Miller
  2008-09-23  8:02                                                                                                                                       ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: David Miller @ 2008-09-23  5:16 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, kaber

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 21 Sep 2008 13:15:51 +0200

> That's why I think you should reconsider this simple solution for now,
> until somebody proves this is wrong or something else is better.

Ok, that sounds reasonable.  I've added those three patches to net-next-2.6
and will push those out after some build tests.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-21  7:03                                                                                                                                   ` David Miller
@ 2008-09-23  6:23                                                                                                                                     ` Herbert Xu
  2008-09-24  7:15                                                                                                                                       ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-23  6:23 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, netdev, kaber, alexander.h.duyck

On Sun, Sep 21, 2008 at 12:03:01AM -0700, David Miller wrote:
>
> This works if you want it at the root, but what if you only wanted to
> prio at a leaf?  I think that case has value too.

Good question :)

I think what we should do is to pass some token that rerepsents
the TX queue that's being run down into the dequeue function.

Then each qdisc can decide which child to recursively dequeue
based on that token (or ignore it for non-prio qdiscs such as
HTB).  When the token reaches the leaf then we have two cases:

1) A prio-like qdisc that has separate queues based on priorities.
In this case we dequeue the respective queue based on the token.

2) Any other qdisc.  We dequeue the first packet that hashes
into the queue given by the token.  Ideally these qdiscs should
have separate queues already so that this would be trivial.

> I tend to also disagree with another mentioned assertion.  The one
> where having a shared qdisc sucks on SMP.  It doesn't.  The TX queue
> lock is held much longer than the qdisc lock.

Yes I was exaggerating :)

However, after answering your question above I'm even more convinced
that we should be separating the traffic at the point of enqueueing,
and not after we dequeue it in qdisc_run.

The only reason to do the separation after dequeueing would be to
allow the TX queue selction to change in the time being.  However,
since I see absolutely no reason why we'd need that, it's just so
much simpler to separate them at qdisc_enqueue, and actually have
the same number of software queues as there are hardware queues.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-23  5:16                                                                                                                                     ` David Miller
@ 2008-09-23  8:02                                                                                                                                       ` Jarek Poplawski
  2008-09-23  8:06                                                                                                                                         ` David Miller
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-23  8:02 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber

On Mon, Sep 22, 2008 at 10:16:58PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sun, 21 Sep 2008 13:15:51 +0200
> 
> > That's why I think you should reconsider this simple solution for now,
> > until somebody proves this is wrong or something else is better.
> 
> Ok, that sounds reasonable.  I've added those three patches to net-next-2.6
> and will push those out after some build tests.

OK, then we have to say B and try this all. BTW, I guess, after this
change we could have similar effect as reported by Alexander Duyck
while testing his solution for this problem, namely the higher drop
rate in some cases, which I can only explain as: less time in
requeuing more time for new enqueuing. Of course, if I'm right, this
"bug" should be rather "fixed" with longer queues or some other
throttle mechanism.

Thanks,
Jarek P.

-------------------->

pkt_sched: Remove the tx queue state check in qdisc_run()

The current check wrongly uses the state of one (currently the first)
tx queue for all tx queues in case of non-default qdiscs. This check
mainly prevented requeuing loop with __netif_schedule(), but now it's
controlled inside __qdisc_run(), while dequeuing. The wrongness of
this check was first noticed by Herbert Xu.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

---

 include/net/pkt_sched.h |    5 +----
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index b786a5b..4082f39 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -90,10 +90,7 @@ extern void __qdisc_run(struct Qdisc *q);
 
 static inline void qdisc_run(struct Qdisc *q)
 {
-	struct netdev_queue *txq = q->dev_queue;
-
-	if (!netif_tx_queue_stopped(txq) &&
-	    !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+	if (!test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
 		__qdisc_run(q);
 }
 

^ permalink raw reply related	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-23  8:02                                                                                                                                       ` Jarek Poplawski
@ 2008-09-23  8:06                                                                                                                                         ` David Miller
  0 siblings, 0 replies; 209+ messages in thread
From: David Miller @ 2008-09-23  8:06 UTC (permalink / raw)
  To: jarkao2; +Cc: herbert, netdev, kaber

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 23 Sep 2008 08:02:40 +0000

> OK, then we have to say B and try this all. BTW, I guess, after this
> change we could have similar effect as reported by Alexander Duyck
> while testing his solution for this problem, namely the higher drop
> rate in some cases, which I can only explain as: less time in
> requeuing more time for new enqueuing. Of course, if I'm right, this
> "bug" should be rather "fixed" with longer queues or some other
> throttle mechanism.
 ...
> pkt_sched: Remove the tx queue state check in qdisc_run()
> 
> The current check wrongly uses the state of one (currently the first)
> tx queue for all tx queues in case of non-default qdiscs. This check
> mainly prevented requeuing loop with __netif_schedule(), but now it's
> controlled inside __qdisc_run(), while dequeuing. The wrongness of
> this check was first noticed by Herbert Xu.
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Agreed and applied, thanks Jarek.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-23  6:23                                                                                                                                     ` Herbert Xu
@ 2008-09-24  7:15                                                                                                                                       ` Jarek Poplawski
  2008-09-24  8:04                                                                                                                                         ` Herbert Xu
  0 siblings, 1 reply; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-24  7:15 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, kaber, alexander.h.duyck

On Tue, Sep 23, 2008 at 02:23:33PM +0800, Herbert Xu wrote:
> On Sun, Sep 21, 2008 at 12:03:01AM -0700, David Miller wrote:
> >
> > This works if you want it at the root, but what if you only wanted to
> > prio at a leaf?  I think that case has value too.
> 
> Good question :)
> 
> I think what we should do is to pass some token that rerepsents
> the TX queue that's being run down into the dequeue function.
> 
> Then each qdisc can decide which child to recursively dequeue
> based on that token (or ignore it for non-prio qdiscs such as
> HTB).

I don't think HTB could be considered as a non-prio qdisc.

> When the token reaches the leaf then we have two cases:
> 1) A prio-like qdisc that has separate queues based on priorities.
> In this case we dequeue the respective queue based on the token.

As matter of fact I can't figure out this idea of a prio at the root
or leaf either. Could you explain in which point do you expect the
gain? If it's about the locks, what kind of synchronization would be
used to assure packets from lower prio queues (or qdiscs?)  aren't
sent to free tx queues, while higher prio wait on stopped ones?

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-24  7:15                                                                                                                                       ` Jarek Poplawski
@ 2008-09-24  8:04                                                                                                                                         ` Herbert Xu
  2008-09-24  8:28                                                                                                                                           ` Jarek Poplawski
  0 siblings, 1 reply; 209+ messages in thread
From: Herbert Xu @ 2008-09-24  8:04 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev, kaber, alexander.h.duyck

On Wed, Sep 24, 2008 at 07:15:21AM +0000, Jarek Poplawski wrote:
>
> > Then each qdisc can decide which child to recursively dequeue
> > based on that token (or ignore it for non-prio qdiscs such as
> > HTB).
> 
> I don't think HTB could be considered as a non-prio qdisc.

It is non-prio in the sense that it has other criteria for deciding
which child qdisc to enqueue into.

> As matter of fact I can't figure out this idea of a prio at the root
> or leaf either. Could you explain in which point do you expect the
> gain? If it's about the locks, what kind of synchronization would be
> used to assure packets from lower prio queues (or qdiscs?)  aren't
> sent to free tx queues, while higher prio wait on stopped ones?

It's very simple really.  For a non-leaf prio you determine which
child qdisc to enqueue into using the priority.  For a leaf prio
you determine which software queue to enqueue into based on the
priority.

To put it another way, what I'm saying is that instead of duplicating
the qdiscs as we do now for pfifo_fast, we should make the leaf
qdiscs duplicate its software queues to match the hardware queues
it's feeding into.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 209+ messages in thread

* Re: [PATCH take 2] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race
  2008-09-24  8:04                                                                                                                                         ` Herbert Xu
@ 2008-09-24  8:28                                                                                                                                           ` Jarek Poplawski
  0 siblings, 0 replies; 209+ messages in thread
From: Jarek Poplawski @ 2008-09-24  8:28 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, kaber, alexander.h.duyck

On Wed, Sep 24, 2008 at 04:04:27PM +0800, Herbert Xu wrote:
> On Wed, Sep 24, 2008 at 07:15:21AM +0000, Jarek Poplawski wrote:
> >
> > > Then each qdisc can decide which child to recursively dequeue
> > > based on that token (or ignore it for non-prio qdiscs such as
> > > HTB).
> > 
> > I don't think HTB could be considered as a non-prio qdisc.
> 
> It is non-prio in the sense that it has other criteria for deciding
> which child qdisc to enqueue into.
> 
> > As matter of fact I can't figure out this idea of a prio at the root
> > or leaf either. Could you explain in which point do you expect the
> > gain? If it's about the locks, what kind of synchronization would be
> > used to assure packets from lower prio queues (or qdiscs?)  aren't
> > sent to free tx queues, while higher prio wait on stopped ones?
> 
> It's very simple really.  For a non-leaf prio you determine which
> child qdisc to enqueue into using the priority.  For a leaf prio
> you determine which software queue to enqueue into based on the
> priority.

OK, it's too simple then. Could you make this more complex and
show me the gain.

> To put it another way, what I'm saying is that instead of duplicating
> the qdiscs as we do now for pfifo_fast, we should make the leaf
> qdiscs duplicate its software queues to match the hardware queues
> it's feeding into.

It looks like sch_multiq, but you probably mean something else...

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 209+ messages in thread

end of thread, other threads:[~2008-09-24  8:28 UTC | newest]

Thread overview: 209+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-11 20:53 [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
2008-08-12  1:12 ` David Miller
2008-08-12  5:20   ` Jarek Poplawski
2008-08-12  5:40     ` David Miller
2008-08-12  7:00       ` Jarek Poplawski
2008-08-12  8:15         ` David Miller
2008-08-12 10:38           ` Jarek Poplawski
2008-08-13  4:30           ` Herbert Xu
2008-08-13  5:11             ` David Miller
2008-08-13  5:31               ` Herbert Xu
2008-08-13  9:30                 ` David Miller
2008-08-13  6:13               ` Jarek Poplawski
2008-08-13  6:16                 ` David Miller
2008-08-13  6:53                   ` Jarek Poplawski
2008-08-13  7:31                     ` Jarek Poplawski
2008-08-13  9:25                     ` David Miller
2008-08-13  9:58                       ` Herbert Xu
2008-08-13 10:27                       ` Jarek Poplawski
2008-08-13 10:42                         ` Jarek Poplawski
2008-08-13 10:42                         ` Herbert Xu
2008-08-13 10:50                           ` Jarek Poplawski
2008-08-13 22:19                             ` David Miller
2008-08-14  7:59                               ` Jarek Poplawski
2008-08-14  8:16                                 ` Herbert Xu
2008-08-14  8:31                                   ` Jarek Poplawski
2008-08-14  8:33                                     ` Herbert Xu
2008-08-14  8:44                                       ` Jarek Poplawski
2008-08-14  8:52                                         ` Jarek Poplawski
2008-08-17 22:57                                           ` David Miller
2008-08-17 23:03                                             ` David Miller
2008-08-18  1:25                                               ` Herbert Xu
2008-08-18  1:35                                                 ` David Miller
2008-08-18  1:36                                                   ` Herbert Xu
2008-08-18  1:49                                                     ` David Miller
2008-08-18  4:27                                                       ` Herbert Xu
2008-08-18  4:31                                                         ` David Miller
2008-08-18  4:36                                                           ` Herbert Xu
2008-08-18  5:13                                                             ` David Miller
2008-08-18  6:08                                                               ` Denys Fedoryshchenko
2008-08-18  6:13                                                                 ` David Miller
2008-08-18  6:27                                                       ` Jarek Poplawski
2008-08-18  6:38                                                         ` David Miller
2008-08-18 21:29                                                       ` Jarek Poplawski
2008-08-18 23:47                                                         ` David Miller
2008-08-19 10:31                                                           ` Jarek Poplawski
2008-08-19 10:51                                                             ` Herbert Xu
2008-08-19 10:54                                                               ` David Miller
2008-08-19 10:55                                                                 ` Herbert Xu
2008-08-19 10:58                                                                   ` Herbert Xu
2008-08-19 11:02                                                                     ` David Miller
2008-08-19 11:11                                                                       ` Herbert Xu
2008-08-19 16:48                                                                       ` Jarek Poplawski
2008-08-19 22:23                                                                         ` Herbert Xu
2008-08-20 11:56                                                                           ` [PATCH] pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race Jarek Poplawski
2008-08-20 12:16                                                                             ` Herbert Xu
2008-08-21  5:17                                                                             ` Jarek Poplawski
2008-08-21  5:49                                                                           ` [PATCH take 2] " Jarek Poplawski
2008-08-21  6:10                                                                             ` Herbert Xu
2008-08-21  6:49                                                                               ` Jarek Poplawski
2008-08-21  7:16                                                                                 ` Herbert Xu
2008-08-21  7:52                                                                                   ` David Miller
2008-08-21  8:00                                                                                     ` Herbert Xu
2008-08-21  8:27                                                                                     ` Jarek Poplawski
2008-08-21  8:35                                                                                       ` Jarek Poplawski
2008-08-21  8:47                                                                                         ` Jarek Poplawski
2008-09-11 10:39                                                                                     ` David Miller
2008-09-11 10:45                                                                                       ` Herbert Xu
2008-09-11 10:49                                                                                         ` David Miller
2008-09-11 11:00                                                                                           ` Herbert Xu
2008-09-11 11:42                                                                                             ` David Miller
2008-09-11 11:45                                                                                               ` Herbert Xu
2008-09-11 11:47                                                                                                 ` David Miller
2008-09-12  4:49                                                                                                   ` David Miller
2008-09-12  8:02                                                                                                     ` Jarek Poplawski
2008-09-12 23:10                                                                                                       ` David Miller
2008-09-13  1:10                                                                                                         ` Herbert Xu
2008-09-13  1:22                                                                                                           ` David Miller
2008-09-13  1:27                                                                                                             ` Herbert Xu
2008-09-13  1:40                                                                                                               ` David Miller
2008-09-13  1:48                                                                                                                 ` Herbert Xu
2008-09-13 20:54                                                                                                                   ` Jarek Poplawski
2008-09-14  6:16                                                                                                                     ` Herbert Xu
2008-09-14 10:31                                                                                                                       ` Alexander Duyck
2008-09-14 21:43                                                                                                                         ` Jarek Poplawski
2008-09-14 22:13                                                                                                                           ` Herbert Xu
2008-09-15  6:07                                                                                                                             ` Jarek Poplawski
2008-09-15  6:19                                                                                                                               ` Herbert Xu
2008-09-15  7:20                                                                                                                                 ` Jarek Poplawski
2008-09-15  7:45                                                                                                                                   ` Jarek Poplawski
2008-09-15 23:44                                                                                                                                     ` Duyck, Alexander H
2008-09-16 10:47                                                                                                                                       ` Jarek Poplawski
2008-09-17  2:31                                                                                                                                         ` Alexander Duyck
2008-09-14 11:56                                                                                                                       ` jamal
2008-09-14 20:27                                                                                                                       ` Jarek Poplawski
2008-09-20  7:21                                                                                                                         ` David Miller
2008-09-20  7:25                                                                                                                           ` Herbert Xu
2008-09-20  7:28                                                                                                                             ` David Miller
2008-09-20 23:48                                                                                                                           ` Jarek Poplawski
2008-09-21  5:35                                                                                                                             ` David Miller
2008-09-21  5:50                                                                                                                               ` David Miller
2008-09-21  6:38                                                                                                                                 ` Herbert Xu
2008-09-21  7:03                                                                                                                                   ` David Miller
2008-09-23  6:23                                                                                                                                     ` Herbert Xu
2008-09-24  7:15                                                                                                                                       ` Jarek Poplawski
2008-09-24  8:04                                                                                                                                         ` Herbert Xu
2008-09-24  8:28                                                                                                                                           ` Jarek Poplawski
2008-09-21 15:25                                                                                                                                 ` Jarek Poplawski
2008-09-21  9:57                                                                                                                               ` Jarek Poplawski
2008-09-21 10:18                                                                                                                                 ` David Miller
2008-09-21 11:15                                                                                                                                   ` Jarek Poplawski
2008-09-23  5:16                                                                                                                                     ` David Miller
2008-09-23  8:02                                                                                                                                       ` Jarek Poplawski
2008-09-23  8:06                                                                                                                                         ` David Miller
2008-09-11 11:51                                                                                             ` Jarek Poplawski
2008-09-11 11:54                                                                                               ` Herbert Xu
2008-09-11 12:10                                                                                                 ` Jarek Poplawski
2008-09-11 12:34                                                                                                 ` Jarek Poplawski
2008-08-21 12:11                                                                             ` David Miller
2008-08-14  8:17                               ` [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
2008-08-14 11:24                               ` Jarek Poplawski
2008-08-17 13:42                                 ` Jarek Poplawski
2008-08-17 21:34                                   ` David Miller
2008-08-17 22:22                                     ` Jarek Poplawski
2008-08-17 22:32                                       ` David Miller
2008-08-18 20:12                                         ` Jarek Poplawski
2008-08-18 23:54                                           ` David Miller
2008-08-19  0:05                                             ` Herbert Xu
2008-08-19  0:11                                               ` David Miller
2008-08-19  4:07                                                 ` David Miller
2008-08-19  5:27                                                   ` Ilpo Järvinen
2008-08-19  5:30                                                     ` David Miller
2008-08-19  6:46                                                   ` Jarek Poplawski
2008-08-19  7:03                                                     ` David Miller
2008-08-19  7:23                                                       ` Jarek Poplawski
2008-08-19  7:23                                                     ` Herbert Xu
2008-08-19  7:35                                                       ` Jarek Poplawski
2008-08-19  7:46                                                         ` Herbert Xu
2008-08-19  7:56                                                           ` Jarek Poplawski
2008-08-19  8:05                                                             ` Herbert Xu
2008-08-19  8:17                                                               ` Jarek Poplawski
2008-08-19  8:23                                                                 ` Herbert Xu
2008-08-19  8:32                                                                   ` David Miller
2008-08-19  8:41                                                                     ` Jarek Poplawski
2008-08-19  8:48                                                                       ` David Miller
2008-08-19  8:50                                                                     ` Herbert Xu
2008-08-19  8:39                                                                   ` Jarek Poplawski
2008-08-19  8:55                                                                     ` Herbert Xu
2008-08-19  9:16                                                                       ` Jarek Poplawski
2008-08-21 10:01                                                                         ` Jarek Poplawski
2008-08-21 10:05                                                                           ` David Miller
2008-08-21 10:11                                                                             ` Jarek Poplawski
2008-08-21 10:18                                                                               ` Jarek Poplawski
2008-08-21 10:21                                                                                 ` Herbert Xu
2008-08-21 10:23                                                                                   ` Herbert Xu
2008-08-21 10:33                                                                                     ` Jarek Poplawski
2008-08-21 10:51                                                                                       ` Herbert Xu
2008-08-21 11:20                                                                                         ` Jarek Poplawski
2008-08-21 11:26                                                                                           ` Herbert Xu
2008-08-21 11:55                                                                                             ` Jarek Poplawski
2008-08-21 12:01                                                                                               ` Herbert Xu
2008-08-21 12:19                                                                                                 ` Jarek Poplawski
2008-08-21 12:22                                                                                                   ` Herbert Xu
2008-08-21 12:27                                                                                                     ` David Miller
2008-08-21 12:35                                                                                                       ` Herbert Xu
2008-08-21 12:48                                                                                                         ` Herbert Xu
2008-08-21 12:55                                                                                                           ` Jarek Poplawski
2008-08-21 13:12                                                                                                             ` Herbert Xu
2008-08-21 18:58                                                                                                               ` Jarek Poplawski
2008-08-21 21:14                                                                                                                 ` Jarek Poplawski
2008-08-21 22:23                                                                                                                 ` Herbert Xu
2008-08-22  8:49                                                                                                                   ` Jarek Poplawski
2008-08-22  8:55                                                                                                                     ` David Miller
2008-08-22 10:07                                                                                                                       ` Herbert Xu
2008-08-22 10:27                                                                                                                         ` David Miller
2008-08-22 11:02                                                                                                                           ` Herbert Xu
2008-08-22 11:38                                                                                                                   ` Jarek Poplawski
2008-08-22 11:42                                                                                                                     ` David Miller
2008-08-22 12:09                                                                                                                       ` Jarek Poplawski
2008-08-22 12:11                                                                                                                         ` Herbert Xu
2008-08-22 12:18                                                                                                                           ` David Miller
2008-08-22 12:45                                                                                                                             ` Herbert Xu
2008-08-24 23:26                                                                                                                               ` Stephen Hemminger
2008-08-24 23:49                                                                                                                                 ` Herbert Xu
2008-08-25  0:29                                                                                                                                   ` Stephen Hemminger
2008-08-26  7:35                                                                                                                                     ` Herbert Xu
2008-08-26  7:47                                                                                                                                       ` Herbert Xu
2008-08-26 12:24                                                                                                                                         ` Stephen Hemminger
2008-08-26 12:41                                                                                                                                           ` Herbert Xu
2008-08-26 12:50                                                                                                                                             ` Stephen Hemminger
2008-08-26 12:56                                                                                                                                               ` Herbert Xu
2008-08-27 12:17                                                                                                                                               ` Bastian Bloessl
2008-08-27  9:32                                                                                                                                         ` David Miller
2008-08-27  9:56                                                                                                                                           ` Herbert Xu
2008-08-22 12:25                                                                                                                           ` Jarek Poplawski
2008-08-23 12:15                                                                                                                 ` David Miller
2008-08-21 20:40                                                                                                           ` Jarek Poplawski
2008-08-21 22:24                                                                                                             ` Herbert Xu
2008-08-22  8:41                                                                                                               ` [PATCH] pkt_sched: Fix qdisc list locking Jarek Poplawski
2008-08-22 10:14                                                                                                                 ` Herbert Xu
2008-08-22  9:27                                                                                                               ` [PATCH take 2] " Jarek Poplawski
2008-08-22 10:15                                                                                                                 ` Herbert Xu
2008-08-22 10:28                                                                                                                   ` David Miller
2008-08-22 10:23                                                                                                                 ` David Miller
2008-08-21 12:49                                                                                                         ` [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock() Jarek Poplawski
2008-08-21 12:51                                                                                                           ` Herbert Xu
2008-08-21 12:06                                                                                               ` David Miller
2008-08-21 10:18                                                                           ` Herbert Xu
2008-08-12 22:02 ` [PATCH take 2] pkt_sched: Protect gen estimators under est_lock Jarek Poplawski
2008-08-13 22:20   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).