* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-09 22:53 ` David Miller
@ 2014-03-09 23:17 ` Ben Hutchings
2014-03-09 23:28 ` David Lang
2014-03-10 0:07 ` Stanislav Meduna
2014-03-31 21:49 ` [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb Thomas Gleixner
2 siblings, 1 reply; 18+ messages in thread
From: Ben Hutchings @ 2014-03-09 23:17 UTC (permalink / raw)
To: David Miller; +Cc: mkl, linux-rt-users, linux-kernel, netdev, kernel
[-- Attachment #1: Type: text/plain, Size: 2688 bytes --]
On Sun, 2014-03-09 at 18:53 -0400, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Sun, 09 Mar 2014 19:09:20 +0000
>
> > On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
> >> From: Marc Kleine-Budde <mkl@pengutronix.de>
> >> Date: Wed, 5 Mar 2014 00:49:47 +0100
> >>
> >> > @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
> >> > /* Wait for outstanding qdisc_run calls. */
> >> > list_for_each_entry(dev, head, unreg_list)
> >> > while (some_qdisc_is_busy(dev))
> >> > - yield();
> >> > + msleep(1)
> >> > }
> >>
> >> I don't understand this.
> >>
> >> yield() should really _mean_ yield.
> >>
> >> The intent of a yield() call, like this one here, is unambiguously
> >> that the current thread cannot do anything until some other thread
> >> gets onto the cpu and makes forward progress.
> >>
> >> Therefore it should allow lower priority threads to run, not just
> >> equal or higher priority ones.
> >
> > Until when?
> >
> > yield() is not a sensible operation in a preemptive multitasking system,
> > regardless of RT.
>
> To me it means "I've got nothing to do if other tasks want to run right
> now" Yes, I even see it having this meaning when an RT task executes
> it.
>
> How else can you interpret the intent above?
The problem is that 'I've got nothing to do ... now' is information
about a *point* in time, which gives the scheduler no clue as to when
this task might be ready again. So far as task *state* goes, it never
ceases to be ready.
> If you change it to msleep(1), you're assigning an extra completely
> arbitrary time limit to the yield. The code doesn't want to sleep
> for 1ms, that's not what it's asking for.
[...]
I think you want to give up a 'time slice' to any task that's available.
But that's not a meaningful concept for all schedulers. If your task is
highest priority and is ready, it must run. You could drop priority
temporarily, but then you need it to somehow be bumped up again at some
time in the future. Well, sleeping effectively does that.
I do understand that unconditionally sleeping also isn't ideal - the
task that unblocks this one might already be running on another CPU and
able to finish sooner than the sleep timeout.
Maybe the answer is a yield_for(time) which sleeps for the given time or
until there's an idle CPU, whichever is sooner. But I don't know how
hard that would be to implement or how widely useful it would be.
Ben.
--
Ben Hutchings
I say we take off; nuke the site from orbit. It's the only way to be sure.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-09 23:17 ` Ben Hutchings
@ 2014-03-09 23:28 ` David Lang
0 siblings, 0 replies; 18+ messages in thread
From: David Lang @ 2014-03-09 23:28 UTC (permalink / raw)
To: Ben Hutchings
Cc: David Miller, mkl, linux-rt-users, linux-kernel, netdev, kernel
On Sun, 9 Mar 2014, Ben Hutchings wrote:
> On Sun, 2014-03-09 at 18:53 -0400, David Miller wrote:
>> From: Ben Hutchings <ben@decadent.org.uk>
>> Date: Sun, 09 Mar 2014 19:09:20 +0000
>>
>>> On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
>>>> From: Marc Kleine-Budde <mkl@pengutronix.de>
>>>> Date: Wed, 5 Mar 2014 00:49:47 +0100
>>>>
>>>>> @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
>>>>> /* Wait for outstanding qdisc_run calls. */
>>>>> list_for_each_entry(dev, head, unreg_list)
>>>>> while (some_qdisc_is_busy(dev))
>>>>> - yield();
>>>>> + msleep(1)
>>>>> }
>>>>
>>>> I don't understand this.
>>>>
>>>> yield() should really _mean_ yield.
>>>>
>>>> The intent of a yield() call, like this one here, is unambiguously
>>>> that the current thread cannot do anything until some other thread
>>>> gets onto the cpu and makes forward progress.
>>>>
>>>> Therefore it should allow lower priority threads to run, not just
>>>> equal or higher priority ones.
>>>
>>> Until when?
>>>
>>> yield() is not a sensible operation in a preemptive multitasking system,
>>> regardless of RT.
>>
>> To me it means "I've got nothing to do if other tasks want to run right
>> now" Yes, I even see it having this meaning when an RT task executes
>> it.
>>
>> How else can you interpret the intent above?
>
> The problem is that 'I've got nothing to do ... now' is information
> about a *point* in time, which gives the scheduler no clue as to when
> this task might be ready again. So far as task *state* goes, it never
> ceases to be ready.
>
>> If you change it to msleep(1), you're assigning an extra completely
>> arbitrary time limit to the yield. The code doesn't want to sleep
>> for 1ms, that's not what it's asking for.
> [...]
>
> I think you want to give up a 'time slice' to any task that's available.
> But that's not a meaningful concept for all schedulers. If your task is
> highest priority and is ready, it must run. You could drop priority
> temporarily, but then you need it to somehow be bumped up again at some
> time in the future. Well, sleeping effectively does that.
>
> I do understand that unconditionally sleeping also isn't ideal - the
> task that unblocks this one might already be running on another CPU and
> able to finish sooner than the sleep timeout.
>
> Maybe the answer is a yield_for(time) which sleeps for the given time or
> until there's an idle CPU, whichever is sooner. But I don't know how
> hard that would be to implement or how widely useful it would be.
what is msleep(0) defined to do? would it be any better?
David Lang
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-09 22:53 ` David Miller
2014-03-09 23:17 ` Ben Hutchings
@ 2014-03-10 0:07 ` Stanislav Meduna
2014-03-31 21:49 ` [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb Thomas Gleixner
2 siblings, 0 replies; 18+ messages in thread
From: Stanislav Meduna @ 2014-03-10 0:07 UTC (permalink / raw)
To: David Miller, ben; +Cc: mkl, linux-rt-users, linux-kernel, netdev, kernel
On 09.03.2014 23:53, David Miller wrote:
> To me it means "I've got nothing to do if other tasks want to run right
> now" Yes, I even see it having this meaning when an RT task executes
> it.
http://www.kernel.org/doc/htmldocs/device-drivers/API-yield.html
lists this exact "while (!event) yield;" pattern as a broken usage
and states "Never use yield as a progress guarantee!!".
> How else can you interpret the intent above?
IMNSHO there is no way to make the yield() honor this intent if it is
called from a SCHED_FIFO or SCHED_RR task and the task unblocking
it is running at a lower priority. On the PREEMPT_RT systems where
the interrupt handlers are threads an application thread running
at higher priority than some interrupt handlers is a common situation.
The semantics of the scheduler here is "run the highest priority
runnable task until it blocks (FIFO) or its time slice is over
and there are more with the _same_ priority (RR)". This scenario
then collapses into a busy loop never making progress.
> If you change it to msleep(1), you're assigning an extra completely
> arbitrary time limit to the yield. The code doesn't want to sleep
> for 1ms, that's not what it's asking for.
The problem is that there is no way to formulate what it is asking
for in scheduler terms only.
Regards
--
Stano
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-03-09 22:53 ` David Miller
2014-03-09 23:17 ` Ben Hutchings
2014-03-10 0:07 ` Stanislav Meduna
@ 2014-03-31 21:49 ` Thomas Gleixner
2014-04-02 11:07 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2014-03-31 21:49 UTC (permalink / raw)
To: David Miller
Cc: ben, mkl, linux-rt-users, LKML, netdev, kernel, Peter Zijlstra
On Sun, 9 Mar 2014, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Sun, 09 Mar 2014 19:09:20 +0000
>
> > On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
> >> From: Marc Kleine-Budde <mkl@pengutronix.de>
> >> Date: Wed, 5 Mar 2014 00:49:47 +0100
> >>
> >> > @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
> >> > /* Wait for outstanding qdisc_run calls. */
> >> > list_for_each_entry(dev, head, unreg_list)
> >> > while (some_qdisc_is_busy(dev))
> >> > - yield();
> >> > + msleep(1)
> >> > }
> >>
> >> I don't understand this.
> >>
> >> yield() should really _mean_ yield.
> >>
> >> The intent of a yield() call, like this one here, is unambiguously
> >> that the current thread cannot do anything until some other thread
> >> gets onto the cpu and makes forward progress.
> >>
> >> Therefore it should allow lower priority threads to run, not just
> >> equal or higher priority ones.
> >
> > Until when?
> >
> > yield() is not a sensible operation in a preemptive multitasking system,
> > regardless of RT.
>
> To me it means "I've got nothing to do if other tasks want to run right
> now" Yes, I even see it having this meaning when an RT task executes
> it.
>
> How else can you interpret the intent above?
The problem is that the semantics of yield() are a complete disaster.
yield() only works by some definiton of "works" when the involved
parties are in the same scheduling class.
You can yield from one SCHED_OTHER task to another SCHED_OTHER task,
but yield() does not guarantee any progress. It might work for some
scenarios of yield from one SCHED_FIFO task to another SCHED_FIFO task
even better than in the SCHED_OTHER case, but there is no guarantee
that it works at all.
But yielding across scheduling classes cannot work ever. Not even in
mainline.
That's not a RT sepcific issue, it's just one of these problems which
are unearthed by RT because it does not follow the bog standard
scheduling class and priority assignements of your favourite distro
setup.
Once you tune a few knobs on a mainline kernel, you can provoke the
same issue. It's that simple.
yield() should die both in kernel and user space.
Thanks,
tglx
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-03-31 21:49 ` [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb Thomas Gleixner
@ 2014-04-02 11:07 ` Peter Zijlstra
2014-04-02 11:17 ` Eric Dumazet
0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2014-04-02 11:07 UTC (permalink / raw)
To: Thomas Gleixner
Cc: David Miller, ben, mkl, linux-rt-users, LKML, netdev, kernel
On Mon, Mar 31, 2014 at 11:49:16PM +0200, Thomas Gleixner wrote:
> > >> The intent of a yield() call, like this one here, is unambiguously
> > >> that the current thread cannot do anything until some other thread
> > >> gets onto the cpu and makes forward progress.
Yeah; like Thomas said; that's not what yield() does -- or can do.
yield() only says 'I don't know what to do, you figure it out', but then
fails to provide enough information to actually do something sensible.
It cannot put the task to sleep; because it doesn't pair with a wakeup;
and therefore the task stays an eligible/runnable task from a scheduler
pov.
The scheduler is therefore entirely in its right to pick this task
again; and pretty much has to under many circumstances.
yield() does not, and can not, guarantee forward progress - ever.
Use wait_event() and assorted bits to wait for an actual event. That is
a sleep paired with a wakeup and thus has progress guarantees.
Also; see the comment near yield:
/**
* yield - yield the current processor to other threads.
*
* Do not ever use this function, there's a 99% chance you're doing it wrong.
*
* The scheduler is at all times free to pick the calling task as the most
* eligible task to run, if removing the yield() call from your code breaks
* it, its already broken.
*
* Typical broken usage is:
*
* while (!event)
* yield();
*
* where one assumes that yield() will let 'the other' process run that will
* make event true. If the current task is a SCHED_FIFO task that will never
* happen. Never use yield() as a progress guarantee!!
*
* If you want to use yield() to wait for something, use wait_event().
* If you want to use yield() to be 'nice' for others, use cond_resched().
* If you still want to use yield(), do not!
*/
void __sched yield(void)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-02 11:07 ` Peter Zijlstra
@ 2014-04-02 11:17 ` Eric Dumazet
2014-04-04 15:19 ` Peter Zijlstra
0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2014-04-02 11:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, David Miller, ben, mkl, linux-rt-users, LKML,
netdev, kernel
On Wed, 2014-04-02 at 13:07 +0200, Peter Zijlstra wrote:
> On Mon, Mar 31, 2014 at 11:49:16PM +0200, Thomas Gleixner wrote:
> > > >> The intent of a yield() call, like this one here, is unambiguously
> > > >> that the current thread cannot do anything until some other thread
> > > >> gets onto the cpu and makes forward progress.
>
> Yeah; like Thomas said; that's not what yield() does -- or can do.
>
> yield() only says 'I don't know what to do, you figure it out', but then
> fails to provide enough information to actually do something sensible.
>
> It cannot put the task to sleep; because it doesn't pair with a wakeup;
> and therefore the task stays an eligible/runnable task from a scheduler
> pov.
>
> The scheduler is therefore entirely in its right to pick this task
> again; and pretty much has to under many circumstances.
>
> yield() does not, and can not, guarantee forward progress - ever.
>
> Use wait_event() and assorted bits to wait for an actual event. That is
> a sleep paired with a wakeup and thus has progress guarantees.
Why wouldn't it be safe to redefine yield as msleep(1) ?
net/ipv4/tcp_output.c seems to also use yield().
/* Socket is locked, keep trying until memory is available. */
for (;;) {
skb = alloc_skb_fclone(MAX_TCP_HEADER,
sk->sk_allocation);
if (skb)
break;
yield();
}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-02 11:17 ` Eric Dumazet
@ 2014-04-04 15:19 ` Peter Zijlstra
2014-04-04 15:26 ` David Miller
2014-04-04 15:28 ` David Miller
0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-04-04 15:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Gleixner, David Miller, ben, mkl, linux-rt-users, LKML,
netdev, kernel
On Wed, Apr 02, 2014 at 04:17:39AM -0700, Eric Dumazet wrote:
> On Wed, 2014-04-02 at 13:07 +0200, Peter Zijlstra wrote:
> > On Mon, Mar 31, 2014 at 11:49:16PM +0200, Thomas Gleixner wrote:
> > > > >> The intent of a yield() call, like this one here, is unambiguously
> > > > >> that the current thread cannot do anything until some other thread
> > > > >> gets onto the cpu and makes forward progress.
> >
> > Yeah; like Thomas said; that's not what yield() does -- or can do.
> >
> > yield() only says 'I don't know what to do, you figure it out', but then
> > fails to provide enough information to actually do something sensible.
> >
> > It cannot put the task to sleep; because it doesn't pair with a wakeup;
> > and therefore the task stays an eligible/runnable task from a scheduler
> > pov.
> >
> > The scheduler is therefore entirely in its right to pick this task
> > again; and pretty much has to under many circumstances.
> >
> > yield() does not, and can not, guarantee forward progress - ever.
> >
> > Use wait_event() and assorted bits to wait for an actual event. That is
> > a sleep paired with a wakeup and thus has progress guarantees.
>
> Why wouldn't it be safe to redefine yield as msleep(1) ?
Because it violates what POSIX says what yield() should do. There is a
semi valid use-case for yield between RR/FIFO tasks of the same
priority. Changing yield to do msleep(1) would break that.
The kernel actually used to rely on that particular semantics; but I
think we took that out because it was fragile as heck.
> net/ipv4/tcp_output.c seems to also use yield().
>
>
> /* Socket is locked, keep trying until memory is available. */
> for (;;) {
> skb = alloc_skb_fclone(MAX_TCP_HEADER,
> sk->sk_allocation);
> if (skb)
> break;
> yield();
> }
Yeah, that's crap code too.
Furthermore, changing the network code to use msleep(1) instead of
yield() doesn't make it less crap than it was, just functional crap
instead of broken crap.
The proper way to fix the dev_deactivate_many() is to use wait_event(),
polling for that state is just daft. Afaict there is no reason the qdisc
code could not do a wakeup whenever that condition changes.
And the above loop in tcp_output() is a plain memory deadlock, you
should not loop on allocations like that. If the allocation fails;
propagate the error.
Given that this function must not fail; one should pre-allocate proper
reserves, not simply loop until it works. The loop might be holding up
the work required to release memory.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-04 15:19 ` Peter Zijlstra
@ 2014-04-04 15:26 ` David Miller
2014-04-07 11:19 ` Peter Zijlstra
2014-04-04 15:28 ` David Miller
1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2014-04-04 15:26 UTC (permalink / raw)
To: peterz
Cc: eric.dumazet, tglx, ben, mkl, linux-rt-users, linux-kernel,
netdev, kernel
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 4 Apr 2014 17:19:42 +0200
> The proper way to fix the dev_deactivate_many() is to use wait_event(),
> polling for that state is just daft. Afaict there is no reason the qdisc
> code could not do a wakeup whenever that condition changes.
I actually looked into this, and it's going to add expensive checks
to the fast paths of packet processing.
If it was so easy we'd be doing it that way already.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-04 15:26 ` David Miller
@ 2014-04-07 11:19 ` Peter Zijlstra
0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-04-07 11:19 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, tglx, ben, mkl, linux-rt-users, linux-kernel,
netdev, kernel
On Fri, Apr 04, 2014 at 11:26:28AM -0400, David Miller wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 4 Apr 2014 17:19:42 +0200
>
> > The proper way to fix the dev_deactivate_many() is to use wait_event(),
> > polling for that state is just daft. Afaict there is no reason the qdisc
> > code could not do a wakeup whenever that condition changes.
>
> I actually looked into this, and it's going to add expensive checks
> to the fast paths of packet processing.
>
> If it was so easy we'd be doing it that way already.
Fair enough; and I suppose this waiting side is 'rare' compared to the
packet processing side of things?
In that case you could add a comment with the msleep() and just leave it
at that I suppose.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-04 15:19 ` Peter Zijlstra
2014-04-04 15:26 ` David Miller
@ 2014-04-04 15:28 ` David Miller
2014-04-07 11:24 ` Peter Zijlstra
1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2014-04-04 15:28 UTC (permalink / raw)
To: peterz
Cc: eric.dumazet, tglx, ben, mkl, linux-rt-users, linux-kernel,
netdev, kernel
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 4 Apr 2014 17:19:42 +0200
> And the above loop in tcp_output() is a plain memory deadlock, you
> should not loop on allocations like that. If the allocation fails;
> propagate the error.
There is nothing to "propagate" it to. We have no "event" that will
trigger so that we have a second chance to send the FIN out, it really
must go out before we progress any further at this stage.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-04 15:28 ` David Miller
@ 2014-04-07 11:24 ` Peter Zijlstra
0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-04-07 11:24 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, tglx, ben, mkl, linux-rt-users, linux-kernel,
netdev, kernel
On Fri, Apr 04, 2014 at 11:28:50AM -0400, David Miller wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 4 Apr 2014 17:19:42 +0200
>
> > And the above loop in tcp_output() is a plain memory deadlock, you
> > should not loop on allocations like that. If the allocation fails;
> > propagate the error.
>
> There is nothing to "propagate" it to. We have no "event" that will
> trigger so that we have a second chance to send the FIN out, it really
> must go out before we progress any further at this stage.
>From what I remember of TCP/IP the FIN packet is used once per
connection; to terminate said connection. So if we pre-allocate one per
connection, at establishment time (where we can still easily fail) we
should be good and safe.
Now, I suppose there's valid arguments against this -- wasted memory
being one I suspect.
Also, seeing how this code has basically been this way forever and has
not caused (afaik) any actual memory deadlocks (they're hard anyway). We
might just get away with not touching it.
That said, at the very least we want a comment here stating that this is
a potential memory deadlock and why the alternatives aren't any good, so
that future readers (possibly us again) know this has been considered.
^ permalink raw reply [flat|nested] 18+ messages in thread