* [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
@ 2014-03-04 23:49 Marc Kleine-Budde
2014-03-06 21:06 ` David Miller
0 siblings, 1 reply; 18+ messages in thread
From: Marc Kleine-Budde @ 2014-03-04 23:49 UTC (permalink / raw)
To: linux-rt-users; +Cc: linux-kernel, netdev, kernel, Marc Kleine-Budde
On PREEMPT_RT enabled systems the interrupt handler run as threads at prio 50
(by default). If a high priority userspace process tries to shut down a busy
network interface it might spin in a yield loop waiting for the device to
become idle. With the interrupt thread having a lower priority than the
looping process it might never be scheduled and so result in a deadlock on UP
systems.
With Magic SysRq the following backtrace can be produced:
> test_app R running 0 174 168 0x00000000
> [<c02c7070>] (__schedule+0x220/0x3fc) from [<c02c7870>] (preempt_schedule_irq+0x48/0x80)
> [<c02c7870>] (preempt_schedule_irq+0x48/0x80) from [<c0008fa8>] (svc_preempt+0x8/0x20)
> [<c0008fa8>] (svc_preempt+0x8/0x20) from [<c001a984>] (local_bh_enable+0x18/0x88)
> [<c001a984>] (local_bh_enable+0x18/0x88) from [<c025316c>] (dev_deactivate_many+0x220/0x264)
> [<c025316c>] (dev_deactivate_many+0x220/0x264) from [<c023be04>] (__dev_close_many+0x64/0xd4)
> [<c023be04>] (__dev_close_many+0x64/0xd4) from [<c023be9c>] (__dev_close+0x28/0x3c)
> [<c023be9c>] (__dev_close+0x28/0x3c) from [<c023f7f0>] (__dev_change_flags+0x88/0x130)
> [<c023f7f0>] (__dev_change_flags+0x88/0x130) from [<c023f904>] (dev_change_flags+0x10/0x48)
> [<c023f904>] (dev_change_flags+0x10/0x48) from [<c024c140>] (do_setlink+0x370/0x7ec)
> [<c024c140>] (do_setlink+0x370/0x7ec) from [<c024d2f0>] (rtnl_newlink+0x2b4/0x450)
> [<c024d2f0>] (rtnl_newlink+0x2b4/0x450) from [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4)
> [<c024cfa0>] (rtnetlink_rcv_msg+0x158/0x1f4) from [<c0256740>] (netlink_rcv_skb+0xac/0xc0)
> [<c0256740>] (netlink_rcv_skb+0xac/0xc0) from [<c024bbd8>] (rtnetlink_rcv+0x18/0x24)
> [<c024bbd8>] (rtnetlink_rcv+0x18/0x24) from [<c02561b8>] (netlink_unicast+0x13c/0x198)
> [<c02561b8>] (netlink_unicast+0x13c/0x198) from [<c025651c>] (netlink_sendmsg+0x264/0x2e0)
> [<c025651c>] (netlink_sendmsg+0x264/0x2e0) from [<c022af98>] (sock_sendmsg+0x78/0x98)
> [<c022af98>] (sock_sendmsg+0x78/0x98) from [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278)
> [<c022bb50>] (___sys_sendmsg.part.25+0x268/0x278) from [<c022cf08>] (__sys_sendmsg+0x48/0x78)
> [<c022cf08>] (__sys_sendmsg+0x48/0x78) from [<c0009320>] (ret_fast_syscall+0x0/0x2c)
This patch works around the problem by replacing yield() by msleep(1), giving
the interrupt thread time to finish, similar to other changes contained in the
rt patch set. Using wait_for_completion() instead would probably be a better
solution.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
---
Hello,
this patch is intended for -rt, as the problem probably doesn't occur on non rt
systems. However, I put netdev on Cc, as they probably can come up with a
proper solution.
regards,
Marc
net/sched/sch_generic.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index a7f838b..d5b00ad 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
/* Wait for outstanding qdisc_run calls. */
list_for_each_entry(dev, head, unreg_list)
while (some_qdisc_is_busy(dev))
- yield();
+ msleep(1)
}
void dev_deactivate(struct net_device *dev)
--
1.9.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-04 23:49 [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls Marc Kleine-Budde
@ 2014-03-06 21:06 ` David Miller
2014-03-06 21:39 ` Marc Kleine-Budde
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: David Miller @ 2014-03-06 21:06 UTC (permalink / raw)
To: mkl; +Cc: linux-rt-users, linux-kernel, netdev, kernel
From: Marc Kleine-Budde <mkl@pengutronix.de>
Date: Wed, 5 Mar 2014 00:49:47 +0100
> @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
> /* Wait for outstanding qdisc_run calls. */
> list_for_each_entry(dev, head, unreg_list)
> while (some_qdisc_is_busy(dev))
> - yield();
> + msleep(1)
> }
I don't understand this.
yield() should really _mean_ yield.
The intent of a yield() call, like this one here, is unambiguously
that the current thread cannot do anything until some other thread
gets onto the cpu and makes forward progress.
Therefore it should allow lower priority threads to run, not just
equal or higher priority ones.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-06 21:06 ` David Miller
@ 2014-03-06 21:39 ` Marc Kleine-Budde
2014-03-07 15:47 ` Sebastian Andrzej Siewior
2014-03-07 4:26 ` Mike Galbraith
2014-03-09 19:09 ` Ben Hutchings
2 siblings, 1 reply; 18+ messages in thread
From: Marc Kleine-Budde @ 2014-03-06 21:39 UTC (permalink / raw)
To: David Miller; +Cc: linux-rt-users, linux-kernel, netdev, kernel
[-- Attachment #1: Type: text/plain, Size: 1735 bytes --]
On 03/06/2014 10:06 PM, David Miller wrote:
> From: Marc Kleine-Budde <mkl@pengutronix.de>
> Date: Wed, 5 Mar 2014 00:49:47 +0100
>
>> @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
>> /* Wait for outstanding qdisc_run calls. */
>> list_for_each_entry(dev, head, unreg_list)
>> while (some_qdisc_is_busy(dev))
>> - yield();
>> + msleep(1)
>> }
>
> I don't understand this.
>
> yield() should really _mean_ yield.
>
> The intent of a yield() call, like this one here, is unambiguously
> that the current thread cannot do anything until some other thread
> gets onto the cpu and makes forward progress.
>
> Therefore it should allow lower priority threads to run, not just
> equal or higher priority ones.
Yes, we need a call that does what you described, however I'm not sure
if yield() really does that. According to:
http://lxr.free-electrons.com/source/kernel/sched/core.c#L3599
> * Typical broken usage is:
> *
> * while (!event)
> * yield();
> *
> * where one assumes that yield() will let 'the other' process run that will
> * make event true. If the current task is a SCHED_FIFO task that will never
> * happen. Never use yield() as a progress guarantee!!
My Process runs with SCHED_FIFO and prio > 50, with IRQ at default prio,
which is 50.
Maybe the RT guys can comment on this. I found another interesting
function in the RT patch set: cpu_chill().
Marc
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 242 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-06 21:06 ` David Miller
2014-03-06 21:39 ` Marc Kleine-Budde
@ 2014-03-07 4:26 ` Mike Galbraith
2014-03-09 19:09 ` Ben Hutchings
2 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2014-03-07 4:26 UTC (permalink / raw)
To: David Miller; +Cc: mkl, linux-rt-users, linux-kernel, netdev, kernel
On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
> From: Marc Kleine-Budde <mkl@pengutronix.de>
> Date: Wed, 5 Mar 2014 00:49:47 +0100
>
> > @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
> > /* Wait for outstanding qdisc_run calls. */
> > list_for_each_entry(dev, head, unreg_list)
> > while (some_qdisc_is_busy(dev))
> > - yield();
> > + msleep(1)
> > }
>
> I don't understand this.
>
> yield() should really _mean_ yield.
It does, but yield() semantics make it useless for what you want to do..
and pretty much undefined for anything other than SCHED_FIFO. If you
really want to give up the CPU to any old body, you have to sleep.
-Mike
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-06 21:39 ` Marc Kleine-Budde
@ 2014-03-07 15:47 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 18+ messages in thread
From: Sebastian Andrzej Siewior @ 2014-03-07 15:47 UTC (permalink / raw)
To: Marc Kleine-Budde
Cc: David Miller, linux-rt-users, linux-kernel, netdev, kernel
* Marc Kleine-Budde | 2014-03-06 22:39:58 [+0100]:
>> Therefore it should allow lower priority threads to run, not just
>> equal or higher priority ones.
>
>Yes, we need a call that does what you described, however I'm not sure
>if yield() really does that. According to:
>
>http://lxr.free-electrons.com/source/kernel/sched/core.c#L3599
>
>> * Typical broken usage is:
>> *
>> * while (!event)
>> * yield();
>> *
>> * where one assumes that yield() will let 'the other' process run that will
>> * make event true. If the current task is a SCHED_FIFO task that will never
>> * happen. Never use yield() as a progress guarantee!!
>
>My Process runs with SCHED_FIFO and prio > 50, with IRQ at default prio,
>which is 50.
>
>Maybe the RT guys can comment on this. I found another interesting
>function in the RT patch set: cpu_chill().
If you boot mainline without -RT, use threadirqs, start your application
do the same prio thing then you should end up with exactly the same
outcome. Please say so :)
msleep() is safe as long as it is used outside of the softirq. Nice that
you found cpu_chill() but on non-RT it turns to cpu_relax() and you do
not want this here.
wait_event() would be nice in the end to have. For now I take that patch
for -RT.
>Marc
Sebastian
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-06 21:06 ` David Miller
2014-03-06 21:39 ` Marc Kleine-Budde
2014-03-07 4:26 ` Mike Galbraith
@ 2014-03-09 19:09 ` Ben Hutchings
2014-03-09 22:53 ` David Miller
2 siblings, 1 reply; 18+ messages in thread
From: Ben Hutchings @ 2014-03-09 19:09 UTC (permalink / raw)
To: David Miller; +Cc: mkl, linux-rt-users, linux-kernel, netdev, kernel
[-- Attachment #1: Type: text/plain, Size: 1055 bytes --]
On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
> From: Marc Kleine-Budde <mkl@pengutronix.de>
> Date: Wed, 5 Mar 2014 00:49:47 +0100
>
> > @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
> > /* Wait for outstanding qdisc_run calls. */
> > list_for_each_entry(dev, head, unreg_list)
> > while (some_qdisc_is_busy(dev))
> > - yield();
> > + msleep(1)
> > }
>
> I don't understand this.
>
> yield() should really _mean_ yield.
>
> The intent of a yield() call, like this one here, is unambiguously
> that the current thread cannot do anything until some other thread
> gets onto the cpu and makes forward progress.
>
> Therefore it should allow lower priority threads to run, not just
> equal or higher priority ones.
Until when?
yield() is not a sensible operation in a preemptive multitasking system,
regardless of RT.
Ben.
--
Ben Hutchings
I say we take off; nuke the site from orbit. It's the only way to be sure.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-09 19:09 ` Ben Hutchings
@ 2014-03-09 22:53 ` David Miller
2014-03-09 23:17 ` Ben Hutchings
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: David Miller @ 2014-03-09 22:53 UTC (permalink / raw)
To: ben; +Cc: mkl, linux-rt-users, linux-kernel, netdev, kernel
From: Ben Hutchings <ben@decadent.org.uk>
Date: Sun, 09 Mar 2014 19:09:20 +0000
> On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
>> From: Marc Kleine-Budde <mkl@pengutronix.de>
>> Date: Wed, 5 Mar 2014 00:49:47 +0100
>>
>> > @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
>> > /* Wait for outstanding qdisc_run calls. */
>> > list_for_each_entry(dev, head, unreg_list)
>> > while (some_qdisc_is_busy(dev))
>> > - yield();
>> > + msleep(1)
>> > }
>>
>> I don't understand this.
>>
>> yield() should really _mean_ yield.
>>
>> The intent of a yield() call, like this one here, is unambiguously
>> that the current thread cannot do anything until some other thread
>> gets onto the cpu and makes forward progress.
>>
>> Therefore it should allow lower priority threads to run, not just
>> equal or higher priority ones.
>
> Until when?
>
> yield() is not a sensible operation in a preemptive multitasking system,
> regardless of RT.
To me it means "I've got nothing to do if other tasks want to run right
now" Yes, I even see it having this meaning when an RT task executes
it.
How else can you interpret the intent above?
If you change it to msleep(1), you're assigning an extra completely
arbitrary time limit to the yield. The code doesn't want to sleep
for 1ms, that's not what it's asking for.
On the other hand, I do completely agree with other replies stating
that it would be better if we found a way for this code to wait on
something explicitly via a wait queue.
Unfortunately, that's undesirable from another perspective, in that it
would probably require making the fast paths of the qdiscs more
expensive.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-09 22:53 ` David Miller
@ 2014-03-09 23:17 ` Ben Hutchings
2014-03-09 23:28 ` David Lang
2014-03-10 0:07 ` Stanislav Meduna
2014-03-31 21:49 ` [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb Thomas Gleixner
2 siblings, 1 reply; 18+ messages in thread
From: Ben Hutchings @ 2014-03-09 23:17 UTC (permalink / raw)
To: David Miller; +Cc: mkl, linux-rt-users, linux-kernel, netdev, kernel
[-- Attachment #1: Type: text/plain, Size: 2688 bytes --]
On Sun, 2014-03-09 at 18:53 -0400, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Sun, 09 Mar 2014 19:09:20 +0000
>
> > On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
> >> From: Marc Kleine-Budde <mkl@pengutronix.de>
> >> Date: Wed, 5 Mar 2014 00:49:47 +0100
> >>
> >> > @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
> >> > /* Wait for outstanding qdisc_run calls. */
> >> > list_for_each_entry(dev, head, unreg_list)
> >> > while (some_qdisc_is_busy(dev))
> >> > - yield();
> >> > + msleep(1)
> >> > }
> >>
> >> I don't understand this.
> >>
> >> yield() should really _mean_ yield.
> >>
> >> The intent of a yield() call, like this one here, is unambiguously
> >> that the current thread cannot do anything until some other thread
> >> gets onto the cpu and makes forward progress.
> >>
> >> Therefore it should allow lower priority threads to run, not just
> >> equal or higher priority ones.
> >
> > Until when?
> >
> > yield() is not a sensible operation in a preemptive multitasking system,
> > regardless of RT.
>
> To me it means "I've got nothing to do if other tasks want to run right
> now" Yes, I even see it having this meaning when an RT task executes
> it.
>
> How else can you interpret the intent above?
The problem is that 'I've got nothing to do ... now' is information
about a *point* in time, which gives the scheduler no clue as to when
this task might be ready again. So far as task *state* goes, it never
ceases to be ready.
> If you change it to msleep(1), you're assigning an extra completely
> arbitrary time limit to the yield. The code doesn't want to sleep
> for 1ms, that's not what it's asking for.
[...]
I think you want to give up a 'time slice' to any task that's available.
But that's not a meaningful concept for all schedulers. If your task is
highest priority and is ready, it must run. You could drop priority
temporarily, but then you need it to somehow be bumped up again at some
time in the future. Well, sleeping effectively does that.
I do understand that unconditionally sleeping also isn't ideal - the
task that unblocks this one might already be running on another CPU and
able to finish sooner than the sleep timeout.
Maybe the answer is a yield_for(time) which sleeps for the given time or
until there's an idle CPU, whichever is sooner. But I don't know how
hard that would be to implement or how widely useful it would be.
Ben.
--
Ben Hutchings
I say we take off; nuke the site from orbit. It's the only way to be sure.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-09 23:17 ` Ben Hutchings
@ 2014-03-09 23:28 ` David Lang
0 siblings, 0 replies; 18+ messages in thread
From: David Lang @ 2014-03-09 23:28 UTC (permalink / raw)
To: Ben Hutchings
Cc: David Miller, mkl, linux-rt-users, linux-kernel, netdev, kernel
On Sun, 9 Mar 2014, Ben Hutchings wrote:
> On Sun, 2014-03-09 at 18:53 -0400, David Miller wrote:
>> From: Ben Hutchings <ben@decadent.org.uk>
>> Date: Sun, 09 Mar 2014 19:09:20 +0000
>>
>>> On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
>>>> From: Marc Kleine-Budde <mkl@pengutronix.de>
>>>> Date: Wed, 5 Mar 2014 00:49:47 +0100
>>>>
>>>>> @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
>>>>> /* Wait for outstanding qdisc_run calls. */
>>>>> list_for_each_entry(dev, head, unreg_list)
>>>>> while (some_qdisc_is_busy(dev))
>>>>> - yield();
>>>>> + msleep(1)
>>>>> }
>>>>
>>>> I don't understand this.
>>>>
>>>> yield() should really _mean_ yield.
>>>>
>>>> The intent of a yield() call, like this one here, is unambiguously
>>>> that the current thread cannot do anything until some other thread
>>>> gets onto the cpu and makes forward progress.
>>>>
>>>> Therefore it should allow lower priority threads to run, not just
>>>> equal or higher priority ones.
>>>
>>> Until when?
>>>
>>> yield() is not a sensible operation in a preemptive multitasking system,
>>> regardless of RT.
>>
>> To me it means "I've got nothing to do if other tasks want to run right
>> now" Yes, I even see it having this meaning when an RT task executes
>> it.
>>
>> How else can you interpret the intent above?
>
> The problem is that 'I've got nothing to do ... now' is information
> about a *point* in time, which gives the scheduler no clue as to when
> this task might be ready again. So far as task *state* goes, it never
> ceases to be ready.
>
>> If you change it to msleep(1), you're assigning an extra completely
>> arbitrary time limit to the yield. The code doesn't want to sleep
>> for 1ms, that's not what it's asking for.
> [...]
>
> I think you want to give up a 'time slice' to any task that's available.
> But that's not a meaningful concept for all schedulers. If your task is
> highest priority and is ready, it must run. You could drop priority
> temporarily, but then you need it to somehow be bumped up again at some
> time in the future. Well, sleeping effectively does that.
>
> I do understand that unconditionally sleeping also isn't ideal - the
> task that unblocks this one might already be running on another CPU and
> able to finish sooner than the sleep timeout.
>
> Maybe the answer is a yield_for(time) which sleeps for the given time or
> until there's an idle CPU, whichever is sooner. But I don't know how
> hard that would be to implement or how widely useful it would be.
what is msleep(0) defined to do? would it be any better?
David Lang
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls
2014-03-09 22:53 ` David Miller
2014-03-09 23:17 ` Ben Hutchings
@ 2014-03-10 0:07 ` Stanislav Meduna
2014-03-31 21:49 ` [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb Thomas Gleixner
2 siblings, 0 replies; 18+ messages in thread
From: Stanislav Meduna @ 2014-03-10 0:07 UTC (permalink / raw)
To: David Miller, ben; +Cc: mkl, linux-rt-users, linux-kernel, netdev, kernel
On 09.03.2014 23:53, David Miller wrote:
> To me it means "I've got nothing to do if other tasks want to run right
> now" Yes, I even see it having this meaning when an RT task executes
> it.
http://www.kernel.org/doc/htmldocs/device-drivers/API-yield.html
lists this exact "while (!event) yield;" pattern as a broken usage
and states "Never use yield as a progress guarantee!!".
> How else can you interpret the intent above?
IMNSHO there is no way to make the yield() honor this intent if it is
called from a SCHED_FIFO or SCHED_RR task and the task unblocking
it is running at a lower priority. On the PREEMPT_RT systems where
the interrupt handlers are threads an application thread running
at higher priority than some interrupt handlers is a common situation.
The semantics of the scheduler here is "run the highest priority
runnable task until it blocks (FIFO) or its time slice is over
and there are more with the _same_ priority (RR)". This scenario
then collapses into a busy loop never making progress.
> If you change it to msleep(1), you're assigning an extra completely
> arbitrary time limit to the yield. The code doesn't want to sleep
> for 1ms, that's not what it's asking for.
The problem is that there is no way to formulate what it is asking
for in scheduler terms only.
Regards
--
Stano
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-03-09 22:53 ` David Miller
2014-03-09 23:17 ` Ben Hutchings
2014-03-10 0:07 ` Stanislav Meduna
@ 2014-03-31 21:49 ` Thomas Gleixner
2014-04-02 11:07 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: Thomas Gleixner @ 2014-03-31 21:49 UTC (permalink / raw)
To: David Miller
Cc: ben, mkl, linux-rt-users, LKML, netdev, kernel, Peter Zijlstra
On Sun, 9 Mar 2014, David Miller wrote:
> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Sun, 09 Mar 2014 19:09:20 +0000
>
> > On Thu, 2014-03-06 at 16:06 -0500, David Miller wrote:
> >> From: Marc Kleine-Budde <mkl@pengutronix.de>
> >> Date: Wed, 5 Mar 2014 00:49:47 +0100
> >>
> >> > @@ -839,7 +839,7 @@ void dev_deactivate_many(struct list_head *head)
> >> > /* Wait for outstanding qdisc_run calls. */
> >> > list_for_each_entry(dev, head, unreg_list)
> >> > while (some_qdisc_is_busy(dev))
> >> > - yield();
> >> > + msleep(1)
> >> > }
> >>
> >> I don't understand this.
> >>
> >> yield() should really _mean_ yield.
> >>
> >> The intent of a yield() call, like this one here, is unambiguously
> >> that the current thread cannot do anything until some other thread
> >> gets onto the cpu and makes forward progress.
> >>
> >> Therefore it should allow lower priority threads to run, not just
> >> equal or higher priority ones.
> >
> > Until when?
> >
> > yield() is not a sensible operation in a preemptive multitasking system,
> > regardless of RT.
>
> To me it means "I've got nothing to do if other tasks want to run right
> now" Yes, I even see it having this meaning when an RT task executes
> it.
>
> How else can you interpret the intent above?
The problem is that the semantics of yield() are a complete disaster.
yield() only works by some definiton of "works" when the involved
parties are in the same scheduling class.
You can yield from one SCHED_OTHER task to another SCHED_OTHER task,
but yield() does not guarantee any progress. It might work for some
scenarios of yield from one SCHED_FIFO task to another SCHED_FIFO task
even better than in the SCHED_OTHER case, but there is no guarantee
that it works at all.
But yielding across scheduling classes cannot work ever. Not even in
mainline.
That's not a RT sepcific issue, it's just one of these problems which
are unearthed by RT because it does not follow the bog standard
scheduling class and priority assignements of your favourite distro
setup.
Once you tune a few knobs on a mainline kernel, you can provoke the
same issue. It's that simple.
yield() should die both in kernel and user space.
Thanks,
tglx
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-03-31 21:49 ` [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb Thomas Gleixner
@ 2014-04-02 11:07 ` Peter Zijlstra
2014-04-02 11:17 ` Eric Dumazet
0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2014-04-02 11:07 UTC (permalink / raw)
To: Thomas Gleixner
Cc: David Miller, ben, mkl, linux-rt-users, LKML, netdev, kernel
On Mon, Mar 31, 2014 at 11:49:16PM +0200, Thomas Gleixner wrote:
> > >> The intent of a yield() call, like this one here, is unambiguously
> > >> that the current thread cannot do anything until some other thread
> > >> gets onto the cpu and makes forward progress.
Yeah; like Thomas said; that's not what yield() does -- or can do.
yield() only says 'I don't know what to do, you figure it out', but then
fails to provide enough information to actually do something sensible.
It cannot put the task to sleep; because it doesn't pair with a wakeup;
and therefore the task stays an eligible/runnable task from a scheduler
pov.
The scheduler is therefore entirely in its right to pick this task
again; and pretty much has to under many circumstances.
yield() does not, and can not, guarantee forward progress - ever.
Use wait_event() and assorted bits to wait for an actual event. That is
a sleep paired with a wakeup and thus has progress guarantees.
Also; see the comment near yield:
/**
* yield - yield the current processor to other threads.
*
* Do not ever use this function, there's a 99% chance you're doing it wrong.
*
* The scheduler is at all times free to pick the calling task as the most
* eligible task to run, if removing the yield() call from your code breaks
* it, its already broken.
*
* Typical broken usage is:
*
* while (!event)
* yield();
*
* where one assumes that yield() will let 'the other' process run that will
* make event true. If the current task is a SCHED_FIFO task that will never
* happen. Never use yield() as a progress guarantee!!
*
* If you want to use yield() to wait for something, use wait_event().
* If you want to use yield() to be 'nice' for others, use cond_resched().
* If you still want to use yield(), do not!
*/
void __sched yield(void)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-02 11:07 ` Peter Zijlstra
@ 2014-04-02 11:17 ` Eric Dumazet
2014-04-04 15:19 ` Peter Zijlstra
0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2014-04-02 11:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, David Miller, ben, mkl, linux-rt-users, LKML,
netdev, kernel
On Wed, 2014-04-02 at 13:07 +0200, Peter Zijlstra wrote:
> On Mon, Mar 31, 2014 at 11:49:16PM +0200, Thomas Gleixner wrote:
> > > >> The intent of a yield() call, like this one here, is unambiguously
> > > >> that the current thread cannot do anything until some other thread
> > > >> gets onto the cpu and makes forward progress.
>
> Yeah; like Thomas said; that's not what yield() does -- or can do.
>
> yield() only says 'I don't know what to do, you figure it out', but then
> fails to provide enough information to actually do something sensible.
>
> It cannot put the task to sleep; because it doesn't pair with a wakeup;
> and therefore the task stays an eligible/runnable task from a scheduler
> pov.
>
> The scheduler is therefore entirely in its right to pick this task
> again; and pretty much has to under many circumstances.
>
> yield() does not, and can not, guarantee forward progress - ever.
>
> Use wait_event() and assorted bits to wait for an actual event. That is
> a sleep paired with a wakeup and thus has progress guarantees.
Why wouldn't it be safe to redefine yield as msleep(1) ?
net/ipv4/tcp_output.c seems to also use yield().
/* Socket is locked, keep trying until memory is available. */
for (;;) {
skb = alloc_skb_fclone(MAX_TCP_HEADER,
sk->sk_allocation);
if (skb)
break;
yield();
}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-02 11:17 ` Eric Dumazet
@ 2014-04-04 15:19 ` Peter Zijlstra
2014-04-04 15:26 ` David Miller
2014-04-04 15:28 ` David Miller
0 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-04-04 15:19 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Gleixner, David Miller, ben, mkl, linux-rt-users, LKML,
netdev, kernel
On Wed, Apr 02, 2014 at 04:17:39AM -0700, Eric Dumazet wrote:
> On Wed, 2014-04-02 at 13:07 +0200, Peter Zijlstra wrote:
> > On Mon, Mar 31, 2014 at 11:49:16PM +0200, Thomas Gleixner wrote:
> > > > >> The intent of a yield() call, like this one here, is unambiguously
> > > > >> that the current thread cannot do anything until some other thread
> > > > >> gets onto the cpu and makes forward progress.
> >
> > Yeah; like Thomas said; that's not what yield() does -- or can do.
> >
> > yield() only says 'I don't know what to do, you figure it out', but then
> > fails to provide enough information to actually do something sensible.
> >
> > It cannot put the task to sleep; because it doesn't pair with a wakeup;
> > and therefore the task stays an eligible/runnable task from a scheduler
> > pov.
> >
> > The scheduler is therefore entirely in its right to pick this task
> > again; and pretty much has to under many circumstances.
> >
> > yield() does not, and can not, guarantee forward progress - ever.
> >
> > Use wait_event() and assorted bits to wait for an actual event. That is
> > a sleep paired with a wakeup and thus has progress guarantees.
>
> Why wouldn't it be safe to redefine yield as msleep(1) ?
Because it violates what POSIX says what yield() should do. There is a
semi valid use-case for yield between RR/FIFO tasks of the same
priority. Changing yield to do msleep(1) would break that.
The kernel actually used to rely on that particular semantics; but I
think we took that out because it was fragile as heck.
> net/ipv4/tcp_output.c seems to also use yield().
>
>
> /* Socket is locked, keep trying until memory is available. */
> for (;;) {
> skb = alloc_skb_fclone(MAX_TCP_HEADER,
> sk->sk_allocation);
> if (skb)
> break;
> yield();
> }
Yeah, that's crap code too.
Furthermore, changing the network code to use msleep(1) instead of
yield() doesn't make it less crap than it was, just functional crap
instead of broken crap.
The proper way to fix the dev_deactivate_many() is to use wait_event(),
polling for that state is just daft. Afaict there is no reason the qdisc
code could not do a wakeup whenever that condition changes.
And the above loop in tcp_output() is a plain memory deadlock, you
should not loop on allocations like that. If the allocation fails;
propagate the error.
Given that this function must not fail; one should pre-allocate proper
reserves, not simply loop until it works. The loop might be holding up
the work required to release memory.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-04 15:19 ` Peter Zijlstra
@ 2014-04-04 15:26 ` David Miller
2014-04-07 11:19 ` Peter Zijlstra
2014-04-04 15:28 ` David Miller
1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2014-04-04 15:26 UTC (permalink / raw)
To: peterz
Cc: eric.dumazet, tglx, ben, mkl, linux-rt-users, linux-kernel,
netdev, kernel
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 4 Apr 2014 17:19:42 +0200
> The proper way to fix the dev_deactivate_many() is to use wait_event(),
> polling for that state is just daft. Afaict there is no reason the qdisc
> code could not do a wakeup whenever that condition changes.
I actually looked into this, and it's going to add expensive checks
to the fast paths of packet processing.
If it was so easy we'd be doing it that way already.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-04 15:19 ` Peter Zijlstra
2014-04-04 15:26 ` David Miller
@ 2014-04-04 15:28 ` David Miller
2014-04-07 11:24 ` Peter Zijlstra
1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2014-04-04 15:28 UTC (permalink / raw)
To: peterz
Cc: eric.dumazet, tglx, ben, mkl, linux-rt-users, linux-kernel,
netdev, kernel
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri, 4 Apr 2014 17:19:42 +0200
> And the above loop in tcp_output() is a plain memory deadlock, you
> should not loop on allocations like that. If the allocation fails;
> propagate the error.
There is nothing to "propagate" it to. We have no "event" that will
trigger so that we have a second chance to send the FIN out, it really
must go out before we progress any further at this stage.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-04 15:26 ` David Miller
@ 2014-04-07 11:19 ` Peter Zijlstra
0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-04-07 11:19 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, tglx, ben, mkl, linux-rt-users, linux-kernel,
netdev, kernel
On Fri, Apr 04, 2014 at 11:26:28AM -0400, David Miller wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 4 Apr 2014 17:19:42 +0200
>
> > The proper way to fix the dev_deactivate_many() is to use wait_event(),
> > polling for that state is just daft. Afaict there is no reason the qdisc
> > code could not do a wakeup whenever that condition changes.
>
> I actually looked into this, and it's going to add expensive checks
> to the fast paths of packet processing.
>
> If it was so easy we'd be doing it that way already.
Fair enough; and I suppose this waiting side is 'rare' compared to the
packet processing side of things?
In that case you could add a comment with the msleep() and just leave it
at that I suppose.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb
2014-04-04 15:28 ` David Miller
@ 2014-04-07 11:24 ` Peter Zijlstra
0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2014-04-07 11:24 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, tglx, ben, mkl, linux-rt-users, linux-kernel,
netdev, kernel
On Fri, Apr 04, 2014 at 11:28:50AM -0400, David Miller wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Fri, 4 Apr 2014 17:19:42 +0200
>
> > And the above loop in tcp_output() is a plain memory deadlock, you
> > should not loop on allocations like that. If the allocation fails;
> > propagate the error.
>
> There is nothing to "propagate" it to. We have no "event" that will
> trigger so that we have a second chance to send the FIN out, it really
> must go out before we progress any further at this stage.
>From what I remember of TCP/IP the FIN packet is used once per
connection; to terminate said connection. So if we pre-allocate one per
connection, at establishment time (where we can still easily fail) we
should be good and safe.
Now, I suppose there's valid arguments against this -- wasted memory
being one I suspect.
Also, seeing how this code has basically been this way forever and has
not caused (afaik) any actual memory deadlocks (they're hard anyway). We
might just get away with not touching it.
That said, at the very least we want a comment here stating that this is
a potential memory deadlock and why the alternatives aren't any good, so
that future readers (possibly us again) know this has been considered.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2014-04-07 11:24 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-04 23:49 [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run calls Marc Kleine-Budde
2014-03-06 21:06 ` David Miller
2014-03-06 21:39 ` Marc Kleine-Budde
2014-03-07 15:47 ` Sebastian Andrzej Siewior
2014-03-07 4:26 ` Mike Galbraith
2014-03-09 19:09 ` Ben Hutchings
2014-03-09 22:53 ` David Miller
2014-03-09 23:17 ` Ben Hutchings
2014-03-09 23:28 ` David Lang
2014-03-10 0:07 ` Stanislav Meduna
2014-03-31 21:49 ` [PATCH] net: sched: dev_deactivate_many(): use msleep(1) instead of yield() to wait for outstanding qdisc_run callsb Thomas Gleixner
2014-04-02 11:07 ` Peter Zijlstra
2014-04-02 11:17 ` Eric Dumazet
2014-04-04 15:19 ` Peter Zijlstra
2014-04-04 15:26 ` David Miller
2014-04-07 11:19 ` Peter Zijlstra
2014-04-04 15:28 ` David Miller
2014-04-07 11:24 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).