* [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
@ 2025-01-17 10:58 K Prateek Nayak
2025-01-17 13:25 ` Vincent Guittot
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: K Prateek Nayak @ 2025-01-17 10:58 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal,
K Prateek Nayak
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an
entity is delayed irrespective of whether the entity corresponds to a
task or a cfs_rq.
Consider the following scenario:
root
/ \
A B (*) delayed since B is no longer eligible on root
| |
Task0 Task1 <--- dequeue_task_fair() - task blocks
When Task1 blocks (dequeue_entity() for task's se returns true),
dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the
hierarchy of Task1. However, when the sched_entity corresponding to
cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the
hierarchy too leading to both dequeue_entity() and set_delayed()
decrementing h_nr_runnable for the dequeue of the same task.
A SCHED_WARN_ON() to inspect h_nr_runnable post its update in
dequeue_entities() like below:
cfs_rq->h_nr_runnable -= h_nr_runnable;
SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0);
is consistently tripped when running wakeup intensive workloads like
hackbench in a cgroup.
This error is self correcting since cfs_rq are per-cpu and cannot
migrate. The entitiy is either picked for full dequeue or is requeued
when a task wakes up below it. Both those paths call clear_delayed()
which again increments h_nr_runnable of the hierarchy without
considering if the entity corresponds to a task or not.
h_nr_runnable will eventually reflect the correct value however in the
interim, the incorrect values can still influence PELT calculation which
uses se->runnable_weight or cfs_rq->h_nr_runnable.
Since only delayed tasks take the early return path in
dequeue_entities() and enqueue_task_fair(), adjust the
h_nr_runnable in {set,clear}_delayed() only when a task is delayed as
this path skips the h_nr_* update loops and returns early.
For entities corresponding to cfs_rq, the h_nr_* update loop in the
caller will do the right thing.
Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE")
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98ac49ce78ea..0fe6c6e65e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5481,6 +5481,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
static void set_delayed(struct sched_entity *se)
{
se->sched_delayed = 1;
+
+ /*
+ * Delayed se of cfs_rq have no tasks queued on them.
+ * Do not adjust h_nr_runnable since dequeue_entities()
+ * will account it for blocked tasks.
+ */
+ if (!entity_is_task(se))
+ return;
+
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -5493,6 +5502,16 @@ static void set_delayed(struct sched_entity *se)
static void clear_delayed(struct sched_entity *se)
{
se->sched_delayed = 0;
+
+ /*
+ * Delayed se of cfs_rq have no tasks queued on them.
+ * Do not adjust h_nr_runnable since a dequeue has
+ * already accounted for it or an enqueue of a task
+ * below it will account for it in enqueue_task_fair().
+ */
+ if (!entity_is_task(se))
+ return;
+
for_each_sched_entity(se) {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
base-commit: 7d9da040575b343085287686fa902a5b2d43c7ca
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-17 10:58 [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue K Prateek Nayak @ 2025-01-17 13:25 ` Vincent Guittot 2025-01-17 15:59 ` K Prateek Nayak 2025-01-20 5:06 ` Madadi Vineeth Reddy 2025-01-21 12:21 ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak 2 siblings, 1 reply; 11+ messages in thread From: Vincent Guittot @ 2025-01-17 13:25 UTC (permalink / raw) To: K Prateek Nayak Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal Hi Prateek, On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an > entity is delayed irrespective of whether the entity corresponds to a > task or a cfs_rq. > > Consider the following scenario: > > root > / \ > A B (*) delayed since B is no longer eligible on root > | | > Task0 Task1 <--- dequeue_task_fair() - task blocks > > When Task1 blocks (dequeue_entity() for task's se returns true), > dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the > hierarchy of Task1. However, when the sched_entity corresponding to > cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the > hierarchy too leading to both dequeue_entity() and set_delayed() > decrementing h_nr_runnable for the dequeue of the same task. > > A SCHED_WARN_ON() to inspect h_nr_runnable post its update in > dequeue_entities() like below: > > cfs_rq->h_nr_runnable -= h_nr_runnable; > SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); > > is consistently tripped when running wakeup intensive workloads like > hackbench in a cgroup. > > This error is self correcting since cfs_rq are per-cpu and cannot > migrate. The entitiy is either picked for full dequeue or is requeued > when a task wakes up below it. Both those paths call clear_delayed() > which again increments h_nr_runnable of the hierarchy without > considering if the entity corresponds to a task or not. > > h_nr_runnable will eventually reflect the correct value however in the > interim, the incorrect values can still influence PELT calculation which > uses se->runnable_weight or cfs_rq->h_nr_runnable. > > Since only delayed tasks take the early return path in > dequeue_entities() and enqueue_task_fair(), adjust the > h_nr_runnable in {set,clear}_delayed() only when a task is delayed as > this path skips the h_nr_* update loops and returns early. > > For entities corresponding to cfs_rq, the h_nr_* update loop in the > caller will do the right thing. > > Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable") Before we were tracking the opposite h_nr_delayed. Did you see the problem only on tip/sched/core or also before the rework which added h_nr_runnable and removed h_nr_delayed I'm going to have a closer look > Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> > Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> > Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> > --- > kernel/sched/fair.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 98ac49ce78ea..0fe6c6e65e55 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5481,6 +5481,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq); > static void set_delayed(struct sched_entity *se) > { > se->sched_delayed = 1; > + > + /* > + * Delayed se of cfs_rq have no tasks queued on them. > + * Do not adjust h_nr_runnable since dequeue_entities() > + * will account it for blocked tasks. > + */ > + if (!entity_is_task(se)) > + return; > + > for_each_sched_entity(se) { > struct cfs_rq *cfs_rq = cfs_rq_of(se); > > @@ -5493,6 +5502,16 @@ static void set_delayed(struct sched_entity *se) > static void clear_delayed(struct sched_entity *se) > { > se->sched_delayed = 0; > + > + /* > + * Delayed se of cfs_rq have no tasks queued on them. > + * Do not adjust h_nr_runnable since a dequeue has > + * already accounted for it or an enqueue of a task > + * below it will account for it in enqueue_task_fair(). > + */ > + if (!entity_is_task(se)) > + return; > + > for_each_sched_entity(se) { > struct cfs_rq *cfs_rq = cfs_rq_of(se); > > > base-commit: 7d9da040575b343085287686fa902a5b2d43c7ca > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-17 13:25 ` Vincent Guittot @ 2025-01-17 15:59 ` K Prateek Nayak 2025-01-18 8:30 ` Vincent Guittot 0 siblings, 1 reply; 11+ messages in thread From: K Prateek Nayak @ 2025-01-17 15:59 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal Hello Vincent, On 1/17/2025 6:55 PM, Vincent Guittot wrote: > Hi Prateek, > > On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote: >> >> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an >> entity is delayed irrespective of whether the entity corresponds to a >> task or a cfs_rq. >> >> Consider the following scenario: >> >> root >> / \ >> A B (*) delayed since B is no longer eligible on root >> | | >> Task0 Task1 <--- dequeue_task_fair() - task blocks >> >> When Task1 blocks (dequeue_entity() for task's se returns true), >> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the >> hierarchy of Task1. However, when the sched_entity corresponding to >> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the >> hierarchy too leading to both dequeue_entity() and set_delayed() >> decrementing h_nr_runnable for the dequeue of the same task. >> >> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in >> dequeue_entities() like below: >> >> cfs_rq->h_nr_runnable -= h_nr_runnable; >> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); >> >> is consistently tripped when running wakeup intensive workloads like >> hackbench in a cgroup. >> >> This error is self correcting since cfs_rq are per-cpu and cannot >> migrate. The entitiy is either picked for full dequeue or is requeued >> when a task wakes up below it. Both those paths call clear_delayed() >> which again increments h_nr_runnable of the hierarchy without >> considering if the entity corresponds to a task or not. >> >> h_nr_runnable will eventually reflect the correct value however in the >> interim, the incorrect values can still influence PELT calculation which >> uses se->runnable_weight or cfs_rq->h_nr_runnable. >> >> Since only delayed tasks take the early return path in >> dequeue_entities() and enqueue_task_fair(), adjust the >> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as >> this path skips the h_nr_* update loops and returns early. >> >> For entities corresponding to cfs_rq, the h_nr_* update loop in the >> caller will do the right thing. >> >> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") > > You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable") You are right! I had done a git blame on set_delayed() ad landed at commit 76f2f783294d but you are right, it should be c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was inverted to account runnable tasks. Thank you for pointing that out. > Before we were tracking the opposite h_nr_delayed. Did you see the > problem only on tip/sched/core or also before the rework which added > h_nr_runnable and removed h_nr_delayed The problem is on tip:sched/core. I did not encounter any anomalies on 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep && !delayed" which would imply migration or a save + restore type operation and the whole "h_nr_delayed" adjusting was contained in {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and requeue. > > I'm going to have a closer look Thank you! > > >> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> >> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> >> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> >> --- >> >> [..snip..] >> -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-17 15:59 ` K Prateek Nayak @ 2025-01-18 8:30 ` Vincent Guittot 2025-01-21 8:09 ` K Prateek Nayak 0 siblings, 1 reply; 11+ messages in thread From: Vincent Guittot @ 2025-01-18 8:30 UTC (permalink / raw) To: K Prateek Nayak Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > Hello Vincent, > > On 1/17/2025 6:55 PM, Vincent Guittot wrote: > > Hi Prateek, > > > > On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote: > >> > >> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an > >> entity is delayed irrespective of whether the entity corresponds to a > >> task or a cfs_rq. > >> > >> Consider the following scenario: > >> > >> root > >> / \ > >> A B (*) delayed since B is no longer eligible on root > >> | | > >> Task0 Task1 <--- dequeue_task_fair() - task blocks > >> > >> When Task1 blocks (dequeue_entity() for task's se returns true), > >> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the > >> hierarchy of Task1. However, when the sched_entity corresponding to > >> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the > >> hierarchy too leading to both dequeue_entity() and set_delayed() > >> decrementing h_nr_runnable for the dequeue of the same task. > >> > >> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in > >> dequeue_entities() like below: > >> > >> cfs_rq->h_nr_runnable -= h_nr_runnable; > >> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); > >> > >> is consistently tripped when running wakeup intensive workloads like > >> hackbench in a cgroup. > >> > >> This error is self correcting since cfs_rq are per-cpu and cannot > >> migrate. The entitiy is either picked for full dequeue or is requeued > >> when a task wakes up below it. Both those paths call clear_delayed() > >> which again increments h_nr_runnable of the hierarchy without > >> considering if the entity corresponds to a task or not. > >> > >> h_nr_runnable will eventually reflect the correct value however in the > >> interim, the incorrect values can still influence PELT calculation which > >> uses se->runnable_weight or cfs_rq->h_nr_runnable. > >> > >> Since only delayed tasks take the early return path in > >> dequeue_entities() and enqueue_task_fair(), adjust the > >> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as > >> this path skips the h_nr_* update loops and returns early. > >> > >> For entities corresponding to cfs_rq, the h_nr_* update loop in the > >> caller will do the right thing. > >> > >> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") > > > > You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable") > > You are right! I had done a git blame on set_delayed() ad landed at > commit 76f2f783294d but you are right, it should be c2a295bffeaf > ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was > inverted to account runnable tasks. Thank you for pointing that out. > > > Before we were tracking the opposite h_nr_delayed. Did you see the > > problem only on tip/sched/core or also before the rework which added > > h_nr_runnable and removed h_nr_delayed > > The problem is on tip:sched/core. I did not encounter any anomalies on > 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") > > "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep && > !delayed" which would imply migration or a save + restore type operation > and the whole "h_nr_delayed" adjusting was contained in > {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and > requeue. > > > > > I'm going to have a closer look Your fix looks good to me. I also run some tests after re-adding h_nr_delayed and checking that h_nr_queued = h_nr_runnable + h_nr_delayed after each update and I didn't get any warning with your patch whereas I got one during boot without it (but none after that during my tests) Thanks for catching this Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > > Thank you! > > > > > > >> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> > >> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> > >> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> > >> --- > >> > >> [..snip..] > >> > > -- > Thanks and Regards, > Prateek > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-18 8:30 ` Vincent Guittot @ 2025-01-21 8:09 ` K Prateek Nayak 2025-01-21 9:53 ` Vincent Guittot 0 siblings, 1 reply; 11+ messages in thread From: K Prateek Nayak @ 2025-01-21 8:09 UTC (permalink / raw) To: Vincent Guittot Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal Hello Vincent, On 1/18/2025 2:00 PM, Vincent Guittot wrote: > On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote: >> >> Hello Vincent, >> >> On 1/17/2025 6:55 PM, Vincent Guittot wrote: >>> Hi Prateek, >>> >>> On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote: >>>> >>>> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an >>>> entity is delayed irrespective of whether the entity corresponds to a >>>> task or a cfs_rq. >>>> >>>> Consider the following scenario: >>>> >>>> root >>>> / \ >>>> A B (*) delayed since B is no longer eligible on root >>>> | | >>>> Task0 Task1 <--- dequeue_task_fair() - task blocks >>>> >>>> When Task1 blocks (dequeue_entity() for task's se returns true), >>>> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the >>>> hierarchy of Task1. However, when the sched_entity corresponding to >>>> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the >>>> hierarchy too leading to both dequeue_entity() and set_delayed() >>>> decrementing h_nr_runnable for the dequeue of the same task. >>>> >>>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in >>>> dequeue_entities() like below: >>>> >>>> cfs_rq->h_nr_runnable -= h_nr_runnable; >>>> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); >>>> >>>> is consistently tripped when running wakeup intensive workloads like >>>> hackbench in a cgroup. >>>> >>>> This error is self correcting since cfs_rq are per-cpu and cannot >>>> migrate. The entitiy is either picked for full dequeue or is requeued >>>> when a task wakes up below it. Both those paths call clear_delayed() >>>> which again increments h_nr_runnable of the hierarchy without >>>> considering if the entity corresponds to a task or not. >>>> >>>> h_nr_runnable will eventually reflect the correct value however in the >>>> interim, the incorrect values can still influence PELT calculation which >>>> uses se->runnable_weight or cfs_rq->h_nr_runnable. >>>> >>>> Since only delayed tasks take the early return path in >>>> dequeue_entities() and enqueue_task_fair(), adjust the >>>> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as >>>> this path skips the h_nr_* update loops and returns early. >>>> >>>> For entities corresponding to cfs_rq, the h_nr_* update loop in the >>>> caller will do the right thing. >>>> >>>> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") >>> >>> You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable") >> >> You are right! I had done a git blame on set_delayed() ad landed at >> commit 76f2f783294d but you are right, it should be c2a295bffeaf >> ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was >> inverted to account runnable tasks. Thank you for pointing that out. >> >>> Before we were tracking the opposite h_nr_delayed. Did you see the >>> problem only on tip/sched/core or also before the rework which added >>> h_nr_runnable and removed h_nr_delayed >> >> The problem is on tip:sched/core. I did not encounter any anomalies on >> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") >> >> "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep && >> !delayed" which would imply migration or a save + restore type operation >> and the whole "h_nr_delayed" adjusting was contained in >> {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and >> requeue. So I was looking at it wrong when I was investigating on commit 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") h_nr_delayed can never be larger than h_nr_running (h_nr_queued upstream) since the number of delayed tasks can never cross number of tasks queued below the given cfs_rq but with the following: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 97ee48c8bf5e..8e713f241483 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7145,6 +7145,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running; cfs_rq->h_nr_delayed -= h_nr_delayed; + SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running); if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running = h_nr_running; @@ -7185,6 +7186,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) cfs_rq->idle_h_nr_running -= idle_h_nr_running; cfs_rq->h_nr_delayed -= h_nr_delayed; + SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running); + if (cfs_rq_is_idle(cfs_rq)) idle_h_nr_running = h_nr_running; -- I can again consistently hit the warning without the fix on 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") I think that the original "Fixes:" tag is indeed right. >> >>> >>> I'm going to have a closer look > > Your fix looks good to me. I also run some tests after re-adding > h_nr_delayed and checking that h_nr_queued = h_nr_runnable + > h_nr_delayed after each update and I didn't get any warning with your > patch whereas I got one during boot without it (but none after that > during my tests) Could it be the case that h_nr_delayed counts a tiny bit higher than the actual number and h_nr_runnable counts a tiny bit lower by the same amount and they both correct each other to give the correct h_nr_queued? > > Thanks for catching this > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Thank you for reviewing the patch! > >> >> Thank you! >> >>> >>> >>>> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> >>>> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> >>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> >>>> --- >>>> >>>> [..snip..] >>>> >> >> -- >> Thanks and Regards, >> Prateek >> -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-21 8:09 ` K Prateek Nayak @ 2025-01-21 9:53 ` Vincent Guittot 0 siblings, 0 replies; 11+ messages in thread From: Vincent Guittot @ 2025-01-21 9:53 UTC (permalink / raw) To: K Prateek Nayak Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal On Tue, 21 Jan 2025 at 09:09, K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > Hello Vincent, > > On 1/18/2025 2:00 PM, Vincent Guittot wrote: > > On Fri, 17 Jan 2025 at 16:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote: > >> > >> Hello Vincent, > >> > >> On 1/17/2025 6:55 PM, Vincent Guittot wrote: > >>> Hi Prateek, > >>> > >>> On Fri, 17 Jan 2025 at 11:59, K Prateek Nayak <kprateek.nayak@amd.com> wrote: > >>>> > >>>> set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an > >>>> entity is delayed irrespective of whether the entity corresponds to a > >>>> task or a cfs_rq. > >>>> > >>>> Consider the following scenario: > >>>> > >>>> root > >>>> / \ > >>>> A B (*) delayed since B is no longer eligible on root > >>>> | | > >>>> Task0 Task1 <--- dequeue_task_fair() - task blocks > >>>> > >>>> When Task1 blocks (dequeue_entity() for task's se returns true), > >>>> dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the > >>>> hierarchy of Task1. However, when the sched_entity corresponding to > >>>> cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the > >>>> hierarchy too leading to both dequeue_entity() and set_delayed() > >>>> decrementing h_nr_runnable for the dequeue of the same task. > >>>> > >>>> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in > >>>> dequeue_entities() like below: > >>>> > >>>> cfs_rq->h_nr_runnable -= h_nr_runnable; > >>>> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); > >>>> > >>>> is consistently tripped when running wakeup intensive workloads like > >>>> hackbench in a cgroup. > >>>> > >>>> This error is self correcting since cfs_rq are per-cpu and cannot > >>>> migrate. The entitiy is either picked for full dequeue or is requeued > >>>> when a task wakes up below it. Both those paths call clear_delayed() > >>>> which again increments h_nr_runnable of the hierarchy without > >>>> considering if the entity corresponds to a task or not. > >>>> > >>>> h_nr_runnable will eventually reflect the correct value however in the > >>>> interim, the incorrect values can still influence PELT calculation which > >>>> uses se->runnable_weight or cfs_rq->h_nr_runnable. > >>>> > >>>> Since only delayed tasks take the early return path in > >>>> dequeue_entities() and enqueue_task_fair(), adjust the > >>>> h_nr_runnable in {set,clear}_delayed() only when a task is delayed as > >>>> this path skips the h_nr_* update loops and returns early. > >>>> > >>>> For entities corresponding to cfs_rq, the h_nr_* update loop in the > >>>> caller will do the right thing. > >>>> > >>>> Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") > >>> > >>> You probably mean c2a295bffeaf ("sched/fair: Add new cfs_rq.h_nr_runnable") > >> > >> You are right! I had done a git blame on set_delayed() ad landed at > >> commit 76f2f783294d but you are right, it should be c2a295bffeaf > >> ("sched/fair: Add new cfs_rq.h_nr_runnable") when the accounting was > >> inverted to account runnable tasks. Thank you for pointing that out. > >> > >>> Before we were tracking the opposite h_nr_delayed. Did you see the > >>> problem only on tip/sched/core or also before the rework which added > >>> h_nr_runnable and removed h_nr_delayed > >> > >> The problem is on tip:sched/core. I did not encounter any anomalies on > >> 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") > >> > >> "h_nr_delayed" was only adjusted in dequeue_entities() for "!seep && > >> !delayed" which would imply migration or a save + restore type operation > >> and the whole "h_nr_delayed" adjusting was contained in > >> {set,clear}_delayed() for delayed dequeue, finish delayed dequeue, and > >> requeue. > > So I was looking at it wrong when I was investigating on commit > 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") h_nr_delayed > can never be larger than h_nr_running (h_nr_queued upstream) since the > number of delayed tasks can never cross number of tasks queued below > the given cfs_rq but with the following: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 97ee48c8bf5e..8e713f241483 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7145,6 +7145,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > cfs_rq->h_nr_running -= h_nr_running; > cfs_rq->idle_h_nr_running -= idle_h_nr_running; > cfs_rq->h_nr_delayed -= h_nr_delayed; > + SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running); > > if (cfs_rq_is_idle(cfs_rq)) > idle_h_nr_running = h_nr_running; > @@ -7185,6 +7186,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > cfs_rq->idle_h_nr_running -= idle_h_nr_running; > cfs_rq->h_nr_delayed -= h_nr_delayed; > > + SCHED_WARN_ON(cfs_rq->h_nr_delayed > cfs_rq->h_nr_running); > + > if (cfs_rq_is_idle(cfs_rq)) > idle_h_nr_running = h_nr_running; > > -- > > I can again consistently hit the warning without the fix on 76f2f783294d > ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") I updated my warning conditions with h_nr_queued != h_nr_delayed + h_nr_runnable h_nr_delayed < 0 or > h_nr_queued h_nr_runnable < 0 or > h_nr_queued In addition to h_nr_runnable < 0, h_nr_delayed > h_nr_queued has also triggered a warning so I confirm that the fix is on 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") No warning are triggered with your fix > > I think that the original "Fixes:" tag is indeed right. > > >> > >>> > >>> I'm going to have a closer look > > > > Your fix looks good to me. I also run some tests after re-adding > > h_nr_delayed and checking that h_nr_queued = h_nr_runnable + > > h_nr_delayed after each update and I didn't get any warning with your > > patch whereas I got one during boot without it (but none after that > > during my tests) > > Could it be the case that h_nr_delayed counts a tiny bit higher than > the actual number and h_nr_runnable counts a tiny bit lower by the > same amount and they both correct each other to give the correct > h_nr_queued? > > > > > Thanks for catching this > > > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > > Thank you for reviewing the patch! > > > > >> > >> Thank you! > >> > >>> > >>> > >>>> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> > >>>> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> > >>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> > >>>> --- > >>>> > >>>> [..snip..] > >>>> > >> > >> -- > >> Thanks and Regards, > >> Prateek > >> > > -- > Thanks and Regards, > Prateek > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-17 10:58 [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue K Prateek Nayak 2025-01-17 13:25 ` Vincent Guittot @ 2025-01-20 5:06 ` Madadi Vineeth Reddy 2025-01-20 8:44 ` Madadi Vineeth Reddy 2025-01-20 9:12 ` K Prateek Nayak 2025-01-21 12:21 ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak 2 siblings, 2 replies; 11+ messages in thread From: Madadi Vineeth Reddy @ 2025-01-20 5:06 UTC (permalink / raw) To: kprateek.nayak Cc: bsegall, dietmar.eggemann, gautham.shenoy, juri.lelli, linux-kernel, mgorman, mingo, peterz, rostedt, swapnil.sapkal, vincent.guittot, vschneid, vineethr Hi Prateek, >A SCHED_WARN_ON() to inspect h_nr_runnable post its update in >dequeue_entities() like below: > > cfs_rq->h_nr_runnable -= h_nr_runnable; > SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); > >is consistently tripped when running wakeup intensive workloads like >hackbench in a cgroup. I observed that the WARN_ON is triggered during the boot process without the patch, and the patch resolves the issue. However, I was unable to trigger the WARN_ON by running hackbench in a cgroup without the patch. Could you please share the specific test scenario or configuration you used to reproduce it? For the boot process scenario: Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Thanks, Madadi Vineeth Reddy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-20 5:06 ` Madadi Vineeth Reddy @ 2025-01-20 8:44 ` Madadi Vineeth Reddy 2025-01-20 9:14 ` K Prateek Nayak 2025-01-20 9:12 ` K Prateek Nayak 1 sibling, 1 reply; 11+ messages in thread From: Madadi Vineeth Reddy @ 2025-01-20 8:44 UTC (permalink / raw) To: vineethr Cc: bsegall, dietmar.eggemann, gautham.shenoy, juri.lelli, kprateek.nayak, linux-kernel, mgorman, mingo, peterz, rostedt, swapnil.sapkal, vincent.guittot, vschneid > However, I was unable to trigger the WARN_ON by running hackbench in a > cgroup without the patch. Could you please share the specific test > scenario or configuration you used to reproduce it? My mistake: SCHED_WARN_ON is a WARN_ONCE, so I had to clear it before running hackbench to ensure the WARN_ON could be triggered again. Now I am able to trigger the WARN_ON in the unpatched kernel while running hackbench after executing `echo 1 > /sys/kernel/debug/clear_warn_once` following the initial warning during the boot process. I have verified that the patched kernel fixes the issue. Thanks, Madadi Vineeth Reddy ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-20 8:44 ` Madadi Vineeth Reddy @ 2025-01-20 9:14 ` K Prateek Nayak 0 siblings, 0 replies; 11+ messages in thread From: K Prateek Nayak @ 2025-01-20 9:14 UTC (permalink / raw) To: Madadi Vineeth Reddy Cc: bsegall, dietmar.eggemann, gautham.shenoy, juri.lelli, linux-kernel, mgorman, mingo, peterz, rostedt, swapnil.sapkal, vincent.guittot, vschneid On 1/20/2025 2:14 PM, Madadi Vineeth Reddy wrote: >> However, I was unable to trigger the WARN_ON by running hackbench in a >> cgroup without the patch. Could you please share the specific test >> scenario or configuration you used to reproduce it? > > My mistake: SCHED_WARN_ON is a WARN_ONCE, so I had to clear it before > running hackbench to ensure the WARN_ON could be triggered again. Yup! A WARN_ON() is hit very consistently. > > Now I am able to trigger the WARN_ON in the unpatched kernel while > running hackbench after executing `echo 1 > /sys/kernel/debug/clear_warn_once` > following the initial warning during the boot process. I have verified > that the patched kernel fixes the issue. Thank you again for testing the fix. Much appreciated! > > Thanks, > Madadi Vineeth Reddy -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-20 5:06 ` Madadi Vineeth Reddy 2025-01-20 8:44 ` Madadi Vineeth Reddy @ 2025-01-20 9:12 ` K Prateek Nayak 1 sibling, 0 replies; 11+ messages in thread From: K Prateek Nayak @ 2025-01-20 9:12 UTC (permalink / raw) To: Madadi Vineeth Reddy Cc: bsegall, dietmar.eggemann, gautham.shenoy, juri.lelli, linux-kernel, mgorman, mingo, peterz, rostedt, swapnil.sapkal, vincent.guittot, vschneid Hello Madadi Vineeth Reddy, Thank you for the review and test. On 1/20/2025 10:36 AM, Madadi Vineeth Reddy wrote: > Hi Prateek, > >> A SCHED_WARN_ON() to inspect h_nr_runnable post its update in >> dequeue_entities() like below: >> >> cfs_rq->h_nr_runnable -= h_nr_runnable; >> SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); >> >> is consistently tripped when running wakeup intensive workloads like >> hackbench in a cgroup. > > I observed that the WARN_ON is triggered during the boot process without > the patch, and the patch resolves the issue. > > However, I was unable to trigger the WARN_ON by running hackbench in a > cgroup without the patch. Could you please share the specific test > scenario or configuration you used to reproduce it? Can you try converting the SCHED_WARN_ON() to a WARN_ON() and try again. I can consistently hit it to a point that it floods my console. With autogroup enabled on Ubuntu, I believe it is trivial to hit this issue. Following is the exact diff I'm using on top of tip:sched/core that floods my console: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 98ac49ce78ea..7bc2c57601b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7160,6 +7160,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) cfs_rq->h_nr_runnable -= h_nr_runnable; cfs_rq->h_nr_queued -= h_nr_queued; + WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); cfs_rq->h_nr_idle -= h_nr_idle; if (cfs_rq_is_idle(cfs_rq)) @@ -7199,6 +7200,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) cfs_rq->h_nr_runnable -= h_nr_runnable; cfs_rq->h_nr_queued -= h_nr_queued; + WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); cfs_rq->h_nr_idle -= h_nr_idle; if (cfs_rq_is_idle(cfs_rq)) -- I tested this on a 32 vCPU VM and I get similar floods. > > For the boot process scenario: > Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Thanks a ton for testing! > > Thanks, > Madadi Vineeth Reddy -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [tip: sched/urgent] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue 2025-01-17 10:58 [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue K Prateek Nayak 2025-01-17 13:25 ` Vincent Guittot 2025-01-20 5:06 ` Madadi Vineeth Reddy @ 2025-01-21 12:21 ` tip-bot2 for K Prateek Nayak 2 siblings, 0 replies; 11+ messages in thread From: tip-bot2 for K Prateek Nayak @ 2025-01-21 12:21 UTC (permalink / raw) To: linux-tip-commits Cc: K Prateek Nayak, Peter Zijlstra (Intel), Gautham R. Shenoy, Swapnil Sapkal, x86, linux-kernel The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 3429dd57f0deb1a602c2624a1dd7c4c11b6c4734 Gitweb: https://git.kernel.org/tip/3429dd57f0deb1a602c2624a1dd7c4c11b6c4734 Author: K Prateek Nayak <kprateek.nayak@amd.com> AuthorDate: Fri, 17 Jan 2025 10:58:52 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Tue, 21 Jan 2025 13:13:36 +01:00 sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an entity is delayed irrespective of whether the entity corresponds to a task or a cfs_rq. Consider the following scenario: root / \ A B (*) delayed since B is no longer eligible on root | | Task0 Task1 <--- dequeue_task_fair() - task blocks When Task1 blocks (dequeue_entity() for task's se returns true), dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the hierarchy of Task1. However, when the sched_entity corresponding to cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the hierarchy too leading to both dequeue_entity() and set_delayed() decrementing h_nr_runnable for the dequeue of the same task. A SCHED_WARN_ON() to inspect h_nr_runnable post its update in dequeue_entities() like below: cfs_rq->h_nr_runnable -= h_nr_runnable; SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); is consistently tripped when running wakeup intensive workloads like hackbench in a cgroup. This error is self correcting since cfs_rq are per-cpu and cannot migrate. The entitiy is either picked for full dequeue or is requeued when a task wakes up below it. Both those paths call clear_delayed() which again increments h_nr_runnable of the hierarchy without considering if the entity corresponds to a task or not. h_nr_runnable will eventually reflect the correct value however in the interim, the incorrect values can still influence PELT calculation which uses se->runnable_weight or cfs_rq->h_nr_runnable. Since only delayed tasks take the early return path in dequeue_entities() and enqueue_task_fair(), adjust the h_nr_runnable in {set,clear}_delayed() only when a task is delayed as this path skips the h_nr_* update loops and returns early. For entities corresponding to cfs_rq, the h_nr_* update loop in the caller will do the right thing. Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com --- kernel/sched/fair.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2695843..f4e4d3e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5372,6 +5372,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq); static void set_delayed(struct sched_entity *se) { se->sched_delayed = 1; + + /* + * Delayed se of cfs_rq have no tasks queued on them. + * Do not adjust h_nr_runnable since dequeue_entities() + * will account it for blocked tasks. + */ + if (!entity_is_task(se)) + return; + for_each_sched_entity(se) { struct cfs_rq *cfs_rq = cfs_rq_of(se); @@ -5384,6 +5393,16 @@ static void set_delayed(struct sched_entity *se) static void clear_delayed(struct sched_entity *se) { se->sched_delayed = 0; + + /* + * Delayed se of cfs_rq have no tasks queued on them. + * Do not adjust h_nr_runnable since a dequeue has + * already accounted for it or an enqueue of a task + * below it will account for it in enqueue_task_fair(). + */ + if (!entity_is_task(se)) + return; + for_each_sched_entity(se) { struct cfs_rq *cfs_rq = cfs_rq_of(se); ^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-01-21 12:21 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-17 10:58 [PATCH] sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue K Prateek Nayak 2025-01-17 13:25 ` Vincent Guittot 2025-01-17 15:59 ` K Prateek Nayak 2025-01-18 8:30 ` Vincent Guittot 2025-01-21 8:09 ` K Prateek Nayak 2025-01-21 9:53 ` Vincent Guittot 2025-01-20 5:06 ` Madadi Vineeth Reddy 2025-01-20 8:44 ` Madadi Vineeth Reddy 2025-01-20 9:14 ` K Prateek Nayak 2025-01-20 9:12 ` K Prateek Nayak 2025-01-21 12:21 ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox