[RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
@ 2025-02-20  9:32 K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header K Prateek Nayak
                   ` (22 more replies)
  0 siblings, 23 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Hello everyone,

tl;dr

This is a spiritual successor to Valentin's series to defer CFS throttle
to user entry [1] however this does things differently: Instead of
moving to a per-task throttling, this retains the throttling at cfs_rq
level while tracking the tasks preempted in kernel mode queued on the
cfs_rq.

On a throttled hierarchy, only the kernel mode preempted entities are
considered runnable and the pick_task_fair() will only select among these
entities in EEVDF fashion using a min-heap approach (Patch 6-8)

This is an early prototype with few issues (please see "Broken bits and
disclaimer" section) however I'm putting it out to get some early
feedback.

This series is based on

    git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit d34e798094ca ("sched/fair: Refactor can_migrate_task() to
elimate looping")

With that quick tl;dr sorted, let us jump right into it

Problem statement
=================

Following are Valentin's words taken from [1] with few changes to the
reference commits now that PREEMPT_RT has landed upstream:

CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
  the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

For PREEMPT_RT, the upstream kernel added commit 49a17639508c ("softirq:
Use a dedicated thread for timer wakeups on PREEMPT_RT.") which helped
this scenario for non-rwlock locks by ensuring the throttled task would
get PI'd to FIFO1 (ktimers' default priority). Unfortunately, rwlocks
cannot sanely do PI as they allow multiple readers.

Throttle deferral was discussed at OSPM'24 [2] and at LPC'24 [3] with
recordings for both available on Youtube.

Proposed approach
=================

This approach builds on Ben Segall's proposal in [4] which marked the
task in schedule() when exiting to usermode by setting
"in_return_to_user" flag except this prototype takes it a step ahead and
marks a "kernel critical section" within the syscall boundary using a
per-task "kernel_cs_count".

The rationale behind this approach is that the task can only hold
kernel resources when running in kernel mode in preemptible context. In
this POC, the entire syscall boundary is marked as a kernel critical
section but in the future, the API can be used to mark fine grained
boundaries like between an up_read(), down_read() or up_write(),
down_write() pair.

Within a kernel critical section, throttling events are deferred until
the task's "kernel_cs_count" hits 0. Currently this count is an integer
to catch any cases where the count turns negative as a result of
oversights on my part but this could be changed to a preempt count like
mechanism to request a resched.

            cfs_rq throttled               picked again
		  v                              v

    ----|*********| (preempted by tick / wakeup) |***********| (full throttle)

        ^                                                    ^
critical section   cfs_rq is throttled partially     critical section
      entry           since the task is still              exit
                  runnable as it was preempted in
                      kernel critical section

The EEVDF infrastructure is extended to tracks the avg_vruntime and the
avg_load of only those entities preempted in kernel mode. When a cfs_rq
is throttled, it uses these metrics to select among the kernel mode
preempted tasks and running them till they exit to user mode.
pick_eevdf() is made aware that it is operating on a throttled hierarchy
to only select among these tasks that were preempted in kernel mode (and
the sched entities of cfs_rq that lead to them). When a throttled
entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
the hierarchy remains accessible for picking until that point.

          root
        /     \
       A       B * (throttled)
      ...    / | \
            0  1* 2*

    (*) Preempted in kernel mode

  o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
  o avg_kcs_load = load(1) + load(2)

  o throttled_vruntime_eligible:

      entity preempted in kernel mode &&
      entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime

  o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
    that propagates the smallest vruntime of the all the entities in
    the subtree that are preempted in kernel mode. If they were
    executing in usermode when preempted, this will be set to LLONG_MAX.

    This is used to construct a min-heap and select through the
    entities. Consider rbtree of B to look like:

         1*
       /   \
      2*    0

      min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
      min_kcs_vruntime = min(se->min_kcs_vruntime,
                             __node_2_se(rb_left)->min_kcs_vruntime,
                             __node_2_se(rb_right)->min_kcs_vruntime);

   pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
   tree to first check the left subtree for eligibility, then the node
   itself, and then the right subtree.

With proactive tracking, throttle deferral can retain throttling at per
cfs_rq granularity instead of moving to a per-task model.

On the throttling side, a third throttle state is introduced
CFS_THROTTLED_PARTIAL which indicates that the cfs_rq has run out of its
bandwidth but contains runnable (kernel mode preempted) entities. A
partially throttled cfs_rq is added to the throttled list so it can be
unthrottled in a timely manner during distribution making all the tasks
queued on it runnable again. Throttle status can be promoted or demoted
depending on whether a kernel mode preempted tasks exits to user mode,
blocks, or wakes up on the hierarchy.

The most tricky part of the series is handling of the current task which
is kept out of kernel mode status tracking since a running task can make
multiple syscall and propagating those signals up the cgroup hierarchy
can be expensive. Instead the status of the current task is only looked
at during schedule() to make throttling decisions. This leads to some
interesting considerations in some paths (see curr_h_is_throttled() in
Patch 8 for EEVDF handling)

Testing
=======

The series was tested for correctness using a single hackbench instance
in a bandwidth controlled cgroup to observe the following set of events:

  o Throttle into partial state followed by a full throttle

   hackbench-1602    [044] dN.2.    55.684108: throttle_cfs_rq: cfs_rq throttled: (0) -> (1)
   hackbench-1602    [044] dN.2.    55.684111: pick_task_fair: Task picked on throttled hierarchy: hackbench(1602)
   hackbench-1602    [044] d....    55.684117: sched_notify_critical_section_exit: Pick on throttled requesting resched
   hackbench-1602    [044] dN.2.    55.684119: throttle_cfs_rq: cfs_rq throttled: (1) -> (2)
   ...
      <idle>-0       [044] dNh2.    55.689677: unthrottle_cfs_rq: cfs_rq unthrottled: (2) -> (0)

  o Full throttle demoted to partial and then reupgraded

      <idle>-0       [006] dN.2.    55.592145: unthrottle_cfs_rq: cfs_rq unthrottled: (2) -> (1)
      <idle>-0       [006] dN.2.    55.592147: pick_task_fair: Task picked on throttled hierarchy: hackbench(1584)
   ...
   hackbench-1584    [006] d....    55.592154: sched_notify_critical_section_exit: Pick on throttled requesting resched
   ...
   hackbench-1584    [006] dN.2.    55.592157: throttle_cfs_rq: cfs_rq throttled: (1) -> (2)
 
  [ Note: (1) corresponds to CFS_THROTTLE_PARTIAL (throttled but with
    kernel mode preempted entities that are still runnable) and (2)
    corresponds to CFS_THROTTLED (full throttle with no runnable task) ]

The stability testing was done by running the following script:

    #!/bin/bash
    
    mkdir /sys/fs/cgroup/CG0
    echo "+cpu" > /sys/fs/cgroup/CG0/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/cgroup.subtree_control
    mkdir /sys/fs/cgroup/CG0/CG00
    mkdir /sys/fs/cgroup/CG0/CG01
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/cgroup.subtree_control
    mkdir /sys/fs/cgroup/CG0/CG00/CG000
    mkdir /sys/fs/cgroup/CG0/CG00/CG001
    mkdir /sys/fs/cgroup/CG0/CG01/CG000
    mkdir /sys/fs/cgroup/CG0/CG01/CG001
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.subtree_control
    echo "3200000 100000" > /sys/fs/cgroup/CG0/cpu.max
    echo "1600000 100000" > /sys/fs/cgroup/CG0/CG00/cpu.max
    echo "1600000 100000" > /sys/fs/cgroup/CG0/CG01/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG00/CG000/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG00/CG001/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG01/CG000/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG01/CG001/cpu.max
    
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    
    time wait;

---

The above script was stress tested with 1,2,4, and 8 groups in a 64vCPU
VM, and with upto 64 groups on a 3rd Generation EPYC system (2 x
64C/128T) without experiencing any obvious issues.

Since I had run into various issues initially with this prototype, I've
retained a debug patch towards the end that dumps the rbtree state when
pick_eevdf() fails to select a task. Since it is a debug patch, all the
checkpatch warnings and errors have been ignored. I apologize in advance
for any inconvenience there.

Broken bits and disclaimer
==========================

In ascending order of priority:

- Few SCHED_WARN_ON() added assume only the syscall boundary is a
  critical section. These can go if this infrastructure is used for more
  fine-grained deferral.

- !CONFIG_CFS_BANDWIDTH and !CONFIG_SCHED_AUTOGROUP has only been build
  tested. Most (almost all) of the testing was done with
  CONFIG_CFS_BANDWIDTH enabled on a x86 system and a VM.

- The current PoC only support achitectures that select GENERIC_ENTRY.
  Other architectures need to use the sched_notify_critical_section*()
  API to mark the kernel critical section boundaries.

- Some abstractions are not very clean. For example, generic layer uses
  se_in_kernel() often passing &p->se as the argument. Instead, the
  task_struct can be sent directly to a wrapper around se_in_kernel().

- There is currently no way to opt out of throttle deferral however, it
  is possible to introduce a sysctl hook to enable / disable the feature
  on demand by only allowing propagation of "kernel_cs_count" down the
  cgroup hierarchy when the feature is enabled and a static key to guard
  key decision points that can lead to partial throttle.A simpler option
  is a boot time switch.

- On a partially throttled hierarchy, only kernel mode preempted tasks
  and the entities that lead to them can make progress. This can
  theoretically lead them to accumulate unbounded amount of -ve lag
  which can lead to long wait times for these entities when the
  hierarchy is unthrottled.

- PELT requires taking some look at since the number of runnable
  entities on a throttled hierarchy is not the same as "h_nr_runnable".
  Audit all references to "cfs_rq->h_nr_runnable" and "se->runnable_sum"
  and address those paths accordingly.

- Partially throttled hierarchies do not set the hierarchical indicators
  namely "cfs_rq->throttled_count" anymore. Audit all uses of
  throttled_hierarchy() to see if they need adjusting. At least two
  cases, one in throttled_lb_pair(), and the other in
  class->task_is_throttled() callback used by core scheduling requires
  auditing.

Clarifications required
=======================

- A kernel thread on a throttled hierarchy will always partially
  unthrottle it and make itself runnable. Is this possible? (at least
  one case with osnoise was found) Is this desired?

- A new task always starts off in kernel mode and can unthrottle the
  hierarchy to make itself runnable. Is this a desired behavior?

- There are a few "XXX" and "TODO" along that way that I'll be
  revisiting. Any comments there is greatly appreciated.

- Patch 21 seems to solve a heap integrity issue that I have
  encountered. Late into stress testing I occasionally came across a
  scenario where the rbroot does not have the correct min_kcs_vruntime
  set. An example dump is as follows:

    (Sorry in advance for the size of this dump)

    CPU43: ----- current task ----
    CPU43: pid(0) comm(swapper/43) task_cpu(43) task_on_rq_queued(1) task_on_rq_migrating(0) normal_policy(1) idle_policy(0)
    CPU43: ----- current task done ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(1) h_nr_queued(24) h_nr_runnable(24) nr_queued(24) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(3043328) avg_load(24576) avg_kcs_vruntime(125952) avg_kcs_load(1024)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- rbtree traversal: root ----
    CPU43: se: load(1024) vruntime(6057209863) entity_key(12716) deadline(6059998121) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057198486) entity_key(1339) deadline(6059991924) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057128111) entity_key(-69036) deadline(6059919846) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057202668) entity_key(5521) deadline(6059894986) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057124320) entity_key(-72827) deadline(6059915644) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057196261) entity_key(-886) deadline(6059984139) min_vruntime(6057176590) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057176590) entity_key(-20557) deadline(6059967663) min_vruntime(6057176590) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057198406) entity_key(1259) deadline(6059990321) min_vruntime(6057198406) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057206114) entity_key(8967) deadline(6059995624) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057202212) entity_key(5065) deadline(6059994879) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057201766) entity_key(4619) deadline(6059993029) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057197431) entity_key(284) deadline(6059992361) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057201845) entity_key(4698) deadline(6059994381) min_vruntime(6057201845) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057201956) entity_key(4809) deadline(6059995233) min_vruntime(6057201956) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057203193) entity_key(6046) deadline(6059996320) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057199892) entity_key(2745) deadline(6059995624) min_vruntime(6057199892) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057205914) entity_key(8767) deadline(6059996907) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057197270) entity_key(123) deadline(6059997270) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(1) min_kcs_vruntime(6057197270) pick_entity(1)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057210834) entity_key(13687) deadline(6060002729) min_vruntime(6057209600) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057209600) entity_key(12453) deadline(6060000733) min_vruntime(6057209600) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057215347) entity_key(18200) deadline(6060005989) min_vruntime(6057215103) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057215103) entity_key(17956) deadline(6060004643) min_vruntime(6057215103) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057215276) entity_key(18129) deadline(6060007381) min_vruntime(6057215276) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057216042) entity_key(18895) deadline(6060008258) min_vruntime(6057216042) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- rbtree done ----
    CPU43: ----- parent cfs_rq ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(312) avg_kcs_vruntime(0) avg_kcs_load(312)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- parent cfs_rq ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(206) avg_kcs_vruntime(0) avg_kcs_load(206)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(0) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(-1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(156) avg_kcs_vruntime(0) avg_kcs_load(156)
    CPU43: ----- cfs_rq done ----

  The vruntime of entity corresponding to "pick_entity(1)" should have
  been propagated to the root as min_kcs_vruntime but that does not seem
  to happen. This happens unpredictably and the amount of logs are
  overwhelming to make any sense of which series of rbtree operations
  lead up to it but grouping all the min_* members and setting it as
  RBAUGMENT as implemented in Patch 21 seem to make the issue go away.

  If folks think this is a genuine bugfix and not me masking a bigger
  issue, I can send this fix separately for min_vruntime and min_slice
  propagation and rebase my series on top of it.

Credits
=======

This series is based on the former approaches from:

Valentin Schneider <vschneid@redhat.com>
Ben Segall <bsegall@google.com>

All the credit for this idea really is due to them - while the mistakes
in implementing it are likely mine :)

Thanks to Swapnil, Dhananjay, Ravi, and Gautham for helping devise the
test strategy, discussions around the intricacies of CPU controllers,
and their help at tackling multiple subtle bugs I encountered along the
way.

References
==========

[1]: https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
[2]: https://youtu.be/YAVxQKQABLw
[3]: https://youtu.be/_N-nXJHiDNo
[4]: http://lore.kernel.org/r/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com

---
K Prateek Nayak (22):
  kernel/entry/common: Move syscall_enter_from_user_mode_work() out of
    header
  sched/fair: Convert "se->runnable_weight" to unsigned int and pack the
    struct
  [PoC] kernel/entry/common: Mark syscall as a kernel critical section
  [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks
  sched/fair: Track EEVDF stats for entities preempted in kernel mode
  sched/fair: Propagate the min_vruntime of kernel mode preempted entity
  sched/fair: Propagate preempted entity information up cgroup hierarchy
  sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled
    hierarchy
  sched/fair: Introduce cfs_rq throttled states in preparation for
    partial throttling
  sched/fair: Prepare throttle_cfs_rq() to allow partial throttling
  sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status
  sched/fair: Prepare bandwidth distribution to unthrottle partial
    throttles right away
  sched/fair: Correct the throttle status supplied to pick_eevdf()
  sched/fair: Mark a task if it was picked on a partially throttled
    hierarchy
  sched/fair: Call resched_curr() from sched_notify_syscall_exit()
  sched/fair: Prepare enqueue to partially unthrottle cfs_rq
  sched/fair: Clear pick on throttled indicator when task leave fair
    class
  sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled
    hierarchy
  sched/fair: Ignore in-kernel indicators for running task outside of
    schedule()
  sched/fair: Implement determine_throttle_state() for partial throttle
  [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together
    for propagation
  [DEBUG] sched/fair: Debug pick_eevdf() returning NULL!

 include/linux/entry-common.h |  10 +-
 include/linux/sched.h        |  45 +-
 init/init_task.c             |   5 +-
 kernel/entry/common.c        |  17 +
 kernel/entry/common.h        |   4 +
 kernel/sched/core.c          |   8 +-
 kernel/sched/debug.c         |   2 +-
 kernel/sched/fair.c          | 935 ++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h         |   8 +-
 9 files changed, 950 insertions(+), 84 deletions(-)


base-commit: d34e798094ca7be935b629a42f8b237d4d5b7f1d
-- 
2.43.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20 10:43   ` Peter Zijlstra
  2025-02-20  9:32 ` [RFC PATCH 02/22] sched/fair: Convert "se->runnable_weight" to unsigned int and pack the struct K Prateek Nayak
                   ` (21 subsequent siblings)
  22 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Retain the prototype of syscall_enter_from_user_mode_work() in
linux/entry-common.h and move the function definition to
kernel/entry/common.c in preparation to notify the scheduler of task
entering and exiting kernel mode for syscall. The two architectures that
use it directly (x86, s390) and the four that call it via
syscall_enter_from_user_mode() (x86, riscv, loongarch, s390) end up
selecting GENERIC_ENTRY, hence, no functional changes are intended.

syscall_enter_from_user_mode_work() will end up calling function whose
visibility needs to be limited for kernel/* use only for cfs throttling
deferral.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/entry-common.h | 10 +---------
 kernel/entry/common.c        | 10 ++++++++++
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index fc61d0205c97..7569a49cf7a0 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -161,15 +161,7 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
  *     ptrace_report_syscall_entry(), __secure_computing(), trace_sys_enter()
  *  2) Invocation of audit_syscall_entry()
  */
-static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
-{
-	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
-
-	if (work & SYSCALL_WORK_ENTER)
-		syscall = syscall_trace_enter(regs, syscall, work);
-
-	return syscall;
-}
+long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall);
 
 /**
  * syscall_enter_from_user_mode - Establish state and check and handle work
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e33691d5adf7..cc93cdcc36d0 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -79,6 +79,16 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
 	instrumentation_end();
 }
 
+__always_inline long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
+{
+	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
+
+	if (work & SYSCALL_WORK_ENTER)
+		syscall = syscall_trace_enter(regs, syscall, work);
+
+	return syscall;
+}
+
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
 

base-commit: d34e798094ca7be935b629a42f8b237d4d5b7f1d
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 02/22] sched/fair: Convert "se->runnable_weight" to unsigned int and pack the struct
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 03/22] [PoC] kernel/entry/common: Mark syscall as a kernel critical section K Prateek Nayak
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

se->runnable_weight tracks group_cfs_rq(se)->h_nr_running which is an
unsigned int. Convert runnable_weight to an unsigned int to match the
types.

pahole notes that there is a 4 byte hole after se's "depth" member
since depth is an integer. Move depth below runnable_weight to prevent
the 4 byte hole and reclaim 8 bytes of space in total.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9632e3318e0d..34862d904ea3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -569,14 +569,14 @@ struct sched_entity {
 	u64				nr_migrations;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	int				depth;
 	struct sched_entity		*parent;
 	/* rq on which this entity is (to be) queued: */
 	struct cfs_rq			*cfs_rq;
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq			*my_q;
 	/* cached value of my_q->h_nr_running */
-	unsigned long			runnable_weight;
+	unsigned int			runnable_weight;
+	int				depth;
 #endif
 
 #ifdef CONFIG_SMP
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 03/22] [PoC] kernel/entry/common: Mark syscall as a kernel critical section
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 02/22] sched/fair: Convert "se->runnable_weight" to unsigned int and pack the struct K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 04/22] [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks K Prateek Nayak
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Mark the syscall boundary as a kernel critical section. Use a per-task
"kernel_cs_count" to track task's entry from userspace and exit to
userspace. When "kernel_cs_count" is non-zero, the task is executing in
kernel mode.

For this Proof-of-Concept, "kernel_cs_count" can only be 1 or 0 for a
tasks and the implementation will run with the same assumption. The
critical section is defined as an integer count to allow fine grained
control in the future where certain boundaries within the kernel can be
marked as resource holding critical sections.

For the sake of simplicity, the whole kernel mode is marked as a
critical section in this PoC. For future extensibility,
sched_notify_critical_sction{entry,exit}() helpers are defined to mark
boundaries of kernel critical section and is similar to preempt_count()
mechanism.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched.h | 19 ++++++++++++++++++-
 kernel/entry/common.c |  7 +++++++
 kernel/entry/common.h |  4 ++++
 kernel/sched/fair.c   | 20 ++++++++++++++++++++
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 34862d904ea3..63f3f235a5c1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -577,7 +577,24 @@ struct sched_entity {
 	/* cached value of my_q->h_nr_running */
 	unsigned int			runnable_weight;
 	int				depth;
-#endif
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	/*
+	 * Keep track of tasks, and cfs_rq(s) that contains tasks
+	 * running in kernel mode. Any throttling event for the
+	 * cfs_rq will be deferred until this count hits 0.
+	 *
+	 * Semantics:
+	 *
+	 * - For task: It represents if the task is currently
+	 *   running in kernel mode. It is always 0 or 1.
+	 *
+	 * TODO: Describe for sched_entity when implementing.
+	 */
+	int				kernel_cs_count;
+					/* hole */
+#endif /* CONFIG_CFS_BANDWIDTH */
+#endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_SMP
 	/*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index cc93cdcc36d0..b132b96e2b96 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -83,6 +83,8 @@ __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *regs, lon
 {
 	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
 
+	sched_notify_critical_section_entry();
+
 	if (work & SYSCALL_WORK_ENTER)
 		syscall = syscall_trace_enter(regs, syscall, work);
 
@@ -214,6 +216,11 @@ static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *reg
 {
 	syscall_exit_to_user_mode_prepare(regs);
 	local_irq_disable_exit_to_user();
+	/*
+	 * Notify scheduler that the task is exiting to userspace after a
+	 * syscall. Must be called before checking for NEED_RESCHED work.
+	 */
+	sched_notify_critical_section_exit();
 	exit_to_user_mode_prepare(regs);
 }
 
diff --git a/kernel/entry/common.h b/kernel/entry/common.h
index f6e6d02f07fe..73e699a4c3e9 100644
--- a/kernel/entry/common.h
+++ b/kernel/entry/common.h
@@ -4,4 +4,8 @@
 
 bool syscall_user_dispatch(struct pt_regs *regs);
 
+/* sched notifiers for CFS bandwidth deferral */
+extern void sched_notify_critical_section_entry(void);
+extern void sched_notify_critical_section_exit(void);
+
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 857808da23d8..becf2d35f35a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -58,6 +58,8 @@
 #include "stats.h"
 #include "autogroup.h"
 
+#include "../entry/common.h" /* critical section entry / exit notifiers */
+
 /*
  * The initial- and re-scaling of tunables is configurable
  *
@@ -6704,6 +6706,20 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 	return false;
 }
 
+__always_inline void sched_notify_critical_section_entry(void)
+{
+	current->se.kernel_cs_count++;
+	/*
+	 * Post this point, the task is considered to be in a kernel
+	 * critical section and will defer bandwidth throttling.
+	 */
+}
+
+__always_inline void sched_notify_critical_section_exit(void)
+{
+	current->se.kernel_cs_count--;
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 /* called from pick_next_task_fair() */
 static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
@@ -6772,6 +6788,10 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 	return false;
 }
 #endif
+
+__always_inline void sched_notify_critical_section_entry(void) {}
+__always_inline void sched_notify_critical_section_exit(void) {}
+
 #endif /* CONFIG_CFS_BANDWIDTH */
 
 #if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 04/22] [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (2 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 03/22] [PoC] kernel/entry/common: Mark syscall as a kernel critical section K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 05/22] sched/fair: Track EEVDF stats for entities preempted in kernel mode K Prateek Nayak
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Since only archs that select GENERIC_ENTRY can track a syscall entry and
exit for userspace tasks, appropriately set the "kernel_cs_count"
depending on the arch.

For arch that select GENERIC_ENTRY, the "kernel_cs_count" is initialized
the 1 since the task starts running by exiting out of syscall without a
matching syscall entry.

For any future fine-grained tracking, the initial count must be adjusted
appropriately.

XXX: A kernel thread will always appear to be running a kernel critical
section. Is this desirable?

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 init/init_task.c    | 5 ++++-
 kernel/sched/core.c | 6 +++++-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..90abbd248c6a 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -88,7 +88,10 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 		.fn = do_no_restart_syscall,
 	},
 	.se		= {
-		.group_node 	= LIST_HEAD_INIT(init_task.se.group_node),
+		.group_node		= LIST_HEAD_INIT(init_task.se.group_node),
+#ifdef CONFIG_CFS_BANDWIDTH
+		.kernel_cs_count	= (IS_ENABLED(CONFIG_GENERIC_ENTRY)) ? 1 : 0,
+#endif
 	},
 	.rt		= {
 		.run_list	= LIST_HEAD_INIT(init_task.rt.run_list),
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 165c90ba64ea..0851cdad9242 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4493,7 +4493,11 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
-#endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	/* Only the arch that select GENERIC_ENTRY can defer throttling */
+	p->se.kernel_cs_count		= (IS_ENABLED(CONFIG_GENERIC_ENTRY)) ? 1 : 0;
+#endif /* CONFIG_CFS_BANDWIDTH */
+#endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_SCHEDSTATS
 	/* Even if schedstat is disabled, there should not be garbage */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 05/22] sched/fair: Track EEVDF stats for entities preempted in kernel mode
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (3 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 04/22] [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 06/22] sched/fair: Propagate the min_vruntime of kernel mode preempted entity K Prateek Nayak
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Throttled hierarchies will require only picking between kernel mode
preempted entities queued on them with throttle deferral.

Track EEVDF stats of kernel mode preempted entities in avg_kcs_vruntime
and avg_kcs_load which is the same as avg_vruntime and avg_load
respectively, but only contains stats for kernel mode preempted entities
queued on the rbtree.

Since all the checks for eligibility are entity_key() based, also update
avg_kcs_vruntime when min_vruntime of the cfs_rq changes.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c  | 62 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  6 +++++
 2 files changed, 68 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index becf2d35f35a..cbb7a227afe7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -523,6 +523,9 @@ static int se_is_idle(struct sched_entity *se)
 
 static __always_inline
 void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
+static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se);
+static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se);
+static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -630,6 +633,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	cfs_rq->avg_vruntime += key * weight;
 	cfs_rq->avg_load += weight;
+	avg_kcs_vruntime_add(cfs_rq, se);
 }
 
 static void
@@ -640,6 +644,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 	cfs_rq->avg_vruntime -= key * weight;
 	cfs_rq->avg_load -= weight;
+	avg_kcs_vruntime_sub(cfs_rq, se);
 }
 
 static inline
@@ -649,6 +654,7 @@ void avg_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
 	 * v' = v + d ==> avg_vruntime' = avg_runtime - d*avg_load
 	 */
 	cfs_rq->avg_vruntime -= cfs_rq->avg_load * delta;
+	avg_kcs_vruntime_update(cfs_rq, delta);
 }
 
 /*
@@ -6720,6 +6726,58 @@ __always_inline void sched_notify_critical_section_exit(void)
 	current->se.kernel_cs_count--;
 }
 
+static inline int se_in_kernel(struct sched_entity *se)
+{
+	return se->kernel_cs_count;
+}
+
+/*
+ * Same as avg_vruntime_add() except avg_kcs_vruntime_add() only adjusts the avg_kcs_vruntime
+ * and avg_kcs_load of kernel mode preempted entity when it joins the rbtree.
+ */
+static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight;
+	s64 key;
+
+	if (!se_in_kernel(se))
+		return;
+
+	weight = scale_load_down(se->load.weight);
+	key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_kcs_vruntime += key * weight;
+	cfs_rq->avg_kcs_load += weight;
+}
+
+/*
+ * Same as avg_vruntime_sub() except avg_kcs_vruntime_sub() only adjusts the avg_kcs_vruntime
+ * and avg_kcs_load of kernel mode preempted entity when it leaves the rbtree.
+ */
+static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	unsigned long weight;
+	s64 key;
+
+	if (!se_in_kernel(se))
+		return;
+
+	weight = scale_load_down(se->load.weight);
+	key = entity_key(cfs_rq, se);
+
+	cfs_rq->avg_kcs_vruntime -= key * weight;
+	cfs_rq->avg_kcs_load -= weight;
+}
+
+/*
+ * Same as avg_vruntime_update() except it adjusts avg_kcs_vruntime based on avg_kcs_load
+ * when min_vruntime of the cfs_rq changes.
+ */
+static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
+{
+	cfs_rq->avg_kcs_vruntime -= cfs_rq->avg_kcs_load * delta;
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 /* called from pick_next_task_fair() */
 static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
@@ -6792,6 +6850,10 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 __always_inline void sched_notify_critical_section_entry(void) {}
 __always_inline void sched_notify_critical_section_exit(void) {}
 
+static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) {}
+
 #endif /* CONFIG_CFS_BANDWIDTH */
 
 #if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ab16d3d0e51c..22567d236f82 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -658,6 +658,12 @@ struct cfs_rq {
 	s64			avg_vruntime;
 	u64			avg_load;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	/* EEVDF stats of entities preempted in kernel mode */
+	s64			avg_kcs_vruntime;
+	u64			avg_kcs_load;
+#endif
+
 	u64			min_vruntime;
 #ifdef CONFIG_SCHED_CORE
 	unsigned int		forceidle_seq;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 06/22] sched/fair: Propagate the min_vruntime of kernel mode preempted entity
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (4 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 05/22] sched/fair: Track EEVDF stats for entities preempted in kernel mode K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 07/22] sched/fair: Propagate preempted entity information up cgroup hierarchy K Prateek Nayak
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Propagate the min_vruntime of the kernel mode preempted entity to the
root of the cfs_rq's rbtree. This will be soon used to pick amongst the
kernel mode entities on a throttled hierarchy using the similar min-heap
approach that pick_eevdf() currently implements.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched.h |  6 ++++++
 kernel/sched/fair.c   | 47 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63f3f235a5c1..4bb7e45758f4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -593,6 +593,12 @@ struct sched_entity {
 	 */
 	int				kernel_cs_count;
 					/* hole */
+
+	/*
+	 * min_vruntime of the kernel mode preempted entities
+	 * in the subtree of this sched entity.
+	 */
+	s64				min_kcs_vruntime;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cbb7a227afe7..ba1bd60ce433 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -828,6 +828,9 @@ static inline void __min_slice_update(struct sched_entity *se, struct rb_node *n
 	}
 }
 
+static __always_inline void init_se_kcs_stats(struct sched_entity *se);
+static inline bool min_kcs_vruntime_update(struct sched_entity *se);
+
 /*
  * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
  */
@@ -836,6 +839,7 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 	u64 old_min_vruntime = se->min_vruntime;
 	u64 old_min_slice = se->min_slice;
 	struct rb_node *node = &se->run_node;
+	bool kcs_stats_unchanged = min_kcs_vruntime_update(se);
 
 	se->min_vruntime = se->vruntime;
 	__min_vruntime_update(se, node->rb_right);
@@ -846,7 +850,8 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 	__min_slice_update(se, node->rb_left);
 
 	return se->min_vruntime == old_min_vruntime &&
-	       se->min_slice == old_min_slice;
+	       se->min_slice == old_min_slice &&
+	       kcs_stats_unchanged;
 }
 
 RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
@@ -858,6 +863,7 @@ RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
 static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
+	init_se_kcs_stats(se);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
@@ -6778,6 +6784,39 @@ static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 d
 	cfs_rq->avg_kcs_vruntime -= cfs_rq->avg_kcs_load * delta;
 }
 
+static __always_inline void init_se_kcs_stats(struct sched_entity *se)
+{
+	/*
+	 * With the introduction of EEVDF, the vruntime of entities can go negative when
+	 * a lagging entity joins a runqueue with avg_vruntime < vlag. Use LLONG_MAX as
+	 * the upper bound to differentiate the case where no kernel mode preempted
+	 * entities are queued on the subtree.
+	 */
+	se->min_kcs_vruntime = (se_in_kernel(se)) ? se->vruntime : LLONG_MAX;
+}
+
+static inline void __min_kcs_vruntime_update(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+
+		if (rse->min_kcs_vruntime < se->min_kcs_vruntime)
+			se->min_kcs_vruntime = rse->min_kcs_vruntime;
+	}
+}
+
+static inline bool min_kcs_vruntime_update(struct sched_entity *se)
+{
+	u64 old_min_kcs_vruntime = se->min_kcs_vruntime;
+	struct rb_node *node = &se->run_node;
+
+	init_se_kcs_stats(se);
+	__min_kcs_vruntime_update(se, node->rb_right);
+	__min_kcs_vruntime_update(se, node->rb_left);
+
+	return se->min_kcs_vruntime == old_min_kcs_vruntime;
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 /* called from pick_next_task_fair() */
 static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
@@ -6853,6 +6892,12 @@ __always_inline void sched_notify_critical_section_exit(void) {}
 static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) {}
+static __always_inline void init_se_kcs_stats(struct sched_entity *se) {}
+
+static inline bool min_kcs_vruntime_update(struct sched_entity *se)
+{
+	return true;
+}
 
 #endif /* CONFIG_CFS_BANDWIDTH */
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 07/22] sched/fair: Propagate preempted entity information up cgroup hierarchy
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (5 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 06/22] sched/fair: Propagate the min_vruntime of kernel mode preempted entity K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 08/22] sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled hierarchy K Prateek Nayak
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Use "kernel_cs_count" in cfs_rq's sched entity to track the number of
preempted entities queued on the cfs_rq. Since the current task can
frequently switch in and out of kernel mode, "cfs_rq->curr" and the
whole sched entity hierarchy of the current task is treated special.

This is similar to "runnable_sum" except "kernel_cs_count" does not
have a corresponding count in the in struct cfs_rq.

The counts are managed at enqueue, dequeue, and task pick boundaries
since the entities on rbtree have a stable "kernel_cs_count". Use
min_vruntime_cb_propagate() generated from RB_DECLARE_CALLBACKS() to
propagate the adjustments to the root of the rbtree.

Since propagations require the state of the task being enqueued /
dequeued / put / set, expose se_in_kernel() to generic code to
streamline the propagation.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched.h |   7 ++-
 kernel/sched/fair.c   | 121 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 124 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4bb7e45758f4..48115de839a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -589,7 +589,12 @@ struct sched_entity {
 	 * - For task: It represents if the task is currently
 	 *   running in kernel mode. It is always 0 or 1.
 	 *
-	 * TODO: Describe for sched_entity when implementing.
+	 * - For cfs_rq: It represents the sum of the
+	 *   kernel_cs_count of the entities queued below it
+	 *   except for the "current". Since the current task can
+	 *   frequently switch in and out of kernel mode, its
+	 *   hierarchy is treated  special and its contribution
+	 *   is accounted at key decision points.
 	 */
 	int				kernel_cs_count;
 					/* hole */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba1bd60ce433..9f28624e4442 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6732,7 +6732,7 @@ __always_inline void sched_notify_critical_section_exit(void)
 	current->se.kernel_cs_count--;
 }
 
-static inline int se_in_kernel(struct sched_entity *se)
+static __always_inline int se_in_kernel(struct sched_entity *se)
 {
 	return se->kernel_cs_count;
 }
@@ -6817,6 +6817,76 @@ static inline bool min_kcs_vruntime_update(struct sched_entity *se)
 	return se->min_kcs_vruntime == old_min_kcs_vruntime;
 }
 
+static inline void account_kcs_enqueue(struct cfs_rq *gcfs_rq, bool in_kernel)
+{
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+
+	if (!in_kernel)
+		return;
+
+	se = gcfs_rq->tg->se[cpu_of(rq_of(gcfs_rq))];
+	if (!se)
+		return;
+
+	cfs_rq = cfs_rq_of(se);
+	se->kernel_cs_count++;
+
+	/* se has transitioned into a kernel mode preempted entity */
+	if (se->kernel_cs_count == 1 && se != cfs_rq->curr && se->on_rq) {
+		/*
+		 * Must be done after "kernel_cs_count" has been
+		 * incremented since avg_kcs_vruntime_add() only
+		 * adjusts the stats if it believes the entity is in
+		 * the kernel mode.
+		 */
+		avg_kcs_vruntime_add(cfs_rq, se);
+
+		/* Propagate min_kcs_vruntime_update() till rb_root */
+		min_vruntime_cb_propagate(&se->run_node, NULL);
+	}
+
+	/* Sanity check */
+	SCHED_WARN_ON(se->kernel_cs_count > gcfs_rq->h_nr_queued);
+}
+
+static inline void account_kcs_dequeue(struct cfs_rq *gcfs_rq, bool in_kernel)
+{
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+	bool transition_out;
+
+	if (!in_kernel)
+		return;
+
+	se = gcfs_rq->tg->se[cpu_of(rq_of(gcfs_rq))];
+	if (!se)
+		return;
+
+	cfs_rq = cfs_rq_of(se);
+	transition_out = se->kernel_cs_count == 1;
+
+	/*
+	 * Discount the load and avg_kcs_vruntime contribution if the
+	 * entity is transitioning out. Must be done before
+	 * "kernel_cs_count" is decremented as avg_kcs_vruntime_sub()
+	 * should still consider it to be in kernel mode to adjust
+	 * the stats.
+	 */
+	if (transition_out && se != cfs_rq->curr && se->on_rq)
+		avg_kcs_vruntime_sub(cfs_rq, se);
+
+	se->kernel_cs_count--;
+
+	/* Propagate min_kcs_vruntime_update() till rb_root */
+	if (transition_out && se != cfs_rq->curr && se->on_rq)
+		min_vruntime_cb_propagate(&se->run_node, NULL);
+
+	/* Sanity checks */
+	SCHED_WARN_ON(se->kernel_cs_count > gcfs_rq->h_nr_queued);
+	SCHED_WARN_ON(se->kernel_cs_count < 0);
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 /* called from pick_next_task_fair() */
 static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
@@ -6889,6 +6959,11 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 __always_inline void sched_notify_critical_section_entry(void) {}
 __always_inline void sched_notify_critical_section_exit(void) {}
 
+static __always_inline int se_in_kernel(struct sched_entity *se)
+{
+	return false;
+}
+
 static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) {}
@@ -6899,6 +6974,9 @@ static inline bool min_kcs_vruntime_update(struct sched_entity *se)
 	return true;
 }
 
+static inline void account_kcs_enqueue(struct cfs_rq *gcfs_rq, bool in_kernel) {}
+static inline void account_kcs_dequeue(struct cfs_rq *gcfs_rq, bool in_kernel) {}
+
 #endif /* CONFIG_CFS_BANDWIDTH */
 
 #if !defined(CONFIG_CFS_BANDWIDTH) || !defined(CONFIG_NO_HZ_FULL)
@@ -7056,6 +7134,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &p->se;
+	/* A task on CPU has its in-kernel stats discounted */
+	bool task_in_kernel = !task_on_cpu(rq, p) && se_in_kernel(se);
 	int h_nr_idle = task_has_idle_policy(p);
 	int h_nr_runnable = 1;
 	int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -7110,6 +7190,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_runnable += h_nr_runnable;
 		cfs_rq->h_nr_queued++;
 		cfs_rq->h_nr_idle += h_nr_idle;
+		account_kcs_enqueue(cfs_rq, task_in_kernel);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
@@ -7136,6 +7217,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_runnable += h_nr_runnable;
 		cfs_rq->h_nr_queued++;
 		cfs_rq->h_nr_idle += h_nr_idle;
+		account_kcs_enqueue(cfs_rq, task_in_kernel);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
@@ -7196,6 +7278,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
 	struct task_struct *p = NULL;
+	bool task_in_kernel = false;
 	int h_nr_idle = 0;
 	int h_nr_queued = 0;
 	int h_nr_runnable = 0;
@@ -7205,6 +7288,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (entity_is_task(se)) {
 		p = task_of(se);
 		h_nr_queued = 1;
+		/* A task on CPU has its in-kernel stats discounted */
+		task_in_kernel = !task_on_cpu(rq, p) && se_in_kernel(se);
 		h_nr_idle = task_has_idle_policy(p);
 		if (task_sleep || task_delayed || !se->sched_delayed)
 			h_nr_runnable = 1;
@@ -7226,6 +7311,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		cfs_rq->h_nr_runnable -= h_nr_runnable;
 		cfs_rq->h_nr_queued -= h_nr_queued;
 		cfs_rq->h_nr_idle -= h_nr_idle;
+		account_kcs_dequeue(cfs_rq, task_in_kernel);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
@@ -7267,6 +7353,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		cfs_rq->h_nr_runnable -= h_nr_runnable;
 		cfs_rq->h_nr_queued -= h_nr_queued;
 		cfs_rq->h_nr_idle -= h_nr_idle;
+		account_kcs_dequeue(cfs_rq, task_in_kernel);
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
@@ -9029,6 +9116,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	 */
 	if (prev != p) {
 		struct sched_entity *pse = &prev->se;
+		bool prev_in_kernel = task_on_rq_queued(prev) && se_in_kernel(pse);
+		bool next_in_kernel = se_in_kernel(se);
 		struct cfs_rq *cfs_rq;
 
 		while (!(cfs_rq = is_same_group(se, pse))) {
@@ -9036,18 +9125,35 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 			int pse_depth = pse->depth;
 
 			if (se_depth <= pse_depth) {
-				put_prev_entity(cfs_rq_of(pse), pse);
+				cfs_rq = cfs_rq_of(pse);
+
+				account_kcs_enqueue(cfs_rq, prev_in_kernel);
+				put_prev_entity(cfs_rq, pse);
 				pse = parent_entity(pse);
 			}
 			if (se_depth >= pse_depth) {
-				set_next_entity(cfs_rq_of(se), se);
+				cfs_rq = cfs_rq_of(se);
+
+				set_next_entity(cfs_rq, se);
 				se = parent_entity(se);
+				account_kcs_dequeue(cfs_rq, next_in_kernel);
 			}
 		}
 
 		put_prev_entity(cfs_rq, pse);
 		set_next_entity(cfs_rq, se);
 
+		if (prev_in_kernel != next_in_kernel) {
+			for_each_sched_entity(se) {
+				cfs_rq = cfs_rq_of(se);
+
+				if (prev_in_kernel)
+					account_kcs_enqueue(cfs_rq, prev_in_kernel);
+				else
+					account_kcs_dequeue(cfs_rq, next_in_kernel);
+			}
+		}
+
 		__set_next_task_fair(rq, p, true);
 	}
 
@@ -9114,10 +9220,17 @@ void fair_server_init(struct rq *rq)
 static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next)
 {
 	struct sched_entity *se = &prev->se;
+	/*
+	 * When next is NULL, it is a save-restore operation. If the task is no
+	 * longer queued on the rq, the stats have been already discounted at
+	 * pick and should not be adjusted here.
+	 */
+	bool task_in_kernel = next && task_on_rq_queued(prev) && se_in_kernel(se);
 	struct cfs_rq *cfs_rq;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
+		account_kcs_enqueue(cfs_rq, task_in_kernel);
 		put_prev_entity(cfs_rq, se);
 	}
 }
@@ -13424,11 +13537,13 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_entity *se = &p->se;
+	bool task_in_kernel = !task_on_cpu(rq, p) && se_in_kernel(se);
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		set_next_entity(cfs_rq, se);
+		account_kcs_dequeue(cfs_rq, task_in_kernel);
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		account_cfs_rq_runtime(cfs_rq, 0);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 08/22] sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled hierarchy
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (6 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 07/22] sched/fair: Propagate preempted entity information up cgroup hierarchy K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 09/22] sched/fair: Introduce cfs_rq throttled states in preparation for partial throttling K Prateek Nayak
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

pick_task_fair() makes sure that the a throttled cfs_rq is dequeued and
unreachable to pick_eevdf() when it encounters one during pick. With
deferred throttling, it is possible for a throttled cfs_rq to have
kernel mode preempted entities queued on it making it runnable.

Allow pick_eevdf() to find and return only the kernel mode preempted
entities when picking on a throttled hierarchy. Introduce two new
wrappers - pick_entity() and pick_subtree() around entity_eligible() to
abstract away the nuaces of picking on a throttled hierarchy.

Introduce pick_se_on_throttled() and pick_subtree_on_throttled() to find
the eligibility of kernel mode preempted entity or its subtree amongst
all kernel mode preempted entities still queued on the rbtree using the
EEVDF stats.

Since current task is outside the "kernel_cs_count" tracking, take
special care when accounting for it in pick_*_on_throttled().

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 134 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 124 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f28624e4442..4fd70012b479 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -526,6 +526,9 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se);
 static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se);
 static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta);
+static __always_inline int pick_se_on_throttled(struct cfs_rq *cfs_rq, struct sched_entity *se);
+static __always_inline
+int pick_subtree_on_throttled(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -750,6 +753,24 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }
 
+static __always_inline
+int pick_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, bool h_throttled)
+{
+	if (unlikely(h_throttled))
+		return pick_se_on_throttled(cfs_rq, se);
+
+	return vruntime_eligible(cfs_rq, se->vruntime);
+}
+
+static __always_inline
+int pick_subtree(struct cfs_rq *cfs_rq, struct sched_entity *se, bool h_throttled)
+{
+	if (unlikely(h_throttled))
+		return pick_subtree_on_throttled(cfs_rq, se);
+
+	return vruntime_eligible(cfs_rq, se->min_vruntime);
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
@@ -936,7 +957,7 @@ static inline void cancel_protect_slice(struct sched_entity *se)
  *
  * Which allows tree pruning through eligibility.
  */
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool h_throttled)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
 	struct sched_entity *se = __pick_first_entity(cfs_rq);
@@ -950,14 +971,14 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 	if (cfs_rq->nr_queued == 1)
 		return curr && curr->on_rq ? curr : se;
 
-	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+	if (curr && (!curr->on_rq || !pick_entity(cfs_rq, curr, h_throttled)))
 		curr = NULL;
 
 	if (sched_feat(RUN_TO_PARITY) && curr && protect_slice(curr))
 		return curr;
 
 	/* Pick the leftmost entity if it's eligible */
-	if (se && entity_eligible(cfs_rq, se)) {
+	if (se && pick_entity(cfs_rq, se, h_throttled)) {
 		best = se;
 		goto found;
 	}
@@ -970,8 +991,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 		 * Eligible entities in left subtree are always better
 		 * choices, since they have earlier deadlines.
 		 */
-		if (left && vruntime_eligible(cfs_rq,
-					__node_2_se(left)->min_vruntime)) {
+		if (left && pick_subtree(cfs_rq, __node_2_se(left), h_throttled)) {
 			node = left;
 			continue;
 		}
@@ -983,7 +1003,7 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 		 * entity, so check the current node since it is the one
 		 * with earliest deadline that might be eligible.
 		 */
-		if (entity_eligible(cfs_rq, se)) {
+		if (pick_entity(cfs_rq, se, h_throttled)) {
 			best = se;
 			break;
 		}
@@ -5601,14 +5621,14 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	/*
 	 * Picking the ->next buddy will affect latency but not fairness.
 	 */
-	if (sched_feat(PICK_BUDDY) &&
-	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+	if (sched_feat(PICK_BUDDY) && cfs_rq->next &&
+	    pick_entity(cfs_rq, cfs_rq->next, throttled_hierarchy(cfs_rq))) {
 		/* ->next will never be delayed */
 		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
 		return cfs_rq->next;
 	}
 
-	se = pick_eevdf(cfs_rq);
+	se = pick_eevdf(cfs_rq, throttled_hierarchy(cfs_rq));
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		/*
@@ -6795,6 +6815,89 @@ static __always_inline void init_se_kcs_stats(struct sched_entity *se)
 	se->min_kcs_vruntime = (se_in_kernel(se)) ? se->vruntime : LLONG_MAX;
 }
 
+/*
+ * Current task is outside the hierarchy during pick_eevdf(). Since
+ * "kernel_cs_count" has not been adjusted yet with put_prev_entity(),
+ * its preempted status is not accounted to the hierarchy. Check the
+ * status of current task too when accounting the contribution of
+ * cfs_rq->curr during the pick.
+ */
+static inline bool curr_h_is_throttled(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr = cfs_rq->curr;
+	struct task_struct *p;
+
+	/* Current hierarchy has been dequeued. */
+	if (!curr || !curr->on_rq)
+		return false;
+
+	/*
+	 * There are kernel mode preempted tasks
+	 * queued below this cfs_rq.
+	 */
+	if (se_in_kernel(cfs_rq->curr))
+		return true;
+
+	p = rq_of(cfs_rq)->curr;
+	/* Current task has been dequeued. */
+	if (!task_on_rq_queued(p))
+		return false;
+
+	/* Current task is still in kernel mode. */
+	return se_in_kernel(&p->se);
+}
+
+/* Same as vruntime eligible except this works with avg_kcs_vruntime and avg_kcs_load. */
+static __always_inline
+int throttled_vruntime_eligible(struct cfs_rq *cfs_rq, u64 vruntime, bool account_curr)
+{
+	s64 avg = cfs_rq->avg_kcs_vruntime;
+	long load = cfs_rq->avg_kcs_load;
+
+	if (account_curr) {
+		struct sched_entity *curr = cfs_rq->curr;
+		unsigned long weight = scale_load_down(curr->load.weight);
+
+		avg += entity_key(cfs_rq, curr) * weight;
+		load += weight;
+	}
+
+	return avg >= (s64)(vruntime - cfs_rq->min_vruntime) * load;
+}
+
+/* Same as entity_eligible() but for throttled hierarchy. */
+static __always_inline int pick_se_on_throttled(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	bool account_curr = curr_h_is_throttled(cfs_rq);
+
+	if (se == cfs_rq->curr) {
+		/*
+		 * If cfs_rq->curr is not accountable, it implies there
+		 * are no more kernel mode preempted tasks below it.
+		 */
+		if (!account_curr)
+			return false;
+	} else if (!se_in_kernel(se))
+		return false;
+
+	return throttled_vruntime_eligible(cfs_rq,
+					   se->vruntime,
+					   account_curr);
+}
+
+/* Similar to entity_eligible(cfs_rq, se->min_vruntime) but for throttled hierarchy. */
+static __always_inline
+int pick_subtree_on_throttled(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	/* There are no kernel mode preempted entities in the subtree. */
+	if (se->min_kcs_vruntime == LLONG_MAX)
+		return false;
+
+	return throttled_vruntime_eligible(cfs_rq,
+					   se->min_kcs_vruntime,
+					   curr_h_is_throttled(cfs_rq));
+}
+
 static inline void __min_kcs_vruntime_update(struct sched_entity *se, struct rb_node *node)
 {
 	if (node) {
@@ -6969,6 +7072,17 @@ static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct s
 static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) {}
 static __always_inline void init_se_kcs_stats(struct sched_entity *se) {}
 
+static __always_inline int pick_se_on_throttled(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	return vruntime_eligible(cfs_rq, se->vruntime);
+}
+
+static __always_inline
+int pick_subtree_on_throttled(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+	return vruntime_eligible(cfs_rq, se->min_vruntime);
+}
+
 static inline bool min_kcs_vruntime_update(struct sched_entity *se)
 {
 	return true;
@@ -9045,7 +9159,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (pick_eevdf(cfs_rq) == pse)
+	if (pick_eevdf(cfs_rq, throttled_hierarchy(cfs_rq)) == pse)
 		goto preempt;
 
 	return;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 09/22] sched/fair: Introduce cfs_rq throttled states in preparation for partial throttling
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (7 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 08/22] sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled hierarchy K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 10/22] sched/fair: Prepare throttle_cfs_rq() to allow " K Prateek Nayak
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

On introduction of throttle deferral, cfs_rq on encountering a throttle
event with kernel mode preempted entities will be marked partially
throttled allowing only these kernel mode entities to run until they
sleep or exit to userspace.

Introduce "throttled_state" enum to define the three throttle states -
CFS_UNTHROTTLED, CFS_THROTTLED_PARTIAL, and CFS_THROTTLED. In addition
to cfs_rq_throttled() which will now track both partial and complete
throttle, introduce a new helper cfs_rq_h_throttled() to detect a
completely throttled hierarchy.

Convert appropriate cfs_rq_throttled() checks to cfs_rq_h_throttled() to
guard logic that only concerns with a complete throttle. Also take the
opportunity to convert any open coded references to cfs_tr->throttled to
use one of the two helpers.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 67 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 51 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4fd70012b479..c84cd2d92343 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5340,7 +5340,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 }
 
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+
+/* cfs_rq is throttled either completely or partially */
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+/* cfs_rq is throttled completely and the hierarchy is frozen */
+static inline int cfs_rq_h_throttled(struct cfs_rq *cfs_rq);
 
 static void
 requeue_delayed_entity(struct sched_entity *se);
@@ -5404,7 +5408,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 #ifdef CONFIG_CFS_BANDWIDTH
 			struct rq *rq = rq_of(cfs_rq);
 
-			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
+			if (cfs_rq_h_throttled(cfs_rq) && !cfs_rq->throttled_clock)
 				cfs_rq->throttled_clock = rq_clock(rq);
 			if (!cfs_rq->throttled_clock_self)
 				cfs_rq->throttled_clock_self = rq_clock(rq);
@@ -5448,7 +5452,7 @@ static void set_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable--;
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_h_throttled(cfs_rq))
 			break;
 	}
 }
@@ -5470,7 +5474,7 @@ static void clear_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable++;
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_h_throttled(cfs_rq))
 			break;
 	}
 }
@@ -5817,7 +5821,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	if (likely(cfs_rq->runtime_remaining > 0))
 		return;
 
-	if (cfs_rq->throttled)
+	if (cfs_rq_throttled(cfs_rq))
 		return;
 	/*
 	 * if we're unable to extend our runtime we resched so that the active
@@ -5836,11 +5840,37 @@ void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 	__account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
+/* cfs_rq->throttled states */
+enum throttle_state {
+	/*
+	 * cfs_rq is unthrottled; All the queued entities
+	 * can be picked to run.
+	 */
+	CFS_UNTHROTTLED = 0,
+	/*
+	 * cfs_rq is only marked throttled. There are kernel
+	 * mode preempted entities that are still runnable.
+	 * PELT is not frozen yet.
+	 */
+	CFS_THROTTLED_PARTIAL,
+	/*
+	 * cfs_rq is fully throttled with PELT frozen. There
+	 * are no entities that are considered runnable under
+	 * throttle.
+	 */
+	CFS_THROTTLED
+};
+
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
 	return cfs_bandwidth_used() && cfs_rq->throttled;
 }
 
+static inline int cfs_rq_h_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && (cfs_rq->throttled == CFS_THROTTLED);
+}
+
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
@@ -6011,7 +6041,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
 	 */
-	cfs_rq->throttled = 1;
+	cfs_rq->throttled = CFS_THROTTLED;
 	SCHED_WARN_ON(cfs_rq->throttled_clock);
 	if (cfs_rq->nr_queued)
 		cfs_rq->throttled_clock = rq_clock(rq);
@@ -6028,7 +6058,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
-	cfs_rq->throttled = 0;
+	cfs_rq->throttled = CFS_UNTHROTTLED;
 
 	update_rq_clock(rq);
 
@@ -6080,7 +6110,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_idle += idle_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
+		if (cfs_rq_h_throttled(qcfs_rq))
 			goto unthrottle_throttle;
 	}
 
@@ -6098,7 +6128,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		qcfs_rq->h_nr_idle += idle_delta;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
+		if (cfs_rq_h_throttled(qcfs_rq))
 			goto unthrottle_throttle;
 	}
 
@@ -6503,7 +6533,7 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	 * it's possible for a throttled entity to be forced into a running
 	 * state (e.g. set_curr_task), in this case we're finished.
 	 */
-	if (cfs_rq_throttled(cfs_rq))
+	if (cfs_rq_h_throttled(cfs_rq))
 		return true;
 
 	return throttle_cfs_rq(cfs_rq);
@@ -7029,6 +7059,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline int cfs_rq_h_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 	return 0;
@@ -7310,7 +7345,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			h_nr_idle = 1;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_h_throttled(cfs_rq))
 			goto enqueue_throttle;
 
 		flags = ENQUEUE_WAKEUP;
@@ -7337,7 +7372,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			h_nr_idle = 1;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_h_throttled(cfs_rq))
 			goto enqueue_throttle;
 	}
 
@@ -7431,7 +7466,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 			h_nr_idle = h_nr_queued;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_h_throttled(cfs_rq))
 			return 0;
 
 		/* Don't dequeue parent if it has other entities besides us */
@@ -7472,8 +7507,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
+		/* end evaluation on encountering a throttled cfs_rq hierarchy */
+		if (cfs_rq_h_throttled(cfs_rq))
 			return 0;
 	}
 
@@ -13519,7 +13554,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	if (cfs_rq_throttled(cfs_rq))
+	if (cfs_rq_h_throttled(cfs_rq))
 		return;
 
 	if (!throttled_hierarchy(cfs_rq))
@@ -13533,7 +13568,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (cfs_rq_throttled(cfs_rq))
+		if (cfs_rq_h_throttled(cfs_rq))
 			break;
 
 		if (!throttled_hierarchy(cfs_rq))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 10/22] sched/fair: Prepare throttle_cfs_rq() to allow partial throttling
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (8 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 09/22] sched/fair: Introduce cfs_rq throttled states in preparation for partial throttling K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 11/22] sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status K Prateek Nayak
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

throttle_cfs_rq() will determine the throttle state first time it is
called using the not yet fully implemented determine_throttle_state()
helper to completely or partially throttle a cfs_rq when it has run
out of bandwidth.

For a partial throttle, throttle_cfs_rq() will only set
cfs_rq->throttled to CFS_THROTTLED_PARTIAL, leaving the rest same. When
a partially throttled hierarchy needs to be upgraded once all the kernel
mode preempted entities yield or exit to userspace, throttle_cfs_rq()
will be called again to upgrade the throttle status.

determine_throttle_state() currently only returns CFS_THROTTLED and will
be implemented in the subsequent commits once the foundation is laid.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 47 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c84cd2d92343..8e1df614e82f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5894,6 +5894,16 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static enum throttle_state
+determine_throttle_state(struct cfs_rq *gcfs_rq, struct sched_entity *se)
+{
+	/*
+	 * TODO: Implement rest once plumbing for
+	 * CFS_THROTTLED_PARTIAL is done.
+	 */
+	return CFS_THROTTLED;
+}
+
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
@@ -5946,9 +5956,25 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
 	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
+	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
 	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	int new_state = determine_throttle_state(cfs_rq, se);
+	int prev_state = cfs_rq->throttled;
+
+	/* Fully throttled cfs_rq should not reach here */
+	SCHED_WARN_ON(cfs_rq_h_throttled(cfs_rq));
+
+	/* Nothing to do */
+	if (new_state == prev_state)
+		return false;
+
+	/*
+	 * We've been upgraded! Just dequeue since we are already on the
+	 * throttled_list. Let distribution unthrottle us.
+	 */
+	if (prev_state)
+		goto throttle_dequeue;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5969,10 +5995,13 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	raw_spin_unlock(&cfs_b->lock);
 
 	if (!dequeue)
-		return false;  /* Throttle no longer required. */
+		new_state = CFS_UNTHROTTLED;
 
-	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+	/* If the cfs_rq is only partially throttled, we are done. */
+	if (new_state < CFS_THROTTLED)
+		goto done;
 
+throttle_dequeue:
 	/* freeze hierarchy runnable averages while throttled */
 	rcu_read_lock();
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
@@ -6041,11 +6070,16 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
 	 */
-	cfs_rq->throttled = CFS_THROTTLED;
-	SCHED_WARN_ON(cfs_rq->throttled_clock);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock = rq_clock(rq);
-	return true;
+	cfs_rq->throttled = new_state;
+	if (new_state == CFS_THROTTLED) {
+		SCHED_WARN_ON(cfs_rq->throttled_clock);
+		if (cfs_rq->nr_queued)
+			cfs_rq->throttled_clock = rq_clock(rq);
+
+		return true;
+	}
+
+	return false;
 }
 
 void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
@@ -6536,6 +6570,11 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	if (cfs_rq_h_throttled(cfs_rq))
 		return true;
 
+	/*
+	 * throttle_cfs_rq() will reevaluate the throttle status for
+	 * partially throttled hierarchy and upgrade to a full throttle
+	 * once all the kernel mode entities leave or exit to user mode.
+	 */
 	return throttle_cfs_rq(cfs_rq);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 11/22] sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (9 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 10/22] sched/fair: Prepare throttle_cfs_rq() to allow " K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 12/22] sched/fair: Prepare bandwidth distribution to unthrottle partial throttles right away K Prateek Nayak
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

If an entity which was blocked when running in kernel mode wakes up, a
fully throttled hierarchy needs to be demoted to a partially throttled
one.

Prepare unthrottle_cfs_rq() to demote the status where the caller
explicitly specifies via "demote_to_partial" indicator that it is
requesting the throttle status to be demoted.

Modify all current callers of unthrottle_cfs_rq()() setting
"demote_to_partial" as false since all the existing scenarios
completely unthrottles a cfs_rq.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/core.c  |  2 +-
 kernel/sched/fair.c  | 39 +++++++++++++++++++++++++++++----------
 kernel/sched/sched.h |  2 +-
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0851cdad9242..a797517d3dcf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9488,7 +9488,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 		cfs_rq->runtime_remaining = 0;
 
 		if (cfs_rq->throttled)
-			unthrottle_cfs_rq(cfs_rq);
+			unthrottle_cfs_rq(cfs_rq, false);
 	}
 
 	if (runtime_was_enabled && !runtime_enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e1df614e82f..091493bc8506 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6082,17 +6082,26 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	return false;
 }
 
-void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+void unthrottle_cfs_rq(struct cfs_rq *cfs_rq, bool demote_to_partial)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long queued_delta, runnable_delta, idle_delta;
 	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	int throttled_state = cfs_rq->throttled;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
-	cfs_rq->throttled = CFS_UNTHROTTLED;
+	if (demote_to_partial) {
+		/*
+		 * A demotion to partially throttled state can only be
+		 * requested on a fully throttled hierarchy.
+		 */
+		SCHED_WARN_ON(!cfs_rq_h_throttled(cfs_rq));
+		cfs_rq->throttled = CFS_THROTTLED_PARTIAL;
+	} else
+		cfs_rq->throttled = CFS_UNTHROTTLED;
 
 	update_rq_clock(rq);
 
@@ -6101,9 +6110,16 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 		cfs_b->throttled_time += rq_clock(rq) - cfs_rq->throttled_clock;
 		cfs_rq->throttled_clock = 0;
 	}
-	list_del_rcu(&cfs_rq->throttled_list);
+
+	/* Partial throttle should retain itself in the throttled_list */
+	if (!demote_to_partial)
+		list_del_rcu(&cfs_rq->throttled_list);
 	raw_spin_unlock(&cfs_b->lock);
 
+	/* If cfs_rq was partially throttled, we have nothing to do */
+	if (throttled_state == CFS_THROTTLED_PARTIAL)
+		goto unthrottle_throttle;
+
 	/* update hierarchical throttle state */
 	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
 
@@ -6176,8 +6192,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
-	/* Determine whether we need to wake up potentially idle CPU: */
-	if (rq->curr == rq->idle && rq->cfs.nr_queued)
+	/*
+	 * Determine whether we need to wake up potentially idle CPU or
+	 * reevalutate our pick on the throttled hierarchy.
+	 */
+	if (cfs_rq->curr || (rq->curr == rq->idle && rq->cfs.nr_queued))
 		resched_curr(rq);
 }
 
@@ -6212,7 +6231,7 @@ static void __cfsb_csd_unthrottle(void *arg)
 		list_del_init(&cursor->throttled_csd_list);
 
 		if (cfs_rq_throttled(cursor))
-			unthrottle_cfs_rq(cursor);
+			unthrottle_cfs_rq(cursor, false);
 	}
 
 	rcu_read_unlock();
@@ -6227,7 +6246,7 @@ static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
 	bool first;
 
 	if (rq == this_rq()) {
-		unthrottle_cfs_rq(cfs_rq);
+		unthrottle_cfs_rq(cfs_rq, false);
 		return;
 	}
 
@@ -6243,7 +6262,7 @@ static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
 #else
 static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
 {
-	unthrottle_cfs_rq(cfs_rq);
+	unthrottle_cfs_rq(cfs_rq, false);
 }
 #endif
 
@@ -6329,7 +6348,7 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 		list_del_init(&cfs_rq->throttled_csd_list);
 
 		if (cfs_rq_throttled(cfs_rq))
-			unthrottle_cfs_rq(cfs_rq);
+			unthrottle_cfs_rq(cfs_rq, false);
 
 		rq_unlock_irqrestore(rq, &rf);
 	}
@@ -6786,7 +6805,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 		 * there's some valid quota amount
 		 */
 		cfs_rq->runtime_remaining = 1;
-		unthrottle_cfs_rq(cfs_rq);
+		unthrottle_cfs_rq(cfs_rq, false);
 	}
 	rcu_read_unlock();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22567d236f82..bd43271fa166 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -554,7 +554,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
-extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
+extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq, bool demote_to_partial);
 extern bool cfs_task_bw_constrained(struct task_struct *p);
 
 extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 12/22] sched/fair: Prepare bandwidth distribution to unthrottle partial throttles right away
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (10 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 11/22] sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 13/22] sched/fair: Correct the throttle status supplied to pick_eevdf() K Prateek Nayak
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Unthrottling a partially throttled hierarchy is cheap since it only
involves updating the cfs_rq's throttle status and does not involve
modifing the hierarchical indicators via tg_tree walks or enqueuing the
cfs_rq back.

Unthrottle a partially throttled cfs_rq right away. This also helps
avoiding a subtle race with async throttling, specifically:

  distribute_cfs_runtime()
    rq_lock_irqsave(rq)
    # Replenish bandwidth		schedule()
    unthrottle_cfs_rq_async()		  local_irq_disable()
      smp_call_function_single_async() -> # IPI pending
    rq_unlock_irqrestore()		  rq_lock()
					  pick_task_fair()
					    check_cfs_rq_runtime()
					    # runtime_remaining > 0 but
					    # cfs_rq->throttled != 0

The above can lead pick_eevdf() failing to find a sched_entity which is
now avoided.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 091493bc8506..8824e89a3ede 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6320,6 +6320,17 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 
 		/* we check whether we're throttled above */
 		if (cfs_rq->runtime_remaining > 0) {
+			/*
+			 * Partially throttled cfs_rq is cheap to unthrottle
+			 * since they only require updating the throttled
+			 * status of cfs_rq. Save the need for reacquiring the
+			 * rq_lock and a possible IPI by unthrottling it here.
+			 */
+			if (!cfs_rq_h_throttled(cfs_rq)) {
+				unthrottle_cfs_rq(cfs_rq, false);
+				goto next;
+			}
+
 			if (cpu_of(rq) != this_cpu) {
 				unthrottle_cfs_rq_async(cfs_rq);
 			} else {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 13/22] sched/fair: Correct the throttle status supplied to pick_eevdf()
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (11 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 12/22] sched/fair: Prepare bandwidth distribution to unthrottle partial throttles right away K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 14/22] sched/fair: Mark a task if it was picked on a partially throttled hierarchy K Prateek Nayak
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

A partially throttled hierarchy does not set the hierarchical indicator.
Find the throttled status of the hierarchy when traversing down the tree
in pick_task_fair() and pass the correct throttle status to
pick_eevdf().

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8824e89a3ede..1d871509b246 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5618,7 +5618,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
  * 4) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
+pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq, bool h_throttled)
 {
 	struct sched_entity *se;
 
@@ -5626,13 +5626,13 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	 * Picking the ->next buddy will affect latency but not fairness.
 	 */
 	if (sched_feat(PICK_BUDDY) && cfs_rq->next &&
-	    pick_entity(cfs_rq, cfs_rq->next, throttled_hierarchy(cfs_rq))) {
+	    pick_entity(cfs_rq, cfs_rq->next, h_throttled)) {
 		/* ->next will never be delayed */
 		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
 		return cfs_rq->next;
 	}
 
-	se = pick_eevdf(cfs_rq, throttled_hierarchy(cfs_rq));
+	se = pick_eevdf(cfs_rq, h_throttled);
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		/*
@@ -9187,6 +9187,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	struct sched_entity *se = &donor->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
+	bool h_throttled = false;
 
 	if (unlikely(se == pse))
 		return;
@@ -9260,10 +9261,16 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	if (do_preempt_short(cfs_rq, pse, se))
 		cancel_protect_slice(se);
 
+	for_each_sched_entity(se) {
+		h_throttled = h_throttled || cfs_rq_throttled(cfs_rq_of(se));
+		if (h_throttled)
+			break;
+	}
+
 	/*
 	 * If @p has become the most eligible task, force preemption.
 	 */
-	if (pick_eevdf(cfs_rq, throttled_hierarchy(cfs_rq)) == pse)
+	if (pick_eevdf(cfs_rq, h_throttled) == pse)
 		goto preempt;
 
 	return;
@@ -9276,12 +9283,15 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
+	bool h_throttled;
 
 again:
 	cfs_rq = &rq->cfs;
 	if (!cfs_rq->nr_queued)
 		return NULL;
 
+	h_throttled = false;
+
 	do {
 		/* Might not have done put_prev_entity() */
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
@@ -9290,7 +9300,8 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
 			goto again;
 
-		se = pick_next_entity(rq, cfs_rq);
+		h_throttled = h_throttled || cfs_rq_throttled(cfs_rq);
+		se = pick_next_entity(rq, cfs_rq, h_throttled);
 		if (!se)
 			goto again;
 		cfs_rq = group_cfs_rq(se);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 14/22] sched/fair: Mark a task if it was picked on a partially throttled hierarchy
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (12 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 13/22] sched/fair: Correct the throttle status supplied to pick_eevdf() K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 15/22] sched/fair: Call resched_curr() from sched_notify_syscall_exit() K Prateek Nayak
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

With partial throttle, the hierarchical throttle indicator, namely
"cfs_rq->throttle_count" will not be set. To detect a task picked on a
partially throttled hierarchy, set a "sched_throttled" indicator in its
sched_entity to trigger a schedule() when "kernel_cs_count" hits 0.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched.h |  3 +++
 kernel/sched/fair.c   | 30 +++++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 48115de839a7..200cc086e121 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -597,6 +597,9 @@ struct sched_entity {
 	 *   is accounted at key decision points.
 	 */
 	int				kernel_cs_count;
+
+	/* Entity picked on a throttled hierarchy */
+	unsigned char			sched_throttled;
 					/* hole */
 
 	/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d871509b246..68c194169c00 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6856,6 +6856,17 @@ static __always_inline int se_in_kernel(struct sched_entity *se)
 	return se->kernel_cs_count;
 }
 
+/* se picked on a partially throttled hierarchy. */
+static inline void task_mark_throttled(struct task_struct *p)
+{
+	p->se.sched_throttled = 1;
+}
+
+static inline void task_clear_throttled(struct task_struct *p)
+{
+	p->se.sched_throttled = 0;
+}
+
 /*
  * Same as avg_vruntime_add() except avg_kcs_vruntime_add() only adjusts the avg_kcs_vruntime
  * and avg_kcs_load of kernel mode preempted entity when it joins the rbtree.
@@ -7171,6 +7182,8 @@ static __always_inline int se_in_kernel(struct sched_entity *se)
 	return false;
 }
 
+static __always_inline void task_mark_throttled(struct task_struct *p) {}
+static __always_inline void task_clear_throttled(struct task_struct *p) {}
 static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) {}
@@ -9281,6 +9294,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 
 static struct task_struct *pick_task_fair(struct rq *rq)
 {
+	struct task_struct *next;
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
 	bool h_throttled;
@@ -9307,7 +9321,11 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
-	return task_of(se);
+	next = task_of(se);
+	if (h_throttled)
+		task_mark_throttled(next);
+
+	return next;
 }
 
 static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
@@ -9349,6 +9367,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		bool next_in_kernel = se_in_kernel(se);
 		struct cfs_rq *cfs_rq;
 
+		task_clear_throttled(prev);
+
 		while (!(cfs_rq = is_same_group(se, pse))) {
 			int se_depth = se->depth;
 			int pse_depth = pse->depth;
@@ -9457,6 +9477,14 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 	bool task_in_kernel = next && task_on_rq_queued(prev) && se_in_kernel(se);
 	struct cfs_rq *cfs_rq;
 
+	/*
+	 * Clear the pick on throttled indicator only if
+	 * another task was picked and not for a save /
+	 * restore operation for the task.
+	 */
+	if (next)
+		task_clear_throttled(prev);
+
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		account_kcs_enqueue(cfs_rq, task_in_kernel);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 15/22] sched/fair: Call resched_curr() from sched_notify_syscall_exit()
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (13 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 14/22] sched/fair: Mark a task if it was picked on a partially throttled hierarchy K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 16/22] sched/fair: Prepare enqueue to partially unthrottle cfs_rq K Prateek Nayak
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

With pick_eevdf() marking a pick on throttled hierarchy with
"sched_throttled", reschedule the current when the "kernel_cs_count"
hits 0 if it was picked on a throttled hierarchy.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68c194169c00..0332e95d36b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6839,6 +6839,7 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 
 __always_inline void sched_notify_critical_section_entry(void)
 {
+	SCHED_WARN_ON(current->se.kernel_cs_count);
 	current->se.kernel_cs_count++;
 	/*
 	 * Post this point, the task is considered to be in a kernel
@@ -6848,7 +6849,23 @@ __always_inline void sched_notify_critical_section_entry(void)
 
 __always_inline void sched_notify_critical_section_exit(void)
 {
+	lockdep_assert_irqs_disabled();
+
 	current->se.kernel_cs_count--;
+	SCHED_WARN_ON(current->se.kernel_cs_count);
+
+	/*
+	 * XXX: Can we get away with using set_thread_flag()
+	 * and not grabbing the rq_lock since we'll call
+	 * schedule() soon after enabling interrupts again in
+	 * exit_to_user_mode_loop()?
+	 */
+	if (!current->se.kernel_cs_count && current->se.sched_throttled) {
+		struct rq *rq = this_rq();
+
+		guard(rq_lock_irqsave)(rq);
+		resched_curr(rq);
+	}
 }
 
 static __always_inline int se_in_kernel(struct sched_entity *se)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 16/22] sched/fair: Prepare enqueue to partially unthrottle cfs_rq
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (14 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 15/22] sched/fair: Call resched_curr() from sched_notify_syscall_exit() K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 17/22] sched/fair: Clear pick on throttled indicator when task leave fair class K Prateek Nayak
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Partially unthrottle frozen hierarchy when a kernel mode preempted
entity is enqueued on a fully throttled cfs_rq. unthrottle_throttle()
will be implemented once the plumbing for partial throttle is complete.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0332e95d36b5..3bcb56a62266 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7117,6 +7117,43 @@ static inline void account_kcs_dequeue(struct cfs_rq *gcfs_rq, bool in_kernel)
 	SCHED_WARN_ON(se->kernel_cs_count < 0);
 }
 
+/*
+ * Unthrottle a fully throttled hierarchy when a kernel mode task
+ * joins the hierarchy.
+ */
+static void unthrottle_throttled(struct cfs_rq *gcfs_rq, bool in_kernel)
+{
+	struct rq *rq = rq_of(gcfs_rq);
+	struct sched_entity *se = gcfs_rq->tg->se[cpu_of(rq)];
+
+	/* TODO: Remove this early return once plumbing is done */
+	return;
+
+	/*
+	 * Demoting a cfs_rq to partial throttle will trigger a
+	 * rq_clock update. Skip all the updates and use the
+	 * last updated time.
+	 */
+	rq_clock_start_loop_update(rq);
+	unthrottle_cfs_rq(gcfs_rq, true);
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		/*
+		 * Fixup what is missed by unthrottle_cfs_rq() that
+		 * enqueue_task_fair() would have done.
+		 */
+		update_cfs_group(se);
+		account_kcs_enqueue(cfs_rq, in_kernel);
+
+		if (cfs_rq_h_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq, true);
+	}
+
+	rq_clock_stop_loop_update(rq);
+}
+
 #ifdef CONFIG_NO_HZ_FULL
 /* called from pick_next_task_fair() */
 static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
@@ -7224,6 +7261,7 @@ static inline bool min_kcs_vruntime_update(struct sched_entity *se)
 
 static inline void account_kcs_enqueue(struct cfs_rq *gcfs_rq, bool in_kernel) {}
 static inline void account_kcs_dequeue(struct cfs_rq *gcfs_rq, bool in_kernel) {}
+static __always_inline void unthrottle_throttled(struct cfs_rq *cfs_rq, bool in_kernel) {}
 
 #endif /* CONFIG_CFS_BANDWIDTH */
 
@@ -7444,8 +7482,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			h_nr_idle = 1;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_h_throttled(cfs_rq))
+		if (cfs_rq_h_throttled(cfs_rq)) {
+			/*
+			 * Since a kernel mode preempted entity has
+			 * joined a fully throttled hierarchy, unfreeze
+			 * it. Since unthrottle_cfs_rq() adjusts the
+			 * h_nr_* stats and the averages internally,
+			 * skip to the end.
+			 */
+			if (task_in_kernel)
+				unthrottle_throttled(cfs_rq, task_in_kernel);
 			goto enqueue_throttle;
+		}
 
 		flags = ENQUEUE_WAKEUP;
 	}
@@ -7471,8 +7519,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			h_nr_idle = 1;
 
 		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_h_throttled(cfs_rq))
+		if (cfs_rq_h_throttled(cfs_rq)) {
+			/* Ditto as above */
+			if (task_in_kernel)
+				unthrottle_throttled(cfs_rq, task_in_kernel);
 			goto enqueue_throttle;
+		}
 	}
 
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 17/22] sched/fair: Clear pick on throttled indicator when task leave fair class
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (15 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 16/22] sched/fair: Prepare enqueue to partially unthrottle cfs_rq K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 18/22] sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled hierarchy K Prateek Nayak
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Prepare switched_from_fair() to clear the "sched_throttled" indicator
when task leaves the fair class.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3bcb56a62266..423c5a95989e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13805,6 +13805,7 @@ static void attach_task_cfs_rq(struct task_struct *p)
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
+	task_clear_throttled(p);
 	detach_task_cfs_rq(p);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 18/22] sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled hierarchy
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (16 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 17/22] sched/fair: Clear pick on throttled indicator when task leave fair class K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 19/22] sched/fair: Ignore in-kernel indicators for running task outside of schedule() K Prateek Nayak
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Kernel mode preempted tasks being put back on a throttled hierarchy
needs to be reachable during pick. Demote the throttle status to partial
if pick_next_task_fair() finds that the previous task was preempted in
kernel mode but is on a fully throttled hierarchy.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 423c5a95989e..1497b0aed1c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9407,6 +9407,38 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 	struct task_struct *p;
 	int new_tasks;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	se = &prev->se;
+
+	/*
+	 * A task on throttled hierarchy was forced into running state.
+	 * Recheck throttle status when the task hits schedule since the
+	 * "kernel_cs_count" is stable now. If task is preempted in
+	 * kernel mode, partially unthrottle the hierarchy now for it to
+	 * be reachable for pick_task_fair() that follows later.
+	 */
+	if (prev->sched_class == &fair_sched_class &&
+	    task_on_rq_queued(prev) &&
+	    se_in_kernel(se) &&
+	    throttled_hierarchy(cfs_rq_of(se))) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+
+			/* There is at least one fully throttled cfs_rq */
+			if (cfs_rq_h_throttled(cfs_rq))
+				break;
+		}
+
+		/*
+		 * Only unthrottle; Do not adjust "kernel_cs_count" yet
+		 * since account_kcs_enqueue() below will adjust it.
+		 */
+		unthrottle_throttled(cfs_rq, false);
+	}
+#endif
+
 again:
 	p = pick_task_fair(rq);
 	if (!p)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 19/22] sched/fair: Ignore in-kernel indicators for running task outside of schedule()
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (17 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 18/22] sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled hierarchy K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 20/22] sched/fair: Implement determine_throttle_state() for partial throttle K Prateek Nayak
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

A running task can go a dequeue_task() -> put_prev_task() ->
enqueue_task() -> set_next_task() cycle when on_cpu for a number of
operations.

Since the task is running on a remote CPU, its "kernel_cs_count" is
unstable and looking at it can lead to wrong outcomes (imbalanced
accounting at __enqueue_entity() and __dequeue_entity())

Use "sched_throttled" to indicate that the task's kernel mode indicators
should be ignored. put_prev_task() is called with prev set to NULL only
for save restore operations and is used as an indicator to set the
ignore bit. Subsequent call to set_next_task_fair() will clear this
indicator.

There are cases where the save restore cycle can fully throttle the
task's hierarchy. One such condition is:

    dequeue_task()
    put_prev_task() # Sets cfs_rq->curr to NULL
    ...
    enqueue_task()
        enqueue_task_fair()
            check_enqueue_throttle()
	        # cfs_rq->curr is NULL so goes ahead
                # Full throttle
    set_next_task() # Sets cfs_rq->curr back

If set_next_task_fair() finds a task marked for ignoring but has been
forced into running on a throttled hierarchy, request resched for
schedule() to partially unthrottle the hierarchy if required before
going ahead with the pick.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 115 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 111 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1497b0aed1c2..55e53db8da45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6837,6 +6837,21 @@ bool cfs_task_bw_constrained(struct task_struct *p)
 	return false;
 }
 
+/* se->sched_throttled states: Modifications are serialized by rq_lock */
+enum sched_throttled_states {
+	/* No throttle status */
+	PICK_ON_UNTHROTTLED	= 0x0,
+	/* Task was picked on a throttled hierarchy */
+	PICK_ON_THROTTLED	= 0x1,
+	/*
+	 * Ignore tasks's "kernel_cs_count". Used for save / restore
+	 * operations on running tasks from a remote CPU. Prevents
+	 * enqueue and set_next_entity() from adjusting the stats and
+	 * puts it off until schedule() is called on the CPU.
+	 */
+	PICK_IGNORE		= 0x2,
+};
+
 __always_inline void sched_notify_critical_section_entry(void)
 {
 	SCHED_WARN_ON(current->se.kernel_cs_count);
@@ -6847,6 +6862,8 @@ __always_inline void sched_notify_critical_section_entry(void)
 	 */
 }
 
+static inline int task_picked_on_throttled(struct task_struct *p);
+
 __always_inline void sched_notify_critical_section_exit(void)
 {
 	lockdep_assert_irqs_disabled();
@@ -6860,7 +6877,7 @@ __always_inline void sched_notify_critical_section_exit(void)
 	 * schedule() soon after enabling interrupts again in
 	 * exit_to_user_mode_loop()?
 	 */
-	if (!current->se.kernel_cs_count && current->se.sched_throttled) {
+	if (!current->se.kernel_cs_count && task_picked_on_throttled(current)) {
 		struct rq *rq = this_rq();
 
 		guard(rq_lock_irqsave)(rq);
@@ -6873,15 +6890,40 @@ static __always_inline int se_in_kernel(struct sched_entity *se)
 	return se->kernel_cs_count;
 }
 
-/* se picked on a partially throttled hierarchy. */
+/* task picked on a partially throttled hierarchy. */
 static inline void task_mark_throttled(struct task_struct *p)
 {
-	p->se.sched_throttled = 1;
+	WRITE_ONCE(p->se.sched_throttled, PICK_ON_THROTTLED);
+}
+
+static inline void task_mark_ignore(struct task_struct *p)
+{
+	WRITE_ONCE(p->se.sched_throttled, READ_ONCE(p->se.sched_throttled) | PICK_IGNORE);
 }
 
 static inline void task_clear_throttled(struct task_struct *p)
 {
-	p->se.sched_throttled = 0;
+	WRITE_ONCE(p->se.sched_throttled, PICK_ON_UNTHROTTLED);
+}
+
+static inline void task_clear_ignore(struct task_struct *p)
+{
+	WRITE_ONCE(p->se.sched_throttled, READ_ONCE(p->se.sched_throttled) & ~PICK_IGNORE);
+}
+
+static inline int __kcs_ignore_entity(struct sched_entity *se)
+{
+	return READ_ONCE(se->sched_throttled) & PICK_IGNORE;
+}
+
+static inline int task_picked_on_throttled(struct task_struct *p)
+{
+	return READ_ONCE(p->se.sched_throttled) & PICK_ON_THROTTLED;
+}
+
+static inline int ignore_task_kcs_stats(struct task_struct *p)
+{
+	return __kcs_ignore_entity(&p->se);
 }
 
 /*
@@ -6893,6 +6935,10 @@ static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct s
 	unsigned long weight;
 	s64 key;
 
+	/* See avg_kcs_vruntime_sub() */
+	if (__kcs_ignore_entity(se))
+		return;
+
 	if (!se_in_kernel(se))
 		return;
 
@@ -6912,6 +6958,15 @@ static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct s
 	unsigned long weight;
 	s64 key;
 
+	/*
+	 * A remote running task in being enqueued for a restore operation.
+	 * Since it has an unstable "kernel_cs_count" being in a running state,
+	 * do not account its stats yet. A set_next_task() -> schedule() will
+	 * follow on the CPU to adjust it.
+	 */
+	if (__kcs_ignore_entity(se))
+		return;
+
 	if (!se_in_kernel(se))
 		return;
 
@@ -7238,6 +7293,20 @@ static __always_inline int se_in_kernel(struct sched_entity *se)
 
 static __always_inline void task_mark_throttled(struct task_struct *p) {}
 static __always_inline void task_clear_throttled(struct task_struct *p) {}
+static __always_inline void task_mark_ignore(struct task_struct *p) {}
+static __always_inline void task_clear_ignore(struct task_struct *p) {}
+
+static inline int task_picked_on_throttled(struct task_struct *p)
+{
+	return 0;
+}
+
+static inline int ignore_task_kcs_stats(struct task_struct *p)
+{
+	return 0;
+}
+
+
 static __always_inline void avg_kcs_vruntime_add(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static __always_inline void avg_kcs_vruntime_sub(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
 static __always_inline void avg_kcs_vruntime_update(struct cfs_rq *cfs_rq, s64 delta) {}
@@ -7437,6 +7506,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
 		util_est_enqueue(&rq->cfs, p);
 
+	/*
+	 * Running task has just moved to the fair class.
+	 * Ignore "kernel_cs_count" until set_next_task_fair()
+	 */
+	if (task_on_cpu(rq, p) && (flags & ENQUEUE_MOVE))
+		task_mark_ignore(p);
+
 	if (flags & ENQUEUE_DELAYED) {
 		requeue_delayed_entity(se);
 		return;
@@ -9582,9 +9658,23 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 	 * Clear the pick on throttled indicator only if
 	 * another task was picked and not for a save /
 	 * restore operation for the task.
+	 *
+	 * For a save / restore operation of a running task,
+	 * mark the task to be ignored for "kernel_cs_stats"
+	 * adjustment. Either set_next_task_fair() or
+	 * switched_from_fair() will clear these idicators.
 	 */
 	if (next)
 		task_clear_throttled(prev);
+	else {
+		/*
+		 * put_prev_task_fair() is only called with next as NULL
+		 * during save / restore operations. Since idle thread
+		 * is always runnable, all other cases will have a valid
+		 * prev task set.
+		 */
+		task_mark_ignore(prev);
+	}
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
@@ -13897,6 +13987,7 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_entity *se = &p->se;
 	bool task_in_kernel = !task_on_cpu(rq, p) && se_in_kernel(se);
+	bool h_throttled = false;
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -13905,6 +13996,22 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 		account_kcs_dequeue(cfs_rq, task_in_kernel);
 		/* ensure bandwidth has been allocated on our new cfs_rq */
 		account_cfs_rq_runtime(cfs_rq, 0);
+		/* Check if task is on a partially throttled hierarchy */
+		h_throttled = h_throttled || cfs_rq_throttled(cfs_rq);
+	}
+
+	/* Mark the end of save / restore operation. */
+	if (ignore_task_kcs_stats(p)) {
+		task_clear_ignore(p);
+
+		/*
+		 * If the hierarchy is throttled but the task was not picked on
+		 * a throttled hierarchy, the hierarchy was throttled during the
+		 * course of a save / restore operation. Request a resched for
+		 * pick_next_task_fair() to reevaluate the throttle status.
+		 */
+		if (h_throttled && !task_picked_on_throttled(p))
+			resched_curr(rq);
 	}
 
 	__set_next_task_fair(rq, p, first);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 20/22] sched/fair: Implement determine_throttle_state() for partial throttle
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (18 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 19/22] sched/fair: Ignore in-kernel indicators for running task outside of schedule() K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 21/22] [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together for propagation K Prateek Nayak
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

With the plumbing for partial throttle in place, implement
determine_throttle_state() for partial throttle when it finds cfs_rq
with kernel mode preempted entities on it. Also remove the early return
in unthrottle_throttled()

"Let it rip"

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 55e53db8da45..39c7e8f548ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5894,13 +5894,37 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static __always_inline int se_in_kernel(struct sched_entity *se);
+static inline int ignore_task_kcs_stats(struct task_struct *p);
+
 static enum throttle_state
 determine_throttle_state(struct cfs_rq *gcfs_rq, struct sched_entity *se)
 {
+	struct sched_entity *curr = gcfs_rq->curr;
+
+	if (se_in_kernel(se))
+		return CFS_THROTTLED_PARTIAL;
+
 	/*
-	 * TODO: Implement rest once plumbing for
-	 * CFS_THROTTLED_PARTIAL is done.
+	 * Check if current task's hierarchy needs throttle deferral.
+	 * For save / restore operations, cfs_rq->curr could still be
+	 * set but the task has already been dequeued by the time
+	 * put_prev_task() is called. Only check if gcfs_rq->curr is
+	 * set to check the current task's indicator. If the hierarchy
+	 * leads to a queued task executing in kernel or is having its
+	 * stats ignored, request a partial throttle.
+	 *
+	 * set_nex_task_fair() will request resched if throttle status
+	 * changes once stats are reconsidred.
 	 */
+	if (curr) {
+		struct task_struct *p = rq_of(gcfs_rq)->curr;
+
+		if (task_on_rq_queued(p) &&
+		    (ignore_task_kcs_stats(p) || se_in_kernel(&p->se)))
+			return CFS_THROTTLED_PARTIAL;
+	}
+
 	return CFS_THROTTLED;
 }
 
@@ -7181,9 +7205,6 @@ static void unthrottle_throttled(struct cfs_rq *gcfs_rq, bool in_kernel)
 	struct rq *rq = rq_of(gcfs_rq);
 	struct sched_entity *se = gcfs_rq->tg->se[cpu_of(rq)];
 
-	/* TODO: Remove this early return once plumbing is done */
-	return;
-
 	/*
 	 * Demoting a cfs_rq to partial throttle will trigger a
 	 * rq_clock update. Skip all the updates and use the
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 21/22] [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together for propagation
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (19 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 20/22] sched/fair: Implement determine_throttle_state() for partial throttle K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20  9:32 ` [RFC PATCH 22/22] [DEBUG] sched/fair: Debug pick_eevdf() returning NULL! K Prateek Nayak
  2025-02-20 10:55 ` [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode Peter Zijlstra
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

When working on the prototype, at some point, min_kcs_vruntime heap
integrity issues were encountered. Since then many bugs were ironed out
but when auditing rb_{add,erase}_augmented_cached(), it was noticed that
only "min_vruntime" was considered as RBAUGMENTED for
RB_DECLARE_CALLBACKS().

The augment->copy() function generated only copies RBAUGMENTED member.
It is very likely that augment->propagate() is always called after
augment->copy() similar to the convention in augment->rotate() which
reconstructs all of min_{vruntime,slice,kcs_vruntime} before comparing
if the values have changed however this author was not sure if
augment->propagate() indeed reaches until the copied node for Case 2.
in __rb_erase_augmented().

Group all the min_* members together that need propagation to rb_root
and pass the new "min" struct as RBAUGMENTED to ease the author's
paranoia.

XXX: Reverting this bit reintroduces the heap integrity issue. It does
not happen immediately into the run and the amount of traces are
overwhelming to analyze which series of rbtree operations lead to this
but with this patch I haven't run into the heap integrity issue so far.
The cover letter should have more details.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched.h | 20 +++++++++-------
 kernel/sched/debug.c  |  2 +-
 kernel/sched/fair.c   | 54 +++++++++++++++++++++----------------------
 3 files changed, 40 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 200cc086e121..6fd5cee3a0f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -549,8 +549,18 @@ struct sched_entity {
 	struct load_weight		load;
 	struct rb_node			run_node;
 	u64				deadline;
-	u64				min_vruntime;
-	u64				min_slice;
+	struct {
+		u64			vruntime;
+		u64			slice;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+		/*
+		 * min_vruntime of the kernel mode preempted
+		 * entities in the subtree of this sched entity.
+		 */
+		s64			kcs_vruntime;
+#endif
+	} min;
 
 	struct list_head		group_node;
 	unsigned char			on_rq;
@@ -601,12 +611,6 @@ struct sched_entity {
 	/* Entity picked on a throttled hierarchy */
 	unsigned char			sched_throttled;
 					/* hole */
-
-	/*
-	 * min_vruntime of the kernel mode preempted entities
-	 * in the subtree of this sched entity.
-	 */
-	s64				min_kcs_vruntime;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ef047add7f9e..37bca53f109f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -821,7 +821,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	raw_spin_rq_lock_irqsave(rq, flags);
 	root = __pick_root_entity(cfs_rq);
 	if (root)
-		left_vruntime = root->min_vruntime;
+		left_vruntime = root->min.vruntime;
 	first = __pick_first_entity(cfs_rq);
 	if (first)
 		left_deadline = first->deadline;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39c7e8f548ca..97566a043398 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -768,7 +768,7 @@ int pick_subtree(struct cfs_rq *cfs_rq, struct sched_entity *se, bool h_throttle
 	if (unlikely(h_throttled))
 		return pick_subtree_on_throttled(cfs_rq, se);
 
-	return vruntime_eligible(cfs_rq, se->min_vruntime);
+	return vruntime_eligible(cfs_rq, se->min.vruntime);
 }
 
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
@@ -800,9 +800,9 @@ static void update_min_vruntime(struct cfs_rq *cfs_rq)
 
 	if (se) {
 		if (!curr)
-			vruntime = se->min_vruntime;
+			vruntime = se->min.vruntime;
 		else
-			vruntime = min_vruntime(vruntime, se->min_vruntime);
+			vruntime = min_vruntime(vruntime, se->min.vruntime);
 	}
 
 	/* ensure we never gain time by being placed backwards. */
@@ -819,7 +819,7 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
 		min_slice = curr->slice;
 
 	if (root)
-		min_slice = min(min_slice, root->min_slice);
+		min_slice = min(min_slice, root->min.slice);
 
 	return min_slice;
 }
@@ -835,8 +835,8 @@ static inline void __min_vruntime_update(struct sched_entity *se, struct rb_node
 {
 	if (node) {
 		struct sched_entity *rse = __node_2_se(node);
-		if (vruntime_gt(min_vruntime, se, rse))
-			se->min_vruntime = rse->min_vruntime;
+		if (vruntime_gt(min.vruntime, se, rse))
+			se->min.vruntime = rse->min.vruntime;
 	}
 }
 
@@ -844,8 +844,8 @@ static inline void __min_slice_update(struct sched_entity *se, struct rb_node *n
 {
 	if (node) {
 		struct sched_entity *rse = __node_2_se(node);
-		if (rse->min_slice < se->min_slice)
-			se->min_slice = rse->min_slice;
+		if (rse->min.slice < se->min.slice)
+			se->min.slice = rse->min.slice;
 	}
 }
 
@@ -853,30 +853,30 @@ static __always_inline void init_se_kcs_stats(struct sched_entity *se);
 static inline bool min_kcs_vruntime_update(struct sched_entity *se);
 
 /*
- * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
+ * se->min.vruntime = min(se->vruntime, {left,right}->min_vruntime)
  */
 static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 {
-	u64 old_min_vruntime = se->min_vruntime;
-	u64 old_min_slice = se->min_slice;
+	u64 old_min_vruntime = se->min.vruntime;
+	u64 old_min_slice = se->min.slice;
 	struct rb_node *node = &se->run_node;
 	bool kcs_stats_unchanged = min_kcs_vruntime_update(se);
 
-	se->min_vruntime = se->vruntime;
+	se->min.vruntime = se->vruntime;
 	__min_vruntime_update(se, node->rb_right);
 	__min_vruntime_update(se, node->rb_left);
 
-	se->min_slice = se->slice;
+	se->min.slice = se->slice;
 	__min_slice_update(se, node->rb_right);
 	__min_slice_update(se, node->rb_left);
 
-	return se->min_vruntime == old_min_vruntime &&
-	       se->min_slice == old_min_slice &&
+	return se->min.vruntime == old_min_vruntime &&
+	       se->min.slice == old_min_slice &&
 	       kcs_stats_unchanged;
 }
 
 RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
-		     run_node, min_vruntime, min_vruntime_update);
+		     run_node, min, min_vruntime_update);
 
 /*
  * Enqueue an entity into the rb-tree:
@@ -885,8 +885,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	avg_vruntime_add(cfs_rq, se);
 	init_se_kcs_stats(se);
-	se->min_vruntime = se->vruntime;
-	se->min_slice = se->slice;
+	se->min.vruntime = se->vruntime;
+	se->min.slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				__entity_less, &min_vruntime_cb);
 }
@@ -953,7 +953,7 @@ static inline void cancel_protect_slice(struct sched_entity *se)
  * tree keeps the entries sorted on deadline, but also functions as a
  * heap based on the vruntime by keeping:
  *
- *  se->min_vruntime = min(se->vruntime, se->{left,right}->min_vruntime)
+ *  se->min.vruntime = min(se->vruntime, se->{left,right}->min_vruntime)
  *
  * Which allows tree pruning through eligibility.
  */
@@ -7018,7 +7018,7 @@ static __always_inline void init_se_kcs_stats(struct sched_entity *se)
 	 * the upper bound to differentiate the case where no kernel mode preempted
 	 * entities are queued on the subtree.
 	 */
-	se->min_kcs_vruntime = (se_in_kernel(se)) ? se->vruntime : LLONG_MAX;
+	se->min.kcs_vruntime = (se_in_kernel(se)) ? se->vruntime : LLONG_MAX;
 }
 
 /*
@@ -7096,11 +7096,11 @@ static __always_inline
 int pick_subtree_on_throttled(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	/* There are no kernel mode preempted entities in the subtree. */
-	if (se->min_kcs_vruntime == LLONG_MAX)
+	if (se->min.kcs_vruntime == LLONG_MAX)
 		return false;
 
 	return throttled_vruntime_eligible(cfs_rq,
-					   se->min_kcs_vruntime,
+					   se->min.kcs_vruntime,
 					   curr_h_is_throttled(cfs_rq));
 }
 
@@ -7109,21 +7109,21 @@ static inline void __min_kcs_vruntime_update(struct sched_entity *se, struct rb_
 	if (node) {
 		struct sched_entity *rse = __node_2_se(node);
 
-		if (rse->min_kcs_vruntime < se->min_kcs_vruntime)
-			se->min_kcs_vruntime = rse->min_kcs_vruntime;
+		if (rse->min.kcs_vruntime < se->min.kcs_vruntime)
+			se->min.kcs_vruntime = rse->min.kcs_vruntime;
 	}
 }
 
 static inline bool min_kcs_vruntime_update(struct sched_entity *se)
 {
-	u64 old_min_kcs_vruntime = se->min_kcs_vruntime;
+	u64 old_min_kcs_vruntime = se->min.kcs_vruntime;
 	struct rb_node *node = &se->run_node;
 
 	init_se_kcs_stats(se);
 	__min_kcs_vruntime_update(se, node->rb_right);
 	__min_kcs_vruntime_update(se, node->rb_left);
 
-	return se->min_kcs_vruntime == old_min_kcs_vruntime;
+	return se->min.kcs_vruntime == old_min_kcs_vruntime;
 }
 
 static inline void account_kcs_enqueue(struct cfs_rq *gcfs_rq, bool in_kernel)
@@ -7341,7 +7341,7 @@ static __always_inline int pick_se_on_throttled(struct cfs_rq *cfs_rq, struct sc
 static __always_inline
 int pick_subtree_on_throttled(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	return vruntime_eligible(cfs_rq, se->min_vruntime);
+	return vruntime_eligible(cfs_rq, se->min.vruntime);
 }
 
 static inline bool min_kcs_vruntime_update(struct sched_entity *se)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 22/22] [DEBUG] sched/fair: Debug pick_eevdf() returning NULL!
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (20 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 21/22] [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together for propagation K Prateek Nayak
@ 2025-02-20  9:32 ` K Prateek Nayak
  2025-02-20 10:55 ` [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode Peter Zijlstra
  22 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20  9:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	K Prateek Nayak, Gautham R. Shenoy, Swapnil Sapkal

Dump the stats of cfs_rq and the entities queued on it when pick_eevdf()
fails to find a runnable entity. Take the panic that follows since this
scenario implies breakdown of the scheduling algorithm.

XXX: This will only build with CONFIG_CFS_BANDWIDTH enabled.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 75 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97566a043398..270e5f4b2741 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5610,6 +5610,78 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
 
+static void debug_print_se(int cpu, struct sched_entity *se, bool h_throttled)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+	pr_warn("CPU%d: se: load(%lu) vruntime(%lld) entity_key(%lld) deadline(%lld) min_vruntime(%lld) on_rq(%d)\n", cpu, scale_load_down(se->load.weight), se->vruntime, entity_key(cfs_rq, se), se->deadline, se->min.vruntime, se->on_rq);
+	pr_warn("CPU%d: se kcs: kernel_cs_count(%d) min_kcs_vruntime(%lld) pick_entity(%d)\n", cpu, se->kernel_cs_count, se->min.kcs_vruntime, pick_entity(cfs_rq, se, h_throttled));
+}
+
+static void debug_print_cfs_rq(int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se, bool h_throttled)
+{
+	pr_warn("CPU%d: ----- cfs_rq ----\n", cpu);
+	pr_warn("CPU%d: cfs_rq: throttled?(%d) cfs_rq->throttled(%d) h_nr_queued(%d) h_nr_runnable(%d) nr_queued(%d) gse->kernel_cs_count(%d)\n", cpu, h_throttled, cfs_rq->throttled, cfs_rq->h_nr_queued, cfs_rq->h_nr_runnable, cfs_rq->nr_queued, (se)? se->kernel_cs_count: -1);
+	pr_warn("CPU%d: cfs_rq EEVDF: avg_vruntime(%lld) avg_load(%lld) avg_kcs_vruntime(%lld) avg_kcs_load(%lld) \n", cpu, cfs_rq->avg_vruntime, cfs_rq->avg_load, cfs_rq->avg_kcs_vruntime, cfs_rq->avg_kcs_load);
+
+	if (cfs_rq->curr) {
+		pr_warn("CPU%d: ----- cfs_rq->curr ----\n", cpu);
+		debug_print_se(cpu, cfs_rq->curr, h_throttled);
+	}
+	pr_warn("CPU%d: ----- cfs_rq done ----\n", cpu);
+}
+
+static void debug_recursive(int cpu, struct rb_node *node, bool h_throttled)
+{
+	debug_print_se(cpu, __node_2_se(node), h_throttled);
+
+	if (node->rb_left) {
+		pr_warn("CPU%d: ----- Left Subtree ----\n", cpu);
+		debug_recursive(cpu, node->rb_left, h_throttled);
+		pr_warn("CPU%d: ----- Left Subtree Done ----\n", cpu);
+	}
+
+	if (node->rb_right) {
+		pr_warn("CPU%d: ----- Right Subtree ----\n", cpu);
+		debug_recursive(cpu, node->rb_right, h_throttled);
+		pr_warn("CPU%d: ----- Right Subtree Done ----\n", cpu);
+	}
+}
+
+static void debug_pick_next_entity(struct cfs_rq *cfs_rq, bool h_throttled)
+{
+	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
+	struct task_struct *p = rq_of(cfs_rq)->curr;
+	int cpu = smp_processor_id();
+
+	if (p) {
+		pr_warn("CPU%d: ----- current task ----\n", cpu);
+		pr_warn("CPU%d: pid(%d) comm(%s) task_cpu(%d) task_on_rq_queued(%d) task_on_rq_migrating(%d) normal_policy(%d) idle_policy(%d)\n", cpu, p->pid, p->comm, task_cpu(p), task_on_rq_queued(p), task_on_rq_migrating(p), normal_policy(p->policy), idle_policy(p->policy));
+		pr_warn("CPU%d: ----- current task done ----\n", cpu);
+	}
+
+	debug_print_cfs_rq(cpu, cfs_rq, se, h_throttled);
+
+	if (node) {
+		pr_warn("CPU%d: ----- rbtree traversal: root ----\n", cpu);
+		debug_recursive(cpu, node, h_throttled);
+		pr_warn("CPU%d: ----- rbtree done ----\n", cpu);
+	}
+
+	cfs_rq = cfs_rq_of(se);
+	se = parent_entity(se);
+
+	for_each_sched_entity(se) {
+		pr_warn("CPU%d: ----- parent cfs_rq ----\n", cpu);
+		debug_print_cfs_rq(cpu, cfs_rq, se, h_throttled);
+
+		cfs_rq = cfs_rq_of(se);
+	}
+
+	debug_print_cfs_rq(cpu, cfs_rq, NULL, false);
+}
+
 /*
  * Pick the next process, keeping these things in mind, in this order:
  * 1) keep things fair between processes/task groups
@@ -5633,6 +5705,9 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq, bool h_throttled)
 	}
 
 	se = pick_eevdf(cfs_rq, h_throttled);
+	if (!se)
+		debug_pick_next_entity(cfs_rq, h_throttled);
+
 	if (se->sched_delayed) {
 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header
  2025-02-20  9:32 ` [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header K Prateek Nayak
@ 2025-02-20 10:43   ` Peter Zijlstra
  2025-02-20 10:56     ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-02-20 10:43 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	Ben Segall, Thomas Gleixner, Andy Lutomirski, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	Gautham R. Shenoy, Swapnil Sapkal

On Thu, Feb 20, 2025 at 09:32:36AM +0000, K Prateek Nayak wrote:
> Retain the prototype of syscall_enter_from_user_mode_work() in
> linux/entry-common.h and move the function definition to
> kernel/entry/common.c in preparation to notify the scheduler of task
> entering and exiting kernel mode for syscall. The two architectures that
> use it directly (x86, s390) and the four that call it via
> syscall_enter_from_user_mode() (x86, riscv, loongarch, s390) end up
> selecting GENERIC_ENTRY, hence, no functional changes are intended.
> 
> syscall_enter_from_user_mode_work() will end up calling function whose
> visibility needs to be limited for kernel/* use only for cfs throttling
> deferral.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  include/linux/entry-common.h | 10 +---------
>  kernel/entry/common.c        | 10 ++++++++++
>  2 files changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index fc61d0205c97..7569a49cf7a0 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -161,15 +161,7 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
>   *     ptrace_report_syscall_entry(), __secure_computing(), trace_sys_enter()
>   *  2) Invocation of audit_syscall_entry()
>   */
> -static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
> -{
> -	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
> -
> -	if (work & SYSCALL_WORK_ENTER)
> -		syscall = syscall_trace_enter(regs, syscall, work);
> -
> -	return syscall;
> -}
> +long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall);
>  
>  /**
>   * syscall_enter_from_user_mode - Establish state and check and handle work
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index e33691d5adf7..cc93cdcc36d0 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -79,6 +79,16 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
>  	instrumentation_end();
>  }
>  
> +__always_inline long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
> +{
> +	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
> +
> +	if (work & SYSCALL_WORK_ENTER)
> +		syscall = syscall_trace_enter(regs, syscall, work);
> +
> +	return syscall;
> +}
> +
>  /* Workaround to allow gradual conversion of architecture code */
>  void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }

This breaks s390. While you retain an external linkage, the function
looses the noinstr tag that's needed for correctness.

Also, extern __always_inline is flaky as heck. Please don't do this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
                   ` (21 preceding siblings ...)
  2025-02-20  9:32 ` [RFC PATCH 22/22] [DEBUG] sched/fair: Debug pick_eevdf() returning NULL! K Prateek Nayak
@ 2025-02-20 10:55 ` Peter Zijlstra
  2025-02-20 11:18   ` K Prateek Nayak
  22 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-02-20 10:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	Ben Segall, Thomas Gleixner, Andy Lutomirski, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	Gautham R. Shenoy, Swapnil Sapkal

On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:
> Proposed approach
> =================
> 
> This approach builds on Ben Segall's proposal in [4] which marked the
> task in schedule() when exiting to usermode by setting
> "in_return_to_user" flag except this prototype takes it a step ahead and
> marks a "kernel critical section" within the syscall boundary using a
> per-task "kernel_cs_count".
> 
> The rationale behind this approach is that the task can only hold
> kernel resources when running in kernel mode in preemptible context. In
> this POC, the entire syscall boundary is marked as a kernel critical
> section but in the future, the API can be used to mark fine grained
> boundaries like between an up_read(), down_read() or up_write(),
> down_write() pair.
> 
> Within a kernel critical section, throttling events are deferred until
> the task's "kernel_cs_count" hits 0. Currently this count is an integer
> to catch any cases where the count turns negative as a result of
> oversights on my part but this could be changed to a preempt count like
> mechanism to request a resched.
> 
>             cfs_rq throttled               picked again
> 		  v                              v
> 
>     ----|*********| (preempted by tick / wakeup) |***********| (full throttle)
> 
>         ^                                                    ^
> critical section   cfs_rq is throttled partially     critical section
>       entry           since the task is still              exit
>                   runnable as it was preempted in
>                       kernel critical section
> 
> The EEVDF infrastructure is extended to tracks the avg_vruntime and the
> avg_load of only those entities preempted in kernel mode. When a cfs_rq
> is throttled, it uses these metrics to select among the kernel mode
> preempted tasks and running them till they exit to user mode.
> pick_eevdf() is made aware that it is operating on a throttled hierarchy
> to only select among these tasks that were preempted in kernel mode (and
> the sched entities of cfs_rq that lead to them). When a throttled
> entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
> the hierarchy remains accessible for picking until that point.
> 
>           root
>         /     \
>        A       B * (throttled)
>       ...    / | \
>             0  1* 2*
> 
>     (*) Preempted in kernel mode
> 
>   o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
>   o avg_kcs_load = load(1) + load(2)
> 
>   o throttled_vruntime_eligible:
> 
>       entity preempted in kernel mode &&
>       entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
> 
>   o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
>     that propagates the smallest vruntime of the all the entities in
>     the subtree that are preempted in kernel mode. If they were
>     executing in usermode when preempted, this will be set to LLONG_MAX.
> 
>     This is used to construct a min-heap and select through the
>     entities. Consider rbtree of B to look like:
> 
>          1*
>        /   \
>       2*    0
> 
>       min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
>       min_kcs_vruntime = min(se->min_kcs_vruntime,
>                              __node_2_se(rb_left)->min_kcs_vruntime,
>                              __node_2_se(rb_right)->min_kcs_vruntime);
> 
>    pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
>    tree to first check the left subtree for eligibility, then the node
>    itself, and then the right subtree.
> 

*groan*... why not actually dequeue the tasks and only retain those with
non-zero cs_count? That avoids having to duplicate everything, no?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header
  2025-02-20 10:43   ` Peter Zijlstra
@ 2025-02-20 10:56     ` K Prateek Nayak
  0 siblings, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20 10:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	Ben Segall, Thomas Gleixner, Andy Lutomirski, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	Gautham R. Shenoy, Swapnil Sapkal

Hello Peter,

Thank you for taking a look.

On 2/20/2025 4:13 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 09:32:36AM +0000, K Prateek Nayak wrote:
>> Retain the prototype of syscall_enter_from_user_mode_work() in
>> linux/entry-common.h and move the function definition to
>> kernel/entry/common.c in preparation to notify the scheduler of task
>> entering and exiting kernel mode for syscall. The two architectures that
>> use it directly (x86, s390) and the four that call it via
>> syscall_enter_from_user_mode() (x86, riscv, loongarch, s390) end up
>> selecting GENERIC_ENTRY, hence, no functional changes are intended.
>>
>> [..snip..]
>>
>> @@ -79,6 +79,16 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
>>   	instrumentation_end();
>>   }
>>   
>> +__always_inline long syscall_enter_from_user_mode_work(struct pt_regs *regs, long syscall)
>> +{
>> +	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);
>> +
>> +	if (work & SYSCALL_WORK_ENTER)
>> +		syscall = syscall_trace_enter(regs, syscall, work);
>> +
>> +	return syscall;
>> +}
>> +
>>   /* Workaround to allow gradual conversion of architecture code */
>>   void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
> 
> This breaks s390. While you retain an external linkage, the function
> looses the noinstr tag that's needed for correctness.
> 
> Also, extern __always_inline is flaky as heck. Please don't do this.

Noted! I'll try to find another way around this.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-20 10:55 ` [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode Peter Zijlstra
@ 2025-02-20 11:18   ` K Prateek Nayak
  2025-02-20 11:32     ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20 11:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	Ben Segall, Thomas Gleixner, Andy Lutomirski, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	Gautham R. Shenoy, Swapnil Sapkal

Hello Peter,

On 2/20/2025 4:25 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:
>> Proposed approach
>> =================
>>
>> This approach builds on Ben Segall's proposal in [4] which marked the
>> task in schedule() when exiting to usermode by setting
>> "in_return_to_user" flag except this prototype takes it a step ahead and
>> marks a "kernel critical section" within the syscall boundary using a
>> per-task "kernel_cs_count".
>>
>> The rationale behind this approach is that the task can only hold
>> kernel resources when running in kernel mode in preemptible context. In
>> this POC, the entire syscall boundary is marked as a kernel critical
>> section but in the future, the API can be used to mark fine grained
>> boundaries like between an up_read(), down_read() or up_write(),
>> down_write() pair.
>>
>> Within a kernel critical section, throttling events are deferred until
>> the task's "kernel_cs_count" hits 0. Currently this count is an integer
>> to catch any cases where the count turns negative as a result of
>> oversights on my part but this could be changed to a preempt count like
>> mechanism to request a resched.
>>
>>              cfs_rq throttled               picked again
>> 		  v                              v
>>
>>      ----|*********| (preempted by tick / wakeup) |***********| (full throttle)
>>
>>          ^                                                    ^
>> critical section   cfs_rq is throttled partially     critical section
>>        entry           since the task is still              exit
>>                    runnable as it was preempted in
>>                        kernel critical section
>>
>> The EEVDF infrastructure is extended to tracks the avg_vruntime and the
>> avg_load of only those entities preempted in kernel mode. When a cfs_rq
>> is throttled, it uses these metrics to select among the kernel mode
>> preempted tasks and running them till they exit to user mode.
>> pick_eevdf() is made aware that it is operating on a throttled hierarchy
>> to only select among these tasks that were preempted in kernel mode (and
>> the sched entities of cfs_rq that lead to them). When a throttled
>> entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
>> the hierarchy remains accessible for picking until that point.
>>
>>            root
>>          /     \
>>         A       B * (throttled)
>>        ...    / | \
>>              0  1* 2*
>>
>>      (*) Preempted in kernel mode
>>
>>    o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
>>    o avg_kcs_load = load(1) + load(2)
>>
>>    o throttled_vruntime_eligible:
>>
>>        entity preempted in kernel mode &&
>>        entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
>>
>>    o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
>>      that propagates the smallest vruntime of the all the entities in
>>      the subtree that are preempted in kernel mode. If they were
>>      executing in usermode when preempted, this will be set to LLONG_MAX.
>>
>>      This is used to construct a min-heap and select through the
>>      entities. Consider rbtree of B to look like:
>>
>>           1*
>>         /   \
>>        2*    0
>>
>>        min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
>>        min_kcs_vruntime = min(se->min_kcs_vruntime,
>>                               __node_2_se(rb_left)->min_kcs_vruntime,
>>                               __node_2_se(rb_right)->min_kcs_vruntime);
>>
>>     pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
>>     tree to first check the left subtree for eligibility, then the node
>>     itself, and then the right subtree.
>>
> 
> *groan*... why not actually dequeue the tasks and only retain those with
> non-zero cs_count? That avoids having to duplicate everything, no?

The rationale there was with growing number of tasks on cfs_rq, the
throttle path has to perform a lot of dequeues and the unthrottle at
distribution has to enqueue all the dequeued threads back.

This is one way to keep all the tasks queued but allow pick to only
select among those that are preempted in kernel mode.

Since per-task throttling needs to tag, dequeue, and re-enqueue each
task, I'm putting this out as an alternate approach that does not
increase the complexities of tg_tree walks which Ben had noted on
Valentin's series [1]. Instead we retain the per cfs_rq throttling
at the cost of some stats tracking at enqueue and dequeue
boundaries.

If you have a strong feelings against any specific part, or the entirety
of this approach, please do let me know, and I'll do my best to see if
a tweaked approach or an alternate implementation can scale well with
growing thread counts (or at least try to defend the bits in question if
they hold merit still).

Any and all feedback is appreciated :)

[1] https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-20 11:18   ` K Prateek Nayak
@ 2025-02-20 11:32     ` Peter Zijlstra
  2025-02-20 12:04       ` K Prateek Nayak
  2025-02-20 15:40       ` Valentin Schneider
  0 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-02-20 11:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	Ben Segall, Thomas Gleixner, Andy Lutomirski, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Josh Don, Qais Yousef, Paul E. McKenney, David Vernet,
	Gautham R. Shenoy, Swapnil Sapkal

On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:

> The rationale there was with growing number of tasks on cfs_rq, the
> throttle path has to perform a lot of dequeues and the unthrottle at
> distribution has to enqueue all the dequeued threads back.
> 
> This is one way to keep all the tasks queued but allow pick to only
> select among those that are preempted in kernel mode.
> 
> Since per-task throttling needs to tag, dequeue, and re-enqueue each
> task, I'm putting this out as an alternate approach that does not
> increase the complexities of tg_tree walks which Ben had noted on
> Valentin's series [1]. Instead we retain the per cfs_rq throttling
> at the cost of some stats tracking at enqueue and dequeue
> boundaries.
> 
> If you have a strong feelings against any specific part, or the entirety
> of this approach, please do let me know, and I'll do my best to see if
> a tweaked approach or an alternate implementation can scale well with
> growing thread counts (or at least try to defend the bits in question if
> they hold merit still).
> 
> Any and all feedback is appreciated :)

Pfff.. I hate it all :-)

So the dequeue approach puts the pain on the people actually using the
bandwidth crud, while this 'some extra accounting' crap has *everybody*
pay for this nonsense, right?

I'm not sure how bad this extra accounting is, but I do fear death by a
thousand cuts.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-20 11:32     ` Peter Zijlstra
@ 2025-02-20 12:04       ` K Prateek Nayak
  2025-02-21  2:04         ` Josh Don
  2025-02-20 15:40       ` Valentin Schneider
  1 sibling, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ben Segall, Josh Don
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Valentin Schneider,
	Thomas Gleixner, Andy Lutomirski, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Sebastian Andrzej Siewior,
	Clark Williams, linux-rt-devel, Tejun Heo, Frederic Weisbecker,
	Barret Rhoden, Petr Mladek, Qais Yousef, Paul E. McKenney,
	David Vernet, Gautham R. Shenoy, Swapnil Sapkal

Hello Peter,

On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
> 
>> The rationale there was with growing number of tasks on cfs_rq, the
>> throttle path has to perform a lot of dequeues and the unthrottle at
>> distribution has to enqueue all the dequeued threads back.
>>
>> This is one way to keep all the tasks queued but allow pick to only
>> select among those that are preempted in kernel mode.
>>
>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>> task, I'm putting this out as an alternate approach that does not
>> increase the complexities of tg_tree walks which Ben had noted on
>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>> at the cost of some stats tracking at enqueue and dequeue
>> boundaries.
>>
>> If you have a strong feelings against any specific part, or the entirety
>> of this approach, please do let me know, and I'll do my best to see if
>> a tweaked approach or an alternate implementation can scale well with
>> growing thread counts (or at least try to defend the bits in question if
>> they hold merit still).
>>
>> Any and all feedback is appreciated :)
> 
> Pfff.. I hate it all :-)
> 
> So the dequeue approach puts the pain on the people actually using the
> bandwidth crud,

Back in Josh Don's presentation at the "Humongous Servers vs Kernel
Scalability" BoF [1] at LPC'24, they mentioned one server handles
around "O(250k) threads" (Slide 21)

Assuming 256 logical CPUs from their first first couple of slides, that
is about 1K potential tasks that can be throttled at one go on each
CPU. Doing that within a single rq_lock critical section may take quite
a bit of time.

Is the expectation that these deployments have to be managed more
smartly if we move to a per-task throttling model? Else it is just
hard lockup by a thousand tasks.

If Ben or Josh can comment on any scalability issues they might have
seen on their deployment and any learning they have drawn from them
since LPC'24, it would be great. Any stats on number of tasks that
get throttled at one go would also be helpful.

[1] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf

> while this 'some extra accounting' crap has *everybody*
> pay for this nonsense, right?

That is correct. Let me go and get some numbers to see if the overhead
is visible but with deeper hierarchies there is a ton that goes on
already which may hide these overheads. I'll try with different levels
on a wakeup heavy task.

> 
> I'm not sure how bad this extra accounting is, but I do fear death by a
> thousand cuts.

We surely don't want that!

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-20 11:32     ` Peter Zijlstra
  2025-02-20 12:04       ` K Prateek Nayak
@ 2025-02-20 15:40       ` Valentin Schneider
  2025-02-20 16:58         ` K Prateek Nayak
  2025-02-21  1:47         ` Josh Don
  1 sibling, 2 replies; 36+ messages in thread
From: Valentin Schneider @ 2025-02-20 15:40 UTC (permalink / raw)
  To: Peter Zijlstra, K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Ben Segall,
	Thomas Gleixner, Andy Lutomirski, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Sebastian Andrzej Siewior,
	Clark Williams, linux-rt-devel, Tejun Heo, Frederic Weisbecker,
	Barret Rhoden, Petr Mladek, Josh Don, Qais Yousef,
	Paul E. McKenney, David Vernet, Gautham R. Shenoy, Swapnil Sapkal

On 20/02/25 12:32, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>
>> The rationale there was with growing number of tasks on cfs_rq, the
>> throttle path has to perform a lot of dequeues and the unthrottle at
>> distribution has to enqueue all the dequeued threads back.
>>
>> This is one way to keep all the tasks queued but allow pick to only
>> select among those that are preempted in kernel mode.
>>
>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>> task, I'm putting this out as an alternate approach that does not
>> increase the complexities of tg_tree walks which Ben had noted on
>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>> at the cost of some stats tracking at enqueue and dequeue
>> boundaries.
>>
>> If you have a strong feelings against any specific part, or the entirety
>> of this approach, please do let me know, and I'll do my best to see if
>> a tweaked approach or an alternate implementation can scale well with
>> growing thread counts (or at least try to defend the bits in question if
>> they hold merit still).
>>
>> Any and all feedback is appreciated :)
>
> Pfff.. I hate it all :-)
>
> So the dequeue approach puts the pain on the people actually using the
> bandwidth crud, while this 'some extra accounting' crap has *everybody*
> pay for this nonsense, right?
>
> I'm not sure how bad this extra accounting is, but I do fear death by a
> thousand cuts.

FWIW that was my main worry with the dual tree approach and why I gave up
on it in favor of the per-task dequeue faff. Having the overhead mainly
contained in throttle/unthrottle is a lot more attractive than adding
(arguably small) overhead to the enqueue/dequeue paths. There was also the
headache of figuring out what to do with the .*nr_running fields and what
is reflected to load balance, which isn't an issue with the per-task thing.

As pointed by Ben in [1], the issue with the per-task approach is the
scalability of the unthrottle. You have the rq lock held and you
potentially end up unthrottling a deep cgroup hierarchy, putting each
individual task back on its cfs_rq.

I can't find my notes on that in a hurry, but my idea with that for a next
version was to periodically release the rq lock as we go up the cgroup
hierarchy during unthrottle - the idea being that we can mess with part of
hierarchy, and as long as that part isn't connected to the rest (i.e. it's
not enqueued, like we currently do for CFS throttling), "it should be
safe".

FYI I haven't given up on this, it's just that repeatedly context switching
between IPI deferral and this didn't really work for me so I'm sticking to
one 'till it gets somewhere.

[1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-20 15:40       ` Valentin Schneider
@ 2025-02-20 16:58         ` K Prateek Nayak
  2025-02-21  1:47         ` Josh Don
  1 sibling, 0 replies; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-20 16:58 UTC (permalink / raw)
  To: Valentin Schneider, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Ben Segall,
	Thomas Gleixner, Andy Lutomirski, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Sebastian Andrzej Siewior,
	Clark Williams, linux-rt-devel, Tejun Heo, Frederic Weisbecker,
	Barret Rhoden, Petr Mladek, Josh Don, Qais Yousef,
	Paul E. McKenney, David Vernet, Gautham R. Shenoy, Swapnil Sapkal

Hello Valentin,

On 2/20/2025 9:10 PM, Valentin Schneider wrote:
> On 20/02/25 12:32, Peter Zijlstra wrote:
>> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>>
>>> The rationale there was with growing number of tasks on cfs_rq, the
>>> throttle path has to perform a lot of dequeues and the unthrottle at
>>> distribution has to enqueue all the dequeued threads back.
>>>
>>> This is one way to keep all the tasks queued but allow pick to only
>>> select among those that are preempted in kernel mode.
>>>
>>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>>> task, I'm putting this out as an alternate approach that does not
>>> increase the complexities of tg_tree walks which Ben had noted on
>>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>>> at the cost of some stats tracking at enqueue and dequeue
>>> boundaries.
>>>
>>> If you have a strong feelings against any specific part, or the entirety
>>> of this approach, please do let me know, and I'll do my best to see if
>>> a tweaked approach or an alternate implementation can scale well with
>>> growing thread counts (or at least try to defend the bits in question if
>>> they hold merit still).
>>>
>>> Any and all feedback is appreciated :)
>>
>> Pfff.. I hate it all :-)
>>
>> So the dequeue approach puts the pain on the people actually using the
>> bandwidth crud, while this 'some extra accounting' crap has *everybody*
>> pay for this nonsense, right?
>>
>> I'm not sure how bad this extra accounting is, but I do fear death by a
>> thousand cuts.
> 
> FWIW that was my main worry with the dual tree approach and why I gave up
> on it in favor of the per-task dequeue faff. Having the overhead mainly
> contained in throttle/unthrottle is a lot more attractive than adding
> (arguably small) overhead to the enqueue/dequeue paths. There was also the
> headache of figuring out what to do with the .*nr_running fields and what
> is reflected to load balance, which isn't an issue with the per-task thing.

I believed that with the differentiation of nr_queued and nr_runnable
now, the counts would be simpler to correct (I might be wrong).

This approach retains the single rbtree but yes there is a cost
associated with maintaining these stats. The stats collection can be
deferred until a bandwidth constraint is first enforced but yes the
small cost remains in every enqueue, dequeue, put_prev_entity,
set_next_entity path thereafter.

Arguably, this should be no more costlier than the current tracking of
h_nr_delayed + min_slice in enqueue and dequeue paths but I might be
wrong.

> 
> As pointed by Ben in [1], the issue with the per-task approach is the
> scalability of the unthrottle. You have the rq lock held and you
> potentially end up unthrottling a deep cgroup hierarchy, putting each
> individual task back on its cfs_rq.

Agreed which is why this alternate approach to retain the throttling and
unthrottling at cfs_rq level was worth a try.

> 
> I can't find my notes on that in a hurry, but my idea with that for a next
> version was to periodically release the rq lock as we go up the cgroup
> hierarchy during unthrottle - the idea being that we can mess with part of
> hierarchy, and as long as that part isn't connected to the rest (i.e. it's
> not enqueued, like we currently do for CFS throttling), "it should be
> safe".

That is pretty nifty! My only concern there would be the case where a
part of throttled hierarchy is still reachable on unthrottle but a part
has dequeued itself - some tasks might have to wait until it is queued
again to be accessible during the pick and a bunch of rescheds follow
with each batch of enqueues.

> 
> FYI I haven't given up on this, it's just that repeatedly context switching
> between IPI deferral and this didn't really work for me so I'm sticking to
> one 'till it gets somewhere.

Ack! This RFC was to get feedback from folks and to see if there are
any takers for cfs_rq level throttling and the reasons to move to a
per-task throttling. Safe to say I'm slowly getting some answers :)

> 
> [1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-20 15:40       ` Valentin Schneider
  2025-02-20 16:58         ` K Prateek Nayak
@ 2025-02-21  1:47         ` Josh Don
  2025-02-25 21:13           ` Valentin Schneider
  1 sibling, 1 reply; 36+ messages in thread
From: Josh Don @ 2025-02-21  1:47 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Qais Yousef, Paul E. McKenney, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal

On Thu, Feb 20, 2025 at 7:40 AM Valentin Schneider <vschneid@redhat.com> wrote:
...
> As pointed by Ben in [1], the issue with the per-task approach is the
> scalability of the unthrottle. You have the rq lock held and you
> potentially end up unthrottling a deep cgroup hierarchy, putting each
> individual task back on its cfs_rq.
>
> I can't find my notes on that in a hurry, but my idea with that for a next
> version was to periodically release the rq lock as we go up the cgroup
> hierarchy during unthrottle - the idea being that we can mess with part of
> hierarchy, and as long as that part isn't connected to the rest (i.e. it's
> not enqueued, like we currently do for CFS throttling), "it should be
> safe".

Can you elaborate a bit more? Even if we periodically release the
lock, we're still spending quite a long time in non-preemptible kernel
context, and unthrottle is also driven by an hrtimer. So we can still
do quite a lot of damage depending on how long the whole loop takes.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-20 12:04       ` K Prateek Nayak
@ 2025-02-21  2:04         ` Josh Don
  2025-02-21  3:37           ` K Prateek Nayak
  0 siblings, 1 reply; 36+ messages in thread
From: Josh Don @ 2025-02-21  2:04 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra
  Cc: Ben Segall, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Thomas Gleixner, Andy Lutomirski,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Qais Yousef, Paul E. McKenney, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal

On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Peter,
>
> On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
> > On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
> >> Any and all feedback is appreciated :)
> >
> > Pfff.. I hate it all :-)
> >
> > So the dequeue approach puts the pain on the people actually using the
> > bandwidth crud, while this 'some extra accounting' crap has *everybody*
> > pay for this nonsense, right?

Doing the context tracking could also provide benefit beyond CFS
bandwidth. As an example, we often see a pattern where a thread
acquires one mutex, then sleeps on trying to take a second mutex. When
the thread eventually is woken due to the second mutex now being
available, the thread now needs to wait to get back on cpu, which can
take an arbitrary amount of time depending on where it landed in the
tree, its weight, etc. Other threads trying to acquire that first
mutex now experience priority inversion as they must wait for the
original thread to get back on cpu and release the mutex. Re-using the
same context tracking, we could prioritize execution of threads in
kernel critical sections, even if they aren't the fair next choice.

If that isn't convincing enough, we could certainly throw another
kconfig or boot param for this behavior :)

> Is the expectation that these deployments have to be managed more
> smartly if we move to a per-task throttling model? Else it is just
> hard lockup by a thousand tasks.

+1, I don't see the per-task throttling being able to scale here.

> If Ben or Josh can comment on any scalability issues they might have
> seen on their deployment and any learning they have drawn from them
> since LPC'24, it would be great. Any stats on number of tasks that
> get throttled at one go would also be helpful.

Maybe just to emphasize that we continue to see the same type of
slowness; throttle/unthrottle when traversing a large cgroup
sub-hierarchy is still an issue for us and we're working on sending a
patch to ideally break this up to do the updates more lazily, as
described at LPC.

In particular, throttle/unthrottle (whether it be on a group basis or
a per-task basis) is a loop that is subject to a lot of cache misses.

Best,
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-21  2:04         ` Josh Don
@ 2025-02-21  3:37           ` K Prateek Nayak
  2025-02-21 19:42             ` Josh Don
  0 siblings, 1 reply; 36+ messages in thread
From: K Prateek Nayak @ 2025-02-21  3:37 UTC (permalink / raw)
  To: Josh Don, Peter Zijlstra
  Cc: Ben Segall, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Valentin Schneider, Thomas Gleixner, Andy Lutomirski,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Qais Yousef, Paul E. McKenney, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal

Hello Josh,

Thank you for sharing the background!

On 2/21/2025 7:34 AM, Josh Don wrote:
> On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Peter,
>>
>> On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
>>> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>>>> Any and all feedback is appreciated :)
>>>
>>> Pfff.. I hate it all :-)
>>>
>>> So the dequeue approach puts the pain on the people actually using the
>>> bandwidth crud, while this 'some extra accounting' crap has *everybody*
>>> pay for this nonsense, right?
> 
> Doing the context tracking could also provide benefit beyond CFS
> bandwidth. As an example, we often see a pattern where a thread
> acquires one mutex, then sleeps on trying to take a second mutex. When
> the thread eventually is woken due to the second mutex now being
> available, the thread now needs to wait to get back on cpu, which can
> take an arbitrary amount of time depending on where it landed in the
> tree, its weight, etc. Other threads trying to acquire that first
> mutex now experience priority inversion as they must wait for the
> original thread to get back on cpu and release the mutex. Re-using the
> same context tracking, we could prioritize execution of threads in
> kernel critical sections, even if they aren't the fair next choice.

Just out of curiosity, have you tried running with proxy-execution [1][2]
on your deployments to mitigate priority inversion in mutexes? I've
tested it with smaller scale benchmarks and I haven't seem much overhead
except for in case of a few microbenchmarks but I'm not sure if you've
run into any issues at your scale.

[1] https://lore.kernel.org/lkml/20241125195204.2374458-1-jstultz@google.com/
[2] https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v14-6.13-rc1/

> 
> If that isn't convincing enough, we could certainly throw another
> kconfig or boot param for this behavior :)
> 
>> Is the expectation that these deployments have to be managed more
>> smartly if we move to a per-task throttling model? Else it is just
>> hard lockup by a thousand tasks.
> 
> +1, I don't see the per-task throttling being able to scale here.
> 
>> If Ben or Josh can comment on any scalability issues they might have
>> seen on their deployment and any learning they have drawn from them
>> since LPC'24, it would be great. Any stats on number of tasks that
>> get throttled at one go would also be helpful.
> 
> Maybe just to emphasize that we continue to see the same type of
> slowness; throttle/unthrottle when traversing a large cgroup
> sub-hierarchy is still an issue for us and we're working on sending a
> patch to ideally break this up to do the updates more lazily, as
> described at LPC.

Is it possible to share an example hierarchy from one of your
deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but
is it possible to reveal the kind of nesting you deal with and at which
levels are bandwidth controls set. Even something like "O(10) cgroups on
root with BW throttling set, and each of them contain O(100) cgroups
below" could also help match a test setup.

[2] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf

> 
> In particular, throttle/unthrottle (whether it be on a group basis or
> a per-task basis) is a loop that is subject to a lot of cache misses.
> 
> Best,
> Josh

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-21  3:37           ` K Prateek Nayak
@ 2025-02-21 19:42             ` Josh Don
  0 siblings, 0 replies; 36+ messages in thread
From: Josh Don @ 2025-02-21 19:42 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ben Segall, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Valentin Schneider, Thomas Gleixner,
	Andy Lutomirski, linux-kernel, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Sebastian Andrzej Siewior, Clark Williams,
	linux-rt-devel, Tejun Heo, Frederic Weisbecker, Barret Rhoden,
	Petr Mladek, Qais Yousef, Paul E. McKenney, David Vernet,
	Gautham R. Shenoy, Swapnil Sapkal

On Thu, Feb 20, 2025 at 7:38 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
...
> Just out of curiosity, have you tried running with proxy-execution [1][2]
> on your deployments to mitigate priority inversion in mutexes? I've
> tested it with smaller scale benchmarks and I haven't seem much overhead
> except for in case of a few microbenchmarks but I'm not sure if you've
> run into any issues at your scale.

The confounding issue is that we see tail issues with other types of
primitives, such as semaphores. That led us to trying an approach
similar to yours with treating kernel-mode as a critical section from
the perspective of e.g. CFSB.

> Is it possible to share an example hierarchy from one of your
> deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but
> is it possible to reveal the kind of nesting you deal with and at which
> levels are bandwidth controls set. Even something like "O(10) cgroups on
> root with BW throttling set, and each of them contain O(100) cgroups
> below" could also help match a test setup.

Sure, I can help shed some additional light. In terms of cgroup depth,
we try to keep that fairly limited, given the cgroup depth scaling
issues with task enqueue/dequeue. Max depth is maybe around ~5
depending on the exact job configuration, with an average closer to
2-3. However, width is quite large as we have many large dual socket
machines that can handle hundreds of individual jobs (as I called out
in the presentation, larger cpu count leads to more cgroups on the
machine in order to fully utilize resources). The example I referred
to in the presentation looks something like:

root -> subtree_parent (this cgroup has CFSB enabled, period = 100ms)
-> (~300-400 direct children, with some fraction having additional
child cgroups, bringing total to O(1000))

Best,
Josh

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode
  2025-02-21  1:47         ` Josh Don
@ 2025-02-25 21:13           ` Valentin Schneider
  0 siblings, 0 replies; 36+ messages in thread
From: Valentin Schneider @ 2025-02-25 21:13 UTC (permalink / raw)
  To: Josh Don
  Cc: Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Ben Segall, Thomas Gleixner, Andy Lutomirski,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Sebastian Andrzej Siewior, Clark Williams, linux-rt-devel,
	Tejun Heo, Frederic Weisbecker, Barret Rhoden, Petr Mladek,
	Qais Yousef, Paul E. McKenney, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal

On 20/02/25 17:47, Josh Don wrote:
> On Thu, Feb 20, 2025 at 7:40 AM Valentin Schneider <vschneid@redhat.com> wrote:
> ...
>> As pointed by Ben in [1], the issue with the per-task approach is the
>> scalability of the unthrottle. You have the rq lock held and you
>> potentially end up unthrottling a deep cgroup hierarchy, putting each
>> individual task back on its cfs_rq.
>>
>> I can't find my notes on that in a hurry, but my idea with that for a next
>> version was to periodically release the rq lock as we go up the cgroup
>> hierarchy during unthrottle - the idea being that we can mess with part of
>> hierarchy, and as long as that part isn't connected to the rest (i.e. it's
>> not enqueued, like we currently do for CFS throttling), "it should be
>> safe".
>
> Can you elaborate a bit more? Even if we periodically release the
> lock, we're still spending quite a long time in non-preemptible kernel
> context, and unthrottle is also driven by an hrtimer. So we can still
> do quite a lot of damage depending on how long the whole loop takes.

Indeed, this only gives the rq lock a breather, but it doesn't help with
preempt / irq off.


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-02-25 21:13 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-20  9:32 [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 01/22] kernel/entry/common: Move syscall_enter_from_user_mode_work() out of header K Prateek Nayak
2025-02-20 10:43   ` Peter Zijlstra
2025-02-20 10:56     ` K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 02/22] sched/fair: Convert "se->runnable_weight" to unsigned int and pack the struct K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 03/22] [PoC] kernel/entry/common: Mark syscall as a kernel critical section K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 04/22] [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 05/22] sched/fair: Track EEVDF stats for entities preempted in kernel mode K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 06/22] sched/fair: Propagate the min_vruntime of kernel mode preempted entity K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 07/22] sched/fair: Propagate preempted entity information up cgroup hierarchy K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 08/22] sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled hierarchy K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 09/22] sched/fair: Introduce cfs_rq throttled states in preparation for partial throttling K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 10/22] sched/fair: Prepare throttle_cfs_rq() to allow " K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 11/22] sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 12/22] sched/fair: Prepare bandwidth distribution to unthrottle partial throttles right away K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 13/22] sched/fair: Correct the throttle status supplied to pick_eevdf() K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 14/22] sched/fair: Mark a task if it was picked on a partially throttled hierarchy K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 15/22] sched/fair: Call resched_curr() from sched_notify_syscall_exit() K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 16/22] sched/fair: Prepare enqueue to partially unthrottle cfs_rq K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 17/22] sched/fair: Clear pick on throttled indicator when task leave fair class K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 18/22] sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled hierarchy K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 19/22] sched/fair: Ignore in-kernel indicators for running task outside of schedule() K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 20/22] sched/fair: Implement determine_throttle_state() for partial throttle K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 21/22] [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together for propagation K Prateek Nayak
2025-02-20  9:32 ` [RFC PATCH 22/22] [DEBUG] sched/fair: Debug pick_eevdf() returning NULL! K Prateek Nayak
2025-02-20 10:55 ` [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode Peter Zijlstra
2025-02-20 11:18   ` K Prateek Nayak
2025-02-20 11:32     ` Peter Zijlstra
2025-02-20 12:04       ` K Prateek Nayak
2025-02-21  2:04         ` Josh Don
2025-02-21  3:37           ` K Prateek Nayak
2025-02-21 19:42             ` Josh Don
2025-02-20 15:40       ` Valentin Schneider
2025-02-20 16:58         ` K Prateek Nayak
2025-02-21  1:47         ` Josh Don
2025-02-25 21:13           ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox