public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
From: Shrikanth Hegde <sshegde@linux.ibm.com>
To: Tobias Huschle <huschle@linux.ibm.com>, linux-kernel@vger.kernel.org
Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org
Subject: Re: [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked
Date: Tue, 18 Feb 2025 11:28:38 +0530	[thread overview]
Message-ID: <cc6996cc-c5c1-429d-ade0-9978b859f207@linux.ibm.com> (raw)
In-Reply-To: <20250217113252.21796-1-huschle@linux.ibm.com>



On 2/17/25 17:02, Tobias Huschle wrote:
> Changes to v1
> 
> parked vs idle
> - parked CPUs are now never considered to be idle
> - a scheduler group is now considered parked iff there are parked CPUs
>    and there are no idle CPUs, i.e. all non parked CPUs are busy or there
>    are only parked CPUs. A scheduler group with parked tasks can be
>    considered to not be parked, if it has idle CPUs which can pick up
>    the parked tasks.
> - idle_cpu_without always returns that the CPU will not be idle if the
>    CPU is parked
> 
> active balance, no_hz, queuing
> - should_we_balance always returns true if a scheduler groups contains
>    a parked CPU and that CPU has a running task
> - stopping the tick on parked CPUs is now prevented in sched_can_stop_tick
>    if a task is running
> - tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond
> 
> cleanup
> - removed duplicate checks for parked CPUs
> 
> CPU capacity
> - added a patch which removes parked cpus and their capacity from
>    scheduler statistics
> 
> 
> Original description:
> 
> Adding a new scheduler group type which allows to remove all tasks
> from certain CPUs through load balancing can help in scenarios where
> such CPUs are currently unfavorable to use, for example in a
> virtualized environment.
> 
> Functionally, this works as intended. The question would be, if this
> could be considered to be added and would be worth going forward
> with. If so, which areas would need additional attention?
> Some cases are referenced below.
> 
> The underlying concept and the approach of adding a new scheduler
> group type were presented in the Sched MC of the 2024 LPC.
> A short summary:
> 
> Some architectures (e.g. s390) provide virtualization on a firmware
> level. This implies, that Linux kernels running on such architectures
> run on virtualized CPUs.
> 
> Like in other virtualized environments, the CPUs are most likely shared
> with other guests on the hardware level. This implies, that Linux
> kernels running in such an environment may encounter 'steal time'. In
> other words, instead of being able to use all available time on a
> physical CPU, some of said available time is 'stolen' by other guests.
> 
> This can cause side effects if a guest is interrupted at an unfavorable
> point in time or if the guest is waiting for one of its other virtual
> CPUs to perform certain actions while those are suspended in favour of
> another guest.
> 
> Architectures, like arch/s390, address this issue by providing an
> alternative classification for the CPUs seen by the Linux kernel.
> 
> The following example is arch/s390 specific:
> In the default mode (horizontal CPU polarization), all CPUs are treated
> equally and can be subject to steal time equally.
> In the alternate mode (vertical CPU polarization), the underlying
> firmware hypervisor assigns the CPUs, visible to the guest, different
> types, depending on how many CPUs the guest is entitled to use. Said
> entitlement is configured by assigning weights to all active guests.
> The three CPU types are:
>      - vertical high   : On these CPUs, the guest has always highest
>                          priority over other guests. This means
>                          especially that if the guest executes tasks on
>                          these CPUs, it will encounter no steal time.
>      - vertical medium : These CPUs are meant to cover fractions of
>                          entitlement.
>      - vertical low    : These CPUs will have no priority when being
>                          scheduled. This implies especially, that while
>                          all other guests are using their full
>                          entitlement, these CPUs might not be ran for a
>                          significant amount of time.
> 
> As a consequence, using vertical lows while the underlying hypervisor
> experiences a high load, driven by all defined guests, is to be avoided.
> 
> In order to consequently move tasks off of vertical lows, introduce a
> new type of scheduler groups: group_parked.
> Parked implies, that processes should be evacuated as fast as possible
> from these CPUs. This implies that other CPUs should start pulling tasks
> immediately, while the parked CPUs should refuse to pull any tasks
> themselves.
> Adding a group type beyond group_overloaded achieves the expected
> behavior. By making its selection architecture dependent, it has
> no effect on architectures which will not make use of that group type.
> 
> This approach works very well for many kinds of workloads. Tasks are
> getting migrated back and forth in line with changing the parked
> state of the involved CPUs.
> 
> There are a couple of issues and corner cases which need further
> considerations:
> - rt & dl:      Realtime and deadline scheduling require some additional
>                  attention.

I think we need to address atleast rt, there would be some non percpu 
kworker threads which need to move out of parked cpus.

> - ext:          Probably affected as well. Needs some conceptional
>                  thoughts first.
> - raciness:     Right now, there are no synchronization efforts. It needs
>                  to be considered whether those might be necessary or if
>                  it is alright that the parked-state of a CPU might change
>                  during load-balancing.
> 
> Patches apply to tip:sched/core
> 
> The s390 patch serves as a simplified implementation example.


Gave it a try on powerpc with the debugfs file. it works for 
sched_normal tasks.

> 
> Tobias Huschle (3):
>    sched/fair: introduce new scheduler group type group_parked
>    sched/fair: adapt scheduler group weight and capacity for parked CPUs
>    s390/topology: Add initial implementation for selection of parked CPUs
> 
>   arch/s390/include/asm/smp.h    |   2 +
>   arch/s390/kernel/smp.c         |   5 ++
>   include/linux/sched/topology.h |  19 ++++++
>   kernel/sched/core.c            |  13 ++++-
>   kernel/sched/fair.c            | 104 ++++++++++++++++++++++++++++-----
>   kernel/sched/syscalls.c        |   3 +
>   6 files changed, 130 insertions(+), 16 deletions(-)
> 


  parent reply	other threads:[~2025-02-18  5:58 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-17 11:32 [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
2025-02-17 11:32 ` [RFC PATCH v2 1/3] " Tobias Huschle
2025-02-18  5:44   ` Shrikanth Hegde
2025-02-20 10:53     ` Tobias Huschle
2025-02-17 11:32 ` [RFC PATCH v2 2/3] sched/fair: adapt scheduler group weight and capacity for parked CPUs Tobias Huschle
2025-02-17 11:32 ` [RFC PATCH v2 3/3] s390/topology: Add initial implementation for selection of " Tobias Huschle
2025-02-18  5:58 ` Shrikanth Hegde [this message]
2025-02-20 10:55   ` [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked Tobias Huschle
2025-02-25 10:33     ` Shrikanth Hegde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc6996cc-c5c1-429d-ade0-9978b859f207@linux.ibm.com \
    --to=sshegde@linux.ibm.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=huschle@linux.ibm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox