From: Yury Norov <yury.norov@gmail.com>
To: Ankit Jain <ankit-aj.jain@broadcom.com>
Cc: linux@rasmusvillemoes.dk, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, juri.lelli@redhat.com,
pauld@redhat.com, ajay.kaher@broadcom.com,
alexey.makhalov@broadcom.com, vasavi.sirnapalli@broadcom.com,
Paul Turner <pjt@google.com>, Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH] lib/cpumask: Boot option to disable tasks distribution within cpumask
Date: Tue, 30 Apr 2024 11:23:07 -0700 [thread overview]
Message-ID: <ZjE3C9UgeZR02Jyy@yury-ThinkPad> (raw)
In-Reply-To: <20240430090431.1619622-1-ankit-aj.jain@broadcom.com>
On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks within cpumask upon initial
> wakeup.
So let's add the authors in CC list?
> For Telco RAN deployments, isolcpus are a necessity to cater to
> the requirement of low latency applications. These isolcpus are generally
> tickless so that high priority SCHED_FIFO tasks can execute without any
> OS jitter. Since load balancing is disabled on isocpus, any task
> which gets placed on these CPUs can not be migrated on its own.
> For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> with all isolcpus becomes the requirement and these RT applications are
> affine to execute on a specific isolcpu within the kubernetes pod.
> However, there may be some non-RT tasks which could also schedule in the
> same kubernetes pod without being affine to any specific CPU(inherits the
> pod cpuset affinity).
OK... It looks like adding scheduler maintainers is also a necessity to
cater here...
> With multiple spawning and running containers inside
> the pod, container runtime spawns several non-RT initializing tasks
> ("runc init") inside the pod and due to above mentioned commits, these
> non-RT tasks may get placed on any isolcpus and may starve if it happens
> to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
> disabled in telco setup. Thus, RAN deployment fails and eventually leads
> to system hangs.
Not that I'm familiar to your setup, but this sounds like a userspace
configuration problems. Can you try to move your non-RT tasks into a
cgroup attached to non-RT CPUs, or something like that?
> With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> there is an option provided for such usecases to disable the distribution
> of tasks within the cpumask logic and use the previous 'pick first cpu'
> approach for initial placement of tasks. Because many telco vendors
> configure the system in such a way that the first cpu within a cpuset
> of pod doesn't run any SCHED_FIFO or High priority tasks.
>
> Co-developed-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
> Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
> Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
> ---
> lib/cpumask.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index e77ee9d46f71..3dea87d5ec1f 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> }
> EXPORT_SYMBOL(cpumask_local_spread);
>
> +/*
> + * Task distribution within the cpumask feature disabled?
> + */
> +static bool cpumask_pick_firstcpu __read_mostly;
> +
> +/*
> + * Disable Tasks distribution within the cpumask feature
> + */
> +static int __init cpumask_pick_firstcpu_setup(char *str)
> +{
> + cpumask_pick_firstcpu = 1;
> + pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> + return 1;
> +}
> +
> +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> +
> static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
>
> /**
> @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> {
> unsigned int next, prev;
>
> + /*
> + * Don't distribute, if tasks distribution
> + * within cpumask feature is disabled
> + */
> + if (cpumask_pick_firstcpu)
> + return cpumask_any_and(src1p, src2p);
No, this is a wrong way.
To begin with, this parameter shouldn't control a single random
function. At least, the other cpumask_*_distribute() should be
consistent to the policy.
But in general... I don't think we should do things like that at all.
Cpumask API is a simple and plain wrapper around bitmaps. If you want
to modify a behavior of the scheduler, you could do that at scheduler
level, not in a random helper function.
Consider 2 cases:
- Someone unrelated to scheduler would use the same helper and will
be affected by this parameter inadvertently.
- Scheduler will switch to using another function to distribute CPUs,
and your setups will suddenly get broken again. This time deeply in
production.
Thanks,
Yury
> /* NOTE: our first selection will skip 0. */
> prev = __this_cpu_read(distribute_cpu_mask_prev);
>
> --
> 2.23.1
next prev parent reply other threads:[~2024-04-30 18:23 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-30 9:04 [PATCH] lib/cpumask: Boot option to disable tasks distribution within cpumask Ankit Jain
2024-04-30 18:23 ` Yury Norov [this message]
2024-05-01 13:36 ` Phil Auld
2024-05-01 15:27 ` Yury Norov
2024-05-01 15:39 ` Phil Auld
[not found] ` <F4B1A5C9-FD5E-42EF-9DA7-4EB394D10408@broadcom.com>
2024-05-02 9:39 ` Ankit Jain
2024-05-02 8:43 ` Peter Zijlstra
2024-05-02 11:45 ` Phil Auld
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZjE3C9UgeZR02Jyy@yury-ThinkPad \
--to=yury.norov@gmail.com \
--cc=ajay.kaher@broadcom.com \
--cc=akpm@linux-foundation.org \
--cc=alexey.makhalov@broadcom.com \
--cc=ankit-aj.jain@broadcom.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=pauld@redhat.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=rostedt@goodmis.org \
--cc=vasavi.sirnapalli@broadcom.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox