public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Yury Norov <yury.norov@gmail.com>
To: Ankit Jain <ankit-aj.jain@broadcom.com>
Cc: linux@rasmusvillemoes.dk, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, juri.lelli@redhat.com,
	pauld@redhat.com, ajay.kaher@broadcom.com,
	alexey.makhalov@broadcom.com, vasavi.sirnapalli@broadcom.com,
	Paul Turner <pjt@google.com>, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>
Subject: Re: [PATCH] lib/cpumask: Boot option to disable tasks distribution within cpumask
Date: Tue, 30 Apr 2024 11:23:07 -0700	[thread overview]
Message-ID: <ZjE3C9UgeZR02Jyy@yury-ThinkPad> (raw)
In-Reply-To: <20240430090431.1619622-1-ankit-aj.jain@broadcom.com>

On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks within cpumask upon initial
> wakeup.

So let's add the authors in CC list?

> For Telco RAN deployments, isolcpus are a necessity to cater to
> the requirement of low latency applications. These isolcpus are generally
> tickless so that high priority SCHED_FIFO tasks can execute without any
> OS jitter. Since load balancing is disabled on isocpus, any task
> which gets placed on these CPUs can not be migrated on its own.
> For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> with all isolcpus becomes the requirement and these RT applications are
> affine to execute on a specific isolcpu within the kubernetes pod.
> However, there may be some non-RT tasks which could also schedule in the
> same kubernetes pod without being affine to any specific CPU(inherits the
> pod cpuset affinity).

OK... It looks like adding scheduler maintainers is also a necessity to
cater here...

> With multiple spawning and running containers inside
> the pod, container runtime spawns several non-RT initializing tasks
> ("runc init") inside the pod and due to above mentioned commits, these
> non-RT tasks may get placed on any isolcpus and may starve if it happens
> to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
> disabled in telco setup. Thus, RAN deployment fails and eventually leads
> to system hangs.

Not that I'm familiar to your setup, but this sounds like a userspace
configuration problems. Can you try to move your non-RT tasks into a
cgroup attached to non-RT CPUs, or something like that? 

> With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> there is an option provided for such usecases to disable the distribution
> of tasks within the cpumask logic and use the previous 'pick first cpu'
> approach for initial placement of tasks. Because many telco vendors
> configure the system in such a way that the first cpu within a cpuset
> of pod doesn't run any SCHED_FIFO or High priority tasks.
> 
> Co-developed-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
> Signed-off-by: Alexey Makhalov <alexey.makhalov@broadcom.com>
> Signed-off-by: Ankit Jain <ankit-aj.jain@broadcom.com>
> ---
>  lib/cpumask.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index e77ee9d46f71..3dea87d5ec1f 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
>  }
>  EXPORT_SYMBOL(cpumask_local_spread);
>  
> +/*
> + * Task distribution within the cpumask feature disabled?
> + */
> +static bool cpumask_pick_firstcpu __read_mostly;
> +
> +/*
> + * Disable Tasks distribution within the cpumask feature
> + */
> +static int __init cpumask_pick_firstcpu_setup(char *str)
> +{
> +	cpumask_pick_firstcpu = 1;
> +	pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> +	return 1;
> +}
> +
> +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> +
>  static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
>  
>  /**
> @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
>  {
>  	unsigned int next, prev;
>  
> +	/*
> +	 * Don't distribute, if tasks distribution
> +	 * within cpumask feature is disabled
> +	 */
> +	if (cpumask_pick_firstcpu)
> +		return cpumask_any_and(src1p, src2p);

No, this is a wrong way.

To begin with, this parameter shouldn't control a single random
function. At least, the other cpumask_*_distribute() should be
consistent to the policy.

But in general... I don't think we should do things like that at all.
Cpumask API is a simple and plain wrapper around bitmaps. If you want
to modify a behavior of the scheduler, you could do that at scheduler
level, not in a random helper function.

Consider 2 cases:
 - Someone unrelated to scheduler would use the same helper and will
   be affected by this parameter inadvertently.
 - Scheduler will switch to using another function to distribute CPUs,
   and your setups will suddenly get broken again. This time deeply in
   production.

Thanks,
Yury

>  	/* NOTE: our first selection will skip 0. */
>  	prev = __this_cpu_read(distribute_cpu_mask_prev);
>  
> -- 
> 2.23.1

  reply	other threads:[~2024-04-30 18:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-30  9:04 [PATCH] lib/cpumask: Boot option to disable tasks distribution within cpumask Ankit Jain
2024-04-30 18:23 ` Yury Norov [this message]
2024-05-01 13:36   ` Phil Auld
2024-05-01 15:27     ` Yury Norov
2024-05-01 15:39       ` Phil Auld
     [not found]         ` <F4B1A5C9-FD5E-42EF-9DA7-4EB394D10408@broadcom.com>
2024-05-02  9:39           ` Ankit Jain
2024-05-02  8:43   ` Peter Zijlstra
2024-05-02 11:45     ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZjE3C9UgeZR02Jyy@yury-ThinkPad \
    --to=yury.norov@gmail.com \
    --cc=ajay.kaher@broadcom.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexey.makhalov@broadcom.com \
    --cc=ankit-aj.jain@broadcom.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@rasmusvillemoes.dk \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=rostedt@goodmis.org \
    --cc=vasavi.sirnapalli@broadcom.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox