From: Ankit Jain <ankitja@vmware.com>
To: peterz@infradead.org, yury.norov@gmail.com,
andriy.shevchenko@linux.intel.com, linux@rasmusvillemoes.dk,
qyousef@layalina.io, pjt@google.com, joshdon@google.com,
bristot@redhat.com, vschneid@redhat.com,
linux-kernel@vger.kernel.org
Cc: namit@vmware.com, amakhalov@vmware.com, srinidhir@vmware.com,
vsirnapalli@vmware.com, vbrahmajosyula@vmware.com,
akaher@vmware.com, srivatsa@csail.mit.edu,
Ankit Jain <ankitja@vmware.com>
Subject: [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask
Date: Wed, 11 Oct 2023 12:49:25 +0530 [thread overview]
Message-ID: <20231011071925.761590-1-ankitja@vmware.com> (raw)
commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
introduced the logic to distribute the tasks at initial wakeup on cpus
where load balancing works poorly or disabled at all (isolated cpus).
There are cases in which the distribution of tasks
that are spawned on isolcpus does not happen properly.
In production deployment, initial wakeup of tasks spawn from
housekeeping cpus to isolcpus[nohz_full cpu] happens on first cpu
within isolcpus range instead of distributed across isolcpus.
Usage of distribute_cpu_mask_prev from one processes group,
will clobber previous value of another or other groups and vice-versa.
When housekeeping cpus spawn multiple child tasks to wakeup on
isolcpus[nohz_full cpu], using cpusets.cpus/sched_setaffinity(),
distribution is currently performed based on per-cpu
distribute_cpu_mask_prev counter.
At the same time, on housekeeping cpus there are percpu
bounded timers interrupt/rcu threads and other system/user tasks
would be running with affinity as housekeeping cpus. In a real-life
environment, housekeeping cpus are much fewer and are too much loaded.
So, distribute_cpu_mask_prev value from these tasks impacts
the offset value for the tasks spawning to wakeup on isolcpus and
thus most of the tasks end up waking up on first cpu within the
isolcpus set.
Steps to reproduce:
Kernel cmdline parameters:
isolcpus=2-5 skew_tick=1 nohz=on nohz_full=2-5
rcu_nocbs=2-5 rcu_nocb_poll idle=poll irqaffinity=0-1
* pid=$(echo $$)
* taskset -pc 0 $pid
* cat loop-normal.c
int main(void)
{
while (1)
;
return 0;
}
* gcc -o loop-normal loop-normal.c
* for i in {1..50}; do ./loop-normal & done
* pids=$(ps -a | grep loop-normal | cut -d' ' -f5)
* for i in $pids; do taskset -pc 2-5 $i ; done
Expected output:
* All 50 “loop-normal” tasks should wake up on cpu2-5
equally distributed.
* ps -eLo cpuid,pid,tid,ppid,cls,psr,cls,cmd | grep "^ [2345]"
Actual output:
* All 50 “loop-normal” tasks got woken up on cpu2 only
Analysis:
There are percpu bounded timer interrupt/rcu threads activities
going on every few microseconds on housekeeping cpus, exercising
find_lowest_rq() -> cpumask_any_and_distribute()/cpumask_any_distribute()
So, per cpu variable distribute_cpu_mask_prev for housekeeping cpus
keep on getting set to housekeeping cpus. Bash/docker processes
are sharing same per cpu variable as they run on housekeeping cpus.
Thus intersection of clobbered distribute_cpu_mask_prev and
new mask(isolcpus) return always first cpu within the new mask(isolcpus)
in accordance to the logic mentioned in commits above.
Fix the issue by using random cores out of the applicable CPU set
instead of relying on distribute_cpu_mask_prev.
Fixes: 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
Fixes: 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
Signed-off-by: Ankit Jain <ankitja@vmware.com>
---
lib/cpumask.c | 40 +++++++++++++++++++++-------------------
1 file changed, 21 insertions(+), 19 deletions(-)
diff --git a/lib/cpumask.c b/lib/cpumask.c
index a7fd02b5ae26..95a7c1b40e95 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -155,45 +155,47 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
}
EXPORT_SYMBOL(cpumask_local_spread);
-static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
-
/**
* cpumask_any_and_distribute - Return an arbitrary cpu within src1p & src2p.
* @src1p: first &cpumask for intersection
* @src2p: second &cpumask for intersection
*
- * Iterated calls using the same srcp1 and srcp2 will be distributed within
- * their intersection.
+ * Iterated calls using the same srcp1 and srcp2 will be randomly distributed
+ * within their intersection.
*
* Returns >= nr_cpu_ids if the intersection is empty.
*/
unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
const struct cpumask *src2p)
{
- unsigned int next, prev;
+ unsigned int n_cpus, nth_cpu;
- /* NOTE: our first selection will skip 0. */
- prev = __this_cpu_read(distribute_cpu_mask_prev);
+ n_cpus = cpumask_weight_and(src1p, src2p);
+ if (n_cpus == 0)
+ return nr_cpu_ids;
- next = find_next_and_bit_wrap(cpumask_bits(src1p), cpumask_bits(src2p),
- nr_cpumask_bits, prev + 1);
- if (next < nr_cpu_ids)
- __this_cpu_write(distribute_cpu_mask_prev, next);
+ nth_cpu = get_random_u32_below(n_cpus);
- return next;
+ return find_nth_and_bit(cpumask_bits(src1p), cpumask_bits(src2p),
+ nr_cpumask_bits, nth_cpu);
}
EXPORT_SYMBOL(cpumask_any_and_distribute);
+/**
+ * Returns an arbitrary cpu within srcp.
+ *
+ * Iterated calls using the same srcp will be randomly distributed
+ */
unsigned int cpumask_any_distribute(const struct cpumask *srcp)
{
- unsigned int next, prev;
+ unsigned int n_cpus, nth_cpu;
- /* NOTE: our first selection will skip 0. */
- prev = __this_cpu_read(distribute_cpu_mask_prev);
- next = find_next_bit_wrap(cpumask_bits(srcp), nr_cpumask_bits, prev + 1);
- if (next < nr_cpu_ids)
- __this_cpu_write(distribute_cpu_mask_prev, next);
+ n_cpus = cpumask_weight(srcp);
+ if (n_cpus == 0)
+ return nr_cpu_ids;
- return next;
+ nth_cpu = get_random_u32_below(n_cpus);
+
+ return find_nth_bit(cpumask_bits(srcp), nr_cpumask_bits, nth_cpu);
}
EXPORT_SYMBOL(cpumask_any_distribute);
--
2.23.1
next reply other threads:[~2023-10-11 7:20 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-11 7:19 Ankit Jain [this message]
2023-10-11 10:39 ` [PATCH RFC] cpumask: Randomly distribute the tasks within affinity mask Andy Shevchenko
2023-10-11 10:53 ` Peter Zijlstra
2023-10-11 11:46 ` Peter Zijlstra
2023-10-11 13:52 ` Peter Zijlstra
2023-10-11 23:55 ` Josh Don
2023-10-12 8:05 ` Peter Zijlstra
2023-10-12 15:43 ` Ankit Jain
2023-10-12 0:16 ` Yury Norov
2023-10-12 15:52 ` Ankit Jain
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231011071925.761590-1-ankitja@vmware.com \
--to=ankitja@vmware.com \
--cc=akaher@vmware.com \
--cc=amakhalov@vmware.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=bristot@redhat.com \
--cc=joshdon@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@rasmusvillemoes.dk \
--cc=namit@vmware.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=qyousef@layalina.io \
--cc=srinidhir@vmware.com \
--cc=srivatsa@csail.mit.edu \
--cc=vbrahmajosyula@vmware.com \
--cc=vschneid@redhat.com \
--cc=vsirnapalli@vmware.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox