* [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime
@ 2024-07-25 12:03 Zheng Zucheng
2024-07-25 14:05 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Zheng Zucheng @ 2024-07-25 12:03 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, oleg
Cc: linux-kernel
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns
In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530
unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
= -199925530
= (u64)18446744073709518790
Trigger scenario:
1. User task run in kernel mode most of time.
2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y &&
TICK_CPU_ACCOUNTING=y
Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
---
kernel/sched/cputime.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index aa48b2ec879d..365c74e95537 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
}
stime = mul_u64_u64_div_u64(stime, rtime, stime + utime);
+ if (unlikely(stime > rtime))
+ stime = rtime;
update:
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-25 12:03 [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime Zheng Zucheng @ 2024-07-25 14:05 ` Peter Zijlstra 2024-07-25 14:49 ` zhengzucheng 2024-07-26 2:32 ` [PATCH v2 " Zheng Zucheng 2024-09-02 1:56 ` [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu? zhengzucheng 2 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2024-07-25 14:05 UTC (permalink / raw) To: Zheng Zucheng Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, oleg, linux-kernel On Thu, Jul 25, 2024 at 12:03:15PM +0000, Zheng Zucheng wrote: > In extreme test scenarios: > the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, > utime = 18446744073709518790 ns, rtime = 135989749728000 ns > > In cputime_adjust() process, stime is greater than rtime due to > mul_u64_u64_div_u64() precision problem. > before call mul_u64_u64_div_u64(), > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > after call mul_u64_u64_div_u64(), > stime = 135989949653530 > > unsigned reversion occurs because rtime is less than stime. > utime = rtime - stime = 135989749728000 - 135989949653530 > = -199925530 > = (u64)18446744073709518790 > > Trigger scenario: > 1. User task run in kernel mode most of time. > 2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y && > TICK_CPU_ACCOUNTING=y > > Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime > > Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") > Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> > --- > kernel/sched/cputime.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index aa48b2ec879d..365c74e95537 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, > } > > stime = mul_u64_u64_div_u64(stime, rtime, stime + utime); > + if (unlikely(stime > rtime)) > + stime = rtime; But but but... for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y the code you're patching is not compiled! ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-25 14:05 ` Peter Zijlstra @ 2024-07-25 14:49 ` zhengzucheng 2024-07-25 15:14 ` Peter Zijlstra 0 siblings, 1 reply; 19+ messages in thread From: zhengzucheng @ 2024-07-25 14:49 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, oleg, linux-kernel Sorry, I made a mistake here. CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set. 在 2024/7/25 22:05, Peter Zijlstra 写道: > On Thu, Jul 25, 2024 at 12:03:15PM +0000, Zheng Zucheng wrote: >> In extreme test scenarios: >> the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, >> utime = 18446744073709518790 ns, rtime = 135989749728000 ns >> >> In cputime_adjust() process, stime is greater than rtime due to >> mul_u64_u64_div_u64() precision problem. >> before call mul_u64_u64_div_u64(), >> stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. >> after call mul_u64_u64_div_u64(), >> stime = 135989949653530 >> >> unsigned reversion occurs because rtime is less than stime. >> utime = rtime - stime = 135989749728000 - 135989949653530 >> = -199925530 >> = (u64)18446744073709518790 >> >> Trigger scenario: >> 1. User task run in kernel mode most of time. >> 2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y && >> TICK_CPU_ACCOUNTING=y >> >> Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime >> >> Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") >> Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> >> --- >> kernel/sched/cputime.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> index aa48b2ec879d..365c74e95537 100644 >> --- a/kernel/sched/cputime.c >> +++ b/kernel/sched/cputime.c >> @@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, >> } >> >> stime = mul_u64_u64_div_u64(stime, rtime, stime + utime); >> + if (unlikely(stime > rtime)) >> + stime = rtime; > But but but... for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y the code you're > patching is not compiled! > > > . ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-25 14:49 ` zhengzucheng @ 2024-07-25 15:14 ` Peter Zijlstra 0 siblings, 0 replies; 19+ messages in thread From: Peter Zijlstra @ 2024-07-25 15:14 UTC (permalink / raw) To: zhengzucheng Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, oleg, linux-kernel On Thu, Jul 25, 2024 at 10:49:46PM +0800, zhengzucheng wrote: > Sorry, I made a mistake here. CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set. > > 在 2024/7/25 22:05, Peter Zijlstra 写道: > > On Thu, Jul 25, 2024 at 12:03:15PM +0000, Zheng Zucheng wrote: > > > In extreme test scenarios: > > > the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, > > > utime = 18446744073709518790 ns, rtime = 135989749728000 ns > > > > > > In cputime_adjust() process, stime is greater than rtime due to > > > mul_u64_u64_div_u64() precision problem. > > > before call mul_u64_u64_div_u64(), > > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > > > after call mul_u64_u64_div_u64(), > > > stime = 135989949653530 > > > > > > unsigned reversion occurs because rtime is less than stime. > > > utime = rtime - stime = 135989749728000 - 135989949653530 > > > = -199925530 > > > = (u64)18446744073709518790 > > > > > > Trigger scenario: > > > 1. User task run in kernel mode most of time. > > > 2. The ARM64 architecture && CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y && > > > TICK_CPU_ACCOUNTING=y > > > > > > Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime > > > > > > Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") > > > Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> > > > --- > > > kernel/sched/cputime.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > > > index aa48b2ec879d..365c74e95537 100644 > > > --- a/kernel/sched/cputime.c > > > +++ b/kernel/sched/cputime.c > > > @@ -582,6 +582,8 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, > > > } > > > stime = mul_u64_u64_div_u64(stime, rtime, stime + utime); > > > + if (unlikely(stime > rtime)) > > > + stime = rtime; Ooh,.. I see, this is because the generic fallback for mul_u64_u64_div_u64() is yuck :/ On x86_64 this is just two instructions and it does a native: u64*u64->u128 u128/u64->u64 And this should never happen. But in the generic case, we appoximate and urgh. So yeah, but then perhaps add a comment like: /* * Because mul_u64_u64_div_u64() can approximate on some * achitectures; enforce the constraint that: a*b/(b+c) <= a. */ if (unlikely(stime > rtime)) stime = rtime; Also, I would look into doing a native arm64 version, I'd be surprised if it could not do better than the generic variant. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-25 12:03 [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime Zheng Zucheng 2024-07-25 14:05 ` Peter Zijlstra @ 2024-07-26 2:32 ` Zheng Zucheng 2024-07-26 10:44 ` Oleg Nesterov 2024-07-29 10:34 ` [tip: sched/core] " tip-bot2 for Zheng Zucheng 2024-09-02 1:56 ` [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu? zhengzucheng 2 siblings, 2 replies; 19+ messages in thread From: Zheng Zucheng @ 2024-07-26 2:32 UTC (permalink / raw) To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, oleg Cc: linux-kernel In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime v2: - Add comment - Update trigger condition Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") Cc: <stable@vger.kernel.org> Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> --- kernel/sched/cputime.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index aa48b2ec879d..4feef0d4e449 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, } stime = mul_u64_u64_div_u64(stime, rtime, stime + utime); + /* + * Because mul_u64_u64_div_u64() can approximate on some + * achitectures; enforce the constraint that: a*b/(b+c) <= a. + */ + if (unlikely(stime > rtime)) + stime = rtime; update: /* -- 2.34.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-26 2:32 ` [PATCH v2 " Zheng Zucheng @ 2024-07-26 10:44 ` Oleg Nesterov 2024-07-26 13:04 ` Oleg Nesterov 2024-07-30 6:55 ` Uwe Kleine-König 2024-07-29 10:34 ` [tip: sched/core] " tip-bot2 for Zheng Zucheng 1 sibling, 2 replies; 19+ messages in thread From: Oleg Nesterov @ 2024-07-26 10:44 UTC (permalink / raw) To: Zheng Zucheng Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel On 07/26, Zheng Zucheng wrote: > > before call mul_u64_u64_div_u64(), > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. So stime + utime == 175138003500000 > after call mul_u64_u64_div_u64(), > stime = 135989949653530 Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000) returns 135989749728000 == rtime, see below. Nevermind... > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, > } > > stime = mul_u64_u64_div_u64(stime, rtime, stime + utime); > + /* > + * Because mul_u64_u64_div_u64() can approximate on some > + * achitectures; enforce the constraint that: a*b/(b+c) <= a. > + */ > + if (unlikely(stime > rtime)) > + stime = rtime; Thanks, Acked-by: Oleg Nesterov <oleg@redhat.com> ------------------------------------------------------------------------------- But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ? See the new() function in the code below. Say, with the numbers above I get $ ./test 175136586720000 135989749728000 175138003500000 old -> 135989749728000 e=1100089950.609375 new -> 135988649638050 e=0.609375 Oleg. ------------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <assert.h> typedef unsigned long long u64; static inline int fls64(u64 x) { int bitpos = -1; /* * AMD64 says BSRQ won't clobber the dest reg if x==0; Intel64 says the * dest reg is undefined if x==0, but their CPU architect says its * value is written to set it to the same as before. */ asm("bsrq %1,%q0" : "+r" (bitpos) : "rm" (x)); return bitpos + 1; } static inline int ilog2(u64 n) { return fls64(n) - 1; } #define swap(a, b) \ do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0) static inline u64 div64_u64_rem(u64 dividend, u64 divisor, u64 *remainder) { *remainder = dividend % divisor; return dividend / divisor; } static inline u64 div64_u64(u64 dividend, u64 divisor) { return dividend / divisor; } //----------------------------------------------------------------------------- // current implementation of mul_u64_u64_div_u64 u64 old(u64 a, u64 b, u64 c) { u64 res = 0, div, rem; int shift; /* can a * b overflow ? */ if (ilog2(a) + ilog2(b) > 62) { /* * Note that the algorithm after the if block below might lose * some precision and the result is more exact for b > a. So * exchange a and b if a is bigger than b. * * For example with a = 43980465100800, b = 100000000, c = 1000000000 * the below calculation doesn't modify b at all because div == 0 * and then shift becomes 45 + 26 - 62 = 9 and so the result * becomes 4398035251080. However with a and b swapped the exact * result is calculated (i.e. 4398046510080). */ if (a > b) swap(a, b); /* * (b * a) / c is equal to * * (b / c) * a + * (b % c) * a / c * * if nothing overflows. Can the 1st multiplication * overflow? Yes, but we do not care: this can only * happen if the end result can't fit in u64 anyway. * * So the code below does * * res = (b / c) * a; * b = b % c; */ div = div64_u64_rem(b, c, &rem); res = div * a; b = rem; shift = ilog2(a) + ilog2(b) - 62; if (shift > 0) { /* drop precision */ b >>= shift; c >>= shift; if (!c) return res; } } return res + div64_u64(a * b, c); } u64 new(u64 a, u64 b, u64 c) { u64 res = 0, div, rem; /* can a * b overflow ? */ while (ilog2(a) + ilog2(b) > 62) { if (a > b) swap(b, a); if (b >= c) { /* * (b * a) / c is equal to * * (b / c) * a + * (b % c) * a / c * * if nothing overflows. Can the 1st multiplication * overflow? Yes, but we do not care: this can only * happen if the end result can't fit in u64 anyway. * * So the code below does * * res += (b / c) * a; * b = b % c; */ div = div64_u64_rem(b, c, &rem); res += div * a; b = rem; continue; } /* drop precision */ b >>= 1; c >>= 1; if (!c) return res; } return res + div64_u64(a * b, c); } int main(int argc, char **argv) { u64 a, b, c, ro, rn; double rd; assert(argc == 4); a = strtoull(argv[1], NULL, 0); b = strtoull(argv[2], NULL, 0); c = strtoull(argv[3], NULL, 0); rd = (((double)a) * b) / c; ro = old(a, b, c); rn = new(a, b, c); printf("old -> %lld\te=%f\n", ro, ro - rd); printf("new -> %lld\te=%f\n", rn, rn - rd); return 0; } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-26 10:44 ` Oleg Nesterov @ 2024-07-26 13:04 ` Oleg Nesterov 2024-07-26 13:08 ` Peter Zijlstra 2024-07-30 6:55 ` Uwe Kleine-König 1 sibling, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2024-07-26 13:04 UTC (permalink / raw) To: Zheng Zucheng Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel On 07/26, Oleg Nesterov wrote: > > On 07/26, Zheng Zucheng wrote: > > > > before call mul_u64_u64_div_u64(), > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > > So stime + utime == 175138003500000 > > > after call mul_u64_u64_div_u64(), > > stime = 135989949653530 > > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000) > returns 135989749728000 == rtime, see below. Seriously, can you re-check your numbers? it would be nice to understand why x86_64 differs... > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ? > See the new() function in the code below. Just in case, the usage of ilog2 can be improved, but this is minor. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-26 13:04 ` Oleg Nesterov @ 2024-07-26 13:08 ` Peter Zijlstra 2024-07-26 13:30 ` Oleg Nesterov 0 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2024-07-26 13:08 UTC (permalink / raw) To: Oleg Nesterov Cc: Zheng Zucheng, mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel On Fri, Jul 26, 2024 at 03:04:01PM +0200, Oleg Nesterov wrote: > On 07/26, Oleg Nesterov wrote: > > > > On 07/26, Zheng Zucheng wrote: > > > > > > before call mul_u64_u64_div_u64(), > > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > > > > So stime + utime == 175138003500000 > > > > > after call mul_u64_u64_div_u64(), > > > stime = 135989949653530 > > > > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000) > > returns 135989749728000 == rtime, see below. > > Seriously, can you re-check your numbers? it would be nice to understand why > x86_64 differs... x86_64 has a custom mul_u64_u64_div_u64() implementation. > > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ? > > See the new() function in the code below. > > Just in case, the usage of ilog2 can be improved, but this is minor. I meant to go look at this, it seems to loop less than your improved version, but I'm chasing crashes atm. Perhaps it provides inspiration. https://codebrowser.dev/llvm/compiler-rt/lib/builtins/udivmodti4.c.html#__udivmodti4 ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-26 13:08 ` Peter Zijlstra @ 2024-07-26 13:30 ` Oleg Nesterov 0 siblings, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2024-07-26 13:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Zheng Zucheng, mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel On 07/26, Peter Zijlstra wrote: > > On Fri, Jul 26, 2024 at 03:04:01PM +0200, Oleg Nesterov wrote: > > On 07/26, Oleg Nesterov wrote: > > > > > > On 07/26, Zheng Zucheng wrote: > > > > > > > > before call mul_u64_u64_div_u64(), > > > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > > > > > > So stime + utime == 175138003500000 > > > > > > > after call mul_u64_u64_div_u64(), > > > > stime = 135989949653530 > > > > > > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000) > > > returns 135989749728000 == rtime, see below. > > > > Seriously, can you re-check your numbers? it would be nice to understand why > > x86_64 differs... > > x86_64 has a custom mul_u64_u64_div_u64() implementation. Yes sure, but my user-space test-case uses the "generic" version, > > > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ? > > > See the new() function in the code below. > > > > Just in case, the usage of ilog2 can be improved, but this is minor. > > I meant to go look at this, it seems to loop less than your improved > version, but I'm chasing crashes atm. Perhaps it provides inspiration. > > https://codebrowser.dev/llvm/compiler-rt/lib/builtins/udivmodti4.c.html#__udivmodti4 Thanks... I'll try to take a look tommorrow, but at first glance I will never understand this (clever) code ;) Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2 -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-26 10:44 ` Oleg Nesterov 2024-07-26 13:04 ` Oleg Nesterov @ 2024-07-30 6:55 ` Uwe Kleine-König 1 sibling, 0 replies; 19+ messages in thread From: Uwe Kleine-König @ 2024-07-30 6:55 UTC (permalink / raw) To: Oleg Nesterov Cc: Zheng Zucheng, mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, linux-kernel, Andrew Morton, Nicolas Pitre [-- Attachment #1: Type: text/plain, Size: 1780 bytes --] Hello, On Fri, Jul 26, 2024 at 12:44:29PM +0200, Oleg Nesterov wrote: > On 07/26, Zheng Zucheng wrote: > > > > before call mul_u64_u64_div_u64(), > > stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. > > So stime + utime == 175138003500000 > > > after call mul_u64_u64_div_u64(), > > stime = 135989949653530 > > Hmm. On x86 mul_u64_u64_div_u64(175136586720000, 135989749728000, 175138003500000) > returns 135989749728000 == rtime, see below. > > Nevermind... > > > --- a/kernel/sched/cputime.c > > +++ b/kernel/sched/cputime.c > > @@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, > > } > > > > stime = mul_u64_u64_div_u64(stime, rtime, stime + utime); > > + /* > > + * Because mul_u64_u64_div_u64() can approximate on some > > + * achitectures; enforce the constraint that: a*b/(b+c) <= a. > > + */ > > + if (unlikely(stime > rtime)) > > + stime = rtime; > > Thanks, > > Acked-by: Oleg Nesterov <oleg@redhat.com> > > ------------------------------------------------------------------------------- > But perhaps it makes sense to improve the accuracy of mul_u64_u64_div_u64() ? Note there is a patch by Nicolas Pitre currently in mm-nonmm-unstable that makes mul_u64_u64_div_u64() precise. It was in next for a while as commit 3cc8bf1a81ef ("mul_u64_u64_div_u64: make it precise always") which might explain problems to reproduce the incorrect results. An obvious alternative to backporting this change to kernel/sched/cputime.c for stable is to backport Nicolas's patch instead. Andrew asked me to provide a justification to send Nicolas's patch for inclusion in the current devel cycle. So it might make it in before 6.11. Best regards Uwe [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* [tip: sched/core] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime 2024-07-26 2:32 ` [PATCH v2 " Zheng Zucheng 2024-07-26 10:44 ` Oleg Nesterov @ 2024-07-29 10:34 ` tip-bot2 for Zheng Zucheng 1 sibling, 0 replies; 19+ messages in thread From: tip-bot2 for Zheng Zucheng @ 2024-07-29 10:34 UTC (permalink / raw) To: linux-tip-commits Cc: Zheng Zucheng, Peter Zijlstra (Intel), stable, x86, linux-kernel The following commit has been merged into the sched/core branch of tip: Commit-ID: 77baa5bafcbe1b2a15ef9c37232c21279c95481c Gitweb: https://git.kernel.org/tip/77baa5bafcbe1b2a15ef9c37232c21279c95481c Author: Zheng Zucheng <zhengzucheng@huawei.com> AuthorDate: Fri, 26 Jul 2024 02:32:35 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Mon, 29 Jul 2024 12:22:32 +02:00 sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com --- kernel/sched/cputime.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index a5e0029..0bed0fa 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -582,6 +582,12 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev, } stime = mul_u64_u64_div_u64(stime, rtime, stime + utime); + /* + * Because mul_u64_u64_div_u64() can approximate on some + * achitectures; enforce the constraint that: a*b/(b+c) <= a. + */ + if (unlikely(stime > rtime)) + stime = rtime; update: /* ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu? 2024-07-25 12:03 [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime Zheng Zucheng 2024-07-25 14:05 ` Peter Zijlstra 2024-07-26 2:32 ` [PATCH v2 " Zheng Zucheng @ 2024-09-02 1:56 ` zhengzucheng 2024-09-02 3:00 ` Waiman Long 2 siblings, 1 reply; 19+ messages in thread From: zhengzucheng @ 2024-09-02 1:56 UTC (permalink / raw) To: peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, bristot, vschneid, oleg, Frederic Weisbecker, mingo, peterx, tj, longman Cc: linux-kernel In a cpuset subsystem, cpuset.cpus contains isolated cpu and non-isolated cpu. Is there any way to ensure that the task runs only on the non-isolated cpus? eg: isolcpus=1, cpusete.cpus=0-7. It is found that some tasks are scheduled to cpu1. In addition, task run on isolated cpu cann't be scheduled to other cpu in the future. Thanks! ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu? 2024-09-02 1:56 ` [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu? zhengzucheng @ 2024-09-02 3:00 ` Waiman Long 2024-09-13 4:03 ` [Question] sched:the load is unbalanced in the VM overcommitment scenario zhengzucheng 0 siblings, 1 reply; 19+ messages in thread From: Waiman Long @ 2024-09-02 3:00 UTC (permalink / raw) To: zhengzucheng, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, oleg, Frederic Weisbecker, mingo, peterx, tj Cc: linux-kernel On 9/1/24 21:56, zhengzucheng wrote: > In a cpuset subsystem, cpuset.cpus contains isolated cpu and > non-isolated cpu. > Is there any way to ensure that the task runs only on the non-isolated > cpus? > eg: > isolcpus=1, cpusete.cpus=0-7. It is found that some tasks are > scheduled to cpu1. > > In addition, task run on isolated cpu cann't be scheduled to other cpu > in the future. The best way is to avoid mixing isolated and scheduling CPUs in the same cpuset especially if you are using cgroup v1. If you are using cgroup v2, one way to avoid the use of isolated CPUs is to put all of them into an isolated partition. This will ensure that those isolated CPUs won't be used even if they are put into the cpuset.cpus of other cpusets accidentally Cheers, Longman ^ permalink raw reply [flat|nested] 19+ messages in thread
* [Question] sched:the load is unbalanced in the VM overcommitment scenario 2024-09-02 3:00 ` Waiman Long @ 2024-09-13 4:03 ` zhengzucheng 2024-09-13 15:55 ` Vincent Guittot 2024-09-13 17:17 ` Waiman Long 0 siblings, 2 replies; 19+ messages in thread From: zhengzucheng @ 2024-09-13 4:03 UTC (permalink / raw) To: Waiman Long, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, oleg, Frederic Weisbecker, mingo, peterx, tj, tjcao980311, vincent.guittot, zhengzucheng Cc: linux-kernel In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 CPUs are overcommitted to 2 x 8u VMs, and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs resources, the other VM has 6 CPUs. The host is configured with 80 CPUs in a sched domain and other CPUs are in the idle state. The root cause is that the load of the host is unbalanced, some vCPUs exclusively occupy CPU resources. when the CPU that triggers load balance calculates imbalance value, env->imbalance = 0 is calculated because of local->avg_load > sds->avg_load. As a result, the load balance fails. The processing logic: https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70 It's normal from kernel load balance, but it's not reasonable from the perspective of VM users. In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule domain to fix it. Is there any other method to fix this problem? thanks. Abstracted reproduction case: 1.environment information: [root@localhost ~]# cat /proc/schedstat cpu0 domain0 00000000,00000000,00010000,00000000,00000001 domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff cpu1 domain0 00000000,00000000,00020000,00000000,00000002 domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff cpu2 domain0 00000000,00000000,00040000,00000000,00000004 domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff cpu3 domain0 00000000,00000000,00080000,00000000,00000008 domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 2.test case: vcpu.c #include <stdio.h> #include <unistd.h> int main() { sleep(20); while (1); return 0; } gcc vcpu.c -o vcpu ----------------------------------------------------------------- test.sh #!/bin/bash #vcpu1 mkdir /sys/fs/cgroup/cpuset/vcpu_1 echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems for i in {1..8} do ./vcpu & pid=$! sleep 1 echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks done #vcpu2 mkdir /sys/fs/cgroup/cpuset/vcpu_2 echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems for i in {1..8} do ./vcpu & pid=$! sleep 1 echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks done ------------------------------------------------------------------ [root@localhost ~]# ./test.sh [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu) 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario 2024-09-13 4:03 ` [Question] sched:the load is unbalanced in the VM overcommitment scenario zhengzucheng @ 2024-09-13 15:55 ` Vincent Guittot 2024-09-14 7:03 ` zhengzucheng 2024-09-13 17:17 ` Waiman Long 1 sibling, 1 reply; 19+ messages in thread From: Vincent Guittot @ 2024-09-13 15:55 UTC (permalink / raw) To: zhengzucheng Cc: Waiman Long, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, oleg, Frederic Weisbecker, mingo, peterx, tj, tjcao980311, linux-kernel On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@huawei.com> wrote: > > In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 > CPUs are overcommitted to 2 x 8u VMs, > and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs > resources, the other VM has 6 CPUs. > The host is configured with 80 CPUs in a sched domain and other CPUs are > in the idle state. > The root cause is that the load of the host is unbalanced, some vCPUs > exclusively occupy CPU resources. > when the CPU that triggers load balance calculates imbalance value, > env->imbalance = 0 is calculated because of > local->avg_load > sds->avg_load. As a result, the load balance fails. > The processing logic: > https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70 > > > It's normal from kernel load balance, but it's not reasonable from the > perspective of VM users. > In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule > domain to fix it. > Is there any other method to fix this problem? thanks. I'm not sure how to understand your setup and why the load balance is not balancing correctly 16 vCPU between the 8 CPUs. From your test case description below, you have 8 always running threads in cgroup A and 8 always running threads in cgroup B and the 2 cgroups have only 8 CPUs among 80. This should not be a problem for load balance. I tried something similar although not exactly the same with cgroupv2 and rt-app and I don't have noticeable imbalance Do you have more details that you can share about your system ? Which kernel version are you using ? Which arch ? > > Abstracted reproduction case: > 1.environment information: > > [root@localhost ~]# cat /proc/schedstat > > cpu0 > domain0 00000000,00000000,00010000,00000000,00000001 > domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > cpu1 > domain0 00000000,00000000,00020000,00000000,00000002 > domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > cpu2 > domain0 00000000,00000000,00040000,00000000,00000004 > domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > cpu3 > domain0 00000000,00000000,00080000,00000000,00000008 > domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG ? and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ? > > 2.test case: > > vcpu.c > #include <stdio.h> > #include <unistd.h> > > int main() > { > sleep(20); > while (1); > return 0; > } > > gcc vcpu.c -o vcpu > ----------------------------------------------------------------- > test.sh > > #!/bin/bash > > #vcpu1 > mkdir /sys/fs/cgroup/cpuset/vcpu_1 > echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus > echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems > for i in {1..8} > do > ./vcpu & > pid=$! > sleep 1 > echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks > done > > #vcpu2 > mkdir /sys/fs/cgroup/cpuset/vcpu_2 > echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus > echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems > for i in {1..8} > do > ./vcpu & > pid=$! > sleep 1 > echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks > done > ------------------------------------------------------------------ > [root@localhost ~]# ./test.sh > > [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu) > > 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu > 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu > 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu > 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu > 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu > 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu > 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu > 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu > 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu > 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu > 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu > 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu > 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu > 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu > 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu > 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario 2024-09-13 15:55 ` Vincent Guittot @ 2024-09-14 7:03 ` zhengzucheng 2024-09-17 6:19 ` Vincent Guittot 0 siblings, 1 reply; 19+ messages in thread From: zhengzucheng @ 2024-09-14 7:03 UTC (permalink / raw) To: Vincent Guittot Cc: Waiman Long, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, oleg, Frederic Weisbecker, mingo, peterx, tj, tjcao980311, linux-kernel 在 2024/9/13 23:55, Vincent Guittot 写道: > On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@huawei.com> wrote: >> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 >> CPUs are overcommitted to 2 x 8u VMs, >> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs >> resources, the other VM has 6 CPUs. >> The host is configured with 80 CPUs in a sched domain and other CPUs are >> in the idle state. >> The root cause is that the load of the host is unbalanced, some vCPUs >> exclusively occupy CPU resources. >> when the CPU that triggers load balance calculates imbalance value, >> env->imbalance = 0 is calculated because of >> local->avg_load > sds->avg_load. As a result, the load balance fails. >> The processing logic: >> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70 >> >> >> It's normal from kernel load balance, but it's not reasonable from the >> perspective of VM users. >> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule >> domain to fix it. >> Is there any other method to fix this problem? thanks. > I'm not sure how to understand your setup and why the load balance is > not balancing correctly 16 vCPU between the 8 CPUs. > > >From your test case description below, you have 8 always running > threads in cgroup A and 8 always running threads in cgroup B and the 2 > cgroups have only 8 CPUs among 80. This should not be a problem for > load balance. I tried something similar although not exactly the same > with cgroupv2 and rt-app and I don't have noticeable imbalance > > Do you have more details that you can share about your system ? > > Which kernel version are you using ? Which arch ? kernel version: 6.11.0-rc7 arch: X86_64 and cgroup v1 >> Abstracted reproduction case: >> 1.environment information: >> >> [root@localhost ~]# cat /proc/schedstat >> >> cpu0 >> domain0 00000000,00000000,00010000,00000000,00000001 >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff >> cpu1 >> domain0 00000000,00000000,00020000,00000000,00000002 >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff >> cpu2 >> domain0 00000000,00000000,00040000,00000000,00000004 >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff >> cpu3 >> domain0 00000000,00000000,00080000,00000000,00000008 >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG ? > and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ? domain0 is SMT and domain1 is MC thread_siblings_list:0,80. 1,81. 2,82. 3,83 LLC is at domain1 level >> 2.test case: >> >> vcpu.c >> #include <stdio.h> >> #include <unistd.h> >> >> int main() >> { >> sleep(20); >> while (1); >> return 0; >> } >> >> gcc vcpu.c -o vcpu >> ----------------------------------------------------------------- >> test.sh >> >> #!/bin/bash >> >> #vcpu1 >> mkdir /sys/fs/cgroup/cpuset/vcpu_1 >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems >> for i in {1..8} >> do >> ./vcpu & >> pid=$! >> sleep 1 >> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks >> done >> >> #vcpu2 >> mkdir /sys/fs/cgroup/cpuset/vcpu_2 >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems >> for i in {1..8} >> do >> ./vcpu & >> pid=$! >> sleep 1 >> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks >> done >> ------------------------------------------------------------------ >> [root@localhost ~]# ./test.sh >> >> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu) >> >> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu >> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu >> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu >> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu >> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu >> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu >> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu >> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu >> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu >> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu >> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu >> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu >> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu >> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu >> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu >> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu >> > . ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario 2024-09-14 7:03 ` zhengzucheng @ 2024-09-17 6:19 ` Vincent Guittot 0 siblings, 0 replies; 19+ messages in thread From: Vincent Guittot @ 2024-09-17 6:19 UTC (permalink / raw) To: zhengzucheng Cc: Waiman Long, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, oleg, Frederic Weisbecker, mingo, peterx, tj, tjcao980311, linux-kernel On Sat, 14 Sept 2024 at 09:04, zhengzucheng <zhengzucheng@huawei.com> wrote: > > > 在 2024/9/13 23:55, Vincent Guittot 写道: > > On Fri, 13 Sept 2024 at 06:03, zhengzucheng <zhengzucheng@huawei.com> wrote: > >> In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 > >> CPUs are overcommitted to 2 x 8u VMs, > >> and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs > >> resources, the other VM has 6 CPUs. > >> The host is configured with 80 CPUs in a sched domain and other CPUs are > >> in the idle state. > >> The root cause is that the load of the host is unbalanced, some vCPUs > >> exclusively occupy CPU resources. > >> when the CPU that triggers load balance calculates imbalance value, > >> env->imbalance = 0 is calculated because of > >> local->avg_load > sds->avg_load. As a result, the load balance fails. > >> The processing logic: > >> https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70 > >> > >> > >> It's normal from kernel load balance, but it's not reasonable from the > >> perspective of VM users. > >> In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule > >> domain to fix it. > >> Is there any other method to fix this problem? thanks. > > I'm not sure how to understand your setup and why the load balance is > > not balancing correctly 16 vCPU between the 8 CPUs. > > > > >From your test case description below, you have 8 always running > > threads in cgroup A and 8 always running threads in cgroup B and the 2 > > cgroups have only 8 CPUs among 80. This should not be a problem for > > load balance. I tried something similar although not exactly the same > > with cgroupv2 and rt-app and I don't have noticeable imbalance > > > > Do you have more details that you can share about your system ? > > > > Which kernel version are you using ? Which arch ? > > kernel version: 6.11.0-rc7 > arch: X86_64 and cgroup v1 okay > > >> Abstracted reproduction case: > >> 1.environment information: > >> > >> [root@localhost ~]# cat /proc/schedstat > >> > >> cpu0 > >> domain0 00000000,00000000,00010000,00000000,00000001 > >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > >> cpu1 > >> domain0 00000000,00000000,00020000,00000000,00000002 > >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > >> cpu2 > >> domain0 00000000,00000000,00040000,00000000,00000004 > >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > >> cpu3 > >> domain0 00000000,00000000,00080000,00000000,00000008 > >> domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > >> domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > > Is it correct to assume that domain0 is SMT, domain1 MC and domain2 PKG ? > > and cpu80-83 are in the other group of PKG ? and LLC is at domain1 level ? > > domain0 is SMT and domain1 is MC > thread_siblings_list:0,80. 1,81. 2,82. 3,83 Yeah, I should have read more carefully the domain0 cpumask > LLC is at domain1 level > > >> 2.test case: > >> > >> vcpu.c > >> #include <stdio.h> > >> #include <unistd.h> > >> > >> int main() > >> { > >> sleep(20); > >> while (1); > >> return 0; > >> } > >> > >> gcc vcpu.c -o vcpu > >> ----------------------------------------------------------------- > >> test.sh > >> > >> #!/bin/bash > >> > >> #vcpu1 > >> mkdir /sys/fs/cgroup/cpuset/vcpu_1 > >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus > >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems > >> for i in {1..8} > >> do > >> ./vcpu & > >> pid=$! > >> sleep 1 > >> echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks > >> done > >> > >> #vcpu2 > >> mkdir /sys/fs/cgroup/cpuset/vcpu_2 > >> echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus > >> echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems > >> for i in {1..8} > >> do > >> ./vcpu & > >> pid=$! > >> sleep 1 > >> echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks > >> done > >> ------------------------------------------------------------------ > >> [root@localhost ~]# ./test.sh > >> > >> [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu) > >> > >> 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 ./vcpu > >> 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 ./vcpu > >> 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 ./vcpu > >> 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 ./vcpu > >> 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 ./vcpu > >> 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 ./vcpu > >> 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu > >> 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu > >> 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu > >> 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu > >> 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu > >> 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu > >> 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu > >> 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu > >> 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu > >> 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu > >> So I finally understood your situation. The limited cpuset screws up the avg load of system for domain1. The group_imbalanced state is there to try to fix an imbalanced situation related to tasks that are pinned to a subset of CPUs of the sched domain. But this can't cover all cases. > > . ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario 2024-09-13 4:03 ` [Question] sched:the load is unbalanced in the VM overcommitment scenario zhengzucheng 2024-09-13 15:55 ` Vincent Guittot @ 2024-09-13 17:17 ` Waiman Long 2024-09-14 2:15 ` zhengzucheng 1 sibling, 1 reply; 19+ messages in thread From: Waiman Long @ 2024-09-13 17:17 UTC (permalink / raw) To: zhengzucheng, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, oleg, Frederic Weisbecker, mingo, peterx, tj, tjcao980311 Cc: linux-kernel On 9/13/24 00:03, zhengzucheng wrote: > In the VM overcommitment scenario, the overcommitment ratio is 1:2, 8 > CPUs are overcommitted to 2 x 8u VMs, > and 16 vCPUs are bound to 8 cpu. However, one VM obtains only 2 CPUs > resources, the other VM has 6 CPUs. > The host is configured with 80 CPUs in a sched domain and other CPUs > are in the idle state. > The root cause is that the load of the host is unbalanced, some vCPUs > exclusively occupy CPU resources. > when the CPU that triggers load balance calculates imbalance value, > env->imbalance = 0 is calculated because of > local->avg_load > sds->avg_load. As a result, the load balance fails. > The processing logic: > https://github.com/torvalds/linux/commit/91dcf1e8068e9a8823e419a7a34ff4341275fb70 > > > It's normal from kernel load balance, but it's not reasonable from the > perspective of VM users. > In cgroup v1, set cpuset.sched_load_balance=0 to modify the schedule > domain to fix it. > Is there any other method to fix this problem? thanks. > > Abstracted reproduction case: > 1.environment information: > > [root@localhost ~]# cat /proc/schedstat > > cpu0 > domain0 00000000,00000000,00010000,00000000,00000001 > domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > cpu1 > domain0 00000000,00000000,00020000,00000000,00000002 > domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > cpu2 > domain0 00000000,00000000,00040000,00000000,00000004 > domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > cpu3 > domain0 00000000,00000000,00080000,00000000,00000008 > domain1 00000000,00ffffff,ffff0000,000000ff,ffffffff > domain2 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff > > 2.test case: > > vcpu.c > #include <stdio.h> > #include <unistd.h> > > int main() > { > sleep(20); > while (1); > return 0; > } > > gcc vcpu.c -o vcpu > ----------------------------------------------------------------- > test.sh > > #!/bin/bash > > #vcpu1 > mkdir /sys/fs/cgroup/cpuset/vcpu_1 > echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.cpus > echo 0 > /sys/fs/cgroup/cpuset/vcpu_1/cpuset.mems > for i in {1..8} > do > ./vcpu & > pid=$! > sleep 1 > echo $pid > /sys/fs/cgroup/cpuset/vcpu_1/tasks > done > > #vcpu2 > mkdir /sys/fs/cgroup/cpuset/vcpu_2 > echo '0-3, 80-83' > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.cpus > echo 0 > /sys/fs/cgroup/cpuset/vcpu_2/cpuset.mems > for i in {1..8} > do > ./vcpu & > pid=$! > sleep 1 > echo $pid > /sys/fs/cgroup/cpuset/vcpu_2/tasks > done > ------------------------------------------------------------------ > [root@localhost ~]# ./test.sh > > [root@localhost ~]# top -d 1 -c -p $(pgrep -d',' -f vcpu) > > 14591 root 20 0 2448 1012 928 R 100.0 0.0 13:10.73 > ./vcpu > 14582 root 20 0 2448 1012 928 R 100.0 0.0 13:12.71 > ./vcpu > 14606 root 20 0 2448 872 784 R 100.0 0.0 13:09.72 > ./vcpu > 14620 root 20 0 2448 916 832 R 100.0 0.0 13:07.72 > ./vcpu > 14622 root 20 0 2448 920 836 R 100.0 0.0 13:06.72 > ./vcpu > 14629 root 20 0 2448 920 832 R 100.0 0.0 13:05.72 > ./vcpu > 14643 root 20 0 2448 924 836 R 21.0 0.0 2:37.13 ./vcpu > 14645 root 20 0 2448 868 784 R 21.0 0.0 2:36.51 ./vcpu > 14589 root 20 0 2448 900 816 R 20.0 0.0 2:45.16 ./vcpu > 14608 root 20 0 2448 956 872 R 20.0 0.0 2:42.24 ./vcpu > 14632 root 20 0 2448 872 788 R 20.0 0.0 2:38.08 ./vcpu > 14638 root 20 0 2448 924 840 R 20.0 0.0 2:37.48 ./vcpu > 14652 root 20 0 2448 928 844 R 20.0 0.0 2:36.42 ./vcpu > 14654 root 20 0 2448 924 840 R 20.0 0.0 2:36.14 ./vcpu > 14663 root 20 0 2448 900 816 R 20.0 0.0 2:35.38 ./vcpu > 14669 root 20 0 2448 868 784 R 20.0 0.0 2:35.70 ./vcpu > Your script creates two cpusets with the same set of CPUs. The scheduling aspect of the tasks, however, are not controlled by cpuset. It is controlled by cpu cgroup. I suppose that all these tasks are in the same cpu cgroup. It is possible that commit you mentioned might have caused some unfairness in allocating CPU time to different processes within the same cpu cgroup. Maybe you can try to put them into separate cpu cgroups as well with equal weight to see if that can improve the scheduling fairness? BTW, you don't actually need to use 2 different cpusets if they all get the same set of CPUs and memory nodes. Also setting cpuset.sched_load_balance=0 may not actually get what you want unless all the cpusets that use those CPUs have cpuset.sched_load_balance set 0 including the root cgroup. Turning off this flag may disable load balancing, but it may not guarantee fairness depending on what CPUs are being used by those tasks when they start unless you explicitly assign the CPUs to them when starting these tasks. Cheers, Longman ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [Question] sched:the load is unbalanced in the VM overcommitment scenario 2024-09-13 17:17 ` Waiman Long @ 2024-09-14 2:15 ` zhengzucheng 0 siblings, 0 replies; 19+ messages in thread From: zhengzucheng @ 2024-09-14 2:15 UTC (permalink / raw) To: Waiman Long, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, oleg, Frederic Weisbecker, mingo, peterx, tj, tjcao980311 Cc: linux-kernel 在 2024/9/14 1:17, Waiman Long 写道: > you don't actually need to use 2 different cpusets if they all get the > same set of CPUs and memory nodes Yes, you're right. The purpose of setting two different cpusets is to simulate the scenario of two VMs. each cpuset is a VM. For example, the VM configuration is as follows: <domain type='kvm' id='12676'> <name>master</name> <vcpu placement='static' cpuset='0-3,80-83'>8</vcpu> <iothreads>1</iothreads> <iothreadids> <iothread id='1'/> </iothreadids> <cputune> <vcpupin vcpu='0' cpuset='0-3,80-83'/> <vcpupin vcpu='1' cpuset='0-3,80-83'/> <vcpupin vcpu='2' cpuset='0-3,80-83'/> <vcpupin vcpu='3' cpuset='0-3,80-83'/> <vcpupin vcpu='4' cpuset='0-3,80-83'/> <vcpupin vcpu='5' cpuset='0-3,80-83'/> <vcpupin vcpu='6' cpuset='0-3,80-83'/> <vcpupin vcpu='7' cpuset='0-3,80-83'/> <emulatorpin cpuset='0-79'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> <memnode cellid='0' mode='strict' nodeset='0'/> </numatune> ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-09-17 6:19 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-25 12:03 [PATCH -next] sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime Zheng Zucheng 2024-07-25 14:05 ` Peter Zijlstra 2024-07-25 14:49 ` zhengzucheng 2024-07-25 15:14 ` Peter Zijlstra 2024-07-26 2:32 ` [PATCH v2 " Zheng Zucheng 2024-07-26 10:44 ` Oleg Nesterov 2024-07-26 13:04 ` Oleg Nesterov 2024-07-26 13:08 ` Peter Zijlstra 2024-07-26 13:30 ` Oleg Nesterov 2024-07-30 6:55 ` Uwe Kleine-König 2024-07-29 10:34 ` [tip: sched/core] " tip-bot2 for Zheng Zucheng 2024-09-02 1:56 ` [Question] Include isolated cpu to ensure that tasks are not scheduled to isolated cpu? zhengzucheng 2024-09-02 3:00 ` Waiman Long 2024-09-13 4:03 ` [Question] sched:the load is unbalanced in the VM overcommitment scenario zhengzucheng 2024-09-13 15:55 ` Vincent Guittot 2024-09-14 7:03 ` zhengzucheng 2024-09-17 6:19 ` Vincent Guittot 2024-09-13 17:17 ` Waiman Long 2024-09-14 2:15 ` zhengzucheng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox