* [PATCH v2 0/2] minor cpu bandwidth control fix @ 2024-07-23 12:20 Chuyi Zhou 2024-07-23 12:20 ` [PATCH v2 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou 2024-07-23 12:20 ` [PATCH v2 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou 0 siblings, 2 replies; 7+ messages in thread From: Chuyi Zhou @ 2024-07-23 12:20 UTC (permalink / raw) To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid Cc: chengming.zhou, linux-kernel, joshdon, Chuyi Zhou Hello, This patchset tries to fix the minor issues in cpu bandwidthe control. patch#1 tries to fix the inaccurate __cfs_bandwidth_used. patch#2 tries to reduce the unnecessary overhead in tg_set_cfs_bandwidth() observed in our production environment. Please see individual patches for more details, and comments are always welcome. --- Changes in v2: patch#1: - guard(cpus_read_lock) before cfs_bandwidth_usage_dec() in destroy_cfs_bandwidth().(Benjamin) - do cfs_bandwidth_usage_dec after __cfsb_csd_unthrottle loop.(Benjamin) - move the call to destroy_cfs_bandwidth() to cpu_cgroup_css_free (Benjamin) patch#2: - move the check under cfs_constraints_mutex.(Chengming and Benjamin) Link to v1:https://lore.kernel.org/lkml/20240721125208.5348-1-zhouchuyi@bytedance.com/ Chuyi Zhou (2): sched/fair: Decrease cfs bandwidth usage in task_group destruction sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth kernel/sched/core.c | 5 +++++ kernel/sched/fair.c | 13 +++++++------ kernel/sched/sched.h | 2 ++ 3 files changed, 14 insertions(+), 6 deletions(-) -- 2.20.1 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction 2024-07-23 12:20 [PATCH v2 0/2] minor cpu bandwidth control fix Chuyi Zhou @ 2024-07-23 12:20 ` Chuyi Zhou 2024-07-24 1:26 ` Benjamin Segall 2024-07-24 2:29 ` Chengming Zhou 2024-07-23 12:20 ` [PATCH v2 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou 1 sibling, 2 replies; 7+ messages in thread From: Chuyi Zhou @ 2024-07-23 12:20 UTC (permalink / raw) To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid Cc: chengming.zhou, linux-kernel, joshdon, Chuyi Zhou The static key __cfs_bandwidth_used is used to indicate whether bandwidth control is enabled in the system. Currently, it is only decreased when a task group disables bandwidth control. This is incorrect because if there was a task group in the past that enabled bandwidth control, the __cfs_bandwidth_used will never go to zero, even if there are no task_group using bandwidth control now. This patch tries to fix this issue by decrsasing bandwidth usage in destroy_cfs_bandwidth(). cfs_bandwidth_usage_dec() calls static_key_slow_dec_cpuslocked which needs to hold hotplug lock, but cfs bandwidth destroy maybe run in a rcu callback. Move the call to destroy_cfs_bandwidth() from unregister_fair_sched_group() to cpu_cgroup_css_free() which runs in process context. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> --- kernel/sched/core.c | 2 ++ kernel/sched/fair.c | 13 +++++++------ kernel/sched/sched.h | 2 ++ 3 files changed, 11 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6d35c48239be..7720d34bd71b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8816,6 +8816,8 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css) { struct task_group *tg = css_tg(css); + destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); + /* * Relies on the RCU grace period between css_released() and this. */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index da3cdd86ab2e..c56b6d5b8ed7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5615,7 +5615,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) cfs_b->runtime_snap = cfs_b->runtime; } -static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) +struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) { return &tg->cfs_bandwidth; } @@ -6438,7 +6438,7 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); } -static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) +void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) { int __maybe_unused i; @@ -6472,6 +6472,9 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) local_irq_restore(flags); } #endif + guard(cpus_read_lock)(); + if (cfs_b->quota != RUNTIME_INF) + cfs_bandwidth_usage_dec(); } /* @@ -6614,11 +6617,11 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *paren static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} #endif -static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) +struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) { return NULL; } -static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {} +void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {} static inline void update_runtime_enabled(struct rq *rq) {} static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {} #ifdef CONFIG_CGROUP_SCHED @@ -12992,8 +12995,6 @@ void unregister_fair_sched_group(struct task_group *tg) struct rq *rq; int cpu; - destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); - for_each_possible_cpu(cpu) { if (tg->se[cpu]) remove_entity_load_avg(tg->se[cpu]); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 8a071022bdec..d251842867ce 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2938,6 +2938,8 @@ extern void init_dl_rq(struct dl_rq *dl_rq); extern void cfs_bandwidth_usage_inc(void); extern void cfs_bandwidth_usage_dec(void); +extern struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg); +extern void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b); #ifdef CONFIG_NO_HZ_COMMON #define NOHZ_BALANCE_KICK_BIT 0 -- 2.20.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction 2024-07-23 12:20 ` [PATCH v2 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou @ 2024-07-24 1:26 ` Benjamin Segall 2024-07-24 2:29 ` Chengming Zhou 1 sibling, 0 replies; 7+ messages in thread From: Benjamin Segall @ 2024-07-24 1:26 UTC (permalink / raw) To: Chuyi Zhou Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, mgorman, vschneid, chengming.zhou, linux-kernel, joshdon Chuyi Zhou <zhouchuyi@bytedance.com> writes: > The static key __cfs_bandwidth_used is used to indicate whether bandwidth > control is enabled in the system. Currently, it is only decreased when a > task group disables bandwidth control. This is incorrect because if there > was a task group in the past that enabled bandwidth control, the > __cfs_bandwidth_used will never go to zero, even if there are no task_group > using bandwidth control now. > > This patch tries to fix this issue by decrsasing bandwidth usage in > destroy_cfs_bandwidth(). cfs_bandwidth_usage_dec() calls > static_key_slow_dec_cpuslocked which needs to hold hotplug lock, but cfs > bandwidth destroy maybe run in a rcu callback. Move the call to > destroy_cfs_bandwidth() from unregister_fair_sched_group() to > cpu_cgroup_css_free() which runs in process context. > > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-By: Ben Segall <bsegall@google.com> > --- > kernel/sched/core.c | 2 ++ > kernel/sched/fair.c | 13 +++++++------ > kernel/sched/sched.h | 2 ++ > 3 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 6d35c48239be..7720d34bd71b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -12992,8 +12995,6 @@ void unregister_fair_sched_group(struct task_group *tg) > struct rq *rq; > int cpu; > > - destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); > - > for_each_possible_cpu(cpu) { > if (tg->se[cpu]) > remove_entity_load_avg(tg->se[cpu]); There is a slightly subtle point here that autogroup cannot have a quota set. If there's some shenanigans way that that's possible then it would need a destroy as well. autogroup is already making assumptions anyways though. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction 2024-07-23 12:20 ` [PATCH v2 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou 2024-07-24 1:26 ` Benjamin Segall @ 2024-07-24 2:29 ` Chengming Zhou 1 sibling, 0 replies; 7+ messages in thread From: Chengming Zhou @ 2024-07-24 2:29 UTC (permalink / raw) To: Chuyi Zhou, mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid Cc: linux-kernel, joshdon On 2024/7/23 20:20, Chuyi Zhou wrote: > The static key __cfs_bandwidth_used is used to indicate whether bandwidth > control is enabled in the system. Currently, it is only decreased when a > task group disables bandwidth control. This is incorrect because if there > was a task group in the past that enabled bandwidth control, the > __cfs_bandwidth_used will never go to zero, even if there are no task_group > using bandwidth control now. > > This patch tries to fix this issue by decrsasing bandwidth usage in > destroy_cfs_bandwidth(). cfs_bandwidth_usage_dec() calls > static_key_slow_dec_cpuslocked which needs to hold hotplug lock, but cfs > bandwidth destroy maybe run in a rcu callback. Move the call to > destroy_cfs_bandwidth() from unregister_fair_sched_group() to > cpu_cgroup_css_free() which runs in process context. > > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Yeah, autogroup can't have bandwidth set, so it's ok to just destroy bandwidth in .css_free(). Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Just some nits below: > --- > kernel/sched/core.c | 2 ++ > kernel/sched/fair.c | 13 +++++++------ > kernel/sched/sched.h | 2 ++ > 3 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 6d35c48239be..7720d34bd71b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -8816,6 +8816,8 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css) > { > struct task_group *tg = css_tg(css); > > + destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); Instead of exporting this tg_cfs_bandwidth(), how about just changing the parameter of init_cfs_bandwidth()/destroy_cfs_bandwidth() to tg? Which maybe clearer? but this is your call. Thanks. > + > /* > * Relies on the RCU grace period between css_released() and this. > */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index da3cdd86ab2e..c56b6d5b8ed7 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5615,7 +5615,7 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > cfs_b->runtime_snap = cfs_b->runtime; > } > > -static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > +struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > { > return &tg->cfs_bandwidth; > } > @@ -6438,7 +6438,7 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > } > > -static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > +void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > { > int __maybe_unused i; > > @@ -6472,6 +6472,9 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > local_irq_restore(flags); > } > #endif > + guard(cpus_read_lock)(); > + if (cfs_b->quota != RUNTIME_INF) > + cfs_bandwidth_usage_dec(); > } > > /* > @@ -6614,11 +6617,11 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *paren > static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {} > #endif > > -static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > +struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > { > return NULL; > } > -static inline void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {} > +void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {} > static inline void update_runtime_enabled(struct rq *rq) {} > static inline void unthrottle_offline_cfs_rqs(struct rq *rq) {} > #ifdef CONFIG_CGROUP_SCHED > @@ -12992,8 +12995,6 @@ void unregister_fair_sched_group(struct task_group *tg) > struct rq *rq; > int cpu; > > - destroy_cfs_bandwidth(tg_cfs_bandwidth(tg)); > - > for_each_possible_cpu(cpu) { > if (tg->se[cpu]) > remove_entity_load_avg(tg->se[cpu]); > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 8a071022bdec..d251842867ce 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2938,6 +2938,8 @@ extern void init_dl_rq(struct dl_rq *dl_rq); > extern void cfs_bandwidth_usage_inc(void); > extern void cfs_bandwidth_usage_dec(void); > > +extern struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg); > +extern void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b); > #ifdef CONFIG_NO_HZ_COMMON > > #define NOHZ_BALANCE_KICK_BIT 0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth 2024-07-23 12:20 [PATCH v2 0/2] minor cpu bandwidth control fix Chuyi Zhou 2024-07-23 12:20 ` [PATCH v2 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou @ 2024-07-23 12:20 ` Chuyi Zhou 2024-07-24 1:27 ` Benjamin Segall 2024-07-24 2:31 ` Chengming Zhou 1 sibling, 2 replies; 7+ messages in thread From: Chuyi Zhou @ 2024-07-23 12:20 UTC (permalink / raw) To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid Cc: chengming.zhou, linux-kernel, joshdon, Chuyi Zhou In the kubernetes production environment, we have observed a high frequency of writes to cpu.max, approximately every 2~4 seconds for each cgroup, with the same value being written each time. This can result in unnecessary overhead, especially on machines with a large number of CPUs and cgroups. This is because kubelet and runc attempt to persist resource configurations through frequent updates with same value in this manner. While optimizations can be made to kubelet and runc to avoid such overhead(e.g. check the current value of cpu request/limit before writing to cpu.max), it is still worth to bail out from tg_set_cfs_bandwidth() if we attempt to update with the same value. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> --- kernel/sched/core.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7720d34bd71b..0cc564f45511 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9090,6 +9090,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, guard(cpus_read_lock)(); guard(mutex)(&cfs_constraints_mutex); + if (cfs_b->period == ns_to_ktime(period) && cfs_b->quota == quota && cfs_b->burst == burst) + return 0; + ret = __cfs_schedulable(tg, period, quota); if (ret) return ret; -- 2.20.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth 2024-07-23 12:20 ` [PATCH v2 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou @ 2024-07-24 1:27 ` Benjamin Segall 2024-07-24 2:31 ` Chengming Zhou 1 sibling, 0 replies; 7+ messages in thread From: Benjamin Segall @ 2024-07-24 1:27 UTC (permalink / raw) To: Chuyi Zhou Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, mgorman, vschneid, chengming.zhou, linux-kernel, joshdon Chuyi Zhou <zhouchuyi@bytedance.com> writes: > In the kubernetes production environment, we have observed a high > frequency of writes to cpu.max, approximately every 2~4 seconds for each > cgroup, with the same value being written each time. This can result in > unnecessary overhead, especially on machines with a large number of CPUs > and cgroups. > > This is because kubelet and runc attempt to persist resource > configurations through frequent updates with same value in this manner. > While optimizations can be made to kubelet and runc to avoid such > overhead(e.g. check the current value of cpu request/limit before writing > to cpu.max), it is still worth to bail out from tg_set_cfs_bandwidth() if > we attempt to update with the same value. > > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Ben Segall <bsegall@google.com> > --- > kernel/sched/core.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 7720d34bd71b..0cc564f45511 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -9090,6 +9090,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, > guard(cpus_read_lock)(); > guard(mutex)(&cfs_constraints_mutex); > > + if (cfs_b->period == ns_to_ktime(period) && cfs_b->quota == quota && cfs_b->burst == burst) > + return 0; > + > ret = __cfs_schedulable(tg, period, quota); > if (ret) > return ret; ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth 2024-07-23 12:20 ` [PATCH v2 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou 2024-07-24 1:27 ` Benjamin Segall @ 2024-07-24 2:31 ` Chengming Zhou 1 sibling, 0 replies; 7+ messages in thread From: Chengming Zhou @ 2024-07-24 2:31 UTC (permalink / raw) To: Chuyi Zhou, mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid Cc: linux-kernel, joshdon On 2024/7/23 20:20, Chuyi Zhou wrote: > In the kubernetes production environment, we have observed a high > frequency of writes to cpu.max, approximately every 2~4 seconds for each > cgroup, with the same value being written each time. This can result in > unnecessary overhead, especially on machines with a large number of CPUs > and cgroups. > > This is because kubelet and runc attempt to persist resource > configurations through frequent updates with same value in this manner. > While optimizations can be made to kubelet and runc to avoid such > overhead(e.g. check the current value of cpu request/limit before writing > to cpu.max), it is still worth to bail out from tg_set_cfs_bandwidth() if > we attempt to update with the same value. > > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> > --- > kernel/sched/core.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 7720d34bd71b..0cc564f45511 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -9090,6 +9090,9 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, > guard(cpus_read_lock)(); > guard(mutex)(&cfs_constraints_mutex); > > + if (cfs_b->period == ns_to_ktime(period) && cfs_b->quota == quota && cfs_b->burst == burst) > + return 0; > + Should break this to multiple lines? Feel free to add: Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> > ret = __cfs_schedulable(tg, period, quota); > if (ret) > return ret; ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-07-24 2:32 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-23 12:20 [PATCH v2 0/2] minor cpu bandwidth control fix Chuyi Zhou 2024-07-23 12:20 ` [PATCH v2 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou 2024-07-24 1:26 ` Benjamin Segall 2024-07-24 2:29 ` Chengming Zhou 2024-07-23 12:20 ` [PATCH v2 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou 2024-07-24 1:27 ` Benjamin Segall 2024-07-24 2:31 ` Chengming Zhou
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox