public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] minor cpu bandwidth control fix
@ 2024-07-21 12:52 Chuyi Zhou
  2024-07-21 12:52 ` [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou
  2024-07-21 12:52 ` [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou
  0 siblings, 2 replies; 12+ messages in thread
From: Chuyi Zhou @ 2024-07-21 12:52 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid
  Cc: chengming.zhou, linux-kernel, joshdon, Chuyi Zhou

Hello,

This patchset tries to fix the minor issues in cpu bandwidthe control.

patch#1 tries to fix the inaccurate __cfs_bandwidth_used.
patch#2 tries to reduce the unnecessary overhead in
tg_set_cfs_bandwidth() observed in our production environment.

Please see individual patches for more details, and comments are always
welcome.

Chuyi Zhou (2):
  sched/fair: Decrease cfs bandwidth usage in task_group destruction
  sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth

 kernel/sched/core.c | 2 ++
 kernel/sched/fair.c | 3 +++
 2 files changed, 5 insertions(+)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction
  2024-07-21 12:52 [PATCH 0/2] minor cpu bandwidth control fix Chuyi Zhou
@ 2024-07-21 12:52 ` Chuyi Zhou
  2024-07-22  3:47   ` Zhang Qiao
  2024-07-21 12:52 ` [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou
  1 sibling, 1 reply; 12+ messages in thread
From: Chuyi Zhou @ 2024-07-21 12:52 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid
  Cc: chengming.zhou, linux-kernel, joshdon, Chuyi Zhou

The static key __cfs_bandwidth_used is used to indicate whether bandwidth
control is enabled in the system. Currently, it is only decreased when a
task group disables bandwidth control. This is incorrect because if there
was a task group in the past that enabled bandwidth control, the
__cfs_bandwidth_used will never go to zero, even if there are no task_group
using bandwidth control now.

This patch tries to fix this issue by decrsasing bandwidth usage in
destroy_cfs_bandwidth().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1e07ce90284..7ad50dc31a93 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6447,6 +6447,9 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	hrtimer_cancel(&cfs_b->period_timer);
 	hrtimer_cancel(&cfs_b->slack_timer);
 
+	if (cfs_b->quota != RUNTIME_INF)
+		cfs_bandwidth_usage_dec();
+
 	/*
 	 * It is possible that we still have some cfs_rq's pending on a CSD
 	 * list, though this race is very rare. In order for this to occur, we
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth
  2024-07-21 12:52 [PATCH 0/2] minor cpu bandwidth control fix Chuyi Zhou
  2024-07-21 12:52 ` [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou
@ 2024-07-21 12:52 ` Chuyi Zhou
  2024-07-22  7:21   ` Chengming Zhou
  2024-07-22 22:24   ` Benjamin Segall
  1 sibling, 2 replies; 12+ messages in thread
From: Chuyi Zhou @ 2024-07-21 12:52 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid
  Cc: chengming.zhou, linux-kernel, joshdon, Chuyi Zhou

In the kubernetes production environment, we have observed a high
frequency of writes to cpu.max, approximately every 2~4 seconds for each
cgroup, with the same value being written each time. This can result in
unnecessary overhead, especially on machines with a large number of CPUs
and cgroups.

This is because kubelet and runc attempt to persist resource
configurations through frequent updates with same value in this manner.
While optimizations can be made to kubelet and runc to avoid such
overhead(e.g. check the current value of cpu request/limit before writing
to cpu.max), it is still worth to bail out from tg_set_cfs_bandwidth() if
we attempt to update with the same value.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d35c48239be..4db3ef2a703b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9081,6 +9081,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
 				     burst + quota > max_cfs_runtime))
 		return -EINVAL;
 
+	if (cfs_b->period == ns_to_ktime(period) && cfs_b->quota == quota && cfs_b->burst == burst)
+		return 0;
 	/*
 	 * Prevent race between setting of cfs_rq->runtime_enabled and
 	 * unthrottle_offline_cfs_rqs().
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction
  2024-07-21 12:52 ` [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou
@ 2024-07-22  3:47   ` Zhang Qiao
  2024-07-22  6:04     ` Chuyi Zhou
  0 siblings, 1 reply; 12+ messages in thread
From: Zhang Qiao @ 2024-07-22  3:47 UTC (permalink / raw)
  To: Chuyi Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: chengming.zhou, linux-kernel, joshdon



Hi, Chuyi

在 2024/7/21 20:52, Chuyi Zhou 写道:
> The static key __cfs_bandwidth_used is used to indicate whether bandwidth
> control is enabled in the system. Currently, it is only decreased when a
> task group disables bandwidth control. This is incorrect because if there
> was a task group in the past that enabled bandwidth control, the
> __cfs_bandwidth_used will never go to zero, even if there are no task_group
> using bandwidth control now.
> 
> This patch tries to fix this issue by decrsasing bandwidth usage in
> destroy_cfs_bandwidth().
> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  kernel/sched/fair.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b1e07ce90284..7ad50dc31a93 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6447,6 +6447,9 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  	hrtimer_cancel(&cfs_b->period_timer);
>  	hrtimer_cancel(&cfs_b->slack_timer);
>  
> +	if (cfs_b->quota != RUNTIME_INF)
> +		cfs_bandwidth_usage_dec();

This calls static_key_slow_dec_cpuslocked, but destroy_cfs_bandwidth
isn't holding the hotplug lock [1].

For fixing this issue, i also sent a patch, but it be not merged into mainline [2].

[1]: https://lore.kernel.org/all/20210712162655.w3j6uczwbfkzazvt@oracle.com/
[2]: https://lore.kernel.org/all/20210910094139.184582-1-zhangqiao22@huawei.com/

Thanks,

-- 
Qiao Zhang.

> +
>  	/*
>  	 * It is possible that we still have some cfs_rq's pending on a CSD
>  	 * list, though this race is very rare. In order for this to occur, we

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction
  2024-07-22  3:47   ` Zhang Qiao
@ 2024-07-22  6:04     ` Chuyi Zhou
  2024-07-22  7:16       ` Zhang Qiao
  0 siblings, 1 reply; 12+ messages in thread
From: Chuyi Zhou @ 2024-07-22  6:04 UTC (permalink / raw)
  To: Zhang Qiao, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: chengming.zhou, linux-kernel, joshdon

Hello

在 2024/7/22 11:47, Zhang Qiao 写道:
> 
> 
> Hi, Chuyi
> 
> 在 2024/7/21 20:52, Chuyi Zhou 写道:
>> The static key __cfs_bandwidth_used is used to indicate whether bandwidth
>> control is enabled in the system. Currently, it is only decreased when a
>> task group disables bandwidth control. This is incorrect because if there
>> was a task group in the past that enabled bandwidth control, the
>> __cfs_bandwidth_used will never go to zero, even if there are no task_group
>> using bandwidth control now.
>>
>> This patch tries to fix this issue by decrsasing bandwidth usage in
>> destroy_cfs_bandwidth().
>>
>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>> ---
>>   kernel/sched/fair.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b1e07ce90284..7ad50dc31a93 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6447,6 +6447,9 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>   	hrtimer_cancel(&cfs_b->period_timer);
>>   	hrtimer_cancel(&cfs_b->slack_timer);
>>   
>> +	if (cfs_b->quota != RUNTIME_INF)
>> +		cfs_bandwidth_usage_dec();
> 
> This calls static_key_slow_dec_cpuslocked, but destroy_cfs_bandwidth
> isn't holding the hotplug lock [1].
> 
> For fixing this issue, i also sent a patch, but it be not merged into mainline [2].
> 
> [1]: https://lore.kernel.org/all/20210712162655.w3j6uczwbfkzazvt@oracle.com/
> [2]: https://lore.kernel.org/all/20210910094139.184582-1-zhangqiao22@huawei.com/
> 

Thanks for your information.

I think maybe cfs_bandwidth_usage_dec() should be moved to other more 
suitable places where could hold hotplug lock(e.g. 
cpu_cgroup_css_released()). I would do some test to verify it.

Thanks.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction
  2024-07-22  6:04     ` Chuyi Zhou
@ 2024-07-22  7:16       ` Zhang Qiao
  2024-07-22  7:46         ` Chuyi Zhou
  0 siblings, 1 reply; 12+ messages in thread
From: Zhang Qiao @ 2024-07-22  7:16 UTC (permalink / raw)
  To: Chuyi Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: chengming.zhou, linux-kernel, joshdon

hi

在 2024/7/22 14:04, Chuyi Zhou 写道:
> Hello
> 
> 在 2024/7/22 11:47, Zhang Qiao 写道:
>>
>>
>> Hi, Chuyi
>>
>> 在 2024/7/21 20:52, Chuyi Zhou 写道:
>>> The static key __cfs_bandwidth_used is used to indicate whether bandwidth
>>> control is enabled in the system. Currently, it is only decreased when a
>>> task group disables bandwidth control. This is incorrect because if there
>>> was a task group in the past that enabled bandwidth control, the
>>> __cfs_bandwidth_used will never go to zero, even if there are no task_group
>>> using bandwidth control now.
>>>
>>> This patch tries to fix this issue by decrsasing bandwidth usage in
>>> destroy_cfs_bandwidth().
>>>
>>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>>> ---
>>>   kernel/sched/fair.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index b1e07ce90284..7ad50dc31a93 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6447,6 +6447,9 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>>       hrtimer_cancel(&cfs_b->period_timer);
>>>       hrtimer_cancel(&cfs_b->slack_timer);
>>>   +    if (cfs_b->quota != RUNTIME_INF)
>>> +        cfs_bandwidth_usage_dec();
>>
>> This calls static_key_slow_dec_cpuslocked, but destroy_cfs_bandwidth
>> isn't holding the hotplug lock [1].
>>
>> For fixing this issue, i also sent a patch, but it be not merged into mainline [2].
>>
>> [1]: https://lore.kernel.org/all/20210712162655.w3j6uczwbfkzazvt@oracle.com/
>> [2]: https://lore.kernel.org/all/20210910094139.184582-1-zhangqiao22@huawei.com/
>>
> 
> Thanks for your information.
> 
> I think maybe cfs_bandwidth_usage_dec() should be moved to other more suitable places where could
> hold hotplug lock(e.g. cpu_cgroup_css_released()). I would do some test to verify it.
> 

The cpu_cgroup_css_released() also doesn't seem to be in the cpu hotplug lock-holding context.


> Thanks.
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth
  2024-07-21 12:52 ` [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou
@ 2024-07-22  7:21   ` Chengming Zhou
  2024-07-22 22:24   ` Benjamin Segall
  1 sibling, 0 replies; 12+ messages in thread
From: Chengming Zhou @ 2024-07-22  7:21 UTC (permalink / raw)
  To: Chuyi Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: linux-kernel, joshdon

On 2024/7/21 20:52, Chuyi Zhou wrote:
> In the kubernetes production environment, we have observed a high
> frequency of writes to cpu.max, approximately every 2~4 seconds for each
> cgroup, with the same value being written each time. This can result in
> unnecessary overhead, especially on machines with a large number of CPUs
> and cgroups.
> 
> This is because kubelet and runc attempt to persist resource
> configurations through frequent updates with same value in this manner.

Ok.

> While optimizations can be made to kubelet and runc to avoid such
> overhead(e.g. check the current value of cpu request/limit before writing
> to cpu.max), it is still worth to bail out from tg_set_cfs_bandwidth() if
> we attempt to update with the same value.

Yeah, we can optimize this situation with a little of checking code,
seems worthwhile to do IMHO.

> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>   kernel/sched/core.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6d35c48239be..4db3ef2a703b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9081,6 +9081,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>   				     burst + quota > max_cfs_runtime))
>   		return -EINVAL;
>   
> +	if (cfs_b->period == ns_to_ktime(period) && cfs_b->quota == quota && cfs_b->burst == burst)
> +		return 0;

Maybe we'd better do these checkings under the lock protection, right?

Thanks.

>   	/*
>   	 * Prevent race between setting of cfs_rq->runtime_enabled and
>   	 * unthrottle_offline_cfs_rqs().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction
  2024-07-22  7:16       ` Zhang Qiao
@ 2024-07-22  7:46         ` Chuyi Zhou
  2024-07-22  8:16           ` Zhang Qiao
  0 siblings, 1 reply; 12+ messages in thread
From: Chuyi Zhou @ 2024-07-22  7:46 UTC (permalink / raw)
  To: Zhang Qiao, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: chengming.zhou, linux-kernel, joshdon

Hello,

在 2024/7/22 15:16, Zhang Qiao 写道:
> hi
> 
> 在 2024/7/22 14:04, Chuyi Zhou 写道:
>> Hello
>>
>> 在 2024/7/22 11:47, Zhang Qiao 写道:
>>>
>>>
>>> Hi, Chuyi
>>>
>>> 在 2024/7/21 20:52, Chuyi Zhou 写道:
>>>> The static key __cfs_bandwidth_used is used to indicate whether bandwidth
>>>> control is enabled in the system. Currently, it is only decreased when a
>>>> task group disables bandwidth control. This is incorrect because if there
>>>> was a task group in the past that enabled bandwidth control, the
>>>> __cfs_bandwidth_used will never go to zero, even if there are no task_group
>>>> using bandwidth control now.
>>>>
>>>> This patch tries to fix this issue by decrsasing bandwidth usage in
>>>> destroy_cfs_bandwidth().
>>>>
>>>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>>>> ---
>>>>    kernel/sched/fair.c | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b1e07ce90284..7ad50dc31a93 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6447,6 +6447,9 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>>>        hrtimer_cancel(&cfs_b->period_timer);
>>>>        hrtimer_cancel(&cfs_b->slack_timer);
>>>>    +    if (cfs_b->quota != RUNTIME_INF)
>>>> +        cfs_bandwidth_usage_dec();
>>>
>>> This calls static_key_slow_dec_cpuslocked, but destroy_cfs_bandwidth
>>> isn't holding the hotplug lock [1].
>>>
>>> For fixing this issue, i also sent a patch, but it be not merged into mainline [2].
>>>
>>> [1]: https://lore.kernel.org/all/20210712162655.w3j6uczwbfkzazvt@oracle.com/
>>> [2]: https://lore.kernel.org/all/20210910094139.184582-1-zhangqiao22@huawei.com/
>>>
>>
>> Thanks for your information.
>>
>> I think maybe cfs_bandwidth_usage_dec() should be moved to other more suitable places where could
>> hold hotplug lock(e.g. cpu_cgroup_css_released()). I would do some test to verify it.
>>
> 
> The cpu_cgroup_css_released() also doesn't seem to be in the cpu hotplug lock-holding context.
> 

IIUC, cpus_read_lock/cpus_read_unlock can be called in 
cpu_cgroup_css_released() right? But cfs bandwidth destroy maybe run in 
a rcu callback since task group list is protected by RCU so we could not
get the lock. Did I miss something important?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction
  2024-07-22  7:46         ` Chuyi Zhou
@ 2024-07-22  8:16           ` Zhang Qiao
  2024-07-22 22:20             ` Benjamin Segall
  0 siblings, 1 reply; 12+ messages in thread
From: Zhang Qiao @ 2024-07-22  8:16 UTC (permalink / raw)
  To: Chuyi Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
  Cc: chengming.zhou, linux-kernel, joshdon



在 2024/7/22 15:46, Chuyi Zhou 写道:
>>>
>>> Thanks for your information.
>>>
>>> I think maybe cfs_bandwidth_usage_dec() should be moved to other more suitable places where could
>>> hold hotplug lock(e.g. cpu_cgroup_css_released()). I would do some test to verify it.
>>>
>>
>> The cpu_cgroup_css_released() also doesn't seem to be in the cpu hotplug lock-holding context.
>>
> 
> IIUC, cpus_read_lock/cpus_read_unlock can be called in cpu_cgroup_css_released() right? But cfs
> bandwidth destroy maybe run in a rcu callback since task group list is protected by RCU so we could not
> get the lock. Did I miss something important?


Okay, you're right. I ignored that we can't hold the hotplug lock in an rcu callback.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction
  2024-07-22  8:16           ` Zhang Qiao
@ 2024-07-22 22:20             ` Benjamin Segall
  0 siblings, 0 replies; 12+ messages in thread
From: Benjamin Segall @ 2024-07-22 22:20 UTC (permalink / raw)
  To: Zhang Qiao
  Cc: Chuyi Zhou, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, mgorman, vschneid, chengming.zhou,
	linux-kernel, joshdon

Zhang Qiao <zhangqiao22@huawei.com> writes:

> 在 2024/7/22 15:46, Chuyi Zhou 写道:
>>>>
>>>> Thanks for your information.
>>>>
>>>> I think maybe cfs_bandwidth_usage_dec() should be moved to other more suitable places where could
>>>> hold hotplug lock(e.g. cpu_cgroup_css_released()). I would do some test to verify it.
>>>>
>>>
>>> The cpu_cgroup_css_released() also doesn't seem to be in the cpu hotplug lock-holding context.
>>>
>> 
>> IIUC, cpus_read_lock/cpus_read_unlock can be called in cpu_cgroup_css_released() right? But cfs
>> bandwidth destroy maybe run in a rcu callback since task group list is protected by RCU so we could not
>> get the lock. Did I miss something important?
>
>
> Okay, you're right. I ignored that we can't hold the hotplug lock in an rcu callback.

Yeah, cpu_cgroup_css_released/cpu_cgroup_css_free are fine I think, and
I think it should be correct to move the call to destroy_cfs_bandwidth() to
cpu_cgroup_css_free (it's unfortunate in terms of code organization, but
as far as correctness goes it should be fine).

As far as the diff goes, the _dec should go after the
__cfsb_csd_unthrottle loop.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth
  2024-07-21 12:52 ` [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou
  2024-07-22  7:21   ` Chengming Zhou
@ 2024-07-22 22:24   ` Benjamin Segall
  2024-07-23  2:34     ` Chuyi Zhou
  1 sibling, 1 reply; 12+ messages in thread
From: Benjamin Segall @ 2024-07-22 22:24 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, vschneid, chengming.zhou, linux-kernel, joshdon

Chuyi Zhou <zhouchuyi@bytedance.com> writes:

> In the kubernetes production environment, we have observed a high
> frequency of writes to cpu.max, approximately every 2~4 seconds for each
> cgroup, with the same value being written each time. This can result in
> unnecessary overhead, especially on machines with a large number of CPUs
> and cgroups.
>
> This is because kubelet and runc attempt to persist resource
> configurations through frequent updates with same value in this manner.
> While optimizations can be made to kubelet and runc to avoid such
> overhead(e.g. check the current value of cpu request/limit before writing
> to cpu.max), it is still worth to bail out from tg_set_cfs_bandwidth() if
> we attempt to update with the same value.

Yeah, this is silly of userspace, but we also do the same sort of check
for shares, so sure.

We do need it to be inside of a lock though, multiple threads could
in theory write to the file at once. Just move it down inside of
cfs_constraints_mutex and that's fine.

>
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  kernel/sched/core.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6d35c48239be..4db3ef2a703b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9081,6 +9081,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
>  				     burst + quota > max_cfs_runtime))
>  		return -EINVAL;
>  
> +	if (cfs_b->period == ns_to_ktime(period) && cfs_b->quota == quota && cfs_b->burst == burst)
> +		return 0;
>  	/*
>  	 * Prevent race between setting of cfs_rq->runtime_enabled and
>  	 * unthrottle_offline_cfs_rqs().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth
  2024-07-22 22:24   ` Benjamin Segall
@ 2024-07-23  2:34     ` Chuyi Zhou
  0 siblings, 0 replies; 12+ messages in thread
From: Chuyi Zhou @ 2024-07-23  2:34 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, mgorman, vschneid, chengming.zhou, linux-kernel, joshdon



在 2024/7/23 06:24, Benjamin Segall 写道:
> Chuyi Zhou <zhouchuyi@bytedance.com> writes:
> 
>> In the kubernetes production environment, we have observed a high
>> frequency of writes to cpu.max, approximately every 2~4 seconds for each
>> cgroup, with the same value being written each time. This can result in
>> unnecessary overhead, especially on machines with a large number of CPUs
>> and cgroups.
>>
>> This is because kubelet and runc attempt to persist resource
>> configurations through frequent updates with same value in this manner.
>> While optimizations can be made to kubelet and runc to avoid such
>> overhead(e.g. check the current value of cpu request/limit before writing
>> to cpu.max), it is still worth to bail out from tg_set_cfs_bandwidth() if
>> we attempt to update with the same value.
> 
> Yeah, this is silly of userspace, but we also do the same sort of check
> for shares, so sure.
> 
> We do need it to be inside of a lock though, multiple threads could
> in theory write to the file at once. Just move it down inside of
> cfs_constraints_mutex and that's fine.
> 

Thanks, I will send v2 later.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-07-23  2:34 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-21 12:52 [PATCH 0/2] minor cpu bandwidth control fix Chuyi Zhou
2024-07-21 12:52 ` [PATCH 1/2] sched/fair: Decrease cfs bandwidth usage in task_group destruction Chuyi Zhou
2024-07-22  3:47   ` Zhang Qiao
2024-07-22  6:04     ` Chuyi Zhou
2024-07-22  7:16       ` Zhang Qiao
2024-07-22  7:46         ` Chuyi Zhou
2024-07-22  8:16           ` Zhang Qiao
2024-07-22 22:20             ` Benjamin Segall
2024-07-21 12:52 ` [PATCH 2/2] sched/core: Avoid unnecessary update in tg_set_cfs_bandwidth Chuyi Zhou
2024-07-22  7:21   ` Chengming Zhou
2024-07-22 22:24   ` Benjamin Segall
2024-07-23  2:34     ` Chuyi Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox