public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization
@ 2023-02-23 18:51 Shrikanth Hegde
  2023-02-23 23:09 ` Benjamin Segall
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Shrikanth Hegde @ 2023-02-23 18:51 UTC (permalink / raw)
  To: mingo, peterz, vincent.guittot
  Cc: dietmar.eggemann, bsegall, tglx, srikar, arjan, svaidy,
	linux-kernel, sshegde

CPU cfs bandwidth controller uses hrtimer. Currently there is no initial
value set. Hence all period timers would align at expiry.
This happens when there are multiple CPU cgroup's.

There is a performance gain that can be achieved here if the timers are
interleaved when the utilization of each CPU cgroup is low and total
utilization of all the CPU cgroup's is less than 50%. If the timers are
interleaved, then the unthrottled cgroup can run freely without many
context switches and can also benefit from SMT Folding. This effect will
be further amplified in SPLPAR environment.

This commit adds a random offset after initializing each hrtimer. This
would result in interleaving the timers at expiry, which helps in achieving
the said performance gain.

This was tested on powerpc platform with 8 core SMT=8. Socket power was
measured when the workload. Benchmarked the stress-ng with power
information. Throughput oriented benchmarks show significant gain up to
25% while power consumption increases up to 15%.

Workload: stress-ng --cpu=32 --cpu-ops=50000.
1CG - 1 cgroup is running.
2CG - 2 cgroups are running together.
Time taken to complete stress-ng in seconds and power is in watts.
each cgroup is throttled at 25% with 100ms as the period value.
           6.2-rc6                     |   with patch
8 core   1CG    power   2CG     power  |  1CG    power  2 CG    power
        27.5    80.6    40      90     |  27.3    82    32.3    104
        27.5    81      40.2    91     |  27.5    81    38.7     96
        27.7    80      40.1    89     |  27.6    80    29.7    106
        27.7    80.1    40.3    94     |  27.6    80    31.5    105

Latency might be affected by this change. That could happen if the CPU was
in a deep idle state which is possible if we interleave the timers. Used
schbench for measuring the latency. Each cgroup is throttled at 25% with
period value is set to 100ms. Numbers are when both the cgroups are
running simultaneously. Latency values don't degrade much. Some
improvement is seen in tail latencies.

		6.2-rc6        with patch
Groups: 16
50.0th:          39.5            42.5
75.0th:         924.0           922.0
90.0th:         972.0           968.0
95.0th:        1005.5           994.0
99.0th:        4166.0          2287.0
99.5th:        7314.0          7448.0
99.9th:       15024.0         13600.0

Groups: 32
50.0th:         819.0           463.0
75.0th:        1596.0           918.0
90.0th:        5992.0          1281.5
95.0th:       13184.0          2765.0
99.0th:       21792.0         14240.0
99.5th:       25696.0         18920.0
99.9th:       33280.0         35776.0

Groups: 64
50.0th:        4806.0          3440.0
75.0th:       31136.0         33664.0
90.0th:       54144.0         58752.0
95.0th:       66176.0         67200.0
99.0th:       84736.0         91520.0
99.5th:       97408.0        114048.0
99.9th:      136448.0        140032.0

Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>

Initial RFC PATCH, discussions and details on the problem:
Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com/
Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@linux.vnet.ibm.com/

---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff4dbbae3b10..2a4a0969e04f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5923,6 +5923,10 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+
+	/* Add a random offset so that timers interleave */
+	hrtimer_set_expires(&cfs_b->period_timer,
+			    get_random_u32_below(cfs_b->period));
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 	cfs_b->slack_started = false;
--
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization
  2023-02-23 18:51 [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization Shrikanth Hegde
@ 2023-02-23 23:09 ` Benjamin Segall
  2023-03-09 14:21   ` Shrikanth Hegde
  2023-03-14  9:59 ` Peter Zijlstra
  2023-03-22  9:22 ` [tip: sched/core] sched: " tip-bot2 for Shrikanth Hegde
  2 siblings, 1 reply; 6+ messages in thread
From: Benjamin Segall @ 2023-02-23 23:09 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, peterz, vincent.guittot, dietmar.eggemann, tglx, srikar,
	arjan, svaidy, linux-kernel

Shrikanth Hegde <sshegde@linux.vnet.ibm.com> writes:

> CPU cfs bandwidth controller uses hrtimer. Currently there is no initial
> value set. Hence all period timers would align at expiry.
> This happens when there are multiple CPU cgroup's.
>
> There is a performance gain that can be achieved here if the timers are
> interleaved when the utilization of each CPU cgroup is low and total
> utilization of all the CPU cgroup's is less than 50%. If the timers are
> interleaved, then the unthrottled cgroup can run freely without many
> context switches and can also benefit from SMT Folding. This effect will
> be further amplified in SPLPAR environment.
>
> This commit adds a random offset after initializing each hrtimer. This
> would result in interleaving the timers at expiry, which helps in achieving
> the said performance gain.
>
> This was tested on powerpc platform with 8 core SMT=8. Socket power was
> measured when the workload. Benchmarked the stress-ng with power
> information. Throughput oriented benchmarks show significant gain up to
> 25% while power consumption increases up to 15%.
>
> Workload: stress-ng --cpu=32 --cpu-ops=50000.
> 1CG - 1 cgroup is running.
> 2CG - 2 cgroups are running together.
> Time taken to complete stress-ng in seconds and power is in watts.
> each cgroup is throttled at 25% with 100ms as the period value.
>            6.2-rc6                     |   with patch
> 8 core   1CG    power   2CG     power  |  1CG    power  2 CG    power
>         27.5    80.6    40      90     |  27.3    82    32.3    104
>         27.5    81      40.2    91     |  27.5    81    38.7     96
>         27.7    80      40.1    89     |  27.6    80    29.7    106
>         27.7    80.1    40.3    94     |  27.6    80    31.5    105
>
> Latency might be affected by this change. That could happen if the CPU was
> in a deep idle state which is possible if we interleave the timers. Used
> schbench for measuring the latency. Each cgroup is throttled at 25% with
> period value is set to 100ms. Numbers are when both the cgroups are
> running simultaneously. Latency values don't degrade much. Some
> improvement is seen in tail latencies.
>
> 		6.2-rc6        with patch
> Groups: 16
> 50.0th:          39.5            42.5
> 75.0th:         924.0           922.0
> 90.0th:         972.0           968.0
> 95.0th:        1005.5           994.0
> 99.0th:        4166.0          2287.0
> 99.5th:        7314.0          7448.0
> 99.9th:       15024.0         13600.0
>
> Groups: 32
> 50.0th:         819.0           463.0
> 75.0th:        1596.0           918.0
> 90.0th:        5992.0          1281.5
> 95.0th:       13184.0          2765.0
> 99.0th:       21792.0         14240.0
> 99.5th:       25696.0         18920.0
> 99.9th:       33280.0         35776.0
>
> Groups: 64
> 50.0th:        4806.0          3440.0
> 75.0th:       31136.0         33664.0
> 90.0th:       54144.0         58752.0
> 95.0th:       66176.0         67200.0
> 99.0th:       84736.0         91520.0
> 99.5th:       97408.0        114048.0
> 99.9th:      136448.0        140032.0
>
> Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Ben Segall <bsegall@google.com>

>
> Initial RFC PATCH, discussions and details on the problem:
> Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com/
> Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@linux.vnet.ibm.com/
>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff4dbbae3b10..2a4a0969e04f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5923,6 +5923,10 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>  	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
>  	cfs_b->period_timer.function = sched_cfs_period_timer;
> +
> +	/* Add a random offset so that timers interleave */
> +	hrtimer_set_expires(&cfs_b->period_timer,
> +			    get_random_u32_below(cfs_b->period));
>  	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>  	cfs_b->slack_timer.function = sched_cfs_slack_timer;
>  	cfs_b->slack_started = false;
> --
> 2.31.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization
  2023-02-23 23:09 ` Benjamin Segall
@ 2023-03-09 14:21   ` Shrikanth Hegde
  2023-03-14  7:55     ` Vincent Guittot
  0 siblings, 1 reply; 6+ messages in thread
From: Shrikanth Hegde @ 2023-03-09 14:21 UTC (permalink / raw)
  To: Benjamin Segall, mingo, peterz, vincent.guittot
  Cc: dietmar.eggemann, tglx, srikar, arjan, svaidy, linux-kernel,
	shrikanth hegde



>> CPU cfs bandwidth controller uses hrtimer. Currently there is no initial
>> value set. Hence all period timers would align at expiry.
>> This happens when there are multiple CPU cgroup's.
>>
>> There is a performance gain that can be achieved here if the timers are
>> interleaved when the utilization of each CPU cgroup is low and total
>> utilization of all the CPU cgroup's is less than 50%. If the timers are
>> interleaved, then the unthrottled cgroup can run freely without many
>> context switches and can also benefit from SMT Folding. This effect will
>> be further amplified in SPLPAR environment.
>>
>> This commit adds a random offset after initializing each hrtimer. This
>> would result in interleaving the timers at expiry, which helps in achieving
>> the said performance gain.
>>
>> This was tested on powerpc platform with 8 core SMT=8. Socket power was
>> measured when the workload. Benchmarked the stress-ng with power
>> information. Throughput oriented benchmarks show significant gain up to
>> 25% while power consumption increases up to 15%.
>>
>> Workload: stress-ng --cpu=32 --cpu-ops=50000.
>> 1CG - 1 cgroup is running.
>> 2CG - 2 cgroups are running together.
>> Time taken to complete stress-ng in seconds and power is in watts.
>> each cgroup is throttled at 25% with 100ms as the period value.
>>            6.2-rc6                     |   with patch
>> 8 core   1CG    power   2CG     power  |  1CG    power  2 CG    power
>>         27.5    80.6    40      90     |  27.3    82    32.3    104
>>         27.5    81      40.2    91     |  27.5    81    38.7     96
>>         27.7    80      40.1    89     |  27.6    80    29.7    106
>>         27.7    80.1    40.3    94     |  27.6    80    31.5    105
>>
>> Latency might be affected by this change. That could happen if the CPU was
>> in a deep idle state which is possible if we interleave the timers. Used
>> schbench for measuring the latency. Each cgroup is throttled at 25% with
>> period value is set to 100ms. Numbers are when both the cgroups are
>> running simultaneously. Latency values don't degrade much. Some
>> improvement is seen in tail latencies.
>>
>> 		6.2-rc6        with patch
>> Groups: 16
>> 50.0th:          39.5            42.5
>> 75.0th:         924.0           922.0
>> 90.0th:         972.0           968.0
>> 95.0th:        1005.5           994.0
>> 99.0th:        4166.0          2287.0
>> 99.5th:        7314.0          7448.0
>> 99.9th:       15024.0         13600.0
>>
>> Groups: 32
>> 50.0th:         819.0           463.0
>> 75.0th:        1596.0           918.0
>> 90.0th:        5992.0          1281.5
>> 95.0th:       13184.0          2765.0
>> 99.0th:       21792.0         14240.0
>> 99.5th:       25696.0         18920.0
>> 99.9th:       33280.0         35776.0
>>
>> Groups: 64
>> 50.0th:        4806.0          3440.0
>> 75.0th:       31136.0         33664.0
>> 90.0th:       54144.0         58752.0
>> 95.0th:       66176.0         67200.0
>> 99.0th:       84736.0         91520.0
>> 99.5th:       97408.0        114048.0
>> 99.9th:      136448.0        140032.0
>>
>> Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Reviewed-by: Ben Segall <bsegall@google.com>

Ben, Thank you very much for the review.

Peter, Ingo, Vincent, 
Could you please provide your inputs? Any other benchmarks should be run? 


Sorry about the subject line. I should have included sched/fair. I realized it after 
mail was sent. I sent another mail with only subject line changed. 
They are same except subject line.
https://lore.kernel.org/lkml/20230223185918.1500132-1-sshegde@linux.vnet.ibm.com/

Should I resend the patch with better change log and a cover letter detailing 
test results instead?

> 
>>
>> Initial RFC PATCH, discussions and details on the problem:
>> Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com/
>> Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@linux.vnet.ibm.com/
>>
>> ---
>>  kernel/sched/fair.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ff4dbbae3b10..2a4a0969e04f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5923,6 +5923,10 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>>  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
>>  	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
>>  	cfs_b->period_timer.function = sched_cfs_period_timer;
>> +
>> +	/* Add a random offset so that timers interleave */
>> +	hrtimer_set_expires(&cfs_b->period_timer,
>> +			    get_random_u32_below(cfs_b->period));
>>  	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>>  	cfs_b->slack_timer.function = sched_cfs_slack_timer;
>>  	cfs_b->slack_started = false;
>> --
>> 2.31.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization
  2023-03-09 14:21   ` Shrikanth Hegde
@ 2023-03-14  7:55     ` Vincent Guittot
  0 siblings, 0 replies; 6+ messages in thread
From: Vincent Guittot @ 2023-03-14  7:55 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Benjamin Segall, mingo, peterz, dietmar.eggemann, tglx, srikar,
	arjan, svaidy, linux-kernel

On Thu, 9 Mar 2023 at 15:22, Shrikanth Hegde <sshegde@linux.vnet.ibm.com> wrote:
>
>
>
> >> CPU cfs bandwidth controller uses hrtimer. Currently there is no initial
> >> value set. Hence all period timers would align at expiry.
> >> This happens when there are multiple CPU cgroup's.
> >>
> >> There is a performance gain that can be achieved here if the timers are
> >> interleaved when the utilization of each CPU cgroup is low and total
> >> utilization of all the CPU cgroup's is less than 50%. If the timers are
> >> interleaved, then the unthrottled cgroup can run freely without many
> >> context switches and can also benefit from SMT Folding. This effect will
> >> be further amplified in SPLPAR environment.
> >>
> >> This commit adds a random offset after initializing each hrtimer. This
> >> would result in interleaving the timers at expiry, which helps in achieving
> >> the said performance gain.
> >>
> >> This was tested on powerpc platform with 8 core SMT=8. Socket power was
> >> measured when the workload. Benchmarked the stress-ng with power
> >> information. Throughput oriented benchmarks show significant gain up to
> >> 25% while power consumption increases up to 15%.
> >>
> >> Workload: stress-ng --cpu=32 --cpu-ops=50000.
> >> 1CG - 1 cgroup is running.
> >> 2CG - 2 cgroups are running together.
> >> Time taken to complete stress-ng in seconds and power is in watts.
> >> each cgroup is throttled at 25% with 100ms as the period value.
> >>            6.2-rc6                     |   with patch
> >> 8 core   1CG    power   2CG     power  |  1CG    power  2 CG    power
> >>         27.5    80.6    40      90     |  27.3    82    32.3    104
> >>         27.5    81      40.2    91     |  27.5    81    38.7     96
> >>         27.7    80      40.1    89     |  27.6    80    29.7    106
> >>         27.7    80.1    40.3    94     |  27.6    80    31.5    105
> >>
> >> Latency might be affected by this change. That could happen if the CPU was
> >> in a deep idle state which is possible if we interleave the timers. Used
> >> schbench for measuring the latency. Each cgroup is throttled at 25% with
> >> period value is set to 100ms. Numbers are when both the cgroups are
> >> running simultaneously. Latency values don't degrade much. Some
> >> improvement is seen in tail latencies.
> >>
> >>              6.2-rc6        with patch
> >> Groups: 16
> >> 50.0th:          39.5            42.5
> >> 75.0th:         924.0           922.0
> >> 90.0th:         972.0           968.0
> >> 95.0th:        1005.5           994.0
> >> 99.0th:        4166.0          2287.0
> >> 99.5th:        7314.0          7448.0
> >> 99.9th:       15024.0         13600.0
> >>
> >> Groups: 32
> >> 50.0th:         819.0           463.0
> >> 75.0th:        1596.0           918.0
> >> 90.0th:        5992.0          1281.5
> >> 95.0th:       13184.0          2765.0
> >> 99.0th:       21792.0         14240.0
> >> 99.5th:       25696.0         18920.0
> >> 99.9th:       33280.0         35776.0
> >>
> >> Groups: 64
> >> 50.0th:        4806.0          3440.0
> >> 75.0th:       31136.0         33664.0
> >> 90.0th:       54144.0         58752.0
> >> 95.0th:       66176.0         67200.0
> >> 99.0th:       84736.0         91520.0
> >> 99.5th:       97408.0        114048.0
> >> 99.9th:      136448.0        140032.0
> >>
> >> Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com>
> >> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> >> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> >
> > Reviewed-by: Ben Segall <bsegall@google.com>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

>
> Ben, Thank you very much for the review.
>
> Peter, Ingo, Vincent,
> Could you please provide your inputs? Any other benchmarks should be run?
>
>
> Sorry about the subject line. I should have included sched/fair. I realized it after
> mail was sent. I sent another mail with only subject line changed.
> They are same except subject line.
> https://lore.kernel.org/lkml/20230223185918.1500132-1-sshegde@linux.vnet.ibm.com/
>
> Should I resend the patch with better change log and a cover letter detailing
> test results instead?
>
> >
> >>
> >> Initial RFC PATCH, discussions and details on the problem:
> >> Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com/
> >> Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@linux.vnet.ibm.com/
> >>
> >> ---
> >>  kernel/sched/fair.c | 4 ++++
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ff4dbbae3b10..2a4a0969e04f 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5923,6 +5923,10 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >>      INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> >>      hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> >>      cfs_b->period_timer.function = sched_cfs_period_timer;
> >> +
> >> +    /* Add a random offset so that timers interleave */
> >> +    hrtimer_set_expires(&cfs_b->period_timer,
> >> +                        get_random_u32_below(cfs_b->period));
> >>      hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> >>      cfs_b->slack_timer.function = sched_cfs_slack_timer;
> >>      cfs_b->slack_started = false;
> >> --
> >> 2.31.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization
  2023-02-23 18:51 [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization Shrikanth Hegde
  2023-02-23 23:09 ` Benjamin Segall
@ 2023-03-14  9:59 ` Peter Zijlstra
  2023-03-22  9:22 ` [tip: sched/core] sched: " tip-bot2 for Shrikanth Hegde
  2 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2023-03-14  9:59 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: mingo, vincent.guittot, dietmar.eggemann, bsegall, tglx, srikar,
	arjan, svaidy, linux-kernel

On Fri, Feb 24, 2023 at 12:21:53AM +0530, Shrikanth Hegde wrote:
> CPU cfs bandwidth controller uses hrtimer. Currently there is no initial
> value set. Hence all period timers would align at expiry.
> This happens when there are multiple CPU cgroup's.
> 
> There is a performance gain that can be achieved here if the timers are
> interleaved when the utilization of each CPU cgroup is low and total
> utilization of all the CPU cgroup's is less than 50%. If the timers are
> interleaved, then the unthrottled cgroup can run freely without many
> context switches and can also benefit from SMT Folding. This effect will
> be further amplified in SPLPAR environment.
> 
> This commit adds a random offset after initializing each hrtimer. This
> would result in interleaving the timers at expiry, which helps in achieving
> the said performance gain.
> 
> This was tested on powerpc platform with 8 core SMT=8. Socket power was
> measured when the workload. Benchmarked the stress-ng with power
> information. Throughput oriented benchmarks show significant gain up to
> 25% while power consumption increases up to 15%.
> 
> Workload: stress-ng --cpu=32 --cpu-ops=50000.
> 1CG - 1 cgroup is running.
> 2CG - 2 cgroups are running together.
> Time taken to complete stress-ng in seconds and power is in watts.
> each cgroup is throttled at 25% with 100ms as the period value.
>            6.2-rc6                     |   with patch
> 8 core   1CG    power   2CG     power  |  1CG    power  2 CG    power
>         27.5    80.6    40      90     |  27.3    82    32.3    104
>         27.5    81      40.2    91     |  27.5    81    38.7     96
>         27.7    80      40.1    89     |  27.6    80    29.7    106
>         27.7    80.1    40.3    94     |  27.6    80    31.5    105
> 
> Latency might be affected by this change. That could happen if the CPU was
> in a deep idle state which is possible if we interleave the timers. Used
> schbench for measuring the latency. Each cgroup is throttled at 25% with
> period value is set to 100ms. Numbers are when both the cgroups are
> running simultaneously. Latency values don't degrade much. Some
> improvement is seen in tail latencies.
> 
> 		6.2-rc6        with patch
> Groups: 16
> 50.0th:          39.5            42.5
> 75.0th:         924.0           922.0
> 90.0th:         972.0           968.0
> 95.0th:        1005.5           994.0
> 99.0th:        4166.0          2287.0
> 99.5th:        7314.0          7448.0
> 99.9th:       15024.0         13600.0
> 
> Groups: 32
> 50.0th:         819.0           463.0
> 75.0th:        1596.0           918.0
> 90.0th:        5992.0          1281.5
> 95.0th:       13184.0          2765.0
> 99.0th:       21792.0         14240.0
> 99.5th:       25696.0         18920.0
> 99.9th:       33280.0         35776.0
> 
> Groups: 64
> 50.0th:        4806.0          3440.0
> 75.0th:       31136.0         33664.0
> 90.0th:       54144.0         58752.0
> 95.0th:       66176.0         67200.0
> 99.0th:       84736.0         91520.0
> 99.5th:       97408.0        114048.0
> 99.9th:      136448.0        140032.0
> 
> Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Initial RFC PATCH, discussions and details on the problem:
> Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com/
> Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@linux.vnet.ibm.com/

Picked up for tip/sched/core.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip: sched/core] sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization
  2023-02-23 18:51 [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization Shrikanth Hegde
  2023-02-23 23:09 ` Benjamin Segall
  2023-03-14  9:59 ` Peter Zijlstra
@ 2023-03-22  9:22 ` tip-bot2 for Shrikanth Hegde
  2 siblings, 0 replies; 6+ messages in thread
From: tip-bot2 for Shrikanth Hegde @ 2023-03-22  9:22 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra, Thomas Gleixner, Shrikanth Hegde, Ben Segall,
	Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     41abdba9374734b743019fc1cc05e3225c82ba6b
Gitweb:        https://git.kernel.org/tip/41abdba9374734b743019fc1cc05e3225c82ba6b
Author:        Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
AuthorDate:    Fri, 24 Feb 2023 00:21:53 +05:30
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 22 Mar 2023 10:10:58 +01:00

sched: Interleave cfs bandwidth timers for improved single thread performance at low utilization

CPU cfs bandwidth controller uses hrtimer. Currently there is no initial
value set. Hence all period timers would align at expiry.
This happens when there are multiple CPU cgroup's.

There is a performance gain that can be achieved here if the timers are
interleaved when the utilization of each CPU cgroup is low and total
utilization of all the CPU cgroup's is less than 50%. If the timers are
interleaved, then the unthrottled cgroup can run freely without many
context switches and can also benefit from SMT Folding. This effect will
be further amplified in SPLPAR environment.

This commit adds a random offset after initializing each hrtimer. This
would result in interleaving the timers at expiry, which helps in achieving
the said performance gain.

This was tested on powerpc platform with 8 core SMT=8. Socket power was
measured when the workload. Benchmarked the stress-ng with power
information. Throughput oriented benchmarks show significant gain up to
25% while power consumption increases up to 15%.

Workload: stress-ng --cpu=32 --cpu-ops=50000.
1CG - 1 cgroup is running.
2CG - 2 cgroups are running together.
Time taken to complete stress-ng in seconds and power is in watts.
each cgroup is throttled at 25% with 100ms as the period value.
           6.2-rc6                     |   with patch
8 core   1CG    power   2CG     power  |  1CG    power  2 CG    power
        27.5    80.6    40      90     |  27.3    82    32.3    104
        27.5    81      40.2    91     |  27.5    81    38.7     96
        27.7    80      40.1    89     |  27.6    80    29.7    106
        27.7    80.1    40.3    94     |  27.6    80    31.5    105

Latency might be affected by this change. That could happen if the CPU was
in a deep idle state which is possible if we interleave the timers. Used
schbench for measuring the latency. Each cgroup is throttled at 25% with
period value is set to 100ms. Numbers are when both the cgroups are
running simultaneously. Latency values don't degrade much. Some
improvement is seen in tail latencies.

		6.2-rc6        with patch
Groups: 16
50.0th:          39.5            42.5
75.0th:         924.0           922.0
90.0th:         972.0           968.0
95.0th:        1005.5           994.0
99.0th:        4166.0          2287.0
99.5th:        7314.0          7448.0
99.9th:       15024.0         13600.0

Groups: 32
50.0th:         819.0           463.0
75.0th:        1596.0           918.0
90.0th:        5992.0          1281.5
95.0th:       13184.0          2765.0
99.0th:       21792.0         14240.0
99.5th:       25696.0         18920.0
99.9th:       33280.0         35776.0

Groups: 64
50.0th:        4806.0          3440.0
75.0th:       31136.0         33664.0
90.0th:       54144.0         58752.0
95.0th:       66176.0         67200.0
99.0th:       84736.0         91520.0
99.5th:       97408.0        114048.0
99.9th:      136448.0        140032.0

Initial RFC PATCH, discussions and details on the problem:

Link1: https://lore.kernel.org/lkml/5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com/
Link2: https://lore.kernel.org/lkml/9c57c92c-3e0c-b8c5-4be9-8f4df344a347@linux.vnet.ibm.com/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Shrikanth Hegde<sshegde@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230223185153.1499710-1-sshegde@linux.vnet.ibm.com
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b572367..bc358dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5923,6 +5923,10 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+
+	/* Add a random offset so that timers interleave */
+	hrtimer_set_expires(&cfs_b->period_timer,
+			    get_random_u32_below(cfs_b->period));
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 	cfs_b->slack_started = false;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-22  9:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-23 18:51 [PATCH V3] Interleave cfs bandwidth timers for improved single thread performance at low utilization Shrikanth Hegde
2023-02-23 23:09 ` Benjamin Segall
2023-03-09 14:21   ` Shrikanth Hegde
2023-03-14  7:55     ` Vincent Guittot
2023-03-14  9:59 ` Peter Zijlstra
2023-03-22  9:22 ` [tip: sched/core] sched: " tip-bot2 for Shrikanth Hegde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox