Re: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: shrikanth hegde <sshegde@linux.vnet.ibm.com>
Cc: tglx@linutronix.de, peterz@infradead.org, arjan@linux.intel.com,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	svaidy@linux.ibm.com, linux-kernel@vger.kernel.org,
	bigeasy@linutronix.de
Subject: Re: [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization
Date: Tue, 31 Jan 2023 11:37:12 +0100	[thread overview]
Message-ID: <Y9jvWCGGICsKGPFt@gmail.com> (raw)
In-Reply-To: <5ae3cb09-8c9a-11e8-75a7-cc774d9bc283@linux.vnet.ibm.com>


* shrikanth hegde <sshegde@linux.vnet.ibm.com> wrote:

> As per current design of hrtimer, it uses the _softexpires to trigger the
> timer function.  _softexpires is set as multiple of the period/interval value.
> This will benefit the power saving by less wakeups. Due to this, different
> timers of the same period/interval values align and the callbacks functions
> will be called at the same time.
> 
> CPU bandwidth controller (CPU cgroup) uses these hrtimers to implement period
> and quota.  Period timer refills the quota and allow the throttled cgroups to
> start running again.  When there are multiple such cgroup's, if their period
> values are same, then these period timers will be aligned.  Hence multiple
> cgroup's timer fire at the same time and ends up unthrottling each cgroups
> runqueues. Since all cgroups start, they would compete for CPU and use all SMT
> threads likely.
> 
> There is performance gain that can be achieved here if the timers are
> interleaved when the utilization of each CPU cgroup is low and total
> utilization of all the CPU cgroup's is less than 50%. This is likely true when
> using containers. If the timers aren't rounded-off, then the unthrottled
> cgroup can run freely without many context switches and can also benefit of SMT
> Folding[1]. This effect will be further amplified in SPLPAR environment[2] as
> this would cause less hypervisor preemptions. There can be benefit due to less
> IPI storm as well. Docker provides a config option of period timer value,
> whereas the kubernetes only provides millicore option. Hence with typical
> deployment period values will be set to 100ms as kubernetes millicore will
> set the quota accordingly without altering period values.
> 
> [1] SMT folding is a mechanism were processor core reconfigured to lower SMT
> mode to improve performance when some sibling threads are idle. In a SMT8 core,
> when only one or two threads are running on a core, we get the best throughput
> compared to running all 8 threads.
> 
> [2] SPLPAR is an Shared Processor Logical PARtition. There can be many SPLPARs
> running on the same physical machine sharing the CPU resources.  One SPLPAR can
> consume all CPU resource it can, if the other SPLPARs are idle. Processors
> within the SPLPAR are called vCPU. vCPU can be higher than CPU.  Hence at an
> instance of time if there are more requested vCPU than CPU, then vCPU can be
> preempted. When the timers align, there will be spike in requested vCPU when
> the timers expire. This can lead to preemption when the other SPLPARs are not
> idle.
> 
> Came up with a naive patch, more of hack. Other alternative is to use a
> slightly modified API for cgroups, so that all other timers align and wakeups
> remain reduced. New hrtimer api is likely better, i can send out the patch
> quickly.  Here i am trying to misalign by setting the softexpire at multiple of
> interval/10 instead of interval. Ran the stress-ng with two cgroups. The
> numbers are with patch and without patch on Power10 machine with SMT=8. Below
> table shows time taken by each group to complete. In the last column, both
> cgroup's are run together and data shows average time taken by cgroups to
> complete. Here each cgroup is assigned 25% runtime.
> 
> workload: stress-ng --cpu=4 --cpu-ops=100000 data shows time it took to
> complete in seconds for each run.
> 
> Without Patch:
> period/quota    cgroup1 runs    cgroup2 runs    cgroup1 &cgroup2
>                    alone           alone         run together
> 100ms/200ms         120s            120s            155s
>                     120s            120s            155s
>                     120s            120s            155s
> With Patch:
> period/quota    cgroup1 runs    cgroup2 runs    cgroup1 & cgroup2
>                    alone           alone         run together
> 100ms/200ms         120s            120s            131s
>                     120s            120s            155s
>                     120s            120s            121s
> 
> There is no benefit at higher utilization of 50% or more. There is no
> degradation also.
> 
> Signed-off by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
> ---
>  kernel/time/hrtimer.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 3ae661ab6260..d160f49f0cce 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -1055,6 +1055,17 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
> 
>  		orun = ktime_divns(delta, incr);
>  		hrtimer_add_expires_ns(timer, incr * orun);
> +		/*
> +		 * Avoid timer round-off, so that all cfs bandwidth timers
> +		 * don't start at the same time
> +		 */
> +		if (incr >= 100000000ULL) {
> +			s64 interleave = 0;
> +			interleave = ktime_sub_ns(delta,  incr * orun);
> +			interleave = interleave - (ktime_to_ns(delta) % (incr/10));
> +			if (interleave > 0)
> +				hrtimer_add_expires_ns(timer, interleave);
> +		}

Any reason why you did this in the hrtimer code, instead of the 
(sched_cfs_period_timer?) hrtimer handler itself?

Thanks,

	Ingo

next prev parent reply	other threads:[~2023-01-31 10:37 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-31  5:48 [RFC PATCH] hrtimer: interleave timers for improved single thread performance at low utilization shrikanth hegde
2023-01-31 10:37 ` Ingo Molnar [this message]
2023-01-31 12:09   ` shrikanth hegde
2023-01-31 11:08 ` Thomas Gleixner
2023-01-31 12:27   ` shrikanth hegde
2023-01-31 14:55 ` Arjan van de Ven
2023-01-31 15:50   ` shrikanth hegde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y9jvWCGGICsKGPFt@gmail.com \
    --to=mingo@kernel.org \
    --cc=arjan@linux.intel.com \
    --cc=bigeasy@linutronix.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=sshegde@linux.vnet.ibm.com \
    --cc=svaidy@linux.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.