From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751898AbcF0M0B (ORCPT ); Mon, 27 Jun 2016 08:26:01 -0400 Received: from mail-wm0-f68.google.com ([74.125.82.68]:35761 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751601AbcF0MZ7 (ORCPT ); Mon, 27 Jun 2016 08:25:59 -0400 Date: Mon, 27 Jun 2016 14:25:56 +0200 From: Frederic Weisbecker To: riel@redhat.com Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, mingo@kernel.org, pbonzini@redhat.com, fweisbec@redhat.com, wanpeng.li@hotmail.com, efault@gmx.de, tglx@linutronix.de, rkrcmar@redhat.com Subject: Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time Message-ID: <20160627122553.GA2111@lerouge> References: <1466648751-7958-1-git-send-email-riel@redhat.com> <1466648751-7958-2-git-send-email-riel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1466648751-7958-2-git-send-email-riel@redhat.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@redhat.com wrote: > From: Rik van Riel > > Currently, if there was any irq or softirq time during 'ticks' > jiffies, the entire period will be accounted as irq or softirq > time. > > This is inaccurate if only a subset of 'ticks' jiffies was > actually spent handling irqs, and could conceivably mis-count > all of the ticks during a period as irq time, when there was > some irq and some softirq time. Good catch! Many comments following. > > This can actually happen when irqtime_account_process_tick > is called from account_idle_ticks, which can pass a larger > number of ticks down all at once. > > Fix this by changing irqtime_account_hi_update and > irqtime_account_si_update to round elapsed irq and softirq > time to jiffies, and return the number of jiffies spent in > each mode, similar to how steal time is handled. > > Additionally, have irqtime_account_process_tick take into > account how much time was spent in each of steal, irq, > and softirq time. > > The latter could help improve the accuracy of timekeeping Maybe you meant cputime? Timekeeping is rather about jiffies and GTOD. > when returning from idle on a NO_HZ_IDLE CPU. > > Properly accounting how much time was spent in hardirq and > softirq time will also allow the NO_HZ_FULL code to re-use > these same functions for hardirq and softirq accounting. > > Signed-off-by: Rik van Riel > --- > kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 60 insertions(+), 21 deletions(-) > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index 3d60e5d76fdb..15b813c014be 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr) > } > EXPORT_SYMBOL_GPL(irqtime_account_irq); > > -static int irqtime_account_hi_update(void) > +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies) > { > u64 *cpustat = kcpustat_this_cpu->cpustat; > + unsigned long irq_jiffies = 0; > unsigned long flags; > - u64 latest_ns; > - int ret = 0; > + u64 irq; Should be cputime_t ? And perhaps renamed to irq_cputime to distinguish clearly against irq_jiffies? > > local_irq_save(flags); > - latest_ns = this_cpu_read(cpu_hardirq_time); > - if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ]) > - ret = 1; > + irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ]; cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in fact even cputime64_t). So you need to convert cpu_hardirq_time before doing the substract. > + if (irq > cputime_one_jiffy) { > + irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq)); > + cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies); > + } > local_irq_restore(flags); > - return ret; > + return irq_jiffies; I think we shouldn't use jiffies at all here and only rely on cputime_t. max_jiffies should be of cputime_t and this function should return the cputime_t. The resulting code should be more simple and in fact more precise (more below on that). > } > > -static int irqtime_account_si_update(void) > +static unsigned long irqtime_account_si_update(unsigned long max_jiffies) > { > u64 *cpustat = kcpustat_this_cpu->cpustat; > + unsigned long si_jiffies = 0; > unsigned long flags; > - u64 latest_ns; > - int ret = 0; > + u64 softirq; > > local_irq_save(flags); > - latest_ns = this_cpu_read(cpu_softirq_time); > - if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ]) > - ret = 1; > + softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ]; > + if (softirq > cputime_one_jiffy) { > + si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq)); > + cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies); > + } > local_irq_restore(flags); > - return ret; > + return si_jiffies; So same comments apply here. [...] > * Accumulate raw cputime values of dead tasks (sig->[us]time) and live > * tasks (sum on group iteration) belonging to @tsk's group. > */ > @@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick, > { > cputime_t scaled = cputime_to_scaled(cputime_one_jiffy); > u64 cputime = (__force u64) cputime_one_jiffy; > - u64 *cpustat = kcpustat_this_cpu->cpustat; > + unsigned long other; > > - if (steal_account_process_tick(ULONG_MAX)) > + /* > + * When returning from idle, many ticks can get accounted at > + * once, including some ticks of steal, irq, and softirq time. > + * Subtract those ticks from the amount of time accounted to > + * idle, or potentially user or system time. Due to rounding, > + * other time can exceed ticks occasionally. > + */ > + other = account_other_ticks(ticks); > + if (other >= ticks) > return; > + ticks -= other; > > cputime *= ticks; > scaled *= ticks; So instead of dealing with ticks here, I think you should rather use the above cputime as both the limit and the remaining time to account after steal/irqs. This should avoid some middle conversions and improve precision when cputime_t == nsecs granularity. If we account 2 ticks to idle (lets say HZ=100) and irq time to account is 15 ms. 2 ticks = 20 ms so we have 5 ms left to account to idle. With the jiffies granularity in this patch, we would account one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account back later) and one tick to idle time whereas if you deal with cputime_t, you are going to account the correct amount of idle time. Beyond the scope of this patchset, we should probably kill cputime_t and account everything to nsecs. I have a patchset that did that 2 years ago, I should probably re-iterate it. Thanks.