All of lore.kernel.org
 help / color / mirror / Atom feed
From: Luiz Capitulino <lcapitulino@redhat.com>
To: Wanpeng Li <kernellwp@gmail.com>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Wanpeng Li <wanpeng.li@hotmail.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Rik van Riel <riel@redhat.com>, Mike Galbraith <efault@gmx.de>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling
Date: Fri, 7 Apr 2017 12:58:01 -0400	[thread overview]
Message-ID: <20170407125801.503ae098@redhat.com> (raw)
In-Reply-To: <1491460584-4908-1-git-send-email-wanpeng.li@hotmail.com>

On Wed,  5 Apr 2017 23:36:24 -0700
Wanpeng Li <kernellwp@gmail.com> wrote:

> From: Wanpeng Li <wanpeng.li@hotmail.com>
> 
> This can be reproduced by setting up a CPU for nohz_full and pin two tasks 
> that hog the CPU 100% of the time to that CPU, top reports 70% system time 
> and 30% user time utilization sometimes. The trace like this:
> 
>    hog-11980 [015]   341.494491: function:    enter_from_user_mode <-- apic_timer_interrupt
> <idle>-0     [000]   341.494492: function:    smp_apic_timer_interrupt <-- apic_timer_interrupt
>    hog-11980 [015]   341.494492: function:    __context_tracking_exit <-- enter_from_user_mode
> <idle>-0     [000]   341.494492: function:    irq_enter <-- smp_apic_timer_interrupt
>    hog-11980 [015]   341.494492: bprint:      vtime_delta: diff=0 (now=4295008339 vtime_snap=4295008339)
>    hog-11980 [015]   341.494492: function:    smp_apic_timer_interrupt <-- apic_timer_interrupt
>    hog-11980 [015]   341.494492: function:    irq_enter <-- smp_apic_timer_interrupt
>    hog-11980 [015]   341.494493: function:    tick_sched_timer <-- __hrtimer_run_queues
> <idle>-0     [000]   341.494493: function:    tick_sched_timer <-- __hrtimer_run_queues
> <idle>-0     [000]   341.494493: function:    tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer
> <idle>-0     [000]   341.494494: function:    do_timer <-- tick_do_update_jiffies64.part.14
>    hog-11980 [015]   341.494494: function:    irq_exit <-- smp_apic_timer_interrupt
> <idle>-0     [000]   341.494494: bprint:      do_timer: updated jiffies_64=4295008340 ticks=1
>    hog-11980 [015]   341.494494: function:    __context_tracking_enter <-- prepare_exit_to_usermode
>    hog-11980 [015]   341.494494: function:    vtime_user_enter <-- __context_tracking_enter
>    hog-11980 [015]   341.494495: bprint:      vtime_delta: diff=1000000 (now=4295008340 vtime_snap=4295008339)
>    hog-11980 [015]   341.494495: function:    __vtime_account_system <-- vtime_user_enter
>    hog-11980 [015]   341.494495: bprint:      get_vtime_delta: vtime_snap=4295008339 now=4295008340
>    hog-11980 [015]   341.494495: function:    account_system_time <-- __vtime_account_system
>    hog-11980 [015]   341.494495: bprint:      account_system_time: cputime=995488
> <idle>-0     [000]   341.494497: function:    irq_exit <-- smp_apic_timer_interrupt
> 
> In this trace, we see the following:
> 
>  1. On CPU15, we transition from user-space to kernel-space because
>     of a timer interrupt (it's the tick)
> 
>  2. vtimer_delta() returns 0, because jiffies didn't change since the
>     last accounting
> 
>  3. While CPU15 is executing in kernel-space, jiffies is updated
>     by CPU0
> 
>  4. When going back to user-space, vtime_delta() returns non-zero
>     and the whole time is accounted for system time (observe how
>     the cputime parameter in account_system_time() is less than 1ms)
> 
> 
> After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN 
> to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to 
> jiffy based sampling. So the issue can happen if all CPUs fire their ticks 
> at the same time:
> 
> 
>                CPU 0                  CPU 1
>                -----                  -----
>                                       exit_user() // no cputime update 
> tick X         update_jiffies
>                                       enter_user() // cputime update
> 
> 
>                                       exit_user() // no cputime update
> tick X+1       update_jiffies
>                                       enter_user() // cputime update
> 
> The nohz_full CPU receives an interrupt at the same time the timer 
> interrupt fires on the housekeeping CPU.
> 
> This patch offsets the tick to avert all ticks alignment in order 
> that the vtime sampling does not end up "in phase" with the jiffies 
> incrementing.
> 
> Reported-by: Luiz Capitulino <lcapitulino@redhat.com>
> Suggested-by: Rik van Riel <riel@redhat.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Luiz Capitulino <lcapitulino@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>

Reviewed-and-Tested-by: Luiz Capitulino <lcapitulino@redhat.com>

Thanks for everyone who participated on tracking this down
and fixing it!

> ---
>  kernel/time/tick-sched.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..35d79f7 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void)
>  	/* Get the next period (per-CPU) */
>  	hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
>  
> -	/* Offset the tick to avert jiffies_lock contention. */
> -	if (sched_skew_tick) {
> +	/*
> +	 * Offset the tick to avert jiffies_lock contention, and all ticks
> +	 * alignment in order that the vtime sampling does not end up "in
> +	 * phase" with the jiffies incrementing.
> +	 */
> +	if (sched_skew_tick || tick_nohz_full_enabled()) {
>  		u64 offset = ktime_to_ns(tick_period) >> 1;
>  		do_div(offset, num_possible_cpus());
>  		offset *= smp_processor_id();

  parent reply	other threads:[~2017-04-07 16:58 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06  6:36 [PATCH] tick/nohz: Fix wrong user and system time accouting against vtime sampling Wanpeng Li
2017-04-06 14:36 ` Rik van Riel
2017-04-07 16:58 ` Luiz Capitulino [this message]
2017-04-10 15:45 ` Thomas Gleixner
2017-04-10 18:01   ` Frederic Weisbecker
2017-04-10 21:51     ` Wanpeng Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170407125801.503ae098@redhat.com \
    --to=lcapitulino@redhat.com \
    --cc=efault@gmx.de \
    --cc=fweisbec@gmail.com \
    --cc=kernellwp@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=wanpeng.li@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.