From: Luiz Capitulino <lcapitulino@redhat.com>
To: Rik van Riel <riel@redhat.com>
Cc: fweisbec@gmail.com, linux-kernel@vger.kernel.org,
linux-rt-users@vger.kernel.org
Subject: Re: [BUG nohz]: wrong user and system time accounting
Date: Thu, 23 Mar 2017 21:39:40 -0400 [thread overview]
Message-ID: <20170323213940.00377a6e@redhat.com> (raw)
In-Reply-To: <1490317718.8850.61.camel@redhat.com>
On Thu, 23 Mar 2017 21:08:38 -0400
Rik van Riel <riel@redhat.com> wrote:
> On Thu, 2017-03-23 at 21:05 -0400, Luiz Capitulino wrote:
> > On Thu, 23 Mar 2017 20:56:02 -0400
> > Rik van Riel <riel@redhat.com> wrote:
> >
> > > On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote:
> > > > When there are two or more tasks executing in user-space and
> > > > taking 100% of a nohz_full CPU, top reports 70% system time
> > > > and 30% user time utilization. Sometimes I'm even able to get
> > > > 100% system time and 0% user time.
> > > >
> > > > This was reproduced with latest Linus tree (093b995), but I
> > > > don't believe it's a regression (at least not a recent one)
> > > > as I can reproduce it with older kernels. Also, I have
> > > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> > > > without it yet.
> > > >
> > > > Below you'll find the steps to reproduce and some initial
> > > > analysis.
> > > >
> > > > Steps to reproduce
> > > > ------------------
> > > >
> > > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> > > >
> > > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU
> > > >
> > > > 3. Run top -d1 and check system time
> > > >
> > > > NOTE: When there's only one task hogging a nohz_full CPU, top
> > > > shows 100% user-time, as expected
> > > >
> > > > Initial analysis
> > > > ----------------
> > > >
> > > > When tracing vtime accounting functions and the user-space/kernel
> > > > transitions when the issue is taking place, I see several of the
> > > > following:
> > > >
> > > > hog-10552 [015] 1132.711104:
> > > > function: enter_from_user_mode <--
> > > > apic_timer_interrupt
> > > > hog-10552 [015] 1132.711105:
> > > > function: __context_tracking_exit <--
> > > > enter_from_user_mode
> > > > hog-10552 [015] 1132.711105:
> > > > bprint: __context_tracking_exit.part.4: new state=1
> > > > cur
> > > > state=1 active=1
> > > > hog-10552 [015] 1132.711105:
> > > > function: vtime_account_user <--
> > > > __context_tracking_exit.part.4
> > > > hog-10552 [015] 1132.711105:
> > > > function: smp_apic_timer_interrupt <--
> > > > apic_timer_interrupt
> > > > hog-10552 [015] 1132.711106: function: irq_enter <--
> > > > smp_apic_timer_interrupt
> > > > hog-10552 [015] 1132.711106:
> > > > function: tick_sched_timer
> > > > <-- __hrtimer_run_queues
> > > > hog-10552 [015] 1132.711108: function: irq_exit <--
> > > > smp_apic_timer_interrupt
> > > > hog-10552 [015] 1132.711108:
> > > > function: __context_tracking_enter <--
> > > > prepare_exit_to_usermode
> > > > hog-10552 [015] 1132.711108:
> > > > bprint: __context_tracking_enter.part.2: new
> > > > state=1
> > > > cur state=0 active=1
> > > > hog-10552 [015] 1132.711109:
> > > > function: vtime_user_enter
> > > > <-- __context_tracking_enter.part.2
> > > > hog-10552 [015] 1132.711109:
> > > > function: __vtime_account_system <-- vtime_user_enter
> > > > hog-10552 [015] 1132.711109:
> > > > function: account_system_time <--
> > > > __vtime_account_system
> > > >
> > > > On entering the kernel due to a timer interrupt,
> > > > vtime_account_user()
> > > > skips user-time accounting. Then later on when returning to user-
> > > > space,
> > > > vtime_user_enter() is probably accounting the whole time (ie.
> > > > user-
> > > > space
> > > > plus kernel-space) to system time.
> > > >
> > > > Now, when does vtime_account_user() skips accounting? Well, when
> > > > the
> > > > time delta is less then one jiffie. This would imply that
> > > > vtime_account_user()
> > > > is being called less than one jiffie since the last accounting,
> > > > but I
> > > > haven't
> > > > confirmed any of this yet.
> > >
> > > Jiffies should be advanced by the timer interrupt, on the
> > > housekeeping CPU, which is not doing context tracking.
> >
> > The hypothesis isn't that it wasn't advanced, but that we stayed in
> > user-space less than 1ms.
>
> That is part of the hypothesis. The other part of the hypothesis
> involves jiffies advancing on the nohz_full & isolated CPU while
> that CPU is in kernel mode 30% of the time.
OK.
> I have no good explanation for the latter yet...
>
> > > Why is the isolated/nohz_full CPU receiving timer interrupts
> > > at all?
> > >
> > > I thought it would not, but obviously I am wrong. What is
> > > going on here?
> >
> > There are two runnable SCHED_OTHER tasks on the nohz_full CPU. When
> > that happens, the tick is re-activated. We're not nohz_full anymore,
> > but accounting should still work.
>
> Isn't the scheduler tick distinct from the timer interrupt,
> or am I confused?
If you consider the scheduler tick to be the code that's run
by scheduler_tick(), yes they are distinct. But I was referring
to tick_sched_timer() the "main" tick handler. This one runs
as a hrtimer handler. In the case described in this email, the
timer interrupt fires because the nohz code sets up a hrtimer
to run (which is the tick, tick_sched_timer()).
Btw, _if_ the hypothesis is correct, I guess I might be able to
create a reproducer that doesn't depend on the tick. A task
staying 980us busy-looping in user-space and then making a
few dozen microseconds kernel call will probably report 100%
system time. This will be hard to do, but I'll give it try tomorrow.
next prev parent reply other threads:[~2017-03-24 1:39 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-03-23 20:55 [BUG nohz]: wrong user and system time accounting Luiz Capitulino
2017-03-24 0:56 ` Rik van Riel
2017-03-24 1:05 ` Luiz Capitulino
2017-03-24 1:08 ` Rik van Riel
2017-03-24 1:39 ` Luiz Capitulino [this message]
2017-03-27 5:33 ` lkml
2017-03-24 1:52 ` Wanpeng Li
2017-03-24 3:56 ` Luiz Capitulino
2017-03-27 1:56 ` Wanpeng Li
2017-03-27 17:35 ` Rik van Riel
2017-03-28 7:19 ` Wanpeng Li
[not found] ` <20170328132406.7d23579c@redhat.com>
[not found] ` <20170328161454.4a5d9e8b@redhat.com>
2017-03-28 21:01 ` Rik van Riel
2017-03-28 21:26 ` Luiz Capitulino
2017-03-29 9:56 ` Wanpeng Li
2017-03-29 12:56 ` Frederic Weisbecker
2017-03-28 21:24 ` Rik van Riel
2017-03-28 21:30 ` Luiz Capitulino
[not found] ` <20170329131656.1d6cb743@redhat.com>
2017-03-29 20:08 ` Rik van Riel
2017-03-29 22:54 ` Frederic Weisbecker
2017-03-30 12:57 ` Rik van Riel
2017-03-30 1:58 ` Wanpeng Li
2017-03-30 12:40 ` Frederic Weisbecker
2017-03-30 13:19 ` Mike Galbraith
2017-03-30 4:27 ` Mike Galbraith
2017-03-30 6:47 ` Wanpeng Li
2017-03-30 11:52 ` Wanpeng Li
2017-03-30 12:33 ` Mike Galbraith
2017-03-30 13:38 ` Frederic Weisbecker
2017-03-30 13:59 ` Wanpeng Li
2017-03-30 14:18 ` Frederic Weisbecker
2017-03-30 21:25 ` Luiz Capitulino
2017-03-31 20:09 ` Luiz Capitulino
2017-03-31 23:24 ` Frederic Weisbecker
2017-04-01 3:11 ` Luiz Capitulino
2017-04-03 15:23 ` Frederic Weisbecker
2017-04-03 19:06 ` Luiz Capitulino
2017-04-04 17:36 ` Luiz Capitulino
2017-04-05 14:26 ` Rik van Riel
2017-04-11 11:03 ` Wanpeng Li
2017-04-11 11:36 ` Peter Zijlstra
2017-04-11 11:43 ` Wanpeng Li
2017-04-11 14:22 ` Thomas Gleixner
2017-04-12 13:18 ` Frederic Weisbecker
2017-04-12 14:57 ` Thomas Gleixner
2017-04-12 15:14 ` Frederic Weisbecker
2017-04-13 4:31 ` Wanpeng Li
2017-04-13 13:32 ` Frederic Weisbecker
2017-05-02 10:01 ` Wanpeng Li
2017-05-15 8:17 ` Wanpeng Li
2017-06-29 17:22 ` Frederic Weisbecker
2017-03-30 12:51 ` Frederic Weisbecker
2017-03-30 13:02 ` Rik van Riel
2017-03-30 13:35 ` Mike Galbraith
2017-04-03 14:40 ` Frederic Weisbecker
2017-04-04 7:32 ` Mike Galbraith
2017-03-30 13:44 ` Frederic Weisbecker
[not found] ` <20170329221700.GB23895@lerouge>
2017-03-29 22:46 ` Wanpeng Li
2017-03-30 2:14 ` Luiz Capitulino
2017-03-30 12:27 ` Wanpeng Li
2017-03-27 18:38 ` Luiz Capitulino
2017-03-28 5:28 ` Wanpeng Li
2017-03-28 13:44 ` Luiz Capitulino
2017-03-29 13:04 ` Frederic Weisbecker
2017-03-29 13:14 ` Rik van Riel
2017-03-29 13:23 ` Luiz Capitulino
2017-03-29 21:12 ` Frederic Weisbecker
2017-03-30 1:48 ` Luiz Capitulino
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170323213940.00377a6e@redhat.com \
--to=lcapitulino@redhat.com \
--cc=fweisbec@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-users@vger.kernel.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.