From: Konrad Rzeszutek Wilk <konrad@darnok.org>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: xen-devel@lists.xensource.com,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Joe Jin <joe.jin@oracle.com>,
Zhenzhong Duan <zhenzhong.duan@oracle.com>,
Jan Beulich <JBeulich@suse.com>, Laszlo Ersek <lersek@redhat.com>
Subject: Re: [PATCH] remove blocked time accounting from xen "clockchip"
Date: Thu, 19 Jan 2012 14:42:39 -0500 [thread overview]
Message-ID: <20120119194232.GA3728@konrad-lan> (raw)
In-Reply-To: <4EBC125A.70300@goop.org>
On Thu, Nov 10, 2011 at 10:05:14AM -0800, Jeremy Fitzhardinge wrote:
> On 11/09/2011 05:35 AM, Jan Beulich wrote:
> >>>> On 18.10.11 at 22:42, Laszlo Ersek <lersek@redhat.com> wrote:
> >> ... because the "clock_event_device framework" already accounts for idle
> >> time through the "event_handler" function pointer in
> >> xen_timer_interrupt().
> >>
> >> The patch is intended as the completion of [1]. It should fix the double
> >> idle times seen in PV guests' /proc/stat [2]. It should be orthogonal to
> >> stolen time accounting (the removed code seems to be isolated).
> > After some more looking around I still think it's incorrect, albeit for
> > a different reason: What tick_nohz_restart_sched_tick() accounts
> > as idle time is *all* time that passed while in cpu_idle(). What gets
> > accounted in do_stolen_accounting() (without your patch) is
> > different:
> > - time the vCPU was in RUNSTATE_blocked gets accounted as idle
> > - time the vCPU was in RUNSTATE_runnable and RUNSTATE_offline
> > gets accounted as stolen.
> >
> > That is, on an overcommitted system (and without your patch) I
> > would expect you to not see the (full) double idle increment for a not
> > fully idle and not fully loaded vCPU.
> >
> > Now we can of course say that for most purposes it's fine to
> > ignore the difference between idle and steal time, but there must
> > be a reason these have been getting accounted separately in Linux
> > without regard to Xen.
> >
> > So I really think that what needs to be suppressed to address this
> > is tick_nohz_restart_sched_tick()'s call to account_idle_ticks(). But
> > simply forcing CONFIG_VIRT_CPU_ACCOUNTING is not an immediate
> > option (not even when considering a Xen-only kernel), as that has
> > further implications. Instead I wonder whether under Xen we should
> > e.g. update per_cpu(tick_cpu_sched, cpu).idle_jiffies prior to calling
> > tick_nohz_restart_sched_tick() (possibly implicitly by doing the
> > update in do_stolen_accounting(), though that wouldn't work when
> > the wakeup is due to an interrupt other than the timer one).
> >
> > The alternative of accounting the negative of the steal time as
> > idle time in do_stolen_accounting() doesn't look correct either,
> > since not all stolen time was necessarily accounted as idle (some
> > would have got - also incorrectly - accounted as process or system
> > time).
> >
> > So my conclusion is that only the full implementation of what
> > CONFIG_VIRT_CPU_ACCOUNTING expects would actually get
> > things sorted out properly (but that would imply Xen *and* native
> > are willing to pay the performance price, or the enabling of this
> > can be made dynamic so that at least native could remain
> > unaffected).
> >
> > Or the negative stolen time should be accounted to the current
> > process, the system, or as idle, depending on the context which
> > do_stolen_accounting() runs in (just like timer_interrupt() did in
> > the XenoLinux kernels for positive accounting).
>
> I was always suspicious of the timer-interrupt-based stolen time
> accounting code. It's really a hold-over from when ticks were the main
> timekeeping mechanism, but with highres timers its massive overkill and
> probably a source of performance degradation.
>
> Overall, the stolen time accounting isn't really very important. The
> kernel doesn't use it at all internally - it doesn't affect scheduling
> decisions, for example. It's only exported in /proc/stat, and some
> tools like top will display it if its non-zero. It could be useful for
> diagnosing performance problems on a heavily loaded host system, but
> other tools like "xentop" would give you as much information.
>
> So really, all this code is nice to have, but I'm not sure its worth a
> lot of time to fix, especially if it makes idle accounting hard (which
> *is* important).
>
> However, that said, PeterZ recently added some code to the scheduler so
> that time "stolen" by interrupt handling isn't accounted against the
> task running at the time, and the design is specifically intended to
> also be useful for virtualization stolen time as well. There are some
> KVM patches floating around to implement this, and we should look at
I finally got some time to look at them and I think they are these ones:
git log --oneline e03b644fe68b1c6401465b02724d261538dba10f..3c404b578fab699c4708279938078d9404b255a4
3c404b5 KVM guest: Add a pv_ops stub for steal time
c9aaa89 KVM: Steal time implementation
9ddabbe KVM: KVM Steal time guest/host interface
4b6b35f KVM: Add constant to represent KVM MSRs enabled bit in guest/host interface
What is interesting is that they end up inserting a bunch of:
+ if (steal_account_process_tick())
+ return;
+
in irqtime_account_process_tick and in account_process_tick.
The steal_account_process_tick just makes a pv_time_ops.steal_clock
based on the asm goto of paravirt_steal_enabled key.
I think if we were to persue this we could rip out in the
'do_stolen_accounting' the call to 'accont_idle_tick' and
'account_idle_time' completly. In essence making that function
behave as a pv_time_ops.steal_clock implementation.
Also that would mean we could remove in the 'xen_timer_interrupt'
function the call to 'do_stolen_accounting'
I think this would mean we would account _only_ for the
RUNSTATE_runnable and RUNSTATE_offline as Laszlo's earlier patch
suggested?
And the RUNSTATE_blocked would be figured out by the existing
mechanism in the sched.c code?
> doing a Xen implementation. That would be much more practically useful,
> since (I think) it allows the scheduler to be aware of stolen time and
> not penalize tasks for time they never got to use. Or at the very least
> it could give you per-task stolen stats (maybe?) which would be useful.
>
> J
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
next prev parent reply other threads:[~2012-01-19 19:42 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-18 20:42 [PATCH] remove blocked time accounting from xen "clockchip" Laszlo Ersek
2011-10-19 7:51 ` Jan Beulich
2011-10-19 14:54 ` Laszlo Ersek
2011-10-20 14:35 ` Laszlo Ersek
2011-10-20 15:02 ` Laszlo Ersek
2011-10-26 20:52 ` Konrad Rzeszutek Wilk
2011-11-09 13:35 ` Jan Beulich
2011-11-09 17:47 ` Laszlo Ersek
2011-11-10 8:32 ` Jan Beulich
2011-11-10 18:05 ` Jeremy Fitzhardinge
2012-01-19 19:42 ` Konrad Rzeszutek Wilk [this message]
2012-01-20 9:57 ` Jan Beulich
2012-01-20 16:00 ` Konrad Rzeszutek Wilk
2012-01-23 22:07 ` Konrad Rzeszutek Wilk
2011-12-21 8:32 ` Jan Beulich
2011-12-21 13:53 ` Laszlo Ersek
2011-12-21 14:58 ` Jan Beulich
-- strict thread matches above, loose matches on Subject: below --
2011-12-22 8:49 Jan Beulich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120119194232.GA3728@konrad-lan \
--to=konrad@darnok.org \
--cc=JBeulich@suse.com \
--cc=jeremy@goop.org \
--cc=joe.jin@oracle.com \
--cc=konrad.wilk@oracle.com \
--cc=lersek@redhat.com \
--cc=xen-devel@lists.xensource.com \
--cc=zhenzhong.duan@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).