From: Frederic Weisbecker <fweisbec@gmail.com>
To: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
Tony Luck <tony.luck@intel.com>,
Wanpeng Li <wanpeng.li@hotmail.com>,
Peter Zijlstra <peterz@infradead.org>,
Michael Ellerman <mpe@ellerman.id.au>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Thomas Gleixner <tglx@linutronix.de>,
Paul Mackerras <paulus@samba.org>, Ingo Molnar <mingo@kernel.org>,
Fenghua Yu <fenghua.yu@intel.com>, Rik van Riel <riel@redhat.com>,
Stanislaw Gruszka <sgruszka@redhat.com>
Subject: Re: [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs
Date: Mon, 21 Nov 2016 17:20:06 +0100 [thread overview]
Message-ID: <20161121162003.GB7554@lerouge> (raw)
In-Reply-To: <20161121075956.2b36b3e3@mschwide>
On Mon, Nov 21, 2016 at 07:59:56AM +0100, Martin Schwidefsky wrote:
> On Fri, 18 Nov 2016 15:47:02 +0100
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > Just because some code isn't too complex doesn't mean we really want to keep it.
> > I get regular questions about what unit does cputime_t map to on a given
> > configuration. Everybody gets confused about that. On many of the
> > patches we got on cputime for the last years, I had to fix quite some issues
> > with bad granularity assumption. In fact most fixes that came to kernel/sched/cputime.c
> > recently, after merge or review, were about people getting confused with cputime_t granularity.
>
> These regular question you get about the cputime_t is exactly what I was referring
> to. If the value would just be a u64 the guys asking the question about cputime_t
> would just assume the value to be nano-seconds and then go ahead and break things.
Sure, replacing cputime_t with u64 without changing the unit wouldn't help. But changing
it to nsecs and expect people to deduce it from the u64 type sounds a good direction.
>
> > Especially for stats that come from nsecs clocks (steal and irqtime), we always have to maintain an
> > accumulator and make sure we don't lose some nanosec deltas.
>
> Yes, for the CONFIG_IRQ_TIME_ACCOUNTING=y case.
Right.
>
> > And we have to maintain several workarounds, sometimes even in the fastpath in
> > order to cope with the cputime_t random granularity all over.
> >
> > Some fastpath examples:
> >
> > * steal time accounting (need to convert nsecs to cputime then back)
> > * irqtime accounting (maintain accumulators)
> > * cputime_adjust, used on any user read of cputime (need to convert from nsecs
> > to cputime on cputime_adjust)
> >
> > But the worst really is about maintainance. This patchset removes around 600 lines.
>
> Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from
> the generic cputime_jiffies.h and cputime_nsecs.h.
Well, still worth it :-)
> > > The do_account_vtime function is called once per jiffy and once per task
> > > switch. HZ is usually set to 100 for s390, the conversion once per jiffy
> > > would not be so bad, but the call on the scheduling path *will* hurt.
> >
> > I don't think we need to flush on task switch. If we maintain the accumulators
> > on the task/thread struct instead of per-cpu, then the remaining time after
> > task switch out will be accounted on next tick after after next task switch in.
>
> You can not properly calculate steal time if you allow sleeping tasks to sit on
> up to 5*HZ worth of cpu time.
Ah, you mean that when the task goes to sleep, we shouldn't miss more than one
tick worth of system/user time but the steal time can be much higher, right?
> I think we *have* to do accounting on task switch.
> At least on s390, likely on powerpc as well. Why not make that an option for
> the architecture with the yet-to-be-written accumulating code.
Ok, how about doing the accumulation and always account on task switch for now,
we'll see later if it's worth having such an option.
>
> > > What is even worse is the vtime_account_irq_enter path, that is call several
> > > times for each *interrupt*, at least two times for an interrupt without
> > > additional processing and four times if a softirq is triggered.
> >
> > Actually maintaining an accumulator to flush on ticks is probably going to increase
> > the perf because of that. account_system_time() is called twice per interrupt, and
> > such function do much more than just account the time to the task_struct and cpustat
> > fields. The same applies to userspace boundaries and context switch. The account_*_time()
> > functions can be expensive.
>
> The account_system_time twice per interrupt can be removed with the accumulation
> idea. We will have to see how expensive the accounting_xxx_time calls are on
> the context switch path.
Right.
>
> > >
> > > Now it has been proposed to implement lazy accounting to accumulate deltas
> > > and do the expensive conversions only infrequently. This is pretty straight-
> > > forward for account_user_time but to do this for the account_system_time
> > > function is more complicated. The function has to differentiate between
> > > guest/hardirq/softirq and pure system time. We would need to keep sums for
> > > each bucket and provide a separate function to add to each bucket. Like
> > > account_guest_time(), account_hardirq_time(), account_softirq_time() and
> > > account_system_time(). Then it is up to the arch code to sort out the details
> > > and call the accounting code once per jiffy for each of the buckets.
> >
> > That wouldn't be too hard really. The s390 code in vtime.c already does that.
>
> Yes, I agree that the accumulating change would not be too hard. Can I make the
> request that we try to get that done first before doing the cleanup ?
Of course. I see you started something, I'll be glad to help!
>
> > > We still have to do the whole thing on each task switch though.
> >
> > Not if we maintain the deltas in the task_struct.
> >
> > >
> > > But I am still not happy about the approach. What is the compelling reason
> > > for this change except for the "but it looks ugly"?
> >
> > The diffstat (600 lines removed). Also the fact that we have all these workarounds
> > in the core code just for the special case of 1 arch (s390) and a half
> > (powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE).
> >
> > I'd much rather have all that complexity moved in a vtime_native.c shared by s390 and powerpc
> > that takes care of proper accumulation in cputime_t and flushes that on ticks in nsecs rather
> > than having all these cputime_t game all over the kernel.
>
> The goal to have nano-seconds only in the core code is a good one. And with the
> accumulator I think s390 can live with it. The change would have a real upside
> too. There are these stupid divisions for scaled cputime that we have to calculate
> for every call to account_xxx_time(). These would not be done for the interrupts
> anymore.
Exactly!
Thanks.
next prev parent reply other threads:[~2016-11-21 16:20 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-17 18:08 [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 01/36] jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 02/36] time: Introduce jiffies64_to_nsecs() Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 03/36] sched: Remove unused INIT_CPUTIME macro Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 04/36] cputime: Convert kcpustat to nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 05/36] macintosh/rack-meter: Remove cputime_t internal use Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 06/36] cputime: Convert guest time accounting to nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 07/36] cputime: Special API to return old-typed cputime Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 08/36] cputime: Convert task/group cputime to nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 09/36] alpha: Convert obsolete cputime_t " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 10/36] x86: Convert obsolete cputime type " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 11/36] isdn: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 12/36] binfmt: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 13/36] acct: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 14/36] delaycct: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 15/36] tsacct: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 16/36] signal: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 17/36] cputime: Increment kcpustat directly on irqtime account Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 18/36] posix-timers: Use TICK_NSEC instead of a dynamically ad-hoc calculated version Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 19/36] posix-timers: Convert internals to use nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 20/36] itimer: Convert internal cputime_t units to nsec Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 21/36] sched: Remove temporary cputime_t accessors Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 22/36] cputime: Push time to account_user_time() in nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 23/36] cputime: Push time to account_steal_time() " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 24/36] cputime: Push time to account_idle_time() " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 25/36] cputime: Push time to account_system_time() " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 26/36] cputime: Complete nsec conversion of tick based accounting Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 27/36] vtime: Return nsecs instead of cputime_t to account Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 28/36] cputime: Remove jiffies based cputime Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 29/36] ia64: Move nsecs based cputime headers to the last arch using it Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 30/36] ia64: Convert vtime to use nsec units directly Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 31/36] ia64: Remove unused cputime definitions Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 32/36] s390: Make arch_cpu_idle_time() to return nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 33/36] powerpc: Remove unused cputime definitions Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 34/36] s390: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 35/36] cputime: Remove unused nsec_to_cputime Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 36/36] cputime: Remove asm generic headers Frederic Weisbecker
2016-11-18 12:08 ` [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs Martin Schwidefsky
2016-11-18 14:47 ` Frederic Weisbecker
2016-11-21 6:59 ` Martin Schwidefsky
2016-11-21 10:17 ` Martin Schwidefsky
2016-11-22 13:45 ` Frederic Weisbecker
2016-11-22 14:28 ` Martin Schwidefsky
2016-11-21 16:20 ` Frederic Weisbecker [this message]
2016-11-22 6:11 ` Martin Schwidefsky
2016-11-21 9:49 ` Ingo Molnar
2016-11-21 16:22 ` Frederic Weisbecker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161121162003.GB7554@lerouge \
--to=fweisbec@gmail.com \
--cc=benh@kernel.crashing.org \
--cc=fenghua.yu@intel.com \
--cc=heiko.carstens@de.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=mpe@ellerman.id.au \
--cc=paulus@samba.org \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=schwidefsky@de.ibm.com \
--cc=sgruszka@redhat.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=wanpeng.li@hotmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).