All of lore.kernel.org
 help / color / mirror / Atom feed
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Tony Luck <tony.luck@intel.com>,
	Wanpeng Li <wanpeng.li@hotmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul Mackerras <paulus@samba.org>, Ingo Molnar <mingo@kernel.org>,
	Fenghua Yu <fenghua.yu@intel.com>, Rik van Riel <riel@redhat.com>,
	Stanislaw Gruszka <sgruszka@redhat.com>
Subject: Re: [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs
Date: Mon, 21 Nov 2016 17:20:06 +0100	[thread overview]
Message-ID: <20161121162003.GB7554@lerouge> (raw)
In-Reply-To: <20161121075956.2b36b3e3@mschwide>

On Mon, Nov 21, 2016 at 07:59:56AM +0100, Martin Schwidefsky wrote:
> On Fri, 18 Nov 2016 15:47:02 +0100
> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > Just because some code isn't too complex doesn't mean we really want to keep it.
> > I get regular questions about what unit does cputime_t map to on a given
> > configuration. Everybody gets confused about that. On many of the
> > patches we got on cputime for the last years, I had to fix quite some issues
> > with bad granularity assumption. In fact most fixes that came to kernel/sched/cputime.c
> > recently, after merge or review, were about people getting confused with cputime_t granularity.
> 
> These regular question you get about the cputime_t is exactly what I was referring
> to. If the value would just be a u64 the guys asking the question about cputime_t
> would just assume the value to be nano-seconds and then go ahead and break things.

Sure, replacing cputime_t with u64 without changing the unit wouldn't help. But changing
it to nsecs and expect people to deduce it from the u64 type sounds a good direction.

> 
> > Especially for stats that come from nsecs clocks (steal and irqtime), we always have to maintain an
> > accumulator and make sure we don't lose some nanosec deltas.
> 
> Yes, for the CONFIG_IRQ_TIME_ACCOUNTING=y case.

Right.

> 
> > And we have to maintain several workarounds, sometimes even in the fastpath in
> > order to cope with the cputime_t random granularity all over.
> > 
> > Some fastpath examples:
> > 
> > * steal time accounting (need to convert nsecs to cputime then back)
> > * irqtime accounting (maintain accumulators)
> > * cputime_adjust, used on any user read of cputime (need to convert from nsecs
> >   to cputime on cputime_adjust)
> > 
> > But the worst really is about maintainance. This patchset removes around 600 lines.
> 
> Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from
> the generic cputime_jiffies.h and cputime_nsecs.h.

Well, still worth it :-)

> > > The do_account_vtime function is called once per jiffy and once per task
> > > switch. HZ is usually set to 100 for s390, the conversion once per jiffy
> > > would not be so bad, but the call on the scheduling path *will* hurt.
> > 
> > I don't think we need to flush on task switch. If we maintain the accumulators
> > on the task/thread struct instead of per-cpu, then the remaining time after
> > task switch out will be accounted on next tick after after next task switch in.
> 
> You can not properly calculate steal time if you allow sleeping tasks to sit on
> up to 5*HZ worth of cpu time.

Ah, you mean that when the task goes to sleep, we shouldn't miss more than one
tick worth of system/user time but the steal time can be much higher, right?

> I think we *have* to do accounting on task switch.
> At least on s390, likely on powerpc as well. Why not make that an option for
> the architecture with the yet-to-be-written accumulating code.

Ok, how about doing the accumulation and always account on task switch for now,
we'll see later if it's worth having such an option.

> 
> > > What is even worse is the vtime_account_irq_enter path, that is call several
> > > times for each *interrupt*, at least two times for an interrupt without
> > > additional processing and four times if a softirq is triggered.
> > 
> > Actually maintaining an accumulator to flush on ticks is probably going to increase
> > the perf because of that. account_system_time() is called twice per interrupt, and
> > such function do much more than just account the time to the task_struct and cpustat
> > fields. The same applies to userspace boundaries and context switch. The account_*_time()
> > functions can be expensive.
> 
> The account_system_time twice per interrupt can be removed with the accumulation
> idea. We will have to see how expensive the accounting_xxx_time calls are on
> the context switch path.

Right.

> 
> > > 
> > > Now it has been proposed to implement lazy accounting to accumulate deltas
> > > and do the expensive conversions only infrequently. This is pretty straight-
> > > forward for account_user_time but to do this for the account_system_time
> > > function is more complicated. The function has to differentiate between
> > > guest/hardirq/softirq and pure system time. We would need to keep sums for
> > > each bucket and provide a separate function to add to each bucket. Like
> > > account_guest_time(), account_hardirq_time(), account_softirq_time() and
> > > account_system_time(). Then it is up to the arch code to sort out the details
> > > and call the accounting code once per jiffy for each of the buckets.
> > 
> > That wouldn't be too hard really. The s390 code in vtime.c already does that.
> 
> Yes, I agree that the accumulating change would not be too hard. Can I make the
> request that we try to get that done first before doing the cleanup ?

Of course. I see you started something, I'll be glad to help!

>  
> > > We still have to do the whole thing on each task switch though.
> > 
> > Not if we maintain the deltas in the task_struct.
> > 
> > > 
> > > But I am still not happy about the approach. What is the compelling reason
> > > for this change except for the "but it looks ugly"?
> > 
> > The diffstat (600 lines removed). Also the fact that we have all these workarounds
> > in the core code just for the special case of 1 arch (s390) and a half
> > (powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE).
> > 
> > I'd much rather have all that complexity moved in a vtime_native.c shared by s390 and powerpc
> > that takes care of proper accumulation in cputime_t and flushes that on ticks in nsecs rather
> > than having all these cputime_t game all over the kernel.
> 
> The goal to have nano-seconds only in the core code is a good one. And with the
> accumulator I think s390 can live with it. The change would have a real upside
> too. There are these stupid divisions for scaled cputime that we have to calculate
> for every call to account_xxx_time(). These would not be done for the interrupts
> anymore.

Exactly!


Thanks.

  parent reply	other threads:[~2016-11-21 16:20 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-17 18:08 [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 01/36] jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 02/36] time: Introduce jiffies64_to_nsecs() Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 03/36] sched: Remove unused INIT_CPUTIME macro Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 04/36] cputime: Convert kcpustat to nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 05/36] macintosh/rack-meter: Remove cputime_t internal use Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 06/36] cputime: Convert guest time accounting to nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 07/36] cputime: Special API to return old-typed cputime Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 08/36] cputime: Convert task/group cputime to nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 09/36] alpha: Convert obsolete cputime_t " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 10/36] x86: Convert obsolete cputime type " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 11/36] isdn: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 12/36] binfmt: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 13/36] acct: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 14/36] delaycct: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 15/36] tsacct: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 16/36] signal: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 17/36] cputime: Increment kcpustat directly on irqtime account Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 18/36] posix-timers: Use TICK_NSEC instead of a dynamically ad-hoc calculated version Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 19/36] posix-timers: Convert internals to use nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 20/36] itimer: Convert internal cputime_t units to nsec Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 21/36] sched: Remove temporary cputime_t accessors Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 22/36] cputime: Push time to account_user_time() in nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 23/36] cputime: Push time to account_steal_time() " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 24/36] cputime: Push time to account_idle_time() " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 25/36] cputime: Push time to account_system_time() " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 26/36] cputime: Complete nsec conversion of tick based accounting Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 27/36] vtime: Return nsecs instead of cputime_t to account Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 28/36] cputime: Remove jiffies based cputime Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 29/36] ia64: Move nsecs based cputime headers to the last arch using it Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 30/36] ia64: Convert vtime to use nsec units directly Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 31/36] ia64: Remove unused cputime definitions Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 32/36] s390: Make arch_cpu_idle_time() to return nsecs Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 33/36] powerpc: Remove unused cputime definitions Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 34/36] s390: " Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 35/36] cputime: Remove unused nsec_to_cputime Frederic Weisbecker
2016-11-17 18:08 ` [PATCH 36/36] cputime: Remove asm generic headers Frederic Weisbecker
2016-11-18 12:08 ` [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs Martin Schwidefsky
2016-11-18 14:47   ` Frederic Weisbecker
2016-11-21  6:59     ` Martin Schwidefsky
2016-11-21 10:17       ` Martin Schwidefsky
2016-11-22 13:45         ` Frederic Weisbecker
2016-11-22 14:28           ` Martin Schwidefsky
2016-11-21 16:20       ` Frederic Weisbecker [this message]
2016-11-22  6:11         ` Martin Schwidefsky
2016-11-21  9:49     ` Ingo Molnar
2016-11-21 16:22       ` Frederic Weisbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161121162003.GB7554@lerouge \
    --to=fweisbec@gmail.com \
    --cc=benh@kernel.crashing.org \
    --cc=fenghua.yu@intel.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=schwidefsky@de.ibm.com \
    --cc=sgruszka@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=wanpeng.li@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.