* [git pull request] scheduler updates
@ 2007-08-12 16:32 Ingo Molnar
2007-08-14 8:37 ` [accounting regression since rc1] " Christian Borntraeger
[not found] ` <200708141032.47235.borntraeger@de.ibm.com>
0 siblings, 2 replies; 40+ messages in thread
From: Ingo Molnar @ 2007-08-12 16:32 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel
Linus, please pull the latest scheduler git tree from:
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
three bugfixes:
- a nice fix from eagle-eye Oleg for a subtle typo in the balancing
code, the effect of this bug was more agressive idle balancing. This
bug was introduced by one of the original CFS commits.
- a round of global->static fixes from Adrian Bunk - this change,
besides the cleanup effect, chops 100 bytes off sched.o.
- Peter Zijlstra noticed a sleeper-bonus bug. I kept this patch under
observation and testing this past week and saw no ill effects so far.
It could fix two suspected regressions. (It could improve Kasper
Sandberg's workload and it could improve the sleeper/runner
problem/bug Roman Zippel was seeing.)
test-built and test-booted on x86-32 and x86-64, and did a dozen of
randconfig builds for good measure (which uncovered two new build errors
in latest -git).
Thanks,
Ingo
--------------->
Adrian Bunk (1):
sched: make global code static
Ingo Molnar (1):
sched: fix sleeper bonus
Oleg Nesterov (1):
sched: run_rebalance_domains: s/SCHED_IDLE/CPU_IDLE/
include/linux/cpu.h | 2 --
kernel/sched.c | 48 ++++++++++++++++++++++++------------------------
kernel/sched_fair.c | 12 ++++++------
3 files changed, 30 insertions(+), 32 deletions(-)
^ permalink raw reply [flat|nested] 40+ messages in thread* Re: [accounting regression since rc1] scheduler updates 2007-08-12 16:32 [git pull request] scheduler updates Ingo Molnar @ 2007-08-14 8:37 ` Christian Borntraeger 2007-08-16 8:17 ` [PATCH][RFC] Re: accounting regression since rc1 Christian Borntraeger 2007-08-20 15:45 ` [accounting regression since rc1] scheduler updates Ingo Molnar [not found] ` <200708141032.47235.borntraeger@de.ibm.com> 1 sibling, 2 replies; 40+ messages in thread From: Christian Borntraeger @ 2007-08-14 8:37 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras This is a 2nd try with correct email address, sorry for the duplicates. Am Sonntag, 12. August 2007 schrieb Ingo Molnar: > Linus, please pull the latest scheduler git tree from: Hello Ingo, this is a followup to the discussion in http://lkml.org/lkml/2007/7/19/538 Since 2.6.12, s390 already does precise accouting for system and user time. Depending on CONFIG_VIRT_CPU_ACCOUNTING, we use two 64bit hardware timers on s390: the first returns the wall clock time and is stepped even if the virtual cpu is not backed by a physical cpu. The second timer is only stepped, when the virtual cpu is backed by a physical cpu. The timers have a very high accurancy, and the architecture guarantees that bit 51 is increased by one/microsecond. We store both timers on each context switch, irq, syscall, and machinecheck in entry.S. The calculation are made in arch/s390/kernel/vtime.c in accouting_system_vtime and friends with microsecond accurracy. This is also used for irq accouting (see the definition of irq_enter). It basically boils down to precise numbers in the cpu stat and the utime/stime for processes as well as knowledge about time stolen by the hypervisor. With CFS the accounting was changed, and everything is now based on sum_exec_runtime. There is now an accounting regression on s390 (and maybe ppc64), as the default jiffy implemenation does not know anything about virtual cpus. While looking for a solution, I started with a very quick hack and reverted b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa for the procfs related changes. If I revert that commit, it seems that I get the old behaviour - but of course this is just a hack. I see some options now: 1. Jan could finish his sched_clock implementation for s390 and we would get close to the precise numbers. This would also let CFS make better decisions. Downside: its not as precise as before as we do some math on the numbers and it will burn cycles to compute numbers we already have (utime=sum*utime/stime). 2. set sum_exec_runtime based on the precise utime and stime. Dont know enough about CFS if this would show different scheduling behaviour than 1 3. ifdef fs/proc/array.c depending on CONFIG_VIRT_CPU_ACCOUNTING. This will save some cycles, and the numbers are precise to a microsecond. Downside: the scheduler gets no information about virtual cpus and steal time so its probably not completely fair 4. implement sched_clock AND reuse the exisiting utime and stime numbers. 5. other clever solutions I cannot see Any suggestions? Christian ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH][RFC] Re: accounting regression since rc1 2007-08-14 8:37 ` [accounting regression since rc1] " Christian Borntraeger @ 2007-08-16 8:17 ` Christian Borntraeger 2007-08-20 15:45 ` [accounting regression since rc1] scheduler updates Ingo Molnar 1 sibling, 0 replies; 40+ messages in thread From: Christian Borntraeger @ 2007-08-16 8:17 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras Ingo, this patch fixes the accounting regression for CONFIG_VIRT_CPU_ACCOUNTING. It reverts parts of commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa by converting fs/proc/array.c back to cputime_t. The new functions task_utime and task_stime now return cputime_t instead of clock_t. If CONFIG_VIRT_CPU_ACCOUTING is set, task->utime and task->stime are returned directly instead of using sum_exec_runtime. Patch is tested on s390x with and without VIRT_CPU_ACCOUTING as well as on i386. Not tested on ppc64 - Paul? Feedback is welcome. Signed-Off-By: Christian Borntraeger <borntraeger@de.ibm.com> --- fs/proc/array.c | 41 ++++++++++++++++++++++++++--------------- 1 file changed, 26 insertions(+), 15 deletions(-) Index: linux-2.6/fs/proc/array.c =================================================================== --- linux-2.6.orig/fs/proc/array.c +++ linux-2.6/fs/proc/array.c @@ -320,7 +320,18 @@ int proc_pid_status(struct task_struct * return buffer - orig; } -static clock_t task_utime(struct task_struct *p) +#if defined(CONFIG_VIRT_CPU_ACCOUNTING) +static cputime_t task_utime(struct task_struct *p) +{ + return p->utime; +} + +static cputime_t task_stime(struct task_struct *p) +{ + return p->stime; +} +#else +static cputime_t task_utime(struct task_struct *p) { clock_t utime = cputime_to_clock_t(p->utime), total = utime + cputime_to_clock_t(p->stime); @@ -337,10 +348,10 @@ static clock_t task_utime(struct task_st } utime = (clock_t)temp; - return utime; + return clock_t_to_cputime(utime); } -static clock_t task_stime(struct task_struct *p) +static cputime_t task_stime(struct task_struct *p) { clock_t stime; @@ -349,10 +360,12 @@ static clock_t task_stime(struct task_st * the total, to make sure the total observed by userspace * grows monotonically - apps rely on that): */ - stime = nsec_to_clock_t(p->se.sum_exec_runtime) - task_utime(p); + stime = nsec_to_clock_t(p->se.sum_exec_runtime) - + cputime_to_clock_t(task_utime(p)); - return stime; + return clock_t_to_cputime(stime); } +#endif static int do_task_stat(struct task_struct *task, char *buffer, int whole) { @@ -368,8 +381,7 @@ static int do_task_stat(struct task_stru unsigned long long start_time; unsigned long cmin_flt = 0, cmaj_flt = 0; unsigned long min_flt = 0, maj_flt = 0; - cputime_t cutime, cstime; - clock_t utime, stime; + cputime_t cutime, cstime, utime, stime; unsigned long rsslim = 0; char tcomm[sizeof(task->comm)]; unsigned long flags; @@ -387,8 +399,7 @@ static int do_task_stat(struct task_stru sigemptyset(&sigign); sigemptyset(&sigcatch); - cutime = cstime = cputime_zero; - utime = stime = 0; + cutime = cstime = utime = stime = cputime_zero; rcu_read_lock(); if (lock_task_sighand(task, &flags)) { @@ -414,15 +425,15 @@ static int do_task_stat(struct task_stru do { min_flt += t->min_flt; maj_flt += t->maj_flt; - utime += task_utime(t); - stime += task_stime(t); + utime = cputime_add(utime, task_utime(t)); + stime = cputime_add(stime, task_stime(t)); t = next_thread(t); } while (t != task); min_flt += sig->min_flt; maj_flt += sig->maj_flt; - utime += cputime_to_clock_t(sig->utime); - stime += cputime_to_clock_t(sig->stime); + utime = cputime_add(utime, sig->utime); + stime = cputime_add(stime, sig->stime); } sid = signal_session(sig); @@ -471,8 +482,8 @@ static int do_task_stat(struct task_stru cmin_flt, maj_flt, cmaj_flt, - utime, - stime, + cputime_to_clock_t(utime), + cputime_to_clock_t(stime), cputime_to_clock_t(cutime), cputime_to_clock_t(cstime), priority, ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-14 8:37 ` [accounting regression since rc1] " Christian Borntraeger 2007-08-16 8:17 ` [PATCH][RFC] Re: accounting regression since rc1 Christian Borntraeger @ 2007-08-20 15:45 ` Ingo Molnar 2007-08-20 17:03 ` Martin Schwidefsky 2007-08-21 8:17 ` Christian Borntraeger 1 sibling, 2 replies; 40+ messages in thread From: Ingo Molnar @ 2007-08-20 15:45 UTC (permalink / raw) To: Christian Borntraeger Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > 1. Jan could finish his sched_clock implementation for s390 and we > would get close to the precise numbers. This would also let CFS make > better decisions. [...] i think this is the best option and it should give us the same /proc accuracy on s390 as before, plus improved scheduler precision. (and improved tracing accuracy, etc. etc.) Note that for architectures that already have sched_clock() at least as precise as the stime/utime stats there's no problem - and that seems to include all architectures except s390. could you send that precise sched_clock() patch? It should be an order of magnitude simpler than the high-precision stime/utime tracking you already do, and it's needed for quality scheduling anyway. > [...] Downside: its not as precise as before as we do some math on the > numbers and it will burn cycles to compute numbers we already have > (utime=sum*utime/stime). i can see no real downside to it: if all of stime, utime and sum_exec_clock are precise, then the numbers we present via /proc are precise too: sum_exec * utime / stime; there should be no loss of precision on s390 because the multiplication/division rounding is not accumulating - we keep the precise sum_exec, utime and stime values untouched. on x86 we dont really want to slow down every irq and syscall event with precise stime/utime stats for 'top' to display. On s390 the multiplication and division is indeed superfluous but it keeps the code generic for arches where utime/stime is less precise and irq-sampled - while the sum is always precise. It also animates architectures that have an imprecise sched_clock() implementation to improve its accuracy. Accessing the /proc files alone is many orders of magnitude more expensive than this simple multiplication and division. Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 15:45 ` [accounting regression since rc1] scheduler updates Ingo Molnar @ 2007-08-20 17:03 ` Martin Schwidefsky 2007-08-20 18:08 ` Ingo Molnar 2007-08-21 8:17 ` Christian Borntraeger 1 sibling, 1 reply; 40+ messages in thread From: Martin Schwidefsky @ 2007-08-20 17:03 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras On Mon, 2007-08-20 at 17:45 +0200, Ingo Molnar wrote: > * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > > 1. Jan could finish his sched_clock implementation for s390 and we > > would get close to the precise numbers. This would also let CFS make > > better decisions. [...] > > i think this is the best option and it should give us the same /proc > accuracy on s390 as before, plus improved scheduler precision. (and > improved tracing accuracy, etc. etc.) Note that for architectures that > already have sched_clock() at least as precise as the stime/utime stats > there's no problem - and that seems to include all architectures except > s390. For far we have used the TOD clock for sched_clock. This clocks measures real time with an accuracy of 1usec or better. The [us]time accounting with CONFIG_VIRT_CPU_ACCOUNTING=y is done using the CPU timer. This timer measures virtual time with an accuracy of 1usec of better. Without CONFIG_VIRT_CPU_ACCOUNTING the [us]time accounting is done with HZ ticks. Which means that sched_clock() is at least as precise as [us]time on s390 as well, only that we distinguish between real time / virtual time if the improved accounting is used. > could you send that precise sched_clock() patch? It should be an order > of magnitude simpler than the high-precision stime/utime tracking you > already do, and it's needed for quality scheduling anyway. Sure if you can explain what it should do. This is still unclear to me, for a non-idle CPU the virtual cpu time should be used but for an idle CPU the real time should be used ? That seems rather ill-defined to me. On s390 we have three times to consider, real time, virtual cpu time and steal time. For a given period we have real = virtual + steal. And if a cpu is idle we have real = steal, virtual = 0. My best interpretation of what you want is that sched_clock should progress with virtual cpu time if the current process is not idle and with the real time if it is. No ? > > [...] Downside: its not as precise as before as we do some math on the > > numbers and it will burn cycles to compute numbers we already have > > (utime=sum*utime/stime). > > i can see no real downside to it: if all of stime, utime and > sum_exec_clock are precise, then the numbers we present via /proc are > precise too: > > sum_exec * utime / stime; > > there should be no loss of precision on s390 because the > multiplication/division rounding is not accumulating - we keep the > precise sum_exec, utime and stime values untouched. But then sched_clock() has to return the virtual cpu time only, otherwise it will be hard to make sum_exec exact, wouldn't it? And why should we jump through all these loops to come up with values that are only as good as the values we already have? > on x86 we dont really want to slow down every irq and syscall event with > precise stime/utime stats for 'top' to display. On s390 the > multiplication and division is indeed superfluous but it keeps the code > generic for arches where utime/stime is less precise and irq-sampled - > while the sum is always precise. It also animates architectures that > have an imprecise sched_clock() implementation to improve its accuracy. > Accessing the /proc files alone is many orders of magnitude more > expensive than this simple multiplication and division. Yes, I can understand why you don't want to have the exact cpu accounting scheme on x86 since it will slow down every context switch quite a bit (that includes user <-> kernel, softirq <-> hardirq <-> process context, ..). On s390 the cost is acceptable, for an empty system call it is about 40 additional cycles for the precise accounting. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 17:03 ` Martin Schwidefsky @ 2007-08-20 18:08 ` Ingo Molnar 2007-08-20 18:33 ` Martin Schwidefsky ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Ingo Molnar @ 2007-08-20 18:08 UTC (permalink / raw) To: Martin Schwidefsky Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote: > > could you send that precise sched_clock() patch? It should be an > > order of magnitude simpler than the high-precision stime/utime > > tracking you already do, and it's needed for quality scheduling > > anyway. > > Sure if you can explain what it should do. This is still unclear to > me, for a non-idle CPU the virtual cpu time should be used but for an > idle CPU the real time should be used ? That seems rather ill-defined > to me. On s390 we have three times to consider, real time, virtual cpu > time and steal time. For a given period we have real = virtual + > steal. And if a cpu is idle we have real = steal, virtual = 0. My best > interpretation of what you want is that sched_clock should progress > with virtual cpu time if the current process is not idle and with the > real time if it is. No ? The core scheduler is invariant to sched_clock()'s behavior during idle periods [i.e. sched_clock() can do _anything_ during idle periods, and scheduling would not/schould not change], and that's the source of the uncertainty you noted. So the best way to proceed is still a bit unclear to me - but in any case, a few boundary conditions are already cast into stone: we definitely dont want to make life harder for s390 (without giving any tangible benefits) and we dont want to regress any existing precision of s390 either. We seem to agree wrt. sched_clock()'s behavior while the virtual CPU is busy: sched_clock() very much wants to track virtual time. (real time is pretty much meaningless and coupling sched_clock() to real time would make the virtual machine's behavior dependent on the host's load, which breaks the "seemless virtualization to inside observers" common-sense requirement of virtual-CPU scheduling.) For sched_clock()'s behavior while the virtual CPU is idle: my current idea for that is the patch below (a loosely analoguous problem exists with nohz/dynticks): it makes sched_clock() valid across idle periods too and uses wall-clock time for that. If a virtual CPU is idle then i think the "real = steal, virtual = 0" way of thinking about idle looks a bit unnatural to me - wouldnt it be better to think in terms of "steal = 0, virtual = real" ? Basically a virtual CPU can idle at "perfect speed", without the host "stealing" any cycles from it. And with that way of thinking, if s390 passed in the real-idle-time value to the new callbacks below it would all fall into place. Hm? that way we'd have a meaningful sched_clock() across idle periods too, useful for tracers, better scheduler debug-statistics, etc. Ingo ----------> Subject: sched: sched_clock_idle_[sleep|wakeup]_event() From: Ingo Molnar <mingo@elte.hu> construct a more or less wall-clock time out of sched_clock(), by using ACPI-idle's existing knowledge about how much time we spent idling. This allows the rq clock to work around TSC-stops-in-C2, TSC-gets-corrupted-in-C3 type of problems. ( Besides the scheduler's statistics this also benefits blktrace and printk-timestamps as well. ) Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup callbacks allow the scheduler to get out the most of the period where the CPU has a reliable TSC. This results in slightly more precise task statistics. the ACPI bits were acked by Len. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Len Brown <len.brown@intel.com> --- arch/i386/kernel/tsc.c | 1 - drivers/acpi/processor_idle.c | 32 +++++++++++++++++++++++++------- include/linux/sched.h | 3 ++- kernel/sched.c | 41 ++++++++++++++++++++++++++++++++--------- kernel/sched_debug.c | 3 ++- 5 files changed, 61 insertions(+), 19 deletions(-) Index: linux/arch/i386/kernel/tsc.c =================================================================== --- linux.orig/arch/i386/kernel/tsc.c +++ linux/arch/i386/kernel/tsc.c @@ -292,7 +292,6 @@ static struct clocksource clocksource_ts void mark_tsc_unstable(char *reason) { - sched_clock_unstable_event(); if (!tsc_unstable) { tsc_unstable = 1; tsc_enabled = 0; Index: linux/drivers/acpi/processor_idle.c =================================================================== --- linux.orig/drivers/acpi/processor_idle.c +++ linux/drivers/acpi/processor_idle.c @@ -63,6 +63,7 @@ ACPI_MODULE_NAME("processor_idle"); #define ACPI_PROCESSOR_FILE_POWER "power" #define US_TO_PM_TIMER_TICKS(t) ((t * (PM_TIMER_FREQUENCY/1000)) / 1000) +#define PM_TIMER_TICK_NS (1000000000ULL/PM_TIMER_FREQUENCY) #define C2_OVERHEAD 4 /* 1us (3.579 ticks per us) */ #define C3_OVERHEAD 4 /* 1us (3.579 ticks per us) */ static void (*pm_idle_save) (void) __read_mostly; @@ -462,6 +463,9 @@ static void acpi_processor_idle(void) * TBD: Can't get time duration while in C1, as resumes * go to an ISR rather than here. Need to instrument * base interrupt handler. + * + * Note: the TSC better not stop in C1, sched_clock() will + * skew otherwise. */ sleep_ticks = 0xFFFFFFFF; break; @@ -469,6 +473,8 @@ static void acpi_processor_idle(void) case ACPI_STATE_C2: /* Get start time (ticks) */ t1 = inl(acpi_gbl_FADT.xpm_timer_block.address); + /* Tell the scheduler that we are going deep-idle: */ + sched_clock_idle_sleep_event(); /* Invoke C2 */ acpi_state_timer_broadcast(pr, cx, 1); acpi_cstate_enter(cx); @@ -479,17 +485,22 @@ static void acpi_processor_idle(void) /* TSC halts in C2, so notify users */ mark_tsc_unstable("possible TSC halt in C2"); #endif + /* Compute time (ticks) that we were actually asleep */ + sleep_ticks = ticks_elapsed(t1, t2); + + /* Tell the scheduler how much we idled: */ + sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS); + /* Re-enable interrupts */ local_irq_enable(); + /* Do not account our idle-switching overhead: */ + sleep_ticks -= cx->latency_ticks + C2_OVERHEAD; + current_thread_info()->status |= TS_POLLING; - /* Compute time (ticks) that we were actually asleep */ - sleep_ticks = - ticks_elapsed(t1, t2) - cx->latency_ticks - C2_OVERHEAD; acpi_state_timer_broadcast(pr, cx, 0); break; case ACPI_STATE_C3: - /* * disable bus master * bm_check implies we need ARB_DIS @@ -518,6 +529,8 @@ static void acpi_processor_idle(void) t1 = inl(acpi_gbl_FADT.xpm_timer_block.address); /* Invoke C3 */ acpi_state_timer_broadcast(pr, cx, 1); + /* Tell the scheduler that we are going deep-idle: */ + sched_clock_idle_sleep_event(); acpi_cstate_enter(cx); /* Get end time (ticks) */ t2 = inl(acpi_gbl_FADT.xpm_timer_block.address); @@ -531,12 +544,17 @@ static void acpi_processor_idle(void) /* TSC halts in C3, so notify users */ mark_tsc_unstable("TSC halts in C3"); #endif + /* Compute time (ticks) that we were actually asleep */ + sleep_ticks = ticks_elapsed(t1, t2); + /* Tell the scheduler how much we idled: */ + sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS); + /* Re-enable interrupts */ local_irq_enable(); + /* Do not account our idle-switching overhead: */ + sleep_ticks -= cx->latency_ticks + C3_OVERHEAD; + current_thread_info()->status |= TS_POLLING; - /* Compute time (ticks) that we were actually asleep */ - sleep_ticks = - ticks_elapsed(t1, t2) - cx->latency_ticks - C3_OVERHEAD; acpi_state_timer_broadcast(pr, cx, 0); break; Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1388,7 +1388,8 @@ extern void sched_exec(void); #define sched_exec() {} #endif -extern void sched_clock_unstable_event(void); +extern void sched_clock_idle_sleep_event(void); +extern void sched_clock_idle_wakeup_event(u64 delta_ns); #ifdef CONFIG_HOTPLUG_CPU extern void idle_task_exit(void); Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -262,7 +262,8 @@ struct rq { s64 clock_max_delta; unsigned int clock_warps, clock_overflows; - unsigned int clock_unstable_events; + u64 idle_clock; + unsigned int clock_deep_idle_events; u64 tick_timestamp; atomic_t nr_iowait; @@ -556,18 +557,40 @@ static inline struct rq *this_rq_lock(vo } /* - * CPU frequency is/was unstable - start new by setting prev_clock_raw: + * We are going deep-idle (irqs are disabled): */ -void sched_clock_unstable_event(void) +void sched_clock_idle_sleep_event(void) { - unsigned long flags; - struct rq *rq; + struct rq *rq = cpu_rq(smp_processor_id()); - rq = task_rq_lock(current, &flags); - rq->prev_clock_raw = sched_clock(); - rq->clock_unstable_events++; - task_rq_unlock(rq, &flags); + spin_lock(&rq->lock); + __update_rq_clock(rq); + spin_unlock(&rq->lock); + rq->clock_deep_idle_events++; +} +EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event); + +/* + * We just idled delta nanoseconds (called with irqs disabled): + */ +void sched_clock_idle_wakeup_event(u64 delta_ns) +{ + struct rq *rq = cpu_rq(smp_processor_id()); + u64 now = sched_clock(); + + rq->idle_clock += delta_ns; + /* + * Override the previous timestamp and ignore all + * sched_clock() deltas that occured while we idled, + * and use the PM-provided delta_ns to advance the + * rq clock: + */ + spin_lock(&rq->lock); + rq->prev_clock_raw = now; + rq->clock += delta_ns; + spin_unlock(&rq->lock); } +EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event); /* * resched_task - mark a task 'to be rescheduled now'. Index: linux/kernel/sched_debug.c =================================================================== --- linux.orig/kernel/sched_debug.c +++ linux/kernel/sched_debug.c @@ -154,10 +154,11 @@ static void print_cpu(struct seq_file *m P(next_balance); P(curr->pid); P(clock); + P(idle_clock); P(prev_clock_raw); P(clock_warps); P(clock_overflows); - P(clock_unstable_events); + P(clock_deep_idle_events); P(clock_max_delta); P(cpu_load[0]); P(cpu_load[1]); ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 18:08 ` Ingo Molnar @ 2007-08-20 18:33 ` Martin Schwidefsky 2007-08-20 19:00 ` Balbir Singh ` (3 more replies) 2007-08-20 23:07 ` Paul Mackerras 2007-08-21 2:18 ` Andi Kleen 2 siblings, 4 replies; 40+ messages in thread From: Martin Schwidefsky @ 2007-08-20 18:33 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote: > For sched_clock()'s behavior while the virtual CPU is idle: my current > idea for that is the patch below (a loosely analoguous problem exists > with nohz/dynticks): it makes sched_clock() valid across idle periods > too and uses wall-clock time for that. Ok, that would mean that sched_clock can just return the virtual cpu time and the two hooks starts and stops the idle periods as far as the scheduler is concerned. In this case we can use the patch from Jan with the new implementation for sched_clock and add the two hooks to the places where the cpu-idle notifiers are done (do_monitor_call and default_idle). In fact this could be an idle-notifier. Hmm, I take a closer look tomorrow when I'm back at the office. > If a virtual CPU is idle then i think the "real = steal, virtual = 0" > way of thinking about idle looks a bit unnatural to me - wouldnt it be > better to think in terms of "steal = 0, virtual = real" ? Basically a > virtual CPU can idle at "perfect speed", without the host "stealing" any > cycles from it. And with that way of thinking, if s390 passed in the > real-idle-time value to the new callbacks below it would all fall into > place. Hm? How you think about an idle cpu depends on your viewpoint. The source for the virtual cpu time on s390 is the cpu timer. This timer is stopped when a virtual cpu looses the physical cpu, so it seems natural to me to think that real=steal, virtual=0 because the cpu timer is stopped while the cpu is idle. The other way of thinking about it is as valid though. > that way we'd have a meaningful sched_clock() across idle periods too, > useful for tracers, better scheduler debug-statistics, etc. That would be good. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 18:33 ` Martin Schwidefsky @ 2007-08-20 19:00 ` Balbir Singh 2007-08-20 19:05 ` Ingo Molnar ` (2 subsequent siblings) 3 siblings, 0 replies; 40+ messages in thread From: Balbir Singh @ 2007-08-20 19:00 UTC (permalink / raw) To: schwidefsky Cc: Ingo Molnar, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras Martin Schwidefsky wrote: > On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote: >> For sched_clock()'s behavior while the virtual CPU is idle: my current >> idea for that is the patch below (a loosely analoguous problem exists >> with nohz/dynticks): it makes sched_clock() valid across idle periods >> too and uses wall-clock time for that. > > Ok, that would mean that sched_clock can just return the virtual cpu > time and the two hooks starts and stops the idle periods as far as the > scheduler is concerned. In this case we can use the patch from Jan with > the new implementation for sched_clock and add the two hooks to the > places where the cpu-idle notifiers are done (do_monitor_call and > default_idle). In fact this could be an idle-notifier. Hmm, I take a > closer look tomorrow when I'm back at the office. > <snip> I am partially responsible for the regression. While working on the CPU accounting change, I for some unknown reason always assumed that sched_clock() was virtualized. I should have taken a closer look. Ingo, with this new approach, sched_clock() although not virtualized, advances as if it is (due to the idle state change accounting). I have one question though, what if the underlying CPU is forcefully scheduled out from the virtual CPU? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 18:33 ` Martin Schwidefsky 2007-08-20 19:00 ` Balbir Singh @ 2007-08-20 19:05 ` Ingo Molnar 2007-08-21 7:20 ` Christian Borntraeger 2007-08-20 19:12 ` Ingo Molnar 2007-08-21 7:00 ` Christian Borntraeger 3 siblings, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-20 19:05 UTC (permalink / raw) To: Martin Schwidefsky Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote: > On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote: > > For sched_clock()'s behavior while the virtual CPU is idle: my current > > idea for that is the patch below (a loosely analoguous problem exists > > with nohz/dynticks): it makes sched_clock() valid across idle periods > > too and uses wall-clock time for that. > > Ok, that would mean that sched_clock can just return the virtual cpu > time and the two hooks starts and stops the idle periods as far as the > scheduler is concerned. In this case we can use the patch from Jan > with the new implementation for sched_clock and add the two hooks to > the places where the cpu-idle notifiers are done (do_monitor_call and > default_idle). In fact this could be an idle-notifier. Hmm, I take a > closer look tomorrow when I'm back at the office. ok. Just to make it sure wrt. release-management: you said s390 sched_clock() is currently at least as precise as stime/utime - so this would suggest that there is no regression over v2.6.22? Regardless of whether it's a live regression or not, i think we want the nohz improvement (and the s390 patch if the callbacks are OK to you) in .23, and we want to migrate all users of "raw" sched_clock() [blktrace, softlockup-detector, print-timestamps, etc.] over to the better cpu_clock() interface. Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 19:05 ` Ingo Molnar @ 2007-08-21 7:20 ` Christian Borntraeger 0 siblings, 0 replies; 40+ messages in thread From: Christian Borntraeger @ 2007-08-21 7:20 UTC (permalink / raw) To: Ingo Molnar Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras Am Montag, 20. August 2007 schrieb Ingo Molnar: > ok. Just to make it sure wrt. release-management: you said s390 > sched_clock() is currently at least as precise as stime/utime - so this > would suggest that there is no regression over v2.6.22? Regardless of On current git s390 has a sched_clock implementaton based on real time. That means we have no knowledge about steal time. E.g. if you only get 50% of your physial cpu from the hypervisor on 2.6.22 a single cpu bound process has 50% in top while on 2.6.23-rc top shows 100%, so yes, there is a regression. Christian ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 18:33 ` Martin Schwidefsky 2007-08-20 19:00 ` Balbir Singh 2007-08-20 19:05 ` Ingo Molnar @ 2007-08-20 19:12 ` Ingo Molnar 2007-08-21 7:00 ` Christian Borntraeger 3 siblings, 0 replies; 40+ messages in thread From: Ingo Molnar @ 2007-08-20 19:12 UTC (permalink / raw) To: Martin Schwidefsky Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote: > > If a virtual CPU is idle then i think the "real = steal, virtual = > > 0" way of thinking about idle looks a bit unnatural to me - wouldnt > > it be better to think in terms of "steal = 0, virtual = real" ? > > Basically a virtual CPU can idle at "perfect speed", without the > > host "stealing" any cycles from it. And with that way of thinking, > > if s390 passed in the real-idle-time value to the new callbacks > > below it would all fall into place. Hm? > > How you think about an idle cpu depends on your viewpoint. The source > for the virtual cpu time on s390 is the cpu timer. This timer is > stopped when a virtual cpu looses the physical cpu, so it seems > natural to me to think that real=steal, virtual=0 because the cpu > timer is stopped while the cpu is idle. The other way of thinking > about it is as valid though. my thinking is this: the structure of "idle time" only matters if it can be observed from "within" a virtual machine - via timers. Are on s390 any of the typical app-visible timers (timer_list, etc.) driven by the virtual tick? [which slows down if a virtual CPU is scheduled away by the host/monitor/hypervisor?] Or is the virtual tick only affecting scheduling/cpu-accounting statistics in essence? Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 18:33 ` Martin Schwidefsky ` (2 preceding siblings ...) 2007-08-20 19:12 ` Ingo Molnar @ 2007-08-21 7:00 ` Christian Borntraeger 2007-08-21 9:18 ` Martin Schwidefsky 3 siblings, 1 reply; 40+ messages in thread From: Christian Borntraeger @ 2007-08-21 7:00 UTC (permalink / raw) To: schwidefsky Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras Am Montag, 20. August 2007 schrieb Martin Schwidefsky: > On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote: > Ok, that would mean that sched_clock can just return the virtual cpu > time and the two hooks starts and stops the idle periods as far as the > scheduler is concerned. In this case we can use the patch from Jan with > the new implementation for sched_clock and add the two hooks to the > places where the cpu-idle notifiers are done (do_monitor_call and > default_idle). In fact this could be an idle-notifier. Hmm, I take a > closer look tomorrow when I'm back at the office. > > > If a virtual CPU is idle then i think the "real = steal, virtual = 0" > > way of thinking about idle looks a bit unnatural to me - wouldnt it be > > better to think in terms of "steal = 0, virtual = real" ? Basically a > > virtual CPU can idle at "perfect speed", without the host "stealing" any > > cycles from it. And with that way of thinking, if s390 passed in the > > real-idle-time value to the new callbacks below it would all fall into > > place. Hm? Martin, I think we already do something like this. If you look at cpustat in 2.6.22 and earlier we already have steal increase = 0, idle increase = 100 % on an idle cpu, even on s390. So while from the hardware perspective steal is growing, we do the right thing in Linux, no? Chrisian ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 7:00 ` Christian Borntraeger @ 2007-08-21 9:18 ` Martin Schwidefsky 0 siblings, 0 replies; 40+ messages in thread From: Martin Schwidefsky @ 2007-08-21 9:18 UTC (permalink / raw) To: Christian Borntraeger Cc: Ingo Molnar, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras On Tue, 2007-08-21 at 09:00 +0200, Christian Borntraeger wrote: > Am Montag, 20. August 2007 schrieb Martin Schwidefsky: > > On Mon, 2007-08-20 at 20:08 +0200, Ingo Molnar wrote: > > Ok, that would mean that sched_clock can just return the virtual cpu > > time and the two hooks starts and stops the idle periods as far as the > > scheduler is concerned. In this case we can use the patch from Jan with > > the new implementation for sched_clock and add the two hooks to the > > places where the cpu-idle notifiers are done (do_monitor_call and > > default_idle). In fact this could be an idle-notifier. Hmm, I take a > > closer look tomorrow when I'm back at the office. > > > > > If a virtual CPU is idle then i think the "real = steal, virtual = 0" > > > way of thinking about idle looks a bit unnatural to me - wouldnt it be > > > better to think in terms of "steal = 0, virtual = real" ? Basically a > > > virtual CPU can idle at "perfect speed", without the host "stealing" any > > > cycles from it. And with that way of thinking, if s390 passed in the > > > real-idle-time value to the new callbacks below it would all fall into > > > place. Hm? > > Martin, > > I think we already do something like this. If you look at cpustat in 2.6.22 > and earlier we already have steal increase = 0, idle increase = 100 % on an > idle cpu, even on s390. So while from the hardware perspective steal is > growing, we do the right thing in Linux, no? This is done in kernel/sched.c:account_steal_time(). If the architecture backend reports steal time for idle it is accounted as idle time. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 18:08 ` Ingo Molnar 2007-08-20 18:33 ` Martin Schwidefsky @ 2007-08-20 23:07 ` Paul Mackerras 2007-08-21 2:18 ` Andi Kleen 2 siblings, 0 replies; 40+ messages in thread From: Paul Mackerras @ 2007-08-20 23:07 UTC (permalink / raw) To: Ingo Molnar Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens Ingo Molnar writes: > We seem to agree wrt. sched_clock()'s behavior while the virtual CPU is > busy: sched_clock() very much wants to track virtual time. (real time is > pretty much meaningless and coupling sched_clock() to real time would > make the virtual machine's behavior dependent on the host's load, which > breaks the "seemless virtualization to inside observers" common-sense > requirement of virtual-CPU scheduling.) OK, that would imply that virtualized powerpc machines want to use the PURR register for sched_clock(). The PURR basically counts time that this virtual CPU (SMT thread) spends dispatching instructions, so it excludes both the time taken by the hypervisor and the time (dispatch cycles) taken by the other thread. > For sched_clock()'s behavior while the virtual CPU is idle: my current > idea for that is the patch below (a loosely analoguous problem exists > with nohz/dynticks): it makes sched_clock() valid across idle periods > too and uses wall-clock time for that. The straightforward thing is just to use the PURR all the time, even during idle. What that means is this: * If the other thread is active then sched_clock() will continue to advance at a slow rate during idle (reflecting the fact that the active thread is getting almost all of the dispatch cycles). * If the other thread is idle and the partition is a "shared processor" partition then sched_clock() not advance since in that case we idle in the hypervisor. * If the other thread is idle and the partition is a "dedicated processor" partition then sched_clock() will advance at half speed on both threads, since the two threads each get half of the dispatch cycles. It sounds like this behaviour should be OK - do you agree? > If a virtual CPU is idle then i think the "real = steal, virtual = 0" > way of thinking about idle looks a bit unnatural to me - wouldnt it be > better to think in terms of "steal = 0, virtual = real" ? Basically a Stolen time while the virtual CPU is idle gets (or at least, used to get :) accounted as idle time. Paul. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 18:08 ` Ingo Molnar 2007-08-20 18:33 ` Martin Schwidefsky 2007-08-20 23:07 ` Paul Mackerras @ 2007-08-21 2:18 ` Andi Kleen 2007-08-21 7:09 ` Ingo Molnar 2 siblings, 1 reply; 40+ messages in thread From: Andi Kleen @ 2007-08-21 2:18 UTC (permalink / raw) To: Ingo Molnar Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras Ingo Molnar <mingo@elte.hu> writes: You should just be using idle notifiers instead instead of adding more and more custom hooks (like NOHZ has already) x86-64 still has them and there is a old patch around to add them to i386. -Andi ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 2:18 ` Andi Kleen @ 2007-08-21 7:09 ` Ingo Molnar 2007-08-21 10:07 ` Andi Kleen 0 siblings, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 7:09 UTC (permalink / raw) To: Andi Kleen Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Andi Kleen <andi@firstfloor.org> wrote: > You should just be using idle notifiers instead instead of adding more > and more custom hooks (like NOHZ has already) x86-64 still has them > and there is a old patch around to add them to i386. these are specially placed callbacks that we want to call from certain ACPI codepaths but not from all of them. Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 7:09 ` Ingo Molnar @ 2007-08-21 10:07 ` Andi Kleen 2007-08-21 10:20 ` Ingo Molnar 0 siblings, 1 reply; 40+ messages in thread From: Andi Kleen @ 2007-08-21 10:07 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Martin Schwidefsky, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras On Tue, Aug 21, 2007 at 09:09:22AM +0200, Ingo Molnar wrote: > > * Andi Kleen <andi@firstfloor.org> wrote: > > > You should just be using idle notifiers instead instead of adding more > > and more custom hooks (like NOHZ has already) x86-64 still has them > > and there is a old patch around to add them to i386. > > these are specially placed callbacks that we want to call from certain > ACPI codepaths but not from all of them. Because you believe TSC only stops in C2 and C3? That's not correct on all systems. -Andi ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 10:07 ` Andi Kleen @ 2007-08-21 10:20 ` Ingo Molnar 2007-08-21 11:15 ` Andi Kleen 0 siblings, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 10:20 UTC (permalink / raw) To: Andi Kleen Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Andi Kleen <andi@firstfloor.org> wrote: > On Tue, Aug 21, 2007 at 09:09:22AM +0200, Ingo Molnar wrote: > > > > * Andi Kleen <andi@firstfloor.org> wrote: > > > > > You should just be using idle notifiers instead instead of adding > > > more and more custom hooks (like NOHZ has already) x86-64 still > > > has them and there is a old patch around to add them to i386. > > > > these are specially placed callbacks that we want to call from > > certain ACPI codepaths but not from all of them. > > Because you believe TSC only stops in C2 and C3? That's not correct on > all systems. i know there are some incredibly broken (but rare) boxes where the bios will report it only knows C1 and do C2? Is that the case you are referring to, or is there something else too? Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 10:20 ` Ingo Molnar @ 2007-08-21 11:15 ` Andi Kleen 2007-08-21 11:20 ` Ingo Molnar 0 siblings, 1 reply; 40+ messages in thread From: Andi Kleen @ 2007-08-21 11:15 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Martin Schwidefsky, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras > i know there are some incredibly broken (but rare) boxes where the bios > will report it only knows C1 and do C2? Is that the case you are > referring to, or is there something else too? There are first a couple of older and not so old (Centaur) chips that generally stop the TSC in C1. And also some boxes who shouldn't have anything deeper than C2 have trouble with the TSC. For example I got a Opteron machine (which definitely shouldn't have any C2 since it's two socket) where the TSC appears to stop or at least slow down a lot in C1. And thirdly it's just unclean to add all kinds of custom hooks there. It was already ugly in NOHZ, please don't continue that. -Andi ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 11:15 ` Andi Kleen @ 2007-08-21 11:20 ` Ingo Molnar 0 siblings, 0 replies; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 11:20 UTC (permalink / raw) To: Andi Kleen Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Andi Kleen <andi@firstfloor.org> wrote: > > i know there are some incredibly broken (but rare) boxes where the bios > > will report it only knows C1 and do C2? Is that the case you are > > referring to, or is there something else too? > > There are first a couple of older and not so old (Centaur) chips that > generally stop the TSC in C1. there's not much we can do about them: the ACPI code doesnt measure their idle time, right? This is mostly for statistics purposes, so unless "broken" means tons of boxes, we dont have to have 100% coverage. > And also some boxes who shouldn't have anything deeper than C2 have > trouble with the TSC. For example I got a Opteron machine (which > definitely shouldn't have any C2 since it's two socket) where the TSC > appears to stop or at least slow down a lot in C1. how much is "a lot" in C1? There's an AMD TSC-slows-down-C1 erratum but that should be less than 1%. (which is fine enough for idle time measurements) > And thirdly it's just unclean to add all kinds of custom hooks there. > It was already ugly in NOHZ, please don't continue that. i'm not opposed to the idle notifiers but iirc the idle notifiers caused problems in themselves so part of them were reverted. We can do this more cleanly in .24 - it will make the benefits of the notifier cleanup even more apparent. Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-20 15:45 ` [accounting regression since rc1] scheduler updates Ingo Molnar 2007-08-20 17:03 ` Martin Schwidefsky @ 2007-08-21 8:17 ` Christian Borntraeger 2007-08-21 8:42 ` Ingo Molnar 2007-08-21 11:25 ` Ingo Molnar 1 sibling, 2 replies; 40+ messages in thread From: Christian Borntraeger @ 2007-08-21 8:17 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras Am Montag, 20. August 2007 schrieb Ingo Molnar: > could you send that precise sched_clock() patch? It should be an order > of magnitude simpler than the high-precision stime/utime tracking you > already do, and it's needed for quality scheduling anyway. I have a question about that. I just played with sched_clock, and even when I intentionally slow down sched_clock by a factor of 2, my cpu bound process gets 100 % in top. If this is intentional, I dont understand how a virtualized sched_clock would fix the accounting change? Thanks Christian ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 8:17 ` Christian Borntraeger @ 2007-08-21 8:42 ` Ingo Molnar 2007-08-21 9:11 ` Martin Schwidefsky 2007-08-21 11:25 ` Ingo Molnar 1 sibling, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 8:42 UTC (permalink / raw) To: Christian Borntraeger Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > Am Montag, 20. August 2007 schrieb Ingo Molnar: > > could you send that precise sched_clock() patch? It should be an order > > of magnitude simpler than the high-precision stime/utime tracking you > > already do, and it's needed for quality scheduling anyway. > > I have a question about that. I just played with sched_clock, and even > when I intentionally slow down sched_clock by a factor of 2, my cpu > bound process gets 100 % in top. If this is intentional, I dont > understand how a virtualized sched_clock would fix the accounting > change? hm, does on s390 scheduler_tick() get driven in virtual time or in real time? The very latest scheduler code will enforce a minimum rate of sched_clock() across two scheduler_tick() calls (in rc3 and later kernels). If sched_clock() "slows down" but scheduler_tick() still has a real-time frequency then that impacts the quality of scheduling. So scheduler_tick() and sched_clock() must really have the same behavior (either both are virtual or both are real), so that scheduling becomes invariant to steal-time. Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 8:42 ` Ingo Molnar @ 2007-08-21 9:11 ` Martin Schwidefsky 2007-08-21 9:34 ` Ingo Molnar 0 siblings, 1 reply; 40+ messages in thread From: Martin Schwidefsky @ 2007-08-21 9:11 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras On Tue, 2007-08-21 at 10:42 +0200, Ingo Molnar wrote: > * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > > Am Montag, 20. August 2007 schrieb Ingo Molnar: > > > could you send that precise sched_clock() patch? It should be an order > > > of magnitude simpler than the high-precision stime/utime tracking you > > > already do, and it's needed for quality scheduling anyway. > > > > I have a question about that. I just played with sched_clock, and even > > when I intentionally slow down sched_clock by a factor of 2, my cpu > > bound process gets 100 % in top. If this is intentional, I dont > > understand how a virtualized sched_clock would fix the accounting > > change? > > hm, does on s390 scheduler_tick() get driven in virtual time or in real > time? The very latest scheduler code will enforce a minimum rate of > sched_clock() across two scheduler_tick() calls (in rc3 and later > kernels). If sched_clock() "slows down" but scheduler_tick() still has a > real-time frequency then that impacts the quality of scheduling. So > scheduler_tick() and sched_clock() must really have the same behavior > (either both are virtual or both are real), so that scheduling becomes > invariant to steal-time. scheduler_tick() is based on the HZ timer which uses the TOD clock = real time. sched_clock() currently uses the TOD clock as well so in regard to the new scheduler we currently do not have a problem. We have a problem with cpu time accounting, the change to the /proc code breaks the precise accounting on s390. To solve the cpu time accounting we need to change sched_clock() to the cpu timer = virtual time. To change the scheduler_tick() as well requires another patch and I fear it would complicate things in the s390 backend. And if you say that the scheduling becomes invariant to steal-time, how is the cpu time accounting via sum_exec supposed to work if it does not take steal-time into account ? -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 9:11 ` Martin Schwidefsky @ 2007-08-21 9:34 ` Ingo Molnar 2007-08-21 9:48 ` Paul Mackerras ` (3 more replies) 0 siblings, 4 replies; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 9:34 UTC (permalink / raw) To: Martin Schwidefsky Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote: > > hm, does on s390 scheduler_tick() get driven in virtual time or in > > real time? The very latest scheduler code will enforce a minimum > > rate of sched_clock() across two scheduler_tick() calls (in rc3 and > > later kernels). If sched_clock() "slows down" but scheduler_tick() > > still has a real-time frequency then that impacts the quality of > > scheduling. So scheduler_tick() and sched_clock() must really have > > the same behavior (either both are virtual or both are real), so > > that scheduling becomes invariant to steal-time. > > scheduler_tick() is based on the HZ timer which uses the TOD clock = > real time. sched_clock() currently uses the TOD clock as well so in > regard to the new scheduler we currently do not have a problem. We > have a problem with cpu time accounting, the change to the /proc code > breaks the precise accounting on s390. To solve the cpu time > accounting we need to change sched_clock() to the cpu timer = virtual > time. To change the scheduler_tick() as well requires another patch > and I fear it would complicate things in the s390 backend. my feeling is that it gives us generally higher-quality scheduling if we drive all things scheduler via virtual time. Do you agree with that? > And if you say that the scheduling becomes invariant to steal-time, > how is the cpu time accounting via sum_exec supposed to work if it > does not take steal-time into account ? right now there are two distinct and independent things: scheduler behavior (the scheduling decisions the scheduler makes) and accounting behavior. the 'invariant' i mentioned only covers scheduler behavior, not accounting behavior. Accounting is separate in theory, but coupled in practice now via sum_exec_runtime. Before we do a patch to decouple them again, lets make sure we agree on the direction to take here. There are two ways to account within a virtual machine: either in real time or in virtual time. it seems you'd like accounting to be sensitive to 'external load' - i.e. you'd like an 'internal' top to show the 'real' CPU accounting, right? Wouldnt it be more consistent if a virtual box would not show any dependency on external load? (i.e. it would slow down all of its internal functionality transparently, without exposing it via /proc. The only way to observe that would be the TOD interfaces: gettimeofday and real-time clock driven POSIX timers. Even timer_list could be driven via virtual time - although that would probably break user expectations, right?) Or would accounting-in-virtual-time break user expectations too? (most of the other hypervisors let guests account in virtual time.) Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 9:34 ` Ingo Molnar @ 2007-08-21 9:48 ` Paul Mackerras 2007-08-21 10:38 ` Martin Schwidefsky ` (2 subsequent siblings) 3 siblings, 0 replies; 40+ messages in thread From: Paul Mackerras @ 2007-08-21 9:48 UTC (permalink / raw) To: Ingo Molnar Cc: Martin Schwidefsky, Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens Ingo Molnar writes: > my feeling is that it gives us generally higher-quality scheduling if we > drive all things scheduler via virtual time. Do you agree with that? On PowerPC at least, while we can measure virtual time, there is no hardware facility for getting an interrupt after a certain amount of virtual time has elapsed, but there is a facility to get an interrupt after an elapsed real-time interval. So sched_clock() could easily change to measure virtual time, but I don't see how to make scheduler_tick() be driven off virtual time. It sounds like s390 is the same. > it seems you'd like accounting to be sensitive to 'external load' - i.e. > you'd like an 'internal' top to show the 'real' CPU accounting, right? > > Wouldnt it be more consistent if a virtual box would not show any > dependency on external load? (i.e. it would slow down all of its > internal functionality transparently, without exposing it via /proc. The The tools that use the data in /proc get unhappy if user time + system time + hardirq time + softirq time + idle time + stolen time doesn't add up to real time. The way we handle that is by making stolen time represent the time taken away by the hypervisor (and on PowerPC with SMT, the time taken by the other thread too). Paul. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 9:34 ` Ingo Molnar 2007-08-21 9:48 ` Paul Mackerras @ 2007-08-21 10:38 ` Martin Schwidefsky 2007-08-21 11:36 ` Ingo Molnar 2007-08-21 10:39 ` Christian Borntraeger 2007-08-21 10:43 ` Christian Borntraeger 3 siblings, 1 reply; 40+ messages in thread From: Martin Schwidefsky @ 2007-08-21 10:38 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras On Tue, 2007-08-21 at 11:34 +0200, Ingo Molnar wrote: > * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote: > > > > hm, does on s390 scheduler_tick() get driven in virtual time or in > > > real time? The very latest scheduler code will enforce a minimum > > > rate of sched_clock() across two scheduler_tick() calls (in rc3 and > > > later kernels). If sched_clock() "slows down" but scheduler_tick() > > > still has a real-time frequency then that impacts the quality of > > > scheduling. So scheduler_tick() and sched_clock() must really have > > > the same behavior (either both are virtual or both are real), so > > > that scheduling becomes invariant to steal-time. > > > > scheduler_tick() is based on the HZ timer which uses the TOD clock = > > real time. sched_clock() currently uses the TOD clock as well so in > > regard to the new scheduler we currently do not have a problem. We > > have a problem with cpu time accounting, the change to the /proc code > > breaks the precise accounting on s390. To solve the cpu time > > accounting we need to change sched_clock() to the cpu timer = virtual > > time. To change the scheduler_tick() as well requires another patch > > and I fear it would complicate things in the s390 backend. > > my feeling is that it gives us generally higher-quality scheduling if we > drive all things scheduler via virtual time. Do you agree with that? Yes, I'm in favour of converting sched_clock to use virtual time. It makes sense to me. > > And if you say that the scheduling becomes invariant to steal-time, > > how is the cpu time accounting via sum_exec supposed to work if it > > does not take steal-time into account ? > > right now there are two distinct and independent things: scheduler > behavior (the scheduling decisions the scheduler makes) and accounting > behavior. > > the 'invariant' i mentioned only covers scheduler behavior, not > accounting behavior. Accounting is separate in theory, but coupled in > practice now via sum_exec_runtime. Hmm, ok. But the fact is that right now the accounting via sum_exec_runtime is broken in regard to virtual cpus, isn't it? > Before we do a patch to decouple them again, lets make sure we agree on > the direction to take here. There are two ways to account within a > virtual machine: either in real time or in virtual time. > > it seems you'd like accounting to be sensitive to 'external load' - i.e. > you'd like an 'internal' top to show the 'real' CPU accounting, right? Yes, we want utime and stime represent the time spent on the physical cpu. To make up for the missing time the steal time field has been introduced. > Wouldnt it be more consistent if a virtual box would not show any > dependency on external load? (i.e. it would slow down all of its > internal functionality transparently, without exposing it via /proc. The > only way to observe that would be the TOD interfaces: gettimeofday and > real-time clock driven POSIX timers. Even timer_list could be driven via > virtual time - although that would probably break user expectations, > right?) Or would accounting-in-virtual-time break user expectations too? > (most of the other hypervisors let guests account in virtual time. No, imho it is less consistent if the virtual box shows virtual time. If you look at the top output as a user and it shows that some process used x% of cpu what does it tell you? With virtual cpus next to nothing, you have to normalize the numbers with the %steal while the process was running. But even then it still is not a good number because the %steal changes while a process is running. The only good solution is to use virtual time for all cputime values. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 10:38 ` Martin Schwidefsky @ 2007-08-21 11:36 ` Ingo Molnar 2007-08-21 11:58 ` Martin Schwidefsky 0 siblings, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 11:36 UTC (permalink / raw) To: Martin Schwidefsky Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote: > > Wouldnt it be more consistent if a virtual box would not show any > > dependency on external load? (i.e. it would slow down all of its > > internal functionality transparently, without exposing it via /proc. > > The only way to observe that would be the TOD interfaces: > > gettimeofday and real-time clock driven POSIX timers. Even > > timer_list could be driven via virtual time - although that would > > probably break user expectations, right?) Or would > > accounting-in-virtual-time break user expectations too? (most of the > > other hypervisors let guests account in virtual time. > > No, imho it is less consistent if the virtual box shows virtual time. > If you look at the top output as a user and it shows that some process > used x% of cpu what does it tell you? [...] it tells me something really important: that the virtual box's scheduler gave this task as much CPU time as it could. > [...] With virtual cpus next to nothing, you have to normalize the > numbers with the %steal while the process was running. But even then > it still is not a good number because the %steal changes while a > process is running. The only good solution is to use virtual time for > all cputime values. the steal percentage is really a concept one step higher in the scheduling hierarchy. You should be running top (or an equivalent tool) in the _hypervisor_ context if you want to know about how much time each virtual box gets. 'mixing' that information with the 'internal' task statistics of the virtual box is less consistent IMO and leads to the loss of information. (with the 'mixing' method there's no way to tell whether a task got all CPU time it asked for - and _that_ is which an admin is interested in just as much as the time allocation between virtual boxes.) so in say KVM you determine the steal percentage by looking at 'top' output in the hypervisor context. (or looking at steal% in the internal top output) Looking at 'top' in the guest tells you everything internal to that guest, without mixing external scheduling information into it. Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 11:36 ` Ingo Molnar @ 2007-08-21 11:58 ` Martin Schwidefsky 0 siblings, 0 replies; 40+ messages in thread From: Martin Schwidefsky @ 2007-08-21 11:58 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras On Tue, 2007-08-21 at 13:36 +0200, Ingo Molnar wrote: > * Martin Schwidefsky <schwidefsky@de.ibm.com> wrote: > > > > Wouldnt it be more consistent if a virtual box would not show any > > > dependency on external load? (i.e. it would slow down all of its > > > internal functionality transparently, without exposing it via /proc. > > > The only way to observe that would be the TOD interfaces: > > > gettimeofday and real-time clock driven POSIX timers. Even > > > timer_list could be driven via virtual time - although that would > > > probably break user expectations, right?) Or would > > > accounting-in-virtual-time break user expectations too? (most of the > > > other hypervisors let guests account in virtual time. > > > > No, imho it is less consistent if the virtual box shows virtual time. > > If you look at the top output as a user and it shows that some process > > used x% of cpu what does it tell you? [...] > > it tells me something really important: that the virtual box's scheduler > gave this task as much CPU time as it could. So? As far as accounting is concerned the user doesn't care one bit what the scheduler decided. The user cares how much cpu was used. If you want to know how much of the cputime available to the virtual box was used for a process you just have to add/subtract the steal time. Your have the complete picture what happened. You cannot get the complete picture if you have do process accounting based on the TOD clock, you never know how much cpu was actually spent for a process. > > [...] With virtual cpus next to nothing, you have to normalize the > > numbers with the %steal while the process was running. But even then > > it still is not a good number because the %steal changes while a > > process is running. The only good solution is to use virtual time for > > all cputime values. > > the steal percentage is really a concept one step higher in the > scheduling hierarchy. You should be running top (or an equivalent tool) > in the _hypervisor_ context if you want to know about how much time each > virtual box gets. 'mixing' that information with the 'internal' task > statistics of the virtual box is less consistent IMO and leads to the > loss of information. (with the 'mixing' method there's no way to tell > whether a task got all CPU time it asked for - and _that_ is which an > admin is interested in just as much as the time allocation between > virtual boxes.) Not really. If I look at a process I want to know how much cpu it used. Not virtual but real cpu. And we learned this the hard way, trying to do accounting with numbers that have to get normalized by numbers from the hypervisor is just plain broken. > so in say KVM you determine the steal percentage by looking at 'top' > output in the hypervisor context. (or looking at steal% in the internal > top output) Looking at 'top' in the guest tells you everything internal > to that guest, without mixing external scheduling information into it. The information about accounting numbers that are internal to the guest is 99.99% useless. You always have to take the external scheduling into account. We choose to do the accounting in a way that does not require additional steps to get to useful numbers. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 9:34 ` Ingo Molnar 2007-08-21 9:48 ` Paul Mackerras 2007-08-21 10:38 ` Martin Schwidefsky @ 2007-08-21 10:39 ` Christian Borntraeger 2007-08-21 10:43 ` Christian Borntraeger 3 siblings, 0 replies; 40+ messages in thread From: Christian Borntraeger @ 2007-08-21 10:39 UTC (permalink / raw) To: Ingo Molnar Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras Am Dienstag, 21. August 2007 schrieb Ingo Molnar: > Wouldnt it be more consistent if a virtual box would not show any > dependency on external load? (i.e. it would slow down all of its > internal functionality transparently, without exposing it via /proc. The > only way to observe that would be the TOD interfaces: gettimeofday and > real-time clock driven POSIX timers. Even timer_list could be driven via > virtual time - although that would probably break user expectations, > right?) Or would accounting-in-virtual-time break user expectations too? > (most of the other hypervisors let guests account in virtual time.) Most hypervisors let guest account in virtual time because this requires no code change. We had that in the past as well. But now we have lots of customers that rely on a different accounting model. Before CONFIG_VIRT_CPU_ACCOUNTING we got this model of top showing 99% of cpu used, even if the hypervisor gives us 20% of the physical one. We actually want to show that this process gets only 19.8% of a real cpu for several reasons: - people do workload management based on used cycles - people/departments pay money based on used cycles - If your physical cpu has to much load, you want to identify processes by physical cpu usage There are even some vendors that claimed that Linux accouting was completely broken and useless on s390 and people should use their vendor tool to get the right numbers. While these tools still have important features CONFIG_VIRT_CPU_ACCOUNTING fixed at least the "broken" parts. Christian ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 9:34 ` Ingo Molnar ` (2 preceding siblings ...) 2007-08-21 10:39 ` Christian Borntraeger @ 2007-08-21 10:43 ` Christian Borntraeger 2007-08-21 11:15 ` Ingo Molnar 3 siblings, 1 reply; 40+ messages in thread From: Christian Borntraeger @ 2007-08-21 10:43 UTC (permalink / raw) To: Ingo Molnar Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras Am Dienstag, 21. August 2007 schrieb Ingo Molnar: > the 'invariant' i mentioned only covers scheduler behavior, not > accounting behavior. Accounting is separate in theory, but coupled in > practice now via sum_exec_runtime. Forgot to answer about that: That means that the current design does not cover our requirement of showing process real time, even if we implement sched_clock? In that case I would suggest to merge my patch as a quick but correct solution and do it properly for 2.6.24. Of course better solutions are welcome :-) Christian ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 10:43 ` Christian Borntraeger @ 2007-08-21 11:15 ` Ingo Molnar 2007-08-21 11:24 ` Christian Borntraeger 0 siblings, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 11:15 UTC (permalink / raw) To: Christian Borntraeger Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > Am Dienstag, 21. August 2007 schrieb Ingo Molnar: > > the 'invariant' i mentioned only covers scheduler behavior, not > > accounting behavior. Accounting is separate in theory, but coupled in > > practice now via sum_exec_runtime. > > Forgot to answer about that: That means that the current design does > not cover our requirement of showing process real time, even if we > implement sched_clock? In that case I would suggest to merge my patch > as a quick but correct solution and do it properly for 2.6.24. Of > course better solutions are welcome :-) you mean to revert b27f03d4bd? I'd really like to see this fixed for real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the only case affected) Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 11:15 ` Ingo Molnar @ 2007-08-21 11:24 ` Christian Borntraeger 2007-08-21 11:30 ` Ingo Molnar 0 siblings, 1 reply; 40+ messages in thread From: Christian Borntraeger @ 2007-08-21 11:24 UTC (permalink / raw) To: Ingo Molnar Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras Am Dienstag, 21. August 2007 schrieb Ingo Molnar: > you mean to revert b27f03d4bd? I'd really like to see this fixed for > real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the > only case affected) Not a complete revert, but an ifdef-workaround. I wrote this patch last week and you were on cc: http://marc.info/?l=linux-mm-commits&m=118737949222362&w=2 This patch should fix s390 and let everybody else use your new code. If you are conviced that we get a better solution before 2.6.23 hits the street, thats fine with me. Christian ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 11:24 ` Christian Borntraeger @ 2007-08-21 11:30 ` Ingo Molnar 2007-08-21 11:58 ` Christian Borntraeger 0 siblings, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 11:30 UTC (permalink / raw) To: Christian Borntraeger Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > Am Dienstag, 21. August 2007 schrieb Ingo Molnar: > > you mean to revert b27f03d4bd? I'd really like to see this fixed for > > real for s390 + CONFIG_VIRT_CPU_ACCOUNTING=y. (which seems to be the > > only case affected) > > Not a complete revert, but an ifdef-workaround. I wrote this patch last week > and you were on cc: > http://marc.info/?l=linux-mm-commits&m=118737949222362&w=2 > > This patch should fix s390 and let everybody else use your new code. > If you are conviced that we get a better solution before 2.6.23 hits > the street, thats fine with me. what do you think about the rq_clock() #ifdef i did in the previous mail plus you making sched_clock() virtual? That way you can keep scheduler_tick() driven by real-time, although that generally will cause artifacts with SMP load-balancing too. (that was true in the past too) but i dont mind your patch either - it's really the architecture's choice how visible it wants to make external load to the task stats of its virtual machines. I think it is more logical to say that 100% CPU time displayed in 'top' means that the task got all the CPU time it asked for from the virtual machine. (and if you are curious about how much time was stolen from the virtual box altogether you look at the stolen-time stats in isolation.) Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 11:30 ` Ingo Molnar @ 2007-08-21 11:58 ` Christian Borntraeger 2007-08-21 12:21 ` Ingo Molnar 0 siblings, 1 reply; 40+ messages in thread From: Christian Borntraeger @ 2007-08-21 11:58 UTC (permalink / raw) To: Ingo Molnar Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras > what do you think about the rq_clock() #ifdef i did in the previous mail > plus you making sched_clock() virtual? That way you can keep > scheduler_tick() driven by real-time, although that generally will cause > artifacts with SMP load-balancing too. (that was true in the past too) I just has a test run with a virtual sched_clock and your patch. Unfortunately, it doesnt work. top shows 100% for a cpu bound process, but steal time shows about 5% stolen cpu. This brings me to another problem: runtime. Let me give an example. You get 90% cpu from your hipervisor in a shared environment. If you now start a cpu bound task that gets the full cpu for lets say 10 minutes. I REALLY want to see 9 minutes in ps and top because my department might pay for used cpu cycles. > > but i dont mind your patch either - it's really the architecture's > choice how visible it wants to make external load to the task stats of > its virtual machines. I think it is more logical to say that 100% CPU > time displayed in 'top' means that the task got all the CPU time it > asked for from the virtual machine. (and if you are curious about how > much time was stolen from the virtual box altogether you look at the > stolen-time stats in isolation.) Well, as I said we started with the same approach (virtual cpu) but we learned that these numbers have no meaning at all because the hypervisor does have different scheduling timeslices and having 100% inside the guest can still result in almost nothing if the system is really loaded. Christian ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 11:58 ` Christian Borntraeger @ 2007-08-21 12:21 ` Ingo Molnar 2007-08-21 12:57 ` Martin Schwidefsky 0 siblings, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 12:21 UTC (permalink / raw) To: Christian Borntraeger Cc: Martin Schwidefsky, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > but i dont mind your patch either - it's really the architecture's > > choice how visible it wants to make external load to the task stats > > of its virtual machines. I think it is more logical to say that 100% > > CPU time displayed in 'top' means that the task got all the CPU time > > it asked for from the virtual machine. (and if you are curious about > > how much time was stolen from the virtual box altogether you look at > > the stolen-time stats in isolation.) > > Well, as I said we started with the same approach (virtual cpu) but we > learned that these numbers have no meaning at all because the > hypervisor does have different scheduling timeslices and having 100% > inside the guest can still result in almost nothing if the system is > really loaded. hm, i think i must have used the wrong terminology, so let me describe what i mean, so that we can argue this more efficiently ;-) What i call "real time sched_clock()" is a sched_clock() that returns the GTOD (the real time) of the hypervisor. I.e. sched_clock() advances by 1 billion units every wall-clock second, in each guest. A "virtual time sched_clock()" is a sched_clock() that returns only the amount of time the virtual CPU was executed by the hypervisor. I.e. on a 3 times overloaded hypervisor with 3 guests it will advance 333 million nanoseconds per 1 wall-clock second, in each guest. (it is 'virtual' because the clock slows down as load goes up. In CFS-speak the virtual clock is the "fair-clock".) to me the right scheme for sched_clock() is the virtual variant: to return the load-scaled nanoseconds. That way CFS will be able to schedule fairly even if time has been "stolen" from a task [by virtue of the hypervisor scheduling away the guest context without giving any notice about this to the guest kernel] - because sched_clock() measures the virtual time that got allocated to that guest by the hypervisor. [ here i'm assuming precise host and precise guest statistics (which is naturally the case if both are Linux), and in that context the virtual numbers very much make sense, and whether 'top' displays 100% for a sole CPU-bound task should be mostly a matter of tooling. ] Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 12:21 ` Ingo Molnar @ 2007-08-21 12:57 ` Martin Schwidefsky 0 siblings, 0 replies; 40+ messages in thread From: Martin Schwidefsky @ 2007-08-21 12:57 UTC (permalink / raw) To: Ingo Molnar Cc: Christian Borntraeger, Linus Torvalds, Andrew Morton, linux-kernel, Jan Glauber, heiko.carstens, Paul Mackerras On Tue, 2007-08-21 at 14:21 +0200, Ingo Molnar wrote: > hm, i think i must have used the wrong terminology, so let me describe > what i mean, so that we can argue this more efficiently ;-) Ok, seems we need to be a bit more precise. > What i call "real time sched_clock()" is a sched_clock() that returns > the GTOD (the real time) of the hypervisor. I.e. sched_clock() advances > by 1 billion units every wall-clock second, in each guest. Which is what we call the TOD clock. > A "virtual time sched_clock()" is a sched_clock() that returns only the > amount of time the virtual CPU was executed by the hypervisor. I.e. on a > 3 times overloaded hypervisor with 3 guests it will advance 333 million > nanoseconds per 1 wall-clock second, in each guest. (it is 'virtual' > because the clock slows down as load goes up. In CFS-speak the virtual > clock is the "fair-clock".) We 100% agree that sched_clock() should be virtual. > to me the right scheme for sched_clock() is the virtual variant: to > return the load-scaled nanoseconds. That way CFS will be able to > schedule fairly even if time has been "stolen" from a task [by virtue of > the hypervisor scheduling away the guest context without giving any > notice about this to the guest kernel] - because sched_clock() measures > the virtual time that got allocated to that guest by the hypervisor. Ok, this means we will need Jans patch that makes sched_clock() use the cpu timer. You said that this change would require that scheduler_tick() has to use virtual time as well, which would be the second patch. And then we require a third patch that fixes the process accounting. > [ here i'm assuming precise host and precise guest statistics (which is > naturally the case if both are Linux), and in that context the virtual > numbers very much make sense, and whether 'top' displays 100% for a > sole CPU-bound task should be mostly a matter of tooling. ] It is not only a matter of tooling. The tool needs to have enough, precise information to actually return the requested information. If the user want to know how much real cpu a process has used (and our user do want to know that), the output of /proc/<pid>/stat fields 14-17 have to contain the real cpu usage for user/system/cumulated user and cumulated system time. If they would contain the virtual cpu usage you cannot tell how much real cpu a process used, even if you have access to the steal timer numbers. The reason is that you would have to synchronize the scheduling points in the guest and the hypervisor to come up with reasonable numbers. This is something we obviously do not want to do. The output of /proc/<pid>/stat is what Christian and I are complaining about. Since the introduction of CFS and VIRT_ACCOUNTING=y the output of /proc/<pid>/stat has changed its meaning and in our opinion it is wrong now. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 8:17 ` Christian Borntraeger 2007-08-21 8:42 ` Ingo Molnar @ 2007-08-21 11:25 ` Ingo Molnar 2007-08-22 7:50 ` Christian Borntraeger 1 sibling, 1 reply; 40+ messages in thread From: Ingo Molnar @ 2007-08-21 11:25 UTC (permalink / raw) To: Christian Borntraeger Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > Am Montag, 20. August 2007 schrieb Ingo Molnar: > > could you send that precise sched_clock() patch? It should be an order > > of magnitude simpler than the high-precision stime/utime tracking you > > already do, and it's needed for quality scheduling anyway. > > I have a question about that. I just played with sched_clock, and even > when I intentionally slow down sched_clock by a factor of 2, my cpu > bound process gets 100 % in top. If this is intentional, I dont > understand how a virtualized sched_clock would fix the accounting > change? could you try the patch below, does it work any better? Ingo --- kernel/sched.c | 9 +++++++++ 1 file changed, 9 insertions(+) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -333,6 +333,14 @@ static void __update_rq_clock(struct rq #ifdef CONFIG_SCHED_DEBUG WARN_ON_ONCE(cpu_of(rq) != smp_processor_id()); #endif +#ifdef CONFIG_VIRT_CPU_ACCOUNTING + /* + * Trust sched_clock on s390: + */ + if (unlikely(delta > rq->clock_max_delta)) + rq->clock_max_delta = delta; + clock += delta; +#else /* * Protect against sched_clock() occasionally going backwards: */ @@ -355,6 +363,7 @@ static void __update_rq_clock(struct rq clock += delta; } } +#endif rq->prev_clock_raw = now; rq->clock = clock; ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-21 11:25 ` Ingo Molnar @ 2007-08-22 7:50 ` Christian Borntraeger 2007-08-22 7:59 ` Ingo Molnar 0 siblings, 1 reply; 40+ messages in thread From: Christian Borntraeger @ 2007-08-22 7:50 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras Am Dienstag, 21. August 2007 schrieben Sie: > could you try the patch below, does it work any better? I looked again at the scheduler code and things are getting better when I run the patch below on top of your patch and with our sched_clock prototype. I guess there is a reason why you want rq->clock advanced by at least one tick? We discussed calling scheduler_tick with virtual time as well. Would it have the same result? What would be the impact on latency? After looking at the current s390 timer code, it seems that this kind of change is not trivial enough to be rc3+ ready. I personally think, that for 2.6.23 we should use the patch against fs/proc/array.c and everything else for 2.6.24? Christian --- kernel/sched.c | 6 ------ 1 file changed, 6 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -3321,15 +3321,9 @@ void scheduler_tick(void) int cpu = smp_processor_id(); struct rq *rq = cpu_rq(cpu); struct task_struct *curr = rq->curr; - u64 next_tick = rq->tick_timestamp + TICK_NSEC; spin_lock(&rq->lock); __update_rq_clock(rq); - /* - * Let rq->clock advance by at least TICK_NSEC: - */ - if (unlikely(rq->clock < next_tick)) - rq->clock = next_tick; rq->tick_timestamp = rq->clock; update_cpu_load(rq); if (curr != rq->idle) /* FIXME: needed? */ ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [accounting regression since rc1] scheduler updates 2007-08-22 7:50 ` Christian Borntraeger @ 2007-08-22 7:59 ` Ingo Molnar 0 siblings, 0 replies; 40+ messages in thread From: Ingo Molnar @ 2007-08-22 7:59 UTC (permalink / raw) To: Christian Borntraeger Cc: Linus Torvalds, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras * Christian Borntraeger <borntraeger@de.ibm.com> wrote: > Am Dienstag, 21. August 2007 schrieben Sie: > > could you try the patch below, does it work any better? > > I looked again at the scheduler code and things are getting better > when I run the patch below on top of your patch and with our > sched_clock prototype. I guess there is a reason why you want > rq->clock advanced by at least one tick? yeah - on PCs if for whatever reason the TSC misbehaves (and that's quite frequent) then this code sets a minimum boundary for behavior. If sched_clock() is totally random or does not advance at all or goes backwards all the time then rq_clock() still functions and falls back to jiffies-granularity behavior in essence. > We discussed calling scheduler_tick with virtual time as well. > Would it have the same result? > What would be the impact on latency? if you call scheduler_tick() with virtual time then the "safety" measures in rq_clock() do not kick in and sched_clock() behaves correctly as far as the scheduler is concerned. (if everything is in virtual time then the scheduler has no way to observe/notice that in reality this is a virtual machine.) > After looking at the current s390 timer code, it seems that this kind of > change is not trivial enough to be rc3+ ready. > I personally think, that for 2.6.23 we should use the patch against > fs/proc/array.c and everything else for 2.6.24? yes, that has the least impact for .23 - i have added your array.c patch to my queue. Ingo ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <200708141032.47235.borntraeger@de.ibm.com>]
[parent not found: <alpine.LFD.0.999.0708140835240.30176@woody.linux-foundation.org>]
* Re: [accounting regression since rc1] scheduler updates [not found] ` <alpine.LFD.0.999.0708140835240.30176@woody.linux-foundation.org> @ 2007-08-14 18:19 ` Christian Borntraeger 0 siblings, 0 replies; 40+ messages in thread From: Christian Borntraeger @ 2007-08-14 18:19 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andrew Morton, linux-kernel, Martin Schwidefsky, Jan Glauber, heiko.carstens, Paul Mackerras Am Dienstag, 14. August 2007 schrieb Linus Torvalds: > On Tue, 14 Aug 2007, Christian Borntraeger wrote: > > Hello Ingo, > Just a fyi: I think Ingo is on vacation this week without any internet > access, I think he'll be back next Monday. > > Linus Well deserved. As Martin and Jan are away as well I have a week to look into that problem myself :-) Christian ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2007-08-22 7:59 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-12 16:32 [git pull request] scheduler updates Ingo Molnar
2007-08-14 8:37 ` [accounting regression since rc1] " Christian Borntraeger
2007-08-16 8:17 ` [PATCH][RFC] Re: accounting regression since rc1 Christian Borntraeger
2007-08-20 15:45 ` [accounting regression since rc1] scheduler updates Ingo Molnar
2007-08-20 17:03 ` Martin Schwidefsky
2007-08-20 18:08 ` Ingo Molnar
2007-08-20 18:33 ` Martin Schwidefsky
2007-08-20 19:00 ` Balbir Singh
2007-08-20 19:05 ` Ingo Molnar
2007-08-21 7:20 ` Christian Borntraeger
2007-08-20 19:12 ` Ingo Molnar
2007-08-21 7:00 ` Christian Borntraeger
2007-08-21 9:18 ` Martin Schwidefsky
2007-08-20 23:07 ` Paul Mackerras
2007-08-21 2:18 ` Andi Kleen
2007-08-21 7:09 ` Ingo Molnar
2007-08-21 10:07 ` Andi Kleen
2007-08-21 10:20 ` Ingo Molnar
2007-08-21 11:15 ` Andi Kleen
2007-08-21 11:20 ` Ingo Molnar
2007-08-21 8:17 ` Christian Borntraeger
2007-08-21 8:42 ` Ingo Molnar
2007-08-21 9:11 ` Martin Schwidefsky
2007-08-21 9:34 ` Ingo Molnar
2007-08-21 9:48 ` Paul Mackerras
2007-08-21 10:38 ` Martin Schwidefsky
2007-08-21 11:36 ` Ingo Molnar
2007-08-21 11:58 ` Martin Schwidefsky
2007-08-21 10:39 ` Christian Borntraeger
2007-08-21 10:43 ` Christian Borntraeger
2007-08-21 11:15 ` Ingo Molnar
2007-08-21 11:24 ` Christian Borntraeger
2007-08-21 11:30 ` Ingo Molnar
2007-08-21 11:58 ` Christian Borntraeger
2007-08-21 12:21 ` Ingo Molnar
2007-08-21 12:57 ` Martin Schwidefsky
2007-08-21 11:25 ` Ingo Molnar
2007-08-22 7:50 ` Christian Borntraeger
2007-08-22 7:59 ` Ingo Molnar
[not found] ` <200708141032.47235.borntraeger@de.ibm.com>
[not found] ` <alpine.LFD.0.999.0708140835240.30176@woody.linux-foundation.org>
2007-08-14 18:19 ` Christian Borntraeger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox