* [BUG] Paravirtual time accounting / IRQ time accounting
@ 2014-03-19 9:42 lwcheng
2014-03-20 15:01 ` Glauber Costa
0 siblings, 1 reply; 8+ messages in thread
From: lwcheng @ 2014-03-19 9:42 UTC (permalink / raw)
To: linux-kernel; +Cc: glommer
In consolidated environments, when there are multiple virtual machines (VMs)
running on one CPU core, timekeeping will be a problem to the guest OS.
Here, I report my findings about Linux process scheduler.
Description
------------
Linux CFS relies on rq->clock_task to charge each task, determine
vruntime, etc.
When CONFIG_IRQ_TIME_ACCOUNTING is enabled, the time spent on serving IRQ
will be excluded from updating rq->clock_task.
When CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled, the time stolen by
the hypervisor
will also be excluded from updating rq->clock_task.
With "both" CONFIG_IRQ_TIME_ACCOUNTING and
CONFIG_PARAVIRT_TIME_ACCOUNTING enabled,
I put three KVM guests on one core and run hackbench in each guest. I
find that
in the guests, rq->clock_task stays *unchanged*. The malfunction
embarrasses CFS.
------------
Analysis
------------
[src/kernel/sched/core.c]
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
... ...
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
... ...
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((¶virt_steal_rq_enabled))) {
steal = paravirt_steal_clock(cpu_of(rq));
steal -= rq->prev_steal_time_rq;
... ...
rq->prev_steal_time_rq += steal;
delta -= steal;
}
#endif
rq->clock_task += delta;
... ...
}
--
"delta" -> the intended increment to rq->clock_task
"irq_delta" -> the time spent on serving IRQ (hard + soft)
"steal" -> the time stolen by the underlying hypervisor
--
"irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable
to VM scheduling delays. "irq_delta" can include part or whole of "steal".
I observe that [irq_delta + steal >> delta].
As a result, "delta" becomes zero. That is why rq->clock_task stops.
------------
Please confirm this bug. Thanks.
Luwei Cheng
--
CS student
The University of Hong Kong
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [BUG] Paravirtual time accounting / IRQ time accounting 2014-03-19 9:42 [BUG] Paravirtual time accounting / IRQ time accounting lwcheng @ 2014-03-20 15:01 ` Glauber Costa 2014-03-21 5:50 ` Mike Galbraith 2014-03-21 11:31 ` Rik van Riel 0 siblings, 2 replies; 8+ messages in thread From: Glauber Costa @ 2014-03-20 15:01 UTC (permalink / raw) To: lwcheng, Rik van Riel, Peter Zijlstra; +Cc: LKML On Wed, Mar 19, 2014 at 6:42 AM, <lwcheng@cs.hku.hk> wrote: > In consolidated environments, when there are multiple virtual machines (VMs) > running on one CPU core, timekeeping will be a problem to the guest OS. > Here, I report my findings about Linux process scheduler. > > > Description > ------------ > Linux CFS relies on rq->clock_task to charge each task, determine vruntime, > etc. > > When CONFIG_IRQ_TIME_ACCOUNTING is enabled, the time spent on serving IRQ > will be excluded from updating rq->clock_task. > When CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled, the time stolen by the > hypervisor > will also be excluded from updating rq->clock_task. > > With "both" CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING > enabled, > I put three KVM guests on one core and run hackbench in each guest. I find > that > in the guests, rq->clock_task stays *unchanged*. The malfunction embarrasses > CFS. > ------------ > > > Analysis > ------------ > [src/kernel/sched/core.c] > static void update_rq_clock_task(struct rq *rq, s64 delta) > { > ... ... > #ifdef CONFIG_IRQ_TIME_ACCOUNTING > irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; > ... ... > rq->prev_irq_time += irq_delta; > delta -= irq_delta; > #endif > > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > if (static_key_false((¶virt_steal_rq_enabled))) { > steal = paravirt_steal_clock(cpu_of(rq)); > steal -= rq->prev_steal_time_rq; > ... ... > rq->prev_steal_time_rq += steal; > delta -= steal; > } > #endif > > rq->clock_task += delta; > ... ... > } > -- > "delta" -> the intended increment to rq->clock_task > "irq_delta" -> the time spent on serving IRQ (hard + soft) > "steal" -> the time stolen by the underlying hypervisor > -- > "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable > to VM scheduling delays. This looks like a real problem indeed. The main problem in searching for a solution, is that of course not all of the irq time is steal time and vice versa. In this case, we could subtract irq_time from steal, and add only the steal part time that is in excess. I don't think this is 100 % guaranteed, but maybe it is a good approximation. Rik, do you have an opinion on this ? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG] Paravirtual time accounting / IRQ time accounting 2014-03-20 15:01 ` Glauber Costa @ 2014-03-21 5:50 ` Mike Galbraith 2014-03-22 6:47 ` lwcheng 2014-03-21 11:31 ` Rik van Riel 1 sibling, 1 reply; 8+ messages in thread From: Mike Galbraith @ 2014-03-21 5:50 UTC (permalink / raw) To: Glauber Costa; +Cc: lwcheng, Rik van Riel, Peter Zijlstra, LKML On Thu, 2014-03-20 at 12:01 -0300, Glauber Costa wrote: > On Wed, Mar 19, 2014 at 6:42 AM, <lwcheng@cs.hku.hk> wrote: > > In consolidated environments, when there are multiple virtual machines (VMs) > > running on one CPU core, timekeeping will be a problem to the guest OS. > > Here, I report my findings about Linux process scheduler. > > > > > > Description > > ------------ > > Linux CFS relies on rq->clock_task to charge each task, determine vruntime, > > etc. > > > > When CONFIG_IRQ_TIME_ACCOUNTING is enabled, the time spent on serving IRQ > > will be excluded from updating rq->clock_task. > > When CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled, the time stolen by the > > hypervisor > > will also be excluded from updating rq->clock_task. > > > > With "both" CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING > > enabled, > > I put three KVM guests on one core and run hackbench in each guest. I find > > that > > in the guests, rq->clock_task stays *unchanged*. The malfunction embarrasses > > CFS. > > ------------ > > > > > > Analysis > > ------------ > > [src/kernel/sched/core.c] > > static void update_rq_clock_task(struct rq *rq, s64 delta) > > { > > ... ... > > #ifdef CONFIG_IRQ_TIME_ACCOUNTING > > irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; > > ... ... > > rq->prev_irq_time += irq_delta; > > delta -= irq_delta; > > #endif > > > > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING > > if (static_key_false((¶virt_steal_rq_enabled))) { > > steal = paravirt_steal_clock(cpu_of(rq)); > > steal -= rq->prev_steal_time_rq; > > ... ... > > rq->prev_steal_time_rq += steal; > > delta -= steal; > > } > > #endif > > > > rq->clock_task += delta; > > ... ... > > } > > -- > > "delta" -> the intended increment to rq->clock_task > > "irq_delta" -> the time spent on serving IRQ (hard + soft) > > "steal" -> the time stolen by the underlying hypervisor > > -- > > "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable > > to VM scheduling delays. > > This looks like a real problem indeed. The main problem in searching > for a solution, is that of course not all of the irq time is steal > time and vice versa. In this case, we could subtract irq_time from > steal, and add only the steal part time that is in excess. I don't > think this is 100 % guaranteed, but maybe it is a good approximation. > > Rik, do you have an opinion on this ? Hrm, on my little Q6600 box, I'm running 3 VMS all pinned to CPU3, all running hackbench -l zillion, one of them also running crash, staring at it's sole rq->clock_task as I write this, with kernels (3.11.10) on both host and guest configured as reported. clock_task = 631322187004, clock_task = 631387807452, clock_task = 631474214294, clock_task = 631523864893, clock_task = 631604646268, clock_task = 631643276025, Maybe 3 VMs isn't enough overload for such a beastly CPU. Top reports some very funky utilization numbers, but other than that, the things seem to work fine here. perf thinks scheduling work too. -Mike ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG] Paravirtual time accounting / IRQ time accounting 2014-03-21 5:50 ` Mike Galbraith @ 2014-03-22 6:47 ` lwcheng 2014-03-22 7:44 ` Mike Galbraith 0 siblings, 1 reply; 8+ messages in thread From: lwcheng @ 2014-03-22 6:47 UTC (permalink / raw) To: Mike Galbraith; +Cc: Glauber Costa, Rik van Riel, Peter Zijlstra, LKML Quoting Mike Galbraith <umgwanakikbuti@gmail.com>: > On Thu, 2014-03-20 at 12:01 -0300, Glauber Costa wrote: >> On Wed, Mar 19, 2014 at 6:42 AM, <lwcheng@cs.hku.hk> wrote: >> > In consolidated environments, when there are multiple virtual >> machines (VMs) >> > running on one CPU core, timekeeping will be a problem to the guest OS. >> > Here, I report my findings about Linux process scheduler. >> > >> > >> > Description >> > ------------ >> > Linux CFS relies on rq->clock_task to charge each task, determine >> vruntime, >> > etc. >> > >> > When CONFIG_IRQ_TIME_ACCOUNTING is enabled, the time spent on serving IRQ >> > will be excluded from updating rq->clock_task. >> > When CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled, the time stolen by the >> > hypervisor >> > will also be excluded from updating rq->clock_task. >> > >> > With "both" CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING >> > enabled, >> > I put three KVM guests on one core and run hackbench in each guest. I find >> > that >> > in the guests, rq->clock_task stays *unchanged*. The malfunction >> embarrasses >> > CFS. >> > ------------ >> > >> > >> > Analysis >> > ------------ >> > [src/kernel/sched/core.c] >> > static void update_rq_clock_task(struct rq *rq, s64 delta) >> > { >> > ... ... >> > #ifdef CONFIG_IRQ_TIME_ACCOUNTING >> > irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; >> > ... ... >> > rq->prev_irq_time += irq_delta; >> > delta -= irq_delta; >> > #endif >> > >> > #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING >> > if (static_key_false((¶virt_steal_rq_enabled))) { >> > steal = paravirt_steal_clock(cpu_of(rq)); >> > steal -= rq->prev_steal_time_rq; >> > ... ... >> > rq->prev_steal_time_rq += steal; >> > delta -= steal; >> > } >> > #endif >> > >> > rq->clock_task += delta; >> > ... ... >> > } >> > -- >> > "delta" -> the intended increment to rq->clock_task >> > "irq_delta" -> the time spent on serving IRQ (hard + soft) >> > "steal" -> the time stolen by the underlying hypervisor >> > -- >> > "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable >> > to VM scheduling delays. >> >> This looks like a real problem indeed. The main problem in searching >> for a solution, is that of course not all of the irq time is steal >> time and vice versa. In this case, we could subtract irq_time from >> steal, and add only the steal part time that is in excess. I don't >> think this is 100 % guaranteed, but maybe it is a good approximation. >> >> Rik, do you have an opinion on this ? > > Hrm, on my little Q6600 box, I'm running 3 VMS all pinned to CPU3, all > running hackbench -l zillion, one of them also running crash, staring at > it's sole rq->clock_task as I write this, with kernels (3.11.10) on both > host and guest configured as reported. > > clock_task = 631322187004, > clock_task = 631387807452, > clock_task = 631474214294, > clock_task = 631523864893, > clock_task = 631604646268, > clock_task = 631643276025, > > Maybe 3 VMs isn't enough overload for such a beastly CPU. Top reports > some very funky utilization numbers, but other than that, the things > seem to work fine here. perf thinks scheduling work too. > > -Mike > I checked the source code again. I forgot to mention that I commented out steal_ticks() in my experiments (sorry): [kernel/sched/core.c] ------------------------------- #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { /* u64 st; */ steal = paravirt_steal_clock(cpu_of(rq)); steal -= rq->prev_steal_time_rq; if (unlikely(steal > delta)) steal = delta; /* st = steal_ticks(steal); steal = st * TICK_NSEC; */ rq->prev_steal_time_rq += steal; delta -= steal; } #endif rq->clock_task += delta; ------------------------------- I do it just for "accuracy", because I fully trust "steal" reported by the hypervisor. I do not quite understand why it is trimmed again using steal_ticks(). Please enlighten me. Even when "steal == delta", as long as "steal" is not exactly (N * TICK_NSEC), after steal_ticks(), it will be always smaller than "delta". So, rq->clock_task can still progress, but may not in the precise way. Each time, the error is within the range (0, TICK_NSEC). When CONFIG_IRQ_TIME_ACCOUNTING is disabled, deleting steal_ticks() does not affect the update to rq->clock_task. I tested both 3.10.0 and 3.13.5. The results are consistent. -Luwei ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG] Paravirtual time accounting / IRQ time accounting 2014-03-22 6:47 ` lwcheng @ 2014-03-22 7:44 ` Mike Galbraith 0 siblings, 0 replies; 8+ messages in thread From: Mike Galbraith @ 2014-03-22 7:44 UTC (permalink / raw) To: lwcheng; +Cc: Glauber Costa, Rik van Riel, Peter Zijlstra, LKML On Sat, 2014-03-22 at 14:47 +0800, lwcheng@cs.hku.hk wrote: > I checked the source code again. I forgot to mention that I commented out > steal_ticks() in my experiments (sorry): Oh well, it wasn't a _complete_ waste of time. I now have a gaggle of cute little alien boxen to dissect should I ever again feel inspired to explore non silicon based CPU anatomy. -Mike ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG] Paravirtual time accounting / IRQ time accounting 2014-03-20 15:01 ` Glauber Costa 2014-03-21 5:50 ` Mike Galbraith @ 2014-03-21 11:31 ` Rik van Riel 2014-03-22 7:15 ` lwcheng 1 sibling, 1 reply; 8+ messages in thread From: Rik van Riel @ 2014-03-21 11:31 UTC (permalink / raw) To: Glauber Costa, lwcheng, Peter Zijlstra; +Cc: LKML On 03/20/2014 11:01 AM, Glauber Costa wrote: > On Wed, Mar 19, 2014 at 6:42 AM, <lwcheng@cs.hku.hk> wrote: >> ------------ >> [src/kernel/sched/core.c] >> static void update_rq_clock_task(struct rq *rq, s64 delta) >> { >> ... ... >> #ifdef CONFIG_IRQ_TIME_ACCOUNTING >> irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; >> ... ... >> rq->prev_irq_time += irq_delta; >> delta -= irq_delta; >> #endif >> >> #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING >> if (static_key_false((¶virt_steal_rq_enabled))) { >> steal = paravirt_steal_clock(cpu_of(rq)); >> steal -= rq->prev_steal_time_rq; >> ... ... >> rq->prev_steal_time_rq += steal; >> delta -= steal; >> } >> #endif >> >> rq->clock_task += delta; >> ... ... >> } >> -- >> "delta" -> the intended increment to rq->clock_task >> "irq_delta" -> the time spent on serving IRQ (hard + soft) >> "steal" -> the time stolen by the underlying hypervisor >> -- >> "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable >> to VM scheduling delays. > > This looks like a real problem indeed. The main problem in searching > for a solution, is that of course not all of the irq time is steal > time and vice versa. In this case, we could subtract irq_time from > steal, and add only the steal part time that is in excess. I don't > think this is 100 % guaranteed, but maybe it is a good approximation. > > Rik, do you have an opinion on this ? The other way around may be better, since steal time (when it happens) is likely to be of "time slice" magnitude, while irq time will happen more frequently, and in dozens-of-microseconds intervals. Furthermore, we have no way to measure what the irq time is, except by looking at how much real time elapsed. For steal time, however, the hypervisor tells us exactly how much time was stolen. That means when steal time and irq time happen simultaneously, the amount of steal time should always be smaller than the calculated irq time for that period. actual irq_time = calculated irq time - reported steal time; -- All rights reversed ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG] Paravirtual time accounting / IRQ time accounting 2014-03-21 11:31 ` Rik van Riel @ 2014-03-22 7:15 ` lwcheng 2014-03-22 14:57 ` Rik van Riel 0 siblings, 1 reply; 8+ messages in thread From: lwcheng @ 2014-03-22 7:15 UTC (permalink / raw) To: Rik van Riel; +Cc: Glauber Costa, Peter Zijlstra, LKML Quoting Rik van Riel <riel@redhat.com>: > On 03/20/2014 11:01 AM, Glauber Costa wrote: >> On Wed, Mar 19, 2014 at 6:42 AM, <lwcheng@cs.hku.hk> wrote: > >>> ------------ >>> [src/kernel/sched/core.c] >>> static void update_rq_clock_task(struct rq *rq, s64 delta) >>> { >>> ... ... >>> #ifdef CONFIG_IRQ_TIME_ACCOUNTING >>> irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; >>> ... ... >>> rq->prev_irq_time += irq_delta; >>> delta -= irq_delta; >>> #endif >>> >>> #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING >>> if (static_key_false((¶virt_steal_rq_enabled))) { >>> steal = paravirt_steal_clock(cpu_of(rq)); >>> steal -= rq->prev_steal_time_rq; >>> ... ... >>> rq->prev_steal_time_rq += steal; >>> delta -= steal; >>> } >>> #endif >>> >>> rq->clock_task += delta; >>> ... ... >>> } >>> -- >>> "delta" -> the intended increment to rq->clock_task >>> "irq_delta" -> the time spent on serving IRQ (hard + soft) >>> "steal" -> the time stolen by the underlying hypervisor >>> -- >>> "irq_delta" is calculated based on sched_clock_cpu(), which is vulnerable >>> to VM scheduling delays. >> >> This looks like a real problem indeed. The main problem in searching >> for a solution, is that of course not all of the irq time is steal >> time and vice versa. In this case, we could subtract irq_time from >> steal, and add only the steal part time that is in excess. I don't >> think this is 100 % guaranteed, but maybe it is a good approximation. >> >> Rik, do you have an opinion on this ? > > The other way around may be better, since steal time (when it > happens) is likely to be of "time slice" magnitude, while irq > time will happen more frequently, and in dozens-of-microseconds > intervals. > > Furthermore, we have no way to measure what the irq time is, > except by looking at how much real time elapsed. For steal time, > however, the hypervisor tells us exactly how much time was stolen. > > That means when steal time and irq time happen simultaneously, > the amount of steal time should always be smaller than the > calculated irq time for that period. > > actual irq_time = calculated irq time - reported steal time; > > -- > All rights reversed > I observe that sometimes irq_time only includes "part" of steal_time. Like you said, irq_time is in dozens-of-microseconds. In VMs, as all devices seen are virtual ones, irq_time seems to be not as desired as it is in physical hosts. A quick (but not radical) solution may be: disable CONFIG_IRQ_TIME_ACCOUNTING if CONFIG_PARAVIRT is selected. Just adopt tick-based accounting: CONFIG_TICK_CPU_ACCOUNTING I am thinking what irq_time really *means* in VMs. -Luwei ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [BUG] Paravirtual time accounting / IRQ time accounting 2014-03-22 7:15 ` lwcheng @ 2014-03-22 14:57 ` Rik van Riel 0 siblings, 0 replies; 8+ messages in thread From: Rik van Riel @ 2014-03-22 14:57 UTC (permalink / raw) To: lwcheng; +Cc: Glauber Costa, Peter Zijlstra, LKML On 03/22/2014 03:15 AM, lwcheng@cs.hku.hk wrote: > I observe that sometimes irq_time only includes "part" of steal_time. > > Like you said, irq_time is in dozens-of-microseconds. In VMs, as all > devices seen are virtual ones, irq_time seems to be not as desired > as it is in physical hosts. > > A quick (but not radical) solution may be: > disable CONFIG_IRQ_TIME_ACCOUNTING if CONFIG_PARAVIRT is selected. > Just adopt tick-based accounting: CONFIG_TICK_CPU_ACCOUNTING > > I am thinking what irq_time really *means* in VMs. That is not really an option for Linux distributions, which tend to run the same kernel image on the host and in the guest... -- All rights reversed ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-03-22 14:57 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-19 9:42 [BUG] Paravirtual time accounting / IRQ time accounting lwcheng 2014-03-20 15:01 ` Glauber Costa 2014-03-21 5:50 ` Mike Galbraith 2014-03-22 6:47 ` lwcheng 2014-03-22 7:44 ` Mike Galbraith 2014-03-21 11:31 ` Rik van Riel 2014-03-22 7:15 ` lwcheng 2014-03-22 14:57 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox