From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dan Hecht Subject: Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree Date: Tue, 06 Mar 2007 13:07:59 -0800 Message-ID: <45EDD82F.90204@vmware.com> References: <200703060654.l266sVxr014860@shell0.pdx.osdl.net> <45ED16D2.3000202@vmware.com> <20070306084258.GA15745@elte.hu> <20070306084647.GA16280@elte.hu> <45ED2C82.3080008@vmware.com> <1173178774.24738.311.camel@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <1173178774.24738.311.camel@localhost.localdomain> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: virtualization-bounces@lists.osdl.org Errors-To: virtualization-bounces@lists.osdl.org To: tglx@linutronix.de Cc: Virtualization Mailing List , john stultz , LKML , Ingo Molnar , akpm@linux-foundation.org List-Id: virtualization@lists.linuxfoundation.org On 03/06/2007 02:59 AM, Thomas Gleixner wrote: > On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote: >>> a proper CE device also has the added bonus of making high-res timers = >>> guests work automatically. It should be simple: just pass it through to = >>> your hypervisor, a hyper-CE-device, like a hyper-clocksource device has = >>> essentially no guest-side complexity. >>> = >> It is not so simple. In theory it works great. In reality, the i386 = >> implementation is completely hardwired to work the way hardware works, = >> and breaking the clockevent code out of the deep ties to the APIC is = >> extremely non-trivial. We tried, and could not accomplish it for 2.6.21 = >> because the hrtimers integration was complex, and introduced many bugs = >> for us. > = > Why is this so non-trivial ? All you have to do is _NOT_ register > PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead, > which uses the hypervisor timer emulation instead of real hardware. > = > clockevents breaks the hardwired assumptions of the old timer code and > allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e. > stuff like > = > /* Disable PIT. */ > outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */ > = Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and = we still need to stop it. In any case, clockevents doesn't really make it easier nor harder as far = as init goes. In the pre-clockevent days, we replace setup_pit_timer, = setup_boot_clock, setup_secondary_clock. With clockevents, I think the = hook points are the same. Mostly just need to allow the per-cpu = lapic_event to be generalized to local_clock_events that can be set to = whatever device we want. The other thing on i386 is just some minor = annoyances due initially setting up only the PIT on cpu0 on irq 0 and = then later setting up per-cpu timer on lvtt, and making this all place = nice with paravirt timers. But these are just details and just require = some minor changes and will be working, but it just takes some massaging. So, that is not the real reason to move over the clockevents. The real = reason is to use the generic interrupt handlers. We understand that, = and will get to that point. In the mean time, we are harming no one. = Our code has zero effect when you booting natively or on a non-VMI = hypervisor. >> We worked around this by keeping NO_IDLE_HZ support, which now = >> you deprecated. So now we are using NO_HZ without a hyper-CE device, = >> and it is working fine. We understand the benefits of moving to the CE = >> model - but it cannot be done overnight. > = > This is ugly as hell. NO_HZ enables the dyntick functions in idle(), > irq_enter() and irq_exit() so the clockevents code is actually invoked. > I have not looked close enough why this does work at all. > = I believe this was just a quick fix in response to Ingo breaking the VMI = build yesterday by disabling NO_IDLE_HZ on us. There is no technical = reason why NO_IDLE_HZ=3Dy can't coexist with NO_HZ. (The two work okay together because when using NO_IDLE_HZ, the hooks are = deeper in a custom safe_halt routine which isn't registered when using = nohz mode at runtime, and conversely, the nohz code is guarded at = runtime by the ts->nohz_mode. So, the two really can co-exist at = compile time). Again, no one is arguing that we shouldn't move to clockevents, it's = just a matter of time (sorry, no pun intended). The vmi-time code was introduced to solve some shortcomings of the old = (pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was = especially painful for virtualization. Certainly, = clocksource/clockevents/NO_HZ solves many of the problems (basically, = moving away from counting interrupts to using time sources). e.g. xtime = updating is no longer a worry with the new timeofday/clocksource stuff. = But there are some that may not quite be solved, listed below. (I = know I'm not telling you anything new, but I might as well flesh it out = for the other paravirt folks while the code is fresh in my mind): 1) Stolen time (virtual cpu is ready to run but not running): this is = handled inconsistently between the various clockevent handlers / = CLOCK_EVT_MODE_ONESHOT combinations: a) tick_handle_periodic / CLOCK_EVT_MODE_PERIODIC: depends on how you = define "periodic" timer in a paravirtual world. If you do something = like Xen-style where you send periodic events only to running vcpus, = then this handler suffers from some of the same problems as the old i386 = timer handler: - jiffies updated according to the number of interrupts you get, so = falls behind monotonic time. generally, counting timer interrupts is = bad for paravirt. - process time updated according to the number of interrupts, so = falls behind monotonic time. This is probably okay though, since it is = essentially tracking (mono - stolen) time. I.e. the missing time is stolen. - jiffies updated only by boot cpu, which is a problem for paravirt = since the boot vcpu can be descheduled while the other vcpus are scheduled. - can probably just avoid this mode by not advertising PERIODIC = capability by your clockevent device. b) tick_handle_periodic / CLOCK_EVT_MODE_ONESHOT: - jiffies updated according to monotonic time since we'll loop to = catch up the oneshot timer. - process time accounted in monotonic time, for the same reason. = this is probably not what we'd want since it will charge time to = whatever process happened to be scheduled in the guest during periods of = stolen time. - Same problem about jiffies only updated by one vcpu. c) tick_nohz_handler (implies CLOCK_EVT_MODE_ONESHOT): - jiffies updated according to monotonic time. this is good. - Process time effectively does not count stolen time since = update_process_times is called once per callback but oneshot expiry is = advanced into the future. This is probably the right thing for = paravirt, but, inconsistent with (b). - jiffies updated by all vcpus, which is good. d) hrtimer_interrupt (really tick_sched_timer): - w.r.t. stolen time, will be similar to (c). We'll advance the = sched_timer to the next period in the future, skipping stolen time for = process accounting. - jiffies will be kept up to date with monotonic time. - jiffies updated by all vcpus, which is good. Summary: Cases (c) and (d) should be relatively well behaved for = paravirt. So, if we can depend on NO_HZ=3Dy and not being disabled at = runtime, we should be okay. We may want to have some knowledge of = stolen time passed from the hypervisor (if we wanted more accurate = process time accounting), but this can be put back in later, and isn't = too important with sample based accounting system like Linux. But, we = still need to QA all four cases, and all four cases will likely expose = different bugs due to second order effects. 2) Virtual interrupts have a relatively high overhead as compared with = native interrupts. So, in vmitime, we wanted to be able to lower the = timer interrupt rate at runtime, even if HZ is a compile time constant = (and set to something high, like 1000hz). While we could hack this in = by using evt->min_delta_ns, it wouldn't really work since process time = accounting would be wrong. Instead, we should allow the = tick_sched_timer in cases (c) and (d) to have runtime configurable = period, and then scale the time value accordingly before passing to = account_system_time. This is probably something the Xen folks will want = also, since I think Xen itself only gets 100hz hard timer, and so it can = implement at best a oneshot virtual timer with 100hz resolution. Any = objections to us doing something like this? 3) clockevent set_next_event interface is suboptimal for paravirt (and = probably realtime-ish uses). The problem is that the expiry is passed = as a relative time. On paravirt, an arbitrary amount of (stolen) time = may have passed since the delta was computed and when the timer device = is programmed, causing that next interrupt to be too far out in the = future. It seems a better interface for set_next_event would be to pass = the current time and the absolute expiry. Actually, I sent email to = Thomas and Ingo about this (and some other clockevents/hrtimer feedback) = in July 2006, but never heard back. Thoughts? I think all the other important points that the vmitime code addressed = are also addressed by clocksource/clockevents/NO_HZ. Finally, while I agree that writing the clockevent callback code is = trivial, we will hit bugs when moving over. It is something we need to = do, but just takes some time for us to test and shake out the bugs. For = example, we are seeing this bug. It seems to me that tick_sched_timer = should not be running in softirq context, but only from the = hrtimer_interrupt. Is that right? I'm not sure how we get into this case. Switched to high resolution mode on CPU 0 ------------[ cut here ]------------ kernel BUG at = /dhecht/linux/testing/torvalds/linux-2.6.21/kernel/posix-cpu-timers.c:1295! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 0 EIP: 0062:[] Not tainted VLI EFLAGS: 00010202 (2.6.21-rc2-smp #29) EIP is at run_posix_cpu_timers+0x24/0x6a7 eax: 00000200 ebx: c1405be0 ecx: c1405102 edx: c14051d4 esi: c1405ca0 edi: 77ad5c64 ebp: dfed5a50 esp: dfe81db4 ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 006a Process swapper (pid: 1, ti=3Ddfe80000 task=3Ddfed5a50 task.ti=3Ddfe80000) Stack: 00000078 c039e760 0000007d c010a885 28f8c4ec 00000000 dfed5a50 = c14051c0 dfe81df8 c0122ac3 c14051cc 00000003 00000008 c14051d4 dfed5a50 = 00000000 dfe81df4 dfe81df4 c1405be0 c1405ca0 77ad5c64 00000000 c013cd3a = c1405c18 Call Trace: [] sched_clock+0x3d/0x69 [] scheduler_tick+0x52/0xc9 [] tick_sched_timer+0x57/0x9d [] tick_sched_timer+0x0/0x9d [] hrtimer_run_queues+0x1c3/0x21d [] run_timer_softirq+0x21/0x177 [] __do_softirq+0x66/0xca [] do_softirq+0x43/0x51 [] irq_exit+0x38/0x6b [] do_IRQ+0x82/0x99 [] serial8250_console_write+0x129/0x136 [] serial8250_console_putchar+0x0/0x78 [] common_interrupt+0x23/0x30 [] vprintk+0x29b/0x2ca [] idr_get_new_above_int+0x13c/0x216 [] printk+0x1b/0x1f [] audit_init+0x26/0x101 [] ikconfig_init+0x14/0x37 [] init+0x145/0x23e [] ret_from_fork+0x6/0x20 [] syscall_exit+0x5/0x18 [] init+0x0/0x23e [] init+0x0/0x23e [] kernel_thread_helper+0x7/0x10 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Code: e9 b0 fe ff ff 5b c3 55 57 56 53 83 ec 48 89 c5 8d 44 24 40 89 44 = 24 40 89 44 24 44 e8 19 9a ec 3b 90 8d 74 26 00 f6 c4 02 74 04 <0f> 0b = eb fe 8b 95 24 01 00 00 85 d2 74 10 8b 85 08 01 00 00 03 EIP: [] run_posix_cpu_timers+0x24/0x6a7 SS:ESP 006a:dfe81db4 Kernel panic - not syncing: Fatal exception in interrupt thanks, Dan