[Patch] x86/HVM: Fix RTC interrupt modelling

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [Patch] x86/HVM: Fix RTC interrupt modelling
@ 2014-02-10 11:17 Andrew Cooper
  2014-02-10 12:19 ` Tim Deegan
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Andrew Cooper @ 2014-02-10 11:17 UTC (permalink / raw)
  To: Xen-devel
  Cc: Keir Fraser, George Dunlap, Andrew Cooper, Tim Deegan,
	Jan Beulich, Roger Pau Monné

This reverts large amounts of:
  9607327abbd3e77bde6cc7b5327f3efd781fc06e
    "x86/HVM: properly handle RTC periodic timer even when !RTC_PIE"
  620d5dad54008e40798c4a0c4322aef274c36fa3
    "x86/HVM: assorted RTC emulation adjustments"

and by extentsion:
  f3347f520cb4d8aa4566182b013c6758d80cbe88
    "x86/HVM: adjust IRQ (de-)assertion"
  c2f79c464849e5f796aa9d1d0f26fe356abd1a1a
    "x86/HVM: fix processing of RTC REG_B writes"
  527824f41f5fac9cba3d4441b2e73d3118d98837
    "x86/hvm: Centralize and simplify the RTC IRQ logic."

The current code has a pathological case, tickled by the access pattern of
Windows 2003 Server SP2.  Occasonally on boot (which I presume is during a
time calibration against the RTC Periodic Timer), Windows gets stuck in an
infinite loop reading RTC REG_C.  This affects 32 and 64 bit guests.

In the pathological case, the VM state looks like this:
  * RTC: 64Hz period, periodic interrupts enabled
  * RTC_IRQ in IOAPIC as vector 0xd1, edge triggered, not pending
  * vector 0xd1 set in LAPIC IRR and ISR, TPR at 0xd0
  * Reads from REG_C return 'RTC_PF | RTC_IRQF'

With an intstrumented Xen, dumping the periodic timers with a guest in this
state shows a single timer with pt->irq_issued=1 and pt->pending_intr_nr=2.

Windows is presumably waiting for reads of REG_C to drop to 0, and reading
REG_C clears the value each time in the emulated RTC.  However:

  * {svm,vmx}_intr_assist() calls pt_update_irq() unconditionally.
  * pt_update_irq() always finds the RTC as earliest_pt.
  * rtc_periodic_interrupt() unconditionally sets RTC_PF in no_ack mode.  It
    returns true, indicating that pt_update_irq() should really inject the
    interrupt.
  * pt_update_irq() decides that it doesn't need to fake up part of
    pt_intr_post() because this is a real interrupt.
  * {svm,vmx}_intr_assist() can't inject the interrupt as it is already
    pending, so exits early without calling pt_intr_post().

The underlying problem here comes because the AF and UF bits of RTC interrupt
state is modelled by the RTC code, but the PF is modelled by the pt code.  The
root cause of windows infinite loop is that RTC_PF is being re-set on vmentry
before the interrupt logic has worked out that it can't actually inject an RTC
interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
should be reading 0.

This patch reverts the RTC_PF logic handling to its former state, whereby
rtc_periodic_cb() is called strictly when the periodic timer logic has
successfully injected a periodic interrupt.  In doing so, it is important that
the RTC code itself never directly triggers an interrupt for the periodic
timer (other than the case when setting REG_B.PIE, where the pt code will have
dropped the interrupt).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Tim Deegan <tim@xen.org>
CC: Keir Fraser <keir@xen.org>
CC: Jan Beulich <JBeulich@suse.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---

I still dont know exactly what condition causes windows to tickle this
behavour.  It is seen about 1 or 2 times in 9 tests running a 12 hour VM
lifecycle test.  Over the weekend, 100 of these tests have passed without a
single reoccurence of the infinite loop.  The change has also passed a windows
extended regression test, so it would appear that other versions of windows
are still fine with the change.

Roger: as this caused issues for FreeBSD, would you mind testing it again
please?

George: Regarding 4.4 - I request a release ack.  However, this is quite a big
and complicated patch; it took Tim and myself several hours to write, even
given an understanding of the pathalogical case.  Having said that, about half
the patch is just reversions (listed above) with the other half being brand
new logic.

In sumary:
  * As Xen currently stands, w2k3 SP2 (still supported) is liable to fall into
    an infinite loop on boot, because of a bug in Xen's emulation of the RTC.
    Other guests risk the same infinite loop.
  * There is a risk that some of the logic is not quite correct; the RTC
    emulation has proved tricky time and time again.
  * The results from XenRT suggest that the new emulation is better than the
    old.
---
 xen/arch/x86/hvm/rtc.c |   94 ++++++++++++++++++++++++++++++------------------
 xen/arch/x86/hvm/vpt.c |   50 +++-----------------------
 2 files changed, 63 insertions(+), 81 deletions(-)

diff --git a/xen/arch/x86/hvm/rtc.c b/xen/arch/x86/hvm/rtc.c
index cdedefe..ebeb674 100644
--- a/xen/arch/x86/hvm/rtc.c
+++ b/xen/arch/x86/hvm/rtc.c
@@ -59,48 +59,78 @@ static void rtc_set_time(RTCState *s);
 static inline int from_bcd(RTCState *s, int a);
 static inline int convert_hour(RTCState *s, int hour);
 
-static void rtc_update_irq(RTCState *s)
+/*
+ * Send an edge on the RTC ISA IRQ line.  The RTC spec states that it should
+ * be a line level interrupt, but the PIIX3 states that it must be edge
+ * triggered.  We model the RTC using edge semantics.
+ */
+static void rtc_toggle_irq(RTCState *s)
 {
+    hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
+    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
+}
+
+static void rtc_update_regb(RTCState *s, uint8_t new_b)
+{
+    uint8_t new_c = s->hw.cmos_data[RTC_REG_C] & ~RTC_IRQF;
+
     ASSERT(spin_is_locked(&s->lock));
 
-    if ( rtc_mode_is(s, strict) && (s->hw.cmos_data[RTC_REG_C] & RTC_IRQF) )
-        return;
+    if ( new_b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
+        new_c |= RTC_IRQF;
 
-    /* IRQ is raised if any source is both raised & enabled */
-    if ( !(s->hw.cmos_data[RTC_REG_B] &
-           s->hw.cmos_data[RTC_REG_C] &
-           (RTC_PF | RTC_AF | RTC_UF)) )
-        return;
+    s->hw.cmos_data[RTC_REG_B] = new_b;
+    s->hw.cmos_data[RTC_REG_C] = new_c;
 
-    s->hw.cmos_data[RTC_REG_C] |= RTC_IRQF;
-    if ( rtc_mode_is(s, no_ack) )
-        hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
-    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
+    if ( new_c & RTC_IRQF )
+        rtc_toggle_irq(s);
+}
+
+/*
+ * Strictly only called when setting REG_C.{AF,UF}.  The logic depend on
+ * knowing that REB_B is unchanged, and REG_C does not have a falling edge.
+ * 'event' should be RTC_AF or RTC_UF.
+ */
+static void rtc_irq_event(RTCState *s, uint8_t event)
+{
+    uint8_t b = s->hw.cmos_data[RTC_REG_B];
+    uint8_t old_c = s->hw.cmos_data[RTC_REG_C];
+    uint8_t new_c = old_c & ~RTC_IRQF;
+
+    ASSERT(spin_is_locked(&s->lock));
+
+    if ( b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
+        new_c |= RTC_IRQF;
+
+    if ( (b & event) &&
+         (rtc_mode_is(s, no_ack) || !(old_c & RTC_IRQF)) )
+        rtc_toggle_irq(s);
+
+    s->hw.cmos_data[RTC_REG_C] = new_c;
 }
 
-bool_t rtc_periodic_interrupt(void *opaque)
+/*
+ * Callback from the periodic timer state machine, indicating that a periodic
+ * timer interrupt has been injected to the guest on our behalf.
+ */
+static void rtc_periodic_cb(struct vcpu *v, void *opaque)
 {
     RTCState *s = opaque;
-    bool_t ret;
 
     spin_lock(&s->lock);
-    ret = rtc_mode_is(s, no_ack) || !(s->hw.cmos_data[RTC_REG_C] & RTC_IRQF);
-    if ( rtc_mode_is(s, no_ack) || !(s->hw.cmos_data[RTC_REG_C] & RTC_PF) )
-    {
-        s->hw.cmos_data[RTC_REG_C] |= RTC_PF;
-        rtc_update_irq(s);
-    }
-    else if ( ++(s->pt_dead_ticks) >= 10 )
+
+    if ( !rtc_mode_is(s, no_ack) &&
+         (s->hw.cmos_data[RTC_REG_C] & RTC_PF) &&
+         (++(s->pt_dead_ticks) >= 10) )
     {
         /* VM is ignoring its RTC; no point in running the timer */
         destroy_periodic_time(&s->pt);
         s->pt_code = 0;
     }
-    if ( !(s->hw.cmos_data[RTC_REG_C] & RTC_IRQF) )
-        ret = 0;
-    spin_unlock(&s->lock);
 
-    return ret;
+    /* Specifically not raising the irq as it has already been done for us */
+    s->hw.cmos_data[RTC_REG_C] |= RTC_PF | RTC_IRQF;
+    spin_unlock(&s->lock);
 }
 
 /* Enable/configure/disable the periodic timer based on the RTC_PIE and
@@ -135,7 +165,7 @@ static void rtc_timer_update(RTCState *s)
                 else
                     delta = period - ((NOW() - s->start_time) % period);
                 create_periodic_time(v, &s->pt, delta, period,
-                                     RTC_IRQ, NULL, s);
+                                     RTC_IRQ, rtc_periodic_cb, s);
             }
             break;
         }
@@ -211,9 +241,8 @@ static void rtc_update_timer2(void *opaque)
     spin_lock(&s->lock);
     if (!(s->hw.cmos_data[RTC_REG_B] & RTC_SET))
     {
-        s->hw.cmos_data[RTC_REG_C] |= RTC_UF;
         s->hw.cmos_data[RTC_REG_A] &= ~RTC_UIP;
-        rtc_update_irq(s);
+        rtc_irq_event(s, RTC_UF);
         check_update_timer(s);
     }
     spin_unlock(&s->lock);
@@ -402,8 +431,7 @@ static void rtc_alarm_cb(void *opaque)
     spin_lock(&s->lock);
     if (!(s->hw.cmos_data[RTC_REG_B] & RTC_SET))
     {
-        s->hw.cmos_data[RTC_REG_C] |= RTC_AF;
-        rtc_update_irq(s);
+        rtc_irq_event(s, RTC_AF);
         alarm_timer_update(s);
     }
     spin_unlock(&s->lock);
@@ -484,12 +512,11 @@ static int rtc_ioport_write(void *opaque, uint32_t addr, uint32_t data)
             if ( orig & RTC_SET )
                 rtc_set_time(s);
         }
-        s->hw.cmos_data[RTC_REG_B] = data;
         /*
          * If the interrupt is already set when the interrupt becomes
          * enabled, raise an interrupt immediately.
          */
-        rtc_update_irq(s);
+        rtc_update_regb(s, data);
         if ( (data & RTC_PIE) && !(orig & RTC_PIE) )
             rtc_timer_update(s);
         if ( (data ^ orig) & RTC_SET )
@@ -647,9 +674,6 @@ static uint32_t rtc_ioport_read(RTCState *s, uint32_t addr)
     case RTC_REG_C:
         ret = s->hw.cmos_data[s->hw.cmos_index];
         s->hw.cmos_data[RTC_REG_C] = 0x00;
-        if ( (ret & RTC_IRQF) && !rtc_mode_is(s, no_ack) )
-            hvm_isa_irq_deassert(d, RTC_IRQ);
-        rtc_update_irq(s);
         check_update_timer(s);
         alarm_timer_update(s);
         rtc_timer_update(s);
diff --git a/xen/arch/x86/hvm/vpt.c b/xen/arch/x86/hvm/vpt.c
index 1961bda..e12e940 100644
--- a/xen/arch/x86/hvm/vpt.c
+++ b/xen/arch/x86/hvm/vpt.c
@@ -22,7 +22,6 @@
 #include <asm/hvm/vpt.h>
 #include <asm/event.h>
 #include <asm/apic.h>
-#include <asm/mc146818rtc.h>
 
 #define mode_is(d, name) \
     ((d)->arch.hvm_domain.params[HVM_PARAM_TIMER_MODE] == HVMPTM_##name)
@@ -228,23 +227,17 @@ static void pt_timer_fn(void *data)
 int pt_update_irq(struct vcpu *v)
 {
     struct list_head *head = &v->arch.hvm_vcpu.tm_list;
-    struct periodic_time *pt, *temp, *earliest_pt;
-    uint64_t max_lag;
+    struct periodic_time *pt, *temp, *earliest_pt = NULL;
+    uint64_t max_lag = -1ULL;
     int irq, is_lapic;
-    void *pt_priv;
 
- rescan:
     spin_lock(&v->arch.hvm_vcpu.tm_lock);
 
- rescan_locked:
-    earliest_pt = NULL;
-    max_lag = -1ULL;
     list_for_each_entry_safe ( pt, temp, head, list )
     {
         if ( pt->pending_intr_nr )
         {
-            /* RTC code takes care of disabling the timer itself. */
-            if ( (pt->irq != RTC_IRQ || !pt->priv) && pt_irq_masked(pt) )
+            if ( pt_irq_masked(pt) )
             {
                 /* suspend timer emulation */
                 list_del(&pt->list);
@@ -270,47 +263,12 @@ int pt_update_irq(struct vcpu *v)
     earliest_pt->irq_issued = 1;
     irq = earliest_pt->irq;
     is_lapic = (earliest_pt->source == PTSRC_lapic);
-    pt_priv = earliest_pt->priv;
 
     spin_unlock(&v->arch.hvm_vcpu.tm_lock);
 
     if ( is_lapic )
-        vlapic_set_irq(vcpu_vlapic(v), irq, 0);
-    else if ( irq == RTC_IRQ && pt_priv )
     {
-        if ( !rtc_periodic_interrupt(pt_priv) )
-            irq = -1;
-
-        pt_lock(earliest_pt);
-
-        if ( irq < 0 && earliest_pt->pending_intr_nr )
-        {
-            /*
-             * RTC periodic timer runs without the corresponding interrupt
-             * being enabled - need to mimic enough of pt_intr_post() to keep
-             * things going.
-             */
-            earliest_pt->pending_intr_nr = 0;
-            earliest_pt->irq_issued = 0;
-            set_timer(&earliest_pt->timer, earliest_pt->scheduled);
-        }
-        else if ( irq >= 0 && pt_irq_masked(earliest_pt) )
-        {
-            if ( earliest_pt->on_list )
-            {
-                /* suspend timer emulation */
-                list_del(&earliest_pt->list);
-                earliest_pt->on_list = 0;
-            }
-            irq = -1;
-        }
-
-        /* Avoid dropping the lock if we can. */
-        if ( irq < 0 && v == earliest_pt->vcpu )
-            goto rescan_locked;
-        pt_unlock(earliest_pt);
-        if ( irq < 0 )
-            goto rescan;
+        vlapic_set_irq(vcpu_vlapic(v), irq, 0);
     }
     else
     {
-- 
1.7.10.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 11:17 [Patch] x86/HVM: Fix RTC interrupt modelling Andrew Cooper
@ 2014-02-10 12:19 ` Tim Deegan
  2014-02-10 15:17 ` Roger Pau Monné
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Tim Deegan @ 2014-02-10 12:19 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: George Dunlap, Roger Pau Monné, Keir Fraser, Jan Beulich,
	Xen-devel

At 11:17 +0000 on 10 Feb (1392027468), Andrew Cooper wrote:
> The current code has a pathological case, tickled by the access pattern of
> Windows 2003 Server SP2.  Occasonally on boot (which I presume is during a
> time calibration against the RTC Periodic Timer), Windows gets stuck in an
> infinite loop reading RTC REG_C.  This affects 32 and 64 bit guests.
> 
> In the pathological case, the VM state looks like this:
>   * RTC: 64Hz period, periodic interrupts enabled
>   * RTC_IRQ in IOAPIC as vector 0xd1, edge triggered, not pending
>   * vector 0xd1 set in LAPIC IRR and ISR, TPR at 0xd0
>   * Reads from REG_C return 'RTC_PF | RTC_IRQF'
> 
> With an intstrumented Xen, dumping the periodic timers with a guest in this
> state shows a single timer with pt->irq_issued=1 and pt->pending_intr_nr=2.
> 
> Windows is presumably waiting for reads of REG_C to drop to 0

s/presumably/definitely/; we disassembled the kernel code in question.

>, and reading
> REG_C clears the value each time in the emulated RTC.  However:
> 
>   * {svm,vmx}_intr_assist() calls pt_update_irq() unconditionally.
>   * pt_update_irq() always finds the RTC as earliest_pt.

This is, AFAICT, because we are in no-missed-ticks mode, and there are
multiple RTC ticks queued waiting to be delivered.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 11:17 [Patch] x86/HVM: Fix RTC interrupt modelling Andrew Cooper
  2014-02-10 12:19 ` Tim Deegan
@ 2014-02-10 15:17 ` Roger Pau Monné
  2014-02-10 15:33 ` Keir Fraser
  2014-02-10 16:34 ` Jan Beulich
  3 siblings, 0 replies; 22+ messages in thread
From: Roger Pau Monné @ 2014-02-10 15:17 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: George Dunlap, Keir Fraser, Tim Deegan, Jan Beulich

On 10/02/14 12:17, Andrew Cooper wrote:
> This reverts large amounts of:
>   9607327abbd3e77bde6cc7b5327f3efd781fc06e
>     "x86/HVM: properly handle RTC periodic timer even when !RTC_PIE"
>   620d5dad54008e40798c4a0c4322aef274c36fa3
>     "x86/HVM: assorted RTC emulation adjustments"
> 
> and by extentsion:
>   f3347f520cb4d8aa4566182b013c6758d80cbe88
>     "x86/HVM: adjust IRQ (de-)assertion"
>   c2f79c464849e5f796aa9d1d0f26fe356abd1a1a
>     "x86/HVM: fix processing of RTC REG_B writes"
>   527824f41f5fac9cba3d4441b2e73d3118d98837
>     "x86/hvm: Centralize and simplify the RTC IRQ logic."
> 
> The current code has a pathological case, tickled by the access pattern of
> Windows 2003 Server SP2.  Occasonally on boot (which I presume is during a
> time calibration against the RTC Periodic Timer), Windows gets stuck in an
> infinite loop reading RTC REG_C.  This affects 32 and 64 bit guests.
> 
> In the pathological case, the VM state looks like this:
>   * RTC: 64Hz period, periodic interrupts enabled
>   * RTC_IRQ in IOAPIC as vector 0xd1, edge triggered, not pending
>   * vector 0xd1 set in LAPIC IRR and ISR, TPR at 0xd0
>   * Reads from REG_C return 'RTC_PF | RTC_IRQF'
> 
> With an intstrumented Xen, dumping the periodic timers with a guest in this
> state shows a single timer with pt->irq_issued=1 and pt->pending_intr_nr=2.
> 
> Windows is presumably waiting for reads of REG_C to drop to 0, and reading
> REG_C clears the value each time in the emulated RTC.  However:
> 
>   * {svm,vmx}_intr_assist() calls pt_update_irq() unconditionally.
>   * pt_update_irq() always finds the RTC as earliest_pt.
>   * rtc_periodic_interrupt() unconditionally sets RTC_PF in no_ack mode.  It
>     returns true, indicating that pt_update_irq() should really inject the
>     interrupt.
>   * pt_update_irq() decides that it doesn't need to fake up part of
>     pt_intr_post() because this is a real interrupt.
>   * {svm,vmx}_intr_assist() can't inject the interrupt as it is already
>     pending, so exits early without calling pt_intr_post().
> 
> The underlying problem here comes because the AF and UF bits of RTC interrupt
> state is modelled by the RTC code, but the PF is modelled by the pt code.  The
> root cause of windows infinite loop is that RTC_PF is being re-set on vmentry
> before the interrupt logic has worked out that it can't actually inject an RTC
> interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
> should be reading 0.
> 
> This patch reverts the RTC_PF logic handling to its former state, whereby
> rtc_periodic_cb() is called strictly when the periodic timer logic has
> successfully injected a periodic interrupt.  In doing so, it is important that
> the RTC code itself never directly triggers an interrupt for the periodic
> timer (other than the case when setting REG_B.PIE, where the pt code will have
> dropped the interrupt).
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Tim Deegan <tim@xen.org>
> CC: Keir Fraser <keir@xen.org>
> CC: Jan Beulich <JBeulich@suse.com>
> CC: George Dunlap <george.dunlap@eu.citrix.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> ---
> 
> I still dont know exactly what condition causes windows to tickle this
> behavour.  It is seen about 1 or 2 times in 9 tests running a 12 hour VM
> lifecycle test.  Over the weekend, 100 of these tests have passed without a
> single reoccurence of the infinite loop.  The change has also passed a windows
> extended regression test, so it would appear that other versions of windows
> are still fine with the change.
> 
> Roger: as this caused issues for FreeBSD, would you mind testing it again
> please?

Tested-by: Roger Pau Monné <roger.pau@citrix.com>
On FreeBSD 10.0, 9.2 and 8.4

No apparent regressions AFAICT.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 11:17 [Patch] x86/HVM: Fix RTC interrupt modelling Andrew Cooper
  2014-02-10 12:19 ` Tim Deegan
  2014-02-10 15:17 ` Roger Pau Monné
@ 2014-02-10 15:33 ` Keir Fraser
  2014-02-10 16:34 ` Jan Beulich
  3 siblings, 0 replies; 22+ messages in thread
From: Keir Fraser @ 2014-02-10 15:33 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: George Dunlap, Tim Deegan, Jan Beulich, Roger Pau Monné

On 10/02/2014 11:17, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:

> George: Regarding 4.4 - I request a release ack.  However, this is quite a big
> and complicated patch; it took Tim and myself several hours to write, even
> given an understanding of the pathalogical case.  Having said that, about half
> the patch is just reversions (listed above) with the other half being brand
> new logic.

It does at least look to simplify the code, and make it clearer, as well as
fixing this bug.

Acked-by: Keir Fraser <keir@xen.org>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 11:17 [Patch] x86/HVM: Fix RTC interrupt modelling Andrew Cooper
                   ` (2 preceding siblings ...)
  2014-02-10 15:33 ` Keir Fraser
@ 2014-02-10 16:34 ` Jan Beulich
  2014-02-10 17:13   ` Andrew Cooper
                     ` (2 more replies)
  3 siblings, 3 replies; 22+ messages in thread
From: Jan Beulich @ 2014-02-10 16:34 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel, KeirFraser, roger.pau

>>> On 10.02.14 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> This reverts large amounts of:
>   9607327abbd3e77bde6cc7b5327f3efd781fc06e
>     "x86/HVM: properly handle RTC periodic timer even when !RTC_PIE"
>   620d5dad54008e40798c4a0c4322aef274c36fa3
>     "x86/HVM: assorted RTC emulation adjustments"
> 
> and by extentsion:
>   f3347f520cb4d8aa4566182b013c6758d80cbe88
>     "x86/HVM: adjust IRQ (de-)assertion"
>   c2f79c464849e5f796aa9d1d0f26fe356abd1a1a
>     "x86/HVM: fix processing of RTC REG_B writes"
>   527824f41f5fac9cba3d4441b2e73d3118d98837
>     "x86/hvm: Centralize and simplify the RTC IRQ logic."

So what does "by extension" mean here? Are these being
reverted?

> The current code has a pathological case, tickled by the access pattern of
> Windows 2003 Server SP2.  Occasonally on boot (which I presume is during a
> time calibration against the RTC Periodic Timer), Windows gets stuck in an
> infinite loop reading RTC REG_C.  This affects 32 and 64 bit guests.
> 
> In the pathological case, the VM state looks like this:
>   * RTC: 64Hz period, periodic interrupts enabled
>   * RTC_IRQ in IOAPIC as vector 0xd1, edge triggered, not pending
>   * vector 0xd1 set in LAPIC IRR and ISR, TPR at 0xd0
>   * Reads from REG_C return 'RTC_PF | RTC_IRQF'
> 
> With an intstrumented Xen, dumping the periodic timers with a guest in this
> state shows a single timer with pt->irq_issued=1 and pt->pending_intr_nr=2.
> 
> Windows is presumably waiting for reads of REG_C to drop to 0, and reading
> REG_C clears the value each time in the emulated RTC.  However:
> 
>   * {svm,vmx}_intr_assist() calls pt_update_irq() unconditionally.
>   * pt_update_irq() always finds the RTC as earliest_pt.
>   * rtc_periodic_interrupt() unconditionally sets RTC_PF in no_ack mode.  It
>     returns true, indicating that pt_update_irq() should really inject the
>     interrupt.
>   * pt_update_irq() decides that it doesn't need to fake up part of
>     pt_intr_post() because this is a real interrupt.
>   * {svm,vmx}_intr_assist() can't inject the interrupt as it is already
>     pending, so exits early without calling pt_intr_post().
> 
> The underlying problem here comes because the AF and UF bits of RTC 
> interrupt
> state is modelled by the RTC code, but the PF is modelled by the pt code.  
> The
> root cause of windows infinite loop is that RTC_PF is being re-set on 
> vmentry
> before the interrupt logic has worked out that it can't actually inject an 
> RTC
> interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
> should be reading 0.

So you're undoing a whole lot of changes done with the goal of
getting the overall emulation closer to what real hardware does,
just to paper over an issue elsewhere in the code? Not really an
approach I'm in favor of.

>   * The results from XenRT suggest that the new emulation is better than the
>     old.

"Better" in the sense of the limited set of uses of the virtual hardware
by whatever selection of guest OSes is being run there. But very
likely not "better" in the sense on matching up with how the respective
hardware specification would require it to behave.

> --- a/xen/arch/x86/hvm/rtc.c
> +++ b/xen/arch/x86/hvm/rtc.c
> @@ -59,48 +59,78 @@ static void rtc_set_time(RTCState *s);
>  static inline int from_bcd(RTCState *s, int a);
>  static inline int convert_hour(RTCState *s, int hour);
>  
> -static void rtc_update_irq(RTCState *s)
> +/*
> + * Send an edge on the RTC ISA IRQ line.  The RTC spec states that it should
> + * be a line level interrupt, but the PIIX3 states that it must be edge
> + * triggered.  We model the RTC using edge semantics.
> + */
> +static void rtc_toggle_irq(RTCState *s)
>  {
> +    hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
> +    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
> +}
> +
> +static void rtc_update_regb(RTCState *s, uint8_t new_b)
> +{
> +    uint8_t new_c = s->hw.cmos_data[RTC_REG_C] & ~RTC_IRQF;
> +
>      ASSERT(spin_is_locked(&s->lock));
>  
> -    if ( rtc_mode_is(s, strict) && (s->hw.cmos_data[RTC_REG_C] & RTC_IRQF) )
> -        return;
> +    if ( new_b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
> +        new_c |= RTC_IRQF;

Without going back to reading the spec, iirc RTC_IRQF is a sticky
bit cleared _only_ be REG_C reads, i.e. you shouldn't clear it
earlier in the function and then conditionally set it here.

>  
> -    /* IRQ is raised if any source is both raised & enabled */
> -    if ( !(s->hw.cmos_data[RTC_REG_B] &
> -           s->hw.cmos_data[RTC_REG_C] &
> -           (RTC_PF | RTC_AF | RTC_UF)) )
> -        return;
> +    s->hw.cmos_data[RTC_REG_B] = new_b;
> +    s->hw.cmos_data[RTC_REG_C] = new_c;
>  
> -    s->hw.cmos_data[RTC_REG_C] |= RTC_IRQF;
> -    if ( rtc_mode_is(s, no_ack) )
> -        hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
> -    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
> +    if ( new_c & RTC_IRQF )
> +        rtc_toggle_irq(s);

Which then implies that the condition here would also need to
consider the old state of the flag.

> +static void rtc_irq_event(RTCState *s, uint8_t event)
> +{
> +    uint8_t b = s->hw.cmos_data[RTC_REG_B];
> +    uint8_t old_c = s->hw.cmos_data[RTC_REG_C];
> +    uint8_t new_c = old_c & ~RTC_IRQF;
> +
> +    ASSERT(spin_is_locked(&s->lock));
> +
> +    if ( b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
> +        new_c |= RTC_IRQF;

Same comment as above.

> @@ -647,9 +674,6 @@ static uint32_t rtc_ioport_read(RTCState *s, uint32_t addr)
>      case RTC_REG_C:
>          ret = s->hw.cmos_data[s->hw.cmos_index];
>          s->hw.cmos_data[RTC_REG_C] = 0x00;
> -        if ( (ret & RTC_IRQF) && !rtc_mode_is(s, no_ack) )
> -            hvm_isa_irq_deassert(d, RTC_IRQ);

Why? With RTC_IRQF going from 1 to 0, the interrupt line should
get de-asserted.

> -        rtc_update_irq(s);

So given the problem description, this would seem to be the most
important part at a first glance. But looking more closely, I'm getting
the impression that the call to rtc_update_irq() had no effect at all
here anyway: The function would always bail on the second if() due
to REG_C having got cleared a few lines up.

> @@ -270,47 +263,12 @@ int pt_update_irq(struct vcpu *v)
>      earliest_pt->irq_issued = 1;
>      irq = earliest_pt->irq;
>      is_lapic = (earliest_pt->source == PTSRC_lapic);
> -    pt_priv = earliest_pt->priv;
>  
>      spin_unlock(&v->arch.hvm_vcpu.tm_lock);
>  
>      if ( is_lapic )
> -        vlapic_set_irq(vcpu_vlapic(v), irq, 0);
> -    else if ( irq == RTC_IRQ && pt_priv )
>      {
> -        if ( !rtc_periodic_interrupt(pt_priv) )
> -            irq = -1;
> -
> -        pt_lock(earliest_pt);
> -
> -        if ( irq < 0 && earliest_pt->pending_intr_nr )
> -        {
> -            /*
> -             * RTC periodic timer runs without the corresponding interrupt
> -             * being enabled - need to mimic enough of pt_intr_post() to keep
> -             * things going.
> -             */
> -            earliest_pt->pending_intr_nr = 0;
> -            earliest_pt->irq_issued = 0;
> -            set_timer(&earliest_pt->timer, earliest_pt->scheduled);
> -        }
> -        else if ( irq >= 0 && pt_irq_masked(earliest_pt) )
> -        {
> -            if ( earliest_pt->on_list )
> -            {
> -                /* suspend timer emulation */
> -                list_del(&earliest_pt->list);
> -                earliest_pt->on_list = 0;
> -            }
> -            irq = -1;
> -        }
> -
> -        /* Avoid dropping the lock if we can. */
> -        if ( irq < 0 && v == earliest_pt->vcpu )
> -            goto rescan_locked;
> -        pt_unlock(earliest_pt);
> -        if ( irq < 0 )
> -            goto rescan;
> +        vlapic_set_irq(vcpu_vlapic(v), irq, 0);

If you didn't put this single function call in braces, the patch would
become more clear, as it would then be exactly the "else if()"
branch that got removed by it.

As a round-up: I'm not going to veto this, but I'm also not going to
be putting my name under it, nor am I going to make another
attempt to clean up the RTC emulation if this is to go in unchanged.
I'm personally getting the impression that the root cause of the
observed problem is still being left in place (and perhaps still not
being fully understood), and hence this whole change goes in the
wrong direction, _even_ if it makes the problem it is aiming at
fixing indeed appear to go away.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 16:34 ` Jan Beulich
@ 2014-02-10 17:13   ` Andrew Cooper
  2014-02-11  9:28     ` Jan Beulich
  2014-02-10 17:21   ` Tim Deegan
  2014-02-10 17:46   ` George Dunlap
  2 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2014-02-10 17:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Tim Deegan, Xen-devel, KeirFraser, roger.pau

On 10/02/14 16:34, Jan Beulich wrote:
>>>> On 10.02.14 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> This reverts large amounts of:
>>   9607327abbd3e77bde6cc7b5327f3efd781fc06e
>>     "x86/HVM: properly handle RTC periodic timer even when !RTC_PIE"
>>   620d5dad54008e40798c4a0c4322aef274c36fa3
>>     "x86/HVM: assorted RTC emulation adjustments"
>>
>> and by extentsion:
>>   f3347f520cb4d8aa4566182b013c6758d80cbe88
>>     "x86/HVM: adjust IRQ (de-)assertion"
>>   c2f79c464849e5f796aa9d1d0f26fe356abd1a1a
>>     "x86/HVM: fix processing of RTC REG_B writes"
>>   527824f41f5fac9cba3d4441b2e73d3118d98837
>>     "x86/hvm: Centralize and simplify the RTC IRQ logic."
> So what does "by extension" mean here? Are these being
> reverted?

The logic here was based on the logic being reverted, so the changes
themselves are mostly gone.

>
>> The current code has a pathological case, tickled by the access pattern of
>> Windows 2003 Server SP2.  Occasonally on boot (which I presume is during a
>> time calibration against the RTC Periodic Timer), Windows gets stuck in an
>> infinite loop reading RTC REG_C.  This affects 32 and 64 bit guests.
>>
>> In the pathological case, the VM state looks like this:
>>   * RTC: 64Hz period, periodic interrupts enabled
>>   * RTC_IRQ in IOAPIC as vector 0xd1, edge triggered, not pending
>>   * vector 0xd1 set in LAPIC IRR and ISR, TPR at 0xd0
>>   * Reads from REG_C return 'RTC_PF | RTC_IRQF'
>>
>> With an intstrumented Xen, dumping the periodic timers with a guest in this
>> state shows a single timer with pt->irq_issued=1 and pt->pending_intr_nr=2.
>>
>> Windows is presumably waiting for reads of REG_C to drop to 0, and reading
>> REG_C clears the value each time in the emulated RTC.  However:
>>
>>   * {svm,vmx}_intr_assist() calls pt_update_irq() unconditionally.
>>   * pt_update_irq() always finds the RTC as earliest_pt.
>>   * rtc_periodic_interrupt() unconditionally sets RTC_PF in no_ack mode.  It
>>     returns true, indicating that pt_update_irq() should really inject the
>>     interrupt.
>>   * pt_update_irq() decides that it doesn't need to fake up part of
>>     pt_intr_post() because this is a real interrupt.
>>   * {svm,vmx}_intr_assist() can't inject the interrupt as it is already
>>     pending, so exits early without calling pt_intr_post().
>>
>> The underlying problem here comes because the AF and UF bits of RTC 
>> interrupt
>> state is modelled by the RTC code, but the PF is modelled by the pt code.  
>> The
>> root cause of windows infinite loop is that RTC_PF is being re-set on 
>> vmentry
>> before the interrupt logic has worked out that it can't actually inject an 
>> RTC
>> interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
>> should be reading 0.
> So you're undoing a whole lot of changes done with the goal of
> getting the overall emulation closer to what real hardware does,
> just to paper over an issue elsewhere in the code? Not really an
> approach I'm in favor of.

I fail to see how the current code is closer to what hardware does than
this proposed patch.

>
>>   * The results from XenRT suggest that the new emulation is better than the
>>     old.
> "Better" in the sense of the limited set of uses of the virtual hardware
> by whatever selection of guest OSes is being run there. But very
> likely not "better" in the sense on matching up with how the respective
> hardware specification would require it to behave.
>
>> --- a/xen/arch/x86/hvm/rtc.c
>> +++ b/xen/arch/x86/hvm/rtc.c
>> @@ -59,48 +59,78 @@ static void rtc_set_time(RTCState *s);
>>  static inline int from_bcd(RTCState *s, int a);
>>  static inline int convert_hour(RTCState *s, int hour);
>>  
>> -static void rtc_update_irq(RTCState *s)
>> +/*
>> + * Send an edge on the RTC ISA IRQ line.  The RTC spec states that it should
>> + * be a line level interrupt, but the PIIX3 states that it must be edge
>> + * triggered.  We model the RTC using edge semantics.
>> + */
>> +static void rtc_toggle_irq(RTCState *s)
>>  {
>> +    hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
>> +    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
>> +}
>> +
>> +static void rtc_update_regb(RTCState *s, uint8_t new_b)
>> +{
>> +    uint8_t new_c = s->hw.cmos_data[RTC_REG_C] & ~RTC_IRQF;
>> +
>>      ASSERT(spin_is_locked(&s->lock));
>>  
>> -    if ( rtc_mode_is(s, strict) && (s->hw.cmos_data[RTC_REG_C] & RTC_IRQF) )
>> -        return;
>> +    if ( new_b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
>> +        new_c |= RTC_IRQF;
> Without going back to reading the spec, iirc RTC_IRQF is a sticky
> bit cleared _only_ be REG_C reads, i.e. you shouldn't clear it
> earlier in the function and then conditionally set it here.

The main problem is that the MC146818 states that the interrupt is line
level, while the PIIX3 mandates that it is edge triggered, and our ACPI
tables define it to be edge triggered.

The datasheet states "The IRQF bit in Register C is a '1' whenever the
¬IRQ pin is being driven low".  With edge semantics where the line never
actually stays asserted, this would degrade to being sticky until read.

However, with edge semantics, we need to send new interrupts when
enabling bits in REG_B even if IRQF is already outstanding, which is why
the logic starts by masking it back out.  Furthermore, in no-ack mode,
we expect IRQF to be set (as the guest isn't reading REG_C), but still
wanting interrupts.

Perhaps IRQF should be or'd back together when writing the state back.

>
>>  
>> -    /* IRQ is raised if any source is both raised & enabled */
>> -    if ( !(s->hw.cmos_data[RTC_REG_B] &
>> -           s->hw.cmos_data[RTC_REG_C] &
>> -           (RTC_PF | RTC_AF | RTC_UF)) )
>> -        return;
>> +    s->hw.cmos_data[RTC_REG_B] = new_b;
>> +    s->hw.cmos_data[RTC_REG_C] = new_c;
>>  
>> -    s->hw.cmos_data[RTC_REG_C] |= RTC_IRQF;
>> -    if ( rtc_mode_is(s, no_ack) )
>> -        hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
>> -    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
>> +    if ( new_c & RTC_IRQF )
>> +        rtc_toggle_irq(s);
> Which then implies that the condition here would also need to
> consider the old state of the flag.
>
>> +static void rtc_irq_event(RTCState *s, uint8_t event)
>> +{
>> +    uint8_t b = s->hw.cmos_data[RTC_REG_B];
>> +    uint8_t old_c = s->hw.cmos_data[RTC_REG_C];
>> +    uint8_t new_c = old_c & ~RTC_IRQF;
>> +
>> +    ASSERT(spin_is_locked(&s->lock));
>> +
>> +    if ( b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
>> +        new_c |= RTC_IRQF;
> Same comment as above.
>
>> @@ -647,9 +674,6 @@ static uint32_t rtc_ioport_read(RTCState *s, uint32_t addr)
>>      case RTC_REG_C:
>>          ret = s->hw.cmos_data[s->hw.cmos_index];
>>          s->hw.cmos_data[RTC_REG_C] = 0x00;
>> -        if ( (ret & RTC_IRQF) && !rtc_mode_is(s, no_ack) )
>> -            hvm_isa_irq_deassert(d, RTC_IRQ);
> Why? With RTC_IRQF going from 1 to 0, the interrupt line should
> get de-asserted.
>
>> -        rtc_update_irq(s);
> So given the problem description, this would seem to be the most
> important part at a first glance. But looking more closely, I'm getting
> the impression that the call to rtc_update_irq() had no effect at all
> here anyway: The function would always bail on the second if() due
> to REG_C having got cleared a few lines up.
>
>> @@ -270,47 +263,12 @@ int pt_update_irq(struct vcpu *v)
>>      earliest_pt->irq_issued = 1;
>>      irq = earliest_pt->irq;
>>      is_lapic = (earliest_pt->source == PTSRC_lapic);
>> -    pt_priv = earliest_pt->priv;
>>  
>>      spin_unlock(&v->arch.hvm_vcpu.tm_lock);
>>  
>>      if ( is_lapic )
>> -        vlapic_set_irq(vcpu_vlapic(v), irq, 0);
>> -    else if ( irq == RTC_IRQ && pt_priv )
>>      {
>> -        if ( !rtc_periodic_interrupt(pt_priv) )
>> -            irq = -1;
>> -
>> -        pt_lock(earliest_pt);
>> -
>> -        if ( irq < 0 && earliest_pt->pending_intr_nr )
>> -        {
>> -            /*
>> -             * RTC periodic timer runs without the corresponding interrupt
>> -             * being enabled - need to mimic enough of pt_intr_post() to keep
>> -             * things going.
>> -             */
>> -            earliest_pt->pending_intr_nr = 0;
>> -            earliest_pt->irq_issued = 0;
>> -            set_timer(&earliest_pt->timer, earliest_pt->scheduled);
>> -        }
>> -        else if ( irq >= 0 && pt_irq_masked(earliest_pt) )
>> -        {
>> -            if ( earliest_pt->on_list )
>> -            {
>> -                /* suspend timer emulation */
>> -                list_del(&earliest_pt->list);
>> -                earliest_pt->on_list = 0;
>> -            }
>> -            irq = -1;
>> -        }
>> -
>> -        /* Avoid dropping the lock if we can. */
>> -        if ( irq < 0 && v == earliest_pt->vcpu )
>> -            goto rescan_locked;
>> -        pt_unlock(earliest_pt);
>> -        if ( irq < 0 )
>> -            goto rescan;
>> +        vlapic_set_irq(vcpu_vlapic(v), irq, 0);
> If you didn't put this single function call in braces, the patch would
> become more clear, as it would then be exactly the "else if()"
> branch that got removed by it.

Right - This part of the patch was produced with reversion alone, but
dropping the braces would make it clearer.

>
> As a round-up: I'm not going to veto this, but I'm also not going to
> be putting my name under it, nor am I going to make another
> attempt to clean up the RTC emulation if this is to go in unchanged.
> I'm personally getting the impression that the root cause of the
> observed problem is still being left in place (and perhaps still not
> being fully understood), and hence this whole change goes in the
> wrong direction, _even_ if it makes the problem it is aiming at
> fixing indeed appear to go away.
>
> Jan

I am not sure I follow you.  The root case of the infinite loop is
because Xen was erroniously setting REG_C.PF, because pt_update_irq()
was trying to pre-guess what the interrupt injection logic would do,
(and getting it wrong).

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 17:13   ` Andrew Cooper
@ 2014-02-11  9:28     ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2014-02-11  9:28 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel, KeirFraser, roger.pau

>>> On 10.02.14 at 18:13, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 10/02/14 16:34, Jan Beulich wrote:
>>>>> On 10.02.14 at 12:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> This reverts large amounts of:
>>>   9607327abbd3e77bde6cc7b5327f3efd781fc06e
>>>     "x86/HVM: properly handle RTC periodic timer even when !RTC_PIE"
>>>   620d5dad54008e40798c4a0c4322aef274c36fa3
>>>     "x86/HVM: assorted RTC emulation adjustments"
>>>
>>> and by extentsion:
>>>   f3347f520cb4d8aa4566182b013c6758d80cbe88
>>>     "x86/HVM: adjust IRQ (de-)assertion"
>>>   c2f79c464849e5f796aa9d1d0f26fe356abd1a1a
>>>     "x86/HVM: fix processing of RTC REG_B writes"
>>>   527824f41f5fac9cba3d4441b2e73d3118d98837
>>>     "x86/hvm: Centralize and simplify the RTC IRQ logic."
>> So what does "by extension" mean here? Are these being
>> reverted?
> 
> The logic here was based on the logic being reverted, so the changes
> themselves are mostly gone.
> 
>>
>>> The current code has a pathological case, tickled by the access pattern of
>>> Windows 2003 Server SP2.  Occasonally on boot (which I presume is during a
>>> time calibration against the RTC Periodic Timer), Windows gets stuck in an
>>> infinite loop reading RTC REG_C.  This affects 32 and 64 bit guests.
>>>
>>> In the pathological case, the VM state looks like this:
>>>   * RTC: 64Hz period, periodic interrupts enabled
>>>   * RTC_IRQ in IOAPIC as vector 0xd1, edge triggered, not pending
>>>   * vector 0xd1 set in LAPIC IRR and ISR, TPR at 0xd0
>>>   * Reads from REG_C return 'RTC_PF | RTC_IRQF'
>>>
>>> With an intstrumented Xen, dumping the periodic timers with a guest in this
>>> state shows a single timer with pt->irq_issued=1 and pt->pending_intr_nr=2.
>>>
>>> Windows is presumably waiting for reads of REG_C to drop to 0, and reading
>>> REG_C clears the value each time in the emulated RTC.  However:
>>>
>>>   * {svm,vmx}_intr_assist() calls pt_update_irq() unconditionally.
>>>   * pt_update_irq() always finds the RTC as earliest_pt.
>>>   * rtc_periodic_interrupt() unconditionally sets RTC_PF in no_ack mode.  It
>>>     returns true, indicating that pt_update_irq() should really inject the
>>>     interrupt.
>>>   * pt_update_irq() decides that it doesn't need to fake up part of
>>>     pt_intr_post() because this is a real interrupt.
>>>   * {svm,vmx}_intr_assist() can't inject the interrupt as it is already
>>>     pending, so exits early without calling pt_intr_post().
>>>
>>> The underlying problem here comes because the AF and UF bits of RTC 
>>> interrupt
>>> state is modelled by the RTC code, but the PF is modelled by the pt code.  
>>> The
>>> root cause of windows infinite loop is that RTC_PF is being re-set on 
>>> vmentry
>>> before the interrupt logic has worked out that it can't actually inject an 
>>> RTC
>>> interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
>>> should be reading 0.
>> So you're undoing a whole lot of changes done with the goal of
>> getting the overall emulation closer to what real hardware does,
>> just to paper over an issue elsewhere in the code? Not really an
>> approach I'm in favor of.
> 
> I fail to see how the current code is closer to what hardware does than
> this proposed patch.

The thing is that if any of the changes you revert (fully or partly)
was actively wrong, you should have broken up the patch such
that it reverts the broken pieces in individual steps. The way I see
the situation right now is that you effectively say "individually all
those changes have been fine, but collectively they have a
problem which we can't really isolate, so let's revert them altogether".

The most prominent example of further diverging from actual
hardware behavior is the correct driving of PF without PIE, as
just explained in another response on this thread.

>>> -    if ( rtc_mode_is(s, strict) && (s->hw.cmos_data[RTC_REG_C] & RTC_IRQF) )
>>> -        return;
>>> +    if ( new_b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
>>> +        new_c |= RTC_IRQF;
>> Without going back to reading the spec, iirc RTC_IRQF is a sticky
>> bit cleared _only_ be REG_C reads, i.e. you shouldn't clear it
>> earlier in the function and then conditionally set it here.
> 
> The main problem is that the MC146818 states that the interrupt is line
> level, while the PIIX3 mandates that it is edge triggered, and our ACPI
> tables define it to be edge triggered.
> 
> The datasheet states "The IRQF bit in Register C is a '1' whenever the
> ¬IRQ pin is being driven low".  With edge semantics where the line never
> actually stays asserted, this would degrade to being sticky until read.

Edge semantics say nothing about the line remaining asserted, i.e.
an edge triggered interrupt could have its line remaining asserted
for all the time except when a new interrupt is needed, at which
point it would get briefly de-asserted and then asserted again.
Hence the interrupt assertion requirements of PIIX3 and those of
the MC146818 aren't really contradicting with one another. And
our code (in no-ack mode) mostly does what I just described: It
de-asserts the line briefly before re-asserting it. In "strict" mode
the de-assertion happens when IRQF gets cleared.

> However, with edge semantics, we need to send new interrupts when
> enabling bits in REG_B even if IRQF is already outstanding,

Why?

> which is why
> the logic starts by masking it back out.  Furthermore, in no-ack mode,
> we expect IRQF to be set (as the guest isn't reading REG_C), but still
> wanting interrupts.

Which is precisely what rtc_update_irq() does.

>> As a round-up: I'm not going to veto this, but I'm also not going to
>> be putting my name under it, nor am I going to make another
>> attempt to clean up the RTC emulation if this is to go in unchanged.
>> I'm personally getting the impression that the root cause of the
>> observed problem is still being left in place (and perhaps still not
>> being fully understood), and hence this whole change goes in the
>> wrong direction, _even_ if it makes the problem it is aiming at
>> fixing indeed appear to go away.
> 
> I am not sure I follow you.  The root case of the infinite loop is
> because Xen was erroniously setting REG_C.PF, because pt_update_irq()
> was trying to pre-guess what the interrupt injection logic would do,
> (and getting it wrong).

So it would be - as also said in the response to Tim - that
interaction between vpt and rtc code that needs further
tweaking.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 16:34 ` Jan Beulich
  2014-02-10 17:13   ` Andrew Cooper
@ 2014-02-10 17:21   ` Tim Deegan
  2014-02-11  9:15     ` Jan Beulich
  2014-02-10 17:46   ` George Dunlap
  2 siblings, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2014-02-10 17:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Xen-devel, KeirFraser, roger.pau

At 16:34 +0000 on 10 Feb (1392046444), Jan Beulich wrote:
> > The underlying problem here comes because the AF and UF bits of RTC 
> > interrupt
> > state is modelled by the RTC code, but the PF is modelled by the pt code.  
> > The
> > root cause of windows infinite loop is that RTC_PF is being re-set on 
> > vmentry
> > before the interrupt logic has worked out that it can't actually inject an 
> > RTC
> > interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
> > should be reading 0.
> 
> So you're undoing a whole lot of changes done with the goal of
> getting the overall emulation closer to what real hardware does,
> just to paper over an issue elsewhere in the code? Not really an
> approach I'm in favor of.

My understanding was that the problem is explicitly in the change to
how RTC code is called from vpt code.

Originally, the RTC callback was called from pt_intr_post, like other
vpt sources.  Your rework changed it to be called much earlier, when
the vpt was considering which time source to choose.  AIUI that was to
let the RTC code tell the VPT not to inject, if the guest hasn't acked
the last interrupt, right?

Since that was changed later to allow a certain number of dead ticks
before deciding to stop the timer chain, the decision no longer has to
be made so early -- we can allow one more IRQ to go in and then
disable it. 

That is the main change of this cset:  we go back to driving
the interrupt from the vpt code and fixing up the RTC state after vpt
tells us it's injected an interrupt.

> > +    if ( new_b & new_c & (RTC_PF | RTC_AF | RTC_UF) )
> > +        new_c |= RTC_IRQF;
> 
> Without going back to reading the spec, iirc RTC_IRQF is a sticky
> bit cleared _only_ be REG_C reads, i.e. you shouldn't clear it
> earlier in the function and then conditionally set it here.

All those bits are sticky, so if IRQF was set before it will be now;
but true that we could drop the mask at the top of the function.
(+ again in rtc_irq_event)

> >  
> > -    /* IRQ is raised if any source is both raised & enabled */
> > -    if ( !(s->hw.cmos_data[RTC_REG_B] &
> > -           s->hw.cmos_data[RTC_REG_C] &
> > -           (RTC_PF | RTC_AF | RTC_UF)) )
> > -        return;
> > +    s->hw.cmos_data[RTC_REG_B] = new_b;
> > +    s->hw.cmos_data[RTC_REG_C] = new_c;
> >  
> > -    s->hw.cmos_data[RTC_REG_C] |= RTC_IRQF;
> > -    if ( rtc_mode_is(s, no_ack) )
> > -        hvm_isa_irq_deassert(vrtc_domain(s), RTC_IRQ);
> > -    hvm_isa_irq_assert(vrtc_domain(s), RTC_IRQ);
> > +    if ( new_c & RTC_IRQF )
> > +        rtc_toggle_irq(s);
> 
> Which then implies that the condition here would also need to
> consider the old state of the flag.

Sure.

> > @@ -647,9 +674,6 @@ static uint32_t rtc_ioport_read(RTCState *s, uint32_t addr)
> >      case RTC_REG_C:
> >          ret = s->hw.cmos_data[s->hw.cmos_index];
> >          s->hw.cmos_data[RTC_REG_C] = 0x00;
> > -        if ( (ret & RTC_IRQF) && !rtc_mode_is(s, no_ack) )
> > -            hvm_isa_irq_deassert(d, RTC_IRQ);
> 
> Why? With RTC_IRQF going from 1 to 0, the interrupt line should
> get de-asserted.

After this patch the RTC model only sends edges (as mentioned in the
description, maybe not clearly enough).  I think that might be
caused by some confusion on my part.  I'll go into it a bit more below.

> > -        rtc_update_irq(s);
> 
> So given the problem description, this would seem to be the most
> important part at a first glance. But looking more closely, I'm getting
> the impression that the call to rtc_update_irq() had no effect at all
> here anyway: The function would always bail on the second if() due
> to REG_C having got cleared a few lines up.

Yeah, this has nothing to do with the bug being fixed here.  The old
REG_C read was operating correctly, but on the return-to-guest path:
 - vpt sees another RTC interrupt is due and calls RTC code
 - RTC code sees REG_C clear, sets PF|IRQF and asserts the line
 - vlapic code sees the last interrupt is still in the ISR and does
   nothing;
 - we return to the guest having set IRQF but not consumed a timer
   event, so vpt stste is the same
 - the guest sees the old REG_C, with PF|IRQF set, and re-reads, 
   waiting for a read of 0.
 - repeat forever.

AFAICS this is caused by the disconnect where the RTC tries to
reinject but doesn't use up a tick.  A real RTC doesn't have that
problem, because it doesn't have all the scheduling artefacts that the
VPT code is intended to work around.

So IMO the right fix is to go back to letting the VPT code control IRQ
injection for the PF source, which is what this patch does.

(I don't consider it to be a revert of any particular csets, BTW --
there have been a lot of other improvements in the RTC as part of that
and later series that this leaves in place).

The switch to always sending edges is incidental, and I'd be happy to
get rid of it.  IIRC the reason we ended up with that is that when the
vpt code injects an interrupt, we need to set IRQF|PF in REG_C but
_not_ assert the line (which would cause a second interrupt).
Tracking that disconnect seemed tricky, esp. across save/restore. 

If you have any idea how we could sort that out, I'd be inclined to go
back to a level-triggered model (i.e. following the RTC spec) even in
no-ack mode.  

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 17:21   ` Tim Deegan
@ 2014-02-11  9:15     ` Jan Beulich
  2014-02-11 12:11       ` Tim Deegan
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2014-02-11  9:15 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Andrew Cooper, Xen-devel, KeirFraser, roger.pau

>>> On 10.02.14 at 18:21, Tim Deegan <tim@xen.org> wrote:
> At 16:34 +0000 on 10 Feb (1392046444), Jan Beulich wrote:
>> > The underlying problem here comes because the AF and UF bits of RTC 
>> > interrupt
>> > state is modelled by the RTC code, but the PF is modelled by the pt code.  
>> > The
>> > root cause of windows infinite loop is that RTC_PF is being re-set on 
>> > vmentry
>> > before the interrupt logic has worked out that it can't actually inject an 
>> > RTC
>> > interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
>> > should be reading 0.
>> 
>> So you're undoing a whole lot of changes done with the goal of
>> getting the overall emulation closer to what real hardware does,
>> just to paper over an issue elsewhere in the code? Not really an
>> approach I'm in favor of.
> 
> My understanding was that the problem is explicitly in the change to
> how RTC code is called from vpt code.
> 
> Originally, the RTC callback was called from pt_intr_post, like other
> vpt sources.  Your rework changed it to be called much earlier, when
> the vpt was considering which time source to choose.  AIUI that was to
> let the RTC code tell the VPT not to inject, if the guest hasn't acked
> the last interrupt, right?

Not only, it was also a necessary pre-adjustment for the !PIE case
to work (see below).

> Since that was changed later to allow a certain number of dead ticks
> before deciding to stop the timer chain, the decision no longer has to
> be made so early -- we can allow one more IRQ to go in and then
> disable it. 
> 
> That is the main change of this cset:  we go back to driving
> the interrupt from the vpt code and fixing up the RTC state after vpt
> tells us it's injected an interrupt.

And that's what is wrong imo, as it doesn't allow driving PF correctly
when !PIE.

>> > -        rtc_update_irq(s);
>> 
>> So given the problem description, this would seem to be the most
>> important part at a first glance. But looking more closely, I'm getting
>> the impression that the call to rtc_update_irq() had no effect at all
>> here anyway: The function would always bail on the second if() due
>> to REG_C having got cleared a few lines up.
> 
> Yeah, this has nothing to do with the bug being fixed here.  The old
> REG_C read was operating correctly, but on the return-to-guest path:
>  - vpt sees another RTC interrupt is due and calls RTC code
>  - RTC code sees REG_C clear, sets PF|IRQF and asserts the line
>  - vlapic code sees the last interrupt is still in the ISR and does
>    nothing;
>  - we return to the guest having set IRQF but not consumed a timer
>    event, so vpt stste is the same
>  - the guest sees the old REG_C, with PF|IRQF set, and re-reads, 
>    waiting for a read of 0.
>  - repeat forever.

Which would call for a flag suppressing the setting of PF|IRQF
until the timer event got consumed. Possibly with some safety
belt for this to not get deferred indefinitely (albeit if the interrupt
doesn't get injected for extended periods of time, the guest
would presumably have more severe problems than these flags
not getting updated as expected).

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11  9:15     ` Jan Beulich
@ 2014-02-11 12:11       ` Tim Deegan
  2014-02-11 12:50         ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2014-02-11 12:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Xen-devel, KeirFraser, roger.pau

At 09:15 +0000 on 11 Feb (1392106520), Jan Beulich wrote:
> >>> On 10.02.14 at 18:21, Tim Deegan <tim@xen.org> wrote:
> > At 16:34 +0000 on 10 Feb (1392046444), Jan Beulich wrote:
> >> > The underlying problem here comes because the AF and UF bits of RTC 
> >> > interrupt
> >> > state is modelled by the RTC code, but the PF is modelled by the pt code.  
> >> > The
> >> > root cause of windows infinite loop is that RTC_PF is being re-set on 
> >> > vmentry
> >> > before the interrupt logic has worked out that it can't actually inject an 
> >> > RTC
> >> > interrupt, causing Windows to erroniously read (RTC_PF|RTC_IRQF) when it
> >> > should be reading 0.
> >> 
> >> So you're undoing a whole lot of changes done with the goal of
> >> getting the overall emulation closer to what real hardware does,
> >> just to paper over an issue elsewhere in the code? Not really an
> >> approach I'm in favor of.
> > 
> > My understanding was that the problem is explicitly in the change to
> > how RTC code is called from vpt code.
> > 
> > Originally, the RTC callback was called from pt_intr_post, like other
> > vpt sources.  Your rework changed it to be called much earlier, when
> > the vpt was considering which time source to choose.  AIUI that was to
> > let the RTC code tell the VPT not to inject, if the guest hasn't acked
> > the last interrupt, right?
> 
> Not only, it was also a necessary pre-adjustment for the !PIE case
> to work (see below).
> 
> > Since that was changed later to allow a certain number of dead ticks
> > before deciding to stop the timer chain, the decision no longer has to
> > be made so early -- we can allow one more IRQ to go in and then
> > disable it. 
> > 
> > That is the main change of this cset:  we go back to driving
> > the interrupt from the vpt code and fixing up the RTC state after vpt
> > tells us it's injected an interrupt.
> 
> And that's what is wrong imo, as it doesn't allow driving PF correctly
> when !PIE.

Oh, I see -- the current code doesn't turn the vpt off when !PIE.  Can
you remember why not?  Have I forgotten some wrinkle or race here?

> >> > -        rtc_update_irq(s);
> >> 
> >> So given the problem description, this would seem to be the most
> >> important part at a first glance. But looking more closely, I'm getting
> >> the impression that the call to rtc_update_irq() had no effect at all
> >> here anyway: The function would always bail on the second if() due
> >> to REG_C having got cleared a few lines up.
> > 
> > Yeah, this has nothing to do with the bug being fixed here.  The old
> > REG_C read was operating correctly, but on the return-to-guest path:
> >  - vpt sees another RTC interrupt is due and calls RTC code
> >  - RTC code sees REG_C clear, sets PF|IRQF and asserts the line
> >  - vlapic code sees the last interrupt is still in the ISR and does
> >    nothing;
> >  - we return to the guest having set IRQF but not consumed a timer
> >    event, so vpt stste is the same
> >  - the guest sees the old REG_C, with PF|IRQF set, and re-reads, 
> >    waiting for a read of 0.
> >  - repeat forever.
> 
> Which would call for a flag suppressing the setting of PF|IRQF
> until the timer event got consumed. Possibly with some safety
> belt for this to not get deferred indefinitely (albeit if the interrupt
> doesn't get injected for extended periods of time, the guest
> would presumably have more severe problems than these flags
> not getting updated as expected).

That's pretty much what we're doing here -- the pt_intr_post callback
sets PF|IRQF when the interrupt is injected.

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11 12:11       ` Tim Deegan
@ 2014-02-11 12:50         ` Jan Beulich
  2014-02-11 13:15           ` Tim Deegan
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2014-02-11 12:50 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Andrew Cooper, Xen-devel, KeirFraser, roger.pau

>>> On 11.02.14 at 13:11, Tim Deegan <tim@xen.org> wrote:
> At 09:15 +0000 on 11 Feb (1392106520), Jan Beulich wrote:
>> >>> On 10.02.14 at 18:21, Tim Deegan <tim@xen.org> wrote:
>> > My understanding was that the problem is explicitly in the change to
>> > how RTC code is called from vpt code.
>> > 
>> > Originally, the RTC callback was called from pt_intr_post, like other
>> > vpt sources.  Your rework changed it to be called much earlier, when
>> > the vpt was considering which time source to choose.  AIUI that was to
>> > let the RTC code tell the VPT not to inject, if the guest hasn't acked
>> > the last interrupt, right?
>> 
>> Not only, it was also a necessary pre-adjustment for the !PIE case
>> to work (see below).
>> 
>> > Since that was changed later to allow a certain number of dead ticks
>> > before deciding to stop the timer chain, the decision no longer has to
>> > be made so early -- we can allow one more IRQ to go in and then
>> > disable it. 
>> > 
>> > That is the main change of this cset:  we go back to driving
>> > the interrupt from the vpt code and fixing up the RTC state after vpt
>> > tells us it's injected an interrupt.
>> 
>> And that's what is wrong imo, as it doesn't allow driving PF correctly
>> when !PIE.
> 
> Oh, I see -- the current code doesn't turn the vpt off when !PIE.  Can
> you remember why not?  Have I forgotten some wrinkle or race here?

Because an OS could inspect PF without setting PIE.

>> > Yeah, this has nothing to do with the bug being fixed here.  The old
>> > REG_C read was operating correctly, but on the return-to-guest path:
>> >  - vpt sees another RTC interrupt is due and calls RTC code
>> >  - RTC code sees REG_C clear, sets PF|IRQF and asserts the line
>> >  - vlapic code sees the last interrupt is still in the ISR and does
>> >    nothing;
>> >  - we return to the guest having set IRQF but not consumed a timer
>> >    event, so vpt stste is the same
>> >  - the guest sees the old REG_C, with PF|IRQF set, and re-reads, 
>> >    waiting for a read of 0.
>> >  - repeat forever.
>> 
>> Which would call for a flag suppressing the setting of PF|IRQF
>> until the timer event got consumed. Possibly with some safety
>> belt for this to not get deferred indefinitely (albeit if the interrupt
>> doesn't get injected for extended periods of time, the guest
>> would presumably have more severe problems than these flags
>> not getting updated as expected).
> 
> That's pretty much what we're doing here -- the pt_intr_post callback
> sets PF|IRQF when the interrupt is injected.

Right, except you do this be reverting other stuff rather than
adding the missing functionality on top.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11 12:50         ` Jan Beulich
@ 2014-02-11 13:15           ` Tim Deegan
  2014-02-11 13:31             ` Jan Beulich
  2014-02-11 13:59             ` Andrew Cooper
  0 siblings, 2 replies; 22+ messages in thread
From: Tim Deegan @ 2014-02-11 13:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Xen-devel, KeirFraser, roger.pau

At 12:50 +0000 on 11 Feb (1392119457), Jan Beulich wrote:
> >>> On 11.02.14 at 13:11, Tim Deegan <tim@xen.org> wrote:
> > At 09:15 +0000 on 11 Feb (1392106520), Jan Beulich wrote:
> >> >>> On 10.02.14 at 18:21, Tim Deegan <tim@xen.org> wrote:
> >> > That is the main change of this cset:  we go back to driving
> >> > the interrupt from the vpt code and fixing up the RTC state after vpt
> >> > tells us it's injected an interrupt.
> >> 
> >> And that's what is wrong imo, as it doesn't allow driving PF correctly
> >> when !PIE.
> > 
> > Oh, I see -- the current code doesn't turn the vpt off when !PIE.  Can
> > you remember why not?  Have I forgotten some wrinkle or race here?
> 
> Because an OS could inspect PF without setting PIE.

Ugh. :( 

> >> > Yeah, this has nothing to do with the bug being fixed here.  The old
> >> > REG_C read was operating correctly, but on the return-to-guest path:
> >> >  - vpt sees another RTC interrupt is due and calls RTC code
> >> >  - RTC code sees REG_C clear, sets PF|IRQF and asserts the line
> >> >  - vlapic code sees the last interrupt is still in the ISR and does
> >> >    nothing;
> >> >  - we return to the guest having set IRQF but not consumed a timer
> >> >    event, so vpt stste is the same
> >> >  - the guest sees the old REG_C, with PF|IRQF set, and re-reads, 
> >> >    waiting for a read of 0.
> >> >  - repeat forever.
> >> 
> >> Which would call for a flag suppressing the setting of PF|IRQF
> >> until the timer event got consumed. Possibly with some safety
> >> belt for this to not get deferred indefinitely (albeit if the interrupt
> >> doesn't get injected for extended periods of time, the guest
> >> would presumably have more severe problems than these flags
> >> not getting updated as expected).
> > 
> > That's pretty much what we're doing here -- the pt_intr_post callback
> > sets PF|IRQF when the interrupt is injected.
> 
> Right, except you do this be reverting other stuff rather than
> adding the missing functionality on top.

Absolutely -- because once we went back to having PF set only when the
interrupt was injected, it seemed better to reduce the amount of
special-case plumbing for RTC than to add yet more.

But for the case of an OS polling for PF with PIE clear, I guess we
might need to keep all the current special cases.  Was that a known
observed bug or a theoretical one?  I can't see a way of handling
both that case and the w2k3 case.

Either we always set PF when the tick happens, even if the interrupt
is masked (which breaks w2k3) or we don't set it until we can deliver
the interrupt (which breaks pollers).

Or equivalently, either setting PF consumes a tick (which breaks
no-missed-ticks mode for OSes that don't poll) or it doesn't (which
breaks w2k3).

Or do we have to treat 'masked in REG_B/IOAPIC' differently from
'masked by ISR/TPR/RFLAGS.IF/...'?

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11 13:15           ` Tim Deegan
@ 2014-02-11 13:31             ` Jan Beulich
  2014-02-11 13:57               ` Tim Deegan
  2014-02-11 13:59             ` Andrew Cooper
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2014-02-11 13:31 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Andrew Cooper, Xen-devel, KeirFraser, roger.pau

>>> On 11.02.14 at 14:15, Tim Deegan <tim@xen.org> wrote:
> At 12:50 +0000 on 11 Feb (1392119457), Jan Beulich wrote:
>> Right, except you do this be reverting other stuff rather than
>> adding the missing functionality on top.
> 
> Absolutely -- because once we went back to having PF set only when the
> interrupt was injected, it seemed better to reduce the amount of
> special-case plumbing for RTC than to add yet more.
> 
> But for the case of an OS polling for PF with PIE clear, I guess we
> might need to keep all the current special cases.  Was that a known
> observed bug or a theoretical one?

A theoretical one. Along the lines of the general theme of all the
patches around that time: Get the emulation closer to what real
hardware does.

> I can't see a way of handling both that case and the w2k3 case.

Since hardware can, software certainly could as well.

> Either we always set PF when the tick happens, even if the interrupt
> is masked (which breaks w2k3) or we don't set it until we can deliver
> the interrupt (which breaks pollers).
> 
> Or equivalently, either setting PF consumes a tick (which breaks
> no-missed-ticks mode for OSes that don't poll) or it doesn't (which
> breaks w2k3).

Are these two really equivalent (perhaps they are in our current
implementation, but I ask from an abstract POV)? In particular
since for the above it's not immediately clear what "masked"
means, i.e. ...

> Or do we have to treat 'masked in REG_B/IOAPIC' differently from
> 'masked by ISR/TPR/RFLAGS.IF/...'?

... this might be the right direction (albeit I think it would be REG_B
on one side and collectively IOAPIC/ISR/TPR/EFLAGS.IF).

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11 13:31             ` Jan Beulich
@ 2014-02-11 13:57               ` Tim Deegan
  2014-02-11 14:35                 ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2014-02-11 13:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Xen-devel, KeirFraser, roger.pau

At 13:31 +0000 on 11 Feb (1392121899), Jan Beulich wrote:
> >>> On 11.02.14 at 14:15, Tim Deegan <tim@xen.org> wrote:
> > At 12:50 +0000 on 11 Feb (1392119457), Jan Beulich wrote:
> >> Right, except you do this be reverting other stuff rather than
> >> adding the missing functionality on top.
> > 
> > Absolutely -- because once we went back to having PF set only when the
> > interrupt was injected, it seemed better to reduce the amount of
> > special-case plumbing for RTC than to add yet more.
> > 
> > But for the case of an OS polling for PF with PIE clear, I guess we
> > might need to keep all the current special cases.  Was that a known
> > observed bug or a theoretical one?
> 
> A theoretical one. Along the lines of the general theme of all the
> patches around that time: Get the emulation closer to what real
> hardware does.

Righto.

> > I can't see a way of handling both that case and the w2k3 case.
> 
> Since hardware can, software certainly could as well.

Hardware doesn't have to do things like no-missed-ticks mode; it
certainly doesn't have to deal with no-ack mode. :(

> > Or do we have to treat 'masked in REG_B/IOAPIC' differently from
> > 'masked by ISR/TPR/RFLAGS.IF/...'?
> 
> ... this might be the right direction (albeit I think it would be REG_B
> on one side and collectively IOAPIC/ISR/TPR/EFLAGS.IF).

Yeah, maybe.

Something like, if !PIE then we set PF and consume the tick in the
early vpt callback, otherwise we do it in the late callback (and in
both cases, the early vpt callback doesn't actually assert the line)?

Or: we drop the early callback, and go back to setting PF|IRQF in the
pt_intr_post callback (much like the patch under discussion), and
disable the vpt when the guest clears PIE.  Then in the REG_C read, if
!PIE, we can set the PF bit if a tick should have happened since
the last read (but saving ourselves the bother of running the actual
timer).  Then the only case where things are wrong is an OS which
_both_ polls for the edge _and_ relies on the vpt logic for adding the
right number of ticks after being descheduled.  Or an OS which asks
for interrupts, masks them in the *APIC/RFLAGS and then polls for PF.
But that guest, I think, is indistinguishable from w2k3 in the stuck
state.

Actually that second patch doesn't sound too bad.  If that sounds OK
to you I can look into coding it up on Thursday.

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11 13:57               ` Tim Deegan
@ 2014-02-11 14:35                 ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2014-02-11 14:35 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Andrew Cooper, Xen-devel, KeirFraser, roger.pau

>>> On 11.02.14 at 14:57, Tim Deegan <tim@xen.org> wrote:
> At 13:31 +0000 on 11 Feb (1392121899), Jan Beulich wrote:
>> >>> On 11.02.14 at 14:15, Tim Deegan <tim@xen.org> wrote:
>> > Or do we have to treat 'masked in REG_B/IOAPIC' differently from
>> > 'masked by ISR/TPR/RFLAGS.IF/...'?
>> 
>> ... this might be the right direction (albeit I think it would be REG_B
>> on one side and collectively IOAPIC/ISR/TPR/EFLAGS.IF).
> 
> Yeah, maybe.
> 
> Something like, if !PIE then we set PF and consume the tick in the
> early vpt callback, otherwise we do it in the late callback (and in
> both cases, the early vpt callback doesn't actually assert the line)?

Yes. Question is how intrusive a change that would be (namely to
have the two callbacks deal correctly with an RTC register - in
particular PIE - changing between them being called).

> Or: we drop the early callback, and go back to setting PF|IRQF in the
> pt_intr_post callback (much like the patch under discussion), and
> disable the vpt when the guest clears PIE.  Then in the REG_C read, if
> !PIE, we can set the PF bit if a tick should have happened since
> the last read (but saving ourselves the bother of running the actual
> timer).  Then the only case where things are wrong is an OS which
> _both_ polls for the edge _and_ relies on the vpt logic for adding the
> right number of ticks after being descheduled.  Or an OS which asks
> for interrupts, masks them in the *APIC/RFLAGS and then polls for PF.
> But that guest, I think, is indistinguishable from w2k3 in the stuck
> state.
> 
> Actually that second patch doesn't sound too bad.  If that sounds OK
> to you I can look into coding it up on Thursday.

That second one might be okay too, as long as it is clearly written
down in at least the commit message which modes we expect to
work and which not.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11 13:15           ` Tim Deegan
  2014-02-11 13:31             ` Jan Beulich
@ 2014-02-11 13:59             ` Andrew Cooper
  2014-02-11 14:10               ` Tim Deegan
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2014-02-11 13:59 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Xen-devel, KeirFraser, Jan Beulich, roger.pau

On 11/02/14 13:15, Tim Deegan wrote:
> At 12:50 +0000 on 11 Feb (1392119457), Jan Beulich wrote:
>>>>> On 11.02.14 at 13:11, Tim Deegan <tim@xen.org> wrote:
>>> At 09:15 +0000 on 11 Feb (1392106520), Jan Beulich wrote:
>>>>>>> On 10.02.14 at 18:21, Tim Deegan <tim@xen.org> wrote:
>>>>> That is the main change of this cset:  we go back to driving
>>>>> the interrupt from the vpt code and fixing up the RTC state after vpt
>>>>> tells us it's injected an interrupt.
>>>> And that's what is wrong imo, as it doesn't allow driving PF correctly
>>>> when !PIE.
>>> Oh, I see -- the current code doesn't turn the vpt off when !PIE.  Can
>>> you remember why not?  Have I forgotten some wrinkle or race here?
>> Because an OS could inspect PF without setting PIE.
> Ugh. :( 
>
>>>>> Yeah, this has nothing to do with the bug being fixed here.  The old
>>>>> REG_C read was operating correctly, but on the return-to-guest path:
>>>>>  - vpt sees another RTC interrupt is due and calls RTC code
>>>>>  - RTC code sees REG_C clear, sets PF|IRQF and asserts the line
>>>>>  - vlapic code sees the last interrupt is still in the ISR and does
>>>>>    nothing;
>>>>>  - we return to the guest having set IRQF but not consumed a timer
>>>>>    event, so vpt stste is the same
>>>>>  - the guest sees the old REG_C, with PF|IRQF set, and re-reads, 
>>>>>    waiting for a read of 0.
>>>>>  - repeat forever.
>>>> Which would call for a flag suppressing the setting of PF|IRQF
>>>> until the timer event got consumed. Possibly with some safety
>>>> belt for this to not get deferred indefinitely (albeit if the interrupt
>>>> doesn't get injected for extended periods of time, the guest
>>>> would presumably have more severe problems than these flags
>>>> not getting updated as expected).
>>> That's pretty much what we're doing here -- the pt_intr_post callback
>>> sets PF|IRQF when the interrupt is injected.
>> Right, except you do this be reverting other stuff rather than
>> adding the missing functionality on top.
> Absolutely -- because once we went back to having PF set only when the
> interrupt was injected, it seemed better to reduce the amount of
> special-case plumbing for RTC than to add yet more.
>
> But for the case of an OS polling for PF with PIE clear, I guess we
> might need to keep all the current special cases.  Was that a known
> observed bug or a theoretical one?  I can't see a way of handling
> both that case and the w2k3 case.
>
> Either we always set PF when the tick happens, even if the interrupt
> is masked (which breaks w2k3) or we don't set it until we can deliver
> the interrupt (which breaks pollers).

This doesn't break w2k3.  Setting PF when a tick happens (or should
happen for !PIE) is the correct thing to do.

The bug is that we see an interrupt pending and set PF when we
shouldn't, so w2k3 is constantly seeing PF set when it shouldn't be. 
The first read of REG_C should clear PF, which should then stay clear
until the next 64Hz period is up, whether or not interrupts are pending
injection.

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11 13:59             ` Andrew Cooper
@ 2014-02-11 14:10               ` Tim Deegan
  2014-02-11 14:52                 ` Andrew Cooper
  0 siblings, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2014-02-11 14:10 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: George Dunlap, Xen-devel, KeirFraser, Jan Beulich, roger.pau

At 13:59 +0000 on 11 Feb (1392123546), Andrew Cooper wrote:
> On 11/02/14 13:15, Tim Deegan wrote:
> > At 12:50 +0000 on 11 Feb (1392119457), Jan Beulich wrote:
> >>>>> On 11.02.14 at 13:11, Tim Deegan <tim@xen.org> wrote:
> >>> At 09:15 +0000 on 11 Feb (1392106520), Jan Beulich wrote:
> >>>>>>> On 10.02.14 at 18:21, Tim Deegan <tim@xen.org> wrote:
> >>>>> That is the main change of this cset:  we go back to driving
> >>>>> the interrupt from the vpt code and fixing up the RTC state after vpt
> >>>>> tells us it's injected an interrupt.
> >>>> And that's what is wrong imo, as it doesn't allow driving PF correctly
> >>>> when !PIE.
> >>> Oh, I see -- the current code doesn't turn the vpt off when !PIE.  Can
> >>> you remember why not?  Have I forgotten some wrinkle or race here?
> >> Because an OS could inspect PF without setting PIE.
> > Ugh. :( 
> >
> >>>>> Yeah, this has nothing to do with the bug being fixed here.  The old
> >>>>> REG_C read was operating correctly, but on the return-to-guest path:
> >>>>>  - vpt sees another RTC interrupt is due and calls RTC code
> >>>>>  - RTC code sees REG_C clear, sets PF|IRQF and asserts the line
> >>>>>  - vlapic code sees the last interrupt is still in the ISR and does
> >>>>>    nothing;
> >>>>>  - we return to the guest having set IRQF but not consumed a timer
> >>>>>    event, so vpt stste is the same
> >>>>>  - the guest sees the old REG_C, with PF|IRQF set, and re-reads, 
> >>>>>    waiting for a read of 0.
> >>>>>  - repeat forever.
> >>>> Which would call for a flag suppressing the setting of PF|IRQF
> >>>> until the timer event got consumed. Possibly with some safety
> >>>> belt for this to not get deferred indefinitely (albeit if the interrupt
> >>>> doesn't get injected for extended periods of time, the guest
> >>>> would presumably have more severe problems than these flags
> >>>> not getting updated as expected).
> >>> That's pretty much what we're doing here -- the pt_intr_post callback
> >>> sets PF|IRQF when the interrupt is injected.
> >> Right, except you do this be reverting other stuff rather than
> >> adding the missing functionality on top.
> > Absolutely -- because once we went back to having PF set only when the
> > interrupt was injected, it seemed better to reduce the amount of
> > special-case plumbing for RTC than to add yet more.
> >
> > But for the case of an OS polling for PF with PIE clear, I guess we
> > might need to keep all the current special cases.  Was that a known
> > observed bug or a theoretical one?  I can't see a way of handling
> > both that case and the w2k3 case.
> >
> > Either we always set PF when the tick happens, even if the interrupt
> > is masked (which breaks w2k3) or we don't set it until we can deliver
> > the interrupt (which breaks pollers).
> 
> This doesn't break w2k3.  Setting PF when a tick happens (or should
> happen for !PIE) is the correct thing to do.
> 
> The bug is that we see an interrupt pending and set PF when we
> shouldn't

We _are_ setting PF when the tick happens; it's just that because of
no-missed-ticks mode the tick happens before w2k3 has finished
handling the last one.  At that point, anything we do breaks w2k3 in
some way -- either we leave the tick pending until the interrupt is
actually delivered (which leads to the hang) or we consume the tick
even though the interrupt will be lost (which causes clock drift).

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-11 14:10               ` Tim Deegan
@ 2014-02-11 14:52                 ` Andrew Cooper
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Cooper @ 2014-02-11 14:52 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, Xen-devel, KeirFraser, Jan Beulich, roger.pau

On 11/02/14 14:10, Tim Deegan wrote:
> At 13:59 +0000 on 11 Feb (1392123546), Andrew Cooper wrote:
>> On 11/02/14 13:15, Tim Deegan wrote:
>>> At 12:50 +0000 on 11 Feb (1392119457), Jan Beulich wrote:
>>>>>>> On 11.02.14 at 13:11, Tim Deegan <tim@xen.org> wrote:
>>>>> At 09:15 +0000 on 11 Feb (1392106520), Jan Beulich wrote:
>>>>>>>>> On 10.02.14 at 18:21, Tim Deegan <tim@xen.org> wrote:
>>>>>>> That is the main change of this cset:  we go back to driving
>>>>>>> the interrupt from the vpt code and fixing up the RTC state after vpt
>>>>>>> tells us it's injected an interrupt.
>>>>>> And that's what is wrong imo, as it doesn't allow driving PF correctly
>>>>>> when !PIE.
>>>>> Oh, I see -- the current code doesn't turn the vpt off when !PIE.  Can
>>>>> you remember why not?  Have I forgotten some wrinkle or race here?
>>>> Because an OS could inspect PF without setting PIE.
>>> Ugh. :( 
>>>
>>>>>>> Yeah, this has nothing to do with the bug being fixed here.  The old
>>>>>>> REG_C read was operating correctly, but on the return-to-guest path:
>>>>>>>  - vpt sees another RTC interrupt is due and calls RTC code
>>>>>>>  - RTC code sees REG_C clear, sets PF|IRQF and asserts the line
>>>>>>>  - vlapic code sees the last interrupt is still in the ISR and does
>>>>>>>    nothing;
>>>>>>>  - we return to the guest having set IRQF but not consumed a timer
>>>>>>>    event, so vpt stste is the same
>>>>>>>  - the guest sees the old REG_C, with PF|IRQF set, and re-reads, 
>>>>>>>    waiting for a read of 0.
>>>>>>>  - repeat forever.
>>>>>> Which would call for a flag suppressing the setting of PF|IRQF
>>>>>> until the timer event got consumed. Possibly with some safety
>>>>>> belt for this to not get deferred indefinitely (albeit if the interrupt
>>>>>> doesn't get injected for extended periods of time, the guest
>>>>>> would presumably have more severe problems than these flags
>>>>>> not getting updated as expected).
>>>>> That's pretty much what we're doing here -- the pt_intr_post callback
>>>>> sets PF|IRQF when the interrupt is injected.
>>>> Right, except you do this be reverting other stuff rather than
>>>> adding the missing functionality on top.
>>> Absolutely -- because once we went back to having PF set only when the
>>> interrupt was injected, it seemed better to reduce the amount of
>>> special-case plumbing for RTC than to add yet more.
>>>
>>> But for the case of an OS polling for PF with PIE clear, I guess we
>>> might need to keep all the current special cases.  Was that a known
>>> observed bug or a theoretical one?  I can't see a way of handling
>>> both that case and the w2k3 case.
>>>
>>> Either we always set PF when the tick happens, even if the interrupt
>>> is masked (which breaks w2k3) or we don't set it until we can deliver
>>> the interrupt (which breaks pollers).
>> This doesn't break w2k3.  Setting PF when a tick happens (or should
>> happen for !PIE) is the correct thing to do.
>>
>> The bug is that we see an interrupt pending and set PF when we
>> shouldn't
> We _are_ setting PF when the tick happens; it's just that because of
> no-missed-ticks mode the tick happens before w2k3 has finished
> handling the last one.  At that point, anything we do breaks w2k3 in
> some way -- either we leave the tick pending until the interrupt is
> actually delivered (which leads to the hang) or we consume the tick
> even though the interrupt will be lost (which causes clock drift).
>
> Tim.

No - we are setting PF on every vmentry, not every tick.

* pt_update_irq() finds the timer pending and decides to inject an
interrupt.  This sets REG_C.PF
* {svm,vmx}_intr_assist() bails early because it can't actually inject
the interrupt.
* pt_intr_post() doesn't run, which doesn't update the PT state, yet
because pt_update_irq() thought it was injecting an interrupt, it didn't
run its faked-up pt_intr_post()
* Next VMentry finds an erroneously pending tick and decides to inject
an interrupt.

If w2k3 were to repeatedly read REG_C alone without writing to the index
register in-between, it would observe PF being alternately set and clear.


w2k3 would still work perfectly fine if we only set PF when actually
injecting the interrupt, which is why the patch at the root of this
thread fixes the observed hang.  However, Jans comment about the
behaviour of the PF bit when !PIE is quite correct.

Therefore, for !PIE, PF must be updated ahead of time, but for PIE, it
must be updated when the interrupt is actually injected.

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 16:34 ` Jan Beulich
  2014-02-10 17:13   ` Andrew Cooper
  2014-02-10 17:21   ` Tim Deegan
@ 2014-02-10 17:46   ` George Dunlap
  2014-02-10 18:43     ` Andrew Cooper
  2 siblings, 1 reply; 22+ messages in thread
From: George Dunlap @ 2014-02-10 17:46 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: Tim Deegan, Xen-devel, KeirFraser, roger.pau

On 02/10/2014 04:34 PM, Jan Beulich wrote:
>>    * The results from XenRT suggest that the new emulation is better than the
>>      old.
> "Better" in the sense of the limited set of uses of the virtual hardware
> by whatever selection of guest OSes is being run there. But very
> likely not "better" in the sense on matching up with how the respective
> hardware specification would require it to behave.

The context of the above sentence was in a justification for including 
it in 4.4.  Obviously "occasionally gets stuck during boot" is a pretty 
bad bug that we'd like to see fix.  But given the tricky nature of this 
whole area, there's a risk that this will cause regressions in *other* 
situations or operating systems.  What I understand Andy to be saying is 
that with the patch, the RTC appears to cause less problems than without it.

What your analysis is missing, Andy, is what the effects might be if 
there were a bug.  Obviously other guests might hang during boot; but 
what else?  Might they hang at some point much later, perhaps when being 
pounded with interrupts due to heavy network traffic? Might the clock 
begin to drift or jump around?  Would the XenRT testing catch that if it 
happened?  And, would those potential bugs be worse than what we have now?

There's a reason for trying to go through the whole exercise, 
particularly in bugs like this.  I do have a lot of faith in our 
intuition to consider hundreds of individual factors and come up with a 
reasonably good judgement of the probabilities -- but only if it is 
guided properly.  We are all very prone to only consider the things we 
happen to be thinking about, and to completely ignore all the things we 
don't happen to be looking at.  My own temptation, looking at this bug, 
is to say, "Random hangs during boot, yeah, that's pretty bad; we should 
take it."  But then I'm only looking at the positives of the patch: I'm 
not really making a balance of the positives versus the negatives.

The goal of going through the "worst-case-scenario" exercise is to bring 
to our minds the potential outcomes we are prone to ignore. Only then 
can we reasonably trust our intuition to make a properly informed judgement.

  -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 17:46   ` George Dunlap
@ 2014-02-10 18:43     ` Andrew Cooper
  2014-02-11  8:47       ` Tim Deegan
  2014-02-11  9:02       ` Jan Beulich
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Cooper @ 2014-02-10 18:43 UTC (permalink / raw)
  To: George Dunlap; +Cc: Tim Deegan, Xen-devel, KeirFraser, Jan Beulich, roger.pau

On 10/02/14 17:46, George Dunlap wrote:
> On 02/10/2014 04:34 PM, Jan Beulich wrote:
>>>    * The results from XenRT suggest that the new emulation is better
>>> than the
>>>      old.
>> "Better" in the sense of the limited set of uses of the virtual hardware
>> by whatever selection of guest OSes is being run there. But very
>> likely not "better" in the sense on matching up with how the respective
>> hardware specification would require it to behave.
>
> The context of the above sentence was in a justification for including
> it in 4.4.  Obviously "occasionally gets stuck during boot" is a
> pretty bad bug that we'd like to see fix.  But given the tricky nature
> of this whole area, there's a risk that this will cause regressions in
> *other* situations or operating systems.  What I understand Andy to be
> saying is that with the patch, the RTC appears to cause less problems
> than without it.

That was indeed the message I was trying to convey.

>
> What your analysis is missing, Andy, is what the effects might be if
> there were a bug.  Obviously other guests might hang during boot; but
> what else?  Might they hang at some point much later, perhaps when
> being pounded with interrupts due to heavy network traffic? Might the
> clock begin to drift or jump around?  Would the XenRT testing catch
> that if it happened?  And, would those potential bugs be worse than
> what we have now?

This is much more complicated to answer.  Lets try.

The patch only touches RTC and Periodic Timer code.  All of the Periodic
Timer code is reversions back to the previous code (mid Xen-4.3 and
before), along with an attempted justification as to why the old code
was more correct than the current code.

Some of the RTC changes are reversions, but there is also new logic. 
All new logic is to do with how to update REG_B and REG_C correctly. 
None of the other functionality is touched.

REG_B and REG_C are to do with interrupts, and which events
should(B)/have(C) generated interrupts.  The worst case is that a guest
gets none/too-few/too-many interrupts when trying to drive the RTC. 
None of this should lead to clock skew, as reading the time values
directly will still provide the same information as before, although any
guest which attempts to guess time based on counting periodic interrupts
from the RTC is a) already broken and b) already having massive skew as
a VM due to vcpu scheduling.

XenRT does have tests for clock drift, but don't know for certain
whether they have been run against the new code yet.

I will ensure they get run on v2 of the patch.

~Andrew

>
> There's a reason for trying to go through the whole exercise,
> particularly in bugs like this.  I do have a lot of faith in our
> intuition to consider hundreds of individual factors and come up with
> a reasonably good judgement of the probabilities -- but only if it is
> guided properly.  We are all very prone to only consider the things we
> happen to be thinking about, and to completely ignore all the things
> we don't happen to be looking at.  My own temptation, looking at this
> bug, is to say, "Random hangs during boot, yeah, that's pretty bad; we
> should take it."  But then I'm only looking at the positives of the
> patch: I'm not really making a balance of the positives versus the
> negatives.
>
> The goal of going through the "worst-case-scenario" exercise is to
> bring to our minds the potential outcomes we are prone to ignore. Only
> then can we reasonably trust our intuition to make a properly informed
> judgement.
>
>  -George

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 18:43     ` Andrew Cooper
@ 2014-02-11  8:47       ` Tim Deegan
  2014-02-11  9:02       ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Tim Deegan @ 2014-02-11  8:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: George Dunlap, Xen-devel, KeirFraser, Jan Beulich, roger.pau

At 18:43 +0000 on 10 Feb (1392054204), Andrew Cooper wrote:
> REG_B and REG_C are to do with interrupts, and which events
> should(B)/have(C) generated interrupts.  The worst case is that a guest
> gets none/too-few/too-many interrupts when trying to drive the RTC. 
> None of this should lead to clock skew, as reading the time values
> directly will still provide the same information as before, although any
> guest which attempts to guess time based on counting periodic interrupts
> from the RTC is a) already broken and b) already having massive skew as
> a VM due to vcpu scheduling.

Sadly, not the case.  There absolutely are OSes that compute wallclock
time by counting timer ticks -- that's why we have the no-missed-ticks
mode.

Based on what these bugs have looked like in the past, the main risk
is that some kind of guest behaviour will cause clock skew or even
clock freeze in the guest.   In the past some of these have depended
on workload, e.g. something that changes the tick frequency. 

> XenRT does have tests for clock drift, but don't know for certain
> whether they have been run against the new code yet.
> 
> I will ensure they get run on v2 of the patch.

Excellent!

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Patch] x86/HVM: Fix RTC interrupt modelling
  2014-02-10 18:43     ` Andrew Cooper
  2014-02-11  8:47       ` Tim Deegan
@ 2014-02-11  9:02       ` Jan Beulich
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2014-02-11  9:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: George Dunlap, Tim Deegan, Xen-devel, KeirFraser, roger.pau

>>> On 10.02.14 at 19:43, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> REG_B and REG_C are to do with interrupts, and which events
> should(B)/have(C) generated interrupts.  The worst case is that a guest
> gets none/too-few/too-many interrupts when trying to drive the RTC. 
> None of this should lead to clock skew, as reading the time values
> directly will still provide the same information as before, although any
> guest which attempts to guess time based on counting periodic interrupts
> from the RTC is a) already broken

Why? If an OS itself guarantees low enough interrupt latency (for
the chosen period), no interrupt would ever get lost, and counting
would be precise afaict.

> and b) already having massive skew as
> a VM due to vcpu scheduling.

Yes, very likely. But that's a problem we (the ones virtualizing the
OS) have to solve, not the OS (it would be the other way around
only if the OS was PV or at least virtualization aware).

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-02-11 14:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-10 11:17 [Patch] x86/HVM: Fix RTC interrupt modelling Andrew Cooper
2014-02-10 12:19 ` Tim Deegan
2014-02-10 15:17 ` Roger Pau Monné
2014-02-10 15:33 ` Keir Fraser
2014-02-10 16:34 ` Jan Beulich
2014-02-10 17:13   ` Andrew Cooper
2014-02-11  9:28     ` Jan Beulich
2014-02-10 17:21   ` Tim Deegan
2014-02-11  9:15     ` Jan Beulich
2014-02-11 12:11       ` Tim Deegan
2014-02-11 12:50         ` Jan Beulich
2014-02-11 13:15           ` Tim Deegan
2014-02-11 13:31             ` Jan Beulich
2014-02-11 13:57               ` Tim Deegan
2014-02-11 14:35                 ` Jan Beulich
2014-02-11 13:59             ` Andrew Cooper
2014-02-11 14:10               ` Tim Deegan
2014-02-11 14:52                 ` Andrew Cooper
2014-02-10 17:46   ` George Dunlap
2014-02-10 18:43     ` Andrew Cooper
2014-02-11  8:47       ` Tim Deegan
2014-02-11  9:02       ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).