[Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
@ 2008-10-29 15:22 Gleb Natapov
  2008-10-29 15:22 ` [Qemu-devel] [PATCH 1/3] Change qemu_set_irq() to return status information Gleb Natapov
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-10-29 15:22 UTC (permalink / raw)
  To: qemu-devel

Qemu device emulation for timers might be inaccurate and
causes coalescing of several IRQs into one. It happens when the
load on the host is high and the guest did not manage to ack the
previous IRQ. The problem can be reproduced by copying of a big
file or many small ones inside Windows guest. When you do that 
guest clock start to lag behind the host one.

The first patch in the series changes qemu_irq subsystem to return
IRQ delivery status information. If device is notified that IRQs
where lost it can regenerate them as needed. The following two
patches add IRQ regeneration to PIC and RTC devices.

---

Gleb Natapov (3):
      Fix time drift problem under high load when RTC is in use.
      Fix time drift problem under high load when PIT is in use.
      Change qemu_set_irq() to return status information.


 hw/apic.c          |  103 +++++++++++++++++++++++++++++++++-------------------
 hw/arm_gic.c       |    6 ++-
 hw/arm_pic.c       |    4 ++
 hw/arm_timer.c     |    4 +-
 hw/cbus.c          |   12 +++++-
 hw/esp.c           |    4 ++
 hw/etraxfs_pic.c   |   12 ++++--
 hw/fdc.c           |    4 ++
 hw/heathrow_pic.c  |    4 ++
 hw/i8254.c         |   23 +++++++++++-
 hw/i8259.c         |   19 +++++++---
 hw/ide.c           |    8 +++-
 hw/integratorcp.c  |    4 ++
 hw/irq.c           |   18 +++++++--
 hw/irq.h           |   35 ++++++++++++------
 hw/max7310.c       |    4 ++
 hw/mc146818rtc.c   |   11 +++++-
 hw/mcf5206.c       |    4 ++
 hw/mcf_intc.c      |    6 ++-
 hw/mips_int.c      |    6 ++-
 hw/mpcore.c        |    4 ++
 hw/mst_fpga.c      |    4 ++
 hw/musicpal.c      |    4 ++
 hw/nseries.c       |    7 +++-
 hw/omap.h          |    2 +
 hw/omap1.c         |   44 +++++++++++++++++-----
 hw/omap2.c         |   12 +++++-
 hw/omap_dma.c      |    8 +++-
 hw/omap_mmc.c      |    4 ++
 hw/openpic.c       |    4 ++
 hw/palm.c          |    4 ++
 hw/pc.c            |    4 ++
 hw/pc.h            |    2 +
 hw/pci.c           |    8 +++-
 hw/pcnet.c         |    4 ++
 hw/pl061.c         |    4 ++
 hw/pl190.c         |    4 ++
 hw/ppc.c           |   18 ++++++---
 hw/ppc4xx_devs.c   |    6 ++-
 hw/pxa2xx.c        |    4 ++
 hw/pxa2xx_gpio.c   |    6 ++-
 hw/pxa2xx_pcmcia.c |    6 ++-
 hw/pxa2xx_pic.c    |    4 ++
 hw/rc4030.c        |    4 ++
 hw/sbi.c           |    6 ++-
 hw/sharpsl.h       |    2 +
 hw/slavio_intctl.c |    8 +++-
 hw/sparc32_dma.c   |    4 ++
 hw/spitz.c         |   20 ++++++++--
 hw/ssd0323.c       |    4 ++
 hw/stellaris.c     |   10 ++++-
 hw/sun4c_intctl.c  |    4 ++
 hw/sun4m.c         |    7 +++-
 hw/tc6393xb.c      |    5 ++-
 hw/tusb6010.c      |    4 ++
 hw/twl92230.c      |    8 +++-
 hw/versatilepb.c   |    4 ++
 hw/zaurus.c        |    4 ++
 58 files changed, 392 insertions(+), 160 deletions(-)

-- 
        Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 1/3] Change qemu_set_irq() to return status information.
  2008-10-29 15:22 [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Gleb Natapov
@ 2008-10-29 15:22 ` Gleb Natapov
  2008-10-29 15:22 ` [Qemu-devel] [PATCH 2/3] Fix time drift problem under high load when PIT is in use Gleb Natapov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-10-29 15:22 UTC (permalink / raw)
  To: qemu-devel

The return value is less then zero if interrupt is masked, zero if it
is known that interrupt is lost (due to coalescing) or greater then zero
if interrupt is delivered or was successfully queued for delivery by
interrupt controller. If virtual clock is in use we fall back to old
behaviour by always returning 1. Device emulation can use this info as it
pleases. Included patch adds detection of interrupt coalescing into PIC and
APIC code for edge triggered interrupts.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

 hw/apic.c          |  103 +++++++++++++++++++++++++++++++++-------------------
 hw/arm_gic.c       |    6 ++-
 hw/arm_pic.c       |    4 ++
 hw/arm_timer.c     |    4 +-
 hw/cbus.c          |   12 +++++-
 hw/esp.c           |    4 ++
 hw/etraxfs_pic.c   |   12 ++++--
 hw/fdc.c           |    4 ++
 hw/heathrow_pic.c  |    4 ++
 hw/i8259.c         |   19 +++++++---
 hw/ide.c           |    8 +++-
 hw/integratorcp.c  |    4 ++
 hw/irq.c           |   18 +++++++--
 hw/irq.h           |   35 ++++++++++++------
 hw/max7310.c       |    4 ++
 hw/mcf5206.c       |    4 ++
 hw/mcf_intc.c      |    6 ++-
 hw/mips_int.c      |    6 ++-
 hw/mpcore.c        |    4 ++
 hw/mst_fpga.c      |    4 ++
 hw/musicpal.c      |    4 ++
 hw/nseries.c       |    7 +++-
 hw/omap.h          |    2 +
 hw/omap1.c         |   44 +++++++++++++++++-----
 hw/omap2.c         |   12 +++++-
 hw/omap_dma.c      |    8 +++-
 hw/omap_mmc.c      |    4 ++
 hw/openpic.c       |    4 ++
 hw/palm.c          |    4 ++
 hw/pc.c            |    4 ++
 hw/pc.h            |    2 +
 hw/pci.c           |    8 +++-
 hw/pcnet.c         |    4 ++
 hw/pl061.c         |    4 ++
 hw/pl190.c         |    4 ++
 hw/ppc.c           |   18 ++++++---
 hw/ppc4xx_devs.c   |    6 ++-
 hw/pxa2xx.c        |    4 ++
 hw/pxa2xx_gpio.c   |    6 ++-
 hw/pxa2xx_pcmcia.c |    6 ++-
 hw/pxa2xx_pic.c    |    4 ++
 hw/rc4030.c        |    4 ++
 hw/sbi.c           |    6 ++-
 hw/sharpsl.h       |    2 +
 hw/slavio_intctl.c |    8 +++-
 hw/sparc32_dma.c   |    4 ++
 hw/spitz.c         |   20 ++++++++--
 hw/ssd0323.c       |    4 ++
 hw/stellaris.c     |   10 ++++-
 hw/sun4c_intctl.c  |    4 ++
 hw/sun4m.c         |    7 +++-
 hw/tc6393xb.c      |    5 ++-
 hw/tusb6010.c      |    4 ++
 hw/twl92230.c      |    8 +++-
 hw/versatilepb.c   |    4 ++
 hw/zaurus.c        |    4 ++
 56 files changed, 361 insertions(+), 157 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index a2915f8..908f2b6 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -102,7 +102,7 @@ static APICState *local_apics[MAX_APICS + 1];
 static int last_apic_id = 0;
 
 static void apic_init_ipi(APICState *s);
-static void apic_set_irq(APICState *s, int vector_num, int trigger_mode);
+static int apic_set_irq(APICState *s, int vector_num, int trigger_mode);
 static void apic_update_irq(APICState *s);
 
 /* Find first bit starting from msb */
@@ -133,6 +133,14 @@ static inline void reset_bit(uint32_t *tab, int index)
     tab[i] &= ~mask;
 }
 
+static inline int get_bit(uint32_t *tab, int index)
+{
+    int i, mask;
+    i = index >> 5;
+    mask = 1 << (index & 0x1f);
+    return !!(tab[i] & mask);
+}
+
 static void apic_local_deliver(CPUState *env, int vector)
 {
     APICState *s = env->apic_state;
@@ -203,12 +211,13 @@ void apic_deliver_pic_intr(CPUState *env, int level)
     }\
 }
 
-static void apic_bus_deliver(const uint32_t *deliver_bitmask,
+static int apic_bus_deliver(const uint32_t *deliver_bitmask,
                              uint8_t delivery_mode,
                              uint8_t vector_num, uint8_t polarity,
                              uint8_t trigger_mode)
 {
     APICState *apic_iter;
+    int ret = 0;
 
     switch (delivery_mode) {
         case APIC_DM_LOWPRI:
@@ -225,11 +234,12 @@ static void apic_bus_deliver(const uint32_t *deliver_bitmask,
                 if (d >= 0) {
                     apic_iter = local_apics[d];
                     if (apic_iter) {
-                        apic_set_irq(apic_iter, vector_num, trigger_mode);
+                        ret = apic_set_irq(apic_iter, vector_num, trigger_mode);
                     }
-                }
+                } else
+                    ret = 1;
             }
-            return;
+            return ret;
 
         case APIC_DM_FIXED:
             break;
@@ -237,29 +247,33 @@ static void apic_bus_deliver(const uint32_t *deliver_bitmask,
         case APIC_DM_SMI:
             foreach_apic(apic_iter, deliver_bitmask,
                 cpu_interrupt(apic_iter->cpu_env, CPU_INTERRUPT_SMI) );
-            return;
+            return 1;
 
         case APIC_DM_NMI:
             foreach_apic(apic_iter, deliver_bitmask,
                 cpu_interrupt(apic_iter->cpu_env, CPU_INTERRUPT_NMI) );
-            return;
+            return 1;
 
         case APIC_DM_INIT:
             /* normal INIT IPI sent to processors */
             foreach_apic(apic_iter, deliver_bitmask,
                          apic_init_ipi(apic_iter) );
-            return;
+            return 1;
 
         case APIC_DM_EXTINT:
             /* handled in I/O APIC code */
             break;
 
         default:
-            return;
+            return ret;
     }
 
+    /* if interrupt should be delivered to more then one apic report it as
+     * delivered if at least one apic accepts it */
     foreach_apic(apic_iter, deliver_bitmask,
-                 apic_set_irq(apic_iter, vector_num, trigger_mode) );
+                 ret += apic_set_irq(apic_iter, vector_num, trigger_mode) );
+
+    return ret;
 }
 
 void cpu_set_apic_base(CPUState *env, uint64_t val)
@@ -349,14 +363,18 @@ static void apic_update_irq(APICState *s)
     cpu_interrupt(s->cpu_env, CPU_INTERRUPT_HARD);
 }
 
-static void apic_set_irq(APICState *s, int vector_num, int trigger_mode)
+static int apic_set_irq(APICState *s, int vector_num, int trigger_mode)
 {
+    int ret = !get_bit(s->irr, vector_num);
+
     set_bit(s->irr, vector_num);
     if (trigger_mode)
         set_bit(s->tmr, vector_num);
     else
         reset_bit(s->tmr, vector_num);
     apic_update_irq(s);
+
+    return ret;
 }
 
 static void apic_eoi(APICState *s)
@@ -903,9 +921,8 @@ int apic_init(CPUState *env)
     return 0;
 }
 
-static void ioapic_service(IOAPICState *s)
+static int ioapic_service_one(IOAPICState *s, uint8_t pin)
 {
-    uint8_t i;
     uint8_t trig_mode;
     uint8_t vector;
     uint8_t delivery_mode;
@@ -916,34 +933,44 @@ static void ioapic_service(IOAPICState *s)
     uint8_t polarity;
     uint32_t deliver_bitmask[MAX_APIC_WORDS];
 
+    mask = 1 << pin;
+    if (!(s->irr & mask))
+        return 1;
+    
+    entry = s->ioredtbl[pin];
+    if ((entry & APIC_LVT_MASKED))
+        return -1;
+
+    trig_mode = ((entry >> 15) & 1);
+    dest = entry >> 56;
+    dest_mode = (entry >> 11) & 1;
+    delivery_mode = (entry >> 8) & 7;
+    polarity = (entry >> 13) & 1;
+    if (trig_mode == APIC_TRIGGER_EDGE)
+        s->irr &= ~mask;
+    if (delivery_mode == APIC_DM_EXTINT)
+        vector = pic_read_irq(isa_pic);
+    else
+        vector = entry & 0xff;
+
+    apic_get_delivery_bitmask(deliver_bitmask, dest, dest_mode);
+    return apic_bus_deliver(deliver_bitmask, delivery_mode, vector, polarity,
+            trig_mode);
+}
+
+static void ioapic_service(IOAPICState *s)
+{
+    uint8_t i;
+
     for (i = 0; i < IOAPIC_NUM_PINS; i++) {
-        mask = 1 << i;
-        if (s->irr & mask) {
-            entry = s->ioredtbl[i];
-            if (!(entry & APIC_LVT_MASKED)) {
-                trig_mode = ((entry >> 15) & 1);
-                dest = entry >> 56;
-                dest_mode = (entry >> 11) & 1;
-                delivery_mode = (entry >> 8) & 7;
-                polarity = (entry >> 13) & 1;
-                if (trig_mode == APIC_TRIGGER_EDGE)
-                    s->irr &= ~mask;
-                if (delivery_mode == APIC_DM_EXTINT)
-                    vector = pic_read_irq(isa_pic);
-                else
-                    vector = entry & 0xff;
-
-                apic_get_delivery_bitmask(deliver_bitmask, dest, dest_mode);
-                apic_bus_deliver(deliver_bitmask, delivery_mode,
-                                 vector, polarity, trig_mode);
-            }
-        }
+        ioapic_service_one(s, i);
     }
 }
 
-void ioapic_set_irq(void *opaque, int vector, int level)
+int ioapic_set_irq(void *opaque, int vector, int level)
 {
     IOAPICState *s = opaque;
+    int ret = 1;
 
     if (vector >= 0 && vector < IOAPIC_NUM_PINS) {
         uint32_t mask = 1 << vector;
@@ -953,7 +980,7 @@ void ioapic_set_irq(void *opaque, int vector, int level)
             /* level triggered */
             if (level) {
                 s->irr |= mask;
-                ioapic_service(s);
+                ioapic_service_one(s, vector);
             } else {
                 s->irr &= ~mask;
             }
@@ -961,10 +988,12 @@ void ioapic_set_irq(void *opaque, int vector, int level)
             /* edge triggered */
             if (level) {
                 s->irr |= mask;
-                ioapic_service(s);
+                ret = ioapic_service_one(s, vector);
             }
         }
     }
+
+    return ret;
 }
 
 static uint32_t ioapic_mem_readl(void *opaque, target_phys_addr_t addr)
diff --git a/hw/arm_gic.c b/hw/arm_gic.c
index 54e99f4..a0fc3f0 100644
--- a/hw/arm_gic.c
+++ b/hw/arm_gic.c
@@ -154,13 +154,13 @@ gic_set_pending_private(gic_state *s, int cpu, int irq)
 }
 
 /* Process a change in an external IRQ input.  */
-static void gic_set_irq(void *opaque, int irq, int level)
+static int gic_set_irq(void *opaque, int irq, int level)
 {
     gic_state *s = (gic_state *)opaque;
     /* The first external input line is internal interrupt 32.  */
     irq += 32;
     if (level == GIC_TEST_LEVEL(irq, ALL_CPU_MASK))
-        return;
+        return 1;
 
     if (level) {
         GIC_SET_LEVEL(irq, ALL_CPU_MASK);
@@ -172,6 +172,8 @@ static void gic_set_irq(void *opaque, int irq, int level)
         GIC_CLEAR_LEVEL(irq, ALL_CPU_MASK);
     }
     gic_update(s);
+
+    return 1;
 }
 
 static void gic_set_running_irq(gic_state *s, int cpu, int irq)
diff --git a/hw/arm_pic.c b/hw/arm_pic.c
index 1fe55b7..753a404 100644
--- a/hw/arm_pic.c
+++ b/hw/arm_pic.c
@@ -21,7 +21,7 @@ void irq_info(void)
 
 
 /* Input 0 is IRQ and input 1 is FIQ.  */
-static void arm_pic_cpu_handler(void *opaque, int irq, int level)
+static int arm_pic_cpu_handler(void *opaque, int irq, int level)
 {
     CPUState *env = (CPUState *)opaque;
     switch (irq) {
@@ -40,6 +40,8 @@ static void arm_pic_cpu_handler(void *opaque, int irq, int level)
     default:
         cpu_abort(env, "arm_pic_cpu_handler: Bad interrput line %d\n", irq);
     }
+
+    return 1;
 }
 
 qemu_irq *arm_pic_init_cpu(CPUState *env)
diff --git a/hw/arm_timer.c b/hw/arm_timer.c
index 5150fe9..635c7e2 100644
--- a/hw/arm_timer.c
+++ b/hw/arm_timer.c
@@ -195,12 +195,12 @@ typedef struct {
 } sp804_state;
 
 /* Merge the IRQs from the two component devices.  */
-static void sp804_set_irq(void *opaque, int irq, int level)
+static int sp804_set_irq(void *opaque, int irq, int level)
 {
     sp804_state *s = (sp804_state *)opaque;
 
     s->level[irq] = level;
-    qemu_set_irq(s->irq, s->level[0] || s->level[1]);
+    return qemu_set_irq(s->irq, s->level[0] || s->level[1]);
 }
 
 static uint32_t sp804_read(void *opaque, target_phys_addr_t offset)
diff --git a/hw/cbus.c b/hw/cbus.c
index 03218b4..9a57cec 100644
--- a/hw/cbus.c
+++ b/hw/cbus.c
@@ -97,7 +97,7 @@ static void cbus_cycle(struct cbus_priv_s *s)
     }
 }
 
-static void cbus_clk(void *opaque, int line, int level)
+static int cbus_clk(void *opaque, int line, int level)
 {
     struct cbus_priv_s *s = (struct cbus_priv_s *) opaque;
 
@@ -112,16 +112,20 @@ static void cbus_clk(void *opaque, int line, int level)
     }
 
     s->clk = level;
+
+    return 1;
 }
 
-static void cbus_dat(void *opaque, int line, int level)
+static int cbus_dat(void *opaque, int line, int level)
 {
     struct cbus_priv_s *s = (struct cbus_priv_s *) opaque;
 
     s->dat = level;
+
+    return 1;
 }
 
-static void cbus_sel(void *opaque, int line, int level)
+static int cbus_sel(void *opaque, int line, int level)
 {
     struct cbus_priv_s *s = (struct cbus_priv_s *) opaque;
 
@@ -132,6 +136,8 @@ static void cbus_sel(void *opaque, int line, int level)
     }
 
     s->sel = level;
+
+    return 1;
 }
 
 struct cbus_s *cbus_init(qemu_irq dat)
diff --git a/hw/esp.c b/hw/esp.c
index 6b16cf4..abc2a2a 100644
--- a/hw/esp.c
+++ b/hw/esp.c
@@ -409,10 +409,12 @@ static void esp_reset(void *opaque)
     s->do_cmd = 0;
 }
 
-static void parent_esp_reset(void *opaque, int irq, int level)
+static int parent_esp_reset(void *opaque, int irq, int level)
 {
     if (level)
         esp_reset(opaque);
+
+    return 1;
 }
 
 static uint32_t esp_mem_readb(void *opaque, target_phys_addr_t addr)
diff --git a/hw/etraxfs_pic.c b/hw/etraxfs_pic.c
index d145bec..0f7c8ed 100644
--- a/hw/etraxfs_pic.c
+++ b/hw/etraxfs_pic.c
@@ -144,7 +144,7 @@ void irq_info(void)
 {
 }
 
-static void irq_handler(void *opaque, int irq, int level)
+static int irq_handler(void *opaque, int irq, int level)
 {	
 	struct fs_pic_state_t *fs = (void *)opaque;
 	CPUState *env = fs->env;
@@ -186,9 +186,11 @@ static void irq_handler(void *opaque, int irq, int level)
 		cpu_reset_interrupt(env, CPU_INTERRUPT_HARD);
 		D(printf("%s reset irqs\n", __func__));
 	}
+
+	return 1;
 }
 
-static void nmi_handler(void *opaque, int irq, int level)
+static int nmi_handler(void *opaque, int irq, int level)
 {	
 	struct fs_pic_state_t *fs = (void *)opaque;
 	CPUState *env = fs->env;
@@ -204,14 +206,16 @@ static void nmi_handler(void *opaque, int irq, int level)
 		cpu_interrupt(env, CPU_INTERRUPT_NMI);
 	else
 		cpu_reset_interrupt(env, CPU_INTERRUPT_NMI);
+
+	return 1;
 }
 
-static void guru_handler(void *opaque, int irq, int level)
+static int guru_handler(void *opaque, int irq, int level)
 {	
 	struct fs_pic_state_t *fs = (void *)opaque;
 	CPUState *env = fs->env;
 	cpu_abort(env, "%s unsupported exception\n", __func__);
-
+	return 1;
 }
 
 
diff --git a/hw/fdc.c b/hw/fdc.c
index cd00420..b212d89 100644
--- a/hw/fdc.c
+++ b/hw/fdc.c
@@ -713,7 +713,7 @@ static void fdctrl_external_reset(void *opaque)
     fdctrl_reset(s, 0);
 }
 
-static void fdctrl_handle_tc(void *opaque, int irq, int level)
+static int fdctrl_handle_tc(void *opaque, int irq, int level)
 {
     //fdctrl_t *s = opaque;
 
@@ -721,6 +721,8 @@ static void fdctrl_handle_tc(void *opaque, int irq, int level)
         // XXX
         FLOPPY_DPRINTF("TC pulsed\n");
     }
+
+    return 1;
 }
 
 /* XXX: may change if moved to bdrv */
diff --git a/hw/heathrow_pic.c b/hw/heathrow_pic.c
index dc2a30c..81657cf 100644
--- a/hw/heathrow_pic.c
+++ b/hw/heathrow_pic.c
@@ -135,7 +135,7 @@ static CPUReadMemoryFunc *pic_read[] = {
 };
 
 
-static void heathrow_pic_set_irq(void *opaque, int num, int level)
+static int heathrow_pic_set_irq(void *opaque, int num, int level)
 {
     HeathrowPICS *s = opaque;
     HeathrowPIC *pic;
@@ -159,6 +159,8 @@ static void heathrow_pic_set_irq(void *opaque, int num, int level)
         pic->levels &= ~irq_bit;
     }
     heathrow_pic_update(s);
+
+    return 1;
 }
 
 qemu_irq *heathrow_pic_init(int *pmem_index,
diff --git a/hw/i8259.c b/hw/i8259.c
index 750a76c..5ea3415 100644
--- a/hw/i8259.c
+++ b/hw/i8259.c
@@ -72,9 +72,9 @@ static uint64_t irq_count[16];
 #endif
 
 /* set irq level. If an edge is detected, then the IRR is set to 1 */
-static inline void pic_set_irq1(PicState *s, int irq, int level)
+static inline int pic_set_irq1(PicState *s, int irq, int level)
 {
-    int mask;
+    int mask, ret = 1;
     mask = 1 << irq;
     if (s->elcr & mask) {
         /* level triggered */
@@ -88,13 +88,18 @@ static inline void pic_set_irq1(PicState *s, int irq, int level)
     } else {
         /* edge triggered */
         if (level) {
-            if ((s->last_irr & mask) == 0)
+            if ((s->last_irr & mask) == 0) {
+                if((s->irr & mask))
+                    ret = 0;
                 s->irr |= mask;
+            }
             s->last_irr |= mask;
         } else {
             s->last_irr &= ~mask;
         }
     }
+
+    return (s->imr & mask) ? -1 : ret;
 }
 
 /* return the highest priority found in mask (highest = smallest
@@ -180,9 +185,10 @@ void pic_update_irq(PicState2 *s)
 int64_t irq_time[16];
 #endif
 
-static void i8259_set_irq(void *opaque, int irq, int level)
+static int i8259_set_irq(void *opaque, int irq, int level)
 {
     PicState2 *s = opaque;
+    int pic_ret, ioapic_ret = -1;
 
 #if defined(DEBUG_PIC) || defined(DEBUG_IRQ_COUNT)
     if (level != irq_level[irq]) {
@@ -201,11 +207,12 @@ static void i8259_set_irq(void *opaque, int irq, int level)
         irq_time[irq] = qemu_get_clock(vm_clock);
     }
 #endif
-    pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+    pic_ret = pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
     /* used for IOAPIC irqs */
     if (s->alt_irq_func)
-        s->alt_irq_func(s->alt_irq_opaque, irq, level);
+        ioapic_ret = s->alt_irq_func(s->alt_irq_opaque, irq, level);
     pic_update_irq(s);
+    return MAX(pic_ret, ioapic_ret);
 }
 
 /* acknowledge interrupt 'irq' */
diff --git a/hw/ide.c b/hw/ide.c
index 33e8b39..6aec211 100644
--- a/hw/ide.c
+++ b/hw/ide.c
@@ -3135,7 +3135,7 @@ static void cmd646_update_irq(PCIIDEState *d)
 }
 
 /* the PCI irq level is the logical OR of the two channels */
-static void cmd646_set_irq(void *opaque, int channel, int level)
+static int cmd646_set_irq(void *opaque, int channel, int level)
 {
     PCIIDEState *d = opaque;
     int irq_mask;
@@ -3146,6 +3146,8 @@ static void cmd646_set_irq(void *opaque, int channel, int level)
     else
         d->dev.config[MRDMODE] &= ~irq_mask;
     cmd646_update_irq(d);
+
+    return 1;
 }
 
 /* CMD646 PCI IDE controller */
@@ -3545,7 +3547,7 @@ static inline void md_interrupt_update(struct md_s *s)
                     !(s->opt & OPT_SRESET));
 }
 
-static void md_set_irq(void *opaque, int irq, int level)
+static int md_set_irq(void *opaque, int irq, int level)
 {
     struct md_s *s = (struct md_s *) opaque;
     if (level)
@@ -3554,6 +3556,8 @@ static void md_set_irq(void *opaque, int irq, int level)
         s->stat &= ~STAT_INT;
 
     md_interrupt_update(s);
+
+    return 1;
 }
 
 static void md_reset(struct md_s *s)
diff --git a/hw/integratorcp.c b/hw/integratorcp.c
index 779d46b..b69512e 100644
--- a/hw/integratorcp.c
+++ b/hw/integratorcp.c
@@ -290,7 +290,7 @@ static void icp_pic_update(icp_pic_state *s)
     qemu_set_irq(s->parent_fiq, flags != 0);
 }
 
-static void icp_pic_set_irq(void *opaque, int irq, int level)
+static int icp_pic_set_irq(void *opaque, int irq, int level)
 {
     icp_pic_state *s = (icp_pic_state *)opaque;
     if (level)
@@ -298,6 +298,8 @@ static void icp_pic_set_irq(void *opaque, int irq, int level)
     else
         s->level &= ~(1 << irq);
     icp_pic_update(s);
+
+    return 1;
 }
 
 static uint32_t icp_pic_read(void *opaque, target_phys_addr_t offset)
diff --git a/hw/irq.c b/hw/irq.c
index eca707d..2159604 100644
--- a/hw/irq.c
+++ b/hw/irq.c
@@ -24,18 +24,26 @@
 #include "qemu-common.h"
 #include "irq.h"
 
+extern int use_icount;
+
 struct IRQState {
     qemu_irq_handler handler;
     void *opaque;
     int n;
 };
 
-void qemu_set_irq(qemu_irq irq, int level)
+int qemu_set_irq(qemu_irq irq, int level)
 {
+    int ret;
+
     if (!irq)
-        return;
+        return 1;
+
+    ret = irq->handler(irq->opaque, irq->n, level);
 
-    irq->handler(irq->opaque, irq->n, level);
+    /* if virtual clock is in use then emulation is as close to real HW as
+     * possible, so we never re-inject lost interrupts in this case */
+    return use_icount ? 1 : ret;
 }
 
 qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
@@ -56,11 +64,11 @@ qemu_irq *qemu_allocate_irqs(qemu_irq_handler handler, void *opaque, int n)
     return s;
 }
 
-static void qemu_notirq(void *opaque, int line, int level)
+static int qemu_notirq(void *opaque, int line, int level)
 {
     struct IRQState *irq = opaque;
 
-    irq->handler(irq->opaque, irq->n, !level);
+    return irq->handler(irq->opaque, irq->n, !level);
 }
 
 qemu_irq qemu_irq_invert(qemu_irq irq)
diff --git a/hw/irq.h b/hw/irq.h
index 0880ad2..53e5921 100644
--- a/hw/irq.h
+++ b/hw/irq.h
@@ -4,25 +4,38 @@
 /* Generic IRQ/GPIO pin infrastructure.  */
 
 /* FIXME: Rmove one of these.  */
-typedef void (*qemu_irq_handler)(void *opaque, int n, int level);
-typedef void SetIRQFunc(void *opaque, int irq_num, int level);
-
-void qemu_set_irq(qemu_irq irq, int level);
-
-static inline void qemu_irq_raise(qemu_irq irq)
+/* 
+ * Negative return value indicates that interrupt is masked.
+ * Positive return value indicates that interrupt was delivered
+ * to at least one CPU or was queued in interrupt controller for
+ * delivery. If return value is greater then one it indicates the
+ * number of CPUs that handled the interrupt.
+ * Zero value indicates that interrupt was lost (due to coalescing
+ * for instance).
+ */
+typedef int (*qemu_irq_handler)(void *opaque, int n, int level);
+typedef int SetIRQFunc(void *opaque, int irq_num, int level);
+
+int qemu_set_irq(qemu_irq irq, int level);
+
+static inline int qemu_irq_raise(qemu_irq irq)
 {
-    qemu_set_irq(irq, 1);
+    return qemu_set_irq(irq, 1);
 }
 
-static inline void qemu_irq_lower(qemu_irq irq)
+static inline int qemu_irq_lower(qemu_irq irq)
 {
-    qemu_set_irq(irq, 0);
+    return qemu_set_irq(irq, 0);
 }
 
-static inline void qemu_irq_pulse(qemu_irq irq)
+static inline int qemu_irq_pulse(qemu_irq irq)
 {
-    qemu_set_irq(irq, 1);
+    int ret;
+
+    ret = qemu_set_irq(irq, 1);
     qemu_set_irq(irq, 0);
+
+    return ret;
 }
 
 /* Returns an array of N IRQs.  */
diff --git a/hw/max7310.c b/hw/max7310.c
index 2816611..a8e759f 100644
--- a/hw/max7310.c
+++ b/hw/max7310.c
@@ -177,7 +177,7 @@ static int max7310_load(QEMUFile *f, void *opaque, int version_id)
     return 0;
 }
 
-static void max7310_gpio_set(void *opaque, int line, int level)
+static int max7310_gpio_set(void *opaque, int line, int level)
 {
     struct max7310_s *s = (struct max7310_s *) opaque;
     if (line >= sizeof(s->handler) / sizeof(*s->handler) || line  < 0)
@@ -187,6 +187,8 @@ static void max7310_gpio_set(void *opaque, int line, int level)
         s->level |= s->direction & (1 << line);
     else
         s->level &= ~(s->direction & (1 << line));
+
+    return 1;
 }
 
 /* MAX7310 is SMBus-compatible (can be used with only SMBus protocols),
diff --git a/hw/mcf5206.c b/hw/mcf5206.c
index 449ca12..79d9e8e 100644
--- a/hw/mcf5206.c
+++ b/hw/mcf5206.c
@@ -228,7 +228,7 @@ static void m5206_mbar_update(m5206_mbar_state *s)
     m68k_set_irq_level(s->env, level, vector);
 }
 
-static void m5206_mbar_set_irq(void *opaque, int irq, int level)
+static int m5206_mbar_set_irq(void *opaque, int irq, int level)
 {
     m5206_mbar_state *s = (m5206_mbar_state *)opaque;
     if (level) {
@@ -237,6 +237,8 @@ static void m5206_mbar_set_irq(void *opaque, int irq, int level)
         s->ipr &= ~(1 << irq);
     }
     m5206_mbar_update(s);
+
+    return 1;
 }
 
 /* System Integration Module.  */
diff --git a/hw/mcf_intc.c b/hw/mcf_intc.c
index 4e99aeb..edca76f 100644
--- a/hw/mcf_intc.c
+++ b/hw/mcf_intc.c
@@ -106,16 +106,18 @@ static void mcf_intc_write(void *opaque, target_phys_addr_t addr, uint32_t val)
     mcf_intc_update(s);
 }
 
-static void mcf_intc_set_irq(void *opaque, int irq, int level)
+static int mcf_intc_set_irq(void *opaque, int irq, int level)
 {
     mcf_intc_state *s = (mcf_intc_state *)opaque;
     if (irq >= 64)
-        return;
+        return 1;
     if (level)
         s->ipr |= 1ull << irq;
     else
         s->ipr &= ~(1ull << irq);
     mcf_intc_update(s);
+
+    return 1;
 }
 
 static void mcf_intc_reset(mcf_intc_state *s)
diff --git a/hw/mips_int.c b/hw/mips_int.c
index ad48b4f..7e18e1b 100644
--- a/hw/mips_int.c
+++ b/hw/mips_int.c
@@ -18,12 +18,12 @@ void cpu_mips_update_irq(CPUState *env)
         cpu_reset_interrupt(env, CPU_INTERRUPT_HARD);
 }
 
-static void cpu_mips_irq_request(void *opaque, int irq, int level)
+static int cpu_mips_irq_request(void *opaque, int irq, int level)
 {
     CPUState *env = (CPUState *)opaque;
 
     if (irq < 0 || irq > 7)
-        return;
+        return 1;
 
     if (level) {
         env->CP0_Cause |= 1 << (irq + CP0Ca_IP);
@@ -31,6 +31,8 @@ static void cpu_mips_irq_request(void *opaque, int irq, int level)
         env->CP0_Cause &= ~(1 << (irq + CP0Ca_IP));
     }
     cpu_mips_update_irq(env);
+
+    return 1;
 }
 
 void cpu_mips_irq_init_cpu(CPUState *env)
diff --git a/hw/mpcore.c b/hw/mpcore.c
index d5b28fe..6e71081 100644
--- a/hw/mpcore.c
+++ b/hw/mpcore.c
@@ -293,7 +293,7 @@ static const int mpcore_irq_map[32] = {
     -1, -1, -1, -1,  9,  3, -1, -1,
 };
 
-static void mpcore_rirq_set_irq(void *opaque, int irq, int level)
+static int mpcore_rirq_set_irq(void *opaque, int irq, int level)
 {
     mpcore_rirq_state *s = (mpcore_rirq_state *)opaque;
     int i;
@@ -307,6 +307,8 @@ static void mpcore_rirq_set_irq(void *opaque, int irq, int level)
             qemu_set_irq(s->cpuic[irq], level);
         }
     }
+
+    return 1;
 }
 
 qemu_irq *mpcore_irq_init(qemu_irq *cpu_irq)
diff --git a/hw/mst_fpga.c b/hw/mst_fpga.c
index 2d5ac5a..1773c2a 100644
--- a/hw/mst_fpga.c
+++ b/hw/mst_fpga.c
@@ -62,7 +62,7 @@ mst_fpga_update_gpio(mst_irq_state *s)
 	s->prev_level = level;
 }
 
-static void
+static int 
 mst_fpga_set_irq(void *opaque, int irq, int level)
 {
 	mst_irq_state *s = (mst_irq_state *)opaque;
@@ -76,6 +76,8 @@ mst_fpga_set_irq(void *opaque, int irq, int level)
 		s->intsetclr = 1u << irq;
 		qemu_set_irq(s->parent[0], level);
 	}
+
+	return 1;
 }
 
 
diff --git a/hw/musicpal.c b/hw/musicpal.c
index c7c11de..0a1cf01 100644
--- a/hw/musicpal.c
+++ b/hw/musicpal.c
@@ -951,7 +951,7 @@ static void mv88w8618_pic_update(mv88w8618_pic_state *s)
     qemu_set_irq(s->parent_irq, (s->level & s->enabled));
 }
 
-static void mv88w8618_pic_set_irq(void *opaque, int irq, int level)
+static int mv88w8618_pic_set_irq(void *opaque, int irq, int level)
 {
     mv88w8618_pic_state *s = opaque;
 
@@ -960,6 +960,8 @@ static void mv88w8618_pic_set_irq(void *opaque, int irq, int level)
     else
         s->level &= ~(1 << irq);
     mv88w8618_pic_update(s);
+
+    return 1;
 }
 
 static uint32_t mv88w8618_pic_read(void *opaque, target_phys_addr_t offset)
diff --git a/hw/nseries.c b/hw/nseries.c
index 80fd9e8..1ddc28e 100644
--- a/hw/nseries.c
+++ b/hw/nseries.c
@@ -123,13 +123,14 @@ struct n800_s {
 
 #define N8X0_BD_ADDR			0x00, 0x1a, 0x89, 0x9e, 0x3e, 0x81
 
-static void n800_mmc_cs_cb(void *opaque, int line, int level)
+static int n800_mmc_cs_cb(void *opaque, int line, int level)
 {
     /* TODO: this seems to actually be connected to the menelaus, to
      * which also both MMC slots connect.  */
     omap_mmc_enable((struct omap_mmc_s *) opaque, !level);
 
     printf("%s: MMC slot %i active\n", __FUNCTION__, level + 1);
+    return 1;
 }
 
 static void n8x0_gpio_setup(struct n800_s *s)
@@ -755,11 +756,13 @@ static void n8x0_uart_setup(struct n800_s *s)
     omap_uart_attach(s->cpu->uart[BT_UART], radio);
 }
 
-static void n8x0_usb_power_cb(void *opaque, int line, int level)
+static int n8x0_usb_power_cb(void *opaque, int line, int level)
 {
     struct n800_s *s = opaque;
 
     tusb6010_power(s->usb, level);
+
+    return 1;
 }
 
 static void n8x0_usb_setup(struct n800_s *s)
diff --git a/hw/omap.h b/hw/omap.h
index 4c30436..5d2f729 100644
--- a/hw/omap.h
+++ b/hw/omap.h
@@ -992,7 +992,7 @@ uint32_t omap_badwidth_read32(void *opaque, target_phys_addr_t addr);
 void omap_badwidth_write32(void *opaque, target_phys_addr_t addr,
                 uint32_t value);
 
-void omap_mpu_wakeup(void *opaque, int irq, int req);
+int omap_mpu_wakeup(void *opaque, int irq, int req);
 
 # define OMAP_BAD_REG(paddr)		\
         fprintf(stderr, "%s: Bad register " OMAP_FMT_plx "\n",	\
diff --git a/hw/omap1.c b/hw/omap1.c
index a32563b..7066f2e 100644
--- a/hw/omap1.c
+++ b/hw/omap1.c
@@ -153,7 +153,7 @@ static inline void omap_inth_update(struct omap_intr_handler_s *s, int is_fiq)
 #define INT_FALLING_EDGE	0
 #define INT_LOW_LEVEL		1
 
-static void omap_set_intr(void *opaque, int irq, int req)
+static int omap_set_intr(void *opaque, int irq, int req)
 {
     struct omap_intr_handler_s *ih = (struct omap_intr_handler_s *) opaque;
     uint32_t rise;
@@ -177,10 +177,12 @@ static void omap_set_intr(void *opaque, int irq, int req)
         bank->irqs &= ~rise;
         bank->inputs &= ~(1 << n);
     }
+
+    return 1;
 }
 
 /* Simplified version with no edge detection */
-static void omap_set_intr_noedge(void *opaque, int irq, int req)
+static int omap_set_intr_noedge(void *opaque, int irq, int req)
 {
     struct omap_intr_handler_s *ih = (struct omap_intr_handler_s *) opaque;
     uint32_t rise;
@@ -197,6 +199,8 @@ static void omap_set_intr_noedge(void *opaque, int irq, int req)
         }
     } else
         bank->irqs = (bank->inputs &= ~(1 << n)) | bank->swi;
+
+    return 1;
 }
 
 static uint32_t omap_inth_read(void *opaque, target_phys_addr_t addr)
@@ -738,13 +742,15 @@ static void omap_timer_tick(void *opaque)
     omap_timer_update(timer);
 }
 
-static void omap_timer_clk_update(void *opaque, int line, int on)
+static int omap_timer_clk_update(void *opaque, int line, int on)
 {
     struct omap_mpu_timer_s *timer = (struct omap_mpu_timer_s *) opaque;
 
     omap_timer_sync(timer);
     timer->rate = on ? omap_clk_getrate(timer->clk) : 0;
     omap_timer_update(timer);
+
+    return 1;
 }
 
 static void omap_timer_clk_setup(struct omap_mpu_timer_s *timer)
@@ -2575,7 +2581,7 @@ struct omap_mpuio_s {
     int clk;
 };
 
-static void omap_mpuio_set(void *opaque, int line, int level)
+static int omap_mpuio_set(void *opaque, int line, int level)
 {
     struct omap_mpuio_s *s = (struct omap_mpuio_s *) opaque;
     uint16_t prev = s->inputs;
@@ -2595,6 +2601,8 @@ static void omap_mpuio_set(void *opaque, int line, int level)
                 (s->event >> 1) == line)	/* PIN_SELECT */
             s->latch = s->inputs;
     }
+
+    return 1;
 }
 
 static void omap_mpuio_kbd_update(struct omap_mpuio_s *s)
@@ -2766,13 +2774,15 @@ static void omap_mpuio_reset(struct omap_mpuio_s *s)
     s->clk = 1;
 }
 
-static void omap_mpuio_onoff(void *opaque, int line, int on)
+static int omap_mpuio_onoff(void *opaque, int line, int on)
 {
     struct omap_mpuio_s *s = (struct omap_mpuio_s *) opaque;
 
     s->clk = on;
     if (on)
         omap_mpuio_kbd_update(s);
+
+    return 1;
 }
 
 struct omap_mpuio_s *omap_mpuio_init(target_phys_addr_t base,
@@ -2841,7 +2851,7 @@ struct omap_gpio_s {
     uint16_t pins;
 };
 
-static void omap_gpio_set(void *opaque, int line, int level)
+static int omap_gpio_set(void *opaque, int line, int level)
 {
     struct omap_gpio_s *s = (struct omap_gpio_s *) opaque;
     uint16_t prev = s->inputs;
@@ -2856,6 +2866,8 @@ static void omap_gpio_set(void *opaque, int line, int level)
         s->ints |= 1 << line;
         qemu_irq_raise(s->irq);
     }
+
+    return 1;
 }
 
 static uint32_t omap_gpio_read(void *opaque, target_phys_addr_t addr)
@@ -3247,12 +3259,14 @@ static void omap_pwl_reset(struct omap_mpu_state_s *s)
     omap_pwl_update(s);
 }
 
-static void omap_pwl_clk_update(void *opaque, int line, int on)
+static int omap_pwl_clk_update(void *opaque, int line, int on)
 {
     struct omap_mpu_state_s *s = (struct omap_mpu_state_s *) opaque;
 
     s->pwl.clk = on;
     omap_pwl_update(s);
+
+    return 1;
 }
 
 static void omap_pwl_init(target_phys_addr_t base, struct omap_mpu_state_s *s,
@@ -4311,7 +4325,7 @@ struct omap_mcbsp_s *omap_mcbsp_init(target_phys_addr_t base,
     return s;
 }
 
-static void omap_mcbsp_i2s_swallow(void *opaque, int line, int level)
+static int omap_mcbsp_i2s_swallow(void *opaque, int line, int level)
 {
     struct omap_mcbsp_s *s = (struct omap_mcbsp_s *) opaque;
 
@@ -4319,9 +4333,11 @@ static void omap_mcbsp_i2s_swallow(void *opaque, int line, int level)
         s->rx_req = s->codec->in.len;
         omap_mcbsp_rx_newdata(s);
     }
+
+    return 1;
 }
 
-static void omap_mcbsp_i2s_start(void *opaque, int line, int level)
+static int omap_mcbsp_i2s_start(void *opaque, int line, int level)
 {
     struct omap_mcbsp_s *s = (struct omap_mcbsp_s *) opaque;
 
@@ -4329,6 +4345,8 @@ static void omap_mcbsp_i2s_start(void *opaque, int line, int level)
         s->tx_req = s->codec->out.size;
         omap_mcbsp_tx_newdata(s);
     }
+
+    return 1;
 }
 
 void omap_mcbsp_i2s_attach(struct omap_mcbsp_s *s, struct i2s_codec_s *slave)
@@ -4459,12 +4477,14 @@ static CPUWriteMemoryFunc *omap_lpg_writefn[] = {
     omap_badwidth_write8,
 };
 
-static void omap_lpg_clk_update(void *opaque, int line, int on)
+static int omap_lpg_clk_update(void *opaque, int line, int on)
 {
     struct omap_lpg_s *s = (struct omap_lpg_s *) opaque;
 
     s->clk = on;
     omap_lpg_update(s);
+
+    return 1;
 }
 
 struct omap_lpg_s *omap_lpg_init(target_phys_addr_t base, omap_clk clk)
@@ -4599,12 +4619,14 @@ static void omap_setup_dsp_mapping(const struct omap_map_s *map)
     }
 }
 
-void omap_mpu_wakeup(void *opaque, int irq, int req)
+int omap_mpu_wakeup(void *opaque, int irq, int req)
 {
     struct omap_mpu_state_s *mpu = (struct omap_mpu_state_s *) opaque;
 
     if (mpu->env->halted)
         cpu_interrupt(mpu->env, CPU_INTERRUPT_EXITTB);
+
+    return 1;
 }
 
 static const struct dma_irq_map omap1_dma_irq_map[] = {
diff --git a/hw/omap2.c b/hw/omap2.c
index 5add059..e171dbb 100644
--- a/hw/omap2.c
+++ b/hw/omap2.c
@@ -193,7 +193,7 @@ static void omap_gp_timer_match(void *opaque)
     omap_gp_timer_intr(timer, GPT_MAT_IT);
 }
 
-static void omap_gp_timer_input(void *opaque, int line, int on)
+static int omap_gp_timer_input(void *opaque, int line, int on)
 {
     struct omap_gp_timer_s *s = (struct omap_gp_timer_s *) opaque;
     int trigger;
@@ -221,15 +221,19 @@ static void omap_gp_timer_input(void *opaque, int line, int on)
         if (s->capt2 == s->capt_num ++)
             omap_gp_timer_intr(s, GPT_TCAR_IT);
     }
+
+    return 1;
 }
 
-static void omap_gp_timer_clk_update(void *opaque, int line, int on)
+static int omap_gp_timer_clk_update(void *opaque, int line, int on)
 {
     struct omap_gp_timer_s *timer = (struct omap_gp_timer_s *) opaque;
 
     omap_gp_timer_sync(timer);
     timer->rate = on ? omap_clk_getrate(timer->clk) : 0;
     omap_gp_timer_update(timer);
+
+    return 1;
 }
 
 static void omap_gp_timer_clk_setup(struct omap_gp_timer_s *timer)
@@ -632,7 +636,7 @@ static inline void omap_gpio_module_int(struct omap2_gpio_s *s, int line)
     omap_gpio_module_wake(s, line);
 }
 
-static void omap_gpio_module_set(void *opaque, int line, int level)
+static int omap_gpio_module_set(void *opaque, int line, int level)
 {
     struct omap2_gpio_s *s = (struct omap2_gpio_s *) opaque;
 
@@ -645,6 +649,8 @@ static void omap_gpio_module_set(void *opaque, int line, int level)
             omap_gpio_module_int(s, line);
         s->inputs &= ~(1 << line);
     }
+
+    return 1;
 }
 
 static void omap_gpio_module_reset(struct omap2_gpio_s *s)
diff --git a/hw/omap_dma.c b/hw/omap_dma.c
index ba980df..52ba18e 100644
--- a/hw/omap_dma.c
+++ b/hw/omap_dma.c
@@ -1541,7 +1541,7 @@ static CPUWriteMemoryFunc *omap_dma_writefn[] = {
     omap_badwidth_write16,
 };
 
-static void omap_dma_request(void *opaque, int drq, int req)
+static int omap_dma_request(void *opaque, int drq, int req)
 {
     struct omap_dma_s *s = (struct omap_dma_s *) opaque;
     /* The request pins are level triggered in QEMU.  */
@@ -1552,10 +1552,12 @@ static void omap_dma_request(void *opaque, int drq, int req)
         }
     } else
         s->dma->drqbmp &= ~(1 << drq);
+
+    return 1;
 }
 
 /* XXX: this won't be needed once soc_dma knows about clocks.  */
-static void omap_dma_clk_update(void *opaque, int line, int on)
+static int omap_dma_clk_update(void *opaque, int line, int on)
 {
     struct omap_dma_s *s = (struct omap_dma_s *) opaque;
     int i;
@@ -1565,6 +1567,8 @@ static void omap_dma_clk_update(void *opaque, int line, int on)
     for (i = 0; i < s->chans; i ++)
         if (s->ch[i].active)
             soc_dma_set_request(s->ch[i].dma, on);
+
+    return 1;
 }
 
 static void omap_dma_setcaps(struct omap_dma_s *s)
diff --git a/hw/omap_mmc.c b/hw/omap_mmc.c
index bc46edf..e4e3088 100644
--- a/hw/omap_mmc.c
+++ b/hw/omap_mmc.c
@@ -555,7 +555,7 @@ static CPUWriteMemoryFunc *omap_mmc_writefn[] = {
     omap_badwidth_write16,
 };
 
-static void omap_mmc_cover_cb(void *opaque, int line, int level)
+static int omap_mmc_cover_cb(void *opaque, int line, int level)
 {
     struct omap_mmc_s *host = (struct omap_mmc_s *) opaque;
 
@@ -570,6 +570,8 @@ static void omap_mmc_cover_cb(void *opaque, int line, int level)
         qemu_set_irq(host->coverswitch, level);
         host->cdet_state = level;
     }
+
+    return 1;
 }
 
 struct omap_mmc_s *omap_mmc_init(target_phys_addr_t base,
diff --git a/hw/openpic.c b/hw/openpic.c
index def20eb..3f8d7b0 100644
--- a/hw/openpic.c
+++ b/hw/openpic.c
@@ -346,7 +346,7 @@ static void openpic_update_irq(openpic_t *opp, int n_IRQ)
     }
 }
 
-static void openpic_set_irq(void *opaque, int n_IRQ, int level)
+static int openpic_set_irq(void *opaque, int n_IRQ, int level)
 {
     openpic_t *opp = opaque;
     IRQ_src_t *src;
@@ -365,6 +365,8 @@ static void openpic_set_irq(void *opaque, int n_IRQ, int level)
             src->pending = 1;
     }
     openpic_update_irq(opp, n_IRQ);
+
+    return 1;
 }
 
 static void openpic_reset (openpic_t *opp)
diff --git a/hw/palm.c b/hw/palm.c
index 9bc6eec..96c12a0 100644
--- a/hw/palm.c
+++ b/hw/palm.c
@@ -138,7 +138,7 @@ static void palmte_button_event(void *opaque, int keycode)
                         !(keycode & 0x80));
 }
 
-static void palmte_onoff_gpios(void *opaque, int line, int level)
+static int palmte_onoff_gpios(void *opaque, int line, int level)
 {
     switch (line) {
     case 0:
@@ -163,6 +163,8 @@ static void palmte_onoff_gpios(void *opaque, int line, int level)
                         __FUNCTION__, line - 4, level ? "high" : "low");
         break;
     }
+
+    return 1;
 }
 
 static void palmte_gpio_setup(struct omap_mpu_state_s *cpu)
diff --git a/hw/pc.c b/hw/pc.c
index 167d628..54b9b0b 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -116,7 +116,7 @@ int cpu_get_pic_interrupt(CPUState *env)
     return intno;
 }
 
-static void pic_irq_request(void *opaque, int irq, int level)
+static int pic_irq_request(void *opaque, int irq, int level)
 {
     CPUState *env = first_cpu;
 
@@ -132,6 +132,8 @@ static void pic_irq_request(void *opaque, int irq, int level)
         else
             cpu_reset_interrupt(env, CPU_INTERRUPT_HARD);
     }
+
+    return 1;
 }
 
 /* PC cmos mappings */
diff --git a/hw/pc.h b/hw/pc.h
index d64d8a6..0ed9037 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -45,7 +45,7 @@ int apic_accept_pic_intr(CPUState *env);
 void apic_deliver_pic_intr(CPUState *env, int level);
 int apic_get_interrupt(CPUState *env);
 IOAPICState *ioapic_init(void);
-void ioapic_set_irq(void *opaque, int vector, int level);
+int ioapic_set_irq(void *opaque, int vector, int level);
 
 /* i8254.c */
 
diff --git a/hw/pci.c b/hw/pci.c
index a0f91a8..dc7f9ff 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -47,7 +47,7 @@ struct PCIBus {
 };
 
 static void pci_update_mappings(PCIDevice *d);
-static void pci_set_irq(void *opaque, int irq_num, int level);
+static int pci_set_irq(void *opaque, int irq_num, int level);
 
 target_phys_addr_t pci_mem_base;
 static int pci_irq_index;
@@ -485,7 +485,7 @@ uint32_t pci_data_read(void *opaque, uint32_t addr, int len)
 /* generic PCI irq support */
 
 /* 0 <= irq_num <= 3. level must be 0 or 1 */
-static void pci_set_irq(void *opaque, int irq_num, int level)
+static int pci_set_irq(void *opaque, int irq_num, int level)
 {
     PCIDevice *pci_dev = (PCIDevice *)opaque;
     PCIBus *bus;
@@ -493,7 +493,7 @@ static void pci_set_irq(void *opaque, int irq_num, int level)
 
     change = level - pci_dev->irq_state[irq_num];
     if (!change)
-        return;
+        return 1;
 
     pci_dev->irq_state[irq_num] = level;
     for (;;) {
@@ -505,6 +505,8 @@ static void pci_set_irq(void *opaque, int irq_num, int level)
     }
     bus->irq_count[irq_num] += change;
     bus->set_irq(bus->irq_opaque, irq_num, bus->irq_count[irq_num] != 0);
+
+    return 1;
 }
 
 /***********************************************************/
diff --git a/hw/pcnet.c b/hw/pcnet.c
index 188e5ff..0c9c1a9 100644
--- a/hw/pcnet.c
+++ b/hw/pcnet.c
@@ -2047,10 +2047,12 @@ void pci_pcnet_init(PCIBus *bus, NICInfo *nd, int devfn)
 #if defined (TARGET_SPARC) && !defined(TARGET_SPARC64) // Avoid compile failure
 #include "sun4m.h"
 
-static void parent_lance_reset(void *opaque, int irq, int level)
+static int parent_lance_reset(void *opaque, int irq, int level)
 {
     if (level)
         pcnet_h_reset(opaque);
+
+    return 1;
 }
 
 static void lance_mem_writew(void *opaque, target_phys_addr_t addr,
diff --git a/hw/pl061.c b/hw/pl061.c
index 6db5e39..09899e7 100644
--- a/hw/pl061.c
+++ b/hw/pl061.c
@@ -214,7 +214,7 @@ static void pl061_reset(pl061_state *s)
   s->cr = 0xff;
 }
 
-static void pl061_set_irq(void * opaque, int irq, int level)
+static int pl061_set_irq(void * opaque, int irq, int level)
 {
     pl061_state *s = (pl061_state *)opaque;
     uint8_t mask;
@@ -226,6 +226,8 @@ static void pl061_set_irq(void * opaque, int irq, int level)
             s->data |= mask;
         pl061_update(s);
     }
+
+    return 1;
 }
 
 static CPUReadMemoryFunc *pl061_readfn[] = {
diff --git a/hw/pl190.c b/hw/pl190.c
index 267c26b..ef5094c 100644
--- a/hw/pl190.c
+++ b/hw/pl190.c
@@ -56,7 +56,7 @@ static void pl190_update(pl190_state *s)
     qemu_set_irq(s->fiq, set);
 }
 
-static void pl190_set_irq(void *opaque, int irq, int level)
+static int pl190_set_irq(void *opaque, int irq, int level)
 {
     pl190_state *s = (pl190_state *)opaque;
 
@@ -65,6 +65,8 @@ static void pl190_set_irq(void *opaque, int irq, int level)
     else
         s->level &= ~(1u << irq);
     pl190_update(s);
+
+    return 1;
 }
 
 static void pl190_update_vectors(pl190_state *s)
diff --git a/hw/ppc.c b/hw/ppc.c
index 60d6e86..8ce522d 100644
--- a/hw/ppc.c
+++ b/hw/ppc.c
@@ -54,7 +54,7 @@ static void ppc_set_irq (CPUState *env, int n_IRQ, int level)
 }
 
 /* PowerPC 6xx / 7xx internal IRQ controller */
-static void ppc6xx_set_irq (void *opaque, int pin, int level)
+static int ppc6xx_set_irq (void *opaque, int pin, int level)
 {
     CPUState *env = opaque;
     int cur_level;
@@ -163,13 +163,15 @@ static void ppc6xx_set_irq (void *opaque, int pin, int level)
                 fprintf(logfile, "%s: unknown IRQ pin %d\n", __func__, pin);
             }
 #endif
-            return;
+            return 1;
         }
         if (level)
             env->irq_input_state |= 1 << pin;
         else
             env->irq_input_state &= ~(1 << pin);
     }
+
+    return 1;
 }
 
 void ppc6xx_irq_init (CPUState *env)
@@ -180,7 +182,7 @@ void ppc6xx_irq_init (CPUState *env)
 
 #if defined(TARGET_PPC64)
 /* PowerPC 970 internal IRQ controller */
-static void ppc970_set_irq (void *opaque, int pin, int level)
+static int ppc970_set_irq (void *opaque, int pin, int level)
 {
     CPUState *env = opaque;
     int cur_level;
@@ -287,13 +289,15 @@ static void ppc970_set_irq (void *opaque, int pin, int level)
                 fprintf(logfile, "%s: unknown IRQ pin %d\n", __func__, pin);
             }
 #endif
-            return;
+            return 1;
         }
         if (level)
             env->irq_input_state |= 1 << pin;
         else
             env->irq_input_state &= ~(1 << pin);
     }
+
+    return 1;
 }
 
 void ppc970_irq_init (CPUState *env)
@@ -304,7 +308,7 @@ void ppc970_irq_init (CPUState *env)
 #endif /* defined(TARGET_PPC64) */
 
 /* PowerPC 40x internal IRQ controller */
-static void ppc40x_set_irq (void *opaque, int pin, int level)
+static int ppc40x_set_irq (void *opaque, int pin, int level)
 {
     CPUState *env = opaque;
     int cur_level;
@@ -406,13 +410,15 @@ static void ppc40x_set_irq (void *opaque, int pin, int level)
                 fprintf(logfile, "%s: unknown IRQ pin %d\n", __func__, pin);
             }
 #endif
-            return;
+            return 1;
         }
         if (level)
             env->irq_input_state |= 1 << pin;
         else
             env->irq_input_state &= ~(1 << pin);
     }
+
+    return 1;
 }
 
 void ppc40x_irq_init (CPUState *env)
diff --git a/hw/ppc4xx_devs.c b/hw/ppc4xx_devs.c
index d240f0e..e55da8b 100644
--- a/hw/ppc4xx_devs.c
+++ b/hw/ppc4xx_devs.c
@@ -357,7 +357,7 @@ static void ppcuic_trigger_irq (ppcuic_t *uic)
     }
 }
 
-static void ppcuic_set_irq (void *opaque, int irq_num, int level)
+static int ppcuic_set_irq (void *opaque, int irq_num, int level)
 {
     ppcuic_t *uic;
     uint32_t mask, sr;
@@ -373,7 +373,7 @@ static void ppcuic_set_irq (void *opaque, int irq_num, int level)
     }
 #endif
     if (irq_num < 0 || irq_num > 31)
-        return;
+        return 1;
     sr = uic->uicsr;
 
     /* Update status register */
@@ -399,6 +399,8 @@ static void ppcuic_set_irq (void *opaque, int irq_num, int level)
 #endif
     if (sr != uic->uicsr)
         ppcuic_trigger_irq(uic);
+
+    return 1;
 }
 
 static target_ulong dcr_read_uic (void *opaque, int dcrn)
diff --git a/hw/pxa2xx.c b/hw/pxa2xx.c
index 2c838e5..a804c97 100644
--- a/hw/pxa2xx.c
+++ b/hw/pxa2xx.c
@@ -2016,7 +2016,7 @@ static struct pxa2xx_fir_s *pxa2xx_fir_init(target_phys_addr_t base,
     return s;
 }
 
-static void pxa2xx_reset(void *opaque, int line, int level)
+static int pxa2xx_reset(void *opaque, int line, int level)
 {
     struct pxa2xx_state_s *s = (struct pxa2xx_state_s *) opaque;
 
@@ -2024,6 +2024,8 @@ static void pxa2xx_reset(void *opaque, int line, int level)
         cpu_reset(s->env);
         /* TODO: reset peripherals */
     }
+
+    return 1;
 }
 
 /* Initialise a PXA270 integrated chip (ARM based core).  */
diff --git a/hw/pxa2xx_gpio.c b/hw/pxa2xx_gpio.c
index e3a30bc..3295f8d 100644
--- a/hw/pxa2xx_gpio.c
+++ b/hw/pxa2xx_gpio.c
@@ -87,7 +87,7 @@ static const int pxa2xx_gpio_wake[PXA2XX_GPIO_BANKS] = {
     0x8003fe1b, 0x002001fc, 0xec080000, 0x0012007f,
 };
 
-static void pxa2xx_gpio_set(void *opaque, int line, int level)
+static int pxa2xx_gpio_set(void *opaque, int line, int level)
 {
     struct pxa2xx_gpio_info_s *s = (struct pxa2xx_gpio_info_s *) opaque;
     int bank;
@@ -95,7 +95,7 @@ static void pxa2xx_gpio_set(void *opaque, int line, int level)
 
     if (line >= s->lines) {
         printf("%s: No GPIO pin %i\n", __FUNCTION__, line);
-        return;
+        return 1;
     }
 
     bank = line >> 5;
@@ -117,6 +117,8 @@ static void pxa2xx_gpio_set(void *opaque, int line, int level)
     /* Wake-up GPIOs */
     if (s->cpu_env->halted && (mask & ~s->dir[bank] & pxa2xx_gpio_wake[bank]))
         cpu_interrupt(s->cpu_env, CPU_INTERRUPT_EXITTB);
+
+    return 1;
 }
 
 static void pxa2xx_gpio_handler_update(struct pxa2xx_gpio_info_s *s) {
diff --git a/hw/pxa2xx_pcmcia.c b/hw/pxa2xx_pcmcia.c
index 1e96ee4..84f27dd 100644
--- a/hw/pxa2xx_pcmcia.c
+++ b/hw/pxa2xx_pcmcia.c
@@ -130,13 +130,13 @@ static CPUWriteMemoryFunc *pxa2xx_pcmcia_io_writefn[] = {
     pxa2xx_pcmcia_io_write,
 };
 
-static void pxa2xx_pcmcia_set_irq(void *opaque, int line, int level)
+static int pxa2xx_pcmcia_set_irq(void *opaque, int line, int level)
 {
     struct pxa2xx_pcmcia_s *s = (struct pxa2xx_pcmcia_s *) opaque;
     if (!s->irq)
-        return;
+        return 1;
 
-    qemu_set_irq(s->irq, level);
+    return qemu_set_irq(s->irq, level);
 }
 
 struct pxa2xx_pcmcia_s *pxa2xx_pcmcia_init(target_phys_addr_t base)
diff --git a/hw/pxa2xx_pic.c b/hw/pxa2xx_pic.c
index a3611b8..0328f24 100644
--- a/hw/pxa2xx_pic.c
+++ b/hw/pxa2xx_pic.c
@@ -68,7 +68,7 @@ static void pxa2xx_pic_update(void *opaque)
 
 /* Note: Here level means state of the signal on a pin, not
  * IRQ/FIQ distinction as in PXA Developer Manual.  */
-static void pxa2xx_pic_set_irq(void *opaque, int irq, int level)
+static int pxa2xx_pic_set_irq(void *opaque, int irq, int level)
 {
     struct pxa2xx_pic_state_s *s = (struct pxa2xx_pic_state_s *) opaque;
     int int_set = (irq >= 32);
@@ -80,6 +80,8 @@ static void pxa2xx_pic_set_irq(void *opaque, int irq, int level)
         s->int_pending[int_set] &= ~(1 << irq);
 
     pxa2xx_pic_update(opaque);
+
+    return 1;
 }
 
 static inline uint32_t pxa2xx_pic_highest(struct pxa2xx_pic_state_s *s) {
diff --git a/hw/rc4030.c b/hw/rc4030.c
index bf440db..245118d 100644
--- a/hw/rc4030.c
+++ b/hw/rc4030.c
@@ -420,7 +420,7 @@ static void update_jazz_irq(rc4030State *s)
         qemu_irq_lower(s->jazz_bus_irq);
 }
 
-static void rc4030_irq_jazz_request(void *opaque, int irq, int level)
+static int rc4030_irq_jazz_request(void *opaque, int irq, int level)
 {
     rc4030State *s = opaque;
 
@@ -431,6 +431,8 @@ static void rc4030_irq_jazz_request(void *opaque, int irq, int level)
     }
 
     update_jazz_irq(s);
+
+    return 1;
 }
 
 static void rc4030_periodic_timer(void *opaque)
diff --git a/hw/sbi.c b/hw/sbi.c
index 8d264f1..0e767e5 100644
--- a/hw/sbi.c
+++ b/hw/sbi.c
@@ -52,12 +52,14 @@ static void sbi_check_interrupts(void *opaque)
 {
 }
 
-static void sbi_set_irq(void *opaque, int irq, int level)
+static int sbi_set_irq(void *opaque, int irq, int level)
 {
+    return 1;
 }
 
-static void sbi_set_timer_irq_cpu(void *opaque, int cpu, int level)
+static int sbi_set_timer_irq_cpu(void *opaque, int cpu, int level)
 {
+    return 1;
 }
 
 static uint32_t sbi_mem_readl(void *opaque, target_phys_addr_t addr)
diff --git a/hw/sharpsl.h b/hw/sharpsl.h
index 184aae6..2054017 100644
--- a/hw/sharpsl.h
+++ b/hw/sharpsl.h
@@ -12,7 +12,7 @@
 /* zaurus.c */
 struct scoop_info_s *scoop_init(struct pxa2xx_state_s *cpu,
                 int instance, target_phys_addr_t target_base);
-void scoop_gpio_set(void *opaque, int line, int level);
+int scoop_gpio_set(void *opaque, int line, int level);
 qemu_irq *scoop_gpio_in_get(struct scoop_info_s *s);
 void scoop_gpio_out_set(struct scoop_info_s *s, int line,
                 qemu_irq handler);
diff --git a/hw/slavio_intctl.c b/hw/slavio_intctl.c
index 6a5d505..dec137b 100644
--- a/hw/slavio_intctl.c
+++ b/hw/slavio_intctl.c
@@ -284,7 +284,7 @@ static void slavio_check_interrupts(void *opaque)
  * "irq" here is the bit number in the system interrupt register to
  * separate serial and keyboard interrupts sharing a level.
  */
-static void slavio_set_irq(void *opaque, int irq, int level)
+static int slavio_set_irq(void *opaque, int irq, int level)
 {
     SLAVIO_INTCTLState *s = opaque;
     uint32_t mask = 1 << irq;
@@ -305,9 +305,11 @@ static void slavio_set_irq(void *opaque, int irq, int level)
         }
         slavio_check_interrupts(s);
     }
+
+    return 1;
 }
 
-static void slavio_set_timer_irq_cpu(void *opaque, int cpu, int level)
+static int slavio_set_timer_irq_cpu(void *opaque, int cpu, int level)
 {
     SLAVIO_INTCTLState *s = opaque;
 
@@ -322,6 +324,8 @@ static void slavio_set_timer_irq_cpu(void *opaque, int cpu, int level)
     }
 
     slavio_check_interrupts(s);
+
+    return 1;
 }
 
 static void slavio_intctl_save(QEMUFile *f, void *opaque)
diff --git a/hw/sparc32_dma.c b/hw/sparc32_dma.c
index c69b559..8b9fda6 100644
--- a/hw/sparc32_dma.c
+++ b/hw/sparc32_dma.c
@@ -116,7 +116,7 @@ void ledma_memory_write(void *opaque, target_phys_addr_t addr,
     }
 }
 
-static void dma_set_irq(void *opaque, int irq, int level)
+static int dma_set_irq(void *opaque, int irq, int level)
 {
     DMAState *s = opaque;
     if (level) {
@@ -128,6 +128,8 @@ static void dma_set_irq(void *opaque, int irq, int level)
         DPRINTF("Lower IRQ\n");
         qemu_irq_lower(s->irq);
     }
+
+    return 1;
 }
 
 void espdma_memory_read(void *opaque, uint8_t *buf, int len)
diff --git a/hw/spitz.c b/hw/spitz.c
index fc77174..33e6a17 100644
--- a/hw/spitz.c
+++ b/hw/spitz.c
@@ -261,7 +261,7 @@ static void spitz_keyboard_sense_update(struct spitz_keyboard_s *s)
     s->sense_state = sense;
 }
 
-static void spitz_keyboard_strobe(void *opaque, int line, int level)
+static int spitz_keyboard_strobe(void *opaque, int line, int level)
 {
     struct spitz_keyboard_s *s = (struct spitz_keyboard_s *) opaque;
 
@@ -270,6 +270,8 @@ static void spitz_keyboard_strobe(void *opaque, int line, int level)
     else
         s->strobe_state &= ~(1 << line);
     spitz_keyboard_sense_update(s);
+
+    return 1;
 }
 
 static void spitz_keyboard_keydown(struct spitz_keyboard_s *s, int keycode)
@@ -621,7 +623,7 @@ static void corgi_ssp_write(void *opaque, uint32_t value)
         max111x_write(max1111, value);
 }
 
-static void corgi_ssp_gpio_cs(void *opaque, int line, int level)
+static int corgi_ssp_gpio_cs(void *opaque, int line, int level)
 {
     switch (line) {
     case 0:
@@ -634,6 +636,8 @@ static void corgi_ssp_gpio_cs(void *opaque, int line, int level)
         max_en = !level;
         break;
     }
+
+    return 1;
 }
 
 #define MAX1111_BATT_VOLT	1
@@ -729,13 +733,15 @@ static void spitz_microdrive_attach(struct pxa2xx_state_s *cpu)
 #define SPITZ_GPIO_WM	5
 
 #ifdef HAS_AUDIO
-static void spitz_wm8750_addr(void *opaque, int line, int level)
+static int spitz_wm8750_addr(void *opaque, int line, int level)
 {
     i2c_slave *wm = (i2c_slave *) opaque;
     if (level)
         i2c_set_slave_address(wm, SPITZ_WM_ADDRH);
     else
         i2c_set_slave_address(wm, SPITZ_WM_ADDRL);
+
+    return 1;
 }
 #endif
 
@@ -774,7 +780,7 @@ static void spitz_akita_i2c_setup(struct pxa2xx_state_s *cpu)
 
 /* Other peripherals */
 
-static void spitz_out_switch(void *opaque, int line, int level)
+static int spitz_out_switch(void *opaque, int line, int level)
 {
     switch (line) {
     case 0:
@@ -799,6 +805,8 @@ static void spitz_out_switch(void *opaque, int line, int level)
         spitz_adc_temp_on(opaque, line, level);
         break;
     }
+
+    return 1;
 }
 
 #define SPITZ_SCP_LED_GREEN		1
@@ -846,11 +854,13 @@ static void spitz_scoop_gpio_setup(struct pxa2xx_state_s *cpu,
 
 static int spitz_hsync;
 
-static void spitz_lcd_hsync_handler(void *opaque, int line, int level)
+static int spitz_lcd_hsync_handler(void *opaque, int line, int level)
 {
     struct pxa2xx_state_s *cpu = (struct pxa2xx_state_s *) opaque;
     qemu_set_irq(pxa2xx_gpio_in_get(cpu->gpio)[SPITZ_GPIO_HSYNC], spitz_hsync);
     spitz_hsync ^= 1;
+
+    return 1;
 }
 
 static void spitz_gpio_setup(struct pxa2xx_state_s *cpu, int slots)
diff --git a/hw/ssd0323.c b/hw/ssd0323.c
index e496fe7..9b0e45e 100644
--- a/hw/ssd0323.c
+++ b/hw/ssd0323.c
@@ -268,11 +268,13 @@ static void ssd0323_invalidate_display(void * opaque)
 }
 
 /* Command/data input.  */
-static void ssd0323_cd(void *opaque, int n, int level)
+static int ssd0323_cd(void *opaque, int n, int level)
 {
     ssd0323_state *s = (ssd0323_state *)opaque;
     DPRINTF("%s mode\n", level ? "Data" : "Command");
     s->mode = level ? SSD0323_DATA : SSD0323_CMD;
+
+    return 1;
 }
 
 static void ssd0323_save(QEMUFile *f, void *opaque)
diff --git a/hw/stellaris.c b/hw/stellaris.c
index 5948079..b5fcf1e 100644
--- a/hw/stellaris.c
+++ b/hw/stellaris.c
@@ -980,12 +980,12 @@ static void stellaris_adc_update(stellaris_adc_state *s)
     qemu_set_irq(s->irq, level);
 }
 
-static void stellaris_adc_trigger(void *opaque, int irq, int level)
+static int stellaris_adc_trigger(void *opaque, int irq, int level)
 {
     stellaris_adc_state *s = (stellaris_adc_state *)opaque;
 
     if ((s->actss & 1) == 0) {
-        return;
+        return 1;
     }
 
     /* Some applications use the ADC as a random number source, so introduce
@@ -995,6 +995,8 @@ static void stellaris_adc_trigger(void *opaque, int irq, int level)
     stellaris_adc_fifo_write(s, 0, 0x200 + ((s->noise >> 16) & 7));
     s->ris |= 1;
     stellaris_adc_update(s);
+
+    return 1;
 }
 
 static void stellaris_adc_reset(stellaris_adc_state *s)
@@ -1221,11 +1223,13 @@ typedef struct {
     int current_dev;
 } stellaris_ssi_bus_state;
 
-static void stellaris_ssi_bus_select(void *opaque, int irq, int level)
+static int stellaris_ssi_bus_select(void *opaque, int irq, int level)
 {
     stellaris_ssi_bus_state *s = (stellaris_ssi_bus_state *)opaque;
 
     s->current_dev = level;
+
+    return 1;
 }
 
 static int stellaris_ssi_bus_xfer(void *opaque, int val)
diff --git a/hw/sun4c_intctl.c b/hw/sun4c_intctl.c
index 88cd4a5..462f62c 100644
--- a/hw/sun4c_intctl.c
+++ b/hw/sun4c_intctl.c
@@ -146,7 +146,7 @@ static void sun4c_check_interrupts(void *opaque)
 /*
  * "irq" here is the bit number in the system interrupt register
  */
-static void sun4c_set_irq(void *opaque, int irq, int level)
+static int sun4c_set_irq(void *opaque, int irq, int level)
 {
     Sun4c_INTCTLState *s = opaque;
     uint32_t mask = 1 << irq;
@@ -165,6 +165,8 @@ static void sun4c_set_irq(void *opaque, int irq, int level)
         }
         sun4c_check_interrupts(s);
     }
+
+    return 1;
 }
 
 static void sun4c_intctl_save(QEMUFile *f, void *opaque)
diff --git a/hw/sun4m.c b/hw/sun4m.c
index d1c5b94..945926a 100644
--- a/hw/sun4m.c
+++ b/hw/sun4m.c
@@ -318,7 +318,7 @@ void cpu_check_irqs(CPUState *env)
     }
 }
 
-static void cpu_set_irq(void *opaque, int irq, int level)
+static int cpu_set_irq(void *opaque, int irq, int level)
 {
     CPUState *env = opaque;
 
@@ -332,10 +332,13 @@ static void cpu_set_irq(void *opaque, int irq, int level)
         env->pil_in &= ~(1 << irq);
         cpu_check_irqs(env);
     }
+
+    return 1;
 }
 
-static void dummy_cpu_set_irq(void *opaque, int irq, int level)
+static int dummy_cpu_set_irq(void *opaque, int irq, int level)
 {
+    return 1;
 }
 
 static void *slavio_misc;
diff --git a/hw/tc6393xb.c b/hw/tc6393xb.c
index 6db6dcb..0b10bcb 100644
--- a/hw/tc6393xb.c
+++ b/hw/tc6393xb.c
@@ -78,16 +78,17 @@ qemu_irq *tc6393xb_gpio_in_get(struct tc6393xb_s *s)
     return s->gpio_in;
 }
 
-static void tc6393xb_gpio_set(void *opaque, int line, int level)
+static int tc6393xb_gpio_set(void *opaque, int line, int level)
 {
 //    struct tc6393xb_s *s = opaque;
 
     if (line > TC6393XB_GPIOS) {
         printf("%s: No GPIO pin %i\n", __FUNCTION__, line);
-        return;
+        return 1;
     }
 
     // FIXME: how does the chip reflect the GPIO input level change?
+    return 1;
 }
 
 void tc6393xb_gpio_out_set(struct tc6393xb_s *s, int line,
diff --git a/hw/tusb6010.c b/hw/tusb6010.c
index 6f37779..96a4e73 100644
--- a/hw/tusb6010.c
+++ b/hw/tusb6010.c
@@ -683,7 +683,7 @@ static void tusb_power_tick(void *opaque)
     }
 }
 
-static void tusb_musb_core_intr(void *opaque, int source, int level)
+static int tusb_musb_core_intr(void *opaque, int source, int level)
 {
     struct tusb_s *s = (struct tusb_s *) opaque;
     uint16_t otg_status = s->otg_status;
@@ -730,6 +730,8 @@ static void tusb_musb_core_intr(void *opaque, int source, int level)
         tusb_intr_update(s);
         break;
     }
+
+    return 1;
 }
 
 struct tusb_s *tusb6010_init(qemu_irq intr)
diff --git a/hw/twl92230.c b/hw/twl92230.c
index ad49eac..4573c6f 100644
--- a/hw/twl92230.c
+++ b/hw/twl92230.c
@@ -195,16 +195,18 @@ static inline int from_bcd(uint8_t val)
     return ((val >> 4) * 10) + (val & 0x0f);
 }
 
-static void menelaus_gpio_set(void *opaque, int line, int level)
+static int menelaus_gpio_set(void *opaque, int line, int level)
 {
     struct menelaus_s *s = (struct menelaus_s *) opaque;
 
     /* No interrupt generated */
     s->inputs &= ~(1 << line);
     s->inputs |= level << line;
+
+    return 1;
 }
 
-static void menelaus_pwrbtn_set(void *opaque, int line, int level)
+static int menelaus_pwrbtn_set(void *opaque, int line, int level)
 {
     struct menelaus_s *s = (struct menelaus_s *) opaque;
 
@@ -213,6 +215,8 @@ static void menelaus_pwrbtn_set(void *opaque, int line, int level)
         menelaus_update(s);
     }
     s->pwrbtn_state = level;
+
+    return 1;
 }
 
 #define MENELAUS_REV		0x01
diff --git a/hw/versatilepb.c b/hw/versatilepb.c
index f9e9988..a85afd2 100644
--- a/hw/versatilepb.c
+++ b/hw/versatilepb.c
@@ -49,7 +49,7 @@ static void vpb_sic_update_pic(vpb_sic_state *s)
     }
 }
 
-static void vpb_sic_set_irq(void *opaque, int irq, int level)
+static int vpb_sic_set_irq(void *opaque, int irq, int level)
 {
     vpb_sic_state *s = (vpb_sic_state *)opaque;
     if (level)
@@ -59,6 +59,8 @@ static void vpb_sic_set_irq(void *opaque, int irq, int level)
     if (s->pic_enable & (1u << irq))
         qemu_set_irq(s->parent[irq], level);
     vpb_sic_update(s);
+
+    return 1;
 }
 
 static uint32_t vpb_sic_read(void *opaque, target_phys_addr_t offset)
diff --git a/hw/zaurus.c b/hw/zaurus.c
index c475eaa..0b66859 100644
--- a/hw/zaurus.c
+++ b/hw/zaurus.c
@@ -166,7 +166,7 @@ static CPUWriteMemoryFunc *scoop_writefn[] = {
     scoop_writeb,
 };
 
-void scoop_gpio_set(void *opaque, int line, int level)
+int scoop_gpio_set(void *opaque, int line, int level)
 {
     struct scoop_info_s *s = (struct scoop_info_s *) s;
 
@@ -174,6 +174,8 @@ void scoop_gpio_set(void *opaque, int line, int level)
         s->gpio_level |= (1 << line);
     else
         s->gpio_level &= ~(1 << line);
+
+    return 1;
 }
 
 qemu_irq *scoop_gpio_in_get(struct scoop_info_s *s)

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 2/3] Fix time drift problem under high load when PIT is in use.
  2008-10-29 15:22 [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Gleb Natapov
  2008-10-29 15:22 ` [Qemu-devel] [PATCH 1/3] Change qemu_set_irq() to return status information Gleb Natapov
@ 2008-10-29 15:22 ` Gleb Natapov
  2008-10-29 15:22 ` [Qemu-devel] [PATCH 3/3] Fix time drift problem under high load when RTC " Gleb Natapov
  2008-10-31 19:17 ` [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Anthony Liguori
  3 siblings, 0 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-10-29 15:22 UTC (permalink / raw)
  To: qemu-devel

Count the number of interrupts that was lost due to interrupt coalescing
and re-inject them back when possible. This fixes time drift problem when
pit is used as a time source.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

 hw/i8254.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/hw/i8254.c b/hw/i8254.c
index 4813b03..1a43ffc 100644
--- a/hw/i8254.c
+++ b/hw/i8254.c
@@ -49,6 +49,7 @@ typedef struct PITChannelState {
     int64_t count_load_time;
     /* irq handling */
     int64_t next_transition_time;
+    uint32_t irq_coalesced;
     QEMUTimer *irq_timer;
     qemu_irq irq;
 } PITChannelState;
@@ -228,6 +229,9 @@ static inline void pit_load_count(PITChannelState *s, int val)
 {
     if (val == 0)
         val = 0x10000;
+    /* recalculate how much IRQs to inject based on new timer value */
+    if(s->irq_coalesced)
+        s->irq_coalesced = (s->irq_coalesced * s->count) / val;
     s->count_load_time = qemu_get_clock(vm_clock);
     s->count = val;
     pit_irq_timer_update(s, s->count_load_time);
@@ -369,12 +373,28 @@ static void pit_irq_timer_update(PITChannelState *s, int64_t current_time)
         return;
     expire_time = pit_get_next_transition_time(s, current_time);
     irq_level = pit_get_out1(s, current_time);
-    qemu_set_irq(s->irq, irq_level);
+    if(irq_level) {
+        if(!qemu_irq_raise(s->irq))
+            s->irq_coalesced++;
+    } else {
+        qemu_irq_lower(s->irq);
+        if(s->irq_coalesced > 0) {
+            if(qemu_irq_raise(s->irq))
+                s->irq_coalesced--;
+            qemu_irq_lower(s->irq);
+        }
+    }
+
 #ifdef DEBUG_PIT
     printf("irq_level=%d next_delay=%f\n",
            irq_level,
            (double)(expire_time - current_time) / ticks_per_sec);
 #endif
+    if(s->irq_coalesced && expire_time != -1) {
+        uint32_t div = ((s->irq_coalesced >> 10) & 0x7f) + 2;
+        expire_time -= ((expire_time - current_time) / div);
+    }
+
     s->next_transition_time = expire_time;
     if (expire_time != -1)
         qemu_mod_timer(s->irq_timer, expire_time);
@@ -459,6 +479,7 @@ static void pit_reset(void *opaque)
         s = &pit->channels[i];
         s->mode = 3;
         s->gate = (i != 2);
+        s->irq_coalesced = 0;
         pit_load_count(s, 0);
     }
 }

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Qemu-devel] [PATCH 3/3] Fix time drift problem under high load when RTC is in use.
  2008-10-29 15:22 [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Gleb Natapov
  2008-10-29 15:22 ` [Qemu-devel] [PATCH 1/3] Change qemu_set_irq() to return status information Gleb Natapov
  2008-10-29 15:22 ` [Qemu-devel] [PATCH 2/3] Fix time drift problem under high load when PIT is in use Gleb Natapov
@ 2008-10-29 15:22 ` Gleb Natapov
  2008-11-05 12:46   ` Dor Laor
  2008-10-31 19:17 ` [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Anthony Liguori
  3 siblings, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-10-29 15:22 UTC (permalink / raw)
  To: qemu-devel

Count the number of interrupts that was lost due to interrupt coalescing
and re-inject them back when possible. This fixes time drift problem when
RTC is used as a time source.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

 hw/mc146818rtc.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/hw/mc146818rtc.c b/hw/mc146818rtc.c
index 30bb044..505e117 100644
--- a/hw/mc146818rtc.c
+++ b/hw/mc146818rtc.c
@@ -66,6 +66,7 @@ struct RTCState {
     int64_t next_periodic_time;
     /* second update */
     int64_t next_second_time;
+    uint32_t irq_coalesced;
     QEMUTimer *second_timer;
     QEMUTimer *second_timer2;
 };
@@ -101,7 +102,8 @@ static void rtc_periodic_timer(void *opaque)

     rtc_timer_update(s, s->next_periodic_time);
     s->cmos_data[RTC_REG_C] |= 0xc0;
-    qemu_irq_raise(s->irq);
+    if(!qemu_irq_raise(s->irq))
+        s->irq_coalesced++;
 }

 static void cmos_ioport_write(void *opaque, uint32_t addr, uint32_t data)
@@ -360,7 +362,12 @@ static uint32_t cmos_ioport_read(void *opaque, uint32_t addr)
         case RTC_REG_C:
             ret = s->cmos_data[s->cmos_index];
             qemu_irq_lower(s->irq);
-            s->cmos_data[RTC_REG_C] = 0x00;
+            if(!s->irq_coalesced) {
+                s->cmos_data[RTC_REG_C] = 0x00;
+            } else {
+                if(qemu_irq_raise(s->irq))
+                    s->irq_coalesced--;
+            }
             break;
         default:
             ret = s->cmos_data[s->cmos_index];

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] Fix time drift problem under high load when RTC is in use.
  2008-10-29 15:22 ` [Qemu-devel] [PATCH 3/3] Fix time drift problem under high load when RTC " Gleb Natapov
@ 2008-11-05 12:46   ` Dor Laor
  0 siblings, 0 replies; 49+ messages in thread
From: Dor Laor @ 2008-11-05 12:46 UTC (permalink / raw)
  To: qemu-devel

Gleb Natapov wrote:
> Count the number of interrupts that was lost due to interrupt coalescing
> and re-inject them back when possible. This fixes time drift problem when
> RTC is used as a time source.
>
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>
>  hw/mc146818rtc.c |   11 +++++++++--
>  1 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/hw/mc146818rtc.c b/hw/mc146818rtc.c
> index 30bb044..505e117 100644
> --- a/hw/mc146818rtc.c
> +++ b/hw/mc146818rtc.c
> @@ -66,6 +66,7 @@ struct RTCState {
>      int64_t next_periodic_time;
>      /* second update */
>      int64_t next_second_time;
> +    uint32_t irq_coalesced;
>      QEMUTimer *second_timer;
>      QEMUTimer *second_timer2;
>  };
> @@ -101,7 +102,8 @@ static void rtc_periodic_timer(void *opaque)
>  
>      rtc_timer_update(s, s->next_periodic_time);
>      s->cmos_data[RTC_REG_C] |= 0xc0;
> -    qemu_irq_raise(s->irq);
> +    if(!qemu_irq_raise(s->irq))
> +        s->irq_coalesced++;
>  }
>  
>  static void cmos_ioport_write(void *opaque, uint32_t addr, uint32_t data)
> @@ -360,7 +362,12 @@ static uint32_t cmos_ioport_read(void *opaque, uint32_t addr)
>          case RTC_REG_C:
>              ret = s->cmos_data[s->cmos_index];
>              qemu_irq_lower(s->irq);
> -            s->cmos_data[RTC_REG_C] = 0x00;
> +            if(!s->irq_coalesced) {
> +                s->cmos_data[RTC_REG_C] = 0x00;
> +            } else {
> +                if(qemu_irq_raise(s->irq))
> +                    s->irq_coalesced--;
> +            }
>              break;
>          default:
>              ret = s->cmos_data[s->cmos_index];
>
>
>   
The rtc needs to recalculate the irq_coalesced counter if the frequency 
was changed, similarly to the pit patch.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-10-29 15:22 [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Gleb Natapov
                   ` (2 preceding siblings ...)
  2008-10-29 15:22 ` [Qemu-devel] [PATCH 3/3] Fix time drift problem under high load when RTC " Gleb Natapov
@ 2008-10-31 19:17 ` Anthony Liguori
  2008-11-02 13:04   ` Gleb Natapov
  3 siblings, 1 reply; 49+ messages in thread
From: Anthony Liguori @ 2008-10-31 19:17 UTC (permalink / raw)
  To: qemu-devel

Gleb Natapov wrote:
> Qemu device emulation for timers might be inaccurate and
> causes coalescing of several IRQs into one. It happens when the
> load on the host is high and the guest did not manage to ack the
> previous IRQ. The problem can be reproduced by copying of a big
> file or many small ones inside Windows guest. When you do that 
> guest clock start to lag behind the host one.
>
> The first patch in the series changes qemu_irq subsystem to return
> IRQ delivery status information. If device is notified that IRQs
> where lost it can regenerate them as needed. The following two
> patches add IRQ regeneration to PIC and RTC devices.
>   

I don't think any of the problems raised when this was initially 
posted.  Further, I don't think that always playing catch-up with 
interrupts is always the best course of action.

As I've said repeatedly in the past, any sort of time drift fixes needs 
to have a lot of data posted with it that is repeatable.

How much does this improve things with Windows?  How does having a high 
resolution timer in the host affect the problem to begin with?  How do 
Linux guests behave with this?

Even the Windows PV spec calls out three separate approaches to dealing 
with missed interrupts and provides an interface for the host to query 
the guest as to which one should be used.  I don't think any solution 
that uses a single technique is going to be correct.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-10-31 19:17 ` [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Anthony Liguori
@ 2008-11-02 13:04   ` Gleb Natapov
  2008-11-05 12:45     ` Dor Laor
                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-11-02 13:04 UTC (permalink / raw)
  To: qemu-devel

On Fri, Oct 31, 2008 at 02:17:19PM -0500, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> Qemu device emulation for timers might be inaccurate and
>> causes coalescing of several IRQs into one. It happens when the
>> load on the host is high and the guest did not manage to ack the
>> previous IRQ. The problem can be reproduced by copying of a big
>> file or many small ones inside Windows guest. When you do that guest 
>> clock start to lag behind the host one.
>>
>> The first patch in the series changes qemu_irq subsystem to return
>> IRQ delivery status information. If device is notified that IRQs
>> where lost it can regenerate them as needed. The following two
>> patches add IRQ regeneration to PIC and RTC devices.
>>   
>
> I don't think any of the problems raised when this was initially posted.  
So? I raise them now. Have you tried suggested scenario and was able to
reproduce the problem?

> Further, I don't think that always playing catch-up with interrupts is 
> always the best course of action.
>
Agree. Playing catch-up with interrupts is not always the best course of
action. But sometimes there is no other choice.

> As I've said repeatedly in the past, any sort of time drift fixes needs  
> to have a lot of data posted with it that is repeatable.
>
> How much does this improve things with Windows? 
The time drift is eliminated. If there is a spike in a load time may
slow down, but after that it catches up (this happens only during very
high loads though).

>                                                  How does having a high  
> resolution timer in the host affect the problem to begin with?
My test machine has relatively recent kernel that use high resolution
timers for time keeping. Also the problem is that guest does not receive
enough time to process injected interrupt. How hr timer can help here?

>                                                                 How do  
> Linux guests behave with this?
Linux guests don't use pit or RTC for time keeping. They are completely
unaffected by those patches.

>
> Even the Windows PV spec calls out three separate approaches to dealing  
> with missed interrupts and provides an interface for the host to query  
> the guest as to which one should be used.  I don't think any solution  
> that uses a single technique is going to be correct.
>
That is what I found in Microsoft docs:

  If a virtual processor is unavailable for a sufficiently long period of
  time, a full timer period may be missed. In this case, the hypervisor
  uses one of two techniques. The first technique involves timer period
  modulation, in effect shortening the period until the timer “catches
  up”.

  If a significant number of timer signals have been missed, the
  hypervisor may be unable to compensate by using period modulation. In
  this case, some timer expiration signals may be skipped completely.
  For timers that are marked as lazy, the hypervisor uses a second
  technique for dealing with the situation in which a virtual processor is
  unavailable for a long period of time. In this case, the timer signal is
  deferred until this virtual processor is available. If it doesn’t become
  available until shortly before the next timer is due to expire, it is
  skipped entirely. 

The first techniques is what I am trying to introduce with this patch
series.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-02 13:04   ` Gleb Natapov
@ 2008-11-05 12:45     ` Dor Laor
  2008-11-05 15:48       ` andrzej zaborowski
                         ` (2 more replies)
  2008-11-05 16:43     ` Anthony Liguori
  2008-11-06  3:41     ` Jamie Lokier
  2 siblings, 3 replies; 49+ messages in thread
From: Dor Laor @ 2008-11-05 12:45 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4391 bytes --]

Gleb Natapov wrote:
> On Fri, Oct 31, 2008 at 02:17:19PM -0500, Anthony Liguori wrote:
>   
>> Gleb Natapov wrote:
>>     
>>> Qemu device emulation for timers might be inaccurate and
>>> causes coalescing of several IRQs into one. It happens when the
>>> load on the host is high and the guest did not manage to ack the
>>> previous IRQ. The problem can be reproduced by copying of a big
>>> file or many small ones inside Windows guest. When you do that guest 
>>> clock start to lag behind the host one.
>>>
>>> The first patch in the series changes qemu_irq subsystem to return
>>> IRQ delivery status information. If device is notified that IRQs
>>> where lost it can regenerate them as needed. The following two
>>> patches add IRQ regeneration to PIC and RTC devices.
>>>   
>>>       
>> I don't think any of the problems raised when this was initially posted.  
>>     
> So? I raise them now. Have you tried suggested scenario and was able to
> reproduce the problem?
>
>   
It is the same issue, just another scenario.
>> Further, I don't think that always playing catch-up with interrupts is 
>> always the best course of action.
>>
>>     
> Agree. Playing catch-up with interrupts is not always the best course of
> action. But sometimes there is no other choice.
>
>   
>> As I've said repeatedly in the past, any sort of time drift fixes needs  
>> to have a lot of data posted with it that is repeatable.
>>
>> How much does this improve things with Windows? 
>>     
> The time drift is eliminated. If there is a spike in a load time may
> slow down, but after that it catches up (this happens only during very
> high loads though).
>
>   
Gleb, can you please provide more details:
- What's the host's kernel version exactly (including the high-res, dyn 
tick configured)
- What's the windows version? Is it standard HAL (pit) or ACPI (rtc) or 
both?
- The detailed scenario you use (example: I copied the entire c:/windows 
directory, etc)
- Without the patch, what the time drift after x seconds on the host.
- With the patch, is there a drift? Is there increased cpu consumption, etc

Btw: I ack the whole thing, including the problem, the scenario and the 
solution.
The first '1/3' was not received by my mailer.
>>                                                  How does having a high  
>> resolution timer in the host affect the problem to begin with?
>>     
> My test machine has relatively recent kernel that use high resolution
> timers for time keeping. Also the problem is that guest does not receive
> enough time to process injected interrupt. How hr timer can help here?
>
>   
>>                                                                 How do  
>> Linux guests behave with this?
>>     
> Linux guests don't use pit or RTC for time keeping. They are completely
> unaffected by those patches.
>
>   
It will probably also drift with clock=pit in the guest kernel cmdline.
>> Even the Windows PV spec calls out three separate approaches to dealing  
>> with missed interrupts and provides an interface for the host to query  
>> the guest as to which one should be used.  I don't think any solution  
>> that uses a single technique is going to be correct.
>>
>>     
> That is what I found in Microsoft docs:
>
>   If a virtual processor is unavailable for a sufficiently long period of
>   time, a full timer period may be missed. In this case, the hypervisor
>   uses one of two techniques. The first technique involves timer period
>   modulation, in effect shortening the period until the timer “catches
>   up”.
>
>   If a significant number of timer signals have been missed, the
>   hypervisor may be unable to compensate by using period modulation. In
>   this case, some timer expiration signals may be skipped completely.
>   For timers that are marked as lazy, the hypervisor uses a second
>   technique for dealing with the situation in which a virtual processor is
>   unavailable for a long period of time. In this case, the timer signal is
>   deferred until this virtual processor is available. If it doesn’t become
>   available until shortly before the next timer is due to expire, it is
>   skipped entirely. 
>
> The first techniques is what I am trying to introduce with this patch
> series.
>
> --
> 			Gleb.
>
>
>   


[-- Attachment #2: Type: text/html, Size: 5559 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-05 12:45     ` Dor Laor
@ 2008-11-05 15:48       ` andrzej zaborowski
  2008-11-05 16:33         ` Anthony Liguori
  2008-11-06  7:16         ` Gleb Natapov
  2008-11-05 17:43       ` Gleb Natapov
  2008-11-06 17:28       ` David S. Ahern
  2 siblings, 2 replies; 49+ messages in thread
From: andrzej zaborowski @ 2008-11-05 15:48 UTC (permalink / raw)
  To: Gleb Natapov, dlaor, qemu-devel

2008/11/5 Dor Laor <dlaor@redhat.com>:
> Gleb Natapov wrote:
>
> On Fri, Oct 31, 2008 at 02:17:19PM -0500, Anthony Liguori wrote:
>
>
> Gleb Natapov wrote:
>
>
> Qemu device emulation for timers might be inaccurate and
> causes coalescing of several IRQs into one. It happens when the
> load on the host is high and the guest did not manage to ack the
> previous IRQ. The problem can be reproduced by copying of a big
> file or many small ones inside Windows guest. When you do that guest
> clock start to lag behind the host one.
>
> The first patch in the series changes qemu_irq subsystem to return
> IRQ delivery status information. If device is notified that IRQs
> where lost it can regenerate them as needed. The following two
> patches add IRQ regeneration to PIC and RTC devices.
>
>
>
> I don't think any of the problems raised when this was initially posted.
>
>
> So? I raise them now. Have you tried suggested scenario and was able to
> reproduce the problem?
>
>
>
> It is the same issue, just another scenario.
>
> Further, I don't think that always playing catch-up with interrupts is
> always the best course of action.
>
>
>
> Agree. Playing catch-up with interrupts is not always the best course of
> action. But sometimes there is no other choice.
>
>
>
> As I've said repeatedly in the past, any sort of time drift fixes needs
> to have a lot of data posted with it that is repeatable.
>
> How much does this improve things with Windows?
>
>
> The time drift is eliminated. If there is a spike in a load time may
> slow down, but after that it catches up (this happens only during very
> high loads though).
>
>
>
> Gleb, can you please provide more details:
> - What's the host's kernel version exactly (including the high-res, dyn tick
> configured)
> - What's the windows version? Is it standard HAL (pit) or ACPI (rtc) or
> both?
> - The detailed scenario you use (example: I copied the entire c:/windows
> directory, etc)
> - Without the patch, what the time drift after x seconds on the host.
> - With the patch, is there a drift? Is there increased cpu consumption, etc
>
> Btw: I ack the whole thing, including the problem, the scenario and the
> solution.

I don't, as far as I understand it's a -win2k-hack type of addition,
i.e. the hardware doesn't do this but we want to improve usability by
working around a bad guest behaviour.  Modifying qemu_irq abstraction
doesn't sound like the right place for that, qemu_irq contrary to what
the name suggests doesn't have to be connected to any interrupt.

Instead you can have the interrupt sources register a callback in the
PIC that the PIC calls when the interrupt wasn't delivered.  Or.. in
the case of mc146818rtc.c wouldn't it be enough to check if the irq
has been acked by reading RTC_REG_C?  e.g.

static void rtc_periodic_timer(void *opaque)
{
    RTCState *s = opaque;

    rtc_timer_update(s, s->next_periodic_time);
+   if (s->cmos_data[RTC_REG_C] & 0xc0)
+         s->irq_coalesced++;
    s->cmos_data[RTC_REG_C] |= 0xc0;
    qemu_irq_raise(s->irq);
}

Cheers

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-05 15:48       ` andrzej zaborowski
@ 2008-11-05 16:33         ` Anthony Liguori
  2008-11-06  7:16         ` Gleb Natapov
  1 sibling, 0 replies; 49+ messages in thread
From: Anthony Liguori @ 2008-11-05 16:33 UTC (permalink / raw)
  To: qemu-devel; +Cc: dlaor, Gleb Natapov

andrzej zaborowski wrote:
> 2008/11/5 Dor Laor <dlaor@redhat.com>:
>   
> I don't, as far as I understand it's a -win2k-hack type of addition,
>   

Yes, I agree.  It is definitely not something we want enabled in all 
circumstances.

But before we get to how to conditionally enable it, I want to know 
exactly what circumstances it helps and by how much.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-05 15:48       ` andrzej zaborowski
  2008-11-05 16:33         ` Anthony Liguori
@ 2008-11-06  7:16         ` Gleb Natapov
  2008-11-06  9:37           ` andrzej zaborowski
  1 sibling, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-11-06  7:16 UTC (permalink / raw)
  To: andrzej zaborowski; +Cc: dlaor, qemu-devel

On Wed, Nov 05, 2008 at 04:48:32PM +0100, andrzej zaborowski wrote:
> > Btw: I ack the whole thing, including the problem, the scenario and the
> > solution.
> 
> I don't, as far as I understand it's a -win2k-hack type of addition,
> i.e. the hardware doesn't do this but we want to improve usability by
> working around a bad guest behaviour.  Modifying qemu_irq abstraction
> doesn't sound like the right place for that, qemu_irq contrary to what
> the name suggests doesn't have to be connected to any interrupt.
> 
It is nothing like a -win2k-hack since there is no any guest "bad
behaviour" that cause the problem. Yes real hardware doesn't do this,
but real hardware also provides OS with enough CPU power to handle every
single timer interrupt. And even if _some_ interrupts are dropped the
drift is easily fixed with NTP. Try to run Windows XP on very slow machine
and I am sure you'll see very noticeable time drift.

> Instead you can have the interrupt sources register a callback in the
> PIC that the PIC calls when the interrupt wasn't delivered.  Or.. in
It requires the mapping from interrupt vector inside the PIC to
interrupt source. This approach was rejected long time ago.

> the case of mc146818rtc.c wouldn't it be enough to check if the irq
> has been acked by reading RTC_REG_C?  e.g.
> 
> static void rtc_periodic_timer(void *opaque)
> {
>     RTCState *s = opaque;
> 
>     rtc_timer_update(s, s->next_periodic_time);
> +   if (s->cmos_data[RTC_REG_C] & 0xc0)
> +         s->irq_coalesced++;
>     s->cmos_data[RTC_REG_C] |= 0xc0;
>     qemu_irq_raise(s->irq);
> }
> 
PIC/APIC in effect has a queue of one interrupt. This means that if
timer tick is still not acknowledged it doesn't mean that interrupt
was not queued for delivery inside a PIC. With your suggestion more
interrupt will be delivered to a guest that was actually issued.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06  7:16         ` Gleb Natapov
@ 2008-11-06  9:37           ` andrzej zaborowski
  2008-11-06 10:08             ` Gleb Natapov
  0 siblings, 1 reply; 49+ messages in thread
From: andrzej zaborowski @ 2008-11-06  9:37 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: dlaor, qemu-devel

2008/11/6 Gleb Natapov <gleb@redhat.com>:
> On Wed, Nov 05, 2008 at 04:48:32PM +0100, andrzej zaborowski wrote:
>> > Btw: I ack the whole thing, including the problem, the scenario and the
>> > solution.
>>
>> I don't, as far as I understand it's a -win2k-hack type of addition,
>> i.e. the hardware doesn't do this but we want to improve usability by
>> working around a bad guest behaviour.  Modifying qemu_irq abstraction
>> doesn't sound like the right place for that, qemu_irq contrary to what
>> the name suggests doesn't have to be connected to any interrupt.
>>
> It is nothing like a -win2k-hack since there is no any guest "bad
> behaviour" that cause the problem. Yes real hardware doesn't do this,
> but real hardware also provides OS with enough CPU power to handle every
> single timer interrupt.

A guest that counts on having enough CPU for something is
timing-depenent (buggy).

> And even if _some_ interrupts are dropped the
> drift is easily fixed with NTP. Try to run Windows XP on very slow machine
> and I am sure you'll see very noticeable time drift.

Exactly.  You'll find the drift on real hardware, so you should find
it in the emulator too.  You're trying to hack around it.

Linux doesn't see this because the clocksource and the
clockevents-device come from separate clks there.  It is a windows'
problem.  It *is* "bad behaviour".

>
>> Instead you can have the interrupt sources register a callback in the
>> PIC that the PIC calls when the interrupt wasn't delivered.  Or.. in
> It requires the mapping from interrupt vector inside the PIC to
> interrupt source.

Of course.

> This approach was rejected long time ago.

Then you'll have to find a different one.  qemu_irq is the wrong place.

>
>> the case of mc146818rtc.c wouldn't it be enough to check if the irq
>> has been acked by reading RTC_REG_C?  e.g.
>>
>> static void rtc_periodic_timer(void *opaque)
>> {
>>     RTCState *s = opaque;
>>
>>     rtc_timer_update(s, s->next_periodic_time);
>> +   if (s->cmos_data[RTC_REG_C] & 0xc0)
>> +         s->irq_coalesced++;
>>     s->cmos_data[RTC_REG_C] |= 0xc0;
>>     qemu_irq_raise(s->irq);
>> }
>>
> PIC/APIC in effect has a queue of one interrupt. This means that if
> timer tick is still not acknowledged it doesn't mean that interrupt
> was not queued for delivery inside a PIC.

This doesn't matter, the tick that arrived while a previous interrupt
was not acked yet, is lost anyway, i.e. had been coalesced.  So
this'll give you the right number of interrupts to re-inject.

Ofcourse this, as well as your approach are both wrong because the
guest may be intentionally ignoring the irq and expecting the
interrupts to coalesce.  Once it starts processing the RTC interrupts
it will get an unexpected storm.

Cheers

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06  9:37           ` andrzej zaborowski
@ 2008-11-06 10:08             ` Gleb Natapov
  2008-11-06 13:21               ` andrzej zaborowski
  2008-11-06 13:44               ` Paul Brook
  0 siblings, 2 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-11-06 10:08 UTC (permalink / raw)
  To: andrzej zaborowski; +Cc: dlaor, qemu-devel

On Thu, Nov 06, 2008 at 10:37:43AM +0100, andrzej zaborowski wrote:
> 2008/11/6 Gleb Natapov <gleb@redhat.com>:
> > On Wed, Nov 05, 2008 at 04:48:32PM +0100, andrzej zaborowski wrote:
> >> > Btw: I ack the whole thing, including the problem, the scenario and the
> >> > solution.
> >>
> >> I don't, as far as I understand it's a -win2k-hack type of addition,
> >> i.e. the hardware doesn't do this but we want to improve usability by
> >> working around a bad guest behaviour.  Modifying qemu_irq abstraction
> >> doesn't sound like the right place for that, qemu_irq contrary to what
> >> the name suggests doesn't have to be connected to any interrupt.
> >>
> > It is nothing like a -win2k-hack since there is no any guest "bad
> > behaviour" that cause the problem. Yes real hardware doesn't do this,
> > but real hardware also provides OS with enough CPU power to handle every
> > single timer interrupt.
> 
> A guest that counts on having enough CPU for something is
> timing-depenent (buggy).
> 
Tell this to RT developers who count each CPU cycle.

> > And even if _some_ interrupts are dropped the
> > drift is easily fixed with NTP. Try to run Windows XP on very slow machine
> > and I am sure you'll see very noticeable time drift.
> 
> Exactly.  You'll find the drift on real hardware, so you should find
> it in the emulator too.  You're trying to hack around it.
>
If I'll try to run windows XP on 486 then yes, I'll see the time drift.
After analyzing the problem I, most certainly, will decide that HW is
too old will buy modern CPU and will solve the time drift problem. What do
you propose for QEMU users? To use real HW?

> Linux doesn't see this because the clocksource and the
> clockevents-device come from separate clks there.  It is a windows'
> problem.  It *is* "bad behaviour".
OK we will call the flag -win-time-drift-hack.

> 
> >
> >> Instead you can have the interrupt sources register a callback in the
> >> PIC that the PIC calls when the interrupt wasn't delivered.  Or.. in
> > It requires the mapping from interrupt vector inside the PIC to
> > interrupt source.
> 
> Of course.
> 
> > This approach was rejected long time ago.
> 
> Then you'll have to find a different one.
> 

I found one. Here it is, implemented by this patch series.

> qemu_irq is the wrong place.
Why? Don't give me "that is not how real HW works". Real HW, if properly
configured, will behave more or less deterministically. I.e if timer
interrupt is configured to generate highest priority interrupt vector
and IRQ handler is fast enough (can be calculated knowing CPU freq) the
chances of loosing interrupt will be minimal. And those few that are
lost due to SMM or NMI can be compensated by NTP.

> >
> >> the case of mc146818rtc.c wouldn't it be enough to check if the irq
> >> has been acked by reading RTC_REG_C?  e.g.
> >>
> >> static void rtc_periodic_timer(void *opaque)
> >> {
> >>     RTCState *s = opaque;
> >>
> >>     rtc_timer_update(s, s->next_periodic_time);
> >> +   if (s->cmos_data[RTC_REG_C] & 0xc0)
> >> +         s->irq_coalesced++;
> >>     s->cmos_data[RTC_REG_C] |= 0xc0;
> >>     qemu_irq_raise(s->irq);
> >> }
> >>
> > PIC/APIC in effect has a queue of one interrupt. This means that if
> > timer tick is still not acknowledged it doesn't mean that interrupt
> > was not queued for delivery inside a PIC.
> 
> This doesn't matter, the tick that arrived while a previous interrupt
> was not acked yet, is lost anyway,
Not it is not. Not necessary. It can be queued inside PIC and delivered
by PIC itself immediately after interrupt acknowledgement.

>                                    i.e. had been coalesced.  So
> this'll give you the right number of interrupts to re-inject.
No it will not. You'll inject more interrupt then needed and clock will
drift forwards.

> 
> Ofcourse this, as well as your approach are both wrong because the
> guest may be intentionally ignoring the irq and expecting the
> interrupts to coalesce.  Once it starts processing the RTC interrupts
> it will get an unexpected storm.
> 
This is one more thing which is broken in your suggestion, not mine. If a
guest wants to ignore interrupt it will mask it and my patch don't report
interrupts delivered while masked as been coalesced.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 10:08             ` Gleb Natapov
@ 2008-11-06 13:21               ` andrzej zaborowski
  2008-11-06 14:18                 ` Gleb Natapov
  2008-11-06 13:44               ` Paul Brook
  1 sibling, 1 reply; 49+ messages in thread
From: andrzej zaborowski @ 2008-11-06 13:21 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: dlaor, qemu-devel

2008/11/6 Gleb Natapov <gleb@redhat.com>:
> On Thu, Nov 06, 2008 at 10:37:43AM +0100, andrzej zaborowski wrote:
>> 2008/11/6 Gleb Natapov <gleb@redhat.com>:
>> > On Wed, Nov 05, 2008 at 04:48:32PM +0100, andrzej zaborowski wrote:
>> >> > Btw: I ack the whole thing, including the problem, the scenario and the
>> >> > solution.
>> >>
>> >> I don't, as far as I understand it's a -win2k-hack type of addition,
>> >> i.e. the hardware doesn't do this but we want to improve usability by
>> >> working around a bad guest behaviour.  Modifying qemu_irq abstraction
>> >> doesn't sound like the right place for that, qemu_irq contrary to what
>> >> the name suggests doesn't have to be connected to any interrupt.
>> >>
>> > It is nothing like a -win2k-hack since there is no any guest "bad
>> > behaviour" that cause the problem. Yes real hardware doesn't do this,
>> > but real hardware also provides OS with enough CPU power to handle every
>> > single timer interrupt.
>>
>> A guest that counts on having enough CPU for something is
>> timing-depenent (buggy).
>>
> Tell this to RT developers who count each CPU cycle.

They don't usually use qemu (they certainly shouldn't).

>
>> > And even if _some_ interrupts are dropped the
>> > drift is easily fixed with NTP. Try to run Windows XP on very slow machine
>> > and I am sure you'll see very noticeable time drift.
>>
>> Exactly.  You'll find the drift on real hardware, so you should find
>> it in the emulator too.  You're trying to hack around it.
>>
> If I'll try to run windows XP on 486 then yes, I'll see the time drift.
> After analyzing the problem I, most certainly, will decide that HW is
> too old will buy modern CPU and will solve the time drift problem. What do
> you propose for QEMU users? To use real HW?
>
>> Linux doesn't see this because the clocksource and the
>> clockevents-device come from separate clks there.  It is a windows'
>> problem.  It *is* "bad behaviour".
> OK we will call the flag -win-time-drift-hack.

I'm not saying it needs a flag, but that's it's a hack so it should stay local.

>
>>
>> >
>> >> Instead you can have the interrupt sources register a callback in the
>> >> PIC that the PIC calls when the interrupt wasn't delivered.  Or.. in
>> > It requires the mapping from interrupt vector inside the PIC to
>> > interrupt source.
>>
>> Of course.
>>
>> > This approach was rejected long time ago.
>>
>> Then you'll have to find a different one.
>>
>
> I found one. Here it is, implemented by this patch series.
>
>> qemu_irq is the wrong place.
> Why? Don't give me "that is not how real HW works". Real HW, if properly

I explained in a previous mail, but let me reiterate: qemu_irq
abstracts a pin on whose one end a device can set a voltage level and
the other end read it.  That's it - there's communication in one way
only.  If you want to send a notification the other direction use a
second qemu_irq or a callback.  It's a quite simple change in your
PIC.

> configured, will behave more or less deterministically. I.e if timer
> interrupt is configured to generate highest priority interrupt vector
> and IRQ handler is fast enough (can be calculated knowing CPU freq) the
> chances of loosing interrupt will be minimal. And those few that are
> lost due to SMM or NMI can be compensated by NTP.
>
>> >
>> >> the case of mc146818rtc.c wouldn't it be enough to check if the irq
>> >> has been acked by reading RTC_REG_C?  e.g.
>> >>
>> >> static void rtc_periodic_timer(void *opaque)
>> >> {
>> >>     RTCState *s = opaque;
>> >>
>> >>     rtc_timer_update(s, s->next_periodic_time);
>> >> +   if (s->cmos_data[RTC_REG_C] & 0xc0)
>> >> +         s->irq_coalesced++;
>> >>     s->cmos_data[RTC_REG_C] |= 0xc0;
>> >>     qemu_irq_raise(s->irq);
>> >> }
>> >>
>> > PIC/APIC in effect has a queue of one interrupt. This means that if
>> > timer tick is still not acknowledged it doesn't mean that interrupt
>> > was not queued for delivery inside a PIC.
>>
>> This doesn't matter, the tick that arrived while a previous interrupt
>> was not acked yet, is lost anyway,
> Not it is not. Not necessary. It can be queued inside PIC and delivered
> by PIC itself immediately after interrupt acknowledgement.

 You can argue that it's the new irq that's lost or it's the first one
that was lost, either way the PIC only sees one time the irq rising,
instead of two.  That means they were coalesced.

>
>>                                    i.e. had been coalesced.  So
>> this'll give you the right number of interrupts to re-inject.
> No it will not. You'll inject more interrupt then needed and clock will
> drift forwards.

The PIC won't see more interrupts (rising edges) than times the timer
had ticked to given moment.  Thus the guest OS won't see them either.

>
>>
>> Ofcourse this, as well as your approach are both wrong because the
>> guest may be intentionally ignoring the irq and expecting the
>> interrupts to coalesce.  Once it starts processing the RTC interrupts
>> it will get an unexpected storm.
>>
> This is one more thing which is broken in your suggestion, not mine. If a
> guest wants to ignore interrupt it will mask it and my patch don't report
> interrupts delivered while masked as been coalesced.

This is moot, it may choose to mask it or not, it can also mask it
lower down the path.

Cheers

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 13:21               ` andrzej zaborowski
@ 2008-11-06 14:18                 ` Gleb Natapov
  2008-11-06 14:35                   ` andrzej zaborowski
  0 siblings, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-11-06 14:18 UTC (permalink / raw)
  To: andrzej zaborowski; +Cc: dlaor, qemu-devel

On Thu, Nov 06, 2008 at 02:21:16PM +0100, andrzej zaborowski wrote:
> >> > It is nothing like a -win2k-hack since there is no any guest "bad
> >> > behaviour" that cause the problem. Yes real hardware doesn't do this,
> >> > but real hardware also provides OS with enough CPU power to handle every
> >> > single timer interrupt.
> >>
> >> A guest that counts on having enough CPU for something is
> >> timing-depenent (buggy).
> >>
> > Tell this to RT developers who count each CPU cycle.
> 
> They don't usually use qemu (they certainly shouldn't).
> 
It seems to me that you were saying that counting on CPU availability
is buggy in general and since Windows doing this it is buggy. And what I
am saying is that for certain thing it is valid thing to do. RT people
shouldn't use qemu in production of cause, should people use qemu to run
Windows in production?

> >
> >> > And even if _some_ interrupts are dropped the
> >> > drift is easily fixed with NTP. Try to run Windows XP on very slow machine
> >> > and I am sure you'll see very noticeable time drift.
> >>
> >> Exactly.  You'll find the drift on real hardware, so you should find
> >> it in the emulator too.  You're trying to hack around it.
> >>
> > If I'll try to run windows XP on 486 then yes, I'll see the time drift.
> > After analyzing the problem I, most certainly, will decide that HW is
> > too old will buy modern CPU and will solve the time drift problem. What do
> > you propose for QEMU users? To use real HW?
> >
You ignored this question, but it was not rhetorical at all. Can you
answer it please?

> >> Linux doesn't see this because the clocksource and the
> >> clockevents-device come from separate clks there.  It is a windows'
> >> problem.  It *is* "bad behaviour".
> > OK we will call the flag -win-time-drift-hack.
> 
> I'm not saying it needs a flag, but that's it's a hack so it should stay local.
>
The fix is already local. It is local to PIT and RTC. And to make it
local I change an infrastructure in the first patch.

> >
> >>
> >> >
> >> >> Instead you can have the interrupt sources register a callback in the
> >> >> PIC that the PIC calls when the interrupt wasn't delivered.  Or.. in
> >> > It requires the mapping from interrupt vector inside the PIC to
> >> > interrupt source.
> >>
> >> Of course.
> >>
> >> > This approach was rejected long time ago.
> >>
> >> Then you'll have to find a different one.
> >>
> >
> > I found one. Here it is, implemented by this patch series.
> >
> >> qemu_irq is the wrong place.
> > Why? Don't give me "that is not how real HW works". Real HW, if properly
> 
> I explained in a previous mail, but let me reiterate: qemu_irq
> abstracts a pin on whose one end a device can set a voltage level and
> the other end read it.  That's it - there's communication in one way
> only.  If you want to send a notification the other direction use a
> second qemu_irq or a callback.  It's a quite simple change in your
> PIC.
> 
And why is this abstract is set in stone? Why can't we extend it to be:
qemu_irq abstracts a pin on whose one end a device can set a voltage
level and the other end read it. If the receiving end "excepts" a signal
(for whatever definition of "except" appropriate for a receiving device)
the function returns positive number, if receiving device loose
information about the signal it returns zero.

> > configured, will behave more or less deterministically. I.e if timer
> > interrupt is configured to generate highest priority interrupt vector
> > and IRQ handler is fast enough (can be calculated knowing CPU freq) the
> > chances of loosing interrupt will be minimal. And those few that are
> > lost due to SMM or NMI can be compensated by NTP.
> >
> >> >
> >> >> the case of mc146818rtc.c wouldn't it be enough to check if the irq
> >> >> has been acked by reading RTC_REG_C?  e.g.
> >> >>
> >> >> static void rtc_periodic_timer(void *opaque)
> >> >> {
> >> >>     RTCState *s = opaque;
> >> >>
> >> >>     rtc_timer_update(s, s->next_periodic_time);
> >> >> +   if (s->cmos_data[RTC_REG_C] & 0xc0)
> >> >> +         s->irq_coalesced++;
> >> >>     s->cmos_data[RTC_REG_C] |= 0xc0;
> >> >>     qemu_irq_raise(s->irq);
> >> >> }
> >> >>
> >> > PIC/APIC in effect has a queue of one interrupt. This means that if
> >> > timer tick is still not acknowledged it doesn't mean that interrupt
> >> > was not queued for delivery inside a PIC.
> >>
> >> This doesn't matter, the tick that arrived while a previous interrupt
> >> was not acked yet, is lost anyway,
> > Not it is not. Not necessary. It can be queued inside PIC and delivered
> > by PIC itself immediately after interrupt acknowledgement.
> 
>  You can argue that it's the new irq that's lost or it's the first one
> that was lost, either way the PIC only sees one time the irq rising,
> instead of two.  That means they were coalesced.
Nothing is lost and PIC sees two irq rising. Example:
- RTC triggers first interrupt.
- It is delivered to PIC. PIC sets corespondent bit in IRR.
- CPU picks up RTC interrupt and it's bit is cleared from IRR bitmap.
- CPU jumps to RTC IRQ routing but before it gets a chance to acknowledge
  IRQ to PIC new timer is triggered.
- With your patch you increment irq_coalesced in that case.
- Interrupt is delivered to PIC. PIC sets corespondent bit in IRR
  but don't deliver the interrupt to the CPU since one is already 
  in progress.
- CPU acks first IRQ.
- PIC delivers second IRQ from IRR.
  
The result is that two interrupts are delivered to CPU and
irq_coalesced == 1 so one more will be needlessly signaled.

> >>                                    i.e. had been coalesced.  So
> >> this'll give you the right number of interrupts to re-inject.
> > No it will not. You'll inject more interrupt then needed and clock will
> > drift forwards.
> 
> The PIC won't see more interrupts (rising edges) than times the timer
> had ticked to given moment.  Thus the guest OS won't see them either.
> 
See above.

> >
> >>
> >> Ofcourse this, as well as your approach are both wrong because the
> >> guest may be intentionally ignoring the irq and expecting the
> >> interrupts to coalesce.  Once it starts processing the RTC interrupts
> >> it will get an unexpected storm.
> >>
> > This is one more thing which is broken in your suggestion, not mine. If a
> > guest wants to ignore interrupt it will mask it and my patch don't report
> > interrupts delivered while masked as been coalesced.
> 
> This is moot, it may choose to mask it or not, it can also mask it
> lower down the path.
> 
Are we talking about real OSes that we want to run as guests or we give
purely theoretical examples? Can you provide an examples of an OS that
doesn't mask interrupt or disable interrupt source in case it doesn't
want to dial with interrupts from a device?

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 14:18                 ` Gleb Natapov
@ 2008-11-06 14:35                   ` andrzej zaborowski
  2008-11-06 15:04                     ` Gleb Natapov
  0 siblings, 1 reply; 49+ messages in thread
From: andrzej zaborowski @ 2008-11-06 14:35 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: dlaor, qemu-devel

2008/11/6 Gleb Natapov <gleb@redhat.com>:
> On Thu, Nov 06, 2008 at 02:21:16PM +0100, andrzej zaborowski wrote:
>> >> > It is nothing like a -win2k-hack since there is no any guest "bad
>> >> > behaviour" that cause the problem. Yes real hardware doesn't do this,
>> >> > but real hardware also provides OS with enough CPU power to handle every
>> >> > single timer interrupt.
>> >>
>> >> A guest that counts on having enough CPU for something is
>> >> timing-depenent (buggy).
>> >>
>> > Tell this to RT developers who count each CPU cycle.
>>
>> They don't usually use qemu (they certainly shouldn't).
>>
> It seems to me that you were saying that counting on CPU availability
> is buggy in general and since Windows doing this it is buggy. And what I
> am saying is that for certain thing it is valid thing to do. RT people
> shouldn't use qemu in production of cause, should people use qemu to run
> Windows in production?

People shouldn't use buggy software in general.. if they do, they have
to accept its bugs ;)
Or they can have a nasty, but local, workaround.

>
>> >
>> >> > And even if _some_ interrupts are dropped the
>> >> > drift is easily fixed with NTP. Try to run Windows XP on very slow machine
>> >> > and I am sure you'll see very noticeable time drift.
>> >>
>> >> Exactly.  You'll find the drift on real hardware, so you should find
>> >> it in the emulator too.  You're trying to hack around it.
>> >>
>> > If I'll try to run windows XP on 486 then yes, I'll see the time drift.
>> > After analyzing the problem I, most certainly, will decide that HW is
>> > too old will buy modern CPU and will solve the time drift problem. What do
>> > you propose for QEMU users? To use real HW?

No, I didn't say we should not have a workaround for this specific
case.  I propose we use it.

That said, even on a modern cpu you have to assume from the start that
an emulator will give you about the speed of an 486.

>> >
> You ignored this question, but it was not rhetorical at all. Can you
> answer it please?
>
>> >> Linux doesn't see this because the clocksource and the
>> >> clockevents-device come from separate clks there.  It is a windows'
>> >> problem.  It *is* "bad behaviour".
>> > OK we will call the flag -win-time-drift-hack.
>>
>> I'm not saying it needs a flag, but that's it's a hack so it should stay local.
>>
> The fix is already local. It is local to PIT and RTC. And to make it
> local I change an infrastructure in the first patch.

Unnecessarily and invasively.  How's that different from an invasive ugly hack..

>
>> >
>> >>
>> >> >
>> >> >> Instead you can have the interrupt sources register a callback in the
>> >> >> PIC that the PIC calls when the interrupt wasn't delivered.  Or.. in
>> >> > It requires the mapping from interrupt vector inside the PIC to
>> >> > interrupt source.
>> >>
>> >> Of course.
>> >>
>> >> > This approach was rejected long time ago.
>> >>
>> >> Then you'll have to find a different one.
>> >>
>> >
>> > I found one. Here it is, implemented by this patch series.
>> >
>> >> qemu_irq is the wrong place.
>> > Why? Don't give me "that is not how real HW works". Real HW, if properly
>>
>> I explained in a previous mail, but let me reiterate: qemu_irq
>> abstracts a pin on whose one end a device can set a voltage level and
>> the other end read it.  That's it - there's communication in one way
>> only.  If you want to send a notification the other direction use a
>> second qemu_irq or a callback.  It's a quite simple change in your
>> PIC.
>>
> And why is this abstract is set in stone? Why can't we extend it to be:

Because we don't need to and because it's so backwards..

> qemu_irq abstracts a pin on whose one end a device can set a voltage
> level and the other end read it. If the receiving end "excepts" a signal
> (for whatever definition of "except" appropriate for a receiving device)
> the function returns positive number, if receiving device loose
> information about the signal it returns zero.
>
>> > configured, will behave more or less deterministically. I.e if timer
>> > interrupt is configured to generate highest priority interrupt vector
>> > and IRQ handler is fast enough (can be calculated knowing CPU freq) the
>> > chances of loosing interrupt will be minimal. And those few that are
>> > lost due to SMM or NMI can be compensated by NTP.
>> >
>> >> >
>> >> >> the case of mc146818rtc.c wouldn't it be enough to check if the irq
>> >> >> has been acked by reading RTC_REG_C?  e.g.
>> >> >>
>> >> >> static void rtc_periodic_timer(void *opaque)
>> >> >> {
>> >> >>     RTCState *s = opaque;
>> >> >>
>> >> >>     rtc_timer_update(s, s->next_periodic_time);
>> >> >> +   if (s->cmos_data[RTC_REG_C] & 0xc0)
>> >> >> +         s->irq_coalesced++;
>> >> >>     s->cmos_data[RTC_REG_C] |= 0xc0;
>> >> >>     qemu_irq_raise(s->irq);
>> >> >> }
>> >> >>
>> >> > PIC/APIC in effect has a queue of one interrupt. This means that if
>> >> > timer tick is still not acknowledged it doesn't mean that interrupt
>> >> > was not queued for delivery inside a PIC.
>> >>
>> >> This doesn't matter, the tick that arrived while a previous interrupt
>> >> was not acked yet, is lost anyway,
>> > Not it is not. Not necessary. It can be queued inside PIC and delivered
>> > by PIC itself immediately after interrupt acknowledgement.
>>
>>  You can argue that it's the new irq that's lost or it's the first one
>> that was lost, either way the PIC only sees one time the irq rising,
>> instead of two.  That means they were coalesced.
> Nothing is lost and PIC sees two irq rising. Example:
> - RTC triggers first interrupt.
> - It is delivered to PIC. PIC sets corespondent bit in IRR.
> - CPU picks up RTC interrupt and it's bit is cleared from IRR bitmap.
> - CPU jumps to RTC IRQ routing but before it gets a chance to acknowledge
>  IRQ to PIC new timer is triggered.
> - With your patch you increment irq_coalesced in that case.
> - Interrupt is delivered to PIC.

No, it isn't (unless the PIC is poorly implemented).  We raise the
irq, but since it's already high, nothing happens, there's no rising
edge.

> PIC sets corespondent bit in IRR
>  but don't deliver the interrupt to the CPU since one is already
>  in progress.
> - CPU acks first IRQ.

Only here it's lowered.

> - PIC delivers second IRQ from IRR.
>
> The result is that two interrupts are delivered to CPU and
> irq_coalesced == 1 so one more will be needlessly signaled.
>
>> >>                                    i.e. had been coalesced.  So
>> >> this'll give you the right number of interrupts to re-inject.
>> > No it will not. You'll inject more interrupt then needed and clock will
>> > drift forwards.
>>
>> The PIC won't see more interrupts (rising edges) than times the timer
>> had ticked to given moment.  Thus the guest OS won't see them either.
>>
> See above.
>
>> >
>> >>
>> >> Ofcourse this, as well as your approach are both wrong because the
>> >> guest may be intentionally ignoring the irq and expecting the
>> >> interrupts to coalesce.  Once it starts processing the RTC interrupts
>> >> it will get an unexpected storm.
>> >>
>> > This is one more thing which is broken in your suggestion, not mine. If a
>> > guest wants to ignore interrupt it will mask it and my patch don't report
>> > interrupts delivered while masked as been coalesced.
>>
>> This is moot, it may choose to mask it or not, it can also mask it
>> lower down the path.
>>
> Are we talking about real OSes that we want to run as guests or we give
> purely theoretical examples?

We're talking about emulating the hw.  We have to make compromises
sometimes, but we try to avoid them.

Cheers,
Andrzej

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 14:35                   ` andrzej zaborowski
@ 2008-11-06 15:04                     ` Gleb Natapov
  2008-11-06 15:41                       ` Anthony Liguori
  2008-11-07 23:18                       ` andrzej zaborowski
  0 siblings, 2 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-11-06 15:04 UTC (permalink / raw)
  To: andrzej zaborowski; +Cc: dlaor, qemu-devel

On Thu, Nov 06, 2008 at 03:35:16PM +0100, andrzej zaborowski wrote:
> 2008/11/6 Gleb Natapov <gleb@redhat.com>:
> > On Thu, Nov 06, 2008 at 02:21:16PM +0100, andrzej zaborowski wrote:
> >> >> > It is nothing like a -win2k-hack since there is no any guest "bad
> >> >> > behaviour" that cause the problem. Yes real hardware doesn't do this,
> >> >> > but real hardware also provides OS with enough CPU power to handle every
> >> >> > single timer interrupt.
> >> >>
> >> >> A guest that counts on having enough CPU for something is
> >> >> timing-depenent (buggy).
> >> >>
> >> > Tell this to RT developers who count each CPU cycle.
> >>
> >> They don't usually use qemu (they certainly shouldn't).
> >>
> > It seems to me that you were saying that counting on CPU availability
> > is buggy in general and since Windows doing this it is buggy. And what I
> > am saying is that for certain thing it is valid thing to do. RT people
> > shouldn't use qemu in production of cause, should people use qemu to run
> > Windows in production?
> 
> People shouldn't use buggy software in general.. if they do, they have
> to accept its bugs ;)
They don't want to (accept bugs I mean) :( Otherwise we agree on this point.

> 
> >> >
> > You ignored this question, but it was not rhetorical at all. Can you
> > answer it please?
> >
> >> >> Linux doesn't see this because the clocksource and the
> >> >> clockevents-device come from separate clks there.  It is a windows'
> >> >> problem.  It *is* "bad behaviour".
> >> > OK we will call the flag -win-time-drift-hack.
> >>
> >> I'm not saying it needs a flag, but that's it's a hack so it should stay local.
> >>
> > The fix is already local. It is local to PIT and RTC. And to make it
> > local I change an infrastructure in the first patch.
> 
> Unnecessarily and invasively.  How's that different from an invasive ugly hack..
> 
If interface was defined in such a way at the beginning you wouldn't
even noticed that it can be used to implement localized hacks ;)

> >
> >> >
> >> >>
> >> >> >
> >> >> >> Instead you can have the interrupt sources register a callback in the
> >> >> >> PIC that the PIC calls when the interrupt wasn't delivered.  Or.. in
> >> >> > It requires the mapping from interrupt vector inside the PIC to
> >> >> > interrupt source.
> >> >>
> >> >> Of course.
> >> >>
> >> >> > This approach was rejected long time ago.
> >> >>
> >> >> Then you'll have to find a different one.
> >> >>
> >> >
> >> > I found one. Here it is, implemented by this patch series.
> >> >
> >> >> qemu_irq is the wrong place.
> >> > Why? Don't give me "that is not how real HW works". Real HW, if properly
> >>
> >> I explained in a previous mail, but let me reiterate: qemu_irq
> >> abstracts a pin on whose one end a device can set a voltage level and
> >> the other end read it.  That's it - there's communication in one way
> >> only.  If you want to send a notification the other direction use a
> >> second qemu_irq or a callback.  It's a quite simple change in your
> >> PIC.
> >>
> > And why is this abstract is set in stone? Why can't we extend it to be:
> 
> Because we don't need to and because it's so backwards..
That is debatable...

> >> >> This doesn't matter, the tick that arrived while a previous interrupt
> >> >> was not acked yet, is lost anyway,
> >> > Not it is not. Not necessary. It can be queued inside PIC and delivered
> >> > by PIC itself immediately after interrupt acknowledgement.
> >>
> >>  You can argue that it's the new irq that's lost or it's the first one
> >> that was lost, either way the PIC only sees one time the irq rising,
> >> instead of two.  That means they were coalesced.
> > Nothing is lost and PIC sees two irq rising. Example:
> > - RTC triggers first interrupt.
> > - It is delivered to PIC. PIC sets corespondent bit in IRR.
> > - CPU picks up RTC interrupt and it's bit is cleared from IRR bitmap.
> > - CPU jumps to RTC IRQ routing but before it gets a chance to acknowledge
> >  IRQ to PIC new timer is triggered.
> > - With your patch you increment irq_coalesced in that case.
> > - Interrupt is delivered to PIC.
> 
> No, it isn't (unless the PIC is poorly implemented).  We raise the
> irq, but since it's already high, nothing happens, there's no rising
> edge.
> 
That would be the case if RTC used level triggered interrupts, but
RTC and PIT are edge-trigered. That is how they behave like it or not.

> > PIC sets corespondent bit in IRR
> >  but don't deliver the interrupt to the CPU since one is already
> >  in progress.
> > - CPU acks first IRQ.
> 
> Only here it's lowered.
> 
Nope.

> > - PIC delivers second IRQ from IRR.
> >
> > The result is that two interrupts are delivered to CPU and
> > irq_coalesced == 1 so one more will be needlessly signaled.
> >
> >> >>                                    i.e. had been coalesced.  So
> >> >> this'll give you the right number of interrupts to re-inject.
> >> > No it will not. You'll inject more interrupt then needed and clock will
> >> > drift forwards.
> >>
> >> The PIC won't see more interrupts (rising edges) than times the timer
> >> had ticked to given moment.  Thus the guest OS won't see them either.
> >>
> > See above.
> >
> >> >
> >> >>
> >> >> Ofcourse this, as well as your approach are both wrong because the
> >> >> guest may be intentionally ignoring the irq and expecting the
> >> >> interrupts to coalesce.  Once it starts processing the RTC interrupts
> >> >> it will get an unexpected storm.
> >> >>
> >> > This is one more thing which is broken in your suggestion, not mine. If a
> >> > guest wants to ignore interrupt it will mask it and my patch don't report
> >> > interrupts delivered while masked as been coalesced.
> >>
> >> This is moot, it may choose to mask it or not, it can also mask it
> >> lower down the path.
> >>
> > Are we talking about real OSes that we want to run as guests or we give
> > purely theoretical examples?
> 
> We're talking about emulating the hw.  We have to make compromises
> sometimes, but we try to avoid them.
> 
Fair enough. Can we compromise in this case :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 15:04                     ` Gleb Natapov
@ 2008-11-06 15:41                       ` Anthony Liguori
  2008-11-07 23:18                       ` andrzej zaborowski
  1 sibling, 0 replies; 49+ messages in thread
From: Anthony Liguori @ 2008-11-06 15:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: dlaor

Gleb Natapov wrote:
> On Thu, Nov 06, 2008 at 03:35:16PM +0100, andrzej zaborowski wrote:
>   
>>
>> We're talking about emulating the hw.  We have to make compromises
>> sometimes, but we try to avoid them.
>>
>>     
> Fair enough. Can we compromise in this case :)
>   

We have work ahead of us to determine how much of a problem this really 
is.  My suspicion is that we're going to discover that it's pretty 
difficult to produce noticable drift when using hosts with high 
resolution timers.

I think we'll also find that it's relatively easy to produce drift on 
older hosts.  If that's the case, I don't think anything that's very 
invasive is going to be acceptable because of diminishing returns.

I think that a well isolated, optional work around that's enabled only 
on hosts that lack high resolution timers will probably be the solution.

Regards,

Anthony Liguori

> --
> 			Gleb.
>
>
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 15:04                     ` Gleb Natapov
  2008-11-06 15:41                       ` Anthony Liguori
@ 2008-11-07 23:18                       ` andrzej zaborowski
  2008-11-08  8:23                         ` Gleb Natapov
  1 sibling, 1 reply; 49+ messages in thread
From: andrzej zaborowski @ 2008-11-07 23:18 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: dlaor, qemu-devel

2008/11/6 Gleb Natapov <gleb@redhat.com>:
>> >> >> This doesn't matter, the tick that arrived while a previous interrupt
>> >> >> was not acked yet, is lost anyway,
>> >> > Not it is not. Not necessary. It can be queued inside PIC and delivered
>> >> > by PIC itself immediately after interrupt acknowledgement.
>> >>
>> >>  You can argue that it's the new irq that's lost or it's the first one
>> >> that was lost, either way the PIC only sees one time the irq rising,
>> >> instead of two.  That means they were coalesced.
>> > Nothing is lost and PIC sees two irq rising. Example:
>> > - RTC triggers first interrupt.
>> > - It is delivered to PIC. PIC sets corespondent bit in IRR.
>> > - CPU picks up RTC interrupt and it's bit is cleared from IRR bitmap.
>> > - CPU jumps to RTC IRQ routing but before it gets a chance to acknowledge
>> >  IRQ to PIC new timer is triggered.
>> > - With your patch you increment irq_coalesced in that case.
>> > - Interrupt is delivered to PIC.
>>
>> No, it isn't (unless the PIC is poorly implemented).  We raise the
>> irq, but since it's already high, nothing happens, there's no rising
>> edge.
>>
> That would be the case if RTC used level triggered interrupts, but
> RTC and PIT are edge-trigered. That is how they behave like it or not.

Sorry, I'm not taking this at all.  If this was the case it would be
completely broken, but I just had a look at i8259 and the
implementation seems to be correct.

The two devices are connected with only one line.  The signal on the
line can be in one of two states (levels).  After the first tick it
becomes high.  It stays high until a read to RTC_REG_C clears it.  The
second call to qemu_irq_raise does not change the level, there's no
edge.  If there's no event, how possibly can there be a reaction?

Cheers

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-07 23:18                       ` andrzej zaborowski
@ 2008-11-08  8:23                         ` Gleb Natapov
  0 siblings, 0 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-11-08  8:23 UTC (permalink / raw)
  To: andrzej zaborowski; +Cc: dlaor, qemu-devel

On Sat, Nov 08, 2008 at 12:18:00AM +0100, andrzej zaborowski wrote:
> 2008/11/6 Gleb Natapov <gleb@redhat.com>:
> >> >> >> This doesn't matter, the tick that arrived while a previous interrupt
> >> >> >> was not acked yet, is lost anyway,
> >> >> > Not it is not. Not necessary. It can be queued inside PIC and delivered
> >> >> > by PIC itself immediately after interrupt acknowledgement.
> >> >>
> >> >>  You can argue that it's the new irq that's lost or it's the first one
> >> >> that was lost, either way the PIC only sees one time the irq rising,
> >> >> instead of two.  That means they were coalesced.
> >> > Nothing is lost and PIC sees two irq rising. Example:
> >> > - RTC triggers first interrupt.
> >> > - It is delivered to PIC. PIC sets corespondent bit in IRR.
> >> > - CPU picks up RTC interrupt and it's bit is cleared from IRR bitmap.
> >> > - CPU jumps to RTC IRQ routing but before it gets a chance to acknowledge
> >> >  IRQ to PIC new timer is triggered.
> >> > - With your patch you increment irq_coalesced in that case.
> >> > - Interrupt is delivered to PIC.
> >>
> >> No, it isn't (unless the PIC is poorly implemented).  We raise the
> >> irq, but since it's already high, nothing happens, there's no rising
> >> edge.
> >>
> > That would be the case if RTC used level triggered interrupts, but
> > RTC and PIT are edge-trigered. That is how they behave like it or not.
> 
> Sorry, I'm not taking this at all.  If this was the case it would be
> completely broken, but I just had a look at i8259 and the
> implementation seems to be correct.
> 
> The two devices are connected with only one line.  The signal on the
> line can be in one of two states (levels).  After the first tick it
> becomes high.  It stays high until a read to RTC_REG_C clears it.  The
> second call to qemu_irq_raise does not change the level, there's no
> edge.  If there's no event, how possibly can there be a reaction?
> 
Yes, I was wrong about how RTC behaves. I described how PIT works. PIT
generates square wave and does not wait for acknowledgement from an OS.
For RTC you suggestion will mostly work and will needlessly re-inject
only those ticks that were generated when RTC interrupt vector was masked.
Don't know how often this happens in reality.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 10:08             ` Gleb Natapov
  2008-11-06 13:21               ` andrzej zaborowski
@ 2008-11-06 13:44               ` Paul Brook
  1 sibling, 0 replies; 49+ messages in thread
From: Paul Brook @ 2008-11-06 13:44 UTC (permalink / raw)
  To: qemu-devel; +Cc: dlaor, Gleb Natapov

> > A guest that counts on having enough CPU for something is
> > timing-depenent (buggy).
>
> Tell this to RT developers who count each CPU cycle.

If you want any a realtime illusion you absolutely have to be using -icount. 
This is still a long way from being cycle accurate and is no use for 
benchmarking, but it does give some semblance of realistic virtual realtime 
behavior. Of course this is at the expense of some divergence between host 
and virtual time.

Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-05 12:45     ` Dor Laor
  2008-11-05 15:48       ` andrzej zaborowski
@ 2008-11-05 17:43       ` Gleb Natapov
  2008-11-06 17:28       ` David S. Ahern
  2 siblings, 0 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-11-05 17:43 UTC (permalink / raw)
  To: dlaor, qemu-devel

On Wed, Nov 05, 2008 at 02:45:05PM +0200, Dor Laor wrote:
>>> As I've said repeatedly in the past, any sort of time drift fixes 
>>> needs  to have a lot of data posted with it that is repeatable.
>>>
>>> How much does this improve things with Windows?     
>> The time drift is eliminated. If there is a spike in a load time may
>> slow down, but after that it catches up (this happens only during very
>> high loads though).
>>
>>   
> Gleb, can you please provide more details:
> - What's the host's kernel version exactly (including the high-res, dyn  
> tick configured)
Kernel 2.6.25. Qemu uses dynticks (CLOCK_REALTIME) timer. But why
high-res timers should help fix time-drift problem at all? Time drift
happens because guest doesn't get enough time to run, if host timer will
be more precise this will not magically provide guest with more CPU time
to handle them.

> - What's the windows version? Is it standard HAL (pit) or ACPI (rtc) or  
> both?
In my case this what windows2003 with ACPI HAL.

> - The detailed scenario you use (example: I copied the entire c:/windows  
> directory, etc)
Just run one guest in qemu on otherwise un-loaded host and copy entire
c:/windows directory somewhere.

> - Without the patch, what the time drift after x seconds on the host.
After 15 minutes time drift was 5 minutes. But this is with write through
disk cache (default now). With write back cache situation is much
better. After 5 minutes the drift was 8 secs.

> - With the patch, is there a drift? Is there increased cpu consumption, etc
There is no drift with the patch. CPU consumption is 100 with cache
write back so it can be increased any more and with cache write through
qemu doing mostly IO. I am sure that copying will take more time with
the patch since more time interrupts will be handled by guest.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-05 12:45     ` Dor Laor
  2008-11-05 15:48       ` andrzej zaborowski
  2008-11-05 17:43       ` Gleb Natapov
@ 2008-11-06 17:28       ` David S. Ahern
  2 siblings, 0 replies; 49+ messages in thread
From: David S. Ahern @ 2008-11-06 17:28 UTC (permalink / raw)
  To: gleb; +Cc: qemu-devel



Dor Laor wrote:
> Gleb Natapov wrote:
>> On Fri, Oct 31, 2008 at 02:17:19PM -0500, Anthony Liguori wrote:
>>   
>>> Gleb Natapov wrote:
>>>     
>>>> Qemu device emulation for timers might be inaccurate and
>>>> causes coalescing of several IRQs into one. It happens when the
>>>> load on the host is high and the guest did not manage to ack the
>>>> previous IRQ. The problem can be reproduced by copying of a big
>>>> file or many small ones inside Windows guest. When you do that guest 
>>>> clock start to lag behind the host one.

<snip>

>>
>>   
>>>                                                                 How do  
>>> Linux guests behave with this?
>>>     
>> Linux guests don't use pit or RTC for time keeping. They are completely
>> unaffected by those patches.
>>
>>   
> It will probably also drift with clock=pit in the guest kernel cmdline.

It is my understanding that for linux guests using a 2.4 kernel (e.g.,
RHEL3) the PIT is the only option.

david

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-02 13:04   ` Gleb Natapov
  2008-11-05 12:45     ` Dor Laor
@ 2008-11-05 16:43     ` Anthony Liguori
  2008-11-06  3:55       ` Jamie Lokier
  2008-11-06  8:12       ` Gleb Natapov
  2008-11-06  3:41     ` Jamie Lokier
  2 siblings, 2 replies; 49+ messages in thread
From: Anthony Liguori @ 2008-11-05 16:43 UTC (permalink / raw)
  To: qemu-devel

Gleb Natapov wrote:
> So? I raise them now. Have you tried suggested scenario and was able to
> reproduce the problem?
>   

Sorry, I mistyped.  I meant to say, I don't think any of the problems 
raised when this was initially posted have been addressed.  Namely, I 
asked for hard data on how much this helped things and Paul complained 
that this fix only fixed things partially, and was very invasive to 
other architectures.

Basically, there are two hurdles to overcome here.  The first is that I 
don't think it's overwhelmingly obvious that this is the correct 
solution in all scenarios.  We need to understand the scenarios it helps 
and by how much.  We then probably need to make sure to limit this 
operation to those specific scenarios.

The second is that this is not how hardware behaves normally.  This 
makes it undesirably from an architectural perspective.  If it's 
necessary, we need to find a way to minimize it's impact in much the way 
-win2k-hack's impact is minimized.

> The time drift is eliminated. If there is a spike in a load time may
> slow down, but after that it catches up (this happens only during very
> high loads though).
>   

How bad is time drift without it.  Under workload X, we lose N seconds 
per Y hours and with this patch, under the same workload, we lose M 
seconds per Y hours and N << M.

I strongly, strongly doubt that you'll be eliminating drift 100%.  And 
please describe workload X in such a way that it is 100% reproducible.  
If you're using a multimedia file to do this, please provide a link to 
obtain the multimedia file.

>>                                                  How does having a high  
>> resolution timer in the host affect the problem to begin with?
>>     
> My test machine has relatively recent kernel that use high resolution
> timers for time keeping. Also the problem is that guest does not receive
> enough time to process injected interrupt. How hr timer can help here?
>   

If the host can awaken QEMU 1024 times a second and QEMU can deliver a 
timer interrupt each time, there is no need for time drift fixing.

I would think that with high res timers on the host, you would have to 
put the host under heavy load before drift began occurring.

>>                                                                 How do  
>> Linux guests behave with this?
>>     
> Linux guests don't use pit or RTC for time keeping. They are completely
> unaffected by those patches.
>   

They certainly can, under the right circumstances.

>> Even the Windows PV spec calls out three separate approaches to dealing  
>> with missed interrupts and provides an interface for the host to query  
>> the guest as to which one should be used.  I don't think any solution  
>> that uses a single technique is going to be correct.
>>
>>     
> That is what I found in Microsoft docs:
>
>   If a virtual processor is unavailable for a sufficiently long period of
>   time, a full timer period may be missed. In this case, the hypervisor
>   uses one of two techniques. The first technique involves timer period
>   modulation, in effect shortening the period until the timer “catches
>   up”.
>
>   If a significant number of timer signals have been missed, the
>   hypervisor may be unable to compensate by using period modulation. In
>   this case, some timer expiration signals may be skipped completely.
>   For timers that are marked as lazy, the hypervisor uses a second
>   technique for dealing with the situation in which a virtual processor is
>   unavailable for a long period of time. In this case, the timer signal is
>   deferred until this virtual processor is available. If it doesn’t become
>   available until shortly before the next timer is due to expire, it is
>   skipped entirely. 
>
> The first techniques is what I am trying to introduce with this patch
> series.
>   

There is a third technique whereas the hypervisor is supposed to 
modulate the delivery of missed ticks by ensuring an even distribution 
of them across the next few time slices.  The windows guest is supposed 
to be able to tell the hypervisor which technique it should be using.

Regards,

Anthony Liguori

> --
> 			Gleb.
>
>
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-05 16:43     ` Anthony Liguori
@ 2008-11-06  3:55       ` Jamie Lokier
  2008-11-06  8:12       ` Gleb Natapov
  1 sibling, 0 replies; 49+ messages in thread
From: Jamie Lokier @ 2008-11-06  3:55 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> >The time drift is eliminated. If there is a spike in a load time may
> >slow down, but after that it catches up (this happens only during very
> >high loads though).
> 
> How bad is time drift without it.  Under workload X, we lose N seconds 
> per Y hours and with this patch, under the same workload, we lose M 
> seconds per Y hours and N << M.

In my experience, N seconds (for any N) is typically a problem with VM
systems in general.

> I strongly, strongly doubt that you'll be eliminating drift 100%.

I do believe that the method of "virtual clock warping" can eliminate
drift 100% for all guest OSes including guests which have their own
lost-tick compensators, provided there's enough host CPU _on average_
for the guest to tick, over an appropriate averaging window.  It
should work even with guests which request a different scheme with the
Microsoft PV ops.

> If the host can awaken QEMU 1024 times a second and QEMU can deliver a 
> timer interrupt each time, there is no need for time drift fixing.
> 
> I would think that with high res timers on the host, you would have to 
> put the host under heavy load before drift began occurring.

If two guests are running at 100% CPU on a single CPU, I suspect
you'll find each QEMU instance does _not_ get to run 1024 times per
second even with high res timers.  They will behave like
non-interactive processes, running alternately with long timeslices -
so even 100 or 18 times a second won't fire.  That's a reasonable
scenario, and doesn't even require any I/O or swapping.

I've personally yet to see any VM which doesn't drift
over time periods of months (i.e. servers in VMs) unless the guest is
running some kind of regular clock sync protocol - and NTP does not
always work for them, because NTP assumes a fairly low jitter clock,
which guests on a loaded host don't always manage - see above.

-- Jamie

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-05 16:43     ` Anthony Liguori
  2008-11-06  3:55       ` Jamie Lokier
@ 2008-11-06  8:12       ` Gleb Natapov
  2008-11-06 14:10         ` Anthony Liguori
  1 sibling, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-11-06  8:12 UTC (permalink / raw)
  To: qemu-devel

On Wed, Nov 05, 2008 at 10:43:46AM -0600, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> So? I raise them now. Have you tried suggested scenario and was able to
>> reproduce the problem?
>>   
>
> Sorry, I mistyped.  I meant to say, I don't think any of the problems  
> raised when this was initially posted have been addressed.  Namely, I  
> asked for hard data on how much this helped things
Does my answer to Dor's mail provide enough data?

>                                                     and Paul complained  
> that this fix only fixed things partially, and was very invasive to  
> other architectures.
Paul complained that initial version of the patch doesn't compile on non
x86 and I fixed that long time ago. I don't remember him complaining
about invasiveness of the patch, but there is nothing that can be done
about it. The patch changes core API function that is used all over
the code, so every place that uses old function definition have to be
amended. The changes are trivial BTW. May be it is possible to add another
API with new semantics and change qemu_irq subsystem to use both in parallel,
but I personally don't like this. It like microsoft approach to add new APIs
instead of fixing old ones to maintain backwards compatibility. Qemu doesn't
have backwards compatibility problem so why use inferior approach?

>
> Basically, there are two hurdles to overcome here.  The first is that I  
> don't think it's overwhelmingly obvious that this is the correct  
> solution in all scenarios.  We need to understand the scenarios it helps  
> and by how much.  We then probably need to make sure to limit this  
> operation to those specific scenarios.
I think we can all agree that this is _not_ the correct solution in all
scenarios. For instance icount approach works for pure qemu and looks
like much more clean approach, that is why interrupt de-coalescing
is disabled if icount is in use. We can have interrupt de-coalescing
be disabled by default and enable it only with a command line option.

>
> The second is that this is not how hardware behaves normally.  This  
> makes it undesirably from an architectural perspective.  If it's  
> necessary, we need to find a way to minimize it's impact in much the way  
> -win2k-hack's impact is minimized.
>
>> The time drift is eliminated. If there is a spike in a load time may
>> slow down, but after that it catches up (this happens only during very
>> high loads though).
>>   
>
> How bad is time drift without it.  Under workload X, we lose N seconds  
> per Y hours and with this patch, under the same workload, we lose M  
> seconds per Y hours and N << M.
>
See may reply to Dor please.

> I strongly, strongly doubt that you'll be eliminating drift 100%.  And  
> please describe workload X in such a way that it is 100% reproducible.   
> If you're using a multimedia file to do this, please provide a link to  
> obtain the multimedia file.
I found much simpler way to reproduce the problem, no multimedia
involved. Just copy windows folder somewhere. It should eliminate time
drift 100% assuming that on average the guest will have enough time to
run.

>
>>>                                                  How does having a 
>>> high  resolution timer in the host affect the problem to begin with?
>>>     
>> My test machine has relatively recent kernel that use high resolution
>> timers for time keeping. Also the problem is that guest does not receive
>> enough time to process injected interrupt. How hr timer can help here?
>>   
>
> If the host can awaken QEMU 1024 times a second and QEMU can deliver a  
> timer interrupt each time, there is no need for time drift fixing.
>
It is not enough to wake QEMU process 1024 time a second to signal time
interrupt. A guest should also have enough time to run between each
interrupt to process it. If QEMU signals 1024 time interrupt a second
and guest process only half of that a second you'll see time drift.
And if guest asks for 1024hz frequency and guest can't provide that no
solution for time drift exists in that case.

> I would think that with high res timers on the host, you would have to  
> put the host under heavy load before drift began occurring.
>
I see a time drift even with guest using 100hz timers and I am pretty sure
my 2.6.25 host uses hr timers.

>>>                                                                 How 
>>> do  Linux guests behave with this?
>>>     
>> Linux guests don't use pit or RTC for time keeping. They are completely
>> unaffected by those patches.
>>   
>
> They certainly can, under the right circumstances.
>
I know they can, but by default they are smarter then that.

>>> Even the Windows PV spec calls out three separate approaches to 
>>> dealing  with missed interrupts and provides an interface for the 
>>> host to query  the guest as to which one should be used.  I don't 
>>> think any solution  that uses a single technique is going to be 
>>> correct.
>>>
>>>     
>> That is what I found in Microsoft docs:
>>
>>   If a virtual processor is unavailable for a sufficiently long period of
>>   time, a full timer period may be missed. In this case, the hypervisor
>>   uses one of two techniques. The first technique involves timer period
>>   modulation, in effect shortening the period until the timer “catches
>>   up”.
>>
>>   If a significant number of timer signals have been missed, the
>>   hypervisor may be unable to compensate by using period modulation. In
>>   this case, some timer expiration signals may be skipped completely.
>>   For timers that are marked as lazy, the hypervisor uses a second
>>   technique for dealing with the situation in which a virtual processor is
>>   unavailable for a long period of time. In this case, the timer signal is
>>   deferred until this virtual processor is available. If it doesn’t become
>>   available until shortly before the next timer is due to expire, it is
>>   skipped entirely. 
>>
>> The first techniques is what I am trying to introduce with this patch
>> series.
>>   
>
> There is a third technique whereas the hypervisor is supposed to  
> modulate the delivery of missed ticks by ensuring an even distribution  
> of them across the next few time slices.
The third technique you describe looks to me exactly like the first one
in the text I quoted. My patches implement that approach for PIT. RTC
injects missed interrupt as soon as the previous one is acknowledged.

>                                           The windows guest is supposed  
> to be able to tell the hypervisor which technique it should be using.
>
Do you have pointer to a documentation that describe it? I am especially
interested if old guest like windows XP and windows 2003 support this?
Surprisingly many people are interested in those old guests, not shiny
newer ones :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06  8:12       ` Gleb Natapov
@ 2008-11-06 14:10         ` Anthony Liguori
  2008-11-06 14:24           ` Paul Brook
  0 siblings, 1 reply; 49+ messages in thread
From: Anthony Liguori @ 2008-11-06 14:10 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: qemu-devel

Gleb Natapov wrote:
> On Wed, Nov 05, 2008 at 10:43:46AM -0600, Anthony Liguori wrote:
>   
>> Gleb Natapov wrote:
>>
>> Sorry, I mistyped.  I meant to say, I don't think any of the problems  
>> raised when this was initially posted have been addressed.  Namely, I  
>> asked for hard data on how much this helped things
>>     
> Does my answer to Dor's mail provide enough data?
>   

I'll try to reproduce locally.  Your case would be more compelling with 
a more thorough analysis but I can at least now attempt to determine how 
hr timers impacts this.

>>>>                                                  How does having a 
>>>> high  resolution timer in the host affect the problem to begin with?
>>>>     
>>>>         
>>> My test machine has relatively recent kernel that use high resolution
>>> timers for time keeping. Also the problem is that guest does not receive
>>> enough time to process injected interrupt. How hr timer can help here?
>>>   
>>>       
>> If the host can awaken QEMU 1024 times a second and QEMU can deliver a  
>> timer interrupt each time, there is no need for time drift fixing.
>>
>>     
> It is not enough to wake QEMU process 1024 time a second to signal time
> interrupt. A guest should also have enough time to run between each
> interrupt to process it. If QEMU signals 1024 time interrupt a second
> and guest process only half of that a second you'll see time drift.
> And if guest asks for 1024hz frequency and guest can't provide that no
> solution for time drift exists in that case.
>   

If nothing else is running on the system, the guest should get plenty of 
time to run timers.

If the guest is not getting enough time on the system to be able to run 
their timer ticks, then coalescing the timer ticks isn't going to help 
by definition (they aren't getting enough CPU time to run these routines).

>> I would think that with high res timers on the host, you would have to  
>> put the host under heavy load before drift began occurring.
>>
>>     
> I see a time drift even with guest using 100hz timers and I am pretty sure
> my 2.6.25 host uses hr timers.
>   

You see time drift with 100hz timers in the guest??  That makes very 
little sense as even without hr timers, the host should have no problem 
delivering a 10ms timer.  Note, we just fixed an issue with delayed 
timer deliver.  There may be more bugs like this that are causing ticks 
to get missed.  If you have nothing else running on the host, no matter 
what you do in the guest, I see absolutely no reason why time should 
drift if the guest is running a 100hz timer.

Do you have an explanation for this?

>>                                           The windows guest is supposed  
>> to be able to tell the hypervisor which technique it should be using.
>>
>>     
> Do you have pointer to a documentation that describe it? I am especially
> interested if old guest like windows XP and windows 2003 support this?
> Surprisingly many people are interested in those old guests, not shiny
> newer ones :)
>   

http://www.microsoft.com/downloads/details.aspx?FamilyID=91e2e518-c62c-4ff2-8e50-3a37ea4100f5&DisplayLang=en

Regards,

Anthony Liguori

> --
> 			Gleb.
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 14:10         ` Anthony Liguori
@ 2008-11-06 14:24           ` Paul Brook
  2008-11-06 14:40             ` Anthony Liguori
  0 siblings, 1 reply; 49+ messages in thread
From: Paul Brook @ 2008-11-06 14:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Gleb Natapov

> You see time drift with 100hz timers in the guest??  That makes very
> little sense as even without hr timers, the host should have no problem
> delivering a 10ms timer.

If your host has HZ=100 and no HR timers than a 10ms interval is fairly 
borderline, and it's not that hard to end up missing an interrupt, especially 
under heavy load.

If part of qemu gets swapped out then all bets are off, and you can easily 
stall for significant fractions of a second. No amount of host high 
resolution time support will help you there.

Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 14:24           ` Paul Brook
@ 2008-11-06 14:40             ` Anthony Liguori
  2008-11-06 14:51               ` Gleb Natapov
  0 siblings, 1 reply; 49+ messages in thread
From: Anthony Liguori @ 2008-11-06 14:40 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel, Gleb Natapov

Paul Brook wrote:
>> You see time drift with 100hz timers in the guest??  That makes very
>> little sense as even without hr timers, the host should have no problem
>> delivering a 10ms timer.
>>     
>
> If your host has HZ=100 and no HR timers than a 10ms interval is fairly 
> borderline, and it's not that hard to end up missing an interrupt, especially 
> under heavy load.
>   

If the host is otherwise idle, then the guest should have ample time to 
run.  With /dev/rtc, you should be able to get a reliably 1ms timer.

Gleb: are you perhaps using a qcow2 file in conjunction with -snapshot?  
If that's the case, and you're doing an IO intensive workload, then you 
would be expanding the qcow2 file and updating the metadata is 
synchronous.  This could cause lost time for the guest.

If you are using qcow2, can you still reproduce with a raw file and not 
using -snapshot?

> If part of qemu gets swapped out then all bets are off, and you can easily 
> stall for significant fractions of a second. No amount of host high 
> resolution time support will help you there.
>   

Running a steady workload, you aren't going to be partially swapped.

Regards,

Anthony Liguori

> Paul
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 14:40             ` Anthony Liguori
@ 2008-11-06 14:51               ` Gleb Natapov
  2008-11-06 15:37                 ` Anthony Liguori
  0 siblings, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-11-06 14:51 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paul Brook, qemu-devel

On Thu, Nov 06, 2008 at 08:40:09AM -0600, Anthony Liguori wrote:
> Paul Brook wrote:
>>> You see time drift with 100hz timers in the guest??  That makes very
>>> little sense as even without hr timers, the host should have no problem
>>> delivering a 10ms timer.
>>>     
>>
>> If your host has HZ=100 and no HR timers than a 10ms interval is fairly 
>> borderline, and it's not that hard to end up missing an interrupt, 
>> especially under heavy load.
>>   
>
> If the host is otherwise idle, then the guest should have ample time to  
> run.  With /dev/rtc, you should be able to get a reliably 1ms timer.
And only one guest.

BTW I am saying that guest timer is 100HZ since I assume that windows2003
behaves like XP in this regard, but may be default timer frequency of 2003
is different. I'll check this.

>
> Gleb: are you perhaps using a qcow2 file in conjunction with -snapshot?   
I am using qcow2, but without -snapshot.

> If that's the case, and you're doing an IO intensive workload, then you  
> would be expanding the qcow2 file and updating the metadata is  
> synchronous.  This could cause lost time for the guest.
>
I think that's what happens when I see a huge drift, yes.

> If you are using qcow2, can you still reproduce with a raw file and not  
> using -snapshot?
I'll try next week to run with a raw file.

>
>> If part of qemu gets swapped out then all bets are off, and you can 
>> easily stall for significant fractions of a second. No amount of host 
>> high resolution time support will help you there.
>>   
>
> Running a steady workload, you aren't going to be partially swapped.
>
We want to oversubscribe host as much as possible, and workload will
vary during a lifetime of the VMs.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 14:51               ` Gleb Natapov
@ 2008-11-06 15:37                 ` Anthony Liguori
  2008-11-08  8:36                   ` Gleb Natapov
  0 siblings, 1 reply; 49+ messages in thread
From: Anthony Liguori @ 2008-11-06 15:37 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Brook, qemu-devel

Gleb Natapov wrote:
> On Thu, Nov 06, 2008 at 08:40:09AM -0600, Anthony Liguori wrote:
>   
>
>> Gleb: are you perhaps using a qcow2 file in conjunction with -snapshot?   
>>     
> I am using qcow2, but without -snapshot.
>   

Okay, you would still see this if your qcow2 is relatively small 
compared to the possible size it could be.

I totally believe that you could miss ticks from qcow2 metadata writing 
even with 100hz clock especially since we're using O_SYNC.  A relatively 
large write that has to extend the qcow2 file multiple times could 
conceivably block the guest for more than 10ms.  However, this is a bug 
in qcow2 IMHO.  Metadata updates should be done asynchronously and if 
they did, I bet this problem wouldn't occur.  A test against raw should 
confirm this.

>
>>> If part of qemu gets swapped out then all bets are off, and you can 
>>> easily stall for significant fractions of a second. No amount of host 
>>> high resolution time support will help you there.
>>>   
>>>       
>> Running a steady workload, you aren't going to be partially swapped.
>>
>>     
> We want to oversubscribe host as much as possible, and workload will
> vary during a lifetime of the VMs.
>   

I understand that we want guest time behave even when we're 
overcommitting the host CPU.

However, let's make sure we understand exactly what's going on such that 
we know precisely what we're fixing.  I believe the file copy benchmark 
is going to turn out to no longer produce drift with a raw image.  If 
that's the case, you'll need to find another benchmark to quantify drift.

I think the best ones are going to be intense host workload (and let's 
see how much is needed before we start drifting badly) and high guest 
frequencies with hosts that lack high resolution timers.  I think with a 
high resolution guest and no host overcommit, it should be very 
difficult to produce drift regardless of what the guest is doing.

Regards,

Anthony Liguori

> --
> 			Gleb.
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-06 15:37                 ` Anthony Liguori
@ 2008-11-08  8:36                   ` Gleb Natapov
  2008-11-08 22:14                     ` Dor Laor
                                       ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-11-08  8:36 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paul Brook, qemu-devel

On Thu, Nov 06, 2008 at 09:37:56AM -0600, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> On Thu, Nov 06, 2008 at 08:40:09AM -0600, Anthony Liguori wrote:
>>   
>>
>>> Gleb: are you perhaps using a qcow2 file in conjunction with 
>>> -snapshot?       
>> I am using qcow2, but without -snapshot.
>>   
>
> Okay, you would still see this if your qcow2 is relatively small  
> compared to the possible size it could be.
>
> I totally believe that you could miss ticks from qcow2 metadata writing  
> even with 100hz clock especially since we're using O_SYNC.  A relatively  
> large write that has to extend the qcow2 file multiple times could  
> conceivably block the guest for more than 10ms.  However, this is a bug  
> in qcow2 IMHO.  Metadata updates should be done asynchronously and if  
> they did, I bet this problem wouldn't occur.  A test against raw should  
> confirm this.
>
I ran the copy test once again with qcow2 image, but this time I copied 
from qcow2 to network fs and the drift still exists. Much smaller
though. 8 second per hour AFAIR.

>>
>>>> If part of qemu gets swapped out then all bets are off, and you can 
>>>> easily stall for significant fractions of a second. No amount of 
>>>> host high resolution time support will help you there.
>>>>         
>>> Running a steady workload, you aren't going to be partially swapped.
>>>
>>>     
>> We want to oversubscribe host as much as possible, and workload will
>> vary during a lifetime of the VMs.
>>   
>
> I understand that we want guest time behave even when we're  
> overcommitting the host CPU.
>
> However, let's make sure we understand exactly what's going on such that  
> we know precisely what we're fixing.  I believe the file copy benchmark  
> is going to turn out to no longer produce drift with a raw image.  If  
> that's the case, you'll need to find another benchmark to quantify drift.
>
Yes indeed. With raw image copy benchmark no longer runs enough time to
produce time drift big enough to be visible. So I ran this disk test
utility http://69.90.47.6/mybootdisks.com/mybootdisks_com/nu2/bst514.zip
for ~12 hours and the time drift was 12 secs (if I weren't so lazy and
wrote bat file to copy c:\windows in a loop I am sure result would be the
same). This is on completely idle host.


> I think the best ones are going to be intense host workload (and let's  
> see how much is needed before we start drifting badly) and high guest  
> frequencies with hosts that lack high resolution timers.  I think with a  
> high resolution guest and no host overcommit, it should be very  
> difficult to produce drift regardless of what the guest is doing.
>
Later I'll try to generate load on a host an see how this affects
guest's time drift.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-08  8:36                   ` Gleb Natapov
@ 2008-11-08 22:14                     ` Dor Laor
  2008-11-09  7:40                     ` Gleb Natapov
  2008-11-09 16:36                     ` Anthony Liguori
  2 siblings, 0 replies; 49+ messages in thread
From: Dor Laor @ 2008-11-08 22:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paul Brook

[-- Attachment #1: Type: text/plain, Size: 4314 bytes --]

Gleb Natapov wrote:
> On Thu, Nov 06, 2008 at 09:37:56AM -0600, Anthony Liguori wrote:
>   
>> Gleb Natapov wrote:
>>     
>>> On Thu, Nov 06, 2008 at 08:40:09AM -0600, Anthony Liguori wrote:
>>>   
>>>
>>>       
>>>> Gleb: are you perhaps using a qcow2 file in conjunction with 
>>>> -snapshot?       
>>>>         
>>> I am using qcow2, but without -snapshot.
>>>   
>>>       
>> Okay, you would still see this if your qcow2 is relatively small  
>> compared to the possible size it could be.
>>
>> I totally believe that you could miss ticks from qcow2 metadata writing  
>> even with 100hz clock especially since we're using O_SYNC.  A relatively  
>> large write that has to extend the qcow2 file multiple times could  
>> conceivably block the guest for more than 10ms.  However, this is a bug  
>> in qcow2 IMHO.  Metadata updates should be done asynchronously and if  
>> they did, I bet this problem wouldn't occur.  A test against raw should  
>> confirm this.
>>
>>     
> I ran the copy test once again with qcow2 image, but this time I copied 
> from qcow2 to network fs and the drift still exists. Much smaller
> though. 8 second per hour AFAIR.
>
>   
>>>>> If part of qemu gets swapped out then all bets are off, and you can 
>>>>> easily stall for significant fractions of a second. No amount of 
>>>>> host high resolution time support will help you there.
>>>>>         
>>>>>           
>>>> Running a steady workload, you aren't going to be partially swapped.
>>>>
>>>>     
>>>>         
>>> We want to oversubscribe host as much as possible, and workload will
>>> vary during a lifetime of the VMs.
>>>   
>>>       
>> I understand that we want guest time behave even when we're  
>> overcommitting the host CPU.
>>
>> However, let's make sure we understand exactly what's going on such that  
>> we know precisely what we're fixing.  I believe the file copy benchmark  
>> is going to turn out to no longer produce drift with a raw image.  If  
>> that's the case, you'll need to find another benchmark to quantify drift.
>>
>>     
> Yes indeed. With raw image copy benchmark no longer runs enough time to
> produce time drift big enough to be visible. So I ran this disk test
> utility http://69.90.47.6/mybootdisks.com/mybootdisks_com/nu2/bst514.zip
> for ~12 hours and the time drift was 12 secs (if I weren't so lazy and
> wrote bat file to copy c:\windows in a loop I am sure result would be the
> same). This is on completely idle host.
>
>
>   
>> I think the best ones are going to be intense host workload (and let's  
>> see how much is needed before we start drifting badly) and high guest  
>> frequencies with hosts that lack high resolution timers.  I think with a  
>> high resolution guest and no host overcommit, it should be very  
>> difficult to produce drift regardless of what the guest is doing.
>>
>>     
> Later I'll try to generate load on a host an see how this affects
> guest's time drift.
>
> --
> 			Gleb.
>
>   

No doubt that either additional cpu load or maybe easily, disk io load 
on the host/storage will increase
the time drift. Can't wait to get the load results by Gleb.
Note that the problem becomes more visible when the guests uses 1000hz 
timers (when windows
plays any multimedia or for Linux with 1000Hz clock + pit/rtc.
There are many hosts out there with kernel < 2.6.24 --> no userspace hr 
timer.
Bottom line, no matter how we'll twist it there will be scenarios where 
we drift.

Maybe in order to minimize the effect we can use the in-kernel-pit for 
kvm. For rtc we can
use Andrzej's solution using RTC_REG_C. If we use a counter to count the 
number of
un/set events it can solve the drift for rtc without any irq_set api 
changes.

Another option is to provide the qemu_timer user a simple api to check 
if the guest vcpu was
scheduled by the host since the it was issued. If the vcpu did not the 
timer user can assume a drift.
(Less accurate from the irq solution thought).
The advantage here is that it helps the long forgotten -win2k-hack to work.
I almost forgot this hack is not stable and when you run 10+ VMs, some 
fail. Using this option
the hack will reliably force the host/guest to do 
dma_request---vcpu_exec---dma_done.
Now it only separate the two dma events with a timer.

Thanks for the fruitful discussion,
Dor

[-- Attachment #2: Type: text/html, Size: 5309 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-08  8:36                   ` Gleb Natapov
  2008-11-08 22:14                     ` Dor Laor
@ 2008-11-09  7:40                     ` Gleb Natapov
  2008-11-09 16:38                       ` Anthony Liguori
  2008-11-09 16:36                     ` Anthony Liguori
  2 siblings, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-11-09  7:40 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paul Brook, qemu-devel

On Sat, Nov 08, 2008 at 10:36:20AM +0200, Gleb Natapov wrote:
> > I think the best ones are going to be intense host workload (and let's  
> > see how much is needed before we start drifting badly) and high guest  
> > frequencies with hosts that lack high resolution timers.  I think with a  
> > high resolution guest and no host overcommit, it should be very  
> > difficult to produce drift regardless of what the guest is doing.
> >
> Later I'll try to generate load on a host an see how this affects
> guest's time drift.
> 
Just did that. Run qemu process bound to CPU 1 (taskset 1 qemu ...).
Run busy loop on the same CPU (taskset 1 bash -c "while true; do x='x'; done")
Run disk test utility as before inside a guest. After 40 minutes time
drift is almost 1 minute, but the drift was not gradual i.e I observed
gradual drift of 1 second per ~6 minutest, but sometimes there were jumps
of ~10 seconds. 

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-09  7:40                     ` Gleb Natapov
@ 2008-11-09 16:38                       ` Anthony Liguori
  2008-11-09 21:00                         ` Avi Kivity
  0 siblings, 1 reply; 49+ messages in thread
From: Anthony Liguori @ 2008-11-09 16:38 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Brook, qemu-devel

Gleb Natapov wrote:
> On Sat, Nov 08, 2008 at 10:36:20AM +0200, Gleb Natapov wrote:
>   
>>> I think the best ones are going to be intense host workload (and let's  
>>> see how much is needed before we start drifting badly) and high guest  
>>> frequencies with hosts that lack high resolution timers.  I think with a  
>>> high resolution guest and no host overcommit, it should be very  
>>> difficult to produce drift regardless of what the guest is doing.
>>>
>>>       
>> Later I'll try to generate load on a host an see how this affects
>> guest's time drift.
>>
>>     
> Just did that. Run qemu process bound to CPU 1 (taskset 1 qemu ...).
> Run busy loop on the same CPU (taskset 1 bash -c "while true; do x='x'; done")
> Run disk test utility as before inside a guest. After 40 minutes time
> drift is almost 1 minute, but the drift was not gradual i.e I observed
> gradual drift of 1 second per ~6 minutest, but sometimes there were jumps
> of ~10 seconds. 
>   

Yeah, this is what I would expect.  This is why we'll probably have to 
do some sort of interrupt catch-up.  But I'd like to make sure we have 
the idle host stuff figured out because I don't see a reason why we 
should lose ticks.  I don't want to mask other problems with something 
like TDF.

Regards,

Anthony Liguori

> --
> 			Gleb.
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-09 16:38                       ` Anthony Liguori
@ 2008-11-09 21:00                         ` Avi Kivity
  0 siblings, 0 replies; 49+ messages in thread
From: Avi Kivity @ 2008-11-09 21:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paul Brook, Gleb Natapov

Anthony Liguori wrote:
>
> Yeah, this is what I would expect.  This is why we'll probably have to 
> do some sort of interrupt catch-up.  But I'd like to make sure we have 
> the idle host stuff figured out because I don't see a reason why we 
> should lose ticks.  I don't want to mask other problems with something 
> like TDF.
>

Even an idle host is never truly idle.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-08  8:36                   ` Gleb Natapov
  2008-11-08 22:14                     ` Dor Laor
  2008-11-09  7:40                     ` Gleb Natapov
@ 2008-11-09 16:36                     ` Anthony Liguori
  2008-11-10 14:37                       ` Gleb Natapov
  2 siblings, 1 reply; 49+ messages in thread
From: Anthony Liguori @ 2008-11-09 16:36 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Brook, qemu-devel

Gleb Natapov wrote:
> On Thu, Nov 06, 2008 at 09:37:56AM -0600, Anthony Liguori wrote:
>   
>
> Yes indeed. With raw image copy benchmark no longer runs enough time to
> produce time drift big enough to be visible. So I ran this disk test
> utility http://69.90.47.6/mybootdisks.com/mybootdisks_com/nu2/bst514.zip
> for ~12 hours and the time drift was 12 secs (if I weren't so lazy and
> wrote bat file to copy c:\windows in a loop I am sure result would be the
> same). This is on completely idle host.
>   

What frequency is the guest running at?  If it's running at 100hz, then 
it missed a tick once every 36 seconds.  This means that the guest 
couldn't run long enough to handle a timer interrupt (which should be a 
relatively small number of cycles) in a 10ms period.

Does this drift go away with the TDF patches?  This almost makes me 
think that we aren't delivering interrupts at the right frequency and 
we're simply accumulating error.  In theory, the TDF patches shouldn't 
help that.

Otherwise, I'm curious if you have any insight into where we're pausing 
for 10ms that's causing the missed interrupt?

We could also be missing ticks somehow.  I think this warrants further 
investigation.

Regards,

Anthony Liguori

>   
>> I think the best ones are going to be intense host workload (and let's  
>> see how much is needed before we start drifting badly) and high guest  
>> frequencies with hosts that lack high resolution timers.  I think with a  
>> high resolution guest and no host overcommit, it should be very  
>> difficult to produce drift regardless of what the guest is doing.
>>
>>     
> Later I'll try to generate load on a host an see how this affects
> guest's time drift.
>
> --
> 			Gleb.
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-09 16:36                     ` Anthony Liguori
@ 2008-11-10 14:37                       ` Gleb Natapov
  2008-11-10 15:24                         ` Anthony Liguori
  0 siblings, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-11-10 14:37 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paul Brook, qemu-devel

On Sun, Nov 09, 2008 at 10:36:59AM -0600, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> On Thu, Nov 06, 2008 at 09:37:56AM -0600, Anthony Liguori wrote:
>>   
>>
>> Yes indeed. With raw image copy benchmark no longer runs enough time to
>> produce time drift big enough to be visible. So I ran this disk test
>> utility http://69.90.47.6/mybootdisks.com/mybootdisks_com/nu2/bst514.zip
>> for ~12 hours and the time drift was 12 secs (if I weren't so lazy and
>> wrote bat file to copy c:\windows in a loop I am sure result would be the
>> same). This is on completely idle host.
>>   
>
> What frequency is the guest running at?  If it's running at 100hz, then  
Windows program RTC to 64hz frequency.

> it missed a tick once every 36 seconds.  This means that the guest  
> couldn't run long enough to handle a timer interrupt (which should be a  
> relatively small number of cycles) in a 10ms period.
>
10ms is for qemu + guest, not just guest.

> Does this drift go away with the TDF patches?  This almost makes me  
> think that we aren't delivering interrupts at the right frequency and  
> we're simply accumulating error.  In theory, the TDF patches shouldn't  
> help that.
I haven't checked TDF patch yet, but I inserted prints into RTC
emulation to check it interrupt are really missed and they are. And
after 64 lost interrupts (took slightly more then one hour) the time
drift was 1 second, so TDF patches should fix that.

>
> Otherwise, I'm curious if you have any insight into where we're pausing  
> for 10ms that's causing the missed interrupt?
>
Don't know yet.

> We could also be missing ticks somehow.  I think this warrants further  
> investigation.
>
This is not the case from what I observe.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-10 14:37                       ` Gleb Natapov
@ 2008-11-10 15:24                         ` Anthony Liguori
  2008-11-10 15:29                           ` Gleb Natapov
  0 siblings, 1 reply; 49+ messages in thread
From: Anthony Liguori @ 2008-11-10 15:24 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Brook, qemu-devel

Gleb Natapov wrote:
> On Sun, Nov 09, 2008 at 10:36:59AM -0600, Anthony Liguori wrote:
>   
>
>> Otherwise, I'm curious if you have any insight into where we're pausing  
>> for 10ms that's causing the missed interrupt?
>>
>>     
> Don't know yet.
>   

What's your full command line?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-10 15:24                         ` Anthony Liguori
@ 2008-11-10 15:29                           ` Gleb Natapov
  2008-11-10 15:46                             ` Anthony Liguori
  0 siblings, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-11-10 15:29 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paul Brook, qemu-devel

On Mon, Nov 10, 2008 at 09:24:55AM -0600, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> On Sun, Nov 09, 2008 at 10:36:59AM -0600, Anthony Liguori wrote:
>>   
>>
>>> Otherwise, I'm curious if you have any insight into where we're 
>>> pausing  for 10ms that's causing the missed interrupt?
>>>
>>>     
>> Don't know yet.
>>   
>
> What's your full command line?
>
/home/gleb/install/qemu/org/bin/qemu-system-x86_64 -hda Windows2003.raw -m 1024 -net
nic,model=rtl8139 -net user -vnc 172.17.10.111:1 -usbdevice tablet

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-10 15:29                           ` Gleb Natapov
@ 2008-11-10 15:46                             ` Anthony Liguori
  2008-11-10 15:51                               ` Gleb Natapov
  2008-11-11 14:43                               ` Gleb Natapov
  0 siblings, 2 replies; 49+ messages in thread
From: Anthony Liguori @ 2008-11-10 15:46 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Brook, qemu-devel

Gleb Natapov wrote:
> On Mon, Nov 10, 2008 at 09:24:55AM -0600, Anthony Liguori wrote:
>   
>> Gleb Natapov wrote:
>>     
>>> On Sun, Nov 09, 2008 at 10:36:59AM -0600, Anthony Liguori wrote:
>>>   
>>>
>>>       
>>>> Otherwise, I'm curious if you have any insight into where we're 
>>>> pausing  for 10ms that's causing the missed interrupt?
>>>>
>>>>     
>>>>         
>>> Don't know yet.
>>>   
>>>       
>> What's your full command line?
>>
>>     
> /home/gleb/install/qemu/org/bin/qemu-system-x86_64 -hda Windows2003.raw -m 1024 -net
> nic,model=rtl8139 -net user -vnc 172.17.10.111:1 -usbdevice tablet
>   

-usbdevice tablet may be the culprit.  Is the guest doing an significant IO?

Regards,

Anthony Liguori

> --
> 			Gleb.
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-10 15:46                             ` Anthony Liguori
@ 2008-11-10 15:51                               ` Gleb Natapov
  2008-11-11 14:43                               ` Gleb Natapov
  1 sibling, 0 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-11-10 15:51 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paul Brook, qemu-devel

On Mon, Nov 10, 2008 at 09:46:12AM -0600, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> On Mon, Nov 10, 2008 at 09:24:55AM -0600, Anthony Liguori wrote:
>>   
>>> Gleb Natapov wrote:
>>>     
>>>> On Sun, Nov 09, 2008 at 10:36:59AM -0600, Anthony Liguori wrote:
>>>>   
>>>>
>>>>       
>>>>> Otherwise, I'm curious if you have any insight into where we're  
>>>>> pausing  for 10ms that's causing the missed interrupt?
>>>>>
>>>>>             
>>>> Don't know yet.
>>>>         
>>> What's your full command line?
>>>
>>>     
>> /home/gleb/install/qemu/org/bin/qemu-system-x86_64 -hda Windows2003.raw -m 1024 -net
>> nic,model=rtl8139 -net user -vnc 172.17.10.111:1 -usbdevice tablet
>>   
>
> -usbdevice tablet may be the culprit.  Is the guest doing an significant IO?
>
It runs a disk test utility. That is a lot of disk IO.

I just checked qemu doesn't call cpu_exec() between two coalesced
interrupts.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-10 15:46                             ` Anthony Liguori
  2008-11-10 15:51                               ` Gleb Natapov
@ 2008-11-11 14:43                               ` Gleb Natapov
  2008-11-11 17:26                                 ` Anthony Liguori
  2008-11-11 20:17                                 ` Anthony Liguori
  1 sibling, 2 replies; 49+ messages in thread
From: Gleb Natapov @ 2008-11-11 14:43 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paul Brook, qemu-devel

On Mon, Nov 10, 2008 at 09:46:12AM -0600, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> On Mon, Nov 10, 2008 at 09:24:55AM -0600, Anthony Liguori wrote:
>>   
>>> Gleb Natapov wrote:
>>>     
>>>> On Sun, Nov 09, 2008 at 10:36:59AM -0600, Anthony Liguori wrote:
>>>>   
>>>>
>>>>       
>>>>> Otherwise, I'm curious if you have any insight into where we're  
>>>>> pausing  for 10ms that's causing the missed interrupt?
>>>>>
>>>>>             
>>>> Don't know yet.
>>>>         
>>> What's your full command line?
>>>
>>>     
>> /home/gleb/install/qemu/org/bin/qemu-system-x86_64 -hda Windows2003.raw -m 1024 -net
>> nic,model=rtl8139 -net user -vnc 172.17.10.111:1 -usbdevice tablet
>>   
>
> -usbdevice tablet may be the culprit.  Is the guest doing an significant IO?
>
-usbdevice tablet has nothing to do with it. Qemu misses interrupt even
without this option and with SDL screen it misses them in bunches when
SDL redraws a screen. In case of vnc qemu misses interrupt because of
fsync() call in raw_flush(), or so my instrumentation shows.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-11 14:43                               ` Gleb Natapov
@ 2008-11-11 17:26                                 ` Anthony Liguori
  2008-11-11 20:17                                 ` Anthony Liguori
  1 sibling, 0 replies; 49+ messages in thread
From: Anthony Liguori @ 2008-11-11 17:26 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Brook, qemu-devel

Gleb Natapov wrote:
> On Mon, Nov 10, 2008 at 09:46:12AM -0600, Anthony Liguori wrote:
>   
>> Gleb Natapov wrote:
>>     
>>> On Mon, Nov 10, 2008 at 09:24:55AM -0600, Anthony Liguori wrote:
>>>   
>>>       
>>>> Gleb Natapov wrote:
>>>>     
>>>>         
>>>>> On Sun, Nov 09, 2008 at 10:36:59AM -0600, Anthony Liguori wrote:
>>>>>   
>>>>>
>>>>>       
>>>>>           
>>>>>> Otherwise, I'm curious if you have any insight into where we're  
>>>>>> pausing  for 10ms that's causing the missed interrupt?
>>>>>>
>>>>>>             
>>>>>>             
>>>>> Don't know yet.
>>>>>         
>>>>>           
>>>> What's your full command line?
>>>>
>>>>     
>>>>         
>>> /home/gleb/install/qemu/org/bin/qemu-system-x86_64 -hda Windows2003.raw -m 1024 -net
>>> nic,model=rtl8139 -net user -vnc 172.17.10.111:1 -usbdevice tablet
>>>   
>>>       
>> -usbdevice tablet may be the culprit.  Is the guest doing an significant IO?
>>
>>     
> -usbdevice tablet has nothing to do with it. Qemu misses interrupt even
> without this option and with SDL screen it misses them in bunches when
> SDL redraws a screen. In case of vnc qemu misses interrupt because of
> fsync() call in raw_flush(), or so my instrumentation shows.
>   

eek, fsync is actually wrong here because there may be pending requests 
in flight that fsync won't take into account.  We should be doing an 
aio_fsync().  You're using IDE right?

Regards,

Anthony Liguori

> --
> 			Gleb.
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-11 14:43                               ` Gleb Natapov
  2008-11-11 17:26                                 ` Anthony Liguori
@ 2008-11-11 20:17                                 ` Anthony Liguori
  2008-11-12 11:42                                   ` Gleb Natapov
  1 sibling, 1 reply; 49+ messages in thread
From: Anthony Liguori @ 2008-11-11 20:17 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Paul Brook, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1216 bytes --]

Gleb Natapov wrote:
> On Mon, Nov 10, 2008 at 09:46:12AM -0600, Anthony Liguori wrote:
>   
> -usbdevice tablet has nothing to do with it. Qemu misses interrupt even
> without this option and with SDL screen it misses them in bunches when
> SDL redraws a screen. In case of vnc qemu misses interrupt because of
> fsync() call in raw_flush(), or so my instrumentation shows.
>   

Can you give this patch a spin?

This introduces a bdrv_aio_flush() which will wait for all existing AIO 
operations to complete before indicating completion.  It also fixes up 
IDE.  Fixing up SCSI will be a little more tricky but not much.  Since 
we now use O_DSYNC, it's unnecessary to do an fsync (or an fdatasync).

Assuming you're using IDE, this should eliminate any delays from fsync.  
SDL delays are unavoidable because it's going to come down to SDL doing 
sychronous updates to the X server.  The proper long term solution here 
would be to put SDL in it's own thread but I'm not too worried about 
this case.  It's a good argument for timer fixups but I think it's 
important to identify all the reasons we're losing ticks and if we can 
do something about it, we should.

Regards,

Anthony Liguori

> --
> 			Gleb.
>   


[-- Attachment #2: bdrv_aio_flush.patch --]
[-- Type: text/x-patch, Size: 11613 bytes --]

diff --git a/block-qcow.c b/block-qcow.c
index 1fecf30..ed7a62d 100644
--- a/block-qcow.c
+++ b/block-qcow.c
@@ -712,6 +712,13 @@ static BlockDriverAIOCB *qcow_aio_write(BlockDriverState *bs,
     return &acb->common;
 }
 
+static BlockDriverAIOCB *qcow_aio_flush(BlockDriverState *bs,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BDRVQcowState *s = bs->opaque;
+    return bdrv_aio_flush(s->hd, cb, opaque);
+}
+
 static void qcow_aio_cancel(BlockDriverAIOCB *blockacb)
 {
     QCowAIOCB *acb = (QCowAIOCB *)blockacb;
@@ -868,12 +875,6 @@ static int qcow_write_compressed(BlockDriverState *bs, int64_t sector_num,
     return 0;
 }
 
-static void qcow_flush(BlockDriverState *bs)
-{
-    BDRVQcowState *s = bs->opaque;
-    bdrv_flush(s->hd);
-}
-
 static int qcow_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
     BDRVQcowState *s = bs->opaque;
@@ -890,13 +891,14 @@ BlockDriver bdrv_qcow = {
     NULL,
     qcow_close,
     qcow_create,
-    qcow_flush,
+    NULL,
     qcow_is_allocated,
     qcow_set_key,
     qcow_make_empty,
 
     .bdrv_aio_read = qcow_aio_read,
     .bdrv_aio_write = qcow_aio_write,
+    .bdrv_aio_flush = qcow_aio_flush,
     .bdrv_aio_cancel = qcow_aio_cancel,
     .aiocb_size = sizeof(QCowAIOCB),
     .bdrv_write_compressed = qcow_write_compressed,
diff --git a/block-qcow2.c b/block-qcow2.c
index dc73769..69523a5 100644
--- a/block-qcow2.c
+++ b/block-qcow2.c
@@ -1401,6 +1401,13 @@ static BlockDriverAIOCB *qcow_aio_write(BlockDriverState *bs,
     return &acb->common;
 }
 
+static BlockDriverAIOCB *qcow_aio_flush(BlockDriverState *bs,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BDRVQcowState *s = bs->opaque;
+    return bdrv_aio_flush(s->hd, cb, opaque);
+}
+
 static void qcow_aio_cancel(BlockDriverAIOCB *blockacb)
 {
     QCowAIOCB *acb = (QCowAIOCB *)blockacb;
@@ -1632,12 +1639,6 @@ static int qcow_write_compressed(BlockDriverState *bs, int64_t sector_num,
     return 0;
 }
 
-static void qcow_flush(BlockDriverState *bs)
-{
-    BDRVQcowState *s = bs->opaque;
-    bdrv_flush(s->hd);
-}
-
 static int qcow_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
     BDRVQcowState *s = bs->opaque;
@@ -2637,13 +2638,14 @@ BlockDriver bdrv_qcow2 = {
     NULL,
     qcow_close,
     qcow_create,
-    qcow_flush,
+    NULL,
     qcow_is_allocated,
     qcow_set_key,
     qcow_make_empty,
 
     .bdrv_aio_read = qcow_aio_read,
     .bdrv_aio_write = qcow_aio_write,
+    .bdrv_aio_flush = qcow_aio_flush,
     .bdrv_aio_cancel = qcow_aio_cancel,
     .aiocb_size = sizeof(QCowAIOCB),
     .bdrv_write_compressed = qcow_write_compressed,
diff --git a/block-raw-posix.c b/block-raw-posix.c
index c06e38d..8cc1dba 100644
--- a/block-raw-posix.c
+++ b/block-raw-posix.c
@@ -449,6 +449,7 @@ typedef struct RawAIOCB {
     int fd;
     struct aiocb aiocb;
     struct RawAIOCB *next;
+    uint64_t flush_generation;
     int ret;
 } RawAIOCB;
 
@@ -456,24 +457,46 @@ typedef struct PosixAioState
 {
     int rfd, wfd;
     RawAIOCB *first_aio;
+    RawAIOCB *first_aio_flush;
+    uint64_t flush_generation;
 } PosixAioState;
 
 static int raw_fd_pool_get(BDRVRawState *s)
 {
     int i;
 
-    for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
-        /* already in use */
-        if (s->fd_pool[i] != -1)
-            continue;
+    if (s->fd_inuse == 0) {
+        s->fd_inuse++;
+        return s->fd;
+    }
 
-        /* try to dup file descriptor */
-        s->fd_pool[i] = dup(s->fd);
-        if (s->fd_pool[i] != -1)
+    /* try to find an unused fd that has
+     * already been opened */
+    for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
+        if (s->fd_pool_inuse[i] == 0 &&
+            s->fd_pool[i] != -1) {
+            s->fd_pool_inuse[i]++;
             return s->fd_pool[i];
+        }
+    }
+
+    /* try to find an unused slot that we can
+     * dup the main fd from */
+    for (i = 0; i < RAW_FD_POOL_SIZE; I++) {
+        if (s->fd_pool_inuse[i] == 0 &&
+            s->fd_pool[i] == -1) {
+            s->fd_pool[i] = dup(s->fd);
+            if (s->fd_pool[i] != -1) {
+                s->fd_pool_inuse[i]++;
+                return s->fd_pool[i];
+            }
+        }
     }
 
-    /* we couldn't dup the file descriptor so just use the main one */
+    /* Could not find an fd in the pool,
+     * use the main fd */
+    s->fd_inuse++;
+
     return s->fd;
 }
 
@@ -482,10 +505,15 @@ static void raw_fd_pool_put(RawAIOCB *acb)
     BDRVRawState *s = acb->common.bs->opaque;
     int i;
 
+    if (s->fd == acb->fd) {
+        s->fd_inuse--;
+        return;
+    }
+
     for (i = 0; i < RAW_FD_POOL_SIZE; i++) {
-        if (s->fd_pool[i] == acb->fd) {
-            close(s->fd_pool[i]);
-            s->fd_pool[i] = -1;
+        if (s->fd_pool_inuse[i] > 0 &&
+            s->fd_pool[i] == acb->fd) {
+            s->fd_pool_inuse[i]--;
         }
     }
 }
@@ -496,6 +524,7 @@ static void posix_aio_read(void *opaque)
     RawAIOCB *acb, **pacb;
     int ret;
     ssize_t len;
+    uint64_t min_generation = ~0ULL;
 
     do {
         char byte;
@@ -538,11 +567,29 @@ static void posix_aio_read(void *opaque)
                 qemu_aio_release(acb);
                 break;
             } else {
+                min_generation = MIN(min_generation,
+                                     acb->flush_generation);
                 pacb = &acb->next;
             }
         }
     }
  the_end: ;
+    pacb = &s->first_aio_flush;
+
+    while (*pacb) {
+        acb = *pacb;
+        if (acb->flush_generation < min_generation)
+            break;
+        else
+            pacb = &acb->next;
+    }
+
+    while (*pacb) {
+        acb = *pacb;
+        *pacb = acb->next;
+        acb->common.cb(acb->common.opaque, 0);
+        qemu_aio_release(acb);
+    }
 }
 
 static int posix_aio_flush(void *opaque)
@@ -595,6 +642,8 @@ static int posix_aio_init(void)
 
     qemu_aio_set_fd_handler(s->rfd, posix_aio_read, NULL, posix_aio_flush, s);
 
+    s->flush_generation = 0;
+
 #if defined(__linux__)
     {
         struct aioinit ai;
@@ -641,6 +690,7 @@ static RawAIOCB *raw_aio_setup(BlockDriverState *bs,
     else
         acb->aiocb.aio_nbytes = nb_sectors * 512;
     acb->aiocb.aio_offset = sector_num * 512;
+    acb->flush_generation = posix_aio_state->flush_generation;
     acb->next = posix_aio_state->first_aio;
     posix_aio_state->first_aio = acb;
     return acb;
@@ -715,6 +765,25 @@ static BlockDriverAIOCB *raw_aio_write(BlockDriverState *bs,
     return &acb->common;
 }
 
+static BlockDriverAIOCB *raw_aio_flush(BlockDriverState *bs,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BDRVRawState *s = bs->opaque;
+    RawAIOCB *acb;
+
+    acb = qemu_aio_get(bs, cb, opaque);
+    if (!acb)
+        return NULL;
+
+    posix_aio_state->flush_generation++;
+    acb->flush_generation = posix_aio_state->flush_generation;
+
+    acb->next = posix_aio_state->first_aio_flush;
+    posix_aio_state->first_aio_flush = acb;
+
+    return &acb->common;
+}
+
 static void raw_aio_cancel(BlockDriverAIOCB *blockacb)
 {
     int ret;
@@ -889,6 +958,7 @@ BlockDriver bdrv_raw = {
 #ifdef CONFIG_AIO
     .bdrv_aio_read = raw_aio_read,
     .bdrv_aio_write = raw_aio_write,
+    .bdrv_aio_flush = raw_aio_flush,
     .bdrv_aio_cancel = raw_aio_cancel,
     .aiocb_size = sizeof(RawAIOCB),
 #endif
diff --git a/block.c b/block.c
index 0f1734e..110bbe3 100644
--- a/block.c
+++ b/block.c
@@ -48,6 +48,8 @@ static BlockDriverAIOCB *bdrv_aio_read_em(BlockDriverState *bs,
 static BlockDriverAIOCB *bdrv_aio_write_em(BlockDriverState *bs,
         int64_t sector_num, const uint8_t *buf, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque);
+static BlockDriverAIOCB *bdrv_aio_flush_em(BlockDriverState *bs,
+        BlockDriverCompletionFunc *cb, void *opaque);
 static void bdrv_aio_cancel_em(BlockDriverAIOCB *acb);
 static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
                         uint8_t *buf, int nb_sectors);
@@ -129,6 +131,7 @@ static void bdrv_register(BlockDriver *bdrv)
         /* add AIO emulation layer */
         bdrv->bdrv_aio_read = bdrv_aio_read_em;
         bdrv->bdrv_aio_write = bdrv_aio_write_em;
+        bdrv->bdrv_aio_flush = bdrv_aio_flush_em;
         bdrv->bdrv_aio_cancel = bdrv_aio_cancel_em;
         bdrv->aiocb_size = sizeof(BlockDriverAIOCBSync);
     } else if (!bdrv->bdrv_read && !bdrv->bdrv_pread) {
@@ -1168,6 +1171,19 @@ BlockDriverAIOCB *bdrv_aio_write(BlockDriverState *bs, int64_t sector_num,
     return ret;
 }
 
+BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
+                                 BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BlockDriver *drv = bs->drv;
+
+    if (!drv)
+        return NULL;
+    if (bs->read_only)
+        return NULL;
+
+    return drv->bdrv_aio_flush(bs, cb, opaque);
+}
+
 void bdrv_aio_cancel(BlockDriverAIOCB *acb)
 {
     BlockDriver *drv = acb->bs->drv;
@@ -1218,6 +1234,20 @@ static BlockDriverAIOCB *bdrv_aio_write_em(BlockDriverState *bs,
     return &acb->common;
 }
 
+static BlockDriverAIOCB *bdrv_aio_flush_em(BlockDriverState *bs,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BlockDriverAIOCBSync *acb;
+
+    acb = qemu_aio_get(bs, cb, opaque);
+    if (!acb->bh)
+        acb->bh = qemu_bh_new(bdrv_aio_bh_cb, acb);
+    bdrv_flush(bs);
+    acb->ret = 0;
+    qemu_bh_schedule(acb->bh);
+    return &acb->common;
+}
+
 static void bdrv_aio_cancel_em(BlockDriverAIOCB *blockacb)
 {
     BlockDriverAIOCBSync *acb = (BlockDriverAIOCBSync *)blockacb;
diff --git a/block.h b/block.h
index 12053d9..3139750 100644
--- a/block.h
+++ b/block.h
@@ -89,6 +89,8 @@ BlockDriverAIOCB *bdrv_aio_read(BlockDriverState *bs, int64_t sector_num,
 BlockDriverAIOCB *bdrv_aio_write(BlockDriverState *bs, int64_t sector_num,
                                  const uint8_t *buf, int nb_sectors,
                                  BlockDriverCompletionFunc *cb, void *opaque);
+BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
+                                 BlockDriverCompletionFunc *cb, void *opaque);
 void bdrv_aio_cancel(BlockDriverAIOCB *acb);
 
 int qemu_key_check(BlockDriverState *bs, const char *name);
diff --git a/block_int.h b/block_int.h
index e83fd2c..a342e18 100644
--- a/block_int.h
+++ b/block_int.h
@@ -54,6 +54,8 @@ struct BlockDriver {
     BlockDriverAIOCB *(*bdrv_aio_write)(BlockDriverState *bs,
         int64_t sector_num, const uint8_t *buf, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque);
+    BlockDriverAIOCB *(*bdrv_aio_flush)(BlockDriverState *bs,
+        BlockDriverCompletionFunc *cb, void *opaque);
     void (*bdrv_aio_cancel)(BlockDriverAIOCB *acb);
     int aiocb_size;
 
diff --git a/hw/ide.c b/hw/ide.c
index 0b672c8..7ed922c 100644
--- a/hw/ide.c
+++ b/hw/ide.c
@@ -2009,6 +2009,13 @@ static void ide_clear_hob(IDEState *ide_if)
     ide_if[1].select &= ~(1 << 7);
 }
 
+static void ide_flush_cb(void *opaque, int ret)
+{
+    IDEState *s = opaque;
+    s->status = READY_STAT | SEEK_STAT;
+    ide_set_irq(s);
+}
+
 static void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 {
     IDEState *ide_if = opaque;
@@ -2278,9 +2285,7 @@ static void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val)
         case WIN_FLUSH_CACHE:
         case WIN_FLUSH_CACHE_EXT:
             if (s->bs)
-                bdrv_flush(s->bs);
-	    s->status = READY_STAT | SEEK_STAT;
-            ide_set_irq(s);
+                bdrv_aio_flush(s->bs, ide_flush_cb, s);
             break;
         case WIN_STANDBY:
         case WIN_STANDBY2:

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-11 20:17                                 ` Anthony Liguori
@ 2008-11-12 11:42                                   ` Gleb Natapov
  2008-11-12 11:54                                     ` Glauber Costa
  0 siblings, 1 reply; 49+ messages in thread
From: Gleb Natapov @ 2008-11-12 11:42 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Paul Brook, qemu-devel

On Tue, Nov 11, 2008 at 02:17:49PM -0600, Anthony Liguori wrote:
> Gleb Natapov wrote:
>> On Mon, Nov 10, 2008 at 09:46:12AM -0600, Anthony Liguori wrote:
>>   -usbdevice tablet has nothing to do with it. Qemu misses interrupt 
>> even
>> without this option and with SDL screen it misses them in bunches when
>> SDL redraws a screen. In case of vnc qemu misses interrupt because of
>> fsync() call in raw_flush(), or so my instrumentation shows.
>>   
>
> Can you give this patch a spin?
>
Doesn't compile for me. fd_pool_inuse and fd_inuse are used but not
defined.

> This introduces a bdrv_aio_flush() which will wait for all existing AIO  
> operations to complete before indicating completion.  It also fixes up  
> IDE.  Fixing up SCSI will be a little more tricky but not much.  Since  
> we now use O_DSYNC, it's unnecessary to do an fsync (or an fdatasync).
>
> Assuming you're using IDE, this should eliminate any delays from fsync.   
I am using IDE.

> SDL delays are unavoidable because it's going to come down to SDL doing  
> sychronous updates to the X server.  The proper long term solution here  
> would be to put SDL in it's own thread but I'm not too worried about  
And probably time-keeping deserves its own thread. And CPU execution
too.

> this case.  It's a good argument for timer fixups but I think it's  
> important to identify all the reasons we're losing ticks and if we can  
> do something about it, we should.
>
I completely agree that all the cases that can be fixed should be fixed,
but even if qemu itself will be perfect, it is still at mercy of scheduler
to get scheduled often enough. Qemu timers code tries to compensate missed
timer by triggering next one immediately. What I am seen with SDL is
that qemu missed ~10 timers and tries to run all of them at once without
giving CPU chance to handle any of them. As a result all but one of them
are lost.

--
			Gleb.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-12 11:42                                   ` Gleb Natapov
@ 2008-11-12 11:54                                     ` Glauber Costa
  2008-11-12 12:38                                       ` Dor Laor
  0 siblings, 1 reply; 49+ messages in thread
From: Glauber Costa @ 2008-11-12 11:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paul Brook

On Wed, Nov 12, 2008 at 9:42 AM, Gleb Natapov <gleb@redhat.com> wrote:
> On Tue, Nov 11, 2008 at 02:17:49PM -0600, Anthony Liguori wrote:
>> Gleb Natapov wrote:
>>> On Mon, Nov 10, 2008 at 09:46:12AM -0600, Anthony Liguori wrote:
>>>   -usbdevice tablet has nothing to do with it. Qemu misses interrupt
>>> even
>>> without this option and with SDL screen it misses them in bunches when
>>> SDL redraws a screen. In case of vnc qemu misses interrupt because of
>>> fsync() call in raw_flush(), or so my instrumentation shows.
>>>
>>
>> Can you give this patch a spin?
>>
> Doesn't compile for me. fd_pool_inuse and fd_inuse are used but not
> defined.
>
>> This introduces a bdrv_aio_flush() which will wait for all existing AIO
>> operations to complete before indicating completion.  It also fixes up
>> IDE.  Fixing up SCSI will be a little more tricky but not much.  Since
>> we now use O_DSYNC, it's unnecessary to do an fsync (or an fdatasync).
>>
>> Assuming you're using IDE, this should eliminate any delays from fsync.
> I am using IDE.
>
>> SDL delays are unavoidable because it's going to come down to SDL doing
>> sychronous updates to the X server.  The proper long term solution here
>> would be to put SDL in it's own thread but I'm not too worried about
> And probably time-keeping deserves its own thread. And CPU execution
> too.

It might well be a stupid idea, (would have to benchmark it), but the
other day it occurred to me
that we could keep timekeeping in a separate _process_, with a shm
area, doing timekeeping
for all running guests.


-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-12 11:54                                     ` Glauber Costa
@ 2008-11-12 12:38                                       ` Dor Laor
  0 siblings, 0 replies; 49+ messages in thread
From: Dor Laor @ 2008-11-12 12:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paul Brook

[-- Attachment #1: Type: text/plain, Size: 2243 bytes --]

Glauber Costa wrote:
> On Wed, Nov 12, 2008 at 9:42 AM, Gleb Natapov <gleb@redhat.com> wrote:
>   
>> On Tue, Nov 11, 2008 at 02:17:49PM -0600, Anthony Liguori wrote:
>>     
>>> Gleb Natapov wrote:
>>>       
>>>> On Mon, Nov 10, 2008 at 09:46:12AM -0600, Anthony Liguori wrote:
>>>>   -usbdevice tablet has nothing to do with it. Qemu misses interrupt
>>>> even
>>>> without this option and with SDL screen it misses them in bunches when
>>>> SDL redraws a screen. In case of vnc qemu misses interrupt because of
>>>> fsync() call in raw_flush(), or so my instrumentation shows.
>>>>
>>>>         
>>> Can you give this patch a spin?
>>>
>>>       
>> Doesn't compile for me. fd_pool_inuse and fd_inuse are used but not
>> defined.
>>
>>     
>>> This introduces a bdrv_aio_flush() which will wait for all existing AIO
>>> operations to complete before indicating completion.  It also fixes up
>>> IDE.  Fixing up SCSI will be a little more tricky but not much.  Since
>>> we now use O_DSYNC, it's unnecessary to do an fsync (or an fdatasync).
>>>
>>> Assuming you're using IDE, this should eliminate any delays from fsync.
>>>       
>> I am using IDE.
>>
>>     
>>> SDL delays are unavoidable because it's going to come down to SDL doing
>>> sychronous updates to the X server.  The proper long term solution here
>>> would be to put SDL in it's own thread but I'm not too worried about
>>>       
>> And probably time-keeping deserves its own thread. And CPU execution
>> too.
>>     
>
> It might well be a stupid idea, (would have to benchmark it), but the
> other day it occurred to me
> that we could keep timekeeping in a separate _process_, with a shm
> area, doing timekeeping
> for all running guests.
>
>   
The problem is that the right vcpu should be preempted in order us 
injecting time irq into it.
It will introduce latency (signal/IPI).
This is why the in-kernel pit performs better/ more accurate.

What we can do is add a kernel interface to schedule a timer on a 
specific cpu for preempting the vcpu.
The problem such interface is a bit awkward and Avi feels it's beginning 
to be too complex.

So either we fix it using Gleb's set_irq ack method or just fix the RTC 
using Andrzej suggestion for RTC irq
status bit.

[-- Attachment #2: Type: text/html, Size: 3041 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load.
  2008-11-02 13:04   ` Gleb Natapov
  2008-11-05 12:45     ` Dor Laor
  2008-11-05 16:43     ` Anthony Liguori
@ 2008-11-06  3:41     ` Jamie Lokier
  2 siblings, 0 replies; 49+ messages in thread
From: Jamie Lokier @ 2008-11-06  3:41 UTC (permalink / raw)
  To: qemu-devel

Gleb Natapov wrote:
> >                                                                 How do  
> > Linux guests behave with this?
> Linux guests don't use pit or RTC for time keeping. They are completely
> unaffected by those patches.

Older Linux guests do, and those booted with clock=pit.

-- Jamie

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2008-11-12 12:42 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-29 15:22 [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Gleb Natapov
2008-10-29 15:22 ` [Qemu-devel] [PATCH 1/3] Change qemu_set_irq() to return status information Gleb Natapov
2008-10-29 15:22 ` [Qemu-devel] [PATCH 2/3] Fix time drift problem under high load when PIT is in use Gleb Natapov
2008-10-29 15:22 ` [Qemu-devel] [PATCH 3/3] Fix time drift problem under high load when RTC " Gleb Natapov
2008-11-05 12:46   ` Dor Laor
2008-10-31 19:17 ` [Qemu-devel] [RESEND][PATCH 0/3] Fix guest time drift under heavy load Anthony Liguori
2008-11-02 13:04   ` Gleb Natapov
2008-11-05 12:45     ` Dor Laor
2008-11-05 15:48       ` andrzej zaborowski
2008-11-05 16:33         ` Anthony Liguori
2008-11-06  7:16         ` Gleb Natapov
2008-11-06  9:37           ` andrzej zaborowski
2008-11-06 10:08             ` Gleb Natapov
2008-11-06 13:21               ` andrzej zaborowski
2008-11-06 14:18                 ` Gleb Natapov
2008-11-06 14:35                   ` andrzej zaborowski
2008-11-06 15:04                     ` Gleb Natapov
2008-11-06 15:41                       ` Anthony Liguori
2008-11-07 23:18                       ` andrzej zaborowski
2008-11-08  8:23                         ` Gleb Natapov
2008-11-06 13:44               ` Paul Brook
2008-11-05 17:43       ` Gleb Natapov
2008-11-06 17:28       ` David S. Ahern
2008-11-05 16:43     ` Anthony Liguori
2008-11-06  3:55       ` Jamie Lokier
2008-11-06  8:12       ` Gleb Natapov
2008-11-06 14:10         ` Anthony Liguori
2008-11-06 14:24           ` Paul Brook
2008-11-06 14:40             ` Anthony Liguori
2008-11-06 14:51               ` Gleb Natapov
2008-11-06 15:37                 ` Anthony Liguori
2008-11-08  8:36                   ` Gleb Natapov
2008-11-08 22:14                     ` Dor Laor
2008-11-09  7:40                     ` Gleb Natapov
2008-11-09 16:38                       ` Anthony Liguori
2008-11-09 21:00                         ` Avi Kivity
2008-11-09 16:36                     ` Anthony Liguori
2008-11-10 14:37                       ` Gleb Natapov
2008-11-10 15:24                         ` Anthony Liguori
2008-11-10 15:29                           ` Gleb Natapov
2008-11-10 15:46                             ` Anthony Liguori
2008-11-10 15:51                               ` Gleb Natapov
2008-11-11 14:43                               ` Gleb Natapov
2008-11-11 17:26                                 ` Anthony Liguori
2008-11-11 20:17                                 ` Anthony Liguori
2008-11-12 11:42                                   ` Gleb Natapov
2008-11-12 11:54                                     ` Glauber Costa
2008-11-12 12:38                                       ` Dor Laor
2008-11-06  3:41     ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).