public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 0/7] force the TSC unreliable by reporting C2 state
@ 2008-06-18 16:42 Marcelo Tosatti
  2008-06-18 16:42 ` [patch 1/7] kvm: qemu: inform valid C2 state in ACPI table Marcelo Tosatti
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Avi, I don't think this causes such a huge performance regression. NOHZ
makes the frequency of timer reads go down significantly.

As for constant tick guests, well, the impact will be similar to
changing to SMP, since those fallback to ACPI timer anyway now.

The C2 emulation is required by Ubuntu 7.10 for example, which refuses
to process the CST notification.

Stock Linux kernels (as old as 2.6.20) do mark C2 invalid upon the CST
notification.

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 1/7] kvm: qemu: inform valid C2 state in ACPI table
  2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
@ 2008-06-18 16:42 ` Marcelo Tosatti
  2008-06-18 16:42 ` [patch 2/7] kvm: qemu: disable c2 via _CST notification Marcelo Tosatti
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Marcelo Tosatti

[-- Attachment #1: acpi-fake-c2 --]
[-- Type: text/plain, Size: 3080 bytes --]

Inform C2 state support via ACPI's CST per-processor package.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm-userspace.tip/bios/acpi-dsdt.dsl
===================================================================
--- kvm-userspace.tip.orig/bios/acpi-dsdt.dsl
+++ kvm-userspace.tip/bios/acpi-dsdt.dsl
@@ -33,6 +33,11 @@ DefinitionBlock (
 		PRU, 8,
 		PRD, 8,
 	}
+	OperationRegion(PWNO, SystemIO, 0xb040, 0x02)
+	Field (PWNO, WordAcc, NoLock, WriteAsZeros)
+	{
+		PWC, 16,
+	}
 
 #define gen_processor(nr, name) 				            \
 	Processor (CPU##name, nr, 0x0000b010, 0x06) {                       \
@@ -44,11 +49,23 @@ DefinitionBlock (
             Method (_STA) {                                                 \
                 Return(0xF)                                                 \
             }                                                               \
+            Name(_CST, Package() {                                          \
+                1, Package() {                                              \
+                    ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},  \
+                                        2, 2, 300},                         \
+                })                                                          \
         }                                                                   \
 
 
 
-        Processor (CPU0, 0x00, 0x0000b010, 0x06) {Method (_STA) { Return(0xF)}}
+        Processor (CPU0, 0x00, 0x0000b010, 0x06) {
+                        Method (_STA) { Return(0xF)}
+                        Name(_CST, Package() {
+                        1, Package() {
+                          ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+                                              2, 2, 300},
+                        })
+                  }
 	gen_processor(1, 1)
 	gen_processor(2, 2)
 	gen_processor(3, 3)
@@ -747,7 +764,29 @@ DefinitionBlock (
         Method(_L05) {
             Return(0x01)
         }
+#define gen_power_notify(nr, name)                                            \
+                Store (0xfffff,                                               \
+                       Index (DeRefOf (Index (\_PR.CPU##name._CST, 1)), 2))   \
+                Notify(\_PR.CPU##name, 0x81)                                  \
+
         Method(_L06) {
+            If (And(\_PR.PWC, 0x1)) {
+                gen_power_notify(0, 0)
+                gen_power_notify(1, 1)
+                gen_power_notify(2, 2)
+                gen_power_notify(3, 3)
+                gen_power_notify(4, 4)
+                gen_power_notify(5, 5)
+                gen_power_notify(6, 6)
+                gen_power_notify(7, 7)
+                gen_power_notify(8, 8)
+                gen_power_notify(9, 9)
+                gen_power_notify(10, A)
+                gen_power_notify(11, B)
+                gen_power_notify(12, C)
+                gen_power_notify(13, D)
+                gen_power_notify(14, E)
+            }
             Return(0x01)
         }
         Method(_L07) {

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 2/7] kvm: qemu: disable c2 via _CST notification
  2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
  2008-06-18 16:42 ` [patch 1/7] kvm: qemu: inform valid C2 state in ACPI table Marcelo Tosatti
@ 2008-06-18 16:42 ` Marcelo Tosatti
  2008-06-18 16:42 ` [patch 3/7] libkvm: in-kernel C2 halt interface Marcelo Tosatti
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Marcelo Tosatti

[-- Attachment #1: acpi-fake-c2-qemu --]
[-- Type: text/plain, Size: 3554 bytes --]

Write an invalid latency value the first time the guest attempts to idle
via P_LVL2 port.

This way the TSC is considered unreliable, and we get away with the
costs relative to APIC timer broadcasts on enter/exit necessary for C1+.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm-userspace.tip/qemu/hw/acpi.c
===================================================================
--- kvm-userspace.tip.orig/qemu/hw/acpi.c
+++ kvm-userspace.tip/qemu/hw/acpi.c
@@ -120,6 +120,29 @@ static void pm_tmr_timer(void *opaque)
     pm_update_sci(s);
 }
 
+/*
+ * Fake C2 emulation, so the OS will consider the TSC unreliable
+ * and fallback to C1 after the latency is updated to a high value
+ * in acpi-dsdt.dsl.
+ */
+static void qemu_system_cpu_power_notify(void);
+static uint32_t pm_ioport_readb(void *opaque, uint32_t addr)
+{
+    addr &= 0x3f;
+    switch (addr) {
+    case 0x14: /* P_LVL2 */
+         qemu_system_cpu_power_notify();
+    }
+#ifdef DEBUG
+    printf("pm_ioport_readb addr=%x\n", addr);
+#endif
+    return 0;
+}
+
+static void pm_ioport_writeb(void *opaque, uint32_t addr, uint32_t val)
+{
+}
+
 static void pm_ioport_writew(void *opaque, uint32_t addr, uint32_t val)
 {
     PIIX4PMState *s = opaque;
@@ -419,6 +442,8 @@ static void pm_io_space_update(PIIX4PMSt
 #if defined(DEBUG)
         printf("PM: mapping to 0x%x\n", pm_io_base);
 #endif
+        register_ioport_write(pm_io_base, 64, 1, pm_ioport_writeb, s);
+        register_ioport_read(pm_io_base, 64, 1, pm_ioport_readb, s);
         register_ioport_write(pm_io_base, 64, 2, pm_ioport_writew, s);
         register_ioport_read(pm_io_base, 64, 2, pm_ioport_readw, s);
         register_ioport_write(pm_io_base, 64, 4, pm_ioport_writel, s);
@@ -537,6 +562,7 @@ void qemu_system_powerdown(void)
 }
 #endif
 #define GPE_BASE 0xafe0
+#define POWER_GPE_BASE 0xb040
 #define PROC_BASE 0xaf00
 #define PCI_BASE 0xae00
 #define PCI_EJ_BASE 0xae08
@@ -553,7 +579,12 @@ struct pci_status {
     uint32_t down;
 };
 
+struct power_gpe_regs {
+    uint8_t disable;
+};
+
 static struct gpe_regs gpe;
+static struct power_gpe_regs power_gpe;
 static struct pci_status pci0_status;
 
 static uint32_t gpe_readb(void *opaque, uint32_t addr)
@@ -622,6 +653,23 @@ static void gpe_writeb(void *opaque, uin
 #endif
 }
 
+static uint32_t cpu_power_read(void *opaque, uint32_t addr)
+{
+    struct power_gpe_regs *p = opaque;
+
+#if defined(DEBUG)
+    printf("cpu power read %lx == %lx\n", addr, p->disable);
+#endif
+    return p->disable;
+}
+
+static void cpu_power_write(void *opaque, uint32_t addr, uint32_t val)
+{
+#if defined(DEBUG)
+    printf("cpu power write %lx <== %lx\n", addr, val);
+#endif
+}
+
 static uint32_t pcihotplug_read(void *opaque, uint32_t addr)
 {
     uint32_t val = 0;
@@ -695,6 +743,9 @@ void qemu_system_hot_add_init(const char
     register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, NULL);
     register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, NULL);
 
+    register_ioport_write(POWER_GPE_BASE, 4, 2, cpu_power_write, &power_gpe);
+    register_ioport_read(POWER_GPE_BASE, 4, 2, cpu_power_read, &power_gpe);
+
     model = cpu_model;
 }
 
@@ -737,6 +788,15 @@ void qemu_system_cpu_hot_add(int cpu, in
         disable_processor(&gpe, cpu);
     qemu_set_irq(pm_state->irq, 0);
 }
+
+static void qemu_system_cpu_power_notify(void)
+{
+    power_gpe.disable = 1;
+
+    qemu_set_irq(pm_state->irq, 1);
+    qemu_set_irq(pm_state->irq, 0);
+}
+
 #endif
 
 static void enable_device(struct pci_status *p, struct gpe_regs *g, int slot)

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 3/7] libkvm: in-kernel C2 halt interface
  2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
  2008-06-18 16:42 ` [patch 1/7] kvm: qemu: inform valid C2 state in ACPI table Marcelo Tosatti
  2008-06-18 16:42 ` [patch 2/7] kvm: qemu: disable c2 via _CST notification Marcelo Tosatti
@ 2008-06-18 16:42 ` Marcelo Tosatti
  2008-06-18 16:42 ` [patch 4/7] libkvm: handle_io return handler value Marcelo Tosatti
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Marcelo Tosatti

[-- Attachment #1: libkvm-acpi-c2 --]
[-- Type: text/plain, Size: 1160 bytes --]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm-userspace.tip/libkvm/libkvm.c
===================================================================
--- kvm-userspace.tip.orig/libkvm/libkvm.c
+++ kvm-userspace.tip/libkvm/libkvm.c
@@ -827,6 +827,18 @@ int kvm_set_mpstate(kvm_context_t kvm, i
 }
 #endif
 
+#ifdef KVM_CAP_ACPI_C2
+int kvm_enable_acpi_c2(kvm_context_t kvm)
+{
+    int r;
+
+    r = ioctl(kvm->fd, KVM_CHECK_EXTENSION, KVM_CAP_ACPI_C2);
+    if (r > 0)
+        return ioctl(kvm->vm_fd, KVM_ENABLE_ACPI_C2);
+    return -ENOSYS;
+}
+#endif
+
 static int handle_mmio(kvm_context_t kvm, struct kvm_run *kvm_run)
 {
 	unsigned long addr = kvm_run->mmio.phys_addr;
Index: kvm-userspace.tip/libkvm/libkvm.h
===================================================================
--- kvm-userspace.tip.orig/libkvm/libkvm.h
+++ kvm-userspace.tip/libkvm/libkvm.h
@@ -325,6 +325,14 @@ static inline int kvm_reset_mpstate(kvm_
 }
 #endif
 
+#ifdef KVM_CAP_ACPI_C2
+/*!
+ * \brief Enable in-kernel ACPI C2 emulation.
+ *
+ */
+int kvm_enable_acpi_c2(kvm_context_t kvm);
+#endif
+
 /*!
  * \brief Simulate an external vectored interrupt
  *

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 4/7] libkvm: handle_io return handler value
  2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2008-06-18 16:42 ` [patch 3/7] libkvm: in-kernel C2 halt interface Marcelo Tosatti
@ 2008-06-18 16:42 ` Marcelo Tosatti
  2008-06-18 16:42 ` [patch 5/7] qemu: kvm: unhalt vcpu0 on pit irq Marcelo Tosatti
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Marcelo Tosatti

[-- Attachment #1: libkvm-io-return --]
[-- Type: text/plain, Size: 524 bytes --]

This makes it possible for PIO handlers (such as the ACPI PLVL2 one) force
the current vcpu to main_loop().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm-userspace.tip/libkvm/libkvm.c
===================================================================
--- kvm-userspace.tip.orig/libkvm/libkvm.c
+++ kvm-userspace.tip/libkvm/libkvm.c
@@ -767,7 +767,7 @@ static int handle_io(kvm_context_t kvm, 
 		p += run->io.size;
 	}
 
-	return 0;
+	return r;
 }
 
 int handle_debug(kvm_context_t kvm, int vcpu)

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 5/7] qemu: kvm: unhalt vcpu0 on pit irq
  2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2008-06-18 16:42 ` [patch 4/7] libkvm: handle_io return handler value Marcelo Tosatti
@ 2008-06-18 16:42 ` Marcelo Tosatti
  2008-06-18 16:42 ` [patch 6/7] kvm: qemu: enable in-kernel C2 emulation / userspace emulation Marcelo Tosatti
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Marcelo Tosatti

[-- Attachment #1: pic-unhalt-cpu0 --]
[-- Type: text/plain, Size: 1990 bytes --]

Since the introduction of the IO thread host_alarm_handler() fails to
wakeup vcpu0 when a timer is triggered. This results in failure to take
vcpu0 out of halt in C2 emulation.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm-userspace.tip/qemu/hw/i8259.c
===================================================================
--- kvm-userspace.tip.orig/qemu/hw/i8259.c
+++ kvm-userspace.tip/qemu/hw/i8259.c
@@ -83,6 +83,8 @@ static inline void pic_set_irq1(PicState
         if (level) {
             s->irr |= mask;
             s->last_irr |= mask;
+            if (kvm_enabled())
+                kvm_unhalt_cpu0();
         } else {
             s->irr &= ~mask;
             s->last_irr &= ~mask;
@@ -93,6 +95,8 @@ static inline void pic_set_irq1(PicState
             if ((s->last_irr & mask) == 0)
                 s->irr |= mask;
             s->last_irr |= mask;
+            if (kvm_enabled())
+                kvm_unhalt_cpu0();
         } else {
             s->last_irr &= ~mask;
         }
Index: kvm-userspace.tip/qemu/qemu-kvm.c
===================================================================
--- kvm-userspace.tip.orig/qemu/qemu-kvm.c
+++ kvm-userspace.tip/qemu/qemu-kvm.c
@@ -128,6 +128,12 @@ static void on_vcpu(CPUState *env, void 
         qemu_cond_wait(&qemu_work_cond);
 }
 
+void kvm_unhalt_cpu0(void)
+{
+    if (vcpu_info[0].thread)
+        vcpu_info[0].env->halted = 0;
+}
+
 void kvm_update_interrupt_request(CPUState *env)
 {
     int signal = 0;
Index: kvm-userspace.tip/qemu/qemu-kvm.h
===================================================================
--- kvm-userspace.tip.orig/qemu/qemu-kvm.h
+++ kvm-userspace.tip/qemu/qemu-kvm.h
@@ -30,6 +30,7 @@ int kvm_qemu_init_env(CPUState *env);
 int kvm_qemu_check_extension(int ext);
 void kvm_apic_init(CPUState *env);
 int kvm_set_irq(int irq, int level);
+void kvm_unhalt_cpu0(void);
 
 int kvm_physical_memory_set_dirty_tracking(int enable);
 int kvm_update_dirty_pages_log(void);

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 6/7] kvm: qemu: enable in-kernel C2 emulation / userspace emulation
  2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2008-06-18 16:42 ` [patch 5/7] qemu: kvm: unhalt vcpu0 on pit irq Marcelo Tosatti
@ 2008-06-18 16:42 ` Marcelo Tosatti
  2008-06-18 16:42 ` [patch 7/7] KVM: in-kernel ACPI C2 idle emulation Marcelo Tosatti
  2008-06-18 20:09 ` [patch 0/7] force the TSC unreliable by reporting C2 state Anthony Liguori
  7 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Marcelo Tosatti

[-- Attachment #1: inkernel-acpic2-enable --]
[-- Type: text/plain, Size: 3414 bytes --]

Enable the in-kernel C2 emulation after the _CST notification has been
triggered.

Emulate in userspace for -no-kvm-irqchip case.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm-userspace.tip/qemu/hw/acpi.c
===================================================================
--- kvm-userspace.tip.orig/qemu/hw/acpi.c
+++ kvm-userspace.tip/qemu/hw/acpi.c
@@ -124,6 +124,7 @@ static void pm_tmr_timer(void *opaque)
  * Fake C2 emulation, so the OS will consider the TSC unreliable
  * and fallback to C1 after the latency is updated to a high value
  * in acpi-dsdt.dsl.
+ * We still emulate C2 halt in case the OS ignores the _CST notification.
  */
 static void qemu_system_cpu_power_notify(void);
 static uint32_t pm_ioport_readb(void *opaque, uint32_t addr)
@@ -131,7 +132,12 @@ static uint32_t pm_ioport_readb(void *op
     addr &= 0x3f;
     switch (addr) {
     case 0x14: /* P_LVL2 */
-         qemu_system_cpu_power_notify();
+        if (!cst_notified)
+            qemu_system_cpu_power_notify();
+            cst_notified = 1;
+         } else {
+            qemu_kvm_handle_plvl2_read();
+         }
     }
 #ifdef DEBUG
     printf("pm_ioport_readb addr=%x\n", addr);
@@ -791,10 +797,12 @@ void qemu_system_cpu_hot_add(int cpu, in
 
 static void qemu_system_cpu_power_notify(void)
 {
-    power_gpe.disable = 1;
+    if (kvm_irqchip_in_kernel(kvm_context)) {
+        power_gpe.disable = 1;
 
-    qemu_set_irq(pm_state->irq, 1);
-    qemu_set_irq(pm_state->irq, 0);
+        qemu_set_irq(pm_state->irq, 1);
+        qemu_set_irq(pm_state->irq, 0);
+    }
 }
 
 #endif
Index: kvm-userspace.tip/qemu/qemu-kvm-x86.c
===================================================================
--- kvm-userspace.tip.orig/qemu/qemu-kvm-x86.c
+++ kvm-userspace.tip/qemu/qemu-kvm-x86.c
@@ -287,6 +287,21 @@ void kvm_load_mpstate(CPUState *env)
 #endif
 }
 
+uint32_t qemu_kvm_handle_plvl2_read(void)
+{
+    CPUState *env = cpu_single_env;
+    uint32_t ret;
+
+    if (kvm_irqchip_in_kernel(kvm_context)) {
+        kvm_enable_acpi_c2(kvm_context);
+        ret = 0;
+    } else {
+        kvm_arch_halt(NULL, env->cpu_index);
+        ret = 1;
+    }
+    return ret;
+}
+
 void kvm_arch_save_regs(CPUState *env)
 {
     struct kvm_regs regs;
Index: kvm-userspace.tip/qemu/qemu-kvm.h
===================================================================
--- kvm-userspace.tip.orig/qemu/qemu-kvm.h
+++ kvm-userspace.tip/qemu/qemu-kvm.h
@@ -88,6 +88,9 @@ void qemu_kvm_system_reset_request(void)
 int handle_powerpc_dcr_read(int vcpu, uint32_t dcrn, uint32_t *data);
 int handle_powerpc_dcr_write(int vcpu,uint32_t dcrn, uint32_t data);
 #endif
+#if defined(TARGET_I386) || defined(TARGET_X86_64) || defined (TARGET_IA64)
+uint32_t qemu_kvm_handle_plvl2_read(void);
+#endif
 
 #if !defined(SYS_signalfd)
 struct signalfd_siginfo {
Index: kvm-userspace.tip/qemu/qemu-kvm.c
===================================================================
--- kvm-userspace.tip.orig/qemu/qemu-kvm.c
+++ kvm-userspace.tip/qemu/qemu-kvm.c
@@ -619,6 +619,12 @@ static int kvm_debug(void *opaque, int v
 static int kvm_inb(void *opaque, uint16_t addr, uint8_t *data)
 {
     *data = cpu_inb(0, addr);
+    /*
+     * reads from the ACPI LVL2 port are similar to HLT emulation,
+     * so make current thread go sleep in main_loop().
+     */
+    if (addr == 0xb014 && *data)
+        return 1;
     return 0;
 }
 

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 7/7] KVM: in-kernel ACPI C2 idle emulation
  2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2008-06-18 16:42 ` [patch 6/7] kvm: qemu: enable in-kernel C2 emulation / userspace emulation Marcelo Tosatti
@ 2008-06-18 16:42 ` Marcelo Tosatti
  2008-06-23  3:01   ` Avi Kivity
  2008-06-18 20:09 ` [patch 0/7] force the TSC unreliable by reporting C2 state Anthony Liguori
  7 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 16:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, Marcelo Tosatti

[-- Attachment #1: acpi-c2-emul --]
[-- Type: text/plain, Size: 11633 bytes --]

Some guests fail to process the _CST notification which invalidates the C2 state.

Emulate C2 similarly to HLT for those cases.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: kvm/arch/x86/kvm/Makefile
===================================================================
--- kvm.orig/arch/x86/kvm/Makefile
+++ kvm/arch/x86/kvm/Makefile
@@ -11,7 +11,7 @@ endif
 EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
 
 kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
-	i8254.o
+	i8254.o acpi.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
Index: kvm/arch/x86/kvm/acpi.c
===================================================================
--- /dev/null
+++ kvm/arch/x86/kvm/acpi.c
@@ -0,0 +1,82 @@
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/acpi_pmtmr.h>
+#include <asm/io.h>
+#include "iodev.h"
+#include "irq.h"
+
+/*
+ * P_LVL2 = PBLK + 4h
+ *
+ * Note: matches BIOS (currently hardcoded) definition.
+ */
+#define ACPI_PBLK 0xb010
+
+struct kvm_acpi {
+	struct kvm_io_device dev;
+	struct kvm *kvm;
+};
+
+static void acpi_cstate_halt(struct kvm_vcpu *vcpu)
+{
+	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
+	mutex_unlock(&vcpu->kvm->lock);
+	up_read(&vcpu->kvm->slots_lock);
+	kvm_vcpu_block(vcpu);
+	down_read(&vcpu->kvm->slots_lock);
+	mutex_lock(&vcpu->kvm->lock);
+	vcpu->arch.pio.size = 0;
+}
+
+static void acpi_ioport_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
+			     gpa_t addr, int len, void *data)
+{
+
+	if (addr == ACPI_PBLK + 0x4)
+		acpi_cstate_halt(vcpu);
+
+	return;
+}
+
+static void acpi_ioport_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
+			      gpa_t addr, int len, const void *data)
+{
+	return;
+}
+
+static int acpi_in_range(struct kvm_io_device *this, gpa_t addr, int len,
+			 int is_write)
+{
+	struct kvm_acpi *acpi = (struct kvm_acpi *)this->private;
+
+	if (!irqchip_in_kernel(acpi->kvm))
+		return 0;
+
+	return (addr == ACPI_PBLK + 0x4);
+}
+
+void kvm_acpi_free(struct kvm_io_device *this)
+{
+	struct kvm_acpi *acpi = (struct kvm_acpi *)this->private;
+
+	kfree(acpi);
+}
+
+int kvm_acpi_init(struct kvm *kvm)
+{
+	struct kvm_acpi *acpi;
+
+	acpi = kzalloc(sizeof(struct kvm_acpi), GFP_KERNEL);
+	if (!acpi)
+		return -ENOMEM;
+
+	acpi->dev.read = acpi_ioport_read;
+	acpi->dev.write = acpi_ioport_write;
+	acpi->dev.in_range = acpi_in_range;
+	acpi->dev.private = acpi;
+	acpi->kvm = kvm;
+	acpi->dev.destructor = kvm_acpi_free;
+	kvm_io_bus_register_dev(&kvm->pio_bus, &acpi->dev);
+
+	return 0;
+}
Index: kvm/arch/x86/kvm/i8254.c
===================================================================
--- kvm.orig/arch/x86/kvm/i8254.c
+++ kvm/arch/x86/kvm/i8254.c
@@ -320,7 +320,7 @@ void kvm_pit_load_count(struct kvm *kvm,
 	mutex_unlock(&kvm->arch.vpit->pit_state.lock);
 }
 
-static void pit_ioport_write(struct kvm_io_device *this,
+static void pit_ioport_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
 			     gpa_t addr, int len, const void *data)
 {
 	struct kvm_pit *pit = (struct kvm_pit *)this->private;
@@ -393,7 +393,7 @@ static void pit_ioport_write(struct kvm_
 	mutex_unlock(&pit_state->lock);
 }
 
-static void pit_ioport_read(struct kvm_io_device *this,
+static void pit_ioport_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
 			    gpa_t addr, int len, void *data)
 {
 	struct kvm_pit *pit = (struct kvm_pit *)this->private;
@@ -464,8 +464,9 @@ static int pit_in_range(struct kvm_io_de
 		(addr < KVM_PIT_BASE_ADDRESS + KVM_PIT_MEM_LENGTH));
 }
 
-static void speaker_ioport_write(struct kvm_io_device *this,
-				 gpa_t addr, int len, const void *data)
+static void speaker_ioport_write(struct kvm_vcpu *vcpu,
+				 struct kvm_io_device *this, gpa_t addr,
+				 int len, const void *data)
 {
 	struct kvm_pit *pit = (struct kvm_pit *)this->private;
 	struct kvm_kpit_state *pit_state = &pit->pit_state;
@@ -478,7 +479,8 @@ static void speaker_ioport_write(struct 
 	mutex_unlock(&pit_state->lock);
 }
 
-static void speaker_ioport_read(struct kvm_io_device *this,
+static void speaker_ioport_read(struct kvm_vcpu *vcpu,
+				struct kvm_io_device *this,
 				gpa_t addr, int len, void *data)
 {
 	struct kvm_pit *pit = (struct kvm_pit *)this->private;
Index: kvm/arch/x86/kvm/i8259.c
===================================================================
--- kvm.orig/arch/x86/kvm/i8259.c
+++ kvm/arch/x86/kvm/i8259.c
@@ -362,7 +362,7 @@ static int picdev_in_range(struct kvm_io
 	}
 }
 
-static void picdev_write(struct kvm_io_device *this,
+static void picdev_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
 			 gpa_t addr, int len, const void *val)
 {
 	struct kvm_pic *s = this->private;
@@ -387,7 +387,7 @@ static void picdev_write(struct kvm_io_d
 	}
 }
 
-static void picdev_read(struct kvm_io_device *this,
+static void picdev_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
 			gpa_t addr, int len, void *val)
 {
 	struct kvm_pic *s = this->private;
Index: kvm/arch/x86/kvm/lapic.c
===================================================================
--- kvm.orig/arch/x86/kvm/lapic.c
+++ kvm/arch/x86/kvm/lapic.c
@@ -600,7 +600,7 @@ static u32 __apic_read(struct kvm_lapic 
 	return val;
 }
 
-static void apic_mmio_read(struct kvm_io_device *this,
+static void apic_mmio_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
 			   gpa_t address, int len, void *data)
 {
 	struct kvm_lapic *apic = (struct kvm_lapic *)this->private;
@@ -669,7 +669,7 @@ static void start_apic_timer(struct kvm_
 					apic->timer.period)));
 }
 
-static void apic_mmio_write(struct kvm_io_device *this,
+static void apic_mmio_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
 			    gpa_t address, int len, const void *data)
 {
 	struct kvm_lapic *apic = (struct kvm_lapic *)this->private;
Index: kvm/arch/x86/kvm/x86.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -19,6 +19,7 @@
 #include "mmu.h"
 #include "i8254.h"
 #include "tss.h"
+#include "acpi.h"
 
 #include <linux/clocksource.h>
 #include <linux/kvm.h>
@@ -834,6 +835,7 @@ int kvm_dev_ioctl_check_extension(long e
 	case KVM_CAP_PIT:
 	case KVM_CAP_NOP_IO_DELAY:
 	case KVM_CAP_MP_STATE:
+	case KVM_CAP_ACPI_C2:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -1725,6 +1727,10 @@ long kvm_arch_vm_ioctl(struct file *filp
 		r = 0;
 		break;
 	}
+	case KVM_ENABLE_ACPI_C2: {
+		r = kvm_acpi_init(kvm);
+		break;
+	}
 	default:
 		;
 	}
@@ -1844,7 +1850,7 @@ mmio:
 	mutex_lock(&vcpu->kvm->lock);
 	mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 0);
 	if (mmio_dev) {
-		kvm_iodevice_read(mmio_dev, gpa, bytes, val);
+		kvm_iodevice_read(vcpu, mmio_dev, gpa, bytes, val);
 		mutex_unlock(&vcpu->kvm->lock);
 		return X86EMUL_CONTINUE;
 	}
@@ -1899,7 +1905,7 @@ mmio:
 	mutex_lock(&vcpu->kvm->lock);
 	mmio_dev = vcpu_find_mmio_dev(vcpu, gpa, bytes, 1);
 	if (mmio_dev) {
-		kvm_iodevice_write(mmio_dev, gpa, bytes, val);
+		kvm_iodevice_write(vcpu, mmio_dev, gpa, bytes, val);
 		mutex_unlock(&vcpu->kvm->lock);
 		return X86EMUL_CONTINUE;
 	}
@@ -2244,11 +2250,11 @@ static void kernel_pio(struct kvm_io_dev
 
 	mutex_lock(&vcpu->kvm->lock);
 	if (vcpu->arch.pio.in)
-		kvm_iodevice_read(pio_dev, vcpu->arch.pio.port,
+		kvm_iodevice_read(vcpu, pio_dev, vcpu->arch.pio.port,
 				  vcpu->arch.pio.size,
 				  pd);
 	else
-		kvm_iodevice_write(pio_dev, vcpu->arch.pio.port,
+		kvm_iodevice_write(vcpu, pio_dev, vcpu->arch.pio.port,
 				   vcpu->arch.pio.size,
 				   pd);
 	mutex_unlock(&vcpu->kvm->lock);
@@ -2263,7 +2269,7 @@ static void pio_string_write(struct kvm_
 
 	mutex_lock(&vcpu->kvm->lock);
 	for (i = 0; i < io->cur_count; i++) {
-		kvm_iodevice_write(pio_dev, io->port,
+		kvm_iodevice_write(vcpu, pio_dev, io->port,
 				   io->size,
 				   pd);
 		pd += io->size;
Index: kvm/include/linux/kvm.h
===================================================================
--- kvm.orig/include/linux/kvm.h
+++ kvm/include/linux/kvm.h
@@ -371,6 +371,7 @@ struct kvm_trace_rec {
 #define KVM_CAP_PV_MMU 13
 #define KVM_CAP_MP_STATE 14
 #define KVM_CAP_COALESCED_MMIO 15
+#define KVM_CAP_ACPI_C2 16
 
 /*
  * ioctls for VM fds
@@ -400,6 +401,7 @@ struct kvm_trace_rec {
 			_IOW(KVMIO,  0x67, struct kvm_coalesced_mmio_zone)
 #define KVM_UNREGISTER_COALESCED_MMIO \
 			_IOW(KVMIO,  0x68, struct kvm_coalesced_mmio_zone)
+#define KVM_ENABLE_ACPI_C2	  _IO(KVMIO, 0x69)
 
 /*
  * ioctls for vcpu fds
Index: kvm/virt/kvm/iodev.h
===================================================================
--- kvm.orig/virt/kvm/iodev.h
+++ kvm/virt/kvm/iodev.h
@@ -16,14 +16,17 @@
 #ifndef __KVM_IODEV_H__
 #define __KVM_IODEV_H__
 
+#include <linux/kvm_host.h>
 #include <linux/kvm_types.h>
 
 struct kvm_io_device {
-	void (*read)(struct kvm_io_device *this,
+	void (*read)(struct kvm_vcpu *vcpu,
+		     struct kvm_io_device *this,
 		     gpa_t addr,
 		     int len,
 		     void *val);
-	void (*write)(struct kvm_io_device *this,
+	void (*write)(struct kvm_vcpu *vcpu,
+		      struct kvm_io_device *this,
 		      gpa_t addr,
 		      int len,
 		      const void *val);
@@ -34,20 +37,22 @@ struct kvm_io_device {
 	void             *private;
 };
 
-static inline void kvm_iodevice_read(struct kvm_io_device *dev,
+static inline void kvm_iodevice_read(struct kvm_vcpu *vcpu,
+				     struct kvm_io_device *dev,
 				     gpa_t addr,
 				     int len,
 				     void *val)
 {
-	dev->read(dev, addr, len, val);
+	dev->read(vcpu, dev, addr, len, val);
 }
 
-static inline void kvm_iodevice_write(struct kvm_io_device *dev,
+static inline void kvm_iodevice_write(struct kvm_vcpu *vcpu,
+				      struct kvm_io_device *dev,
 				      gpa_t addr,
 				      int len,
 				      const void *val)
 {
-	dev->write(dev, addr, len, val);
+	dev->write(vcpu, dev, addr, len, val);
 }
 
 static inline int kvm_iodevice_inrange(struct kvm_io_device *dev,
Index: kvm/arch/x86/kvm/acpi.h
===================================================================
--- /dev/null
+++ kvm/arch/x86/kvm/acpi.h
@@ -0,0 +1,6 @@
+#ifndef __KVM_ACPI_H
+#define __KVM_ACPI_H
+
+int kvm_acpi_init(struct kvm *kvm);
+
+#endif
Index: kvm/virt/kvm/ioapic.c
===================================================================
--- kvm.orig/virt/kvm/ioapic.c
+++ kvm/virt/kvm/ioapic.c
@@ -325,8 +325,8 @@ static int ioapic_in_range(struct kvm_io
 		 (addr < ioapic->base_address + IOAPIC_MEM_LENGTH)));
 }
 
-static void ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
-			     void *val)
+static void ioapic_mmio_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
+			     gpa_t addr, int len, void *val)
 {
 	struct kvm_ioapic *ioapic = (struct kvm_ioapic *)this->private;
 	u32 result;
@@ -362,8 +362,8 @@ static void ioapic_mmio_read(struct kvm_
 	}
 }
 
-static void ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
-			      const void *val)
+static void ioapic_mmio_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
+			      gpa_t addr, int len, const void *val)
 {
 	struct kvm_ioapic *ioapic = (struct kvm_ioapic *)this->private;
 	u32 data;
Index: kvm/virt/kvm/coalesced_mmio.c
===================================================================
--- kvm.orig/virt/kvm/coalesced_mmio.c
+++ kvm/virt/kvm/coalesced_mmio.c
@@ -60,7 +60,7 @@ static int coalesced_mmio_in_range(struc
 	return 0;
 }
 
-static void coalesced_mmio_write(struct kvm_io_device *this,
+static void coalesced_mmio_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
 				 gpa_t addr, int len, const void *val)
 {
 	struct kvm_coalesced_mmio_dev *dev =

-- 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2008-06-18 16:42 ` [patch 7/7] KVM: in-kernel ACPI C2 idle emulation Marcelo Tosatti
@ 2008-06-18 20:09 ` Anthony Liguori
  2008-06-18 20:40   ` Marcelo Tosatti
  7 siblings, 1 reply; 18+ messages in thread
From: Anthony Liguori @ 2008-06-18 20:09 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm

Marcelo Tosatti wrote:
> Avi, I don't think this causes such a huge performance regression. NOHZ
> makes the frequency of timer reads go down significantly.
>   

Have we yet determined why the TSC is so unstable in the first place?  
In theory, it should be relatively stable on single-node Intel and 
Barcelona chips.

Regards,

Anthony Liguori

> As for constant tick guests, well, the impact will be similar to
> changing to SMP, since those fallback to ACPI timer anyway now.
>
> The C2 emulation is required by Ubuntu 7.10 for example, which refuses
> to process the CST notification.
>
> Stock Linux kernels (as old as 2.6.20) do mark C2 invalid upon the CST
> notification.
>
>   


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 20:09 ` [patch 0/7] force the TSC unreliable by reporting C2 state Anthony Liguori
@ 2008-06-18 20:40   ` Marcelo Tosatti
  2008-06-18 21:02     ` Anthony Liguori
  0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 20:40 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm

On Wed, Jun 18, 2008 at 03:09:41PM -0500, Anthony Liguori wrote:
> Marcelo Tosatti wrote:
>> Avi, I don't think this causes such a huge performance regression. NOHZ
>> makes the frequency of timer reads go down significantly.
>>   
>
> Have we yet determined why the TSC is so unstable in the first place?   
> In theory, it should be relatively stable on single-node Intel and  
> Barcelona chips.

If the host enters C2/C3, or changes CPU frequency, it becomes
unreliable as a clocksource and there's no guarantee the guest will
detect that.

Also, as mentioned earlier, large systems with clustered APIC have
unstable TSC.

We _could_ hook this fake-C2-state thing to the host TSC reliability:

1) Hook into Linux's mark_tsc_unstable().
2) On migration check if the destination host is using the TSC, if not, 
force a faked-C2-state.

Problem with 2) is that not all guests honour the ACPI _CST package
notification (which would change C2's latency time from an unusable
value to something usable). And now I don't think assuming the _CST
notification to work is a good thing (after we found out that for ex.
Ubuntu 7.10 kernel ignores it).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 20:40   ` Marcelo Tosatti
@ 2008-06-18 21:02     ` Anthony Liguori
  2008-06-18 21:21       ` Marcelo Tosatti
  0 siblings, 1 reply; 18+ messages in thread
From: Anthony Liguori @ 2008-06-18 21:02 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm

Marcelo Tosatti wrote:
> On Wed, Jun 18, 2008 at 03:09:41PM -0500, Anthony Liguori wrote:
>   
>> Marcelo Tosatti wrote:
>>     
>>> Avi, I don't think this causes such a huge performance regression. NOHZ
>>> makes the frequency of timer reads go down significantly.
>>>   
>>>       
>> Have we yet determined why the TSC is so unstable in the first place?   
>> In theory, it should be relatively stable on single-node Intel and  
>> Barcelona chips.
>>     
>
> If the host enters C2/C3, or changes CPU frequency, it becomes
> unreliable as a clocksource and there's no guarantee the guest will
> detect that.
>   

On Intel, the TSC should be fixed-frequency for basically all shipping 
processors supporting VT.  Starting with 10h (Barcelona), I believe AMD 
also has a fixed frequency TSC.

> Also, as mentioned earlier, large systems with clustered APIC have
> unstable TSC.
>   

Right, that's why I qualified with single-node.

> We _could_ hook this fake-C2-state thing to the host TSC reliability:
>
> 1) Hook into Linux's mark_tsc_unstable().
> 2) On migration check if the destination host is using the TSC, if not, 
> force a faked-C2-state.
>
> Problem with 2) is that not all guests honour the ACPI _CST package
> notification (which would change C2's latency time from an unusable
> value to something usable). And now I don't think assuming the _CST
> notification to work is a good thing (after we found out that for ex.
> Ubuntu 7.10 kernel ignores it).
>   

I think that for hosts with a known unstable TSC, we should do something 
like this.  But I also think we have a bug with TSC synchronization for 
AMD although I don't at all know what the source of it is.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 21:02     ` Anthony Liguori
@ 2008-06-18 21:21       ` Marcelo Tosatti
  2008-06-18 21:42         ` Anthony Liguori
  0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 21:21 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm

On Wed, Jun 18, 2008 at 04:02:39PM -0500, Anthony Liguori wrote:
>>> Have we yet determined why the TSC is so unstable in the first place? 
>>>   In theory, it should be relatively stable on single-node Intel and  
>>> Barcelona chips.
>>>     
>>
>> If the host enters C2/C3, or changes CPU frequency, it becomes
>> unreliable as a clocksource and there's no guarantee the guest will
>> detect that.
>>   
>
> On Intel, the TSC should be fixed-frequency for basically all shipping  
> processors supporting VT.  Starting with 10h (Barcelona), I believe AMD  
> also has a fixed frequency TSC.

But still stops ticking in C2/C3 state, I suppose?

>> Also, as mentioned earlier, large systems with clustered APIC have
>> unstable TSC.
>>   
>
> Right, that's why I qualified with single-node.
>
>> We _could_ hook this fake-C2-state thing to the host TSC reliability:
>>
>> 1) Hook into Linux's mark_tsc_unstable().
>> 2) On migration check if the destination host is using the TSC, if not, 
>> force a faked-C2-state.
>>
>> Problem with 2) is that not all guests honour the ACPI _CST package
>> notification (which would change C2's latency time from an unusable
>> value to something usable). And now I don't think assuming the _CST
>> notification to work is a good thing (after we found out that for ex.
>> Ubuntu 7.10 kernel ignores it).
>>   
>
> I think that for hosts with a known unstable TSC, we should do something  
> like this.  But I also think we have a bug with TSC synchronization for  
> AMD although I don't at all know what the source of it is.

Chris has some patches around, I don't remember the details either.

Thanks


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 21:21       ` Marcelo Tosatti
@ 2008-06-18 21:42         ` Anthony Liguori
  2008-06-18 22:41           ` Marcelo Tosatti
  0 siblings, 1 reply; 18+ messages in thread
From: Anthony Liguori @ 2008-06-18 21:42 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm

Marcelo Tosatti wrote:
> On Wed, Jun 18, 2008 at 04:02:39PM -0500, Anthony Liguori wrote:
>   
>>>> Have we yet determined why the TSC is so unstable in the first place? 
>>>>   In theory, it should be relatively stable on single-node Intel and  
>>>> Barcelona chips.
>>>>     
>>>>         
>>> If the host enters C2/C3, or changes CPU frequency, it becomes
>>> unreliable as a clocksource and there's no guarantee the guest will
>>> detect that.
>>>   
>>>       
>> On Intel, the TSC should be fixed-frequency for basically all shipping  
>> processors supporting VT.  Starting with 10h (Barcelona), I believe AMD  
>> also has a fixed frequency TSC.
>>     
>
> But still stops ticking in C2/C3 state, I suppose?
>   

I don't know for sure but the TSC is not tied to the CPU clock so I 
would be surprised if it did.  I think that that would defeat the 
utility of a fixed-frequency TSC.

Regards,

Anthony Liguori

>>> Also, as mentioned earlier, large systems with clustered APIC have
>>> unstable TSC.
>>>   
>>>       
>> Right, that's why I qualified with single-node.
>>
>>     
>>> We _could_ hook this fake-C2-state thing to the host TSC reliability:
>>>
>>> 1) Hook into Linux's mark_tsc_unstable().
>>> 2) On migration check if the destination host is using the TSC, if not, 
>>> force a faked-C2-state.
>>>
>>> Problem with 2) is that not all guests honour the ACPI _CST package
>>> notification (which would change C2's latency time from an unusable
>>> value to something usable). And now I don't think assuming the _CST
>>> notification to work is a good thing (after we found out that for ex.
>>> Ubuntu 7.10 kernel ignores it).
>>>   
>>>       
>> I think that for hosts with a known unstable TSC, we should do something  
>> like this.  But I also think we have a bug with TSC synchronization for  
>> AMD although I don't at all know what the source of it is.
>>     
>
> Chris has some patches around, I don't remember the details either.
>
> Thanks
>
>   


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 21:42         ` Anthony Liguori
@ 2008-06-18 22:41           ` Marcelo Tosatti
  2008-06-18 22:57             ` john stultz
  2008-06-20 14:07             ` Andi Kleen
  0 siblings, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-06-18 22:41 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, kvm, John Stultz, Yang, Sheng

On Wed, Jun 18, 2008 at 04:42:40PM -0500, Anthony Liguori wrote:
> Marcelo Tosatti wrote:
>> On Wed, Jun 18, 2008 at 04:02:39PM -0500, Anthony Liguori wrote:
>>   
>>>>> Have we yet determined why the TSC is so unstable in the first 
>>>>> place?   In theory, it should be relatively stable on single-node 
>>>>> Intel and  Barcelona chips.
>>>>>             
>>>> If the host enters C2/C3, or changes CPU frequency, it becomes
>>>> unreliable as a clocksource and there's no guarantee the guest will
>>>> detect that.
>>>>         
>>> On Intel, the TSC should be fixed-frequency for basically all 
>>> shipping  processors supporting VT.  Starting with 10h (Barcelona), I 
>>> believe AMD  also has a fixed frequency TSC.
>>>     
>>
>> But still stops ticking in C2/C3 state, I suppose?
>>   
>
> I don't know for sure but the TSC is not tied to the CPU clock so I  
> would be surprised if it did.  I think that that would defeat the  
> utility of a fixed-frequency TSC.

Well, Linux assumes that TSC stops ticking on C2/C3.

Section 18.10 of Intel says:

"The specific processor configuration determines the behavior. Constant
TSC behavior ensures that the duration of each clock tick is uniform and
supports the use of the TSC as a wall clock timer even if the processor
core changes frequency. This is the architectural behavior moving
forward."

However it does not mention C2/C3.

Could someone confirm either way?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 22:41           ` Marcelo Tosatti
@ 2008-06-18 22:57             ` john stultz
  2008-06-18 23:08               ` Nakajima, Jun
  2008-06-20 14:07             ` Andi Kleen
  1 sibling, 1 reply; 18+ messages in thread
From: john stultz @ 2008-06-18 22:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Anthony Liguori, Avi Kivity, kvm, Yang, Sheng


On Wed, 2008-06-18 at 19:41 -0300, Marcelo Tosatti wrote:
> On Wed, Jun 18, 2008 at 04:42:40PM -0500, Anthony Liguori wrote:
> > Marcelo Tosatti wrote:
> >> On Wed, Jun 18, 2008 at 04:02:39PM -0500, Anthony Liguori wrote:
> >>   
> >>>>> Have we yet determined why the TSC is so unstable in the first 
> >>>>> place?   In theory, it should be relatively stable on single-node 
> >>>>> Intel and  Barcelona chips.
> >>>>>             
> >>>> If the host enters C2/C3, or changes CPU frequency, it becomes
> >>>> unreliable as a clocksource and there's no guarantee the guest will
> >>>> detect that.
> >>>>         
> >>> On Intel, the TSC should be fixed-frequency for basically all 
> >>> shipping  processors supporting VT.  Starting with 10h (Barcelona), I 
> >>> believe AMD  also has a fixed frequency TSC.
> >>>     
> >>
> >> But still stops ticking in C2/C3 state, I suppose?
> >>   
> >
> > I don't know for sure but the TSC is not tied to the CPU clock so I  
> > would be surprised if it did.  I think that that would defeat the  
> > utility of a fixed-frequency TSC.
> 
> Well, Linux assumes that TSC stops ticking on C2/C3.
> 
> Section 18.10 of Intel says:
> 
> "The specific processor configuration determines the behavior. Constant
> TSC behavior ensures that the duration of each clock tick is uniform and
> supports the use of the TSC as a wall clock timer even if the processor
> core changes frequency. This is the architectural behavior moving
> forward."
> 
> However it does not mention C2/C3.
> 
> Could someone confirm either way?

My understanding: On most systems, the TSC halts in C3. C2 may also halt
the TSC, but that seems to depend on the BIOS.

thanks
-john



^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 22:57             ` john stultz
@ 2008-06-18 23:08               ` Nakajima, Jun
  0 siblings, 0 replies; 18+ messages in thread
From: Nakajima, Jun @ 2008-06-18 23:08 UTC (permalink / raw)
  To: john stultz, Marcelo Tosatti
  Cc: Anthony Liguori, Avi Kivity, kvm@vger.kernel.org, Yang, Sheng

On 6/18/2008 3:57:16 PM, john stultz wrote:
>
> On Wed, 2008-06-18 at 19:41 -0300, Marcelo Tosatti wrote:
> > On Wed, Jun 18, 2008 at 04:42:40PM -0500, Anthony Liguori wrote:
> > > Marcelo Tosatti wrote:
> > > > On Wed, Jun 18, 2008 at 04:02:39PM -0500, Anthony Liguori wrote:
> > > >
> > > > > > > Have we yet determined why the TSC is so unstable in the first
> > > > > > > place?   In theory, it should be relatively stable on
> > > > > > > single-node Intel and  Barcelona chips.
> > > > > > >
> > > > > > If the host enters C2/C3, or changes CPU frequency, it
> > > > > > becomes unreliable as a clocksource and there's no guarantee
> > > > > > the guest will detect that.
> > > > > >
> > > > > On Intel, the TSC should be fixed-frequency for basically all
> > > > > shipping  processors supporting VT.  Starting with 10h
> > > > > (Barcelona), I believe AMD  also has a fixed frequency TSC.
> > > > >
> > > >
> > > > But still stops ticking in C2/C3 state, I suppose?
> > > >
> > >
> > > I don't know for sure but the TSC is not tied to the CPU clock so
> > > I would be surprised if it did.  I think that that would defeat
> > > the utility of a fixed-frequency TSC.
> >
> > Well, Linux assumes that TSC stops ticking on C2/C3.
> >
> > Section 18.10 of Intel says:
> >
> > "The specific processor configuration determines the behavior.
> > Constant TSC behavior ensures that the duration of each clock tick
> > is uniform and supports the use of the TSC as a wall clock timer
> > even if the processor core changes frequency. This is the
> > architectural behavior moving forward."
> >
> > However it does not mention C2/C3.
> >
> > Could someone confirm either way?
>
> My understanding: On most systems, the TSC halts in C3. C2 may also
> halt the TSC, but that seems to depend on the BIOS.

TSC stops counting in the H/W C3. The C-states reported by BIOS may not necessarily be mapped to H/W C-states; C2 used by BIOS may be C3 for H/W.

>
> thanks
> -john
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
             .
Jun Nakajima | Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 0/7] force the TSC unreliable by reporting C2 state
  2008-06-18 22:41           ` Marcelo Tosatti
  2008-06-18 22:57             ` john stultz
@ 2008-06-20 14:07             ` Andi Kleen
  1 sibling, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2008-06-20 14:07 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Anthony Liguori, Avi Kivity, kvm, John Stultz, Yang, Sheng

Marcelo Tosatti <mtosatti@redhat.com> writes:
>
> Well, Linux assumes that TSC stops ticking on C2/C3.

It doesn't always and Linux is overly conservative and doesn't know
the full rules (and in some cases it's also hard to know because the
BIOS hides systems). Also a lot of systems don't have C2/C3.

But it still happens occasionally so it has to be handled. Normally
we would expect guests to detect this because they have exactly the 
same problem on real hardware, but at least older Linux didn't always
get it correct.

But in general the newer kernel already keeps an estimate on how long C2/C3 took
(needed for power management) and nobody would stop KVM from just adding
that into the TSC offset that is supported by VT. You might still have some drift
from that though.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 7/7] KVM: in-kernel ACPI C2 idle emulation
  2008-06-18 16:42 ` [patch 7/7] KVM: in-kernel ACPI C2 idle emulation Marcelo Tosatti
@ 2008-06-23  3:01   ` Avi Kivity
  0 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-06-23  3:01 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

Marcelo Tosatti wrote:
> Some guests fail to process the _CST notification which invalidates the C2 state.
>
>   

Which?

> Emulate C2 similarly to HLT for those cases.
>   


Please split the API change from actual addition of the ACPI emulation.

> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>
> Index: kvm/arch/x86/kvm/Makefile
> ===================================================================
> --- kvm.orig/arch/x86/kvm/Makefile
> +++ kvm/arch/x86/kvm/Makefile
> @@ -11,7 +11,7 @@ endif
>  EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
>  
>  kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \
> -	i8254.o
> +	i8254.o acpi.o
>  obj-$(CONFIG_KVM) += kvm.o
>  kvm-intel-objs = vmx.o
>  obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
> Index: kvm/arch/x86/kvm/acpi.c
> ===================================================================
> --- /dev/null
> +++ kvm/arch/x86/kvm/acpi.c
> @@ -0,0 +1,82 @@
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/acpi_pmtmr.h>
> +#include <asm/io.h>
> +#include "iodev.h"
> +#include "irq.h"
> +
> +/*
> + * P_LVL2 = PBLK + 4h
> + *
> + * Note: matches BIOS (currently hardcoded) definition.
> + */
> +#define ACPI_PBLK 0xb010
>   

Please make it part of the API.

> +
> +struct kvm_acpi {
> +	struct kvm_io_device dev;
> +	struct kvm *kvm;
> +};
> +
> +static void acpi_cstate_halt(struct kvm_vcpu *vcpu)
> +{
> +	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> +	mutex_unlock(&vcpu->kvm->lock);
> +	up_read(&vcpu->kvm->slots_lock);
> +	kvm_vcpu_block(vcpu);
> +	down_read(&vcpu->kvm->slots_lock);
> +	mutex_lock(&vcpu->kvm->lock);
> +	vcpu->arch.pio.size = 0;
> +}
> +
> +static void acpi_ioport_read(struct kvm_vcpu *vcpu, struct kvm_io_device *this,
> +			     gpa_t addr, int len, void *data)
> +{
> +
> +	if (addr == ACPI_PBLK + 0x4)
> +		acpi_cstate_halt(vcpu);
> +
>   

Don't you need to write to the data here?

> @@ -1725,6 +1727,10 @@ long kvm_arch_vm_ioctl(struct file *filp
>  		r = 0;
>  		break;
>  	}
> +	case KVM_ENABLE_ACPI_C2: {
> +		r = kvm_acpi_init(kvm);
> +		break;
> +	}
>   

Need a disable too.



-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-06-23  3:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-18 16:42 [patch 0/7] force the TSC unreliable by reporting C2 state Marcelo Tosatti
2008-06-18 16:42 ` [patch 1/7] kvm: qemu: inform valid C2 state in ACPI table Marcelo Tosatti
2008-06-18 16:42 ` [patch 2/7] kvm: qemu: disable c2 via _CST notification Marcelo Tosatti
2008-06-18 16:42 ` [patch 3/7] libkvm: in-kernel C2 halt interface Marcelo Tosatti
2008-06-18 16:42 ` [patch 4/7] libkvm: handle_io return handler value Marcelo Tosatti
2008-06-18 16:42 ` [patch 5/7] qemu: kvm: unhalt vcpu0 on pit irq Marcelo Tosatti
2008-06-18 16:42 ` [patch 6/7] kvm: qemu: enable in-kernel C2 emulation / userspace emulation Marcelo Tosatti
2008-06-18 16:42 ` [patch 7/7] KVM: in-kernel ACPI C2 idle emulation Marcelo Tosatti
2008-06-23  3:01   ` Avi Kivity
2008-06-18 20:09 ` [patch 0/7] force the TSC unreliable by reporting C2 state Anthony Liguori
2008-06-18 20:40   ` Marcelo Tosatti
2008-06-18 21:02     ` Anthony Liguori
2008-06-18 21:21       ` Marcelo Tosatti
2008-06-18 21:42         ` Anthony Liguori
2008-06-18 22:41           ` Marcelo Tosatti
2008-06-18 22:57             ` john stultz
2008-06-18 23:08               ` Nakajima, Jun
2008-06-20 14:07             ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox