* [patch 0/4] C2 "emulation"
@ 2008-05-24 23:43 Marcelo Tosatti
2008-05-24 23:43 ` [patch 1/4] QEMU/KVM: self-disabling C2 emulation Marcelo Tosatti
` (3 more replies)
0 siblings, 4 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-24 23:43 UTC (permalink / raw)
To: Avi Kivity, Chris Wright; +Cc: kvm-devel
As discussed before the TSC is unreliable as a clocksource for KVM
guests, especially with CPU migration.
The following patchset changes the QEMU ACPI tables to report a valid C2
state temporarily to force guests to switch to more reliable ones.
Also allow direct access to the PMTimer port which is constantly used
in C1 idle.
Tested with Linux and Windows XP.
--
^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch 1/4] QEMU/KVM: self-disabling C2 emulation
2008-05-24 23:43 [patch 0/4] C2 "emulation" Marcelo Tosatti
@ 2008-05-24 23:43 ` Marcelo Tosatti
2008-05-24 23:43 ` [patch 2/4] libkvm: KVM_GET_PMTIMER ioctl support Marcelo Tosatti
` (2 subsequent siblings)
3 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-24 23:43 UTC (permalink / raw)
To: Avi Kivity, Chris Wright; +Cc: kvm-devel
[-- Attachment #1: acpi-c2-fake --]
[-- Type: text/plain, Size: 13848 bytes --]
Inform C2 state support via ACPI's CST per-processor package's, but
write an invalid latency value the first time the guest attempts to idle
via P_LVL2 port.
This way the TSC is considered unreliable, and we get away with the
costs relative to APIC timer broadcasts on enter/exit necessary for C1+.
It would be nice to fallback to plain hlt idle instead of C1, which
does not use the pmtimer for idle measurement, but Linux guests with
CONFIG_CPUIDLE enabled fallback to poll_idle instead which is very
inefficient.
Index: kvm-userspace.tip/bios/acpi-dsdt.dsl
===================================================================
--- kvm-userspace.tip.orig/bios/acpi-dsdt.dsl
+++ kvm-userspace.tip/bios/acpi-dsdt.dsl
@@ -33,8 +33,20 @@ DefinitionBlock (
PRU, 8,
PRD, 8,
}
+ OperationRegion(PWNO, SystemIO, 0xb040, 0x02)
+ Field (PWNO, WordAcc, NoLock, WriteAsZeros)
+ {
+ PWC, 16,
+ }
- Processor (CPU0, 0x00, 0x0000b010, 0x06) {Method (_STA) { Return(0xF)}}
+ Processor (CPU0, 0x00, 0x0000b010, 0x06) {
+ Method (_STA) { Return(0xF)}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
+ }
Processor (CPU1, 0x01, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x01, 0x01, 0x1, 0x0, 0x0, 0x0})
Method(_MAT, 0) {
@@ -44,6 +56,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPU2, 0x02, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x02, 0x02, 0x1, 0x0, 0x0, 0x0})
@@ -54,6 +71,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPU3, 0x03, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x03, 0x03, 0x1, 0x0, 0x0, 0x0})
@@ -64,6 +86,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPU4, 0x04, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x04, 0x04, 0x1, 0x0, 0x0, 0x0})
@@ -74,6 +101,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPU5, 0x05, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x05, 0x05, 0x1, 0x0, 0x0, 0x0})
@@ -84,6 +116,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPU6, 0x06, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x06, 0x06, 0x1, 0x0, 0x0, 0x0})
@@ -94,6 +131,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPU7, 0x07, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x07, 0x07, 0x1, 0x0, 0x0, 0x0})
@@ -104,6 +146,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPU8, 0x08, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x08, 0x08, 0x1, 0x0, 0x0, 0x0})
@@ -114,6 +161,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPU9, 0x09, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x09, 0x09, 0x1, 0x0, 0x0, 0x0})
@@ -124,6 +176,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPUA, 0x0a, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x0A, 0x0A, 0x1, 0x0, 0x0, 0x0})
@@ -134,6 +191,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPUB, 0x0b, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x0B, 0x0B, 0x1, 0x0, 0x0, 0x0})
@@ -144,6 +206,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPUC, 0x0c, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x0C, 0x0C, 0x1, 0x0, 0x0, 0x0})
@@ -154,6 +221,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPUD, 0x0d, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x0D, 0x0D, 0x1, 0x0, 0x0, 0x0})
@@ -164,6 +236,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
Processor (CPUE, 0x0e, 0x0000b010, 0x06) {
Name (TMP, Buffer(0x8) {0x0, 0x8, 0x0E, 0x0E, 0x1, 0x0, 0x0, 0x0})
@@ -174,6 +251,11 @@ DefinitionBlock (
Method (_STA) {
Return(0xF)
}
+ Name(_CST, Package() {
+ 1, Package() {
+ ResourceTemplate() {Register(SystemIO, 8, 0, 0xb014)},
+ 2, 2, 300},
+ })
}
}
@@ -1544,6 +1626,81 @@ DefinitionBlock (
Return(0x01)
}
Method(_L06) {
+ If (And(\_PR.PWC, 0x1)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU0._CST, 1)), 2))
+ Notify(\_PR.CPU0, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x2)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU1._CST, 1)), 2))
+ Notify(\_PR.CPU1, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x4)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU2._CST, 1)), 2))
+ Notify(\_PR.CPU2, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x8)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU3._CST, 1)), 2))
+ Notify(\_PR.CPU3, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x10)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU4._CST, 1)), 2))
+ Notify(\_PR.CPU4, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x20)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU5._CST, 1)), 2))
+ Notify(\_PR.CPU5, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x40)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU6._CST, 1)), 2))
+ Notify(\_PR.CPU6, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x80)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU7._CST, 1)), 2))
+ Notify(\_PR.CPU7, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x100)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU8._CST, 1)), 2))
+ Notify(\_PR.CPU8, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x200)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPU9._CST, 1)), 2))
+ Notify(\_PR.CPU9, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x400)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPUA._CST, 1)), 2))
+ Notify(\_PR.CPUA, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x800)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPUB._CST, 1)), 2))
+ Notify(\_PR.CPUB, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x1000)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPUC._CST, 1)), 2))
+ Notify(\_PR.CPUC, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x2000)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPUD._CST, 1)), 2))
+ Notify(\_PR.CPUD, 0x81)
+ }
+
+ If (And(\_PR.PWC, 0x4000)) {
+ Store (0xfffff, Index (DeRefOf (Index (\_PR.CPUE._CST, 1)), 2))
+ Notify(\_PR.CPUE, 0x81)
+ }
+
Return(0x01)
}
Method(_L07) {
Index: kvm-userspace.tip/qemu/hw/acpi.c
===================================================================
--- kvm-userspace.tip.orig/qemu/hw/acpi.c
+++ kvm-userspace.tip/qemu/hw/acpi.c
@@ -121,6 +121,31 @@ static void pm_tmr_timer(void *opaque)
pm_update_sci(s);
}
+/*
+ * Fake C2 emulation, so the OS will consider the TSC unreliable
+ * an fallback to C1 after the latency is updated to a high value
+ * in acpi-dsdt.dsl.
+ */
+static void qemu_system_cpu_power_notify(int cpu);
+static uint32_t pm_ioport_readb(void *opaque, uint32_t addr)
+{
+ CPUState *env = cpu_single_env;
+
+ addr &= 0x3f;
+ switch (addr) {
+ case 0x14: /* P_LVL2 */
+ qemu_system_cpu_power_notify(env->cpu_index);
+ }
+#ifdef DEBUG
+ printf("pm_ioport_readb addr=%x\n", addr);
+#endif
+ return 0;
+}
+
+static void pm_ioport_writeb(void *opaque, uint32_t addr, uint32_t val)
+{
+}
+
static void pm_ioport_writew(void *opaque, uint32_t addr, uint32_t val)
{
PIIX4PMState *s = opaque;
@@ -420,6 +445,8 @@ static void pm_io_space_update(PIIX4PMSt
#if defined(DEBUG)
printf("PM: mapping to 0x%x\n", pm_io_base);
#endif
+ register_ioport_write(pm_io_base, 64, 1, pm_ioport_writeb, s);
+ register_ioport_read(pm_io_base, 64, 1, pm_ioport_readb, s);
register_ioport_write(pm_io_base, 64, 2, pm_ioport_writew, s);
register_ioport_read(pm_io_base, 64, 2, pm_ioport_readw, s);
register_ioport_write(pm_io_base, 64, 4, pm_ioport_writel, s);
@@ -538,6 +565,7 @@ void qemu_system_powerdown(void)
}
#endif
#define GPE_BASE 0xafe0
+#define POWER_GPE_BASE 0xb040
#define PROC_BASE 0xaf00
#define PCI_BASE 0xae00
#define PCI_EJ_BASE 0xae08
@@ -554,7 +582,12 @@ struct pci_status {
uint32_t down;
};
+struct power_gpe_regs {
+ uint8_t cpus;
+};
+
static struct gpe_regs gpe;
+static struct power_gpe_regs power_gpe;
static struct pci_status pci0_status;
static uint32_t gpe_readb(void *opaque, uint32_t addr)
@@ -623,6 +656,23 @@ static void gpe_writeb(void *opaque, uin
#endif
}
+static uint32_t cpu_power_read(void *opaque, uint32_t addr)
+{
+ struct power_gpe_regs *p = opaque;
+
+#if defined(DEBUG)
+ printf("cpu power read %lx == %lx\n", addr, p->cpus);
+#endif
+ return p->cpus;
+}
+
+static void cpu_power_write(void *opaque, uint32_t addr, uint32_t val)
+{
+#if defined(DEBUG)
+ printf("cpu power write %lx <== %lx\n", addr, val);
+#endif
+}
+
static uint32_t pcihotplug_read(void *opaque, uint32_t addr)
{
uint32_t val = 0;
@@ -696,6 +746,9 @@ void qemu_system_hot_add_init(const char
register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, NULL);
register_ioport_read(PCI_EJ_BASE, 4, 4, pciej_read, NULL);
+ register_ioport_write(POWER_GPE_BASE, 4, 2, cpu_power_write, &power_gpe);
+ register_ioport_read(POWER_GPE_BASE, 4, 2, cpu_power_read, &power_gpe);
+
model = cpu_model;
}
@@ -738,6 +791,16 @@ void qemu_system_cpu_hot_add(int cpu, in
disable_processor(&gpe, cpu);
qemu_set_irq(pm_state->irq, 0);
}
+
+static void qemu_system_cpu_power_notify(int cpu)
+{
+ power_gpe.cpus = 0;
+
+ qemu_set_irq(pm_state->irq, 1);
+ power_gpe.cpus |= (1 << cpu);
+ qemu_set_irq(pm_state->irq, 0);
+}
+
#endif
static void enable_device(struct pci_status *p, struct gpe_regs *g, int slot)
--
^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch 2/4] libkvm: KVM_GET_PMTIMER ioctl support
2008-05-24 23:43 [patch 0/4] C2 "emulation" Marcelo Tosatti
2008-05-24 23:43 ` [patch 1/4] QEMU/KVM: self-disabling C2 emulation Marcelo Tosatti
@ 2008-05-24 23:43 ` Marcelo Tosatti
2008-05-24 23:43 ` [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support Marcelo Tosatti
2008-05-24 23:43 ` [patch 4/4] KVM: allow direct access to PMTimer port Marcelo Tosatti
3 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-24 23:43 UTC (permalink / raw)
To: Avi Kivity, Chris Wright; +Cc: kvm-devel
[-- Attachment #1: kvm_pmtimer --]
[-- Type: text/plain, Size: 1203 bytes --]
Unfortunately there seems to be no standard interface to read the
pmtimer...
Index: kvm-userspace.tip/libkvm/libkvm.c
===================================================================
--- kvm-userspace.tip.orig/libkvm/libkvm.c
+++ kvm-userspace.tip/libkvm/libkvm.c
@@ -794,6 +794,17 @@ int kvm_set_mpstate(kvm_context_t kvm, i
return -ENOSYS;
}
#endif
+#ifdef KVM_CAP_GET_PMTIMER
+int kvm_get_pmtimer(kvm_context_t kvm, struct kvm_pmtimer *pmtmr)
+{
+ int r;
+
+ r = ioctl(kvm->fd, KVM_CHECK_EXTENSION, KVM_CAP_GET_PMTIMER);
+ if (r > 0)
+ return ioctl(kvm->vm_fd, KVM_GET_PMTIMER, pmtmr);
+ return -ENOSYS;
+}
+#endif
static int handle_mmio(kvm_context_t kvm, struct kvm_run *kvm_run)
{
Index: kvm-userspace.tip/libkvm/libkvm.h
===================================================================
--- kvm-userspace.tip.orig/libkvm/libkvm.h
+++ kvm-userspace.tip/libkvm/libkvm.h
@@ -325,6 +325,13 @@ static inline int kvm_reset_mpstate(kvm_
}
#endif
+#ifdef KVM_CAP_GET_PMTIMER
+/*!
+ * \brief Read host pmtimer value
+ */
+int kvm_get_pmtimer(kvm_context_t kvm, struct kvm_pmtimer *pmtmr);
+#endif
+
/*!
* \brief Simulate an external vectored interrupt
*
--
^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-24 23:43 [patch 0/4] C2 "emulation" Marcelo Tosatti
2008-05-24 23:43 ` [patch 1/4] QEMU/KVM: self-disabling C2 emulation Marcelo Tosatti
2008-05-24 23:43 ` [patch 2/4] libkvm: KVM_GET_PMTIMER ioctl support Marcelo Tosatti
@ 2008-05-24 23:43 ` Marcelo Tosatti
2008-05-25 10:18 ` Avi Kivity
2008-05-25 10:19 ` Avi Kivity
2008-05-24 23:43 ` [patch 4/4] KVM: allow direct access to PMTimer port Marcelo Tosatti
3 siblings, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-24 23:43 UTC (permalink / raw)
To: Avi Kivity, Chris Wright; +Cc: kvm-devel
[-- Attachment #1: acpi-pmtimer --]
[-- Type: text/plain, Size: 14922 bytes --]
QEMU support for direct pmtimer reads. Hopefully its safe, since its a
read-only register ?
With self-disable C2 + this I'm seeing less CPU usage when idle with
CONFIG_CPU_IDLE enabled. Quite noticeable on SMP guests. Windows XP is
comparable to standard (never seen it consume less than 10% either way,
usually 20-30%).
On migration the destination host can either lack ACPI or have the timer
in a different IO port, so emulation is necessary.
Or luckily the pmtimer is in the same address. Since the 24-bit counter
overflow period is only ~= 4.6 seconds, its probably worthwhile to wait
for synchronization before restarting the guest. Not implemented though.
Also simplify and fix the overflow emulation, which was happening every
2.3 seconds instead of the expected 4.6s.
Index: kvm-userspace.tip/bios/rombios32.c
===================================================================
--- kvm-userspace.tip.orig/bios/rombios32.c
+++ kvm-userspace.tip/bios/rombios32.c
@@ -391,7 +391,7 @@ uint8_t bios_uuid[16];
unsigned long ebda_cur_addr;
#endif
int acpi_enabled;
-uint32_t pm_io_base, smb_io_base;
+uint32_t pm_io_base, pmtmr_base, smb_io_base;
int pm_sci_int;
unsigned long bios_table_cur_addr;
unsigned long bios_table_end_addr;
@@ -819,6 +819,12 @@ static void pci_bios_init_device(PCIDevi
pci_config_writeb(d, PCI_INTERRUPT_LINE, 9);
pm_io_base = PM_IO_BASE;
+ pmtmr_base = cmos_readb(0x60);
+ pmtmr_base |= cmos_readb(0x61) << 8;
+ pmtmr_base |= cmos_readb(0x62) << 16;
+ pmtmr_base |= cmos_readb(0x63) << 24;
+ if (!pmtmr_base)
+ pmtmr_base = pm_io_base + 0x08;
pci_config_writel(d, 0x40, pm_io_base | 1);
pci_config_writeb(d, 0x80, 0x01); /* enable PM io space */
smb_io_base = SMB_IO_BASE;
@@ -1381,7 +1387,7 @@ void acpi_bios_init(void)
fadt->acpi_disable = 0xf0;
fadt->pm1a_evt_blk = cpu_to_le32(pm_io_base);
fadt->pm1a_cnt_blk = cpu_to_le32(pm_io_base + 0x04);
- fadt->pm_tmr_blk = cpu_to_le32(pm_io_base + 0x08);
+ fadt->pm_tmr_blk = cpu_to_le32(pmtmr_base);
fadt->pm1_evt_len = 4;
fadt->pm1_cnt_len = 2;
fadt->pm_tmr_len = 4;
Index: kvm-userspace.tip/qemu/hw/acpi.c
===================================================================
--- kvm-userspace.tip.orig/qemu/hw/acpi.c
+++ kvm-userspace.tip/qemu/hw/acpi.c
@@ -40,6 +40,10 @@ typedef struct PIIX4PMState {
uint16_t pmsts;
uint16_t pmen;
uint16_t pmcntrl;
+ uint32_t pmtimer_base;
+ uint8_t direct_access;
+ int32_t pmtimer_offset;
+ uint32_t pmtimer_io_offset;
uint8_t apmc;
uint8_t apms;
QEMUTimer *tmr_timer;
@@ -82,42 +86,51 @@ static uint32_t get_pmtmr(PIIX4PMState *
{
uint32_t d;
d = muldiv64(qemu_get_clock(vm_clock), PM_FREQ, ticks_per_sec);
- return d & 0xffffff;
+ d += s->pmtimer_offset;
+ d &= 0xffffff;
+ return d;
}
static int get_pmsts(PIIX4PMState *s)
{
- int64_t d;
- int pmsts;
- pmsts = s->pmsts;
- d = muldiv64(qemu_get_clock(vm_clock), PM_FREQ, ticks_per_sec);
- if (d >= s->tmr_overflow_time)
- s->pmsts |= TMROF_EN;
- return pmsts;
+ return s->pmsts;
+}
+
+static void schedule_pmtmr_sci(PIIX4PMState *s)
+{
+ int64_t expire_time;
+ uint32_t pmtmr, left;
+
+ if (s->direct_access)
+ qemu_kvm_get_pmtimer(&pmtmr);
+ else
+ pmtmr = get_pmtmr(s);
+
+ left = (1 << 24) - pmtmr;
+ expire_time = muldiv64(left, ticks_per_sec, PM_FREQ);
+ expire_time += qemu_get_clock(vm_clock);
+ qemu_mod_timer(s->tmr_timer, expire_time);
}
static void pm_update_sci(PIIX4PMState *s)
{
int sci_level, pmsts;
- int64_t expire_time;
pmsts = get_pmsts(s);
sci_level = (((pmsts & s->pmen) &
(RTC_EN | PWRBTN_EN | GBL_EN | TMROF_EN)) != 0);
qemu_set_irq(s->irq, sci_level);
/* schedule a timer interruption if needed */
- if ((s->pmen & TMROF_EN) && !(pmsts & TMROF_EN)) {
- expire_time = muldiv64(s->tmr_overflow_time, ticks_per_sec, PM_FREQ);
- qemu_mod_timer(s->tmr_timer, expire_time);
- s->tmr_overflow_time += 0x800000;
- } else {
+ if ((s->pmen & TMROF_EN) && !(s->pmsts & TMROF_EN))
+ schedule_pmtmr_sci(s);
+ else
qemu_del_timer(s->tmr_timer);
- }
}
static void pm_tmr_timer(void *opaque)
{
PIIX4PMState *s = opaque;
+ s->pmsts |= TMROF_EN;
pm_update_sci(s);
}
@@ -152,18 +165,8 @@ static void pm_ioport_writew(void *opaqu
addr &= 0x3f;
switch(addr) {
case 0x00:
- {
- int64_t d;
- int pmsts;
- pmsts = get_pmsts(s);
- if (pmsts & val & TMROF_EN) {
- /* if TMRSTS is reset, then compute the new overflow time */
- d = muldiv64(qemu_get_clock(vm_clock), PM_FREQ, ticks_per_sec);
- s->tmr_overflow_time = (d + 0x800000LL) & ~0x7fffffLL;
- }
- s->pmsts &= ~val;
- pm_update_sci(s);
- }
+ s->pmsts &= ~val;
+ pm_update_sci(s);
break;
case 0x02:
s->pmen = val;
@@ -235,14 +238,10 @@ static uint32_t pm_ioport_readl(void *op
uint32_t val;
addr &= 0x3f;
- switch(addr) {
- case 0x08:
+ if (addr == s->pmtimer_io_offset)
val = get_pmtmr(s);
- break;
- default:
+ else
val = 0;
- break;
- }
#ifdef DEBUG
printf("PM readl port=0x%04x val=0x%08x\n", addr, val);
#endif
@@ -433,9 +432,9 @@ static uint32_t smb_ioport_readb(void *o
return val;
}
-static void pm_io_space_update(PIIX4PMState *s)
+static void pm_io_space_update(PIIX4PMState *s, int migration)
{
- uint32_t pm_io_base;
+ uint32_t pm_io_base, pmtmr_len;
if (s->dev.config[0x80] & 1) {
pm_io_base = le32_to_cpu(*(uint32_t *)(s->dev.config + 0x40));
@@ -443,14 +442,29 @@ static void pm_io_space_update(PIIX4PMSt
/* XXX: need to improve memory and ioport allocation */
#if defined(DEBUG)
- printf("PM: mapping to 0x%x\n", pm_io_base);
+ printf("PM: mapping to 0x%x mig=%d\n", pm_io_base, migration);
#endif
register_ioport_write(pm_io_base, 64, 1, pm_ioport_writeb, s);
register_ioport_read(pm_io_base, 64, 1, pm_ioport_readb, s);
register_ioport_write(pm_io_base, 64, 2, pm_ioport_writew, s);
register_ioport_read(pm_io_base, 64, 2, pm_ioport_readw, s);
- register_ioport_write(pm_io_base, 64, 4, pm_ioport_writel, s);
- register_ioport_read(pm_io_base, 64, 4, pm_ioport_readl, s);
+
+ if (migration) {
+ s->pmtimer_io_offset = 0x08;
+ pmtmr_len = 64;
+ } else if (host_pmtimer_base) {
+ s->pmtimer_base = host_pmtimer_base;
+ s->pmtimer_io_offset = 0x0;
+ pmtmr_len = 4;
+ s->direct_access = 1;
+ } else {
+ s->pmtimer_base = pm_io_base;
+ s->pmtimer_io_offset = 0x08;
+ pmtmr_len = 64;
+ }
+
+ register_ioport_write(s->pmtimer_base, pmtmr_len, 4, pm_ioport_writel, s);
+ register_ioport_read(s->pmtimer_base, pmtmr_len, 4, pm_ioport_readl, s);
}
}
@@ -459,12 +473,13 @@ static void pm_write_config(PCIDevice *d
{
pci_default_write_config(d, address, val, len);
if (address == 0x80)
- pm_io_space_update((PIIX4PMState *)d);
+ pm_io_space_update((PIIX4PMState *)d, 0);
}
static void pm_save(QEMUFile* f,void *opaque)
{
PIIX4PMState *s = opaque;
+ uint32_t pmtmr_val;
pci_device_save(&s->dev, f);
@@ -475,6 +490,14 @@ static void pm_save(QEMUFile* f,void *op
qemu_put_8s(f, &s->apms);
qemu_put_timer(f, s->tmr_timer);
qemu_put_be64(f, s->tmr_overflow_time);
+ qemu_put_be32(f, s->pmtimer_base);
+ if (s->direct_access) {
+ if (qemu_kvm_get_pmtimer(&pmtmr_val) < 0)
+ pmtmr_val = 1 << 30;
+ } else
+ pmtmr_val = get_pmtmr(s);
+
+ qemu_put_be32(f, pmtmr_val);
}
static int pm_load(QEMUFile* f,void* opaque,int version_id)
@@ -482,7 +505,7 @@ static int pm_load(QEMUFile* f,void* opa
PIIX4PMState *s = opaque;
int ret;
- if (version_id > 1)
+ if (version_id > 2)
return -EINVAL;
ret = pci_device_load(&s->dev, f);
@@ -496,10 +519,19 @@ static int pm_load(QEMUFile* f,void* opa
qemu_get_8s(f, &s->apms);
qemu_get_timer(f, s->tmr_timer);
s->tmr_overflow_time=qemu_get_be64(f);
+ if (version_id >= 2) {
+ uint32_t pmtmr_val;
- pm_io_space_update(s);
+ s->pmtimer_base = qemu_get_be32(f);
+ pmtmr_val = qemu_get_be32(f);
+ if (pmtmr_val & (1 << 30))
+ return -EINVAL;
+ s->pmtimer_offset = pmtmr_val - get_pmtmr(s);
+ }
- return 0;
+ pm_io_space_update(s, 1);
+
+ return 0;
}
i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t smb_io_base,
@@ -548,7 +580,7 @@ i2c_bus *piix4_pm_init(PCIBus *bus, int
s->tmr_timer = qemu_new_timer(vm_clock, pm_tmr_timer, s);
- register_savevm("piix4_pm", 0, 1, pm_save, pm_load, s);
+ register_savevm("piix4_pm", 0, 2, pm_save, pm_load, s);
s->smbus = i2c_init_bus();
s->irq = sci_irq;
Index: kvm-userspace.tip/qemu/hw/pc.c
===================================================================
--- kvm-userspace.tip.orig/qemu/hw/pc.c
+++ kvm-userspace.tip/qemu/hw/pc.c
@@ -253,6 +253,11 @@ static void cmos_init(ram_addr_t ram_siz
}
rtc_set_memory(s, 0x5f, smp_cpus - 1);
+ rtc_set_memory(s, 0x60, host_pmtimer_base);
+ rtc_set_memory(s, 0x61, host_pmtimer_base >> 8);
+ rtc_set_memory(s, 0x62, host_pmtimer_base >> 16);
+ rtc_set_memory(s, 0x63, host_pmtimer_base >> 24);
+
if (ram_size > (16 * 1024 * 1024))
val = (ram_size / 65536) - ((16 * 1024 * 1024) / 65536);
else
Index: kvm-userspace.tip/qemu/qemu-kvm-x86.c
===================================================================
--- kvm-userspace.tip.orig/qemu/qemu-kvm-x86.c
+++ kvm-userspace.tip/qemu/qemu-kvm-x86.c
@@ -11,12 +11,17 @@
#include <string.h>
#include "hw/hw.h"
+#include "sysemu.h"
#include "qemu-kvm.h"
#include <libkvm.h>
#include <pthread.h>
#include <sys/utsname.h>
#include <linux/kvm_para.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
#define MSR_IA32_TSC 0x10
@@ -545,6 +550,59 @@ static int get_para_features(kvm_context
return features;
}
+int kvm_arch_qemu_init(void)
+{
+ int fd, ret;
+ char buf[16384];
+ char *line, *saveptr;
+ struct kvm_pmtimer pmtmr;
+
+ if (no_direct_pmtimer)
+ return 0;
+
+ fd = open("/proc/ioports", O_RDONLY);
+ if (fd == -1) {
+ perror("open /proc/ioports");
+ exit(0);
+ }
+ ret = read(fd, buf, 16384);
+ if (ret == -1) {
+ perror("read /proc/ioports");
+ exit(0);
+ }
+
+ line = strtok_r(buf, "\n", &saveptr);
+ do {
+ char *pmstr;
+ line = pmstr = strtok_r(NULL, "\n", &saveptr);
+ if (pmstr && strstr(pmstr, "ACPI PM_TMR")) {
+ pmstr = strtok(line, "-");
+ while (*pmstr == ' ')
+ pmstr++;
+ host_pmtimer_base = strtoul(pmstr, NULL, 16);
+ /*
+ * Fail now instead of during migration
+ */
+ if (kvm_get_pmtimer(kvm_context, &pmtmr) < 0)
+ host_pmtimer_base = 0;
+ break;
+ }
+ } while (line);
+
+ return 0;
+}
+
+int qemu_kvm_get_pmtimer(uint32_t *value)
+{
+ int ret = 0;
+ struct kvm_pmtimer pmtmr;
+
+ ret = kvm_get_pmtimer(kvm_context, &pmtmr);
+ *value = pmtmr.val & 0xffffff;
+
+ return ret;
+}
+
int kvm_arch_qemu_init_env(CPUState *cenv)
{
struct kvm_cpuid_entry cpuid_ent[100];
Index: kvm-userspace.tip/qemu/qemu-kvm.c
===================================================================
--- kvm-userspace.tip.orig/qemu/qemu-kvm.c
+++ kvm-userspace.tip/qemu/qemu-kvm.c
@@ -677,6 +677,7 @@ int kvm_qemu_create_context(void)
r = kvm_arch_qemu_create_context();
if(r <0)
kvm_qemu_destroy();
+ kvm_arch_qemu_init();
return 0;
}
Index: kvm-userspace.tip/qemu/qemu-kvm.h
===================================================================
--- kvm-userspace.tip.orig/qemu/qemu-kvm.h
+++ kvm-userspace.tip/qemu/qemu-kvm.h
@@ -49,6 +49,7 @@ void kvm_cpu_destroy_phys_mem(target_phy
unsigned long size);
int kvm_arch_qemu_create_context(void);
+int kvm_arch_qemu_init(void);
void kvm_arch_save_regs(CPUState *env);
void kvm_arch_load_regs(CPUState *env);
@@ -60,6 +61,8 @@ int kvm_arch_has_work(CPUState *env);
int kvm_arch_try_push_interrupts(void *opaque);
void kvm_arch_update_regs_for_sipi(CPUState *env);
void kvm_arch_cpu_reset(CPUState *env);
+int qemu_kvm_get_pmtimer(uint32_t *value);
+
CPUState *qemu_kvm_cpu_env(int index);
Index: kvm-userspace.tip/qemu/sysemu.h
===================================================================
--- kvm-userspace.tip.orig/qemu/sysemu.h
+++ kvm-userspace.tip/qemu/sysemu.h
@@ -94,6 +94,7 @@ extern int win2k_install_hack;
extern int alt_grab;
extern int usb_enabled;
extern int smp_cpus;
+extern unsigned int host_pmtimer_base;
extern int cursor_hide;
extern int graphic_rotate;
extern int no_quit;
@@ -101,6 +102,7 @@ extern int semihosting_enabled;
extern int autostart;
extern int old_param;
extern int hpagesize;
+extern int no_direct_pmtimer;
extern const char *bootp_filename;
Index: kvm-userspace.tip/qemu/vl.c
===================================================================
--- kvm-userspace.tip.orig/qemu/vl.c
+++ kvm-userspace.tip/qemu/vl.c
@@ -209,6 +209,7 @@ int win2k_install_hack = 0;
int usb_enabled = 0;
static VLANState *first_vlan;
int smp_cpus = 1;
+unsigned int host_pmtimer_base;
const char *vnc_display;
#if defined(TARGET_SPARC)
#define MAX_CPUS 16
@@ -235,6 +236,7 @@ int time_drift_fix = 0;
unsigned int kvm_shadow_memory = 0;
const char *mem_path = NULL;
int hpagesize = 0;
+int no_direct_pmtimer = 0;
const char *cpu_vendor_string;
#ifdef TARGET_ARM
int old_param = 0;
@@ -7931,6 +7933,7 @@ enum {
QEMU_OPTION_tdf,
QEMU_OPTION_kvm_shadow_memory,
QEMU_OPTION_mempath,
+ QEMU_OPTION_no_direct_pmtimer,
};
typedef struct QEMUOption {
@@ -8058,6 +8061,7 @@ const QEMUOption qemu_options[] = {
{ "clock", HAS_ARG, QEMU_OPTION_clock },
{ "startdate", HAS_ARG, QEMU_OPTION_startdate },
{ "mem-path", HAS_ARG, QEMU_OPTION_mempath },
+ { "no-direct-pmtimer", 0, QEMU_OPTION_no_direct_pmtimer },
{ NULL },
};
@@ -8962,6 +8966,9 @@ int main(int argc, char **argv)
case QEMU_OPTION_mempath:
mem_path = optarg;
break;
+ case QEMU_OPTION_no_direct_pmtimer:
+ no_direct_pmtimer = 1;
+ break;
case QEMU_OPTION_name:
qemu_name = optarg;
break;
--
^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch 4/4] KVM: allow direct access to PMTimer port
2008-05-24 23:43 [patch 0/4] C2 "emulation" Marcelo Tosatti
` (2 preceding siblings ...)
2008-05-24 23:43 ` [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support Marcelo Tosatti
@ 2008-05-24 23:43 ` Marcelo Tosatti
2008-05-25 10:04 ` Avi Kivity
2008-05-25 12:31 ` Avi Kivity
3 siblings, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-24 23:43 UTC (permalink / raw)
To: Avi Kivity, Chris Wright; +Cc: kvm-devel, Len Brown
[-- Attachment #1: kvm-open-pmtmr --]
[-- Type: text/plain, Size: 3581 bytes --]
There's not much point in exiting for pmtimer reads, since it runs at a
fixed clock rate and its start value is undefined.
The KVM-specific ioctl to read the counter from userspace is not nice
though. Ideas?
CC: Len Brown <lenb@kernel.org>
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index baf9607..f7e44d8 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -24,6 +24,7 @@
#include <linux/vmalloc.h>
#include <linux/highmem.h>
#include <linux/sched.h>
+#include <linux/acpi_pmtmr.h>
#include <asm/desc.h>
@@ -424,6 +425,10 @@ static __init int svm_hardware_setup(void)
iopm_va = page_address(iopm_pages);
memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER));
clear_bit(0x80, iopm_va); /* allow direct access to PC debug port */
+ if (pmtmr_ioport)
+ for (r = 0; r < 4; r++)
+ clear_bit(pmtmr_ioport+r, iopm_va);
+
iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT;
if (boot_cpu_has(X86_FEATURE_NX))
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aaa99ed..4f77fd7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -26,6 +26,7 @@
#include <linux/highmem.h>
#include <linux/sched.h>
#include <linux/moduleparam.h>
+#include <linux/acpi_pmtmr.h>
#include <asm/io.h>
#include <asm/desc.h>
@@ -3261,6 +3262,10 @@ static int __init vmx_init(void)
va = kmap(vmx_io_bitmap_a);
memset(va, 0xff, PAGE_SIZE);
clear_bit(0x80, va);
+ if (pmtmr_ioport)
+ for (r = 0; r < 4; r++)
+ clear_bit(pmtmr_ioport+r, va);
+
kunmap(vmx_io_bitmap_a);
va = kmap(vmx_io_bitmap_b);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e537005..9769f52 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -27,6 +27,7 @@
#include <linux/module.h>
#include <linux/mman.h>
#include <linux/highmem.h>
+#include <linux/acpi_pmtmr.h>
#include <asm/uaccess.h>
#include <asm/msr.h>
@@ -790,6 +791,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_PIT:
case KVM_CAP_NOP_IO_DELAY:
case KVM_CAP_MP_STATE:
+ case KVM_CAP_GET_PMTIMER:
r = 1;
break;
case KVM_CAP_VAPIC:
@@ -1678,6 +1680,18 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = 0;
break;
}
+ case KVM_GET_PMTIMER: {
+ struct kvm_pmtimer pmtmr;
+ r = -ENODEV;
+ if (!pmtmr_ioport)
+ goto out;
+ pmtmr.val = inl(pmtmr_ioport);
+ r = -EFAULT;
+ if (copy_to_user(argp, &pmtmr, sizeof pmtmr))
+ goto out;
+ r = 0;
+ break;
+ }
default:
;
}
diff --git a/drivers/clocksource/acpi_pm.c b/drivers/clocksource/acpi_pm.c
index 7b46faf..d3d4294 100644
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -29,6 +29,7 @@
* in arch/i386/kernel/acpi/boot.c
*/
u32 pmtmr_ioport __read_mostly;
+EXPORT_SYMBOL(pmtmr_ioport);
static inline u32 read_pmtmr(void)
{
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index a281afe..3672890 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -247,6 +247,10 @@ struct kvm_mp_state {
__u32 mp_state;
};
+struct kvm_pmtimer {
+ __u32 val;
+};
+
struct kvm_s390_psw {
__u64 mask;
__u64 addr;
@@ -346,6 +350,7 @@ struct kvm_trace_rec {
#define KVM_CAP_NOP_IO_DELAY 12
#define KVM_CAP_PV_MMU 13
#define KVM_CAP_MP_STATE 14
+#define KVM_CAP_GET_PMTIMER 15
/*
* ioctls for VM fds
@@ -371,6 +376,7 @@ struct kvm_trace_rec {
#define KVM_CREATE_PIT _IO(KVMIO, 0x64)
#define KVM_GET_PIT _IOWR(KVMIO, 0x65, struct kvm_pit_state)
#define KVM_SET_PIT _IOR(KVMIO, 0x66, struct kvm_pit_state)
+#define KVM_GET_PMTIMER _IOR(KVMIO, 0x67, struct kvm_pmtimer)
/*
* ioctls for vcpu fds
--
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [patch 4/4] KVM: allow direct access to PMTimer port
2008-05-24 23:43 ` [patch 4/4] KVM: allow direct access to PMTimer port Marcelo Tosatti
@ 2008-05-25 10:04 ` Avi Kivity
2008-05-25 16:09 ` Marcelo Tosatti
2008-05-25 12:31 ` Avi Kivity
1 sibling, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2008-05-25 10:04 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Chris Wright, kvm-devel, Len Brown
Marcelo Tosatti wrote:
> There's not much point in exiting for pmtimer reads, since it runs at a
> fixed clock rate and its start value is undefined.
>
> The KVM-specific ioctl to read the counter from userspace is not nice
> though. Ideas?
>
> CC: Len Brown <lenb@kernel.org>
>
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index baf9607..f7e44d8 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -24,6 +24,7 @@
> #include <linux/vmalloc.h>
> #include <linux/highmem.h>
> #include <linux/sched.h>
> +#include <linux/acpi_pmtmr.h>
>
> #include <asm/desc.h>
>
> @@ -424,6 +425,10 @@ static __init int svm_hardware_setup(void)
> iopm_va = page_address(iopm_pages);
> memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER));
> clear_bit(0x80, iopm_va); /* allow direct access to PC debug port */
> + if (pmtmr_ioport)
> + for (r = 0; r < 4; r++)
> + clear_bit(pmtmr_ioport+r, iopm_va);
> +
>
What if the port conflicts with a virtualized port?
> iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT;
>
> if (boot_cpu_has(X86_FEATURE_NX))
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index aaa99ed..4f77fd7 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -26,6 +26,7 @@
> #include <linux/highmem.h>
> #include <linux/sched.h>
> #include <linux/moduleparam.h>
> +#include <linux/acpi_pmtmr.h>
>
> #include <asm/io.h>
> #include <asm/desc.h>
> @@ -3261,6 +3262,10 @@ static int __init vmx_init(void)
> va = kmap(vmx_io_bitmap_a);
> memset(va, 0xff, PAGE_SIZE);
> clear_bit(0x80, va);
> + if (pmtmr_ioport)
> + for (r = 0; r < 4; r++)
> + clear_bit(pmtmr_ioport+r, va);
> +
>
Instead of duplicating the code, how about an kvm_x86_ops interface?
> kunmap(vmx_io_bitmap_a);
>
> va = kmap(vmx_io_bitmap_b);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e537005..9769f52 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -27,6 +27,7 @@
> #include <linux/module.h>
> #include <linux/mman.h>
> #include <linux/highmem.h>
> +#include <linux/acpi_pmtmr.h>
>
> #include <asm/uaccess.h>
> #include <asm/msr.h>
> @@ -790,6 +791,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> case KVM_CAP_PIT:
> case KVM_CAP_NOP_IO_DELAY:
> case KVM_CAP_MP_STATE:
> + case KVM_CAP_GET_PMTIMER:
> r = 1;
> break;
> case KVM_CAP_VAPIC:
> @@ -1678,6 +1680,18 @@ long kvm_arch_vm_ioctl(struct file *filp,
> r = 0;
> break;
> }
> + case KVM_GET_PMTIMER: {
> + struct kvm_pmtimer pmtmr;
> + r = -ENODEV;
> + if (!pmtmr_ioport)
> + goto out;
> + pmtmr.val = inl(pmtmr_ioport);
> + r = -EFAULT;
> + if (copy_to_user(argp, &pmtmr, sizeof pmtmr))
> + goto out;
> + r = 0;
> + break;
> + }
> default:
>
This definitely doesn't belong in kvm. /dev/pmtimer?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-24 23:43 ` [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support Marcelo Tosatti
@ 2008-05-25 10:18 ` Avi Kivity
2008-05-25 16:32 ` Marcelo Tosatti
2008-05-29 17:56 ` Marcelo Tosatti
2008-05-25 10:19 ` Avi Kivity
1 sibling, 2 replies; 18+ messages in thread
From: Avi Kivity @ 2008-05-25 10:18 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Chris Wright, kvm-devel
Marcelo Tosatti wrote:
> QEMU support for direct pmtimer reads. Hopefully its safe, since its a
> read-only register ?
>
> With self-disable C2 + this I'm seeing less CPU usage when idle with
> CONFIG_CPU_IDLE enabled. Quite noticeable on SMP guests. Windows XP is
> comparable to standard (never seen it consume less than 10% either way,
> usually 20-30%).
>
This is because the TPR optimization hack is disabled on smp. You need
a machine with FlexPriority to get usable smp Windows.
> On migration the destination host can either lack ACPI or have the timer
> in a different IO port, so emulation is necessary.
>
> Or luckily the pmtimer is in the same address. Since the 24-bit counter
> overflow period is only ~= 4.6 seconds, its probably worthwhile to wait
> for synchronization before restarting the guest. Not implemented though.
>
> Also simplify and fix the overflow emulation, which was happening every
> 2.3 seconds instead of the expected 4.6s.
>
> Index: kvm-userspace.tip/bios/rombios32.c
> ===================================================================
> --- kvm-userspace.tip.orig/bios/rombios32.c
> +++ kvm-userspace.tip/bios/rombios32.c
> @@ -391,7 +391,7 @@ uint8_t bios_uuid[16];
> unsigned long ebda_cur_addr;
> #endif
> int acpi_enabled;
> -uint32_t pm_io_base, smb_io_base;
> +uint32_t pm_io_base, pmtmr_base, smb_io_base;
> int pm_sci_int;
> unsigned long bios_table_cur_addr;
> unsigned long bios_table_end_addr;
> @@ -819,6 +819,12 @@ static void pci_bios_init_device(PCIDevi
> pci_config_writeb(d, PCI_INTERRUPT_LINE, 9);
>
> pm_io_base = PM_IO_BASE;
> + pmtmr_base = cmos_readb(0x60);
> + pmtmr_base |= cmos_readb(0x61) << 8;
> + pmtmr_base |= cmos_readb(0x62) << 16;
> + pmtmr_base |= cmos_readb(0x63) << 24;
> + if (!pmtmr_base)
> + pmtmr_base = pm_io_base + 0x08;
>
You're splitting the ACPI ioport range into two. I think the correct
fix here is to have qemu supply a PMBA hint to the BIOS. If the hint is
present, the bios should locate pm_io_base there, and should also avoid
placing other pio resources there.
> +static void schedule_pmtmr_sci(PIIX4PMState *s)
> +{
> + int64_t expire_time;
> + uint32_t pmtmr, left;
> +
> + if (s->direct_access)
> + qemu_kvm_get_pmtimer(&pmtmr);
> + else
> + pmtmr = get_pmtmr(s);
>
get_pmtmr() should have this logic.
> +
> + left = (1 << 24) - pmtmr;
>
The docs say that SCI is generated when bit 23 toggles, not on
overflow. See TMROF_STS PIIX4 documentation.
In any case, this should be in a separate patch.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-24 23:43 ` [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support Marcelo Tosatti
2008-05-25 10:18 ` Avi Kivity
@ 2008-05-25 10:19 ` Avi Kivity
2008-05-25 17:39 ` Marcelo Tosatti
1 sibling, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2008-05-25 10:19 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Chris Wright, kvm-devel
Marcelo Tosatti wrote:
> QEMU support for direct pmtimer reads. Hopefully its safe, since its a
> read-only register ?
>
ISTR that pmtimer can be either 24 bit or 32 bit. Can this cause a problem?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 4/4] KVM: allow direct access to PMTimer port
2008-05-24 23:43 ` [patch 4/4] KVM: allow direct access to PMTimer port Marcelo Tosatti
2008-05-25 10:04 ` Avi Kivity
@ 2008-05-25 12:31 ` Avi Kivity
2008-05-25 16:12 ` Marcelo Tosatti
1 sibling, 1 reply; 18+ messages in thread
From: Avi Kivity @ 2008-05-25 12:31 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Chris Wright, kvm-devel, Len Brown
Marcelo Tosatti wrote:
> There's not much point in exiting for pmtimer reads, since it runs at a
> fixed clock rate and its start value is undefined.
>
On a migration farm with lots of different hardware, eventually all
guests will have migrated and none of them will be able to use the
passthrough port.
What about emulating pmtimer in the host kernel instead of userspace?
It increases the hit from 1 exit to 3 exits, but they're much faster exits.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 4/4] KVM: allow direct access to PMTimer port
2008-05-25 10:04 ` Avi Kivity
@ 2008-05-25 16:09 ` Marcelo Tosatti
0 siblings, 0 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-25 16:09 UTC (permalink / raw)
To: Avi Kivity; +Cc: Chris Wright, kvm-devel, Len Brown
On Sun, May 25, 2008 at 01:04:40PM +0300, Avi Kivity wrote:
> >@@ -424,6 +425,10 @@ static __init int svm_hardware_setup(void)
> > iopm_va = page_address(iopm_pages);
> > memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER));
> > clear_bit(0x80, iopm_va); /* allow direct access to PC debug port */
> >+ if (pmtmr_ioport)
> >+ for (r = 0; r < 4; r++)
> >+ clear_bit(pmtmr_ioport+r, iopm_va);
> >+
> >
>
> What if the port conflicts with a virtualized port?
Userspace registers the port range in QEMU even if it can be accessed
directly, so if there is a conflict it can be handled there.
> >+ if (pmtmr_ioport)
> >+ for (r = 0; r < 4; r++)
> >+ clear_bit(pmtmr_ioport+r, va);
> >+
> >
>
> Instead of duplicating the code, how about an kvm_x86_ops interface?
Yes sounds much better.
> >+ case KVM_GET_PMTIMER: {
> >+ struct kvm_pmtimer pmtmr;
> >+ r = -ENODEV;
> >+ if (!pmtmr_ioport)
> >+ goto out;
> >+ pmtmr.val = inl(pmtmr_ioport);
> >+ r = -EFAULT;
> >+ if (copy_to_user(argp, &pmtmr, sizeof pmtmr))
> >+ goto out;
> >+ r = 0;
> >+ break;
> >+ }
> > default:
> >
>
> This definitely doesn't belong in kvm. /dev/pmtimer?
Think so.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 4/4] KVM: allow direct access to PMTimer port
2008-05-25 12:31 ` Avi Kivity
@ 2008-05-25 16:12 ` Marcelo Tosatti
2008-05-26 8:03 ` Avi Kivity
0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-25 16:12 UTC (permalink / raw)
To: Avi Kivity; +Cc: Chris Wright, kvm-devel, Len Brown
On Sun, May 25, 2008 at 03:31:03PM +0300, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >There's not much point in exiting for pmtimer reads, since it runs at a
> >fixed clock rate and its start value is undefined.
> >
>
> On a migration farm with lots of different hardware, eventually all
> guests will have migrated and none of them will be able to use the
> passthrough port.
>
> What about emulating pmtimer in the host kernel instead of userspace?
> It increases the hit from 1 exit to 3 exits, but they're much faster exits.
I agree, even began writing it, but I think the non-passthrough case
could be optimized later.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-25 10:18 ` Avi Kivity
@ 2008-05-25 16:32 ` Marcelo Tosatti
2008-05-26 8:16 ` Avi Kivity
2008-05-29 17:56 ` Marcelo Tosatti
1 sibling, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-25 16:32 UTC (permalink / raw)
To: Avi Kivity; +Cc: Chris Wright, kvm-devel
On Sun, May 25, 2008 at 01:18:46PM +0300, Avi Kivity wrote:
> > int acpi_enabled;
> >-uint32_t pm_io_base, smb_io_base;
> >+uint32_t pm_io_base, pmtmr_base, smb_io_base;
> > int pm_sci_int;
> > unsigned long bios_table_cur_addr;
> > unsigned long bios_table_end_addr;
> >@@ -819,6 +819,12 @@ static void pci_bios_init_device(PCIDevi
> > pci_config_writeb(d, PCI_INTERRUPT_LINE, 9);
> >
> > pm_io_base = PM_IO_BASE;
> >+ pmtmr_base = cmos_readb(0x60);
> >+ pmtmr_base |= cmos_readb(0x61) << 8;
> >+ pmtmr_base |= cmos_readb(0x62) << 16;
> >+ pmtmr_base |= cmos_readb(0x63) << 24;
> >+ if (!pmtmr_base)
> >+ pmtmr_base = pm_io_base + 0x08;
> >
>
> You're splitting the ACPI ioport range into two. I think the correct
> fix here is to have qemu supply a PMBA hint to the BIOS. If the hint is
> placing other pio resources there.
What is PMBA?
>From my understand ACPI supports an address for each register block, and
the PMTimer resides in a separate block. So what is the problem with
having different ACPI blocks in different ports?
Note that the GPE0 registers are in a different port range than
PM1EVT/PM1CNT/PMTimer already.
> >+static void schedule_pmtmr_sci(PIIX4PMState *s)
> >+{
> >+ int64_t expire_time;
> >+ uint32_t pmtmr, left;
> >+
> >+ if (s->direct_access)
> >+ qemu_kvm_get_pmtimer(&pmtmr);
> >+ else
> >+ pmtmr = get_pmtmr(s);
> >
>
> get_pmtmr() should have this logic.
>
> >+
> >+ left = (1 << 24) - pmtmr;
> >
>
> The docs say that SCI is generated when bit 23 toggles, not on
> overflow. See TMROF_STS PIIX4 documentation.
Yeah, misread the docs, as usual.
>
> In any case, this should be in a separate patch.
>
> --
> error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-25 10:19 ` Avi Kivity
@ 2008-05-25 17:39 ` Marcelo Tosatti
2008-05-26 8:23 ` Avi Kivity
0 siblings, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-25 17:39 UTC (permalink / raw)
To: Avi Kivity; +Cc: Chris Wright, kvm-devel
On Sun, May 25, 2008 at 01:19:30PM +0300, Avi Kivity wrote:
> Marcelo Tosatti wrote:
> >QEMU support for direct pmtimer reads. Hopefully its safe, since its a
> >read-only register ?
> >
>
> ISTR that pmtimer can be either 24 bit or 32 bit. Can this cause a problem?
Don't think so, since QEMU/KVM reports a 24-bit timer (tmr_val_ext field
of fadt is zero). If the host has a 32-bit timer, the higher bits are
ignored, and TMR_OF is emulated.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 4/4] KVM: allow direct access to PMTimer port
2008-05-25 16:12 ` Marcelo Tosatti
@ 2008-05-26 8:03 ` Avi Kivity
0 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-05-26 8:03 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Chris Wright, kvm-devel, Len Brown
Marcelo Tosatti wrote:
> On Sun, May 25, 2008 at 03:31:03PM +0300, Avi Kivity wrote:
>
>> Marcelo Tosatti wrote:
>>
>>> There's not much point in exiting for pmtimer reads, since it runs at a
>>> fixed clock rate and its start value is undefined.
>>>
>>>
>> On a migration farm with lots of different hardware, eventually all
>> guests will have migrated and none of them will be able to use the
>> passthrough port.
>>
>> What about emulating pmtimer in the host kernel instead of userspace?
>> It increases the hit from 1 exit to 3 exits, but they're much faster exits.
>>
>
> I agree, even began writing it, but I think the non-passthrough case
> could be optimized later.
>
But, I claim above that the passthrough case will never be in use on
long-running guests due to migration.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-25 16:32 ` Marcelo Tosatti
@ 2008-05-26 8:16 ` Avi Kivity
0 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-05-26 8:16 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Chris Wright, kvm-devel
Marcelo Tosatti wrote:
> On Sun, May 25, 2008 at 01:18:46PM +0300, Avi Kivity wrote:
>
>>> int acpi_enabled;
>>> -uint32_t pm_io_base, smb_io_base;
>>> +uint32_t pm_io_base, pmtmr_base, smb_io_base;
>>> int pm_sci_int;
>>> unsigned long bios_table_cur_addr;
>>> unsigned long bios_table_end_addr;
>>> @@ -819,6 +819,12 @@ static void pci_bios_init_device(PCIDevi
>>> pci_config_writeb(d, PCI_INTERRUPT_LINE, 9);
>>>
>>> pm_io_base = PM_IO_BASE;
>>> + pmtmr_base = cmos_readb(0x60);
>>> + pmtmr_base |= cmos_readb(0x61) << 8;
>>> + pmtmr_base |= cmos_readb(0x62) << 16;
>>> + pmtmr_base |= cmos_readb(0x63) << 24;
>>> + if (!pmtmr_base)
>>> + pmtmr_base = pm_io_base + 0x08;
>>>
>>>
>> You're splitting the ACPI ioport range into two. I think the correct
>> fix here is to have qemu supply a PMBA hint to the BIOS. If the hint is
>> placing other pio resources there.
>>
>
> What is PMBA?
>
>
Power Management Base Address, which must equal the value of pm_io_base
above.
> From my understand ACPI supports an address for each register block, and
> the PMTimer resides in a separate block. So what is the problem with
> having different ACPI blocks in different ports?
>
The particular chipset we emulate has all blocks in one contiguous
region starting at the PMBA.
> Note that the GPE0 registers are in a different port range than
> PM1EVT/PM1CNT/PMTimer already.
>
>
That's sucky. piix4 supports GPIO pins, we should have emulated them
instead of inventing our own.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-25 17:39 ` Marcelo Tosatti
@ 2008-05-26 8:23 ` Avi Kivity
0 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-05-26 8:23 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Chris Wright, kvm-devel
Marcelo Tosatti wrote:
> On Sun, May 25, 2008 at 01:19:30PM +0300, Avi Kivity wrote:
>
>> Marcelo Tosatti wrote:
>>
>>> QEMU support for direct pmtimer reads. Hopefully its safe, since its a
>>> read-only register ?
>>>
>>>
>> ISTR that pmtimer can be either 24 bit or 32 bit. Can this cause a problem?
>>
>
> Don't think so, since QEMU/KVM reports a 24-bit timer (tmr_val_ext field
> of fadt is zero). If the host has a 32-bit timer, the higher bits are
> ignored, and TMR_OF is emulated.
>
Linux seems to mask the high bits properly, yes. Hopefully other OSes
do too.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-25 10:18 ` Avi Kivity
2008-05-25 16:32 ` Marcelo Tosatti
@ 2008-05-29 17:56 ` Marcelo Tosatti
2008-05-31 7:52 ` Avi Kivity
1 sibling, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2008-05-29 17:56 UTC (permalink / raw)
To: Avi Kivity; +Cc: Chris Wright, kvm-devel
On Sun, May 25, 2008 at 01:18:46PM +0300, Avi Kivity wrote:
> > pm_io_base = PM_IO_BASE;
> >+ pmtmr_base = cmos_readb(0x60);
> >+ pmtmr_base |= cmos_readb(0x61) << 8;
> >+ pmtmr_base |= cmos_readb(0x62) << 16;
> >+ pmtmr_base |= cmos_readb(0x63) << 24;
> >+ if (!pmtmr_base)
> >+ pmtmr_base = pm_io_base + 0x08;
> >
>
> You're splitting the ACPI ioport range into two. I think the correct
> fix here is to have qemu supply a PMBA hint to the BIOS. If the hint is
> present, the bios should locate pm_io_base there, and should also avoid
> placing other pio resources there.
PBLK_BASE (processor block) is statically defined in acpi-dsdt.dsl, and
I don't see any easy way to change that dynamically.
In practice I don't see any problem with not having the PMBA registers
in a contiguous range, since I doubt any ACPI implementation will assume
PIIX4 specific details (ACPI driver looks for the register addresses in
FADT, so does not matter if they're contiguous or not).
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support
2008-05-29 17:56 ` Marcelo Tosatti
@ 2008-05-31 7:52 ` Avi Kivity
0 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2008-05-31 7:52 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Chris Wright, kvm-devel
Marcelo Tosatti wrote:
> On Sun, May 25, 2008 at 01:18:46PM +0300, Avi Kivity wrote:
>
>>> pm_io_base = PM_IO_BASE;
>>> + pmtmr_base = cmos_readb(0x60);
>>> + pmtmr_base |= cmos_readb(0x61) << 8;
>>> + pmtmr_base |= cmos_readb(0x62) << 16;
>>> + pmtmr_base |= cmos_readb(0x63) << 24;
>>> + if (!pmtmr_base)
>>> + pmtmr_base = pm_io_base + 0x08;
>>>
>>>
>> You're splitting the ACPI ioport range into two. I think the correct
>> fix here is to have qemu supply a PMBA hint to the BIOS. If the hint is
>> present, the bios should locate pm_io_base there, and should also avoid
>> placing other pio resources there.
>>
>
> PBLK_BASE (processor block) is statically defined in acpi-dsdt.dsl, and
> I don't see any easy way to change that dynamically.
>
> In practice I don't see any problem with not having the PMBA registers
> in a contiguous range, since I doubt any ACPI implementation will assume
> PIIX4 specific details (ACPI driver looks for the register addresses in
> FADT, so does not matter if they're contiguous or not).
>
Right, it's a cleanliness issue, not correctness.
The only guest I can think of that will care about the actual layout is
linuxbios/coreboot.
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2008-05-31 7:52 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-24 23:43 [patch 0/4] C2 "emulation" Marcelo Tosatti
2008-05-24 23:43 ` [patch 1/4] QEMU/KVM: self-disabling C2 emulation Marcelo Tosatti
2008-05-24 23:43 ` [patch 2/4] libkvm: KVM_GET_PMTIMER ioctl support Marcelo Tosatti
2008-05-24 23:43 ` [patch 3/4] QEMU/KVM: non-virtualized ACPI PMTimer support Marcelo Tosatti
2008-05-25 10:18 ` Avi Kivity
2008-05-25 16:32 ` Marcelo Tosatti
2008-05-26 8:16 ` Avi Kivity
2008-05-29 17:56 ` Marcelo Tosatti
2008-05-31 7:52 ` Avi Kivity
2008-05-25 10:19 ` Avi Kivity
2008-05-25 17:39 ` Marcelo Tosatti
2008-05-26 8:23 ` Avi Kivity
2008-05-24 23:43 ` [patch 4/4] KVM: allow direct access to PMTimer port Marcelo Tosatti
2008-05-25 10:04 ` Avi Kivity
2008-05-25 16:09 ` Marcelo Tosatti
2008-05-25 12:31 ` Avi Kivity
2008-05-25 16:12 ` Marcelo Tosatti
2008-05-26 8:03 ` Avi Kivity
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox