public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Jan Kiszka <jan.kiszka@siemens.com>
To: Blue Swirl <blauwirbel@gmail.com>
Cc: Avi Kivity <avi@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Anthony Liguori <aliguori@us.ibm.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Gleb Natapov <gleb@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v2 5/8] kvmvapic: Introduce TPR access optimization for Windows guests
Date: Mon, 13 Feb 2012 11:16:53 +0100	[thread overview]
Message-ID: <4F38E315.4040107@siemens.com> (raw)
In-Reply-To: <CAAu8pHsM=ZgP2ATqXOtxhBd-QsaGwSKCPdqQ2OCf0KAG=mUXsg@mail.gmail.com>

On 2012-02-11 16:25, Blue Swirl wrote:
> On Fri, Feb 10, 2012 at 18:31, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> This enables acceleration for MMIO-based TPR registers accesses of
>> 32-bit Windows guest systems. It is mostly useful with KVM enabled,
>> either on older Intel CPUs (without flexpriority feature, can also be
>> manually disabled for testing) or any current AMD processor.
>>
>> The approach introduced here is derived from the original version of
>> qemu-kvm. It was refactored, documented, and extended by support for
>> user space APIC emulation, both with and without KVM acceleration. The
>> VMState format was kept compatible, so was the ABI to the option ROM
>> that implements the guest-side para-virtualized driver service. This
>> enables seamless migration from qemu-kvm to upstream or, one day,
>> between KVM and TCG mode.
>>
>> The basic concept goes like this:
>>  - VAPIC PV interface consisting of I/O port 0x7e and (for KVM in-kernel
>>   irqchip) a vmcall hypercall is registered
>>  - VAPIC option ROM is loaded into guest
>>  - option ROM activates TPR MMIO access reporting via port 0x7e
>>  - TPR accesses are trapped and patched in the guest to call into option
>>   ROM instead, VAPIC support is enabled
>>  - option ROM TPR helpers track state in memory and invoke hypercall to
>>   poll for pending IRQs if required
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> 
> I must say that I find the approach horrible, patching guests and ROMs
> and looking up Windows internals. Taking the same approach to extreme,
> we could for example patch Xen guest to become a KVM guest. Not that I
> object merging.

Yes, this is horrible. But there is no real better way in the absence of
hardware assisted virtualization of the TPR. I think MS is recommending
this patching approach as well.

>> diff --git a/hw/apic.c b/hw/apic.c
>> index 086c544..2ebf3ca 100644
>> --- a/hw/apic.c
>> +++ b/hw/apic.c
>> @@ -35,6 +35,10 @@
>>  #define MSI_ADDR_DEST_ID_SHIFT         12
>>  #define        MSI_ADDR_DEST_ID_MASK           0x00ffff0
>>
>> +#define SYNC_FROM_VAPIC                 0x1
>> +#define SYNC_TO_VAPIC                   0x2
>> +#define SYNC_ISR_IRR_TO_VAPIC           0x4
> 
> Enum, please.

OK.

> 
>> +
>>  static APICCommonState *local_apics[MAX_APICS + 1];
>>
>>  static void apic_set_irq(APICCommonState *s, int vector_num, int trigger_mode);
>> @@ -78,6 +82,70 @@ static inline int get_bit(uint32_t *tab, int index)
>>     return !!(tab[i] & mask);
>>  }
>>
>> +/* return -1 if no bit is set */
>> +static int get_highest_priority_int(uint32_t *tab)
>> +{
>> +    int i;
>> +    for (i = 7; i >= 0; i--) {
>> +        if (tab[i] != 0) {
>> +            return i * 32 + fls_bit(tab[i]);
>> +        }
>> +    }
>> +    return -1;
>> +}
>> +
>> +static void apic_sync_vapic(APICCommonState *s, int sync_type)
>> +{
>> +    VAPICState vapic_state;
>> +    size_t length;
>> +    off_t start;
>> +    int vector;
>> +
>> +    if (!s->vapic_paddr) {
>> +        return;
>> +    }
>> +    if (sync_type & SYNC_FROM_VAPIC) {
>> +        cpu_physical_memory_rw(s->vapic_paddr, (void *)&vapic_state,
>> +                               sizeof(vapic_state), 0);
>> +        s->tpr = vapic_state.tpr;
>> +    }
>> +    if (sync_type & (SYNC_TO_VAPIC | SYNC_ISR_IRR_TO_VAPIC)) {
>> +        start = offsetof(VAPICState, isr);
>> +        length = offsetof(VAPICState, enabled) - offsetof(VAPICState, isr);
>> +
>> +        if (sync_type & SYNC_TO_VAPIC) {
>> +            assert(qemu_cpu_is_self(s->cpu_env));
>> +
>> +            vapic_state.tpr = s->tpr;
>> +            vapic_state.enabled = 1;
>> +            start = 0;
>> +            length = sizeof(VAPICState);
>> +        }
>> +
>> +        vector = get_highest_priority_int(s->isr);
>> +        if (vector < 0) {
>> +            vector = 0;
>> +        }
>> +        vapic_state.isr = vector & 0xf0;
>> +
>> +        vapic_state.zero = 0;
>> +
>> +        vector = get_highest_priority_int(s->irr);
>> +        if (vector < 0) {
>> +            vector = 0;
>> +        }
>> +        vapic_state.irr = vector & 0xff;
>> +
>> +        cpu_physical_memory_write_rom(s->vapic_paddr + start,
>> +                                      ((void *)&vapic_state) + start, length);
> 
> This assumes that the vapic_state structure matches guest what guest
> expect without conversion. Is this true for i386 on x86_64? I didn't
> check the structure in question.

Yes, the structure in question is a packed one, stable on both guest and
host side (the guest side is 32-bit only anyway).

>> diff --git a/hw/apic_common.c b/hw/apic_common.c
>> index 588531b..1977da7 100644
>> --- a/hw/apic_common.c
>> +++ b/hw/apic_common.c
>> @@ -20,8 +20,10 @@
>>  #include "apic.h"
>>  #include "apic_internal.h"
>>  #include "trace.h"
>> +#include "kvm.h"
>>
>>  static int apic_irq_delivered;
>> +bool apic_report_tpr_access;
> 
> This should go to APICCommonState.

Nope, it is a global state, also checked in a place where the APIC is
set up, thus have no local clue about it yet and needs to pick up the
global view.

>> @@ -238,6 +275,7 @@ static int apic_init_common(SysBusDevice *dev)
>>  {
>>     APICCommonState *s = APIC_COMMON(dev);
>>     APICCommonClass *info;
>> +    static DeviceState *vapic;
>>     static int apic_no;
>>
>>     if (apic_no >= MAX_APICS) {
>> @@ -248,10 +286,29 @@ static int apic_init_common(SysBusDevice *dev)
>>     info = APIC_COMMON_GET_CLASS(s);
>>     info->init(s);
>>
>> -    sysbus_init_mmio(&s->busdev, &s->io_memory);
>> +    sysbus_init_mmio(dev, &s->io_memory);
>> +
>> +    if (!vapic && s->vapic_control & VAPIC_ENABLE_MASK) {
>> +        vapic = sysbus_create_simple("kvmvapic", -1, NULL);
>> +    }
>> +    s->vapic = vapic;
>> +    if (apic_report_tpr_access && info->enable_tpr_reporting) {
> 
> I think you should not rely on apic_report_tpr_access being in sane
> condition during class init.

It is mandatory, e.g. for CPU hotplug, as reporting needs to be
consistent accross all VCPUs. Therefore it is a static global, set to
false initially. However, you are right, we lack proper clearing of  the
access report feature on reset, not only in this variable.

>> diff --git a/hw/kvmvapic.c b/hw/kvmvapic.c
>> new file mode 100644
>> index 0000000..0c4d304
>> --- /dev/null
>> +++ b/hw/kvmvapic.c
>> @@ -0,0 +1,774 @@
>> +/*
>> + * TPR optimization for 32-bit Windows guests
>> + *
>> + * Copyright (C) 2007-2008 Qumranet Technologies
>> + * Copyright (C) 2012      Jan Kiszka, Siemens AG
>> + *
>> + * This work is licensed under the terms of the GNU GPL version 2, or
>> + * (at your option) any later version. See the COPYING file in the
>> + * top-level directory.
>> + */
>> +#include "sysemu.h"
>> +#include "cpus.h"
>> +#include "kvm.h"
>> +#include "apic_internal.h"
>> +
>> +#define APIC_DEFAULT_ADDRESS    0xfee00000
>> +
>> +#define VAPIC_IO_PORT           0x7e
>> +
>> +#define VAPIC_INACTIVE          0
>> +#define VAPIC_ACTIVE            1
>> +#define VAPIC_STANDBY           2
> 
> Enums, please.

OK.

> 
>> +
>> +#define VAPIC_CPU_SHIFT         7
>> +
>> +#define ROM_BLOCK_SIZE          512
>> +#define ROM_BLOCK_MASK          (~(ROM_BLOCK_SIZE - 1))
>> +
>> +typedef struct VAPICHandlers {
>> +    uint32_t set_tpr;
>> +    uint32_t set_tpr_eax;
>> +    uint32_t get_tpr[8];
>> +    uint32_t get_tpr_stack;
>> +} QEMU_PACKED VAPICHandlers;
>> +
>> +typedef struct GuestROMState {
>> +    char signature[8];
>> +    uint32_t vaddr;
> 
> This does not look 64 bit clean.

It's packed.

> 
>> +    uint32_t fixup_start;
>> +    uint32_t fixup_end;
>> +    uint32_t vapic_vaddr;
>> +    uint32_t vapic_size;
>> +    uint32_t vcpu_shift;
>> +    uint32_t real_tpr_addr;
>> +    VAPICHandlers up;
>> +    VAPICHandlers mp;
>> +} QEMU_PACKED GuestROMState;
> 
> Why packed, is this passed to guest directly?

It is a data field in the option ROM, see vapic_base in kvmvapic.S.

> 
>> +
>> +typedef struct VAPICROMState {
>> +    SysBusDevice busdev;
>> +    MemoryRegion io;
>> +    MemoryRegion rom;
>> +    bool rom_mapped_writable;
> 
> I'd put this later to avoid a structure hole.

Moving it after rom_state may save us a few precious bytes. Well, ok. :)

> 
>> +    uint32_t state;
>> +    uint32_t rom_state_paddr;
>> +    uint32_t rom_state_vaddr;
>> +    uint32_t vapic_paddr;
>> +    uint32_t real_tpr_addr;
>> +    GuestROMState rom_state;
>> +    size_t rom_size;
>> +} VAPICROMState;
>> +
>> +#define TPR_INSTR_IS_WRITE              0x1
>> +#define TPR_INSTR_ABS_MODRM             0x2
>> +#define TPR_INSTR_MATCH_MODRM_REG       0x4
>> +
>> +typedef struct TPRInstruction {
>> +    uint8_t opcode;
>> +    uint8_t modrm_reg;
>> +    unsigned int flags;
>> +    size_t length;
>> +    off_t addr_offset;
>> +} TPRInstruction;
> 
> Also here the order is pessimized.

Don't see the gain here, though.

>> +static int find_real_tpr_addr(VAPICROMState *s, CPUState *env)
>> +{
>> +    target_phys_addr_t paddr;
>> +    target_ulong addr;
>> +
>> +    if (s->state == VAPIC_ACTIVE) {
>> +        return 0;
>> +    }
>> +    for (addr = 0xfffff000; addr >= 0x80000000; addr -= TARGET_PAGE_SIZE) {
>> +        paddr = cpu_get_phys_page_debug(env, addr);
>> +        if (paddr != APIC_DEFAULT_ADDRESS) {
>> +            continue;
>> +        }
>> +        s->real_tpr_addr = addr + 0x80;
>> +        update_guest_rom_state(s);
>> +        return 0;
>> +    }
> 
> This loop looks odd, what should it do, probe for unused address?

Seems to deserve a comment: We have to scan for the guest's mapping of
the APIC as we enter here without a hint from an TPR accessing
instruction. So we probe the potential range, trying to find the page
that maps to that known physical address (known in the sense that
Windows does not remap the APIC physically - nor does QEMU support that
so far).

>> +static int evaluate_tpr_instruction(VAPICROMState *s, CPUState *env,
>> +                                    target_ulong *pip, int access)
>> +{
>> +    const TPRInstruction *instr;
>> +    target_ulong ip = *pip;
>> +    uint8_t opcode[2];
>> +    uint32_t real_tpr_addr;
>> +    int i;
>> +
>> +    if ((ip & 0xf0000000) != 0x80000000 && (ip & 0xf0000000) != 0xe0000000) {
> 
> The constants should be using ULL suffix because target_ulong could be
> 64 bit, though maybe this is more optimal.

target_ulong is 64-bit unconditionally on x86. I'll add this.

>> +
>> +/*
>> + * Tries to read the unique processor number from the Kernel Processor Control
>> + * Region (KPCR) of 32-bit Windows. Returns -1 if the KPCR cannot be accessed
>> + * or is considered invalid.
>> + */
> 
> Horrible hack. Is guest OS type or version checked somewhere?

This is all about hacking Windows 32-bit. And this check encodes that
even stronger. The other important binding is the expected virtual
address of the ROM mapping under Windows. I would have preferred
checking the version directly, but no one has a complete list of
supported guests and their codes.

> 
>> +static int get_kpcr_number(CPUState *env)
>> +{
>> +    struct kpcr {
>> +        uint8_t  fill1[0x1c];
>> +        uint32_t self;
>> +        uint8_t  fill2[0x31];
>> +        uint8_t  number;
>> +    } QEMU_PACKED kpcr;
> 
> KPCR. Pointers to Windows documentation would be nice.

Oops, yes.

Unfortunately, this is only an internal structure, not officially
documented by MS. However, all supported OS versions a legacy by now, no
longer changing its structure.

>> +
>> +static int patch_hypercalls(VAPICROMState *s)
>> +{
>> +    target_phys_addr_t rom_paddr = s->rom_state_paddr & ROM_BLOCK_MASK;
>> +    static uint8_t vmcall_pattern[] = {
> 
> const
> 
>> +        0xb8, 0x1, 0, 0, 0, 0xf, 0x1, 0xc1
>> +    };
>> +    static uint8_t outl_pattern[] = {
> 
> const

Yep.

>> +static void vapic_write(void *opaque, target_phys_addr_t addr, uint64_t data,
>> +                        unsigned int size)
>> +{
>> +    CPUState *env = cpu_single_env;
>> +    target_phys_addr_t rom_paddr;
>> +    VAPICROMState *s = opaque;
>> +
>> +    cpu_synchronize_state(env);
>> +
>> +    /*
>> +     * The VAPIC supports two PIO-based hypercalls, both via port 0x7E.
>> +     *  o 16-bit write access:
>> +     *    Reports the option ROM initialization to the hypervisor. Written
>> +     *    value is the offset of the state structure in the ROM.
>> +     *  o 8-bit write access:
>> +     *    Reactivates the VAPIC after a guest hibernation, i.e. after the
>> +     *    option ROM content has been re-initialized by a guest power cycle.
>> +     *  o 32-bit write access:
>> +     *    Poll for pending IRQs, considering the current VAPIC state.
>> +     */
> 
> Different operation depending on size? Interesting.

Originally not my idea, just added the third case. :)

> 
>> +    switch (size) {
>> +    case 2:
>> +        if (s->state != VAPIC_INACTIVE) {
>> +            patch_hypercalls(s);
>> +            break;
>> +        }
>> +
>> +        rom_paddr = (env->segs[R_CS].base + env->eip) & ROM_BLOCK_MASK;
>> +        s->rom_state_paddr = rom_paddr + data;
>> +
>> +        if (vapic_prepare(s) < 0) {
>> +            break;
>> +        }
>> +        s->state = VAPIC_STANDBY;
>> +        break;
>> +    case 1:
>> +        if (kvm_enabled()) {
>> +            /*
>> +             * Disable triggering instruction in ROM by writing a NOP.
>> +             *
>> +             * We cannot do this in TCG mode as the reported IP is not
>> +             * reliable.
> 
> Given the hack level of the whole, it would not be impossible to find
> the IP using search PC.

Is there a specific pre-existing service you have in mind? Otherwise,
the complexity might not be worth the gain.

Thanks for having a look,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

  reply	other threads:[~2012-02-13 10:17 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-10 18:31 [PATCH v2 0/8] uq/master: TPR access optimization for Windows guests Jan Kiszka
2012-02-10 18:31 ` [PATCH v2 1/8] kvm: Set cpu_single_env only once Jan Kiszka
2012-02-11 10:02   ` Blue Swirl
2012-02-11 10:06     ` Jan Kiszka
2012-02-11 11:25       ` Blue Swirl
2012-02-11 11:49         ` Andreas Färber
2012-02-11 12:43           ` [Qemu-devel] " Jan Kiszka
2012-02-11 13:06             ` Andreas Färber
2012-02-11 13:07               ` [Qemu-devel] " Jan Kiszka
2012-02-11 13:21                 ` Andreas Färber
2012-02-11 13:35                   ` Jan Kiszka
2012-02-11 13:59                     ` Andreas Färber
2012-02-11 14:02                       ` Jan Kiszka
2012-02-11 14:12                         ` [Qemu-devel] " Andreas Färber
2012-02-11 14:24                           ` Jan Kiszka
2012-02-11 14:49                             ` Andreas Färber
2012-02-13  8:17                           ` [Qemu-devel] " Paolo Bonzini
2012-02-11 13:54             ` Blue Swirl
2012-02-11 14:00               ` Jan Kiszka
2012-02-11 14:11                 ` [Qemu-devel] " Blue Swirl
2012-02-11 14:18                   ` Jan Kiszka
2012-02-11 14:23                     ` Blue Swirl
2012-02-11 12:41         ` Jan Kiszka
2012-02-10 18:31 ` [PATCH v2 2/8] Allow to use pause_all_vcpus from VCPU context Jan Kiszka
2012-02-11 14:16   ` Blue Swirl
2012-02-11 14:31     ` Jan Kiszka
2012-02-10 18:31 ` [PATCH v2 3/8] target-i386: Add infrastructure for reporting TPR MMIO accesses Jan Kiszka
2012-02-11 14:32   ` [Qemu-devel] " Blue Swirl
2012-02-10 18:31 ` [PATCH v2 4/8] kvmvapic: Add option ROM Jan Kiszka
2012-02-10 18:31 ` [PATCH v2 5/8] kvmvapic: Introduce TPR access optimization for Windows guests Jan Kiszka
2012-02-11 15:25   ` Blue Swirl
2012-02-13 10:16     ` Jan Kiszka [this message]
2012-02-13 18:50       ` [Qemu-devel] " Blue Swirl
2012-02-13 19:11         ` Gleb Natapov
2012-02-13 19:22         ` Jan Kiszka
2012-02-14  7:54           ` Gleb Natapov
2012-02-14  8:55             ` Jan Kiszka
2012-02-14  8:59               ` Gleb Natapov
2012-02-10 18:31 ` [PATCH v2 6/8] kvmvapic: Simplify mp/up_set_tpr Jan Kiszka
2012-02-10 18:31 ` [PATCH v2 7/8] optionsrom: Reserve space for checksum Jan Kiszka
2012-02-11 11:46   ` Andreas Färber
2012-02-11 12:45     ` Jan Kiszka
2012-02-11 12:51       ` Andreas Färber
2012-02-11 12:57         ` Jan Kiszka
2012-02-10 18:31 ` [PATCH v2 8/8] kvmvapic: Use optionrom helpers Jan Kiszka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F38E315.4040107@siemens.com \
    --to=jan.kiszka@siemens.com \
    --cc=aliguori@us.ibm.com \
    --cc=avi@redhat.com \
    --cc=blauwirbel@gmail.com \
    --cc=gleb@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox