From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andre Przywara <andre@andrep.de>
Subject: Re: [RFC PATCH] Emulate MOVBE
Date: Wed, 10 Apr 2013 11:29:42 +0200
Message-ID: <20130410112942.07dfc167@slackpad>
References: <20130409234602.GI5077@pd.tnic>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: kvm@vger.kernel.org,
	=?ISO-8859-1?Q?J=F6rg_R=F6del?= <joro@8bytes.org>,
	"H. Peter Anvin" <hpa@zytor.com>, x86-ml <x86@kernel.org>,
	kvm@vger-kernel.org
To: Borislav Petkov <bp@alien8.de>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail.andrep.de ([217.160.17.100]:38723 "EHLO mail.andrep.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750738Ab3DJJju convert rfc822-to-8bit (ORCPT
	<rfc822;kvm@vger.kernel.org>); Wed, 10 Apr 2013 05:39:50 -0400
In-Reply-To: <20130409234602.GI5077@pd.tnic>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Wed, 10 Apr 2013 01:46:02 +0200
Borislav Petkov <bp@alien8.de> wrote:

> Hi guys,
>=20
> so I was trying to repro tglx's bug in smpboot.c and for some reason,
> the most reliable way to trigger it was to boot an 32-bit atom smp
> guest in kvm (don't ask :)).
>=20
> The problem, however, was that atom wants MOVBE and qemu doesn't
> emulate it yet (except Richard's patches which I used in order to be
> able to actually even boot a guest).
>=20
> However, without hw acceleration, qemu is pretty slow, and waiting fo=
r
> an smp guest to boot in sw-only emulation is not fun. So, just for
> funsies, I decided to give the MOVBE emulation a try.
>=20
> Patch is below, 8 core smp atom guest boots fine and in 6-ish seconds
> with it. :-)
>=20
> I know, I know, it still needs cleaning up and proper rFLAGS handling
> but that is for later. For now, I'd very much like to hear whether
> this approach even makes sense and if so, what should be improved.
>=20
> Thanks, and thanks to Andre and J=F6rg for their help.
>=20
> Not-yet-signed-off-by: Borislav Petkov <bp@suse.de>

You are missing an (hackish) older patch which enables the MOVBE CPUID
bit even if the host does not have it (malformed due to c&p):
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -273,7 +273,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2
*entry, u32 function,
		cpuid_mask(&entry->ecx, 4); /* we support x2apic
emulation even if host does not support
		 * it since we emulate x2apic in software */
-		entry->ecx |=3D F(X2APIC);
+		entry->ecx |=3D (F(X2APIC) | F(MOVBE));
		break;
	/* function 2 entries are STATEFUL. That is, repeated cpuid
commands
	 * may return different values. This forces us to get_cpu()
before

>=20
> --
> diff --git a/arch/x86/include/asm/kvm_emulate.h
> b/arch/x86/include/asm/kvm_emulate.h index 15f960c06ff7..ae01c765cd77
> 100644 --- a/arch/x86/include/asm/kvm_emulate.h
> +++ b/arch/x86/include/asm/kvm_emulate.h
> @@ -281,6 +281,7 @@ struct x86_emulate_ctxt {
> =20
>  	/* decode cache */
>  	u8 twobyte;
> +	u8 thirdbyte;
>  	u8 b;
>  	u8 intercept;
>  	u8 lock_prefix;
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index a335cc6cde72..0ccff339359d 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -152,6 +152,7 @@
>  #define Avx         ((u64)1 << 43)  /* Advanced Vector Extensions */
>  #define Fastop      ((u64)1 << 44)  /* Use opcode::u.fastop */
>  #define NoWrite     ((u64)1 << 45)  /* No writeback */
> +#define ThreeByte   ((u64)1 << 46)  /* Three byte opcodes 0F 38 and
> 0F 3A */=20
>  #define X2(x...) x, x
>  #define X3(x...) X2(x), x
> @@ -3107,6 +3108,34 @@ static int em_mov(struct x86_emulate_ctxt
> *ctxt) return X86EMUL_CONTINUE;
>  }
> =20
> +static int em_movbe(struct x86_emulate_ctxt *ctxt)
> +{
> +	char *valptr =3D ctxt->src.valptr;
> +
> +	switch (ctxt->op_bytes) {
> +	case 2:
> +		*(u16 *)valptr =3D swab16(*(u16 *)valptr);
> +		break;
> +	case 4:
> +		*(u32 *)valptr =3D swab32(*(u32 *)valptr);
> +
> +		/*
> +		 * clear upper dword for 32-bit operand size in
> 64-bit mode.
> +		 */
> +		if (ctxt->mode =3D=3D X86EMUL_MODE_PROT64)
> +			*((u32 *)valptr + 1) =3D 0x0;
> +		break;
> +	case 8:
> +		*(u64 *)valptr =3D swab64(*(u64 *)valptr);
> +		break;
> +	default:
> +		return X86EMUL_PROPAGATE_FAULT;
> +	}
> +
> +	memcpy(ctxt->dst.valptr, ctxt->src.valptr, ctxt->op_bytes);
> +	return X86EMUL_CONTINUE;
> +}
> +
On 04/09/2013 05:03 PM, Borislav Petkov wrote:
>=20
> Note to self: this destroys the src operand but it shouldn't. Fix it
> tomorrow.
>=20

What about this one?

	*(u16 *)(ctxt->dst.valptr) =3D swab16(*(u16 *)(ctxt->src.valptr));

Respective versions for the other operand sizes, of course, and drop
the final memcpy.

>  static int em_cr_write(struct x86_emulate_ctxt *ctxt)
>  {
>  	if (ctxt->ops->set_cr(ctxt, ctxt->modrm_reg, ctxt->src.val))
> @@ -3974,7 +4003,8 @@ static const struct opcode twobyte_table[256] =3D
> { I(ImplicitOps | VendorSpecific, em_sysenter),
>  	I(ImplicitOps | Priv | VendorSpecific, em_sysexit),
>  	N, N,
> -	N, N, N, N, N, N, N, N,
> +	I(ModRM | Mov | ThreeByte | VendorSpecific, em_movbe),
> +	N, N, N, N, N, N, N,

In a real world VendorSpecific should be replaced with something more
meaningful. Depends on KVMs intention to emulate instructions, actually
out of scope for a pure virtualizer.

What is the opinion from the KVM folks on this? Shall we start to
emulate instructions the host does not provide? In this particular case
a relatively simple patch fixes a problem (starting Atom optimized
kernels on non-Atom machines).

(And if one can believe the AMD Fam16h SWOG [1], PS4^Wfuture AMD
processors have MOVBE, so it's not even actually one CPU anymore).

>  	/* 0x40 - 0x4F */
>  	X16(D(DstReg | SrcMem | ModRM | Mov)),
>  	/* 0x50 - 0x5F */
> @@ -4323,6 +4353,15 @@ done_prefixes:
>  	}
>  	ctxt->d =3D opcode.flags;
> =20
> +	if (ctxt->d & ThreeByte) {
> +		ctxt->thirdbyte =3D insn_fetch(u8, ctxt);
> +
> +		if (ctxt->thirdbyte =3D=3D 0xf0)
> +			ctxt->d |=3D DstReg | SrcMem;
> +		else
> +			ctxt->d |=3D DstMem | SrcReg;
> +	}
> +

As already agreed on on IRC, we should properly check for the other
opcodes in the 0F 38/3A table. Or make it right and implement a real
sub-table.

Regards,
Andre.


[1]http://support.amd.com/us/Processor_TechDocs/52128_16h_Software_Opt_=
Guide.zip