[RFC PATCH 0/4] KVM: emulator: speed up instruction fetch

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] KVM: emulator: speed up instruction fetch
@ 2014-05-06 18:16 Paolo Bonzini
  2014-05-06 18:16 ` [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch Paolo Bonzini
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Paolo Bonzini @ 2014-05-06 18:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, bsd

This small series provides another ~6% speedup on top of Bandan's patches.
It does so by speeding up instruction fetches.  The three tricks, one per
patch, are to help the compiler a bit by redistributing the work between
do_insn_fetch_byte and do_insn_fetch, to avoid repeated checks on the
boundary of the fetch cache, and to exploit the processor's support for
unaligned accesses.

Before:
   819 cycles/emulated jump instruction
   1020 cycles/emulated move instruction
   1026 cycles/emulated arithmetic instruction
   1264 cycles/emulated memory load instruction
   1182 cycles/emulated memory store instruction
   1299 cycles/emulated memory RMW instruction

After:
   771 cycles/emulated jump instruction
   963 cycles/emulated move instruction
   960 cycles/emulated arithmetic instruction
   1192 cycles/emulated memory load instruction
   1110 cycles/emulated memory store instruction
   1228 cycles/emulated memory RMW instruction

Paolo Bonzini (4):
  KVM: emulate: speed up do_insn_fetch
  KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
  KVM: emulate: avoid per-byte copying in instruction fetches
  KVM: emulate: put pointers in the fetch_cache

 arch/x86/include/asm/kvm_emulate.h |  4 +-
 arch/x86/kvm/emulate.c             | 97 ++++++++++++++++++++------------------
 arch/x86/kvm/trace.h               |  6 +--
 3 files changed, 57 insertions(+), 50 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch
  2014-05-06 18:16 [RFC PATCH 0/4] KVM: emulator: speed up instruction fetch Paolo Bonzini
@ 2014-05-06 18:16 ` Paolo Bonzini
  2014-05-07  2:30   ` Bandan Das
  2014-05-06 18:16 ` [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes Paolo Bonzini
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2014-05-06 18:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, bsd

Hoist the common case up from do_insn_fetch_byte to do_insn_fetch,
and prime the fetch_cache in x86_decode_insn.  This helps both the
compiler and the branch predictor.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/emulate.c | 67 +++++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 42820f5fdd04..c7b625bf0b5d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -703,51 +703,51 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
 }
 
 /*
- * Fetch the next byte of the instruction being emulated which is pointed to
- * by ctxt->_eip, then increment ctxt->_eip.
- *
- * Also prefetch the remaining bytes of the instruction without crossing page
+ * Prefetch the remaining bytes of the instruction without crossing page
  * boundary if they are not in fetch_cache yet.
  */
-static int do_insn_fetch_byte(struct x86_emulate_ctxt *ctxt, u8 *dest)
+static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
 {
 	struct fetch_cache *fc = &ctxt->fetch;
 	int rc;
 	int size, cur_size;
-
-	if (ctxt->_eip == fc->end) {
-		unsigned long linear;
-		struct segmented_address addr = { .seg = VCPU_SREG_CS,
-						  .ea  = ctxt->_eip };
-		cur_size = fc->end - fc->start;
-		size = min(15UL - cur_size,
-			   PAGE_SIZE - offset_in_page(ctxt->_eip));
-		rc = __linearize(ctxt, addr, size, false, true, &linear);
-		if (unlikely(rc != X86EMUL_CONTINUE))
-			return rc;
-		rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
-				      size, &ctxt->exception);
-		if (unlikely(rc != X86EMUL_CONTINUE))
-			return rc;
-		fc->end += size;
-	}
-	*dest = fc->data[ctxt->_eip - fc->start];
-	ctxt->_eip++;
+	unsigned long linear;
+
+	struct segmented_address addr = { .seg = VCPU_SREG_CS,
+					  .ea  = fc->end };
+	cur_size = fc->end - fc->start;
+	size = min(15UL - cur_size,
+		   PAGE_SIZE - offset_in_page(fc->end));
+	if (unlikely(size == 0))
+		return X86EMUL_UNHANDLEABLE;
+	rc = __linearize(ctxt, addr, size, false, true, &linear);
+	if (unlikely(rc != X86EMUL_CONTINUE))
+		return rc;
+	rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
+			      size, &ctxt->exception);
+	if (unlikely(rc != X86EMUL_CONTINUE))
+		return rc;
+	fc->end += size;
 	return X86EMUL_CONTINUE;
 }
 
 static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
-			 void *dest, unsigned size)
+			 void *__dest, unsigned size)
 {
 	int rc;
+	struct fetch_cache *fc = &ctxt->fetch;
+	u8 *dest = __dest;
+	u8 *src = &fc->data[ctxt->_eip - fc->start];
 
-	/* x86 instructions are limited to 15 bytes. */
-	if (unlikely(ctxt->_eip + size - ctxt->eip > 15))
-		return X86EMUL_UNHANDLEABLE;
 	while (size--) {
-		rc = do_insn_fetch_byte(ctxt, dest++);
-		if (rc != X86EMUL_CONTINUE)
-			return rc;
+		if (unlikely(ctxt->_eip == fc->end)) {
+			rc = do_insn_fetch_bytes(ctxt);
+			if (rc != X86EMUL_CONTINUE)
+				return rc;
+		}
+		*dest++ = *src++;
+		ctxt->_eip++;
+		continue;
 	}
 	return X86EMUL_CONTINUE;
 }
@@ -4272,6 +4272,11 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 	ctxt->opcode_len = 1;
 	if (insn_len > 0)
 		memcpy(ctxt->fetch.data, insn, insn_len);
+	else {
+		rc = do_insn_fetch_bytes(ctxt);
+		if (rc != X86EMUL_CONTINUE)
+			return rc;
+	}
 
 	switch (mode) {
 	case X86EMUL_MODE_REAL:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
  2014-05-06 18:16 [RFC PATCH 0/4] KVM: emulator: speed up instruction fetch Paolo Bonzini
  2014-05-06 18:16 ` [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch Paolo Bonzini
@ 2014-05-06 18:16 ` Paolo Bonzini
  2014-05-07  4:21   ` Bandan Das
  2014-05-06 18:16 ` [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches Paolo Bonzini
  2014-05-06 18:16 ` [RFC PATCH 4/4] KVM: emulate: put pointers in the fetch_cache Paolo Bonzini
  3 siblings, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2014-05-06 18:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, bsd

do_insn_fetch_bytes will only be called once in a given insn_fetch and
insn_fetch_arr, because in fact it will only be called at most twice
for any instruction and the first call is explicit in x86_decode_insn.
This observation lets us hoist the call out of the memory copying loop.
It does not buy performance, because most fetches are one byte long
anyway, but it prepares for the next patch.

The overflow check is tricky, but correct.  Because do_insn_fetch_bytes
has already been called once, we know that fc->end is at least 15.  So
it is okay to subtract the number of bytes we want to read.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/emulate.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c7b625bf0b5d..886f9a88010f 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -706,7 +706,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
  * Prefetch the remaining bytes of the instruction without crossing page
  * boundary if they are not in fetch_cache yet.
  */
-static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
+static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 {
 	struct fetch_cache *fc = &ctxt->fetch;
 	int rc;
@@ -718,7 +718,14 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
 	cur_size = fc->end - fc->start;
 	size = min(15UL - cur_size,
 		   PAGE_SIZE - offset_in_page(fc->end));
-	if (unlikely(size == 0))
+
+	/*
+	 * One instruction can only straddle two pages,
+	 * and one has been loaded at the beginning of
+	 * x86_decode_insn.  So, if not enough bytes
+	 * still, we must have hit the 15-byte boundary.
+	 */
+	if (unlikely(size < op_size))
 		return X86EMUL_UNHANDLEABLE;
 	rc = __linearize(ctxt, addr, size, false, true, &linear);
 	if (unlikely(rc != X86EMUL_CONTINUE))
@@ -739,12 +746,14 @@ static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
 	u8 *dest = __dest;
 	u8 *src = &fc->data[ctxt->_eip - fc->start];
 
+	/* We have to be careful about overflow! */
+	if (unlikely(ctxt->_eip > fc->end - size)) {
+		rc != do_insn_fetch_bytes(ctxt, size);
+		if (rc != X86EMUL_CONTINNUE)
+			goto done;
+	}
+
 	while (size--) {
-		if (unlikely(ctxt->_eip == fc->end)) {
-			rc = do_insn_fetch_bytes(ctxt);
-			if (rc != X86EMUL_CONTINUE)
-				return rc;
-		}
 		*dest++ = *src++;
 		ctxt->_eip++;
 		continue;
@@ -4273,7 +4282,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 	if (insn_len > 0)
 		memcpy(ctxt->fetch.data, insn, insn_len);
 	else {
-		rc = do_insn_fetch_bytes(ctxt);
+		rc = do_insn_fetch_bytes(ctxt, 1);
 		if (rc != X86EMUL_CONTINUE)
 			return rc;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches
  2014-05-06 18:16 [RFC PATCH 0/4] KVM: emulator: speed up instruction fetch Paolo Bonzini
  2014-05-06 18:16 ` [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch Paolo Bonzini
  2014-05-06 18:16 ` [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes Paolo Bonzini
@ 2014-05-06 18:16 ` Paolo Bonzini
  2014-05-07  4:36   ` Bandan Das
  2014-05-06 18:16 ` [RFC PATCH 4/4] KVM: emulate: put pointers in the fetch_cache Paolo Bonzini
  3 siblings, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2014-05-06 18:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, bsd

We do not need a memory copying loop anymore in insn_fetch; we
can use a byte-aligned pointer to access instruction fields directly
from the fetch_cache.  This eliminates 40-80 cycles (corresponding to
a 5-7% improvement in performance) from each instruction.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/emulate.c | 47 ++++++++++++++++++++++-------------------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 886f9a88010f..245a2d0bfe68 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -706,7 +706,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
  * Prefetch the remaining bytes of the instruction without crossing page
  * boundary if they are not in fetch_cache yet.
  */
-static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
+static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 {
 	struct fetch_cache *fc = &ctxt->fetch;
 	int rc;
@@ -738,42 +738,39 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 	return X86EMUL_CONTINUE;
 }
 
-static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
-			 void *__dest, unsigned size)
+static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
+					       unsigned size)
 {
-	int rc;
-	struct fetch_cache *fc = &ctxt->fetch;
-	u8 *dest = __dest;
-	u8 *src = &fc->data[ctxt->_eip - fc->start];
-
 	/* We have to be careful about overflow! */
-	if (unlikely(ctxt->_eip > fc->end - size)) {
-		rc != do_insn_fetch_bytes(ctxt, size);
-		if (rc != X86EMUL_CONTINNUE)
-			goto done;
-	}
-
-	while (size--) {
-		*dest++ = *src++;
-		ctxt->_eip++;
-		continue;
-	}
-	return X86EMUL_CONTINUE;
+	if (unlikely(ctxt->_eip > ctxt->fetch.end - size))
+		return __do_insn_fetch_bytes(ctxt, size);
+	else
+		return X86EMUL_CONTINUE;
 }
 
 /* Fetch next part of the instruction being emulated. */
 #define insn_fetch(_type, _ctxt)					\
-({	unsigned long _x;						\
-	rc = do_insn_fetch(_ctxt, &_x, sizeof(_type));			\
+({	_type _x;							\
+	struct fetch_cache *_fc;					\
+									\
+	rc = do_insn_fetch_bytes(_ctxt, sizeof(_type));			\
 	if (rc != X86EMUL_CONTINUE)					\
 		goto done;						\
-	(_type)_x;							\
+	_fc = &ctxt->fetch;						\
+	_x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \
+	ctxt->_eip += sizeof(_type);					\
+	_x;								\
 })
 
 #define insn_fetch_arr(_arr, _size, _ctxt)				\
-({	rc = do_insn_fetch(_ctxt, _arr, (_size));			\
+({									\
+	struct fetch_cache *_fc;					\
+	rc = do_insn_fetch_bytes(_ctxt, _size);				\
 	if (rc != X86EMUL_CONTINUE)					\
 		goto done;						\
+	_fc = &ctxt->fetch;						\
+	memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size);	\
+	ctxt->_eip += (_size);						\
 })
 
 /*
@@ -4282,7 +4279,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 	if (insn_len > 0)
 		memcpy(ctxt->fetch.data, insn, insn_len);
 	else {
-		rc = do_insn_fetch_bytes(ctxt, 1);
+		rc = __do_insn_fetch_bytes(ctxt, 1);
 		if (rc != X86EMUL_CONTINUE)
 			return rc;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 4/4] KVM: emulate: put pointers in the fetch_cache
  2014-05-06 18:16 [RFC PATCH 0/4] KVM: emulator: speed up instruction fetch Paolo Bonzini
                   ` (2 preceding siblings ...)
  2014-05-06 18:16 ` [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches Paolo Bonzini
@ 2014-05-06 18:16 ` Paolo Bonzini
  3 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2014-05-06 18:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: kvm, bsd

This simplifies the code a bit, especially the overflow checks.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_emulate.h |  4 ++--
 arch/x86/kvm/emulate.c             | 34 +++++++++++++++-------------------
 arch/x86/kvm/trace.h               |  6 +++---
 3 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h
index 085d68868756..2467e5e1b894 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -269,8 +269,8 @@ struct operand {
 
 struct fetch_cache {
 	u8 data[15];
-	unsigned long start;
-	unsigned long end;
+	u8 *ptr;
+	u8 *end;
 };
 
 struct read_cache {
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 245a2d0bfe68..3ab9310918a7 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -708,16 +708,15 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
  */
 static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 {
-	struct fetch_cache *fc = &ctxt->fetch;
 	int rc;
-	int size, cur_size;
+	int size;
 	unsigned long linear;
-
+	int cur_size = ctxt->fetch.end - ctxt->fetch.data;
 	struct segmented_address addr = { .seg = VCPU_SREG_CS,
-					  .ea  = fc->end };
-	cur_size = fc->end - fc->start;
-	size = min(15UL - cur_size,
-		   PAGE_SIZE - offset_in_page(fc->end));
+					   .ea = ctxt->eip + cur_size };
+
+	size = min(15UL ^ cur_size,
+		   PAGE_SIZE - offset_in_page(addr.ea));
 
 	/*
 	 * One instruction can only straddle two pages,
@@ -730,19 +729,18 @@ static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
 	rc = __linearize(ctxt, addr, size, false, true, &linear);
 	if (unlikely(rc != X86EMUL_CONTINUE))
 		return rc;
-	rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
+	rc = ctxt->ops->fetch(ctxt, linear, ctxt->fetch.end,
 			      size, &ctxt->exception);
 	if (unlikely(rc != X86EMUL_CONTINUE))
 		return rc;
-	fc->end += size;
+	ctxt->fetch.end += size;
 	return X86EMUL_CONTINUE;
 }
 
 static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
 					       unsigned size)
 {
-	/* We have to be careful about overflow! */
-	if (unlikely(ctxt->_eip > ctxt->fetch.end - size))
+	if (unlikely(ctxt->fetch.end - ctxt->fetch.ptr < size))
 		return __do_insn_fetch_bytes(ctxt, size);
 	else
 		return X86EMUL_CONTINUE;
@@ -751,26 +749,24 @@ static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
 /* Fetch next part of the instruction being emulated. */
 #define insn_fetch(_type, _ctxt)					\
 ({	_type _x;							\
-	struct fetch_cache *_fc;					\
 									\
 	rc = do_insn_fetch_bytes(_ctxt, sizeof(_type));			\
 	if (rc != X86EMUL_CONTINUE)					\
 		goto done;						\
-	_fc = &ctxt->fetch;						\
-	_x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \
 	ctxt->_eip += sizeof(_type);					\
+	_x = *(_type __aligned(1) *) ctxt->fetch.ptr;			\
+	ctxt->fetch.ptr += sizeof(_type);				\
 	_x;								\
 })
 
 #define insn_fetch_arr(_arr, _size, _ctxt)				\
 ({									\
-	struct fetch_cache *_fc;					\
 	rc = do_insn_fetch_bytes(_ctxt, _size);				\
 	if (rc != X86EMUL_CONTINUE)					\
 		goto done;						\
-	_fc = &ctxt->fetch;						\
-	memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size);	\
 	ctxt->_eip += (_size);						\
+	memcpy(_arr, ctxt->fetch.ptr, _size);				\
+	ctxt->fetch.ptr += (_size);					\
 })
 
 /*
@@ -4273,8 +4269,8 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
 
 	ctxt->memop.type = OP_NONE;
 	ctxt->_eip = ctxt->eip;
-	ctxt->fetch.start = ctxt->_eip;
-	ctxt->fetch.end = ctxt->fetch.start + insn_len;
+	ctxt->fetch.ptr = ctxt->fetch.data;
+	ctxt->fetch.end = ctxt->fetch.data + insn_len;
 	ctxt->opcode_len = 1;
 	if (insn_len > 0)
 		memcpy(ctxt->fetch.data, insn, insn_len);
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 33574c95220d..e850a7d332be 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -721,10 +721,10 @@ TRACE_EVENT(kvm_emulate_insn,
 		),
 
 	TP_fast_assign(
-		__entry->rip = vcpu->arch.emulate_ctxt.fetch.start;
 		__entry->csbase = kvm_x86_ops->get_segment_base(vcpu, VCPU_SREG_CS);
-		__entry->len = vcpu->arch.emulate_ctxt._eip
-			       - vcpu->arch.emulate_ctxt.fetch.start;
+		__entry->len = vcpu->arch.emulate_ctxt.fetch.ptr
+			       - vcpu->arch.emulate_ctxt.fetch.data;
+		__entry->rip = vcpu->arch.emulate_ctxt._eip - __entry->len;
 		memcpy(__entry->insn,
 		       vcpu->arch.emulate_ctxt.fetch.data,
 		       15);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch
  2014-05-06 18:16 ` [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch Paolo Bonzini
@ 2014-05-07  2:30   ` Bandan Das
  2014-05-07  8:32     ` Paolo Bonzini
  0 siblings, 1 reply; 11+ messages in thread
From: Bandan Das @ 2014-05-07  2:30 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm

Ok! Now that you posted your changes, I am getting to understand this
a little :)
 
Paolo Bonzini <pbonzini@redhat.com> writes:

> Hoist the common case up from do_insn_fetch_byte to do_insn_fetch,
> and prime the fetch_cache in x86_decode_insn.  This helps both the
> compiler and the branch predictor.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/emulate.c | 67 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 36 insertions(+), 31 deletions(-)
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 42820f5fdd04..c7b625bf0b5d 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -703,51 +703,51 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
>  }
>  
>  /*
> - * Fetch the next byte of the instruction being emulated which is pointed to
> - * by ctxt->_eip, then increment ctxt->_eip.
> - *
> - * Also prefetch the remaining bytes of the instruction without crossing page
> + * Prefetch the remaining bytes of the instruction without crossing page
>   * boundary if they are not in fetch_cache yet.
>   */
> -static int do_insn_fetch_byte(struct x86_emulate_ctxt *ctxt, u8 *dest)
> +static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
>  {
>  	struct fetch_cache *fc = &ctxt->fetch;
>  	int rc;
>  	int size, cur_size;
> -
> -	if (ctxt->_eip == fc->end) {
> -		unsigned long linear;
> -		struct segmented_address addr = { .seg = VCPU_SREG_CS,
> -						  .ea  = ctxt->_eip };
> -		cur_size = fc->end - fc->start;
> -		size = min(15UL - cur_size,
> -			   PAGE_SIZE - offset_in_page(ctxt->_eip));
> -		rc = __linearize(ctxt, addr, size, false, true, &linear);
> -		if (unlikely(rc != X86EMUL_CONTINUE))
> -			return rc;
> -		rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
> -				      size, &ctxt->exception);
> -		if (unlikely(rc != X86EMUL_CONTINUE))
> -			return rc;
> -		fc->end += size;
> -	}
> -	*dest = fc->data[ctxt->_eip - fc->start];
> -	ctxt->_eip++;
> +	unsigned long linear;
> +
> +	struct segmented_address addr = { .seg = VCPU_SREG_CS,
> +					  .ea  = fc->end };
> +	cur_size = fc->end - fc->start;
> +	size = min(15UL - cur_size,
> +		   PAGE_SIZE - offset_in_page(fc->end));
> +	if (unlikely(size == 0))
> +		return X86EMUL_UNHANDLEABLE;
> +	rc = __linearize(ctxt, addr, size, false, true, &linear);
> +	if (unlikely(rc != X86EMUL_CONTINUE))
> +		return rc;
> +	rc = ctxt->ops->fetch(ctxt, linear, fc->data + cur_size,
> +			      size, &ctxt->exception);
> +	if (unlikely(rc != X86EMUL_CONTINUE))
> +		return rc;
> +	fc->end += size;
>  	return X86EMUL_CONTINUE;
>  }
>  
>  static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
> -			 void *dest, unsigned size)
> +			 void *__dest, unsigned size)
>  {
>  	int rc;
> +	struct fetch_cache *fc = &ctxt->fetch;
> +	u8 *dest = __dest;
> +	u8 *src = &fc->data[ctxt->_eip - fc->start];
>  
> -	/* x86 instructions are limited to 15 bytes. */
> -	if (unlikely(ctxt->_eip + size - ctxt->eip > 15))
> -		return X86EMUL_UNHANDLEABLE;
>  	while (size--) {
> -		rc = do_insn_fetch_byte(ctxt, dest++);
> -		if (rc != X86EMUL_CONTINUE)
> -			return rc;
> +		if (unlikely(ctxt->_eip == fc->end)) {

Is this really going to be unlikely ?

> +			rc = do_insn_fetch_bytes(ctxt);
> +			if (rc != X86EMUL_CONTINUE)
> +				return rc;
> +		}
> +		*dest++ = *src++;
> +		ctxt->_eip++;
> +		continue;
>  	}
>  	return X86EMUL_CONTINUE;
>  }
> @@ -4272,6 +4272,11 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
>  	ctxt->opcode_len = 1;
>  	if (insn_len > 0)
>  		memcpy(ctxt->fetch.data, insn, insn_len);
> +	else {
> +		rc = do_insn_fetch_bytes(ctxt);
> +		if (rc != X86EMUL_CONTINUE)
> +			return rc;
> +	}
>  
>  	switch (mode) {
>  	case X86EMUL_MODE_REAL:

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
  2014-05-06 18:16 ` [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes Paolo Bonzini
@ 2014-05-07  4:21   ` Bandan Das
  2014-05-07  8:34     ` Paolo Bonzini
  0 siblings, 1 reply; 11+ messages in thread
From: Bandan Das @ 2014-05-07  4:21 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm

Paolo Bonzini <pbonzini@redhat.com> writes:

> do_insn_fetch_bytes will only be called once in a given insn_fetch and
> insn_fetch_arr, because in fact it will only be called at most twice
> for any instruction and the first call is explicit in x86_decode_insn.
> This observation lets us hoist the call out of the memory copying loop.
> It does not buy performance, because most fetches are one byte long
> anyway, but it prepares for the next patch.
>
> The overflow check is tricky, but correct.  Because do_insn_fetch_bytes
> has already been called once, we know that fc->end is at least 15.  So
> it is okay to subtract the number of bytes we want to read.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/emulate.c | 25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index c7b625bf0b5d..886f9a88010f 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -706,7 +706,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
>   * Prefetch the remaining bytes of the instruction without crossing page
>   * boundary if they are not in fetch_cache yet.
>   */
> -static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
> +static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
>  {
>  	struct fetch_cache *fc = &ctxt->fetch;
>  	int rc;
> @@ -718,7 +718,14 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt)
>  	cur_size = fc->end - fc->start;
>  	size = min(15UL - cur_size,
>  		   PAGE_SIZE - offset_in_page(fc->end));
> -	if (unlikely(size == 0))
> +
> +	/*
> +	 * One instruction can only straddle two pages,
> +	 * and one has been loaded at the beginning of
> +	 * x86_decode_insn.  So, if not enough bytes
> +	 * still, we must have hit the 15-byte boundary.
> +	 */
> +	if (unlikely(size < op_size))
>  		return X86EMUL_UNHANDLEABLE;
>  	rc = __linearize(ctxt, addr, size, false, true, &linear);
>  	if (unlikely(rc != X86EMUL_CONTINUE))
> @@ -739,12 +746,14 @@ static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
>  	u8 *dest = __dest;
>  	u8 *src = &fc->data[ctxt->_eip - fc->start];
>  
> +	/* We have to be careful about overflow! */
> +	if (unlikely(ctxt->_eip > fc->end - size)) {
> +		rc != do_insn_fetch_bytes(ctxt, size);
This is a typo I guess ..

> +		if (rc != X86EMUL_CONTINNUE)
> +			goto done;
> +	}
> +
>  	while (size--) {
> -		if (unlikely(ctxt->_eip == fc->end)) {
> -			rc = do_insn_fetch_bytes(ctxt);
> -			if (rc != X86EMUL_CONTINUE)
> -				return rc;
> -		}
>  		*dest++ = *src++;
>  		ctxt->_eip++;
>  		continue;
> @@ -4273,7 +4282,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
>  	if (insn_len > 0)
>  		memcpy(ctxt->fetch.data, insn, insn_len);
>  	else {
> -		rc = do_insn_fetch_bytes(ctxt);
> +		rc = do_insn_fetch_bytes(ctxt, 1);
Is this saying that if the cache is full, then we fetch one more byte ?

>  		if (rc != X86EMUL_CONTINUE)
>  			return rc;
>  	}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches
  2014-05-06 18:16 ` [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches Paolo Bonzini
@ 2014-05-07  4:36   ` Bandan Das
  2014-05-07  8:40     ` Paolo Bonzini
  0 siblings, 1 reply; 11+ messages in thread
From: Bandan Das @ 2014-05-07  4:36 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm

Paolo Bonzini <pbonzini@redhat.com> writes:

> We do not need a memory copying loop anymore in insn_fetch; we
> can use a byte-aligned pointer to access instruction fields directly

Nice approach!

> from the fetch_cache.  This eliminates 40-80 cycles (corresponding to
> a 5-7% improvement in performance) from each instruction.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/emulate.c | 47 ++++++++++++++++++++++-------------------------
>  1 file changed, 22 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 886f9a88010f..245a2d0bfe68 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -706,7 +706,7 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt,
>   * Prefetch the remaining bytes of the instruction without crossing page
>   * boundary if they are not in fetch_cache yet.
>   */
> -static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
> +static int __do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
>  {
>  	struct fetch_cache *fc = &ctxt->fetch;
>  	int rc;
> @@ -738,42 +738,39 @@ static int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt, int op_size)
>  	return X86EMUL_CONTINUE;
>  }
>  
> -static int do_insn_fetch(struct x86_emulate_ctxt *ctxt,
> -			 void *__dest, unsigned size)
> +static __always_inline int do_insn_fetch_bytes(struct x86_emulate_ctxt *ctxt,
> +					       unsigned size)
>  {
> -	int rc;
> -	struct fetch_cache *fc = &ctxt->fetch;
> -	u8 *dest = __dest;
> -	u8 *src = &fc->data[ctxt->_eip - fc->start];
> -
>  	/* We have to be careful about overflow! */
> -	if (unlikely(ctxt->_eip > fc->end - size)) {
> -		rc != do_insn_fetch_bytes(ctxt, size);
> -		if (rc != X86EMUL_CONTINNUE)
> -			goto done;
> -	}
> -
> -	while (size--) {
> -		*dest++ = *src++;
> -		ctxt->_eip++;
> -		continue;
> -	}
> -	return X86EMUL_CONTINUE;
> +	if (unlikely(ctxt->_eip > ctxt->fetch.end - size))
> +		return __do_insn_fetch_bytes(ctxt, size);
> +	else
> +		return X86EMUL_CONTINUE;
>  }
>  
>  /* Fetch next part of the instruction being emulated. */
>  #define insn_fetch(_type, _ctxt)					\
> -({	unsigned long _x;						\
> -	rc = do_insn_fetch(_ctxt, &_x, sizeof(_type));			\
> +({	_type _x;							\
> +	struct fetch_cache *_fc;					\
> +									\
> +	rc = do_insn_fetch_bytes(_ctxt, sizeof(_type));			\
>  	if (rc != X86EMUL_CONTINUE)					\
>  		goto done;						\
> -	(_type)_x;							\
> +	_fc = &ctxt->fetch;						\
> +	_x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \
For my own understanding, how does the __aligned help here ? Wouldn't 
that result in unaligned accesses that will actually impact performance ?

> +	ctxt->_eip += sizeof(_type);					\
> +	_x;								\
>  })
>  
>  #define insn_fetch_arr(_arr, _size, _ctxt)				\
> -({	rc = do_insn_fetch(_ctxt, _arr, (_size));			\
> +({									\
> +	struct fetch_cache *_fc;					\
> +	rc = do_insn_fetch_bytes(_ctxt, _size);				\
>  	if (rc != X86EMUL_CONTINUE)					\
>  		goto done;						\
> +	_fc = &ctxt->fetch;						\
> +	memcpy(_arr, &_fc->data[ctxt->_eip - _fc->start], _size);	\
> +	ctxt->_eip += (_size);						\
>  })
>  
>  /*
> @@ -4282,7 +4279,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
>  	if (insn_len > 0)
>  		memcpy(ctxt->fetch.data, insn, insn_len);
>  	else {
> -		rc = do_insn_fetch_bytes(ctxt, 1);
> +		rc = __do_insn_fetch_bytes(ctxt, 1);
>  		if (rc != X86EMUL_CONTINUE)
>  			return rc;
>  	}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch
  2014-05-07  2:30   ` Bandan Das
@ 2014-05-07  8:32     ` Paolo Bonzini
  0 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2014-05-07  8:32 UTC (permalink / raw)
  To: Bandan Das; +Cc: linux-kernel, kvm

Il 07/05/2014 04:30, Bandan Das ha scritto:
>> > +		if (unlikely(ctxt->_eip == fc->end)) {
> Is this really going to be unlikely ?
>

Yes, it happens at most once per instruction and only for instructions 
that cross pages.

Paolo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes
  2014-05-07  4:21   ` Bandan Das
@ 2014-05-07  8:34     ` Paolo Bonzini
  0 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2014-05-07  8:34 UTC (permalink / raw)
  To: Bandan Das; +Cc: linux-kernel, kvm

Il 07/05/2014 06:21, Bandan Das ha scritto:
>> > +		if (rc != X86EMUL_CONTINNUE)
>> > +			goto done;
>> > +	}
>> > +
>> >  	while (size--) {
>> > -		if (unlikely(ctxt->_eip == fc->end)) {
>> > -			rc = do_insn_fetch_bytes(ctxt);
>> > -			if (rc != X86EMUL_CONTINUE)
>> > -				return rc;
>> > -		}
>> >  		*dest++ = *src++;
>> >  		ctxt->_eip++;
>> >  		continue;
>> > @@ -4273,7 +4282,7 @@ int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len)
>> >  	if (insn_len > 0)
>> >  		memcpy(ctxt->fetch.data, insn, insn_len);
>> >  	else {
>> > -		rc = do_insn_fetch_bytes(ctxt);
>> > +		rc = do_insn_fetch_bytes(ctxt, 1);
> Is this saying that if the cache is full, then we fetch one more byte ?
>

No, it is saying that if the instruction is being executed for the first 
time (we can execute it multiple times if we reenter a repeated 
instruction after a userspace exit) we try to get at least one byte from 
RIP.  Most of the time, do_insn_fetch_bytes will fetch 15 bytes which 
are the maximum length of an instruction.

Passing op_size == 1 matches this change in do_insn_fetch_bytes:

-	if (unlikely(size == 0))
+	if (unlikely(size < op_size))

Paolo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches
  2014-05-07  4:36   ` Bandan Das
@ 2014-05-07  8:40     ` Paolo Bonzini
  0 siblings, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2014-05-07  8:40 UTC (permalink / raw)
  To: Bandan Das; +Cc: linux-kernel, kvm

Il 07/05/2014 06:36, Bandan Das ha scritto:
>> > +	_x = *(_type __aligned(1) *) &_fc->data[ctxt->_eip - _fc->start]; \
> For my own understanding, how does the __aligned help here ?

Except for 16-byte SSE accesses, x86 doesn't distinguish aligned and 
unaligned accesses.  You can read 4 bytes at 0x2345 and the processor 
will do the right thing.  Still it's better to tell the compiler what 
we're doing.

> Wouldn't
> that result in unaligned accesses that will actually impact performance ?

These accesses *can* and will in fact be unaligned.  For example, say 
you have "mov ax,0x1234" which is 0xb8 0x34 0x12.  When you read it into 
the fetch cache, you will have data[0] == 0xb8, data[1] == 0x34, data[2] 
== 0x12.  Fetching the 16-bit immediate from data[1] will then be an 
unaligned access.

Paolo

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-05-07  8:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-06 18:16 [RFC PATCH 0/4] KVM: emulator: speed up instruction fetch Paolo Bonzini
2014-05-06 18:16 ` [RFC PATCH 1/4] KVM: emulate: speed up do_insn_fetch Paolo Bonzini
2014-05-07  2:30   ` Bandan Das
2014-05-07  8:32     ` Paolo Bonzini
2014-05-06 18:16 ` [RFC PATCH 2/4] KVM: emulate: avoid repeated calls to do_insn_fetch_bytes Paolo Bonzini
2014-05-07  4:21   ` Bandan Das
2014-05-07  8:34     ` Paolo Bonzini
2014-05-06 18:16 ` [RFC PATCH 3/4] KVM: emulate: avoid per-byte copying in instruction fetches Paolo Bonzini
2014-05-07  4:36   ` Bandan Das
2014-05-07  8:40     ` Paolo Bonzini
2014-05-06 18:16 ` [RFC PATCH 4/4] KVM: emulate: put pointers in the fetch_cache Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).