[PATCH 0/3] KVM: Split huge pages mapped by the TDP MMU on fault

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: Split huge pages mapped by the TDP MMU on fault
@ 2022-04-01 23:37 David Matlack
  2022-04-01 23:37 ` [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance David Matlack
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David Matlack @ 2022-04-01 23:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Matlack, Ben Gardon, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

Now that the TDP MMU has a mechanism to split huge pages, it is trivial
to use it in the fault path whenever a huge page needs to be replaced
with a mapping at a lower level.

The main beneficiary of this change is NX HugePages, which now no longer
unmaps huge pages when instructions are fetched from them. This means
that the VM does not have to take faults when accessing other parts of
the huge page (for whatever reason). The other beneficiary of this
change is workloads that set eage_page_split=N, which can be desirable
for read-heavy workloads. The performance impacts are discussed in more
detail in PATCH 3.

To validate the performance impact on NX HugePages, this series
introduces a new selftest that leverages perf_test_util.c to execute
from a given region of memory across a variable number of vCPUS.

This new selftest, while only 188 lines, is largely a whittled down
copy-paste of access_tracking_perf_test.c. So there is obviously some
room for improvement and refactoring that I am procrastinating on.

Tested: Ran all kvm-unit-tests and KVM selftests.

David Matlack (3):
  KVM: selftests: Introduce a selftest to measure execution performance
  KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page()
  KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault

 arch/x86/kvm/mmu/tdp_mmu.c                    |  44 ++--
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/execute_perf_test.c | 188 ++++++++++++++++++
 .../selftests/kvm/include/perf_test_util.h    |   2 +
 .../selftests/kvm/lib/perf_test_util.c        |  25 ++-
 6 files changed, 244 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c

base-commit: d1fb6a1ca3e535f89628193ab94203533b264c8c
-- 
2.35.1.1094.g7c7d902a7c-goog

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance
  2022-04-01 23:37 [PATCH 0/3] KVM: Split huge pages mapped by the TDP MMU on fault David Matlack
@ 2022-04-01 23:37 ` David Matlack
  2022-04-04 18:13   ` Ben Gardon
  2022-04-01 23:37 ` [PATCH 2/3] KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page() David Matlack
  2022-04-01 23:37 ` [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault David Matlack
  2 siblings, 1 reply; 12+ messages in thread
From: David Matlack @ 2022-04-01 23:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Matlack, Ben Gardon, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

Introduce a new selftest, execute_perf_test, that uses the
perf_test_util framework to measure the performance of executing code
within a VM. This test is similar to the other perf_test_util-based
tests in that it spins up a variable number of vCPUs and runs them
concurrently, accessing memory.

In order to support executiong, extend perf_test_util to populate guest
memory with return instructions rather than random garbage. This way
memory can be execute simply by calling it.

Currently only x86-64 supports execution, but other architectures can be
easily added by providing their return code instruction.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../testing/selftests/kvm/execute_perf_test.c | 188 ++++++++++++++++++
 .../selftests/kvm/include/perf_test_util.h    |   2 +
 .../selftests/kvm/lib/perf_test_util.c        |  25 ++-
 5 files changed, 215 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 1f1b6c978bf7..3647ddacb103 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -56,6 +56,7 @@
 /demand_paging_test
 /dirty_log_test
 /dirty_log_perf_test
+/execute_perf_test
 /hardware_disable_test
 /kvm_create_max_vcpus
 /kvm_page_table_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index c9cdbd248727..3c67346b0766 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -92,6 +92,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += execute_perf_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
 TEST_GEN_PROGS_x86_64 += kvm_page_table_test
diff --git a/tools/testing/selftests/kvm/execute_perf_test.c b/tools/testing/selftests/kvm/execute_perf_test.c
new file mode 100644
index 000000000000..fa78facf44e7
--- /dev/null
+++ b/tools/testing/selftests/kvm/execute_perf_test.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <inttypes.h>
+#include <limits.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "perf_test_util.h"
+#include "guest_modes.h"
+
+/* Global variable used to synchronize all of the vCPU threads. */
+static int iteration;
+
+/* Set to true when vCPU threads should exit. */
+static bool done;
+
+/* The iteration that was last completed by each vCPU. */
+static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
+
+/* Whether to overlap the regions of memory vCPUs access. */
+static bool overlap_memory_access;
+
+struct test_params {
+	/* The backing source for the region of memory. */
+	enum vm_mem_backing_src_type backing_src;
+
+	/* The amount of memory to allocate for each vCPU. */
+	uint64_t vcpu_memory_bytes;
+
+	/* The number of vCPUs to create in the VM. */
+	int vcpus;
+};
+
+static void assert_ucall(struct kvm_vm *vm, uint32_t vcpu_id,
+			 uint64_t expected_ucall)
+{
+	struct ucall uc;
+	uint64_t actual_ucall = get_ucall(vm, vcpu_id, &uc);
+
+	TEST_ASSERT(expected_ucall == actual_ucall,
+		    "Guest exited unexpectedly (expected ucall %" PRIu64
+		    ", got %" PRIu64 ")",
+		    expected_ucall, actual_ucall);
+}
+
+static bool spin_wait_for_next_iteration(int *current_iteration)
+{
+	int last_iteration = *current_iteration;
+
+	do {
+		if (READ_ONCE(done))
+			return false;
+
+		*current_iteration = READ_ONCE(iteration);
+	} while (last_iteration == *current_iteration);
+
+	return true;
+}
+
+static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
+{
+	struct kvm_vm *vm = perf_test_args.vm;
+	int vcpu_id = vcpu_args->vcpu_id;
+	int current_iteration = 0;
+
+	while (spin_wait_for_next_iteration(&current_iteration)) {
+		vcpu_run(vm, vcpu_id);
+		assert_ucall(vm, vcpu_id, UCALL_SYNC);
+		vcpu_last_completed_iteration[vcpu_id] = current_iteration;
+	}
+}
+
+static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
+{
+	while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
+	       target_iteration) {
+		continue;
+	}
+}
+
+static void run_iteration(struct kvm_vm *vm, int vcpus, const char *description)
+{
+	struct timespec ts_start;
+	struct timespec ts_elapsed;
+	int next_iteration;
+	int vcpu_id;
+
+	/* Kick off the vCPUs by incrementing iteration. */
+	next_iteration = ++iteration;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+	/* Wait for all vCPUs to finish the iteration. */
+	for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
+		spin_wait_for_vcpu(vcpu_id, next_iteration);
+
+	ts_elapsed = timespec_elapsed(ts_start);
+	pr_info("%-30s: %ld.%09lds\n",
+		description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
+}
+
+static void run_test(enum vm_guest_mode mode, void *arg)
+{
+	struct test_params *params = arg;
+	struct kvm_vm *vm;
+	int vcpus = params->vcpus;
+
+	vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 1,
+				 params->backing_src, !overlap_memory_access);
+
+	perf_test_start_vcpu_threads(vcpus, vcpu_thread_main);
+
+	pr_info("\n");
+
+	perf_test_set_wr_fract(vm, 1);
+	run_iteration(vm, vcpus, "Populating memory");
+
+	perf_test_set_execute(vm, true);
+	run_iteration(vm, vcpus, "Executing from memory");
+
+	/* Set done to signal the vCPU threads to exit */
+	done = true;
+
+	perf_test_join_vcpu_threads(vcpus);
+	perf_test_destroy_vm(vm);
+}
+
+static void help(char *name)
+{
+	puts("");
+	printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
+	       name);
+	puts("");
+	printf(" -h: Display this help message.");
+	guest_modes_help();
+	printf(" -b: specify the size of the memory region which should be\n"
+	       "     dirtied by each vCPU. e.g. 10M or 3G.\n"
+	       "     (default: 1G)\n");
+	printf(" -v: specify the number of vCPUs to run.\n");
+	printf(" -o: Overlap guest memory accesses instead of partitioning\n"
+	       "     them into a separate region of memory for each vCPU.\n");
+	backing_src_help("-s");
+	puts("");
+	exit(0);
+}
+
+int main(int argc, char *argv[])
+{
+	struct test_params params = {
+		.backing_src = DEFAULT_VM_MEM_SRC,
+		.vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
+		.vcpus = 1,
+	};
+	int opt;
+
+	guest_modes_append_default();
+
+	while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
+		switch (opt) {
+		case 'm':
+			guest_modes_cmdline(optarg);
+			break;
+		case 'b':
+			params.vcpu_memory_bytes = parse_size(optarg);
+			break;
+		case 'v':
+			params.vcpus = atoi(optarg);
+			break;
+		case 'o':
+			overlap_memory_access = true;
+			break;
+		case 's':
+			params.backing_src = parse_backing_src_type(optarg);
+			break;
+		case 'h':
+		default:
+			help(argv[0]);
+			break;
+		}
+	}
+
+	for_each_guest_mode(run_test, &params);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
index a86f953d8d36..0a5a56539aff 100644
--- a/tools/testing/selftests/kvm/include/perf_test_util.h
+++ b/tools/testing/selftests/kvm/include/perf_test_util.h
@@ -33,6 +33,7 @@ struct perf_test_args {
 	uint64_t gpa;
 	uint64_t guest_page_size;
 	int wr_fract;
+	bool execute;
 
 	struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS];
 };
@@ -46,6 +47,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
 void perf_test_destroy_vm(struct kvm_vm *vm);
 
 void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
+void perf_test_set_execute(struct kvm_vm *vm, bool execute);
 
 void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
 void perf_test_join_vcpu_threads(int vcpus);
diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
index 722df3a28791..1a5eb60b59da 100644
--- a/tools/testing/selftests/kvm/lib/perf_test_util.c
+++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
@@ -36,6 +36,16 @@ static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
 /* Set to true once all vCPU threads are up and running. */
 static bool all_vcpu_threads_running;
 
+/*
+ * When writing to guest memory, write the opcode for the `ret` instruction so
+ * that subsequent iteractions can exercise instruction fetch by calling the
+ * memory.
+ *
+ * NOTE: Non-x86 architectures would to use different values here to support
+ * execute.
+ */
+#define RETURN_OPCODE 0xC3
+
 /*
  * Continuously write to the first 8 bytes of each page in the
  * specified region.
@@ -58,8 +68,10 @@ static void guest_code(uint32_t vcpu_id)
 		for (i = 0; i < pages; i++) {
 			uint64_t addr = gva + (i * pta->guest_page_size);
 
-			if (i % pta->wr_fract == 0)
-				*(uint64_t *)addr = 0x0123456789ABCDEF;
+			if (pta->execute)
+				((void (*)(void)) addr)();
+			else if (i % pta->wr_fract == 0)
+				*(uint64_t *)addr = RETURN_OPCODE;
 			else
 				READ_ONCE(*(uint64_t *)addr);
 		}
@@ -198,6 +210,15 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
 	sync_global_to_guest(vm, perf_test_args);
 }
 
+void perf_test_set_execute(struct kvm_vm *vm, bool execute)
+{
+#ifndef __x86_64__
+	TEST_ASSERT(false, "Execute not supported on this architure; see RETURN_OPCODE.");
+#endif
+	perf_test_args.execute = execute;
+	sync_global_to_guest(vm, perf_test_args);
+}
+
 static void *vcpu_thread_main(void *data)
 {
 	struct vcpu_thread *vcpu = data;

base-commit: d1fb6a1ca3e535f89628193ab94203533b264c8c
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance
  2022-04-01 23:37 ` [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance David Matlack
@ 2022-04-04 18:13   ` Ben Gardon
  2022-04-05 20:59     ` David Matlack
  0 siblings, 1 reply; 12+ messages in thread
From: Ben Gardon @ 2022-04-04 18:13 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

On Fri, Apr 1, 2022 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
>
> Introduce a new selftest, execute_perf_test, that uses the
> perf_test_util framework to measure the performance of executing code
> within a VM. This test is similar to the other perf_test_util-based
> tests in that it spins up a variable number of vCPUs and runs them
> concurrently, accessing memory.
>
> In order to support executiong, extend perf_test_util to populate guest

*executing instructions in the data slot,

> memory with return instructions rather than random garbage. This way
> memory can be execute simply by calling it.

*executed

>
> Currently only x86-64 supports execution, but other architectures can be
> easily added by providing their return code instruction.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../testing/selftests/kvm/execute_perf_test.c | 188 ++++++++++++++++++
>  .../selftests/kvm/include/perf_test_util.h    |   2 +
>  .../selftests/kvm/lib/perf_test_util.c        |  25 ++-
>  5 files changed, 215 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c
>
> diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> index 1f1b6c978bf7..3647ddacb103 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -56,6 +56,7 @@
>  /demand_paging_test
>  /dirty_log_test
>  /dirty_log_perf_test
> +/execute_perf_test
>  /hardware_disable_test
>  /kvm_create_max_vcpus
>  /kvm_page_table_test
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index c9cdbd248727..3c67346b0766 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -92,6 +92,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
>  TEST_GEN_PROGS_x86_64 += demand_paging_test
>  TEST_GEN_PROGS_x86_64 += dirty_log_test
>  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
> +TEST_GEN_PROGS_x86_64 += execute_perf_test
>  TEST_GEN_PROGS_x86_64 += hardware_disable_test
>  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
>  TEST_GEN_PROGS_x86_64 += kvm_page_table_test
> diff --git a/tools/testing/selftests/kvm/execute_perf_test.c b/tools/testing/selftests/kvm/execute_perf_test.c
> new file mode 100644
> index 000000000000..fa78facf44e7
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/execute_perf_test.c
> @@ -0,0 +1,188 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <inttypes.h>
> +#include <limits.h>
> +#include <pthread.h>
> +#include <sys/mman.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +
> +#include "kvm_util.h"
> +#include "test_util.h"
> +#include "perf_test_util.h"
> +#include "guest_modes.h"
> +
> +/* Global variable used to synchronize all of the vCPU threads. */
> +static int iteration;

Should this be volatile? (same for other globals)

> +
> +/* Set to true when vCPU threads should exit. */
> +static bool done;
> +
> +/* The iteration that was last completed by each vCPU. */
> +static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
> +
> +/* Whether to overlap the regions of memory vCPUs access. */
> +static bool overlap_memory_access;

Can this be factored into the perf test util framework / test params?

> +
> +struct test_params {
> +       /* The backing source for the region of memory. */
> +       enum vm_mem_backing_src_type backing_src;
> +
> +       /* The amount of memory to allocate for each vCPU. */
> +       uint64_t vcpu_memory_bytes;
> +
> +       /* The number of vCPUs to create in the VM. */
> +       int vcpus;
> +};
> +
> +static void assert_ucall(struct kvm_vm *vm, uint32_t vcpu_id,
> +                        uint64_t expected_ucall)
> +{
> +       struct ucall uc;
> +       uint64_t actual_ucall = get_ucall(vm, vcpu_id, &uc);
> +
> +       TEST_ASSERT(expected_ucall == actual_ucall,
> +                   "Guest exited unexpectedly (expected ucall %" PRIu64
> +                   ", got %" PRIu64 ")",
> +                   expected_ucall, actual_ucall);
> +}
> +
> +static bool spin_wait_for_next_iteration(int *current_iteration)
> +{
> +       int last_iteration = *current_iteration;
> +
> +       do {
> +               if (READ_ONCE(done))
> +                       return false;
> +
> +               *current_iteration = READ_ONCE(iteration);
> +       } while (last_iteration == *current_iteration);
> +
> +       return true;
> +}
> +
> +static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
> +{
> +       struct kvm_vm *vm = perf_test_args.vm;
> +       int vcpu_id = vcpu_args->vcpu_id;
> +       int current_iteration = 0;
> +
> +       while (spin_wait_for_next_iteration(&current_iteration)) {
> +               vcpu_run(vm, vcpu_id);
> +               assert_ucall(vm, vcpu_id, UCALL_SYNC);
> +               vcpu_last_completed_iteration[vcpu_id] = current_iteration;
> +       }
> +}
> +
> +static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
> +{
> +       while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
> +              target_iteration) {
> +               continue;
> +       }
> +}
> +
> +static void run_iteration(struct kvm_vm *vm, int vcpus, const char *description)
> +{
> +       struct timespec ts_start;
> +       struct timespec ts_elapsed;
> +       int next_iteration;
> +       int vcpu_id;
> +
> +       /* Kick off the vCPUs by incrementing iteration. */
> +       next_iteration = ++iteration;
> +
> +       clock_gettime(CLOCK_MONOTONIC, &ts_start);
> +
> +       /* Wait for all vCPUs to finish the iteration. */
> +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
> +               spin_wait_for_vcpu(vcpu_id, next_iteration);
> +
> +       ts_elapsed = timespec_elapsed(ts_start);
> +       pr_info("%-30s: %ld.%09lds\n",
> +               description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> +}
> +
> +static void run_test(enum vm_guest_mode mode, void *arg)
> +{
> +       struct test_params *params = arg;
> +       struct kvm_vm *vm;
> +       int vcpus = params->vcpus;
> +
> +       vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 1,
> +                                params->backing_src, !overlap_memory_access);
> +
> +       perf_test_start_vcpu_threads(vcpus, vcpu_thread_main);
> +
> +       pr_info("\n");
> +
> +       perf_test_set_wr_fract(vm, 1);
> +       run_iteration(vm, vcpus, "Populating memory");
> +
> +       perf_test_set_execute(vm, true);
> +       run_iteration(vm, vcpus, "Executing from memory");
> +
> +       /* Set done to signal the vCPU threads to exit */
> +       done = true;
> +
> +       perf_test_join_vcpu_threads(vcpus);
> +       perf_test_destroy_vm(vm);
> +}
> +
> +static void help(char *name)
> +{
> +       puts("");
> +       printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
> +              name);
> +       puts("");
> +       printf(" -h: Display this help message.");
> +       guest_modes_help();
> +       printf(" -b: specify the size of the memory region which should be\n"
> +              "     dirtied by each vCPU. e.g. 10M or 3G.\n"
> +              "     (default: 1G)\n");
> +       printf(" -v: specify the number of vCPUs to run.\n");
> +       printf(" -o: Overlap guest memory accesses instead of partitioning\n"
> +              "     them into a separate region of memory for each vCPU.\n");
> +       backing_src_help("-s");
> +       puts("");
> +       exit(0);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       struct test_params params = {
> +               .backing_src = DEFAULT_VM_MEM_SRC,
> +               .vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
> +               .vcpus = 1,
> +       };
> +       int opt;
> +
> +       guest_modes_append_default();
> +
> +       while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
> +               switch (opt) {
> +               case 'm':
> +                       guest_modes_cmdline(optarg);
> +                       break;
> +               case 'b':
> +                       params.vcpu_memory_bytes = parse_size(optarg);
> +                       break;
> +               case 'v':
> +                       params.vcpus = atoi(optarg);
> +                       break;
> +               case 'o':
> +                       overlap_memory_access = true;
> +                       break;
> +               case 's':
> +                       params.backing_src = parse_backing_src_type(optarg);
> +                       break;
> +               case 'h':
> +               default:
> +                       help(argv[0]);
> +                       break;
> +               }
> +       }
> +
> +       for_each_guest_mode(run_test, &params);
> +
> +       return 0;
> +}
> diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
> index a86f953d8d36..0a5a56539aff 100644
> --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> @@ -33,6 +33,7 @@ struct perf_test_args {
>         uint64_t gpa;
>         uint64_t guest_page_size;
>         int wr_fract;
> +       bool execute;
>
>         struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS];
>  };
> @@ -46,6 +47,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
>  void perf_test_destroy_vm(struct kvm_vm *vm);
>
>  void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
> +void perf_test_set_execute(struct kvm_vm *vm, bool execute);
>
>  void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
>  void perf_test_join_vcpu_threads(int vcpus);
> diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> index 722df3a28791..1a5eb60b59da 100644
> --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> @@ -36,6 +36,16 @@ static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
>  /* Set to true once all vCPU threads are up and running. */
>  static bool all_vcpu_threads_running;
>
> +/*
> + * When writing to guest memory, write the opcode for the `ret` instruction so
> + * that subsequent iteractions can exercise instruction fetch by calling the
> + * memory.
> + *
> + * NOTE: Non-x86 architectures would to use different values here to support
> + * execute.
> + */
> +#define RETURN_OPCODE 0xC3
> +

This should be defined in an arch-specific header or surrounded by
ifdefs so that the build would fail for other archs.

>  /*
>   * Continuously write to the first 8 bytes of each page in the
>   * specified region.
> @@ -58,8 +68,10 @@ static void guest_code(uint32_t vcpu_id)
>                 for (i = 0; i < pages; i++) {
>                         uint64_t addr = gva + (i * pta->guest_page_size);
>
> -                       if (i % pta->wr_fract == 0)
> -                               *(uint64_t *)addr = 0x0123456789ABCDEF;
> +                       if (pta->execute)
> +                               ((void (*)(void)) addr)();
> +                       else if (i % pta->wr_fract == 0)
> +                               *(uint64_t *)addr = RETURN_OPCODE;

Oh interesting, you're using a write pass to set up the contents of
memory. I suppose that probably ends up being faster than memset, but
it introduces kind of a strange dependency.

>                         else
>                                 READ_ONCE(*(uint64_t *)addr);
>                 }
> @@ -198,6 +210,15 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
>         sync_global_to_guest(vm, perf_test_args);
>  }
>
> +void perf_test_set_execute(struct kvm_vm *vm, bool execute)
> +{
> +#ifndef __x86_64__
> +       TEST_ASSERT(false, "Execute not supported on this architure; see RETURN_OPCODE.");
> +#endif
> +       perf_test_args.execute = execute;
> +       sync_global_to_guest(vm, perf_test_args);
> +}
> +
>  static void *vcpu_thread_main(void *data)
>  {
>         struct vcpu_thread *vcpu = data;
>
> base-commit: d1fb6a1ca3e535f89628193ab94203533b264c8c
> --
> 2.35.1.1094.g7c7d902a7c-goog
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance
  2022-04-04 18:13   ` Ben Gardon
@ 2022-04-05 20:59     ` David Matlack
  2022-04-05 21:00       ` David Matlack
  0 siblings, 1 reply; 12+ messages in thread
From: David Matlack @ 2022-04-05 20:59 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

On Mon, Apr 04, 2022 at 11:13:40AM -0700, Ben Gardon wrote:
> On Fri, Apr 1, 2022 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Introduce a new selftest, execute_perf_test, that uses the
> > perf_test_util framework to measure the performance of executing code
> > within a VM. This test is similar to the other perf_test_util-based
> > tests in that it spins up a variable number of vCPUs and runs them
> > concurrently, accessing memory.
> >
> > In order to support executiong, extend perf_test_util to populate guest
> 
> *executing instructions in the data slot,
> 
> > memory with return instructions rather than random garbage. This way
> > memory can be execute simply by calling it.
> 
> *executed
> 
> >
> > Currently only x86-64 supports execution, but other architectures can be
> > easily added by providing their return code instruction.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  tools/testing/selftests/kvm/.gitignore        |   1 +
> >  tools/testing/selftests/kvm/Makefile          |   1 +
> >  .../testing/selftests/kvm/execute_perf_test.c | 188 ++++++++++++++++++
> >  .../selftests/kvm/include/perf_test_util.h    |   2 +
> >  .../selftests/kvm/lib/perf_test_util.c        |  25 ++-
> >  5 files changed, 215 insertions(+), 2 deletions(-)
> >  create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c
> >
> > diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> > index 1f1b6c978bf7..3647ddacb103 100644
> > --- a/tools/testing/selftests/kvm/.gitignore
> > +++ b/tools/testing/selftests/kvm/.gitignore
> > @@ -56,6 +56,7 @@
> >  /demand_paging_test
> >  /dirty_log_test
> >  /dirty_log_perf_test
> > +/execute_perf_test
> >  /hardware_disable_test
> >  /kvm_create_max_vcpus
> >  /kvm_page_table_test
> > diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> > index c9cdbd248727..3c67346b0766 100644
> > --- a/tools/testing/selftests/kvm/Makefile
> > +++ b/tools/testing/selftests/kvm/Makefile
> > @@ -92,6 +92,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
> >  TEST_GEN_PROGS_x86_64 += demand_paging_test
> >  TEST_GEN_PROGS_x86_64 += dirty_log_test
> >  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
> > +TEST_GEN_PROGS_x86_64 += execute_perf_test
> >  TEST_GEN_PROGS_x86_64 += hardware_disable_test
> >  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
> >  TEST_GEN_PROGS_x86_64 += kvm_page_table_test
> > diff --git a/tools/testing/selftests/kvm/execute_perf_test.c b/tools/testing/selftests/kvm/execute_perf_test.c
> > new file mode 100644
> > index 000000000000..fa78facf44e7
> > --- /dev/null
> > +++ b/tools/testing/selftests/kvm/execute_perf_test.c
> > @@ -0,0 +1,188 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <inttypes.h>
> > +#include <limits.h>
> > +#include <pthread.h>
> > +#include <sys/mman.h>
> > +#include <sys/types.h>
> > +#include <sys/stat.h>
> > +
> > +#include "kvm_util.h"
> > +#include "test_util.h"
> > +#include "perf_test_util.h"
> > +#include "guest_modes.h"
> > +
> > +/* Global variable used to synchronize all of the vCPU threads. */
> > +static int iteration;
> 
> Should this be volatile? (same for other globals)

Or atomic_t. This is a common pattern across almost all of the
perf_test_util-based tests that needs to be addressed.

> 
> > +
> > +/* Set to true when vCPU threads should exit. */
> > +static bool done;
> > +
> > +/* The iteration that was last completed by each vCPU. */
> > +static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
> > +
> > +/* Whether to overlap the regions of memory vCPUs access. */
> > +static bool overlap_memory_access;
> 
> Can this be factored into the perf test util framework / test params?

Yes. I'm planning to do a larger refactor of the perf_test_util
framework to consolidate code like this. But I want to leave that to a
separate series.

I'd be fine with deferring this test until that refactor is complete but
I don't think it's stricly necessary.

> 
> > +
> > +struct test_params {
> > +       /* The backing source for the region of memory. */
> > +       enum vm_mem_backing_src_type backing_src;
> > +
> > +       /* The amount of memory to allocate for each vCPU. */
> > +       uint64_t vcpu_memory_bytes;
> > +
> > +       /* The number of vCPUs to create in the VM. */
> > +       int vcpus;
> > +};
> > +
> > +static void assert_ucall(struct kvm_vm *vm, uint32_t vcpu_id,
> > +                        uint64_t expected_ucall)
> > +{
> > +       struct ucall uc;
> > +       uint64_t actual_ucall = get_ucall(vm, vcpu_id, &uc);
> > +
> > +       TEST_ASSERT(expected_ucall == actual_ucall,
> > +                   "Guest exited unexpectedly (expected ucall %" PRIu64
> > +                   ", got %" PRIu64 ")",
> > +                   expected_ucall, actual_ucall);
> > +}
> > +
> > +static bool spin_wait_for_next_iteration(int *current_iteration)
> > +{
> > +       int last_iteration = *current_iteration;
> > +
> > +       do {
> > +               if (READ_ONCE(done))
> > +                       return false;
> > +
> > +               *current_iteration = READ_ONCE(iteration);
> > +       } while (last_iteration == *current_iteration);
> > +
> > +       return true;
> > +}
> > +
> > +static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
> > +{
> > +       struct kvm_vm *vm = perf_test_args.vm;
> > +       int vcpu_id = vcpu_args->vcpu_id;
> > +       int current_iteration = 0;
> > +
> > +       while (spin_wait_for_next_iteration(&current_iteration)) {
> > +               vcpu_run(vm, vcpu_id);
> > +               assert_ucall(vm, vcpu_id, UCALL_SYNC);
> > +               vcpu_last_completed_iteration[vcpu_id] = current_iteration;
> > +       }
> > +}
> > +
> > +static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
> > +{
> > +       while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
> > +              target_iteration) {
> > +               continue;
> > +       }
> > +}
> > +
> > +static void run_iteration(struct kvm_vm *vm, int vcpus, const char *description)
> > +{
> > +       struct timespec ts_start;
> > +       struct timespec ts_elapsed;
> > +       int next_iteration;
> > +       int vcpu_id;
> > +
> > +       /* Kick off the vCPUs by incrementing iteration. */
> > +       next_iteration = ++iteration;
> > +
> > +       clock_gettime(CLOCK_MONOTONIC, &ts_start);
> > +
> > +       /* Wait for all vCPUs to finish the iteration. */
> > +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
> > +               spin_wait_for_vcpu(vcpu_id, next_iteration);
> > +
> > +       ts_elapsed = timespec_elapsed(ts_start);
> > +       pr_info("%-30s: %ld.%09lds\n",
> > +               description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> > +}
> > +
> > +static void run_test(enum vm_guest_mode mode, void *arg)
> > +{
> > +       struct test_params *params = arg;
> > +       struct kvm_vm *vm;
> > +       int vcpus = params->vcpus;
> > +
> > +       vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 1,
> > +                                params->backing_src, !overlap_memory_access);
> > +
> > +       perf_test_start_vcpu_threads(vcpus, vcpu_thread_main);
> > +
> > +       pr_info("\n");
> > +
> > +       perf_test_set_wr_fract(vm, 1);
> > +       run_iteration(vm, vcpus, "Populating memory");
> > +
> > +       perf_test_set_execute(vm, true);
> > +       run_iteration(vm, vcpus, "Executing from memory");
> > +
> > +       /* Set done to signal the vCPU threads to exit */
> > +       done = true;
> > +
> > +       perf_test_join_vcpu_threads(vcpus);
> > +       perf_test_destroy_vm(vm);
> > +}
> > +
> > +static void help(char *name)
> > +{
> > +       puts("");
> > +       printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
> > +              name);
> > +       puts("");
> > +       printf(" -h: Display this help message.");
> > +       guest_modes_help();
> > +       printf(" -b: specify the size of the memory region which should be\n"
> > +              "     dirtied by each vCPU. e.g. 10M or 3G.\n"
> > +              "     (default: 1G)\n");
> > +       printf(" -v: specify the number of vCPUs to run.\n");
> > +       printf(" -o: Overlap guest memory accesses instead of partitioning\n"
> > +              "     them into a separate region of memory for each vCPU.\n");
> > +       backing_src_help("-s");
> > +       puts("");
> > +       exit(0);
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{
> > +       struct test_params params = {
> > +               .backing_src = DEFAULT_VM_MEM_SRC,
> > +               .vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
> > +               .vcpus = 1,
> > +       };
> > +       int opt;
> > +
> > +       guest_modes_append_default();
> > +
> > +       while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
> > +               switch (opt) {
> > +               case 'm':
> > +                       guest_modes_cmdline(optarg);
> > +                       break;
> > +               case 'b':
> > +                       params.vcpu_memory_bytes = parse_size(optarg);
> > +                       break;
> > +               case 'v':
> > +                       params.vcpus = atoi(optarg);
> > +                       break;
> > +               case 'o':
> > +                       overlap_memory_access = true;
> > +                       break;
> > +               case 's':
> > +                       params.backing_src = parse_backing_src_type(optarg);
> > +                       break;
> > +               case 'h':
> > +               default:
> > +                       help(argv[0]);
> > +                       break;
> > +               }
> > +       }
> > +
> > +       for_each_guest_mode(run_test, &params);
> > +
> > +       return 0;
> > +}
> > diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
> > index a86f953d8d36..0a5a56539aff 100644
> > --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> > +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> > @@ -33,6 +33,7 @@ struct perf_test_args {
> >         uint64_t gpa;
> >         uint64_t guest_page_size;
> >         int wr_fract;
> > +       bool execute;
> >
> >         struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS];
> >  };
> > @@ -46,6 +47,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
> >  void perf_test_destroy_vm(struct kvm_vm *vm);
> >
> >  void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
> > +void perf_test_set_execute(struct kvm_vm *vm, bool execute);
> >
> >  void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
> >  void perf_test_join_vcpu_threads(int vcpus);
> > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > index 722df3a28791..1a5eb60b59da 100644
> > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > @@ -36,6 +36,16 @@ static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
> >  /* Set to true once all vCPU threads are up and running. */
> >  static bool all_vcpu_threads_running;
> >
> > +/*
> > + * When writing to guest memory, write the opcode for the `ret` instruction so
> > + * that subsequent iteractions can exercise instruction fetch by calling the
> > + * memory.
> > + *
> > + * NOTE: Non-x86 architectures would to use different values here to support
> > + * execute.
> > + */
> > +#define RETURN_OPCODE 0xC3
> > +
> 
> This should be defined in an arch-specific header or surrounded by
> ifdefs so that the build would fail for other archs.

Agreed, this should really go in an x86-specific header file. There's no
correctness issue (0xC3 works just as well as 0x0123456789ABCDEF and
non-x86 architectures are prevented from setting execute to true in
perf_test_set_execute()), but this is a lazy way to structure the code.

> 
> >  /*
> >   * Continuously write to the first 8 bytes of each page in the
> >   * specified region.
> > @@ -58,8 +68,10 @@ static void guest_code(uint32_t vcpu_id)
> >                 for (i = 0; i < pages; i++) {
> >                         uint64_t addr = gva + (i * pta->guest_page_size);
> >
> > -                       if (i % pta->wr_fract == 0)
> > -                               *(uint64_t *)addr = 0x0123456789ABCDEF;
> > +                       if (pta->execute)
> > +                               ((void (*)(void)) addr)();
> > +                       else if (i % pta->wr_fract == 0)
> > +                               *(uint64_t *)addr = RETURN_OPCODE;
> 
> Oh interesting, you're using a write pass to set up the contents of
> memory. I suppose that probably ends up being faster than memset, but
> it introduces kind of a strange dependency.

It also allows the memory to be mapped in a huge pages first so then it
can be split via NX HugePages. But I agree it's a strange dependency.
I'll have to think more about how to better structure this code.

> 
> >                         else
> >                                 READ_ONCE(*(uint64_t *)addr);
> >                 }
> > @@ -198,6 +210,15 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
> >         sync_global_to_guest(vm, perf_test_args);
> >  }
> >
> > +void perf_test_set_execute(struct kvm_vm *vm, bool execute)
> > +{
> > +#ifndef __x86_64__
> > +       TEST_ASSERT(false, "Execute not supported on this architure; see RETURN_OPCODE.");
> > +#endif
> > +       perf_test_args.execute = execute;
> > +       sync_global_to_guest(vm, perf_test_args);
> > +}
> > +
> >  static void *vcpu_thread_main(void *data)
> >  {
> >         struct vcpu_thread *vcpu = data;
> >
> > base-commit: d1fb6a1ca3e535f89628193ab94203533b264c8c
> > --
> > 2.35.1.1094.g7c7d902a7c-goog
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance
  2022-04-05 20:59     ` David Matlack
@ 2022-04-05 21:00       ` David Matlack
  0 siblings, 0 replies; 12+ messages in thread
From: David Matlack @ 2022-04-05 21:00 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

On Tue, Apr 05, 2022 at 08:59:01PM +0000, David Matlack wrote:
> On Mon, Apr 04, 2022 at 11:13:40AM -0700, Ben Gardon wrote:
> > On Fri, Apr 1, 2022 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
> > >
> > > Introduce a new selftest, execute_perf_test, that uses the
> > > perf_test_util framework to measure the performance of executing code
> > > within a VM. This test is similar to the other perf_test_util-based
> > > tests in that it spins up a variable number of vCPUs and runs them
> > > concurrently, accessing memory.
> > >
> > > In order to support executiong, extend perf_test_util to populate guest
> > 
> > *executing instructions in the data slot,
> > 
> > > memory with return instructions rather than random garbage. This way
> > > memory can be execute simply by calling it.
> > 
> > *executed
> > 
> > >
> > > Currently only x86-64 supports execution, but other architectures can be
> > > easily added by providing their return code instruction.
> > >
> > > Signed-off-by: David Matlack <dmatlack@google.com>
> > > ---
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../testing/selftests/kvm/execute_perf_test.c | 188 ++++++++++++++++++
> > >  .../selftests/kvm/include/perf_test_util.h    |   2 +
> > >  .../selftests/kvm/lib/perf_test_util.c        |  25 ++-
> > >  5 files changed, 215 insertions(+), 2 deletions(-)
> > >  create mode 100644 tools/testing/selftests/kvm/execute_perf_test.c
> > >
> > > diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> > > index 1f1b6c978bf7..3647ddacb103 100644
> > > --- a/tools/testing/selftests/kvm/.gitignore
> > > +++ b/tools/testing/selftests/kvm/.gitignore
> > > @@ -56,6 +56,7 @@
> > >  /demand_paging_test
> > >  /dirty_log_test
> > >  /dirty_log_perf_test
> > > +/execute_perf_test
> > >  /hardware_disable_test
> > >  /kvm_create_max_vcpus
> > >  /kvm_page_table_test
> > > diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> > > index c9cdbd248727..3c67346b0766 100644
> > > --- a/tools/testing/selftests/kvm/Makefile
> > > +++ b/tools/testing/selftests/kvm/Makefile
> > > @@ -92,6 +92,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
> > >  TEST_GEN_PROGS_x86_64 += demand_paging_test
> > >  TEST_GEN_PROGS_x86_64 += dirty_log_test
> > >  TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
> > > +TEST_GEN_PROGS_x86_64 += execute_perf_test
> > >  TEST_GEN_PROGS_x86_64 += hardware_disable_test
> > >  TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
> > >  TEST_GEN_PROGS_x86_64 += kvm_page_table_test
> > > diff --git a/tools/testing/selftests/kvm/execute_perf_test.c b/tools/testing/selftests/kvm/execute_perf_test.c
> > > new file mode 100644
> > > index 000000000000..fa78facf44e7
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/kvm/execute_perf_test.c
> > > @@ -0,0 +1,188 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include <inttypes.h>
> > > +#include <limits.h>
> > > +#include <pthread.h>
> > > +#include <sys/mman.h>
> > > +#include <sys/types.h>
> > > +#include <sys/stat.h>
> > > +
> > > +#include "kvm_util.h"
> > > +#include "test_util.h"
> > > +#include "perf_test_util.h"
> > > +#include "guest_modes.h"
> > > +
> > > +/* Global variable used to synchronize all of the vCPU threads. */
> > > +static int iteration;
> > 
> > Should this be volatile? (same for other globals)
> 
> Or atomic_t. This is a common pattern across almost all of the
> perf_test_util-based tests that needs to be addressed.
> 
> > 
> > > +
> > > +/* Set to true when vCPU threads should exit. */
> > > +static bool done;
> > > +
> > > +/* The iteration that was last completed by each vCPU. */
> > > +static int vcpu_last_completed_iteration[KVM_MAX_VCPUS];
> > > +
> > > +/* Whether to overlap the regions of memory vCPUs access. */
> > > +static bool overlap_memory_access;
> > 
> > Can this be factored into the perf test util framework / test params?
> 
> Yes. I'm planning to do a larger refactor of the perf_test_util
> framework to consolidate code like this. But I want to leave that to a
> separate series.
> 
> I'd be fine with deferring this test until that refactor is complete but
> I don't think it's stricly necessary.
> 
> > 
> > > +
> > > +struct test_params {
> > > +       /* The backing source for the region of memory. */
> > > +       enum vm_mem_backing_src_type backing_src;
> > > +
> > > +       /* The amount of memory to allocate for each vCPU. */
> > > +       uint64_t vcpu_memory_bytes;
> > > +
> > > +       /* The number of vCPUs to create in the VM. */
> > > +       int vcpus;
> > > +};
> > > +
> > > +static void assert_ucall(struct kvm_vm *vm, uint32_t vcpu_id,
> > > +                        uint64_t expected_ucall)
> > > +{
> > > +       struct ucall uc;
> > > +       uint64_t actual_ucall = get_ucall(vm, vcpu_id, &uc);
> > > +
> > > +       TEST_ASSERT(expected_ucall == actual_ucall,
> > > +                   "Guest exited unexpectedly (expected ucall %" PRIu64
> > > +                   ", got %" PRIu64 ")",
> > > +                   expected_ucall, actual_ucall);
> > > +}
> > > +
> > > +static bool spin_wait_for_next_iteration(int *current_iteration)
> > > +{
> > > +       int last_iteration = *current_iteration;
> > > +
> > > +       do {
> > > +               if (READ_ONCE(done))
> > > +                       return false;
> > > +
> > > +               *current_iteration = READ_ONCE(iteration);
> > > +       } while (last_iteration == *current_iteration);
> > > +
> > > +       return true;
> > > +}
> > > +
> > > +static void vcpu_thread_main(struct perf_test_vcpu_args *vcpu_args)
> > > +{
> > > +       struct kvm_vm *vm = perf_test_args.vm;
> > > +       int vcpu_id = vcpu_args->vcpu_id;
> > > +       int current_iteration = 0;
> > > +
> > > +       while (spin_wait_for_next_iteration(&current_iteration)) {
> > > +               vcpu_run(vm, vcpu_id);
> > > +               assert_ucall(vm, vcpu_id, UCALL_SYNC);
> > > +               vcpu_last_completed_iteration[vcpu_id] = current_iteration;
> > > +       }
> > > +}
> > > +
> > > +static void spin_wait_for_vcpu(int vcpu_id, int target_iteration)
> > > +{
> > > +       while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id]) !=
> > > +              target_iteration) {
> > > +               continue;
> > > +       }
> > > +}
> > > +
> > > +static void run_iteration(struct kvm_vm *vm, int vcpus, const char *description)
> > > +{
> > > +       struct timespec ts_start;
> > > +       struct timespec ts_elapsed;
> > > +       int next_iteration;
> > > +       int vcpu_id;
> > > +
> > > +       /* Kick off the vCPUs by incrementing iteration. */
> > > +       next_iteration = ++iteration;
> > > +
> > > +       clock_gettime(CLOCK_MONOTONIC, &ts_start);
> > > +
> > > +       /* Wait for all vCPUs to finish the iteration. */
> > > +       for (vcpu_id = 0; vcpu_id < vcpus; vcpu_id++)
> > > +               spin_wait_for_vcpu(vcpu_id, next_iteration);
> > > +
> > > +       ts_elapsed = timespec_elapsed(ts_start);
> > > +       pr_info("%-30s: %ld.%09lds\n",
> > > +               description, ts_elapsed.tv_sec, ts_elapsed.tv_nsec);
> > > +}
> > > +
> > > +static void run_test(enum vm_guest_mode mode, void *arg)
> > > +{
> > > +       struct test_params *params = arg;
> > > +       struct kvm_vm *vm;
> > > +       int vcpus = params->vcpus;
> > > +
> > > +       vm = perf_test_create_vm(mode, vcpus, params->vcpu_memory_bytes, 1,
> > > +                                params->backing_src, !overlap_memory_access);
> > > +
> > > +       perf_test_start_vcpu_threads(vcpus, vcpu_thread_main);
> > > +
> > > +       pr_info("\n");
> > > +
> > > +       perf_test_set_wr_fract(vm, 1);
> > > +       run_iteration(vm, vcpus, "Populating memory");
> > > +
> > > +       perf_test_set_execute(vm, true);
> > > +       run_iteration(vm, vcpus, "Executing from memory");
> > > +
> > > +       /* Set done to signal the vCPU threads to exit */
> > > +       done = true;
> > > +
> > > +       perf_test_join_vcpu_threads(vcpus);
> > > +       perf_test_destroy_vm(vm);
> > > +}
> > > +
> > > +static void help(char *name)
> > > +{
> > > +       puts("");
> > > +       printf("usage: %s [-h] [-m mode] [-b vcpu_bytes] [-v vcpus] [-o]  [-s mem_type]\n",
> > > +              name);
> > > +       puts("");
> > > +       printf(" -h: Display this help message.");
> > > +       guest_modes_help();
> > > +       printf(" -b: specify the size of the memory region which should be\n"
> > > +              "     dirtied by each vCPU. e.g. 10M or 3G.\n"
> > > +              "     (default: 1G)\n");
> > > +       printf(" -v: specify the number of vCPUs to run.\n");
> > > +       printf(" -o: Overlap guest memory accesses instead of partitioning\n"
> > > +              "     them into a separate region of memory for each vCPU.\n");
> > > +       backing_src_help("-s");
> > > +       puts("");
> > > +       exit(0);
> > > +}
> > > +
> > > +int main(int argc, char *argv[])
> > > +{
> > > +       struct test_params params = {
> > > +               .backing_src = DEFAULT_VM_MEM_SRC,
> > > +               .vcpu_memory_bytes = DEFAULT_PER_VCPU_MEM_SIZE,
> > > +               .vcpus = 1,
> > > +       };
> > > +       int opt;
> > > +
> > > +       guest_modes_append_default();
> > > +
> > > +       while ((opt = getopt(argc, argv, "hm:b:v:os:")) != -1) {
> > > +               switch (opt) {
> > > +               case 'm':
> > > +                       guest_modes_cmdline(optarg);
> > > +                       break;
> > > +               case 'b':
> > > +                       params.vcpu_memory_bytes = parse_size(optarg);
> > > +                       break;
> > > +               case 'v':
> > > +                       params.vcpus = atoi(optarg);
> > > +                       break;
> > > +               case 'o':
> > > +                       overlap_memory_access = true;
> > > +                       break;
> > > +               case 's':
> > > +                       params.backing_src = parse_backing_src_type(optarg);
> > > +                       break;
> > > +               case 'h':
> > > +               default:
> > > +                       help(argv[0]);
> > > +                       break;
> > > +               }
> > > +       }
> > > +
> > > +       for_each_guest_mode(run_test, &params);
> > > +
> > > +       return 0;
> > > +}
> > > diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h
> > > index a86f953d8d36..0a5a56539aff 100644
> > > --- a/tools/testing/selftests/kvm/include/perf_test_util.h
> > > +++ b/tools/testing/selftests/kvm/include/perf_test_util.h
> > > @@ -33,6 +33,7 @@ struct perf_test_args {
> > >         uint64_t gpa;
> > >         uint64_t guest_page_size;
> > >         int wr_fract;
> > > +       bool execute;
> > >
> > >         struct perf_test_vcpu_args vcpu_args[KVM_MAX_VCPUS];
> > >  };
> > > @@ -46,6 +47,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int vcpus,
> > >  void perf_test_destroy_vm(struct kvm_vm *vm);
> > >
> > >  void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract);
> > > +void perf_test_set_execute(struct kvm_vm *vm, bool execute);
> > >
> > >  void perf_test_start_vcpu_threads(int vcpus, void (*vcpu_fn)(struct perf_test_vcpu_args *));
> > >  void perf_test_join_vcpu_threads(int vcpus);
> > > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > > index 722df3a28791..1a5eb60b59da 100644
> > > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c
> > > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c
> > > @@ -36,6 +36,16 @@ static void (*vcpu_thread_fn)(struct perf_test_vcpu_args *);
> > >  /* Set to true once all vCPU threads are up and running. */
> > >  static bool all_vcpu_threads_running;
> > >
> > > +/*
> > > + * When writing to guest memory, write the opcode for the `ret` instruction so
> > > + * that subsequent iteractions can exercise instruction fetch by calling the
> > > + * memory.
> > > + *
> > > + * NOTE: Non-x86 architectures would to use different values here to support
> > > + * execute.
> > > + */
> > > +#define RETURN_OPCODE 0xC3
> > > +
> > 
> > This should be defined in an arch-specific header or surrounded by
> > ifdefs so that the build would fail for other archs.
> 
> Agreed, this should really go in an x86-specific header file. There's no
> correctness issue (0xC3 works just as well as 0x0123456789ABCDEF and

I meant to say: 0xC3 works just as well as 0x0123456789ABCDEF *for
writes*.

> non-x86 architectures are prevented from setting execute to true in
> perf_test_set_execute()), but this is a lazy way to structure the code.
> 
> > 
> > >  /*
> > >   * Continuously write to the first 8 bytes of each page in the
> > >   * specified region.
> > > @@ -58,8 +68,10 @@ static void guest_code(uint32_t vcpu_id)
> > >                 for (i = 0; i < pages; i++) {
> > >                         uint64_t addr = gva + (i * pta->guest_page_size);
> > >
> > > -                       if (i % pta->wr_fract == 0)
> > > -                               *(uint64_t *)addr = 0x0123456789ABCDEF;
> > > +                       if (pta->execute)
> > > +                               ((void (*)(void)) addr)();
> > > +                       else if (i % pta->wr_fract == 0)
> > > +                               *(uint64_t *)addr = RETURN_OPCODE;
> > 
> > Oh interesting, you're using a write pass to set up the contents of
> > memory. I suppose that probably ends up being faster than memset, but
> > it introduces kind of a strange dependency.
> 
> It also allows the memory to be mapped in a huge pages first so then it
> can be split via NX HugePages. But I agree it's a strange dependency.
> I'll have to think more about how to better structure this code.
> 
> > 
> > >                         else
> > >                                 READ_ONCE(*(uint64_t *)addr);
> > >                 }
> > > @@ -198,6 +210,15 @@ void perf_test_set_wr_fract(struct kvm_vm *vm, int wr_fract)
> > >         sync_global_to_guest(vm, perf_test_args);
> > >  }
> > >
> > > +void perf_test_set_execute(struct kvm_vm *vm, bool execute)
> > > +{
> > > +#ifndef __x86_64__
> > > +       TEST_ASSERT(false, "Execute not supported on this architure; see RETURN_OPCODE.");
> > > +#endif
> > > +       perf_test_args.execute = execute;
> > > +       sync_global_to_guest(vm, perf_test_args);
> > > +}
> > > +
> > >  static void *vcpu_thread_main(void *data)
> > >  {
> > >         struct vcpu_thread *vcpu = data;
> > >
> > > base-commit: d1fb6a1ca3e535f89628193ab94203533b264c8c
> > > --
> > > 2.35.1.1094.g7c7d902a7c-goog
> > >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/3] KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page()
  2022-04-01 23:37 [PATCH 0/3] KVM: Split huge pages mapped by the TDP MMU on fault David Matlack
  2022-04-01 23:37 ` [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance David Matlack
@ 2022-04-01 23:37 ` David Matlack
  2022-04-04 18:26   ` Ben Gardon
  2022-04-01 23:37 ` [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault David Matlack
  2 siblings, 1 reply; 12+ messages in thread
From: David Matlack @ 2022-04-01 23:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Matlack, Ben Gardon, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

In preparation for splitting huge pages during fault, pass account_nx to
tdp_mmu_split_huge_page(). Eager page splitting hard-codes account_nx to
false because the splitting is being done for dirty-logging rather than
vCPU execution faults.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d71d177ae6b8..9263765c8068 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1456,7 +1456,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 }
 
 static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
-				   struct kvm_mmu_page *sp, bool shared)
+				   struct kvm_mmu_page *sp, bool shared,
+				   bool account_nx)
 {
 	const u64 huge_spte = iter->old_spte;
 	const int level = iter->level;
@@ -1479,7 +1480,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * correctness standpoint since the translation will be the same either
 	 * way.
 	 */
-	ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
+	ret = tdp_mmu_link_sp(kvm, iter, sp, account_nx, shared);
 	if (ret)
 		goto out;
 
@@ -1539,7 +1540,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 				continue;
 		}
 
-		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
+		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared, false))
 			goto retry;
 
 		sp = NULL;
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page()
  2022-04-01 23:37 ` [PATCH 2/3] KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page() David Matlack
@ 2022-04-04 18:26   ` Ben Gardon
  0 siblings, 0 replies; 12+ messages in thread
From: Ben Gardon @ 2022-04-04 18:26 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

On Fri, Apr 1, 2022 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
>
> In preparation for splitting huge pages during fault, pass account_nx to
> tdp_mmu_split_huge_page(). Eager page splitting hard-codes account_nx to
> false because the splitting is being done for dirty-logging rather than
> vCPU execution faults.
>
> No functional change intended.
>
> Signed-off-by: David Matlack <dmatlack@google.com>

Looks good to me, but this will conflict with some other patches from
Mingwei and Sean, so someone will have to send out another version.

> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index d71d177ae6b8..9263765c8068 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1456,7 +1456,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>  }
>
>  static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> -                                  struct kvm_mmu_page *sp, bool shared)
> +                                  struct kvm_mmu_page *sp, bool shared,
> +                                  bool account_nx)
>  {
>         const u64 huge_spte = iter->old_spte;
>         const int level = iter->level;
> @@ -1479,7 +1480,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>          * correctness standpoint since the translation will be the same either
>          * way.
>          */
> -       ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
> +       ret = tdp_mmu_link_sp(kvm, iter, sp, account_nx, shared);
>         if (ret)
>                 goto out;
>
> @@ -1539,7 +1540,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>                                 continue;
>                 }
>
> -               if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
> +               if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared, false))
>                         goto retry;
>
>                 sp = NULL;
> --
> 2.35.1.1094.g7c7d902a7c-goog
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
  2022-04-01 23:37 [PATCH 0/3] KVM: Split huge pages mapped by the TDP MMU on fault David Matlack
  2022-04-01 23:37 ` [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance David Matlack
  2022-04-01 23:37 ` [PATCH 2/3] KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page() David Matlack
@ 2022-04-01 23:37 ` David Matlack
  2022-04-04 18:48   ` Ben Gardon
  2022-04-07 19:39   ` Sean Christopherson
  2 siblings, 2 replies; 12+ messages in thread
From: David Matlack @ 2022-04-01 23:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Matlack, Ben Gardon, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

Now that the TDP MMU has a mechanism to split huge pages, use it in the
fault path when a huge page needs to be replaced with a mapping at a
lower level.

This change reduces the negative performance impact of NX HugePages.
Prior to this change if a vCPU executed from a huge page and NX
HugePages was enabled, the vCPU would take a fault, zap the huge page,
and mapping the faulting address at 4KiB with execute permissions
enabled. The rest of the memory would be left *unmapped* and have to be
faulted back in by the guest upon access (read, write, or execute). If
guest is backed by 1GiB, a single execute instruction can zap an entire
GiB of its physical address space.

For example, it can take a VM longer to execute from its memory than to
populate that memory in the first place:

$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96

Populating memory             : 2.748378795s
Executing from memory         : 2.899670885s

With this change, such faults split the huge page instead of zapping it,
which avoids the non-present faults on the rest of the huge page:

$ ./execute_perf_test -s anonymous_hugetlb_1gb -v96

Populating memory             : 2.729544474s
Executing from memory         : 0.111965688s   <---

This change also reduces the performance impact of dirty logging when
eager_page_split=N for the same reasons as above but write faults.
eager_page_split=N (abbreviated "eps=N" below) can be desirable for
read-heavy workloads, as it avoids allocating memory to split huge pages
that are never written and avoids increasing the TLB miss cost on reads
of those pages.

             | Config: ept=Y, tdp_mmu=Y, 5% writes           |
             | Iteration 1 dirty memory time                 |
             | --------------------------------------------- |
vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
------------ | -------------- | ------------- | ------------ |
2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
96           | 3.016535213s   | 0.062608739s  | 0.044760838s |

Eager page splitting remains beneficial for write-heavy workloads, but
the gap is now reduced.

             | Config: ept=Y, tdp_mmu=Y, 100% writes         |
             | Iteration 1 dirty memory time                 |
             | --------------------------------------------- |
vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
------------ | -------------- | ------------- | ------------ |
2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
96           | 2.681123208s   | 0.673778787s  | 0.165386739s |

Further study is needed to determine if the remaining gap is acceptable
for customer workloads or if eager_page_split=N still requires a-priori
knowledge of the VM workload, especially when considering these costs
extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9263765c8068..5a2120d85347 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1131,6 +1131,10 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 	return 0;
 }
 
+static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
+					  struct tdp_iter *iter,
+					  bool account_nx);
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -1140,6 +1144,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
+	bool account_nx;
 	int ret;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1155,28 +1160,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (iter.level == fault->goal_level)
 			break;
 
+		account_nx = fault->huge_page_disallowed &&
+			     fault->req_level >= iter.level;
+
 		/*
 		 * If there is an SPTE mapping a large page at a higher level
-		 * than the target, that SPTE must be cleared and replaced
-		 * with a non-leaf SPTE.
+		 * than the target, split it down one level.
 		 */
 		if (is_shadow_present_pte(iter.old_spte) &&
 		    is_large_pte(iter.old_spte)) {
-			if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
+			if (tdp_mmu_split_huge_page_atomic(vcpu, &iter, account_nx))
 				break;
 
-			/*
-			 * The iter must explicitly re-read the spte here
-			 * because the new value informs the !present
-			 * path below.
-			 */
-			iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
+			continue;
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
-			bool account_nx = fault->huge_page_disallowed &&
-					  fault->req_level >= iter.level;
-
 			/*
 			 * If SPTE has been frozen by another thread, just
 			 * give up and retry, avoiding unnecessary page table
@@ -1496,6 +1495,20 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	return ret;
 }
 
+static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
+					  struct tdp_iter *iter,
+					  bool account_nx)
+{
+	struct kvm_mmu_page *sp = tdp_mmu_alloc_sp(vcpu);
+	int r;
+
+	r = tdp_mmu_split_huge_page(vcpu->kvm, iter, sp, true, account_nx);
+	if (r)
+		tdp_mmu_free_sp(sp);
+
+	return r;
+}
+
 static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 					 struct kvm_mmu_page *root,
 					 gfn_t start, gfn_t end,
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
  2022-04-01 23:37 ` [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault David Matlack
@ 2022-04-04 18:48   ` Ben Gardon
  2022-04-05 19:02     ` David Matlack
  2022-04-07 19:39   ` Sean Christopherson
  1 sibling, 1 reply; 12+ messages in thread
From: Ben Gardon @ 2022-04-04 18:48 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

On Fri, Apr 1, 2022 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
>
> Now that the TDP MMU has a mechanism to split huge pages, use it in the
> fault path when a huge page needs to be replaced with a mapping at a
> lower level.
>
> This change reduces the negative performance impact of NX HugePages.
> Prior to this change if a vCPU executed from a huge page and NX
> HugePages was enabled, the vCPU would take a fault, zap the huge page,
> and mapping the faulting address at 4KiB with execute permissions
> enabled. The rest of the memory would be left *unmapped* and have to be
> faulted back in by the guest upon access (read, write, or execute). If
> guest is backed by 1GiB, a single execute instruction can zap an entire
> GiB of its physical address space.
>
> For example, it can take a VM longer to execute from its memory than to
> populate that memory in the first place:
>
> $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
>
> Populating memory             : 2.748378795s
> Executing from memory         : 2.899670885s
>
> With this change, such faults split the huge page instead of zapping it,
> which avoids the non-present faults on the rest of the huge page:
>
> $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
>
> Populating memory             : 2.729544474s
> Executing from memory         : 0.111965688s   <---
>
> This change also reduces the performance impact of dirty logging when
> eager_page_split=N for the same reasons as above but write faults.
> eager_page_split=N (abbreviated "eps=N" below) can be desirable for
> read-heavy workloads, as it avoids allocating memory to split huge pages
> that are never written and avoids increasing the TLB miss cost on reads
> of those pages.
>
>              | Config: ept=Y, tdp_mmu=Y, 5% writes           |
>              | Iteration 1 dirty memory time                 |
>              | --------------------------------------------- |
> vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
> ------------ | -------------- | ------------- | ------------ |
> 2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
> 4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
> 8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
> 16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
> 32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
> 64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
> 96           | 3.016535213s   | 0.062608739s  | 0.044760838s |
>
> Eager page splitting remains beneficial for write-heavy workloads, but
> the gap is now reduced.
>
>              | Config: ept=Y, tdp_mmu=Y, 100% writes         |
>              | Iteration 1 dirty memory time                 |
>              | --------------------------------------------- |
> vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
> ------------ | -------------- | ------------- | ------------ |
> 2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
> 4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
> 8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
> 16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
> 32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
> 64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
> 96           | 2.681123208s   | 0.673778787s  | 0.165386739s |
>
> Further study is needed to determine if the remaining gap is acceptable
> for customer workloads or if eager_page_split=N still requires a-priori
> knowledge of the VM workload, especially when considering these costs
> extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++------------
>  1 file changed, 25 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9263765c8068..5a2120d85347 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1131,6 +1131,10 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>         return 0;
>  }
>
> +static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
> +                                         struct tdp_iter *iter,
> +                                         bool account_nx);
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1140,6 +1144,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
> +       bool account_nx;
>         int ret;
>
>         kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -1155,28 +1160,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                 if (iter.level == fault->goal_level)
>                         break;
>
> +               account_nx = fault->huge_page_disallowed &&
> +                            fault->req_level >= iter.level;
> +
>                 /*
>                  * If there is an SPTE mapping a large page at a higher level
> -                * than the target, that SPTE must be cleared and replaced
> -                * with a non-leaf SPTE.
> +                * than the target, split it down one level.
>                  */
>                 if (is_shadow_present_pte(iter.old_spte) &&
>                     is_large_pte(iter.old_spte)) {
> -                       if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> +                       if (tdp_mmu_split_huge_page_atomic(vcpu, &iter, account_nx))
>                                 break;

I don't think we necessarily want to break here, as splitting a 1G
page would require two splits.

...

Oh tdp_mmu_split_huge_page_atomic returns non-zero to indicate an
error and if everything works we will split again. In the case of
failure, should we fall back to zapping?


>
> -                       /*
> -                        * The iter must explicitly re-read the spte here
> -                        * because the new value informs the !present
> -                        * path below.
> -                        */
> -                       iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> +                       continue;
>                 }
>
>                 if (!is_shadow_present_pte(iter.old_spte)) {
> -                       bool account_nx = fault->huge_page_disallowed &&
> -                                         fault->req_level >= iter.level;
> -
>                         /*
>                          * If SPTE has been frozen by another thread, just
>                          * give up and retry, avoiding unnecessary page table
> @@ -1496,6 +1495,20 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>         return ret;
>  }
>
> +static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
> +                                         struct tdp_iter *iter,
> +                                         bool account_nx)
> +{
> +       struct kvm_mmu_page *sp = tdp_mmu_alloc_sp(vcpu);
> +       int r;
> +
> +       r = tdp_mmu_split_huge_page(vcpu->kvm, iter, sp, true, account_nx);
> +       if (r)
> +               tdp_mmu_free_sp(sp);
> +
> +       return r;
> +}
> +
>  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>                                          struct kvm_mmu_page *root,
>                                          gfn_t start, gfn_t end,
> --
> 2.35.1.1094.g7c7d902a7c-goog
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
  2022-04-04 18:48   ` Ben Gardon
@ 2022-04-05 19:02     ` David Matlack
  0 siblings, 0 replies; 12+ messages in thread
From: David Matlack @ 2022-04-05 19:02 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

On Mon, Apr 04, 2022 at 11:48:46AM -0700, Ben Gardon wrote:
> On Fri, Apr 1, 2022 at 4:37 PM David Matlack <dmatlack@google.com> wrote:
> >
> > Now that the TDP MMU has a mechanism to split huge pages, use it in the
> > fault path when a huge page needs to be replaced with a mapping at a
> > lower level.
> >
> > This change reduces the negative performance impact of NX HugePages.
> > Prior to this change if a vCPU executed from a huge page and NX
> > HugePages was enabled, the vCPU would take a fault, zap the huge page,
> > and mapping the faulting address at 4KiB with execute permissions
> > enabled. The rest of the memory would be left *unmapped* and have to be
> > faulted back in by the guest upon access (read, write, or execute). If
> > guest is backed by 1GiB, a single execute instruction can zap an entire
> > GiB of its physical address space.
> >
> > For example, it can take a VM longer to execute from its memory than to
> > populate that memory in the first place:
> >
> > $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
> >
> > Populating memory             : 2.748378795s
> > Executing from memory         : 2.899670885s
> >
> > With this change, such faults split the huge page instead of zapping it,
> > which avoids the non-present faults on the rest of the huge page:
> >
> > $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
> >
> > Populating memory             : 2.729544474s
> > Executing from memory         : 0.111965688s   <---
> >
> > This change also reduces the performance impact of dirty logging when
> > eager_page_split=N for the same reasons as above but write faults.
> > eager_page_split=N (abbreviated "eps=N" below) can be desirable for
> > read-heavy workloads, as it avoids allocating memory to split huge pages
> > that are never written and avoids increasing the TLB miss cost on reads
> > of those pages.
> >
> >              | Config: ept=Y, tdp_mmu=Y, 5% writes           |
> >              | Iteration 1 dirty memory time                 |
> >              | --------------------------------------------- |
> > vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
> > ------------ | -------------- | ------------- | ------------ |
> > 2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
> > 4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
> > 8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
> > 16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
> > 32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
> > 64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
> > 96           | 3.016535213s   | 0.062608739s  | 0.044760838s |
> >
> > Eager page splitting remains beneficial for write-heavy workloads, but
> > the gap is now reduced.
> >
> >              | Config: ept=Y, tdp_mmu=Y, 100% writes         |
> >              | Iteration 1 dirty memory time                 |
> >              | --------------------------------------------- |
> > vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
> > ------------ | -------------- | ------------- | ------------ |
> > 2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
> > 4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
> > 8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
> > 16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
> > 32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
> > 64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
> > 96           | 2.681123208s   | 0.673778787s  | 0.165386739s |
> >
> > Further study is needed to determine if the remaining gap is acceptable
> > for customer workloads or if eager_page_split=N still requires a-priori
> > knowledge of the VM workload, especially when considering these costs
> > extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
> >
> > Signed-off-by: David Matlack <dmatlack@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++------------
> >  1 file changed, 25 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 9263765c8068..5a2120d85347 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1131,6 +1131,10 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> >         return 0;
> >  }
> >
> > +static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
> > +                                         struct tdp_iter *iter,
> > +                                         bool account_nx);
> > +
> >  /*
> >   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> >   * page tables and SPTEs to translate the faulting guest physical address.
> > @@ -1140,6 +1144,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >         struct kvm_mmu *mmu = vcpu->arch.mmu;
> >         struct tdp_iter iter;
> >         struct kvm_mmu_page *sp;
> > +       bool account_nx;
> >         int ret;
> >
> >         kvm_mmu_hugepage_adjust(vcpu, fault);
> > @@ -1155,28 +1160,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                 if (iter.level == fault->goal_level)
> >                         break;
> >
> > +               account_nx = fault->huge_page_disallowed &&
> > +                            fault->req_level >= iter.level;
> > +
> >                 /*
> >                  * If there is an SPTE mapping a large page at a higher level
> > -                * than the target, that SPTE must be cleared and replaced
> > -                * with a non-leaf SPTE.
> > +                * than the target, split it down one level.
> >                  */
> >                 if (is_shadow_present_pte(iter.old_spte) &&
> >                     is_large_pte(iter.old_spte)) {
> > -                       if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> > +                       if (tdp_mmu_split_huge_page_atomic(vcpu, &iter, account_nx))
> >                                 break;
> 
> I don't think we necessarily want to break here, as splitting a 1G
> page would require two splits.
> 
> ...
> 
> Oh tdp_mmu_split_huge_page_atomic returns non-zero to indicate an
> error and if everything works we will split again. In the case of
> failure, should we fall back to zapping?

The only way for tdp_mmu_split_huge_page_atomic() to fail is if
tdp_mmu_set_spte_atomic() fails (i.e. the huge page SPTE is frozen or
being concurrently modified). Breaking here means we go back into the
guest and retry the access.

I don't think we should fall back to zapping:

 - If the SPTE is frozen, zapping will also fail.
 - Otherwise, the SPTE is being modified by another CPU. It'd be a waste
   to immediately zap that CPU's work. e.g. Maybe another CPU just split
   this huge page for us :).

> 
> 
> >
> > -                       /*
> > -                        * The iter must explicitly re-read the spte here
> > -                        * because the new value informs the !present
> > -                        * path below.
> > -                        */
> > -                       iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> > +                       continue;
> >                 }
> >
> >                 if (!is_shadow_present_pte(iter.old_spte)) {
> > -                       bool account_nx = fault->huge_page_disallowed &&
> > -                                         fault->req_level >= iter.level;
> > -
> >                         /*
> >                          * If SPTE has been frozen by another thread, just
> >                          * give up and retry, avoiding unnecessary page table
> > @@ -1496,6 +1495,20 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> >         return ret;
> >  }
> >
> > +static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
> > +                                         struct tdp_iter *iter,
> > +                                         bool account_nx)
> > +{
> > +       struct kvm_mmu_page *sp = tdp_mmu_alloc_sp(vcpu);
> > +       int r;
> > +
> > +       r = tdp_mmu_split_huge_page(vcpu->kvm, iter, sp, true, account_nx);
> > +       if (r)
> > +               tdp_mmu_free_sp(sp);
> > +
> > +       return r;
> > +}
> > +
> >  static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> >                                          struct kvm_mmu_page *root,
> >                                          gfn_t start, gfn_t end,
> > --
> > 2.35.1.1094.g7c7d902a7c-goog
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
  2022-04-01 23:37 ` [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault David Matlack
  2022-04-04 18:48   ` Ben Gardon
@ 2022-04-07 19:39   ` Sean Christopherson
  2022-04-07 20:52     ` David Matlack
  1 sibling, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2022-04-07 19:39 UTC (permalink / raw)
  To: David Matlack
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Ben Gardon, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

On Fri, Apr 01, 2022, David Matlack wrote:
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++------------
>  1 file changed, 25 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9263765c8068..5a2120d85347 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1131,6 +1131,10 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  	return 0;
>  }
>  
> +static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
> +					  struct tdp_iter *iter,
> +					  bool account_nx);
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1140,6 +1144,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	struct tdp_iter iter;
>  	struct kvm_mmu_page *sp;
> +	bool account_nx;
>  	int ret;
>  
>  	kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -1155,28 +1160,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		if (iter.level == fault->goal_level)
>  			break;
>  
> +		account_nx = fault->huge_page_disallowed &&
> +			     fault->req_level >= iter.level;
> +
>  		/*
>  		 * If there is an SPTE mapping a large page at a higher level
> -		 * than the target, that SPTE must be cleared and replaced
> -		 * with a non-leaf SPTE.
> +		 * than the target, split it down one level.
>  		 */
>  		if (is_shadow_present_pte(iter.old_spte) &&
>  		    is_large_pte(iter.old_spte)) {
> -			if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> +			if (tdp_mmu_split_huge_page_atomic(vcpu, &iter, account_nx))

As Ben brought up in patch 2, this conflicts in nasty ways with Mingwei's series
to more preciesly check sp->lpage_disallowed.  There's apparently a bug in that
code when using shadow paging, but assuming said bug isn't a blocking issue, I'd
prefer to land this on top of Mingwei's series.

With a bit of massaging, I think we can make the whole thing a bit more
straightforward.  This is what I ended up with (compile tested only, your patch 2
dropped, might split moving the "init" to a prep patch).   I'll give it a spin,
and assuming it works and Mingwei's bug is resolved, I'll post this and Mingwei's
series as a single thing.

---
 arch/x86/kvm/mmu/tdp_mmu.c | 99 ++++++++++++++++++--------------------
 1 file changed, 48 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f046af20f3d6..b0abf14570ea 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1126,6 +1126,9 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 	return 0;
 }

+static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
+				   struct kvm_mmu_page *sp, bool shared);
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -1136,7 +1139,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
-	int ret;
+	bool account_nx;
+	int ret, r;

 	kvm_mmu_hugepage_adjust(vcpu, fault);

@@ -1151,57 +1155,50 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		if (iter.level == fault->goal_level)
 			break;

-		/*
-		 * If there is an SPTE mapping a large page at a higher level
-		 * than the target, that SPTE must be cleared and replaced
-		 * with a non-leaf SPTE.
-		 */
+		/* Nothing to do if there's already a shadow page installed. */
 		if (is_shadow_present_pte(iter.old_spte) &&
-		    is_large_pte(iter.old_spte)) {
-			if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
-				break;
-
-			/*
-			 * The iter must explicitly re-read the spte here
-			 * because the new value informs the !present
-			 * path below.
-			 */
-			iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
+		    !is_large_pte(iter.old_spte))
+			continue;
+
+		/*
+		 * If the SPTE has been frozen by another thread, just give up
+		 * and retry to avoid unnecessary page table alloc and free.
+		 */
+		if (is_removed_spte(iter.old_spte))
+			break;
+
+		/*
+		 * The SPTE is either invalid or points at a huge page that
+		 * needs to be split.
+		 */
+		sp = tdp_mmu_alloc_sp(vcpu);
+		tdp_mmu_init_child_sp(sp, &iter);
+
+		account_nx = fault->huge_page_disallowed &&
+			     fault->req_level >= iter.level;
+
+		sp->lpage_disallowed = account_nx;
+		/*
+		 * Ensure lpage_disallowed is visible before the SP is marked
+		 * present (or not-huge), as mmu_lock is held for read.  Pairs
+		 * with the smp_rmb() in disallowed_hugepage_adjust().
+		 */
+		smp_wmb();
+
+		if (!is_shadow_present_pte(iter.old_spte))
+			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
+		else
+			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
+
+		if (r) {
+			tdp_mmu_free_sp(sp);
+			break;
 		}

-		if (!is_shadow_present_pte(iter.old_spte)) {
-			bool account_nx = fault->huge_page_disallowed &&
-					  fault->req_level >= iter.level;
-
-			/*
-			 * If SPTE has been frozen by another thread, just
-			 * give up and retry, avoiding unnecessary page table
-			 * allocation and free.
-			 */
-			if (is_removed_spte(iter.old_spte))
-				break;
-
-			sp = tdp_mmu_alloc_sp(vcpu);
-			tdp_mmu_init_child_sp(sp, &iter);
-
-			sp->lpage_disallowed = account_nx;
-			/*
-			 * Ensure lpage_disallowed is visible before the SP is
-			 * marked present, as mmu_lock is held for read.  Pairs
-			 * with the smp_rmb() in disallowed_hugepage_adjust().
-			 */
-			smp_wmb();
-
-			if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
-				tdp_mmu_free_sp(sp);
-				break;
-			}
-
-			if (account_nx) {
-				spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-				__account_huge_nx_page(kvm, sp);
-				spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
-			}
+		if (account_nx) {
+			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+			__account_huge_nx_page(kvm, sp);
+			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 		}
 	}

@@ -1472,8 +1469,6 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	const int level = iter->level;
 	int ret, i;

-	tdp_mmu_init_child_sp(sp, iter);
-
 	/*
 	 * No need for atomics when writing to sp->spt since the page table has
 	 * not been linked in yet and thus is not reachable from any other CPU.
@@ -1549,6 +1544,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 				continue;
 		}

+		tdp_mmu_init_child_sp(sp, &iter);
+
 		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
 			goto retry;


base-commit: f06d9d4f3d89912c40c57da45d64b9827d8580ac
--


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault
  2022-04-07 19:39   ` Sean Christopherson
@ 2022-04-07 20:52     ` David Matlack
  0 siblings, 0 replies; 12+ messages in thread
From: David Matlack @ 2022-04-07 20:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Ben Gardon, Zhenzhong Duan,
	open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86), Peter Xu

On Thu, Apr 7, 2022 at 12:39 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Fri, Apr 01, 2022, David Matlack wrote:
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++++++++------------
> >  1 file changed, 25 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 9263765c8068..5a2120d85347 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1131,6 +1131,10 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> >       return 0;
> >  }
> >
> > +static int tdp_mmu_split_huge_page_atomic(struct kvm_vcpu *vcpu,
> > +                                       struct tdp_iter *iter,
> > +                                       bool account_nx);
> > +
> >  /*
> >   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
> >   * page tables and SPTEs to translate the faulting guest physical address.
> > @@ -1140,6 +1144,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >       struct kvm_mmu *mmu = vcpu->arch.mmu;
> >       struct tdp_iter iter;
> >       struct kvm_mmu_page *sp;
> > +     bool account_nx;
> >       int ret;
> >
> >       kvm_mmu_hugepage_adjust(vcpu, fault);
> > @@ -1155,28 +1160,22 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >               if (iter.level == fault->goal_level)
> >                       break;
> >
> > +             account_nx = fault->huge_page_disallowed &&
> > +                          fault->req_level >= iter.level;
> > +
> >               /*
> >                * If there is an SPTE mapping a large page at a higher level
> > -              * than the target, that SPTE must be cleared and replaced
> > -              * with a non-leaf SPTE.
> > +              * than the target, split it down one level.
> >                */
> >               if (is_shadow_present_pte(iter.old_spte) &&
> >                   is_large_pte(iter.old_spte)) {
> > -                     if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> > +                     if (tdp_mmu_split_huge_page_atomic(vcpu, &iter, account_nx))
>
> As Ben brought up in patch 2, this conflicts in nasty ways with Mingwei's series
> to more preciesly check sp->lpage_disallowed.  There's apparently a bug in that
> code when using shadow paging, but assuming said bug isn't a blocking issue, I'd
> prefer to land this on top of Mingwei's series.
>
> With a bit of massaging, I think we can make the whole thing a bit more
> straightforward.  This is what I ended up with (compile tested only, your patch 2
> dropped, might split moving the "init" to a prep patch).   I'll give it a spin,
> and assuming it works and Mingwei's bug is resolved, I'll post this and Mingwei's
> series as a single thing.

Sean and I spoke offline. I'll wait for Mingwei to send another
version of his patches and then send a v2 of my series on top of that.

>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 99 ++++++++++++++++++--------------------
>  1 file changed, 48 insertions(+), 51 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f046af20f3d6..b0abf14570ea 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1126,6 +1126,9 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>         return 0;
>  }
>
> +static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
> +                                  struct kvm_mmu_page *sp, bool shared);
> +
>  /*
>   * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>   * page tables and SPTEs to translate the faulting guest physical address.
> @@ -1136,7 +1139,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>         struct kvm *kvm = vcpu->kvm;
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
> -       int ret;
> +       bool account_nx;
> +       int ret, r;
>
>         kvm_mmu_hugepage_adjust(vcpu, fault);
>
> @@ -1151,57 +1155,50 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                 if (iter.level == fault->goal_level)
>                         break;
>
> -               /*
> -                * If there is an SPTE mapping a large page at a higher level
> -                * than the target, that SPTE must be cleared and replaced
> -                * with a non-leaf SPTE.
> -                */
> +               /* Nothing to do if there's already a shadow page installed. */
>                 if (is_shadow_present_pte(iter.old_spte) &&
> -                   is_large_pte(iter.old_spte)) {
> -                       if (tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> -                               break;
> -
> -                       /*
> -                        * The iter must explicitly re-read the spte here
> -                        * because the new value informs the !present
> -                        * path below.
> -                        */
> -                       iter.old_spte = kvm_tdp_mmu_read_spte(iter.sptep);
> +                   !is_large_pte(iter.old_spte))
> +                       continue;
> +
> +               /*
> +                * If the SPTE has been frozen by another thread, just give up
> +                * and retry to avoid unnecessary page table alloc and free.
> +                */
> +               if (is_removed_spte(iter.old_spte))
> +                       break;
> +
> +               /*
> +                * The SPTE is either invalid or points at a huge page that
> +                * needs to be split.
> +                */
> +               sp = tdp_mmu_alloc_sp(vcpu);
> +               tdp_mmu_init_child_sp(sp, &iter);
> +
> +               account_nx = fault->huge_page_disallowed &&
> +                            fault->req_level >= iter.level;
> +
> +               sp->lpage_disallowed = account_nx;
> +               /*
> +                * Ensure lpage_disallowed is visible before the SP is marked
> +                * present (or not-huge), as mmu_lock is held for read.  Pairs
> +                * with the smp_rmb() in disallowed_hugepage_adjust().
> +                */
> +               smp_wmb();
> +
> +               if (!is_shadow_present_pte(iter.old_spte))
> +                       r = tdp_mmu_link_sp(kvm, &iter, sp, true);
> +               else
> +                       r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
> +
> +               if (r) {
> +                       tdp_mmu_free_sp(sp);
> +                       break;
>                 }
>
> -               if (!is_shadow_present_pte(iter.old_spte)) {
> -                       bool account_nx = fault->huge_page_disallowed &&
> -                                         fault->req_level >= iter.level;
> -
> -                       /*
> -                        * If SPTE has been frozen by another thread, just
> -                        * give up and retry, avoiding unnecessary page table
> -                        * allocation and free.
> -                        */
> -                       if (is_removed_spte(iter.old_spte))
> -                               break;
> -
> -                       sp = tdp_mmu_alloc_sp(vcpu);
> -                       tdp_mmu_init_child_sp(sp, &iter);
> -
> -                       sp->lpage_disallowed = account_nx;
> -                       /*
> -                        * Ensure lpage_disallowed is visible before the SP is
> -                        * marked present, as mmu_lock is held for read.  Pairs
> -                        * with the smp_rmb() in disallowed_hugepage_adjust().
> -                        */
> -                       smp_wmb();
> -
> -                       if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
> -                               tdp_mmu_free_sp(sp);
> -                               break;
> -                       }
> -
> -                       if (account_nx) {
> -                               spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> -                               __account_huge_nx_page(kvm, sp);
> -                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> -                       }
> +               if (account_nx) {
> +                       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +                       __account_huge_nx_page(kvm, sp);
> +                       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>                 }
>         }
>
> @@ -1472,8 +1469,6 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>         const int level = iter->level;
>         int ret, i;
>
> -       tdp_mmu_init_child_sp(sp, iter);
> -
>         /*
>          * No need for atomics when writing to sp->spt since the page table has
>          * not been linked in yet and thus is not reachable from any other CPU.
> @@ -1549,6 +1544,8 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
>                                 continue;
>                 }
>
> +               tdp_mmu_init_child_sp(sp, &iter);
> +
>                 if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
>                         goto retry;
>
>
> base-commit: f06d9d4f3d89912c40c57da45d64b9827d8580ac
> --
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-07 20:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-01 23:37 [PATCH 0/3] KVM: Split huge pages mapped by the TDP MMU on fault David Matlack
2022-04-01 23:37 ` [PATCH 1/3] KVM: selftests: Introduce a selftest to measure execution performance David Matlack
2022-04-04 18:13   ` Ben Gardon
2022-04-05 20:59     ` David Matlack
2022-04-05 21:00       ` David Matlack
2022-04-01 23:37 ` [PATCH 2/3] KVM: x86/mmu: Pass account_nx to tdp_mmu_split_huge_page() David Matlack
2022-04-04 18:26   ` Ben Gardon
2022-04-01 23:37 ` [PATCH 3/3] KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault David Matlack
2022-04-04 18:48   ` Ben Gardon
2022-04-05 19:02     ` David Matlack
2022-04-07 19:39   ` Sean Christopherson
2022-04-07 20:52     ` David Matlack

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).