* [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-19 13:51 ` Hoo Robert
2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
` (22 subsequent siblings)
23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
At the moment, demand_paging_test does not support profiling/testing
multiple vCPU threads concurrently faulting on a single uffd because
(a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
region, so that each uffd services a single vCPU thread.
(b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
simply doesn't work: the test tries to register the same memory
to multiple uffds, causing an error.
Add support for many vcpus per uffd by
(1) Keeping "-u" behavior unchanged.
(2) Making "-u -a" create a single uffd for all of guest memory.
(3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
In cases (2) and (3) all vCPU threads fault on a single uffd.
With multiple potentially multiple vCPU per UFFD, it makes sense to
allow configuring the number reader threads per UFFD as well: add the
"-r" flag to do so.
Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
.../selftests/kvm/aarch64/page_fault_test.c | 4 +-
.../selftests/kvm/demand_paging_test.c | 62 +++++++++----
.../selftests/kvm/include/userfaultfd_util.h | 18 +++-
.../selftests/kvm/lib/userfaultfd_util.c | 86 +++++++++++++------
4 files changed, 124 insertions(+), 46 deletions(-)
diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index df10f1ffa20d9..3b6d228a9340d 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -376,14 +376,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
pt_args.hva,
pt_args.paging_size,
- test->uffd_pt_handler);
+ 1, test->uffd_pt_handler);
*data_uffd = NULL;
if (test->uffd_data_handler)
*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
data_args.hva,
data_args.paging_size,
- test->uffd_data_handler);
+ 1, test->uffd_data_handler);
}
static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index b0e1fc4de9e29..6c2253f4a64ef 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -77,9 +77,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
copy.mode = 0;
r = ioctl(uffd, UFFDIO_COPY, ©);
- if (r == -1) {
- pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
- addr, tid, errno);
+ /*
+ * With multiple vCPU threads fault on a single page and there are
+ * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+ * will fail with EEXIST: handle that case without signaling an
+ * error.
+ */
+ if (r == -1 && errno != EEXIST) {
+ pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
+ addr, tid, errno);
return r;
}
} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
@@ -89,9 +95,10 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
cont.range.len = demand_paging_size;
r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
- if (r == -1) {
- pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
- addr, tid, errno);
+ /* See the note about EEXISTs in the UFFDIO_COPY branch. */
+ if (r == -1 && errno != EEXIST) {
+ pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
+ addr, tid, errno);
return r;
}
} else {
@@ -110,7 +117,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
struct test_params {
int uffd_mode;
+ bool single_uffd;
useconds_t uffd_delay;
+ int readers_per_uffd;
enum vm_mem_backing_src_type src_type;
bool partition_vcpu_memory_access;
};
@@ -133,7 +142,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct timespec start;
struct timespec ts_diff;
struct kvm_vm *vm;
- int i;
+ int i, num_uffds = 0;
+ uint64_t uffd_region_size;
vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
p->src_type, p->partition_vcpu_memory_access);
@@ -146,10 +156,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
memset(guest_data_prototype, 0xAB, demand_paging_size);
if (p->uffd_mode) {
- uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
+ num_uffds = p->single_uffd ? 1 : nr_vcpus;
+ uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
+
+ uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
TEST_ASSERT(uffd_descs, "Memory allocation failed");
- for (i = 0; i < nr_vcpus; i++) {
+ for (i = 0; i < num_uffds; i++) {
struct memstress_vcpu_args *vcpu_args;
void *vcpu_hva;
void *vcpu_alias;
@@ -160,8 +173,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa);
vcpu_alias = addr_gpa2alias(vm, vcpu_args->gpa);
- prefault_mem(vcpu_alias,
- vcpu_args->pages * memstress_args.guest_page_size);
+ prefault_mem(vcpu_alias, uffd_region_size);
/*
* Set up user fault fd to handle demand paging
@@ -169,7 +181,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
*/
uffd_descs[i] = uffd_setup_demand_paging(
p->uffd_mode, p->uffd_delay, vcpu_hva,
- vcpu_args->pages * memstress_args.guest_page_size,
+ uffd_region_size,
+ p->readers_per_uffd,
&handle_uffd_page_request);
}
}
@@ -186,7 +199,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
if (p->uffd_mode) {
/* Tell the user fault fd handler threads to quit */
- for (i = 0; i < nr_vcpus; i++)
+ for (i = 0; i < num_uffds; i++)
uffd_stop_demand_paging(uffd_descs[i]);
}
@@ -206,14 +219,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
static void help(char *name)
{
puts("");
- printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
- " [-b memory] [-s type] [-v vcpus] [-o]\n", name);
+ printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
+ " [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
+ " [-s type] [-v vcpus] [-o]\n", name);
guest_modes_help();
printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
" UFFD registration mode: 'MISSING' or 'MINOR'.\n");
+ printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
+ " creating one for each region paged by a unique vCPU\n"
+ " Set implicitly with -o, and no effect without -u.\n");
printf(" -d: add a delay in usec to the User Fault\n"
" FD handler to simulate demand paging\n"
" overheads. Ignored without -u.\n");
+ printf(" -r: Set the number of reader threads per uffd.\n");
printf(" -b: specify the size of the memory region which should be\n"
" demand paged by each vCPU. e.g. 10M or 3G.\n"
" Default: 1G\n");
@@ -231,12 +249,14 @@ int main(int argc, char *argv[])
struct test_params p = {
.src_type = DEFAULT_VM_MEM_SRC,
.partition_vcpu_memory_access = true,
+ .readers_per_uffd = 1,
+ .single_uffd = false,
};
int opt;
guest_modes_append_default();
- while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:o")) != -1) {
+ while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
switch (opt) {
case 'm':
guest_modes_cmdline(optarg);
@@ -248,6 +268,9 @@ int main(int argc, char *argv[])
p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
break;
+ case 'a':
+ p.single_uffd = true;
+ break;
case 'd':
p.uffd_delay = strtoul(optarg, NULL, 0);
TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
@@ -265,6 +288,13 @@ int main(int argc, char *argv[])
break;
case 'o':
p.partition_vcpu_memory_access = false;
+ p.single_uffd = true;
+ break;
+ case 'r':
+ p.readers_per_uffd = atoi(optarg);
+ TEST_ASSERT(p.readers_per_uffd >= 1,
+ "Invalid number of readers per uffd %d: must be >=1",
+ p.readers_per_uffd);
break;
case 'h':
default:
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c345928..92cc1f9ec0686 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -17,18 +17,30 @@
typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
+struct uffd_reader_args {
+ int uffd_mode;
+ int uffd;
+ useconds_t delay;
+ uffd_handler_t handler;
+ /* Holds the read end of the pipe for killing the reader. */
+ int pipe;
+};
+
struct uffd_desc {
int uffd_mode;
int uffd;
- int pipefds[2];
useconds_t delay;
uffd_handler_t handler;
- pthread_t thread;
+ uint64_t num_readers;
+ /* Holds the write ends of the pipes for killing the readers. */
+ int *pipefds;
+ pthread_t *readers;
+ struct uffd_reader_args *reader_args;
};
struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
void *hva, uint64_t len,
- uffd_handler_t handler);
+ uint64_t num_readers, uffd_handler_t handler);
void uffd_stop_demand_paging(struct uffd_desc *uffd);
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 92cef20902f1f..2723ee1e3e1b2 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -27,10 +27,8 @@
static void *uffd_handler_thread_fn(void *arg)
{
- struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
- int uffd = uffd_desc->uffd;
- int pipefd = uffd_desc->pipefds[0];
- useconds_t delay = uffd_desc->delay;
+ struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
+ int uffd = reader_args->uffd;
int64_t pages = 0;
struct timespec start;
struct timespec ts_diff;
@@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
pollfd[0].fd = uffd;
pollfd[0].events = POLLIN;
- pollfd[1].fd = pipefd;
+ pollfd[1].fd = reader_args->pipe;
pollfd[1].events = POLLIN;
r = poll(pollfd, 2, -1);
@@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
if (!(msg.event & UFFD_EVENT_PAGEFAULT))
continue;
- if (delay)
- usleep(delay);
- r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
+ if (reader_args->delay)
+ usleep(reader_args->delay);
+ r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
if (r < 0)
return NULL;
pages++;
@@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
void *hva, uint64_t len,
- uffd_handler_t handler)
+ uint64_t num_readers, uffd_handler_t handler)
{
struct uffd_desc *uffd_desc;
bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
struct uffdio_api uffdio_api;
struct uffdio_register uffdio_register;
uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
- int ret;
+ int ret, i;
PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
is_minor ? "MINOR" : "MISSING",
is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
uffd_desc = malloc(sizeof(struct uffd_desc));
- TEST_ASSERT(uffd_desc, "malloc failed");
+ TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
+
+ uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
+ TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
+
+ uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
+ TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
+
+ uffd_desc->reader_args = malloc(
+ sizeof(struct uffd_reader_args) * num_readers);
+ TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
+
+ uffd_desc->num_readers = num_readers;
/* In order to get minor faults, prefault via the alias. */
if (is_minor)
@@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
expected_ioctls, "missing userfaultfd ioctls");
- ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
- TEST_ASSERT(!ret, "Failed to set up pipefd");
-
uffd_desc->uffd_mode = uffd_mode;
uffd_desc->uffd = uffd;
uffd_desc->delay = delay;
uffd_desc->handler = handler;
- pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
- uffd_desc);
- PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
- hva, hva + len);
+ for (i = 0; i < uffd_desc->num_readers; ++i) {
+ int pipes[2];
+
+ ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
+ TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
+ i, uffd_desc);
+
+ uffd_desc->pipefds[i] = pipes[1];
+
+ uffd_desc->reader_args[i].uffd_mode = uffd_mode;
+ uffd_desc->reader_args[i].uffd = uffd;
+ uffd_desc->reader_args[i].delay = delay;
+ uffd_desc->reader_args[i].handler = handler;
+ uffd_desc->reader_args[i].pipe = pipes[0];
+
+ pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
+ &uffd_desc->reader_args[i]);
+
+ PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
+ i, hva, hva + len);
+ }
return uffd_desc;
}
@@ -167,19 +191,31 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
void uffd_stop_demand_paging(struct uffd_desc *uffd)
{
char c = 0;
- int ret;
+ int i, ret;
- ret = write(uffd->pipefds[1], &c, 1);
- TEST_ASSERT(ret == 1, "Unable to write to pipefd");
+ for (i = 0; i < uffd->num_readers; ++i) {
+ ret = write(uffd->pipefds[i], &c, 1);
+ TEST_ASSERT(
+ ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
+ }
- ret = pthread_join(uffd->thread, NULL);
- TEST_ASSERT(ret == 0, "Pthread_join failed.");
+ for (i = 0; i < uffd->num_readers; ++i) {
+ ret = pthread_join(uffd->readers[i], NULL);
+ TEST_ASSERT(
+ ret == 0,
+ "Pthread_join failed on reader thread %i for uffd_desc %p", i, uffd);
+ }
close(uffd->uffd);
- close(uffd->pipefds[1]);
- close(uffd->pipefds[0]);
+ for (i = 0; i < uffd->num_readers; ++i) {
+ close(uffd->pipefds[i]);
+ close(uffd->reader_args[i].pipe);
+ }
+ free(uffd->pipefds);
+ free(uffd->readers);
+ free(uffd->reader_args);
free(uffd);
}
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* Re: [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
@ 2023-04-19 13:51 ` Hoo Robert
2023-04-20 17:55 ` Anish Moorthy
0 siblings, 1 reply; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 13:51 UTC (permalink / raw)
To: Anish Moorthy, pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On 4/13/2023 5:34 AM, Anish Moorthy wrote:
> At the moment, demand_paging_test does not support profiling/testing
> multiple vCPU threads concurrently faulting on a single uffd because
>
> (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
> region, so that each uffd services a single vCPU thread.
> (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
> simply doesn't work: the test tries to register the same memory
> to multiple uffds, causing an error.
>
> Add support for many vcpus per uffd by
> (1) Keeping "-u" behavior unchanged.
> (2) Making "-u -a" create a single uffd for all of guest memory.
> (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
> In cases (2) and (3) all vCPU threads fault on a single uffd.
>
> With multiple potentially multiple vCPU per UFFD, it makes sense to
^^^^^^^^
redundant "multiple"?
> allow configuring the number reader threads per UFFD as well: add the
> "-r" flag to do so.
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
> .../selftests/kvm/aarch64/page_fault_test.c | 4 +-
> .../selftests/kvm/demand_paging_test.c | 62 +++++++++----
> .../selftests/kvm/include/userfaultfd_util.h | 18 +++-
> .../selftests/kvm/lib/userfaultfd_util.c | 86 +++++++++++++------
> 4 files changed, 124 insertions(+), 46 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> index df10f1ffa20d9..3b6d228a9340d 100644
> --- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> +++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
> @@ -376,14 +376,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
> *pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
> pt_args.hva,
> pt_args.paging_size,
> - test->uffd_pt_handler);
> + 1, test->uffd_pt_handler);
>
> *data_uffd = NULL;
> if (test->uffd_data_handler)
> *data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
> data_args.hva,
> data_args.paging_size,
> - test->uffd_data_handler);
> + 1, test->uffd_data_handler);
> }
>
> static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index b0e1fc4de9e29..6c2253f4a64ef 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -77,9 +77,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
> copy.mode = 0;
>
> r = ioctl(uffd, UFFDIO_COPY, ©);
> - if (r == -1) {
> - pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
> - addr, tid, errno);
> + /*
> + * With multiple vCPU threads fault on a single page and there are
> + * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> + * will fail with EEXIST: handle that case without signaling an
> + * error.
> + */
But this code path is also gone through in other cases, isn't it? In
those cases, is it still safe to ignore EEXIST?
> + if (r == -1 && errno != EEXIST) {
> + pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
> + addr, tid, errno);
unintended indent changes I think.
> return r;
> }
> } else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
> @@ -89,9 +95,10 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
> cont.range.len = demand_paging_size;
>
> r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
> - if (r == -1) {
> - pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
> - addr, tid, errno);
> + /* See the note about EEXISTs in the UFFDIO_COPY branch. */
Personally I would suggest copy the comments here. what if some day above
code/comment was changed/deleted?
> + if (r == -1 && errno != EEXIST) {
> + pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
> + addr, tid, errno);
Ditto
> return r;
> }
> } else {
> @@ -110,7 +117,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
>
> struct test_params {
> int uffd_mode;
> + bool single_uffd;
> useconds_t uffd_delay;
> + int readers_per_uffd;
> enum vm_mem_backing_src_type src_type;
> bool partition_vcpu_memory_access;
> };
> @@ -133,7 +142,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> struct timespec start;
> struct timespec ts_diff;
> struct kvm_vm *vm;
> - int i;
> + int i, num_uffds = 0;
> + uint64_t uffd_region_size;
>
> vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
> p->src_type, p->partition_vcpu_memory_access);
> @@ -146,10 +156,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> memset(guest_data_prototype, 0xAB, demand_paging_size);
>
> if (p->uffd_mode) {
> - uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
> + num_uffds = p->single_uffd ? 1 : nr_vcpus;
> + uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
> +
> + uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
> TEST_ASSERT(uffd_descs, "Memory allocation failed");
>
> - for (i = 0; i < nr_vcpus; i++) {
> + for (i = 0; i < num_uffds; i++) {
> struct memstress_vcpu_args *vcpu_args;
> void *vcpu_hva;
> void *vcpu_alias;
> @@ -160,8 +173,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> vcpu_hva = addr_gpa2hva(vm, vcpu_args->gpa);
> vcpu_alias = addr_gpa2alias(vm, vcpu_args->gpa);
>
> - prefault_mem(vcpu_alias,
> - vcpu_args->pages * memstress_args.guest_page_size);
> + prefault_mem(vcpu_alias, uffd_region_size);
>
> /*
> * Set up user fault fd to handle demand paging
> @@ -169,7 +181,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> */
> uffd_descs[i] = uffd_setup_demand_paging(
> p->uffd_mode, p->uffd_delay, vcpu_hva,
> - vcpu_args->pages * memstress_args.guest_page_size,
> + uffd_region_size,
> + p->readers_per_uffd,
> &handle_uffd_page_request);
> }
> }
> @@ -186,7 +199,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>
> if (p->uffd_mode) {
> /* Tell the user fault fd handler threads to quit */
> - for (i = 0; i < nr_vcpus; i++)
> + for (i = 0; i < num_uffds; i++)
> uffd_stop_demand_paging(uffd_descs[i]);
> }
>
> @@ -206,14 +219,19 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> static void help(char *name)
> {
> puts("");
> - printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
> - " [-b memory] [-s type] [-v vcpus] [-o]\n", name);
> + printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
> + " [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
> + " [-s type] [-v vcpus] [-o]\n", name);
Ditto
> guest_modes_help();
> printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
> " UFFD registration mode: 'MISSING' or 'MINOR'.\n");
> + printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
> + " creating one for each region paged by a unique vCPU\n"
> + " Set implicitly with -o, and no effect without -u.\n");
Ditto
> printf(" -d: add a delay in usec to the User Fault\n"
> " FD handler to simulate demand paging\n"
> " overheads. Ignored without -u.\n");
> + printf(" -r: Set the number of reader threads per uffd.\n");
> printf(" -b: specify the size of the memory region which should be\n"
> " demand paged by each vCPU. e.g. 10M or 3G.\n"
> " Default: 1G\n");
> @@ -231,12 +249,14 @@ int main(int argc, char *argv[])
> struct test_params p = {
> .src_type = DEFAULT_VM_MEM_SRC,
> .partition_vcpu_memory_access = true,
> + .readers_per_uffd = 1,
> + .single_uffd = false,
> };
> int opt;
>
> guest_modes_append_default();
>
> - while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:o")) != -1) {
> + while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
> switch (opt) {
> case 'm':
> guest_modes_cmdline(optarg);
> @@ -248,6 +268,9 @@ int main(int argc, char *argv[])
> p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
> break;
> + case 'a':
> + p.single_uffd = true;
> + break;
> case 'd':
> p.uffd_delay = strtoul(optarg, NULL, 0);
> TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
> @@ -265,6 +288,13 @@ int main(int argc, char *argv[])
> break;
> case 'o':
> p.partition_vcpu_memory_access = false;
> + p.single_uffd = true;
> + break;
> + case 'r':
> + p.readers_per_uffd = atoi(optarg);
> + TEST_ASSERT(p.readers_per_uffd >= 1,
> + "Invalid number of readers per uffd %d: must be >=1",
> + p.readers_per_uffd);
> break;
> case 'h':
> default:
> diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
> index 877449c345928..92cc1f9ec0686 100644
> --- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
> +++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
> @@ -17,18 +17,30 @@
>
> typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
>
> +struct uffd_reader_args {
> + int uffd_mode;
> + int uffd;
> + useconds_t delay;
> + uffd_handler_t handler;
> + /* Holds the read end of the pipe for killing the reader. */
> + int pipe;
> +};
> +
> struct uffd_desc {
> int uffd_mode;
> int uffd;
> - int pipefds[2];
> useconds_t delay;
> uffd_handler_t handler;
> - pthread_t thread;
> + uint64_t num_readers;
> + /* Holds the write ends of the pipes for killing the readers. */
> + int *pipefds;
> + pthread_t *readers;
> + struct uffd_reader_args *reader_args;
> };
>
> struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> void *hva, uint64_t len,
> - uffd_handler_t handler);
> + uint64_t num_readers, uffd_handler_t handler);
>
> void uffd_stop_demand_paging(struct uffd_desc *uffd);
>
> diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> index 92cef20902f1f..2723ee1e3e1b2 100644
> --- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> +++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> @@ -27,10 +27,8 @@
>
> static void *uffd_handler_thread_fn(void *arg)
> {
> - struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
> - int uffd = uffd_desc->uffd;
> - int pipefd = uffd_desc->pipefds[0];
> - useconds_t delay = uffd_desc->delay;
> + struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
> + int uffd = reader_args->uffd;
> int64_t pages = 0;
> struct timespec start;
> struct timespec ts_diff;
> @@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
>
> pollfd[0].fd = uffd;
> pollfd[0].events = POLLIN;
> - pollfd[1].fd = pipefd;
> + pollfd[1].fd = reader_args->pipe;
> pollfd[1].events = POLLIN;
>
> r = poll(pollfd, 2, -1);
> @@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
> if (!(msg.event & UFFD_EVENT_PAGEFAULT))
> continue;
>
> - if (delay)
> - usleep(delay);
> - r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
> + if (reader_args->delay)
> + usleep(reader_args->delay);
> + r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
> if (r < 0)
> return NULL;
> pages++;
> @@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
>
> struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> void *hva, uint64_t len,
> - uffd_handler_t handler)
> + uint64_t num_readers, uffd_handler_t handler)
> {
> struct uffd_desc *uffd_desc;
> bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
> @@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> struct uffdio_api uffdio_api;
> struct uffdio_register uffdio_register;
> uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
> - int ret;
> + int ret, i;
>
> PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
> is_minor ? "MINOR" : "MISSING",
> is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
>
> uffd_desc = malloc(sizeof(struct uffd_desc));
> - TEST_ASSERT(uffd_desc, "malloc failed");
> + TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
> +
> + uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
> + TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
> +
> + uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
> + TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
> +
> + uffd_desc->reader_args = malloc(
> + sizeof(struct uffd_reader_args) * num_readers);
> + TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
> +
> + uffd_desc->num_readers = num_readers;
>
> /* In order to get minor faults, prefault via the alias. */
> if (is_minor)
> @@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
> expected_ioctls, "missing userfaultfd ioctls");
>
> - ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
> - TEST_ASSERT(!ret, "Failed to set up pipefd");
> -
> uffd_desc->uffd_mode = uffd_mode;
> uffd_desc->uffd = uffd;
> uffd_desc->delay = delay;
> uffd_desc->handler = handler;
Now that these info are encapsulated into reader args below, looks
unnecessary to have them in uffd_desc here.
> - pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
> - uffd_desc);
>
> - PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
> - hva, hva + len);
> + for (i = 0; i < uffd_desc->num_readers; ++i) {
> + int pipes[2];
> +
> + ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
> + TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
> + i, uffd_desc);
> +
> + uffd_desc->pipefds[i] = pipes[1];
> +
> + uffd_desc->reader_args[i].uffd_mode = uffd_mode;
> + uffd_desc->reader_args[i].uffd = uffd;
> + uffd_desc->reader_args[i].delay = delay;
> + uffd_desc->reader_args[i].handler = handler;
> + uffd_desc->reader_args[i].pipe = pipes[0];
> +
> + pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
> + &uffd_desc->reader_args[i]);
> +
> + PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
> + i, hva, hva + len);
> + }
>
> return uffd_desc;
> }
> @@ -167,19 +191,31 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> void uffd_stop_demand_paging(struct uffd_desc *uffd)
> {
> char c = 0;
> - int ret;
> + int i, ret;
>
> - ret = write(uffd->pipefds[1], &c, 1);
> - TEST_ASSERT(ret == 1, "Unable to write to pipefd");
> + for (i = 0; i < uffd->num_readers; ++i) {
> + ret = write(uffd->pipefds[i], &c, 1);
> + TEST_ASSERT(
> + ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
> + }
>
> - ret = pthread_join(uffd->thread, NULL);
> - TEST_ASSERT(ret == 0, "Pthread_join failed.");
> + for (i = 0; i < uffd->num_readers; ++i) {
> + ret = pthread_join(uffd->readers[i], NULL);
> + TEST_ASSERT(
> + ret == 0,
> + "Pthread_join failed on reader thread %i for uffd_desc %p", i, uffd);
> + }
>
> close(uffd->uffd);
>
> - close(uffd->pipefds[1]);
> - close(uffd->pipefds[0]);
> + for (i = 0; i < uffd->num_readers; ++i) {
> + close(uffd->pipefds[i]);
> + close(uffd->reader_args[i].pipe);
> + }
>
> + free(uffd->pipefds);
> + free(uffd->readers);
> + free(uffd->reader_args);
> free(uffd);
> }
>
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
2023-04-19 13:51 ` Hoo Robert
@ 2023-04-20 17:55 ` Anish Moorthy
2023-04-21 12:15 ` Robert Hoo
0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 17:55 UTC (permalink / raw)
To: Hoo Robert
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 19, 2023 at 6:51 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> On 4/13/2023 5:34 AM, Anish Moorthy wrote:
> > At the moment, demand_paging_test does not support profiling/testing
> > multiple vCPU threads concurrently faulting on a single uffd because
> >
> > (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
> > region, so that each uffd services a single vCPU thread.
> > (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
> > simply doesn't work: the test tries to register the same memory
> > to multiple uffds, causing an error.
> >
> > Add support for many vcpus per uffd by
> > (1) Keeping "-u" behavior unchanged.
> > (2) Making "-u -a" create a single uffd for all of guest memory.
> > (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
> > In cases (2) and (3) all vCPU threads fault on a single uffd.
> >
> > With multiple potentially multiple vCPU per UFFD, it makes sense to
> ^^^^^^^^
> redundant "multiple"?
Thanks, fixed
> > --- a/tools/testing/selftests/kvm/demand_paging_test.c
> > +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> > @@ -77,9 +77,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
> > copy.mode = 0;
> >
> > r = ioctl(uffd, UFFDIO_COPY, ©);
> > - if (r == -1) {
> > - pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
> > - addr, tid, errno);
> > + /*
> > + * With multiple vCPU threads fault on a single page and there are
> > + * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> > + * will fail with EEXIST: handle that case without signaling an
> > + * error.
> > + */
>
> But this code path is also gone through in other cases, isn't it? In
> those cases, is it still safe to ignore EEXIST?
Good point: the answer is no, it's not always safe to ignore EEXISTs
here. For instance the first UFFDIO_CONTINUE for a page shouldn't be
allowed to EEXIST, and that's swept under the rug here. I've added the
following to the comment
+ * Note that this does sweep under the rug any EEXISTs occurring
+ * from, e.g., the first UFFDIO_COPY/CONTINUEs on a page. A
+ * realistic VMM would maintain some other state to correctly
+ * surface EEXISTs to userspace or prevent duplicate
+ * COPY/CONTINUEs from happening in the first place.
I could add that extra state to the self test (via for instance, an
atomic bitmap that threads "or" into before issuing any
COPY/CONTINUEs) but it's a bit of an extra complication without any
real payoff. Let me know if you think the comment's inadequate though.
> > + if (r == -1 && errno != EEXIST) {
> > + pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
> > + addr, tid, errno);
>
> unintended indent changes I think.
>
> > return r;
> > }
> > } else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
> > @@ -89,9 +95,10 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
> > cont.range.len = demand_paging_size;
> >
> > r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
> > - if (r == -1) {
> > - pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
> > - addr, tid, errno);
> > + /* See the note about EEXISTs in the UFFDIO_COPY branch. */
>
> Personally I would suggest copy the comments here. what if some day above
> code/comment was changed/deleted?
You might be right: on the other hand, if the comment ever gets
updated then it would have to be done in two places. Anyone to break
the tie? :)
> > @@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> > TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
> > expected_ioctls, "missing userfaultfd ioctls");
> >
> > - ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
> > - TEST_ASSERT(!ret, "Failed to set up pipefd");
> > -
> > uffd_desc->uffd_mode = uffd_mode;
> > uffd_desc->uffd = uffd;
> > uffd_desc->delay = delay;
> > uffd_desc->handler = handler;
>
> Now that these info are encapsulated into reader args below, looks
> unnecessary to have them in uffd_desc here.
Good point. I've removed uffd_mode, delay, and handler from uffd_desc.
I left the "uffd" field in because that's a shared resource, and
close()ing it as "close(desc->uffd)" makes more sense than, say,
"close(desc->reader_args[0].uffd)"
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
2023-04-20 17:55 ` Anish Moorthy
@ 2023-04-21 12:15 ` Robert Hoo
2023-04-21 16:21 ` Anish Moorthy
0 siblings, 1 reply; 103+ messages in thread
From: Robert Hoo @ 2023-04-21 12:15 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
Anish Moorthy <amoorthy@google.com> 于2023年4月21日周五 01:56写道:
>
> > > + /*
> > > + * With multiple vCPU threads fault on a single page and there are
> > > + * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> > > + * will fail with EEXIST: handle that case without signaling an
> > > + * error.
> > > + */
> >
> > But this code path is also gone through in other cases, isn't it? In
> > those cases, is it still safe to ignore EEXIST?
>
> Good point: the answer is no, it's not always safe to ignore EEXISTs
> here. For instance the first UFFDIO_CONTINUE for a page shouldn't be
> allowed to EEXIST, and that's swept under the rug here. I've added the
> following to the comment
>
> + * Note that this does sweep under the rug any EEXISTs occurring
> + * from, e.g., the first UFFDIO_COPY/CONTINUEs on a page. A
> + * realistic VMM would maintain some other state to correctly
> + * surface EEXISTs to userspace or prevent duplicate
> + * COPY/CONTINUEs from happening in the first place.
>
> I could add that extra state to the self test (via for instance, an
> atomic bitmap that threads "or" into before issuing any
> COPY/CONTINUEs) but it's a bit of an extra complication without any
> real payoff. Let me know if you think the comment's inadequate though.
>
IIUC, you could say: in this on demand paging test case, even
duplicate copy/continue doesn't do harm anyway. Am I right?
> > > + /* See the note about EEXISTs in the UFFDIO_COPY branch. */
> >
> > Personally I would suggest copy the comments here. what if some day above
> > code/comment was changed/deleted?
>
> You might be right: on the other hand, if the comment ever gets
> updated then it would have to be done in two places. Anyone to break
> the tie? :)
The one who updates the place is responsible for the comments. make sense?:)
>
> > > @@ -148,18 +158,32 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
> > > TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
> > > expected_ioctls, "missing userfaultfd ioctls");
> > >
> > > - ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
> > > - TEST_ASSERT(!ret, "Failed to set up pipefd");
> > > -
> > > uffd_desc->uffd_mode = uffd_mode;
> > > uffd_desc->uffd = uffd;
> > > uffd_desc->delay = delay;
> > > uffd_desc->handler = handler;
> >
> > Now that these info are encapsulated into reader args below, looks
> > unnecessary to have them in uffd_desc here.
>
> Good point. I've removed uffd_mode, delay, and handler from uffd_desc.
> I left the "uffd" field in because that's a shared resource, and
> close()ing it as "close(desc->uffd)" makes more sense than, say,
> "close(desc->reader_args[0].uffd)"
Sure, that's also what I originally changed on my side. sorry didn't
mention it earlier.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
2023-04-21 12:15 ` Robert Hoo
@ 2023-04-21 16:21 ` Anish Moorthy
0 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-21 16:21 UTC (permalink / raw)
To: Robert Hoo
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Fri, Apr 21, 2023 at 5:15 AM Robert Hoo <robert.hoo.linux@gmail.com> wrote:
>
> IIUC, you could say: in this on demand paging test case, even
> duplicate copy/continue doesn't do harm anyway. Am I right?
It's probably more accurate to say that it never happens in the first
place. I've added a sentence here,
> > > > + /* See the note about EEXISTs in the UFFDIO_COPY branch. */
> > >
> > > Personally I would suggest copy the comments here. what if some day above
> > > code/comment was changed/deleted?
> >
> > You might be right: on the other hand, if the comment ever gets
> > updated then it would have to be done in two places. Anyone to break
> > the tie? :)
>
> The one who updates the place is responsible for the comments. make sense?:)
Fair enough, done.
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-19 13:36 ` Hoo Robert
2023-04-12 21:34 ` [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
` (21 subsequent siblings)
23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
With multiple reader threads POLLing a single UFFD, the test suffers
from the thundering herd problem: performance degrades as the number of
reader threads is increased. Solve this issue [1] by switching the
the polling mechanism to EPOLL + EPOLLEXCLUSIVE.
Also, change the error-handling convention of uffd_handler_thread_fn.
Instead of just printing errors and returning early from the polling
loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
successful exit from uffd_handler_thread_fn, ie one triggered by a
write to the exit pipe.
Performance samples generated by the command in [2] are given below.
Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
1 249k 185k
2 201k 235k
4 186k 155k
16 150k 217k
32 89k 198k
[1] Single-vCPU performance does suffer somewhat.
[2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
.../selftests/kvm/demand_paging_test.c | 1 -
.../selftests/kvm/lib/userfaultfd_util.c | 74 +++++++++----------
2 files changed, 35 insertions(+), 40 deletions(-)
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 6c2253f4a64ef..c729cee4c2055 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -13,7 +13,6 @@
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
-#include <poll.h>
#include <pthread.h>
#include <linux/userfaultfd.h>
#include <sys/syscall.h>
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 2723ee1e3e1b2..909ad69c1cb04 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -16,6 +16,7 @@
#include <poll.h>
#include <pthread.h>
#include <linux/userfaultfd.h>
+#include <sys/epoll.h>
#include <sys/syscall.h>
#include "kvm_util.h"
@@ -32,60 +33,55 @@ static void *uffd_handler_thread_fn(void *arg)
int64_t pages = 0;
struct timespec start;
struct timespec ts_diff;
+ int epollfd;
+ struct epoll_event evt;
+
+ epollfd = epoll_create(1);
+ TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
+
+ evt.events = EPOLLIN | EPOLLEXCLUSIVE;
+ evt.data.u32 = 0;
+ TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
+ "Failed to add uffd to epollfd");
+
+ evt.events = EPOLLIN;
+ evt.data.u32 = 1;
+ TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
+ "Failed to add pipe to epollfd");
clock_gettime(CLOCK_MONOTONIC, &start);
while (1) {
struct uffd_msg msg;
- struct pollfd pollfd[2];
- char tmp_chr;
int r;
- pollfd[0].fd = uffd;
- pollfd[0].events = POLLIN;
- pollfd[1].fd = reader_args->pipe;
- pollfd[1].events = POLLIN;
-
- r = poll(pollfd, 2, -1);
- switch (r) {
- case -1:
- pr_info("poll err");
- continue;
- case 0:
- continue;
- case 1:
- break;
- default:
- pr_info("Polling uffd returned %d", r);
- return NULL;
- }
+ r = epoll_wait(epollfd, &evt, 1, -1);
+ TEST_ASSERT(r == 1,
+ "Unexpected number of events (%d) from epoll, errno = %d",
+ r, errno);
- if (pollfd[0].revents & POLLERR) {
- pr_info("uffd revents has POLLERR");
- return NULL;
- }
+ if (evt.data.u32 == 1) {
+ char tmp_chr;
- if (pollfd[1].revents & POLLIN) {
- r = read(pollfd[1].fd, &tmp_chr, 1);
+ TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+ "Reader thread received EPOLLERR or EPOLLHUP on pipe.");
+ r = read(reader_args->pipe, &tmp_chr, 1);
TEST_ASSERT(r == 1,
- "Error reading pipefd in UFFD thread\n");
+ "Error reading pipefd in uffd reader thread");
return NULL;
}
- if (!(pollfd[0].revents & POLLIN))
- continue;
+ TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+ "Reader thread received EPOLLERR or EPOLLHUP on uffd.");
r = read(uffd, &msg, sizeof(msg));
if (r == -1) {
- if (errno == EAGAIN)
- continue;
- pr_info("Read of uffd got errno %d\n", errno);
- return NULL;
+ TEST_ASSERT(errno == EAGAIN,
+ "Error reading from UFFD: errno = %d", errno);
+ continue;
}
- if (r != sizeof(msg)) {
- pr_info("Read on uffd returned unexpected size: %d bytes", r);
- return NULL;
- }
+ TEST_ASSERT(r == sizeof(msg),
+ "Read on uffd returned unexpected number of bytes (%d)", r);
if (!(msg.event & UFFD_EVENT_PAGEFAULT))
continue;
@@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
if (reader_args->delay)
usleep(reader_args->delay);
r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
- if (r < 0)
- return NULL;
+ TEST_ASSERT(r >= 0,
+ "Reader thread handler fn returned negative value %d", r);
pages++;
}
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* Re: [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2023-04-19 13:36 ` Hoo Robert
2023-04-19 23:26 ` Anish Moorthy
0 siblings, 1 reply; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 13:36 UTC (permalink / raw)
To: Anish Moorthy, pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On 4/13/2023 5:34 AM, Anish Moorthy wrote:
> With multiple reader threads POLLing a single UFFD, the test suffers
> from the thundering herd problem: performance degrades as the number of
> reader threads is increased. Solve this issue [1] by switching the
> the polling mechanism to EPOLL + EPOLLEXCLUSIVE.
>
> Also, change the error-handling convention of uffd_handler_thread_fn.
> Instead of just printing errors and returning early from the polling
> loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
> successful exit from uffd_handler_thread_fn, ie one triggered by a
> write to the exit pipe.
>
> Performance samples generated by the command in [2] are given below.
>
> Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
> 1 249k 185k
> 2 201k 235k
> 4 186k 155k
> 16 150k 217k
> 32 89k 198k
>
> [1] Single-vCPU performance does suffer somewhat.
> [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
> .../selftests/kvm/demand_paging_test.c | 1 -
> .../selftests/kvm/lib/userfaultfd_util.c | 74 +++++++++----------
> 2 files changed, 35 insertions(+), 40 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index 6c2253f4a64ef..c729cee4c2055 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -13,7 +13,6 @@
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
> -#include <poll.h>
> #include <pthread.h>
> #include <linux/userfaultfd.h>
> #include <sys/syscall.h>
> diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> index 2723ee1e3e1b2..909ad69c1cb04 100644
> --- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> +++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
> @@ -16,6 +16,7 @@
> #include <poll.h>
> #include <pthread.h>
> #include <linux/userfaultfd.h>
> +#include <sys/epoll.h>
> #include <sys/syscall.h>
>
> #include "kvm_util.h"
> @@ -32,60 +33,55 @@ static void *uffd_handler_thread_fn(void *arg)
> int64_t pages = 0;
> struct timespec start;
> struct timespec ts_diff;
> + int epollfd;
> + struct epoll_event evt;
> +
> + epollfd = epoll_create(1);
> + TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
> +
> + evt.events = EPOLLIN | EPOLLEXCLUSIVE;
> + evt.data.u32 = 0;
> + TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
> + "Failed to add uffd to epollfd");
> +
> + evt.events = EPOLLIN;
> + evt.data.u32 = 1;
> + TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
> + "Failed to add pipe to epollfd");
>
> clock_gettime(CLOCK_MONOTONIC, &start);
> while (1) {
> struct uffd_msg msg;
> - struct pollfd pollfd[2];
> - char tmp_chr;
> int r;
>
> - pollfd[0].fd = uffd;
> - pollfd[0].events = POLLIN;
> - pollfd[1].fd = reader_args->pipe;
> - pollfd[1].events = POLLIN;
> -
> - r = poll(pollfd, 2, -1);
> - switch (r) {
> - case -1:
> - pr_info("poll err");
> - continue;
> - case 0:
> - continue;
> - case 1:
> - break;
> - default:
> - pr_info("Polling uffd returned %d", r);
> - return NULL;
> - }
> + r = epoll_wait(epollfd, &evt, 1, -1);
> + TEST_ASSERT(r == 1,
> + "Unexpected number of events (%d) from epoll, errno = %d",
> + r, errno);
>
too much indentation, also seen elsewhere.
> - if (pollfd[0].revents & POLLERR) {
> - pr_info("uffd revents has POLLERR");
> - return NULL;
> - }
> + if (evt.data.u32 == 1) {
> + char tmp_chr;
>
> - if (pollfd[1].revents & POLLIN) {
> - r = read(pollfd[1].fd, &tmp_chr, 1);
> + TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
> + "Reader thread received EPOLLERR or EPOLLHUP on pipe.");
> + r = read(reader_args->pipe, &tmp_chr, 1);
> TEST_ASSERT(r == 1,
> - "Error reading pipefd in UFFD thread\n");
> + "Error reading pipefd in uffd reader thread");
> return NULL;
How about goto
ts_diff = timespec_elapsed(start);
Otherwise last stats won't get chances to be calc'ed.
> }
>
> - if (!(pollfd[0].revents & POLLIN))
> - continue;
> + TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
> + "Reader thread received EPOLLERR or EPOLLHUP on uffd.");
>
> r = read(uffd, &msg, sizeof(msg));
> if (r == -1) {
> - if (errno == EAGAIN)
> - continue;
> - pr_info("Read of uffd got errno %d\n", errno);
> - return NULL;
> + TEST_ASSERT(errno == EAGAIN,
> + "Error reading from UFFD: errno = %d", errno);
> + continue;
> }
>
> - if (r != sizeof(msg)) {
> - pr_info("Read on uffd returned unexpected size: %d bytes", r);
> - return NULL;
> - }
> + TEST_ASSERT(r == sizeof(msg),
> + "Read on uffd returned unexpected number of bytes (%d)", r);
>
> if (!(msg.event & UFFD_EVENT_PAGEFAULT))
> continue;
> @@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
> if (reader_args->delay)
> usleep(reader_args->delay);
> r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
> - if (r < 0)
> - return NULL;
> + TEST_ASSERT(r >= 0,
> + "Reader thread handler fn returned negative value %d", r);
> pages++;
> }
>
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
2023-04-19 13:36 ` Hoo Robert
@ 2023-04-19 23:26 ` Anish Moorthy
0 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-19 23:26 UTC (permalink / raw)
To: Hoo Robert
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 19, 2023 at 6:36 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> How about goto
> ts_diff = timespec_elapsed(start);
> Otherwise last stats won't get chances to be calc'ed.
Good idea, done.
> > + TEST_ASSERT(r == 1,
> > + "Unexpected number of events (%d) from epoll, errno = %d",
> > + r, errno);
> >
> too much indentation, also seen elsewhere.
Augh, my editor has been set to a tab width of 4 this entire time.
That... explains a lot >:(
> > }
> >
> > - if (!(pollfd[0].revents & POLLIN))
> > - continue;
> > + TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
> > + "Reader thread received EPOLLERR or EPOLLHUP on uffd.");
> >
> > r = read(uffd, &msg, sizeof(msg));
> > if (r == -1) {
> > - if (errno == EAGAIN)
> > - continue;
> > - pr_info("Read of uffd got errno %d\n", errno);
> > - return NULL;
> > + TEST_ASSERT(errno == EAGAIN,
> > + "Error reading from UFFD: errno = %d", errno);
> > + continue;
> > }
> >
> > - if (r != sizeof(msg)) {
> > - pr_info("Read on uffd returned unexpected size: %d bytes", r);
> > - return NULL;
> > - }
> > + TEST_ASSERT(r == sizeof(msg),
> > + "Read on uffd returned unexpected number of bytes (%d)", r);
> >
> > if (!(msg.event & UFFD_EVENT_PAGEFAULT))
> > continue;
> > @@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
> > if (reader_args->delay)
> > usleep(reader_args->delay);
> > r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
> > - if (r < 0)
> > - return NULL;
> > + TEST_ASSERT(r >= 0,
> > + "Reader thread handler fn returned negative value %d", r);
> > pages++;
> > }
> >
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults.
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 01/22] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 02/22] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
` (20 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
hva_to_pfn_fast() currently just fails for read-only faults, which is
unnecessary. Instead, try pinning the page without passing FOLL_WRITE.
This allows read-only faults to (potentially) be resolved without
falling back to slow GUP.
Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
virt/kvm/kvm_main.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f40b72eb0e7bf..cf7d3de6f3689 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2479,7 +2479,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
}
/*
- * The fast path to get the writable pfn which will be stored in @pfn,
+ * The fast path to get the pfn which will be stored in @pfn,
* true indicates success, otherwise false is returned. It's also the
* only part that runs if we can in atomic context.
*/
@@ -2493,10 +2493,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
* or the caller allows to map a writable pfn for a read fault
* request.
*/
- if (!(write_fault || writable))
- return false;
+ unsigned int gup_flags = (write_fault || writable) ? FOLL_WRITE : 0;
- if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
+ if (get_user_page_fast_only(addr, gup_flags, page)) {
*pfn = page_to_pfn(page[0]);
if (writable)
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (2 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 03/22] KVM: Allow hva_pfn_fast() to resolve read-only faults Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-05-02 17:17 ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
` (19 subsequent siblings)
23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Give kvm_run.exit_reason a defined initial value on entry into KVM_RUN:
other architectures (riscv, arm64) already use KVM_EXIT_UNKNOWN for this
purpose, so copy that convention.
This gives vCPUs trying to fill the run struct a mechanism to avoid
overwriting already-populated data, albeit an imperfect one. Being able
to detect an already-populated KVM run struct will prevent at least some
bugs in the upcoming implementation of KVM_CAP_MEMORY_FAULT_INFO, which
will attempt to fill the run struct whenever a vCPU fails a guest memory
access.
Without the already-populated check, KVM_CAP_MEMORY_FAULT_INFO could
change kvm_run in any code paths which
1. Populate kvm_run for some exit and prepare to return to userspace
2. Access guest memory for some reason (but without returning -EFAULTs
to userspace)
3. Finish the return to userspace set up in (1), now with the contents
of kvm_run changed to contain efault info.
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/x86/kvm/x86.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 237c483b12301..ca73eb066af81 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10965,6 +10965,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
kvm_run->flags = 0;
kvm_load_guest_fpu(vcpu);
+ kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
kvm_vcpu_srcu_read_lock(vcpu);
if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
if (kvm_run->immediate_exit) {
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
@ 2023-05-02 17:17 ` Anish Moorthy
2023-05-02 18:51 ` Sean Christopherson
0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-02 17:17 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
During some testing yesterday I realized that this patch actually
breaks the self test, causing an error which the later self test
changes cover up.
Running "./demand_paging_test -b 512M -u MINOR -s shmem -v 1" from
kvm/next (b3c98052d469) with just this patch applies gives the
following output
> # ./demand_paging_test -b 512M -u MINOR -s shmem -v 1
> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> guest physical test memory: [0x7fcdfffe000, 0x7fcffffe000)
> Finished creating vCPUs and starting uffd threads
> Started all vCPUs
> ==== Test Assertion Failure ====
> demand_paging_test.c:50: false
> pid=13293 tid=13297 errno=4 - Interrupted system call
> // Some stack trace stuff
> Invalid guest sync status: exit_reason=UNKNOWN, ucall=0
The problem is the get_ucall() part of the following block in the self
test's vcpu_worker()
> ret = _vcpu_run(vcpu);
> TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> TEST_ASSERT(false,
> "Invalid guest sync status: exit_reason=%s\n",
> exit_reason_str(run->exit_reason));
> }
I took a look and, while get_ucall() does depend on the value of
exit_reason, the error's root cause isn't clear to me yet.
Moving the "exit_reason = kvm_exit_unknown" line to later in the
function, right above the vcpu_run() call "fixes" the problem. I've
done that for now and will bisect later to investigate: if anyone
has any clues please let me know.
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
2023-05-02 17:17 ` Anish Moorthy
@ 2023-05-02 18:51 ` Sean Christopherson
2023-05-02 19:49 ` Anish Moorthy
0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-05-02 18:51 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Tue, May 02, 2023, Anish Moorthy wrote:
> During some testing yesterday I realized that this patch actually
> breaks the self test, causing an error which the later self test
> changes cover up.
>
> Running "./demand_paging_test -b 512M -u MINOR -s shmem -v 1" from
> kvm/next (b3c98052d469) with just this patch applies gives the
> following output
>
> > # ./demand_paging_test -b 512M -u MINOR -s shmem -v 1
> > Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
> > guest physical test memory: [0x7fcdfffe000, 0x7fcffffe000)
> > Finished creating vCPUs and starting uffd threads
> > Started all vCPUs
> > ==== Test Assertion Failure ====
> > demand_paging_test.c:50: false
> > pid=13293 tid=13297 errno=4 - Interrupted system call
> > // Some stack trace stuff
> > Invalid guest sync status: exit_reason=UNKNOWN, ucall=0
>
> The problem is the get_ucall() part of the following block in the self
> test's vcpu_worker()
>
> > ret = _vcpu_run(vcpu);
> > TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> > if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> > TEST_ASSERT(false,
> > "Invalid guest sync status: exit_reason=%s\n",
> > exit_reason_str(run->exit_reason));
> > }
>
> I took a look and, while get_ucall() does depend on the value of
> exit_reason, the error's root cause isn't clear to me yet.
Stating what you likely already know... On x86, the UCALL is performed via port
I/O, and so the selftests framework zeros out the ucall struct if the userspace
exit reason isn't KVM_EXIT_IO.
> Moving the "exit_reason = kvm_exit_unknown" line to later in the
> function, right above the vcpu_run() call "fixes" the problem. I've
> done that for now and will bisect later to investigate: if anyone
> has any clues please let me know.
Clobbering vcpu->run->exit_reason before this code block is a bug:
if (unlikely(vcpu->arch.complete_userspace_io)) {
int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
vcpu->arch.complete_userspace_io = NULL;
r = cui(vcpu);
if (r <= 0)
goto out;
} else {
WARN_ON_ONCE(vcpu->arch.pio.count);
WARN_ON_ONCE(vcpu->mmio_needed);
}
if (kvm_run->immediate_exit) {
r = -EINTR;
goto out;
}
For userspace I/O and MMIO, KVM requires userspace to "complete" the instruction
that triggered the exit to userspace, e.g. write memory/registers and skip the
instruction as needed. The immediate_exit flag is set by userspace when userspace
wants to retain control and is doing KVM_RUN purely to placate KVM. In selftests,
this is done by vcpu_run_complete_io().
The one part I'm a bit surprised by is that this caused ucall problems. The ucall
framework invokes vcpu_run_complete_io() _after_ it grabs the information.
addr = ucall_arch_get_ucall(vcpu);
if (addr) {
TEST_ASSERT(addr != (void *)GUEST_UCALL_FAILED,
"Guest failed to allocate ucall struct");
memcpy(uc, addr, sizeof(*uc));
vcpu_run_complete_io(vcpu);
} else {
memset(uc, 0, sizeof(*uc));
}
Making multiple calls to get_ucall() after a single guest ucall would explain
everything as only the first get_ucall() would succeed, but AFAICT the test doesn't
invoke get_ucall() multiple times.
Aha! Found it. _vcpu_run() invokes assert_on_unhandled_exception(), which does
if (get_ucall(vcpu, &uc) == UCALL_UNHANDLED) {
uint64_t vector = uc.args[0];
TEST_FAIL("Unexpected vectored event in guest (vector:0x%lx)",
vector);
}
and thus triggers vcpu_run_complete_io() before demand_paging_test's vcpu_worker()
gets control and does _its_ get_ucall().
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
2023-05-02 18:51 ` Sean Christopherson
@ 2023-05-02 19:49 ` Anish Moorthy
2023-05-02 20:41 ` Sean Christopherson
0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-02 19:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Thanks for nailing this down for me! One more question: should we be
concerned about any guest memory accesses occurring in the preamble to
that vcpu_run() call in kvm_arch_vcpu_ioctl_run()?
I only see two spots from which an EFAULT could make it to userspace,
those being the sync_regs() and cui() calls. The former looks clean
but I'm not sure about the latter. As written it's not an issue per se
if the cui() call tries a vCPU memory access- the
kvm_populate_efault_info() helper will just not populate the run
struct and WARN_ON_ONCE(). But it would be good to know about.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
2023-05-02 19:49 ` Anish Moorthy
@ 2023-05-02 20:41 ` Sean Christopherson
2023-05-02 21:46 ` Anish Moorthy
0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-05-02 20:41 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Tue, May 02, 2023, Anish Moorthy wrote:
> Thanks for nailing this down for me! One more question: should we be
> concerned about any guest memory accesses occurring in the preamble to
> that vcpu_run() call in kvm_arch_vcpu_ioctl_run()?
>
> I only see two spots from which an EFAULT could make it to userspace,
> those being the sync_regs() and cui() calls. The former looks clean
Ya, sync_regs() is a non-issue, that doesn't touch guest memory unless userspace
is doing something truly bizarre.
> but I'm not sure about the latter. As written it's not an issue per se
> if the cui() call tries a vCPU memory access- the
> kvm_populate_efault_info() helper will just not populate the run
> struct and WARN_ON_ONCE(). But it would be good to know about.
If KVM triggers a WARN_ON_ONCE(), then that's an issue. Though looking at the
code, the cui() aspect is a moot point. As I stated in the previous discussion,
the WARN_ON_ONCE() in question needs to be off-by-default.
: Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
: but set kvm_run.exit_reason to some magic number, e.g. zero it out. Then KVM could
: WARN if something tries to overwrite kvm_run.exit_reason. The WARN would need to
: be buried by a Kconfig or something since kvm_run can be modified by userspace,
: but other than that I think it would work.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
2023-05-02 20:41 ` Sean Christopherson
@ 2023-05-02 21:46 ` Anish Moorthy
2023-05-02 22:31 ` Sean Christopherson
0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-02 21:46 UTC (permalink / raw)
To: Sean Christopherson
Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Tue, May 2, 2023 at 1:41 PM Sean Christopherson <seanjc@google.com> wrote:
>
> If KVM triggers a WARN_ON_ONCE(), then that's an issue. Though looking at the
> code, the cui() aspect is a moot point. As I stated in the previous discussion,
> the WARN_ON_ONCE() in question needs to be off-by-default.
>
> : Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
> : but set kvm_run.exit_reason to some magic number, e.g. zero it out. Then KVM could
> : WARN if something tries to overwrite kvm_run.exit_reason. The WARN would need to
> : be buried by a Kconfig or something since kvm_run can be modified by userspace,
> : but other than that I think it would work.
Ah, ok: I thought using WARN_ON_ONCE instead of WARN might have
obviated the Kconfig. I'll go add one.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN
2023-05-02 21:46 ` Anish Moorthy
@ 2023-05-02 22:31 ` Sean Christopherson
0 siblings, 0 replies; 103+ messages in thread
From: Sean Christopherson @ 2023-05-02 22:31 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Tue, May 02, 2023, Anish Moorthy wrote:
> On Tue, May 2, 2023 at 1:41 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > If KVM triggers a WARN_ON_ONCE(), then that's an issue. Though looking at the
> > code, the cui() aspect is a moot point. As I stated in the previous discussion,
> > the WARN_ON_ONCE() in question needs to be off-by-default.
> >
> > : Hmm, one idea would be to have the initial -EFAULT detection fill kvm_run.memory_fault,
> > : but set kvm_run.exit_reason to some magic number, e.g. zero it out. Then KVM could
> > : WARN if something tries to overwrite kvm_run.exit_reason. The WARN would need to
> > : be buried by a Kconfig or something since kvm_run can be modified by userspace,
> > : but other than that I think it would work.
>
> Ah, ok: I thought using WARN_ON_ONCE instead of WARN might have
> obviated the Kconfig. I'll go add one.
Don't put too much effort into anything at this point. I'm not entirely convinced
that it's worth carrying a Kconfig for this one-off case (my "suggestion" was mostly
just me spitballing), and at a quick glance through the rest of the series, I'll
definitely have more comments when I do a full review, i.e. things may change too.
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (3 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 04/22] KVM: x86: Set vCPU exit reason to KVM_EXIT_UNKNOWN at the start of KVM_RUN Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-19 13:57 ` Hoo Robert
` (2 more replies)
2023-04-12 21:34 ` [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
` (18 subsequent siblings)
23 siblings, 3 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
besides a return value of -1 and errno of EFAULT when a vCPU fails an
access to guest memory.
Add documentation, updates to the KVM headers, and a helper function
(kvm_populate_efault_info) for implementing the capability.
Besides simply filling the run struct, kvm_populate_efault_info takes
two safety measures
a. It tries to prevent concurrent fills on a single vCPU run struct
by checking that the run struct being modified corresponds to the
currently loaded vCPU.
b. It tries to avoid filling an already-populated run struct by
checking whether the exit reason has been modified since entry
into KVM_RUN.
Finally, mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86,
even though EFAULT annotation are currently totally absent. Picking a
point to declare the implementation "done" is difficult because
1. Annotations will be performed incrementally in subsequent commits
across both core and arch-specific KVM.
2. The initial series will very likely miss some cases which need
annotation. Although these omissions are to be fixed in the future,
userspace thus still needs to expect and be able to handle
unannotated EFAULTs.
Given these qualifications, just marking it available here seems the
least arbitrary thing to do.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
Documentation/virt/kvm/api.rst | 35 +++++++++++++++++++++++++++
arch/arm64/kvm/arm.c | 1 +
arch/x86/kvm/x86.c | 1 +
include/linux/kvm_host.h | 12 ++++++++++
include/uapi/linux/kvm.h | 16 +++++++++++++
tools/include/uapi/linux/kvm.h | 11 +++++++++
virt/kvm/kvm_main.c | 44 ++++++++++++++++++++++++++++++++++
7 files changed, 120 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 48fad65568227..f174f43c38d45 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6637,6 +6637,18 @@ array field represents return values. The userspace should update the return
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
spec refer, https://github.com/riscv/riscv-sbi-doc.
+::
+
+ /* KVM_EXIT_MEMORY_FAULT */
+ struct {
+ __u64 flags;
+ __u64 gpa;
+ __u64 len; /* in bytes */
+ } memory_fault;
+
+Indicates a vCPU memory fault on the guest physical address range
+[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
+
::
/* KVM_EXIT_NOTIFY */
@@ -7670,6 +7682,29 @@ This capability is aimed to mitigate the threat that malicious VMs can
cause CPU stuck (due to event windows don't open up) and make the CPU
unavailable to host or other VMs.
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+------------------------------
+
+:Architectures: x86, arm64
+:Parameters: args[0] - KVM_MEMORY_FAULT_INFO_ENABLE|DISABLE to enable/disable
+ the capability.
+:Returns: 0 on success, or -EINVAL if unsupported or invalid args[0].
+
+When enabled, EFAULTs "returned" by KVM_RUN in response to failed vCPU guest
+memory accesses may be annotated with additional information. When KVM_RUN
+returns an error with errno=EFAULT, userspace may check the exit reason: if it
+is KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the 'memory_fault'
+member of the run struct.
+
+The 'gpa' and 'len' (in bytes) fields describe the range of guest
+physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is
+currently always zero.
+
+NOTE: The implementation of this capability is incomplete. Even with it enabled,
+userspace may receive "bare" EFAULTs (i.e. exit reason !=
+KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
+reported to the maintainers.
+
8. Other capabilities.
======================
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a43e1cb3b7e97..a932346b59f61 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
case KVM_CAP_ARM_SYSTEM_SUSPEND:
+ case KVM_CAP_MEMORY_FAULT_INFO:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ca73eb066af81..0925678e741de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4432,6 +4432,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VAPIC:
case KVM_CAP_ENABLE_CAP:
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
+ case KVM_CAP_MEMORY_FAULT_INFO:
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 90edc16d37e59..776f9713f3921 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -805,6 +805,8 @@ struct kvm {
struct notifier_block pm_notifier;
#endif
char stats_id[KVM_STATS_NAME_SIZE];
+
+ bool fill_efault_info;
};
#define kvm_err(fmt, ...) \
@@ -2277,4 +2279,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
/* Max number of entries allowed for each kvm dirty ring */
#define KVM_DIRTY_RING_MAX_ENTRIES 65536
+/*
+ * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
+ * populate the memory_fault field with the given information.
+ *
+ * Does nothing if KVM_CAP_MEMORY_FAULT_INFO is not enabled. WARNs and does
+ * nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if 'vcpu' is not
+ * the current running vcpu.
+ */
+inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
+ uint64_t gpa, uint64_t len);
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4003a166328cc..bc73e8381a2bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_RISCV_SBI 35
#define KVM_EXIT_RISCV_CSR 36
#define KVM_EXIT_NOTIFY 37
+#define KVM_EXIT_MEMORY_FAULT 38
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -505,6 +506,16 @@ struct kvm_run {
#define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
__u32 flags;
} notify;
+ /* KVM_EXIT_MEMORY_FAULT */
+ struct {
+ /*
+ * Indicates a memory fault on the guest physical address range
+ * [gpa, gpa + len). flags is always zero for now.
+ */
+ __u64 flags;
+ __u64 gpa;
+ __u64 len; /* in bytes */
+ } memory_fault;
/* Fix the size of the union. */
char padding[256];
};
@@ -1184,6 +1195,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
#define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
+#define KVM_CAP_MEMORY_FAULT_INFO 227
#ifdef KVM_CAP_IRQ_ROUTING
@@ -2237,4 +2249,8 @@ struct kvm_s390_zpci_op {
/* flags for kvm_s390_zpci_op->u.reg_aen.flags */
#define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
+/* flags for KVM_CAP_MEMORY_FAULT_INFO */
+#define KVM_MEMORY_FAULT_INFO_DISABLE 0
+#define KVM_MEMORY_FAULT_INFO_ENABLE 1
+
#endif /* __LINUX_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 4003a166328cc..5c57796364d65 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -264,6 +264,7 @@ struct kvm_xen_exit {
#define KVM_EXIT_RISCV_SBI 35
#define KVM_EXIT_RISCV_CSR 36
#define KVM_EXIT_NOTIFY 37
+#define KVM_EXIT_MEMORY_FAULT 38
/* For KVM_EXIT_INTERNAL_ERROR */
/* Emulate instruction failed. */
@@ -505,6 +506,16 @@ struct kvm_run {
#define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
__u32 flags;
} notify;
+ /* KVM_EXIT_MEMORY_FAULT */
+ struct {
+ /*
+ * Indicates a memory fault on the guest physical address range
+ * [gpa, gpa + len). flags is always zero for now.
+ */
+ __u64 flags;
+ __u64 gpa;
+ __u64 len; /* in bytes */
+ } memory_fault;
/* Fix the size of the union. */
char padding[256];
};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf7d3de6f3689..f3effc93cbef3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
spin_lock_init(&kvm->mn_invalidate_lock);
rcuwait_init(&kvm->mn_memslots_update_rcuwait);
xa_init(&kvm->vcpu_array);
+ kvm->fill_efault_info = false;
INIT_LIST_HEAD(&kvm->gpc_list);
spin_lock_init(&kvm->gpc_lock);
@@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
put_pid(oldpid);
}
r = kvm_arch_vcpu_ioctl_run(vcpu);
+ WARN_ON_ONCE(r == -EFAULT &&
+ vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
}
@@ -4672,6 +4675,15 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
return r;
}
+ case KVM_CAP_MEMORY_FAULT_INFO: {
+ if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)
+ || (cap->args[0] != KVM_MEMORY_FAULT_INFO_ENABLE
+ && cap->args[0] != KVM_MEMORY_FAULT_INFO_DISABLE)) {
+ return -EINVAL;
+ }
+ kvm->fill_efault_info = cap->args[0] == KVM_MEMORY_FAULT_INFO_ENABLE;
+ return 0;
+ }
default:
return kvm_vm_ioctl_enable_cap(kvm, cap);
}
@@ -6173,3 +6185,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
return init_context.err;
}
+
+inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
+ uint64_t gpa, uint64_t len)
+{
+ if (!vcpu->kvm->fill_efault_info)
+ return;
+
+ preempt_disable();
+ /*
+ * Ensure the this vCPU isn't modifying another vCPU's run struct, which
+ * would open the door for races between concurrent calls to this
+ * function.
+ */
+ if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
+ goto out;
+ /*
+ * Try not to overwrite an already-populated run struct.
+ * This isn't a perfect solution, as there's no guarantee that the exit
+ * reason is set before the run struct is populated, but it should prevent
+ * at least some bugs.
+ */
+ else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
+ goto out;
+
+ vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+ vcpu->run->memory_fault.gpa = gpa;
+ vcpu->run->memory_fault.len = len;
+ vcpu->run->memory_fault.flags = 0;
+
+out:
+ preempt_enable();
+}
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-04-19 13:57 ` Hoo Robert
2023-04-20 18:09 ` Anish Moorthy
2023-06-01 19:52 ` Oliver Upton
2023-07-04 10:10 ` Kautuk Consul
2 siblings, 1 reply; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 13:57 UTC (permalink / raw)
To: Anish Moorthy, pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On 4/13/2023 5:34 AM, Anish Moorthy wrote:
> KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
> besides a return value of -1 and errno of EFAULT when a vCPU fails an
> access to guest memory.
>
> Add documentation, updates to the KVM headers, and a helper function
> (kvm_populate_efault_info) for implementing the capability.
kvm_populate_efault_info(), function name.
>
> Besides simply filling the run struct, kvm_populate_efault_info takes
Ditto
> two safety measures
>
> a. It tries to prevent concurrent fills on a single vCPU run struct
> by checking that the run struct being modified corresponds to the
> currently loaded vCPU.
> b. It tries to avoid filling an already-populated run struct by
> checking whether the exit reason has been modified since entry
> into KVM_RUN.
>
> Finally, mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86,
> even though EFAULT annotation are currently totally absent. Picking a
> point to declare the implementation "done" is difficult because
>
> 1. Annotations will be performed incrementally in subsequent commits
> across both core and arch-specific KVM.
> 2. The initial series will very likely miss some cases which need
> annotation. Although these omissions are to be fixed in the future,
> userspace thus still needs to expect and be able to handle
> unannotated EFAULTs.
>
> Given these qualifications, just marking it available here seems the
> least arbitrary thing to do.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
> Documentation/virt/kvm/api.rst | 35 +++++++++++++++++++++++++++
> arch/arm64/kvm/arm.c | 1 +
> arch/x86/kvm/x86.c | 1 +
> include/linux/kvm_host.h | 12 ++++++++++
> include/uapi/linux/kvm.h | 16 +++++++++++++
> tools/include/uapi/linux/kvm.h | 11 +++++++++
> virt/kvm/kvm_main.c | 44 ++++++++++++++++++++++++++++++++++
> 7 files changed, 120 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 48fad65568227..f174f43c38d45 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6637,6 +6637,18 @@ array field represents return values. The userspace should update the return
> values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + __u64 flags;
> + __u64 gpa;
> + __u64 len; /* in bytes */
> + } memory_fault;
> +
> +Indicates a vCPU memory fault on the guest physical address range
> +[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
> +
> ::
>
> /* KVM_EXIT_NOTIFY */
> @@ -7670,6 +7682,29 @@ This capability is aimed to mitigate the threat that malicious VMs can
> cause CPU stuck (due to event windows don't open up) and make the CPU
> unavailable to host or other VMs.
>
> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86, arm64
> +:Parameters: args[0] - KVM_MEMORY_FAULT_INFO_ENABLE|DISABLE to enable/disable
> + the capability.
> +:Returns: 0 on success, or -EINVAL if unsupported or invalid args[0].
> +
> +When enabled, EFAULTs "returned" by KVM_RUN in response to failed vCPU guest
> +memory accesses may be annotated with additional information. When KVM_RUN
> +returns an error with errno=EFAULT, userspace may check the exit reason: if it
> +is KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the 'memory_fault'
> +member of the run struct.
> +
> +The 'gpa' and 'len' (in bytes) fields describe the range of guest
> +physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is
> +currently always zero.
> +
> +NOTE: The implementation of this capability is incomplete. Even with it enabled,
> +userspace may receive "bare" EFAULTs (i.e. exit reason !=
> +KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
> +reported to the maintainers.
> +
> 8. Other capabilities.
> ======================
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index a43e1cb3b7e97..a932346b59f61 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_VCPU_ATTRIBUTES:
> case KVM_CAP_PTP_KVM:
> case KVM_CAP_ARM_SYSTEM_SUSPEND:
> + case KVM_CAP_MEMORY_FAULT_INFO:
> r = 1;
> break;
> case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ca73eb066af81..0925678e741de 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4432,6 +4432,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_VAPIC:
> case KVM_CAP_ENABLE_CAP:
> case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> + case KVM_CAP_MEMORY_FAULT_INFO:
> r = 1;
> break;
> case KVM_CAP_EXIT_HYPERCALL:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 90edc16d37e59..776f9713f3921 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -805,6 +805,8 @@ struct kvm {
> struct notifier_block pm_notifier;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> +
> + bool fill_efault_info;
> };
>
> #define kvm_err(fmt, ...) \
> @@ -2277,4 +2279,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> /* Max number of entries allowed for each kvm dirty ring */
> #define KVM_DIRTY_RING_MAX_ENTRIES 65536
>
> +/*
> + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
> + * populate the memory_fault field with the given information.
> + *
> + * Does nothing if KVM_CAP_MEMORY_FAULT_INFO is not enabled. WARNs and does
> + * nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if 'vcpu' is not
> + * the current running vcpu.
> + */
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> + uint64_t gpa, uint64_t len);
> #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4003a166328cc..bc73e8381a2bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_RISCV_SBI 35
> #define KVM_EXIT_RISCV_CSR 36
> #define KVM_EXIT_NOTIFY 37
> +#define KVM_EXIT_MEMORY_FAULT 38
struct exit_reason[] string for KVM_EXIT_MEMORY_FAULT can be added as
well.
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -505,6 +506,16 @@ struct kvm_run {
> #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> __u32 flags;
> } notify;
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + /*
> + * Indicates a memory fault on the guest physical address range
> + * [gpa, gpa + len). flags is always zero for now.
> + */
> + __u64 flags;
> + __u64 gpa;
> + __u64 len; /* in bytes */
> + } memory_fault;
> /* Fix the size of the union. */
> char padding[256];
> };
> @@ -1184,6 +1195,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
> #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
> +#define KVM_CAP_MEMORY_FAULT_INFO 227
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2237,4 +2249,8 @@ struct kvm_s390_zpci_op {
> /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
>
> +/* flags for KVM_CAP_MEMORY_FAULT_INFO */
> +#define KVM_MEMORY_FAULT_INFO_DISABLE 0
> +#define KVM_MEMORY_FAULT_INFO_ENABLE 1
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 4003a166328cc..5c57796364d65 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_RISCV_SBI 35
> #define KVM_EXIT_RISCV_CSR 36
> #define KVM_EXIT_NOTIFY 37
> +#define KVM_EXIT_MEMORY_FAULT 38
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -505,6 +506,16 @@ struct kvm_run {
> #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> __u32 flags;
> } notify;
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + /*
> + * Indicates a memory fault on the guest physical address range
> + * [gpa, gpa + len). flags is always zero for now.
> + */
> + __u64 flags;
> + __u64 gpa;
> + __u64 len; /* in bytes */
> + } memory_fault;
> /* Fix the size of the union. */
> char padding[256];
> };
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf7d3de6f3689..f3effc93cbef3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> + kvm->fill_efault_info = false;
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
> put_pid(oldpid);
> }
> r = kvm_arch_vcpu_ioctl_run(vcpu);
> + WARN_ON_ONCE(r == -EFAULT &&
> + vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
> trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
> break;
> }
> @@ -4672,6 +4675,15 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>
> return r;
> }
> + case KVM_CAP_MEMORY_FAULT_INFO: {
> + if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)
> + || (cap->args[0] != KVM_MEMORY_FAULT_INFO_ENABLE
> + && cap->args[0] != KVM_MEMORY_FAULT_INFO_DISABLE)) {
> + return -EINVAL;
> + }
> + kvm->fill_efault_info = cap->args[0] == KVM_MEMORY_FAULT_INFO_ENABLE;
> + return 0;
> + }
> default:
> return kvm_vm_ioctl_enable_cap(kvm, cap);
> }
> @@ -6173,3 +6185,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>
> return init_context.err;
> }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> + uint64_t gpa, uint64_t len)
> +{
> + if (!vcpu->kvm->fill_efault_info)
> + return;
> +
> + preempt_disable();
> + /*
> + * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> + * would open the door for races between concurrent calls to this
> + * function.
> + */
> + if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> + goto out;
> + /*
> + * Try not to overwrite an already-populated run struct.
> + * This isn't a perfect solution, as there's no guarantee that the exit
> + * reason is set before the run struct is populated, but it should prevent
> + * at least some bugs.
> + */
> + else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> + goto out;
> +
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + vcpu->run->memory_fault.gpa = gpa;
> + vcpu->run->memory_fault.len = len;
> + vcpu->run->memory_fault.flags = 0;
> +
> +out:
> + preempt_enable();
> +}
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
2023-04-19 13:57 ` Hoo Robert
@ 2023-04-20 18:09 ` Anish Moorthy
2023-04-21 12:28 ` Robert Hoo
0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 18:09 UTC (permalink / raw)
To: Hoo Robert
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 19, 2023 at 6:57 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> kvm_populate_efault_info(), function name.
> ...
> Ditto
Done
> struct exit_reason[] string for KVM_EXIT_MEMORY_FAULT can be added as
> well.
Done, assuming you mean the exit_reasons_known definition in kvm_util.c
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
2023-04-20 18:09 ` Anish Moorthy
@ 2023-04-21 12:28 ` Robert Hoo
0 siblings, 0 replies; 103+ messages in thread
From: Robert Hoo @ 2023-04-21 12:28 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
Anish Moorthy <amoorthy@google.com> 于2023年4月21日周五 02:10写道:
> > struct exit_reason[] string for KVM_EXIT_MEMORY_FAULT can be added as
> > well.
>
> Done, assuming you mean the exit_reasons_known definition in kvm_util.c
Yes.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
2023-04-19 13:57 ` Hoo Robert
@ 2023-06-01 19:52 ` Oliver Upton
2023-06-01 20:30 ` Anish Moorthy
2023-07-04 10:10 ` Kautuk Consul
2 siblings, 1 reply; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 19:52 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 12, 2023 at 09:34:53PM +0000, Anish Moorthy wrote:
[...]
> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86, arm64
> +:Parameters: args[0] - KVM_MEMORY_FAULT_INFO_ENABLE|DISABLE to enable/disable
> + the capability.
> +:Returns: 0 on success, or -EINVAL if unsupported or invalid args[0].
> +
> +When enabled, EFAULTs "returned" by KVM_RUN in response to failed vCPU guest
> +memory accesses may be annotated with additional information. When KVM_RUN
> +returns an error with errno=EFAULT, userspace may check the exit reason: if it
> +is KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the 'memory_fault'
> +member of the run struct.
So the other angle of my concern w.r.t. NOWAIT exits is the fact that
userspace gets to decide whether or not we annotate such an exit. We all
agree that a NOWAIT exit w/o context isn't actionable, right?
Sean is suggesting that we abuse the fact that kvm_run already contains
junk for EFAULT exits and populate kvm_run::memory_fault unconditionally
[*]. I agree with him, and it eliminates the odd quirk of 'bare' NOWAIT
exits too. Old userspace will still see 'garbage' in kvm_run struct,
but one man's trash is another man's treasure after all :)
So, based on that, could you:
- Unconditionally prepare MEMORY_FAULT exits everywhere you're
converting here
- Redefine KVM_CAP_MEMORY_FAULT_INFO as an informational cap, and do
not accept an attempt to enable it. Instead, have calls to
KVM_CHECK_EXTENSION return a set of flags describing the supported
feature set.
Eventually, you can stuff a bit in there to advertise that all
EFAULTs are reliable.
[*] https://lore.kernel.org/kvmarm/ZHjqkdEOVUiazj5d@google.com/
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf7d3de6f3689..f3effc93cbef3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> + kvm->fill_efault_info = false;
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
> put_pid(oldpid);
> }
> r = kvm_arch_vcpu_ioctl_run(vcpu);
> + WARN_ON_ONCE(r == -EFAULT &&
> + vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
This might be a bit overkill, as it will definitely fire on unsupported
architectures. Instead you may want to condition this on an architecture
actually selecting support for MEMORY_FAULT_INFO.
> trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
> break;
> }
> @@ -4672,6 +4675,15 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>
> return r;
> }
> + case KVM_CAP_MEMORY_FAULT_INFO: {
> + if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)
> + || (cap->args[0] != KVM_MEMORY_FAULT_INFO_ENABLE
> + && cap->args[0] != KVM_MEMORY_FAULT_INFO_DISABLE)) {
> + return -EINVAL;
> + }
> + kvm->fill_efault_info = cap->args[0] == KVM_MEMORY_FAULT_INFO_ENABLE;
> + return 0;
> + }
> default:
> return kvm_vm_ioctl_enable_cap(kvm, cap);
> }
> @@ -6173,3 +6185,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>
> return init_context.err;
> }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> + uint64_t gpa, uint64_t len)
> +{
> + if (!vcpu->kvm->fill_efault_info)
> + return;
> +
> + preempt_disable();
> + /*
> + * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> + * would open the door for races between concurrent calls to this
> + * function.
> + */
> + if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> + goto out;
> + /*
> + * Try not to overwrite an already-populated run struct.
> + * This isn't a perfect solution, as there's no guarantee that the exit
> + * reason is set before the run struct is populated, but it should prevent
> + * at least some bugs.
> + */
> + else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> + goto out;
> +
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + vcpu->run->memory_fault.gpa = gpa;
> + vcpu->run->memory_fault.len = len;
> + vcpu->run->memory_fault.flags = 0;
> +
> +out:
> + preempt_enable();
> +}
> --
> 2.40.0.577.gac1e443424-goog
>
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
2023-06-01 19:52 ` Oliver Upton
@ 2023-06-01 20:30 ` Anish Moorthy
2023-06-01 21:29 ` Oliver Upton
0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-06-01 20:30 UTC (permalink / raw)
To: Oliver Upton
Cc: pbonzini, maz, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On Thu, Jun 1, 2023 at 12:52 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> So the other angle of my concern w.r.t. NOWAIT exits is the fact that
> userspace gets to decide whether or not we annotate such an exit. We all
> agree that a NOWAIT exit w/o context isn't actionable, right?
Yup
> Sean is suggesting that we abuse the fact that kvm_run already contains
> junk for EFAULT exits and populate kvm_run::memory_fault unconditionally
> [*]. I agree with him, and it eliminates the odd quirk of 'bare' NOWAIT
> exits too. Old userspace will still see 'garbage' in kvm_run struct,
> but one man's trash is another man's treasure after all :)
>
> So, based on that, could you:
>
> - Unconditionally prepare MEMORY_FAULT exits everywhere you're
> converting here
>
> - Redefine KVM_CAP_MEMORY_FAULT_INFO as an informational cap, and do
> not accept an attempt to enable it. Instead, have calls to
> KVM_CHECK_EXTENSION return a set of flags describing the supported
> feature set.
Sure. I've been collecting feedback as it comes in, so I can send up a
v4 with everything up to now soon. The major thing left to resolve is
that the exact set of annotations is still waiting on feedback: I've
already gone ahead and dropped everything I wasn't sure of in [1], so
the next version will be quite a bit smaller. If it turns out that
I've dropped too much, then I can add things back in based on the
feedback.
[1] https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
> Eventually, you can stuff a bit in there to advertise that all
> EFAULTs are reliable.
I don't think this is an objective: the idea is to annotate efaults
tracing back to user accesses (see [2]). Although the idea of
annotating with some "unrecoverable" flag set for other efaults has
been tossed around, so we may end up with that.
[2] https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#m5715f3a14a6a9ff9a4188918ec105592f0bfc69a
> [*] https://lore.kernel.org/kvmarm/ZHjqkdEOVUiazj5d@google.com/
>
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index cf7d3de6f3689..f3effc93cbef3 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > spin_lock_init(&kvm->mn_invalidate_lock);
> > rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > xa_init(&kvm->vcpu_array);
> > + kvm->fill_efault_info = false;
> >
> > INIT_LIST_HEAD(&kvm->gpc_list);
> > spin_lock_init(&kvm->gpc_lock);
> > @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
> > put_pid(oldpid);
> > }
> > r = kvm_arch_vcpu_ioctl_run(vcpu);
> > + WARN_ON_ONCE(r == -EFAULT &&
> > + vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
>
> This might be a bit overkill, as it will definitely fire on unsupported
> architectures. Instead you may want to condition this on an architecture
> actually selecting support for MEMORY_FAULT_INFO.
Ah, that's embarrassing. Thanks for the catch.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
2023-06-01 20:30 ` Anish Moorthy
@ 2023-06-01 21:29 ` Oliver Upton
0 siblings, 0 replies; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 21:29 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On Thu, Jun 01, 2023 at 01:30:58PM -0700, Anish Moorthy wrote:
> On Thu, Jun 1, 2023 at 12:52 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > Eventually, you can stuff a bit in there to advertise that all
> > EFAULTs are reliable.
>
> I don't think this is an objective: the idea is to annotate efaults
> tracing back to user accesses (see [2]). Although the idea of
> annotating with some "unrecoverable" flag set for other efaults has
> been tossed around, so we may end up with that.
Right, there's quite a bit of detail entailed by what such a bit
means... In any case, the idea would be to have a forward-looking
stance with the UAPI where we can bolt on more things to the existing
CAP in the future.
> [2] https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#m5715f3a14a6a9ff9a4188918ec105592f0bfc69a
>
> > [*] https://lore.kernel.org/kvmarm/ZHjqkdEOVUiazj5d@google.com/
> >
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index cf7d3de6f3689..f3effc93cbef3 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> > > spin_lock_init(&kvm->mn_invalidate_lock);
> > > rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> > > xa_init(&kvm->vcpu_array);
> > > + kvm->fill_efault_info = false;
> > >
> > > INIT_LIST_HEAD(&kvm->gpc_list);
> > > spin_lock_init(&kvm->gpc_lock);
> > > @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
> > > put_pid(oldpid);
> > > }
> > > r = kvm_arch_vcpu_ioctl_run(vcpu);
> > > + WARN_ON_ONCE(r == -EFAULT &&
> > > + vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
> >
> > This might be a bit overkill, as it will definitely fire on unsupported
> > architectures. Instead you may want to condition this on an architecture
> > actually selecting support for MEMORY_FAULT_INFO.
>
> Ah, that's embarrassing. Thanks for the catch.
No problem at all. Pretty sure I've done a lot more actually egregious
changes than you have ;)
While we're here, forgot to mention it before but please clean up that
indentation too. I think you may've gotten in a fight with the Google3
styling of your editor and lost :)
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO
2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
2023-04-19 13:57 ` Hoo Robert
2023-06-01 19:52 ` Oliver Upton
@ 2023-07-04 10:10 ` Kautuk Consul
2 siblings, 0 replies; 103+ messages in thread
From: Kautuk Consul @ 2023-07-04 10:10 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On 2023-04-12 21:34:53, Anish Moorthy wrote:
> KVM_CAP_MEMORY_FAULT_INFO allows kvm_run to return useful information
> besides a return value of -1 and errno of EFAULT when a vCPU fails an
> access to guest memory.
>
> Add documentation, updates to the KVM headers, and a helper function
> (kvm_populate_efault_info) for implementing the capability.
>
> Besides simply filling the run struct, kvm_populate_efault_info takes
> two safety measures
>
> a. It tries to prevent concurrent fills on a single vCPU run struct
> by checking that the run struct being modified corresponds to the
> currently loaded vCPU.
> b. It tries to avoid filling an already-populated run struct by
> checking whether the exit reason has been modified since entry
> into KVM_RUN.
>
> Finally, mark KVM_CAP_MEMORY_FAULT_INFO as available on arm64 and x86,
> even though EFAULT annotation are currently totally absent. Picking a
> point to declare the implementation "done" is difficult because
>
> 1. Annotations will be performed incrementally in subsequent commits
> across both core and arch-specific KVM.
> 2. The initial series will very likely miss some cases which need
> annotation. Although these omissions are to be fixed in the future,
> userspace thus still needs to expect and be able to handle
> unannotated EFAULTs.
>
> Given these qualifications, just marking it available here seems the
> least arbitrary thing to do.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
> Documentation/virt/kvm/api.rst | 35 +++++++++++++++++++++++++++
> arch/arm64/kvm/arm.c | 1 +
> arch/x86/kvm/x86.c | 1 +
> include/linux/kvm_host.h | 12 ++++++++++
> include/uapi/linux/kvm.h | 16 +++++++++++++
> tools/include/uapi/linux/kvm.h | 11 +++++++++
> virt/kvm/kvm_main.c | 44 ++++++++++++++++++++++++++++++++++
> 7 files changed, 120 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 48fad65568227..f174f43c38d45 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6637,6 +6637,18 @@ array field represents return values. The userspace should update the return
> values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> +::
> +
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + __u64 flags;
> + __u64 gpa;
> + __u64 len; /* in bytes */
> + } memory_fault;
> +
> +Indicates a vCPU memory fault on the guest physical address range
> +[gpa, gpa + len). See KVM_CAP_MEMORY_FAULT_INFO for more details.
> +
> ::
>
> /* KVM_EXIT_NOTIFY */
> @@ -7670,6 +7682,29 @@ This capability is aimed to mitigate the threat that malicious VMs can
> cause CPU stuck (due to event windows don't open up) and make the CPU
> unavailable to host or other VMs.
>
> +7.34 KVM_CAP_MEMORY_FAULT_INFO
> +------------------------------
> +
> +:Architectures: x86, arm64
> +:Parameters: args[0] - KVM_MEMORY_FAULT_INFO_ENABLE|DISABLE to enable/disable
> + the capability.
> +:Returns: 0 on success, or -EINVAL if unsupported or invalid args[0].
> +
> +When enabled, EFAULTs "returned" by KVM_RUN in response to failed vCPU guest
> +memory accesses may be annotated with additional information. When KVM_RUN
> +returns an error with errno=EFAULT, userspace may check the exit reason: if it
> +is KVM_EXIT_MEMORY_FAULT, userspace is then permitted to read the 'memory_fault'
> +member of the run struct.
> +
> +The 'gpa' and 'len' (in bytes) fields describe the range of guest
> +physical memory to which access failed, i.e. [gpa, gpa + len). 'flags' is
> +currently always zero.
> +
> +NOTE: The implementation of this capability is incomplete. Even with it enabled,
> +userspace may receive "bare" EFAULTs (i.e. exit reason !=
> +KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
> +reported to the maintainers.
> +
> 8. Other capabilities.
> ======================
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index a43e1cb3b7e97..a932346b59f61 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -220,6 +220,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_VCPU_ATTRIBUTES:
> case KVM_CAP_PTP_KVM:
> case KVM_CAP_ARM_SYSTEM_SUSPEND:
> + case KVM_CAP_MEMORY_FAULT_INFO:
> r = 1;
> break;
> case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ca73eb066af81..0925678e741de 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4432,6 +4432,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_VAPIC:
> case KVM_CAP_ENABLE_CAP:
> case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
> + case KVM_CAP_MEMORY_FAULT_INFO:
> r = 1;
> break;
> case KVM_CAP_EXIT_HYPERCALL:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 90edc16d37e59..776f9713f3921 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -805,6 +805,8 @@ struct kvm {
> struct notifier_block pm_notifier;
> #endif
> char stats_id[KVM_STATS_NAME_SIZE];
> +
> + bool fill_efault_info;
> };
>
> #define kvm_err(fmt, ...) \
> @@ -2277,4 +2279,14 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> /* Max number of entries allowed for each kvm dirty ring */
> #define KVM_DIRTY_RING_MAX_ENTRIES 65536
>
> +/*
> + * Attempts to set the run struct's exit reason to KVM_EXIT_MEMORY_FAULT and
> + * populate the memory_fault field with the given information.
> + *
> + * Does nothing if KVM_CAP_MEMORY_FAULT_INFO is not enabled. WARNs and does
> + * nothing if the exit reason is not KVM_EXIT_UNKNOWN, or if 'vcpu' is not
> + * the current running vcpu.
> + */
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> + uint64_t gpa, uint64_t len);
> #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4003a166328cc..bc73e8381a2bb 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_RISCV_SBI 35
> #define KVM_EXIT_RISCV_CSR 36
> #define KVM_EXIT_NOTIFY 37
> +#define KVM_EXIT_MEMORY_FAULT 38
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -505,6 +506,16 @@ struct kvm_run {
> #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> __u32 flags;
> } notify;
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + /*
> + * Indicates a memory fault on the guest physical address range
> + * [gpa, gpa + len). flags is always zero for now.
> + */
> + __u64 flags;
> + __u64 gpa;
> + __u64 len; /* in bytes */
> + } memory_fault;
> /* Fix the size of the union. */
> char padding[256];
> };
> @@ -1184,6 +1195,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
> #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
> +#define KVM_CAP_MEMORY_FAULT_INFO 227
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -2237,4 +2249,8 @@ struct kvm_s390_zpci_op {
> /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
> #define KVM_S390_ZPCIOP_REGAEN_HOST (1 << 0)
>
> +/* flags for KVM_CAP_MEMORY_FAULT_INFO */
> +#define KVM_MEMORY_FAULT_INFO_DISABLE 0
> +#define KVM_MEMORY_FAULT_INFO_ENABLE 1
> +
> #endif /* __LINUX_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 4003a166328cc..5c57796364d65 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -264,6 +264,7 @@ struct kvm_xen_exit {
> #define KVM_EXIT_RISCV_SBI 35
> #define KVM_EXIT_RISCV_CSR 36
> #define KVM_EXIT_NOTIFY 37
> +#define KVM_EXIT_MEMORY_FAULT 38
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> /* Emulate instruction failed. */
> @@ -505,6 +506,16 @@ struct kvm_run {
> #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> __u32 flags;
> } notify;
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + /*
> + * Indicates a memory fault on the guest physical address range
> + * [gpa, gpa + len). flags is always zero for now.
> + */
> + __u64 flags;
> + __u64 gpa;
> + __u64 len; /* in bytes */
> + } memory_fault;
> /* Fix the size of the union. */
> char padding[256];
> };
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cf7d3de6f3689..f3effc93cbef3 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1142,6 +1142,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
> spin_lock_init(&kvm->mn_invalidate_lock);
> rcuwait_init(&kvm->mn_memslots_update_rcuwait);
> xa_init(&kvm->vcpu_array);
> + kvm->fill_efault_info = false;
>
> INIT_LIST_HEAD(&kvm->gpc_list);
> spin_lock_init(&kvm->gpc_lock);
> @@ -4096,6 +4097,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
> put_pid(oldpid);
> }
> r = kvm_arch_vcpu_ioctl_run(vcpu);
> + WARN_ON_ONCE(r == -EFAULT &&
> + vcpu->run->exit_reason != KVM_EXIT_MEMORY_FAULT);
> trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
> break;
> }
> @@ -4672,6 +4675,15 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>
> return r;
> }
> + case KVM_CAP_MEMORY_FAULT_INFO: {
> + if (!kvm_vm_ioctl_check_extension_generic(kvm, cap->cap)
> + || (cap->args[0] != KVM_MEMORY_FAULT_INFO_ENABLE
> + && cap->args[0] != KVM_MEMORY_FAULT_INFO_DISABLE)) {
> + return -EINVAL;
> + }
> + kvm->fill_efault_info = cap->args[0] == KVM_MEMORY_FAULT_INFO_ENABLE;
> + return 0;
> + }
> default:
> return kvm_vm_ioctl_enable_cap(kvm, cap);
> }
> @@ -6173,3 +6185,35 @@ int kvm_vm_create_worker_thread(struct kvm *kvm, kvm_vm_thread_fn_t thread_fn,
>
> return init_context.err;
> }
> +
> +inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> + uint64_t gpa, uint64_t len)
> +{
> + if (!vcpu->kvm->fill_efault_info)
> + return;
> +
> + preempt_disable();
> + /*
> + * Ensure the this vCPU isn't modifying another vCPU's run struct, which
> + * would open the door for races between concurrent calls to this
> + * function.
> + */
> + if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> + goto out;
Why use WARN_ON_ONCE when there is a clear possiblity of preemption
kicking in (with the possibility of vcpu_load/vcpu_put being called
in the new task) before preempt_disable() is called in this function ?
I think you should use WARN_ON_ONCE only where there is some impossible
situation happening, not when there is a possibility of that
situation happening as per the kernel code. I think that this WARN_ON_ONCE
could make sense if kvm_populate_efault_info() is called from atomic context,
but not when you are disabling preemption from this function itself.
Basically I don't think there is any way we can guarantee that
preemption DOESN'T kick in before the preempt_disable() such that
this warning is actually something that deserves to have a kernel
WARN_ON_ONCE() warning.
Can we get rid of this WARN_ON_ONCE and straightaway jump to the
out label if "(vcpu != __this_cpu_read(kvm_running_vcpu))" is true, or
please do correct me if I am wrong about something ?
> + /*
> + * Try not to overwrite an already-populated run struct.
> + * This isn't a perfect solution, as there's no guarantee that the exit
> + * reason is set before the run struct is populated, but it should prevent
> + * at least some bugs.
> + */
> + else if (WARN_ON_ONCE(vcpu->run->exit_reason != KVM_EXIT_UNKNOWN))
> + goto out;
> +
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + vcpu->run->memory_fault.gpa = gpa;
> + vcpu->run->memory_fault.len = len;
> + vcpu->run->memory_fault.flags = 0;
> +
> +out:
> + preempt_enable();
> +}
> --
> 2.40.0.577.gac1e443424-goog
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (4 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 05/22] KVM: Add KVM_CAP_MEMORY_FAULT_INFO Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
` (17 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
The order of parameters in these function signature is a little strange,
with "offset" actually applying to "gfn" rather than to "data". Add
short comments to make things perfectly clear.
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
virt/kvm/kvm_main.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f3effc93cbef3..63b4285d858d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2982,6 +2982,9 @@ static int next_segment(unsigned long len, int offset)
return len;
}
+/*
+ * Copy 'len' bytes from guest memory at '(gfn * PAGE_SIZE) + offset' to 'data'
+ */
static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
void *data, int offset, int len)
{
@@ -3083,6 +3086,9 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
}
EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
+/*
+ * Copy 'len' bytes from 'data' into guest memory at '(gfn * PAGE_SIZE) + offset'
+ */
static int __kvm_write_guest_page(struct kvm *kvm,
struct kvm_memory_slot *memslot, gfn_t gfn,
const void *data, int offset, int len)
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (5 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 06/22] KVM: Add docstrings to __kvm_write_guest_page() and __kvm_read_guest_page() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-20 20:52 ` Peter Xu
2023-04-12 21:34 ` [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
` (16 subsequent siblings)
23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults from
kvm_vcpu_write_guest_page()
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
virt/kvm/kvm_main.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 63b4285d858d1..b29a38af543f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3119,8 +3119,11 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
const void *data, int offset, int len)
{
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+ int ret = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
- return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
+ if (ret == -EFAULT)
+ kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
+ return ret;
}
EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* Re: [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
@ 2023-04-20 20:52 ` Peter Xu
2023-04-20 23:29 ` Anish Moorthy
0 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-20 20:52 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, kvm, kvmarm
On Wed, Apr 12, 2023 at 09:34:55PM +0000, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults from
> kvm_vcpu_write_guest_page()
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
> virt/kvm/kvm_main.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 63b4285d858d1..b29a38af543f0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3119,8 +3119,11 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> const void *data, int offset, int len)
> {
> struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> + int ret = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
>
> - return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> + if (ret == -EFAULT)
> + kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
> + return ret;
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
Why need to trap this? Is this -EFAULT part of the "scalable userfault"
plan or not?
My previous memory was one can still leave things like copy_to_user() to go
via the userfaults channels which should work in parallel with the new vcpu
MEMORY_FAULT exit. But maybe the plan changed?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
2023-04-20 20:52 ` Peter Xu
@ 2023-04-20 23:29 ` Anish Moorthy
2023-04-21 15:00 ` Peter Xu
0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 23:29 UTC (permalink / raw)
To: Peter Xu
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, kvm, kvmarm
On Thu, Apr 20, 2023 at 1:52 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Apr 12, 2023 at 09:34:55PM +0000, Anish Moorthy wrote:
> > Implement KVM_CAP_MEMORY_FAULT_INFO for efaults from
> > kvm_vcpu_write_guest_page()
> >
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> > virt/kvm/kvm_main.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 63b4285d858d1..b29a38af543f0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -3119,8 +3119,11 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > const void *data, int offset, int len)
> > {
> > struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > + int ret = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> >
> > - return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> > + if (ret == -EFAULT)
> > + kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
> > + return ret;
> > }
> > EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
>
> Why need to trap this? Is this -EFAULT part of the "scalable userfault"
> plan or not?
>
> My previous memory was one can still leave things like copy_to_user() to go
> via the userfaults channels which should work in parallel with the new vcpu
> MEMORY_FAULT exit. But maybe the plan changed?
This commit isn't really part of the "scalable uffd" changes, which
basically correspond to KVM_CAP_ABSENT_MAPPING_FAULT. There should be
more details in the cover letter, but basically my v1 just included
KVM_CAP_ABSENT_MAPPING_FAULT: Sean argued that the API there ("return
to userspace whenever KVM fails a guest memory access to a page
fault") was problematic, and so I reworked the series to include a
general capability for reporting extra information for failed guest
memory accesses (KVM_CAP_MEMORY_FAULT_INFO) and
KVM_CAP_ABSENT_MAPPING_FAULT (which is meant to be used in combination
with the other cap) for the "scalable userfaultfd" changes.
As such most of the commits in this series are unrelated to
KVM_CAP_ABSENT_MAPPING_FAULT, and this is one of those commits. It
doesn't affect page faults generated by copy_to_user (which should
still be delivered via uffd).
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page()
2023-04-20 23:29 ` Anish Moorthy
@ 2023-04-21 15:00 ` Peter Xu
0 siblings, 0 replies; 103+ messages in thread
From: Peter Xu @ 2023-04-21 15:00 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, kvm, kvmarm
On Thu, Apr 20, 2023 at 04:29:38PM -0700, Anish Moorthy wrote:
> On Thu, Apr 20, 2023 at 1:52 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Apr 12, 2023 at 09:34:55PM +0000, Anish Moorthy wrote:
> > > Implement KVM_CAP_MEMORY_FAULT_INFO for efaults from
> > > kvm_vcpu_write_guest_page()
> > >
> > > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > > ---
> > > virt/kvm/kvm_main.c | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 63b4285d858d1..b29a38af543f0 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -3119,8 +3119,11 @@ int kvm_vcpu_write_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > > const void *data, int offset, int len)
> > > {
> > > struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > > + int ret = __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> > >
> > > - return __kvm_write_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
> > > + if (ret == -EFAULT)
> > > + kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
> > > + return ret;
> > > }
> > > EXPORT_SYMBOL_GPL(kvm_vcpu_write_guest_page);
> >
> > Why need to trap this? Is this -EFAULT part of the "scalable userfault"
> > plan or not?
> >
> > My previous memory was one can still leave things like copy_to_user() to go
> > via the userfaults channels which should work in parallel with the new vcpu
> > MEMORY_FAULT exit. But maybe the plan changed?
>
> This commit isn't really part of the "scalable uffd" changes, which
> basically correspond to KVM_CAP_ABSENT_MAPPING_FAULT. There should be
> more details in the cover letter, but basically my v1 just included
> KVM_CAP_ABSENT_MAPPING_FAULT: Sean argued that the API there ("return
> to userspace whenever KVM fails a guest memory access to a page
> fault") was problematic, and so I reworked the series to include a
> general capability for reporting extra information for failed guest
> memory accesses (KVM_CAP_MEMORY_FAULT_INFO) and
> KVM_CAP_ABSENT_MAPPING_FAULT (which is meant to be used in combination
> with the other cap) for the "scalable userfaultfd" changes.
>
> As such most of the commits in this series are unrelated to
> KVM_CAP_ABSENT_MAPPING_FAULT, and this is one of those commits. It
> doesn't affect page faults generated by copy_to_user (which should
> still be delivered via uffd).
Indeed it'll be an improvement itself to report more details for such an
error already. Makes sense to me, thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (6 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 07/22] KVM: Annotate -EFAULTs from kvm_vcpu_write_guest_page() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
` (15 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_vcpu_read_guest_page().
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
virt/kvm/kvm_main.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b29a38af543f0..572adba9ad8ed 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3014,7 +3014,11 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data,
{
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
- return __kvm_read_guest_page(slot, gfn, data, offset, len);
+ int ret = __kvm_read_guest_page(slot, gfn, data, offset, len);
+
+ if (ret == -EFAULT)
+ kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE + offset, len);
+ return ret;
}
EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (7 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 08/22] KVM: Annotate -EFAULTs from kvm_vcpu_read_guest_page() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-20 20:53 ` Peter Xu
2023-04-12 21:34 ` [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault() Anish Moorthy
` (14 subsequent siblings)
23 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_vcpu_map().
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
virt/kvm/kvm_main.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 572adba9ad8ed..f3be5aa49829a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2843,8 +2843,10 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
#endif
}
- if (!hva)
+ if (!hva) {
+ kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE, PAGE_SIZE);
return -EFAULT;
+ }
map->page = page;
map->hva = hva;
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* Re: [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map()
2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
@ 2023-04-20 20:53 ` Peter Xu
2023-04-20 23:34 ` Anish Moorthy
0 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-20 20:53 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, kvm, kvmarm
On Wed, Apr 12, 2023 at 09:34:57PM +0000, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> kvm_vcpu_map().
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
> virt/kvm/kvm_main.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 572adba9ad8ed..f3be5aa49829a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2843,8 +2843,10 @@ int kvm_vcpu_map(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map)
> #endif
> }
>
> - if (!hva)
> + if (!hva) {
> + kvm_populate_efault_info(vcpu, gfn * PAGE_SIZE, PAGE_SIZE);
> return -EFAULT;
> + }
>
> map->page = page;
> map->hva = hva;
Totally not familiar with nested, just a pure question on whether all the
kvm_vcpu_map() callers will be prepared to receive this -EFAULT yet?
I quickly went over the later patches but I didn't find a full solution
yet, but maybe I missed something.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map()
2023-04-20 20:53 ` Peter Xu
@ 2023-04-20 23:34 ` Anish Moorthy
2023-04-21 14:58 ` Peter Xu
0 siblings, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 23:34 UTC (permalink / raw)
To: Peter Xu
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, kvm, kvmarm
On Thu, Apr 20, 2023 at 1:53 PM Peter Xu <peterx@redhat.com> wrote:
>
> Totally not familiar with nested, just a pure question on whether all the
> kvm_vcpu_map() callers will be prepared to receive this -EFAULT yet?
The return values of this function aren't being changed: I'm just
setting some extra state in the kvm_run_struct in the case where this
function already returns -EFAULT.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map()
2023-04-20 23:34 ` Anish Moorthy
@ 2023-04-21 14:58 ` Peter Xu
0 siblings, 0 replies; 103+ messages in thread
From: Peter Xu @ 2023-04-21 14:58 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, kvm, kvmarm
On Thu, Apr 20, 2023 at 04:34:39PM -0700, Anish Moorthy wrote:
> On Thu, Apr 20, 2023 at 1:53 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Totally not familiar with nested, just a pure question on whether all the
> > kvm_vcpu_map() callers will be prepared to receive this -EFAULT yet?
>
> The return values of this function aren't being changed: I'm just
> setting some extra state in the kvm_run_struct in the case where this
> function already returns -EFAULT.
Ah, I was wrongly assuming there'll be more -EFAULTs after you enable the
new memslot flag KVM_MEM_ABSENT_MAPPING_FAULT. But then when I re-read
your patch below I see that the new flag only affects __kvm_faultin_pfn().
Then I assume that's fine, thanks.
--
Peter Xu
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (8 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 09/22] KVM: Annotate -EFAULTs from kvm_vcpu_map() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-12 21:34 ` [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch() Anish Moorthy
` (13 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_mmu_page_fault().
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/x86/kvm/mmu/mmu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 144c5a01cd778..7391d1f75149d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5670,6 +5670,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
return -EIO;
}
+ if (r == -EFAULT)
+ kvm_populate_efault_info(vcpu, round_down(cr2_or_gpa, PAGE_SIZE),
+ PAGE_SIZE);
if (r < 0)
return r;
if (r != RET_PF_EMULATE)
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (9 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 10/22] KVM: x86: Annotate -EFAULTs from kvm_mmu_page_fault() Anish Moorthy
@ 2023-04-12 21:34 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault() Anish Moorthy
` (12 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:34 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
setup_vmgexit_scratch().
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/x86/kvm/svm/sev.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index c25aeb550cd97..9ef121f71dc26 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2683,6 +2683,7 @@ static int setup_vmgexit_scratch(struct vcpu_svm *svm, bool sync, u64 len)
pr_err("vmgexit: kvm_read_guest for scratch area failed\n");
kvfree(scratch_va);
+ kvm_populate_efault_info(&svm->vcpu, scratch_gpa_beg, len);
return -EFAULT;
}
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (10 preceding siblings ...)
2023-04-12 21:34 ` [PATCH v3 11/22] KVM: x86: Annotate -EFAULTs from setup_vmgexit_scratch() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page() Anish Moorthy
` (11 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for -EFAULTs caused by
kvm_handle_page_fault().
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/x86/kvm/mmu/mmu.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7391d1f75149d..937329bee654e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4371,8 +4371,11 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
#ifndef CONFIG_X86_64
/* A 64-bit CR2 should be impossible on 32-bit KVM. */
- if (WARN_ON_ONCE(fault_address >> 32))
+ if (WARN_ON_ONCE(fault_address >> 32)) {
+ kvm_populate_efault_info(vcpu, round_down(fault_address, PAGE_SIZE),
+ PAGE_SIZE);
return -EFAULT;
+ }
#endif
vcpu->arch.l1tf_flush_l1d = true;
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (11 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 12/22] KVM: x86: Annotate -EFAULTs from kvm_handle_page_fault() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing() Anish Moorthy
` (10 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_hv_get_assist_page().
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/x86/kvm/hyperv.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index b28fd020066f6..467fff271bc88 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -921,13 +921,21 @@ EXPORT_SYMBOL_GPL(kvm_hv_assist_page_enabled);
int kvm_hv_get_assist_page(struct kvm_vcpu *vcpu)
{
+ int ret = -EFAULT;
struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
if (!hv_vcpu || !kvm_hv_assist_page_enabled(vcpu))
- return -EFAULT;
+ goto out;
+
+ ret = kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data,
+ &hv_vcpu->vp_assist_page,
+ sizeof(struct hv_vp_assist_page));
- return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.pv_eoi.data,
- &hv_vcpu->vp_assist_page, sizeof(struct hv_vp_assist_page));
+out:
+ if (ret == -EFAULT)
+ kvm_populate_efault_info(vcpu, vcpu->arch.pv_eoi.data.gpa,
+ vcpu->arch.pv_eoi.data.len);
+ return ret;
}
EXPORT_SYMBOL_GPL(kvm_hv_get_assist_page);
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (12 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 13/22] KVM: x86: Annotate -EFAULTs from kvm_hv_get_assist_page() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map() Anish Moorthy
` (9 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_pv_clock_pairing().
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/x86/kvm/x86.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0925678e741de..3e9deab31e1c8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9589,8 +9589,10 @@ static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr,
ret = 0;
if (kvm_write_guest(vcpu->kvm, paddr, &clock_pairing,
- sizeof(struct kvm_clock_pairing)))
+ sizeof(struct kvm_clock_pairing))) {
+ kvm_populate_efault_info(vcpu, paddr, sizeof(struct kvm_clock_pairing));
ret = -KVM_EFAULT;
+ }
return ret;
}
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (13 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 14/22] KVM: x86: Annotate -EFAULTs from kvm_pv_clock_pairing() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
` (8 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
direct_map().
Since direct_map() traverses multiple levels of the shadow page table, it
seems like there are actually two correct guest physical address ranges
which could be provided.
1. A smaller range, more specific range, which potentially only
corresponds to a part of what could not be mapped.
start = gfn_round_for_level(fault->gfn, fault->goal_level)
length = KVM_PAGES_PER_HPAGE(fault->goal_level)
2. The entire range which could not be mapped
start = gfn_round_for_level(fault->gfn, fault->goal_level)
length = KVM_PAGES_PER_HPAGE(fault->goal_level)
Take the first approach, although it's possible the second is actually
preferable.
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/x86/kvm/mmu/mmu.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 937329bee654e..a965c048edde8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3192,8 +3192,13 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
fault->req_level >= it.level);
}
- if (WARN_ON_ONCE(it.level != fault->goal_level))
+ if (WARN_ON_ONCE(it.level != fault->goal_level)) {
+ gfn_t rounded_gfn = gfn_round_for_level(fault->gfn, fault->goal_level);
+ uint64_t len = KVM_PAGES_PER_HPAGE(fault->goal_level);
+
+ kvm_populate_efault_info(vcpu, rounded_gfn, len);
return -EFAULT;
+ }
ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL,
base_gfn, fault->pfn, fault);
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (14 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 15/22] KVM: x86: Annotate -EFAULTs from direct_map() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
` (7 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_handle_error_pfn().
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/x86/kvm/mmu/mmu.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a965c048edde8..d83a3e1e3eff9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3218,6 +3218,9 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
+ uint64_t rounded_gfn;
+ uint64_t fault_size;
+
if (is_sigpending_pfn(fault->pfn)) {
kvm_handle_signal_exit(vcpu);
return -EINTR;
@@ -3236,6 +3239,10 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
return RET_PF_RETRY;
}
+ fault_size = KVM_HPAGE_SIZE(fault->goal_level);
+ rounded_gfn = round_down(fault->gfn * PAGE_SIZE, fault_size);
+
+ kvm_populate_efault_info(vcpu, rounded_gfn, fault_size);
return -EFAULT;
}
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (15 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 16/22] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-19 14:00 ` Hoo Robert
` (2 more replies)
2023-04-12 21:35 ` [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
` (6 subsequent siblings)
23 siblings, 3 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Add documentation, memslot flags, useful helper functions, and the
actual new capability itself.
Memory fault exits on absent mappings are particularly useful for
userfaultfd-based postcopy live migration. When many vCPUs fault on a
single userfaultfd the faults can take a while to surface to userspace
due to having to contend for uffd wait queue locks. Bypassing the uffd
entirely by returning information directly to the vCPU exit avoids this
contention and improves the fault rate.
Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++++++---
include/linux/kvm_host.h | 7 +++++++
include/uapi/linux/kvm.h | 2 ++
tools/include/uapi/linux/kvm.h | 1 +
virt/kvm/kvm_main.c | 3 +++
5 files changed, 41 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f174f43c38d45..7967b9909e28b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
/* for kvm_userspace_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+ #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
This ioctl allows the user to create, modify or delete a guest physical
memory slot. Bits 0-15 of "slot" specify the slot id and this value
@@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
be identical. This allows large pages in the guest to be backed by large
pages in the host.
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
+The flags field supports three flags
+
+1. KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
-use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+use it.
+2. KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
to make a new slot read-only. In this case, writes to this memory will be
posted to userspace as KVM_EXIT_MMIO exits.
+3. KVM_MEM_ABSENT_MAPPING_FAULT: see KVM_CAP_ABSENT_MAPPING_FAULT for details.
When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
the memory region are automatically reflected into the guest. For example, an
@@ -7705,6 +7709,27 @@ userspace may receive "bare" EFAULTs (i.e. exit reason !=
KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
reported to the maintainers.
+7.35 KVM_CAP_ABSENT_MAPPING_FAULT
+---------------------------------
+
+:Architectures: None
+:Returns: -EINVAL.
+
+The presence of this capability indicates that userspace may pass the
+KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
+to fail (-EFAULT) in response to page faults for which the userspace page tables
+do not contain present mappings. Attempting to enable the capability directly
+will fail.
+
+The range of guest physical memory causing the fault is advertised to userspace
+through KVM_CAP_MEMORY_FAULT_INFO (if it is enabled).
+
+Userspace should determine how best to make the mapping present, then take
+appropriate action. For instance, in the case of absent mappings this might
+involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or
+faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the
+mapping, userspace can return to KVM to retry the previous memory access.
+
8. Other capabilities.
======================
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 776f9713f3921..2407fc1e52ab8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2289,4 +2289,11 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
*/
inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
uint64_t gpa, uint64_t len);
+
+static inline bool kvm_slot_fault_on_absent_mapping(
+ const struct kvm_memory_slot *slot)
+{
+ return slot->flags & KVM_MEM_ABSENT_MAPPING_FAULT;
+}
+
#endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index bc73e8381a2bb..21df449e74648 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+#define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
@@ -1196,6 +1197,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
#define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
#define KVM_CAP_MEMORY_FAULT_INFO 227
+#define KVM_CAP_ABSENT_MAPPING_FAULT 228
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 5c57796364d65..59219da95634c 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
*/
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
+#define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f3be5aa49829a..7cd0ad94726df 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1525,6 +1525,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
valid_flags |= KVM_MEM_READONLY;
#endif
+ if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_ABSENT_MAPPING_FAULT))
+ valid_flags |= KVM_MEM_ABSENT_MAPPING_FAULT;
+
if (mem->flags & ~valid_flags)
return -EINVAL;
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
@ 2023-04-19 14:00 ` Hoo Robert
2023-04-20 18:23 ` Anish Moorthy
2023-04-24 21:02 ` Sean Christopherson
2023-06-01 18:19 ` Oliver Upton
2 siblings, 1 reply; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 14:00 UTC (permalink / raw)
To: Anish Moorthy, pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On 4/13/2023 5:35 AM, Anish Moorthy wrote:
> Add documentation, memslot flags, useful helper functions, and the
> actual new capability itself.
>
> Memory fault exits on absent mappings are particularly useful for
> userfaultfd-based postcopy live migration. When many vCPUs fault on a
> single userfaultfd the faults can take a while to surface to userspace
> due to having to contend for uffd wait queue locks. Bypassing the uffd
> entirely by returning information directly to the vCPU exit avoids this
> contention and improves the fault rate.
>
> Suggested-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
> Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++++++---
> include/linux/kvm_host.h | 7 +++++++
> include/uapi/linux/kvm.h | 2 ++
> tools/include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 3 +++
> 5 files changed, 41 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f174f43c38d45..7967b9909e28b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
> /* for kvm_userspace_memory_region::flags */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> + #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
>
> This ioctl allows the user to create, modify or delete a guest physical
> memory slot. Bits 0-15 of "slot" specify the slot id and this value
> @@ -1342,12 +1343,15 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
> be identical. This allows large pages in the guest to be backed by large
> pages in the host.
>
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of
> +The flags field supports three flags
> +
> +1. KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> +use it.
> +2. KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
> to make a new slot read-only. In this case, writes to this memory will be
> posted to userspace as KVM_EXIT_MMIO exits.
> +3. KVM_MEM_ABSENT_MAPPING_FAULT: see KVM_CAP_ABSENT_MAPPING_FAULT for details.
>
> When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> the memory region are automatically reflected into the guest. For example, an
> @@ -7705,6 +7709,27 @@ userspace may receive "bare" EFAULTs (i.e. exit reason !=
> KVM_EXIT_MEMORY_FAULT) from KVM_RUN. These should be considered bugs and
> reported to the maintainers.
>
> +7.35 KVM_CAP_ABSENT_MAPPING_FAULT
> +---------------------------------
> +
> +:Architectures: None
> +:Returns: -EINVAL.
> +
> +The presence of this capability indicates that userspace may pass the
> +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> +to fail (-EFAULT) in response to page faults for which the userspace page tables
> +do not contain present mappings. Attempting to enable the capability directly
> +will fail.
> +
> +The range of guest physical memory causing the fault is advertised to userspace
> +through KVM_CAP_MEMORY_FAULT_INFO (if it is enabled).
> +
> +Userspace should determine how best to make the mapping present, then take
> +appropriate action. For instance, in the case of absent mappings this might
> +involve establishing the mapping for the first time via UFFDIO_COPY/CONTINUE or
> +faulting the mapping in using MADV_POPULATE_READ/WRITE. After establishing the
> +mapping, userspace can return to KVM to retry the previous memory access.
> +
> 8. Other capabilities.
> ======================
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 776f9713f3921..2407fc1e52ab8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2289,4 +2289,11 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr)
> */
> inline void kvm_populate_efault_info(struct kvm_vcpu *vcpu,
> uint64_t gpa, uint64_t len);
> +
> +static inline bool kvm_slot_fault_on_absent_mapping(
> + const struct kvm_memory_slot *slot)
Strange line break.
> +{
> + return slot->flags & KVM_MEM_ABSENT_MAPPING_FAULT;
> +}
> +
> #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index bc73e8381a2bb..21df449e74648 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
> */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> +#define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
>
> /* for KVM_IRQ_LINE */
> struct kvm_irq_level {
> @@ -1196,6 +1197,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
> #define KVM_CAP_PMU_EVENT_MASKED_EVENTS 226
> #define KVM_CAP_MEMORY_FAULT_INFO 227
> +#define KVM_CAP_ABSENT_MAPPING_FAULT 228
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 5c57796364d65..59219da95634c 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -102,6 +102,7 @@ struct kvm_userspace_memory_region {
> */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> +#define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
>
> /* for KVM_IRQ_LINE */
> struct kvm_irq_level {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f3be5aa49829a..7cd0ad94726df 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1525,6 +1525,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
> valid_flags |= KVM_MEM_READONLY;
Is it better to also via kvm_vm_ioctl_check_extension() rather than
#ifdef __KVM_HAVE_READONLY_MEM?
> #endif
>
> + if (kvm_vm_ioctl_check_extension(NULL, KVM_CAP_ABSENT_MAPPING_FAULT))
> + valid_flags |= KVM_MEM_ABSENT_MAPPING_FAULT;
> +
> if (mem->flags & ~valid_flags)
> return -EINVAL;
>
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-04-19 14:00 ` Hoo Robert
@ 2023-04-20 18:23 ` Anish Moorthy
0 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 18:23 UTC (permalink / raw)
To: Hoo Robert
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 19, 2023 at 7:00 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
> > +static inline bool kvm_slot_fault_on_absent_mapping(
> > + const struct kvm_memory_slot *slot)
>
> Strange line break.
Fixed: there's now a single indent on the second line.
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index f3be5aa49829a..7cd0ad94726df 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1525,6 +1525,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
> > valid_flags |= KVM_MEM_READONLY;
>
> Is it better to also via kvm_vm_ioctl_check_extension() rather than
> #ifdef __KVM_HAVE_READONLY_MEM?
Probably, that's unrelated though so I won't change it here
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
2023-04-19 14:00 ` Hoo Robert
@ 2023-04-24 21:02 ` Sean Christopherson
2023-06-01 16:04 ` Oliver Upton
2023-06-01 18:19 ` Oliver Upton
2 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-04-24 21:02 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 12, 2023, Anish Moorthy wrote:
> Add documentation, memslot flags, useful helper functions, and the
> actual new capability itself.
>
> Memory fault exits on absent mappings are particularly useful for
> userfaultfd-based postcopy live migration. When many vCPUs fault on a
> single userfaultfd the faults can take a while to surface to userspace
> due to having to contend for uffd wait queue locks. Bypassing the uffd
> entirely by returning information directly to the vCPU exit avoids this
> contention and improves the fault rate.
>
> Suggested-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
> Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++++++---
> include/linux/kvm_host.h | 7 +++++++
> include/uapi/linux/kvm.h | 2 ++
> tools/include/uapi/linux/kvm.h | 1 +
> virt/kvm/kvm_main.c | 3 +++
> 5 files changed, 41 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f174f43c38d45..7967b9909e28b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
> /* for kvm_userspace_memory_region::flags */
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> + #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
This name is both too specific and too vague. It's too specific because it affects
more than just "absent" mappings, it will affect any page fault that can't be
resolved by fast GUP, i.e. I'm objecting for all the same reasons I objected to
the exit reason being name KVM_MEMFAULT_REASON_ABSENT_MAPPING. It's too vague
because it doesn't describe what behavior the flag actually enables in any way.
I liked the "nowait" verbiage from the RFC. "fast_only" is an ok alternative,
but that's much more of a kernel-internal name.
Oliver, you had concerns with using "fault" in the name, is something like
KVM_MEM_NOWAIT_ON_PAGE_FAULT or KVM_MEM_NOWAIT_ON_FAULT palatable? IMO, "fault"
is perfectly ok, we just need to ensure it's unlikely to be ambiguous for userspace.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-04-24 21:02 ` Sean Christopherson
@ 2023-06-01 16:04 ` Oliver Upton
0 siblings, 0 replies; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 16:04 UTC (permalink / raw)
To: Sean Christopherson
Cc: Anish Moorthy, pbonzini, maz, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Better late than never right? :)
On Mon, Apr 24, 2023 at 02:02:49PM -0700, Sean Christopherson wrote:
> On Wed, Apr 12, 2023, Anish Moorthy wrote:
> > Add documentation, memslot flags, useful helper functions, and the
> > actual new capability itself.
> >
> > Memory fault exits on absent mappings are particularly useful for
> > userfaultfd-based postcopy live migration. When many vCPUs fault on a
> > single userfaultfd the faults can take a while to surface to userspace
> > due to having to contend for uffd wait queue locks. Bypassing the uffd
> > entirely by returning information directly to the vCPU exit avoids this
> > contention and improves the fault rate.
> >
> > Suggested-by: James Houghton <jthoughton@google.com>
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> > Documentation/virt/kvm/api.rst | 31 ++++++++++++++++++++++++++++---
> > include/linux/kvm_host.h | 7 +++++++
> > include/uapi/linux/kvm.h | 2 ++
> > tools/include/uapi/linux/kvm.h | 1 +
> > virt/kvm/kvm_main.c | 3 +++
> > 5 files changed, 41 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f174f43c38d45..7967b9909e28b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1312,6 +1312,7 @@ yet and must be cleared on entry.
> > /* for kvm_userspace_memory_region::flags */
> > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > #define KVM_MEM_READONLY (1UL << 1)
> > + #define KVM_MEM_ABSENT_MAPPING_FAULT (1UL << 2)
>
> This name is both too specific and too vague. It's too specific because it affects
> more than just "absent" mappings, it will affect any page fault that can't be
> resolved by fast GUP, i.e. I'm objecting for all the same reasons I objected to
> the exit reason being name KVM_MEMFAULT_REASON_ABSENT_MAPPING. It's too vague
> because it doesn't describe what behavior the flag actually enables in any way.
>
> I liked the "nowait" verbiage from the RFC. "fast_only" is an ok alternative,
> but that's much more of a kernel-internal name.
>
> Oliver, you had concerns with using "fault" in the name, is something like
> KVM_MEM_NOWAIT_ON_PAGE_FAULT or KVM_MEM_NOWAIT_ON_FAULT palatable? IMO, "fault"
> is perfectly ok, we just need to ensure it's unlikely to be ambiguous for userspace.
Yeah, I can get over it. Slight preference towards KVM_MEM_NOWAIT_ON_FAULT,
fewer characters and still gets the point across.
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
2023-04-19 14:00 ` Hoo Robert
2023-04-24 21:02 ` Sean Christopherson
@ 2023-06-01 18:19 ` Oliver Upton
2023-06-01 18:59 ` Sean Christopherson
2 siblings, 1 reply; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 18:19 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
Anish,
On Wed, Apr 12, 2023 at 09:35:05PM +0000, Anish Moorthy wrote:
> +7.35 KVM_CAP_ABSENT_MAPPING_FAULT
> +---------------------------------
> +
> +:Architectures: None
> +:Returns: -EINVAL.
> +
> +The presence of this capability indicates that userspace may pass the
> +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> +to fail (-EFAULT) in response to page faults for which the userspace page tables
> +do not contain present mappings. Attempting to enable the capability directly
> +will fail.
> +
> +The range of guest physical memory causing the fault is advertised to userspace
> +through KVM_CAP_MEMORY_FAULT_INFO (if it is enabled).
Maybe third time is the charm. I *really* do not like the
interdependence between NOWAIT exits and the completely orthogonal
annotation of existing EFAULT exits.
How do we support a userspace that only cares about NOWAIT exits but
doesn't want other EFAULT exits to be annotated? It is very likely that
userspace will only know how to resolve NOWAIT exits anyway. Since we do
not provide a precise description of the conditions that caused an exit,
there's no way for userspace to differentiate between NOWAIT exits and
other exits it couldn't care less about.
NOWAIT exits w/o annotation (i.e. a 'bare' EFAULT) make even less sense
since userspace cannot even tell what address needs fixing at that
point.
This is why I had been suggesting we separate the two capabilities and
make annotated exits an unconditional property of NOWAIT exits. It
aligns with the practical use you're proposing for the series, and still
puts userspace in the drivers seat for other issues it may or may not
care about.
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-06-01 18:19 ` Oliver Upton
@ 2023-06-01 18:59 ` Sean Christopherson
2023-06-01 19:29 ` Oliver Upton
0 siblings, 1 reply; 103+ messages in thread
From: Sean Christopherson @ 2023-06-01 18:59 UTC (permalink / raw)
To: Oliver Upton
Cc: Anish Moorthy, pbonzini, maz, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Thu, Jun 01, 2023, Oliver Upton wrote:
> Anish,
>
> On Wed, Apr 12, 2023 at 09:35:05PM +0000, Anish Moorthy wrote:
> > +7.35 KVM_CAP_ABSENT_MAPPING_FAULT
> > +---------------------------------
> > +
> > +:Architectures: None
> > +:Returns: -EINVAL.
> > +
> > +The presence of this capability indicates that userspace may pass the
> > +KVM_MEM_ABSENT_MAPPING_FAULT flag to KVM_SET_USER_MEMORY_REGION to cause KVM_RUN
> > +to fail (-EFAULT) in response to page faults for which the userspace page tables
> > +do not contain present mappings. Attempting to enable the capability directly
> > +will fail.
> > +
> > +The range of guest physical memory causing the fault is advertised to userspace
> > +through KVM_CAP_MEMORY_FAULT_INFO (if it is enabled).
>
> Maybe third time is the charm. I *really* do not like the
> interdependence between NOWAIT exits and the completely orthogonal
> annotation of existing EFAULT exits.
They're not completely orthogonal, because the touchpoints for NOWAIT are themselves
existing EFAULT exits.
> How do we support a userspace that only cares about NOWAIT exits but
> doesn't want other EFAULT exits to be annotated?
We don't. The proposed approach is to not change the return value, and the
vcpu->run union currently holds random garbage on -EFAULT, so I don't see any reason
to require userspace to opt-in, or to let userspace opt-out. I.e. fill
vcpu->run->memory_fault unconditionally (for the paths that are converted) and
advertise to userspace that vcpu->run->memory_fault *may* contain useful info on
-EFAULT when KVM_CAP_MEMORY_FAULT_INFO is supported. And then we define KVM's
ABI such that vcpu->run->memory_fault is guarateed to be valid if an -EFAULT occurs
when faulting in guest memory (on supported architectures).
> It is very likely that userspace will only know how to resolve NOWAIT exits
> anyway. Since we do not provide a precise description of the conditions that
> caused an exit, there's no way for userspace to differentiate between NOWAIT
> exits and other exits it couldn't care less about.
>
> NOWAIT exits w/o annotation (i.e. a 'bare' EFAULT) make even less sense
> since userspace cannot even tell what address needs fixing at that
> point.
>
> This is why I had been suggesting we separate the two capabilities and
> make annotated exits an unconditional property of NOWAIT exits.
No, because as I've been stating ad nauseum, KVM cannot differentiate between a
NOWAIT -EFAULT and an -EFAULT that would have occurred regardless of the NOWAIT
behavior. Defining the ABI to be that KVM fills memory_fault if and only if the
slot has NOWAIT will create a mess, e.g. if an -EFAULT occurs while userspace
is doing a KVM_SET_USER_MEMORY_REGION to set NOWAIT, userspace may or may not see
valid memory_fault information depending on when the vCPU grabbed its memslot
snapshot.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-06-01 18:59 ` Sean Christopherson
@ 2023-06-01 19:29 ` Oliver Upton
2023-06-01 19:34 ` Sean Christopherson
0 siblings, 1 reply; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 19:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Anish Moorthy, pbonzini, maz, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Thu, Jun 01, 2023 at 11:59:29AM -0700, Sean Christopherson wrote:
> On Thu, Jun 01, 2023, Oliver Upton wrote:
> > How do we support a userspace that only cares about NOWAIT exits but
> > doesn't want other EFAULT exits to be annotated?
>
> We don't. The proposed approach is to not change the return value, and the
> vcpu->run union currently holds random garbage on -EFAULT, so I don't see any reason
> to require userspace to opt-in, or to let userspace opt-out. I.e. fill
> vcpu->run->memory_fault unconditionally (for the paths that are converted) and
> advertise to userspace that vcpu->run->memory_fault *may* contain useful info on
> -EFAULT when KVM_CAP_MEMORY_FAULT_INFO is supported. And then we define KVM's
> ABI such that vcpu->run->memory_fault is guarateed to be valid if an -EFAULT occurs
> when faulting in guest memory (on supported architectures).
Sure, but the series currently gives userspace an explicit opt-in for
existing EFAULT paths. Hold your breath, I'll reply over there so we
don't mix context.
> > It is very likely that userspace will only know how to resolve NOWAIT exits
> > anyway. Since we do not provide a precise description of the conditions that
> > caused an exit, there's no way for userspace to differentiate between NOWAIT
> > exits and other exits it couldn't care less about.
> >
> > NOWAIT exits w/o annotation (i.e. a 'bare' EFAULT) make even less sense
> > since userspace cannot even tell what address needs fixing at that
> > point.
> >
> > This is why I had been suggesting we separate the two capabilities and
> > make annotated exits an unconditional property of NOWAIT exits.
>
> No, because as I've been stating ad nauseum, KVM cannot differentiate between a
> NOWAIT -EFAULT and an -EFAULT that would have occurred regardless of the NOWAIT
> behavior.
IOW: "If you engage brain for more than a second, you'll actually see
the point"
Ok, I'm on board now and sorry for the noise.
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation
2023-06-01 19:29 ` Oliver Upton
@ 2023-06-01 19:34 ` Sean Christopherson
0 siblings, 0 replies; 103+ messages in thread
From: Sean Christopherson @ 2023-06-01 19:34 UTC (permalink / raw)
To: Oliver Upton
Cc: Anish Moorthy, pbonzini, maz, jthoughton, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Thu, Jun 01, 2023, Oliver Upton wrote:
> On Thu, Jun 01, 2023 at 11:59:29AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 01, 2023, Oliver Upton wrote:
> > > How do we support a userspace that only cares about NOWAIT exits but
> > > doesn't want other EFAULT exits to be annotated?
> >
> > We don't. The proposed approach is to not change the return value, and the
> > vcpu->run union currently holds random garbage on -EFAULT, so I don't see any reason
> > to require userspace to opt-in, or to let userspace opt-out. I.e. fill
> > vcpu->run->memory_fault unconditionally (for the paths that are converted) and
> > advertise to userspace that vcpu->run->memory_fault *may* contain useful info on
> > -EFAULT when KVM_CAP_MEMORY_FAULT_INFO is supported. And then we define KVM's
> > ABI such that vcpu->run->memory_fault is guarateed to be valid if an -EFAULT occurs
> > when faulting in guest memory (on supported architectures).
>
> Sure, but the series currently gives userspace an explicit opt-in for
> existing EFAULT paths.
Yeah, that's one of the things I am/was going to provide feedback on, I've been
really slow getting into reviews for this cycle :-/
^ permalink raw reply [flat|nested] 103+ messages in thread
* [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (16 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 17/22] KVM: Introduce KVM_CAP_ABSENT_MAPPING_FAULT without implementation Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort() Anish Moorthy
` (5 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
When the memslot flag is enabled, fail guest memory accesses for which
fast-gup fails (ie, for which the mappings are not present).
Suggested-by: James Houghton <jthoughton@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
Documentation/virt/kvm/api.rst | 2 +-
arch/x86/kvm/mmu/mmu.c | 17 ++++++++++++-----
arch/x86/kvm/x86.c | 1 +
3 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7967b9909e28b..452bbca800b15 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7712,7 +7712,7 @@ reported to the maintainers.
7.35 KVM_CAP_ABSENT_MAPPING_FAULT
---------------------------------
-:Architectures: None
+:Architectures: x86
:Returns: -EINVAL.
The presence of this capability indicates that userspace may pass the
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d83a3e1e3eff9..4aef79b97c985 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4218,7 +4218,9 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
}
-static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault,
+ bool fault_on_absent_mapping)
{
struct kvm_memory_slot *slot = fault->slot;
bool async;
@@ -4251,9 +4253,12 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
}
async = false;
- fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
- fault->write, &fault->map_writable,
- &fault->hva);
+
+ fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn,
+ fault_on_absent_mapping, false,
+ fault_on_absent_mapping ? NULL : &async,
+ fault->write, &fault->map_writable, &fault->hva);
+
if (!async)
return RET_PF_CONTINUE; /* *pfn has correct page already */
@@ -4287,7 +4292,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
smp_rmb();
- ret = __kvm_faultin_pfn(vcpu, fault);
+ ret = __kvm_faultin_pfn(vcpu, fault,
+ likely(fault->slot)
+ && kvm_slot_fault_on_absent_mapping(fault->slot));
if (ret != RET_PF_CONTINUE)
return ret;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3e9deab31e1c8..bc465cde7acf6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4433,6 +4433,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ENABLE_CAP:
case KVM_CAP_VM_DISABLE_NX_HUGE_PAGES:
case KVM_CAP_MEMORY_FAULT_INFO:
+ case KVM_CAP_ABSENT_MAPPING_FAULT:
r = 1;
break;
case KVM_CAP_EXIT_HYPERCALL:
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (17 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 18/22] KVM: x86: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
` (4 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Implement KVM_CAP_MEMORY_FAULT_INFO for at least some -EFAULTs returned
by user_mem_abort(). Other EFAULTs returned by this function come from
before the guest physical address of the fault is calculated: leave
those unannotated.
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
arch/arm64/kvm/mmu.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7113587222ffe..d5ae636c26d62 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1307,8 +1307,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
}
- if (is_error_noslot_pfn(pfn))
+ if (is_error_noslot_pfn(pfn)) {
+ kvm_populate_efault_info(vcpu, round_down(gfn * PAGE_SIZE, vma_pagesize),
+ vma_pagesize);
return -EFAULT;
+ }
if (kvm_is_device_pfn(pfn)) {
/*
@@ -1357,6 +1360,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (kvm_vma_mte_allowed(vma)) {
sanitise_mte_tags(kvm, pfn, vma_pagesize);
} else {
+ kvm_populate_efault_info(vcpu,
+ round_down(gfn * PAGE_SIZE, vma_pagesize), vma_pagesize);
ret = -EFAULT;
goto out_unlock;
}
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (18 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 19/22] KVM: arm64: Annotate (some) -EFAULTs from user_mem_abort() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
` (3 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Return -EFAULT from user_mem_abort when the memslot flag is enabled and
fast GUP fails to find a present mapping for the page.
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
Documentation/virt/kvm/api.rst | 2 +-
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/mmu.c | 11 +++++++++--
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 452bbca800b15..47f728701aca4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7712,7 +7712,7 @@ reported to the maintainers.
7.35 KVM_CAP_ABSENT_MAPPING_FAULT
---------------------------------
-:Architectures: x86
+:Architectures: x86, arm64
:Returns: -EINVAL.
The presence of this capability indicates that userspace may pass the
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a932346b59f61..c9666d7c6c4ff 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -221,6 +221,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_PTP_KVM:
case KVM_CAP_ARM_SYSTEM_SUSPEND:
case KVM_CAP_MEMORY_FAULT_INFO:
+ case KVM_CAP_ABSENT_MAPPING_FAULT:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d5ae636c26d62..26b9485557056 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1206,6 +1206,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
unsigned long vma_pagesize, fault_granule;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
+ bool exit_on_memory_fault = kvm_slot_fault_on_absent_mapping(memslot);
fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
write_fault = kvm_is_write_fault(vcpu);
@@ -1301,8 +1302,14 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
*/
smp_rmb();
- pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
- write_fault, &writable, NULL);
+ pfn = __gfn_to_pfn_memslot(memslot, gfn, exit_on_memory_fault, false, NULL,
+ write_fault, &writable, NULL);
+
+ if (exit_on_memory_fault && pfn == KVM_PFN_ERR_FAULT) {
+ kvm_populate_efault_info(vcpu,
+ round_down(gfn * PAGE_SIZE, vma_pagesize), vma_pagesize);
+ return -EFAULT;
+ }
if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (19 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 20/22] KVM: arm64: Implement KVM_CAP_ABSENT_MAPPING_FAULT Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
` (2 subsequent siblings)
23 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Memslot flags aren't currently exposed to the tests, and are just always
set to 0. Add a parameter to allow tests to manually set those flags.
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
tools/testing/selftests/kvm/access_tracking_perf_test.c | 2 +-
tools/testing/selftests/kvm/demand_paging_test.c | 4 ++--
tools/testing/selftests/kvm/dirty_log_perf_test.c | 2 +-
tools/testing/selftests/kvm/include/memstress.h | 2 +-
tools/testing/selftests/kvm/lib/memstress.c | 4 ++--
.../testing/selftests/kvm/memslot_modification_stress_test.c | 2 +-
6 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f567..b51656b408b83 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct kvm_vm *vm;
int nr_vcpus = params->nr_vcpus;
- vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+ vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, 0,
params->backing_src, !overlap_memory_access);
memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index c729cee4c2055..e84dde345edbc 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -144,8 +144,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
int i, num_uffds = 0;
uint64_t uffd_region_size;
- vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
- p->src_type, p->partition_vcpu_memory_access);
+ vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+ 1, 0, p->src_type, p->partition_vcpu_memory_access);
demand_paging_size = get_backing_src_pagesz(p->src_type);
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index e9d6d1aecf89c..6c8749193cfa4 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -224,7 +224,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
int i;
vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
- p->slots, p->backing_src,
+ p->slots, 0, p->backing_src,
p->partition_vcpu_memory_access);
pr_info("Random seed: %u\n", p->random_seed);
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index 72e3e358ef7bd..1cba965d2d331 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -56,7 +56,7 @@ struct memstress_args {
extern struct memstress_args memstress_args;
struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
- uint64_t vcpu_memory_bytes, int slots,
+ uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
enum vm_mem_backing_src_type backing_src,
bool partition_vcpu_memory_access);
void memstress_destroy_vm(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index 5f1d3173c238c..7589b8cef6911 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -119,7 +119,7 @@ void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
}
struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
- uint64_t vcpu_memory_bytes, int slots,
+ uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
enum vm_mem_backing_src_type backing_src,
bool partition_vcpu_memory_access)
{
@@ -207,7 +207,7 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
vm_userspace_mem_region_add(vm, backing_src, region_start,
MEMSTRESS_MEM_SLOT_INDEX + i,
- region_pages, 0);
+ region_pages, slot_flags);
}
/* Do mapping for the demand paging memory slot */
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811f..0b19ec3ecc9cc 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
struct test_params *p = arg;
struct kvm_vm *vm;
- vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+ vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
VM_MEM_SRC_ANONYMOUS,
p->partition_vcpu_memory_access);
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (20 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 21/22] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
@ 2023-04-12 21:35 ` Anish Moorthy
2023-04-19 14:09 ` Hoo Robert
2023-04-27 15:48 ` James Houghton
2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
2023-05-09 22:19 ` David Matlack
23 siblings, 2 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-12 21:35 UTC (permalink / raw)
To: pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, amoorthy, bgardon, dmatlack,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
Demonstrate a (very basic) scheme for supporting memory fault exits.
From the vCPU threads:
1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
with the purpose of establishing the absent mappings. Do so with
wake_waiters=false to avoid serializing on the userfaultfd wait queue
locks.
2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
assume that the mapping was already established but is currently
absent [A] and attempt to populate it using MADV_POPULATE_WRITE.
Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
wake_waiters=true to ensure that any threads sleeping on the uffd are
eventually woken up.
A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
via a bitmap) to avoid calls destined to EEXIST. However, even the
naive approach is enough to demonstrate the performance advantages of
KVM_EXIT_MEMORY_FAULT.
[A] In reality it is much likelier that the vCPU thread simply lost a
race to establish the mapping for the page.
Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
.../selftests/kvm/demand_paging_test.c | 209 +++++++++++++-----
1 file changed, 155 insertions(+), 54 deletions(-)
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index e84dde345edbc..668bd63d944e7 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -15,6 +15,7 @@
#include <time.h>
#include <pthread.h>
#include <linux/userfaultfd.h>
+#include <sys/mman.h>
#include <sys/syscall.h>
#include "kvm_util.h"
@@ -31,6 +32,57 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
static size_t demand_paging_size;
static char *guest_data_prototype;
+static int num_uffds;
+static size_t uffd_region_size;
+static struct uffd_desc **uffd_descs;
+/*
+ * Delay when demand paging is performed through userfaultfd or directly by
+ * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
+ */
+static useconds_t uffd_delay;
+static int uffd_mode;
+
+
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+ bool is_vcpu);
+
+static void madv_write_or_err(uint64_t gpa)
+{
+ int r;
+ void *hva = addr_gpa2hva(memstress_args.vm, gpa);
+
+ r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
+ TEST_ASSERT(r == 0,
+ "MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
+ (uintptr_t) hva, gpa, errno);
+}
+
+static void ready_page(uint64_t gpa)
+{
+ int r, uffd;
+
+ /*
+ * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
+ * the registered ranges should fault in the physical pages through
+ * MADV_POPULATE_WRITE.
+ */
+ if ((gpa < memstress_args.gpa)
+ || (gpa >= memstress_args.gpa + memstress_args.size)) {
+ madv_write_or_err(gpa);
+ } else {
+ if (uffd_delay)
+ usleep(uffd_delay);
+
+ uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
+
+ r = handle_uffd_page_request(uffd_mode, uffd,
+ (uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
+
+ if (r == EEXIST)
+ madv_write_or_err(gpa);
+ }
+}
+
static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
{
struct kvm_vcpu *vcpu = vcpu_args->vcpu;
@@ -42,25 +94,36 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
clock_gettime(CLOCK_MONOTONIC, &start);
- /* Let the guest access its memory */
- ret = _vcpu_run(vcpu);
- TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
- if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
- TEST_ASSERT(false,
- "Invalid guest sync status: exit_reason=%s\n",
- exit_reason_str(run->exit_reason));
- }
+ while (true) {
+ /* Let the guest access its memory */
+ ret = _vcpu_run(vcpu);
+ TEST_ASSERT(ret == 0
+ || (errno == EFAULT
+ && run->exit_reason == KVM_EXIT_MEMORY_FAULT),
+ "vcpu_run failed: %d\n", ret);
+ if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
+
+ if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
+ ready_page(run->memory_fault.gpa);
+ continue;
+ }
+
+ TEST_ASSERT(false,
+ "Invalid guest sync status: exit_reason=%s\n",
+ exit_reason_str(run->exit_reason));
+ }
- ts_diff = timespec_elapsed(start);
- PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
- ts_diff.tv_sec, ts_diff.tv_nsec);
+ ts_diff = timespec_elapsed(start);
+ PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
+ ts_diff.tv_sec, ts_diff.tv_nsec);
+ break;
+ }
}
-static int handle_uffd_page_request(int uffd_mode, int uffd,
- struct uffd_msg *msg)
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+ bool is_vcpu)
{
pid_t tid = syscall(__NR_gettid);
- uint64_t addr = msg->arg.pagefault.address;
struct timespec start;
struct timespec ts_diff;
int r;
@@ -71,56 +134,78 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
struct uffdio_copy copy;
copy.src = (uint64_t)guest_data_prototype;
- copy.dst = addr;
+ copy.dst = hva;
copy.len = demand_paging_size;
- copy.mode = 0;
+ copy.mode = UFFDIO_COPY_MODE_DONTWAKE;
- r = ioctl(uffd, UFFDIO_COPY, ©);
/*
- * With multiple vCPU threads fault on a single page and there are
- * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
- * will fail with EEXIST: handle that case without signaling an
- * error.
+ * With multiple vCPU threads and at least one of multiple reader threads
+ * or vCPU memory faults, multiple vCPUs accessing an absent page will
+ * almost certainly cause some thread doing the UFFDIO_COPY here to get
+ * EEXIST: make sure to allow that case.
*/
- if (r == -1 && errno != EEXIST) {
- pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
- addr, tid, errno);
- return r;
- }
+ r = ioctl(uffd, UFFDIO_COPY, ©);
+ TEST_ASSERT(r == 0 || errno == EEXIST,
+ "Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
+ gettid(), hva, errno);
} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+ /* The comments in the UFFDIO_COPY branch also apply here. */
struct uffdio_continue cont = {0};
- cont.range.start = addr;
+ cont.range.start = hva;
cont.range.len = demand_paging_size;
+ cont.mode = UFFDIO_CONTINUE_MODE_DONTWAKE;
r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
- /* See the note about EEXISTs in the UFFDIO_COPY branch. */
- if (r == -1 && errno != EEXIST) {
- pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
- addr, tid, errno);
- return r;
- }
+ TEST_ASSERT(r == 0 || errno == EEXIST,
+ "Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
+ gettid(), hva, errno);
} else {
TEST_FAIL("Invalid uffd mode %d", uffd_mode);
}
+ /*
+ * If the above UFFDIO_COPY/CONTINUE fails with EEXIST, it will do so without
+ * waking threads waiting on the UFFD: make sure that happens here.
+ */
+ if (!is_vcpu) {
+ struct uffdio_range range = {
+ .start = hva,
+ .len = demand_paging_size
+ };
+ r = ioctl(uffd, UFFDIO_WAKE, &range);
+ TEST_ASSERT(
+ r == 0,
+ "Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
+ gettid(), hva, errno);
+ }
+
ts_diff = timespec_elapsed(start);
PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
timespec_to_ns(ts_diff));
PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
- demand_paging_size, addr, tid);
+ demand_paging_size, hva, tid);
return 0;
}
+static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
+ struct uffd_msg *msg)
+{
+ TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
+ "Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
+ msg->event);
+ return handle_uffd_page_request(uffd_mode, uffd,
+ msg->arg.pagefault.address, false);
+}
+
struct test_params {
- int uffd_mode;
bool single_uffd;
- useconds_t uffd_delay;
int readers_per_uffd;
enum vm_mem_backing_src_type src_type;
bool partition_vcpu_memory_access;
+ bool memfault_exits;
};
static void prefault_mem(void *alias, uint64_t len)
@@ -137,15 +222,26 @@ static void prefault_mem(void *alias, uint64_t len)
static void run_test(enum vm_guest_mode mode, void *arg)
{
struct test_params *p = arg;
- struct uffd_desc **uffd_descs = NULL;
struct timespec start;
struct timespec ts_diff;
struct kvm_vm *vm;
- int i, num_uffds = 0;
- uint64_t uffd_region_size;
+ int i;
+ uint32_t slot_flags = 0;
+ bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
+
+ if (uffd_memfault_exits) {
+ TEST_ASSERT(kvm_has_cap(KVM_CAP_ABSENT_MAPPING_FAULT) > 0,
+ "KVM does not have KVM_CAP_ABSENT_MAPPING_FAULT");
+ slot_flags = KVM_MEM_ABSENT_MAPPING_FAULT;
+ }
vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
- 1, 0, p->src_type, p->partition_vcpu_memory_access);
+ 1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
+
+ if (uffd_memfault_exits) {
+ vm_enable_cap(vm,
+ KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_FAULT_INFO_ENABLE);
+ }
demand_paging_size = get_backing_src_pagesz(p->src_type);
@@ -154,12 +250,12 @@ static void run_test(enum vm_guest_mode mode, void *arg)
"Failed to allocate buffer for guest data pattern");
memset(guest_data_prototype, 0xAB, demand_paging_size);
- if (p->uffd_mode) {
+ if (uffd_mode) {
num_uffds = p->single_uffd ? 1 : nr_vcpus;
uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
- TEST_ASSERT(uffd_descs, "Memory allocation failed");
+ TEST_ASSERT(uffd_descs, "Failed to allocate memory of uffd descriptors");
for (i = 0; i < num_uffds; i++) {
struct memstress_vcpu_args *vcpu_args;
@@ -179,10 +275,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
* requests.
*/
uffd_descs[i] = uffd_setup_demand_paging(
- p->uffd_mode, p->uffd_delay, vcpu_hva,
+ uffd_mode, uffd_delay, vcpu_hva,
uffd_region_size,
p->readers_per_uffd,
- &handle_uffd_page_request);
+ &handle_uffd_page_request_from_uffd);
}
}
@@ -196,7 +292,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
ts_diff = timespec_elapsed(start);
pr_info("All vCPU threads joined\n");
- if (p->uffd_mode) {
+ if (uffd_mode) {
/* Tell the user fault fd handler threads to quit */
for (i = 0; i < num_uffds; i++)
uffd_stop_demand_paging(uffd_descs[i]);
@@ -211,7 +307,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
memstress_destroy_vm(vm);
free(guest_data_prototype);
- if (p->uffd_mode)
+ if (uffd_mode)
free(uffd_descs);
}
@@ -220,7 +316,7 @@ static void help(char *name)
puts("");
printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
" [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
- " [-s type] [-v vcpus] [-o]\n", name);
+ " [-w] [-s type] [-v vcpus] [-o]\n", name);
guest_modes_help();
printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
" UFFD registration mode: 'MISSING' or 'MINOR'.\n");
@@ -231,6 +327,7 @@ static void help(char *name)
" FD handler to simulate demand paging\n"
" overheads. Ignored without -u.\n");
printf(" -r: Set the number of reader threads per uffd.\n");
+ printf(" -w: Enable kvm cap for memory fault exits.\n");
printf(" -b: specify the size of the memory region which should be\n"
" demand paged by each vCPU. e.g. 10M or 3G.\n"
" Default: 1G\n");
@@ -250,29 +347,30 @@ int main(int argc, char *argv[])
.partition_vcpu_memory_access = true,
.readers_per_uffd = 1,
.single_uffd = false,
+ .memfault_exits = false,
};
int opt;
guest_modes_append_default();
- while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
+ while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
switch (opt) {
case 'm':
guest_modes_cmdline(optarg);
break;
case 'u':
if (!strcmp("MISSING", optarg))
- p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
+ uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
else if (!strcmp("MINOR", optarg))
- p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
- TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
+ uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
+ TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
break;
case 'a':
p.single_uffd = true;
break;
case 'd':
- p.uffd_delay = strtoul(optarg, NULL, 0);
- TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
+ uffd_delay = strtoul(optarg, NULL, 0);
+ TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
break;
case 'b':
guest_percpu_mem_size = parse_size(optarg);
@@ -295,6 +393,9 @@ int main(int argc, char *argv[])
"Invalid number of readers per uffd %d: must be >=1",
p.readers_per_uffd);
break;
+ case 'w':
+ p.memfault_exits = true;
+ break;
case 'h':
default:
help(argv[0]);
@@ -302,7 +403,7 @@ int main(int argc, char *argv[])
}
}
- if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
+ if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
!backing_src_is_shared(p.src_type)) {
TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
}
--
2.40.0.577.gac1e443424-goog
^ permalink raw reply related [flat|nested] 103+ messages in thread* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
@ 2023-04-19 14:09 ` Hoo Robert
2023-04-19 16:40 ` Anish Moorthy
2023-04-20 22:47 ` Anish Moorthy
2023-04-27 15:48 ` James Houghton
1 sibling, 2 replies; 103+ messages in thread
From: Hoo Robert @ 2023-04-19 14:09 UTC (permalink / raw)
To: Anish Moorthy, pbonzini, maz
Cc: oliver.upton, seanjc, jthoughton, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On 4/13/2023 5:35 AM, Anish Moorthy wrote:
> Demonstrate a (very basic) scheme for supporting memory fault exits.
>
>>From the vCPU threads:
> 1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
> with the purpose of establishing the absent mappings. Do so with
> wake_waiters=false to avoid serializing on the userfaultfd wait queue
> locks.
>
> 2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
> assume that the mapping was already established but is currently
> absent [A] and attempt to populate it using MADV_POPULATE_WRITE.
>
> Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
> wake_waiters=true to ensure that any threads sleeping on the uffd are
> eventually woken up.
>
> A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
> via a bitmap) to avoid calls destined to EEXIST. However, even the
> naive approach is enough to demonstrate the performance advantages of
> KVM_EXIT_MEMORY_FAULT.
>
> [A] In reality it is much likelier that the vCPU thread simply lost a
> race to establish the mapping for the page.
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
> .../selftests/kvm/demand_paging_test.c | 209 +++++++++++++-----
> 1 file changed, 155 insertions(+), 54 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index e84dde345edbc..668bd63d944e7 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -15,6 +15,7 @@
> #include <time.h>
> #include <pthread.h>
> #include <linux/userfaultfd.h>
> +#include <sys/mman.h>
+#include <linux/mman.h> for MADV_POPULATE_WRITE definition.
> #include <sys/syscall.h>
>
> #include "kvm_util.h"
> @@ -31,6 +32,57 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
> static size_t demand_paging_size;
> static char *guest_data_prototype;
>
> +static int num_uffds;
> +static size_t uffd_region_size;
> +static struct uffd_desc **uffd_descs;
> +/*
> + * Delay when demand paging is performed through userfaultfd or directly by
> + * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
> + */
> +static useconds_t uffd_delay;
> +static int uffd_mode;
> +
> +
> +static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
> + bool is_vcpu);
> +
> +static void madv_write_or_err(uint64_t gpa)
> +{
> + int r;
> + void *hva = addr_gpa2hva(memstress_args.vm, gpa);
> +
> + r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
> + TEST_ASSERT(r == 0,
> + "MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
> + (uintptr_t) hva, gpa, errno);
There are quite a few strange line breaks/indentations across this
patch set, editor's issue?:-)
> +}
> +
> +static void ready_page(uint64_t gpa)
> +{
> + int r, uffd;
> +
> + /*
> + * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
> + * the registered ranges should fault in the physical pages through
> + * MADV_POPULATE_WRITE.
> + */
> + if ((gpa < memstress_args.gpa)
> + || (gpa >= memstress_args.gpa + memstress_args.size)) {
> + madv_write_or_err(gpa);
> + } else {
> + if (uffd_delay)
> + usleep(uffd_delay);
> +
> + uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
> +
> + r = handle_uffd_page_request(uffd_mode, uffd,
> + (uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
> +
> + if (r == EEXIST)
> + madv_write_or_err(gpa);
> + }
> +}
> +
> static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
> {
> struct kvm_vcpu *vcpu = vcpu_args->vcpu;
> @@ -42,25 +94,36 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
>
> clock_gettime(CLOCK_MONOTONIC, &start);
>
> - /* Let the guest access its memory */
> - ret = _vcpu_run(vcpu);
> - TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> - if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> - TEST_ASSERT(false,
> - "Invalid guest sync status: exit_reason=%s\n",
> - exit_reason_str(run->exit_reason));
> - }
> + while (true) {
> + /* Let the guest access its memory */
> + ret = _vcpu_run(vcpu);
> + TEST_ASSERT(ret == 0
> + || (errno == EFAULT
> + && run->exit_reason == KVM_EXIT_MEMORY_FAULT),
> + "vcpu_run failed: %d\n", ret);
> + if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
> +
> + if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
> + ready_page(run->memory_fault.gpa);
> + continue;
> + }
> +
> + TEST_ASSERT(false,
TEST_ASSERT(false, ...) == TEST_FAIL()
> + "Invalid guest sync status: exit_reason=%s\n",
> + exit_reason_str(run->exit_reason));
> + }
>
> - ts_diff = timespec_elapsed(start);
> - PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> - ts_diff.tv_sec, ts_diff.tv_nsec);
> + ts_diff = timespec_elapsed(start);
> + PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> + ts_diff.tv_sec, ts_diff.tv_nsec);
I think this vcpu exec time calc should be outside while() {} block.
> + break;
> + }
> }
>
> -static int handle_uffd_page_request(int uffd_mode, int uffd,
> - struct uffd_msg *msg)
> +static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
> + bool is_vcpu)
> {
> pid_t tid = syscall(__NR_gettid);
> - uint64_t addr = msg->arg.pagefault.address;
> struct timespec start;
> struct timespec ts_diff;
> int r;
> @@ -71,56 +134,78 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
> struct uffdio_copy copy;
>
> copy.src = (uint64_t)guest_data_prototype;
> - copy.dst = addr;
> + copy.dst = hva;
> copy.len = demand_paging_size;
> - copy.mode = 0;
> + copy.mode = UFFDIO_COPY_MODE_DONTWAKE;
>
> - r = ioctl(uffd, UFFDIO_COPY, ©);
> /*
> - * With multiple vCPU threads fault on a single page and there are
> - * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> - * will fail with EEXIST: handle that case without signaling an
> - * error.
> + * With multiple vCPU threads and at least one of multiple reader threads
> + * or vCPU memory faults, multiple vCPUs accessing an absent page will
> + * almost certainly cause some thread doing the UFFDIO_COPY here to get
> + * EEXIST: make sure to allow that case.
> */
> - if (r == -1 && errno != EEXIST) {
> - pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
> - addr, tid, errno);
> - return r;
> - }
> + r = ioctl(uffd, UFFDIO_COPY, ©);
> + TEST_ASSERT(r == 0 || errno == EEXIST,
> + "Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
> + gettid(), hva, errno);
can this gettid() be substituted by tid above? or #include header file
for its prototype, otherwise build warning/error.
> } else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
> + /* The comments in the UFFDIO_COPY branch also apply here. */
> struct uffdio_continue cont = {0};
>
> - cont.range.start = addr;
> + cont.range.start = hva;
> cont.range.len = demand_paging_size;
> + cont.mode = UFFDIO_CONTINUE_MODE_DONTWAKE;
>
> r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
> - /* See the note about EEXISTs in the UFFDIO_COPY branch. */
> - if (r == -1 && errno != EEXIST) {
> - pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
> - addr, tid, errno);
> - return r;
> - }
> + TEST_ASSERT(r == 0 || errno == EEXIST,
> + "Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
> + gettid(), hva, errno);
Ditto
> } else {
> TEST_FAIL("Invalid uffd mode %d", uffd_mode);
> }
>
> + /*
> + * If the above UFFDIO_COPY/CONTINUE fails with EEXIST, it will do so without
> + * waking threads waiting on the UFFD: make sure that happens here.
> + */
> + if (!is_vcpu) {
> + struct uffdio_range range = {
> + .start = hva,
> + .len = demand_paging_size
> + };
> + r = ioctl(uffd, UFFDIO_WAKE, &range);
> + TEST_ASSERT(
> + r == 0,
> + "Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
> + gettid(), hva, errno);
Ditto
> + }
> +
> ts_diff = timespec_elapsed(start);
>
> PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
> timespec_to_ns(ts_diff));
> PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
> - demand_paging_size, addr, tid);
> + demand_paging_size, hva, tid);
>
> return 0;
> }
>
> +static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
> + struct uffd_msg *msg)
> +{
> + TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
> + "Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
> + msg->event);
> + return handle_uffd_page_request(uffd_mode, uffd,
> + msg->arg.pagefault.address, false);
> +}
> +
> struct test_params {
> - int uffd_mode;
> bool single_uffd;
> - useconds_t uffd_delay;
> int readers_per_uffd;
> enum vm_mem_backing_src_type src_type;
> bool partition_vcpu_memory_access;
> + bool memfault_exits;
> };
>
> static void prefault_mem(void *alias, uint64_t len)
> @@ -137,15 +222,26 @@ static void prefault_mem(void *alias, uint64_t len)
> static void run_test(enum vm_guest_mode mode, void *arg)
> {
> struct test_params *p = arg;
> - struct uffd_desc **uffd_descs = NULL;
> struct timespec start;
> struct timespec ts_diff;
> struct kvm_vm *vm;
> - int i, num_uffds = 0;
> - uint64_t uffd_region_size;
> + int i;
> + uint32_t slot_flags = 0;
> + bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
> +
> + if (uffd_memfault_exits) {
> + TEST_ASSERT(kvm_has_cap(KVM_CAP_ABSENT_MAPPING_FAULT) > 0,
> + "KVM does not have KVM_CAP_ABSENT_MAPPING_FAULT");
> + slot_flags = KVM_MEM_ABSENT_MAPPING_FAULT;
> + }
>
> vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
> - 1, 0, p->src_type, p->partition_vcpu_memory_access);
> + 1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
> +
> + if (uffd_memfault_exits) {
> + vm_enable_cap(vm,
> + KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_FAULT_INFO_ENABLE);
> + }
>
> demand_paging_size = get_backing_src_pagesz(p->src_type);
>
> @@ -154,12 +250,12 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> "Failed to allocate buffer for guest data pattern");
> memset(guest_data_prototype, 0xAB, demand_paging_size);
>
> - if (p->uffd_mode) {
> + if (uffd_mode) {
> num_uffds = p->single_uffd ? 1 : nr_vcpus;
> uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
>
> uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
> - TEST_ASSERT(uffd_descs, "Memory allocation failed");
> + TEST_ASSERT(uffd_descs, "Failed to allocate memory of uffd descriptors");
>
> for (i = 0; i < num_uffds; i++) {
> struct memstress_vcpu_args *vcpu_args;
> @@ -179,10 +275,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> * requests.
> */
> uffd_descs[i] = uffd_setup_demand_paging(
> - p->uffd_mode, p->uffd_delay, vcpu_hva,
> + uffd_mode, uffd_delay, vcpu_hva,
> uffd_region_size,
> p->readers_per_uffd,
> - &handle_uffd_page_request);
> + &handle_uffd_page_request_from_uffd);
> }
> }
>
> @@ -196,7 +292,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> ts_diff = timespec_elapsed(start);
> pr_info("All vCPU threads joined\n");
>
> - if (p->uffd_mode) {
> + if (uffd_mode) {
> /* Tell the user fault fd handler threads to quit */
> for (i = 0; i < num_uffds; i++)
> uffd_stop_demand_paging(uffd_descs[i]);
> @@ -211,7 +307,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> memstress_destroy_vm(vm);
>
> free(guest_data_prototype);
> - if (p->uffd_mode)
> + if (uffd_mode)
> free(uffd_descs);
> }
>
> @@ -220,7 +316,7 @@ static void help(char *name)
> puts("");
> printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
> " [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
> - " [-s type] [-v vcpus] [-o]\n", name);
> + " [-w] [-s type] [-v vcpus] [-o]\n", name);
> guest_modes_help();
> printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
> " UFFD registration mode: 'MISSING' or 'MINOR'.\n");
> @@ -231,6 +327,7 @@ static void help(char *name)
> " FD handler to simulate demand paging\n"
> " overheads. Ignored without -u.\n");
> printf(" -r: Set the number of reader threads per uffd.\n");
> + printf(" -w: Enable kvm cap for memory fault exits.\n");
> printf(" -b: specify the size of the memory region which should be\n"
> " demand paged by each vCPU. e.g. 10M or 3G.\n"
> " Default: 1G\n");
> @@ -250,29 +347,30 @@ int main(int argc, char *argv[])
> .partition_vcpu_memory_access = true,
> .readers_per_uffd = 1,
> .single_uffd = false,
> + .memfault_exits = false,
> };
> int opt;
>
> guest_modes_append_default();
>
> - while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
> + while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
> switch (opt) {
> case 'm':
> guest_modes_cmdline(optarg);
> break;
> case 'u':
> if (!strcmp("MISSING", optarg))
> - p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
> + uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
> else if (!strcmp("MINOR", optarg))
> - p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> - TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
> + uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> + TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
> break;
> case 'a':
> p.single_uffd = true;
> break;
> case 'd':
> - p.uffd_delay = strtoul(optarg, NULL, 0);
> - TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
> + uffd_delay = strtoul(optarg, NULL, 0);
> + TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
> break;
> case 'b':
> guest_percpu_mem_size = parse_size(optarg);
> @@ -295,6 +393,9 @@ int main(int argc, char *argv[])
> "Invalid number of readers per uffd %d: must be >=1",
> p.readers_per_uffd);
> break;
> + case 'w':
> + p.memfault_exits = true;
> + break;
> case 'h':
> default:
> help(argv[0]);
> @@ -302,7 +403,7 @@ int main(int argc, char *argv[])
> }
> }
>
> - if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
> + if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
> !backing_src_is_shared(p.src_type)) {
> TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
> }
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
2023-04-19 14:09 ` Hoo Robert
@ 2023-04-19 16:40 ` Anish Moorthy
2023-04-20 22:47 ` Anish Moorthy
1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-19 16:40 UTC (permalink / raw)
To: Hoo Robert
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 19, 2023 at 7:10 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> There are quite a few strange line breaks/indentations across this
> patch set, editor's issue?:-)
A combination of editor issues and inconsistency on my part I think,
that's been a bit of a theme :/ Thanks for pointing out so many
places, I'll figure out what's going wrong (and also look at your
non-style related feedback as well :)
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
2023-04-19 14:09 ` Hoo Robert
2023-04-19 16:40 ` Anish Moorthy
@ 2023-04-20 22:47 ` Anish Moorthy
1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-04-20 22:47 UTC (permalink / raw)
To: Hoo Robert
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 19, 2023 at 7:10 AM Hoo Robert <robert.hoo.linux@gmail.com> wrote:
>
> I think this vcpu exec time calc should be outside while() {} block.
Ah, you're right: fixed.
> can this gettid() be substituted by tid above? or #include header file
> for its prototype, otherwise build warning/error.
Huh, not sure how I missed the warning. Thanks, and done.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2023-04-19 14:09 ` Hoo Robert
@ 2023-04-27 15:48 ` James Houghton
2023-05-01 18:01 ` Anish Moorthy
1 sibling, 1 reply; 103+ messages in thread
From: James Houghton @ 2023-04-27 15:48 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 12, 2023 at 2:35 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> Demonstrate a (very basic) scheme for supporting memory fault exits.
>
> From the vCPU threads:
> 1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
> with the purpose of establishing the absent mappings. Do so with
> wake_waiters=false to avoid serializing on the userfaultfd wait queue
> locks.
>
> 2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
> assume that the mapping was already established but is currently
> absent [A] and attempt to populate it using MADV_POPULATE_WRITE.
>
> Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
> wake_waiters=true to ensure that any threads sleeping on the uffd are
> eventually woken up.
>
> A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
> via a bitmap) to avoid calls destined to EEXIST. However, even the
> naive approach is enough to demonstrate the performance advantages of
> KVM_EXIT_MEMORY_FAULT.
>
> [A] In reality it is much likelier that the vCPU thread simply lost a
> race to establish the mapping for the page.
>
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> Acked-by: James Houghton <jthoughton@google.com>
> ---
> .../selftests/kvm/demand_paging_test.c | 209 +++++++++++++-----
> 1 file changed, 155 insertions(+), 54 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index e84dde345edbc..668bd63d944e7 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -15,6 +15,7 @@
> #include <time.h>
> #include <pthread.h>
> #include <linux/userfaultfd.h>
> +#include <sys/mman.h>
> #include <sys/syscall.h>
>
> #include "kvm_util.h"
> @@ -31,6 +32,57 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
> static size_t demand_paging_size;
> static char *guest_data_prototype;
>
> +static int num_uffds;
> +static size_t uffd_region_size;
> +static struct uffd_desc **uffd_descs;
> +/*
> + * Delay when demand paging is performed through userfaultfd or directly by
> + * vcpu_worker in the case of a KVM_EXIT_MEMORY_FAULT.
> + */
> +static useconds_t uffd_delay;
> +static int uffd_mode;
> +
> +
> +static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
> + bool is_vcpu);
> +
> +static void madv_write_or_err(uint64_t gpa)
> +{
> + int r;
> + void *hva = addr_gpa2hva(memstress_args.vm, gpa);
> +
> + r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
> + TEST_ASSERT(r == 0,
> + "MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
> + (uintptr_t) hva, gpa, errno);
> +}
> +
> +static void ready_page(uint64_t gpa)
> +{
> + int r, uffd;
> +
> + /*
> + * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
> + * the registered ranges should fault in the physical pages through
> + * MADV_POPULATE_WRITE.
> + */
> + if ((gpa < memstress_args.gpa)
> + || (gpa >= memstress_args.gpa + memstress_args.size)) {
> + madv_write_or_err(gpa);
> + } else {
> + if (uffd_delay)
> + usleep(uffd_delay);
> +
> + uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
> +
> + r = handle_uffd_page_request(uffd_mode, uffd,
> + (uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
> +
> + if (r == EEXIST)
> + madv_write_or_err(gpa);
> + }
> +}
> +
> static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
> {
> struct kvm_vcpu *vcpu = vcpu_args->vcpu;
> @@ -42,25 +94,36 @@ static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
>
> clock_gettime(CLOCK_MONOTONIC, &start);
>
> - /* Let the guest access its memory */
> - ret = _vcpu_run(vcpu);
> - TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
> - if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
> - TEST_ASSERT(false,
> - "Invalid guest sync status: exit_reason=%s\n",
> - exit_reason_str(run->exit_reason));
> - }
> + while (true) {
> + /* Let the guest access its memory */
> + ret = _vcpu_run(vcpu);
> + TEST_ASSERT(ret == 0
> + || (errno == EFAULT
> + && run->exit_reason == KVM_EXIT_MEMORY_FAULT),
> + "vcpu_run failed: %d\n", ret);
> + if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
> +
> + if (run->exit_reason == KVM_EXIT_MEMORY_FAULT) {
> + ready_page(run->memory_fault.gpa);
> + continue;
> + }
> +
> + TEST_ASSERT(false,
> + "Invalid guest sync status: exit_reason=%s\n",
> + exit_reason_str(run->exit_reason));
> + }
>
> - ts_diff = timespec_elapsed(start);
> - PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> - ts_diff.tv_sec, ts_diff.tv_nsec);
> + ts_diff = timespec_elapsed(start);
> + PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
> + ts_diff.tv_sec, ts_diff.tv_nsec);
> + break;
> + }
> }
>
> -static int handle_uffd_page_request(int uffd_mode, int uffd,
> - struct uffd_msg *msg)
> +static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
> + bool is_vcpu)
> {
> pid_t tid = syscall(__NR_gettid);
> - uint64_t addr = msg->arg.pagefault.address;
> struct timespec start;
> struct timespec ts_diff;
> int r;
> @@ -71,56 +134,78 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
> struct uffdio_copy copy;
>
> copy.src = (uint64_t)guest_data_prototype;
> - copy.dst = addr;
> + copy.dst = hva;
> copy.len = demand_paging_size;
> - copy.mode = 0;
> + copy.mode = UFFDIO_COPY_MODE_DONTWAKE;
>
> - r = ioctl(uffd, UFFDIO_COPY, ©);
> /*
> - * With multiple vCPU threads fault on a single page and there are
> - * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
> - * will fail with EEXIST: handle that case without signaling an
> - * error.
> + * With multiple vCPU threads and at least one of multiple reader threads
> + * or vCPU memory faults, multiple vCPUs accessing an absent page will
> + * almost certainly cause some thread doing the UFFDIO_COPY here to get
> + * EEXIST: make sure to allow that case.
> */
> - if (r == -1 && errno != EEXIST) {
> - pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
> - addr, tid, errno);
> - return r;
> - }
> + r = ioctl(uffd, UFFDIO_COPY, ©);
> + TEST_ASSERT(r == 0 || errno == EEXIST,
> + "Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
> + gettid(), hva, errno);
> } else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
> + /* The comments in the UFFDIO_COPY branch also apply here. */
> struct uffdio_continue cont = {0};
>
> - cont.range.start = addr;
> + cont.range.start = hva;
> cont.range.len = demand_paging_size;
> + cont.mode = UFFDIO_CONTINUE_MODE_DONTWAKE;
>
> r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
> - /* See the note about EEXISTs in the UFFDIO_COPY branch. */
> - if (r == -1 && errno != EEXIST) {
> - pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
> - addr, tid, errno);
> - return r;
> - }
> + TEST_ASSERT(r == 0 || errno == EEXIST,
> + "Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
> + gettid(), hva, errno);
> } else {
> TEST_FAIL("Invalid uffd mode %d", uffd_mode);
> }
>
> + /*
> + * If the above UFFDIO_COPY/CONTINUE fails with EEXIST, it will do so without
> + * waking threads waiting on the UFFD: make sure that happens here.
> + */
This comment sounds a little bit strange because we're always passing
MODE_DONTWAKE to UFFDIO_COPY/CONTINUE.
You *could* update the comment to reflect what this test is really
doing, but I think you actually probably want the test to do what the
comment suggests. That is, I think the code you should write should:
1. DONTWAKE if is_vcpu
2. UFFDIO_WAKE if !is_vcpu && UFFDIO_COPY/CONTINUE failed (with
EEXIST, but we would have already crashed if it weren't).
This way, we can save a syscall with almost no added complexity, and
the existing userfaultfd tests remain basically untouched (i.e., no
longer always need an explicit UFFDIO_WAKE).
Thanks!
> + if (!is_vcpu) {
> + struct uffdio_range range = {
> + .start = hva,
> + .len = demand_paging_size
> + };
> + r = ioctl(uffd, UFFDIO_WAKE, &range);
> + TEST_ASSERT(
> + r == 0,
> + "Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
> + gettid(), hva, errno);
> + }
> +
> ts_diff = timespec_elapsed(start);
>
> PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
> timespec_to_ns(ts_diff));
> PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
> - demand_paging_size, addr, tid);
> + demand_paging_size, hva, tid);
>
> return 0;
> }
>
> +static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
> + struct uffd_msg *msg)
> +{
> + TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
> + "Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
> + msg->event);
> + return handle_uffd_page_request(uffd_mode, uffd,
> + msg->arg.pagefault.address, false);
> +}
> +
> struct test_params {
> - int uffd_mode;
> bool single_uffd;
> - useconds_t uffd_delay;
> int readers_per_uffd;
> enum vm_mem_backing_src_type src_type;
> bool partition_vcpu_memory_access;
> + bool memfault_exits;
> };
>
> static void prefault_mem(void *alias, uint64_t len)
> @@ -137,15 +222,26 @@ static void prefault_mem(void *alias, uint64_t len)
> static void run_test(enum vm_guest_mode mode, void *arg)
> {
> struct test_params *p = arg;
> - struct uffd_desc **uffd_descs = NULL;
> struct timespec start;
> struct timespec ts_diff;
> struct kvm_vm *vm;
> - int i, num_uffds = 0;
> - uint64_t uffd_region_size;
> + int i;
> + uint32_t slot_flags = 0;
> + bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
> +
> + if (uffd_memfault_exits) {
> + TEST_ASSERT(kvm_has_cap(KVM_CAP_ABSENT_MAPPING_FAULT) > 0,
> + "KVM does not have KVM_CAP_ABSENT_MAPPING_FAULT");
> + slot_flags = KVM_MEM_ABSENT_MAPPING_FAULT;
> + }
>
> vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
> - 1, 0, p->src_type, p->partition_vcpu_memory_access);
> + 1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
> +
> + if (uffd_memfault_exits) {
> + vm_enable_cap(vm,
> + KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_FAULT_INFO_ENABLE);
> + }
>
> demand_paging_size = get_backing_src_pagesz(p->src_type);
>
> @@ -154,12 +250,12 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> "Failed to allocate buffer for guest data pattern");
> memset(guest_data_prototype, 0xAB, demand_paging_size);
>
> - if (p->uffd_mode) {
> + if (uffd_mode) {
> num_uffds = p->single_uffd ? 1 : nr_vcpus;
> uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
>
> uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
> - TEST_ASSERT(uffd_descs, "Memory allocation failed");
> + TEST_ASSERT(uffd_descs, "Failed to allocate memory of uffd descriptors");
>
> for (i = 0; i < num_uffds; i++) {
> struct memstress_vcpu_args *vcpu_args;
> @@ -179,10 +275,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> * requests.
> */
> uffd_descs[i] = uffd_setup_demand_paging(
> - p->uffd_mode, p->uffd_delay, vcpu_hva,
> + uffd_mode, uffd_delay, vcpu_hva,
> uffd_region_size,
> p->readers_per_uffd,
> - &handle_uffd_page_request);
> + &handle_uffd_page_request_from_uffd);
> }
> }
>
> @@ -196,7 +292,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> ts_diff = timespec_elapsed(start);
> pr_info("All vCPU threads joined\n");
>
> - if (p->uffd_mode) {
> + if (uffd_mode) {
> /* Tell the user fault fd handler threads to quit */
> for (i = 0; i < num_uffds; i++)
> uffd_stop_demand_paging(uffd_descs[i]);
> @@ -211,7 +307,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> memstress_destroy_vm(vm);
>
> free(guest_data_prototype);
> - if (p->uffd_mode)
> + if (uffd_mode)
> free(uffd_descs);
> }
>
> @@ -220,7 +316,7 @@ static void help(char *name)
> puts("");
> printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
> " [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
> - " [-s type] [-v vcpus] [-o]\n", name);
> + " [-w] [-s type] [-v vcpus] [-o]\n", name);
> guest_modes_help();
> printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
> " UFFD registration mode: 'MISSING' or 'MINOR'.\n");
> @@ -231,6 +327,7 @@ static void help(char *name)
> " FD handler to simulate demand paging\n"
> " overheads. Ignored without -u.\n");
> printf(" -r: Set the number of reader threads per uffd.\n");
> + printf(" -w: Enable kvm cap for memory fault exits.\n");
> printf(" -b: specify the size of the memory region which should be\n"
> " demand paged by each vCPU. e.g. 10M or 3G.\n"
> " Default: 1G\n");
> @@ -250,29 +347,30 @@ int main(int argc, char *argv[])
> .partition_vcpu_memory_access = true,
> .readers_per_uffd = 1,
> .single_uffd = false,
> + .memfault_exits = false,
> };
> int opt;
>
> guest_modes_append_default();
>
> - while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:r:")) != -1) {
> + while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:r:")) != -1) {
> switch (opt) {
> case 'm':
> guest_modes_cmdline(optarg);
> break;
> case 'u':
> if (!strcmp("MISSING", optarg))
> - p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
> + uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
> else if (!strcmp("MINOR", optarg))
> - p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> - TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
> + uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
> + TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
> break;
> case 'a':
> p.single_uffd = true;
> break;
> case 'd':
> - p.uffd_delay = strtoul(optarg, NULL, 0);
> - TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
> + uffd_delay = strtoul(optarg, NULL, 0);
> + TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
> break;
> case 'b':
> guest_percpu_mem_size = parse_size(optarg);
> @@ -295,6 +393,9 @@ int main(int argc, char *argv[])
> "Invalid number of readers per uffd %d: must be >=1",
> p.readers_per_uffd);
> break;
> + case 'w':
> + p.memfault_exits = true;
> + break;
> case 'h':
> default:
> help(argv[0]);
> @@ -302,7 +403,7 @@ int main(int argc, char *argv[])
> }
> }
>
> - if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
> + if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
> !backing_src_is_shared(p.src_type)) {
> TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
> }
> --
> 2.40.0.577.gac1e443424-goog
>
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test
2023-04-27 15:48 ` James Houghton
@ 2023-05-01 18:01 ` Anish Moorthy
0 siblings, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-05-01 18:01 UTC (permalink / raw)
To: James Houghton
Cc: pbonzini, maz, oliver.upton, seanjc, bgardon, dmatlack, ricarkol,
axelrasmussen, peterx, kvm, kvmarm
On Thu, Apr 27, 2023 at 8:48 AM James Houghton <jthoughton@google.com> wrote:
>
> This comment sounds a little bit strange because we're always passing
> MODE_DONTWAKE to UFFDIO_COPY/CONTINUE.
>
> You *could* update the comment to reflect what this test is really
> doing, but I think you actually probably want the test to do what the
> comment suggests. That is, I think the code you should write should:
> 1. DONTWAKE if is_vcpu
> 2. UFFDIO_WAKE if !is_vcpu && UFFDIO_COPY/CONTINUE failed (with
> EEXIST, but we would have already crashed if it weren't).
>
> This way, we can save a syscall with almost no added complexity, and
> the existing userfaultfd tests remain basically untouched (i.e., no
> longer always need an explicit UFFDIO_WAKE).
>
> Thanks!
Good points, and taken: though in practice I suspect that every fault
read from the uffd will EEXIST and necessitate the wake anyways.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (21 preceding siblings ...)
2023-04-12 21:35 ` [PATCH v3 22/22] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
@ 2023-04-19 19:55 ` Peter Xu
2023-04-19 20:15 ` Axel Rasmussen
2023-05-09 22:19 ` David Matlack
23 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-19 19:55 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
dmatlack, ricarkol, axelrasmussen, kvm, kvmarm
Hi, Anish,
On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> KVM's demand paging self test is extended to demonstrate the performance
> benefits of using the two new capabilities to bypass the userfaultfd
> wait queue. The performance samples below (rates in thousands of
> pages/s, n = 5), were generated using [2] on an x86 machine with 256
> cores.
>
> vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> 1 150 340
> 2 191 477
> 4 210 809
> 8 155 1239
> 16 130 1595
> 32 108 2299
> 64 86 3482
> 128 62 4134
> 256 36 4012
The number looks very promising. Though..
>
> [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> A quick rundown of the new flags (also detailed in later commits)
> -a registers all of guest memory to a single uffd.
... this is the worst case scenario. I'd say it's slightly unfair to
compare by first introducing a bottleneck then compare with it. :)
Jokes aside: I'd think it'll make more sense if such a performance solution
will be measured on real systems showing real benefits, because so far it's
still not convincing enough if it's only with the test especially with only
one uffd.
I don't remember whether I used to discuss this with James before, but..
I know that having multiple uffds in productions also means scattered guest
memory and scattered VMAs all over the place. However split the guest
large mem into at least a few (or even tens of) VMAs may still be something
worth trying? Do you think that'll already solve some of the contentions
on userfaultfd, either on the queue or else?
With a bunch of VMAs and userfaultfds (paired with uffd fault handler
threads, totally separate uffd queues), I'd expect to some extend other
things can pop up already, e.g., the network bandwidth, without teaching
each vcpu thread to report uffd faults themselves.
These are my pure imaginations though, I think that's also why it'll be
great if such a solution can be tested more or less on a real migration
scenario to show its real benefits.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
@ 2023-04-19 20:15 ` Axel Rasmussen
2023-04-19 21:05 ` Peter Xu
0 siblings, 1 reply; 103+ messages in thread
From: Axel Rasmussen @ 2023-04-19 20:15 UTC (permalink / raw)
To: Peter Xu
Cc: Anish Moorthy, pbonzini, maz, oliver.upton, seanjc, jthoughton,
bgardon, dmatlack, ricarkol, kvm, kvmarm
On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hi, Anish,
>
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > KVM's demand paging self test is extended to demonstrate the performance
> > benefits of using the two new capabilities to bypass the userfaultfd
> > wait queue. The performance samples below (rates in thousands of
> > pages/s, n = 5), were generated using [2] on an x86 machine with 256
> > cores.
> >
> > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > 1 150 340
> > 2 191 477
> > 4 210 809
> > 8 155 1239
> > 16 130 1595
> > 32 108 2299
> > 64 86 3482
> > 128 62 4134
> > 256 36 4012
>
> The number looks very promising. Though..
>
> >
> > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> > A quick rundown of the new flags (also detailed in later commits)
> > -a registers all of guest memory to a single uffd.
>
> ... this is the worst case scenario. I'd say it's slightly unfair to
> compare by first introducing a bottleneck then compare with it. :)
>
> Jokes aside: I'd think it'll make more sense if such a performance solution
> will be measured on real systems showing real benefits, because so far it's
> still not convincing enough if it's only with the test especially with only
> one uffd.
>
> I don't remember whether I used to discuss this with James before, but..
>
> I know that having multiple uffds in productions also means scattered guest
> memory and scattered VMAs all over the place. However split the guest
> large mem into at least a few (or even tens of) VMAs may still be something
> worth trying? Do you think that'll already solve some of the contentions
> on userfaultfd, either on the queue or else?
We considered sharding into several UFFDs. I do think it helps, but
also I think there are two main problems with it:
- One is, I think there's a limit to how much you'd want to do that.
E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable,
but 1/100 or 1/1000 might become ridiculous in terms of the
"scattering" of VMAs and so on like you mentioned. Especially for very
large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm
not sure splitting just "slightly" is enough to get good performance.
- Another is, sharding UFFDs sort of assumes accesses are randomly
distributed across the guest physical address space. I'm not sure this
is guaranteed for all possible VMs / customer workloads. In other
words, even if we shard across several UFFDs, we may end up with a
small number of them being "hot".
A benefit to Anish's series is that it solves the problem more
fundamentally, and allows demand paging with no "global" locking. So,
it will scale better regardless of VM size, or access pattern.
[1]: https://cloud.google.com/compute/docs/memory-optimized-machines
>
> With a bunch of VMAs and userfaultfds (paired with uffd fault handler
> threads, totally separate uffd queues), I'd expect to some extend other
> things can pop up already, e.g., the network bandwidth, without teaching
> each vcpu thread to report uffd faults themselves.
>
> These are my pure imaginations though, I think that's also why it'll be
> great if such a solution can be tested more or less on a real migration
> scenario to show its real benefits.
I wonder, is there an existing open source QEMU/KVM based live
migration stress test?
I think we could share numbers from some of our internal benchmarks,
or at the very least give relative numbers (e.g. +50% increase), but
since a lot of the software stack is proprietary (e.g. we don't use
QEMU), it may not be that useful or reproducible for folks.
>
> Thanks,
>
> --
> Peter Xu
>
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-04-19 20:15 ` Axel Rasmussen
@ 2023-04-19 21:05 ` Peter Xu
[not found] ` <CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@mail.gmail.com>
0 siblings, 1 reply; 103+ messages in thread
From: Peter Xu @ 2023-04-19 21:05 UTC (permalink / raw)
To: Axel Rasmussen
Cc: Anish Moorthy, pbonzini, maz, oliver.upton, seanjc, jthoughton,
bgardon, dmatlack, ricarkol, kvm, kvmarm
On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hi, Anish,
> >
> > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > > KVM's demand paging self test is extended to demonstrate the performance
> > > benefits of using the two new capabilities to bypass the userfaultfd
> > > wait queue. The performance samples below (rates in thousands of
> > > pages/s, n = 5), were generated using [2] on an x86 machine with 256
> > > cores.
> > >
> > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > > 1 150 340
> > > 2 191 477
> > > 4 210 809
> > > 8 155 1239
> > > 16 130 1595
> > > 32 108 2299
> > > 64 86 3482
> > > 128 62 4134
> > > 256 36 4012
> >
> > The number looks very promising. Though..
> >
> > >
> > > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> > > A quick rundown of the new flags (also detailed in later commits)
> > > -a registers all of guest memory to a single uffd.
> >
> > ... this is the worst case scenario. I'd say it's slightly unfair to
> > compare by first introducing a bottleneck then compare with it. :)
> >
> > Jokes aside: I'd think it'll make more sense if such a performance solution
> > will be measured on real systems showing real benefits, because so far it's
> > still not convincing enough if it's only with the test especially with only
> > one uffd.
> >
> > I don't remember whether I used to discuss this with James before, but..
> >
> > I know that having multiple uffds in productions also means scattered guest
> > memory and scattered VMAs all over the place. However split the guest
> > large mem into at least a few (or even tens of) VMAs may still be something
> > worth trying? Do you think that'll already solve some of the contentions
> > on userfaultfd, either on the queue or else?
>
> We considered sharding into several UFFDs. I do think it helps, but
> also I think there are two main problems with it:
>
> - One is, I think there's a limit to how much you'd want to do that.
> E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable,
> but 1/100 or 1/1000 might become ridiculous in terms of the
> "scattering" of VMAs and so on like you mentioned. Especially for very
> large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm
> not sure splitting just "slightly" is enough to get good performance.
>
> - Another is, sharding UFFDs sort of assumes accesses are randomly
> distributed across the guest physical address space. I'm not sure this
> is guaranteed for all possible VMs / customer workloads. In other
> words, even if we shard across several UFFDs, we may end up with a
> small number of them being "hot".
I never tried to monitor this, but I had a feeling that it's actually
harder to maintain physical continuity of pages being used and accessed at
least on Linux.
The more possible case to me is the system pages goes very scattered easily
after boot a few hours unless special care is taken, e.g., on using hugetlb
pages or reservations for specific purpose.
I also think that's normally optimal to the system, e.g., numa balancing
will help nodes / cpus using local memory which helps spread the memory
consumptions, hence each core can access different pages that is local to
it.
But I agree I can never justify that it'll always work. If you or Anish
could provide some data points to further support this issue that would be
very interesting and helpful, IMHO, not required though.
>
> A benefit to Anish's series is that it solves the problem more
> fundamentally, and allows demand paging with no "global" locking. So,
> it will scale better regardless of VM size, or access pattern.
>
> [1]: https://cloud.google.com/compute/docs/memory-optimized-machines
>
> >
> > With a bunch of VMAs and userfaultfds (paired with uffd fault handler
> > threads, totally separate uffd queues), I'd expect to some extend other
> > things can pop up already, e.g., the network bandwidth, without teaching
> > each vcpu thread to report uffd faults themselves.
> >
> > These are my pure imaginations though, I think that's also why it'll be
> > great if such a solution can be tested more or less on a real migration
> > scenario to show its real benefits.
>
> I wonder, is there an existing open source QEMU/KVM based live
> migration stress test?
I am not aware of any.
>
> I think we could share numbers from some of our internal benchmarks,
> or at the very least give relative numbers (e.g. +50% increase), but
> since a lot of the software stack is proprietary (e.g. we don't use
> QEMU), it may not be that useful or reproducible for folks.
Those numbers can still be helpful. I was not asking for reproduceability,
but some test to better justify this feature.
IMHO the demand paging test (at least the current one) may or may not be a
good test to show the value of this specific feature. When with 1-uffd, it
obviously bottlenecks on the single uffd, so it doesn't explain whether
scaling num of uffds could help.
But it's not friendly to multi-uffd either, because it'll be the other
extreme case where all mem accesses are spread the cores, so probably the
feature won't show a result proving its worthwhile.
From another aspect, if a kernel feature is proposed it'll be always nice
(and sometimes mandatory) to have at least one user of it (besides the unit
tests). I think that should also include proprietary softwares. It
doesn't need to be used already in production, but some POC would
definitely be very helpful to move a feature forward towards community
acceptance.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-04-12 21:34 [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Anish Moorthy
` (22 preceding siblings ...)
2023-04-19 19:55 ` [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults Peter Xu
@ 2023-05-09 22:19 ` David Matlack
2023-05-10 16:35 ` Anish Moorthy
2023-05-10 22:35 ` Sean Christopherson
23 siblings, 2 replies; 103+ messages in thread
From: David Matlack @ 2023-05-09 22:19 UTC (permalink / raw)
To: Anish Moorthy
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> Upon receiving an annotated EFAULT, userspace may take appropriate
> action to resolve the failed access. For instance, this might involve a
> UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> migration postcopy.
As implemented, I think it will be prohibitively expensive if not
impossible for userspace to determine why KVM is returning EFAULT when
KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
decide the correct action to take (try to resolve or bail).
Consider the direct_map() case in patch in PATCH 15. The only way to hit
that condition is a logic bug in KVM or data corruption. There isn't
really anything userspace can do to handle this situation, and it has no
way to distinguish that from faults to due absent mappings.
We could end up hitting cases where userspace loops forever doing
KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...
Maybe we should just change direct_map() to use KVM_BUG() and return
something other than EFAULT. But the general problem still exists and
even if we have confidence in all the current EFAULT sites, we don't have
much protection against someone adding an EFAULT in the future that
userspace can't handle.
^ permalink raw reply [flat|nested] 103+ messages in thread* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-05-09 22:19 ` David Matlack
@ 2023-05-10 16:35 ` Anish Moorthy
2023-05-10 22:35 ` Sean Christopherson
1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-05-10 16:35 UTC (permalink / raw)
To: David Matlack
Cc: pbonzini, maz, oliver.upton, seanjc, jthoughton, bgardon,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Tue, May 9, 2023 at 3:19 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > Upon receiving an annotated EFAULT, userspace may take appropriate
> > action to resolve the failed access. For instance, this might involve a
> > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> > migration postcopy.
>
> As implemented, I think it will be prohibitively expensive if not
> impossible for userspace to determine why KVM is returning EFAULT when
> KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
> decide the correct action to take (try to resolve or bail).
>
> Consider the direct_map() case in patch in PATCH 15. The only way to hit
> that condition is a logic bug in KVM or data corruption. There isn't
> really anything userspace can do to handle this situation, and it has no
> way to distinguish that from faults to due absent mappings.
>
> We could end up hitting cases where userspace loops forever doing
> KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...
>
> Maybe we should just change direct_map() to use KVM_BUG() and return
> something other than EFAULT. But the general problem still exists and
> even if we have confidence in all the current EFAULT sites, we don't have
> much protection against someone adding an EFAULT in the future that
> userspace can't handle.
Hmm, I had been operating under the assumption that userspace would
always have been able to make the memory access succeed somehow- I
(naively) didn't count on some guest memory access errors being
unrecoverable.
If that's the case, then we're back to needing some way to distinguish
the new faults/exits emitted by user_mem_abort/kvm_faultin_pfn with
the ABSENT_MAPPING_FAULT cap enabled :/ Let me paste in a bit of what
Sean said to refute the idea of a special page-fault-failure set in
those spots.
(from https://lore.kernel.org/kvm/ZBoIzo8FGxSyUJ2I@google.com/)
On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Setting a flag that essentially says "failure when handling a guest page fault"
> is problematic on multiple fronts. Tying the ABI to KVM's internal implementation
> is not an option, i.e. the ABI would need to be defined as "on page faults from
> the guest". And then the resulting behavior would be non-deterministic, e.g.
> userspace would see different behavior if KVM accessed a "bad" gfn via emulation
> instead of in response to a guest page fault. And because of hardware TLBs, it
> would even be possible for the behavior to be non-deterministic on the same
> platform running the same guest code (though this would be exteremly unliklely
> in practice).
>
> And even if userspace is ok with only handling guest page faults_today_, I highly
> doubt that will hold forever. I.e. at some point there will be a use case that
> wants to react to uaccess failures on fast-only memslots.
>
> Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
> handling a guest page fault" isn't precise enough for userspace to blindly resolve
> the failure. Even if KVM went through the trouble of setting information if and
> only if get_user_page_fast_only() failed while handling a guest page fault,
> userspace would still need/want a way to verify that the failure was expected and
> can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
> or mprotecting a page.
I wonder, how much of this problem comes down to my description/name
(I suggested MEMFAULT_REASON_PAGE_FAULT_FAILURE) for the flag? I see
Sean's concerns of the behavior issues when fast-only pages are
accessed via guest mode or via emulation/uaccess. What if the
description of the fast-only fault cap was tightened to something like
"generates vcpu faults/exits in response to *EPT/SLAT violations*
which cannot be mapped by present userspace page table entries?" I
think that would eliminate the emulation/uaccess issues (though I may
be wrong, so please let me know).
Of course, by the time we get to kvm_faultin_pfn we don't know that
we're faulting pages in response to an EPT violation... but if the
idea makes sense then that might justify some plumbing code.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-05-09 22:19 ` David Matlack
2023-05-10 16:35 ` Anish Moorthy
@ 2023-05-10 22:35 ` Sean Christopherson
2023-05-10 23:44 ` Anish Moorthy
2023-05-23 17:49 ` Anish Moorthy
1 sibling, 2 replies; 103+ messages in thread
From: Sean Christopherson @ 2023-05-10 22:35 UTC (permalink / raw)
To: David Matlack
Cc: Anish Moorthy, pbonzini, maz, oliver.upton, jthoughton, bgardon,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Tue, May 09, 2023, David Matlack wrote:
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > Upon receiving an annotated EFAULT, userspace may take appropriate
> > action to resolve the failed access. For instance, this might involve a
> > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> > migration postcopy.
>
> As implemented, I think it will be prohibitively expensive if not
> impossible for userspace to determine why KVM is returning EFAULT when
> KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
> decide the correct action to take (try to resolve or bail).
>
> Consider the direct_map() case in patch in PATCH 15. The only way to hit
> that condition is a logic bug in KVM or data corruption. There isn't
> really anything userspace can do to handle this situation, and it has no
> way to distinguish that from faults to due absent mappings.
>
> We could end up hitting cases where userspace loops forever doing
> KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...
>
> Maybe we should just change direct_map() to use KVM_BUG() and return
> something other than EFAULT. But the general problem still exists and
> even if we have confidence in all the current EFAULT sites, we don't have
> much protection against someone adding an EFAULT in the future that
> userspace can't handle.
Yeah, when I speed read the series, several of the conversions stood out as being
"wrong". My (potentially unstated) idea was that KVM would only signal
KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
i.e. when the fault _might_ be resolvable by userspace.
If we want to populate KVM_EXIT_MEMORY_FAULT even on kernel bugs, and anything
else that userspace can't possibly resolve, then the easiest thing would be to
add a flag to signal that the fault is fatal, i.e. that userspace shouldn't retry.
Adding a flag may be more robust in the long term as it will force developers to
think about whether or not a fault is fatal, versus relying on documentation to
say "don't signal KVM_EXIT_MEMORY_FAULT for fatal EFAULT conditions".
Side topic, KVM x86 really should have a version of KVM_SYNC_X86_REGS that stores
registers for userspace, but doesn't load registers. That would allow userspace
to detect many infinite loops with minimal overhead, e.g. (1) set KVM_STORE_X86_REGS
during demand paging, (2) check RIP on every exit to see if the vCPU is making
forward progress, (3) escalate to checking all registers if RIP hasn't changed for
N exits, and finally (4) take action if the guest is well and truly stuck after
N more exits. KVM could even store RIP on every exit if userspace wanted to avoid
the overhead of storing registers until userspace actually wants all registers.
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-05-10 22:35 ` Sean Christopherson
@ 2023-05-10 23:44 ` Anish Moorthy
2023-05-23 17:49 ` Anish Moorthy
1 sibling, 0 replies; 103+ messages in thread
From: Anish Moorthy @ 2023-05-10 23:44 UTC (permalink / raw)
To: Sean Christopherson
Cc: David Matlack, pbonzini, maz, oliver.upton, jthoughton, bgardon,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Yeah, when I speed read the series, several of the conversions stood out as being
> "wrong". My (potentially unstated) idea was that KVM would only signal
> KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> i.e. when the fault _might_ be resolvable by userspace.
Well, you definitely tried to get the idea across somehow- even in my
cover letter here, I state
> As a first step, KVM_CAP_MEMORY_FAULT_INFO is introduced. This
> capability is meant to deliver useful information to userspace (i.e. the
> problematic range of guest physical memory) when a vCPU fails a guest
> memory access.
So the fact that I'm doing something more here is unintentional and
stems from unfamiliarity with all of the ways in which KVM does (or
does not) perform user accesses.
Sean, besides direct_map which other patches did you notice as needing
to be dropped/marked as unrecoverable errors?
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-05-10 22:35 ` Sean Christopherson
2023-05-10 23:44 ` Anish Moorthy
@ 2023-05-23 17:49 ` Anish Moorthy
2023-06-01 22:43 ` Oliver Upton
1 sibling, 1 reply; 103+ messages in thread
From: Anish Moorthy @ 2023-05-23 17:49 UTC (permalink / raw)
To: Sean Christopherson
Cc: David Matlack, pbonzini, maz, oliver.upton, jthoughton, bgardon,
ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Wed, May 10, 2023 at 4:44 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Yeah, when I speed read the series, several of the conversions stood out as being
> > "wrong". My (potentially unstated) idea was that KVM would only signal
> > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> > i.e. when the fault _might_ be resolvable by userspace.
>
> Sean, besides direct_map which other patches did you notice as needing
> to be dropped/marked as unrecoverable errors?
I tried going through on my own to try and identify the incorrect
annotations: here's my read.
Correct (or can easily be corrected)
-----------------------------------------------
- user_mem_abort
Incorrect as is: the annotations in patch 19 are incorrect, as they
cover an error-on-no-slot case and one more I don't fully understand:
the one in patch 20 should be good though.
- kvm_vcpu_read/write_guest_page:
Incorrect as-is, but can fixed: the current annotations cover
gpa_to_hva_memslot(_prot) failures, which can happen when "gpa" is not
converted by a memslot. However we can leave these as bare efaults and
just annotate the copy_to/from_user failures, which userspace should
be able to resolve by checking/changing the slot permissions.
- kvm_handle_error_pfn
Correct: at the annotation point, the fault must be either a (a)
read/write to a writable memslot or (b) read from a readable one.
hva_to_pfn must have returned KVM_PFN_ERR_FAULT, which userspace can
attempt to resolve using a MADV
Flatly Incorrect (will drop in next version)
-----------------------------------------------
- kvm_handle_page_fault
efault corresponds to a kernel bug not resolvable by userspace
- direct_map
Same as above
- kvm_mmu_page_fault
Not a "leaf" return of efault, Also, the
check-for-efault-and-annotate here catches efaults which userspace can
do nothing about: such as the one from direct_map [1]
Unsure (Switch kvm_read/write_guest to kvm_vcpu_read/write_guest?)
-----------------------------------------------
- setup_vmgexit_scratch and kvm_pv_clock_pairing
These efault on errors from kvm_read/write_guest, and theoretically
it does seem to make sense to annotate them. However, the annotations
are incorrect as is for the same reason that the
kvm_vcpu_read/write_guest_page need to be corrected.
In fact, the kvm_read/write_guest calls are of the form
"kvm_read_guest(vcpu->kvm, ...)": if we switched these calls to
kvm_vcpu_read/write_guest instead, then it seems like we'd get correct
annotations for free. Would it be correct to make this switch? If not,
then perhaps an optional kvm_vcpu* parameter for the "non-vcpu"
read/write functions strictly for annotation purposes? That seems
rather ugly though...
Unsure (Similar-ish to above)
-----------------------------------------------
- kvm_hv_get_assist_page
Incorrect as-is. The existing annotation would cover some efaults
which it doesn't seem likely that userspace can resolve [2]. Right
after those though, there's a copy_from_user which it could make sense
to annotate.
The efault here comes from failures of
kvm_read_guest_cached/kvm_read_guest_offset_cached, for which all of
the calls are again of the form "f(vcpu->kvm, ...)". Again, we'll need
either an (optional) vcpu parameter or to refactor these to just take
a "kvm_vcpu" instead if we want to annotate just the failing
uaccesses.
PS: I plan to add a couple of flags to the memory fault exit to
identify whether the failed access was a read/write/exec
[1] https://github.com/torvalds/linux/blob/v6.3/arch/x86/kvm/mmu/mmu.c#L3196
[2] https://github.com/torvalds/linux/blob/v6.3/virt/kvm/kvm_main.c#L3261-L3270
^ permalink raw reply [flat|nested] 103+ messages in thread
* Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.
2023-05-23 17:49 ` Anish Moorthy
@ 2023-06-01 22:43 ` Oliver Upton
0 siblings, 0 replies; 103+ messages in thread
From: Oliver Upton @ 2023-06-01 22:43 UTC (permalink / raw)
To: Anish Moorthy
Cc: Sean Christopherson, David Matlack, pbonzini, maz, jthoughton,
bgardon, ricarkol, axelrasmussen, peterx, kvm, kvmarm
On Tue, May 23, 2023 at 10:49:04AM -0700, Anish Moorthy wrote:
> On Wed, May 10, 2023 at 4:44 PM Anish Moorthy <amoorthy@google.com> wrote:
> >
> > On Wed, May 10, 2023 at 3:35 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > Yeah, when I speed read the series, several of the conversions stood out as being
> > > "wrong". My (potentially unstated) idea was that KVM would only signal
> > > KVM_EXIT_MEMORY_FAULT when the -EFAULT could be traced back to a user access,
> > > i.e. when the fault _might_ be resolvable by userspace.
> >
> > Sean, besides direct_map which other patches did you notice as needing
> > to be dropped/marked as unrecoverable errors?
>
> I tried going through on my own to try and identify the incorrect
> annotations: here's my read.
>
> Correct (or can easily be corrected)
> -----------------------------------------------
> - user_mem_abort
> Incorrect as is: the annotations in patch 19 are incorrect, as they
> cover an error-on-no-slot case and one more I don't fully understand:
That other case is a wart we endearingly refer to as MTE (Memory Tagging
Extension). You theoretically _could_ pop out an annotated exit here, as
userspace likely messed up the mapping (like PROT_MTE missing).
But I'm perfectly happy letting someone complain about it before we go
out of our way to annotate that one. So feel free to drop.
--
Thanks,
Oliver
^ permalink raw reply [flat|nested] 103+ messages in thread