linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>, Hugh Dickins <hughd@google.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Ives van Hoorne <ives@codesandbox.io>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@goole.com>,
	Alistair Popple <apopple@nvidia.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA
Date: Wed, 7 Dec 2022 16:32:17 +0100	[thread overview]
Message-ID: <be1a2244-24a8-480a-a38b-e4ebb9868ce5@redhat.com> (raw)
In-Reply-To: <Y4+xpCRQCazCymdS@x1n>

[-- Attachment #1: Type: text/plain, Size: 2456 bytes --]

> 
> David, would you please repost a new patch for this one and copy Ives to
> make sure it'll work for him in his systems?

Yes, will do. I think has was already cc'ed.

> 
> I'd suggest to drop the mprotect example, I'll reply later on that to the
> other email but I still don't know whether it's a good example for a reader
> to understand the problem.

Yes, can do.

> 
> No reproducer needed for numa I think - I guess Ives's test case would be
> far enough to verify it if possible.  I also hope what Ives saw was the
> numa balancing issue you reported, so maybe it'll resolve all problem he
> has.  Then with that verified and queued we can drop the mm/migrate patch.

I tried writing a numa hinting reproducer, but so far I assume that it's 
with current code not (easily) possible for shmem.

We'd have to call change_prot_numa() in order to protnone these PTEs. 
That can either happen via

a) mbind() with MPOL_MF_LAZY. However, user space is currently not able
    to set that flag (dead code ...).
b) task_numa_work(). However, vma_policy_mof() seems to always fail on
    shmem and prevent us from reaching that code. Reason is that shmem
    implements vm_ops->get_policy, and it seems to be impossible to get
    MPOL_F_MOF set. At least I haven't found an easy way or I am missing
    something.

So numa hinting might not immediately explain the lost write faults.

... but are there other ways to reach do_numa_page(), even without 
active NUMA hinting? I found at least one:


   map = mmap(NULL, size, PROT_WRITE, MAP_SHARED|MAP_ANON, -1, 0);
   memset(map, 0, size);
   uffd_wp_range(map, size);

On upstream during the next write fault, we'll end up in do_numa_page() 
and simply remap the page writable due to vm_page_prot, not triggering a 
write fault. I can see the "numa_hint_faults" counter in /proc/vmstat 
increasing accordingly, so we're really in do_numa_page().


PROT_WRITE on shmem with uffd-wp is completely non-functional as it 
seems. It seems to work with anon memory. And with my patch, it also 
works with shmem.

Attaching a simple reproducer (uffd-wp-prot-write.c).



Independent of uffd-wp on shmem, we seem to be missing propagating the 
uffd-wp bit when splitting the huge zeropage. So uffd-wp'ing the huge 
zeropage and then splitting it loses the uffd-wp markers. :/

Fix seems easy and I just tested my possible fix. Attaching a simple 
reproducer (uffd-wp-huge-zero.c).

-- 
Thanks,

David / dhildenb

[-- Attachment #2: uffd-wp-prot-write.c --]
[-- Type: text/x-csrc, Size: 3718 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <inttypes.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <poll.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <linux/memfd.h>
#include <linux/userfaultfd.h>

static size_t pagesize;
static int uffd;
static int pagemap_fd;

#define SIZE (1024*1024*1024ull)
#define barrier() __asm__ __volatile__("": : :"memory")

volatile bool uffd_triggered;

static void uffd_wp_range(char *start, size_t size, bool wp)
{
	struct uffdio_writeprotect uffd_writeprotect;

	uffd_writeprotect.range.start = (unsigned long) start;
	uffd_writeprotect.range.len = size;
	if (wp) {
		uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
	} else {
		uffd_writeprotect.mode = 0;
	}
	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
		fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
		exit(1);
	}
}

static void *uffd_thread_fn(void *arg)
{
	static struct uffd_msg msg;
	ssize_t nread;

	while (1) {
		struct pollfd pollfd;
		int nready;

		pollfd.fd = uffd;
		pollfd.events = POLLIN;
		nready = poll(&pollfd, 1, -1);
		if (nready == -1) {
			fprintf(stderr, "poll() failed: %d\n", errno);
			exit(1);
		}

		nread = read(uffd, &msg, sizeof(msg));
		if (nread <= 0)
			continue;

		if (msg.event != UFFD_EVENT_PAGEFAULT ||
		    !(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) {
			printf("FAIL: wrong uffd-wp event fired\n");
			exit(1);
		}

		/* un-protect */
		uffd_triggered = true;
		uffd_wp_range((char *)(uintptr_t)msg.arg.pagefault.address, pagesize, false);
	}
	return arg;
}

static int setup_uffd(char *map, size_t size)
{
	struct uffdio_api uffdio_api;
	struct uffdio_register uffdio_register;
	pthread_t thread;

	uffd = syscall(__NR_userfaultfd,
		       O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
	if (uffd < 0) {
		fprintf(stderr, "syscall() failed: %d\n", errno);
		return -errno;
	}

	uffdio_api.api = UFFD_API;
	uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
	if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
		fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
		return -errno;
	}

	if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
		fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
		return -ENOSYS;
	}

	uffdio_register.range.start = (unsigned long) map;
	uffdio_register.range.len = size;
	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
		fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
		return -errno;
	}

	pthread_create(&thread, NULL, uffd_thread_fn, NULL);

	return 0;
}

int main(void)
{
	const size_t size = SIZE;
	char *map, *cur;

	pagesize = getpagesize();
	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
	if (pagemap_fd < 0) {
		fprintf(stderr, "Opening /proc/self/pagemap failed\n");
		return 1;
	}

	map = mmap(NULL, size, PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
	if (map == MAP_FAILED) {
		fprintf(stderr, "mmap() failed\n");
		return -errno;
	}

	if (madvise(map, size, MADV_NOHUGEPAGE)) {
		fprintf(stderr, "MADV_HUGEPAGE failed\n");
		return -errno;
	}

	/* Populate all pages ... */
	memset(map, 0, size);

	if (setup_uffd(map, size))
		return 1;

	/* ... and write-protect them using uffd-wp. */
	uffd_wp_range(map, size, true);

	/* Test if all faults trigger. */
	for (cur = map; cur < map + size; cur += pagesize) {
		uffd_triggered = false;
		barrier();

		/* Trigger a write fault. */
		*cur = 1;

		barrier();
		if (!uffd_triggered) {
			printf("FAIL: uffd-wp did not trigger\n");
			return 1;
		}
	}

	printf("PASS: uffd-wp triggered\n");
	return 0;
}

[-- Attachment #3: uffd-wp-huge-zero.c --]
[-- Type: text/x-csrc, Size: 4150 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <inttypes.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <poll.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <linux/memfd.h>
#include <linux/userfaultfd.h>

static size_t pagesize;
static int uffd;
static int pagemap_fd;

#define SIZE (128*1024*1024ull)
#define barrier() __asm__ __volatile__("": : :"memory")

volatile bool uffd_triggered;

static void uffd_wp_range(char *start, size_t size, bool wp)
{
	struct uffdio_writeprotect uffd_writeprotect;

	uffd_writeprotect.range.start = (unsigned long) start;
	uffd_writeprotect.range.len = size;
	if (wp) {
		uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
	} else {
		uffd_writeprotect.mode = 0;
	}
	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
		fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
		exit(1);
	}
}

static void *uffd_thread_fn(void *arg)
{
	static struct uffd_msg msg;
	ssize_t nread;

	while (1) {
		struct pollfd pollfd;
		int nready;

		pollfd.fd = uffd;
		pollfd.events = POLLIN;
		nready = poll(&pollfd, 1, -1);
		if (nready == -1) {
			fprintf(stderr, "poll() failed: %d\n", errno);
			exit(1);
		}

		nread = read(uffd, &msg, sizeof(msg));
		if (nread <= 0)
			continue;

		if (msg.event != UFFD_EVENT_PAGEFAULT ||
		    !(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) {
			printf("FAIL: wrong uffd-wp event fired\n");
			exit(1);
		}

		/* un-protect */
		uffd_triggered = true;
		uffd_wp_range((char *)(uintptr_t)msg.arg.pagefault.address, pagesize, false);
	}
	return arg;
}

static int setup_uffd(char *map, size_t size)
{
	struct uffdio_api uffdio_api;
	struct uffdio_register uffdio_register;
	pthread_t thread;

	uffd = syscall(__NR_userfaultfd,
		       O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
	if (uffd < 0) {
		fprintf(stderr, "syscall() failed: %d\n", errno);
		return -errno;
	}

	uffdio_api.api = UFFD_API;
	uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
	if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
		fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
		return -errno;
	}

	if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
		fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
		return -ENOSYS;
	}

	uffdio_register.range.start = (unsigned long) map;
	uffdio_register.range.len = size;
	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
		fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
		return -errno;
	}

	pthread_create(&thread, NULL, uffd_thread_fn, NULL);

	return 0;
}

int main(void)
{
	const size_t size = SIZE;
	char *map, *cur;

	pagesize = getpagesize();
	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
	if (pagemap_fd < 0) {
		fprintf(stderr, "Opening /proc/self/pagemap failed\n");
		return 1;
	}

	map = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
	if (map == MAP_FAILED) {
		fprintf(stderr, "mmap() failed\n");
		return -errno;
	}

	if (madvise(map, size, MADV_HUGEPAGE)) {
		fprintf(stderr, "MADV_HUGEPAGE failed\n");
		return -errno;
	}

	if (setup_uffd(map, size))
		return 1;

	/* Populate the shared zeropage, hopefullt also the huge one.*/
	madvise(map, size, MADV_POPULATE_READ);

	/* ... and write-protect all pages using uffd-wp. */
	uffd_wp_range(map, size, true);

	/*
	 * Discard every second odd page, this will split any huge zero
	 * THP and will hopefully keep the PTE protected using uffd-wp.
	 *
	 * Any mechanism to PTE-map the THP would do.
	 */
	for (cur = map + pagesize; cur < map + size; cur += 2 * pagesize)
		madvise(cur, pagesize, MADV_DONTNEED);

	/*
	 * Test every second even page (-> all remaining ones) if we get our
	 * uffd-wp events.
	 */
	for (cur = map; cur < map + size; cur += 2 * pagesize) {
		uffd_triggered = false;

		barrier();
		/* Trigger a write fault. */
		*cur = 1;
		barrier();

		if (!uffd_triggered) {
			printf("FAIL: uffd-wp did not trigger\n");
			return 1;
		}
	}

	printf("PASS: uffd-wp triggered\n");
	return 0;
}

  reply	other threads:[~2022-12-07 15:32 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-02 12:27 [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA David Hildenbrand
2022-12-02 16:33 ` Peter Xu
2022-12-02 16:56   ` David Hildenbrand
2022-12-02 17:11     ` David Hildenbrand
2022-12-05 21:08       ` Peter Xu
2022-12-06  0:46         ` [PATCH] mm/uffd: Always wr-protect pte in pte_mkuffd_wp() kernel test robot
2022-12-06 16:21           ` Peter Xu
2022-12-06 11:43         ` kernel test robot
2022-12-06 16:28         ` [PATCH RFC] mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA David Hildenbrand
2022-12-06 19:09           ` Hugh Dickins
2022-12-06 21:18             ` Peter Xu
2022-12-07 15:32               ` David Hildenbrand [this message]
2022-12-07 17:43                 ` Peter Xu
2022-12-07 19:53                   ` David Hildenbrand
2022-12-07 20:14                     ` Peter Xu
2022-12-06 21:27           ` Peter Xu
2022-12-07 13:33             ` David Hildenbrand
2022-12-07 15:59               ` Peter Xu
2022-12-07 20:10                 ` David Hildenbrand
2022-12-08 15:17                   ` Peter Xu
2022-12-06 18:38         ` [PATCH] mm/uffd: Always wr-protect pte in pte_mkuffd_wp() kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=be1a2244-24a8-480a-a38b-e4ebb9868ce5@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=hughd@google.com \
    --cc=hughd@goole.com \
    --cc=ives@codesandbox.io \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).