[PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP

linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU
@ 2025-04-18 17:49 Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

After a long delay I'm posting next iteration of lockless /proc/pid/maps
reading patchset. Differences from v2 [1]:
- Add a set of tests concurrently modifying address space and checking for
correct reading results;
- Use new mmap_lock_speculate_xxx APIs for concurrent change detection and
retries;
- Add lockless PROCMAP_QUERY execution support;

The new tests are designed to check for any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. Such tearing is expected and userspace
is supposed to deal with that possibility. An example of user-visible
inconsistency can be that the same vma is printed twice: once before
it was modified and then after the modifications. For example if vma was
extended, it might be found and reported twice. Whan is not expected is
to see a gap where there should have been a vma both before and after
modification. This patchset increases the chances of such tearing,
therefore it's event more important now to test for unexpected
inconsistencies.

Thanks to Paul McKenney who developed a benchmark to test performance
of concurrent reads and updates, we also have data on performance
benefits:

The test has a pair of processes scanning /proc/PID/maps, and another
process unmapping and remapping 4K pages from a 128MB range of anonymous
memory.  At the end of each 10-second run, the latency of each mmap()
or munmap() operation is measured, and for each run the maximum and mean
latency is printed.  (Yes, the map/unmap process is started first, its
PID is passed to the scanners, and then the map/unmap process waits until
both scanners are running before starting its timed test.  The scanners
keep scanning until the specified /proc/PID/maps file disappears.)
In summary, with stock mm, 78% of the runs had maximum latencies in
excess of 0.5 milliseconds, and with more then half of the runs' latencies
exceeding a full millisecond.  In contrast, 98% of the runs with Suren's
patch series applied had maximum latencies of less than 0.5 milliseconds.
From a median-performance viewpoint, Suren's series also looks good,
with stock mm weighing in at 13 microseconds and Suren's series at 10
microseconds, better than a 20% improvement.

[1] https://lore.kernel.org/all/20240123231014.3801041-1-surenb@google.com/

Suren Baghdasaryan (8):
  selftests/proc: add /proc/pid/maps tearing from vma split test
  selftests/proc: extend /proc/pid/maps tearing test to include vma
    resizing
  selftests/proc: extend /proc/pid/maps tearing test to include vma
    remapping
  selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
    modified
  selftests/proc: add verbose more for tests to facilitate debugging
  mm: make vm_area_struct anon_name field RCU-safe
  mm/maps: read proc/pid/maps under RCU
  mm/maps: execute PROCMAP_QUERY ioctl under RCU

 fs/proc/internal.h                         |   6 +
 fs/proc/task_mmu.c                         | 233 +++++-
 include/linux/mm_inline.h                  |  28 +-
 include/linux/mm_types.h                   |   3 +-
 mm/madvise.c                               |  30 +-
 tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
 6 files changed, 1061 insertions(+), 32 deletions(-)

base-commit: 79f35c4125a9a3fd98efeed4cce1cd7ce5311a44
-- 
2.49.0.805.g082f7c87e0-goog

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test
  2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
@ 2025-04-18 17:49 ` Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

The content of /proc/pid/maps is generated page-by-page with mmap_lock
read lock (or other synchronization mechanism) being dropped in between
these pages. This means that the reader can occasionally retrieve
inconsistent information if the data used for file generation is being
concurrently changed. For /proc/pid/maps that means it's possible to
read inconsistent data if vmas or vma tree are concurrently modified.
A simple example is when a vma gets split or merged. If such action
happens while /proc/pid/maps is read and this vma happens to be at the
edge of the two pages being generated, the readers can see the same vma
twice: once before it got modified and second time after the modification.
This is considered acceptable if the same vma is seen twice and userspace
can deal with this situation. What is unacceptable is if we see a hole in
the place occupied by a vma, for example as a result of a vma being
replaced with another one, leaving the space temporarily empty.
Implement a test that reads /proc/pid/maps of a forked child process and
checks data consistency at the edge of two pages. Child process constantly
modifies its address space in a way that affects the vma located at the
end of the first page when /proc/pid/maps is read by the parent process.
The parent checks the last vma of the first page and the first vma of the
last page for consistency with the split/merge results.
Since the test is designed to create a race between the file reader and
vma tree modifier, we need multiple iterations to catch invalid results.
To limit the time test is run, introduce a command line parameter
specifying the duration of the test in seconds. For example, the following
command will allow this concurrency test to run for 10 seconds:

proc-pid-vm -d 10

The default test duration is set to 5 seconds.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 430 ++++++++++++++++++++-
 1 file changed, 429 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index d04685771952..6e3f06376a1f 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -27,6 +27,7 @@
 #undef NDEBUG
 #include <assert.h>
 #include <errno.h>
+#include <pthread.h>
 #include <sched.h>
 #include <signal.h>
 #include <stdbool.h>
@@ -34,6 +35,7 @@
 #include <stdio.h>
 #include <string.h>
 #include <stdlib.h>
+#include <sys/mman.h>
 #include <sys/mount.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -70,6 +72,8 @@ static void make_private_tmp(void)
 	}
 }
 
+static unsigned long test_duration_sec = 5UL;
+static int page_size;
 static pid_t pid = -1;
 static void ate(void)
 {
@@ -281,11 +285,431 @@ static void vsyscall(void)
 	}
 }
 
-int main(void)
+/* /proc/pid/maps parsing routines */
+struct page_content {
+	char *data;
+	ssize_t size;
+};
+
+#define LINE_MAX_SIZE		256
+
+struct line_content {
+	char text[LINE_MAX_SIZE];
+	unsigned long start_addr;
+	unsigned long end_addr;
+};
+
+static void read_two_pages(int maps_fd, struct page_content *page1,
+			   struct page_content *page2)
+{
+	ssize_t  bytes_read;
+
+	assert(lseek(maps_fd, 0, SEEK_SET) >= 0);
+	bytes_read = read(maps_fd, page1->data, page_size);
+	assert(bytes_read > 0 && bytes_read < page_size);
+	page1->size = bytes_read;
+
+	bytes_read = read(maps_fd, page2->data, page_size);
+	assert(bytes_read > 0 && bytes_read < page_size);
+	page2->size = bytes_read;
+}
+
+static void copy_first_line(struct page_content *page, char *first_line)
+{
+	char *pos = strchr(page->data, '\n');
+
+	strncpy(first_line, page->data, pos - page->data);
+	first_line[pos - page->data] = '\0';
+}
+
+static void copy_last_line(struct page_content *page, char *last_line)
+{
+	/* Get the last line in the first page */
+	const char *end = page->data + page->size - 1;
+	/* skip last newline */
+	const char *pos = end - 1;
+
+	/* search previous newline */
+	while (pos[-1] != '\n')
+		pos--;
+	strncpy(last_line, pos, end - pos);
+	last_line[end - pos] = '\0';
+}
+
+/* Read the last line of the first page and the first line of the second page */
+static void read_boundary_lines(int maps_fd, struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line)
+{
+	read_two_pages(maps_fd, page1, page2);
+
+	copy_last_line(page1, last_line->text);
+	copy_first_line(page2, first_line->text);
+
+	assert(sscanf(last_line->text, "%lx-%lx", &last_line->start_addr,
+		      &last_line->end_addr) == 2);
+	assert(sscanf(first_line->text, "%lx-%lx", &first_line->start_addr,
+		      &first_line->end_addr) == 2);
+}
+
+/* Thread synchronization routines */
+enum test_state {
+	INIT,
+	CHILD_READY,
+	PARENT_READY,
+	SETUP_READY,
+	SETUP_MODIFY_MAPS,
+	SETUP_MAPS_MODIFIED,
+	SETUP_RESTORE_MAPS,
+	SETUP_MAPS_RESTORED,
+	TEST_READY,
+	TEST_DONE,
+};
+
+struct vma_modifier_info;
+
+typedef void (*vma_modifier_op)(const struct vma_modifier_info *mod_info);
+typedef void (*vma_mod_result_check_op)(struct line_content *mod_last_line,
+					struct line_content *mod_first_line,
+					struct line_content *restored_last_line,
+					struct line_content *restored_first_line);
+
+struct vma_modifier_info {
+	int vma_count;
+	void *addr;
+	int prot;
+	void *next_addr;
+	vma_modifier_op vma_modify;
+	vma_modifier_op vma_restore;
+	vma_mod_result_check_op vma_mod_check;
+	pthread_mutex_t sync_lock;
+	pthread_cond_t sync_cond;
+	enum test_state curr_state;
+	bool exit;
+	void *child_mapped_addr[];
+};
+
+static void wait_for_state(struct vma_modifier_info *mod_info, enum test_state state)
+{
+	pthread_mutex_lock(&mod_info->sync_lock);
+	while (mod_info->curr_state != state)
+		pthread_cond_wait(&mod_info->sync_cond, &mod_info->sync_lock);
+	pthread_mutex_unlock(&mod_info->sync_lock);
+}
+
+static void signal_state(struct vma_modifier_info *mod_info, enum test_state state)
+{
+	pthread_mutex_lock(&mod_info->sync_lock);
+	mod_info->curr_state = state;
+	pthread_cond_signal(&mod_info->sync_cond);
+	pthread_mutex_unlock(&mod_info->sync_lock);
+}
+
+/* VMA modification routines */
+static void *child_vma_modifier(struct vma_modifier_info *mod_info)
+{
+	int prot = PROT_READ | PROT_WRITE;
+	int i;
+
+	for (i = 0; i < mod_info->vma_count; i++) {
+		mod_info->child_mapped_addr[i] = mmap(NULL, page_size * 3, prot,
+				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+		assert(mod_info->child_mapped_addr[i] != MAP_FAILED);
+		/* change protection in adjacent maps to prevent merging */
+		prot ^= PROT_WRITE;
+	}
+	signal_state(mod_info, CHILD_READY);
+	wait_for_state(mod_info, PARENT_READY);
+	while (true) {
+		signal_state(mod_info, SETUP_READY);
+		wait_for_state(mod_info, SETUP_MODIFY_MAPS);
+		if (mod_info->exit)
+			break;
+
+		mod_info->vma_modify(mod_info);
+		signal_state(mod_info, SETUP_MAPS_MODIFIED);
+		wait_for_state(mod_info, SETUP_RESTORE_MAPS);
+		mod_info->vma_restore(mod_info);
+		signal_state(mod_info, SETUP_MAPS_RESTORED);
+
+		wait_for_state(mod_info, TEST_READY);
+		while (mod_info->curr_state != TEST_DONE) {
+			mod_info->vma_modify(mod_info);
+			mod_info->vma_restore(mod_info);
+		}
+	}
+	for (i = 0; i < mod_info->vma_count; i++)
+		munmap(mod_info->child_mapped_addr[i], page_size * 3);
+
+	return NULL;
+}
+
+static void stop_vma_modifier(struct vma_modifier_info *mod_info)
+{
+	wait_for_state(mod_info, SETUP_READY);
+	mod_info->exit = true;
+	signal_state(mod_info, SETUP_MODIFY_MAPS);
+}
+
+static void capture_mod_pattern(int maps_fd,
+				struct vma_modifier_info *mod_info,
+				struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line,
+				struct line_content *mod_last_line,
+				struct line_content *mod_first_line,
+				struct line_content *restored_last_line,
+				struct line_content *restored_first_line)
+{
+	signal_state(mod_info, SETUP_MODIFY_MAPS);
+	wait_for_state(mod_info, SETUP_MAPS_MODIFIED);
+
+	/* Copy last line of the first page and first line of the last page */
+	read_boundary_lines(maps_fd, page1, page2, mod_last_line, mod_first_line);
+
+	signal_state(mod_info, SETUP_RESTORE_MAPS);
+	wait_for_state(mod_info, SETUP_MAPS_RESTORED);
+
+	/* Copy last line of the first page and first line of the last page */
+	read_boundary_lines(maps_fd, page1, page2, restored_last_line, restored_first_line);
+
+	mod_info->vma_mod_check(mod_last_line, mod_first_line,
+				restored_last_line, restored_first_line);
+
+	/*
+	 * The content of these lines after modify+resore should be the same
+	 * as the original.
+	 */
+	assert(strcmp(restored_last_line->text, last_line->text) == 0);
+	assert(strcmp(restored_first_line->text, first_line->text) == 0);
+}
+
+static inline void split_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mmap(mod_info->addr, page_size, mod_info->prot | PROT_EXEC,
+		    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
+		    -1, 0) != MAP_FAILED);
+}
+
+static inline void merge_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mmap(mod_info->addr, page_size, mod_info->prot,
+		    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
+		    -1, 0) != MAP_FAILED);
+}
+
+static inline void check_split_result(struct line_content *mod_last_line,
+				      struct line_content *mod_first_line,
+				      struct line_content *restored_last_line,
+				      struct line_content *restored_first_line)
+{
+	/* Make sure vmas at the boundaries are changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
+}
+
+static void test_maps_tearing_from_split(int maps_fd,
+					 struct vma_modifier_info *mod_info,
+					 struct page_content *page1,
+					 struct page_content *page2,
+					 struct line_content *last_line,
+					 struct line_content *first_line)
+{
+	struct line_content split_last_line;
+	struct line_content split_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = split_vma;
+	mod_info->vma_restore = merge_vma;
+	mod_info->vma_mod_check = check_split_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &split_last_line, &split_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		bool last_line_changed;
+		bool first_line_changed;
+
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after split */
+		if (!strcmp(new_last_line.text, split_last_line.text)) {
+			/*
+			 * The vmas should be consistent with split results,
+			 * however if vma was concurrently restored after a
+			 * split, it can be reported twice (first the original
+			 * split one, then the same vma but extended after the
+			 * merge) because we found it as the next vma again.
+			 * In that case new first line will be the same as the
+			 * last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, split_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with merge results */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		/*
+		 * First and last lines should change in unison. If the last
+		 * line changed then the first line should change as well and
+		 * vice versa.
+		 */
+		last_line_changed = strcmp(new_last_line.text, last_line->text) != 0;
+		first_line_changed = strcmp(new_first_line.text, first_line->text) != 0;
+		assert(last_line_changed == first_line_changed);
+
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
+static int test_maps_tearing(void)
+{
+	struct vma_modifier_info *mod_info;
+	pthread_mutexattr_t mutex_attr;
+	pthread_condattr_t cond_attr;
+	int shared_mem_size;
+	char fname[32];
+	int vma_count;
+	int maps_fd;
+	int status;
+	pid_t pid;
+
+	/*
+	 * Have to map enough vmas for /proc/pid/maps to containt more than one
+	 * page worth of vmas. Assume at least 32 bytes per line in maps output
+	 */
+	vma_count = page_size / 32 + 1;
+	shared_mem_size = sizeof(struct vma_modifier_info) + vma_count * sizeof(void *);
+
+	/* map shared memory for communication with the child process */
+	mod_info = (struct vma_modifier_info *)mmap(NULL, shared_mem_size,
+		    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+
+	assert(mod_info != MAP_FAILED);
+
+	/* Initialize shared members */
+	pthread_mutexattr_init(&mutex_attr);
+	pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
+	assert(!pthread_mutex_init(&mod_info->sync_lock, &mutex_attr));
+	pthread_condattr_init(&cond_attr);
+	pthread_condattr_setpshared(&cond_attr, PTHREAD_PROCESS_SHARED);
+	assert(!pthread_cond_init(&mod_info->sync_cond, &cond_attr));
+	mod_info->vma_count = vma_count;
+	mod_info->curr_state = INIT;
+	mod_info->exit = false;
+
+	pid = fork();
+	if (!pid) {
+		/* Child process */
+		child_vma_modifier(mod_info);
+		return 0;
+	}
+
+	sprintf(fname, "/proc/%d/maps", pid);
+	maps_fd = open(fname, O_RDONLY);
+	assert(maps_fd != -1);
+
+	/* Wait for the child to map the VMAs */
+	wait_for_state(mod_info, CHILD_READY);
+
+	/* Read first two pages */
+	struct page_content page1;
+	struct page_content page2;
+
+	page1.data = malloc(page_size);
+	assert(page1.data);
+	page2.data = malloc(page_size);
+	assert(page2.data);
+
+	struct line_content last_line;
+	struct line_content first_line;
+
+	read_boundary_lines(maps_fd, &page1, &page2, &last_line, &first_line);
+
+	/*
+	 * Find the addresses corresponding to the last line in the first page
+	 * and the first line in the last page.
+	 */
+	mod_info->addr = NULL;
+	mod_info->next_addr = NULL;
+	for (int i = 0; i < mod_info->vma_count; i++) {
+		if (mod_info->child_mapped_addr[i] == (void *)last_line.start_addr) {
+			mod_info->addr = mod_info->child_mapped_addr[i];
+			mod_info->prot = PROT_READ;
+			/* Even VMAs have write permission */
+			if ((i % 2) == 0)
+				mod_info->prot |= PROT_WRITE;
+		} else if (mod_info->child_mapped_addr[i] == (void *)first_line.start_addr) {
+			mod_info->next_addr = mod_info->child_mapped_addr[i];
+		}
+
+		if (mod_info->addr && mod_info->next_addr)
+			break;
+	}
+	assert(mod_info->addr && mod_info->next_addr);
+
+	signal_state(mod_info, PARENT_READY);
+
+	test_maps_tearing_from_split(maps_fd, mod_info, &page1, &page2,
+				     &last_line, &first_line);
+
+	stop_vma_modifier(mod_info);
+
+	free(page2.data);
+	free(page1.data);
+
+	for (int i = 0; i < vma_count; i++)
+		munmap(mod_info->child_mapped_addr[i], page_size);
+	close(maps_fd);
+	waitpid(pid, &status, 0);
+	munmap(mod_info, shared_mem_size);
+
+	return 0;
+}
+
+int usage(void)
+{
+	fprintf(stderr, "Userland /proc/pid/{s}maps test cases\n");
+	fprintf(stderr, "  -d: Duration for time-consuming tests\n");
+	fprintf(stderr, "  -h: Help screen\n");
+	exit(-1);
+}
+
+int main(int argc, char **argv)
 {
 	int pipefd[2];
 	int exec_fd;
+	int opt;
+
+	while ((opt = getopt(argc, argv, "d:h")) != -1) {
+		if (opt == 'd')
+			test_duration_sec = strtoul(optarg, NULL, 0);
+		else if (opt == 'h')
+			usage();
+	}
 
+	page_size = sysconf(_SC_PAGESIZE);
 	vsyscall();
 	switch (g_vsyscall) {
 	case 0:
@@ -578,6 +1002,10 @@ int main(void)
 		assert(err == -ENOENT);
 	}
 
+	/* Test tearing in /proc/$PID/maps */
+	if (test_maps_tearing())
+		return 1;
+
 	return 0;
 }
 #else
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing
  2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
@ 2025-04-18 17:49 ` Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

Test that /proc/pid/maps does not report unexpected holes in the address
space when a vma at the edge of the page is being concurrently remapped.
This remapping results in the vma shrinking and expanding from  under the
reader. We should always see either shrunk or expanded (original) version
of the vma.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 83 ++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index 6e3f06376a1f..39842e4ec45f 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -583,6 +583,86 @@ static void test_maps_tearing_from_split(int maps_fd,
 	signal_state(mod_info, TEST_DONE);
 }
 
+static inline void shrink_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mremap(mod_info->addr, page_size * 3, page_size, 0) != MAP_FAILED);
+}
+
+static inline void expand_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mremap(mod_info->addr, page_size, page_size * 3, 0) != MAP_FAILED);
+}
+
+static inline void check_shrink_result(struct line_content *mod_last_line,
+				       struct line_content *mod_first_line,
+				       struct line_content *restored_last_line,
+				       struct line_content *restored_first_line)
+{
+	/* Make sure only the last vma of the first page is changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) == 0);
+}
+
+static void test_maps_tearing_from_resize(int maps_fd,
+					  struct vma_modifier_info *mod_info,
+					  struct page_content *page1,
+					  struct page_content *page2,
+					  struct line_content *last_line,
+					  struct line_content *first_line)
+{
+	struct line_content shrunk_last_line;
+	struct line_content shrunk_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = shrink_vma;
+	mod_info->vma_restore = expand_vma;
+	mod_info->vma_mod_check = check_shrink_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &shrunk_last_line, &shrunk_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after shrinking it */
+		if (!strcmp(new_last_line.text, shrunk_last_line.text)) {
+			/*
+			 * The vmas should be consistent with shrunk results,
+			 * however if the vma was concurrently restored, it
+			 * can be reported twice (first as shrunk one, then
+			 * as restored one) because we found it as the next vma
+			 * again. In that case new first line will be the same
+			 * as the last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, shrunk_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with the original/resored state */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
 static int test_maps_tearing(void)
 {
 	struct vma_modifier_info *mod_info;
@@ -674,6 +754,9 @@ static int test_maps_tearing(void)
 	test_maps_tearing_from_split(maps_fd, mod_info, &page1, &page2,
 				     &last_line, &first_line);
 
+	test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
+				      &last_line, &first_line);
+
 	stop_vma_modifier(mod_info);
 
 	free(page2.data);
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping
  2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
@ 2025-04-18 17:49 ` Suren Baghdasaryan
  2025-04-18 18:30   ` Lorenzo Stoakes
  2025-04-18 17:49 ` [PATCH v3 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

Test that /proc/pid/maps does not report unexpected holes in the address
space when we concurrently remap a part of a vma into the middle of
another vma. This remapping results in the destination vma being split
into three parts and the part in the middle being patched back from,
all done concurrently from under the reader. We should always see either
original vma or the split one with no holes.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 92 ++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index 39842e4ec45f..1aef2db7e893 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -663,6 +663,95 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	signal_state(mod_info, TEST_DONE);
 }
 
+static inline void remap_vma(const struct vma_modifier_info *mod_info)
+{
+	/*
+	 * Remap the last page of the next vma into the middle of the vma.
+	 * This splits the current vma and the first and middle parts (the
+	 * parts at lower addresses) become the last vma objserved in the
+	 * first page and the first vma observed in the last page.
+	 */
+	assert(mremap(mod_info->next_addr + page_size * 2, page_size,
+		      page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
+		      mod_info->addr + page_size) != MAP_FAILED);
+}
+
+static inline void patch_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(!mprotect(mod_info->addr + page_size, page_size,
+			 mod_info->prot));
+}
+
+static inline void check_remap_result(struct line_content *mod_last_line,
+				      struct line_content *mod_first_line,
+				      struct line_content *restored_last_line,
+				      struct line_content *restored_first_line)
+{
+	/* Make sure vmas at the boundaries are changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
+}
+
+static void test_maps_tearing_from_remap(int maps_fd,
+				struct vma_modifier_info *mod_info,
+				struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line)
+{
+	struct line_content remapped_last_line;
+	struct line_content remapped_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = remap_vma;
+	mod_info->vma_restore = patch_vma;
+	mod_info->vma_mod_check = check_remap_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &remapped_last_line, &remapped_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after remapping it */
+		if (!strcmp(new_last_line.text, remapped_last_line.text)) {
+			/*
+			 * The vmas should be consistent with remap results,
+			 * however if the vma was concurrently restored, it
+			 * can be reported twice (first as split one, then
+			 * as restored one) because we found it as the next vma
+			 * again. In that case new first line will be the same
+			 * as the last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with the original/resored state */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
 static int test_maps_tearing(void)
 {
 	struct vma_modifier_info *mod_info;
@@ -757,6 +846,9 @@ static int test_maps_tearing(void)
 	test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
 				      &last_line, &first_line);
 
+	test_maps_tearing_from_remap(maps_fd, mod_info, &page1, &page2,
+				     &last_line, &first_line);
+
 	stop_vma_modifier(mod_info);
 
 	free(page2.data);
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified
  2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2025-04-18 17:49 ` [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
@ 2025-04-18 17:49 ` Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 5/8] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

Extend /proc/pid/maps tearing test to verify PROCMAP_QUERY ioctl operation
correctness while the vma is being concurrently modified.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 60 ++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index 1aef2db7e893..b582f40851fb 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -486,6 +486,21 @@ static void capture_mod_pattern(int maps_fd,
 	assert(strcmp(restored_first_line->text, first_line->text) == 0);
 }
 
+static void query_addr_at(int maps_fd, void *addr,
+			  unsigned long *vma_start, unsigned long *vma_end)
+{
+	struct procmap_query q;
+
+	memset(&q, 0, sizeof(q));
+	q.size = sizeof(q);
+	/* Find the VMA at the split address */
+	q.query_addr = (unsigned long long)addr;
+	q.query_flags = 0;
+	assert(!ioctl(maps_fd, PROCMAP_QUERY, &q));
+	*vma_start = q.vma_start;
+	*vma_end = q.vma_end;
+}
+
 static inline void split_vma(const struct vma_modifier_info *mod_info)
 {
 	assert(mmap(mod_info->addr, page_size, mod_info->prot | PROT_EXEC,
@@ -546,6 +561,8 @@ static void test_maps_tearing_from_split(int maps_fd,
 	do {
 		bool last_line_changed;
 		bool first_line_changed;
+		unsigned long vma_start;
+		unsigned long vma_end;
 
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
@@ -576,6 +593,19 @@ static void test_maps_tearing_from_split(int maps_fd,
 		first_line_changed = strcmp(new_first_line.text, first_line->text) != 0;
 		assert(last_line_changed == first_line_changed);
 
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr + page_size,
+			      &vma_start, &vma_end);
+		/*
+		 * The vma at the split address can be either the same as
+		 * original one (if read before the split) or the same as the
+		 * first line in the second page (if read after the split).
+		 */
+		assert((vma_start == last_line->start_addr &&
+			vma_end == last_line->end_addr) ||
+		       (vma_start == split_first_line.start_addr &&
+			vma_end == split_first_line.end_addr));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
@@ -637,6 +667,9 @@ static void test_maps_tearing_from_resize(int maps_fd,
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
 	do {
+		unsigned long vma_start;
+		unsigned long vma_end;
+
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
 		/* Check if we read vmas after shrinking it */
@@ -656,6 +689,17 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
 			       !strcmp(new_first_line.text, restored_first_line.text));
 		}
+
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr, &vma_start, &vma_end);
+		/*
+		 * The vma should stay at the same address and have either the
+		 * original size of 3 pages or 1 page if read after shrinking.
+		 */
+		assert(vma_start == last_line->start_addr &&
+		       (vma_end - vma_start == page_size * 3 ||
+			vma_end - vma_start == page_size));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
@@ -726,6 +770,9 @@ static void test_maps_tearing_from_remap(int maps_fd,
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
 	do {
+		unsigned long vma_start;
+		unsigned long vma_end;
+
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
 		/* Check if we read vmas after remapping it */
@@ -745,6 +792,19 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
 			       !strcmp(new_first_line.text, restored_first_line.text));
 		}
+
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr + page_size, &vma_start, &vma_end);
+		/*
+		 * The vma should either stay at the same address and have the
+		 * original size of 3 pages or we should find the remapped vma
+		 * at the remap destination address with size of 1 page.
+		 */
+		assert((vma_start == last_line->start_addr &&
+			vma_end - vma_start == page_size * 3) ||
+		       (vma_start == last_line->start_addr + page_size &&
+			vma_end - vma_start == page_size));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 5/8] selftests/proc: add verbose more for tests to facilitate debugging
  2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2025-04-18 17:49 ` [PATCH v3 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
@ 2025-04-18 17:49 ` Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 6/8] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

Add verbose more to the proc tests to print debugging information.
Usage: proc-pid-vm --verbose

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 154 +++++++++++++++++++--
 1 file changed, 141 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index b582f40851fb..97017f48cd70 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -73,6 +73,7 @@ static void make_private_tmp(void)
 }
 
 static unsigned long test_duration_sec = 5UL;
+static bool verbose;
 static int page_size;
 static pid_t pid = -1;
 static void ate(void)
@@ -452,6 +453,99 @@ static void stop_vma_modifier(struct vma_modifier_info *mod_info)
 	signal_state(mod_info, SETUP_MODIFY_MAPS);
 }
 
+static void print_first_lines(char *text, int nr)
+{
+	const char *end = text;
+
+	while (nr && (end = strchr(end, '\n')) != NULL) {
+		nr--;
+		end++;
+	}
+
+	if (end) {
+		int offs = end - text;
+
+		text[offs] = '\0';
+		printf(text);
+		text[offs] = '\n';
+		printf("\n");
+	} else {
+		printf(text);
+	}
+}
+
+static void print_last_lines(char *text, int nr)
+{
+	const char *start = text + strlen(text);
+
+	nr++; /* to ignore the last newline */
+	while (nr) {
+		while (start > text && *start != '\n')
+			start--;
+		nr--;
+		start--;
+	}
+	printf(start);
+}
+
+static void print_boundaries(const char *title,
+			     struct page_content *page1,
+			     struct page_content *page2)
+{
+	if (!verbose)
+		return;
+
+	printf("%s", title);
+	/* Print 3 boundary lines from each page */
+	print_last_lines(page1->data, 3);
+	printf("-----------------page boundary-----------------\n");
+	print_first_lines(page2->data, 3);
+}
+
+static bool print_boundaries_on(bool condition, const char *title,
+				struct page_content *page1,
+				struct page_content *page2)
+{
+	if (verbose && condition)
+		print_boundaries(title, page1, page2);
+
+	return condition;
+}
+
+static void report_test_start(const char *name)
+{
+	if (verbose)
+		printf("==== %s ====\n", name);
+}
+
+static struct timespec print_ts;
+
+static void start_test_loop(struct timespec *ts)
+{
+	if (verbose)
+		print_ts.tv_sec = ts->tv_sec;
+}
+
+static void end_test_iteration(struct timespec *ts)
+{
+	if (!verbose)
+		return;
+
+	/* Update every second */
+	if (print_ts.tv_sec == ts->tv_sec)
+		return;
+
+	printf(".");
+	fflush(stdout);
+	print_ts.tv_sec = ts->tv_sec;
+}
+
+static void end_test_loop(void)
+{
+	if (verbose)
+		printf("\n");
+}
+
 static void capture_mod_pattern(int maps_fd,
 				struct vma_modifier_info *mod_info,
 				struct page_content *page1,
@@ -463,18 +557,24 @@ static void capture_mod_pattern(int maps_fd,
 				struct line_content *restored_last_line,
 				struct line_content *restored_first_line)
 {
+	print_boundaries("Before modification", page1, page2);
+
 	signal_state(mod_info, SETUP_MODIFY_MAPS);
 	wait_for_state(mod_info, SETUP_MAPS_MODIFIED);
 
 	/* Copy last line of the first page and first line of the last page */
 	read_boundary_lines(maps_fd, page1, page2, mod_last_line, mod_first_line);
 
+	print_boundaries("After modification", page1, page2);
+
 	signal_state(mod_info, SETUP_RESTORE_MAPS);
 	wait_for_state(mod_info, SETUP_MAPS_RESTORED);
 
 	/* Copy last line of the first page and first line of the last page */
 	read_boundary_lines(maps_fd, page1, page2, restored_last_line, restored_first_line);
 
+	print_boundaries("After restore", page1, page2);
+
 	mod_info->vma_mod_check(mod_last_line, mod_first_line,
 				restored_last_line, restored_first_line);
 
@@ -546,6 +646,7 @@ static void test_maps_tearing_from_split(int maps_fd,
 	mod_info->vma_restore = merge_vma;
 	mod_info->vma_mod_check = check_split_result;
 
+	report_test_start("Tearing from split");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &split_last_line, &split_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -558,6 +659,7 @@ static void test_maps_tearing_from_split(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		bool last_line_changed;
 		bool first_line_changed;
@@ -577,12 +679,17 @@ static void test_maps_tearing_from_split(int maps_fd,
 			 * In that case new first line will be the same as the
 			 * last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, split_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, split_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Split result invalid", page1, page2));
+
 		} else {
 			/* The vmas should be consistent with merge results */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text) ||
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Merge result invalid", page1, page2));
 		}
 		/*
 		 * First and last lines should change in unison. If the last
@@ -607,7 +714,9 @@ static void test_maps_tearing_from_split(int maps_fd,
 			vma_end == split_first_line.end_addr));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -654,6 +763,7 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	mod_info->vma_restore = expand_vma;
 	mod_info->vma_mod_check = check_shrink_result;
 
+	report_test_start("Tearing from resize");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &shrunk_last_line, &shrunk_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -666,6 +776,7 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		unsigned long vma_start;
 		unsigned long vma_end;
@@ -682,12 +793,16 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			 * again. In that case new first line will be the same
 			 * as the last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, shrunk_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, shrunk_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Shrink result invalid", page1, page2));
 		} else {
 			/* The vmas should be consistent with the original/resored state */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text) ||
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Expand result invalid", page1, page2));
 		}
 
 		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
@@ -701,7 +816,9 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			vma_end - vma_start == page_size));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -757,6 +874,7 @@ static void test_maps_tearing_from_remap(int maps_fd,
 	mod_info->vma_restore = patch_vma;
 	mod_info->vma_mod_check = check_remap_result;
 
+	report_test_start("Tearing from remap");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &remapped_last_line, &remapped_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -769,6 +887,7 @@ static void test_maps_tearing_from_remap(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		unsigned long vma_start;
 		unsigned long vma_end;
@@ -785,12 +904,16 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			 * again. In that case new first line will be the same
 			 * as the last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, remapped_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Remap result invalid", page1, page2));
 		} else {
 			/* The vmas should be consistent with the original/resored state */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text) ||
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Remap restore result invalid", page1, page2));
 		}
 
 		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
@@ -806,7 +929,9 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			vma_end - vma_start == page_size));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -927,6 +1052,7 @@ int usage(void)
 {
 	fprintf(stderr, "Userland /proc/pid/{s}maps test cases\n");
 	fprintf(stderr, "  -d: Duration for time-consuming tests\n");
+	fprintf(stderr, "  -v: Verbose mode\n");
 	fprintf(stderr, "  -h: Help screen\n");
 	exit(-1);
 }
@@ -937,9 +1063,11 @@ int main(int argc, char **argv)
 	int exec_fd;
 	int opt;
 
-	while ((opt = getopt(argc, argv, "d:h")) != -1) {
+	while ((opt = getopt(argc, argv, "d:vh")) != -1) {
 		if (opt == 'd')
 			test_duration_sec = strtoul(optarg, NULL, 0);
+		else if (opt == 'v')
+			verbose = true;
 		else if (opt == 'h')
 			usage();
 	}
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 6/8] mm: make vm_area_struct anon_name field RCU-safe
  2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2025-04-18 17:49 ` [PATCH v3 5/8] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
@ 2025-04-18 17:49 ` Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
  2025-04-18 17:49 ` [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl " Suren Baghdasaryan
  7 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

For lockless /proc/pid/maps reading we have to ensure all the fields
used when generating the output are RCU-safe. The only pointer fields
in vm_area_struct which are used to generate that file's output are
vm_file and anon_name. vm_file is RCU-safe but anon_name is not. Make
anon_name RCU-safe as well.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm_inline.h | 10 +++++++++-
 include/linux/mm_types.h  |  3 ++-
 mm/madvise.c              | 30 ++++++++++++++++++++++++++----
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f9157a0c42a5..9ac2d92d7ede 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -410,7 +410,7 @@ static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma,
 	struct anon_vma_name *anon_name = anon_vma_name(orig_vma);
 
 	if (anon_name)
-		new_vma->anon_name = anon_vma_name_reuse(anon_name);
+		rcu_assign_pointer(new_vma->anon_name, anon_vma_name_reuse(anon_name));
 }
 
 static inline void free_anon_vma_name(struct vm_area_struct *vma)
@@ -432,6 +432,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
 		!strcmp(anon_name1->name, anon_name2->name);
 }
 
+struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma);
+
 #else /* CONFIG_ANON_VMA_NAME */
 static inline void anon_vma_name_get(struct anon_vma_name *anon_name) {}
 static inline void anon_vma_name_put(struct anon_vma_name *anon_name) {}
@@ -445,6 +447,12 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
 	return true;
 }
 
+static inline
+struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma)
+{
+	return NULL;
+}
+
 #endif  /* CONFIG_ANON_VMA_NAME */
 
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..15ec288d4a21 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -700,6 +700,7 @@ struct vm_userfaultfd_ctx {};
 
 struct anon_vma_name {
 	struct kref kref;
+	struct rcu_head rcu;
 	/* The name needs to be at the end because it is dynamically sized. */
 	char name[];
 };
@@ -874,7 +875,7 @@ struct vm_area_struct {
 	 * terminated string containing the name given to the vma, or NULL if
 	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
 	 */
-	struct anon_vma_name *anon_name;
+	struct anon_vma_name __rcu *anon_name;
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 } __randomize_layout;
diff --git a/mm/madvise.c b/mm/madvise.c
index 8433ac9b27e0..ed03a5a2c140 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -101,14 +101,15 @@ void anon_vma_name_free(struct kref *kref)
 {
 	struct anon_vma_name *anon_name =
 			container_of(kref, struct anon_vma_name, kref);
-	kfree(anon_name);
+	kfree_rcu(anon_name, rcu);
 }
 
 struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
 {
 	mmap_assert_locked(vma->vm_mm);
 
-	return vma->anon_name;
+	return rcu_dereference_protected(vma->anon_name,
+		rwsem_is_locked(&vma->vm_mm->mmap_lock));
 }
 
 /* mmap_lock should be write-locked */
@@ -118,7 +119,7 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
 	struct anon_vma_name *orig_name = anon_vma_name(vma);
 
 	if (!anon_name) {
-		vma->anon_name = NULL;
+		rcu_assign_pointer(vma->anon_name, NULL);
 		anon_vma_name_put(orig_name);
 		return 0;
 	}
@@ -126,11 +127,32 @@ static int replace_anon_vma_name(struct vm_area_struct *vma,
 	if (anon_vma_name_eq(orig_name, anon_name))
 		return 0;
 
-	vma->anon_name = anon_vma_name_reuse(anon_name);
+	rcu_assign_pointer(vma->anon_name, anon_vma_name_reuse(anon_name));
 	anon_vma_name_put(orig_name);
 
 	return 0;
 }
+
+/*
+ * Returned anon_vma_name is stable due to elevated refcount but not guaranteed
+ * to be assigned to the original VMA after the call.
+ */
+struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma)
+{
+	struct anon_vma_name __rcu *anon_name;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	anon_name = rcu_dereference(vma->anon_name);
+	if (!anon_name)
+		return NULL;
+
+	if (unlikely(!kref_get_unless_zero(&anon_name->kref)))
+		return NULL;
+
+	return anon_name;
+}
+
 #else /* CONFIG_ANON_VMA_NAME */
 static int replace_anon_vma_name(struct vm_area_struct *vma,
 				 struct anon_vma_name *anon_name)
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2025-04-18 17:49 ` [PATCH v3 6/8] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
@ 2025-04-18 17:49 ` Suren Baghdasaryan
  2025-04-22 22:48   ` Andrii Nakryiko
  2025-04-29 15:39   ` Jann Horn
  2025-04-18 17:49 ` [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl " Suren Baghdasaryan
  7 siblings, 2 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

With maple_tree supporting vma tree traversal under RCU and vma and
its important members being RCU-safe, /proc/pid/maps can be read under
RCU and without the need to read-lock mmap_lock. However vma content
can change from under us, therefore we make a copy of the vma and we
pin pointer fields used when generating the output (currently only
vm_file and anon_name). Afterwards we check for concurrent address
space modifications, wait for them to end and retry. While we take
the mmap_lock for reading during such contention, we do that momentarily
only to record new mm_wr_seq counter. This change is designed to reduce
mmap_lock contention and prevent a process reading /proc/pid/maps files
(often a low priority task, such as monitoring/data collection services)
from blocking address space updates.
Note that this change has a userspace visible disadvantage: it allows
for sub-page data tearing as opposed to the previous mechanism where
data tearing could happen only between pages of generated output data.
Since current userspace considers data tearing between pages to be
acceptable, we assume is will be able to handle sub-page data tearing
as well.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/internal.h        |   6 ++
 fs/proc/task_mmu.c        | 170 ++++++++++++++++++++++++++++++++++----
 include/linux/mm_inline.h |  18 ++++
 3 files changed, 177 insertions(+), 17 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 96122e91c645..6e1169c1f4df 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -379,6 +379,12 @@ struct proc_maps_private {
 	struct task_struct *task;
 	struct mm_struct *mm;
 	struct vma_iterator iter;
+	bool mmap_locked;
+	loff_t last_pos;
+#ifdef CONFIG_PER_VMA_LOCK
+	unsigned int mm_wr_seq;
+	struct vm_area_struct vma_copy;
+#endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *task_mempolicy;
 #endif
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b9e4fbbdf6e6..f9d50a61167c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
 }
 #endif
 
-static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
-						loff_t *ppos)
+#ifdef CONFIG_PER_VMA_LOCK
+
+static const struct seq_operations proc_pid_maps_op;
+
+/*
+ * Take VMA snapshot and pin vm_file and anon_name as they are used by
+ * show_map_vma.
+ */
+static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
+{
+	struct vm_area_struct *copy = &priv->vma_copy;
+	int ret = -EAGAIN;
+
+	memcpy(copy, vma, sizeof(*vma));
+	if (copy->vm_file && !get_file_rcu(&copy->vm_file))
+		goto out;
+
+	if (!anon_vma_name_get_if_valid(copy))
+		goto put_file;
+
+	if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
+		return 0;
+
+	/* Address space got modified, vma might be stale. Re-lock and retry. */
+	rcu_read_unlock();
+	ret = mmap_read_lock_killable(priv->mm);
+	if (!ret) {
+		/* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
+		mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
+		mmap_read_unlock(priv->mm);
+		ret = -EAGAIN;
+	}
+
+	rcu_read_lock();
+
+	anon_vma_name_put_if_valid(copy);
+put_file:
+	if (copy->vm_file)
+		fput(copy->vm_file);
+out:
+	return ret;
+}
+
+static void put_vma_snapshot(struct proc_maps_private *priv)
+{
+	struct vm_area_struct *vma = &priv->vma_copy;
+
+	anon_vma_name_put_if_valid(vma);
+	if (vma->vm_file)
+		fput(vma->vm_file);
+}
+
+static inline bool drop_mmap_lock(struct seq_file *m, struct proc_maps_private *priv)
+{
+	/*
+	 * smaps and numa_maps perform page table walk, therefore require
+	 * mmap_lock but maps can be read under RCU.
+	 */
+	if (m->op != &proc_pid_maps_op)
+		return false;
+
+	/* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
+	mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
+	mmap_read_unlock(priv->mm);
+	rcu_read_lock();
+	memset(&priv->vma_copy, 0, sizeof(priv->vma_copy));
+
+	return true;
+}
+
+static struct vm_area_struct *get_stable_vma(struct vm_area_struct *vma,
+					     struct proc_maps_private *priv,
+					     loff_t last_pos)
+{
+	int ret;
+
+	put_vma_snapshot(priv);
+	while ((ret = get_vma_snapshot(priv, vma)) == -EAGAIN) {
+		/* lookup the vma at the last position again */
+		vma_iter_init(&priv->iter, priv->mm, last_pos);
+		vma = vma_next(&priv->iter);
+	}
+
+	return ret ? ERR_PTR(ret) : &priv->vma_copy;
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+/* Without per-vma locks VMA access is not RCU-safe */
+static inline bool drop_mmap_lock(struct seq_file *m,
+				  struct proc_maps_private *priv)
+{
+	return false;
+}
+
+static struct vm_area_struct *get_stable_vma(struct vm_area_struct *vma,
+					     struct proc_maps_private *priv,
+					     loff_t last_pos)
+{
+	return vma;
+}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
+static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
 {
+	struct proc_maps_private *priv = m->private;
 	struct vm_area_struct *vma = vma_next(&priv->iter);
 
+	if (vma && !priv->mmap_locked)
+		vma = get_stable_vma(vma, priv, *ppos);
+
+	if (IS_ERR(vma))
+		return vma;
+
 	if (vma) {
-		*ppos = vma->vm_start;
+		/* Store previous position to be able to restart if needed */
+		priv->last_pos = *ppos;
+		/*
+		 * Track the end of the reported vma to ensure position changes
+		 * even if previous vma was merged with the next vma and we
+		 * found the extended vma with the same vm_start.
+		 */
+		*ppos = vma->vm_end;
 	} else {
 		*ppos = -2UL;
 		vma = get_gate_vma(priv->mm);
@@ -148,6 +265,7 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 	unsigned long last_addr = *ppos;
 	struct mm_struct *mm;
 
+	priv->mmap_locked = true;
 	/* See m_next(). Zero at the start or after lseek. */
 	if (last_addr == -1UL)
 		return NULL;
@@ -170,12 +288,18 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 		return ERR_PTR(-EINTR);
 	}
 
+	/* Drop mmap_lock if possible */
+	if (drop_mmap_lock(m, priv))
+		priv->mmap_locked = false;
+
+	if (last_addr > 0)
+		*ppos = last_addr = priv->last_pos;
 	vma_iter_init(&priv->iter, mm, last_addr);
 	hold_task_mempolicy(priv);
 	if (last_addr == -2UL)
 		return get_gate_vma(mm);
 
-	return proc_get_vma(priv, ppos);
+	return proc_get_vma(m, ppos);
 }
 
 static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
@@ -184,7 +308,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
 		*ppos = -1UL;
 		return NULL;
 	}
-	return proc_get_vma(m->private, ppos);
+	return proc_get_vma(m, ppos);
 }
 
 static void m_stop(struct seq_file *m, void *v)
@@ -196,7 +320,10 @@ static void m_stop(struct seq_file *m, void *v)
 		return;
 
 	release_task_mempolicy(priv);
-	mmap_read_unlock(mm);
+	if (priv->mmap_locked)
+		mmap_read_unlock(mm);
+	else
+		rcu_read_unlock();
 	mmput(mm);
 	put_task_struct(priv->task);
 	priv->task = NULL;
@@ -243,14 +370,20 @@ static int do_maps_open(struct inode *inode, struct file *file,
 static void get_vma_name(struct vm_area_struct *vma,
 			 const struct path **path,
 			 const char **name,
-			 const char **name_fmt)
+			 const char **name_fmt, bool mmap_locked)
 {
-	struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL;
+	struct anon_vma_name *anon_name;
 
 	*name = NULL;
 	*path = NULL;
 	*name_fmt = NULL;
 
+	if (vma->vm_mm)
+		anon_name = mmap_locked ? anon_vma_name(vma) :
+					  anon_vma_name_get_rcu(vma);
+	else
+		anon_name = NULL;
+
 	/*
 	 * Print the dentry name for named mappings, and a
 	 * special [heap] marker for the heap:
@@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma,
 		} else {
 			*path = file_user_path(vma->vm_file);
 		}
-		return;
+		goto out;
 	}
 
 	if (vma->vm_ops && vma->vm_ops->name) {
 		*name = vma->vm_ops->name(vma);
 		if (*name)
-			return;
+			goto out;
 	}
 
 	*name = arch_vma_name(vma);
 	if (*name)
-		return;
+		goto out;
 
 	if (!vma->vm_mm) {
 		*name = "[vdso]";
-		return;
+		goto out;
 	}
 
 	if (vma_is_initial_heap(vma)) {
 		*name = "[heap]";
-		return;
+		goto out;
 	}
 
 	if (vma_is_initial_stack(vma)) {
 		*name = "[stack]";
-		return;
+		goto out;
 	}
 
 	if (anon_name) {
 		*name_fmt = "[anon:%s]";
 		*name = anon_name->name;
-		return;
 	}
+out:
+	if (anon_name && !mmap_locked)
+		anon_vma_name_put(anon_name);
 }
 
 static void show_vma_header_prefix(struct seq_file *m,
@@ -324,6 +459,7 @@ static void show_vma_header_prefix(struct seq_file *m,
 static void
 show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 {
+	struct proc_maps_private *priv = m->private;
 	const struct path *path;
 	const char *name_fmt, *name;
 	vm_flags_t flags = vma->vm_flags;
@@ -344,7 +480,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 	end = vma->vm_end;
 	show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);
 
-	get_vma_name(vma, &path, &name, &name_fmt);
+	get_vma_name(vma, &path, &name, &name_fmt, priv->mmap_locked);
 	if (path) {
 		seq_pad(m, ' ');
 		seq_path(m, path, "\n");
@@ -549,7 +685,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 		const char *name_fmt;
 		size_t name_sz = 0;
 
-		get_vma_name(vma, &path, &name, &name_fmt);
+		get_vma_name(vma, &path, &name, &name_fmt, true);
 
 		if (path || name_fmt || name) {
 			name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9ac2d92d7ede..436512f1e759 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -434,6 +434,21 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1,
 
 struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma);
 
+/*
+ * Takes a reference if anon_vma is valid and stable (has references).
+ * Fails only if anon_vma is valid but we failed to get a reference.
+ */
+static inline bool anon_vma_name_get_if_valid(struct vm_area_struct *vma)
+{
+	return !vma->anon_name || anon_vma_name_get_rcu(vma);
+}
+
+static inline void anon_vma_name_put_if_valid(struct vm_area_struct *vma)
+{
+	if (vma->anon_name)
+		anon_vma_name_put(vma->anon_name);
+}
+
 #else /* CONFIG_ANON_VMA_NAME */
 static inline void anon_vma_name_get(struct anon_vma_name *anon_name) {}
 static inline void anon_vma_name_put(struct anon_vma_name *anon_name) {}
@@ -453,6 +468,9 @@ struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma)
 	return NULL;
 }
 
+static inline bool anon_vma_name_get_if_valid(struct vm_area_struct *vma) { return true; }
+static inline void anon_vma_name_put_if_valid(struct vm_area_struct *vma) {}
+
 #endif  /* CONFIG_ANON_VMA_NAME */
 
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2025-04-18 17:49 ` [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
@ 2025-04-18 17:49 ` Suren Baghdasaryan
  2025-04-22 22:54   ` Andrii Nakryiko
  7 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 17:49 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest, surenb

Utilize speculative vma lookup to find and snapshot a vma without
taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
address space modifications are detected and the lookup is retried.
While we take the mmap_lock for reading during such contention, we
do that momentarily only to record new mm_wr_seq counter.
This change is designed to reduce mmap_lock contention and prevent
PROCMAP_QUERY ioctl calls (often a low priority task, such as
monitoring/data collection services) from blocking address space
updates.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 55 insertions(+), 8 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f9d50a61167c..28b975ddff26 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -525,9 +525,53 @@ static int pid_maps_open(struct inode *inode, struct file *file)
 		PROCMAP_QUERY_VMA_FLAGS				\
 )
 
-static int query_vma_setup(struct mm_struct *mm)
+#ifdef CONFIG_PER_VMA_LOCK
+
+static int query_vma_setup(struct proc_maps_private *priv)
 {
-	return mmap_read_lock_killable(mm);
+	if (!mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq)) {
+		int ret = mmap_read_lock_killable(priv->mm);
+
+		if (ret)
+			return ret;
+
+		/* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
+		mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
+		mmap_read_unlock(priv->mm);
+	}
+
+	memset(&priv->vma_copy, 0, sizeof(priv->vma_copy));
+	rcu_read_lock();
+
+	return 0;
+}
+
+static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	rcu_read_unlock();
+}
+
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
+						     unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	vma_iter_init(&priv->iter, priv->mm, addr);
+	vma = vma_next(&priv->iter);
+	if (!vma)
+		return NULL;
+
+	vma = get_stable_vma(vma, priv, addr);
+
+	/* The only possible error is EINTR, just pretend we found nothing */
+	return IS_ERR(vma) ? NULL : vma;
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static int query_vma_setup(struct proc_maps_private *priv)
+{
+	return mmap_read_lock_killable(priv->mm);
 }
 
 static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
@@ -535,18 +579,21 @@ static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
 	mmap_read_unlock(mm);
 }
 
-static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
+						     unsigned long addr)
 {
-	return find_vma(mm, addr);
+	return find_vma(priv->mm, addr);
 }
 
-static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
+#endif /* CONFIG_PER_VMA_LOCK */
+
+static struct vm_area_struct *query_matching_vma(struct proc_maps_private *priv,
 						 unsigned long addr, u32 flags)
 {
 	struct vm_area_struct *vma;
 
 next_vma:
-	vma = query_vma_find_by_addr(mm, addr);
+	vma = query_vma_find_by_addr(priv, addr);
 	if (!vma)
 		goto no_vma;
 
@@ -622,13 +669,13 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	if (!mm || !mmget_not_zero(mm))
 		return -ESRCH;
 
-	err = query_vma_setup(mm);
+	err = query_vma_setup(priv);
 	if (err) {
 		mmput(mm);
 		return err;
 	}
 
-	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
+	vma = query_matching_vma(priv, karg.query_addr, karg.query_flags);
 	if (IS_ERR(vma)) {
 		err = PTR_ERR(vma);
 		vma = NULL;
-- 
2.49.0.805.g082f7c87e0-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping
  2025-04-18 17:49 ` [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
@ 2025-04-18 18:30   ` Lorenzo Stoakes
  2025-04-18 19:31     ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2025-04-18 18:30 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Apr 18, 2025 at 10:49:54AM -0700, Suren Baghdasaryan wrote:
> Test that /proc/pid/maps does not report unexpected holes in the address
> space when we concurrently remap a part of a vma into the middle of
> another vma. This remapping results in the destination vma being split
> into three parts and the part in the middle being patched back from,
> all done concurrently from under the reader. We should always see either
> original vma or the split one with no holes.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Umm, but we haven't fixed this in the mremap code right? :) So isn't this test
failing right now? :P

My understanding from meeting was you'd drop this commit/mark it skipped
for now or something like this?

> ---
>  tools/testing/selftests/proc/proc-pid-vm.c | 92 ++++++++++++++++++++++
>  1 file changed, 92 insertions(+)
>
> diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
> index 39842e4ec45f..1aef2db7e893 100644
> --- a/tools/testing/selftests/proc/proc-pid-vm.c
> +++ b/tools/testing/selftests/proc/proc-pid-vm.c
> @@ -663,6 +663,95 @@ static void test_maps_tearing_from_resize(int maps_fd,
>  	signal_state(mod_info, TEST_DONE);
>  }
>
> +static inline void remap_vma(const struct vma_modifier_info *mod_info)
> +{
> +	/*
> +	 * Remap the last page of the next vma into the middle of the vma.
> +	 * This splits the current vma and the first and middle parts (the
> +	 * parts at lower addresses) become the last vma objserved in the
> +	 * first page and the first vma observed in the last page.
> +	 */
> +	assert(mremap(mod_info->next_addr + page_size * 2, page_size,
> +		      page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
> +		      mod_info->addr + page_size) != MAP_FAILED);
> +}
> +
> +static inline void patch_vma(const struct vma_modifier_info *mod_info)
> +{
> +	assert(!mprotect(mod_info->addr + page_size, page_size,
> +			 mod_info->prot));
> +}
> +
> +static inline void check_remap_result(struct line_content *mod_last_line,
> +				      struct line_content *mod_first_line,
> +				      struct line_content *restored_last_line,
> +				      struct line_content *restored_first_line)
> +{
> +	/* Make sure vmas at the boundaries are changing */
> +	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
> +	assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
> +}
> +
> +static void test_maps_tearing_from_remap(int maps_fd,
> +				struct vma_modifier_info *mod_info,
> +				struct page_content *page1,
> +				struct page_content *page2,
> +				struct line_content *last_line,
> +				struct line_content *first_line)
> +{
> +	struct line_content remapped_last_line;
> +	struct line_content remapped_first_line;
> +	struct line_content restored_last_line;
> +	struct line_content restored_first_line;
> +
> +	wait_for_state(mod_info, SETUP_READY);
> +
> +	/* re-read the file to avoid using stale data from previous test */
> +	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
> +
> +	mod_info->vma_modify = remap_vma;
> +	mod_info->vma_restore = patch_vma;
> +	mod_info->vma_mod_check = check_remap_result;
> +
> +	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
> +			    &remapped_last_line, &remapped_first_line,
> +			    &restored_last_line, &restored_first_line);
> +
> +	/* Now start concurrent modifications for test_duration_sec */
> +	signal_state(mod_info, TEST_READY);
> +
> +	struct line_content new_last_line;
> +	struct line_content new_first_line;
> +	struct timespec start_ts, end_ts;
> +
> +	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
> +	do {
> +		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
> +
> +		/* Check if we read vmas after remapping it */
> +		if (!strcmp(new_last_line.text, remapped_last_line.text)) {
> +			/*
> +			 * The vmas should be consistent with remap results,
> +			 * however if the vma was concurrently restored, it
> +			 * can be reported twice (first as split one, then
> +			 * as restored one) because we found it as the next vma
> +			 * again. In that case new first line will be the same
> +			 * as the last restored line.
> +			 */
> +			assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
> +			       !strcmp(new_first_line.text, restored_last_line.text));
> +		} else {
> +			/* The vmas should be consistent with the original/resored state */
> +			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
> +			       !strcmp(new_first_line.text, restored_first_line.text));
> +		}
> +		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
> +	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
> +
> +	/* Signal the modifyer thread to stop and wait until it exits */
> +	signal_state(mod_info, TEST_DONE);
> +}
> +
>  static int test_maps_tearing(void)
>  {
>  	struct vma_modifier_info *mod_info;
> @@ -757,6 +846,9 @@ static int test_maps_tearing(void)
>  	test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
>  				      &last_line, &first_line);
>
> +	test_maps_tearing_from_remap(maps_fd, mod_info, &page1, &page2,
> +				     &last_line, &first_line);
> +
>  	stop_vma_modifier(mod_info);
>
>  	free(page2.data);
> --
> 2.49.0.805.g082f7c87e0-goog
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping
  2025-04-18 18:30   ` Lorenzo Stoakes
@ 2025-04-18 19:31     ` Suren Baghdasaryan
  2025-04-18 19:55       ` Lorenzo Stoakes
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 19:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Apr 18, 2025 at 11:30 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Apr 18, 2025 at 10:49:54AM -0700, Suren Baghdasaryan wrote:
> > Test that /proc/pid/maps does not report unexpected holes in the address
> > space when we concurrently remap a part of a vma into the middle of
> > another vma. This remapping results in the destination vma being split
> > into three parts and the part in the middle being patched back from,
> > all done concurrently from under the reader. We should always see either
> > original vma or the split one with no holes.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Umm, but we haven't fixed this in the mremap code right? :) So isn't this test
> failing right now? :P
>
> My understanding from meeting was you'd drop this commit/mark it skipped
> for now or something like this?

After you pointed out the issue in mremap_to() I spent some time
investigating that code. IIUC the fact that mremap_to() does
do_munmap() and move_vma() as two separate operations should not
affect lockless reading because both those operations are done under
mmap_write_lock() without dropping it in between. Since my lockless
mechanism uses mmap_lock_speculate_xxx API to detect address space
modifications and retry if concurrent update happen, it should be able
to handle this case. The vma it reports should be either the version
before mmap_write_lock was taken (before the mremap()) or after it got
dropped (after mremap() is complete). Did I miss something more
subtle?

>
> > ---
> >  tools/testing/selftests/proc/proc-pid-vm.c | 92 ++++++++++++++++++++++
> >  1 file changed, 92 insertions(+)
> >
> > diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
> > index 39842e4ec45f..1aef2db7e893 100644
> > --- a/tools/testing/selftests/proc/proc-pid-vm.c
> > +++ b/tools/testing/selftests/proc/proc-pid-vm.c
> > @@ -663,6 +663,95 @@ static void test_maps_tearing_from_resize(int maps_fd,
> >       signal_state(mod_info, TEST_DONE);
> >  }
> >
> > +static inline void remap_vma(const struct vma_modifier_info *mod_info)
> > +{
> > +     /*
> > +      * Remap the last page of the next vma into the middle of the vma.
> > +      * This splits the current vma and the first and middle parts (the
> > +      * parts at lower addresses) become the last vma objserved in the
> > +      * first page and the first vma observed in the last page.
> > +      */
> > +     assert(mremap(mod_info->next_addr + page_size * 2, page_size,
> > +                   page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
> > +                   mod_info->addr + page_size) != MAP_FAILED);
> > +}
> > +
> > +static inline void patch_vma(const struct vma_modifier_info *mod_info)
> > +{
> > +     assert(!mprotect(mod_info->addr + page_size, page_size,
> > +                      mod_info->prot));
> > +}
> > +
> > +static inline void check_remap_result(struct line_content *mod_last_line,
> > +                                   struct line_content *mod_first_line,
> > +                                   struct line_content *restored_last_line,
> > +                                   struct line_content *restored_first_line)
> > +{
> > +     /* Make sure vmas at the boundaries are changing */
> > +     assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
> > +     assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
> > +}
> > +
> > +static void test_maps_tearing_from_remap(int maps_fd,
> > +                             struct vma_modifier_info *mod_info,
> > +                             struct page_content *page1,
> > +                             struct page_content *page2,
> > +                             struct line_content *last_line,
> > +                             struct line_content *first_line)
> > +{
> > +     struct line_content remapped_last_line;
> > +     struct line_content remapped_first_line;
> > +     struct line_content restored_last_line;
> > +     struct line_content restored_first_line;
> > +
> > +     wait_for_state(mod_info, SETUP_READY);
> > +
> > +     /* re-read the file to avoid using stale data from previous test */
> > +     read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
> > +
> > +     mod_info->vma_modify = remap_vma;
> > +     mod_info->vma_restore = patch_vma;
> > +     mod_info->vma_mod_check = check_remap_result;
> > +
> > +     capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
> > +                         &remapped_last_line, &remapped_first_line,
> > +                         &restored_last_line, &restored_first_line);
> > +
> > +     /* Now start concurrent modifications for test_duration_sec */
> > +     signal_state(mod_info, TEST_READY);
> > +
> > +     struct line_content new_last_line;
> > +     struct line_content new_first_line;
> > +     struct timespec start_ts, end_ts;
> > +
> > +     clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
> > +     do {
> > +             read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
> > +
> > +             /* Check if we read vmas after remapping it */
> > +             if (!strcmp(new_last_line.text, remapped_last_line.text)) {
> > +                     /*
> > +                      * The vmas should be consistent with remap results,
> > +                      * however if the vma was concurrently restored, it
> > +                      * can be reported twice (first as split one, then
> > +                      * as restored one) because we found it as the next vma
> > +                      * again. In that case new first line will be the same
> > +                      * as the last restored line.
> > +                      */
> > +                     assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
> > +                            !strcmp(new_first_line.text, restored_last_line.text));
> > +             } else {
> > +                     /* The vmas should be consistent with the original/resored state */
> > +                     assert(!strcmp(new_last_line.text, restored_last_line.text) &&
> > +                            !strcmp(new_first_line.text, restored_first_line.text));
> > +             }
> > +             clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
> > +     } while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
> > +
> > +     /* Signal the modifyer thread to stop and wait until it exits */
> > +     signal_state(mod_info, TEST_DONE);
> > +}
> > +
> >  static int test_maps_tearing(void)
> >  {
> >       struct vma_modifier_info *mod_info;
> > @@ -757,6 +846,9 @@ static int test_maps_tearing(void)
> >       test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
> >                                     &last_line, &first_line);
> >
> > +     test_maps_tearing_from_remap(maps_fd, mod_info, &page1, &page2,
> > +                                  &last_line, &first_line);
> > +
> >       stop_vma_modifier(mod_info);
> >
> >       free(page2.data);
> > --
> > 2.49.0.805.g082f7c87e0-goog
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping
  2025-04-18 19:31     ` Suren Baghdasaryan
@ 2025-04-18 19:55       ` Lorenzo Stoakes
  2025-04-18 20:03         ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Lorenzo Stoakes @ 2025-04-18 19:55 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Apr 18, 2025 at 12:31:29PM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 18, 2025 at 11:30 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Apr 18, 2025 at 10:49:54AM -0700, Suren Baghdasaryan wrote:
> > > Test that /proc/pid/maps does not report unexpected holes in the address
> > > space when we concurrently remap a part of a vma into the middle of
> > > another vma. This remapping results in the destination vma being split
> > > into three parts and the part in the middle being patched back from,
> > > all done concurrently from under the reader. We should always see either
> > > original vma or the split one with no holes.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >
> > Umm, but we haven't fixed this in the mremap code right? :) So isn't this test
> > failing right now? :P
> >
> > My understanding from meeting was you'd drop this commit/mark it skipped
> > for now or something like this?
>
> After you pointed out the issue in mremap_to() I spent some time
> investigating that code. IIUC the fact that mremap_to() does
> do_munmap() and move_vma() as two separate operations should not
> affect lockless reading because both those operations are done under
> mmap_write_lock() without dropping it in between. Since my lockless
> mechanism uses mmap_lock_speculate_xxx API to detect address space
> modifications and retry if concurrent update happen, it should be able
> to handle this case. The vma it reports should be either the version
> before mmap_write_lock was taken (before the mremap()) or after it got
> dropped (after mremap() is complete). Did I miss something more
> subtle?

Speaking to Liam, seems perhaps the implementation has changed since we first
started looking at this?

I guess it's this speculation stuff that keeps you safe then, yeah we hold the
write lock over both.

So actually ideal if we can avoid having to fix up the mremap implementation!

If you're sure that the speculation combined with mmap write lock keeps us safe
then we should be ok I'd say.

>
> >
> > > ---
> > >  tools/testing/selftests/proc/proc-pid-vm.c | 92 ++++++++++++++++++++++
> > >  1 file changed, 92 insertions(+)
> > >
> > > diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
> > > index 39842e4ec45f..1aef2db7e893 100644
> > > --- a/tools/testing/selftests/proc/proc-pid-vm.c
> > > +++ b/tools/testing/selftests/proc/proc-pid-vm.c
> > > @@ -663,6 +663,95 @@ static void test_maps_tearing_from_resize(int maps_fd,
> > >       signal_state(mod_info, TEST_DONE);
> > >  }
> > >
> > > +static inline void remap_vma(const struct vma_modifier_info *mod_info)
> > > +{
> > > +     /*
> > > +      * Remap the last page of the next vma into the middle of the vma.
> > > +      * This splits the current vma and the first and middle parts (the
> > > +      * parts at lower addresses) become the last vma objserved in the
> > > +      * first page and the first vma observed in the last page.
> > > +      */
> > > +     assert(mremap(mod_info->next_addr + page_size * 2, page_size,
> > > +                   page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
> > > +                   mod_info->addr + page_size) != MAP_FAILED);
> > > +}
> > > +
> > > +static inline void patch_vma(const struct vma_modifier_info *mod_info)
> > > +{
> > > +     assert(!mprotect(mod_info->addr + page_size, page_size,
> > > +                      mod_info->prot));
> > > +}
> > > +
> > > +static inline void check_remap_result(struct line_content *mod_last_line,
> > > +                                   struct line_content *mod_first_line,
> > > +                                   struct line_content *restored_last_line,
> > > +                                   struct line_content *restored_first_line)
> > > +{
> > > +     /* Make sure vmas at the boundaries are changing */
> > > +     assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
> > > +     assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
> > > +}
> > > +
> > > +static void test_maps_tearing_from_remap(int maps_fd,
> > > +                             struct vma_modifier_info *mod_info,
> > > +                             struct page_content *page1,
> > > +                             struct page_content *page2,
> > > +                             struct line_content *last_line,
> > > +                             struct line_content *first_line)
> > > +{
> > > +     struct line_content remapped_last_line;
> > > +     struct line_content remapped_first_line;
> > > +     struct line_content restored_last_line;
> > > +     struct line_content restored_first_line;
> > > +
> > > +     wait_for_state(mod_info, SETUP_READY);
> > > +
> > > +     /* re-read the file to avoid using stale data from previous test */
> > > +     read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
> > > +
> > > +     mod_info->vma_modify = remap_vma;
> > > +     mod_info->vma_restore = patch_vma;
> > > +     mod_info->vma_mod_check = check_remap_result;
> > > +
> > > +     capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
> > > +                         &remapped_last_line, &remapped_first_line,
> > > +                         &restored_last_line, &restored_first_line);
> > > +
> > > +     /* Now start concurrent modifications for test_duration_sec */
> > > +     signal_state(mod_info, TEST_READY);
> > > +
> > > +     struct line_content new_last_line;
> > > +     struct line_content new_first_line;
> > > +     struct timespec start_ts, end_ts;
> > > +
> > > +     clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
> > > +     do {
> > > +             read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
> > > +
> > > +             /* Check if we read vmas after remapping it */
> > > +             if (!strcmp(new_last_line.text, remapped_last_line.text)) {
> > > +                     /*
> > > +                      * The vmas should be consistent with remap results,
> > > +                      * however if the vma was concurrently restored, it
> > > +                      * can be reported twice (first as split one, then
> > > +                      * as restored one) because we found it as the next vma
> > > +                      * again. In that case new first line will be the same
> > > +                      * as the last restored line.
> > > +                      */
> > > +                     assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
> > > +                            !strcmp(new_first_line.text, restored_last_line.text));
> > > +             } else {
> > > +                     /* The vmas should be consistent with the original/resored state */
> > > +                     assert(!strcmp(new_last_line.text, restored_last_line.text) &&
> > > +                            !strcmp(new_first_line.text, restored_first_line.text));
> > > +             }
> > > +             clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
> > > +     } while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
> > > +
> > > +     /* Signal the modifyer thread to stop and wait until it exits */
> > > +     signal_state(mod_info, TEST_DONE);
> > > +}
> > > +
> > >  static int test_maps_tearing(void)
> > >  {
> > >       struct vma_modifier_info *mod_info;
> > > @@ -757,6 +846,9 @@ static int test_maps_tearing(void)
> > >       test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
> > >                                     &last_line, &first_line);
> > >
> > > +     test_maps_tearing_from_remap(maps_fd, mod_info, &page1, &page2,
> > > +                                  &last_line, &first_line);
> > > +
> > >       stop_vma_modifier(mod_info);
> > >
> > >       free(page2.data);
> > > --
> > > 2.49.0.805.g082f7c87e0-goog
> > >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping
  2025-04-18 19:55       ` Lorenzo Stoakes
@ 2025-04-18 20:03         ` Suren Baghdasaryan
  0 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-18 20:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Fri, Apr 18, 2025 at 12:56 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Apr 18, 2025 at 12:31:29PM -0700, Suren Baghdasaryan wrote:
> > On Fri, Apr 18, 2025 at 11:30 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Fri, Apr 18, 2025 at 10:49:54AM -0700, Suren Baghdasaryan wrote:
> > > > Test that /proc/pid/maps does not report unexpected holes in the address
> > > > space when we concurrently remap a part of a vma into the middle of
> > > > another vma. This remapping results in the destination vma being split
> > > > into three parts and the part in the middle being patched back from,
> > > > all done concurrently from under the reader. We should always see either
> > > > original vma or the split one with no holes.
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > >
> > > Umm, but we haven't fixed this in the mremap code right? :) So isn't this test
> > > failing right now? :P
> > >
> > > My understanding from meeting was you'd drop this commit/mark it skipped
> > > for now or something like this?
> >
> > After you pointed out the issue in mremap_to() I spent some time
> > investigating that code. IIUC the fact that mremap_to() does
> > do_munmap() and move_vma() as two separate operations should not
> > affect lockless reading because both those operations are done under
> > mmap_write_lock() without dropping it in between. Since my lockless
> > mechanism uses mmap_lock_speculate_xxx API to detect address space
> > modifications and retry if concurrent update happen, it should be able
> > to handle this case. The vma it reports should be either the version
> > before mmap_write_lock was taken (before the mremap()) or after it got
> > dropped (after mremap() is complete). Did I miss something more
> > subtle?
>
> Speaking to Liam, seems perhaps the implementation has changed since we first
> started looking at this?
>
> I guess it's this speculation stuff that keeps you safe then, yeah we hold the
> write lock over both.
>
> So actually ideal if we can avoid having to fix up the mremap implementation!
>
> If you're sure that the speculation combined with mmap write lock keeps us safe
> then we should be ok I'd say.

Yeah, I tested that quite rigorously and confirmed (using the
mm->mm_lock_seq) that mmap_write_lock is not dropped somewhere in the
middle of those operations. I think we should be safe.

>
> >
> > >
> > > > ---
> > > >  tools/testing/selftests/proc/proc-pid-vm.c | 92 ++++++++++++++++++++++
> > > >  1 file changed, 92 insertions(+)
> > > >
> > > > diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
> > > > index 39842e4ec45f..1aef2db7e893 100644
> > > > --- a/tools/testing/selftests/proc/proc-pid-vm.c
> > > > +++ b/tools/testing/selftests/proc/proc-pid-vm.c
> > > > @@ -663,6 +663,95 @@ static void test_maps_tearing_from_resize(int maps_fd,
> > > >       signal_state(mod_info, TEST_DONE);
> > > >  }
> > > >
> > > > +static inline void remap_vma(const struct vma_modifier_info *mod_info)
> > > > +{
> > > > +     /*
> > > > +      * Remap the last page of the next vma into the middle of the vma.
> > > > +      * This splits the current vma and the first and middle parts (the
> > > > +      * parts at lower addresses) become the last vma objserved in the
> > > > +      * first page and the first vma observed in the last page.
> > > > +      */
> > > > +     assert(mremap(mod_info->next_addr + page_size * 2, page_size,
> > > > +                   page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
> > > > +                   mod_info->addr + page_size) != MAP_FAILED);
> > > > +}
> > > > +
> > > > +static inline void patch_vma(const struct vma_modifier_info *mod_info)
> > > > +{
> > > > +     assert(!mprotect(mod_info->addr + page_size, page_size,
> > > > +                      mod_info->prot));
> > > > +}
> > > > +
> > > > +static inline void check_remap_result(struct line_content *mod_last_line,
> > > > +                                   struct line_content *mod_first_line,
> > > > +                                   struct line_content *restored_last_line,
> > > > +                                   struct line_content *restored_first_line)
> > > > +{
> > > > +     /* Make sure vmas at the boundaries are changing */
> > > > +     assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
> > > > +     assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
> > > > +}
> > > > +
> > > > +static void test_maps_tearing_from_remap(int maps_fd,
> > > > +                             struct vma_modifier_info *mod_info,
> > > > +                             struct page_content *page1,
> > > > +                             struct page_content *page2,
> > > > +                             struct line_content *last_line,
> > > > +                             struct line_content *first_line)
> > > > +{
> > > > +     struct line_content remapped_last_line;
> > > > +     struct line_content remapped_first_line;
> > > > +     struct line_content restored_last_line;
> > > > +     struct line_content restored_first_line;
> > > > +
> > > > +     wait_for_state(mod_info, SETUP_READY);
> > > > +
> > > > +     /* re-read the file to avoid using stale data from previous test */
> > > > +     read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
> > > > +
> > > > +     mod_info->vma_modify = remap_vma;
> > > > +     mod_info->vma_restore = patch_vma;
> > > > +     mod_info->vma_mod_check = check_remap_result;
> > > > +
> > > > +     capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
> > > > +                         &remapped_last_line, &remapped_first_line,
> > > > +                         &restored_last_line, &restored_first_line);
> > > > +
> > > > +     /* Now start concurrent modifications for test_duration_sec */
> > > > +     signal_state(mod_info, TEST_READY);
> > > > +
> > > > +     struct line_content new_last_line;
> > > > +     struct line_content new_first_line;
> > > > +     struct timespec start_ts, end_ts;
> > > > +
> > > > +     clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
> > > > +     do {
> > > > +             read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
> > > > +
> > > > +             /* Check if we read vmas after remapping it */
> > > > +             if (!strcmp(new_last_line.text, remapped_last_line.text)) {
> > > > +                     /*
> > > > +                      * The vmas should be consistent with remap results,
> > > > +                      * however if the vma was concurrently restored, it
> > > > +                      * can be reported twice (first as split one, then
> > > > +                      * as restored one) because we found it as the next vma
> > > > +                      * again. In that case new first line will be the same
> > > > +                      * as the last restored line.
> > > > +                      */
> > > > +                     assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
> > > > +                            !strcmp(new_first_line.text, restored_last_line.text));
> > > > +             } else {
> > > > +                     /* The vmas should be consistent with the original/resored state */
> > > > +                     assert(!strcmp(new_last_line.text, restored_last_line.text) &&
> > > > +                            !strcmp(new_first_line.text, restored_first_line.text));
> > > > +             }
> > > > +             clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
> > > > +     } while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
> > > > +
> > > > +     /* Signal the modifyer thread to stop and wait until it exits */
> > > > +     signal_state(mod_info, TEST_DONE);
> > > > +}
> > > > +
> > > >  static int test_maps_tearing(void)
> > > >  {
> > > >       struct vma_modifier_info *mod_info;
> > > > @@ -757,6 +846,9 @@ static int test_maps_tearing(void)
> > > >       test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
> > > >                                     &last_line, &first_line);
> > > >
> > > > +     test_maps_tearing_from_remap(maps_fd, mod_info, &page1, &page2,
> > > > +                                  &last_line, &first_line);
> > > > +
> > > >       stop_vma_modifier(mod_info);
> > > >
> > > >       free(page2.data);
> > > > --
> > > > 2.49.0.805.g082f7c87e0-goog
> > > >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-18 17:49 ` [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
@ 2025-04-22 22:48   ` Andrii Nakryiko
  2025-04-23 21:49     ` Suren Baghdasaryan
  2025-04-29 15:39   ` Jann Horn
  1 sibling, 1 reply; 39+ messages in thread
From: Andrii Nakryiko @ 2025-04-22 22:48 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> With maple_tree supporting vma tree traversal under RCU and vma and
> its important members being RCU-safe, /proc/pid/maps can be read under
> RCU and without the need to read-lock mmap_lock. However vma content
> can change from under us, therefore we make a copy of the vma and we
> pin pointer fields used when generating the output (currently only
> vm_file and anon_name). Afterwards we check for concurrent address
> space modifications, wait for them to end and retry. While we take
> the mmap_lock for reading during such contention, we do that momentarily
> only to record new mm_wr_seq counter. This change is designed to reduce

This is probably a stupid question, but why do we need to take a lock
just to record this counter? uprobes get away without taking mmap_lock
even for reads, and still record this seq counter. And then detect
whether there were any modifications in between. Why does this change
need more heavy-weight mmap_read_lock to do speculative reads?

> mmap_lock contention and prevent a process reading /proc/pid/maps files
> (often a low priority task, such as monitoring/data collection services)
> from blocking address space updates.
> Note that this change has a userspace visible disadvantage: it allows
> for sub-page data tearing as opposed to the previous mechanism where
> data tearing could happen only between pages of generated output data.
> Since current userspace considers data tearing between pages to be
> acceptable, we assume is will be able to handle sub-page data tearing
> as well.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/internal.h        |   6 ++
>  fs/proc/task_mmu.c        | 170 ++++++++++++++++++++++++++++++++++----
>  include/linux/mm_inline.h |  18 ++++
>  3 files changed, 177 insertions(+), 17 deletions(-)
>
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 96122e91c645..6e1169c1f4df 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -379,6 +379,12 @@ struct proc_maps_private {
>         struct task_struct *task;
>         struct mm_struct *mm;
>         struct vma_iterator iter;
> +       bool mmap_locked;
> +       loff_t last_pos;
> +#ifdef CONFIG_PER_VMA_LOCK
> +       unsigned int mm_wr_seq;
> +       struct vm_area_struct vma_copy;
> +#endif
>  #ifdef CONFIG_NUMA
>         struct mempolicy *task_mempolicy;
>  #endif
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f9d50a61167c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
>  }
>  #endif
>
> -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> -                                               loff_t *ppos)
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static const struct seq_operations proc_pid_maps_op;
> +
> +/*
> + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> + * show_map_vma.
> + */
> +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> +{
> +       struct vm_area_struct *copy = &priv->vma_copy;
> +       int ret = -EAGAIN;
> +
> +       memcpy(copy, vma, sizeof(*vma));
> +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> +               goto out;
> +
> +       if (!anon_vma_name_get_if_valid(copy))
> +               goto put_file;

Given vm_file and anon_vma_name are both RCU-protected, if we take
rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
even need getting/putting them, no?

I feel like I'm missing some important limitations, but I don't think
they are spelled out explicitly...

> +
> +       if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> +               return 0;
> +
> +       /* Address space got modified, vma might be stale. Re-lock and retry. */
> +       rcu_read_unlock();

Another question I have is whether we really need to copy vma into
priv->vma_copy to have a stable snapshot? Can't we just speculate like
with uprobes under assumption that data doesn't change. And once we
are done producing text output, confirm that speculation was
successful, and if not - retry?

We'd need an API for seq_file to rollback to previous good known
location for that, but that seems straightforward enough to do, no?
Just remember buffer position before speculation, write data, check
for no mm modifications, and if something changed, rollback seq file
to last position.


> +       ret = mmap_read_lock_killable(priv->mm);
> +       if (!ret) {
> +               /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
> +               mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> +               mmap_read_unlock(priv->mm);
> +               ret = -EAGAIN;
> +       }
> +
> +       rcu_read_lock();
> +
> +       anon_vma_name_put_if_valid(copy);
> +put_file:
> +       if (copy->vm_file)
> +               fput(copy->vm_file);
> +out:
> +       return ret;
> +}
> +
> +static void put_vma_snapshot(struct proc_maps_private *priv)
> +{
> +       struct vm_area_struct *vma = &priv->vma_copy;
> +
> +       anon_vma_name_put_if_valid(vma);
> +       if (vma->vm_file)
> +               fput(vma->vm_file);
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-04-18 17:49 ` [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl " Suren Baghdasaryan
@ 2025-04-22 22:54   ` Andrii Nakryiko
  2025-04-23 21:53     ` Suren Baghdasaryan
  2025-04-29 15:55     ` Jann Horn
  0 siblings, 2 replies; 39+ messages in thread
From: Andrii Nakryiko @ 2025-04-22 22:54 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Utilize speculative vma lookup to find and snapshot a vma without
> taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> address space modifications are detected and the lookup is retried.
> While we take the mmap_lock for reading during such contention, we
> do that momentarily only to record new mm_wr_seq counter.

PROCMAP_QUERY is an even more obvious candidate for fully lockless
speculation, IMO (because it's more obvious that vma's use is
localized to do_procmap_query(), instead of being spread across
m_start/m_next and m_show as with seq_file approach). We do
rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
mmap_read_lock), use that VMA to produce (speculative) output, and
then validate that VMA or mm_struct didn't change with
mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
No need for vma_copy and any gets/puts, no?

> This change is designed to reduce mmap_lock contention and prevent
> PROCMAP_QUERY ioctl calls (often a low priority task, such as
> monitoring/data collection services) from blocking address space
> updates.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 55 insertions(+), 8 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-22 22:48   ` Andrii Nakryiko
@ 2025-04-23 21:49     ` Suren Baghdasaryan
  2025-04-23 22:06       ` Andrii Nakryiko
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-23 21:49 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > With maple_tree supporting vma tree traversal under RCU and vma and
> > its important members being RCU-safe, /proc/pid/maps can be read under
> > RCU and without the need to read-lock mmap_lock. However vma content
> > can change from under us, therefore we make a copy of the vma and we
> > pin pointer fields used when generating the output (currently only
> > vm_file and anon_name). Afterwards we check for concurrent address
> > space modifications, wait for them to end and retry. While we take
> > the mmap_lock for reading during such contention, we do that momentarily
> > only to record new mm_wr_seq counter. This change is designed to reduce
>
> This is probably a stupid question, but why do we need to take a lock
> just to record this counter? uprobes get away without taking mmap_lock
> even for reads, and still record this seq counter. And then detect
> whether there were any modifications in between. Why does this change
> need more heavy-weight mmap_read_lock to do speculative reads?

Not a stupid question. mmap_read_lock() is used to wait for the writer
to finish what it's doing and then we continue by recording a new
sequence counter value and call mmap_read_unlock. This is what
get_vma_snapshot() does. But your question made me realize that we can
optimize m_start() further by not taking mmap_read_lock at all.
Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
try mmap_lock_speculate_try_begin() and only if it fails do the same
dance we do in the get_vma_snapshot(). I think that should work.

>
> > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > (often a low priority task, such as monitoring/data collection services)
> > from blocking address space updates.
> > Note that this change has a userspace visible disadvantage: it allows
> > for sub-page data tearing as opposed to the previous mechanism where
> > data tearing could happen only between pages of generated output data.
> > Since current userspace considers data tearing between pages to be
> > acceptable, we assume is will be able to handle sub-page data tearing
> > as well.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/proc/internal.h        |   6 ++
> >  fs/proc/task_mmu.c        | 170 ++++++++++++++++++++++++++++++++++----
> >  include/linux/mm_inline.h |  18 ++++
> >  3 files changed, 177 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 96122e91c645..6e1169c1f4df 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -379,6 +379,12 @@ struct proc_maps_private {
> >         struct task_struct *task;
> >         struct mm_struct *mm;
> >         struct vma_iterator iter;
> > +       bool mmap_locked;
> > +       loff_t last_pos;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +       unsigned int mm_wr_seq;
> > +       struct vm_area_struct vma_copy;
> > +#endif
> >  #ifdef CONFIG_NUMA
> >         struct mempolicy *task_mempolicy;
> >  #endif
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b9e4fbbdf6e6..f9d50a61167c 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> >  }
> >  #endif
> >
> > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > -                                               loff_t *ppos)
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static const struct seq_operations proc_pid_maps_op;
> > +
> > +/*
> > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > + * show_map_vma.
> > + */
> > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > +{
> > +       struct vm_area_struct *copy = &priv->vma_copy;
> > +       int ret = -EAGAIN;
> > +
> > +       memcpy(copy, vma, sizeof(*vma));
> > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > +               goto out;
> > +
> > +       if (!anon_vma_name_get_if_valid(copy))
> > +               goto put_file;
>
> Given vm_file and anon_vma_name are both RCU-protected, if we take
> rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
> even need getting/putting them, no?

Yeah, anon_vma_name indeed looks safe without pinning but vm_file is
using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer
but pointing to a wrong object even if the rcu grace period did not
pass. With my assumption that seq_file can't rollback once show_map()
is done, I would need a completely stable vma at the time show_map()
is executed so that it does not change from under us while we are
generating the output.
OTOH, if we indeed can rollback while generating seq_file output then
show_map() could output potentially invalid vma, then check for vma
changes and when detected, rollback seq_file and retry again.

>
> I feel like I'm missing some important limitations, but I don't think
> they are spelled out explicitly...
>
> > +
> > +       if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> > +               return 0;
> > +
> > +       /* Address space got modified, vma might be stale. Re-lock and retry. */
> > +       rcu_read_unlock();
>
> Another question I have is whether we really need to copy vma into
> priv->vma_copy to have a stable snapshot? Can't we just speculate like
> with uprobes under assumption that data doesn't change. And once we
> are done producing text output, confirm that speculation was
> successful, and if not - retry?
>
> We'd need an API for seq_file to rollback to previous good known
> location for that, but that seems straightforward enough to do, no?
> Just remember buffer position before speculation, write data, check
> for no mm modifications, and if something changed, rollback seq file
> to last position.

From looking at seq_read_iter() I think for a rollback we would have
to reset seq_file.count and seq_file.index to their previous states.
At this location:
https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if
show_map() returns negative value m->count will indeed be rolled back
but not seq_file.index. Also returning negative value at
https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230
would be interpreted as a hard error... So, I'll need to spend some
time in this code to get the answer about rollback.
Thanks for the review!

>
>
> > +       ret = mmap_read_lock_killable(priv->mm);
> > +       if (!ret) {
> > +               /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
> > +               mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > +               mmap_read_unlock(priv->mm);
> > +               ret = -EAGAIN;
> > +       }
> > +
> > +       rcu_read_lock();
> > +
> > +       anon_vma_name_put_if_valid(copy);
> > +put_file:
> > +       if (copy->vm_file)
> > +               fput(copy->vm_file);
> > +out:
> > +       return ret;
> > +}
> > +
> > +static void put_vma_snapshot(struct proc_maps_private *priv)
> > +{
> > +       struct vm_area_struct *vma = &priv->vma_copy;
> > +
> > +       anon_vma_name_put_if_valid(vma);
> > +       if (vma->vm_file)
> > +               fput(vma->vm_file);
> > +}
> > +
>
> [...]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-04-22 22:54   ` Andrii Nakryiko
@ 2025-04-23 21:53     ` Suren Baghdasaryan
  2025-04-29 15:55     ` Jann Horn
  1 sibling, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-23 21:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Apr 22, 2025 at 3:54 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Utilize speculative vma lookup to find and snapshot a vma without
> > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > address space modifications are detected and the lookup is retried.
> > While we take the mmap_lock for reading during such contention, we
> > do that momentarily only to record new mm_wr_seq counter.
>
> PROCMAP_QUERY is an even more obvious candidate for fully lockless
> speculation, IMO (because it's more obvious that vma's use is
> localized to do_procmap_query(), instead of being spread across
> m_start/m_next and m_show as with seq_file approach). We do
> rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> mmap_read_lock), use that VMA to produce (speculative) output, and
> then validate that VMA or mm_struct didn't change with
> mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> No need for vma_copy and any gets/puts, no?

Yeah, since we can simply retry, this should indeed work without
trying to stabilize the vma. I'll update the code to simplify this.
Thanks!

>
> > This change is designed to reduce mmap_lock contention and prevent
> > PROCMAP_QUERY ioctl calls (often a low priority task, such as
> > monitoring/data collection services) from blocking address space
> > updates.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 55 insertions(+), 8 deletions(-)
> >
>
> [...]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-23 21:49     ` Suren Baghdasaryan
@ 2025-04-23 22:06       ` Andrii Nakryiko
  2025-04-24  0:23         ` Liam R. Howlett
  0 siblings, 1 reply; 39+ messages in thread
From: Andrii Nakryiko @ 2025-04-23 22:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > RCU and without the need to read-lock mmap_lock. However vma content
> > > can change from under us, therefore we make a copy of the vma and we
> > > pin pointer fields used when generating the output (currently only
> > > vm_file and anon_name). Afterwards we check for concurrent address
> > > space modifications, wait for them to end and retry. While we take
> > > the mmap_lock for reading during such contention, we do that momentarily
> > > only to record new mm_wr_seq counter. This change is designed to reduce
> >
> > This is probably a stupid question, but why do we need to take a lock
> > just to record this counter? uprobes get away without taking mmap_lock
> > even for reads, and still record this seq counter. And then detect
> > whether there were any modifications in between. Why does this change
> > need more heavy-weight mmap_read_lock to do speculative reads?
>
> Not a stupid question. mmap_read_lock() is used to wait for the writer
> to finish what it's doing and then we continue by recording a new
> sequence counter value and call mmap_read_unlock. This is what
> get_vma_snapshot() does. But your question made me realize that we can
> optimize m_start() further by not taking mmap_read_lock at all.
> Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> try mmap_lock_speculate_try_begin() and only if it fails do the same
> dance we do in the get_vma_snapshot(). I think that should work.

Ok, yeah, it would be great to avoid taking a lock in a common case!

>
> >
> > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > (often a low priority task, such as monitoring/data collection services)
> > > from blocking address space updates.
> > > Note that this change has a userspace visible disadvantage: it allows
> > > for sub-page data tearing as opposed to the previous mechanism where
> > > data tearing could happen only between pages of generated output data.
> > > Since current userspace considers data tearing between pages to be
> > > acceptable, we assume is will be able to handle sub-page data tearing
> > > as well.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  fs/proc/internal.h        |   6 ++
> > >  fs/proc/task_mmu.c        | 170 ++++++++++++++++++++++++++++++++++----
> > >  include/linux/mm_inline.h |  18 ++++
> > >  3 files changed, 177 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > > index 96122e91c645..6e1169c1f4df 100644
> > > --- a/fs/proc/internal.h
> > > +++ b/fs/proc/internal.h
> > > @@ -379,6 +379,12 @@ struct proc_maps_private {
> > >         struct task_struct *task;
> > >         struct mm_struct *mm;
> > >         struct vma_iterator iter;
> > > +       bool mmap_locked;
> > > +       loff_t last_pos;
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +       unsigned int mm_wr_seq;
> > > +       struct vm_area_struct vma_copy;
> > > +#endif
> > >  #ifdef CONFIG_NUMA
> > >         struct mempolicy *task_mempolicy;
> > >  #endif
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> > >  }
> > >  #endif
> > >
> > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > > -                                               loff_t *ppos)
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +
> > > +static const struct seq_operations proc_pid_maps_op;
> > > +
> > > +/*
> > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > + * show_map_vma.
> > > + */
> > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > +{
> > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > +       int ret = -EAGAIN;
> > > +
> > > +       memcpy(copy, vma, sizeof(*vma));
> > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > +               goto out;
> > > +
> > > +       if (!anon_vma_name_get_if_valid(copy))
> > > +               goto put_file;
> >
> > Given vm_file and anon_vma_name are both RCU-protected, if we take
> > rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
> > even need getting/putting them, no?
>
> Yeah, anon_vma_name indeed looks safe without pinning but vm_file is
> using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer
> but pointing to a wrong object even if the rcu grace period did not
> pass. With my assumption that seq_file can't rollback once show_map()
> is done, I would need a completely stable vma at the time show_map()
> is executed so that it does not change from under us while we are
> generating the output.
> OTOH, if we indeed can rollback while generating seq_file output then
> show_map() could output potentially invalid vma, then check for vma
> changes and when detected, rollback seq_file and retry again.
>
> >
> > I feel like I'm missing some important limitations, but I don't think
> > they are spelled out explicitly...
> >
> > > +
> > > +       if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> > > +               return 0;
> > > +
> > > +       /* Address space got modified, vma might be stale. Re-lock and retry. */
> > > +       rcu_read_unlock();
> >
> > Another question I have is whether we really need to copy vma into
> > priv->vma_copy to have a stable snapshot? Can't we just speculate like
> > with uprobes under assumption that data doesn't change. And once we
> > are done producing text output, confirm that speculation was
> > successful, and if not - retry?
> >
> > We'd need an API for seq_file to rollback to previous good known
> > location for that, but that seems straightforward enough to do, no?
> > Just remember buffer position before speculation, write data, check
> > for no mm modifications, and if something changed, rollback seq file
> > to last position.
>
> From looking at seq_read_iter() I think for a rollback we would have
> to reset seq_file.count and seq_file.index to their previous states.
> At this location:
> https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L272 if
> show_map() returns negative value m->count will indeed be rolled back
> but not seq_file.index. Also returning negative value at
> https://elixir.bootlin.com/linux/v6.14.3/source/fs/seq_file.c#L230
> would be interpreted as a hard error... So, I'll need to spend some
> time in this code to get the answer about rollback.
> Thanks for the review!

Yeah, seq_file is a glorified wrapper around a memory buffer,
essentially. And at the lowest level, this transaction-like API would
basically just return seq->count before we start writing anything, and
rollback will just accept a new count to set to seq->count, if we need
to rollback.

Logistically this all needs to be factored into the whole seq_file
callbacks thing, of course, especially if "transaction" will be
started in m_start/m_next, while it can be "aborted" in m_show... So
that's what would need careful consideration.

But you can end up with faster and cleaner implementation, as we
discussed above, so worth giving it a try, IMO.

>
> >
> >
> > > +       ret = mmap_read_lock_killable(priv->mm);
> > > +       if (!ret) {
> > > +               /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
> > > +               mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > > +               mmap_read_unlock(priv->mm);
> > > +               ret = -EAGAIN;
> > > +       }
> > > +
> > > +       rcu_read_lock();
> > > +
> > > +       anon_vma_name_put_if_valid(copy);
> > > +put_file:
> > > +       if (copy->vm_file)
> > > +               fput(copy->vm_file);
> > > +out:
> > > +       return ret;
> > > +}
> > > +
> > > +static void put_vma_snapshot(struct proc_maps_private *priv)
> > > +{
> > > +       struct vm_area_struct *vma = &priv->vma_copy;
> > > +
> > > +       anon_vma_name_put_if_valid(vma);
> > > +       if (vma->vm_file)
> > > +               fput(vma->vm_file);
> > > +}
> > > +
> >
> > [...]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-23 22:06       ` Andrii Nakryiko
@ 2025-04-24  0:23         ` Liam R. Howlett
  2025-04-24 15:20           ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Liam R. Howlett @ 2025-04-24  0:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, akpm, lorenzo.stoakes, david, vbabka, peterx,
	jannh, hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

* Andrii Nakryiko <andrii.nakryiko@gmail.com> [250423 18:06]:
> On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > can change from under us, therefore we make a copy of the vma and we
> > > > pin pointer fields used when generating the output (currently only
> > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > space modifications, wait for them to end and retry. While we take
> > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > >
> > > This is probably a stupid question, but why do we need to take a lock
> > > just to record this counter? uprobes get away without taking mmap_lock
> > > even for reads, and still record this seq counter. And then detect
> > > whether there were any modifications in between. Why does this change
> > > need more heavy-weight mmap_read_lock to do speculative reads?
> >
> > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > to finish what it's doing and then we continue by recording a new
> > sequence counter value and call mmap_read_unlock. This is what
> > get_vma_snapshot() does. But your question made me realize that we can
> > optimize m_start() further by not taking mmap_read_lock at all.
> > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > dance we do in the get_vma_snapshot(). I think that should work.
> 
> Ok, yeah, it would be great to avoid taking a lock in a common case!

We can check this counter once per 4k block and maintain the same
'tearing' that exists today instead of per-vma.  Not that anyone said
they had an issue with changing it, but since we're on this road anyways
I'd thought I'd point out where we could end up.

I am concerned about live locking in either scenario, but I haven't
looked too deep into this pattern.

I also don't love (as usual) the lack of ensured forward progress.

It seems like we have four cases for the vm area state now:
1. we want to read a stable vma or set of vmas (per-vma locking)
2. we want to read a stable mm state for reading (the very short named
mmap_lock_speculate_try_begin)
3. we ensure a stable vma/mm state for reading (mmap read lock)
4. we are writing - get out of my way (mmap write lock).

Cheers,
Liam


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-24  0:23         ` Liam R. Howlett
@ 2025-04-24 15:20           ` Suren Baghdasaryan
  2025-04-24 16:03             ` Andrii Nakryiko
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-24 15:20 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, Suren Baghdasaryan, akpm,
	lorenzo.stoakes, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [250423 18:06]:
> > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > pin pointer fields used when generating the output (currently only
> > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > space modifications, wait for them to end and retry. While we take
> > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > >
> > > > This is probably a stupid question, but why do we need to take a lock
> > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > even for reads, and still record this seq counter. And then detect
> > > > whether there were any modifications in between. Why does this change
> > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > >
> > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > to finish what it's doing and then we continue by recording a new
> > > sequence counter value and call mmap_read_unlock. This is what
> > > get_vma_snapshot() does. But your question made me realize that we can
> > > optimize m_start() further by not taking mmap_read_lock at all.
> > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > dance we do in the get_vma_snapshot(). I think that should work.
> >
> > Ok, yeah, it would be great to avoid taking a lock in a common case!
>
> We can check this counter once per 4k block and maintain the same
> 'tearing' that exists today instead of per-vma.  Not that anyone said
> they had an issue with changing it, but since we're on this road anyways
> I'd thought I'd point out where we could end up.

We would need to run that check on the last call to show_map() right
before seq_file detects the overflow and flushes the page. On
contention we will also be throwing away more prepared data (up to a
page worth of records) vs only the last record. All in all I'm not
convinced this is worth doing unless increased chances of data tearing
is identified as a problem.

>
> I am concerned about live locking in either scenario, but I haven't
> looked too deep into this pattern.
>
> I also don't love (as usual) the lack of ensured forward progress.

Hmm. Maybe we should add a retry limit on
mmap_lock_speculate_try_begin() and once the limit is hit we just take
the mmap_read_lock and proceed with it? That would prevent a
hyperactive writer from blocking the reader's forward progress
indefinitely.

>
> It seems like we have four cases for the vm area state now:
> 1. we want to read a stable vma or set of vmas (per-vma locking)
> 2. we want to read a stable mm state for reading (the very short named
> mmap_lock_speculate_try_begin)

and we don't mind retrying on contention. This one should be done
under RCU protection.

> 3. we ensure a stable vma/mm state for reading (mmap read lock)
> 4. we are writing - get out of my way (mmap write lock).

I wouldn't call #2 a vma state. More of a usecase when we want to read
vma under RCU (valid but can change from under us) and then retry if
it might have been modified from under us.

>
> Cheers,
> Liam
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-24 15:20           ` Suren Baghdasaryan
@ 2025-04-24 16:03             ` Andrii Nakryiko
  2025-04-24 16:42               ` Liam R. Howlett
  0 siblings, 1 reply; 39+ messages in thread
From: Andrii Nakryiko @ 2025-04-24 16:03 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, lorenzo.stoakes, david, vbabka, peterx,
	jannh, hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [250423 18:06]:
> > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > pin pointer fields used when generating the output (currently only
> > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > >
> > > > > This is probably a stupid question, but why do we need to take a lock
> > > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > > even for reads, and still record this seq counter. And then detect
> > > > > whether there were any modifications in between. Why does this change
> > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > >
> > > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > > to finish what it's doing and then we continue by recording a new
> > > > sequence counter value and call mmap_read_unlock. This is what
> > > > get_vma_snapshot() does. But your question made me realize that we can
> > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > dance we do in the get_vma_snapshot(). I think that should work.
> > >
> > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> >
> > We can check this counter once per 4k block and maintain the same
> > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > they had an issue with changing it, but since we're on this road anyways
> > I'd thought I'd point out where we could end up.
>
> We would need to run that check on the last call to show_map() right
> before seq_file detects the overflow and flushes the page. On
> contention we will also be throwing away more prepared data (up to a
> page worth of records) vs only the last record. All in all I'm not
> convinced this is worth doing unless increased chances of data tearing
> is identified as a problem.
>

Yep, I agree, with filling out 4K of data we run into much higher
chances of conflict, IMO. Not worth it, I'd say.

> >
> > I am concerned about live locking in either scenario, but I haven't
> > looked too deep into this pattern.
> >
> > I also don't love (as usual) the lack of ensured forward progress.
>
> Hmm. Maybe we should add a retry limit on
> mmap_lock_speculate_try_begin() and once the limit is hit we just take
> the mmap_read_lock and proceed with it? That would prevent a
> hyperactive writer from blocking the reader's forward progress
> indefinitely.

Came here to say the same. I'd add a small number of retries (3-5?)
and then fallback to the read-locked approach. The main challenge is
to keep all this logic nicely isolated from the main VMA
search/printing logic.

For a similar pattern in uprobes, we don't even bother to rety, we
just fallback to mmap_read_lock and proceed, under the assumption that
this is going to be very rare and thus not important from the overall
performance perspective.

>
> >
> > It seems like we have four cases for the vm area state now:
> > 1. we want to read a stable vma or set of vmas (per-vma locking)
> > 2. we want to read a stable mm state for reading (the very short named
> > mmap_lock_speculate_try_begin)
>
> and we don't mind retrying on contention. This one should be done
> under RCU protection.
>
> > 3. we ensure a stable vma/mm state for reading (mmap read lock)
> > 4. we are writing - get out of my way (mmap write lock).
>
> I wouldn't call #2 a vma state. More of a usecase when we want to read
> vma under RCU (valid but can change from under us) and then retry if
> it might have been modified from under us.
>
> >
> > Cheers,
> > Liam
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-24 16:03             ` Andrii Nakryiko
@ 2025-04-24 16:42               ` Liam R. Howlett
  2025-04-24 17:38                 ` Suren Baghdasaryan
  2025-04-24 17:44                 ` Andrii Nakryiko
  0 siblings, 2 replies; 39+ messages in thread
From: Liam R. Howlett @ 2025-04-24 16:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, akpm, lorenzo.stoakes, david, vbabka, peterx,
	jannh, hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

* Andrii Nakryiko <andrii.nakryiko@gmail.com> [250424 12:04]:
> On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [250423 18:06]:
> > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > >
> > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > > pin pointer fields used when generating the output (currently only
> > > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > >
> > > > > > This is probably a stupid question, but why do we need to take a lock
> > > > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > > > even for reads, and still record this seq counter. And then detect
> > > > > > whether there were any modifications in between. Why does this change
> > > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > > >
> > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > > > to finish what it's doing and then we continue by recording a new
> > > > > sequence counter value and call mmap_read_unlock. This is what
> > > > > get_vma_snapshot() does. But your question made me realize that we can
> > > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > > dance we do in the get_vma_snapshot(). I think that should work.
> > > >
> > > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> > >
> > > We can check this counter once per 4k block and maintain the same
> > > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > > they had an issue with changing it, but since we're on this road anyways
> > > I'd thought I'd point out where we could end up.
> >
> > We would need to run that check on the last call to show_map() right
> > before seq_file detects the overflow and flushes the page. On
> > contention we will also be throwing away more prepared data (up to a
> > page worth of records) vs only the last record. All in all I'm not
> > convinced this is worth doing unless increased chances of data tearing
> > is identified as a problem.
> >
> 
> Yep, I agree, with filling out 4K of data we run into much higher
> chances of conflict, IMO. Not worth it, I'd say.

Sounds good.

If this is an issue we do have a path forward still.  Although it's less
desirable.

> 
> > >
> > > I am concerned about live locking in either scenario, but I haven't
> > > looked too deep into this pattern.
> > >
> > > I also don't love (as usual) the lack of ensured forward progress.
> >
> > Hmm. Maybe we should add a retry limit on
> > mmap_lock_speculate_try_begin() and once the limit is hit we just take
> > the mmap_read_lock and proceed with it? That would prevent a
> > hyperactive writer from blocking the reader's forward progress
> > indefinitely.
> 
> Came here to say the same. I'd add a small number of retries (3-5?)
> and then fallback to the read-locked approach. The main challenge is
> to keep all this logic nicely isolated from the main VMA
> search/printing logic.
> 
> For a similar pattern in uprobes, we don't even bother to rety, we
> just fallback to mmap_read_lock and proceed, under the assumption that
> this is going to be very rare and thus not important from the overall
> performance perspective.

In this problem space we are dealing with a herd of readers caused by
writers delaying an ever-growing line of readers, right?

Assuming there is a backup caused by a writer, then I don't know if the
retry is going to do anything more than heat the data centre.

The readers that take the read lock will get the data, while the others
who arrive during read locked time can try lockless, but will most
likely have a run time that extends beyond the readers holding the lock
and will probably be interrupted by the writer.

We can predict the new readers will also not make it through in time
because the earlier ones failed.  The new readers will then take the
lock and grow the line of readers.

Does that make sense?

Thanks,
Liam



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-24 16:42               ` Liam R. Howlett
@ 2025-04-24 17:38                 ` Suren Baghdasaryan
  2025-04-24 17:44                 ` Andrii Nakryiko
  1 sibling, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-24 17:38 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, Suren Baghdasaryan, akpm,
	lorenzo.stoakes, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Thu, Apr 24, 2025 at 9:42 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [250424 12:04]:
> > On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [250423 18:06]:
> > > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > >
> > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > > > pin pointer fields used when generating the output (currently only
> > > > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > > >
> > > > > > > This is probably a stupid question, but why do we need to take a lock
> > > > > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > > > > even for reads, and still record this seq counter. And then detect
> > > > > > > whether there were any modifications in between. Why does this change
> > > > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > > > >
> > > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > > > > to finish what it's doing and then we continue by recording a new
> > > > > > sequence counter value and call mmap_read_unlock. This is what
> > > > > > get_vma_snapshot() does. But your question made me realize that we can
> > > > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > > > dance we do in the get_vma_snapshot(). I think that should work.
> > > > >
> > > > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> > > >
> > > > We can check this counter once per 4k block and maintain the same
> > > > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > > > they had an issue with changing it, but since we're on this road anyways
> > > > I'd thought I'd point out where we could end up.
> > >
> > > We would need to run that check on the last call to show_map() right
> > > before seq_file detects the overflow and flushes the page. On
> > > contention we will also be throwing away more prepared data (up to a
> > > page worth of records) vs only the last record. All in all I'm not
> > > convinced this is worth doing unless increased chances of data tearing
> > > is identified as a problem.
> > >
> >
> > Yep, I agree, with filling out 4K of data we run into much higher
> > chances of conflict, IMO. Not worth it, I'd say.
>
> Sounds good.
>
> If this is an issue we do have a path forward still.  Although it's less
> desirable.
>
> >
> > > >
> > > > I am concerned about live locking in either scenario, but I haven't
> > > > looked too deep into this pattern.
> > > >
> > > > I also don't love (as usual) the lack of ensured forward progress.
> > >
> > > Hmm. Maybe we should add a retry limit on
> > > mmap_lock_speculate_try_begin() and once the limit is hit we just take
> > > the mmap_read_lock and proceed with it? That would prevent a
> > > hyperactive writer from blocking the reader's forward progress
> > > indefinitely.
> >
> > Came here to say the same. I'd add a small number of retries (3-5?)
> > and then fallback to the read-locked approach. The main challenge is
> > to keep all this logic nicely isolated from the main VMA
> > search/printing logic.
> >
> > For a similar pattern in uprobes, we don't even bother to rety, we
> > just fallback to mmap_read_lock and proceed, under the assumption that
> > this is going to be very rare and thus not important from the overall
> > performance perspective.
>
> In this problem space we are dealing with a herd of readers caused by
> writers delaying an ever-growing line of readers, right?

I don't know if we have a herd of readers. The problem statement was
for a low-priority reader (monitors) blocking a high-priority writer.
Multiple readers are of course possible, so I think as long as we can
guarantee forward progress we should be ok. Is that reasonable?

>
> Assuming there is a backup caused by a writer, then I don't know if the
> retry is going to do anything more than heat the data centre.
>
> The readers that take the read lock will get the data, while the others
> who arrive during read locked time can try lockless, but will most
> likely have a run time that extends beyond the readers holding the lock
> and will probably be interrupted by the writer.
>
> We can predict the new readers will also not make it through in time
> because the earlier ones failed.  The new readers will then take the
> lock and grow the line of readers.
>
> Does that make sense?

Yeah. I guess we could guarantee forward progress if the readers would
take mmap_read_lock on contention and produce one page worth of output
under that lock before dropping it and continuing with speculation
again. If contention happens again it grabs the mma_read_lock and
repeats the dance. IOW, we start with speculation, on contention we
grab the lock to produce one page and go back to speculation. Repeat
until all vmas are reported. OTOH I guess Paul's benchmark would show
some regression though... Can't have everything :)

>
> Thanks,
> Liam
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-24 16:42               ` Liam R. Howlett
  2025-04-24 17:38                 ` Suren Baghdasaryan
@ 2025-04-24 17:44                 ` Andrii Nakryiko
  1 sibling, 0 replies; 39+ messages in thread
From: Andrii Nakryiko @ 2025-04-24 17:44 UTC (permalink / raw)
  To: Liam R. Howlett, Andrii Nakryiko, Suren Baghdasaryan, akpm,
	lorenzo.stoakes, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Thu, Apr 24, 2025 at 9:42 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrii Nakryiko <andrii.nakryiko@gmail.com> [250424 12:04]:
> > On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * Andrii Nakryiko <andrii.nakryiko@gmail.com> [250423 18:06]:
> > > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > >
> > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > > > pin pointer fields used when generating the output (currently only
> > > > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > > >
> > > > > > > This is probably a stupid question, but why do we need to take a lock
> > > > > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > > > > even for reads, and still record this seq counter. And then detect
> > > > > > > whether there were any modifications in between. Why does this change
> > > > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > > > >
> > > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > > > > to finish what it's doing and then we continue by recording a new
> > > > > > sequence counter value and call mmap_read_unlock. This is what
> > > > > > get_vma_snapshot() does. But your question made me realize that we can
> > > > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > > > dance we do in the get_vma_snapshot(). I think that should work.
> > > > >
> > > > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> > > >
> > > > We can check this counter once per 4k block and maintain the same
> > > > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > > > they had an issue with changing it, but since we're on this road anyways
> > > > I'd thought I'd point out where we could end up.
> > >
> > > We would need to run that check on the last call to show_map() right
> > > before seq_file detects the overflow and flushes the page. On
> > > contention we will also be throwing away more prepared data (up to a
> > > page worth of records) vs only the last record. All in all I'm not
> > > convinced this is worth doing unless increased chances of data tearing
> > > is identified as a problem.
> > >
> >
> > Yep, I agree, with filling out 4K of data we run into much higher
> > chances of conflict, IMO. Not worth it, I'd say.
>
> Sounds good.
>
> If this is an issue we do have a path forward still.  Although it's less
> desirable.
>
> >
> > > >
> > > > I am concerned about live locking in either scenario, but I haven't
> > > > looked too deep into this pattern.
> > > >
> > > > I also don't love (as usual) the lack of ensured forward progress.
> > >
> > > Hmm. Maybe we should add a retry limit on
> > > mmap_lock_speculate_try_begin() and once the limit is hit we just take
> > > the mmap_read_lock and proceed with it? That would prevent a
> > > hyperactive writer from blocking the reader's forward progress
> > > indefinitely.
> >
> > Came here to say the same. I'd add a small number of retries (3-5?)
> > and then fallback to the read-locked approach. The main challenge is
> > to keep all this logic nicely isolated from the main VMA
> > search/printing logic.
> >
> > For a similar pattern in uprobes, we don't even bother to rety, we
> > just fallback to mmap_read_lock and proceed, under the assumption that
> > this is going to be very rare and thus not important from the overall
> > performance perspective.
>
> In this problem space we are dealing with a herd of readers caused by
> writers delaying an ever-growing line of readers, right?

I'm assuming that the common case is there is no writer, we attempt
lockless vma read, but then (very rarely) writer comes in and starts
to change something, disrupting the read. In uprobe vma lookup
speculation case we don't even attempt to do lockless read if there is
an active writer, we just fallback to mmap_read_lock.

So I guess in that case we don't really need many retries. Just check
if there is active writer, and if not - mmap_read_lock. If there was
no writer, speculate, and when done double-check that nothing changed.
If something changed - retry with mmap_read_lock.

Does that sound more reasonable?

>
> Assuming there is a backup caused by a writer, then I don't know if the
> retry is going to do anything more than heat the data centre.

In this scenario, yes, I agree that retrying isn't useful, because
writer probably is going to be quite a lot slower than fast readers.
So see above, perhaps no retries are needed beyond just lockles ->
mmap_read_lock retry. Just a quick mmap_lock_speculate_try_begin()
check at the start.

BTW, I realized that me referencing uprobe speculation is done with no
context or code pointers. I'm talking about
find_active_uprobe_speculative() in kernel/events/uprobes.c, if you
are curious.

>
> The readers that take the read lock will get the data, while the others
> who arrive during read locked time can try lockless, but will most
> likely have a run time that extends beyond the readers holding the lock
> and will probably be interrupted by the writer.
>
> We can predict the new readers will also not make it through in time
> because the earlier ones failed.  The new readers will then take the
> lock and grow the line of readers.
>
> Does that make sense?

I think so, though not 100% sure I got all the points you are raising.
But see above if my thoughts make sense to you :)

>
> Thanks,
> Liam
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-18 17:49 ` [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
  2025-04-22 22:48   ` Andrii Nakryiko
@ 2025-04-29 15:39   ` Jann Horn
  2025-04-29 17:09     ` Suren Baghdasaryan
  1 sibling, 1 reply; 39+ messages in thread
From: Jann Horn @ 2025-04-29 15:39 UTC (permalink / raw)
  To: Suren Baghdasaryan, Al Viro, brauner, linux-fsdevel
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx,
	hannes, mhocko, paulmck, shuah, adobriyan, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	linux-kernel, linux-mm, linux-kselftest

On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> With maple_tree supporting vma tree traversal under RCU and vma and
> its important members being RCU-safe, /proc/pid/maps can be read under
> RCU and without the need to read-lock mmap_lock. However vma content
> can change from under us, therefore we make a copy of the vma and we
> pin pointer fields used when generating the output (currently only
> vm_file and anon_name). Afterwards we check for concurrent address
> space modifications, wait for them to end and retry. While we take
> the mmap_lock for reading during such contention, we do that momentarily
> only to record new mm_wr_seq counter. This change is designed to reduce
> mmap_lock contention and prevent a process reading /proc/pid/maps files
> (often a low priority task, such as monitoring/data collection services)
> from blocking address space updates.
[...]
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f9d50a61167c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
[...]
> +/*
> + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> + * show_map_vma.
> + */
> +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> +{
> +       struct vm_area_struct *copy = &priv->vma_copy;
> +       int ret = -EAGAIN;
> +
> +       memcpy(copy, vma, sizeof(*vma));
> +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> +               goto out;

I think this uses get_file_rcu() in a different way than intended.

As I understand it, get_file_rcu() is supposed to be called on a
pointer which always points to a file with a non-zero refcount (except
when it is NULL). That's why it takes a file** instead of a file* - if
it observes a zero refcount, it assumes that the pointer must have
been updated in the meantime, and retries. Calling get_file_rcu() on a
pointer that points to a file with zero refcount, which I think can
happen with this patch, will cause an endless loop.
(Just as background: For other usecases, get_file_rcu() is supposed to
still behave nicely and not spuriously return NULL when the file* is
concurrently updated to point to another file*; that's what that loop
is for.)
(If my understanding is correct, maybe we should document that more
explicitly...)

Also, I think you are introducing an implicit assumption that
remove_vma() does not NULL out the ->vm_file pointer (because that
could cause tearing and could theoretically lead to a torn pointer
being accessed here).

One alternative might be to change the paths that drop references to
vma->vm_file (search for vma_close to find them) such that they first
NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
maybe with a new helper like this:

static void vma_fput(struct vm_area_struct *vma)
{
  struct file *file = vma->vm_file;

  if (file) {
    WRITE_ONCE(vma->vm_file, NULL);
    fput(file);
  }
}

Then on the lockless lookup path you could use get_file_rcu() on the
->vm_file pointer _of the original VMA_, and store the returned file*
into copy->vm_file.

> +       if (!anon_vma_name_get_if_valid(copy))
> +               goto put_file;
> +
> +       if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> +               return 0;

We only check for concurrent updates at this point, so up to here,
anything we read from "copy" could contain torn pointers (both because
memcpy() is not guaranteed to copy pointers atomically and because the
updates to the original VMA are not done with WRITE_ONCE()).
That probably means that something like the preceding
anon_vma_name_get_if_valid() could crash on an access to a torn
pointer.
Please either do another mmap_lock_speculate_retry() check directly
after the memcpy(), or ensure nothing before this point reads from
"copy".

> +       /* Address space got modified, vma might be stale. Re-lock and retry. */
> +       rcu_read_unlock();
> +       ret = mmap_read_lock_killable(priv->mm);
> +       if (!ret) {
> +               /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
> +               mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> +               mmap_read_unlock(priv->mm);
> +               ret = -EAGAIN;
> +       }
> +
> +       rcu_read_lock();
> +
> +       anon_vma_name_put_if_valid(copy);
> +put_file:
> +       if (copy->vm_file)
> +               fput(copy->vm_file);
> +out:
> +       return ret;
> +}
[...]
> @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma,
>                 } else {
>                         *path = file_user_path(vma->vm_file);
>                 }
> -               return;
> +               goto out;
>         }
>
>         if (vma->vm_ops && vma->vm_ops->name) {
>                 *name = vma->vm_ops->name(vma);

This seems to me like a big, subtle change of semantics. After this
change, vm_ops->name() will no longer receive a real VMA; and in
particular, I think the .name implementation special_mapping_name used
in special_mapping_vmops will have a UAF because it relies on
vma->vm_private_data pointing to a live object.

I think you'll need to fall back to using the mmap lock and the real
VMA if you see a non-NULL vma->vm_ops->name pointer.

>                 if (*name)
> -                       return;
> +                       goto out;
>         }
>
>         *name = arch_vma_name(vma);
>         if (*name)
> -               return;
> +               goto out;
>
>         if (!vma->vm_mm) {
>                 *name = "[vdso]";
> -               return;
> +               goto out;
>         }
>
>         if (vma_is_initial_heap(vma)) {
>                 *name = "[heap]";
> -               return;
> +               goto out;
>         }
>
>         if (vma_is_initial_stack(vma)) {
>                 *name = "[stack]";
> -               return;
> +               goto out;
>         }
>
>         if (anon_name) {
>                 *name_fmt = "[anon:%s]";
>                 *name = anon_name->name;
> -               return;
>         }
> +out:
> +       if (anon_name && !mmap_locked)
> +               anon_vma_name_put(anon_name);

Isn't this refcount drop too early, causing UAF read? We drop the
reference on the anon_name here, but (on some paths) we're about to
return anon_name->name to the caller through *name, and the caller
will read from it.

Ah, but I guess it's actually fine because the refcount increment was
unnecessary in the first place, because the vma pointer actually
points to a copy of the original VMA, and the copy has its own
refcounted reference to the anon_name thanks to get_vma_snapshot()?
It might be helpful to have some comments documenting which VMA
pointers can point to copies.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-04-22 22:54   ` Andrii Nakryiko
  2025-04-23 21:53     ` Suren Baghdasaryan
@ 2025-04-29 15:55     ` Jann Horn
  2025-04-29 17:14       ` Suren Baghdasaryan
  1 sibling, 1 reply; 39+ messages in thread
From: Jann Horn @ 2025-04-29 15:55 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, akpm, Liam.Howlett, lorenzo.stoakes, david,
	vbabka, peterx, hannes, mhocko, paulmck, shuah, adobriyan,
	brauner, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > Utilize speculative vma lookup to find and snapshot a vma without
> > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > address space modifications are detected and the lookup is retried.
> > While we take the mmap_lock for reading during such contention, we
> > do that momentarily only to record new mm_wr_seq counter.
>
> PROCMAP_QUERY is an even more obvious candidate for fully lockless
> speculation, IMO (because it's more obvious that vma's use is
> localized to do_procmap_query(), instead of being spread across
> m_start/m_next and m_show as with seq_file approach). We do
> rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> mmap_read_lock), use that VMA to produce (speculative) output, and
> then validate that VMA or mm_struct didn't change with
> mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> No need for vma_copy and any gets/puts, no?

I really strongly dislike this "fully lockless" approach because it
means we get data races all over the place, and it gets hard to reason
about what happens especially if we do anything other than reading
plain data from the VMA. When reading the implementation of
do_procmap_query(), at basically every memory read you'd have to think
twice as hard to figure out which fields can be concurrently updated
elsewhere and whether the subsequent sequence count recheck can
recover from the resulting badness.

Just as one example, I think get_vma_name() could (depending on
compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
between "if (vma->vm_ops && vma->vm_ops->name)" and
"vma->vm_ops->name(vma)". And I think this illustrates how the "fully
lockless" approach creates more implicit assumptions about the
behavior of core MM code, which could be broken by future changes to
MM code.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-29 15:39   ` Jann Horn
@ 2025-04-29 17:09     ` Suren Baghdasaryan
  2025-04-29 17:20       ` Jann Horn
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-29 17:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, brauner, linux-fsdevel, akpm, Liam.Howlett,
	lorenzo.stoakes, david, vbabka, peterx, hannes, mhocko, paulmck,
	shuah, adobriyan, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-mm,
	linux-kselftest

On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > With maple_tree supporting vma tree traversal under RCU and vma and
> > its important members being RCU-safe, /proc/pid/maps can be read under
> > RCU and without the need to read-lock mmap_lock. However vma content
> > can change from under us, therefore we make a copy of the vma and we
> > pin pointer fields used when generating the output (currently only
> > vm_file and anon_name). Afterwards we check for concurrent address
> > space modifications, wait for them to end and retry. While we take
> > the mmap_lock for reading during such contention, we do that momentarily
> > only to record new mm_wr_seq counter. This change is designed to reduce
> > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > (often a low priority task, such as monitoring/data collection services)
> > from blocking address space updates.
> [...]
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b9e4fbbdf6e6..f9d50a61167c 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> [...]
> > +/*
> > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > + * show_map_vma.
> > + */
> > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > +{
> > +       struct vm_area_struct *copy = &priv->vma_copy;
> > +       int ret = -EAGAIN;
> > +
> > +       memcpy(copy, vma, sizeof(*vma));
> > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > +               goto out;
>
> I think this uses get_file_rcu() in a different way than intended.
>
> As I understand it, get_file_rcu() is supposed to be called on a
> pointer which always points to a file with a non-zero refcount (except
> when it is NULL). That's why it takes a file** instead of a file* - if
> it observes a zero refcount, it assumes that the pointer must have
> been updated in the meantime, and retries. Calling get_file_rcu() on a
> pointer that points to a file with zero refcount, which I think can
> happen with this patch, will cause an endless loop.
> (Just as background: For other usecases, get_file_rcu() is supposed to
> still behave nicely and not spuriously return NULL when the file* is
> concurrently updated to point to another file*; that's what that loop
> is for.)

Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
checking the return value of get_file_rcu() and retrying speculation
if it changed.

> (If my understanding is correct, maybe we should document that more
> explicitly...)

Good point. I'll add comments for get_file_rcu() as a separate patch.

>
> Also, I think you are introducing an implicit assumption that
> remove_vma() does not NULL out the ->vm_file pointer (because that
> could cause tearing and could theoretically lead to a torn pointer
> being accessed here).
>
> One alternative might be to change the paths that drop references to
> vma->vm_file (search for vma_close to find them) such that they first
> NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> maybe with a new helper like this:
>
> static void vma_fput(struct vm_area_struct *vma)
> {
>   struct file *file = vma->vm_file;
>
>   if (file) {
>     WRITE_ONCE(vma->vm_file, NULL);
>     fput(file);
>   }
> }
>
> Then on the lockless lookup path you could use get_file_rcu() on the
> ->vm_file pointer _of the original VMA_, and store the returned file*
> into copy->vm_file.

Ack. Except for storing the return value of get_file_rcu(). I think
once we detect that  get_file_rcu() returns a different file we should
bail out and retry. The change in file is an indication that the vma
got changed from under us, so whatever we have is stale.

>
> > +       if (!anon_vma_name_get_if_valid(copy))
> > +               goto put_file;
> > +
> > +       if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> > +               return 0;
>
> We only check for concurrent updates at this point, so up to here,
> anything we read from "copy" could contain torn pointers (both because
> memcpy() is not guaranteed to copy pointers atomically and because the
> updates to the original VMA are not done with WRITE_ONCE()).
> That probably means that something like the preceding
> anon_vma_name_get_if_valid() could crash on an access to a torn
> pointer.
> Please either do another mmap_lock_speculate_retry() check directly
> after the memcpy(), or ensure nothing before this point reads from
> "copy".

Ack. I'll add mmap_lock_speculate_retry() check right after memcpy().

>
> > +       /* Address space got modified, vma might be stale. Re-lock and retry. */
> > +       rcu_read_unlock();
> > +       ret = mmap_read_lock_killable(priv->mm);
> > +       if (!ret) {
> > +               /* mmap_lock_speculate_try_begin() succeeds when holding mmap_read_lock */
> > +               mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > +               mmap_read_unlock(priv->mm);
> > +               ret = -EAGAIN;
> > +       }
> > +
> > +       rcu_read_lock();
> > +
> > +       anon_vma_name_put_if_valid(copy);
> > +put_file:
> > +       if (copy->vm_file)
> > +               fput(copy->vm_file);
> > +out:
> > +       return ret;
> > +}
> [...]
> > @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma,
> >                 } else {
> >                         *path = file_user_path(vma->vm_file);
> >                 }
> > -               return;
> > +               goto out;
> >         }
> >
> >         if (vma->vm_ops && vma->vm_ops->name) {
> >                 *name = vma->vm_ops->name(vma);
>
> This seems to me like a big, subtle change of semantics. After this
> change, vm_ops->name() will no longer receive a real VMA; and in
> particular, I think the .name implementation special_mapping_name used
> in special_mapping_vmops will have a UAF because it relies on
> vma->vm_private_data pointing to a live object.

Ah, I see. IOW, vma->vm_private_data might change from under us and I
don't detect that. Moreover this is just an example and .name() might
depend on other things.

>
> I think you'll need to fall back to using the mmap lock and the real
> VMA if you see a non-NULL vma->vm_ops->name pointer.

Yeah, either that or obtain the name and make a copy during
get_vma_snapshot() using original vma. Will need to check which way is
better.

>
> >                 if (*name)
> > -                       return;
> > +                       goto out;
> >         }
> >
> >         *name = arch_vma_name(vma);
> >         if (*name)
> > -               return;
> > +               goto out;
> >
> >         if (!vma->vm_mm) {
> >                 *name = "[vdso]";
> > -               return;
> > +               goto out;
> >         }
> >
> >         if (vma_is_initial_heap(vma)) {
> >                 *name = "[heap]";
> > -               return;
> > +               goto out;
> >         }
> >
> >         if (vma_is_initial_stack(vma)) {
> >                 *name = "[stack]";
> > -               return;
> > +               goto out;
> >         }
> >
> >         if (anon_name) {
> >                 *name_fmt = "[anon:%s]";
> >                 *name = anon_name->name;
> > -               return;
> >         }
> > +out:
> > +       if (anon_name && !mmap_locked)
> > +               anon_vma_name_put(anon_name);
>
> Isn't this refcount drop too early, causing UAF read? We drop the
> reference on the anon_name here, but (on some paths) we're about to
> return anon_name->name to the caller through *name, and the caller
> will read from it.
>
> Ah, but I guess it's actually fine because the refcount increment was
> unnecessary in the first place, because the vma pointer actually
> points to a copy of the original VMA, and the copy has its own
> refcounted reference to the anon_name thanks to get_vma_snapshot()?
> It might be helpful to have some comments documenting which VMA
> pointers can point to copies.

If I follow Andrii's suggestion and avoid vma copying I'll need to
implement careful handling of pointers and allow get_vma_name() to
fail indicating that vma changed from under us. Let me see if this is
still doable in the light of your findings.

Thanks for the insightful review and welcome back!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-04-29 15:55     ` Jann Horn
@ 2025-04-29 17:14       ` Suren Baghdasaryan
  2025-04-29 17:24         ` Jann Horn
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-29 17:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrii Nakryiko, akpm, Liam.Howlett, lorenzo.stoakes, david,
	vbabka, peterx, hannes, mhocko, paulmck, shuah, adobriyan,
	brauner, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > Utilize speculative vma lookup to find and snapshot a vma without
> > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > address space modifications are detected and the lookup is retried.
> > > While we take the mmap_lock for reading during such contention, we
> > > do that momentarily only to record new mm_wr_seq counter.
> >
> > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > speculation, IMO (because it's more obvious that vma's use is
> > localized to do_procmap_query(), instead of being spread across
> > m_start/m_next and m_show as with seq_file approach). We do
> > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > mmap_read_lock), use that VMA to produce (speculative) output, and
> > then validate that VMA or mm_struct didn't change with
> > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > No need for vma_copy and any gets/puts, no?
>
> I really strongly dislike this "fully lockless" approach because it
> means we get data races all over the place, and it gets hard to reason
> about what happens especially if we do anything other than reading
> plain data from the VMA. When reading the implementation of
> do_procmap_query(), at basically every memory read you'd have to think
> twice as hard to figure out which fields can be concurrently updated
> elsewhere and whether the subsequent sequence count recheck can
> recover from the resulting badness.
>
> Just as one example, I think get_vma_name() could (depending on
> compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> between "if (vma->vm_ops && vma->vm_ops->name)" and
> "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> lockless" approach creates more implicit assumptions about the
> behavior of core MM code, which could be broken by future changes to
> MM code.

Yeah, I'll need to re-evaluate such an approach after your review. I
like having get_stable_vma() to obtain a completely stable version of
the vma in a localized place and then stop worrying about possible
races. If implemented correctly, would that be enough to address your
concern, Jann?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-29 17:09     ` Suren Baghdasaryan
@ 2025-04-29 17:20       ` Jann Horn
  2025-04-29 18:04         ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Jann Horn @ 2025-04-29 17:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Al Viro, brauner, linux-fsdevel, akpm, Liam.Howlett,
	lorenzo.stoakes, david, vbabka, peterx, hannes, mhocko, paulmck,
	shuah, adobriyan, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-mm,
	linux-kselftest

Hi!

(I just noticed that I incorrectly assumed that VMAs use kfree_rcu
(not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
forgot all about that...)

On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@google.com> wrote:
> > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > RCU and without the need to read-lock mmap_lock. However vma content
> > > can change from under us, therefore we make a copy of the vma and we
> > > pin pointer fields used when generating the output (currently only
> > > vm_file and anon_name). Afterwards we check for concurrent address
> > > space modifications, wait for them to end and retry. While we take
> > > the mmap_lock for reading during such contention, we do that momentarily
> > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > (often a low priority task, such as monitoring/data collection services)
> > > from blocking address space updates.
> > [...]
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > [...]
> > > +/*
> > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > + * show_map_vma.
> > > + */
> > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > +{
> > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > +       int ret = -EAGAIN;
> > > +
> > > +       memcpy(copy, vma, sizeof(*vma));
> > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > +               goto out;
> >
> > I think this uses get_file_rcu() in a different way than intended.
> >
> > As I understand it, get_file_rcu() is supposed to be called on a
> > pointer which always points to a file with a non-zero refcount (except
> > when it is NULL). That's why it takes a file** instead of a file* - if
> > it observes a zero refcount, it assumes that the pointer must have
> > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > pointer that points to a file with zero refcount, which I think can
> > happen with this patch, will cause an endless loop.
> > (Just as background: For other usecases, get_file_rcu() is supposed to
> > still behave nicely and not spuriously return NULL when the file* is
> > concurrently updated to point to another file*; that's what that loop
> > is for.)
>
> Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> checking the return value of get_file_rcu() and retrying speculation
> if it changed.

I think you could probably still end up looping endlessly in get_file_rcu().

> > (If my understanding is correct, maybe we should document that more
> > explicitly...)
>
> Good point. I'll add comments for get_file_rcu() as a separate patch.
>
> >
> > Also, I think you are introducing an implicit assumption that
> > remove_vma() does not NULL out the ->vm_file pointer (because that
> > could cause tearing and could theoretically lead to a torn pointer
> > being accessed here).
> >
> > One alternative might be to change the paths that drop references to
> > vma->vm_file (search for vma_close to find them) such that they first
> > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > maybe with a new helper like this:
> >
> > static void vma_fput(struct vm_area_struct *vma)
> > {
> >   struct file *file = vma->vm_file;
> >
> >   if (file) {
> >     WRITE_ONCE(vma->vm_file, NULL);
> >     fput(file);
> >   }
> > }
> >
> > Then on the lockless lookup path you could use get_file_rcu() on the
> > ->vm_file pointer _of the original VMA_, and store the returned file*
> > into copy->vm_file.
>
> Ack. Except for storing the return value of get_file_rcu(). I think
> once we detect that  get_file_rcu() returns a different file we should
> bail out and retry. The change in file is an indication that the vma
> got changed from under us, so whatever we have is stale.

What does "different file" mean here - what file* would you compare
the returned one against?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-04-29 17:14       ` Suren Baghdasaryan
@ 2025-04-29 17:24         ` Jann Horn
  2025-05-01 22:10           ` Andrii Nakryiko
  0 siblings, 1 reply; 39+ messages in thread
From: Jann Horn @ 2025-04-29 17:24 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrii Nakryiko, akpm, Liam.Howlett, lorenzo.stoakes, david,
	vbabka, peterx, hannes, mhocko, paulmck, shuah, adobriyan,
	brauner, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
> > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > address space modifications are detected and the lookup is retried.
> > > > While we take the mmap_lock for reading during such contention, we
> > > > do that momentarily only to record new mm_wr_seq counter.
> > >
> > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > speculation, IMO (because it's more obvious that vma's use is
> > > localized to do_procmap_query(), instead of being spread across
> > > m_start/m_next and m_show as with seq_file approach). We do
> > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > then validate that VMA or mm_struct didn't change with
> > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > No need for vma_copy and any gets/puts, no?
> >
> > I really strongly dislike this "fully lockless" approach because it
> > means we get data races all over the place, and it gets hard to reason
> > about what happens especially if we do anything other than reading
> > plain data from the VMA. When reading the implementation of
> > do_procmap_query(), at basically every memory read you'd have to think
> > twice as hard to figure out which fields can be concurrently updated
> > elsewhere and whether the subsequent sequence count recheck can
> > recover from the resulting badness.
> >
> > Just as one example, I think get_vma_name() could (depending on
> > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > lockless" approach creates more implicit assumptions about the
> > behavior of core MM code, which could be broken by future changes to
> > MM code.
>
> Yeah, I'll need to re-evaluate such an approach after your review. I
> like having get_stable_vma() to obtain a completely stable version of
> the vma in a localized place and then stop worrying about possible
> races. If implemented correctly, would that be enough to address your
> concern, Jann?

Yes, I think a stable local snapshot of the VMA (where tricky data
races are limited to the VMA snapshotting code) is a good tradeoff.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-29 17:20       ` Jann Horn
@ 2025-04-29 18:04         ` Suren Baghdasaryan
  2025-04-29 18:54           ` Jann Horn
  0 siblings, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-29 18:04 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, brauner, linux-fsdevel, akpm, Liam.Howlett,
	lorenzo.stoakes, david, vbabka, peterx, hannes, mhocko, paulmck,
	shuah, adobriyan, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-mm,
	linux-kselftest

On Tue, Apr 29, 2025 at 10:21 AM Jann Horn <jannh@google.com> wrote:
>
> Hi!
>
> (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> forgot all about that...)

Does this fact affect your previous comments? Just want to make sure
I'm not missing something...

>
> On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@google.com> wrote:
> > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > can change from under us, therefore we make a copy of the vma and we
> > > > pin pointer fields used when generating the output (currently only
> > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > space modifications, wait for them to end and retry. While we take
> > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > > (often a low priority task, such as monitoring/data collection services)
> > > > from blocking address space updates.
> > > [...]
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > [...]
> > > > +/*
> > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > > + * show_map_vma.
> > > > + */
> > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > > +{
> > > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > > +       int ret = -EAGAIN;
> > > > +
> > > > +       memcpy(copy, vma, sizeof(*vma));
> > > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > > +               goto out;
> > >
> > > I think this uses get_file_rcu() in a different way than intended.
> > >
> > > As I understand it, get_file_rcu() is supposed to be called on a
> > > pointer which always points to a file with a non-zero refcount (except
> > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > it observes a zero refcount, it assumes that the pointer must have
> > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > pointer that points to a file with zero refcount, which I think can
> > > happen with this patch, will cause an endless loop.
> > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > still behave nicely and not spuriously return NULL when the file* is
> > > concurrently updated to point to another file*; that's what that loop
> > > is for.)
> >
> > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > checking the return value of get_file_rcu() and retrying speculation
> > if it changed.
>
> I think you could probably still end up looping endlessly in get_file_rcu().

By "retrying speculation" I meant it in the sense of
get_vma_snapshot() retry when it takes the mmap_read_lock and then
does mmap_lock_speculate_try_begin to restart speculation. I'm also
thinking about Liam's concern of guaranteeing forward progress for the
reader. Thinking maybe I should not drop mmap_read_lock immediately on
contention but generate some output (one vma or one page worth of
vmas) before dropping mmap_read_lock and proceeding with speculation.

>
> > > (If my understanding is correct, maybe we should document that more
> > > explicitly...)
> >
> > Good point. I'll add comments for get_file_rcu() as a separate patch.
> >
> > >
> > > Also, I think you are introducing an implicit assumption that
> > > remove_vma() does not NULL out the ->vm_file pointer (because that
> > > could cause tearing and could theoretically lead to a torn pointer
> > > being accessed here).
> > >
> > > One alternative might be to change the paths that drop references to
> > > vma->vm_file (search for vma_close to find them) such that they first
> > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > > maybe with a new helper like this:
> > >
> > > static void vma_fput(struct vm_area_struct *vma)
> > > {
> > >   struct file *file = vma->vm_file;
> > >
> > >   if (file) {
> > >     WRITE_ONCE(vma->vm_file, NULL);
> > >     fput(file);
> > >   }
> > > }
> > >
> > > Then on the lockless lookup path you could use get_file_rcu() on the
> > > ->vm_file pointer _of the original VMA_, and store the returned file*
> > > into copy->vm_file.
> >
> > Ack. Except for storing the return value of get_file_rcu(). I think
> > once we detect that  get_file_rcu() returns a different file we should
> > bail out and retry. The change in file is an indication that the vma
> > got changed from under us, so whatever we have is stale.
>
> What does "different file" mean here - what file* would you compare
> the returned one against?

Inside get_vma_snapshot() I would pass the original vma->vm_file to
get_file_rcu() and check if it returns the same value. If the value
got changed we jump to  /* Address space got modified, vma might be
stale. Re-lock and retry. */ section. That should work, right?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-29 18:04         ` Suren Baghdasaryan
@ 2025-04-29 18:54           ` Jann Horn
  2025-04-29 20:33             ` Suren Baghdasaryan
  2025-05-05 13:50             ` Christian Brauner
  0 siblings, 2 replies; 39+ messages in thread
From: Jann Horn @ 2025-04-29 18:54 UTC (permalink / raw)
  To: Suren Baghdasaryan, Al Viro, brauner, linux-fsdevel
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx,
	hannes, mhocko, paulmck, shuah, adobriyan, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	linux-kernel, linux-mm, linux-kselftest

On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Apr 29, 2025 at 10:21 AM Jann Horn <jannh@google.com> wrote:
> >
> > Hi!
> >
> > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > forgot all about that...)
>
> Does this fact affect your previous comments? Just want to make sure
> I'm not missing something...

When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
down a VMA, and using get_file_rcu() for the lockless lookup, I did
not realize that you could actually also race with all the other
places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
did not think about whether any of those code paths might leave a VMA
with a dangling ->vm_file pointer.

I guess maybe that means you really do need to do the lookup from the
copied data, as you did in your patch; and that might require calling
get_file_active() on the copied ->vm_file pointer (instead of
get_file_rcu()), even though I think that is not really how
get_file_active() is supposed to be used (it's supposed to be used
when you know the original file hasn't been freed yet). Really what
you'd want for that is basically a raw __get_file_rcu(), but that is
static and I think Christian wouldn't want to expose more of these
internals outside VFS...
(In that case, all the stuff below about get_file_rcu() would be moot.)

Or you could pepper WRITE_ONCE() over all the places that write
->vm_file, and ensure that ->vm_file is always NULLed before its
reference is dropped... but that seems a bit more ugly to me.

> > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@google.com> wrote:
> > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > pin pointer fields used when generating the output (currently only
> > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > space modifications, wait for them to end and retry. While we take
> > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > > > (often a low priority task, such as monitoring/data collection services)
> > > > > from blocking address space updates.
> > > > [...]
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > [...]
> > > > > +/*
> > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > > > + * show_map_vma.
> > > > > + */
> > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > > > +{
> > > > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > > > +       int ret = -EAGAIN;
> > > > > +
> > > > > +       memcpy(copy, vma, sizeof(*vma));
> > > > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > > > +               goto out;
> > > >
> > > > I think this uses get_file_rcu() in a different way than intended.
> > > >
> > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > pointer which always points to a file with a non-zero refcount (except
> > > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > > it observes a zero refcount, it assumes that the pointer must have
> > > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > > pointer that points to a file with zero refcount, which I think can
> > > > happen with this patch, will cause an endless loop.
> > > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > > still behave nicely and not spuriously return NULL when the file* is
> > > > concurrently updated to point to another file*; that's what that loop
> > > > is for.)
> > >
> > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > > checking the return value of get_file_rcu() and retrying speculation
> > > if it changed.
> >
> > I think you could probably still end up looping endlessly in get_file_rcu().

(Just to be clear: What I meant here is that get_file_rcu() loops
*internally*; get_file_rcu() is not guaranteed to ever return if the
pointed-to file has a zero refcount.)

> By "retrying speculation" I meant it in the sense of
> get_vma_snapshot() retry when it takes the mmap_read_lock and then
> does mmap_lock_speculate_try_begin to restart speculation. I'm also
> thinking about Liam's concern of guaranteeing forward progress for the
> reader. Thinking maybe I should not drop mmap_read_lock immediately on
> contention but generate some output (one vma or one page worth of
> vmas) before dropping mmap_read_lock and proceeding with speculation.

Hm, yeah, I guess you need that for forward progress...

> > > > (If my understanding is correct, maybe we should document that more
> > > > explicitly...)
> > >
> > > Good point. I'll add comments for get_file_rcu() as a separate patch.
> > >
> > > >
> > > > Also, I think you are introducing an implicit assumption that
> > > > remove_vma() does not NULL out the ->vm_file pointer (because that
> > > > could cause tearing and could theoretically lead to a torn pointer
> > > > being accessed here).
> > > >
> > > > One alternative might be to change the paths that drop references to
> > > > vma->vm_file (search for vma_close to find them) such that they first
> > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > > > maybe with a new helper like this:
> > > >
> > > > static void vma_fput(struct vm_area_struct *vma)
> > > > {
> > > >   struct file *file = vma->vm_file;
> > > >
> > > >   if (file) {
> > > >     WRITE_ONCE(vma->vm_file, NULL);
> > > >     fput(file);
> > > >   }
> > > > }
> > > >
> > > > Then on the lockless lookup path you could use get_file_rcu() on the
> > > > ->vm_file pointer _of the original VMA_, and store the returned file*
> > > > into copy->vm_file.
> > >
> > > Ack. Except for storing the return value of get_file_rcu(). I think
> > > once we detect that  get_file_rcu() returns a different file we should
> > > bail out and retry. The change in file is an indication that the vma
> > > got changed from under us, so whatever we have is stale.
> >
> > What does "different file" mean here - what file* would you compare
> > the returned one against?
>
> Inside get_vma_snapshot() I would pass the original vma->vm_file to
> get_file_rcu() and check if it returns the same value. If the value
> got changed we jump to  /* Address space got modified, vma might be
> stale. Re-lock and retry. */ section. That should work, right?

Where do you get an "original vma->vm_file" from?

To be clear, get_file_rcu(p) returns one of the values that *p had
while get_file_rcu(p) is running.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-29 18:54           ` Jann Horn
@ 2025-04-29 20:33             ` Suren Baghdasaryan
  2025-04-29 20:43               ` Jann Horn
  2025-05-05 13:50             ` Christian Brauner
  1 sibling, 1 reply; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-04-29 20:33 UTC (permalink / raw)
  To: Jann Horn
  Cc: Al Viro, brauner, linux-fsdevel, akpm, Liam.Howlett,
	lorenzo.stoakes, david, vbabka, peterx, hannes, mhocko, paulmck,
	shuah, adobriyan, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-mm,
	linux-kselftest

On Tue, Apr 29, 2025 at 11:55 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > Hi!
> > >
> > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > forgot all about that...)
> >
> > Does this fact affect your previous comments? Just want to make sure
> > I'm not missing something...
>
> When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> down a VMA, and using get_file_rcu() for the lockless lookup, I did
> not realize that you could actually also race with all the other
> places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> did not think about whether any of those code paths might leave a VMA
> with a dangling ->vm_file pointer.

So, let me summarize my understanding and see if it's correct.

If we copy the original vma, ensure that it hasn't changed while we
were copying (with mmap_lock_speculate_retry()) and then use
get_file_rcu(&copy->vm_file) I think we are guaranteed no UAF because
we are in RCU read section. At this point the only issue is that
copy->vm_file might have lost its last refcount and get_file_rcu()
would enter an infinite loop. So, to avoid that we have to use the
original vma when calling get_file_rcu() but then we should also
ensure that vma itself does not change from under us due to
SLAB_TYPESAFE_BY_RCU used for vm_area_struct cache. If it does change
from under us we might end up accessing an invalid address if
vma->vm_file is being modified concurrently.

>
> I guess maybe that means you really do need to do the lookup from the
> copied data, as you did in your patch; and that might require calling
> get_file_active() on the copied ->vm_file pointer (instead of
> get_file_rcu()), even though I think that is not really how
> get_file_active() is supposed to be used (it's supposed to be used
> when you know the original file hasn't been freed yet). Really what
> you'd want for that is basically a raw __get_file_rcu(), but that is
> static and I think Christian wouldn't want to expose more of these
> internals outside VFS...
> (In that case, all the stuff below about get_file_rcu() would be moot.)
>
> Or you could pepper WRITE_ONCE() over all the places that write
> ->vm_file, and ensure that ->vm_file is always NULLed before its
> reference is dropped... but that seems a bit more ugly to me.

Ugh, yes. We either ensure no vma->vm_file tearing or use
__get_file_rcu() on a copy of the vma. Or we have to stabilize the vma
by locking it... Let me think about all these options. Thanks!

>
> > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@google.com> wrote:
> > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > pin pointer fields used when generating the output (currently only
> > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > > > > (often a low priority task, such as monitoring/data collection services)
> > > > > > from blocking address space updates.
> > > > > [...]
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > [...]
> > > > > > +/*
> > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > > > > + * show_map_vma.
> > > > > > + */
> > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > > > > +{
> > > > > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > > > > +       int ret = -EAGAIN;
> > > > > > +
> > > > > > +       memcpy(copy, vma, sizeof(*vma));
> > > > > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > > > > +               goto out;
> > > > >
> > > > > I think this uses get_file_rcu() in a different way than intended.
> > > > >
> > > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > > pointer which always points to a file with a non-zero refcount (except
> > > > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > > > it observes a zero refcount, it assumes that the pointer must have
> > > > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > > > pointer that points to a file with zero refcount, which I think can
> > > > > happen with this patch, will cause an endless loop.
> > > > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > > > still behave nicely and not spuriously return NULL when the file* is
> > > > > concurrently updated to point to another file*; that's what that loop
> > > > > is for.)
> > > >
> > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > > > checking the return value of get_file_rcu() and retrying speculation
> > > > if it changed.
> > >
> > > I think you could probably still end up looping endlessly in get_file_rcu().
>
> (Just to be clear: What I meant here is that get_file_rcu() loops
> *internally*; get_file_rcu() is not guaranteed to ever return if the
> pointed-to file has a zero refcount.)
>
> > By "retrying speculation" I meant it in the sense of
> > get_vma_snapshot() retry when it takes the mmap_read_lock and then
> > does mmap_lock_speculate_try_begin to restart speculation. I'm also
> > thinking about Liam's concern of guaranteeing forward progress for the
> > reader. Thinking maybe I should not drop mmap_read_lock immediately on
> > contention but generate some output (one vma or one page worth of
> > vmas) before dropping mmap_read_lock and proceeding with speculation.
>
> Hm, yeah, I guess you need that for forward progress...
>
> > > > > (If my understanding is correct, maybe we should document that more
> > > > > explicitly...)
> > > >
> > > > Good point. I'll add comments for get_file_rcu() as a separate patch.
> > > >
> > > > >
> > > > > Also, I think you are introducing an implicit assumption that
> > > > > remove_vma() does not NULL out the ->vm_file pointer (because that
> > > > > could cause tearing and could theoretically lead to a torn pointer
> > > > > being accessed here).
> > > > >
> > > > > One alternative might be to change the paths that drop references to
> > > > > vma->vm_file (search for vma_close to find them) such that they first
> > > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > > > > maybe with a new helper like this:
> > > > >
> > > > > static void vma_fput(struct vm_area_struct *vma)
> > > > > {
> > > > >   struct file *file = vma->vm_file;
> > > > >
> > > > >   if (file) {
> > > > >     WRITE_ONCE(vma->vm_file, NULL);
> > > > >     fput(file);
> > > > >   }
> > > > > }
> > > > >
> > > > > Then on the lockless lookup path you could use get_file_rcu() on the
> > > > > ->vm_file pointer _of the original VMA_, and store the returned file*
> > > > > into copy->vm_file.
> > > >
> > > > Ack. Except for storing the return value of get_file_rcu(). I think
> > > > once we detect that  get_file_rcu() returns a different file we should
> > > > bail out and retry. The change in file is an indication that the vma
> > > > got changed from under us, so whatever we have is stale.
> > >
> > > What does "different file" mean here - what file* would you compare
> > > the returned one against?
> >
> > Inside get_vma_snapshot() I would pass the original vma->vm_file to
> > get_file_rcu() and check if it returns the same value. If the value
> > got changed we jump to  /* Address space got modified, vma might be
> > stale. Re-lock and retry. */ section. That should work, right?
>
> Where do you get an "original vma->vm_file" from?
>
> To be clear, get_file_rcu(p) returns one of the values that *p had
> while get_file_rcu(p) is running.

``````

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-29 20:33             ` Suren Baghdasaryan
@ 2025-04-29 20:43               ` Jann Horn
  0 siblings, 0 replies; 39+ messages in thread
From: Jann Horn @ 2025-04-29 20:43 UTC (permalink / raw)
  To: Suren Baghdasaryan, Al Viro, brauner
  Cc: linux-fsdevel, akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka,
	peterx, hannes, mhocko, paulmck, shuah, adobriyan, josef, yebin10,
	linux, willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	linux-kernel, linux-mm, linux-kselftest

On Tue, Apr 29, 2025 at 10:33 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Apr 29, 2025 at 11:55 AM Jann Horn <jannh@google.com> wrote:
> > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > Hi!
> > > >
> > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > > forgot all about that...)
> > >
> > > Does this fact affect your previous comments? Just want to make sure
> > > I'm not missing something...
> >
> > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> > down a VMA, and using get_file_rcu() for the lockless lookup, I did
> > not realize that you could actually also race with all the other
> > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> > did not think about whether any of those code paths might leave a VMA
> > with a dangling ->vm_file pointer.
>
> So, let me summarize my understanding and see if it's correct.
>
> If we copy the original vma, ensure that it hasn't changed while we
> were copying (with mmap_lock_speculate_retry()) and then use
> get_file_rcu(&copy->vm_file) I think we are guaranteed no UAF because
> we are in RCU read section. At this point the only issue is that
> copy->vm_file might have lost its last refcount and get_file_rcu()
> would enter an infinite loop.

Yeah. (Using get_file_active() would avoid that.)

> So, to avoid that we have to use the
> original vma when calling get_file_rcu()

Sorry - I originally said that, but I didn't think about
SLAB_TYPESAFE_BY_RCU when I had that in mind.

> but then we should also
> ensure that vma itself does not change from under us due to
> SLAB_TYPESAFE_BY_RCU used for vm_area_struct cache. If it does change
> from under us we might end up accessing an invalid address if
> vma->vm_file is being modified concurrently.

Yeah, I think in theory we would have data races, since the file*
reads in get_file_rcu() could race with all the (plain) ->vm_file
pointer stores. So I guess it might actually be safer to use the
copied VMA's ->vm_file for this, with get_file_active(). Though that
would be abusing get_file_active() quite a bit, so brauner@ should
probably look over this early and see whether he thinks that's
acceptable...

> > I guess maybe that means you really do need to do the lookup from the
> > copied data, as you did in your patch; and that might require calling
> > get_file_active() on the copied ->vm_file pointer (instead of
> > get_file_rcu()), even though I think that is not really how
> > get_file_active() is supposed to be used (it's supposed to be used
> > when you know the original file hasn't been freed yet). Really what
> > you'd want for that is basically a raw __get_file_rcu(), but that is
> > static and I think Christian wouldn't want to expose more of these
> > internals outside VFS...
> > (In that case, all the stuff below about get_file_rcu() would be moot.)
> >
> > Or you could pepper WRITE_ONCE() over all the places that write
> > ->vm_file, and ensure that ->vm_file is always NULLed before its
> > reference is dropped... but that seems a bit more ugly to me.
>
> Ugh, yes. We either ensure no vma->vm_file tearing or use
> __get_file_rcu() on a copy of the vma. Or we have to stabilize the vma
> by locking it... Let me think about all these options. Thanks!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-04-29 17:24         ` Jann Horn
@ 2025-05-01 22:10           ` Andrii Nakryiko
  2025-05-02 15:11             ` Jann Horn
  0 siblings, 1 reply; 39+ messages in thread
From: Andrii Nakryiko @ 2025-05-01 22:10 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, akpm, Liam.Howlett, lorenzo.stoakes, david,
	vbabka, peterx, hannes, mhocko, paulmck, shuah, adobriyan,
	brauner, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Apr 29, 2025 at 10:25 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
> > > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > > address space modifications are detected and the lookup is retried.
> > > > > While we take the mmap_lock for reading during such contention, we
> > > > > do that momentarily only to record new mm_wr_seq counter.
> > > >
> > > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > > speculation, IMO (because it's more obvious that vma's use is
> > > > localized to do_procmap_query(), instead of being spread across
> > > > m_start/m_next and m_show as with seq_file approach). We do
> > > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > > then validate that VMA or mm_struct didn't change with
> > > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > > No need for vma_copy and any gets/puts, no?
> > >
> > > I really strongly dislike this "fully lockless" approach because it
> > > means we get data races all over the place, and it gets hard to reason
> > > about what happens especially if we do anything other than reading
> > > plain data from the VMA. When reading the implementation of
> > > do_procmap_query(), at basically every memory read you'd have to think
> > > twice as hard to figure out which fields can be concurrently updated
> > > elsewhere and whether the subsequent sequence count recheck can
> > > recover from the resulting badness.
> > >
> > > Just as one example, I think get_vma_name() could (depending on
> > > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > > lockless" approach creates more implicit assumptions about the
> > > behavior of core MM code, which could be broken by future changes to
> > > MM code.
> >
> > Yeah, I'll need to re-evaluate such an approach after your review. I
> > like having get_stable_vma() to obtain a completely stable version of
> > the vma in a localized place and then stop worrying about possible
> > races. If implemented correctly, would that be enough to address your
> > concern, Jann?
>
> Yes, I think a stable local snapshot of the VMA (where tricky data
> races are limited to the VMA snapshotting code) is a good tradeoff.

I'm not sure I agree with VMA snapshot being better either, tbh. It is
error-prone to have a byte-by-byte local copy of VMA (which isn't
really that VMA anymore), and passing it into ops callbacks (which
expect "real" VMA)... Who guarantees that this won't backfire,
depending on vm_ops implementations? And constantly copying 176+ bytes
just to access a few fields out of it is a bit unfortunate...

Also taking mmap_read_lock() sort of defeats the point of "RCU-only
access". It's still locking/unlocking and bouncing cache lines between
writer and reader frequently. How slow is per-VMA formatting? If we
take mmap_read_lock, format VMA information into a buffer under this
lock, and drop the mmap_read_lock, would it really be that much slower
compared to what Suren is doing in this patch set? And if no, that
would be so much simpler compared to this semi-locked/semi-RCU way
that is added in this patch set, no?

But I do agree that vma->vm_ops->name access is hard to do in a
completely lockless way reliably. But also how frequently VMAs have
custom names/anon_vma_name? What if we detect that VMA has some
"fancy" functionality (like this custom name thing), and just fallback
to mmap_read_lock-protected logic, which needs to be supported as a
fallback even for lockless approach?

This way we can process most (typical) VMAs completely locklessly,
while not adding any extra assumptions for all the potentially
complicated data pieces. WDYT?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-05-01 22:10           ` Andrii Nakryiko
@ 2025-05-02 15:11             ` Jann Horn
  2025-05-02 16:16               ` Suren Baghdasaryan
  0 siblings, 1 reply; 39+ messages in thread
From: Jann Horn @ 2025-05-02 15:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Suren Baghdasaryan, akpm, Liam.Howlett, lorenzo.stoakes, david,
	vbabka, peterx, hannes, mhocko, paulmck, shuah, adobriyan,
	brauner, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, May 2, 2025 at 12:10 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Tue, Apr 29, 2025 at 10:25 AM Jann Horn <jannh@google.com> wrote:
> > On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
> > > > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > > > address space modifications are detected and the lookup is retried.
> > > > > > While we take the mmap_lock for reading during such contention, we
> > > > > > do that momentarily only to record new mm_wr_seq counter.
> > > > >
> > > > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > > > speculation, IMO (because it's more obvious that vma's use is
> > > > > localized to do_procmap_query(), instead of being spread across
> > > > > m_start/m_next and m_show as with seq_file approach). We do
> > > > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > > > then validate that VMA or mm_struct didn't change with
> > > > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > > > No need for vma_copy and any gets/puts, no?
> > > >
> > > > I really strongly dislike this "fully lockless" approach because it
> > > > means we get data races all over the place, and it gets hard to reason
> > > > about what happens especially if we do anything other than reading
> > > > plain data from the VMA. When reading the implementation of
> > > > do_procmap_query(), at basically every memory read you'd have to think
> > > > twice as hard to figure out which fields can be concurrently updated
> > > > elsewhere and whether the subsequent sequence count recheck can
> > > > recover from the resulting badness.
> > > >
> > > > Just as one example, I think get_vma_name() could (depending on
> > > > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > > > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > > > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > > > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > > > lockless" approach creates more implicit assumptions about the
> > > > behavior of core MM code, which could be broken by future changes to
> > > > MM code.
> > >
> > > Yeah, I'll need to re-evaluate such an approach after your review. I
> > > like having get_stable_vma() to obtain a completely stable version of
> > > the vma in a localized place and then stop worrying about possible
> > > races. If implemented correctly, would that be enough to address your
> > > concern, Jann?
> >
> > Yes, I think a stable local snapshot of the VMA (where tricky data
> > races are limited to the VMA snapshotting code) is a good tradeoff.
>
> I'm not sure I agree with VMA snapshot being better either, tbh. It is
> error-prone to have a byte-by-byte local copy of VMA (which isn't
> really that VMA anymore), and passing it into ops callbacks (which
> expect "real" VMA)... Who guarantees that this won't backfire,
> depending on vm_ops implementations? And constantly copying 176+ bytes
> just to access a few fields out of it is a bit unfortunate...

Yeah, we shouldn't be passing VMA snapshots into ops callbacks, I
agree that we need to fall back to using proper locking for that.

> Also taking mmap_read_lock() sort of defeats the point of "RCU-only
> access". It's still locking/unlocking and bouncing cache lines between
> writer and reader frequently. How slow is per-VMA formatting?

I think this mainly does two things?

1. It shifts the latency burden of concurrent access toward the reader
a bit, essentially allowing writers to preempt this type of reader to
some extent.
2. It avoids bouncing cache lines between this type of reader and
other *readers*.

> If we
> take mmap_read_lock, format VMA information into a buffer under this
> lock, and drop the mmap_read_lock, would it really be that much slower
> compared to what Suren is doing in this patch set? And if no, that
> would be so much simpler compared to this semi-locked/semi-RCU way
> that is added in this patch set, no?

> But I do agree that vma->vm_ops->name access is hard to do in a
> completely lockless way reliably. But also how frequently VMAs have
> custom names/anon_vma_name?

I think there are typically two VMAs with vm_ops->name per MM, vvar
and vdso. (Since you also asked about anon_vma_name: I think
anon_vma_name is more frequent than that on Android, there seem to be
58 of those VMAs even in a simple "cat" process.)

> What if we detect that VMA has some
> "fancy" functionality (like this custom name thing), and just fallback
> to mmap_read_lock-protected logic, which needs to be supported as a
> fallback even for lockless approach?
>
> This way we can process most (typical) VMAs completely locklessly,
> while not adding any extra assumptions for all the potentially
> complicated data pieces. WDYT?

And then we'd also use the fallback path if karg.build_id_size is set?
And I guess we also need it if the VMA is hugetlb, because of
vma_kernel_pagesize()? And use READ_ONCE() in places like
vma_is_initial_heap()/vma_is_initial_stack()/arch_vma_name() for
accessing both the VMA and the MM?

And on top of that, we'd have to open-code/change anything that
currently uses the ->vm_ops (such as vma_kernel_pagesize()), because
between us checking the type of the VMA and later accessing ->vm_ops,
the VMA object could have been reallocated with different ->vm_ops?

I still don't like the idea of pushing the complexity of "the VMA
contents are unstable, and every read from the VMA object may return
data about a logically different VMA" down into these various helpers.
In my mind, making the API contract "The VMA contents can be an
internally consistent snapshot" to the API contract for these helpers
constrains the weirdness a bit more - though I guess the helpers will
still need READ_ONCE() for accessing properties of the MM either
way...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU
  2025-05-02 15:11             ` Jann Horn
@ 2025-05-02 16:16               ` Suren Baghdasaryan
  0 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-05-02 16:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrii Nakryiko, akpm, Liam.Howlett, lorenzo.stoakes, david,
	vbabka, peterx, hannes, mhocko, paulmck, shuah, adobriyan,
	brauner, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Fri, May 2, 2025 at 3:11 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 2, 2025 at 12:10 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Tue, Apr 29, 2025 at 10:25 AM Jann Horn <jannh@google.com> wrote:
> > > On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
> > > > > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > > > > address space modifications are detected and the lookup is retried.
> > > > > > > While we take the mmap_lock for reading during such contention, we
> > > > > > > do that momentarily only to record new mm_wr_seq counter.
> > > > > >
> > > > > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > > > > speculation, IMO (because it's more obvious that vma's use is
> > > > > > localized to do_procmap_query(), instead of being spread across
> > > > > > m_start/m_next and m_show as with seq_file approach). We do
> > > > > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > > > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > > > > then validate that VMA or mm_struct didn't change with
> > > > > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > > > > No need for vma_copy and any gets/puts, no?
> > > > >
> > > > > I really strongly dislike this "fully lockless" approach because it
> > > > > means we get data races all over the place, and it gets hard to reason
> > > > > about what happens especially if we do anything other than reading
> > > > > plain data from the VMA. When reading the implementation of
> > > > > do_procmap_query(), at basically every memory read you'd have to think
> > > > > twice as hard to figure out which fields can be concurrently updated
> > > > > elsewhere and whether the subsequent sequence count recheck can
> > > > > recover from the resulting badness.
> > > > >
> > > > > Just as one example, I think get_vma_name() could (depending on
> > > > > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > > > > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > > > > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > > > > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > > > > lockless" approach creates more implicit assumptions about the
> > > > > behavior of core MM code, which could be broken by future changes to
> > > > > MM code.
> > > >
> > > > Yeah, I'll need to re-evaluate such an approach after your review. I
> > > > like having get_stable_vma() to obtain a completely stable version of
> > > > the vma in a localized place and then stop worrying about possible
> > > > races. If implemented correctly, would that be enough to address your
> > > > concern, Jann?
> > >
> > > Yes, I think a stable local snapshot of the VMA (where tricky data
> > > races are limited to the VMA snapshotting code) is a good tradeoff.
> >
> > I'm not sure I agree with VMA snapshot being better either, tbh. It is
> > error-prone to have a byte-by-byte local copy of VMA (which isn't
> > really that VMA anymore), and passing it into ops callbacks (which
> > expect "real" VMA)... Who guarantees that this won't backfire,
> > depending on vm_ops implementations? And constantly copying 176+ bytes
> > just to access a few fields out of it is a bit unfortunate...
>
> Yeah, we shouldn't be passing VMA snapshots into ops callbacks, I
> agree that we need to fall back to using proper locking for that.

Yes, I'm exploring the option of falling back to per-vma locking to
stabilize the VMA when its reference is used in vm_ops.

>
> > Also taking mmap_read_lock() sort of defeats the point of "RCU-only
> > access". It's still locking/unlocking and bouncing cache lines between
> > writer and reader frequently. How slow is per-VMA formatting?
>
> I think this mainly does two things?
>
> 1. It shifts the latency burden of concurrent access toward the reader
> a bit, essentially allowing writers to preempt this type of reader to
> some extent.
> 2. It avoids bouncing cache lines between this type of reader and
> other *readers*.
>
> > If we
> > take mmap_read_lock, format VMA information into a buffer under this
> > lock, and drop the mmap_read_lock, would it really be that much slower
> > compared to what Suren is doing in this patch set? And if no, that
> > would be so much simpler compared to this semi-locked/semi-RCU way
> > that is added in this patch set, no?

The problem this patch is trying to address is low priority readers
blocking a high priority writer. Taking mmap_read_lock for each VMA
will not help solve this problem. If we have to use locking then
taking per-vma lock would at least narrow down the contention scope
from the entire address space to individual VMAs.

>
> > But I do agree that vma->vm_ops->name access is hard to do in a
> > completely lockless way reliably. But also how frequently VMAs have
> > custom names/anon_vma_name?
>
> I think there are typically two VMAs with vm_ops->name per MM, vvar
> and vdso. (Since you also asked about anon_vma_name: I think
> anon_vma_name is more frequent than that on Android, there seem to be
> 58 of those VMAs even in a simple "cat" process.)
>
> > What if we detect that VMA has some
> > "fancy" functionality (like this custom name thing), and just fallback
> > to mmap_read_lock-protected logic, which needs to be supported as a
> > fallback even for lockless approach?
> >
> > This way we can process most (typical) VMAs completely locklessly,
> > while not adding any extra assumptions for all the potentially
> > complicated data pieces. WDYT?

The option I'm currently contemplating is using per-vma locks to deal
with "fancy" cases and to do snapshotting otherwise. We have several
options with different levels of complexity vs performance tradeoffs
and finding the right balance will require some experimentation. I'll
likely need Paul's help soon to run his testcase with different
versions.

>
> And then we'd also use the fallback path if karg.build_id_size is set?
> And I guess we also need it if the VMA is hugetlb, because of
> vma_kernel_pagesize()? And use READ_ONCE() in places like
> vma_is_initial_heap()/vma_is_initial_stack()/arch_vma_name() for
> accessing both the VMA and the MM?
>
> And on top of that, we'd have to open-code/change anything that
> currently uses the ->vm_ops (such as vma_kernel_pagesize()), because
> between us checking the type of the VMA and later accessing ->vm_ops,
> the VMA object could have been reallocated with different ->vm_ops?
>
> I still don't like the idea of pushing the complexity of "the VMA
> contents are unstable, and every read from the VMA object may return
> data about a logically different VMA" down into these various helpers.
> In my mind, making the API contract "The VMA contents can be an
> internally consistent snapshot" to the API contract for these helpers
> constrains the weirdness a bit more - though I guess the helpers will
> still need READ_ONCE() for accessing properties of the MM either
> way...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-04-29 18:54           ` Jann Horn
  2025-04-29 20:33             ` Suren Baghdasaryan
@ 2025-05-05 13:50             ` Christian Brauner
  2025-05-05 16:38               ` Suren Baghdasaryan
  1 sibling, 1 reply; 39+ messages in thread
From: Christian Brauner @ 2025-05-05 13:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: Suren Baghdasaryan, Al Viro, linux-fsdevel, akpm, Liam.Howlett,
	lorenzo.stoakes, david, vbabka, peterx, hannes, mhocko, paulmck,
	shuah, adobriyan, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-mm,
	linux-kselftest

On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote:
> On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > Hi!
> > >
> > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > forgot all about that...)
> >
> > Does this fact affect your previous comments? Just want to make sure
> > I'm not missing something...
> 
> When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> down a VMA, and using get_file_rcu() for the lockless lookup, I did
> not realize that you could actually also race with all the other
> places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> did not think about whether any of those code paths might leave a VMA
> with a dangling ->vm_file pointer.
> 
> I guess maybe that means you really do need to do the lookup from the
> copied data, as you did in your patch; and that might require calling
> get_file_active() on the copied ->vm_file pointer (instead of
> get_file_rcu()), even though I think that is not really how
> get_file_active() is supposed to be used (it's supposed to be used
> when you know the original file hasn't been freed yet). Really what

I think it's fine for get_file_active() to be used in this way. That
->vm_file pointer usage should get a fat comment above it explaining how
what you're doing is safe.

> you'd want for that is basically a raw __get_file_rcu(), but that is
> static and I think Christian wouldn't want to expose more of these
> internals outside VFS...

Yeah, no. I don't want that to be usable outside of that file.

> (In that case, all the stuff below about get_file_rcu() would be moot.)
> 
> Or you could pepper WRITE_ONCE() over all the places that write
> ->vm_file, and ensure that ->vm_file is always NULLed before its
> reference is dropped... but that seems a bit more ugly to me.
> 
> > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@google.com> wrote:
> > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > pin pointer fields used when generating the output (currently only
> > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > > > > (often a low priority task, such as monitoring/data collection services)
> > > > > > from blocking address space updates.
> > > > > [...]
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > [...]
> > > > > > +/*
> > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > > > > + * show_map_vma.
> > > > > > + */
> > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > > > > +{
> > > > > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > > > > +       int ret = -EAGAIN;
> > > > > > +
> > > > > > +       memcpy(copy, vma, sizeof(*vma));
> > > > > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > > > > +               goto out;
> > > > >
> > > > > I think this uses get_file_rcu() in a different way than intended.
> > > > >
> > > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > > pointer which always points to a file with a non-zero refcount (except
> > > > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > > > it observes a zero refcount, it assumes that the pointer must have
> > > > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > > > pointer that points to a file with zero refcount, which I think can
> > > > > happen with this patch, will cause an endless loop.
> > > > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > > > still behave nicely and not spuriously return NULL when the file* is
> > > > > concurrently updated to point to another file*; that's what that loop
> > > > > is for.)
> > > >
> > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > > > checking the return value of get_file_rcu() and retrying speculation
> > > > if it changed.
> > >
> > > I think you could probably still end up looping endlessly in get_file_rcu().
> 
> (Just to be clear: What I meant here is that get_file_rcu() loops
> *internally*; get_file_rcu() is not guaranteed to ever return if the
> pointed-to file has a zero refcount.)
> 
> > By "retrying speculation" I meant it in the sense of
> > get_vma_snapshot() retry when it takes the mmap_read_lock and then
> > does mmap_lock_speculate_try_begin to restart speculation. I'm also
> > thinking about Liam's concern of guaranteeing forward progress for the
> > reader. Thinking maybe I should not drop mmap_read_lock immediately on
> > contention but generate some output (one vma or one page worth of
> > vmas) before dropping mmap_read_lock and proceeding with speculation.
> 
> Hm, yeah, I guess you need that for forward progress...
> 
> > > > > (If my understanding is correct, maybe we should document that more
> > > > > explicitly...)
> > > >
> > > > Good point. I'll add comments for get_file_rcu() as a separate patch.
> > > >
> > > > >
> > > > > Also, I think you are introducing an implicit assumption that
> > > > > remove_vma() does not NULL out the ->vm_file pointer (because that
> > > > > could cause tearing and could theoretically lead to a torn pointer
> > > > > being accessed here).
> > > > >
> > > > > One alternative might be to change the paths that drop references to
> > > > > vma->vm_file (search for vma_close to find them) such that they first
> > > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > > > > maybe with a new helper like this:
> > > > >
> > > > > static void vma_fput(struct vm_area_struct *vma)
> > > > > {
> > > > >   struct file *file = vma->vm_file;
> > > > >
> > > > >   if (file) {
> > > > >     WRITE_ONCE(vma->vm_file, NULL);
> > > > >     fput(file);
> > > > >   }
> > > > > }
> > > > >
> > > > > Then on the lockless lookup path you could use get_file_rcu() on the
> > > > > ->vm_file pointer _of the original VMA_, and store the returned file*
> > > > > into copy->vm_file.
> > > >
> > > > Ack. Except for storing the return value of get_file_rcu(). I think
> > > > once we detect that  get_file_rcu() returns a different file we should
> > > > bail out and retry. The change in file is an indication that the vma
> > > > got changed from under us, so whatever we have is stale.
> > >
> > > What does "different file" mean here - what file* would you compare
> > > the returned one against?
> >
> > Inside get_vma_snapshot() I would pass the original vma->vm_file to
> > get_file_rcu() and check if it returns the same value. If the value
> > got changed we jump to  /* Address space got modified, vma might be
> > stale. Re-lock and retry. */ section. That should work, right?
> 
> Where do you get an "original vma->vm_file" from?
> 
> To be clear, get_file_rcu(p) returns one of the values that *p had
> while get_file_rcu(p) is running.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
  2025-05-05 13:50             ` Christian Brauner
@ 2025-05-05 16:38               ` Suren Baghdasaryan
  0 siblings, 0 replies; 39+ messages in thread
From: Suren Baghdasaryan @ 2025-05-05 16:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Al Viro, linux-fsdevel, akpm, Liam.Howlett,
	lorenzo.stoakes, david, vbabka, peterx, hannes, mhocko, paulmck,
	shuah, adobriyan, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, linux-kernel, linux-mm,
	linux-kselftest

On Mon, May 5, 2025 at 6:50 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote:
> > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > Hi!
> > > >
> > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > > forgot all about that...)
> > >
> > > Does this fact affect your previous comments? Just want to make sure
> > > I'm not missing something...
> >
> > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> > down a VMA, and using get_file_rcu() for the lockless lookup, I did
> > not realize that you could actually also race with all the other
> > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> > did not think about whether any of those code paths might leave a VMA
> > with a dangling ->vm_file pointer.
> >
> > I guess maybe that means you really do need to do the lookup from the
> > copied data, as you did in your patch; and that might require calling
> > get_file_active() on the copied ->vm_file pointer (instead of
> > get_file_rcu()), even though I think that is not really how
> > get_file_active() is supposed to be used (it's supposed to be used
> > when you know the original file hasn't been freed yet). Really what
>
> I think it's fine for get_file_active() to be used in this way. That
> ->vm_file pointer usage should get a fat comment above it explaining how
> what you're doing is safe.

Got it. Will use it in my next version. Thanks!

>
> > you'd want for that is basically a raw __get_file_rcu(), but that is
> > static and I think Christian wouldn't want to expose more of these
> > internals outside VFS...
>
> Yeah, no. I don't want that to be usable outside of that file.
>
> > (In that case, all the stuff below about get_file_rcu() would be moot.)
> >
> > Or you could pepper WRITE_ONCE() over all the places that write
> > ->vm_file, and ensure that ->vm_file is always NULLed before its
> > reference is dropped... but that seems a bit more ugly to me.
> >
> > > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn <jannh@google.com> wrote:
> > > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > > pin pointer fields used when generating the output (currently only
> > > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > > > > > (often a low priority task, such as monitoring/data collection services)
> > > > > > > from blocking address space updates.
> > > > > > [...]
> > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > > > --- a/fs/proc/task_mmu.c
> > > > > > > +++ b/fs/proc/task_mmu.c
> > > > > > [...]
> > > > > > > +/*
> > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > > > > > + * show_map_vma.
> > > > > > > + */
> > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > +       struct vm_area_struct *copy = &priv->vma_copy;
> > > > > > > +       int ret = -EAGAIN;
> > > > > > > +
> > > > > > > +       memcpy(copy, vma, sizeof(*vma));
> > > > > > > +       if (copy->vm_file && !get_file_rcu(&copy->vm_file))
> > > > > > > +               goto out;
> > > > > >
> > > > > > I think this uses get_file_rcu() in a different way than intended.
> > > > > >
> > > > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > > > pointer which always points to a file with a non-zero refcount (except
> > > > > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > > > > it observes a zero refcount, it assumes that the pointer must have
> > > > > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > > > > pointer that points to a file with zero refcount, which I think can
> > > > > > happen with this patch, will cause an endless loop.
> > > > > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > > > > still behave nicely and not spuriously return NULL when the file* is
> > > > > > concurrently updated to point to another file*; that's what that loop
> > > > > > is for.)
> > > > >
> > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > > > > checking the return value of get_file_rcu() and retrying speculation
> > > > > if it changed.
> > > >
> > > > I think you could probably still end up looping endlessly in get_file_rcu().
> >
> > (Just to be clear: What I meant here is that get_file_rcu() loops
> > *internally*; get_file_rcu() is not guaranteed to ever return if the
> > pointed-to file has a zero refcount.)
> >
> > > By "retrying speculation" I meant it in the sense of
> > > get_vma_snapshot() retry when it takes the mmap_read_lock and then
> > > does mmap_lock_speculate_try_begin to restart speculation. I'm also
> > > thinking about Liam's concern of guaranteeing forward progress for the
> > > reader. Thinking maybe I should not drop mmap_read_lock immediately on
> > > contention but generate some output (one vma or one page worth of
> > > vmas) before dropping mmap_read_lock and proceeding with speculation.
> >
> > Hm, yeah, I guess you need that for forward progress...
> >
> > > > > > (If my understanding is correct, maybe we should document that more
> > > > > > explicitly...)
> > > > >
> > > > > Good point. I'll add comments for get_file_rcu() as a separate patch.
> > > > >
> > > > > >
> > > > > > Also, I think you are introducing an implicit assumption that
> > > > > > remove_vma() does not NULL out the ->vm_file pointer (because that
> > > > > > could cause tearing and could theoretically lead to a torn pointer
> > > > > > being accessed here).
> > > > > >
> > > > > > One alternative might be to change the paths that drop references to
> > > > > > vma->vm_file (search for vma_close to find them) such that they first
> > > > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > > > > > maybe with a new helper like this:
> > > > > >
> > > > > > static void vma_fput(struct vm_area_struct *vma)
> > > > > > {
> > > > > >   struct file *file = vma->vm_file;
> > > > > >
> > > > > >   if (file) {
> > > > > >     WRITE_ONCE(vma->vm_file, NULL);
> > > > > >     fput(file);
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > > Then on the lockless lookup path you could use get_file_rcu() on the
> > > > > > ->vm_file pointer _of the original VMA_, and store the returned file*
> > > > > > into copy->vm_file.
> > > > >
> > > > > Ack. Except for storing the return value of get_file_rcu(). I think
> > > > > once we detect that  get_file_rcu() returns a different file we should
> > > > > bail out and retry. The change in file is an indication that the vma
> > > > > got changed from under us, so whatever we have is stale.
> > > >
> > > > What does "different file" mean here - what file* would you compare
> > > > the returned one against?
> > >
> > > Inside get_vma_snapshot() I would pass the original vma->vm_file to
> > > get_file_rcu() and check if it returns the same value. If the value
> > > got changed we jump to  /* Address space got modified, vma might be
> > > stale. Re-lock and retry. */ section. That should work, right?
> >
> > Where do you get an "original vma->vm_file" from?
> >
> > To be clear, get_file_rcu(p) returns one of the values that *p had
> > while get_file_rcu(p) is running.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-05-05 16:38 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-18 17:49 [PATCH v3 0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
2025-04-18 18:30   ` Lorenzo Stoakes
2025-04-18 19:31     ` Suren Baghdasaryan
2025-04-18 19:55       ` Lorenzo Stoakes
2025-04-18 20:03         ` Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 5/8] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 6/8] mm: make vm_area_struct anon_name field RCU-safe Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU Suren Baghdasaryan
2025-04-22 22:48   ` Andrii Nakryiko
2025-04-23 21:49     ` Suren Baghdasaryan
2025-04-23 22:06       ` Andrii Nakryiko
2025-04-24  0:23         ` Liam R. Howlett
2025-04-24 15:20           ` Suren Baghdasaryan
2025-04-24 16:03             ` Andrii Nakryiko
2025-04-24 16:42               ` Liam R. Howlett
2025-04-24 17:38                 ` Suren Baghdasaryan
2025-04-24 17:44                 ` Andrii Nakryiko
2025-04-29 15:39   ` Jann Horn
2025-04-29 17:09     ` Suren Baghdasaryan
2025-04-29 17:20       ` Jann Horn
2025-04-29 18:04         ` Suren Baghdasaryan
2025-04-29 18:54           ` Jann Horn
2025-04-29 20:33             ` Suren Baghdasaryan
2025-04-29 20:43               ` Jann Horn
2025-05-05 13:50             ` Christian Brauner
2025-05-05 16:38               ` Suren Baghdasaryan
2025-04-18 17:49 ` [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl " Suren Baghdasaryan
2025-04-22 22:54   ` Andrii Nakryiko
2025-04-23 21:53     ` Suren Baghdasaryan
2025-04-29 15:55     ` Jann Horn
2025-04-29 17:14       ` Suren Baghdasaryan
2025-04-29 17:24         ` Jann Horn
2025-05-01 22:10           ` Andrii Nakryiko
2025-05-02 15:11             ` Jann Horn
2025-05-02 16:16               ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).