[PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY
@ 2025-06-04 23:11 Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 1/7] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 23:11 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.

Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.

This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.

Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory.  At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test.  The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:

Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
    0.011     0.008     0.455
    0.011     0.008     0.472
    0.011     0.008     0.535
    0.011     0.009     0.545
    ...
    0.011     0.014     2.875
    0.011     0.014     2.913
    0.011     0.014     3.007
    0.011     0.015     3.018

After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
    0.006     0.005     0.036
    0.006     0.005     0.039
    0.006     0.005     0.039
    0.006     0.005     0.039
    ...
    0.006     0.006     0.403
    0.006     0.006     0.474
    0.006     0.006     0.479
    0.006     0.006     0.498

The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's event more important
now to test for unexpected inconsistencies.

[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test

Suren Baghdasaryan (7):
  selftests/proc: add /proc/pid/maps tearing from vma split test
  selftests/proc: extend /proc/pid/maps tearing test to include vma
    resizing
  selftests/proc: extend /proc/pid/maps tearing test to include vma
    remapping
  selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
    modified
  selftests/proc: add verbose more for tests to facilitate debugging
  mm/maps: read proc/pid/maps under per-vma lock
  mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks

 fs/proc/internal.h                         |   6 +
 fs/proc/task_mmu.c                         | 233 +++++-
 tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
 3 files changed, 1011 insertions(+), 21 deletions(-)

base-commit: 2d0c297637e7d59771c1533847c666cdddc19884
-- 
2.49.0.1266.g31b7d2e469-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/7] selftests/proc: add /proc/pid/maps tearing from vma split test
  2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
@ 2025-06-04 23:11 ` Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 2/7] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 23:11 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

The /proc/pid/maps file is generated page by page, with the mmap_lock
released between pages.  This can lead to inconsistent reads if the
underlying vmas are concurrently modified. For instance, if a vma split
or merge occurs at a page boundary while /proc/pid/maps is being read,
the same vma might be seen twice: once before and once after the change.
This duplication is considered acceptable for userspace handling.
However, observing a "hole" where a vma should be (e.g., due to a vma
being replaced and the space temporarily being empty) is unacceptable.

Implement a test that:
1. Forks a child process which continuously modifies its address space,
specifically targeting a vma at the boundary between two pages.
2. The parent process repeatedly reads the child's /proc/pid/maps.
3. The parent process checks the last vma of the first page and
the first vma of the second page for consistency, looking for the
effects of vma splits or merges.

The test duration is configurable via the -d command-line parameter
in seconds to increase the likelihood of catching the race condition.
The default test duration is 5 seconds.

Example Command: proc-pid-vm -d 10

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 430 ++++++++++++++++++++-
 1 file changed, 429 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index d04685771952..6e3f06376a1f 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -27,6 +27,7 @@
 #undef NDEBUG
 #include <assert.h>
 #include <errno.h>
+#include <pthread.h>
 #include <sched.h>
 #include <signal.h>
 #include <stdbool.h>
@@ -34,6 +35,7 @@
 #include <stdio.h>
 #include <string.h>
 #include <stdlib.h>
+#include <sys/mman.h>
 #include <sys/mount.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -70,6 +72,8 @@ static void make_private_tmp(void)
 	}
 }
 
+static unsigned long test_duration_sec = 5UL;
+static int page_size;
 static pid_t pid = -1;
 static void ate(void)
 {
@@ -281,11 +285,431 @@ static void vsyscall(void)
 	}
 }
 
-int main(void)
+/* /proc/pid/maps parsing routines */
+struct page_content {
+	char *data;
+	ssize_t size;
+};
+
+#define LINE_MAX_SIZE		256
+
+struct line_content {
+	char text[LINE_MAX_SIZE];
+	unsigned long start_addr;
+	unsigned long end_addr;
+};
+
+static void read_two_pages(int maps_fd, struct page_content *page1,
+			   struct page_content *page2)
+{
+	ssize_t  bytes_read;
+
+	assert(lseek(maps_fd, 0, SEEK_SET) >= 0);
+	bytes_read = read(maps_fd, page1->data, page_size);
+	assert(bytes_read > 0 && bytes_read < page_size);
+	page1->size = bytes_read;
+
+	bytes_read = read(maps_fd, page2->data, page_size);
+	assert(bytes_read > 0 && bytes_read < page_size);
+	page2->size = bytes_read;
+}
+
+static void copy_first_line(struct page_content *page, char *first_line)
+{
+	char *pos = strchr(page->data, '\n');
+
+	strncpy(first_line, page->data, pos - page->data);
+	first_line[pos - page->data] = '\0';
+}
+
+static void copy_last_line(struct page_content *page, char *last_line)
+{
+	/* Get the last line in the first page */
+	const char *end = page->data + page->size - 1;
+	/* skip last newline */
+	const char *pos = end - 1;
+
+	/* search previous newline */
+	while (pos[-1] != '\n')
+		pos--;
+	strncpy(last_line, pos, end - pos);
+	last_line[end - pos] = '\0';
+}
+
+/* Read the last line of the first page and the first line of the second page */
+static void read_boundary_lines(int maps_fd, struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line)
+{
+	read_two_pages(maps_fd, page1, page2);
+
+	copy_last_line(page1, last_line->text);
+	copy_first_line(page2, first_line->text);
+
+	assert(sscanf(last_line->text, "%lx-%lx", &last_line->start_addr,
+		      &last_line->end_addr) == 2);
+	assert(sscanf(first_line->text, "%lx-%lx", &first_line->start_addr,
+		      &first_line->end_addr) == 2);
+}
+
+/* Thread synchronization routines */
+enum test_state {
+	INIT,
+	CHILD_READY,
+	PARENT_READY,
+	SETUP_READY,
+	SETUP_MODIFY_MAPS,
+	SETUP_MAPS_MODIFIED,
+	SETUP_RESTORE_MAPS,
+	SETUP_MAPS_RESTORED,
+	TEST_READY,
+	TEST_DONE,
+};
+
+struct vma_modifier_info;
+
+typedef void (*vma_modifier_op)(const struct vma_modifier_info *mod_info);
+typedef void (*vma_mod_result_check_op)(struct line_content *mod_last_line,
+					struct line_content *mod_first_line,
+					struct line_content *restored_last_line,
+					struct line_content *restored_first_line);
+
+struct vma_modifier_info {
+	int vma_count;
+	void *addr;
+	int prot;
+	void *next_addr;
+	vma_modifier_op vma_modify;
+	vma_modifier_op vma_restore;
+	vma_mod_result_check_op vma_mod_check;
+	pthread_mutex_t sync_lock;
+	pthread_cond_t sync_cond;
+	enum test_state curr_state;
+	bool exit;
+	void *child_mapped_addr[];
+};
+
+static void wait_for_state(struct vma_modifier_info *mod_info, enum test_state state)
+{
+	pthread_mutex_lock(&mod_info->sync_lock);
+	while (mod_info->curr_state != state)
+		pthread_cond_wait(&mod_info->sync_cond, &mod_info->sync_lock);
+	pthread_mutex_unlock(&mod_info->sync_lock);
+}
+
+static void signal_state(struct vma_modifier_info *mod_info, enum test_state state)
+{
+	pthread_mutex_lock(&mod_info->sync_lock);
+	mod_info->curr_state = state;
+	pthread_cond_signal(&mod_info->sync_cond);
+	pthread_mutex_unlock(&mod_info->sync_lock);
+}
+
+/* VMA modification routines */
+static void *child_vma_modifier(struct vma_modifier_info *mod_info)
+{
+	int prot = PROT_READ | PROT_WRITE;
+	int i;
+
+	for (i = 0; i < mod_info->vma_count; i++) {
+		mod_info->child_mapped_addr[i] = mmap(NULL, page_size * 3, prot,
+				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+		assert(mod_info->child_mapped_addr[i] != MAP_FAILED);
+		/* change protection in adjacent maps to prevent merging */
+		prot ^= PROT_WRITE;
+	}
+	signal_state(mod_info, CHILD_READY);
+	wait_for_state(mod_info, PARENT_READY);
+	while (true) {
+		signal_state(mod_info, SETUP_READY);
+		wait_for_state(mod_info, SETUP_MODIFY_MAPS);
+		if (mod_info->exit)
+			break;
+
+		mod_info->vma_modify(mod_info);
+		signal_state(mod_info, SETUP_MAPS_MODIFIED);
+		wait_for_state(mod_info, SETUP_RESTORE_MAPS);
+		mod_info->vma_restore(mod_info);
+		signal_state(mod_info, SETUP_MAPS_RESTORED);
+
+		wait_for_state(mod_info, TEST_READY);
+		while (mod_info->curr_state != TEST_DONE) {
+			mod_info->vma_modify(mod_info);
+			mod_info->vma_restore(mod_info);
+		}
+	}
+	for (i = 0; i < mod_info->vma_count; i++)
+		munmap(mod_info->child_mapped_addr[i], page_size * 3);
+
+	return NULL;
+}
+
+static void stop_vma_modifier(struct vma_modifier_info *mod_info)
+{
+	wait_for_state(mod_info, SETUP_READY);
+	mod_info->exit = true;
+	signal_state(mod_info, SETUP_MODIFY_MAPS);
+}
+
+static void capture_mod_pattern(int maps_fd,
+				struct vma_modifier_info *mod_info,
+				struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line,
+				struct line_content *mod_last_line,
+				struct line_content *mod_first_line,
+				struct line_content *restored_last_line,
+				struct line_content *restored_first_line)
+{
+	signal_state(mod_info, SETUP_MODIFY_MAPS);
+	wait_for_state(mod_info, SETUP_MAPS_MODIFIED);
+
+	/* Copy last line of the first page and first line of the last page */
+	read_boundary_lines(maps_fd, page1, page2, mod_last_line, mod_first_line);
+
+	signal_state(mod_info, SETUP_RESTORE_MAPS);
+	wait_for_state(mod_info, SETUP_MAPS_RESTORED);
+
+	/* Copy last line of the first page and first line of the last page */
+	read_boundary_lines(maps_fd, page1, page2, restored_last_line, restored_first_line);
+
+	mod_info->vma_mod_check(mod_last_line, mod_first_line,
+				restored_last_line, restored_first_line);
+
+	/*
+	 * The content of these lines after modify+resore should be the same
+	 * as the original.
+	 */
+	assert(strcmp(restored_last_line->text, last_line->text) == 0);
+	assert(strcmp(restored_first_line->text, first_line->text) == 0);
+}
+
+static inline void split_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mmap(mod_info->addr, page_size, mod_info->prot | PROT_EXEC,
+		    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
+		    -1, 0) != MAP_FAILED);
+}
+
+static inline void merge_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mmap(mod_info->addr, page_size, mod_info->prot,
+		    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
+		    -1, 0) != MAP_FAILED);
+}
+
+static inline void check_split_result(struct line_content *mod_last_line,
+				      struct line_content *mod_first_line,
+				      struct line_content *restored_last_line,
+				      struct line_content *restored_first_line)
+{
+	/* Make sure vmas at the boundaries are changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
+}
+
+static void test_maps_tearing_from_split(int maps_fd,
+					 struct vma_modifier_info *mod_info,
+					 struct page_content *page1,
+					 struct page_content *page2,
+					 struct line_content *last_line,
+					 struct line_content *first_line)
+{
+	struct line_content split_last_line;
+	struct line_content split_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = split_vma;
+	mod_info->vma_restore = merge_vma;
+	mod_info->vma_mod_check = check_split_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &split_last_line, &split_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		bool last_line_changed;
+		bool first_line_changed;
+
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after split */
+		if (!strcmp(new_last_line.text, split_last_line.text)) {
+			/*
+			 * The vmas should be consistent with split results,
+			 * however if vma was concurrently restored after a
+			 * split, it can be reported twice (first the original
+			 * split one, then the same vma but extended after the
+			 * merge) because we found it as the next vma again.
+			 * In that case new first line will be the same as the
+			 * last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, split_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with merge results */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		/*
+		 * First and last lines should change in unison. If the last
+		 * line changed then the first line should change as well and
+		 * vice versa.
+		 */
+		last_line_changed = strcmp(new_last_line.text, last_line->text) != 0;
+		first_line_changed = strcmp(new_first_line.text, first_line->text) != 0;
+		assert(last_line_changed == first_line_changed);
+
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
+static int test_maps_tearing(void)
+{
+	struct vma_modifier_info *mod_info;
+	pthread_mutexattr_t mutex_attr;
+	pthread_condattr_t cond_attr;
+	int shared_mem_size;
+	char fname[32];
+	int vma_count;
+	int maps_fd;
+	int status;
+	pid_t pid;
+
+	/*
+	 * Have to map enough vmas for /proc/pid/maps to containt more than one
+	 * page worth of vmas. Assume at least 32 bytes per line in maps output
+	 */
+	vma_count = page_size / 32 + 1;
+	shared_mem_size = sizeof(struct vma_modifier_info) + vma_count * sizeof(void *);
+
+	/* map shared memory for communication with the child process */
+	mod_info = (struct vma_modifier_info *)mmap(NULL, shared_mem_size,
+		    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+
+	assert(mod_info != MAP_FAILED);
+
+	/* Initialize shared members */
+	pthread_mutexattr_init(&mutex_attr);
+	pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
+	assert(!pthread_mutex_init(&mod_info->sync_lock, &mutex_attr));
+	pthread_condattr_init(&cond_attr);
+	pthread_condattr_setpshared(&cond_attr, PTHREAD_PROCESS_SHARED);
+	assert(!pthread_cond_init(&mod_info->sync_cond, &cond_attr));
+	mod_info->vma_count = vma_count;
+	mod_info->curr_state = INIT;
+	mod_info->exit = false;
+
+	pid = fork();
+	if (!pid) {
+		/* Child process */
+		child_vma_modifier(mod_info);
+		return 0;
+	}
+
+	sprintf(fname, "/proc/%d/maps", pid);
+	maps_fd = open(fname, O_RDONLY);
+	assert(maps_fd != -1);
+
+	/* Wait for the child to map the VMAs */
+	wait_for_state(mod_info, CHILD_READY);
+
+	/* Read first two pages */
+	struct page_content page1;
+	struct page_content page2;
+
+	page1.data = malloc(page_size);
+	assert(page1.data);
+	page2.data = malloc(page_size);
+	assert(page2.data);
+
+	struct line_content last_line;
+	struct line_content first_line;
+
+	read_boundary_lines(maps_fd, &page1, &page2, &last_line, &first_line);
+
+	/*
+	 * Find the addresses corresponding to the last line in the first page
+	 * and the first line in the last page.
+	 */
+	mod_info->addr = NULL;
+	mod_info->next_addr = NULL;
+	for (int i = 0; i < mod_info->vma_count; i++) {
+		if (mod_info->child_mapped_addr[i] == (void *)last_line.start_addr) {
+			mod_info->addr = mod_info->child_mapped_addr[i];
+			mod_info->prot = PROT_READ;
+			/* Even VMAs have write permission */
+			if ((i % 2) == 0)
+				mod_info->prot |= PROT_WRITE;
+		} else if (mod_info->child_mapped_addr[i] == (void *)first_line.start_addr) {
+			mod_info->next_addr = mod_info->child_mapped_addr[i];
+		}
+
+		if (mod_info->addr && mod_info->next_addr)
+			break;
+	}
+	assert(mod_info->addr && mod_info->next_addr);
+
+	signal_state(mod_info, PARENT_READY);
+
+	test_maps_tearing_from_split(maps_fd, mod_info, &page1, &page2,
+				     &last_line, &first_line);
+
+	stop_vma_modifier(mod_info);
+
+	free(page2.data);
+	free(page1.data);
+
+	for (int i = 0; i < vma_count; i++)
+		munmap(mod_info->child_mapped_addr[i], page_size);
+	close(maps_fd);
+	waitpid(pid, &status, 0);
+	munmap(mod_info, shared_mem_size);
+
+	return 0;
+}
+
+int usage(void)
+{
+	fprintf(stderr, "Userland /proc/pid/{s}maps test cases\n");
+	fprintf(stderr, "  -d: Duration for time-consuming tests\n");
+	fprintf(stderr, "  -h: Help screen\n");
+	exit(-1);
+}
+
+int main(int argc, char **argv)
 {
 	int pipefd[2];
 	int exec_fd;
+	int opt;
+
+	while ((opt = getopt(argc, argv, "d:h")) != -1) {
+		if (opt == 'd')
+			test_duration_sec = strtoul(optarg, NULL, 0);
+		else if (opt == 'h')
+			usage();
+	}
 
+	page_size = sysconf(_SC_PAGESIZE);
 	vsyscall();
 	switch (g_vsyscall) {
 	case 0:
@@ -578,6 +1002,10 @@ int main(void)
 		assert(err == -ENOENT);
 	}
 
+	/* Test tearing in /proc/$PID/maps */
+	if (test_maps_tearing())
+		return 1;
+
 	return 0;
 }
 #else
-- 
2.49.0.1266.g31b7d2e469-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/7] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing
  2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 1/7] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
@ 2025-06-04 23:11 ` Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 3/7] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 23:11 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Test that /proc/pid/maps does not report unexpected holes in the address
space when a vma at the edge of the page is being concurrently remapped.
This remapping results in the vma shrinking and expanding from  under the
reader. We should always see either shrunk or expanded (original) version
of the vma.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 83 ++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index 6e3f06376a1f..39842e4ec45f 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -583,6 +583,86 @@ static void test_maps_tearing_from_split(int maps_fd,
 	signal_state(mod_info, TEST_DONE);
 }
 
+static inline void shrink_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mremap(mod_info->addr, page_size * 3, page_size, 0) != MAP_FAILED);
+}
+
+static inline void expand_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mremap(mod_info->addr, page_size, page_size * 3, 0) != MAP_FAILED);
+}
+
+static inline void check_shrink_result(struct line_content *mod_last_line,
+				       struct line_content *mod_first_line,
+				       struct line_content *restored_last_line,
+				       struct line_content *restored_first_line)
+{
+	/* Make sure only the last vma of the first page is changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) == 0);
+}
+
+static void test_maps_tearing_from_resize(int maps_fd,
+					  struct vma_modifier_info *mod_info,
+					  struct page_content *page1,
+					  struct page_content *page2,
+					  struct line_content *last_line,
+					  struct line_content *first_line)
+{
+	struct line_content shrunk_last_line;
+	struct line_content shrunk_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = shrink_vma;
+	mod_info->vma_restore = expand_vma;
+	mod_info->vma_mod_check = check_shrink_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &shrunk_last_line, &shrunk_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after shrinking it */
+		if (!strcmp(new_last_line.text, shrunk_last_line.text)) {
+			/*
+			 * The vmas should be consistent with shrunk results,
+			 * however if the vma was concurrently restored, it
+			 * can be reported twice (first as shrunk one, then
+			 * as restored one) because we found it as the next vma
+			 * again. In that case new first line will be the same
+			 * as the last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, shrunk_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with the original/resored state */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
 static int test_maps_tearing(void)
 {
 	struct vma_modifier_info *mod_info;
@@ -674,6 +754,9 @@ static int test_maps_tearing(void)
 	test_maps_tearing_from_split(maps_fd, mod_info, &page1, &page2,
 				     &last_line, &first_line);
 
+	test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
+				      &last_line, &first_line);
+
 	stop_vma_modifier(mod_info);
 
 	free(page2.data);
-- 
2.49.0.1266.g31b7d2e469-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/7] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping
  2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 1/7] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 2/7] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
@ 2025-06-04 23:11 ` Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 4/7] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 23:11 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Test that /proc/pid/maps does not report unexpected holes in the address
space when we concurrently remap a part of a vma into the middle of
another vma. This remapping results in the destination vma being split
into three parts and the part in the middle being patched back from,
all done concurrently from under the reader. We should always see either
original vma or the split one with no holes.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 92 ++++++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index 39842e4ec45f..1aef2db7e893 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -663,6 +663,95 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	signal_state(mod_info, TEST_DONE);
 }
 
+static inline void remap_vma(const struct vma_modifier_info *mod_info)
+{
+	/*
+	 * Remap the last page of the next vma into the middle of the vma.
+	 * This splits the current vma and the first and middle parts (the
+	 * parts at lower addresses) become the last vma objserved in the
+	 * first page and the first vma observed in the last page.
+	 */
+	assert(mremap(mod_info->next_addr + page_size * 2, page_size,
+		      page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
+		      mod_info->addr + page_size) != MAP_FAILED);
+}
+
+static inline void patch_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(!mprotect(mod_info->addr + page_size, page_size,
+			 mod_info->prot));
+}
+
+static inline void check_remap_result(struct line_content *mod_last_line,
+				      struct line_content *mod_first_line,
+				      struct line_content *restored_last_line,
+				      struct line_content *restored_first_line)
+{
+	/* Make sure vmas at the boundaries are changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
+}
+
+static void test_maps_tearing_from_remap(int maps_fd,
+				struct vma_modifier_info *mod_info,
+				struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line)
+{
+	struct line_content remapped_last_line;
+	struct line_content remapped_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = remap_vma;
+	mod_info->vma_restore = patch_vma;
+	mod_info->vma_mod_check = check_remap_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &remapped_last_line, &remapped_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after remapping it */
+		if (!strcmp(new_last_line.text, remapped_last_line.text)) {
+			/*
+			 * The vmas should be consistent with remap results,
+			 * however if the vma was concurrently restored, it
+			 * can be reported twice (first as split one, then
+			 * as restored one) because we found it as the next vma
+			 * again. In that case new first line will be the same
+			 * as the last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with the original/resored state */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
 static int test_maps_tearing(void)
 {
 	struct vma_modifier_info *mod_info;
@@ -757,6 +846,9 @@ static int test_maps_tearing(void)
 	test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
 				      &last_line, &first_line);
 
+	test_maps_tearing_from_remap(maps_fd, mod_info, &page1, &page2,
+				     &last_line, &first_line);
+
 	stop_vma_modifier(mod_info);
 
 	free(page2.data);
-- 
2.49.0.1266.g31b7d2e469-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 4/7] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified
  2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2025-06-04 23:11 ` [PATCH v4 3/7] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
@ 2025-06-04 23:11 ` Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 5/7] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 23:11 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Extend /proc/pid/maps tearing test to verify PROCMAP_QUERY ioctl operation
correctness while the vma is being concurrently modified.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 60 ++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index 1aef2db7e893..b582f40851fb 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -486,6 +486,21 @@ static void capture_mod_pattern(int maps_fd,
 	assert(strcmp(restored_first_line->text, first_line->text) == 0);
 }
 
+static void query_addr_at(int maps_fd, void *addr,
+			  unsigned long *vma_start, unsigned long *vma_end)
+{
+	struct procmap_query q;
+
+	memset(&q, 0, sizeof(q));
+	q.size = sizeof(q);
+	/* Find the VMA at the split address */
+	q.query_addr = (unsigned long long)addr;
+	q.query_flags = 0;
+	assert(!ioctl(maps_fd, PROCMAP_QUERY, &q));
+	*vma_start = q.vma_start;
+	*vma_end = q.vma_end;
+}
+
 static inline void split_vma(const struct vma_modifier_info *mod_info)
 {
 	assert(mmap(mod_info->addr, page_size, mod_info->prot | PROT_EXEC,
@@ -546,6 +561,8 @@ static void test_maps_tearing_from_split(int maps_fd,
 	do {
 		bool last_line_changed;
 		bool first_line_changed;
+		unsigned long vma_start;
+		unsigned long vma_end;
 
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
@@ -576,6 +593,19 @@ static void test_maps_tearing_from_split(int maps_fd,
 		first_line_changed = strcmp(new_first_line.text, first_line->text) != 0;
 		assert(last_line_changed == first_line_changed);
 
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr + page_size,
+			      &vma_start, &vma_end);
+		/*
+		 * The vma at the split address can be either the same as
+		 * original one (if read before the split) or the same as the
+		 * first line in the second page (if read after the split).
+		 */
+		assert((vma_start == last_line->start_addr &&
+			vma_end == last_line->end_addr) ||
+		       (vma_start == split_first_line.start_addr &&
+			vma_end == split_first_line.end_addr));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
@@ -637,6 +667,9 @@ static void test_maps_tearing_from_resize(int maps_fd,
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
 	do {
+		unsigned long vma_start;
+		unsigned long vma_end;
+
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
 		/* Check if we read vmas after shrinking it */
@@ -656,6 +689,17 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
 			       !strcmp(new_first_line.text, restored_first_line.text));
 		}
+
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr, &vma_start, &vma_end);
+		/*
+		 * The vma should stay at the same address and have either the
+		 * original size of 3 pages or 1 page if read after shrinking.
+		 */
+		assert(vma_start == last_line->start_addr &&
+		       (vma_end - vma_start == page_size * 3 ||
+			vma_end - vma_start == page_size));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
@@ -726,6 +770,9 @@ static void test_maps_tearing_from_remap(int maps_fd,
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
 	do {
+		unsigned long vma_start;
+		unsigned long vma_end;
+
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
 		/* Check if we read vmas after remapping it */
@@ -745,6 +792,19 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
 			       !strcmp(new_first_line.text, restored_first_line.text));
 		}
+
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr + page_size, &vma_start, &vma_end);
+		/*
+		 * The vma should either stay at the same address and have the
+		 * original size of 3 pages or we should find the remapped vma
+		 * at the remap destination address with size of 1 page.
+		 */
+		assert((vma_start == last_line->start_addr &&
+			vma_end - vma_start == page_size * 3) ||
+		       (vma_start == last_line->start_addr + page_size &&
+			vma_end - vma_start == page_size));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
-- 
2.49.0.1266.g31b7d2e469-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 5/7] selftests/proc: add verbose more for tests to facilitate debugging
  2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2025-06-04 23:11 ` [PATCH v4 4/7] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
@ 2025-06-04 23:11 ` Suren Baghdasaryan
  2025-06-04 23:11 ` [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock Suren Baghdasaryan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 23:11 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Add verbose mode to the proc tests to print debugging information.
Usage: proc-pid-vm -v

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-pid-vm.c | 154 +++++++++++++++++++--
 1 file changed, 141 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/proc/proc-pid-vm.c b/tools/testing/selftests/proc/proc-pid-vm.c
index b582f40851fb..97017f48cd70 100644
--- a/tools/testing/selftests/proc/proc-pid-vm.c
+++ b/tools/testing/selftests/proc/proc-pid-vm.c
@@ -73,6 +73,7 @@ static void make_private_tmp(void)
 }
 
 static unsigned long test_duration_sec = 5UL;
+static bool verbose;
 static int page_size;
 static pid_t pid = -1;
 static void ate(void)
@@ -452,6 +453,99 @@ static void stop_vma_modifier(struct vma_modifier_info *mod_info)
 	signal_state(mod_info, SETUP_MODIFY_MAPS);
 }
 
+static void print_first_lines(char *text, int nr)
+{
+	const char *end = text;
+
+	while (nr && (end = strchr(end, '\n')) != NULL) {
+		nr--;
+		end++;
+	}
+
+	if (end) {
+		int offs = end - text;
+
+		text[offs] = '\0';
+		printf(text);
+		text[offs] = '\n';
+		printf("\n");
+	} else {
+		printf(text);
+	}
+}
+
+static void print_last_lines(char *text, int nr)
+{
+	const char *start = text + strlen(text);
+
+	nr++; /* to ignore the last newline */
+	while (nr) {
+		while (start > text && *start != '\n')
+			start--;
+		nr--;
+		start--;
+	}
+	printf(start);
+}
+
+static void print_boundaries(const char *title,
+			     struct page_content *page1,
+			     struct page_content *page2)
+{
+	if (!verbose)
+		return;
+
+	printf("%s", title);
+	/* Print 3 boundary lines from each page */
+	print_last_lines(page1->data, 3);
+	printf("-----------------page boundary-----------------\n");
+	print_first_lines(page2->data, 3);
+}
+
+static bool print_boundaries_on(bool condition, const char *title,
+				struct page_content *page1,
+				struct page_content *page2)
+{
+	if (verbose && condition)
+		print_boundaries(title, page1, page2);
+
+	return condition;
+}
+
+static void report_test_start(const char *name)
+{
+	if (verbose)
+		printf("==== %s ====\n", name);
+}
+
+static struct timespec print_ts;
+
+static void start_test_loop(struct timespec *ts)
+{
+	if (verbose)
+		print_ts.tv_sec = ts->tv_sec;
+}
+
+static void end_test_iteration(struct timespec *ts)
+{
+	if (!verbose)
+		return;
+
+	/* Update every second */
+	if (print_ts.tv_sec == ts->tv_sec)
+		return;
+
+	printf(".");
+	fflush(stdout);
+	print_ts.tv_sec = ts->tv_sec;
+}
+
+static void end_test_loop(void)
+{
+	if (verbose)
+		printf("\n");
+}
+
 static void capture_mod_pattern(int maps_fd,
 				struct vma_modifier_info *mod_info,
 				struct page_content *page1,
@@ -463,18 +557,24 @@ static void capture_mod_pattern(int maps_fd,
 				struct line_content *restored_last_line,
 				struct line_content *restored_first_line)
 {
+	print_boundaries("Before modification", page1, page2);
+
 	signal_state(mod_info, SETUP_MODIFY_MAPS);
 	wait_for_state(mod_info, SETUP_MAPS_MODIFIED);
 
 	/* Copy last line of the first page and first line of the last page */
 	read_boundary_lines(maps_fd, page1, page2, mod_last_line, mod_first_line);
 
+	print_boundaries("After modification", page1, page2);
+
 	signal_state(mod_info, SETUP_RESTORE_MAPS);
 	wait_for_state(mod_info, SETUP_MAPS_RESTORED);
 
 	/* Copy last line of the first page and first line of the last page */
 	read_boundary_lines(maps_fd, page1, page2, restored_last_line, restored_first_line);
 
+	print_boundaries("After restore", page1, page2);
+
 	mod_info->vma_mod_check(mod_last_line, mod_first_line,
 				restored_last_line, restored_first_line);
 
@@ -546,6 +646,7 @@ static void test_maps_tearing_from_split(int maps_fd,
 	mod_info->vma_restore = merge_vma;
 	mod_info->vma_mod_check = check_split_result;
 
+	report_test_start("Tearing from split");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &split_last_line, &split_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -558,6 +659,7 @@ static void test_maps_tearing_from_split(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		bool last_line_changed;
 		bool first_line_changed;
@@ -577,12 +679,17 @@ static void test_maps_tearing_from_split(int maps_fd,
 			 * In that case new first line will be the same as the
 			 * last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, split_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, split_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Split result invalid", page1, page2));
+
 		} else {
 			/* The vmas should be consistent with merge results */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text) ||
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Merge result invalid", page1, page2));
 		}
 		/*
 		 * First and last lines should change in unison. If the last
@@ -607,7 +714,9 @@ static void test_maps_tearing_from_split(int maps_fd,
 			vma_end == split_first_line.end_addr));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -654,6 +763,7 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	mod_info->vma_restore = expand_vma;
 	mod_info->vma_mod_check = check_shrink_result;
 
+	report_test_start("Tearing from resize");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &shrunk_last_line, &shrunk_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -666,6 +776,7 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		unsigned long vma_start;
 		unsigned long vma_end;
@@ -682,12 +793,16 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			 * again. In that case new first line will be the same
 			 * as the last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, shrunk_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, shrunk_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Shrink result invalid", page1, page2));
 		} else {
 			/* The vmas should be consistent with the original/resored state */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text) ||
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Expand result invalid", page1, page2));
 		}
 
 		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
@@ -701,7 +816,9 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			vma_end - vma_start == page_size));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -757,6 +874,7 @@ static void test_maps_tearing_from_remap(int maps_fd,
 	mod_info->vma_restore = patch_vma;
 	mod_info->vma_mod_check = check_remap_result;
 
+	report_test_start("Tearing from remap");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &remapped_last_line, &remapped_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -769,6 +887,7 @@ static void test_maps_tearing_from_remap(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		unsigned long vma_start;
 		unsigned long vma_end;
@@ -785,12 +904,16 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			 * again. In that case new first line will be the same
 			 * as the last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, remapped_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Remap result invalid", page1, page2));
 		} else {
 			/* The vmas should be consistent with the original/resored state */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text) ||
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Remap restore result invalid", page1, page2));
 		}
 
 		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
@@ -806,7 +929,9 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			vma_end - vma_start == page_size));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -927,6 +1052,7 @@ int usage(void)
 {
 	fprintf(stderr, "Userland /proc/pid/{s}maps test cases\n");
 	fprintf(stderr, "  -d: Duration for time-consuming tests\n");
+	fprintf(stderr, "  -v: Verbose mode\n");
 	fprintf(stderr, "  -h: Help screen\n");
 	exit(-1);
 }
@@ -937,9 +1063,11 @@ int main(int argc, char **argv)
 	int exec_fd;
 	int opt;
 
-	while ((opt = getopt(argc, argv, "d:h")) != -1) {
+	while ((opt = getopt(argc, argv, "d:vh")) != -1) {
 		if (opt == 'd')
 			test_duration_sec = strtoul(optarg, NULL, 0);
+		else if (opt == 'v')
+			verbose = true;
 		else if (opt == 'h')
 			usage();
 	}
-- 
2.49.0.1266.g31b7d2e469-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2025-06-04 23:11 ` [PATCH v4 5/7] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
@ 2025-06-04 23:11 ` Suren Baghdasaryan
  2025-06-07 17:43   ` Lorenzo Stoakes
  2025-06-10  7:50   ` kernel test robot
  2025-06-04 23:11 ` [PATCH v4 7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
  2025-06-13 15:01 ` [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Lorenzo Stoakes
  7 siblings, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 23:11 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

With maple_tree supporting vma tree traversal under RCU and per-vma
locks, /proc/pid/maps can be read while holding individual vma locks
instead of locking the entire address space.
Completely lockless approach would be quite complex with the main issue
being get_vma_name() using callbacks which might not work correctly with
a stable vma copy, requiring original (unstable) vma.
When per-vma lock acquisition fails, we take the mmap_lock for reading,
lock the vma, release the mmap_lock and continue. This guarantees the
reader to make forward progress even during lock contention. This will
interfere with the writer but for a very short time while we are
acquiring the per-vma lock and only when there was contention on the
vma reader is interested in.
One case requiring special handling is when vma changes between the
time it was found and the time it got locked. A problematic case would
be if vma got shrunk so that it's start moved higher in the address
space and a new vma was installed at the beginning:

reader found:               |--------VMA A--------|
VMA is modified:            |-VMA B-|----VMA A----|
reader locks modified VMA A
reader reports VMA A:       |  gap  |----VMA A----|

This would result in reporting a gap in the address space that does not
exist. To prevent this we retry the lookup after locking the vma, however
we do that only when we identify a gap and detect that the address space
was changed after we found the vma.
This change is designed to reduce mmap_lock contention and prevent a
process reading /proc/pid/maps files (often a low priority task, such
as monitoring/data collection services) from blocking address space
updates. Note that this change has a userspace visible disadvantage:
it allows for sub-page data tearing as opposed to the previous mechanism
where data tearing could happen only between pages of generated output
data. Since current userspace considers data tearing between pages to be
acceptable, we assume is will be able to handle sub-page data tearing
as well.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/internal.h |   6 ++
 fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 175 insertions(+), 8 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 96122e91c645..3728c9012687 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -379,6 +379,12 @@ struct proc_maps_private {
 	struct task_struct *task;
 	struct mm_struct *mm;
 	struct vma_iterator iter;
+	loff_t last_pos;
+#ifdef CONFIG_PER_VMA_LOCK
+	bool mmap_locked;
+	unsigned int mm_wr_seq;
+	struct vm_area_struct *locked_vma;
+#endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *task_mempolicy;
 #endif
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 27972c0749e7..36d883c4f394 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
 }
 #endif
 
-static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
-						loff_t *ppos)
+#ifdef CONFIG_PER_VMA_LOCK
+
+static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv,
+					  struct vm_area_struct *vma,
+					  unsigned long last_pos,
+					  bool mm_unstable)
+{
+	vma = vma_start_read(priv->mm, vma);
+	if (IS_ERR_OR_NULL(vma))
+		return NULL;
+
+	/* Check if the vma we locked is the right one. */
+	if (unlikely(vma->vm_mm != priv->mm))
+		goto err;
+
+	/* vma should not be ahead of the last search position. */
+	if (unlikely(last_pos >= vma->vm_end))
+		goto err;
+
+	/*
+	 * vma ahead of last search position is possible but we need to
+	 * verify that it was not shrunk after we found it, and another
+	 * vma has not been installed ahead of it. Otherwise we might
+	 * observe a gap that should not be there.
+	 */
+	if (mm_unstable && last_pos < vma->vm_start) {
+		/* Verify only if the address space changed since vma lookup. */
+		if ((priv->mm_wr_seq & 1) ||
+		    mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) {
+			vma_iter_init(&priv->iter, priv->mm, last_pos);
+			if (vma != vma_next(&priv->iter))
+				goto err;
+		}
+	}
+
+	priv->locked_vma = vma;
+
+	return vma;
+err:
+	vma_end_read(vma);
+	return NULL;
+}
+
+
+static void unlock_vma(struct proc_maps_private *priv)
+{
+	if (priv->locked_vma) {
+		vma_end_read(priv->locked_vma);
+		priv->locked_vma = NULL;
+	}
+}
+
+static const struct seq_operations proc_pid_maps_op;
+
+static inline bool lock_content(struct seq_file *m,
+				struct proc_maps_private *priv)
+{
+	/*
+	 * smaps and numa_maps perform page table walk, therefore require
+	 * mmap_lock but maps can be read with locked vma only.
+	 */
+	if (m->op != &proc_pid_maps_op) {
+		if (mmap_read_lock_killable(priv->mm))
+			return false;
+
+		priv->mmap_locked = true;
+	} else {
+		rcu_read_lock();
+		priv->locked_vma = NULL;
+		priv->mmap_locked = false;
+	}
+
+	return true;
+}
+
+static inline void unlock_content(struct proc_maps_private *priv)
+{
+	if (priv->mmap_locked) {
+		mmap_read_unlock(priv->mm);
+	} else {
+		unlock_vma(priv);
+		rcu_read_unlock();
+	}
+}
+
+static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
+					   loff_t last_pos)
 {
-	struct vm_area_struct *vma = vma_next(&priv->iter);
+	struct vm_area_struct *vma;
+	int ret;
+
+	if (priv->mmap_locked)
+		return vma_next(&priv->iter);
+
+	unlock_vma(priv);
+	/*
+	 * Record sequence number ahead of vma lookup.
+	 * Odd seqcount means address space modification is in progress.
+	 */
+	mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
+	vma = vma_next(&priv->iter);
+	if (!vma)
+		return NULL;
+
+	vma = trylock_vma(priv, vma, last_pos, true);
+	if (vma)
+		return vma;
+
+	/* Address space got modified, vma might be stale. Re-lock and retry */
+	rcu_read_unlock();
+	ret = mmap_read_lock_killable(priv->mm);
+	rcu_read_lock();
+	if (ret)
+		return ERR_PTR(ret);
+
+	/* Lookup the vma at the last position again under mmap_read_lock */
+	vma_iter_init(&priv->iter, priv->mm, last_pos);
+	vma = vma_next(&priv->iter);
+	if (vma) {
+		vma = trylock_vma(priv, vma, last_pos, false);
+		WARN_ON(!vma); /* mm is stable, has to succeed */
+	}
+	mmap_read_unlock(priv->mm);
+
+	return vma;
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
 
+static inline bool lock_content(struct seq_file *m,
+				struct proc_maps_private *priv)
+{
+	return mmap_read_lock_killable(priv->mm) == 0;
+}
+
+static inline void unlock_content(struct proc_maps_private *priv)
+{
+	mmap_read_unlock(priv->mm);
+}
+
+static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
+					   loff_t last_pos)
+{
+	return vma_next(&priv->iter);
+}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
+static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma;
+
+	vma = get_next_vma(priv, *ppos);
+	if (IS_ERR(vma))
+		return vma;
+
+	/* Store previous position to be able to restart if needed */
+	priv->last_pos = *ppos;
 	if (vma) {
-		*ppos = vma->vm_start;
+		/*
+		 * Track the end of the reported vma to ensure position changes
+		 * even if previous vma was merged with the next vma and we
+		 * found the extended vma with the same vm_start.
+		 */
+		*ppos = vma->vm_end;
 	} else {
 		*ppos = -2UL;
 		vma = get_gate_vma(priv->mm);
@@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 		return NULL;
 	}
 
-	if (mmap_read_lock_killable(mm)) {
+	if (!lock_content(m, priv)) {
 		mmput(mm);
 		put_task_struct(priv->task);
 		priv->task = NULL;
 		return ERR_PTR(-EINTR);
 	}
 
+	if (last_addr > 0)
+		*ppos = last_addr = priv->last_pos;
 	vma_iter_init(&priv->iter, mm, last_addr);
 	hold_task_mempolicy(priv);
 	if (last_addr == -2UL)
 		return get_gate_vma(mm);
 
-	return proc_get_vma(priv, ppos);
+	return proc_get_vma(m, ppos);
 }
 
 static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
@@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
 		*ppos = -1UL;
 		return NULL;
 	}
-	return proc_get_vma(m->private, ppos);
+	return proc_get_vma(m, ppos);
 }
 
 static void m_stop(struct seq_file *m, void *v)
@@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v)
 		return;
 
 	release_task_mempolicy(priv);
-	mmap_read_unlock(mm);
+	unlock_content(priv);
 	mmput(mm);
 	put_task_struct(priv->task);
 	priv->task = NULL;
-- 
2.49.0.1266.g31b7d2e469-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks
  2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2025-06-04 23:11 ` [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock Suren Baghdasaryan
@ 2025-06-04 23:11 ` Suren Baghdasaryan
  2025-06-13 20:36   ` Andrii Nakryiko
  2025-06-13 15:01 ` [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Lorenzo Stoakes
  7 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 23:11 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Utilize per-vma locks to stabilize vma after lookup without taking
mmap_lock during PROCMAP_QUERY ioctl execution. While we might take
mmap_lock for reading during contention, we do that momentarily only
to lock the vma.
This change is designed to reduce mmap_lock contention and prevent
PROCMAP_QUERY ioctl calls from blocking address space updates.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/task_mmu.c | 56 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 36d883c4f394..93ba35a84975 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -550,28 +550,60 @@ static int pid_maps_open(struct inode *inode, struct file *file)
 		PROCMAP_QUERY_VMA_FLAGS				\
 )
 
-static int query_vma_setup(struct mm_struct *mm)
+#ifdef CONFIG_PER_VMA_LOCK
+
+static int query_vma_setup(struct proc_maps_private *priv)
 {
-	return mmap_read_lock_killable(mm);
+	rcu_read_lock();
+	priv->locked_vma = NULL;
+	priv->mmap_locked = false;
+
+	return 0;
 }
 
-static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
+static void query_vma_teardown(struct proc_maps_private *priv)
 {
-	mmap_read_unlock(mm);
+	unlock_vma(priv);
+	rcu_read_unlock();
+}
+
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
+						     unsigned long addr)
+{
+	vma_iter_init(&priv->iter, priv->mm, addr);
+	return get_next_vma(priv, addr);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static int query_vma_setup(struct proc_maps_private *priv)
+{
+	return mmap_read_lock_killable(priv->mm);
+}
+
+static void query_vma_teardown(struct proc_maps_private *priv)
+{
+	mmap_read_unlock(priv->mm);
 }
 
-static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
+						     unsigned long addr)
 {
-	return find_vma(mm, addr);
+	return find_vma(priv->mm, addr);
 }
 
-static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
+#endif  /* CONFIG_PER_VMA_LOCK */
+
+static struct vm_area_struct *query_matching_vma(struct proc_maps_private *priv,
 						 unsigned long addr, u32 flags)
 {
 	struct vm_area_struct *vma;
 
 next_vma:
-	vma = query_vma_find_by_addr(mm, addr);
+	vma = query_vma_find_by_addr(priv, addr);
+	if (IS_ERR(vma))
+		return vma;
+
 	if (!vma)
 		goto no_vma;
 
@@ -647,13 +679,13 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	if (!mm || !mmget_not_zero(mm))
 		return -ESRCH;
 
-	err = query_vma_setup(mm);
+	err = query_vma_setup(priv);
 	if (err) {
 		mmput(mm);
 		return err;
 	}
 
-	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
+	vma = query_matching_vma(priv, karg.query_addr, karg.query_flags);
 	if (IS_ERR(vma)) {
 		err = PTR_ERR(vma);
 		vma = NULL;
@@ -738,7 +770,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	}
 
 	/* unlock vma or mmap_lock, and put mm_struct before copying data to user */
-	query_vma_teardown(mm, vma);
+	query_vma_teardown(priv);
 	mmput(mm);
 
 	if (karg.vma_name_size && copy_to_user(u64_to_user_ptr(karg.vma_name_addr),
@@ -758,7 +790,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	return 0;
 
 out:
-	query_vma_teardown(mm, vma);
+	query_vma_teardown(priv);
 	mmput(mm);
 	kfree(name_buf);
 	return err;
-- 
2.49.0.1266.g31b7d2e469-goog



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-04 23:11 ` [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock Suren Baghdasaryan
@ 2025-06-07 17:43   ` Lorenzo Stoakes
  2025-06-08  1:41     ` Suren Baghdasaryan
  2025-06-10  7:50   ` kernel test robot
  1 sibling, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-06-07 17:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Hi Suren,

Forgive me but I am going to ask a lot of questions here :p just want to
make sure I'm getting everything right here.

On Wed, Jun 04, 2025 at 04:11:50PM -0700, Suren Baghdasaryan wrote:
> With maple_tree supporting vma tree traversal under RCU and per-vma
> locks, /proc/pid/maps can be read while holding individual vma locks
> instead of locking the entire address space.

Nice :)

> Completely lockless approach would be quite complex with the main issue
> being get_vma_name() using callbacks which might not work correctly with
> a stable vma copy, requiring original (unstable) vma.

Hmmm can you expand on what a 'completely lockless' design might comprise of?

It's super un-greppable and I've not got clangd set up with an allmod kernel to
triple-check but I'm seeing at least 2 (are there more?):

gate_vma_name() which is:

	return "[vsyscall]";

special_mapping_name() which is:

	 return ((struct vm_special_mapping *)vma->vm_private_data)->name;

Which I'm guessing is the issue because it's a double pointer deref...

Seems such a silly issue to get stuck on, I wonder if we can't just change
this to function correctly?

> When per-vma lock acquisition fails, we take the mmap_lock for reading,
> lock the vma, release the mmap_lock and continue. This guarantees the
> reader to make forward progress even during lock contention. This will

Ah that fabled constant forward progress ;)

> interfere with the writer but for a very short time while we are
> acquiring the per-vma lock and only when there was contention on the
> vma reader is interested in.
> One case requiring special handling is when vma changes between the
> time it was found and the time it got locked. A problematic case would
> be if vma got shrunk so that it's start moved higher in the address
> space and a new vma was installed at the beginning:
>
> reader found:               |--------VMA A--------|
> VMA is modified:            |-VMA B-|----VMA A----|
> reader locks modified VMA A
> reader reports VMA A:       |  gap  |----VMA A----|
>
> This would result in reporting a gap in the address space that does not
> exist. To prevent this we retry the lookup after locking the vma, however
> we do that only when we identify a gap and detect that the address space
> was changed after we found the vma.

OK so in this case we have

1. Find VMA A - nothing is locked yet, but presumably we are under RCU so
   are... safe? From unmaps? Or are we? I guess actually the detach
   mechanism sorts this out for us perhaps?

2. We got unlucky and did this immediately prior to VMA A having its
   vma->vm_start, vm_end updated to reflect the split.

3. We lock VMA A, now position with an apparent gap after the prior VMA
which, in practice does not exist.

So I am guessing that by observing sequence numbers you are able to detect
that a change has occurred and thus retry the operation in this situation?

I know we previously discussed the possibility of this retry mechanism
going on forever, I guess I will see the resolution to this in the code :)

> This change is designed to reduce mmap_lock contention and prevent a
> process reading /proc/pid/maps files (often a low priority task, such
> as monitoring/data collection services) from blocking address space
> updates. Note that this change has a userspace visible disadvantage:
> it allows for sub-page data tearing as opposed to the previous mechanism
> where data tearing could happen only between pages of generated output
> data. Since current userspace considers data tearing between pages to be
> acceptable, we assume is will be able to handle sub-page data tearing
> as well.

By tearing do you mean for instance seeing a VMA more than once due to
e.g. a VMA expanding in a racey way?

Pedantic I know, but it might be worth goiing through all the merge case,
split and remap scenarios and explaining what might happen in each one (or
perhaps do that as some form of documentation?)

I can try to put together a list of all of the possibilities if that would
be helpful.

>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/internal.h |   6 ++
>  fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 175 insertions(+), 8 deletions(-)

I really hate having all this logic in the proc/task_mmu.c file.

This is really delicate stuff and I'd really like it to live in mm if
possible.

I reallise this might be a total pain, but I'm quite worried about us
putting super-delicate, carefully written VMA handling code in different
places.

Also having stuff in mm/vma.c opens the door to userland testing which,
when I finally have time to really expand that, would allow for some really
nice stress testing here.

>
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 96122e91c645..3728c9012687 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -379,6 +379,12 @@ struct proc_maps_private {
>  	struct task_struct *task;
>  	struct mm_struct *mm;
>  	struct vma_iterator iter;
> +	loff_t last_pos;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	bool mmap_locked;
> +	unsigned int mm_wr_seq;

Is this the _last_ sequence number observed in the mm_struct? or rather,
previous? Nitty but maybe worth renaming accordingly.

> +	struct vm_area_struct *locked_vma;
> +#endif
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *task_mempolicy;
>  #endif
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 27972c0749e7..36d883c4f394 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
>  }
>  #endif
>
> -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> -						loff_t *ppos)
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv,
> +					  struct vm_area_struct *vma,
> +					  unsigned long last_pos,
> +					  bool mm_unstable)

This whole function is a bit weird tbh, you handle both the
mm_unstable=true and mm_unstable=false cases, in the latter we don't try to
lock at all...

Nitty (sorry I know this is mildly irritating review) but maybe needs to be
renamed, or split up somehow?

This is only trylocking in the mm_unstable case...

> +{
> +	vma = vma_start_read(priv->mm, vma);

Do we want to do this with mm_unstable == false?

I know (from my own documentation :)) taking a VMA read lock while holding
an mmap read lock is fine (the reverse isn't) but maybe it's suboptimal?

> +	if (IS_ERR_OR_NULL(vma))
> +		return NULL;

Hmm IS_ERR_OR_NULL() is generally a code smell (I learned this some years
ago from people moaning at me on code review :)

Sorry I know that's annoying but perhaps its indicative of an issue in the
interface? That's possibly out of scope here however.

Why are we ignoring errors here though? I guess because we don't care if
the VMA got detached from under us, we don't bother retrying like we do in
lock_vma_under_rcu()?

Should we just abstract that part of lock_vma_under_rcu() and use it?

> +
> +	/* Check if the vma we locked is the right one. */

Well it might not be the right one :) but might still belong to the right
mm, so maybe better to refer to the right virtual address space.

> +	if (unlikely(vma->vm_mm != priv->mm))
> +		goto err;
> +
> +	/* vma should not be ahead of the last search position. */

You mean behind the last search position? Surely a VMA being _ahead_ of it
is fine?

> +	if (unlikely(last_pos >= vma->vm_end))

Should that be >=? Wouldn't an == just be an adjacent VMA? Why is that
problematic? Or is last_pos inclusive?

> +		goto err;

Am I correct in thinking thi is what is being checked?

          last_pos
             |
             v
|---------|
|         |
|---------|
        vm_end
   <--- vma 'next'??? How did we go backwards?

When last_pos gets updated, is it possible for a shrink to race to cause
this somehow?

Do we treat this as an entirely unexpected error condition? In which case
is a WARN_ON_ONCE() warranted?

> +
> +	/*
> +	 * vma ahead of last search position is possible but we need to
> +	 * verify that it was not shrunk after we found it, and another
> +	 * vma has not been installed ahead of it. Otherwise we might
> +	 * observe a gap that should not be there.
> +	 */

OK so this is the juicy bit.

> +	if (mm_unstable && last_pos < vma->vm_start) {
> +		/* Verify only if the address space changed since vma lookup. */
> +		if ((priv->mm_wr_seq & 1) ||

Can we wrap this into a helper? This is a 'you just have to know that odd
seq number means a write operation is in effect'. I know you have a comment
here, but I think something like:

	if (has_mm_been_modified(priv) ||

Would be a lot clearer.

Again this speaks to the usefulness of abstracting all this logic from the
proc code, we are putting super delicate VMA stuff here and it's just not
the right place.

As an aside, I don't see coverage in the process_addrs documentation on
sequence number odd/even or speculation?

I think we probably need to cover this to maintain an up-to-date
description of how the VMA locking mechanism works and is used?

> +		    mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) {

Nit, again unrelated to this series, but would be useful to add a comment
to mmap_lock_speculate_retry() to indicate that a true return value
indicates a retry is needed, or renaming it.

Maybe mmap_lock_speculate_needs_retry()? Also I think that function needs a
comment.

Naming is hard :P

Anyway the totality of this expression is 'something changed' or 'read
section retry required'.

Under what circumstances would this happen?

OK so we're into the 'retry' logic here:

> +			vma_iter_init(&priv->iter, priv->mm, last_pos);

I'd definitely want Liam to confirm this is all above board and correct, as
these operations are pretty sensitive.

But assuming this is safe, we reset the iterator to the last position...

> +			if (vma != vma_next(&priv->iter))

Then assert the following VMA is the one we seek.

> +				goto err;

Might this ever be the case in the course of ordinary operation? Is this
really an error?

> +		}
> +	}
> +
> +	priv->locked_vma = vma;
> +
> +	return vma;
> +err:

As queried above, is this really an error path or something we might expect
to happen that could simply result in an expected fallback to mmap lock?

> +	vma_end_read(vma);
> +	return NULL;
> +}
> +
> +
> +static void unlock_vma(struct proc_maps_private *priv)
> +{
> +	if (priv->locked_vma) {
> +		vma_end_read(priv->locked_vma);
> +		priv->locked_vma = NULL;
> +	}
> +}
> +
> +static const struct seq_operations proc_pid_maps_op;
> +
> +static inline bool lock_content(struct seq_file *m,
> +				struct proc_maps_private *priv)

Pedantic I know but isn't 'lock_content' a bit generic?

He says, not being able to think of a great alternative...

OK maybe fine... :)

> +{
> +	/*
> +	 * smaps and numa_maps perform page table walk, therefore require
> +	 * mmap_lock but maps can be read with locked vma only.
> +	 */
> +	if (m->op != &proc_pid_maps_op) {

Nit but is there a neater way of checking this? Actually I imagine not...

But maybe worth, instead of forward-declaring proc_pid_maps_op, forward declare e.g.

static inline bool is_maps_op(struct seq_file *m);

And check e.g.

if (is_maps_op(m)) { ... in the above.

Yeah this is nitty not a massive del :)

> +		if (mmap_read_lock_killable(priv->mm))
> +			return false;
> +
> +		priv->mmap_locked = true;
> +	} else {
> +		rcu_read_lock();
> +		priv->locked_vma = NULL;
> +		priv->mmap_locked = false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline void unlock_content(struct proc_maps_private *priv)
> +{
> +	if (priv->mmap_locked) {
> +		mmap_read_unlock(priv->mm);
> +	} else {
> +		unlock_vma(priv);
> +		rcu_read_unlock();

Does this always get called even in error cases?

> +	}
> +}
> +
> +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> +					   loff_t last_pos)

We really need a generalised RCU multi-VMA locking mechanism (we're looking
into madvise VMA locking atm with a conservative single VMA lock at the
moment, but in future we probably want to be able to span multiple for
instance) and this really really feels like it doesn't belong in this proc
code.

>  {
> -	struct vm_area_struct *vma = vma_next(&priv->iter);
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	if (priv->mmap_locked)
> +		return vma_next(&priv->iter);
> +
> +	unlock_vma(priv);
> +	/*
> +	 * Record sequence number ahead of vma lookup.
> +	 * Odd seqcount means address space modification is in progress.
> +	 */
> +	mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);

Hmm we're discarding the return value I guess we don't really care about
that at this stage? Or do we? Do we want to assert the read critical
section state here?

I guess since we have the mm_rq_seq which we use later it's the same thing
and doesn't matter.

~~(off topic a bit)~~

OK so off-topic again afaict we're doing something pretty horribly gross here.

We pass &priv->mm_rw_seq as 'unsigned int *seq' field to
mmap_lock_speculate_try_begin(), which in turn calls:

	return raw_seqcount_try_begin(&mm->mm_lock_seq, *seq);

And this is defined as a macro of:

#define raw_seqcount_try_begin(s, start)				\
({									\
	start = raw_read_seqcount(s);					\
	!(start & 1);							\
})

So surely this expands to:

	*seq = raw_read_seqcount(&mm->mm_lock_seq);
	!(*seq & 1) // return true if even, false if odd

So we're basically ostensibly passing an unsigned int, but because we're
calling a macro it's actually just 'text' and we're instead able to then
reassign the underlying unsigned int * ptr and... ugh.

~~(/off topic a bit)~~

> +	vma = vma_next(&priv->iter);

> +	if (!vma)
> +		return NULL;
> +
> +	vma = trylock_vma(priv, vma, last_pos, true);
> +	if (vma)
> +		return vma;
> +

Really feels like this should be a boolean... I guess neat to reset vma if
not locked though.

> +	/* Address space got modified, vma might be stale. Re-lock and retry */

> +	rcu_read_unlock();

Might we see a VMA possibly actually legit unmapped in a race here? Do we
need to update last_pos/ppos to account for this? Otherwise we might just
fail on the last_pos >= vma->vm_end check in trylock_vma() no?

> +	ret = mmap_read_lock_killable(priv->mm);

Shouldn't we set priv->mmap_locked here?

I guess not as we are simply holding the mmap lock to definitely get the
next VMA.

> +	rcu_read_lock();
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	/* Lookup the vma at the last position again under mmap_read_lock */
> +	vma_iter_init(&priv->iter, priv->mm, last_pos);
> +	vma = vma_next(&priv->iter);
> +	if (vma) {
> +		vma = trylock_vma(priv, vma, last_pos, false);

Be good to use Liam's convention of /* mm_unstable = */ false to make this
clear.

Find it kinda weird again we're 'trylocking' something we already have
locked via the mmap lock but I already mentioend this... :)

> +		WARN_ON(!vma); /* mm is stable, has to succeed */

I wonder if this is really useful, at any rate seems like there'd be a
flood here so WARN_ON_ONCE()? Perhaps VM_WARN_ON_ONCE() given this really
really ought not happen?

> +	}
> +	mmap_read_unlock(priv->mm);
> +
> +	return vma;
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
>
> +static inline bool lock_content(struct seq_file *m,
> +				struct proc_maps_private *priv)
> +{
> +	return mmap_read_lock_killable(priv->mm) == 0;
> +}
> +
> +static inline void unlock_content(struct proc_maps_private *priv)
> +{
> +	mmap_read_unlock(priv->mm);
> +}
> +
> +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> +					   loff_t last_pos)
> +{
> +	return vma_next(&priv->iter);
> +}
> +
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
> +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> +{
> +	struct proc_maps_private *priv = m->private;
> +	struct vm_area_struct *vma;
> +
> +	vma = get_next_vma(priv, *ppos);
> +	if (IS_ERR(vma))
> +		return vma;
> +
> +	/* Store previous position to be able to restart if needed */
> +	priv->last_pos = *ppos;
>  	if (vma) {
> -		*ppos = vma->vm_start;
> +		/*
> +		 * Track the end of the reported vma to ensure position changes
> +		 * even if previous vma was merged with the next vma and we
> +		 * found the extended vma with the same vm_start.
> +		 */

Right, so observing repetitions is acceptable in such circumstances? I mean
I agree.

> +		*ppos = vma->vm_end;

If we store the end, does the last_pos logic which resets the VMA iterator
later work correctly in all cases?

>  	} else {
>  		*ppos = -2UL;
>  		vma = get_gate_vma(priv->mm);

Is it always the case that !vma here implies a gate VMA (yuck yuck)? I see
this was the original logic, but maybe put a comment about this as it's
weird and confusing? (and not your fault obviously :P)

Also, are all locks and state corectly handled in this case? Seems like one
of this nasty edge case situations that could have jagged edges...

> @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
>  		return NULL;
>  	}
>
> -	if (mmap_read_lock_killable(mm)) {
> +	if (!lock_content(m, priv)) {

Nice that this just slots in like this! :)

>  		mmput(mm);
>  		put_task_struct(priv->task);
>  		priv->task = NULL;
>  		return ERR_PTR(-EINTR);
>  	}
>
> +	if (last_addr > 0)

last_addr is an unsigned long, this will always be true.

You probably want to put an explicit check for -1UL, -2UL here or?

God I hate this mechanism for indicating gate VMA... yuck yuck (again, this
bit not your fault :P)

> +		*ppos = last_addr = priv->last_pos;
>  	vma_iter_init(&priv->iter, mm, last_addr);
>  	hold_task_mempolicy(priv);
>  	if (last_addr == -2UL)
>  		return get_gate_vma(mm);
>
> -	return proc_get_vma(priv, ppos);
> +	return proc_get_vma(m, ppos);
>  }
>
>  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
>  		*ppos = -1UL;
>  		return NULL;
>  	}
> -	return proc_get_vma(m->private, ppos);
> +	return proc_get_vma(m, ppos);
>  }
>
>  static void m_stop(struct seq_file *m, void *v)
> @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v)
>  		return;
>
>  	release_task_mempolicy(priv);
> -	mmap_read_unlock(mm);
> +	unlock_content(priv);
>  	mmput(mm);
>  	put_task_struct(priv->task);
>  	priv->task = NULL;
> --
> 2.49.0.1266.g31b7d2e469-goog
>

Sorry to add to workload by digging into so many details here, but we
really need to make sure all the i's are dotted and t's are crossed given
how fiddly and fragile this stuff is :)

Very much appreciate the work, this is a significant improvement and will
have a great deal of real world impact!

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-07 17:43   ` Lorenzo Stoakes
@ 2025-06-08  1:41     ` Suren Baghdasaryan
  2025-06-10 17:43       ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-08  1:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Sat, Jun 7, 2025 at 10:43 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Hi Suren,
>
> Forgive me but I am going to ask a lot of questions here :p just want to
> make sure I'm getting everything right here.

No worries and thank you for reviewing!

>
> On Wed, Jun 04, 2025 at 04:11:50PM -0700, Suren Baghdasaryan wrote:
> > With maple_tree supporting vma tree traversal under RCU and per-vma
> > locks, /proc/pid/maps can be read while holding individual vma locks
> > instead of locking the entire address space.
>
> Nice :)
>
> > Completely lockless approach would be quite complex with the main issue
> > being get_vma_name() using callbacks which might not work correctly with
> > a stable vma copy, requiring original (unstable) vma.
>
> Hmmm can you expand on what a 'completely lockless' design might comprise of?

In my previous implementation
(https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/)
I was doing this under RCU while checking mmap_lock seq counter to
detect address space changes. That's what I meant by a completely
lockless approach here.

>
> It's super un-greppable and I've not got clangd set up with an allmod kernel to
> triple-check but I'm seeing at least 2 (are there more?):
>
> gate_vma_name() which is:
>
>         return "[vsyscall]";
>
> special_mapping_name() which is:
>
>          return ((struct vm_special_mapping *)vma->vm_private_data)->name;
>
> Which I'm guessing is the issue because it's a double pointer deref...

Correct but in more general terms, depending on implementation of the
vm_ops.name callback, vma->vm_ops->name(vma) might not work correctly
with a vma copy. special_mapping_name() is an example of that.

>
> Seems such a silly issue to get stuck on, I wonder if we can't just change
> this to function correctly?

I was thinking about different ways to overcome that but once I
realized per-vma locks result in even less contention and the
implementation is simpler and more robust, I decided that per-vma
locks direction is better.

>
> > When per-vma lock acquisition fails, we take the mmap_lock for reading,
> > lock the vma, release the mmap_lock and continue. This guarantees the
> > reader to make forward progress even during lock contention. This will
>
> Ah that fabled constant forward progress ;)
>
> > interfere with the writer but for a very short time while we are
> > acquiring the per-vma lock and only when there was contention on the
> > vma reader is interested in.
> > One case requiring special handling is when vma changes between the
> > time it was found and the time it got locked. A problematic case would
> > be if vma got shrunk so that it's start moved higher in the address
> > space and a new vma was installed at the beginning:
> >
> > reader found:               |--------VMA A--------|
> > VMA is modified:            |-VMA B-|----VMA A----|
> > reader locks modified VMA A
> > reader reports VMA A:       |  gap  |----VMA A----|
> >
> > This would result in reporting a gap in the address space that does not
> > exist. To prevent this we retry the lookup after locking the vma, however
> > we do that only when we identify a gap and detect that the address space
> > was changed after we found the vma.
>
> OK so in this case we have
>
> 1. Find VMA A - nothing is locked yet, but presumably we are under RCU so
>    are... safe? From unmaps? Or are we? I guess actually the detach
>    mechanism sorts this out for us perhaps?

Yes, VMAs are RCU-safe and we do detect if it got detached after we
found it but before we locked it.

>
> 2. We got unlucky and did this immediately prior to VMA A having its
>    vma->vm_start, vm_end updated to reflect the split.

Yes, the split happened after we found it and before we locked it.

>
> 3. We lock VMA A, now position with an apparent gap after the prior VMA
> which, in practice does not exist.

Correct.

>
> So I am guessing that by observing sequence numbers you are able to detect
> that a change has occurred and thus retry the operation in this situation?

Yes, we detect the gap and we detect that address space has changed,
so to endure we did not miss a split we fall back to mmap_read_lock,
lock the VMA while holding mmap_read_lock, drop mmap_read_lock and
retry.

>
> I know we previously discussed the possibility of this retry mechanism
> going on forever, I guess I will see the resolution to this in the code :)

Retry in this case won't go forever because we take mmap_read_lock
during the retry. In the worst case we will be constantly falling back
to mmap_read_lock but that's a very unlikely case (the writer should
be constantly splitting the vma right before the reader locks it).

>
> > This change is designed to reduce mmap_lock contention and prevent a
> > process reading /proc/pid/maps files (often a low priority task, such
> > as monitoring/data collection services) from blocking address space
> > updates. Note that this change has a userspace visible disadvantage:
> > it allows for sub-page data tearing as opposed to the previous mechanism
> > where data tearing could happen only between pages of generated output
> > data. Since current userspace considers data tearing between pages to be
> > acceptable, we assume is will be able to handle sub-page data tearing
> > as well.
>
> By tearing do you mean for instance seeing a VMA more than once due to
> e.g. a VMA expanding in a racey way?

Yes.

>
> Pedantic I know, but it might be worth goiing through all the merge case,
> split and remap scenarios and explaining what might happen in each one (or
> perhaps do that as some form of documentation?)
>
> I can try to put together a list of all of the possibilities if that would
> be helpful.

Hmm. That might be an interesting exercise. I called out this
particular case because my tests caught it. I spent some time thinking
about other possible scenarios where we would report a gap in a place
where there are no gaps but could not think of anything else.

>
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/proc/internal.h |   6 ++
> >  fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 175 insertions(+), 8 deletions(-)
>
> I really hate having all this logic in the proc/task_mmu.c file.
>
> This is really delicate stuff and I'd really like it to live in mm if
> possible.
>
> I reallise this might be a total pain, but I'm quite worried about us
> putting super-delicate, carefully written VMA handling code in different
> places.
>
> Also having stuff in mm/vma.c opens the door to userland testing which,
> when I finally have time to really expand that, would allow for some really
> nice stress testing here.

That would require some sizable refactoring. I assume code for smaps
reading and PROCMAP_QUERY would have to be moved as well?

>
> >
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 96122e91c645..3728c9012687 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -379,6 +379,12 @@ struct proc_maps_private {
> >       struct task_struct *task;
> >       struct mm_struct *mm;
> >       struct vma_iterator iter;
> > +     loff_t last_pos;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     bool mmap_locked;
> > +     unsigned int mm_wr_seq;
>
> Is this the _last_ sequence number observed in the mm_struct? or rather,
> previous? Nitty but maybe worth renaming accordingly.

It's a copy of the mm->mm_wr_seq. I can add a comment if needed.

>
> > +     struct vm_area_struct *locked_vma;
> > +#endif
> >  #ifdef CONFIG_NUMA
> >       struct mempolicy *task_mempolicy;
> >  #endif
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 27972c0749e7..36d883c4f394 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> >  }
> >  #endif
> >
> > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > -                                             loff_t *ppos)
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv,
> > +                                       struct vm_area_struct *vma,
> > +                                       unsigned long last_pos,
> > +                                       bool mm_unstable)
>
> This whole function is a bit weird tbh, you handle both the
> mm_unstable=true and mm_unstable=false cases, in the latter we don't try to
> lock at all...

Why do you think so? vma_start_read() is always called but in case
mm_unstable=true we double check for the gaps to take care of the case
I mentioned in the changelog.

>
> Nitty (sorry I know this is mildly irritating review) but maybe needs to be
> renamed, or split up somehow?
>
> This is only trylocking in the mm_unstable case...

Nope, I think you misunderstood the intention, as I mentioned above.

>
> > +{
> > +     vma = vma_start_read(priv->mm, vma);
>
> Do we want to do this with mm_unstable == false?

Yes, always. mm_unstable=true only indicates that we are already
holding mmap_read_lock, so we don't need to double-check for gaps.
Perhaps I should add some comments to clarify what purpose this
parameter serves...

>
> I know (from my own documentation :)) taking a VMA read lock while holding
> an mmap read lock is fine (the reverse isn't) but maybe it's suboptimal?

Ah, right. I should use vma_start_read_locked() instead when we are
holding mmap_read_lock. That's why that function was introduced. Will
change.

>
> > +     if (IS_ERR_OR_NULL(vma))
> > +             return NULL;
>
> Hmm IS_ERR_OR_NULL() is generally a code smell (I learned this some years
> ago from people moaning at me on code review :)
>
> Sorry I know that's annoying but perhaps its indicative of an issue in the
> interface? That's possibly out of scope here however.

lock_vma_under_rcu() returns NULL or EAGAIN to signal
lock_vma_under_rcu() that it should retry the VMA lookup. In here in
either case we retry under mmap_read_lock, that's why EAGAIN is
ignored.

>
> Why are we ignoring errors here though? I guess because we don't care if
> the VMA got detached from under us, we don't bother retrying like we do in
> lock_vma_under_rcu()?

No, we take mmap_read_lock and retry in either case. Perhaps I should
split trylock_vma() into two separate functions - one for the case
when we are holding mmap_read_lock and another one when we don't? I
think that would have prevented many of your questions. I'll try that
and see how it looks.

>
> Should we just abstract that part of lock_vma_under_rcu() and use it?

trylock_vma() is not that similar to lock_vma_under_rcu() for that
IMO. Also lock_vma_under_rcu() is in the pagefault path which is very
hot, so I would not want to add conditions there to make it work for
trylock_vma().

>
> > +
> > +     /* Check if the vma we locked is the right one. */
>
> Well it might not be the right one :) but might still belong to the right
> mm, so maybe better to refer to the right virtual address space.

Ack. Will change to "Check if the vma belongs to the right address space. "

>
> > +     if (unlikely(vma->vm_mm != priv->mm))
> > +             goto err;
> > +
> > +     /* vma should not be ahead of the last search position. */
>
> You mean behind the last search position? Surely a VMA being _ahead_ of it
> is fine?

Yes, you are correct. "should not" should have been "should".

>
> > +     if (unlikely(last_pos >= vma->vm_end))
>
> Should that be >=? Wouldn't an == just be an adjacent VMA? Why is that
> problematic? Or is last_pos inclusive?

last_pos is inclusive and vma->vm_end is not inclusive, so if last_pos
== vma->vm_end that would mean the vma is behind the last_pos. Since
we are searching forward from the last_pos, we should not be finding a
vma before last_pos unless it mutated.

>
> > +             goto err;
>
> Am I correct in thinking thi is what is being checked?
>
>           last_pos
>              |
>              v
> |---------|
> |         |
> |---------|
>         vm_end
>    <--- vma 'next'??? How did we go backwards?

Exactly.

>
> When last_pos gets updated, is it possible for a shrink to race to cause
> this somehow?

No, we update last_pos only after we locked the vma and confirmed it's
the right one.

>
> Do we treat this as an entirely unexpected error condition? In which case
> is a WARN_ON_ONCE() warranted?

No, the VMA might have mutated from under us before we locked it. For
example it might have been remapped to a higher address.

>
> > +
> > +     /*
> > +      * vma ahead of last search position is possible but we need to
> > +      * verify that it was not shrunk after we found it, and another
> > +      * vma has not been installed ahead of it. Otherwise we might
> > +      * observe a gap that should not be there.
> > +      */
>
> OK so this is the juicy bit.

Yep, that's the case singled out in the changelog.

>
>
> > +     if (mm_unstable && last_pos < vma->vm_start) {
> > +             /* Verify only if the address space changed since vma lookup. */
> > +             if ((priv->mm_wr_seq & 1) ||
>
> Can we wrap this into a helper? This is a 'you just have to know that odd
> seq number means a write operation is in effect'. I know you have a comment
> here, but I think something like:
>
>         if (has_mm_been_modified(priv) ||
>
> Would be a lot clearer.

Yeah, I was thinking about that. I think an even cleaner way would be
to remember the return value of mmap_lock_speculate_try_begin() and
pass it around. I was hoping to avoid that extra parameter but sounds
like for the sake of clarity that would be preferable?

>
> Again this speaks to the usefulness of abstracting all this logic from the
> proc code, we are putting super delicate VMA stuff here and it's just not
> the right place.
>
> As an aside, I don't see coverage in the process_addrs documentation on
> sequence number odd/even or speculation?
>
> I think we probably need to cover this to maintain an up-to-date
> description of how the VMA locking mechanism works and is used?

I think that's a very low level technical detail which I should not
have exposed here. As I mentioned, I should simply store the return
value of mmap_lock_speculate_try_begin() instead of doing these tricky
mm_wr_seq checks.

>
> > +                 mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) {
>
> Nit, again unrelated to this series, but would be useful to add a comment
> to mmap_lock_speculate_retry() to indicate that a true return value
> indicates a retry is needed, or renaming it.

This is how seqcount API works in general. Note that
mmap_lock_speculate_retry() is just a wrapper around
read_seqcount_retry().

>
> Maybe mmap_lock_speculate_needs_retry()? Also I think that function needs a
> comment.

See https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/seqlock.h#L395

>
> Naming is hard :P
>
> Anyway the totality of this expression is 'something changed' or 'read
> section retry required'.

Not quite. The expression is "something is changed from under us or
something was changing even before we started VMA lookup". Or in more
technical terms, mmap_write_lock was acquired while we were locking
the VMA or mmap_write_lock was already held even before we started the
VMA search.

>
> Under what circumstances would this happen?

See my previous comment and I hope that clarifies it.

>
> OK so we're into the 'retry' logic here:
>
> > +                     vma_iter_init(&priv->iter, priv->mm, last_pos);
>
> I'd definitely want Liam to confirm this is all above board and correct, as
> these operations are pretty sensitive.
>
> But assuming this is safe, we reset the iterator to the last position...
>
> > +                     if (vma != vma_next(&priv->iter))
>
> Then assert the following VMA is the one we seek.
>
> > +                             goto err;
>
> Might this ever be the case in the course of ordinary operation? Is this
> really an error?

This simply means that the VMA we found before is not at the place we
found it anymore. The locking fails and we should retry.

>
> > +             }
> > +     }
> > +
> > +     priv->locked_vma = vma;
> > +
> > +     return vma;
> > +err:
>
> As queried above, is this really an error path or something we might expect
> to happen that could simply result in an expected fallback to mmap lock?

It's a failure to lock the VMA, which is handled by retrying under
mmap_read_lock. So, trylock_vma() failure does not mean a fault in the
logic. It's expected to happen occasionally.

>
> > +     vma_end_read(vma);
> > +     return NULL;
> > +}
> > +
> > +
> > +static void unlock_vma(struct proc_maps_private *priv)
> > +{
> > +     if (priv->locked_vma) {
> > +             vma_end_read(priv->locked_vma);
> > +             priv->locked_vma = NULL;
> > +     }
> > +}
> > +
> > +static const struct seq_operations proc_pid_maps_op;
> > +
> > +static inline bool lock_content(struct seq_file *m,
> > +                             struct proc_maps_private *priv)
>
> Pedantic I know but isn't 'lock_content' a bit generic?
>
> He says, not being able to think of a great alternative...
>
> OK maybe fine... :)

Yeah, I struggled with this myself. Help in naming is appreciated.

>
> > +{
> > +     /*
> > +      * smaps and numa_maps perform page table walk, therefore require
> > +      * mmap_lock but maps can be read with locked vma only.
> > +      */
> > +     if (m->op != &proc_pid_maps_op) {
>
> Nit but is there a neater way of checking this? Actually I imagine not...
>
> But maybe worth, instead of forward-declaring proc_pid_maps_op, forward declare e.g.
>
> static inline bool is_maps_op(struct seq_file *m);
>
> And check e.g.
>
> if (is_maps_op(m)) { ... in the above.
>
> Yeah this is nitty not a massive del :)

I'll try that and see how it looks. Thanks!

>
> > +             if (mmap_read_lock_killable(priv->mm))
> > +                     return false;
> > +
> > +             priv->mmap_locked = true;
> > +     } else {
> > +             rcu_read_lock();
> > +             priv->locked_vma = NULL;
> > +             priv->mmap_locked = false;
> > +     }
> > +
> > +     return true;
> > +}
> > +
> > +static inline void unlock_content(struct proc_maps_private *priv)
> > +{
> > +     if (priv->mmap_locked) {
> > +             mmap_read_unlock(priv->mm);
> > +     } else {
> > +             unlock_vma(priv);
> > +             rcu_read_unlock();
>
> Does this always get called even in error cases?

What error cases do you have in mind? Error to lock a VMA is handled
by retrying and we should be happily proceeding. Please clarify.

>
> > +     }
> > +}
> > +
> > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > +                                        loff_t last_pos)
>
> We really need a generalised RCU multi-VMA locking mechanism (we're looking
> into madvise VMA locking atm with a conservative single VMA lock at the
> moment, but in future we probably want to be able to span multiple for
> instance) and this really really feels like it doesn't belong in this proc
> code.

Ok, I guess you are building a case to move more code into vma.c? I
see what you are doing :)

>
> >  {
> > -     struct vm_area_struct *vma = vma_next(&priv->iter);
> > +     struct vm_area_struct *vma;
> > +     int ret;
> > +
> > +     if (priv->mmap_locked)
> > +             return vma_next(&priv->iter);
> > +
> > +     unlock_vma(priv);
> > +     /*
> > +      * Record sequence number ahead of vma lookup.
> > +      * Odd seqcount means address space modification is in progress.
> > +      */
> > +     mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
>
> Hmm we're discarding the return value I guess we don't really care about
> that at this stage? Or do we? Do we want to assert the read critical
> section state here?

Yeah, as I mentioned, instead of relying on priv->mm_wr_seq being odd
I should record the return value of mmap_lock_speculate_try_begin().
In the functional sense these two are interchangeable.

>
> I guess since we have the mm_rq_seq which we use later it's the same thing
> and doesn't matter.

Yep.

>
> ~~(off topic a bit)~~
>
> OK so off-topic again afaict we're doing something pretty horribly gross here.
>
> We pass &priv->mm_rw_seq as 'unsigned int *seq' field to
> mmap_lock_speculate_try_begin(), which in turn calls:
>
>         return raw_seqcount_try_begin(&mm->mm_lock_seq, *seq);
>
> And this is defined as a macro of:
>
> #define raw_seqcount_try_begin(s, start)                                \
> ({                                                                      \
>         start = raw_read_seqcount(s);                                   \
>         !(start & 1);                                                   \
> })
>
> So surely this expands to:
>
>         *seq = raw_read_seqcount(&mm->mm_lock_seq);
>         !(*seq & 1) // return true if even, false if odd
>
> So we're basically ostensibly passing an unsigned int, but because we're
> calling a macro it's actually just 'text' and we're instead able to then
> reassign the underlying unsigned int * ptr and... ugh.
>
> ~~(/off topic a bit)~~

Aaaand we are back...

>
> > +     vma = vma_next(&priv->iter);
>
>
>
> > +     if (!vma)
> > +             return NULL;
> > +
> > +     vma = trylock_vma(priv, vma, last_pos, true);
> > +     if (vma)
> > +             return vma;
> > +
>
> Really feels like this should be a boolean... I guess neat to reset vma if
> not locked though.

I guess I can change trylock_vma() to return boolean. We always return
the same vma or NULL I think.

>
> > +     /* Address space got modified, vma might be stale. Re-lock and retry */
>
> > +     rcu_read_unlock();
>
> Might we see a VMA possibly actually legit unmapped in a race here? Do we
> need to update last_pos/ppos to account for this? Otherwise we might just
> fail on the last_pos >= vma->vm_end check in trylock_vma() no?

Yes, it can happen and trylock_vma() will fail to lock the modified
VMA. That's by design. In such cases we retry the lookup from the same
last_pos.

>
> > +     ret = mmap_read_lock_killable(priv->mm);
>
> Shouldn't we set priv->mmap_locked here?

No, we will drop the mmap_read_lock shortly. priv->mmap_locked
indicates the overall mode we operate in. When priv->mmap_locked=false
we can still temporarily take the mmap_read_lock when retrying and
then drop it after we found the VMA.

>
> I guess not as we are simply holding the mmap lock to definitely get the
> next VMA.

Correct.

>
> > +     rcu_read_lock();
> > +     if (ret)
> > +             return ERR_PTR(ret);
> > +
> > +     /* Lookup the vma at the last position again under mmap_read_lock */
> > +     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > +     vma = vma_next(&priv->iter);
> > +     if (vma) {
> > +             vma = trylock_vma(priv, vma, last_pos, false);
>
> Be good to use Liam's convention of /* mm_unstable = */ false to make this
> clear.

Yeah, I'm thinking of splitting trylock_vma() into two separate
functions for mm_unstable=true and mm_unstable=false cases.

>
> Find it kinda weird again we're 'trylocking' something we already have
> locked via the mmap lock but I already mentioend this... :)
>
> > +             WARN_ON(!vma); /* mm is stable, has to succeed */
>
> I wonder if this is really useful, at any rate seems like there'd be a
> flood here so WARN_ON_ONCE()? Perhaps VM_WARN_ON_ONCE() given this really
> really ought not happen?

Well, I can't use BUG_ON(), so WARN_ON() is the next tool I have :) In
reality this should never happen, so
WARN_ON/WARN_ON_ONCE/WARN_ON_RATELIMITED/or whatever does not matter
much.

>
> > +     }
> > +     mmap_read_unlock(priv->mm);
> > +
> > +     return vma;
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> >
> > +static inline bool lock_content(struct seq_file *m,
> > +                             struct proc_maps_private *priv)
> > +{
> > +     return mmap_read_lock_killable(priv->mm) == 0;
> > +}
> > +
> > +static inline void unlock_content(struct proc_maps_private *priv)
> > +{
> > +     mmap_read_unlock(priv->mm);
> > +}
> > +
> > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > +                                        loff_t last_pos)
> > +{
> > +     return vma_next(&priv->iter);
> > +}
> > +
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> > +{
> > +     struct proc_maps_private *priv = m->private;
> > +     struct vm_area_struct *vma;
> > +
> > +     vma = get_next_vma(priv, *ppos);
> > +     if (IS_ERR(vma))
> > +             return vma;
> > +
> > +     /* Store previous position to be able to restart if needed */
> > +     priv->last_pos = *ppos;
> >       if (vma) {
> > -             *ppos = vma->vm_start;
> > +             /*
> > +              * Track the end of the reported vma to ensure position changes
> > +              * even if previous vma was merged with the next vma and we
> > +              * found the extended vma with the same vm_start.
> > +              */
>
> Right, so observing repetitions is acceptable in such circumstances? I mean
> I agree.

Yep, the VMA will be reported twice in such a case.

>
> > +             *ppos = vma->vm_end;
>
> If we store the end, does the last_pos logic which resets the VMA iterator
> later work correctly in all cases?

I think so. By resetting to vma->vm_end we will start the next search
from the address right next to the last reported VMA, no?

>
> >       } else {
> >               *ppos = -2UL;
> >               vma = get_gate_vma(priv->mm);
>
> Is it always the case that !vma here implies a gate VMA (yuck yuck)? I see
> this was the original logic, but maybe put a comment about this as it's
> weird and confusing? (and not your fault obviously :P)

What comment would you like to see here?

>
> Also, are all locks and state corectly handled in this case? Seems like one
> of this nasty edge case situations that could have jagged edges...

I think we are fine. get_next_vma() returned NULL, so we did not lock
any VMA and priv->locked_vma should be NULL.

>
> > @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> >               return NULL;
> >       }
> >
> > -     if (mmap_read_lock_killable(mm)) {
> > +     if (!lock_content(m, priv)) {
>
> Nice that this just slots in like this! :)
>
> >               mmput(mm);
> >               put_task_struct(priv->task);
> >               priv->task = NULL;
> >               return ERR_PTR(-EINTR);
> >       }
> >
> > +     if (last_addr > 0)
>
> last_addr is an unsigned long, this will always be true.

Not unless last_addr==0. That's what I'm really checking here: is this
the first invocation of m_start(), in which case we are starting from
the beginning and not restarting from priv->last_pos. Should I add a
comment?

>
> You probably want to put an explicit check for -1UL, -2UL here or?
>
> God I hate this mechanism for indicating gate VMA... yuck yuck (again, this
> bit not your fault :P)

No, I don't care here about -1UL, -2UL, just that last_addr==0 or not.

>
> > +             *ppos = last_addr = priv->last_pos;
> >       vma_iter_init(&priv->iter, mm, last_addr);
> >       hold_task_mempolicy(priv);
> >       if (last_addr == -2UL)
> >               return get_gate_vma(mm);
> >
> > -     return proc_get_vma(priv, ppos);
> > +     return proc_get_vma(m, ppos);
> >  }
> >
> >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> >               *ppos = -1UL;
> >               return NULL;
> >       }
> > -     return proc_get_vma(m->private, ppos);
> > +     return proc_get_vma(m, ppos);
> >  }
> >
> >  static void m_stop(struct seq_file *m, void *v)
> > @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v)
> >               return;
> >
> >       release_task_mempolicy(priv);
> > -     mmap_read_unlock(mm);
> > +     unlock_content(priv);
> >       mmput(mm);
> >       put_task_struct(priv->task);
> >       priv->task = NULL;
> > --
> > 2.49.0.1266.g31b7d2e469-goog
> >
>
> Sorry to add to workload by digging into so many details here, but we
> really need to make sure all the i's are dotted and t's are crossed given
> how fiddly and fragile this stuff is :)
>
> Very much appreciate the work, this is a significant improvement and will
> have a great deal of real world impact!

Thanks for meticulously going over the code! This is really helpful.
Suren.

>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-04 23:11 ` [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock Suren Baghdasaryan
  2025-06-07 17:43   ` Lorenzo Stoakes
@ 2025-06-10  7:50   ` kernel test robot
  2025-06-10 14:02     ` Suren Baghdasaryan
  1 sibling, 1 reply; 20+ messages in thread
From: kernel test robot @ 2025-06-10  7:50 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: oe-lkp, lkp, linux-kernel, linux-fsdevel, akpm, Liam.Howlett,
	lorenzo.stoakes, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-mm, linux-kselftest, surenb, oliver.sang



Hello,

kernel test robot noticed "WARNING:at_include/linux/rwsem.h:#anon_vma_name" on:

commit: 5c3ce17006c6188d249bc07bfa639f2d76bbd8ac ("[PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock")
url: https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/selftests-proc-add-proc-pid-maps-tearing-from-vma-split-test/20250605-071433
patch link: https://lore.kernel.org/all/20250604231151.799834-7-surenb@google.com/
patch subject: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock

in testcase: locktorture
version: 
with following parameters:

	runtime: 300s
	test: cpuhotplug



config: x86_64-randconfig-005-20250606
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+-------------------------------------------------------------------------------+------------+------------+
|                                                                               | fa0f347301 | 5c3ce17006 |
+-------------------------------------------------------------------------------+------------+------------+
| WARNING:at_include/linux/rwsem.h:#anon_vma_name                               | 0          | 10         |
| RIP:anon_vma_name                                                             | 0          | 10         |
+-------------------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202506101503.903c6ffa-lkp@intel.com


[   41.709983][  T353] ------------[ cut here ]------------
[ 41.710541][ T353] WARNING: CPU: 1 PID: 353 at include/linux/rwsem.h:195 anon_vma_name (include/linux/rwsem.h:195) 
[   41.711251][  T353] Modules linked in:
[   41.711616][  T353] CPU: 1 UID: 0 PID: 353 Comm: grep Tainted: G                T   6.15.0-11198-g5c3ce17006c6 #1 PREEMPT  ce6b47a049c5ee6720891bd644c96f2c3c349eba
[   41.712738][  T353] Tainted: [T]=RANDSTRUCT
[   41.713101][  T353] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 41.713902][ T353] RIP: 0010:anon_vma_name (include/linux/rwsem.h:195) 
[ 41.714327][ T353] Code: 74 28 48 83 c3 40 48 89 d8 48 c1 e8 03 42 80 3c 38 00 74 08 48 89 df e8 ac 4b 02 00 48 8b 03 5b 41 5e 41 5f c3 cc cc cc cc cc <0f> 0b eb d4 48 c7 c1 74 46 b4 89 80 e1 07 80 c1 03 38 c1 7c 87 48
All code
========
   0:	74 28                	je     0x2a
   2:	48 83 c3 40          	add    $0x40,%rbx
   6:	48 89 d8             	mov    %rbx,%rax
   9:	48 c1 e8 03          	shr    $0x3,%rax
   d:	42 80 3c 38 00       	cmpb   $0x0,(%rax,%r15,1)
  12:	74 08                	je     0x1c
  14:	48 89 df             	mov    %rbx,%rdi
  17:	e8 ac 4b 02 00       	call   0x24bc8
  1c:	48 8b 03             	mov    (%rbx),%rax
  1f:	5b                   	pop    %rbx
  20:	41 5e                	pop    %r14
  22:	41 5f                	pop    %r15
  24:	c3                   	ret
  25:	cc                   	int3
  26:	cc                   	int3
  27:	cc                   	int3
  28:	cc                   	int3
  29:	cc                   	int3
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	eb d4                	jmp    0x2
  2e:	48 c7 c1 74 46 b4 89 	mov    $0xffffffff89b44674,%rcx
  35:	80 e1 07             	and    $0x7,%cl
  38:	80 c1 03             	add    $0x3,%cl
  3b:	38 c1                	cmp    %al,%cl
  3d:	7c 87                	jl     0xffffffffffffffc6
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	eb d4                	jmp    0xffffffffffffffd8
   4:	48 c7 c1 74 46 b4 89 	mov    $0xffffffff89b44674,%rcx
   b:	80 e1 07             	and    $0x7,%cl
   e:	80 c1 03             	add    $0x3,%cl
  11:	38 c1                	cmp    %al,%cl
  13:	7c 87                	jl     0xffffffffffffff9c
  15:	48                   	rex.W
[   41.715798][  T353] RSP: 0018:ffffc90002dcf9d8 EFLAGS: 00010246
[   41.716286][  T353] RAX: 0000000000000000 RBX: ffff888135319c40 RCX: ffffc90002dcfa78
[   41.716889][  T353] RDX: ffffc90002dcfa70 RSI: ffff88816ea2bc30 RDI: ffff88816d7485a8
[   41.717509][  T353] RBP: ffffc90002dcfa80 R08: 0000000000000000 R09: 0000000000000002
[   41.718117][  T353] R10: 0000000000000000 R11: ffffffff81ebd610 R12: dffffc0000000000
[   41.718710][  T353] R13: ffff888135319d10 R14: ffff888135319d10 R15: dffffc0000000000
[   41.719318][  T353] FS:  00007f17e7a81740(0000) GS:ffff88842312b000(0000) knlGS:0000000000000000
[   41.719998][  T353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   41.720503][  T353] CR2: 000055c5de49dc78 CR3: 0000000135bcc000 CR4: 00000000000406b0
[   41.721114][  T353] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   41.721717][  T353] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   41.722373][  T353] Call Trace:
[   41.722640][  T353]  <TASK>
[ 41.722881][ T353] get_vma_name (fs/proc/task_mmu.c:?) 
[ 41.723253][ T353] show_map_vma (fs/proc/task_mmu.c:509) 
[ 41.723617][ T353] show_map (fs/proc/task_mmu.c:525) 
[ 41.723922][ T353] seq_read_iter (fs/seq_file.c:231) 
[ 41.724311][ T353] seq_read (fs/seq_file.c:162) 
[ 41.724653][ T353] vfs_read (fs/read_write.c:570) 
[ 41.724981][ T353] ? do_syscall_64 (arch/x86/entry/syscall_64.c:113) 
[ 41.725384][ T353] ksys_read (fs/read_write.c:715) 
[ 41.725703][ T353] ? entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[ 41.726174][ T353] do_syscall_64 (arch/x86/entry/syscall_64.c:?) 
[ 41.726538][ T353] ? find_held_lock (kernel/locking/lockdep.c:5353) 
[ 41.726900][ T353] ? exc_page_fault (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 arch/x86/mm/fault.c:1484 arch/x86/mm/fault.c:1532) 
[ 41.727288][ T353] ? do_user_addr_fault (arch/x86/include/asm/atomic.h:93 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:389 include/linux/refcount.h:432 include/linux/mmap_lock.h:142 include/linux/mmap_lock.h:237 arch/x86/mm/fault.c:1338) 
[ 41.727706][ T353] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:473) 
[ 41.728190][ T353] ? exc_page_fault (arch/x86/mm/fault.c:1536) 
[ 41.728590][ T353] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) 
[   41.729073][  T353] RIP: 0033:0x7f17e7b7c19d
[ 41.729432][ T353] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 54 0a 00 e8 49 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 24 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
All code
========
   0:	31 c0                	xor    %eax,%eax
   2:	e9 c6 fe ff ff       	jmp    0xfffffffffffffecd
   7:	50                   	push   %rax
   8:	48 8d 3d 66 54 0a 00 	lea    0xa5466(%rip),%rdi        # 0xa5475
   f:	e8 49 ff 01 00       	call   0x1ff5d
  14:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
  1b:	00 00 
  1d:	80 3d 41 24 0e 00 00 	cmpb   $0x0,0xe2441(%rip)        # 0xe2465
  24:	74 17                	je     0x3d
  26:	31 c0                	xor    %eax,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
  30:	77 5b                	ja     0x8d
  32:	c3                   	ret
  33:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
  3a:	00 00 00 
  3d:	48                   	rex.W
  3e:	83                   	.byte 0x83
  3f:	ec                   	in     (%dx),%al

Code starting with the faulting instruction
===========================================
   0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
   6:	77 5b                	ja     0x63
   8:	c3                   	ret
   9:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
  10:	00 00 00 
  13:	48                   	rex.W
  14:	83                   	.byte 0x83
  15:	ec                   	in     (%dx),%al
[   41.730862][  T353] RSP: 002b:00007fffc13c12e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   41.731448][  T353] RAX: ffffffffffffffda RBX: 00007fffc13c138c RCX: 00007f17e7b7c19d
[   41.732038][  T353] RDX: 0000000000002000 RSI: 00007f17e7a20000 RDI: 0000000000000003
[   41.732635][  T353] RBP: 00007fffc13c1390 R08: 00000000ffffffff R09: 0000000000000000
[   41.733252][  T353] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000003
[   41.733850][  T353] R13: 0000000000001000 R14: 000055c5de485951 R15: 0000000000002000
[   41.734481][  T353]  </TASK>
[   41.734719][  T353] irq event stamp: 3793
[ 41.735058][ T353] hardirqs last enabled at (3805): __console_unlock (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 kernel/printk/printk.c:344 kernel/printk/printk.c:2885) 
[ 41.735754][ T353] hardirqs last disabled at (3814): __console_unlock (kernel/printk/printk.c:342) 
[ 41.736478][ T353] softirqs last enabled at (3488): handle_softirqs (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:426 kernel/softirq.c:607) 
[ 41.737219][ T353] softirqs last disabled at (3835): __irq_exit_rcu (arch/x86/include/asm/atomic.h:23) 
[   41.737925][  T353] ---[ end trace 0000000000000000 ]---


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250610/202506101503.903c6ffa-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-10  7:50   ` kernel test robot
@ 2025-06-10 14:02     ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-10 14:02 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, linux-kernel, linux-fsdevel, akpm, Liam.Howlett,
	lorenzo.stoakes, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-mm, linux-kselftest

On Tue, Jun 10, 2025 at 12:51 AM kernel test robot
<oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "WARNING:at_include/linux/rwsem.h:#anon_vma_name" on:
>
> commit: 5c3ce17006c6188d249bc07bfa639f2d76bbd8ac ("[PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock")
> url: https://github.com/intel-lab-lkp/linux/commits/Suren-Baghdasaryan/selftests-proc-add-proc-pid-maps-tearing-from-vma-split-test/20250605-071433
> patch link: https://lore.kernel.org/all/20250604231151.799834-7-surenb@google.com/
> patch subject: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock

Ah, I'll need to change anon_vma_name() to allow for only VMA to be
locked instead of doing mmap_assert_locked().

>
> in testcase: locktorture
> version:
> with following parameters:
>
>         runtime: 300s
>         test: cpuhotplug
>
>
>
> config: x86_64-randconfig-005-20250606
> compiler: clang-20
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> +-------------------------------------------------------------------------------+------------+------------+
> |                                                                               | fa0f347301 | 5c3ce17006 |
> +-------------------------------------------------------------------------------+------------+------------+
> | WARNING:at_include/linux/rwsem.h:#anon_vma_name                               | 0          | 10         |
> | RIP:anon_vma_name                                                             | 0          | 10         |
> +-------------------------------------------------------------------------------+------------+------------+
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202506101503.903c6ffa-lkp@intel.com
>
>
> [   41.709983][  T353] ------------[ cut here ]------------
> [ 41.710541][ T353] WARNING: CPU: 1 PID: 353 at include/linux/rwsem.h:195 anon_vma_name (include/linux/rwsem.h:195)
> [   41.711251][  T353] Modules linked in:
> [   41.711616][  T353] CPU: 1 UID: 0 PID: 353 Comm: grep Tainted: G                T   6.15.0-11198-g5c3ce17006c6 #1 PREEMPT  ce6b47a049c5ee6720891bd644c96f2c3c349eba
> [   41.712738][  T353] Tainted: [T]=RANDSTRUCT
> [   41.713101][  T353] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 41.713902][ T353] RIP: 0010:anon_vma_name (include/linux/rwsem.h:195)
> [ 41.714327][ T353] Code: 74 28 48 83 c3 40 48 89 d8 48 c1 e8 03 42 80 3c 38 00 74 08 48 89 df e8 ac 4b 02 00 48 8b 03 5b 41 5e 41 5f c3 cc cc cc cc cc <0f> 0b eb d4 48 c7 c1 74 46 b4 89 80 e1 07 80 c1 03 38 c1 7c 87 48
> All code
> ========
>    0:   74 28                   je     0x2a
>    2:   48 83 c3 40             add    $0x40,%rbx
>    6:   48 89 d8                mov    %rbx,%rax
>    9:   48 c1 e8 03             shr    $0x3,%rax
>    d:   42 80 3c 38 00          cmpb   $0x0,(%rax,%r15,1)
>   12:   74 08                   je     0x1c
>   14:   48 89 df                mov    %rbx,%rdi
>   17:   e8 ac 4b 02 00          call   0x24bc8
>   1c:   48 8b 03                mov    (%rbx),%rax
>   1f:   5b                      pop    %rbx
>   20:   41 5e                   pop    %r14
>   22:   41 5f                   pop    %r15
>   24:   c3                      ret
>   25:   cc                      int3
>   26:   cc                      int3
>   27:   cc                      int3
>   28:   cc                      int3
>   29:   cc                      int3
>   2a:*  0f 0b                   ud2             <-- trapping instruction
>   2c:   eb d4                   jmp    0x2
>   2e:   48 c7 c1 74 46 b4 89    mov    $0xffffffff89b44674,%rcx
>   35:   80 e1 07                and    $0x7,%cl
>   38:   80 c1 03                add    $0x3,%cl
>   3b:   38 c1                   cmp    %al,%cl
>   3d:   7c 87                   jl     0xffffffffffffffc6
>   3f:   48                      rex.W
>
> Code starting with the faulting instruction
> ===========================================
>    0:   0f 0b                   ud2
>    2:   eb d4                   jmp    0xffffffffffffffd8
>    4:   48 c7 c1 74 46 b4 89    mov    $0xffffffff89b44674,%rcx
>    b:   80 e1 07                and    $0x7,%cl
>    e:   80 c1 03                add    $0x3,%cl
>   11:   38 c1                   cmp    %al,%cl
>   13:   7c 87                   jl     0xffffffffffffff9c
>   15:   48                      rex.W
> [   41.715798][  T353] RSP: 0018:ffffc90002dcf9d8 EFLAGS: 00010246
> [   41.716286][  T353] RAX: 0000000000000000 RBX: ffff888135319c40 RCX: ffffc90002dcfa78
> [   41.716889][  T353] RDX: ffffc90002dcfa70 RSI: ffff88816ea2bc30 RDI: ffff88816d7485a8
> [   41.717509][  T353] RBP: ffffc90002dcfa80 R08: 0000000000000000 R09: 0000000000000002
> [   41.718117][  T353] R10: 0000000000000000 R11: ffffffff81ebd610 R12: dffffc0000000000
> [   41.718710][  T353] R13: ffff888135319d10 R14: ffff888135319d10 R15: dffffc0000000000
> [   41.719318][  T353] FS:  00007f17e7a81740(0000) GS:ffff88842312b000(0000) knlGS:0000000000000000
> [   41.719998][  T353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   41.720503][  T353] CR2: 000055c5de49dc78 CR3: 0000000135bcc000 CR4: 00000000000406b0
> [   41.721114][  T353] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   41.721717][  T353] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   41.722373][  T353] Call Trace:
> [   41.722640][  T353]  <TASK>
> [ 41.722881][ T353] get_vma_name (fs/proc/task_mmu.c:?)
> [ 41.723253][ T353] show_map_vma (fs/proc/task_mmu.c:509)
> [ 41.723617][ T353] show_map (fs/proc/task_mmu.c:525)
> [ 41.723922][ T353] seq_read_iter (fs/seq_file.c:231)
> [ 41.724311][ T353] seq_read (fs/seq_file.c:162)
> [ 41.724653][ T353] vfs_read (fs/read_write.c:570)
> [ 41.724981][ T353] ? do_syscall_64 (arch/x86/entry/syscall_64.c:113)
> [ 41.725384][ T353] ksys_read (fs/read_write.c:715)
> [ 41.725703][ T353] ? entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [ 41.726174][ T353] do_syscall_64 (arch/x86/entry/syscall_64.c:?)
> [ 41.726538][ T353] ? find_held_lock (kernel/locking/lockdep.c:5353)
> [ 41.726900][ T353] ? exc_page_fault (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 arch/x86/mm/fault.c:1484 arch/x86/mm/fault.c:1532)
> [ 41.727288][ T353] ? do_user_addr_fault (arch/x86/include/asm/atomic.h:93 include/linux/atomic/atomic-arch-fallback.h:949 include/linux/atomic/atomic-instrumented.h:401 include/linux/refcount.h:389 include/linux/refcount.h:432 include/linux/mmap_lock.h:142 include/linux/mmap_lock.h:237 arch/x86/mm/fault.c:1338)
> [ 41.727706][ T353] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:473)
> [ 41.728190][ T353] ? exc_page_fault (arch/x86/mm/fault.c:1536)
> [ 41.728590][ T353] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> [   41.729073][  T353] RIP: 0033:0x7f17e7b7c19d
> [ 41.729432][ T353] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 54 0a 00 e8 49 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 24 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
> All code
> ========
>    0:   31 c0                   xor    %eax,%eax
>    2:   e9 c6 fe ff ff          jmp    0xfffffffffffffecd
>    7:   50                      push   %rax
>    8:   48 8d 3d 66 54 0a 00    lea    0xa5466(%rip),%rdi        # 0xa5475
>    f:   e8 49 ff 01 00          call   0x1ff5d
>   14:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
>   1b:   00 00
>   1d:   80 3d 41 24 0e 00 00    cmpb   $0x0,0xe2441(%rip)        # 0xe2465
>   24:   74 17                   je     0x3d
>   26:   31 c0                   xor    %eax,%eax
>   28:   0f 05                   syscall
>   2a:*  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
>   30:   77 5b                   ja     0x8d
>   32:   c3                      ret
>   33:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>   3a:   00 00 00
>   3d:   48                      rex.W
>   3e:   83                      .byte 0x83
>   3f:   ec                      in     (%dx),%al
>
> Code starting with the faulting instruction
> ===========================================
>    0:   48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
>    6:   77 5b                   ja     0x63
>    8:   c3                      ret
>    9:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>   10:   00 00 00
>   13:   48                      rex.W
>   14:   83                      .byte 0x83
>   15:   ec                      in     (%dx),%al
> [   41.730862][  T353] RSP: 002b:00007fffc13c12e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [   41.731448][  T353] RAX: ffffffffffffffda RBX: 00007fffc13c138c RCX: 00007f17e7b7c19d
> [   41.732038][  T353] RDX: 0000000000002000 RSI: 00007f17e7a20000 RDI: 0000000000000003
> [   41.732635][  T353] RBP: 00007fffc13c1390 R08: 00000000ffffffff R09: 0000000000000000
> [   41.733252][  T353] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000003
> [   41.733850][  T353] R13: 0000000000001000 R14: 000055c5de485951 R15: 0000000000002000
> [   41.734481][  T353]  </TASK>
> [   41.734719][  T353] irq event stamp: 3793
> [ 41.735058][ T353] hardirqs last enabled at (3805): __console_unlock (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 kernel/printk/printk.c:344 kernel/printk/printk.c:2885)
> [ 41.735754][ T353] hardirqs last disabled at (3814): __console_unlock (kernel/printk/printk.c:342)
> [ 41.736478][ T353] softirqs last enabled at (3488): handle_softirqs (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:426 kernel/softirq.c:607)
> [ 41.737219][ T353] softirqs last disabled at (3835): __irq_exit_rcu (arch/x86/include/asm/atomic.h:23)
> [   41.737925][  T353] ---[ end trace 0000000000000000 ]---
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20250610/202506101503.903c6ffa-lkp@intel.com
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-08  1:41     ` Suren Baghdasaryan
@ 2025-06-10 17:43       ` Lorenzo Stoakes
  2025-06-11  0:16         ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 17:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Sat, Jun 07, 2025 at 06:41:35PM -0700, Suren Baghdasaryan wrote:
> On Sat, Jun 7, 2025 at 10:43 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Hi Suren,
> >
> > Forgive me but I am going to ask a lot of questions here :p just want to
> > make sure I'm getting everything right here.
>
> No worries and thank you for reviewing!

No problem!

>
> >
> > On Wed, Jun 04, 2025 at 04:11:50PM -0700, Suren Baghdasaryan wrote:
> > > With maple_tree supporting vma tree traversal under RCU and per-vma
> > > locks, /proc/pid/maps can be read while holding individual vma locks
> > > instead of locking the entire address space.
> >
> > Nice :)
> >
> > > Completely lockless approach would be quite complex with the main issue
> > > being get_vma_name() using callbacks which might not work correctly with
> > > a stable vma copy, requiring original (unstable) vma.
> >
> > Hmmm can you expand on what a 'completely lockless' design might comprise of?
>
> In my previous implementation
> (https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/)
> I was doing this under RCU while checking mmap_lock seq counter to
> detect address space changes. That's what I meant by a completely
> lockless approach here.

Oh did that approach not even use VMA locks _at all_?

>
> >
> > It's super un-greppable and I've not got clangd set up with an allmod kernel to
> > triple-check but I'm seeing at least 2 (are there more?):
> >
> > gate_vma_name() which is:
> >
> >         return "[vsyscall]";
> >
> > special_mapping_name() which is:
> >
> >          return ((struct vm_special_mapping *)vma->vm_private_data)->name;
> >
> > Which I'm guessing is the issue because it's a double pointer deref...
>
> Correct but in more general terms, depending on implementation of the
> vm_ops.name callback, vma->vm_ops->name(vma) might not work correctly
> with a vma copy. special_mapping_name() is an example of that.

Yeah, this is a horrible situation to be in for such a trivial thing. But I
guess unavoidable for now.

>
> >
> > Seems such a silly issue to get stuck on, I wonder if we can't just change
> > this to function correctly?
>
> I was thinking about different ways to overcome that but once I
> realized per-vma locks result in even less contention and the
> implementation is simpler and more robust, I decided that per-vma
> locks direction is better.

Ack well in that case :)

But still it'd be nice to somehow restrict the impact of this callback.

>
> >
> > > When per-vma lock acquisition fails, we take the mmap_lock for reading,
> > > lock the vma, release the mmap_lock and continue. This guarantees the
> > > reader to make forward progress even during lock contention. This will
> >
> > Ah that fabled constant forward progress ;)
> >
> > > interfere with the writer but for a very short time while we are
> > > acquiring the per-vma lock and only when there was contention on the
> > > vma reader is interested in.
> > > One case requiring special handling is when vma changes between the
> > > time it was found and the time it got locked. A problematic case would
> > > be if vma got shrunk so that it's start moved higher in the address
> > > space and a new vma was installed at the beginning:
> > >
> > > reader found:               |--------VMA A--------|
> > > VMA is modified:            |-VMA B-|----VMA A----|
> > > reader locks modified VMA A
> > > reader reports VMA A:       |  gap  |----VMA A----|
> > >
> > > This would result in reporting a gap in the address space that does not
> > > exist. To prevent this we retry the lookup after locking the vma, however
> > > we do that only when we identify a gap and detect that the address space
> > > was changed after we found the vma.
> >
> > OK so in this case we have
> >
> > 1. Find VMA A - nothing is locked yet, but presumably we are under RCU so
> >    are... safe? From unmaps? Or are we? I guess actually the detach
> >    mechanism sorts this out for us perhaps?
>
> Yes, VMAs are RCU-safe and we do detect if it got detached after we
> found it but before we locked it.

Ack I thought so.

>
> >
> > 2. We got unlucky and did this immediately prior to VMA A having its
> >    vma->vm_start, vm_end updated to reflect the split.
>
> Yes, the split happened after we found it and before we locked it.
>
> >
> > 3. We lock VMA A, now position with an apparent gap after the prior VMA
> > which, in practice does not exist.
>
> Correct.

Ack

>
> >
> > So I am guessing that by observing sequence numbers you are able to detect
> > that a change has occurred and thus retry the operation in this situation?
>
> Yes, we detect the gap and we detect that address space has changed,
> so to endure we did not miss a split we fall back to mmap_read_lock,
> lock the VMA while holding mmap_read_lock, drop mmap_read_lock and
> retry.
>
> >
> > I know we previously discussed the possibility of this retry mechanism
> > going on forever, I guess I will see the resolution to this in the code :)
>
> Retry in this case won't go forever because we take mmap_read_lock
> during the retry. In the worst case we will be constantly falling back
> to mmap_read_lock but that's a very unlikely case (the writer should
> be constantly splitting the vma right before the reader locks it).

It might be worth adding that to commit message to underline that this has
been considered and this is the resolution.

Something like:

	we guarantee forward progress by always resolving contention via a
	fallback to an mmap-read lock.

	We shouldn't see a repeated fallback to mmap read locks in
	practice, as this require a vanishingly unlikely series of lock
	contentions (for instance due to repeated VMA split
	operations). However even if this did somehow happen, we would
	still progress.

>
> >
> > > This change is designed to reduce mmap_lock contention and prevent a
> > > process reading /proc/pid/maps files (often a low priority task, such
> > > as monitoring/data collection services) from blocking address space
> > > updates. Note that this change has a userspace visible disadvantage:
> > > it allows for sub-page data tearing as opposed to the previous mechanism
> > > where data tearing could happen only between pages of generated output
> > > data. Since current userspace considers data tearing between pages to be
> > > acceptable, we assume is will be able to handle sub-page data tearing
> > > as well.
> >
> > By tearing do you mean for instance seeing a VMA more than once due to
> > e.g. a VMA expanding in a racey way?
>
> Yes.
>
> >
> > Pedantic I know, but it might be worth goiing through all the merge case,
> > split and remap scenarios and explaining what might happen in each one (or
> > perhaps do that as some form of documentation?)
> >
> > I can try to put together a list of all of the possibilities if that would
> > be helpful.
>
> Hmm. That might be an interesting exercise. I called out this
> particular case because my tests caught it. I spent some time thinking
> about other possible scenarios where we would report a gap in a place
> where there are no gaps but could not think of anything else.

todo++; :)

>
> >
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  fs/proc/internal.h |   6 ++
> > >  fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++--
> > >  2 files changed, 175 insertions(+), 8 deletions(-)
> >
> > I really hate having all this logic in the proc/task_mmu.c file.
> >
> > This is really delicate stuff and I'd really like it to live in mm if
> > possible.
> >
> > I reallise this might be a total pain, but I'm quite worried about us
> > putting super-delicate, carefully written VMA handling code in different
> > places.
> >
> > Also having stuff in mm/vma.c opens the door to userland testing which,
> > when I finally have time to really expand that, would allow for some really
> > nice stress testing here.
>
> That would require some sizable refactoring. I assume code for smaps
> reading and PROCMAP_QUERY would have to be moved as well?

Yeah, I know, and apologies for that, but I really oppose us having this
super delicate VMA logic in an fs/proc file, one we don't maintain for that
matter.

I know it's a total pain, but this just isn't the right place to be doing
such a careful dance.

I'm not saying relocate code that belongs here, but find a way to abstract
the operations.

Perhaps could be a walker or something that does all the state transition
stuff that you can then just call from the walker functions here?

You could then figure out something similar for the PROCMAP_QUERY logic.

We're not doing this VMA locking stuff for smaps are we? As that is walking
page tables anyway right? So nothing would change for that.

>
> >
> > >
> > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > > index 96122e91c645..3728c9012687 100644
> > > --- a/fs/proc/internal.h
> > > +++ b/fs/proc/internal.h
> > > @@ -379,6 +379,12 @@ struct proc_maps_private {
> > >       struct task_struct *task;
> > >       struct mm_struct *mm;
> > >       struct vma_iterator iter;
> > > +     loff_t last_pos;
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +     bool mmap_locked;
> > > +     unsigned int mm_wr_seq;
> >
> > Is this the _last_ sequence number observed in the mm_struct? or rather,
> > previous? Nitty but maybe worth renaming accordingly.
>
> It's a copy of the mm->mm_wr_seq. I can add a comment if needed.

Right, of course. But I think the problem is the 'when' it refers to. It's
the sequence number associatied with the mm here sure, but when was it
snapshotted? How do we use it?

Something like 'last_seen_seqnum' or 'mm_wr_seq_start' or something plus a
comment would be helpful.

This is nitty I know... but this stuff is very confusing and I think every
little bit we do to help explain things is helpful here.

>
> >
> > > +     struct vm_area_struct *locked_vma;
> > > +#endif
> > >  #ifdef CONFIG_NUMA
> > >       struct mempolicy *task_mempolicy;
> > >  #endif
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 27972c0749e7..36d883c4f394 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> > >  }
> > >  #endif
> > >
> > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > > -                                             loff_t *ppos)
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +
> > > +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv,
> > > +                                       struct vm_area_struct *vma,
> > > +                                       unsigned long last_pos,
> > > +                                       bool mm_unstable)
> >
> > This whole function is a bit weird tbh, you handle both the
> > mm_unstable=true and mm_unstable=false cases, in the latter we don't try to
> > lock at all...
>
> Why do you think so? vma_start_read() is always called but in case
> mm_unstable=true we double check for the gaps to take care of the case
> I mentioned in the changelog.

Well the read lock will always succeed if mmap read lock is held right?
Actually... no :)

I see your point below about vma_start_read_locked() :>)

I see below you suggest splitting into two functions, that seems to be a
good way forward.

I _think_ we won't even need the checks re: mm and last_pos in that case
right? As holding the mmap lock we should be able to guarantee? Or at least
the mm check?

>
> >
> > Nitty (sorry I know this is mildly irritating review) but maybe needs to be
> > renamed, or split up somehow?
> >
> > This is only trylocking in the mm_unstable case...
>
> Nope, I think you misunderstood the intention, as I mentioned above.
>
> >
> > > +{
> > > +     vma = vma_start_read(priv->mm, vma);
> >
> > Do we want to do this with mm_unstable == false?
>
> Yes, always. mm_unstable=true only indicates that we are already
> holding mmap_read_lock, so we don't need to double-check for gaps.
> Perhaps I should add some comments to clarify what purpose this
> parameter serves...
>
> >
> > I know (from my own documentation :)) taking a VMA read lock while holding
> > an mmap read lock is fine (the reverse isn't) but maybe it's suboptimal?
>
> Ah, right. I should use vma_start_read_locked() instead when we are
> holding mmap_read_lock. That's why that function was introduced. Will
> change.

Yeah, I'll pretend this is what I meant to sound smart :P but this is a
really good point!

>
> >
> > > +     if (IS_ERR_OR_NULL(vma))
> > > +             return NULL;
> >
> > Hmm IS_ERR_OR_NULL() is generally a code smell (I learned this some years
> > ago from people moaning at me on code review :)
> >
> > Sorry I know that's annoying but perhaps its indicative of an issue in the
> > interface? That's possibly out of scope here however.
>
> lock_vma_under_rcu() returns NULL or EAGAIN to signal
> lock_vma_under_rcu() that it should retry the VMA lookup. In here in
> either case we retry under mmap_read_lock, that's why EAGAIN is
> ignored.

Yeah indeed you're right. I guess I'm just echoing previous review traumas
here :P

>
> >
> > Why are we ignoring errors here though? I guess because we don't care if
> > the VMA got detached from under us, we don't bother retrying like we do in
> > lock_vma_under_rcu()?
>
> No, we take mmap_read_lock and retry in either case. Perhaps I should
> split trylock_vma() into two separate functions - one for the case
> when we are holding mmap_read_lock and another one when we don't? I
> think that would have prevented many of your questions. I'll try that
> and see how it looks.

Yeah that'd be helpful. I think this should also simplify things?

>
> >
> > Should we just abstract that part of lock_vma_under_rcu() and use it?
>
> trylock_vma() is not that similar to lock_vma_under_rcu() for that
> IMO. Also lock_vma_under_rcu() is in the pagefault path which is very
> hot, so I would not want to add conditions there to make it work for
> trylock_vma().

Right sure.

But I'm just wondering why we don't do the retry stuff, e.g.:

		/* Check if the VMA got isolated after we found it */
		if (PTR_ERR(vma) == -EAGAIN) {
			count_vm_vma_lock_event(VMA_LOCK_MISS);
			/* The area was replaced with another one */
			goto retry;
		}

I mean do we need to retry under mmap lock in that case? Can we just retry
the lookup? Or is this not a worthwhile optimisation here?

>
> >
> > > +
> > > +     /* Check if the vma we locked is the right one. */
> >
> > Well it might not be the right one :) but might still belong to the right
> > mm, so maybe better to refer to the right virtual address space.
>
> Ack. Will change to "Check if the vma belongs to the right address space. "

Thanks!

>
> >
> > > +     if (unlikely(vma->vm_mm != priv->mm))
> > > +             goto err;
> > > +
> > > +     /* vma should not be ahead of the last search position. */
> >
> > You mean behind the last search position? Surely a VMA being _ahead_ of it
> > is fine?
>
> Yes, you are correct. "should not" should have been "should".

Thanks!

>
> >
> > > +     if (unlikely(last_pos >= vma->vm_end))
> >
> > Should that be >=? Wouldn't an == just be an adjacent VMA? Why is that
> > problematic? Or is last_pos inclusive?
>
> last_pos is inclusive and vma->vm_end is not inclusive, so if last_pos
> == vma->vm_end that would mean the vma is behind the last_pos. Since
> we are searching forward from the last_pos, we should not be finding a
> vma before last_pos unless it mutated.

Ahhh that explains it. Thanks.

>
> >
> > > +             goto err;
> >
> > Am I correct in thinking thi is what is being checked?
> >
> >           last_pos
> >              |
> >              v
> > |---------|
> > |         |
> > |---------|
> >         vm_end
> >    <--- vma 'next'??? How did we go backwards?
>
> Exactly.
>
> >
> > When last_pos gets updated, is it possible for a shrink to race to cause
> > this somehow?
>
> No, we update last_pos only after we locked the vma and confirmed it's
> the right one.

Ack.

>
> >
> > Do we treat this as an entirely unexpected error condition? In which case
> > is a WARN_ON_ONCE() warranted?
>
> No, the VMA might have mutated from under us before we locked it. For
> example it might have been remapped to a higher address.
>
> >
> > > +
> > > +     /*
> > > +      * vma ahead of last search position is possible but we need to
> > > +      * verify that it was not shrunk after we found it, and another
> > > +      * vma has not been installed ahead of it. Otherwise we might
> > > +      * observe a gap that should not be there.
> > > +      */
> >
> > OK so this is the juicy bit.
>
> Yep, that's the case singled out in the changelog.

And rightly so!

>
> >
> >
> > > +     if (mm_unstable && last_pos < vma->vm_start) {
> > > +             /* Verify only if the address space changed since vma lookup. */
> > > +             if ((priv->mm_wr_seq & 1) ||
> >
> > Can we wrap this into a helper? This is a 'you just have to know that odd
> > seq number means a write operation is in effect'. I know you have a comment
> > here, but I think something like:
> >
> >         if (has_mm_been_modified(priv) ||
> >
> > Would be a lot clearer.
>
> Yeah, I was thinking about that. I think an even cleaner way would be
> to remember the return value of mmap_lock_speculate_try_begin() and
> pass it around. I was hoping to avoid that extra parameter but sounds
> like for the sake of clarity that would be preferable?

You know, it's me so I might have to mention a helper struct here :P it's
the two most Lorenzo things - helper sructs and churn...

>
> >
> > Again this speaks to the usefulness of abstracting all this logic from the
> > proc code, we are putting super delicate VMA stuff here and it's just not
> > the right place.
> >
> > As an aside, I don't see coverage in the process_addrs documentation on
> > sequence number odd/even or speculation?
> >
> > I think we probably need to cover this to maintain an up-to-date
> > description of how the VMA locking mechanism works and is used?
>
> I think that's a very low level technical detail which I should not
> have exposed here. As I mentioned, I should simply store the return
> value of mmap_lock_speculate_try_begin() instead of doing these tricky
> mm_wr_seq checks.

Right yeah I'm all for simplifying if we can! Sounds sensible.

>
> >
> > > +                 mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) {
> >
> > Nit, again unrelated to this series, but would be useful to add a comment
> > to mmap_lock_speculate_retry() to indicate that a true return value
> > indicates a retry is needed, or renaming it.
>
> This is how seqcount API works in general. Note that
> mmap_lock_speculate_retry() is just a wrapper around
> read_seqcount_retry().

Yeah, I guess I can moan to PeterZ about that :P

It's not a big deal honestly, but it was just something I found confusing.

I think adjusting the comment above to something like:

		/*
		 * Verify if the address space changed since vma lookup, or if
		 * the speculative lock needs to be retried.
		 */

Or perhaps somethig more in line with the description you give below?

>
> >
> > Maybe mmap_lock_speculate_needs_retry()? Also I think that function needs a
> > comment.
>
> See https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/seqlock.h#L395

Yeah I saw that, but going 2 levels deep to read a comment isn't great.

But again this isn't the end of the world.

>
> >
> > Naming is hard :P
> >
> > Anyway the totality of this expression is 'something changed' or 'read
> > section retry required'.
>
> Not quite. The expression is "something is changed from under us or
> something was changing even before we started VMA lookup". Or in more
> technical terms, mmap_write_lock was acquired while we were locking
> the VMA or mmap_write_lock was already held even before we started the
> VMA search.

OK so read section retry required = the seq num changes from under us
(checked carefully with memory barriers and carefully considered and
thought out such logic), and the priv->mm_wr_seq check before it is the
'was this changed even before we began?'

I wonder btw if we could put both into a single helper function to check
whether that'd be clearer.

>
> >
> > Under what circumstances would this happen?
>
> See my previous comment and I hope that clarifies it.

Thanks!

>
> >
> > OK so we're into the 'retry' logic here:
> >
> > > +                     vma_iter_init(&priv->iter, priv->mm, last_pos);
> >
> > I'd definitely want Liam to confirm this is all above board and correct, as
> > these operations are pretty sensitive.
> >
> > But assuming this is safe, we reset the iterator to the last position...
> >
> > > +                     if (vma != vma_next(&priv->iter))
> >
> > Then assert the following VMA is the one we seek.
> >
> > > +                             goto err;
> >
> > Might this ever be the case in the course of ordinary operation? Is this
> > really an error?
>
> This simply means that the VMA we found before is not at the place we
> found it anymore. The locking fails and we should retry.

I know it's pedantic but feels like 'err' is not a great name for this.

Maybe 'nolock' or something? Or 'lock_failed'?

>
> >
> > > +             }
> > > +     }
> > > +
> > > +     priv->locked_vma = vma;
> > > +
> > > +     return vma;
> > > +err:
> >
> > As queried above, is this really an error path or something we might expect
> > to happen that could simply result in an expected fallback to mmap lock?
>
> It's a failure to lock the VMA, which is handled by retrying under
> mmap_read_lock. So, trylock_vma() failure does not mean a fault in the
> logic. It's expected to happen occasionally.

Ack yes understood thanks!

>
> >
> > > +     vma_end_read(vma);
> > > +     return NULL;
> > > +}
> > > +
> > > +
> > > +static void unlock_vma(struct proc_maps_private *priv)
> > > +{
> > > +     if (priv->locked_vma) {
> > > +             vma_end_read(priv->locked_vma);
> > > +             priv->locked_vma = NULL;
> > > +     }
> > > +}
> > > +
> > > +static const struct seq_operations proc_pid_maps_op;
> > > +
> > > +static inline bool lock_content(struct seq_file *m,
> > > +                             struct proc_maps_private *priv)
> >
> > Pedantic I know but isn't 'lock_content' a bit generic?
> >
> > He says, not being able to think of a great alternative...
> >
> > OK maybe fine... :)
>
> Yeah, I struggled with this myself. Help in naming is appreciated.

This is where it gets difficult haha so easy to point out but not so easy
to fix...

lock_vma_range()?

>
> >
> > > +{
> > > +     /*
> > > +      * smaps and numa_maps perform page table walk, therefore require
> > > +      * mmap_lock but maps can be read with locked vma only.
> > > +      */
> > > +     if (m->op != &proc_pid_maps_op) {
> >
> > Nit but is there a neater way of checking this? Actually I imagine not...
> >
> > But maybe worth, instead of forward-declaring proc_pid_maps_op, forward declare e.g.
> >
> > static inline bool is_maps_op(struct seq_file *m);
> >
> > And check e.g.
> >
> > if (is_maps_op(m)) { ... in the above.
> >
> > Yeah this is nitty not a massive del :)
>
> I'll try that and see how it looks. Thanks!

Thanks!

>
> >
> > > +             if (mmap_read_lock_killable(priv->mm))
> > > +                     return false;
> > > +
> > > +             priv->mmap_locked = true;
> > > +     } else {
> > > +             rcu_read_lock();
> > > +             priv->locked_vma = NULL;
> > > +             priv->mmap_locked = false;
> > > +     }
> > > +
> > > +     return true;
> > > +}
> > > +
> > > +static inline void unlock_content(struct proc_maps_private *priv)
> > > +{
> > > +     if (priv->mmap_locked) {
> > > +             mmap_read_unlock(priv->mm);
> > > +     } else {
> > > +             unlock_vma(priv);
> > > +             rcu_read_unlock();
> >
> > Does this always get called even in error cases?
>
> What error cases do you have in mind? Error to lock a VMA is handled
> by retrying and we should be happily proceeding. Please clarify.

Well it was more of a question really - can the traversal through
/proc/$pid/maps result in some kind of error that doesn't reach this
function, thereby leaving things locked mistakenly?

If not then happy days :)

I'm guessing there isn't.

>
> >
> > > +     }
> > > +}
> > > +
> > > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > > +                                        loff_t last_pos)
> >
> > We really need a generalised RCU multi-VMA locking mechanism (we're looking
> > into madvise VMA locking atm with a conservative single VMA lock at the
> > moment, but in future we probably want to be able to span multiple for
> > instance) and this really really feels like it doesn't belong in this proc
> > code.
>
> Ok, I guess you are building a case to move more code into vma.c? I
> see what you are doing :)

Haha damn it, my evil plans revealed :P

>
> >
> > >  {
> > > -     struct vm_area_struct *vma = vma_next(&priv->iter);
> > > +     struct vm_area_struct *vma;
> > > +     int ret;
> > > +
> > > +     if (priv->mmap_locked)
> > > +             return vma_next(&priv->iter);
> > > +
> > > +     unlock_vma(priv);
> > > +     /*
> > > +      * Record sequence number ahead of vma lookup.
> > > +      * Odd seqcount means address space modification is in progress.
> > > +      */
> > > +     mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> >
> > Hmm we're discarding the return value I guess we don't really care about
> > that at this stage? Or do we? Do we want to assert the read critical
> > section state here?
>
> Yeah, as I mentioned, instead of relying on priv->mm_wr_seq being odd
> I should record the return value of mmap_lock_speculate_try_begin().
> In the functional sense these two are interchangeable.

Ack, thanks!

>
> >
> > I guess since we have the mm_rq_seq which we use later it's the same thing
> > and doesn't matter.
>
> Yep.

Ack

>
> >
> > ~~(off topic a bit)~~
> >
> > OK so off-topic again afaict we're doing something pretty horribly gross here.
> >
> > We pass &priv->mm_rw_seq as 'unsigned int *seq' field to
> > mmap_lock_speculate_try_begin(), which in turn calls:
> >
> >         return raw_seqcount_try_begin(&mm->mm_lock_seq, *seq);
> >
> > And this is defined as a macro of:
> >
> > #define raw_seqcount_try_begin(s, start)                                \
> > ({                                                                      \
> >         start = raw_read_seqcount(s);                                   \
> >         !(start & 1);                                                   \
> > })
> >
> > So surely this expands to:
> >
> >         *seq = raw_read_seqcount(&mm->mm_lock_seq);
> >         !(*seq & 1) // return true if even, false if odd
> >
> > So we're basically ostensibly passing an unsigned int, but because we're
> > calling a macro it's actually just 'text' and we're instead able to then
> > reassign the underlying unsigned int * ptr and... ugh.
> >
> > ~~(/off topic a bit)~~
>
> Aaaand we are back...

:)) yeah this isn't your fault, just a related 'wtf' moan :P we can pretend
like it never happened *ahem*

>
> >
> > > +     vma = vma_next(&priv->iter);
> >
> >
> >
> > > +     if (!vma)
> > > +             return NULL;
> > > +
> > > +     vma = trylock_vma(priv, vma, last_pos, true);
> > > +     if (vma)
> > > +             return vma;
> > > +
> >
> > Really feels like this should be a boolean... I guess neat to reset vma if
> > not locked though.
>
> I guess I can change trylock_vma() to return boolean. We always return
> the same vma or NULL I think.

Ack, I mean I guess you're looking at reworking it in general so can take
this into account.

>
> >
> > > +     /* Address space got modified, vma might be stale. Re-lock and retry */
> >
> > > +     rcu_read_unlock();
> >
> > Might we see a VMA possibly actually legit unmapped in a race here? Do we
> > need to update last_pos/ppos to account for this? Otherwise we might just
> > fail on the last_pos >= vma->vm_end check in trylock_vma() no?
>
> Yes, it can happen and trylock_vma() will fail to lock the modified
> VMA. That's by design. In such cases we retry the lookup from the same
> last_pos.

OK and then we're fine with it because the gap we report will be an actual
gap.

>
> >
> > > +     ret = mmap_read_lock_killable(priv->mm);
> >
> > Shouldn't we set priv->mmap_locked here?
>
> No, we will drop the mmap_read_lock shortly. priv->mmap_locked
> indicates the overall mode we operate in. When priv->mmap_locked=false
> we can still temporarily take the mmap_read_lock when retrying and
> then drop it after we found the VMA.

Right yeah, makes sense.

>
> >
> > I guess not as we are simply holding the mmap lock to definitely get the
> > next VMA.
>
> Correct.

Ack

>
> >
> > > +     rcu_read_lock();
> > > +     if (ret)
> > > +             return ERR_PTR(ret);
> > > +
> > > +     /* Lookup the vma at the last position again under mmap_read_lock */
> > > +     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > > +     vma = vma_next(&priv->iter);
> > > +     if (vma) {
> > > +             vma = trylock_vma(priv, vma, last_pos, false);
> >
> > Be good to use Liam's convention of /* mm_unstable = */ false to make this
> > clear.
>
> Yeah, I'm thinking of splitting trylock_vma() into two separate
> functions for mm_unstable=true and mm_unstable=false cases.

Yes :) thanks!

>
> >
> > Find it kinda weird again we're 'trylocking' something we already have
> > locked via the mmap lock but I already mentioend this... :)
> >
> > > +             WARN_ON(!vma); /* mm is stable, has to succeed */
> >
> > I wonder if this is really useful, at any rate seems like there'd be a
> > flood here so WARN_ON_ONCE()? Perhaps VM_WARN_ON_ONCE() given this really
> > really ought not happen?
>
> Well, I can't use BUG_ON(), so WARN_ON() is the next tool I have :) In
> reality this should never happen, so
> WARN_ON/WARN_ON_ONCE/WARN_ON_RATELIMITED/or whatever does not matter
> much.

I think if you refactor into two separate functions this becomes even more
unnecessary because then you are using a vma lock function that can never
fail etc.

I mean maybe just stick a VM_ in front if it's not going to happen but for
debug/dev/early stabilisation purposes we want to keep an eye on it.

>
> >
> > > +     }
> > > +     mmap_read_unlock(priv->mm);
> > > +
> > > +     return vma;
> > > +}
> > > +
> > > +#else /* CONFIG_PER_VMA_LOCK */
> > >
> > > +static inline bool lock_content(struct seq_file *m,
> > > +                             struct proc_maps_private *priv)
> > > +{
> > > +     return mmap_read_lock_killable(priv->mm) == 0;
> > > +}
> > > +
> > > +static inline void unlock_content(struct proc_maps_private *priv)
> > > +{
> > > +     mmap_read_unlock(priv->mm);
> > > +}
> > > +
> > > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > > +                                        loff_t last_pos)
> > > +{
> > > +     return vma_next(&priv->iter);
> > > +}
> > > +
> > > +#endif /* CONFIG_PER_VMA_LOCK */
> > > +
> > > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> > > +{
> > > +     struct proc_maps_private *priv = m->private;
> > > +     struct vm_area_struct *vma;
> > > +
> > > +     vma = get_next_vma(priv, *ppos);
> > > +     if (IS_ERR(vma))
> > > +             return vma;
> > > +
> > > +     /* Store previous position to be able to restart if needed */
> > > +     priv->last_pos = *ppos;
> > >       if (vma) {
> > > -             *ppos = vma->vm_start;
> > > +             /*
> > > +              * Track the end of the reported vma to ensure position changes
> > > +              * even if previous vma was merged with the next vma and we
> > > +              * found the extended vma with the same vm_start.
> > > +              */
> >
> > Right, so observing repetitions is acceptable in such circumstances? I mean
> > I agree.
>
> Yep, the VMA will be reported twice in such a case.

Ack.

>
> >
> > > +             *ppos = vma->vm_end;
> >
> > If we store the end, does the last_pos logic which resets the VMA iterator
> > later work correctly in all cases?
>
> I think so. By resetting to vma->vm_end we will start the next search
> from the address right next to the last reported VMA, no?

Yeah, I was just wondering whether there were any odd corner case that
might be problematic.

But since we treat last_pos as inclusive as you said in a response above,
and of course vma->vm_end is exclusive, then this makes sense.

>
> >
> > >       } else {
> > >               *ppos = -2UL;
> > >               vma = get_gate_vma(priv->mm);
> >
> > Is it always the case that !vma here implies a gate VMA (yuck yuck)? I see
> > this was the original logic, but maybe put a comment about this as it's
> > weird and confusing? (and not your fault obviously :P)
>
> What comment would you like to see here?

It's so gross this. I guess something about the inner workings of gate VMAs
and the use of -2UL as a weird sentinel etc.

But this is out of scope here.

>
> >
> > Also, are all locks and state corectly handled in this case? Seems like one
> > of this nasty edge case situations that could have jagged edges...
>
> I think we are fine. get_next_vma() returned NULL, so we did not lock
> any VMA and priv->locked_vma should be NULL.
>
> >
> > > @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> > >               return NULL;
> > >       }
> > >
> > > -     if (mmap_read_lock_killable(mm)) {
> > > +     if (!lock_content(m, priv)) {
> >
> > Nice that this just slots in like this! :)
> >
> > >               mmput(mm);
> > >               put_task_struct(priv->task);
> > >               priv->task = NULL;
> > >               return ERR_PTR(-EINTR);
> > >       }
> > >
> > > +     if (last_addr > 0)
> >
> > last_addr is an unsigned long, this will always be true.
>
> Not unless last_addr==0. That's what I'm really checking here: is this
> the first invocation of m_start(), in which case we are starting from
> the beginning and not restarting from priv->last_pos. Should I add a
> comment?

Yeah sorry I was being an idiot, I misread this as >= 0 obviously.

I had assumed you were checking for the -2 and -1 cases (though -1 early
exits above).

So in that case, are you handling the gate VMA correctly here? Surely we
should exclude that? Wouldn't setting ppos = last_addr = priv->last_pos be
incorrect if this were a gate vma?

Even if we then call get_gate_vma() we've changed these values? Or is that
fine?

And yeah a comment would be good thanks!

>
> >
> > You probably want to put an explicit check for -1UL, -2UL here or?
> >
> > God I hate this mechanism for indicating gate VMA... yuck yuck (again, this
> > bit not your fault :P)
>
> No, I don't care here about -1UL, -2UL, just that last_addr==0 or not.

OK, so maybe above concerns not a thing.

>
> >
> > > +             *ppos = last_addr = priv->last_pos;
> > >       vma_iter_init(&priv->iter, mm, last_addr);
> > >       hold_task_mempolicy(priv);
> > >       if (last_addr == -2UL)
> > >               return get_gate_vma(mm);
> > >
> > > -     return proc_get_vma(priv, ppos);
> > > +     return proc_get_vma(m, ppos);
> > >  }
> > >
> > >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > > @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > >               *ppos = -1UL;
> > >               return NULL;
> > >       }
> > > -     return proc_get_vma(m->private, ppos);
> > > +     return proc_get_vma(m, ppos);
> > >  }
> > >
> > >  static void m_stop(struct seq_file *m, void *v)
> > > @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v)
> > >               return;
> > >
> > >       release_task_mempolicy(priv);
> > > -     mmap_read_unlock(mm);
> > > +     unlock_content(priv);
> > >       mmput(mm);
> > >       put_task_struct(priv->task);
> > >       priv->task = NULL;
> > > --
> > > 2.49.0.1266.g31b7d2e469-goog
> > >
> >
> > Sorry to add to workload by digging into so many details here, but we
> > really need to make sure all the i's are dotted and t's are crossed given
> > how fiddly and fragile this stuff is :)
> >
> > Very much appreciate the work, this is a significant improvement and will
> > have a great deal of real world impact!
>
> Thanks for meticulously going over the code! This is really helpful.
> Suren.

No problem!

>
> >
> > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-10 17:43       ` Lorenzo Stoakes
@ 2025-06-11  0:16         ` Suren Baghdasaryan
  2025-06-11 10:24           ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-11  0:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Jun 10, 2025 at 10:43 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Sat, Jun 07, 2025 at 06:41:35PM -0700, Suren Baghdasaryan wrote:
> > On Sat, Jun 7, 2025 at 10:43 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > Hi Suren,
> > >
> > > Forgive me but I am going to ask a lot of questions here :p just want to
> > > make sure I'm getting everything right here.
> >
> > No worries and thank you for reviewing!
>
> No problem!
>
> >
> > >
> > > On Wed, Jun 04, 2025 at 04:11:50PM -0700, Suren Baghdasaryan wrote:
> > > > With maple_tree supporting vma tree traversal under RCU and per-vma
> > > > locks, /proc/pid/maps can be read while holding individual vma locks
> > > > instead of locking the entire address space.
> > >
> > > Nice :)
> > >
> > > > Completely lockless approach would be quite complex with the main issue
> > > > being get_vma_name() using callbacks which might not work correctly with
> > > > a stable vma copy, requiring original (unstable) vma.
> > >
> > > Hmmm can you expand on what a 'completely lockless' design might comprise of?
> >
> > In my previous implementation
> > (https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/)
> > I was doing this under RCU while checking mmap_lock seq counter to
> > detect address space changes. That's what I meant by a completely
> > lockless approach here.
>
> Oh did that approach not even use VMA locks _at all_?

Correct, it was done under RCU protection.

>
> >
> > >
> > > It's super un-greppable and I've not got clangd set up with an allmod kernel to
> > > triple-check but I'm seeing at least 2 (are there more?):
> > >
> > > gate_vma_name() which is:
> > >
> > >         return "[vsyscall]";
> > >
> > > special_mapping_name() which is:
> > >
> > >          return ((struct vm_special_mapping *)vma->vm_private_data)->name;
> > >
> > > Which I'm guessing is the issue because it's a double pointer deref...
> >
> > Correct but in more general terms, depending on implementation of the
> > vm_ops.name callback, vma->vm_ops->name(vma) might not work correctly
> > with a vma copy. special_mapping_name() is an example of that.
>
> Yeah, this is a horrible situation to be in for such a trivial thing. But I
> guess unavoidable for now.
>
> >
> > >
> > > Seems such a silly issue to get stuck on, I wonder if we can't just change
> > > this to function correctly?
> >
> > I was thinking about different ways to overcome that but once I
> > realized per-vma locks result in even less contention and the
> > implementation is simpler and more robust, I decided that per-vma
> > locks direction is better.
>
> Ack well in that case :)
>
> But still it'd be nice to somehow restrict the impact of this callback.

With VMA locked we are back in a safe place, I think.

>
> >
> > >
> > > > When per-vma lock acquisition fails, we take the mmap_lock for reading,
> > > > lock the vma, release the mmap_lock and continue. This guarantees the
> > > > reader to make forward progress even during lock contention. This will
> > >
> > > Ah that fabled constant forward progress ;)
> > >
> > > > interfere with the writer but for a very short time while we are
> > > > acquiring the per-vma lock and only when there was contention on the
> > > > vma reader is interested in.
> > > > One case requiring special handling is when vma changes between the
> > > > time it was found and the time it got locked. A problematic case would
> > > > be if vma got shrunk so that it's start moved higher in the address
> > > > space and a new vma was installed at the beginning:
> > > >
> > > > reader found:               |--------VMA A--------|
> > > > VMA is modified:            |-VMA B-|----VMA A----|
> > > > reader locks modified VMA A
> > > > reader reports VMA A:       |  gap  |----VMA A----|
> > > >
> > > > This would result in reporting a gap in the address space that does not
> > > > exist. To prevent this we retry the lookup after locking the vma, however
> > > > we do that only when we identify a gap and detect that the address space
> > > > was changed after we found the vma.
> > >
> > > OK so in this case we have
> > >
> > > 1. Find VMA A - nothing is locked yet, but presumably we are under RCU so
> > >    are... safe? From unmaps? Or are we? I guess actually the detach
> > >    mechanism sorts this out for us perhaps?
> >
> > Yes, VMAs are RCU-safe and we do detect if it got detached after we
> > found it but before we locked it.
>
> Ack I thought so.
>
> >
> > >
> > > 2. We got unlucky and did this immediately prior to VMA A having its
> > >    vma->vm_start, vm_end updated to reflect the split.
> >
> > Yes, the split happened after we found it and before we locked it.
> >
> > >
> > > 3. We lock VMA A, now position with an apparent gap after the prior VMA
> > > which, in practice does not exist.
> >
> > Correct.
>
> Ack
>
> >
> > >
> > > So I am guessing that by observing sequence numbers you are able to detect
> > > that a change has occurred and thus retry the operation in this situation?
> >
> > Yes, we detect the gap and we detect that address space has changed,
> > so to endure we did not miss a split we fall back to mmap_read_lock,
> > lock the VMA while holding mmap_read_lock, drop mmap_read_lock and
> > retry.
> >
> > >
> > > I know we previously discussed the possibility of this retry mechanism
> > > going on forever, I guess I will see the resolution to this in the code :)
> >
> > Retry in this case won't go forever because we take mmap_read_lock
> > during the retry. In the worst case we will be constantly falling back
> > to mmap_read_lock but that's a very unlikely case (the writer should
> > be constantly splitting the vma right before the reader locks it).
>
> It might be worth adding that to commit message to underline that this has
> been considered and this is the resolution.
>
> Something like:
>
>         we guarantee forward progress by always resolving contention via a
>         fallback to an mmap-read lock.
>
>         We shouldn't see a repeated fallback to mmap read locks in
>         practice, as this require a vanishingly unlikely series of lock
>         contentions (for instance due to repeated VMA split
>         operations). However even if this did somehow happen, we would
>         still progress.

Ack.

>
> >
> > >
> > > > This change is designed to reduce mmap_lock contention and prevent a
> > > > process reading /proc/pid/maps files (often a low priority task, such
> > > > as monitoring/data collection services) from blocking address space
> > > > updates. Note that this change has a userspace visible disadvantage:
> > > > it allows for sub-page data tearing as opposed to the previous mechanism
> > > > where data tearing could happen only between pages of generated output
> > > > data. Since current userspace considers data tearing between pages to be
> > > > acceptable, we assume is will be able to handle sub-page data tearing
> > > > as well.
> > >
> > > By tearing do you mean for instance seeing a VMA more than once due to
> > > e.g. a VMA expanding in a racey way?
> >
> > Yes.
> >
> > >
> > > Pedantic I know, but it might be worth goiing through all the merge case,
> > > split and remap scenarios and explaining what might happen in each one (or
> > > perhaps do that as some form of documentation?)
> > >
> > > I can try to put together a list of all of the possibilities if that would
> > > be helpful.
> >
> > Hmm. That might be an interesting exercise. I called out this
> > particular case because my tests caught it. I spent some time thinking
> > about other possible scenarios where we would report a gap in a place
> > where there are no gaps but could not think of anything else.
>
> todo++; :)
>
> >
> > >
> > > >
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  fs/proc/internal.h |   6 ++
> > > >  fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++--
> > > >  2 files changed, 175 insertions(+), 8 deletions(-)
> > >
> > > I really hate having all this logic in the proc/task_mmu.c file.
> > >
> > > This is really delicate stuff and I'd really like it to live in mm if
> > > possible.
> > >
> > > I reallise this might be a total pain, but I'm quite worried about us
> > > putting super-delicate, carefully written VMA handling code in different
> > > places.
> > >
> > > Also having stuff in mm/vma.c opens the door to userland testing which,
> > > when I finally have time to really expand that, would allow for some really
> > > nice stress testing here.
> >
> > That would require some sizable refactoring. I assume code for smaps
> > reading and PROCMAP_QUERY would have to be moved as well?
>
> Yeah, I know, and apologies for that, but I really oppose us having this
> super delicate VMA logic in an fs/proc file, one we don't maintain for that
> matter.
>
> I know it's a total pain, but this just isn't the right place to be doing
> such a careful dance.
>
> I'm not saying relocate code that belongs here, but find a way to abstract
> the operations.

Ok, I'll take a stab at refactoring purely mm-related code and will
see how that looks.

>
> Perhaps could be a walker or something that does all the state transition
> stuff that you can then just call from the walker functions here?
>
> You could then figure out something similar for the PROCMAP_QUERY logic.
>
> We're not doing this VMA locking stuff for smaps are we? As that is walking
> page tables anyway right? So nothing would change for that.

Yeah, smaps would stay as they are but refactoring might affect its
code portions as well.

>
> >
> > >
> > > >
> > > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > > > index 96122e91c645..3728c9012687 100644
> > > > --- a/fs/proc/internal.h
> > > > +++ b/fs/proc/internal.h
> > > > @@ -379,6 +379,12 @@ struct proc_maps_private {
> > > >       struct task_struct *task;
> > > >       struct mm_struct *mm;
> > > >       struct vma_iterator iter;
> > > > +     loff_t last_pos;
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +     bool mmap_locked;
> > > > +     unsigned int mm_wr_seq;
> > >
> > > Is this the _last_ sequence number observed in the mm_struct? or rather,
> > > previous? Nitty but maybe worth renaming accordingly.
> >
> > It's a copy of the mm->mm_wr_seq. I can add a comment if needed.
>
> Right, of course. But I think the problem is the 'when' it refers to. It's
> the sequence number associatied with the mm here sure, but when was it
> snapshotted? How do we use it?
>
> Something like 'last_seen_seqnum' or 'mm_wr_seq_start' or something plus a
> comment would be helpful.
>
> This is nitty I know... but this stuff is very confusing and I think every
> little bit we do to help explain things is helpful here.

Ok, I'll add a comment that mm_wr_seq is a snapshot of mm->mm_wr_seq
before we started the VMA lookup.

>
> >
> > >
> > > > +     struct vm_area_struct *locked_vma;
> > > > +#endif
> > > >  #ifdef CONFIG_NUMA
> > > >       struct mempolicy *task_mempolicy;
> > > >  #endif
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 27972c0749e7..36d883c4f394 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> > > >  }
> > > >  #endif
> > > >
> > > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > > > -                                             loff_t *ppos)
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +
> > > > +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv,
> > > > +                                       struct vm_area_struct *vma,
> > > > +                                       unsigned long last_pos,
> > > > +                                       bool mm_unstable)
> > >
> > > This whole function is a bit weird tbh, you handle both the
> > > mm_unstable=true and mm_unstable=false cases, in the latter we don't try to
> > > lock at all...
> >
> > Why do you think so? vma_start_read() is always called but in case
> > mm_unstable=true we double check for the gaps to take care of the case
> > I mentioned in the changelog.
>
> Well the read lock will always succeed if mmap read lock is held right?
> Actually... no :)
>
> I see your point below about vma_start_read_locked() :>)
>
> I see below you suggest splitting into two functions, that seems to be a
> good way forward.

Ack.

>
> I _think_ we won't even need the checks re: mm and last_pos in that case
> right? As holding the mmap lock we should be able to guarantee? Or at least
> the mm check?

Correct. These checks are needed only if we are searching the VMA
under RCU protection before locking it. If we are holding mmap_lock
then all this is not needed.

>
> >
> > >
> > > Nitty (sorry I know this is mildly irritating review) but maybe needs to be
> > > renamed, or split up somehow?
> > >
> > > This is only trylocking in the mm_unstable case...
> >
> > Nope, I think you misunderstood the intention, as I mentioned above.
> >
> > >
> > > > +{
> > > > +     vma = vma_start_read(priv->mm, vma);
> > >
> > > Do we want to do this with mm_unstable == false?
> >
> > Yes, always. mm_unstable=true only indicates that we are already
> > holding mmap_read_lock, so we don't need to double-check for gaps.
> > Perhaps I should add some comments to clarify what purpose this
> > parameter serves...
> >
> > >
> > > I know (from my own documentation :)) taking a VMA read lock while holding
> > > an mmap read lock is fine (the reverse isn't) but maybe it's suboptimal?
> >
> > Ah, right. I should use vma_start_read_locked() instead when we are
> > holding mmap_read_lock. That's why that function was introduced. Will
> > change.
>
> Yeah, I'll pretend this is what I meant to sound smart :P but this is a
> really good point!
>
> >
> > >
> > > > +     if (IS_ERR_OR_NULL(vma))
> > > > +             return NULL;
> > >
> > > Hmm IS_ERR_OR_NULL() is generally a code smell (I learned this some years
> > > ago from people moaning at me on code review :)
> > >
> > > Sorry I know that's annoying but perhaps its indicative of an issue in the
> > > interface? That's possibly out of scope here however.
> >
> > lock_vma_under_rcu() returns NULL or EAGAIN to signal
> > lock_vma_under_rcu() that it should retry the VMA lookup. In here in
> > either case we retry under mmap_read_lock, that's why EAGAIN is
> > ignored.
>
> Yeah indeed you're right. I guess I'm just echoing previous review traumas
> here :P
>
> >
> > >
> > > Why are we ignoring errors here though? I guess because we don't care if
> > > the VMA got detached from under us, we don't bother retrying like we do in
> > > lock_vma_under_rcu()?
> >
> > No, we take mmap_read_lock and retry in either case. Perhaps I should
> > split trylock_vma() into two separate functions - one for the case
> > when we are holding mmap_read_lock and another one when we don't? I
> > think that would have prevented many of your questions. I'll try that
> > and see how it looks.
>
> Yeah that'd be helpful. I think this should also simplify things?

Yes. Will try that.

>
> >
> > >
> > > Should we just abstract that part of lock_vma_under_rcu() and use it?
> >
> > trylock_vma() is not that similar to lock_vma_under_rcu() for that
> > IMO. Also lock_vma_under_rcu() is in the pagefault path which is very
> > hot, so I would not want to add conditions there to make it work for
> > trylock_vma().
>
> Right sure.
>
> But I'm just wondering why we don't do the retry stuff, e.g.:
>
>                 /* Check if the VMA got isolated after we found it */
>                 if (PTR_ERR(vma) == -EAGAIN) {
>                         count_vm_vma_lock_event(VMA_LOCK_MISS);
>                         /* The area was replaced with another one */
>                         goto retry;
>                 }
>
> I mean do we need to retry under mmap lock in that case? Can we just retry
> the lookup? Or is this not a worthwhile optimisation here?

Hmm. That might be applicable here as well. Let me think some more
about it. Theoretically that might affect our forward progress
guarantee but for us to retry infinitely the VMA we find has to be
knocked out from under us each time we find it. So, quite unlikely to
happen continuously.

>
> >
> > >
> > > > +
> > > > +     /* Check if the vma we locked is the right one. */
> > >
> > > Well it might not be the right one :) but might still belong to the right
> > > mm, so maybe better to refer to the right virtual address space.
> >
> > Ack. Will change to "Check if the vma belongs to the right address space. "
>
> Thanks!
>
> >
> > >
> > > > +     if (unlikely(vma->vm_mm != priv->mm))
> > > > +             goto err;
> > > > +
> > > > +     /* vma should not be ahead of the last search position. */
> > >
> > > You mean behind the last search position? Surely a VMA being _ahead_ of it
> > > is fine?
> >
> > Yes, you are correct. "should not" should have been "should".
>
> Thanks!
>
> >
> > >
> > > > +     if (unlikely(last_pos >= vma->vm_end))
> > >
> > > Should that be >=? Wouldn't an == just be an adjacent VMA? Why is that
> > > problematic? Or is last_pos inclusive?
> >
> > last_pos is inclusive and vma->vm_end is not inclusive, so if last_pos
> > == vma->vm_end that would mean the vma is behind the last_pos. Since
> > we are searching forward from the last_pos, we should not be finding a
> > vma before last_pos unless it mutated.
>
> Ahhh that explains it. Thanks.
>
> >
> > >
> > > > +             goto err;
> > >
> > > Am I correct in thinking thi is what is being checked?
> > >
> > >           last_pos
> > >              |
> > >              v
> > > |---------|
> > > |         |
> > > |---------|
> > >         vm_end
> > >    <--- vma 'next'??? How did we go backwards?
> >
> > Exactly.
> >
> > >
> > > When last_pos gets updated, is it possible for a shrink to race to cause
> > > this somehow?
> >
> > No, we update last_pos only after we locked the vma and confirmed it's
> > the right one.
>
> Ack.
>
> >
> > >
> > > Do we treat this as an entirely unexpected error condition? In which case
> > > is a WARN_ON_ONCE() warranted?
> >
> > No, the VMA might have mutated from under us before we locked it. For
> > example it might have been remapped to a higher address.
> >
> > >
> > > > +
> > > > +     /*
> > > > +      * vma ahead of last search position is possible but we need to
> > > > +      * verify that it was not shrunk after we found it, and another
> > > > +      * vma has not been installed ahead of it. Otherwise we might
> > > > +      * observe a gap that should not be there.
> > > > +      */
> > >
> > > OK so this is the juicy bit.
> >
> > Yep, that's the case singled out in the changelog.
>
> And rightly so!
>
> >
> > >
> > >
> > > > +     if (mm_unstable && last_pos < vma->vm_start) {
> > > > +             /* Verify only if the address space changed since vma lookup. */
> > > > +             if ((priv->mm_wr_seq & 1) ||
> > >
> > > Can we wrap this into a helper? This is a 'you just have to know that odd
> > > seq number means a write operation is in effect'. I know you have a comment
> > > here, but I think something like:
> > >
> > >         if (has_mm_been_modified(priv) ||
> > >
> > > Would be a lot clearer.
> >
> > Yeah, I was thinking about that. I think an even cleaner way would be
> > to remember the return value of mmap_lock_speculate_try_begin() and
> > pass it around. I was hoping to avoid that extra parameter but sounds
> > like for the sake of clarity that would be preferable?
>
> You know, it's me so I might have to mention a helper struct here :P it's
> the two most Lorenzo things - helper sructs and churn...
>
> >
> > >
> > > Again this speaks to the usefulness of abstracting all this logic from the
> > > proc code, we are putting super delicate VMA stuff here and it's just not
> > > the right place.
> > >
> > > As an aside, I don't see coverage in the process_addrs documentation on
> > > sequence number odd/even or speculation?
> > >
> > > I think we probably need to cover this to maintain an up-to-date
> > > description of how the VMA locking mechanism works and is used?
> >
> > I think that's a very low level technical detail which I should not
> > have exposed here. As I mentioned, I should simply store the return
> > value of mmap_lock_speculate_try_begin() instead of doing these tricky
> > mm_wr_seq checks.
>
> Right yeah I'm all for simplifying if we can! Sounds sensible.
>
> >
> > >
> > > > +                 mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) {
> > >
> > > Nit, again unrelated to this series, but would be useful to add a comment
> > > to mmap_lock_speculate_retry() to indicate that a true return value
> > > indicates a retry is needed, or renaming it.
> >
> > This is how seqcount API works in general. Note that
> > mmap_lock_speculate_retry() is just a wrapper around
> > read_seqcount_retry().
>
> Yeah, I guess I can moan to PeterZ about that :P
>
> It's not a big deal honestly, but it was just something I found confusing.
>
> I think adjusting the comment above to something like:
>
>                 /*
>                  * Verify if the address space changed since vma lookup, or if
>                  * the speculative lock needs to be retried.
>                  */
>
> Or perhaps somethig more in line with the description you give below?

Ack.

>
> >
> > >
> > > Maybe mmap_lock_speculate_needs_retry()? Also I think that function needs a
> > > comment.
> >
> > See https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/seqlock.h#L395
>
> Yeah I saw that, but going 2 levels deep to read a comment isn't great.
>
> But again this isn't the end of the world.
>
> >
> > >
> > > Naming is hard :P
> > >
> > > Anyway the totality of this expression is 'something changed' or 'read
> > > section retry required'.
> >
> > Not quite. The expression is "something is changed from under us or
> > something was changing even before we started VMA lookup". Or in more
> > technical terms, mmap_write_lock was acquired while we were locking
> > the VMA or mmap_write_lock was already held even before we started the
> > VMA search.
>
> OK so read section retry required = the seq num changes from under us
> (checked carefully with memory barriers and carefully considered and
> thought out such logic), and the priv->mm_wr_seq check before it is the
> 'was this changed even before we began?'
>
> I wonder btw if we could put both into a single helper function to check
> whether that'd be clearer.

So this will look something like this:

priv->can_speculate = mmap_lock_speculate_try_begin();
...
if (!priv->can_speculate || mmap_lock_speculate_retry()) {
    // fallback
}

Is that descriptive enough?

>
> >
> > >
> > > Under what circumstances would this happen?
> >
> > See my previous comment and I hope that clarifies it.
>
> Thanks!
>
> >
> > >
> > > OK so we're into the 'retry' logic here:
> > >
> > > > +                     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > >
> > > I'd definitely want Liam to confirm this is all above board and correct, as
> > > these operations are pretty sensitive.
> > >
> > > But assuming this is safe, we reset the iterator to the last position...
> > >
> > > > +                     if (vma != vma_next(&priv->iter))
> > >
> > > Then assert the following VMA is the one we seek.
> > >
> > > > +                             goto err;
> > >
> > > Might this ever be the case in the course of ordinary operation? Is this
> > > really an error?
> >
> > This simply means that the VMA we found before is not at the place we
> > found it anymore. The locking fails and we should retry.
>
> I know it's pedantic but feels like 'err' is not a great name for this.
>
> Maybe 'nolock' or something? Or 'lock_failed'?

lock_failed sounds good.


>
> >
> > >
> > > > +             }
> > > > +     }
> > > > +
> > > > +     priv->locked_vma = vma;
> > > > +
> > > > +     return vma;
> > > > +err:
> > >
> > > As queried above, is this really an error path or something we might expect
> > > to happen that could simply result in an expected fallback to mmap lock?
> >
> > It's a failure to lock the VMA, which is handled by retrying under
> > mmap_read_lock. So, trylock_vma() failure does not mean a fault in the
> > logic. It's expected to happen occasionally.
>
> Ack yes understood thanks!
>
> >
> > >
> > > > +     vma_end_read(vma);
> > > > +     return NULL;
> > > > +}
> > > > +
> > > > +
> > > > +static void unlock_vma(struct proc_maps_private *priv)
> > > > +{
> > > > +     if (priv->locked_vma) {
> > > > +             vma_end_read(priv->locked_vma);
> > > > +             priv->locked_vma = NULL;
> > > > +     }
> > > > +}
> > > > +
> > > > +static const struct seq_operations proc_pid_maps_op;
> > > > +
> > > > +static inline bool lock_content(struct seq_file *m,
> > > > +                             struct proc_maps_private *priv)
> > >
> > > Pedantic I know but isn't 'lock_content' a bit generic?
> > >
> > > He says, not being able to think of a great alternative...
> > >
> > > OK maybe fine... :)
> >
> > Yeah, I struggled with this myself. Help in naming is appreciated.
>
> This is where it gets difficult haha so easy to point out but not so easy
> to fix...
>
> lock_vma_range()?

Ack.

>
> >
> > >
> > > > +{
> > > > +     /*
> > > > +      * smaps and numa_maps perform page table walk, therefore require
> > > > +      * mmap_lock but maps can be read with locked vma only.
> > > > +      */
> > > > +     if (m->op != &proc_pid_maps_op) {
> > >
> > > Nit but is there a neater way of checking this? Actually I imagine not...
> > >
> > > But maybe worth, instead of forward-declaring proc_pid_maps_op, forward declare e.g.
> > >
> > > static inline bool is_maps_op(struct seq_file *m);
> > >
> > > And check e.g.
> > >
> > > if (is_maps_op(m)) { ... in the above.
> > >
> > > Yeah this is nitty not a massive del :)
> >
> > I'll try that and see how it looks. Thanks!
>
> Thanks!
>
> >
> > >
> > > > +             if (mmap_read_lock_killable(priv->mm))
> > > > +                     return false;
> > > > +
> > > > +             priv->mmap_locked = true;
> > > > +     } else {
> > > > +             rcu_read_lock();
> > > > +             priv->locked_vma = NULL;
> > > > +             priv->mmap_locked = false;
> > > > +     }
> > > > +
> > > > +     return true;
> > > > +}
> > > > +
> > > > +static inline void unlock_content(struct proc_maps_private *priv)
> > > > +{
> > > > +     if (priv->mmap_locked) {
> > > > +             mmap_read_unlock(priv->mm);
> > > > +     } else {
> > > > +             unlock_vma(priv);
> > > > +             rcu_read_unlock();
> > >
> > > Does this always get called even in error cases?
> >
> > What error cases do you have in mind? Error to lock a VMA is handled
> > by retrying and we should be happily proceeding. Please clarify.
>
> Well it was more of a question really - can the traversal through
> /proc/$pid/maps result in some kind of error that doesn't reach this
> function, thereby leaving things locked mistakenly?
>
> If not then happy days :)
>
> I'm guessing there isn't.

There is EINTR in m_start() but unlock_content() won't be called in
that case, so I think we are good.

>
> >
> > >
> > > > +     }
> > > > +}
> > > > +
> > > > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > > > +                                        loff_t last_pos)
> > >
> > > We really need a generalised RCU multi-VMA locking mechanism (we're looking
> > > into madvise VMA locking atm with a conservative single VMA lock at the
> > > moment, but in future we probably want to be able to span multiple for
> > > instance) and this really really feels like it doesn't belong in this proc
> > > code.
> >
> > Ok, I guess you are building a case to move more code into vma.c? I
> > see what you are doing :)
>
> Haha damn it, my evil plans revealed :P
>
> >
> > >
> > > >  {
> > > > -     struct vm_area_struct *vma = vma_next(&priv->iter);
> > > > +     struct vm_area_struct *vma;
> > > > +     int ret;
> > > > +
> > > > +     if (priv->mmap_locked)
> > > > +             return vma_next(&priv->iter);
> > > > +
> > > > +     unlock_vma(priv);
> > > > +     /*
> > > > +      * Record sequence number ahead of vma lookup.
> > > > +      * Odd seqcount means address space modification is in progress.
> > > > +      */
> > > > +     mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > >
> > > Hmm we're discarding the return value I guess we don't really care about
> > > that at this stage? Or do we? Do we want to assert the read critical
> > > section state here?
> >
> > Yeah, as I mentioned, instead of relying on priv->mm_wr_seq being odd
> > I should record the return value of mmap_lock_speculate_try_begin().
> > In the functional sense these two are interchangeable.
>
> Ack, thanks!
>
> >
> > >
> > > I guess since we have the mm_rq_seq which we use later it's the same thing
> > > and doesn't matter.
> >
> > Yep.
>
> Ack
>
> >
> > >
> > > ~~(off topic a bit)~~
> > >
> > > OK so off-topic again afaict we're doing something pretty horribly gross here.
> > >
> > > We pass &priv->mm_rw_seq as 'unsigned int *seq' field to
> > > mmap_lock_speculate_try_begin(), which in turn calls:
> > >
> > >         return raw_seqcount_try_begin(&mm->mm_lock_seq, *seq);
> > >
> > > And this is defined as a macro of:
> > >
> > > #define raw_seqcount_try_begin(s, start)                                \
> > > ({                                                                      \
> > >         start = raw_read_seqcount(s);                                   \
> > >         !(start & 1);                                                   \
> > > })
> > >
> > > So surely this expands to:
> > >
> > >         *seq = raw_read_seqcount(&mm->mm_lock_seq);
> > >         !(*seq & 1) // return true if even, false if odd
> > >
> > > So we're basically ostensibly passing an unsigned int, but because we're
> > > calling a macro it's actually just 'text' and we're instead able to then
> > > reassign the underlying unsigned int * ptr and... ugh.
> > >
> > > ~~(/off topic a bit)~~
> >
> > Aaaand we are back...
>
> :)) yeah this isn't your fault, just a related 'wtf' moan :P we can pretend
> like it never happened *ahem*
>
> >
> > >
> > > > +     vma = vma_next(&priv->iter);
> > >
> > >
> > >
> > > > +     if (!vma)
> > > > +             return NULL;
> > > > +
> > > > +     vma = trylock_vma(priv, vma, last_pos, true);
> > > > +     if (vma)
> > > > +             return vma;
> > > > +
> > >
> > > Really feels like this should be a boolean... I guess neat to reset vma if
> > > not locked though.
> >
> > I guess I can change trylock_vma() to return boolean. We always return
> > the same vma or NULL I think.
>
> Ack, I mean I guess you're looking at reworking it in general so can take
> this into account.

Ack.

>
> >
> > >
> > > > +     /* Address space got modified, vma might be stale. Re-lock and retry */
> > >
> > > > +     rcu_read_unlock();
> > >
> > > Might we see a VMA possibly actually legit unmapped in a race here? Do we
> > > need to update last_pos/ppos to account for this? Otherwise we might just
> > > fail on the last_pos >= vma->vm_end check in trylock_vma() no?
> >
> > Yes, it can happen and trylock_vma() will fail to lock the modified
> > VMA. That's by design. In such cases we retry the lookup from the same
> > last_pos.
>
> OK and then we're fine with it because the gap we report will be an actual
> gap.

Yes, either the actual gap or a VMA newly mapped at that address.

>
> >
> > >
> > > > +     ret = mmap_read_lock_killable(priv->mm);
> > >
> > > Shouldn't we set priv->mmap_locked here?
> >
> > No, we will drop the mmap_read_lock shortly. priv->mmap_locked
> > indicates the overall mode we operate in. When priv->mmap_locked=false
> > we can still temporarily take the mmap_read_lock when retrying and
> > then drop it after we found the VMA.
>
> Right yeah, makes sense.
>
> >
> > >
> > > I guess not as we are simply holding the mmap lock to definitely get the
> > > next VMA.
> >
> > Correct.
>
> Ack
>
> >
> > >
> > > > +     rcu_read_lock();
> > > > +     if (ret)
> > > > +             return ERR_PTR(ret);
> > > > +
> > > > +     /* Lookup the vma at the last position again under mmap_read_lock */
> > > > +     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > > > +     vma = vma_next(&priv->iter);
> > > > +     if (vma) {
> > > > +             vma = trylock_vma(priv, vma, last_pos, false);
> > >
> > > Be good to use Liam's convention of /* mm_unstable = */ false to make this
> > > clear.
> >
> > Yeah, I'm thinking of splitting trylock_vma() into two separate
> > functions for mm_unstable=true and mm_unstable=false cases.
>
> Yes :) thanks!
>
> >
> > >
> > > Find it kinda weird again we're 'trylocking' something we already have
> > > locked via the mmap lock but I already mentioend this... :)
> > >
> > > > +             WARN_ON(!vma); /* mm is stable, has to succeed */
> > >
> > > I wonder if this is really useful, at any rate seems like there'd be a
> > > flood here so WARN_ON_ONCE()? Perhaps VM_WARN_ON_ONCE() given this really
> > > really ought not happen?
> >
> > Well, I can't use BUG_ON(), so WARN_ON() is the next tool I have :) In
> > reality this should never happen, so
> > WARN_ON/WARN_ON_ONCE/WARN_ON_RATELIMITED/or whatever does not matter
> > much.
>
> I think if you refactor into two separate functions this becomes even more
> unnecessary because then you are using a vma lock function that can never
> fail etc.
>
> I mean maybe just stick a VM_ in front if it's not going to happen but for
> debug/dev/early stabilisation purposes we want to keep an eye on it.

Yeah, I think after refactoring we won't need any warnings here.

>
> >
> > >
> > > > +     }
> > > > +     mmap_read_unlock(priv->mm);
> > > > +
> > > > +     return vma;
> > > > +}
> > > > +
> > > > +#else /* CONFIG_PER_VMA_LOCK */
> > > >
> > > > +static inline bool lock_content(struct seq_file *m,
> > > > +                             struct proc_maps_private *priv)
> > > > +{
> > > > +     return mmap_read_lock_killable(priv->mm) == 0;
> > > > +}
> > > > +
> > > > +static inline void unlock_content(struct proc_maps_private *priv)
> > > > +{
> > > > +     mmap_read_unlock(priv->mm);
> > > > +}
> > > > +
> > > > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > > > +                                        loff_t last_pos)
> > > > +{
> > > > +     return vma_next(&priv->iter);
> > > > +}
> > > > +
> > > > +#endif /* CONFIG_PER_VMA_LOCK */
> > > > +
> > > > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> > > > +{
> > > > +     struct proc_maps_private *priv = m->private;
> > > > +     struct vm_area_struct *vma;
> > > > +
> > > > +     vma = get_next_vma(priv, *ppos);
> > > > +     if (IS_ERR(vma))
> > > > +             return vma;
> > > > +
> > > > +     /* Store previous position to be able to restart if needed */
> > > > +     priv->last_pos = *ppos;
> > > >       if (vma) {
> > > > -             *ppos = vma->vm_start;
> > > > +             /*
> > > > +              * Track the end of the reported vma to ensure position changes
> > > > +              * even if previous vma was merged with the next vma and we
> > > > +              * found the extended vma with the same vm_start.
> > > > +              */
> > >
> > > Right, so observing repetitions is acceptable in such circumstances? I mean
> > > I agree.
> >
> > Yep, the VMA will be reported twice in such a case.
>
> Ack.
>
> >
> > >
> > > > +             *ppos = vma->vm_end;
> > >
> > > If we store the end, does the last_pos logic which resets the VMA iterator
> > > later work correctly in all cases?
> >
> > I think so. By resetting to vma->vm_end we will start the next search
> > from the address right next to the last reported VMA, no?
>
> Yeah, I was just wondering whether there were any odd corner case that
> might be problematic.
>
> But since we treat last_pos as inclusive as you said in a response above,
> and of course vma->vm_end is exclusive, then this makes sense.
>
> >
> > >
> > > >       } else {
> > > >               *ppos = -2UL;
> > > >               vma = get_gate_vma(priv->mm);
> > >
> > > Is it always the case that !vma here implies a gate VMA (yuck yuck)? I see
> > > this was the original logic, but maybe put a comment about this as it's
> > > weird and confusing? (and not your fault obviously :P)
> >
> > What comment would you like to see here?
>
> It's so gross this. I guess something about the inner workings of gate VMAs
> and the use of -2UL as a weird sentinel etc.

Ok, I'll try to add a meaningful comment here.

>
> But this is out of scope here.
>
> >
> > >
> > > Also, are all locks and state corectly handled in this case? Seems like one
> > > of this nasty edge case situations that could have jagged edges...
> >
> > I think we are fine. get_next_vma() returned NULL, so we did not lock
> > any VMA and priv->locked_vma should be NULL.
> >
> > >
> > > > @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> > > >               return NULL;
> > > >       }
> > > >
> > > > -     if (mmap_read_lock_killable(mm)) {
> > > > +     if (!lock_content(m, priv)) {
> > >
> > > Nice that this just slots in like this! :)
> > >
> > > >               mmput(mm);
> > > >               put_task_struct(priv->task);
> > > >               priv->task = NULL;
> > > >               return ERR_PTR(-EINTR);
> > > >       }
> > > >
> > > > +     if (last_addr > 0)
> > >
> > > last_addr is an unsigned long, this will always be true.
> >
> > Not unless last_addr==0. That's what I'm really checking here: is this
> > the first invocation of m_start(), in which case we are starting from
> > the beginning and not restarting from priv->last_pos. Should I add a
> > comment?
>
> Yeah sorry I was being an idiot, I misread this as >= 0 obviously.
>
> I had assumed you were checking for the -2 and -1 cases (though -1 early
> exits above).
>
> So in that case, are you handling the gate VMA correctly here? Surely we
> should exclude that? Wouldn't setting ppos = last_addr = priv->last_pos be
> incorrect if this were a gate vma?

You are actually right. last_addr can be -2UL here and we should not
override it. I'll fix it. Thanks!

>
> Even if we then call get_gate_vma() we've changed these values? Or is that
> fine?
>
> And yeah a comment would be good thanks!
>
> >
> > >
> > > You probably want to put an explicit check for -1UL, -2UL here or?
> > >
> > > God I hate this mechanism for indicating gate VMA... yuck yuck (again, this
> > > bit not your fault :P)
> >
> > No, I don't care here about -1UL, -2UL, just that last_addr==0 or not.
>
> OK, so maybe above concerns not a thing.
>
> >
> > >
> > > > +             *ppos = last_addr = priv->last_pos;
> > > >       vma_iter_init(&priv->iter, mm, last_addr);
> > > >       hold_task_mempolicy(priv);
> > > >       if (last_addr == -2UL)
> > > >               return get_gate_vma(mm);
> > > >
> > > > -     return proc_get_vma(priv, ppos);
> > > > +     return proc_get_vma(m, ppos);
> > > >  }
> > > >
> > > >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > > > @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > > >               *ppos = -1UL;
> > > >               return NULL;
> > > >       }
> > > > -     return proc_get_vma(m->private, ppos);
> > > > +     return proc_get_vma(m, ppos);
> > > >  }
> > > >
> > > >  static void m_stop(struct seq_file *m, void *v)
> > > > @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v)
> > > >               return;
> > > >
> > > >       release_task_mempolicy(priv);
> > > > -     mmap_read_unlock(mm);
> > > > +     unlock_content(priv);
> > > >       mmput(mm);
> > > >       put_task_struct(priv->task);
> > > >       priv->task = NULL;
> > > > --
> > > > 2.49.0.1266.g31b7d2e469-goog
> > > >
> > >
> > > Sorry to add to workload by digging into so many details here, but we
> > > really need to make sure all the i's are dotted and t's are crossed given
> > > how fiddly and fragile this stuff is :)
> > >
> > > Very much appreciate the work, this is a significant improvement and will
> > > have a great deal of real world impact!
> >
> > Thanks for meticulously going over the code! This is really helpful.
> > Suren.
>
> No problem!
>
> >
> > >
> > > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-11  0:16         ` Suren Baghdasaryan
@ 2025-06-11 10:24           ` Lorenzo Stoakes
  2025-06-11 15:12             ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-06-11 10:24 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Thanks for your patient replies :)

OK to save us both time in such a huuuuge back-and-forth - I agree broadly
with your comments below and I think we are aligned on everything now.

I will try to get you a list of merge scenarios and ideally have a look at
the test code too if I have time this week.

But otherwise hopefully we are good for a respin here?

Cheers, Lorenzo

On Tue, Jun 10, 2025 at 05:16:36PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jun 10, 2025 at 10:43 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Sat, Jun 07, 2025 at 06:41:35PM -0700, Suren Baghdasaryan wrote:
> > > On Sat, Jun 7, 2025 at 10:43 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > Hi Suren,
> > > >
> > > > Forgive me but I am going to ask a lot of questions here :p just want to
> > > > make sure I'm getting everything right here.
> > >
> > > No worries and thank you for reviewing!
> >
> > No problem!
> >
> > >
> > > >
> > > > On Wed, Jun 04, 2025 at 04:11:50PM -0700, Suren Baghdasaryan wrote:
> > > > > With maple_tree supporting vma tree traversal under RCU and per-vma
> > > > > locks, /proc/pid/maps can be read while holding individual vma locks
> > > > > instead of locking the entire address space.
> > > >
> > > > Nice :)
> > > >
> > > > > Completely lockless approach would be quite complex with the main issue
> > > > > being get_vma_name() using callbacks which might not work correctly with
> > > > > a stable vma copy, requiring original (unstable) vma.
> > > >
> > > > Hmmm can you expand on what a 'completely lockless' design might comprise of?
> > >
> > > In my previous implementation
> > > (https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/)
> > > I was doing this under RCU while checking mmap_lock seq counter to
> > > detect address space changes. That's what I meant by a completely
> > > lockless approach here.
> >
> > Oh did that approach not even use VMA locks _at all_?
>
> Correct, it was done under RCU protection.
>
> >
> > >
> > > >
> > > > It's super un-greppable and I've not got clangd set up with an allmod kernel to
> > > > triple-check but I'm seeing at least 2 (are there more?):
> > > >
> > > > gate_vma_name() which is:
> > > >
> > > >         return "[vsyscall]";
> > > >
> > > > special_mapping_name() which is:
> > > >
> > > >          return ((struct vm_special_mapping *)vma->vm_private_data)->name;
> > > >
> > > > Which I'm guessing is the issue because it's a double pointer deref...
> > >
> > > Correct but in more general terms, depending on implementation of the
> > > vm_ops.name callback, vma->vm_ops->name(vma) might not work correctly
> > > with a vma copy. special_mapping_name() is an example of that.
> >
> > Yeah, this is a horrible situation to be in for such a trivial thing. But I
> > guess unavoidable for now.
> >
> > >
> > > >
> > > > Seems such a silly issue to get stuck on, I wonder if we can't just change
> > > > this to function correctly?
> > >
> > > I was thinking about different ways to overcome that but once I
> > > realized per-vma locks result in even less contention and the
> > > implementation is simpler and more robust, I decided that per-vma
> > > locks direction is better.
> >
> > Ack well in that case :)
> >
> > But still it'd be nice to somehow restrict the impact of this callback.
>
> With VMA locked we are back in a safe place, I think.
>
> >
> > >
> > > >
> > > > > When per-vma lock acquisition fails, we take the mmap_lock for reading,
> > > > > lock the vma, release the mmap_lock and continue. This guarantees the
> > > > > reader to make forward progress even during lock contention. This will
> > > >
> > > > Ah that fabled constant forward progress ;)
> > > >
> > > > > interfere with the writer but for a very short time while we are
> > > > > acquiring the per-vma lock and only when there was contention on the
> > > > > vma reader is interested in.
> > > > > One case requiring special handling is when vma changes between the
> > > > > time it was found and the time it got locked. A problematic case would
> > > > > be if vma got shrunk so that it's start moved higher in the address
> > > > > space and a new vma was installed at the beginning:
> > > > >
> > > > > reader found:               |--------VMA A--------|
> > > > > VMA is modified:            |-VMA B-|----VMA A----|
> > > > > reader locks modified VMA A
> > > > > reader reports VMA A:       |  gap  |----VMA A----|
> > > > >
> > > > > This would result in reporting a gap in the address space that does not
> > > > > exist. To prevent this we retry the lookup after locking the vma, however
> > > > > we do that only when we identify a gap and detect that the address space
> > > > > was changed after we found the vma.
> > > >
> > > > OK so in this case we have
> > > >
> > > > 1. Find VMA A - nothing is locked yet, but presumably we are under RCU so
> > > >    are... safe? From unmaps? Or are we? I guess actually the detach
> > > >    mechanism sorts this out for us perhaps?
> > >
> > > Yes, VMAs are RCU-safe and we do detect if it got detached after we
> > > found it but before we locked it.
> >
> > Ack I thought so.
> >
> > >
> > > >
> > > > 2. We got unlucky and did this immediately prior to VMA A having its
> > > >    vma->vm_start, vm_end updated to reflect the split.
> > >
> > > Yes, the split happened after we found it and before we locked it.
> > >
> > > >
> > > > 3. We lock VMA A, now position with an apparent gap after the prior VMA
> > > > which, in practice does not exist.
> > >
> > > Correct.
> >
> > Ack
> >
> > >
> > > >
> > > > So I am guessing that by observing sequence numbers you are able to detect
> > > > that a change has occurred and thus retry the operation in this situation?
> > >
> > > Yes, we detect the gap and we detect that address space has changed,
> > > so to endure we did not miss a split we fall back to mmap_read_lock,
> > > lock the VMA while holding mmap_read_lock, drop mmap_read_lock and
> > > retry.
> > >
> > > >
> > > > I know we previously discussed the possibility of this retry mechanism
> > > > going on forever, I guess I will see the resolution to this in the code :)
> > >
> > > Retry in this case won't go forever because we take mmap_read_lock
> > > during the retry. In the worst case we will be constantly falling back
> > > to mmap_read_lock but that's a very unlikely case (the writer should
> > > be constantly splitting the vma right before the reader locks it).
> >
> > It might be worth adding that to commit message to underline that this has
> > been considered and this is the resolution.
> >
> > Something like:
> >
> >         we guarantee forward progress by always resolving contention via a
> >         fallback to an mmap-read lock.
> >
> >         We shouldn't see a repeated fallback to mmap read locks in
> >         practice, as this require a vanishingly unlikely series of lock
> >         contentions (for instance due to repeated VMA split
> >         operations). However even if this did somehow happen, we would
> >         still progress.
>
> Ack.
>
> >
> > >
> > > >
> > > > > This change is designed to reduce mmap_lock contention and prevent a
> > > > > process reading /proc/pid/maps files (often a low priority task, such
> > > > > as monitoring/data collection services) from blocking address space
> > > > > updates. Note that this change has a userspace visible disadvantage:
> > > > > it allows for sub-page data tearing as opposed to the previous mechanism
> > > > > where data tearing could happen only between pages of generated output
> > > > > data. Since current userspace considers data tearing between pages to be
> > > > > acceptable, we assume is will be able to handle sub-page data tearing
> > > > > as well.
> > > >
> > > > By tearing do you mean for instance seeing a VMA more than once due to
> > > > e.g. a VMA expanding in a racey way?
> > >
> > > Yes.
> > >
> > > >
> > > > Pedantic I know, but it might be worth goiing through all the merge case,
> > > > split and remap scenarios and explaining what might happen in each one (or
> > > > perhaps do that as some form of documentation?)
> > > >
> > > > I can try to put together a list of all of the possibilities if that would
> > > > be helpful.
> > >
> > > Hmm. That might be an interesting exercise. I called out this
> > > particular case because my tests caught it. I spent some time thinking
> > > about other possible scenarios where we would report a gap in a place
> > > where there are no gaps but could not think of anything else.
> >
> > todo++; :)
> >
> > >
> > > >
> > > > >
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > ---
> > > > >  fs/proc/internal.h |   6 ++
> > > > >  fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++--
> > > > >  2 files changed, 175 insertions(+), 8 deletions(-)
> > > >
> > > > I really hate having all this logic in the proc/task_mmu.c file.
> > > >
> > > > This is really delicate stuff and I'd really like it to live in mm if
> > > > possible.
> > > >
> > > > I reallise this might be a total pain, but I'm quite worried about us
> > > > putting super-delicate, carefully written VMA handling code in different
> > > > places.
> > > >
> > > > Also having stuff in mm/vma.c opens the door to userland testing which,
> > > > when I finally have time to really expand that, would allow for some really
> > > > nice stress testing here.
> > >
> > > That would require some sizable refactoring. I assume code for smaps
> > > reading and PROCMAP_QUERY would have to be moved as well?
> >
> > Yeah, I know, and apologies for that, but I really oppose us having this
> > super delicate VMA logic in an fs/proc file, one we don't maintain for that
> > matter.
> >
> > I know it's a total pain, but this just isn't the right place to be doing
> > such a careful dance.
> >
> > I'm not saying relocate code that belongs here, but find a way to abstract
> > the operations.
>
> Ok, I'll take a stab at refactoring purely mm-related code and will
> see how that looks.
>
> >
> > Perhaps could be a walker or something that does all the state transition
> > stuff that you can then just call from the walker functions here?
> >
> > You could then figure out something similar for the PROCMAP_QUERY logic.
> >
> > We're not doing this VMA locking stuff for smaps are we? As that is walking
> > page tables anyway right? So nothing would change for that.
>
> Yeah, smaps would stay as they are but refactoring might affect its
> code portions as well.
>
> >
> > >
> > > >
> > > > >
> > > > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > > > > index 96122e91c645..3728c9012687 100644
> > > > > --- a/fs/proc/internal.h
> > > > > +++ b/fs/proc/internal.h
> > > > > @@ -379,6 +379,12 @@ struct proc_maps_private {
> > > > >       struct task_struct *task;
> > > > >       struct mm_struct *mm;
> > > > >       struct vma_iterator iter;
> > > > > +     loff_t last_pos;
> > > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > > +     bool mmap_locked;
> > > > > +     unsigned int mm_wr_seq;
> > > >
> > > > Is this the _last_ sequence number observed in the mm_struct? or rather,
> > > > previous? Nitty but maybe worth renaming accordingly.
> > >
> > > It's a copy of the mm->mm_wr_seq. I can add a comment if needed.
> >
> > Right, of course. But I think the problem is the 'when' it refers to. It's
> > the sequence number associatied with the mm here sure, but when was it
> > snapshotted? How do we use it?
> >
> > Something like 'last_seen_seqnum' or 'mm_wr_seq_start' or something plus a
> > comment would be helpful.
> >
> > This is nitty I know... but this stuff is very confusing and I think every
> > little bit we do to help explain things is helpful here.
>
> Ok, I'll add a comment that mm_wr_seq is a snapshot of mm->mm_wr_seq
> before we started the VMA lookup.
>
> >
> > >
> > > >
> > > > > +     struct vm_area_struct *locked_vma;
> > > > > +#endif
> > > > >  #ifdef CONFIG_NUMA
> > > > >       struct mempolicy *task_mempolicy;
> > > > >  #endif
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index 27972c0749e7..36d883c4f394 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > > @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> > > > >  }
> > > > >  #endif
> > > > >
> > > > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > > > > -                                             loff_t *ppos)
> > > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > > +
> > > > > +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv,
> > > > > +                                       struct vm_area_struct *vma,
> > > > > +                                       unsigned long last_pos,
> > > > > +                                       bool mm_unstable)
> > > >
> > > > This whole function is a bit weird tbh, you handle both the
> > > > mm_unstable=true and mm_unstable=false cases, in the latter we don't try to
> > > > lock at all...
> > >
> > > Why do you think so? vma_start_read() is always called but in case
> > > mm_unstable=true we double check for the gaps to take care of the case
> > > I mentioned in the changelog.
> >
> > Well the read lock will always succeed if mmap read lock is held right?
> > Actually... no :)
> >
> > I see your point below about vma_start_read_locked() :>)
> >
> > I see below you suggest splitting into two functions, that seems to be a
> > good way forward.
>
> Ack.
>
> >
> > I _think_ we won't even need the checks re: mm and last_pos in that case
> > right? As holding the mmap lock we should be able to guarantee? Or at least
> > the mm check?
>
> Correct. These checks are needed only if we are searching the VMA
> under RCU protection before locking it. If we are holding mmap_lock
> then all this is not needed.
>
> >
> > >
> > > >
> > > > Nitty (sorry I know this is mildly irritating review) but maybe needs to be
> > > > renamed, or split up somehow?
> > > >
> > > > This is only trylocking in the mm_unstable case...
> > >
> > > Nope, I think you misunderstood the intention, as I mentioned above.
> > >
> > > >
> > > > > +{
> > > > > +     vma = vma_start_read(priv->mm, vma);
> > > >
> > > > Do we want to do this with mm_unstable == false?
> > >
> > > Yes, always. mm_unstable=true only indicates that we are already
> > > holding mmap_read_lock, so we don't need to double-check for gaps.
> > > Perhaps I should add some comments to clarify what purpose this
> > > parameter serves...
> > >
> > > >
> > > > I know (from my own documentation :)) taking a VMA read lock while holding
> > > > an mmap read lock is fine (the reverse isn't) but maybe it's suboptimal?
> > >
> > > Ah, right. I should use vma_start_read_locked() instead when we are
> > > holding mmap_read_lock. That's why that function was introduced. Will
> > > change.
> >
> > Yeah, I'll pretend this is what I meant to sound smart :P but this is a
> > really good point!
> >
> > >
> > > >
> > > > > +     if (IS_ERR_OR_NULL(vma))
> > > > > +             return NULL;
> > > >
> > > > Hmm IS_ERR_OR_NULL() is generally a code smell (I learned this some years
> > > > ago from people moaning at me on code review :)
> > > >
> > > > Sorry I know that's annoying but perhaps its indicative of an issue in the
> > > > interface? That's possibly out of scope here however.
> > >
> > > lock_vma_under_rcu() returns NULL or EAGAIN to signal
> > > lock_vma_under_rcu() that it should retry the VMA lookup. In here in
> > > either case we retry under mmap_read_lock, that's why EAGAIN is
> > > ignored.
> >
> > Yeah indeed you're right. I guess I'm just echoing previous review traumas
> > here :P
> >
> > >
> > > >
> > > > Why are we ignoring errors here though? I guess because we don't care if
> > > > the VMA got detached from under us, we don't bother retrying like we do in
> > > > lock_vma_under_rcu()?
> > >
> > > No, we take mmap_read_lock and retry in either case. Perhaps I should
> > > split trylock_vma() into two separate functions - one for the case
> > > when we are holding mmap_read_lock and another one when we don't? I
> > > think that would have prevented many of your questions. I'll try that
> > > and see how it looks.
> >
> > Yeah that'd be helpful. I think this should also simplify things?
>
> Yes. Will try that.
>
> >
> > >
> > > >
> > > > Should we just abstract that part of lock_vma_under_rcu() and use it?
> > >
> > > trylock_vma() is not that similar to lock_vma_under_rcu() for that
> > > IMO. Also lock_vma_under_rcu() is in the pagefault path which is very
> > > hot, so I would not want to add conditions there to make it work for
> > > trylock_vma().
> >
> > Right sure.
> >
> > But I'm just wondering why we don't do the retry stuff, e.g.:
> >
> >                 /* Check if the VMA got isolated after we found it */
> >                 if (PTR_ERR(vma) == -EAGAIN) {
> >                         count_vm_vma_lock_event(VMA_LOCK_MISS);
> >                         /* The area was replaced with another one */
> >                         goto retry;
> >                 }
> >
> > I mean do we need to retry under mmap lock in that case? Can we just retry
> > the lookup? Or is this not a worthwhile optimisation here?
>
> Hmm. That might be applicable here as well. Let me think some more
> about it. Theoretically that might affect our forward progress
> guarantee but for us to retry infinitely the VMA we find has to be
> knocked out from under us each time we find it. So, quite unlikely to
> happen continuously.
>
> >
> > >
> > > >
> > > > > +
> > > > > +     /* Check if the vma we locked is the right one. */
> > > >
> > > > Well it might not be the right one :) but might still belong to the right
> > > > mm, so maybe better to refer to the right virtual address space.
> > >
> > > Ack. Will change to "Check if the vma belongs to the right address space. "
> >
> > Thanks!
> >
> > >
> > > >
> > > > > +     if (unlikely(vma->vm_mm != priv->mm))
> > > > > +             goto err;
> > > > > +
> > > > > +     /* vma should not be ahead of the last search position. */
> > > >
> > > > You mean behind the last search position? Surely a VMA being _ahead_ of it
> > > > is fine?
> > >
> > > Yes, you are correct. "should not" should have been "should".
> >
> > Thanks!
> >
> > >
> > > >
> > > > > +     if (unlikely(last_pos >= vma->vm_end))
> > > >
> > > > Should that be >=? Wouldn't an == just be an adjacent VMA? Why is that
> > > > problematic? Or is last_pos inclusive?
> > >
> > > last_pos is inclusive and vma->vm_end is not inclusive, so if last_pos
> > > == vma->vm_end that would mean the vma is behind the last_pos. Since
> > > we are searching forward from the last_pos, we should not be finding a
> > > vma before last_pos unless it mutated.
> >
> > Ahhh that explains it. Thanks.
> >
> > >
> > > >
> > > > > +             goto err;
> > > >
> > > > Am I correct in thinking thi is what is being checked?
> > > >
> > > >           last_pos
> > > >              |
> > > >              v
> > > > |---------|
> > > > |         |
> > > > |---------|
> > > >         vm_end
> > > >    <--- vma 'next'??? How did we go backwards?
> > >
> > > Exactly.
> > >
> > > >
> > > > When last_pos gets updated, is it possible for a shrink to race to cause
> > > > this somehow?
> > >
> > > No, we update last_pos only after we locked the vma and confirmed it's
> > > the right one.
> >
> > Ack.
> >
> > >
> > > >
> > > > Do we treat this as an entirely unexpected error condition? In which case
> > > > is a WARN_ON_ONCE() warranted?
> > >
> > > No, the VMA might have mutated from under us before we locked it. For
> > > example it might have been remapped to a higher address.
> > >
> > > >
> > > > > +
> > > > > +     /*
> > > > > +      * vma ahead of last search position is possible but we need to
> > > > > +      * verify that it was not shrunk after we found it, and another
> > > > > +      * vma has not been installed ahead of it. Otherwise we might
> > > > > +      * observe a gap that should not be there.
> > > > > +      */
> > > >
> > > > OK so this is the juicy bit.
> > >
> > > Yep, that's the case singled out in the changelog.
> >
> > And rightly so!
> >
> > >
> > > >
> > > >
> > > > > +     if (mm_unstable && last_pos < vma->vm_start) {
> > > > > +             /* Verify only if the address space changed since vma lookup. */
> > > > > +             if ((priv->mm_wr_seq & 1) ||
> > > >
> > > > Can we wrap this into a helper? This is a 'you just have to know that odd
> > > > seq number means a write operation is in effect'. I know you have a comment
> > > > here, but I think something like:
> > > >
> > > >         if (has_mm_been_modified(priv) ||
> > > >
> > > > Would be a lot clearer.
> > >
> > > Yeah, I was thinking about that. I think an even cleaner way would be
> > > to remember the return value of mmap_lock_speculate_try_begin() and
> > > pass it around. I was hoping to avoid that extra parameter but sounds
> > > like for the sake of clarity that would be preferable?
> >
> > You know, it's me so I might have to mention a helper struct here :P it's
> > the two most Lorenzo things - helper sructs and churn...
> >
> > >
> > > >
> > > > Again this speaks to the usefulness of abstracting all this logic from the
> > > > proc code, we are putting super delicate VMA stuff here and it's just not
> > > > the right place.
> > > >
> > > > As an aside, I don't see coverage in the process_addrs documentation on
> > > > sequence number odd/even or speculation?
> > > >
> > > > I think we probably need to cover this to maintain an up-to-date
> > > > description of how the VMA locking mechanism works and is used?
> > >
> > > I think that's a very low level technical detail which I should not
> > > have exposed here. As I mentioned, I should simply store the return
> > > value of mmap_lock_speculate_try_begin() instead of doing these tricky
> > > mm_wr_seq checks.
> >
> > Right yeah I'm all for simplifying if we can! Sounds sensible.
> >
> > >
> > > >
> > > > > +                 mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) {
> > > >
> > > > Nit, again unrelated to this series, but would be useful to add a comment
> > > > to mmap_lock_speculate_retry() to indicate that a true return value
> > > > indicates a retry is needed, or renaming it.
> > >
> > > This is how seqcount API works in general. Note that
> > > mmap_lock_speculate_retry() is just a wrapper around
> > > read_seqcount_retry().
> >
> > Yeah, I guess I can moan to PeterZ about that :P
> >
> > It's not a big deal honestly, but it was just something I found confusing.
> >
> > I think adjusting the comment above to something like:
> >
> >                 /*
> >                  * Verify if the address space changed since vma lookup, or if
> >                  * the speculative lock needs to be retried.
> >                  */
> >
> > Or perhaps somethig more in line with the description you give below?
>
> Ack.
>
> >
> > >
> > > >
> > > > Maybe mmap_lock_speculate_needs_retry()? Also I think that function needs a
> > > > comment.
> > >
> > > See https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/seqlock.h#L395
> >
> > Yeah I saw that, but going 2 levels deep to read a comment isn't great.
> >
> > But again this isn't the end of the world.
> >
> > >
> > > >
> > > > Naming is hard :P
> > > >
> > > > Anyway the totality of this expression is 'something changed' or 'read
> > > > section retry required'.
> > >
> > > Not quite. The expression is "something is changed from under us or
> > > something was changing even before we started VMA lookup". Or in more
> > > technical terms, mmap_write_lock was acquired while we were locking
> > > the VMA or mmap_write_lock was already held even before we started the
> > > VMA search.
> >
> > OK so read section retry required = the seq num changes from under us
> > (checked carefully with memory barriers and carefully considered and
> > thought out such logic), and the priv->mm_wr_seq check before it is the
> > 'was this changed even before we began?'
> >
> > I wonder btw if we could put both into a single helper function to check
> > whether that'd be clearer.
>
> So this will look something like this:
>
> priv->can_speculate = mmap_lock_speculate_try_begin();
> ...
> if (!priv->can_speculate || mmap_lock_speculate_retry()) {
>     // fallback
> }
>
> Is that descriptive enough?
>
> >
> > >
> > > >
> > > > Under what circumstances would this happen?
> > >
> > > See my previous comment and I hope that clarifies it.
> >
> > Thanks!
> >
> > >
> > > >
> > > > OK so we're into the 'retry' logic here:
> > > >
> > > > > +                     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > > >
> > > > I'd definitely want Liam to confirm this is all above board and correct, as
> > > > these operations are pretty sensitive.
> > > >
> > > > But assuming this is safe, we reset the iterator to the last position...
> > > >
> > > > > +                     if (vma != vma_next(&priv->iter))
> > > >
> > > > Then assert the following VMA is the one we seek.
> > > >
> > > > > +                             goto err;
> > > >
> > > > Might this ever be the case in the course of ordinary operation? Is this
> > > > really an error?
> > >
> > > This simply means that the VMA we found before is not at the place we
> > > found it anymore. The locking fails and we should retry.
> >
> > I know it's pedantic but feels like 'err' is not a great name for this.
> >
> > Maybe 'nolock' or something? Or 'lock_failed'?
>
> lock_failed sounds good.
>
>
> >
> > >
> > > >
> > > > > +             }
> > > > > +     }
> > > > > +
> > > > > +     priv->locked_vma = vma;
> > > > > +
> > > > > +     return vma;
> > > > > +err:
> > > >
> > > > As queried above, is this really an error path or something we might expect
> > > > to happen that could simply result in an expected fallback to mmap lock?
> > >
> > > It's a failure to lock the VMA, which is handled by retrying under
> > > mmap_read_lock. So, trylock_vma() failure does not mean a fault in the
> > > logic. It's expected to happen occasionally.
> >
> > Ack yes understood thanks!
> >
> > >
> > > >
> > > > > +     vma_end_read(vma);
> > > > > +     return NULL;
> > > > > +}
> > > > > +
> > > > > +
> > > > > +static void unlock_vma(struct proc_maps_private *priv)
> > > > > +{
> > > > > +     if (priv->locked_vma) {
> > > > > +             vma_end_read(priv->locked_vma);
> > > > > +             priv->locked_vma = NULL;
> > > > > +     }
> > > > > +}
> > > > > +
> > > > > +static const struct seq_operations proc_pid_maps_op;
> > > > > +
> > > > > +static inline bool lock_content(struct seq_file *m,
> > > > > +                             struct proc_maps_private *priv)
> > > >
> > > > Pedantic I know but isn't 'lock_content' a bit generic?
> > > >
> > > > He says, not being able to think of a great alternative...
> > > >
> > > > OK maybe fine... :)
> > >
> > > Yeah, I struggled with this myself. Help in naming is appreciated.
> >
> > This is where it gets difficult haha so easy to point out but not so easy
> > to fix...
> >
> > lock_vma_range()?
>
> Ack.
>
> >
> > >
> > > >
> > > > > +{
> > > > > +     /*
> > > > > +      * smaps and numa_maps perform page table walk, therefore require
> > > > > +      * mmap_lock but maps can be read with locked vma only.
> > > > > +      */
> > > > > +     if (m->op != &proc_pid_maps_op) {
> > > >
> > > > Nit but is there a neater way of checking this? Actually I imagine not...
> > > >
> > > > But maybe worth, instead of forward-declaring proc_pid_maps_op, forward declare e.g.
> > > >
> > > > static inline bool is_maps_op(struct seq_file *m);
> > > >
> > > > And check e.g.
> > > >
> > > > if (is_maps_op(m)) { ... in the above.
> > > >
> > > > Yeah this is nitty not a massive del :)
> > >
> > > I'll try that and see how it looks. Thanks!
> >
> > Thanks!
> >
> > >
> > > >
> > > > > +             if (mmap_read_lock_killable(priv->mm))
> > > > > +                     return false;
> > > > > +
> > > > > +             priv->mmap_locked = true;
> > > > > +     } else {
> > > > > +             rcu_read_lock();
> > > > > +             priv->locked_vma = NULL;
> > > > > +             priv->mmap_locked = false;
> > > > > +     }
> > > > > +
> > > > > +     return true;
> > > > > +}
> > > > > +
> > > > > +static inline void unlock_content(struct proc_maps_private *priv)
> > > > > +{
> > > > > +     if (priv->mmap_locked) {
> > > > > +             mmap_read_unlock(priv->mm);
> > > > > +     } else {
> > > > > +             unlock_vma(priv);
> > > > > +             rcu_read_unlock();
> > > >
> > > > Does this always get called even in error cases?
> > >
> > > What error cases do you have in mind? Error to lock a VMA is handled
> > > by retrying and we should be happily proceeding. Please clarify.
> >
> > Well it was more of a question really - can the traversal through
> > /proc/$pid/maps result in some kind of error that doesn't reach this
> > function, thereby leaving things locked mistakenly?
> >
> > If not then happy days :)
> >
> > I'm guessing there isn't.
>
> There is EINTR in m_start() but unlock_content() won't be called in
> that case, so I think we are good.
>
> >
> > >
> > > >
> > > > > +     }
> > > > > +}
> > > > > +
> > > > > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > > > > +                                        loff_t last_pos)
> > > >
> > > > We really need a generalised RCU multi-VMA locking mechanism (we're looking
> > > > into madvise VMA locking atm with a conservative single VMA lock at the
> > > > moment, but in future we probably want to be able to span multiple for
> > > > instance) and this really really feels like it doesn't belong in this proc
> > > > code.
> > >
> > > Ok, I guess you are building a case to move more code into vma.c? I
> > > see what you are doing :)
> >
> > Haha damn it, my evil plans revealed :P
> >
> > >
> > > >
> > > > >  {
> > > > > -     struct vm_area_struct *vma = vma_next(&priv->iter);
> > > > > +     struct vm_area_struct *vma;
> > > > > +     int ret;
> > > > > +
> > > > > +     if (priv->mmap_locked)
> > > > > +             return vma_next(&priv->iter);
> > > > > +
> > > > > +     unlock_vma(priv);
> > > > > +     /*
> > > > > +      * Record sequence number ahead of vma lookup.
> > > > > +      * Odd seqcount means address space modification is in progress.
> > > > > +      */
> > > > > +     mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > > >
> > > > Hmm we're discarding the return value I guess we don't really care about
> > > > that at this stage? Or do we? Do we want to assert the read critical
> > > > section state here?
> > >
> > > Yeah, as I mentioned, instead of relying on priv->mm_wr_seq being odd
> > > I should record the return value of mmap_lock_speculate_try_begin().
> > > In the functional sense these two are interchangeable.
> >
> > Ack, thanks!
> >
> > >
> > > >
> > > > I guess since we have the mm_rq_seq which we use later it's the same thing
> > > > and doesn't matter.
> > >
> > > Yep.
> >
> > Ack
> >
> > >
> > > >
> > > > ~~(off topic a bit)~~
> > > >
> > > > OK so off-topic again afaict we're doing something pretty horribly gross here.
> > > >
> > > > We pass &priv->mm_rw_seq as 'unsigned int *seq' field to
> > > > mmap_lock_speculate_try_begin(), which in turn calls:
> > > >
> > > >         return raw_seqcount_try_begin(&mm->mm_lock_seq, *seq);
> > > >
> > > > And this is defined as a macro of:
> > > >
> > > > #define raw_seqcount_try_begin(s, start)                                \
> > > > ({                                                                      \
> > > >         start = raw_read_seqcount(s);                                   \
> > > >         !(start & 1);                                                   \
> > > > })
> > > >
> > > > So surely this expands to:
> > > >
> > > >         *seq = raw_read_seqcount(&mm->mm_lock_seq);
> > > >         !(*seq & 1) // return true if even, false if odd
> > > >
> > > > So we're basically ostensibly passing an unsigned int, but because we're
> > > > calling a macro it's actually just 'text' and we're instead able to then
> > > > reassign the underlying unsigned int * ptr and... ugh.
> > > >
> > > > ~~(/off topic a bit)~~
> > >
> > > Aaaand we are back...
> >
> > :)) yeah this isn't your fault, just a related 'wtf' moan :P we can pretend
> > like it never happened *ahem*
> >
> > >
> > > >
> > > > > +     vma = vma_next(&priv->iter);
> > > >
> > > >
> > > >
> > > > > +     if (!vma)
> > > > > +             return NULL;
> > > > > +
> > > > > +     vma = trylock_vma(priv, vma, last_pos, true);
> > > > > +     if (vma)
> > > > > +             return vma;
> > > > > +
> > > >
> > > > Really feels like this should be a boolean... I guess neat to reset vma if
> > > > not locked though.
> > >
> > > I guess I can change trylock_vma() to return boolean. We always return
> > > the same vma or NULL I think.
> >
> > Ack, I mean I guess you're looking at reworking it in general so can take
> > this into account.
>
> Ack.
>
> >
> > >
> > > >
> > > > > +     /* Address space got modified, vma might be stale. Re-lock and retry */
> > > >
> > > > > +     rcu_read_unlock();
> > > >
> > > > Might we see a VMA possibly actually legit unmapped in a race here? Do we
> > > > need to update last_pos/ppos to account for this? Otherwise we might just
> > > > fail on the last_pos >= vma->vm_end check in trylock_vma() no?
> > >
> > > Yes, it can happen and trylock_vma() will fail to lock the modified
> > > VMA. That's by design. In such cases we retry the lookup from the same
> > > last_pos.
> >
> > OK and then we're fine with it because the gap we report will be an actual
> > gap.
>
> Yes, either the actual gap or a VMA newly mapped at that address.
>
> >
> > >
> > > >
> > > > > +     ret = mmap_read_lock_killable(priv->mm);
> > > >
> > > > Shouldn't we set priv->mmap_locked here?
> > >
> > > No, we will drop the mmap_read_lock shortly. priv->mmap_locked
> > > indicates the overall mode we operate in. When priv->mmap_locked=false
> > > we can still temporarily take the mmap_read_lock when retrying and
> > > then drop it after we found the VMA.
> >
> > Right yeah, makes sense.
> >
> > >
> > > >
> > > > I guess not as we are simply holding the mmap lock to definitely get the
> > > > next VMA.
> > >
> > > Correct.
> >
> > Ack
> >
> > >
> > > >
> > > > > +     rcu_read_lock();
> > > > > +     if (ret)
> > > > > +             return ERR_PTR(ret);
> > > > > +
> > > > > +     /* Lookup the vma at the last position again under mmap_read_lock */
> > > > > +     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > > > > +     vma = vma_next(&priv->iter);
> > > > > +     if (vma) {
> > > > > +             vma = trylock_vma(priv, vma, last_pos, false);
> > > >
> > > > Be good to use Liam's convention of /* mm_unstable = */ false to make this
> > > > clear.
> > >
> > > Yeah, I'm thinking of splitting trylock_vma() into two separate
> > > functions for mm_unstable=true and mm_unstable=false cases.
> >
> > Yes :) thanks!
> >
> > >
> > > >
> > > > Find it kinda weird again we're 'trylocking' something we already have
> > > > locked via the mmap lock but I already mentioend this... :)
> > > >
> > > > > +             WARN_ON(!vma); /* mm is stable, has to succeed */
> > > >
> > > > I wonder if this is really useful, at any rate seems like there'd be a
> > > > flood here so WARN_ON_ONCE()? Perhaps VM_WARN_ON_ONCE() given this really
> > > > really ought not happen?
> > >
> > > Well, I can't use BUG_ON(), so WARN_ON() is the next tool I have :) In
> > > reality this should never happen, so
> > > WARN_ON/WARN_ON_ONCE/WARN_ON_RATELIMITED/or whatever does not matter
> > > much.
> >
> > I think if you refactor into two separate functions this becomes even more
> > unnecessary because then you are using a vma lock function that can never
> > fail etc.
> >
> > I mean maybe just stick a VM_ in front if it's not going to happen but for
> > debug/dev/early stabilisation purposes we want to keep an eye on it.
>
> Yeah, I think after refactoring we won't need any warnings here.
>
> >
> > >
> > > >
> > > > > +     }
> > > > > +     mmap_read_unlock(priv->mm);
> > > > > +
> > > > > +     return vma;
> > > > > +}
> > > > > +
> > > > > +#else /* CONFIG_PER_VMA_LOCK */
> > > > >
> > > > > +static inline bool lock_content(struct seq_file *m,
> > > > > +                             struct proc_maps_private *priv)
> > > > > +{
> > > > > +     return mmap_read_lock_killable(priv->mm) == 0;
> > > > > +}
> > > > > +
> > > > > +static inline void unlock_content(struct proc_maps_private *priv)
> > > > > +{
> > > > > +     mmap_read_unlock(priv->mm);
> > > > > +}
> > > > > +
> > > > > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > > > > +                                        loff_t last_pos)
> > > > > +{
> > > > > +     return vma_next(&priv->iter);
> > > > > +}
> > > > > +
> > > > > +#endif /* CONFIG_PER_VMA_LOCK */
> > > > > +
> > > > > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> > > > > +{
> > > > > +     struct proc_maps_private *priv = m->private;
> > > > > +     struct vm_area_struct *vma;
> > > > > +
> > > > > +     vma = get_next_vma(priv, *ppos);
> > > > > +     if (IS_ERR(vma))
> > > > > +             return vma;
> > > > > +
> > > > > +     /* Store previous position to be able to restart if needed */
> > > > > +     priv->last_pos = *ppos;
> > > > >       if (vma) {
> > > > > -             *ppos = vma->vm_start;
> > > > > +             /*
> > > > > +              * Track the end of the reported vma to ensure position changes
> > > > > +              * even if previous vma was merged with the next vma and we
> > > > > +              * found the extended vma with the same vm_start.
> > > > > +              */
> > > >
> > > > Right, so observing repetitions is acceptable in such circumstances? I mean
> > > > I agree.
> > >
> > > Yep, the VMA will be reported twice in such a case.
> >
> > Ack.
> >
> > >
> > > >
> > > > > +             *ppos = vma->vm_end;
> > > >
> > > > If we store the end, does the last_pos logic which resets the VMA iterator
> > > > later work correctly in all cases?
> > >
> > > I think so. By resetting to vma->vm_end we will start the next search
> > > from the address right next to the last reported VMA, no?
> >
> > Yeah, I was just wondering whether there were any odd corner case that
> > might be problematic.
> >
> > But since we treat last_pos as inclusive as you said in a response above,
> > and of course vma->vm_end is exclusive, then this makes sense.
> >
> > >
> > > >
> > > > >       } else {
> > > > >               *ppos = -2UL;
> > > > >               vma = get_gate_vma(priv->mm);
> > > >
> > > > Is it always the case that !vma here implies a gate VMA (yuck yuck)? I see
> > > > this was the original logic, but maybe put a comment about this as it's
> > > > weird and confusing? (and not your fault obviously :P)
> > >
> > > What comment would you like to see here?
> >
> > It's so gross this. I guess something about the inner workings of gate VMAs
> > and the use of -2UL as a weird sentinel etc.
>
> Ok, I'll try to add a meaningful comment here.
>
> >
> > But this is out of scope here.
> >
> > >
> > > >
> > > > Also, are all locks and state corectly handled in this case? Seems like one
> > > > of this nasty edge case situations that could have jagged edges...
> > >
> > > I think we are fine. get_next_vma() returned NULL, so we did not lock
> > > any VMA and priv->locked_vma should be NULL.
> > >
> > > >
> > > > > @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> > > > >               return NULL;
> > > > >       }
> > > > >
> > > > > -     if (mmap_read_lock_killable(mm)) {
> > > > > +     if (!lock_content(m, priv)) {
> > > >
> > > > Nice that this just slots in like this! :)
> > > >
> > > > >               mmput(mm);
> > > > >               put_task_struct(priv->task);
> > > > >               priv->task = NULL;
> > > > >               return ERR_PTR(-EINTR);
> > > > >       }
> > > > >
> > > > > +     if (last_addr > 0)
> > > >
> > > > last_addr is an unsigned long, this will always be true.
> > >
> > > Not unless last_addr==0. That's what I'm really checking here: is this
> > > the first invocation of m_start(), in which case we are starting from
> > > the beginning and not restarting from priv->last_pos. Should I add a
> > > comment?
> >
> > Yeah sorry I was being an idiot, I misread this as >= 0 obviously.
> >
> > I had assumed you were checking for the -2 and -1 cases (though -1 early
> > exits above).
> >
> > So in that case, are you handling the gate VMA correctly here? Surely we
> > should exclude that? Wouldn't setting ppos = last_addr = priv->last_pos be
> > incorrect if this were a gate vma?
>
> You are actually right. last_addr can be -2UL here and we should not
> override it. I'll fix it. Thanks!
>
> >
> > Even if we then call get_gate_vma() we've changed these values? Or is that
> > fine?
> >
> > And yeah a comment would be good thanks!
> >
> > >
> > > >
> > > > You probably want to put an explicit check for -1UL, -2UL here or?
> > > >
> > > > God I hate this mechanism for indicating gate VMA... yuck yuck (again, this
> > > > bit not your fault :P)
> > >
> > > No, I don't care here about -1UL, -2UL, just that last_addr==0 or not.
> >
> > OK, so maybe above concerns not a thing.
> >
> > >
> > > >
> > > > > +             *ppos = last_addr = priv->last_pos;
> > > > >       vma_iter_init(&priv->iter, mm, last_addr);
> > > > >       hold_task_mempolicy(priv);
> > > > >       if (last_addr == -2UL)
> > > > >               return get_gate_vma(mm);
> > > > >
> > > > > -     return proc_get_vma(priv, ppos);
> > > > > +     return proc_get_vma(m, ppos);
> > > > >  }
> > > > >
> > > > >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > > > > @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > > > >               *ppos = -1UL;
> > > > >               return NULL;
> > > > >       }
> > > > > -     return proc_get_vma(m->private, ppos);
> > > > > +     return proc_get_vma(m, ppos);
> > > > >  }
> > > > >
> > > > >  static void m_stop(struct seq_file *m, void *v)
> > > > > @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v)
> > > > >               return;
> > > > >
> > > > >       release_task_mempolicy(priv);
> > > > > -     mmap_read_unlock(mm);
> > > > > +     unlock_content(priv);
> > > > >       mmput(mm);
> > > > >       put_task_struct(priv->task);
> > > > >       priv->task = NULL;
> > > > > --
> > > > > 2.49.0.1266.g31b7d2e469-goog
> > > > >
> > > >
> > > > Sorry to add to workload by digging into so many details here, but we
> > > > really need to make sure all the i's are dotted and t's are crossed given
> > > > how fiddly and fragile this stuff is :)
> > > >
> > > > Very much appreciate the work, this is a significant improvement and will
> > > > have a great deal of real world impact!
> > >
> > > Thanks for meticulously going over the code! This is really helpful.
> > > Suren.
> >
> > No problem!
> >
> > >
> > > >
> > > > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock
  2025-06-11 10:24           ` Lorenzo Stoakes
@ 2025-06-11 15:12             ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-11 15:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Wed, Jun 11, 2025 at 3:25 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Thanks for your patient replies :)
>
> OK to save us both time in such a huuuuge back-and-forth - I agree broadly
> with your comments below and I think we are aligned on everything now.
>
> I will try to get you a list of merge scenarios and ideally have a look at
> the test code too if I have time this week.
>
> But otherwise hopefully we are good for a respin here?

Ack. Working on it.

>
> Cheers, Lorenzo
>
> On Tue, Jun 10, 2025 at 05:16:36PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Jun 10, 2025 at 10:43 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Sat, Jun 07, 2025 at 06:41:35PM -0700, Suren Baghdasaryan wrote:
> > > > On Sat, Jun 7, 2025 at 10:43 AM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > Hi Suren,
> > > > >
> > > > > Forgive me but I am going to ask a lot of questions here :p just want to
> > > > > make sure I'm getting everything right here.
> > > >
> > > > No worries and thank you for reviewing!
> > >
> > > No problem!
> > >
> > > >
> > > > >
> > > > > On Wed, Jun 04, 2025 at 04:11:50PM -0700, Suren Baghdasaryan wrote:
> > > > > > With maple_tree supporting vma tree traversal under RCU and per-vma
> > > > > > locks, /proc/pid/maps can be read while holding individual vma locks
> > > > > > instead of locking the entire address space.
> > > > >
> > > > > Nice :)
> > > > >
> > > > > > Completely lockless approach would be quite complex with the main issue
> > > > > > being get_vma_name() using callbacks which might not work correctly with
> > > > > > a stable vma copy, requiring original (unstable) vma.
> > > > >
> > > > > Hmmm can you expand on what a 'completely lockless' design might comprise of?
> > > >
> > > > In my previous implementation
> > > > (https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/)
> > > > I was doing this under RCU while checking mmap_lock seq counter to
> > > > detect address space changes. That's what I meant by a completely
> > > > lockless approach here.
> > >
> > > Oh did that approach not even use VMA locks _at all_?
> >
> > Correct, it was done under RCU protection.
> >
> > >
> > > >
> > > > >
> > > > > It's super un-greppable and I've not got clangd set up with an allmod kernel to
> > > > > triple-check but I'm seeing at least 2 (are there more?):
> > > > >
> > > > > gate_vma_name() which is:
> > > > >
> > > > >         return "[vsyscall]";
> > > > >
> > > > > special_mapping_name() which is:
> > > > >
> > > > >          return ((struct vm_special_mapping *)vma->vm_private_data)->name;
> > > > >
> > > > > Which I'm guessing is the issue because it's a double pointer deref...
> > > >
> > > > Correct but in more general terms, depending on implementation of the
> > > > vm_ops.name callback, vma->vm_ops->name(vma) might not work correctly
> > > > with a vma copy. special_mapping_name() is an example of that.
> > >
> > > Yeah, this is a horrible situation to be in for such a trivial thing. But I
> > > guess unavoidable for now.
> > >
> > > >
> > > > >
> > > > > Seems such a silly issue to get stuck on, I wonder if we can't just change
> > > > > this to function correctly?
> > > >
> > > > I was thinking about different ways to overcome that but once I
> > > > realized per-vma locks result in even less contention and the
> > > > implementation is simpler and more robust, I decided that per-vma
> > > > locks direction is better.
> > >
> > > Ack well in that case :)
> > >
> > > But still it'd be nice to somehow restrict the impact of this callback.
> >
> > With VMA locked we are back in a safe place, I think.
> >
> > >
> > > >
> > > > >
> > > > > > When per-vma lock acquisition fails, we take the mmap_lock for reading,
> > > > > > lock the vma, release the mmap_lock and continue. This guarantees the
> > > > > > reader to make forward progress even during lock contention. This will
> > > > >
> > > > > Ah that fabled constant forward progress ;)
> > > > >
> > > > > > interfere with the writer but for a very short time while we are
> > > > > > acquiring the per-vma lock and only when there was contention on the
> > > > > > vma reader is interested in.
> > > > > > One case requiring special handling is when vma changes between the
> > > > > > time it was found and the time it got locked. A problematic case would
> > > > > > be if vma got shrunk so that it's start moved higher in the address
> > > > > > space and a new vma was installed at the beginning:
> > > > > >
> > > > > > reader found:               |--------VMA A--------|
> > > > > > VMA is modified:            |-VMA B-|----VMA A----|
> > > > > > reader locks modified VMA A
> > > > > > reader reports VMA A:       |  gap  |----VMA A----|
> > > > > >
> > > > > > This would result in reporting a gap in the address space that does not
> > > > > > exist. To prevent this we retry the lookup after locking the vma, however
> > > > > > we do that only when we identify a gap and detect that the address space
> > > > > > was changed after we found the vma.
> > > > >
> > > > > OK so in this case we have
> > > > >
> > > > > 1. Find VMA A - nothing is locked yet, but presumably we are under RCU so
> > > > >    are... safe? From unmaps? Or are we? I guess actually the detach
> > > > >    mechanism sorts this out for us perhaps?
> > > >
> > > > Yes, VMAs are RCU-safe and we do detect if it got detached after we
> > > > found it but before we locked it.
> > >
> > > Ack I thought so.
> > >
> > > >
> > > > >
> > > > > 2. We got unlucky and did this immediately prior to VMA A having its
> > > > >    vma->vm_start, vm_end updated to reflect the split.
> > > >
> > > > Yes, the split happened after we found it and before we locked it.
> > > >
> > > > >
> > > > > 3. We lock VMA A, now position with an apparent gap after the prior VMA
> > > > > which, in practice does not exist.
> > > >
> > > > Correct.
> > >
> > > Ack
> > >
> > > >
> > > > >
> > > > > So I am guessing that by observing sequence numbers you are able to detect
> > > > > that a change has occurred and thus retry the operation in this situation?
> > > >
> > > > Yes, we detect the gap and we detect that address space has changed,
> > > > so to endure we did not miss a split we fall back to mmap_read_lock,
> > > > lock the VMA while holding mmap_read_lock, drop mmap_read_lock and
> > > > retry.
> > > >
> > > > >
> > > > > I know we previously discussed the possibility of this retry mechanism
> > > > > going on forever, I guess I will see the resolution to this in the code :)
> > > >
> > > > Retry in this case won't go forever because we take mmap_read_lock
> > > > during the retry. In the worst case we will be constantly falling back
> > > > to mmap_read_lock but that's a very unlikely case (the writer should
> > > > be constantly splitting the vma right before the reader locks it).
> > >
> > > It might be worth adding that to commit message to underline that this has
> > > been considered and this is the resolution.
> > >
> > > Something like:
> > >
> > >         we guarantee forward progress by always resolving contention via a
> > >         fallback to an mmap-read lock.
> > >
> > >         We shouldn't see a repeated fallback to mmap read locks in
> > >         practice, as this require a vanishingly unlikely series of lock
> > >         contentions (for instance due to repeated VMA split
> > >         operations). However even if this did somehow happen, we would
> > >         still progress.
> >
> > Ack.
> >
> > >
> > > >
> > > > >
> > > > > > This change is designed to reduce mmap_lock contention and prevent a
> > > > > > process reading /proc/pid/maps files (often a low priority task, such
> > > > > > as monitoring/data collection services) from blocking address space
> > > > > > updates. Note that this change has a userspace visible disadvantage:
> > > > > > it allows for sub-page data tearing as opposed to the previous mechanism
> > > > > > where data tearing could happen only between pages of generated output
> > > > > > data. Since current userspace considers data tearing between pages to be
> > > > > > acceptable, we assume is will be able to handle sub-page data tearing
> > > > > > as well.
> > > > >
> > > > > By tearing do you mean for instance seeing a VMA more than once due to
> > > > > e.g. a VMA expanding in a racey way?
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > Pedantic I know, but it might be worth goiing through all the merge case,
> > > > > split and remap scenarios and explaining what might happen in each one (or
> > > > > perhaps do that as some form of documentation?)
> > > > >
> > > > > I can try to put together a list of all of the possibilities if that would
> > > > > be helpful.
> > > >
> > > > Hmm. That might be an interesting exercise. I called out this
> > > > particular case because my tests caught it. I spent some time thinking
> > > > about other possible scenarios where we would report a gap in a place
> > > > where there are no gaps but could not think of anything else.
> > >
> > > todo++; :)
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > > ---
> > > > > >  fs/proc/internal.h |   6 ++
> > > > > >  fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++--
> > > > > >  2 files changed, 175 insertions(+), 8 deletions(-)
> > > > >
> > > > > I really hate having all this logic in the proc/task_mmu.c file.
> > > > >
> > > > > This is really delicate stuff and I'd really like it to live in mm if
> > > > > possible.
> > > > >
> > > > > I reallise this might be a total pain, but I'm quite worried about us
> > > > > putting super-delicate, carefully written VMA handling code in different
> > > > > places.
> > > > >
> > > > > Also having stuff in mm/vma.c opens the door to userland testing which,
> > > > > when I finally have time to really expand that, would allow for some really
> > > > > nice stress testing here.
> > > >
> > > > That would require some sizable refactoring. I assume code for smaps
> > > > reading and PROCMAP_QUERY would have to be moved as well?
> > >
> > > Yeah, I know, and apologies for that, but I really oppose us having this
> > > super delicate VMA logic in an fs/proc file, one we don't maintain for that
> > > matter.
> > >
> > > I know it's a total pain, but this just isn't the right place to be doing
> > > such a careful dance.
> > >
> > > I'm not saying relocate code that belongs here, but find a way to abstract
> > > the operations.
> >
> > Ok, I'll take a stab at refactoring purely mm-related code and will
> > see how that looks.
> >
> > >
> > > Perhaps could be a walker or something that does all the state transition
> > > stuff that you can then just call from the walker functions here?
> > >
> > > You could then figure out something similar for the PROCMAP_QUERY logic.
> > >
> > > We're not doing this VMA locking stuff for smaps are we? As that is walking
> > > page tables anyway right? So nothing would change for that.
> >
> > Yeah, smaps would stay as they are but refactoring might affect its
> > code portions as well.
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > > > > > index 96122e91c645..3728c9012687 100644
> > > > > > --- a/fs/proc/internal.h
> > > > > > +++ b/fs/proc/internal.h
> > > > > > @@ -379,6 +379,12 @@ struct proc_maps_private {
> > > > > >       struct task_struct *task;
> > > > > >       struct mm_struct *mm;
> > > > > >       struct vma_iterator iter;
> > > > > > +     loff_t last_pos;
> > > > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > > > +     bool mmap_locked;
> > > > > > +     unsigned int mm_wr_seq;
> > > > >
> > > > > Is this the _last_ sequence number observed in the mm_struct? or rather,
> > > > > previous? Nitty but maybe worth renaming accordingly.
> > > >
> > > > It's a copy of the mm->mm_wr_seq. I can add a comment if needed.
> > >
> > > Right, of course. But I think the problem is the 'when' it refers to. It's
> > > the sequence number associatied with the mm here sure, but when was it
> > > snapshotted? How do we use it?
> > >
> > > Something like 'last_seen_seqnum' or 'mm_wr_seq_start' or something plus a
> > > comment would be helpful.
> > >
> > > This is nitty I know... but this stuff is very confusing and I think every
> > > little bit we do to help explain things is helpful here.
> >
> > Ok, I'll add a comment that mm_wr_seq is a snapshot of mm->mm_wr_seq
> > before we started the VMA lookup.
> >
> > >
> > > >
> > > > >
> > > > > > +     struct vm_area_struct *locked_vma;
> > > > > > +#endif
> > > > > >  #ifdef CONFIG_NUMA
> > > > > >       struct mempolicy *task_mempolicy;
> > > > > >  #endif
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index 27972c0749e7..36d883c4f394 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > > @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> > > > > >  }
> > > > > >  #endif
> > > > > >
> > > > > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > > > > > -                                             loff_t *ppos)
> > > > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > > > +
> > > > > > +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv,
> > > > > > +                                       struct vm_area_struct *vma,
> > > > > > +                                       unsigned long last_pos,
> > > > > > +                                       bool mm_unstable)
> > > > >
> > > > > This whole function is a bit weird tbh, you handle both the
> > > > > mm_unstable=true and mm_unstable=false cases, in the latter we don't try to
> > > > > lock at all...
> > > >
> > > > Why do you think so? vma_start_read() is always called but in case
> > > > mm_unstable=true we double check for the gaps to take care of the case
> > > > I mentioned in the changelog.
> > >
> > > Well the read lock will always succeed if mmap read lock is held right?
> > > Actually... no :)
> > >
> > > I see your point below about vma_start_read_locked() :>)
> > >
> > > I see below you suggest splitting into two functions, that seems to be a
> > > good way forward.
> >
> > Ack.
> >
> > >
> > > I _think_ we won't even need the checks re: mm and last_pos in that case
> > > right? As holding the mmap lock we should be able to guarantee? Or at least
> > > the mm check?
> >
> > Correct. These checks are needed only if we are searching the VMA
> > under RCU protection before locking it. If we are holding mmap_lock
> > then all this is not needed.
> >
> > >
> > > >
> > > > >
> > > > > Nitty (sorry I know this is mildly irritating review) but maybe needs to be
> > > > > renamed, or split up somehow?
> > > > >
> > > > > This is only trylocking in the mm_unstable case...
> > > >
> > > > Nope, I think you misunderstood the intention, as I mentioned above.
> > > >
> > > > >
> > > > > > +{
> > > > > > +     vma = vma_start_read(priv->mm, vma);
> > > > >
> > > > > Do we want to do this with mm_unstable == false?
> > > >
> > > > Yes, always. mm_unstable=true only indicates that we are already
> > > > holding mmap_read_lock, so we don't need to double-check for gaps.
> > > > Perhaps I should add some comments to clarify what purpose this
> > > > parameter serves...
> > > >
> > > > >
> > > > > I know (from my own documentation :)) taking a VMA read lock while holding
> > > > > an mmap read lock is fine (the reverse isn't) but maybe it's suboptimal?
> > > >
> > > > Ah, right. I should use vma_start_read_locked() instead when we are
> > > > holding mmap_read_lock. That's why that function was introduced. Will
> > > > change.
> > >
> > > Yeah, I'll pretend this is what I meant to sound smart :P but this is a
> > > really good point!
> > >
> > > >
> > > > >
> > > > > > +     if (IS_ERR_OR_NULL(vma))
> > > > > > +             return NULL;
> > > > >
> > > > > Hmm IS_ERR_OR_NULL() is generally a code smell (I learned this some years
> > > > > ago from people moaning at me on code review :)
> > > > >
> > > > > Sorry I know that's annoying but perhaps its indicative of an issue in the
> > > > > interface? That's possibly out of scope here however.
> > > >
> > > > lock_vma_under_rcu() returns NULL or EAGAIN to signal
> > > > lock_vma_under_rcu() that it should retry the VMA lookup. In here in
> > > > either case we retry under mmap_read_lock, that's why EAGAIN is
> > > > ignored.
> > >
> > > Yeah indeed you're right. I guess I'm just echoing previous review traumas
> > > here :P
> > >
> > > >
> > > > >
> > > > > Why are we ignoring errors here though? I guess because we don't care if
> > > > > the VMA got detached from under us, we don't bother retrying like we do in
> > > > > lock_vma_under_rcu()?
> > > >
> > > > No, we take mmap_read_lock and retry in either case. Perhaps I should
> > > > split trylock_vma() into two separate functions - one for the case
> > > > when we are holding mmap_read_lock and another one when we don't? I
> > > > think that would have prevented many of your questions. I'll try that
> > > > and see how it looks.
> > >
> > > Yeah that'd be helpful. I think this should also simplify things?
> >
> > Yes. Will try that.
> >
> > >
> > > >
> > > > >
> > > > > Should we just abstract that part of lock_vma_under_rcu() and use it?
> > > >
> > > > trylock_vma() is not that similar to lock_vma_under_rcu() for that
> > > > IMO. Also lock_vma_under_rcu() is in the pagefault path which is very
> > > > hot, so I would not want to add conditions there to make it work for
> > > > trylock_vma().
> > >
> > > Right sure.
> > >
> > > But I'm just wondering why we don't do the retry stuff, e.g.:
> > >
> > >                 /* Check if the VMA got isolated after we found it */
> > >                 if (PTR_ERR(vma) == -EAGAIN) {
> > >                         count_vm_vma_lock_event(VMA_LOCK_MISS);
> > >                         /* The area was replaced with another one */
> > >                         goto retry;
> > >                 }
> > >
> > > I mean do we need to retry under mmap lock in that case? Can we just retry
> > > the lookup? Or is this not a worthwhile optimisation here?
> >
> > Hmm. That might be applicable here as well. Let me think some more
> > about it. Theoretically that might affect our forward progress
> > guarantee but for us to retry infinitely the VMA we find has to be
> > knocked out from under us each time we find it. So, quite unlikely to
> > happen continuously.
> >
> > >
> > > >
> > > > >
> > > > > > +
> > > > > > +     /* Check if the vma we locked is the right one. */
> > > > >
> > > > > Well it might not be the right one :) but might still belong to the right
> > > > > mm, so maybe better to refer to the right virtual address space.
> > > >
> > > > Ack. Will change to "Check if the vma belongs to the right address space. "
> > >
> > > Thanks!
> > >
> > > >
> > > > >
> > > > > > +     if (unlikely(vma->vm_mm != priv->mm))
> > > > > > +             goto err;
> > > > > > +
> > > > > > +     /* vma should not be ahead of the last search position. */
> > > > >
> > > > > You mean behind the last search position? Surely a VMA being _ahead_ of it
> > > > > is fine?
> > > >
> > > > Yes, you are correct. "should not" should have been "should".
> > >
> > > Thanks!
> > >
> > > >
> > > > >
> > > > > > +     if (unlikely(last_pos >= vma->vm_end))
> > > > >
> > > > > Should that be >=? Wouldn't an == just be an adjacent VMA? Why is that
> > > > > problematic? Or is last_pos inclusive?
> > > >
> > > > last_pos is inclusive and vma->vm_end is not inclusive, so if last_pos
> > > > == vma->vm_end that would mean the vma is behind the last_pos. Since
> > > > we are searching forward from the last_pos, we should not be finding a
> > > > vma before last_pos unless it mutated.
> > >
> > > Ahhh that explains it. Thanks.
> > >
> > > >
> > > > >
> > > > > > +             goto err;
> > > > >
> > > > > Am I correct in thinking thi is what is being checked?
> > > > >
> > > > >           last_pos
> > > > >              |
> > > > >              v
> > > > > |---------|
> > > > > |         |
> > > > > |---------|
> > > > >         vm_end
> > > > >    <--- vma 'next'??? How did we go backwards?
> > > >
> > > > Exactly.
> > > >
> > > > >
> > > > > When last_pos gets updated, is it possible for a shrink to race to cause
> > > > > this somehow?
> > > >
> > > > No, we update last_pos only after we locked the vma and confirmed it's
> > > > the right one.
> > >
> > > Ack.
> > >
> > > >
> > > > >
> > > > > Do we treat this as an entirely unexpected error condition? In which case
> > > > > is a WARN_ON_ONCE() warranted?
> > > >
> > > > No, the VMA might have mutated from under us before we locked it. For
> > > > example it might have been remapped to a higher address.
> > > >
> > > > >
> > > > > > +
> > > > > > +     /*
> > > > > > +      * vma ahead of last search position is possible but we need to
> > > > > > +      * verify that it was not shrunk after we found it, and another
> > > > > > +      * vma has not been installed ahead of it. Otherwise we might
> > > > > > +      * observe a gap that should not be there.
> > > > > > +      */
> > > > >
> > > > > OK so this is the juicy bit.
> > > >
> > > > Yep, that's the case singled out in the changelog.
> > >
> > > And rightly so!
> > >
> > > >
> > > > >
> > > > >
> > > > > > +     if (mm_unstable && last_pos < vma->vm_start) {
> > > > > > +             /* Verify only if the address space changed since vma lookup. */
> > > > > > +             if ((priv->mm_wr_seq & 1) ||
> > > > >
> > > > > Can we wrap this into a helper? This is a 'you just have to know that odd
> > > > > seq number means a write operation is in effect'. I know you have a comment
> > > > > here, but I think something like:
> > > > >
> > > > >         if (has_mm_been_modified(priv) ||
> > > > >
> > > > > Would be a lot clearer.
> > > >
> > > > Yeah, I was thinking about that. I think an even cleaner way would be
> > > > to remember the return value of mmap_lock_speculate_try_begin() and
> > > > pass it around. I was hoping to avoid that extra parameter but sounds
> > > > like for the sake of clarity that would be preferable?
> > >
> > > You know, it's me so I might have to mention a helper struct here :P it's
> > > the two most Lorenzo things - helper sructs and churn...
> > >
> > > >
> > > > >
> > > > > Again this speaks to the usefulness of abstracting all this logic from the
> > > > > proc code, we are putting super delicate VMA stuff here and it's just not
> > > > > the right place.
> > > > >
> > > > > As an aside, I don't see coverage in the process_addrs documentation on
> > > > > sequence number odd/even or speculation?
> > > > >
> > > > > I think we probably need to cover this to maintain an up-to-date
> > > > > description of how the VMA locking mechanism works and is used?
> > > >
> > > > I think that's a very low level technical detail which I should not
> > > > have exposed here. As I mentioned, I should simply store the return
> > > > value of mmap_lock_speculate_try_begin() instead of doing these tricky
> > > > mm_wr_seq checks.
> > >
> > > Right yeah I'm all for simplifying if we can! Sounds sensible.
> > >
> > > >
> > > > >
> > > > > > +                 mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) {
> > > > >
> > > > > Nit, again unrelated to this series, but would be useful to add a comment
> > > > > to mmap_lock_speculate_retry() to indicate that a true return value
> > > > > indicates a retry is needed, or renaming it.
> > > >
> > > > This is how seqcount API works in general. Note that
> > > > mmap_lock_speculate_retry() is just a wrapper around
> > > > read_seqcount_retry().
> > >
> > > Yeah, I guess I can moan to PeterZ about that :P
> > >
> > > It's not a big deal honestly, but it was just something I found confusing.
> > >
> > > I think adjusting the comment above to something like:
> > >
> > >                 /*
> > >                  * Verify if the address space changed since vma lookup, or if
> > >                  * the speculative lock needs to be retried.
> > >                  */
> > >
> > > Or perhaps somethig more in line with the description you give below?
> >
> > Ack.
> >
> > >
> > > >
> > > > >
> > > > > Maybe mmap_lock_speculate_needs_retry()? Also I think that function needs a
> > > > > comment.
> > > >
> > > > See https://elixir.bootlin.com/linux/v6.15.1/source/include/linux/seqlock.h#L395
> > >
> > > Yeah I saw that, but going 2 levels deep to read a comment isn't great.
> > >
> > > But again this isn't the end of the world.
> > >
> > > >
> > > > >
> > > > > Naming is hard :P
> > > > >
> > > > > Anyway the totality of this expression is 'something changed' or 'read
> > > > > section retry required'.
> > > >
> > > > Not quite. The expression is "something is changed from under us or
> > > > something was changing even before we started VMA lookup". Or in more
> > > > technical terms, mmap_write_lock was acquired while we were locking
> > > > the VMA or mmap_write_lock was already held even before we started the
> > > > VMA search.
> > >
> > > OK so read section retry required = the seq num changes from under us
> > > (checked carefully with memory barriers and carefully considered and
> > > thought out such logic), and the priv->mm_wr_seq check before it is the
> > > 'was this changed even before we began?'
> > >
> > > I wonder btw if we could put both into a single helper function to check
> > > whether that'd be clearer.
> >
> > So this will look something like this:
> >
> > priv->can_speculate = mmap_lock_speculate_try_begin();
> > ...
> > if (!priv->can_speculate || mmap_lock_speculate_retry()) {
> >     // fallback
> > }
> >
> > Is that descriptive enough?
> >
> > >
> > > >
> > > > >
> > > > > Under what circumstances would this happen?
> > > >
> > > > See my previous comment and I hope that clarifies it.
> > >
> > > Thanks!
> > >
> > > >
> > > > >
> > > > > OK so we're into the 'retry' logic here:
> > > > >
> > > > > > +                     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > > > >
> > > > > I'd definitely want Liam to confirm this is all above board and correct, as
> > > > > these operations are pretty sensitive.
> > > > >
> > > > > But assuming this is safe, we reset the iterator to the last position...
> > > > >
> > > > > > +                     if (vma != vma_next(&priv->iter))
> > > > >
> > > > > Then assert the following VMA is the one we seek.
> > > > >
> > > > > > +                             goto err;
> > > > >
> > > > > Might this ever be the case in the course of ordinary operation? Is this
> > > > > really an error?
> > > >
> > > > This simply means that the VMA we found before is not at the place we
> > > > found it anymore. The locking fails and we should retry.
> > >
> > > I know it's pedantic but feels like 'err' is not a great name for this.
> > >
> > > Maybe 'nolock' or something? Or 'lock_failed'?
> >
> > lock_failed sounds good.
> >
> >
> > >
> > > >
> > > > >
> > > > > > +             }
> > > > > > +     }
> > > > > > +
> > > > > > +     priv->locked_vma = vma;
> > > > > > +
> > > > > > +     return vma;
> > > > > > +err:
> > > > >
> > > > > As queried above, is this really an error path or something we might expect
> > > > > to happen that could simply result in an expected fallback to mmap lock?
> > > >
> > > > It's a failure to lock the VMA, which is handled by retrying under
> > > > mmap_read_lock. So, trylock_vma() failure does not mean a fault in the
> > > > logic. It's expected to happen occasionally.
> > >
> > > Ack yes understood thanks!
> > >
> > > >
> > > > >
> > > > > > +     vma_end_read(vma);
> > > > > > +     return NULL;
> > > > > > +}
> > > > > > +
> > > > > > +
> > > > > > +static void unlock_vma(struct proc_maps_private *priv)
> > > > > > +{
> > > > > > +     if (priv->locked_vma) {
> > > > > > +             vma_end_read(priv->locked_vma);
> > > > > > +             priv->locked_vma = NULL;
> > > > > > +     }
> > > > > > +}
> > > > > > +
> > > > > > +static const struct seq_operations proc_pid_maps_op;
> > > > > > +
> > > > > > +static inline bool lock_content(struct seq_file *m,
> > > > > > +                             struct proc_maps_private *priv)
> > > > >
> > > > > Pedantic I know but isn't 'lock_content' a bit generic?
> > > > >
> > > > > He says, not being able to think of a great alternative...
> > > > >
> > > > > OK maybe fine... :)
> > > >
> > > > Yeah, I struggled with this myself. Help in naming is appreciated.
> > >
> > > This is where it gets difficult haha so easy to point out but not so easy
> > > to fix...
> > >
> > > lock_vma_range()?
> >
> > Ack.
> >
> > >
> > > >
> > > > >
> > > > > > +{
> > > > > > +     /*
> > > > > > +      * smaps and numa_maps perform page table walk, therefore require
> > > > > > +      * mmap_lock but maps can be read with locked vma only.
> > > > > > +      */
> > > > > > +     if (m->op != &proc_pid_maps_op) {
> > > > >
> > > > > Nit but is there a neater way of checking this? Actually I imagine not...
> > > > >
> > > > > But maybe worth, instead of forward-declaring proc_pid_maps_op, forward declare e.g.
> > > > >
> > > > > static inline bool is_maps_op(struct seq_file *m);
> > > > >
> > > > > And check e.g.
> > > > >
> > > > > if (is_maps_op(m)) { ... in the above.
> > > > >
> > > > > Yeah this is nitty not a massive del :)
> > > >
> > > > I'll try that and see how it looks. Thanks!
> > >
> > > Thanks!
> > >
> > > >
> > > > >
> > > > > > +             if (mmap_read_lock_killable(priv->mm))
> > > > > > +                     return false;
> > > > > > +
> > > > > > +             priv->mmap_locked = true;
> > > > > > +     } else {
> > > > > > +             rcu_read_lock();
> > > > > > +             priv->locked_vma = NULL;
> > > > > > +             priv->mmap_locked = false;
> > > > > > +     }
> > > > > > +
> > > > > > +     return true;
> > > > > > +}
> > > > > > +
> > > > > > +static inline void unlock_content(struct proc_maps_private *priv)
> > > > > > +{
> > > > > > +     if (priv->mmap_locked) {
> > > > > > +             mmap_read_unlock(priv->mm);
> > > > > > +     } else {
> > > > > > +             unlock_vma(priv);
> > > > > > +             rcu_read_unlock();
> > > > >
> > > > > Does this always get called even in error cases?
> > > >
> > > > What error cases do you have in mind? Error to lock a VMA is handled
> > > > by retrying and we should be happily proceeding. Please clarify.
> > >
> > > Well it was more of a question really - can the traversal through
> > > /proc/$pid/maps result in some kind of error that doesn't reach this
> > > function, thereby leaving things locked mistakenly?
> > >
> > > If not then happy days :)
> > >
> > > I'm guessing there isn't.
> >
> > There is EINTR in m_start() but unlock_content() won't be called in
> > that case, so I think we are good.
> >
> > >
> > > >
> > > > >
> > > > > > +     }
> > > > > > +}
> > > > > > +
> > > > > > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > > > > > +                                        loff_t last_pos)
> > > > >
> > > > > We really need a generalised RCU multi-VMA locking mechanism (we're looking
> > > > > into madvise VMA locking atm with a conservative single VMA lock at the
> > > > > moment, but in future we probably want to be able to span multiple for
> > > > > instance) and this really really feels like it doesn't belong in this proc
> > > > > code.
> > > >
> > > > Ok, I guess you are building a case to move more code into vma.c? I
> > > > see what you are doing :)
> > >
> > > Haha damn it, my evil plans revealed :P
> > >
> > > >
> > > > >
> > > > > >  {
> > > > > > -     struct vm_area_struct *vma = vma_next(&priv->iter);
> > > > > > +     struct vm_area_struct *vma;
> > > > > > +     int ret;
> > > > > > +
> > > > > > +     if (priv->mmap_locked)
> > > > > > +             return vma_next(&priv->iter);
> > > > > > +
> > > > > > +     unlock_vma(priv);
> > > > > > +     /*
> > > > > > +      * Record sequence number ahead of vma lookup.
> > > > > > +      * Odd seqcount means address space modification is in progress.
> > > > > > +      */
> > > > > > +     mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > > > >
> > > > > Hmm we're discarding the return value I guess we don't really care about
> > > > > that at this stage? Or do we? Do we want to assert the read critical
> > > > > section state here?
> > > >
> > > > Yeah, as I mentioned, instead of relying on priv->mm_wr_seq being odd
> > > > I should record the return value of mmap_lock_speculate_try_begin().
> > > > In the functional sense these two are interchangeable.
> > >
> > > Ack, thanks!
> > >
> > > >
> > > > >
> > > > > I guess since we have the mm_rq_seq which we use later it's the same thing
> > > > > and doesn't matter.
> > > >
> > > > Yep.
> > >
> > > Ack
> > >
> > > >
> > > > >
> > > > > ~~(off topic a bit)~~
> > > > >
> > > > > OK so off-topic again afaict we're doing something pretty horribly gross here.
> > > > >
> > > > > We pass &priv->mm_rw_seq as 'unsigned int *seq' field to
> > > > > mmap_lock_speculate_try_begin(), which in turn calls:
> > > > >
> > > > >         return raw_seqcount_try_begin(&mm->mm_lock_seq, *seq);
> > > > >
> > > > > And this is defined as a macro of:
> > > > >
> > > > > #define raw_seqcount_try_begin(s, start)                                \
> > > > > ({                                                                      \
> > > > >         start = raw_read_seqcount(s);                                   \
> > > > >         !(start & 1);                                                   \
> > > > > })
> > > > >
> > > > > So surely this expands to:
> > > > >
> > > > >         *seq = raw_read_seqcount(&mm->mm_lock_seq);
> > > > >         !(*seq & 1) // return true if even, false if odd
> > > > >
> > > > > So we're basically ostensibly passing an unsigned int, but because we're
> > > > > calling a macro it's actually just 'text' and we're instead able to then
> > > > > reassign the underlying unsigned int * ptr and... ugh.
> > > > >
> > > > > ~~(/off topic a bit)~~
> > > >
> > > > Aaaand we are back...
> > >
> > > :)) yeah this isn't your fault, just a related 'wtf' moan :P we can pretend
> > > like it never happened *ahem*
> > >
> > > >
> > > > >
> > > > > > +     vma = vma_next(&priv->iter);
> > > > >
> > > > >
> > > > >
> > > > > > +     if (!vma)
> > > > > > +             return NULL;
> > > > > > +
> > > > > > +     vma = trylock_vma(priv, vma, last_pos, true);
> > > > > > +     if (vma)
> > > > > > +             return vma;
> > > > > > +
> > > > >
> > > > > Really feels like this should be a boolean... I guess neat to reset vma if
> > > > > not locked though.
> > > >
> > > > I guess I can change trylock_vma() to return boolean. We always return
> > > > the same vma or NULL I think.
> > >
> > > Ack, I mean I guess you're looking at reworking it in general so can take
> > > this into account.
> >
> > Ack.
> >
> > >
> > > >
> > > > >
> > > > > > +     /* Address space got modified, vma might be stale. Re-lock and retry */
> > > > >
> > > > > > +     rcu_read_unlock();
> > > > >
> > > > > Might we see a VMA possibly actually legit unmapped in a race here? Do we
> > > > > need to update last_pos/ppos to account for this? Otherwise we might just
> > > > > fail on the last_pos >= vma->vm_end check in trylock_vma() no?
> > > >
> > > > Yes, it can happen and trylock_vma() will fail to lock the modified
> > > > VMA. That's by design. In such cases we retry the lookup from the same
> > > > last_pos.
> > >
> > > OK and then we're fine with it because the gap we report will be an actual
> > > gap.
> >
> > Yes, either the actual gap or a VMA newly mapped at that address.
> >
> > >
> > > >
> > > > >
> > > > > > +     ret = mmap_read_lock_killable(priv->mm);
> > > > >
> > > > > Shouldn't we set priv->mmap_locked here?
> > > >
> > > > No, we will drop the mmap_read_lock shortly. priv->mmap_locked
> > > > indicates the overall mode we operate in. When priv->mmap_locked=false
> > > > we can still temporarily take the mmap_read_lock when retrying and
> > > > then drop it after we found the VMA.
> > >
> > > Right yeah, makes sense.
> > >
> > > >
> > > > >
> > > > > I guess not as we are simply holding the mmap lock to definitely get the
> > > > > next VMA.
> > > >
> > > > Correct.
> > >
> > > Ack
> > >
> > > >
> > > > >
> > > > > > +     rcu_read_lock();
> > > > > > +     if (ret)
> > > > > > +             return ERR_PTR(ret);
> > > > > > +
> > > > > > +     /* Lookup the vma at the last position again under mmap_read_lock */
> > > > > > +     vma_iter_init(&priv->iter, priv->mm, last_pos);
> > > > > > +     vma = vma_next(&priv->iter);
> > > > > > +     if (vma) {
> > > > > > +             vma = trylock_vma(priv, vma, last_pos, false);
> > > > >
> > > > > Be good to use Liam's convention of /* mm_unstable = */ false to make this
> > > > > clear.
> > > >
> > > > Yeah, I'm thinking of splitting trylock_vma() into two separate
> > > > functions for mm_unstable=true and mm_unstable=false cases.
> > >
> > > Yes :) thanks!
> > >
> > > >
> > > > >
> > > > > Find it kinda weird again we're 'trylocking' something we already have
> > > > > locked via the mmap lock but I already mentioend this... :)
> > > > >
> > > > > > +             WARN_ON(!vma); /* mm is stable, has to succeed */
> > > > >
> > > > > I wonder if this is really useful, at any rate seems like there'd be a
> > > > > flood here so WARN_ON_ONCE()? Perhaps VM_WARN_ON_ONCE() given this really
> > > > > really ought not happen?
> > > >
> > > > Well, I can't use BUG_ON(), so WARN_ON() is the next tool I have :) In
> > > > reality this should never happen, so
> > > > WARN_ON/WARN_ON_ONCE/WARN_ON_RATELIMITED/or whatever does not matter
> > > > much.
> > >
> > > I think if you refactor into two separate functions this becomes even more
> > > unnecessary because then you are using a vma lock function that can never
> > > fail etc.
> > >
> > > I mean maybe just stick a VM_ in front if it's not going to happen but for
> > > debug/dev/early stabilisation purposes we want to keep an eye on it.
> >
> > Yeah, I think after refactoring we won't need any warnings here.
> >
> > >
> > > >
> > > > >
> > > > > > +     }
> > > > > > +     mmap_read_unlock(priv->mm);
> > > > > > +
> > > > > > +     return vma;
> > > > > > +}
> > > > > > +
> > > > > > +#else /* CONFIG_PER_VMA_LOCK */
> > > > > >
> > > > > > +static inline bool lock_content(struct seq_file *m,
> > > > > > +                             struct proc_maps_private *priv)
> > > > > > +{
> > > > > > +     return mmap_read_lock_killable(priv->mm) == 0;
> > > > > > +}
> > > > > > +
> > > > > > +static inline void unlock_content(struct proc_maps_private *priv)
> > > > > > +{
> > > > > > +     mmap_read_unlock(priv->mm);
> > > > > > +}
> > > > > > +
> > > > > > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > > > > > +                                        loff_t last_pos)
> > > > > > +{
> > > > > > +     return vma_next(&priv->iter);
> > > > > > +}
> > > > > > +
> > > > > > +#endif /* CONFIG_PER_VMA_LOCK */
> > > > > > +
> > > > > > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> > > > > > +{
> > > > > > +     struct proc_maps_private *priv = m->private;
> > > > > > +     struct vm_area_struct *vma;
> > > > > > +
> > > > > > +     vma = get_next_vma(priv, *ppos);
> > > > > > +     if (IS_ERR(vma))
> > > > > > +             return vma;
> > > > > > +
> > > > > > +     /* Store previous position to be able to restart if needed */
> > > > > > +     priv->last_pos = *ppos;
> > > > > >       if (vma) {
> > > > > > -             *ppos = vma->vm_start;
> > > > > > +             /*
> > > > > > +              * Track the end of the reported vma to ensure position changes
> > > > > > +              * even if previous vma was merged with the next vma and we
> > > > > > +              * found the extended vma with the same vm_start.
> > > > > > +              */
> > > > >
> > > > > Right, so observing repetitions is acceptable in such circumstances? I mean
> > > > > I agree.
> > > >
> > > > Yep, the VMA will be reported twice in such a case.
> > >
> > > Ack.
> > >
> > > >
> > > > >
> > > > > > +             *ppos = vma->vm_end;
> > > > >
> > > > > If we store the end, does the last_pos logic which resets the VMA iterator
> > > > > later work correctly in all cases?
> > > >
> > > > I think so. By resetting to vma->vm_end we will start the next search
> > > > from the address right next to the last reported VMA, no?
> > >
> > > Yeah, I was just wondering whether there were any odd corner case that
> > > might be problematic.
> > >
> > > But since we treat last_pos as inclusive as you said in a response above,
> > > and of course vma->vm_end is exclusive, then this makes sense.
> > >
> > > >
> > > > >
> > > > > >       } else {
> > > > > >               *ppos = -2UL;
> > > > > >               vma = get_gate_vma(priv->mm);
> > > > >
> > > > > Is it always the case that !vma here implies a gate VMA (yuck yuck)? I see
> > > > > this was the original logic, but maybe put a comment about this as it's
> > > > > weird and confusing? (and not your fault obviously :P)
> > > >
> > > > What comment would you like to see here?
> > >
> > > It's so gross this. I guess something about the inner workings of gate VMAs
> > > and the use of -2UL as a weird sentinel etc.
> >
> > Ok, I'll try to add a meaningful comment here.
> >
> > >
> > > But this is out of scope here.
> > >
> > > >
> > > > >
> > > > > Also, are all locks and state corectly handled in this case? Seems like one
> > > > > of this nasty edge case situations that could have jagged edges...
> > > >
> > > > I think we are fine. get_next_vma() returned NULL, so we did not lock
> > > > any VMA and priv->locked_vma should be NULL.
> > > >
> > > > >
> > > > > > @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> > > > > >               return NULL;
> > > > > >       }
> > > > > >
> > > > > > -     if (mmap_read_lock_killable(mm)) {
> > > > > > +     if (!lock_content(m, priv)) {
> > > > >
> > > > > Nice that this just slots in like this! :)
> > > > >
> > > > > >               mmput(mm);
> > > > > >               put_task_struct(priv->task);
> > > > > >               priv->task = NULL;
> > > > > >               return ERR_PTR(-EINTR);
> > > > > >       }
> > > > > >
> > > > > > +     if (last_addr > 0)
> > > > >
> > > > > last_addr is an unsigned long, this will always be true.
> > > >
> > > > Not unless last_addr==0. That's what I'm really checking here: is this
> > > > the first invocation of m_start(), in which case we are starting from
> > > > the beginning and not restarting from priv->last_pos. Should I add a
> > > > comment?
> > >
> > > Yeah sorry I was being an idiot, I misread this as >= 0 obviously.
> > >
> > > I had assumed you were checking for the -2 and -1 cases (though -1 early
> > > exits above).
> > >
> > > So in that case, are you handling the gate VMA correctly here? Surely we
> > > should exclude that? Wouldn't setting ppos = last_addr = priv->last_pos be
> > > incorrect if this were a gate vma?
> >
> > You are actually right. last_addr can be -2UL here and we should not
> > override it. I'll fix it. Thanks!
> >
> > >
> > > Even if we then call get_gate_vma() we've changed these values? Or is that
> > > fine?
> > >
> > > And yeah a comment would be good thanks!
> > >
> > > >
> > > > >
> > > > > You probably want to put an explicit check for -1UL, -2UL here or?
> > > > >
> > > > > God I hate this mechanism for indicating gate VMA... yuck yuck (again, this
> > > > > bit not your fault :P)
> > > >
> > > > No, I don't care here about -1UL, -2UL, just that last_addr==0 or not.
> > >
> > > OK, so maybe above concerns not a thing.
> > >
> > > >
> > > > >
> > > > > > +             *ppos = last_addr = priv->last_pos;
> > > > > >       vma_iter_init(&priv->iter, mm, last_addr);
> > > > > >       hold_task_mempolicy(priv);
> > > > > >       if (last_addr == -2UL)
> > > > > >               return get_gate_vma(mm);
> > > > > >
> > > > > > -     return proc_get_vma(priv, ppos);
> > > > > > +     return proc_get_vma(m, ppos);
> > > > > >  }
> > > > > >
> > > > > >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > > > > > @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> > > > > >               *ppos = -1UL;
> > > > > >               return NULL;
> > > > > >       }
> > > > > > -     return proc_get_vma(m->private, ppos);
> > > > > > +     return proc_get_vma(m, ppos);
> > > > > >  }
> > > > > >
> > > > > >  static void m_stop(struct seq_file *m, void *v)
> > > > > > @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v)
> > > > > >               return;
> > > > > >
> > > > > >       release_task_mempolicy(priv);
> > > > > > -     mmap_read_unlock(mm);
> > > > > > +     unlock_content(priv);
> > > > > >       mmput(mm);
> > > > > >       put_task_struct(priv->task);
> > > > > >       priv->task = NULL;
> > > > > > --
> > > > > > 2.49.0.1266.g31b7d2e469-goog
> > > > > >
> > > > >
> > > > > Sorry to add to workload by digging into so many details here, but we
> > > > > really need to make sure all the i's are dotted and t's are crossed given
> > > > > how fiddly and fragile this stuff is :)
> > > > >
> > > > > Very much appreciate the work, this is a significant improvement and will
> > > > > have a great deal of real world impact!
> > > >
> > > > Thanks for meticulously going over the code! This is really helpful.
> > > > Suren.
> > >
> > > No problem!
> > >
> > > >
> > > > >
> > > > > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY
  2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2025-06-04 23:11 ` [PATCH v4 7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
@ 2025-06-13 15:01 ` Lorenzo Stoakes
  2025-06-13 19:11   ` Suren Baghdasaryan
  7 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-06-13 15:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Hi Suren,

I promised I'd share VMA merging scenarios so we can be absolutely sure we have
all cases covered, I share that below. I also included information on split.

Hopefully this is useful! And maybe we can somehow put in a comment or commit
msg or something somewhere? Not sure if a bit much for that though :)

Note that in all of the below we hold exclusive mmap, vma + rmap write locks.

## Merge with change to EXISTING VMA

### Merge both

                      start    end
                         |<---->|
                 |-------********-------|
                   prev   middle   next
                  extend  delete  delete

1. Set prev VMA range [prev->vm_start, next->vmend)
2. Overwrite prev, middle, next nodes in maple tree with prev
3. Detach middle VMA
4. Free middle VMA
5. Detach next VMA
6. Free next VMA

### Merge left full

                       start        end
                         |<--------->|
                 |-------*************
                   prev     middle
                  extend    delete

1. Set prev VMA range [prev->vm_start, end)
2. Overwrite prev, middle nodes in maple tree with prev
3. Detach middle VMA
4. Free middle VMA

### Merge left partial

                       start   end
		         |<---->|
		 |-------*************
		   prev     middle
		  extend  partial overwrite

1. Set prev VMA range [prev->vm_start, end)
2. Set middle range [end, middle->vm_end)
3. Overwrite prev, middle (partial) nodes in maple tree with prev

### Merge right full

               start        end
		 |<--------->|
		 *************-------|
		    middle     next
		    delete    extend

1. Set next range [start, next->vm_end)
2. Overwrite middle, next nodes in maple tree with next
3. Detach middle VMA
4. Free middle VMA

### Merge right partial

                   start    end
		     |<----->|
		 *************-------|
		    middle     next
		    shrink    extend

1. Set middle range [middle->vm_start, start)
2. Set next range [start, next->vm_end)
3. Overwrite middle (partial), next nodes in maple tree with next

## Merge due to introduction of proposed NEW VMA

These cases are easier as there's no existing VMA to either remove or partially
adjust.

### Merge both

                       start     end
		         |<------>|
		 |-------..........-------|
		   prev  (proposed)  next
		  extend            delete

1. Set prev VMA range [prev->vm_start, next->vm_end)
2. Overwrite prev, next nodes in maple tree with prev
3. Detach next VMA
4. Delete next VMA

### Merge left

                       start     end
		         |<------>|
		 |-------..........
		   prev  (proposed)
		  extend

1. Set prev VMA range [prev->vm_start, end)
2. Overwrite prev node in maple tree with newly extended prev

(This is what's used for brk() and bprm_mm_init() stack relocation in
relocate_vma_down() too)

### Merge right

                       start     end
		         |<------>|
		         ..........-------|
		         (proposed)  next
		                    extend

1. Set next VMA range [start, next->vm_end)
2. Overwrite next node in maple tree with newly extended next

## Split VMA

If new below:

                    addr
                |-----.-----|
                | new .     |
                |-----.-----|
                     vma
Otherwise:

                    addr
                |-----.-----|
                |     . new |
                |-----.-----|
		     vma

1. Duplicate vma
2. If new below, set new range to [vma-vm_start, addr)
3. Otherwise, set new range to [addr, vma->vm_end)
4. If new below, Set vma range to [addr, vma->vm_end)
5. Otherwise, set vma range to [vma->vm_start, addr)
6. Partially overwrite vma node in maple tree with new

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY
  2025-06-13 15:01 ` [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Lorenzo Stoakes
@ 2025-06-13 19:11   ` Suren Baghdasaryan
  2025-06-13 19:26     ` Lorenzo Stoakes
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-06-13 19:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Jun 13, 2025 at 8:01 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Hi Suren,
>
> I promised I'd share VMA merging scenarios so we can be absolutely sure we have
> all cases covered, I share that below. I also included information on split.

Thanks Lorenzo! This is great and very helpful.

>
> Hopefully this is useful! And maybe we can somehow put in a comment or commit
> msg or something somewhere? Not sure if a bit much for that though :)

I'll see if I can add a short version into my next cover letter.

>
> Note that in all of the below we hold exclusive mmap, vma + rmap write locks.
>
> ## Merge with change to EXISTING VMA
>
> ### Merge both
>
>                       start    end
>                          |<---->|
>                  |-------********-------|
>                    prev   middle   next
>                   extend  delete  delete
>
> 1. Set prev VMA range [prev->vm_start, next->vmend)
> 2. Overwrite prev, middle, next nodes in maple tree with prev
> 3. Detach middle VMA
> 4. Free middle VMA
> 5. Detach next VMA
> 6. Free next VMA

This case should be fine with per-vma locks while reading
/proc/pid/maps. In the worst case we will report some of the original
vmas before the merge and then the final merged vma, so prev might be
seen twice but no gaps should be observed.

>
> ### Merge left full
>
>                        start        end
>                          |<--------->|
>                  |-------*************
>                    prev     middle
>                   extend    delete
>
> 1. Set prev VMA range [prev->vm_start, end)
> 2. Overwrite prev, middle nodes in maple tree with prev
> 3. Detach middle VMA
> 4. Free middle VMA

Same as the previous case. Worst case we report prev twice - once
before the merge, once after the merge.

>
> ### Merge left partial
>
>                        start   end
>                          |<---->|
>                  |-------*************
>                    prev     middle
>                   extend  partial overwrite
>
> 1. Set prev VMA range [prev->vm_start, end)
> 2. Set middle range [end, middle->vm_end)
> 3. Overwrite prev, middle (partial) nodes in maple tree with prev

We might report prev twice here and this might cause us to retry if we
see a temporary gap between old prev and new middle vma. But retry
should handle this case, so I think we are good here.

>
> ### Merge right full
>
>                start        end
>                  |<--------->|
>                  *************-------|
>                     middle     next
>                     delete    extend
>
> 1. Set next range [start, next->vm_end)
> 2. Overwrite middle, next nodes in maple tree with next
> 3. Detach middle VMA
> 4. Free middle VMA

Worst case we report middle twice.

>
> ### Merge right partial
>
>                    start    end
>                      |<----->|
>                  *************-------|
>                     middle     next
>                     shrink    extend
>
> 1. Set middle range [middle->vm_start, start)
> 2. Set next range [start, next->vm_end)
> 3. Overwrite middle (partial), next nodes in maple tree with next

Worse case we retry and report middle twice.

>
> ## Merge due to introduction of proposed NEW VMA
>
> These cases are easier as there's no existing VMA to either remove or partially
> adjust.
>
> ### Merge both
>
>                        start     end
>                          |<------>|
>                  |-------..........-------|
>                    prev  (proposed)  next
>                   extend            delete
>
> 1. Set prev VMA range [prev->vm_start, next->vm_end)
> 2. Overwrite prev, next nodes in maple tree with prev
> 3. Detach next VMA
> 4. Delete next VMA

Worst case we report prev twice after retry.

>
> ### Merge left
>
>                        start     end
>                          |<------>|
>                  |-------..........
>                    prev  (proposed)
>                   extend
>
> 1. Set prev VMA range [prev->vm_start, end)
> 2. Overwrite prev node in maple tree with newly extended prev

Worst case we report prev twice.

>
> (This is what's used for brk() and bprm_mm_init() stack relocation in
> relocate_vma_down() too)
>
> ### Merge right
>
>                        start     end
>                          |<------>|
>                          ..........-------|
>                          (proposed)  next
>                                     extend
>
> 1. Set next VMA range [start, next->vm_end)
> 2. Overwrite next node in maple tree with newly extended next

This will show either a legit gap + original next or the extended next
with no gap. Both ways we are fine.

>
> ## Split VMA
>
> If new below:
>
>                     addr
>                 |-----.-----|
>                 | new .     |
>                 |-----.-----|
>                      vma
> Otherwise:
>
>                     addr
>                 |-----.-----|
>                 |     . new |
>                 |-----.-----|
>                      vma
>
> 1. Duplicate vma
> 2. If new below, set new range to [vma-vm_start, addr)
> 3. Otherwise, set new range to [addr, vma->vm_end)
> 4. If new below, Set vma range to [addr, vma->vm_end)
> 5. Otherwise, set vma range to [vma->vm_start, addr)
> 6. Partially overwrite vma node in maple tree with new

These are fine too. We will either report before-split view or after-split view.
Thanks,
Suren.

>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY
  2025-06-13 19:11   ` Suren Baghdasaryan
@ 2025-06-13 19:26     ` Lorenzo Stoakes
  0 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-06-13 19:26 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Fri, Jun 13, 2025 at 12:11:43PM -0700, Suren Baghdasaryan wrote:
> On Fri, Jun 13, 2025 at 8:01 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Hi Suren,
> >
> > I promised I'd share VMA merging scenarios so we can be absolutely sure we have
> > all cases covered, I share that below. I also included information on split.
>
> Thanks Lorenzo! This is great and very helpful.

No problem! I do intend to look at the tests here too, but I just didn't
have time to get to that this week.

I don't think this should block the respin should it? Anyway hopefully I'll
be able to take a look next week.

>
> >
> > Hopefully this is useful! And maybe we can somehow put in a comment or commit
> > msg or something somewhere? Not sure if a bit much for that though :)
>
> I'll see if I can add a short version into my next cover letter.

Thanks!

Liam suggested somehow integrating this into our VMA userland testing or at
least as documentation there, I will put on my todo :)

Your replies below honestly make me feel more relaxed about this change
overall - it helps us really identify the known cases (Donald Rumsfeld of
course would tell us to fear the unknown unknowns but we do what we can) -
and if they are clearly thought out and confirmed to be safe then happy
days.

I wonder if we ought to have the tests explicitly try to trigger each case?
I'm not sure how practical/useful that would be however.

>
> >
> > Note that in all of the below we hold exclusive mmap, vma + rmap write locks.
> >
> > ## Merge with change to EXISTING VMA
> >
> > ### Merge both
> >
> >                       start    end
> >                          |<---->|
> >                  |-------********-------|
> >                    prev   middle   next
> >                   extend  delete  delete
> >
> > 1. Set prev VMA range [prev->vm_start, next->vmend)
> > 2. Overwrite prev, middle, next nodes in maple tree with prev
> > 3. Detach middle VMA
> > 4. Free middle VMA
> > 5. Detach next VMA
> > 6. Free next VMA
>
> This case should be fine with per-vma locks while reading
> /proc/pid/maps. In the worst case we will report some of the original
> vmas before the merge and then the final merged vma, so prev might be
> seen twice but no gaps should be observed.
>
> >
> > ### Merge left full
> >
> >                        start        end
> >                          |<--------->|
> >                  |-------*************
> >                    prev     middle
> >                   extend    delete
> >
> > 1. Set prev VMA range [prev->vm_start, end)
> > 2. Overwrite prev, middle nodes in maple tree with prev
> > 3. Detach middle VMA
> > 4. Free middle VMA
>
> Same as the previous case. Worst case we report prev twice - once
> before the merge, once after the merge.
>
> >
> > ### Merge left partial
> >
> >                        start   end
> >                          |<---->|
> >                  |-------*************
> >                    prev     middle
> >                   extend  partial overwrite
> >
> > 1. Set prev VMA range [prev->vm_start, end)
> > 2. Set middle range [end, middle->vm_end)
> > 3. Overwrite prev, middle (partial) nodes in maple tree with prev
>
> We might report prev twice here and this might cause us to retry if we
> see a temporary gap between old prev and new middle vma. But retry
> should handle this case, so I think we are good here.
>
> >
> > ### Merge right full
> >
> >                start        end
> >                  |<--------->|
> >                  *************-------|
> >                     middle     next
> >                     delete    extend
> >
> > 1. Set next range [start, next->vm_end)
> > 2. Overwrite middle, next nodes in maple tree with next
> > 3. Detach middle VMA
> > 4. Free middle VMA
>
> Worst case we report middle twice.
>
> >
> > ### Merge right partial
> >
> >                    start    end
> >                      |<----->|
> >                  *************-------|
> >                     middle     next
> >                     shrink    extend
> >
> > 1. Set middle range [middle->vm_start, start)
> > 2. Set next range [start, next->vm_end)
> > 3. Overwrite middle (partial), next nodes in maple tree with next
>
> Worse case we retry and report middle twice.
>
> >
> > ## Merge due to introduction of proposed NEW VMA
> >
> > These cases are easier as there's no existing VMA to either remove or partially
> > adjust.
> >
> > ### Merge both
> >
> >                        start     end
> >                          |<------>|
> >                  |-------..........-------|
> >                    prev  (proposed)  next
> >                   extend            delete
> >
> > 1. Set prev VMA range [prev->vm_start, next->vm_end)
> > 2. Overwrite prev, next nodes in maple tree with prev
> > 3. Detach next VMA
> > 4. Delete next VMA
>
> Worst case we report prev twice after retry.
>
> >
> > ### Merge left
> >
> >                        start     end
> >                          |<------>|
> >                  |-------..........
> >                    prev  (proposed)
> >                   extend
> >
> > 1. Set prev VMA range [prev->vm_start, end)
> > 2. Overwrite prev node in maple tree with newly extended prev
>
> Worst case we report prev twice.
>
> >
> > (This is what's used for brk() and bprm_mm_init() stack relocation in
> > relocate_vma_down() too)
> >
> > ### Merge right
> >
> >                        start     end
> >                          |<------>|
> >                          ..........-------|
> >                          (proposed)  next
> >                                     extend
> >
> > 1. Set next VMA range [start, next->vm_end)
> > 2. Overwrite next node in maple tree with newly extended next
>
> This will show either a legit gap + original next or the extended next
> with no gap. Both ways we are fine.
>
> >
> > ## Split VMA
> >
> > If new below:
> >
> >                     addr
> >                 |-----.-----|
> >                 | new .     |
> >                 |-----.-----|
> >                      vma
> > Otherwise:
> >
> >                     addr
> >                 |-----.-----|
> >                 |     . new |
> >                 |-----.-----|
> >                      vma
> >
> > 1. Duplicate vma
> > 2. If new below, set new range to [vma-vm_start, addr)
> > 3. Otherwise, set new range to [addr, vma->vm_end)
> > 4. If new below, Set vma range to [addr, vma->vm_end)
> > 5. Otherwise, set vma range to [vma->vm_start, addr)
> > 6. Partially overwrite vma node in maple tree with new
>
> These are fine too. We will either report before-split view or after-split view.
> Thanks,
> Suren.
>
> >
> > Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks
  2025-06-04 23:11 ` [PATCH v4 7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
@ 2025-06-13 20:36   ` Andrii Nakryiko
  0 siblings, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2025-06-13 20:36 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Wed, Jun 4, 2025 at 4:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Utilize per-vma locks to stabilize vma after lookup without taking
> mmap_lock during PROCMAP_QUERY ioctl execution. While we might take
> mmap_lock for reading during contention, we do that momentarily only
> to lock the vma.
> This change is designed to reduce mmap_lock contention and prevent
> PROCMAP_QUERY ioctl calls from blocking address space updates.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/task_mmu.c | 56 ++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 44 insertions(+), 12 deletions(-)
>

The overall approach in this patch set looks good to me! PROCMAP_QUERY
changes specifically are pretty straightforward, nice. LGTM:

Acked-by: Andrii Nakryiko <andrii@kernel.org>

And for the rest of the changes you seem to be in good hands, so I'll
just be waiting for the final thing to land, thanks for working on
this!


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-06-13 20:36 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 23:11 [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
2025-06-04 23:11 ` [PATCH v4 1/7] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
2025-06-04 23:11 ` [PATCH v4 2/7] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
2025-06-04 23:11 ` [PATCH v4 3/7] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
2025-06-04 23:11 ` [PATCH v4 4/7] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
2025-06-04 23:11 ` [PATCH v4 5/7] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
2025-06-04 23:11 ` [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock Suren Baghdasaryan
2025-06-07 17:43   ` Lorenzo Stoakes
2025-06-08  1:41     ` Suren Baghdasaryan
2025-06-10 17:43       ` Lorenzo Stoakes
2025-06-11  0:16         ` Suren Baghdasaryan
2025-06-11 10:24           ` Lorenzo Stoakes
2025-06-11 15:12             ` Suren Baghdasaryan
2025-06-10  7:50   ` kernel test robot
2025-06-10 14:02     ` Suren Baghdasaryan
2025-06-04 23:11 ` [PATCH v4 7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
2025-06-13 20:36   ` Andrii Nakryiko
2025-06-13 15:01 ` [PATCH v4 0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Lorenzo Stoakes
2025-06-13 19:11   ` Suren Baghdasaryan
2025-06-13 19:26     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).