[PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY
@ 2025-07-04  6:07 Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
                   ` (8 more replies)
  0 siblings, 9 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.

Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.

This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.

Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory.  At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test.  The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:

Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
    0.011     0.008     0.455
    0.011     0.008     0.472
    0.011     0.008     0.535
    0.011     0.009     0.545
    ...
    0.011     0.014     2.875
    0.011     0.014     2.913
    0.011     0.014     3.007
    0.011     0.015     3.018

After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
    0.006     0.005     0.036
    0.006     0.005     0.039
    0.006     0.005     0.039
    0.006     0.005     0.039
    ...
    0.006     0.006     0.403
    0.006     0.006     0.474
    0.006     0.006     0.479
    0.006     0.006     0.498

The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's even more important
now to test for unexpected inconsistencies.

In [3] Lorenzo identified the following possible vma merging/splitting
scenarios:

Merges with changes to existing vmas:
1 Merge both - mapping a vma over another one and between two vmas which
can be merged after this replacement;
2. Merge left full - mapping a vma at the end of an existing one and
completely over its right neighbor;
3. Merge left partial - mapping a vma at the end of an existing one and
partially over its right neighbor;
4. Merge right full - mapping a vma before the start of an existing one
and completely over its left neighbor;
5. Merge right partial - mapping a vma before the start of an existing one
and partially over its left neighbor;

Merges without changes to existing vmas:
6. Merge both - mapping a vma into a gap between two vmas which can be
merged after the insertion;
7. Merge left - mapping a vma at the end of an existing one;
8. Merge right - mapping a vma before the start end of an existing one;

Splits
9. Split with new vma at the lower address;
10. Split with new vma at the higher address;

If such merges or splits happen concurrently with the /proc/maps reading
we might report a vma twice, once before the modification and once after
it is modified:

Case 1 might report overwritten and previous vma along with the final
merged vma;
Case 2 might report previous and the final merged vma;
Case 3 might cause us to retry once we detect the temporary gap caused by
shrinking of the right neighbor;
Case 4 might report overritten and the final merged vma;
Case 5 might cause us to retry once we detect the temporary gap caused by
shrinking of the left neighbor;
Case 6 might report previous vma and the gap along with the final marged
vma;
Case 7 might report previous and the final merged vma;
Case 8 might report the original gap and the final merged vma covering the
gap;
Case 9 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma start;
Case 10 might cause us to retry once we detect the temporary gap caused by
shrinking of the original vma at the vma end;

In all these cases the retry mechanism prevents us from reporting possible
temporary gaps.

Changes since v5 [4]:
- Made /proc/pid/maps tearing test a separate selftest,
per Alexey Dobriyan
- Changed asserts with or'ed conditions into separate ones,
per Alexey Dobriyan
- Added a small cleanup patch [6/8] to avoid unnecessary seq_file position
type casting
- Removed unnecessary is_sentinel_pos() helper
- Changed titles to use fs/proc/task_mmu instead of mm/maps prefix,
per David Hildenbrand
- Included Lorenzo's fix for mmap lock assertion in anon_vma_name()
- Reworked the last patch to avoid allocation in the rcu read section,
which replaces Jeongjun Park's fix

!!! NOTES FOR APPLYING THE PATCHSET !!!

Applies cleanly over mm-unstable after reverting old version with fixes.
The following patches should be reverted before applyng this patchset:

b33ce1be8a40 ("selftests/proc: add /proc/pid/maps tearing from vma split test")
b538e0580fd6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma resizing")
4996b4409cc6 ("selftests/proc: extend /proc/pid/maps tearing test to include vma remapping")
c39471f78d5e ("selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified")
487570f548f3 ("selftests/proc: add verbose more for tests to facilitate debugging")
e1ba4969cba1 ("mm/maps: read proc/pid/maps under per-vma lock")
ecb110179e77 ("mm/madvise: fixup stray mmap lock assert in anon_vma_name()")
6772c457a865 ("fs/proc/task_mmu:: execute PROCMAP_QUERY ioctl under per-vma locks")
d5c67bb2c5fb ("mm/maps: move kmalloc() call location in do_procmap_query() out of RCU critical section")

[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test
[3] https://lore.kernel.org/all/e1863f40-39ab-4e5b-984a-c48765ffde1c@lucifer.local/
[4] https://lore.kernel.org/all/20250624193359.3865351-1-surenb@google.com/

Suren Baghdasaryan (8):
  selftests/proc: add /proc/pid/maps tearing from vma split test
  selftests/proc: extend /proc/pid/maps tearing test to include vma
    resizing
  selftests/proc: extend /proc/pid/maps tearing test to include vma
    remapping
  selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
    modified
  selftests/proc: add verbose more for tests to facilitate debugging
  fs/proc/task_mmu: remove conversion of seq_file position to unsigned
  fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks

 fs/proc/internal.h                            |   5 +
 fs/proc/task_mmu.c                            | 188 +++-
 include/linux/mmap_lock.h                     |  11 +
 mm/madvise.c                                  |   3 +-
 mm/mmap_lock.c                                |  88 ++
 tools/testing/selftests/proc/.gitignore       |   1 +
 tools/testing/selftests/proc/Makefile         |   1 +
 tools/testing/selftests/proc/proc-maps-race.c | 829 ++++++++++++++++++
 8 files changed, 1098 insertions(+), 28 deletions(-)
 create mode 100644 tools/testing/selftests/proc/proc-maps-race.c

-- 
2.50.0.727.gbf7dc18ff4-goog

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v6 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
@ 2025-07-04  6:07 ` Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

The /proc/pid/maps file is generated page by page, with the mmap_lock
released between pages.  This can lead to inconsistent reads if the
underlying vmas are concurrently modified. For instance, if a vma split
or merge occurs at a page boundary while /proc/pid/maps is being read,
the same vma might be seen twice: once before and once after the change.
This duplication is considered acceptable for userspace handling.
However, observing a "hole" where a vma should be (e.g., due to a vma
being replaced and the space temporarily being empty) is unacceptable.

Implement a test that:
1. Forks a child process which continuously modifies its address space,
specifically targeting a vma at the boundary between two pages.
2. The parent process repeatedly reads the child's /proc/pid/maps.
3. The parent process checks the last vma of the first page and
the first vma of the second page for consistency, looking for the
effects of vma splits or merges.

The test duration is configurable via the -d command-line parameter
in seconds to increase the likelihood of catching the race condition.
The default test duration is 5 seconds.

Example Command: proc-maps-race -d 10

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/.gitignore       |   1 +
 tools/testing/selftests/proc/Makefile         |   1 +
 tools/testing/selftests/proc/proc-maps-race.c | 459 ++++++++++++++++++
 3 files changed, 461 insertions(+)
 create mode 100644 tools/testing/selftests/proc/proc-maps-race.c

diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index 973968f45bba..19bb333e2485 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -5,6 +5,7 @@
 /proc-2-is-kthread
 /proc-fsconfig-hidepid
 /proc-loadavg-001
+/proc-maps-race
 /proc-multiple-procfs
 /proc-empty-vm
 /proc-pid-vm
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index b12921b9794b..50aba102201a 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -9,6 +9,7 @@ TEST_GEN_PROGS += fd-002-posix-eq
 TEST_GEN_PROGS += fd-003-kthread
 TEST_GEN_PROGS += proc-2-is-kthread
 TEST_GEN_PROGS += proc-loadavg-001
+TEST_GEN_PROGS += proc-maps-race
 TEST_GEN_PROGS += proc-empty-vm
 TEST_GEN_PROGS += proc-pid-vm
 TEST_GEN_PROGS += proc-self-map-files-001
diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c
new file mode 100644
index 000000000000..523afd83d34f
--- /dev/null
+++ b/tools/testing/selftests/proc/proc-maps-race.c
@@ -0,0 +1,459 @@
+/*
+ * Copyright (c) 2025 Suren Baghdasaryan <surenb@google.com>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+/*
+ * Fork a child that concurrently modifies address space while the main
+ * process is reading /proc/$PID/maps and verifying the results. Address
+ * space modifications include:
+ *     VMA splitting and merging
+ *
+ */
+#undef NDEBUG
+#include <assert.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <pthread.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+static unsigned long test_duration_sec = 5UL;
+static int page_size;
+
+/* /proc/pid/maps parsing routines */
+struct page_content {
+	char *data;
+	ssize_t size;
+};
+
+#define LINE_MAX_SIZE		256
+
+struct line_content {
+	char text[LINE_MAX_SIZE];
+	unsigned long start_addr;
+	unsigned long end_addr;
+};
+
+static void read_two_pages(int maps_fd, struct page_content *page1,
+			   struct page_content *page2)
+{
+	ssize_t  bytes_read;
+
+	assert(lseek(maps_fd, 0, SEEK_SET) >= 0);
+	bytes_read = read(maps_fd, page1->data, page_size);
+	assert(bytes_read > 0 && bytes_read < page_size);
+	page1->size = bytes_read;
+
+	bytes_read = read(maps_fd, page2->data, page_size);
+	assert(bytes_read > 0 && bytes_read < page_size);
+	page2->size = bytes_read;
+}
+
+static void copy_first_line(struct page_content *page, char *first_line)
+{
+	char *pos = strchr(page->data, '\n');
+
+	strncpy(first_line, page->data, pos - page->data);
+	first_line[pos - page->data] = '\0';
+}
+
+static void copy_last_line(struct page_content *page, char *last_line)
+{
+	/* Get the last line in the first page */
+	const char *end = page->data + page->size - 1;
+	/* skip last newline */
+	const char *pos = end - 1;
+
+	/* search previous newline */
+	while (pos[-1] != '\n')
+		pos--;
+	strncpy(last_line, pos, end - pos);
+	last_line[end - pos] = '\0';
+}
+
+/* Read the last line of the first page and the first line of the second page */
+static void read_boundary_lines(int maps_fd, struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line)
+{
+	read_two_pages(maps_fd, page1, page2);
+
+	copy_last_line(page1, last_line->text);
+	copy_first_line(page2, first_line->text);
+
+	assert(sscanf(last_line->text, "%lx-%lx", &last_line->start_addr,
+		      &last_line->end_addr) == 2);
+	assert(sscanf(first_line->text, "%lx-%lx", &first_line->start_addr,
+		      &first_line->end_addr) == 2);
+}
+
+/* Thread synchronization routines */
+enum test_state {
+	INIT,
+	CHILD_READY,
+	PARENT_READY,
+	SETUP_READY,
+	SETUP_MODIFY_MAPS,
+	SETUP_MAPS_MODIFIED,
+	SETUP_RESTORE_MAPS,
+	SETUP_MAPS_RESTORED,
+	TEST_READY,
+	TEST_DONE,
+};
+
+struct vma_modifier_info;
+
+typedef void (*vma_modifier_op)(const struct vma_modifier_info *mod_info);
+typedef void (*vma_mod_result_check_op)(struct line_content *mod_last_line,
+					struct line_content *mod_first_line,
+					struct line_content *restored_last_line,
+					struct line_content *restored_first_line);
+
+struct vma_modifier_info {
+	int vma_count;
+	void *addr;
+	int prot;
+	void *next_addr;
+	vma_modifier_op vma_modify;
+	vma_modifier_op vma_restore;
+	vma_mod_result_check_op vma_mod_check;
+	pthread_mutex_t sync_lock;
+	pthread_cond_t sync_cond;
+	enum test_state curr_state;
+	bool exit;
+	void *child_mapped_addr[];
+};
+
+static void wait_for_state(struct vma_modifier_info *mod_info, enum test_state state)
+{
+	pthread_mutex_lock(&mod_info->sync_lock);
+	while (mod_info->curr_state != state)
+		pthread_cond_wait(&mod_info->sync_cond, &mod_info->sync_lock);
+	pthread_mutex_unlock(&mod_info->sync_lock);
+}
+
+static void signal_state(struct vma_modifier_info *mod_info, enum test_state state)
+{
+	pthread_mutex_lock(&mod_info->sync_lock);
+	mod_info->curr_state = state;
+	pthread_cond_signal(&mod_info->sync_cond);
+	pthread_mutex_unlock(&mod_info->sync_lock);
+}
+
+/* VMA modification routines */
+static void *child_vma_modifier(struct vma_modifier_info *mod_info)
+{
+	int prot = PROT_READ | PROT_WRITE;
+	int i;
+
+	for (i = 0; i < mod_info->vma_count; i++) {
+		mod_info->child_mapped_addr[i] = mmap(NULL, page_size * 3, prot,
+				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+		assert(mod_info->child_mapped_addr[i] != MAP_FAILED);
+		/* change protection in adjacent maps to prevent merging */
+		prot ^= PROT_WRITE;
+	}
+	signal_state(mod_info, CHILD_READY);
+	wait_for_state(mod_info, PARENT_READY);
+	while (true) {
+		signal_state(mod_info, SETUP_READY);
+		wait_for_state(mod_info, SETUP_MODIFY_MAPS);
+		if (mod_info->exit)
+			break;
+
+		mod_info->vma_modify(mod_info);
+		signal_state(mod_info, SETUP_MAPS_MODIFIED);
+		wait_for_state(mod_info, SETUP_RESTORE_MAPS);
+		mod_info->vma_restore(mod_info);
+		signal_state(mod_info, SETUP_MAPS_RESTORED);
+
+		wait_for_state(mod_info, TEST_READY);
+		while (mod_info->curr_state != TEST_DONE) {
+			mod_info->vma_modify(mod_info);
+			mod_info->vma_restore(mod_info);
+		}
+	}
+	for (i = 0; i < mod_info->vma_count; i++)
+		munmap(mod_info->child_mapped_addr[i], page_size * 3);
+
+	return NULL;
+}
+
+static void stop_vma_modifier(struct vma_modifier_info *mod_info)
+{
+	wait_for_state(mod_info, SETUP_READY);
+	mod_info->exit = true;
+	signal_state(mod_info, SETUP_MODIFY_MAPS);
+}
+
+static void capture_mod_pattern(int maps_fd,
+				struct vma_modifier_info *mod_info,
+				struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line,
+				struct line_content *mod_last_line,
+				struct line_content *mod_first_line,
+				struct line_content *restored_last_line,
+				struct line_content *restored_first_line)
+{
+	signal_state(mod_info, SETUP_MODIFY_MAPS);
+	wait_for_state(mod_info, SETUP_MAPS_MODIFIED);
+
+	/* Copy last line of the first page and first line of the last page */
+	read_boundary_lines(maps_fd, page1, page2, mod_last_line, mod_first_line);
+
+	signal_state(mod_info, SETUP_RESTORE_MAPS);
+	wait_for_state(mod_info, SETUP_MAPS_RESTORED);
+
+	/* Copy last line of the first page and first line of the last page */
+	read_boundary_lines(maps_fd, page1, page2, restored_last_line, restored_first_line);
+
+	mod_info->vma_mod_check(mod_last_line, mod_first_line,
+				restored_last_line, restored_first_line);
+
+	/*
+	 * The content of these lines after modify+resore should be the same
+	 * as the original.
+	 */
+	assert(strcmp(restored_last_line->text, last_line->text) == 0);
+	assert(strcmp(restored_first_line->text, first_line->text) == 0);
+}
+
+static inline void split_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mmap(mod_info->addr, page_size, mod_info->prot | PROT_EXEC,
+		    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
+		    -1, 0) != MAP_FAILED);
+}
+
+static inline void merge_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mmap(mod_info->addr, page_size, mod_info->prot,
+		    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
+		    -1, 0) != MAP_FAILED);
+}
+
+static inline void check_split_result(struct line_content *mod_last_line,
+				      struct line_content *mod_first_line,
+				      struct line_content *restored_last_line,
+				      struct line_content *restored_first_line)
+{
+	/* Make sure vmas at the boundaries are changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
+}
+
+static void test_maps_tearing_from_split(int maps_fd,
+					 struct vma_modifier_info *mod_info,
+					 struct page_content *page1,
+					 struct page_content *page2,
+					 struct line_content *last_line,
+					 struct line_content *first_line)
+{
+	struct line_content split_last_line;
+	struct line_content split_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = split_vma;
+	mod_info->vma_restore = merge_vma;
+	mod_info->vma_mod_check = check_split_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &split_last_line, &split_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		bool last_line_changed;
+		bool first_line_changed;
+
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after split */
+		if (!strcmp(new_last_line.text, split_last_line.text)) {
+			/*
+			 * The vmas should be consistent with split results,
+			 * however if vma was concurrently restored after a
+			 * split, it can be reported twice (first the original
+			 * split one, then the same vma but extended after the
+			 * merge) because we found it as the next vma again.
+			 * In that case new first line will be the same as the
+			 * last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, split_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with merge results */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		/*
+		 * First and last lines should change in unison. If the last
+		 * line changed then the first line should change as well and
+		 * vice versa.
+		 */
+		last_line_changed = strcmp(new_last_line.text, last_line->text) != 0;
+		first_line_changed = strcmp(new_first_line.text, first_line->text) != 0;
+		assert(last_line_changed == first_line_changed);
+
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
+int usage(void)
+{
+	fprintf(stderr, "Userland /proc/pid/{s}maps race test cases\n");
+	fprintf(stderr, "  -d: Duration for time-consuming tests\n");
+	fprintf(stderr, "  -h: Help screen\n");
+	exit(-1);
+}
+
+int main(int argc, char **argv)
+{
+	struct vma_modifier_info *mod_info;
+	pthread_mutexattr_t mutex_attr;
+	pthread_condattr_t cond_attr;
+	int shared_mem_size;
+	char fname[32];
+	int vma_count;
+	int maps_fd;
+	int status;
+	pid_t pid;
+	int opt;
+
+	while ((opt = getopt(argc, argv, "d:h")) != -1) {
+		if (opt == 'd')
+			test_duration_sec = strtoul(optarg, NULL, 0);
+		else if (opt == 'h')
+			usage();
+	}
+
+	page_size = sysconf(_SC_PAGESIZE);
+	/*
+	 * Have to map enough vmas for /proc/pid/maps to contain more than one
+	 * page worth of vmas. Assume at least 32 bytes per line in maps output
+	 */
+	vma_count = page_size / 32 + 1;
+	shared_mem_size = sizeof(struct vma_modifier_info) + vma_count * sizeof(void *);
+
+	/* map shared memory for communication with the child process */
+	mod_info = (struct vma_modifier_info *)mmap(NULL, shared_mem_size,
+		    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+
+	assert(mod_info != MAP_FAILED);
+
+	/* Initialize shared members */
+	pthread_mutexattr_init(&mutex_attr);
+	pthread_mutexattr_setpshared(&mutex_attr, PTHREAD_PROCESS_SHARED);
+	assert(!pthread_mutex_init(&mod_info->sync_lock, &mutex_attr));
+	pthread_condattr_init(&cond_attr);
+	pthread_condattr_setpshared(&cond_attr, PTHREAD_PROCESS_SHARED);
+	assert(!pthread_cond_init(&mod_info->sync_cond, &cond_attr));
+	mod_info->vma_count = vma_count;
+	mod_info->curr_state = INIT;
+	mod_info->exit = false;
+
+	pid = fork();
+	if (!pid) {
+		/* Child process */
+		child_vma_modifier(mod_info);
+		return 0;
+	}
+
+	sprintf(fname, "/proc/%d/maps", pid);
+	maps_fd = open(fname, O_RDONLY);
+	assert(maps_fd != -1);
+
+	/* Wait for the child to map the VMAs */
+	wait_for_state(mod_info, CHILD_READY);
+
+	/* Read first two pages */
+	struct page_content page1;
+	struct page_content page2;
+
+	page1.data = malloc(page_size);
+	assert(page1.data);
+	page2.data = malloc(page_size);
+	assert(page2.data);
+
+	struct line_content last_line;
+	struct line_content first_line;
+
+	read_boundary_lines(maps_fd, &page1, &page2, &last_line, &first_line);
+
+	/*
+	 * Find the addresses corresponding to the last line in the first page
+	 * and the first line in the last page.
+	 */
+	mod_info->addr = NULL;
+	mod_info->next_addr = NULL;
+	for (int i = 0; i < mod_info->vma_count; i++) {
+		if (mod_info->child_mapped_addr[i] == (void *)last_line.start_addr) {
+			mod_info->addr = mod_info->child_mapped_addr[i];
+			mod_info->prot = PROT_READ;
+			/* Even VMAs have write permission */
+			if ((i % 2) == 0)
+				mod_info->prot |= PROT_WRITE;
+		} else if (mod_info->child_mapped_addr[i] == (void *)first_line.start_addr) {
+			mod_info->next_addr = mod_info->child_mapped_addr[i];
+		}
+
+		if (mod_info->addr && mod_info->next_addr)
+			break;
+	}
+	assert(mod_info->addr && mod_info->next_addr);
+
+	signal_state(mod_info, PARENT_READY);
+
+	test_maps_tearing_from_split(maps_fd, mod_info, &page1, &page2,
+				     &last_line, &first_line);
+
+	stop_vma_modifier(mod_info);
+
+	free(page2.data);
+	free(page1.data);
+
+	for (int i = 0; i < vma_count; i++)
+		munmap(mod_info->child_mapped_addr[i], page_size);
+	close(maps_fd);
+	waitpid(pid, &status, 0);
+	munmap(mod_info, shared_mem_size);
+
+	return 0;
+}
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v6 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
@ 2025-07-04  6:07 ` Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Test that /proc/pid/maps does not report unexpected holes in the address
space when a vma at the edge of the page is being concurrently remapped.
This remapping results in the vma shrinking and expanding from  under the
reader. We should always see either shrunk or expanded (original) version
of the vma.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-maps-race.c | 83 +++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c
index 523afd83d34f..10365b4e68e1 100644
--- a/tools/testing/selftests/proc/proc-maps-race.c
+++ b/tools/testing/selftests/proc/proc-maps-race.c
@@ -336,6 +336,86 @@ static void test_maps_tearing_from_split(int maps_fd,
 	signal_state(mod_info, TEST_DONE);
 }
 
+static inline void shrink_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mremap(mod_info->addr, page_size * 3, page_size, 0) != MAP_FAILED);
+}
+
+static inline void expand_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(mremap(mod_info->addr, page_size, page_size * 3, 0) != MAP_FAILED);
+}
+
+static inline void check_shrink_result(struct line_content *mod_last_line,
+				       struct line_content *mod_first_line,
+				       struct line_content *restored_last_line,
+				       struct line_content *restored_first_line)
+{
+	/* Make sure only the last vma of the first page is changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) == 0);
+}
+
+static void test_maps_tearing_from_resize(int maps_fd,
+					  struct vma_modifier_info *mod_info,
+					  struct page_content *page1,
+					  struct page_content *page2,
+					  struct line_content *last_line,
+					  struct line_content *first_line)
+{
+	struct line_content shrunk_last_line;
+	struct line_content shrunk_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = shrink_vma;
+	mod_info->vma_restore = expand_vma;
+	mod_info->vma_mod_check = check_shrink_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &shrunk_last_line, &shrunk_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after shrinking it */
+		if (!strcmp(new_last_line.text, shrunk_last_line.text)) {
+			/*
+			 * The vmas should be consistent with shrunk results,
+			 * however if the vma was concurrently restored, it
+			 * can be reported twice (first as shrunk one, then
+			 * as restored one) because we found it as the next vma
+			 * again. In that case new first line will be the same
+			 * as the last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, shrunk_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with the original/resored state */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
 int usage(void)
 {
 	fprintf(stderr, "Userland /proc/pid/{s}maps race test cases\n");
@@ -444,6 +524,9 @@ int main(int argc, char **argv)
 	test_maps_tearing_from_split(maps_fd, mod_info, &page1, &page2,
 				     &last_line, &first_line);
 
+	test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
+				      &last_line, &first_line);
+
 	stop_vma_modifier(mod_info);
 
 	free(page2.data);
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v6 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
@ 2025-07-04  6:07 ` Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Test that /proc/pid/maps does not report unexpected holes in the address
space when we concurrently remap a part of a vma into the middle of
another vma. This remapping results in the destination vma being split
into three parts and the part in the middle being patched back from,
all done concurrently from under the reader. We should always see either
original vma or the split one with no holes.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-maps-race.c | 92 +++++++++++++++++++
 1 file changed, 92 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c
index 10365b4e68e1..764821ffd63d 100644
--- a/tools/testing/selftests/proc/proc-maps-race.c
+++ b/tools/testing/selftests/proc/proc-maps-race.c
@@ -416,6 +416,95 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	signal_state(mod_info, TEST_DONE);
 }
 
+static inline void remap_vma(const struct vma_modifier_info *mod_info)
+{
+	/*
+	 * Remap the last page of the next vma into the middle of the vma.
+	 * This splits the current vma and the first and middle parts (the
+	 * parts at lower addresses) become the last vma objserved in the
+	 * first page and the first vma observed in the last page.
+	 */
+	assert(mremap(mod_info->next_addr + page_size * 2, page_size,
+		      page_size, MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
+		      mod_info->addr + page_size) != MAP_FAILED);
+}
+
+static inline void patch_vma(const struct vma_modifier_info *mod_info)
+{
+	assert(!mprotect(mod_info->addr + page_size, page_size,
+			 mod_info->prot));
+}
+
+static inline void check_remap_result(struct line_content *mod_last_line,
+				      struct line_content *mod_first_line,
+				      struct line_content *restored_last_line,
+				      struct line_content *restored_first_line)
+{
+	/* Make sure vmas at the boundaries are changing */
+	assert(strcmp(mod_last_line->text, restored_last_line->text) != 0);
+	assert(strcmp(mod_first_line->text, restored_first_line->text) != 0);
+}
+
+static void test_maps_tearing_from_remap(int maps_fd,
+				struct vma_modifier_info *mod_info,
+				struct page_content *page1,
+				struct page_content *page2,
+				struct line_content *last_line,
+				struct line_content *first_line)
+{
+	struct line_content remapped_last_line;
+	struct line_content remapped_first_line;
+	struct line_content restored_last_line;
+	struct line_content restored_first_line;
+
+	wait_for_state(mod_info, SETUP_READY);
+
+	/* re-read the file to avoid using stale data from previous test */
+	read_boundary_lines(maps_fd, page1, page2, last_line, first_line);
+
+	mod_info->vma_modify = remap_vma;
+	mod_info->vma_restore = patch_vma;
+	mod_info->vma_mod_check = check_remap_result;
+
+	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
+			    &remapped_last_line, &remapped_first_line,
+			    &restored_last_line, &restored_first_line);
+
+	/* Now start concurrent modifications for test_duration_sec */
+	signal_state(mod_info, TEST_READY);
+
+	struct line_content new_last_line;
+	struct line_content new_first_line;
+	struct timespec start_ts, end_ts;
+
+	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	do {
+		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
+
+		/* Check if we read vmas after remapping it */
+		if (!strcmp(new_last_line.text, remapped_last_line.text)) {
+			/*
+			 * The vmas should be consistent with remap results,
+			 * however if the vma was concurrently restored, it
+			 * can be reported twice (first as split one, then
+			 * as restored one) because we found it as the next vma
+			 * again. In that case new first line will be the same
+			 * as the last restored line.
+			 */
+			assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
+			       !strcmp(new_first_line.text, restored_last_line.text));
+		} else {
+			/* The vmas should be consistent with the original/resored state */
+			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
+			       !strcmp(new_first_line.text, restored_first_line.text));
+		}
+		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+
+	/* Signal the modifyer thread to stop and wait until it exits */
+	signal_state(mod_info, TEST_DONE);
+}
+
 int usage(void)
 {
 	fprintf(stderr, "Userland /proc/pid/{s}maps race test cases\n");
@@ -527,6 +616,9 @@ int main(int argc, char **argv)
 	test_maps_tearing_from_resize(maps_fd, mod_info, &page1, &page2,
 				      &last_line, &first_line);
 
+	test_maps_tearing_from_remap(maps_fd, mod_info, &page1, &page2,
+				     &last_line, &first_line);
+
 	stop_vma_modifier(mod_info);
 
 	free(page2.data);
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v6 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2025-07-04  6:07 ` [PATCH v6 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
@ 2025-07-04  6:07 ` Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 5/8] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Extend /proc/pid/maps tearing test to verify PROCMAP_QUERY ioctl operation
correctness while the vma is being concurrently modified.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-maps-race.c | 62 +++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c
index 764821ffd63d..6acdafdac9db 100644
--- a/tools/testing/selftests/proc/proc-maps-race.c
+++ b/tools/testing/selftests/proc/proc-maps-race.c
@@ -30,6 +30,8 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <linux/fs.h>
+#include <sys/ioctl.h>
 #include <sys/mman.h>
 #include <sys/stat.h>
 #include <sys/types.h>
@@ -239,6 +241,21 @@ static void capture_mod_pattern(int maps_fd,
 	assert(strcmp(restored_first_line->text, first_line->text) == 0);
 }
 
+static void query_addr_at(int maps_fd, void *addr,
+			  unsigned long *vma_start, unsigned long *vma_end)
+{
+	struct procmap_query q;
+
+	memset(&q, 0, sizeof(q));
+	q.size = sizeof(q);
+	/* Find the VMA at the split address */
+	q.query_addr = (unsigned long long)addr;
+	q.query_flags = 0;
+	assert(!ioctl(maps_fd, PROCMAP_QUERY, &q));
+	*vma_start = q.vma_start;
+	*vma_end = q.vma_end;
+}
+
 static inline void split_vma(const struct vma_modifier_info *mod_info)
 {
 	assert(mmap(mod_info->addr, page_size, mod_info->prot | PROT_EXEC,
@@ -299,6 +316,8 @@ static void test_maps_tearing_from_split(int maps_fd,
 	do {
 		bool last_line_changed;
 		bool first_line_changed;
+		unsigned long vma_start;
+		unsigned long vma_end;
 
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
@@ -329,6 +348,19 @@ static void test_maps_tearing_from_split(int maps_fd,
 		first_line_changed = strcmp(new_first_line.text, first_line->text) != 0;
 		assert(last_line_changed == first_line_changed);
 
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr + page_size,
+			      &vma_start, &vma_end);
+		/*
+		 * The vma at the split address can be either the same as
+		 * original one (if read before the split) or the same as the
+		 * first line in the second page (if read after the split).
+		 */
+		assert((vma_start == last_line->start_addr &&
+			vma_end == last_line->end_addr) ||
+		       (vma_start == split_first_line.start_addr &&
+			vma_end == split_first_line.end_addr));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
@@ -390,6 +422,9 @@ static void test_maps_tearing_from_resize(int maps_fd,
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
 	do {
+		unsigned long vma_start;
+		unsigned long vma_end;
+
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
 		/* Check if we read vmas after shrinking it */
@@ -409,6 +444,17 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
 			       !strcmp(new_first_line.text, restored_first_line.text));
 		}
+
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr, &vma_start, &vma_end);
+		/*
+		 * The vma should stay at the same address and have either the
+		 * original size of 3 pages or 1 page if read after shrinking.
+		 */
+		assert(vma_start == last_line->start_addr &&
+		       (vma_end - vma_start == page_size * 3 ||
+			vma_end - vma_start == page_size));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
@@ -479,6 +525,9 @@ static void test_maps_tearing_from_remap(int maps_fd,
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
 	do {
+		unsigned long vma_start;
+		unsigned long vma_end;
+
 		read_boundary_lines(maps_fd, page1, page2, &new_last_line, &new_first_line);
 
 		/* Check if we read vmas after remapping it */
@@ -498,6 +547,19 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
 			       !strcmp(new_first_line.text, restored_first_line.text));
 		}
+
+		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
+		query_addr_at(maps_fd, mod_info->addr + page_size, &vma_start, &vma_end);
+		/*
+		 * The vma should either stay at the same address and have the
+		 * original size of 3 pages or we should find the remapped vma
+		 * at the remap destination address with size of 1 page.
+		 */
+		assert((vma_start == last_line->start_addr &&
+			vma_end - vma_start == page_size * 3) ||
+		       (vma_start == last_line->start_addr + page_size &&
+			vma_end - vma_start == page_size));
+
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
 
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v6 5/8] selftests/proc: add verbose more for tests to facilitate debugging
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2025-07-04  6:07 ` [PATCH v6 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
@ 2025-07-04  6:07 ` Suren Baghdasaryan
  2025-07-04  6:07 ` [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned Suren Baghdasaryan
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Add verbose mode to the proc tests to print debugging information.
Usage: proc-pid-vm -v

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 tools/testing/selftests/proc/proc-maps-race.c | 159 ++++++++++++++++--
 1 file changed, 146 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/proc/proc-maps-race.c b/tools/testing/selftests/proc/proc-maps-race.c
index 6acdafdac9db..5f912fedd6cf 100644
--- a/tools/testing/selftests/proc/proc-maps-race.c
+++ b/tools/testing/selftests/proc/proc-maps-race.c
@@ -39,6 +39,7 @@
 
 static unsigned long test_duration_sec = 5UL;
 static int page_size;
+static bool verbose;
 
 /* /proc/pid/maps parsing routines */
 struct page_content {
@@ -207,6 +208,99 @@ static void stop_vma_modifier(struct vma_modifier_info *mod_info)
 	signal_state(mod_info, SETUP_MODIFY_MAPS);
 }
 
+static void print_first_lines(char *text, int nr)
+{
+	const char *end = text;
+
+	while (nr && (end = strchr(end, '\n')) != NULL) {
+		nr--;
+		end++;
+	}
+
+	if (end) {
+		int offs = end - text;
+
+		text[offs] = '\0';
+		printf(text);
+		text[offs] = '\n';
+		printf("\n");
+	} else {
+		printf(text);
+	}
+}
+
+static void print_last_lines(char *text, int nr)
+{
+	const char *start = text + strlen(text);
+
+	nr++; /* to ignore the last newline */
+	while (nr) {
+		while (start > text && *start != '\n')
+			start--;
+		nr--;
+		start--;
+	}
+	printf(start);
+}
+
+static void print_boundaries(const char *title,
+			     struct page_content *page1,
+			     struct page_content *page2)
+{
+	if (!verbose)
+		return;
+
+	printf("%s", title);
+	/* Print 3 boundary lines from each page */
+	print_last_lines(page1->data, 3);
+	printf("-----------------page boundary-----------------\n");
+	print_first_lines(page2->data, 3);
+}
+
+static bool print_boundaries_on(bool condition, const char *title,
+				struct page_content *page1,
+				struct page_content *page2)
+{
+	if (verbose && condition)
+		print_boundaries(title, page1, page2);
+
+	return condition;
+}
+
+static void report_test_start(const char *name)
+{
+	if (verbose)
+		printf("==== %s ====\n", name);
+}
+
+static struct timespec print_ts;
+
+static void start_test_loop(struct timespec *ts)
+{
+	if (verbose)
+		print_ts.tv_sec = ts->tv_sec;
+}
+
+static void end_test_iteration(struct timespec *ts)
+{
+	if (!verbose)
+		return;
+
+	/* Update every second */
+	if (print_ts.tv_sec == ts->tv_sec)
+		return;
+
+	printf(".");
+	fflush(stdout);
+	print_ts.tv_sec = ts->tv_sec;
+}
+
+static void end_test_loop(void)
+{
+	if (verbose)
+		printf("\n");
+}
+
 static void capture_mod_pattern(int maps_fd,
 				struct vma_modifier_info *mod_info,
 				struct page_content *page1,
@@ -218,18 +312,24 @@ static void capture_mod_pattern(int maps_fd,
 				struct line_content *restored_last_line,
 				struct line_content *restored_first_line)
 {
+	print_boundaries("Before modification", page1, page2);
+
 	signal_state(mod_info, SETUP_MODIFY_MAPS);
 	wait_for_state(mod_info, SETUP_MAPS_MODIFIED);
 
 	/* Copy last line of the first page and first line of the last page */
 	read_boundary_lines(maps_fd, page1, page2, mod_last_line, mod_first_line);
 
+	print_boundaries("After modification", page1, page2);
+
 	signal_state(mod_info, SETUP_RESTORE_MAPS);
 	wait_for_state(mod_info, SETUP_MAPS_RESTORED);
 
 	/* Copy last line of the first page and first line of the last page */
 	read_boundary_lines(maps_fd, page1, page2, restored_last_line, restored_first_line);
 
+	print_boundaries("After restore", page1, page2);
+
 	mod_info->vma_mod_check(mod_last_line, mod_first_line,
 				restored_last_line, restored_first_line);
 
@@ -301,6 +401,7 @@ static void test_maps_tearing_from_split(int maps_fd,
 	mod_info->vma_restore = merge_vma;
 	mod_info->vma_mod_check = check_split_result;
 
+	report_test_start("Tearing from split");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &split_last_line, &split_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -313,6 +414,7 @@ static void test_maps_tearing_from_split(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		bool last_line_changed;
 		bool first_line_changed;
@@ -332,12 +434,18 @@ static void test_maps_tearing_from_split(int maps_fd,
 			 * In that case new first line will be the same as the
 			 * last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, split_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, split_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Split result invalid", page1, page2));
 		} else {
 			/* The vmas should be consistent with merge results */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text),
+					"Merge result invalid", page1, page2));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Merge result invalid", page1, page2));
 		}
 		/*
 		 * First and last lines should change in unison. If the last
@@ -362,7 +470,9 @@ static void test_maps_tearing_from_split(int maps_fd,
 			vma_end == split_first_line.end_addr));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -409,6 +519,7 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	mod_info->vma_restore = expand_vma;
 	mod_info->vma_mod_check = check_shrink_result;
 
+	report_test_start("Tearing from resize");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &shrunk_last_line, &shrunk_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -421,6 +532,7 @@ static void test_maps_tearing_from_resize(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		unsigned long vma_start;
 		unsigned long vma_end;
@@ -437,12 +549,18 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			 * again. In that case new first line will be the same
 			 * as the last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, shrunk_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, shrunk_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Shrink result invalid", page1, page2));
 		} else {
 			/* The vmas should be consistent with the original/resored state */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text),
+					"Expand result invalid", page1, page2));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Expand result invalid", page1, page2));
 		}
 
 		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
@@ -456,7 +574,9 @@ static void test_maps_tearing_from_resize(int maps_fd,
 			vma_end - vma_start == page_size));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -512,6 +632,7 @@ static void test_maps_tearing_from_remap(int maps_fd,
 	mod_info->vma_restore = patch_vma;
 	mod_info->vma_mod_check = check_remap_result;
 
+	report_test_start("Tearing from remap");
 	capture_mod_pattern(maps_fd, mod_info, page1, page2, last_line, first_line,
 			    &remapped_last_line, &remapped_first_line,
 			    &restored_last_line, &restored_first_line);
@@ -524,6 +645,7 @@ static void test_maps_tearing_from_remap(int maps_fd,
 	struct timespec start_ts, end_ts;
 
 	clock_gettime(CLOCK_MONOTONIC_COARSE, &start_ts);
+	start_test_loop(&start_ts);
 	do {
 		unsigned long vma_start;
 		unsigned long vma_end;
@@ -540,12 +662,18 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			 * again. In that case new first line will be the same
 			 * as the last restored line.
 			 */
-			assert(!strcmp(new_first_line.text, remapped_first_line.text) ||
-			       !strcmp(new_first_line.text, restored_last_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, remapped_first_line.text) &&
+					strcmp(new_first_line.text, restored_last_line.text),
+					"Remap result invalid", page1, page2));
 		} else {
 			/* The vmas should be consistent with the original/resored state */
-			assert(!strcmp(new_last_line.text, restored_last_line.text) &&
-			       !strcmp(new_first_line.text, restored_first_line.text));
+			assert(!print_boundaries_on(
+					strcmp(new_last_line.text, restored_last_line.text),
+					"Remap restore result invalid", page1, page2));
+			assert(!print_boundaries_on(
+					strcmp(new_first_line.text, restored_first_line.text),
+					"Remap restore result invalid", page1, page2));
 		}
 
 		/* Check if PROCMAP_QUERY ioclt() finds the right VMA */
@@ -561,7 +689,9 @@ static void test_maps_tearing_from_remap(int maps_fd,
 			vma_end - vma_start == page_size));
 
 		clock_gettime(CLOCK_MONOTONIC_COARSE, &end_ts);
+		end_test_iteration(&end_ts);
 	} while (end_ts.tv_sec - start_ts.tv_sec < test_duration_sec);
+	end_test_loop();
 
 	/* Signal the modifyer thread to stop and wait until it exits */
 	signal_state(mod_info, TEST_DONE);
@@ -571,6 +701,7 @@ int usage(void)
 {
 	fprintf(stderr, "Userland /proc/pid/{s}maps race test cases\n");
 	fprintf(stderr, "  -d: Duration for time-consuming tests\n");
+	fprintf(stderr, "  -v: Verbose mode\n");
 	fprintf(stderr, "  -h: Help screen\n");
 	exit(-1);
 }
@@ -588,9 +719,11 @@ int main(int argc, char **argv)
 	pid_t pid;
 	int opt;
 
-	while ((opt = getopt(argc, argv, "d:h")) != -1) {
+	while ((opt = getopt(argc, argv, "d:vh")) != -1) {
 		if (opt == 'd')
 			test_duration_sec = strtoul(optarg, NULL, 0);
+		else if (opt == 'v')
+			verbose = true;
 		else if (opt == 'h')
 			usage();
 	}
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2025-07-04  6:07 ` [PATCH v6 5/8] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
@ 2025-07-04  6:07 ` Suren Baghdasaryan
  2025-07-07 15:01   ` Lorenzo Stoakes
  2025-07-08 17:37   ` Vlastimil Babka
  2025-07-04  6:07 ` [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock Suren Baghdasaryan
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Back in 2.6 era, last_addr used to be stored in seq_file->version
variable, which was unsigned long. As a result, sentinels to represent
gate vma and end of all vmas used unsigned values. In more recent
kernels we don't used seq_file->version anymore and therefore conversion
from loff_t into unsigned type is not needed. Similarly, sentinel values
don't need to be unsigned. Remove type conversion for set_file position
and change sentinel values to signed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/task_mmu.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 751479eb128f..b8bc06d05a72 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -135,7 +135,7 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
 	if (vma) {
 		*ppos = vma->vm_start;
 	} else {
-		*ppos = -2UL;
+		*ppos = -2;
 		vma = get_gate_vma(priv->mm);
 	}
 
@@ -145,11 +145,11 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
 static void *m_start(struct seq_file *m, loff_t *ppos)
 {
 	struct proc_maps_private *priv = m->private;
-	unsigned long last_addr = *ppos;
+	loff_t last_addr = *ppos;
 	struct mm_struct *mm;
 
 	/* See m_next(). Zero at the start or after lseek. */
-	if (last_addr == -1UL)
+	if (last_addr == -1)
 		return NULL;
 
 	priv->task = get_proc_task(priv->inode);
@@ -170,9 +170,9 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 		return ERR_PTR(-EINTR);
 	}
 
-	vma_iter_init(&priv->iter, mm, last_addr);
+	vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);
 	hold_task_mempolicy(priv);
-	if (last_addr == -2UL)
+	if (last_addr == -2)
 		return get_gate_vma(mm);
 
 	return proc_get_vma(priv, ppos);
@@ -180,8 +180,8 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 
 static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
 {
-	if (*ppos == -2UL) {
-		*ppos = -1UL;
+	if (*ppos == -2) {
+		*ppos = -1;
 		return NULL;
 	}
 	return proc_get_vma(m->private, ppos);
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (5 preceding siblings ...)
  2025-07-04  6:07 ` [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned Suren Baghdasaryan
@ 2025-07-04  6:07 ` Suren Baghdasaryan
  2025-07-07 16:51   ` Lorenzo Stoakes
                     ` (2 more replies)
  2025-07-04  6:07 ` [PATCH v6 8/8] fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
  2025-07-15  8:10 ` [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Lorenzo Stoakes
  8 siblings, 3 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

With maple_tree supporting vma tree traversal under RCU and per-vma
locks, /proc/pid/maps can be read while holding individual vma locks
instead of locking the entire address space.
Completely lockless approach (walking vma tree under RCU) would be quite
complex with the main issue being get_vma_name() using callbacks which
might not work correctly with a stable vma copy, requiring original
(unstable) vma - see special_mapping_name() for an example.
When per-vma lock acquisition fails, we take the mmap_lock for reading,
lock the vma, release the mmap_lock and continue. This fallback to mmap
read lock guarantees the reader to make forward progress even during
lock contention. This will interfere with the writer but for a very
short time while we are acquiring the per-vma lock and only when there
was contention on the vma reader is interested in. We shouldn't see a
repeated fallback to mmap read locks in practice, as this require a
very unlikely series of lock contentions (for instance due to repeated
vma split operations). However even if this did somehow happen, we would
still progress.
One case requiring special handling is when vma changes between the
time it was found and the time it got locked. A problematic case would
be if vma got shrunk so that it's start moved higher in the address
space and a new vma was installed at the beginning:

reader found:               |--------VMA A--------|
VMA is modified:            |-VMA B-|----VMA A----|
reader locks modified VMA A
reader reports VMA A:       |  gap  |----VMA A----|

This would result in reporting a gap in the address space that does not
exist. To prevent this we retry the lookup after locking the vma, however
we do that only when we identify a gap and detect that the address space
was changed after we found the vma.
This change is designed to reduce mmap_lock contention and prevent a
process reading /proc/pid/maps files (often a low priority task, such
as monitoring/data collection services) from blocking address space
updates. Note that this change has a userspace visible disadvantage:
it allows for sub-page data tearing as opposed to the previous mechanism
where data tearing could happen only between pages of generated output
data. Since current userspace considers data tearing between pages to be
acceptable, we assume is will be able to handle sub-page data tearing
as well.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/internal.h        |   5 ++
 fs/proc/task_mmu.c        | 118 ++++++++++++++++++++++++++++++++++----
 include/linux/mmap_lock.h |  11 ++++
 mm/madvise.c              |   3 +-
 mm/mmap_lock.c            |  88 ++++++++++++++++++++++++++++
 5 files changed, 214 insertions(+), 11 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 3d48ffe72583..7c235451c5ea 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -384,6 +384,11 @@ struct proc_maps_private {
 	struct task_struct *task;
 	struct mm_struct *mm;
 	struct vma_iterator iter;
+	loff_t last_pos;
+#ifdef CONFIG_PER_VMA_LOCK
+	bool mmap_locked;
+	struct vm_area_struct *locked_vma;
+#endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *task_mempolicy;
 #endif
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b8bc06d05a72..ff3fe488ce51 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -127,15 +127,107 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
 }
 #endif
 
-static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
-						loff_t *ppos)
+#ifdef CONFIG_PER_VMA_LOCK
+
+static void unlock_vma(struct proc_maps_private *priv)
+{
+	if (priv->locked_vma) {
+		vma_end_read(priv->locked_vma);
+		priv->locked_vma = NULL;
+	}
+}
+
+static const struct seq_operations proc_pid_maps_op;
+
+static inline bool lock_vma_range(struct seq_file *m,
+				  struct proc_maps_private *priv)
+{
+	/*
+	 * smaps and numa_maps perform page table walk, therefore require
+	 * mmap_lock but maps can be read with locking just the vma.
+	 */
+	if (m->op != &proc_pid_maps_op) {
+		if (mmap_read_lock_killable(priv->mm))
+			return false;
+
+		priv->mmap_locked = true;
+	} else {
+		rcu_read_lock();
+		priv->locked_vma = NULL;
+		priv->mmap_locked = false;
+	}
+
+	return true;
+}
+
+static inline void unlock_vma_range(struct proc_maps_private *priv)
+{
+	if (priv->mmap_locked) {
+		mmap_read_unlock(priv->mm);
+	} else {
+		unlock_vma(priv);
+		rcu_read_unlock();
+	}
+}
+
+static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
+					   loff_t last_pos)
+{
+	struct vm_area_struct *vma;
+
+	if (priv->mmap_locked)
+		return vma_next(&priv->iter);
+
+	unlock_vma(priv);
+	vma = lock_next_vma(priv->mm, &priv->iter, last_pos);
+	if (!IS_ERR_OR_NULL(vma))
+		priv->locked_vma = vma;
+
+	return vma;
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline bool lock_vma_range(struct seq_file *m,
+				  struct proc_maps_private *priv)
 {
-	struct vm_area_struct *vma = vma_next(&priv->iter);
+	return mmap_read_lock_killable(priv->mm) == 0;
+}
+
+static inline void unlock_vma_range(struct proc_maps_private *priv)
+{
+	mmap_read_unlock(priv->mm);
+}
+
+static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
+					   loff_t last_pos)
+{
+	return vma_next(&priv->iter);
+}
 
+#endif /* CONFIG_PER_VMA_LOCK */
+
+static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma;
+
+	vma = get_next_vma(priv, *ppos);
+	/* EINTR is possible */
+	if (IS_ERR(vma))
+		return vma;
+
+	/* Store previous position to be able to restart if needed */
+	priv->last_pos = *ppos;
 	if (vma) {
-		*ppos = vma->vm_start;
+		/*
+		 * Track the end of the reported vma to ensure position changes
+		 * even if previous vma was merged with the next vma and we
+		 * found the extended vma with the same vm_start.
+		 */
+		*ppos = vma->vm_end;
 	} else {
-		*ppos = -2;
+		*ppos = -2; /* -2 indicates gate vma */
 		vma = get_gate_vma(priv->mm);
 	}
 
@@ -163,28 +255,34 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 		return NULL;
 	}
 
-	if (mmap_read_lock_killable(mm)) {
+	if (!lock_vma_range(m, priv)) {
 		mmput(mm);
 		put_task_struct(priv->task);
 		priv->task = NULL;
 		return ERR_PTR(-EINTR);
 	}
 
+	/*
+	 * Reset current position if last_addr was set before
+	 * and it's not a sentinel.
+	 */
+	if (last_addr > 0)
+		*ppos = last_addr = priv->last_pos;
 	vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);
 	hold_task_mempolicy(priv);
 	if (last_addr == -2)
 		return get_gate_vma(mm);
 
-	return proc_get_vma(priv, ppos);
+	return proc_get_vma(m, ppos);
 }
 
 static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
 {
 	if (*ppos == -2) {
-		*ppos = -1;
+		*ppos = -1; /* -1 indicates no more vmas */
 		return NULL;
 	}
-	return proc_get_vma(m->private, ppos);
+	return proc_get_vma(m, ppos);
 }
 
 static void m_stop(struct seq_file *m, void *v)
@@ -196,7 +294,7 @@ static void m_stop(struct seq_file *m, void *v)
 		return;
 
 	release_task_mempolicy(priv);
-	mmap_read_unlock(mm);
+	unlock_vma_range(priv);
 	mmput(mm);
 	put_task_struct(priv->task);
 	priv->task = NULL;
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 5da384bd0a26..1f4f44951abe 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -309,6 +309,17 @@ void vma_mark_detached(struct vm_area_struct *vma);
 struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 					  unsigned long address);
 
+/*
+ * Locks next vma pointed by the iterator. Confirms the locked vma has not
+ * been modified and will retry under mmap_lock protection if modification
+ * was detected. Should be called from read RCU section.
+ * Returns either a valid locked VMA, NULL if no more VMAs or -EINTR if the
+ * process was interrupted.
+ */
+struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
+				     struct vma_iterator *iter,
+				     unsigned long address);
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
diff --git a/mm/madvise.c b/mm/madvise.c
index a34c2c89a53b..e61e32b2cd91 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -108,7 +108,8 @@ void anon_vma_name_free(struct kref *kref)
 
 struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
 {
-	mmap_assert_locked(vma->vm_mm);
+	if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
+		vma_assert_locked(vma);
 
 	return vma->anon_name;
 }
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
index 5f725cc67334..ed0e5e2171cd 100644
--- a/mm/mmap_lock.c
+++ b/mm/mmap_lock.c
@@ -178,6 +178,94 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
 	count_vm_vma_lock_event(VMA_LOCK_ABORT);
 	return NULL;
 }
+
+static struct vm_area_struct *lock_vma_under_mmap_lock(struct mm_struct *mm,
+						       struct vma_iterator *iter,
+						       unsigned long address)
+{
+	struct vm_area_struct *vma;
+	int ret;
+
+	ret = mmap_read_lock_killable(mm);
+	if (ret)
+		return ERR_PTR(ret);
+
+	/* Lookup the vma at the last position again under mmap_read_lock */
+	vma_iter_init(iter, mm, address);
+	vma = vma_next(iter);
+	if (vma)
+		vma_start_read_locked(vma);
+
+	mmap_read_unlock(mm);
+
+	return vma;
+}
+
+struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
+				     struct vma_iterator *iter,
+				     unsigned long address)
+{
+	struct vm_area_struct *vma;
+	unsigned int mm_wr_seq;
+	bool mmap_unlocked;
+
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
+retry:
+	/* Start mmap_lock speculation in case we need to verify the vma later */
+	mmap_unlocked = mmap_lock_speculate_try_begin(mm, &mm_wr_seq);
+	vma = vma_next(iter);
+	if (!vma)
+		return NULL;
+
+	vma = vma_start_read(mm, vma);
+
+	if (IS_ERR_OR_NULL(vma)) {
+		/*
+		 * Retry immediately if the vma gets detached from under us.
+		 * Infinite loop should not happen because the vma we find will
+		 * have to be constantly knocked out from under us.
+		 */
+		if (PTR_ERR(vma) == -EAGAIN) {
+			vma_iter_init(iter, mm, address);
+			goto retry;
+		}
+
+		goto out;
+	}
+
+	/*
+	 * Verify the vma we locked belongs to the same address space and it's
+	 * not behind of the last search position.
+	 */
+	if (unlikely(vma->vm_mm != mm || address >= vma->vm_end))
+		goto out_unlock;
+
+	/*
+	 * vma can be ahead of the last search position but we need to verify
+	 * it was not shrunk after we found it and another vma has not been
+	 * installed ahead of it. Otherwise we might observe a gap that should
+	 * not be there.
+	 */
+	if (address < vma->vm_start) {
+		/* Verify only if the address space might have changed since vma lookup. */
+		if (!mmap_unlocked || mmap_lock_speculate_retry(mm, mm_wr_seq)) {
+			vma_iter_init(iter, mm, address);
+			if (vma != vma_next(iter))
+				goto out_unlock;
+		}
+	}
+
+	return vma;
+
+out_unlock:
+	vma_end_read(vma);
+out:
+	rcu_read_unlock();
+	vma = lock_vma_under_mmap_lock(mm, iter, address);
+	rcu_read_lock();
+
+	return vma;
+}
 #endif /* CONFIG_PER_VMA_LOCK */
 
 #ifdef CONFIG_LOCK_MM_AND_FIND_VMA
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v6 8/8] fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (6 preceding siblings ...)
  2025-07-04  6:07 ` [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock Suren Baghdasaryan
@ 2025-07-04  6:07 ` Suren Baghdasaryan
  2025-07-07 16:54   ` Lorenzo Stoakes
  2025-07-07 18:26   ` Liam R. Howlett
  2025-07-15  8:10 ` [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Lorenzo Stoakes
  8 siblings, 2 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-04  6:07 UTC (permalink / raw)
  To: akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, vbabka, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest, surenb

Utilize per-vma locks to stabilize vma after lookup without taking
mmap_lock during PROCMAP_QUERY ioctl execution. While we might take
mmap_lock for reading during contention, we do that momentarily only
to lock the vma.
This change is designed to reduce mmap_lock contention and prevent
PROCMAP_QUERY ioctl calls from blocking address space updates.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c | 60 ++++++++++++++++++++++++++++++++++++----------
 1 file changed, 48 insertions(+), 12 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ff3fe488ce51..0496d5969a51 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -487,28 +487,64 @@ static int pid_maps_open(struct inode *inode, struct file *file)
 		PROCMAP_QUERY_VMA_FLAGS				\
 )
 
-static int query_vma_setup(struct mm_struct *mm)
+#ifdef CONFIG_PER_VMA_LOCK
+
+static int query_vma_setup(struct proc_maps_private *priv)
 {
-	return mmap_read_lock_killable(mm);
+	priv->locked_vma = NULL;
+	priv->mmap_locked = false;
+
+	return 0;
 }
 
-static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
+static void query_vma_teardown(struct proc_maps_private *priv)
 {
-	mmap_read_unlock(mm);
+	unlock_vma(priv);
+}
+
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
+						     unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	rcu_read_lock();
+	vma_iter_init(&priv->iter, priv->mm, addr);
+	vma = get_next_vma(priv, addr);
+	rcu_read_unlock();
+
+	return vma;
 }
 
-static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
+#else /* CONFIG_PER_VMA_LOCK */
+
+static int query_vma_setup(struct proc_maps_private *priv)
 {
-	return find_vma(mm, addr);
+	return mmap_read_lock_killable(priv->mm);
 }
 
-static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
+static void query_vma_teardown(struct proc_maps_private *priv)
+{
+	mmap_read_unlock(priv->mm);
+}
+
+static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
+						     unsigned long addr)
+{
+	return find_vma(priv->mm, addr);
+}
+
+#endif  /* CONFIG_PER_VMA_LOCK */
+
+static struct vm_area_struct *query_matching_vma(struct proc_maps_private *priv,
 						 unsigned long addr, u32 flags)
 {
 	struct vm_area_struct *vma;
 
 next_vma:
-	vma = query_vma_find_by_addr(mm, addr);
+	vma = query_vma_find_by_addr(priv, addr);
+	if (IS_ERR(vma))
+		return vma;
+
 	if (!vma)
 		goto no_vma;
 
@@ -584,13 +620,13 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	if (!mm || !mmget_not_zero(mm))
 		return -ESRCH;
 
-	err = query_vma_setup(mm);
+	err = query_vma_setup(priv);
 	if (err) {
 		mmput(mm);
 		return err;
 	}
 
-	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
+	vma = query_matching_vma(priv, karg.query_addr, karg.query_flags);
 	if (IS_ERR(vma)) {
 		err = PTR_ERR(vma);
 		vma = NULL;
@@ -675,7 +711,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	}
 
 	/* unlock vma or mmap_lock, and put mm_struct before copying data to user */
-	query_vma_teardown(mm, vma);
+	query_vma_teardown(priv);
 	mmput(mm);
 
 	if (karg.vma_name_size && copy_to_user(u64_to_user_ptr(karg.vma_name_addr),
@@ -695,7 +731,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	return 0;
 
 out:
-	query_vma_teardown(mm, vma);
+	query_vma_teardown(priv);
 	mmput(mm);
 	kfree(name_buf);
 	return err;
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned
  2025-07-04  6:07 ` [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned Suren Baghdasaryan
@ 2025-07-07 15:01   ` Lorenzo Stoakes
  2025-07-08 17:37   ` Vlastimil Babka
  1 sibling, 0 replies; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-07 15:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Jul 03, 2025 at 11:07:24PM -0700, Suren Baghdasaryan wrote:
> Back in 2.6 era, last_addr used to be stored in seq_file->version
> variable, which was unsigned long. As a result, sentinels to represent
> gate vma and end of all vmas used unsigned values. In more recent
> kernels we don't used seq_file->version anymore and therefore conversion
> from loff_t into unsigned type is not needed. Similarly, sentinel values
> don't need to be unsigned. Remove type conversion for set_file position
> and change sentinel values to signed.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Nice spot!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  fs/proc/task_mmu.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 751479eb128f..b8bc06d05a72 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -135,7 +135,7 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
>  	if (vma) {
>  		*ppos = vma->vm_start;
>  	} else {
> -		*ppos = -2UL;
> +		*ppos = -2;
>  		vma = get_gate_vma(priv->mm);
>  	}
>
> @@ -145,11 +145,11 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
>  static void *m_start(struct seq_file *m, loff_t *ppos)
>  {
>  	struct proc_maps_private *priv = m->private;
> -	unsigned long last_addr = *ppos;
> +	loff_t last_addr = *ppos;
>  	struct mm_struct *mm;
>
>  	/* See m_next(). Zero at the start or after lseek. */
> -	if (last_addr == -1UL)
> +	if (last_addr == -1)
>  		return NULL;
>
>  	priv->task = get_proc_task(priv->inode);
> @@ -170,9 +170,9 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
>  		return ERR_PTR(-EINTR);
>  	}
>
> -	vma_iter_init(&priv->iter, mm, last_addr);
> +	vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);
>  	hold_task_mempolicy(priv);
> -	if (last_addr == -2UL)
> +	if (last_addr == -2)
>  		return get_gate_vma(mm);
>
>  	return proc_get_vma(priv, ppos);
> @@ -180,8 +180,8 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
>
>  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
>  {
> -	if (*ppos == -2UL) {
> -		*ppos = -1UL;
> +	if (*ppos == -2) {
> +		*ppos = -1;
>  		return NULL;
>  	}
>  	return proc_get_vma(m->private, ppos);
> --
> 2.50.0.727.gbf7dc18ff4-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-04  6:07 ` [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock Suren Baghdasaryan
@ 2025-07-07 16:51   ` Lorenzo Stoakes
  2025-07-07 23:10     ` Suren Baghdasaryan
  2025-07-07 18:20   ` Liam R. Howlett
  2025-07-09 10:03   ` Vlastimil Babka
  2 siblings, 1 reply; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-07 16:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Sorry I know it's annoying, but some petty commit msg nits:

On Thu, Jul 03, 2025 at 11:07:25PM -0700, Suren Baghdasaryan wrote:
> With maple_tree supporting vma tree traversal under RCU and per-vma
> locks, /proc/pid/maps can be read while holding individual vma locks
> instead of locking the entire address space.
> Completely lockless approach (walking vma tree under RCU) would be quite

Completely lockless approach -> A completely lockless approach

> complex with the main issue being get_vma_name() using callbacks which
> might not work correctly with a stable vma copy, requiring original
> (unstable) vma - see special_mapping_name() for an example.

NIT: for an example -> for example

> When per-vma lock acquisition fails, we take the mmap_lock for reading,
> lock the vma, release the mmap_lock and continue. This fallback to mmap
> read lock guarantees the reader to make forward progress even during
> lock contention. This will interfere with the writer but for a very
> short time while we are acquiring the per-vma lock and only when there
> was contention on the vma reader is interested in. We shouldn't see a

Can we separate out into a new paragraph?

> repeated fallback to mmap read locks in practice, as this require a
> very unlikely series of lock contentions (for instance due to repeated
> vma split operations). However even if this did somehow happen, we would
> still progress.
> One case requiring special handling is when vma changes between the

when vma changes -> when a vma chnages

> time it was found and the time it got locked. A problematic case would
> be if vma got shrunk so that it's start moved higher in the address

vma -> a vma

it's start moved higher -> its vm_start moved higher

> space and a new vma was installed at the beginning:
>
> reader found:               |--------VMA A--------|
> VMA is modified:            |-VMA B-|----VMA A----|
> reader locks modified VMA A
> reader reports VMA A:       |  gap  |----VMA A----|
>
> This would result in reporting a gap in the address space that does not
> exist. To prevent this we retry the lookup after locking the vma, however
> we do that only when we identify a gap and detect that the address space
> was changed after we found the vma.

Can we separate out into a new paragraph?

> This change is designed to reduce mmap_lock contention and prevent a
> process reading /proc/pid/maps files (often a low priority task, such
> as monitoring/data collection services) from blocking address space
> updates. Note that this change has a userspace visible disadvantage:
> it allows for sub-page data tearing as opposed to the previous mechanism
> where data tearing could happen only between pages of generated output
> data. Since current userspace considers data tearing between pages to be
> acceptable, we assume is will be able to handle sub-page data tearing
> as well.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

OK this is looking pretty great now, I make a bunch of points below, but I
don't think anything is holding this up from being OK, so with those
addressed:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  fs/proc/internal.h        |   5 ++
>  fs/proc/task_mmu.c        | 118 ++++++++++++++++++++++++++++++++++----
>  include/linux/mmap_lock.h |  11 ++++
>  mm/madvise.c              |   3 +-
>  mm/mmap_lock.c            |  88 ++++++++++++++++++++++++++++
>  5 files changed, 214 insertions(+), 11 deletions(-)
>
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 3d48ffe72583..7c235451c5ea 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -384,6 +384,11 @@ struct proc_maps_private {
>  	struct task_struct *task;
>  	struct mm_struct *mm;
>  	struct vma_iterator iter;
> +	loff_t last_pos;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	bool mmap_locked;
> +	struct vm_area_struct *locked_vma;
> +#endif
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *task_mempolicy;
>  #endif
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b8bc06d05a72..ff3fe488ce51 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -127,15 +127,107 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
>  }
>  #endif
>
> -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> -						loff_t *ppos)
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static void unlock_vma(struct proc_maps_private *priv)
> +{
> +	if (priv->locked_vma) {
> +		vma_end_read(priv->locked_vma);
> +		priv->locked_vma = NULL;
> +	}
> +}
> +
> +static const struct seq_operations proc_pid_maps_op;
> +
> +static inline bool lock_vma_range(struct seq_file *m,
> +				  struct proc_maps_private *priv)

OK this is a nice abstraction.

> +{
> +	/*
> +	 * smaps and numa_maps perform page table walk, therefore require
> +	 * mmap_lock but maps can be read with locking just the vma.

Probably worth mentioning that you hold the RCU read lock for the operation
also.

> +	 */
> +	if (m->op != &proc_pid_maps_op) {
> +		if (mmap_read_lock_killable(priv->mm))
> +			return false;
> +
> +		priv->mmap_locked = true;
> +	} else {
> +		rcu_read_lock();
> +		priv->locked_vma = NULL;
> +		priv->mmap_locked = false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline void unlock_vma_range(struct proc_maps_private *priv)

I guess the 'range' is either - the whole thing in case of mmap read
locked, or single VMA in case of per-VMA locks.

> +{
> +	if (priv->mmap_locked) {
> +		mmap_read_unlock(priv->mm);
> +	} else {
> +		unlock_vma(priv);
> +		rcu_read_unlock();
> +	}
> +}
> +
> +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> +					   loff_t last_pos)
> +{
> +	struct vm_area_struct *vma;
> +
> +	if (priv->mmap_locked)
> +		return vma_next(&priv->iter);
> +
> +	unlock_vma(priv);
> +	vma = lock_next_vma(priv->mm, &priv->iter, last_pos);
> +	if (!IS_ERR_OR_NULL(vma))
> +		priv->locked_vma = vma;
> +
> +	return vma;
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static inline bool lock_vma_range(struct seq_file *m,
> +				  struct proc_maps_private *priv)
>  {
> -	struct vm_area_struct *vma = vma_next(&priv->iter);
> +	return mmap_read_lock_killable(priv->mm) == 0;
> +}
> +
> +static inline void unlock_vma_range(struct proc_maps_private *priv)
> +{
> +	mmap_read_unlock(priv->mm);
> +}
> +
> +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> +					   loff_t last_pos)
> +{
> +	return vma_next(&priv->iter);
> +}
>
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
> +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> +{
> +	struct proc_maps_private *priv = m->private;
> +	struct vm_area_struct *vma;
> +
> +	vma = get_next_vma(priv, *ppos);
> +	/* EINTR is possible */
> +	if (IS_ERR(vma))
> +		return vma;
> +
> +	/* Store previous position to be able to restart if needed */
> +	priv->last_pos = *ppos;
>  	if (vma) {
> -		*ppos = vma->vm_start;
> +		/*
> +		 * Track the end of the reported vma to ensure position changes
> +		 * even if previous vma was merged with the next vma and we
> +		 * found the extended vma with the same vm_start.
> +		 */
> +		*ppos = vma->vm_end;
>  	} else {
> -		*ppos = -2;
> +		*ppos = -2; /* -2 indicates gate vma */
>  		vma = get_gate_vma(priv->mm);
>  	}
>
> @@ -163,28 +255,34 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
>  		return NULL;
>  	}
>
> -	if (mmap_read_lock_killable(mm)) {
> +	if (!lock_vma_range(m, priv)) {
>  		mmput(mm);
>  		put_task_struct(priv->task);
>  		priv->task = NULL;
>  		return ERR_PTR(-EINTR);
>  	}
>
> +	/*
> +	 * Reset current position if last_addr was set before
> +	 * and it's not a sentinel.
> +	 */
> +	if (last_addr > 0)
> +		*ppos = last_addr = priv->last_pos;
>  	vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);
>  	hold_task_mempolicy(priv);
>  	if (last_addr == -2)
>  		return get_gate_vma(mm);
>
> -	return proc_get_vma(priv, ppos);
> +	return proc_get_vma(m, ppos);
>  }
>
>  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
>  {
>  	if (*ppos == -2) {
> -		*ppos = -1;
> +		*ppos = -1; /* -1 indicates no more vmas */
>  		return NULL;
>  	}
> -	return proc_get_vma(m->private, ppos);
> +	return proc_get_vma(m, ppos);
>  }
>
>  static void m_stop(struct seq_file *m, void *v)
> @@ -196,7 +294,7 @@ static void m_stop(struct seq_file *m, void *v)
>  		return;
>
>  	release_task_mempolicy(priv);
> -	mmap_read_unlock(mm);
> +	unlock_vma_range(priv);
>  	mmput(mm);
>  	put_task_struct(priv->task);
>  	priv->task = NULL;
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 5da384bd0a26..1f4f44951abe 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -309,6 +309,17 @@ void vma_mark_detached(struct vm_area_struct *vma);
>  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  					  unsigned long address);
>
> +/*
> + * Locks next vma pointed by the iterator. Confirms the locked vma has not
> + * been modified and will retry under mmap_lock protection if modification
> + * was detected. Should be called from read RCU section.
> + * Returns either a valid locked VMA, NULL if no more VMAs or -EINTR if the
> + * process was interrupted.
> + */
> +struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> +				     struct vma_iterator *iter,
> +				     unsigned long address);
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>
>  static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
> diff --git a/mm/madvise.c b/mm/madvise.c
> index a34c2c89a53b..e61e32b2cd91 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -108,7 +108,8 @@ void anon_vma_name_free(struct kref *kref)
>
>  struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
>  {
> -	mmap_assert_locked(vma->vm_mm);
> +	if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
> +		vma_assert_locked(vma);

This looks familiar ;)

>
>  	return vma->anon_name;
>  }
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> index 5f725cc67334..ed0e5e2171cd 100644
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -178,6 +178,94 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	count_vm_vma_lock_event(VMA_LOCK_ABORT);
>  	return NULL;
>  }
> +
> +static struct vm_area_struct *lock_vma_under_mmap_lock(struct mm_struct *mm,
> +						       struct vma_iterator *iter,

Nit, but we tend to call this vmi (yes Liam and I are addicted to 3 letter
abbreviations, we are evil beings)

> +						       unsigned long address)

I swear we already had a helper for this? Maybe misremembering

> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	ret = mmap_read_lock_killable(mm);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	/* Lookup the vma at the last position again under mmap_read_lock */
> +	vma_iter_init(iter, mm, address);
> +	vma = vma_next(iter);

Maybe worth calling this lock_next_under_mmap_lock() as we are grabbing the
next VMA here??

> +	if (vma)
> +		vma_start_read_locked(vma);
> +
> +	mmap_read_unlock(mm);
> +
> +	return vma;
> +}
> +
> +struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> +				     struct vma_iterator *iter,
> +				     unsigned long address)

Slightly confusing this, I think last_pos would be better? Or last_address?

Otherwise it's not clear it's the address of the next VMA or the end of the
previous.

> +{
> +	struct vm_area_struct *vma;
> +	unsigned int mm_wr_seq;
> +	bool mmap_unlocked;
> +
> +	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
> +retry:
> +	/* Start mmap_lock speculation in case we need to verify the vma later */
> +	mmap_unlocked = mmap_lock_speculate_try_begin(mm, &mm_wr_seq);
> +	vma = vma_next(iter);
> +	if (!vma)
> +		return NULL;
> +
> +	vma = vma_start_read(mm, vma);
> +

Nit, but myabe erase this newline.

> +	if (IS_ERR_OR_NULL(vma)) {
> +		/*
> +		 * Retry immediately if the vma gets detached from under us.
> +		 * Infinite loop should not happen because the vma we find will
> +		 * have to be constantly knocked out from under us.
> +		 */
> +		if (PTR_ERR(vma) == -EAGAIN) {

Maybe worth a comment here stating that we intentionally retry getting the
next VMA, and therefore must reset to the last visited adress each time.

> +			vma_iter_init(iter, mm, address);

Maybe Liam can confirm this is the best approach? Seems correct though.

> +			goto retry;
> +		}
> +
> +		goto out;
> +	}
> +
> +	/*
> +	 * Verify the vma we locked belongs to the same address space and it's
> +	 * not behind of the last search position.
> +	 */
> +	if (unlikely(vma->vm_mm != mm || address >= vma->vm_end))
> +		goto out_unlock;
> +
> +	/*
> +	 * vma can be ahead of the last search position but we need to verify
> +	 * it was not shrunk after we found it and another vma has not been
> +	 * installed ahead of it. Otherwise we might observe a gap that should
> +	 * not be there.
> +	 */
> +	if (address < vma->vm_start) {
> +		/* Verify only if the address space might have changed since vma lookup. */
> +		if (!mmap_unlocked || mmap_lock_speculate_retry(mm, mm_wr_seq)) {
> +			vma_iter_init(iter, mm, address);
> +			if (vma != vma_next(iter))
> +				goto out_unlock;
> +		}
> +	}
> +
> +	return vma;
> +
> +out_unlock:
> +	vma_end_read(vma);
> +out:

Maybe these labels should reflect the fact this is a fallback case?

Like fallback_unlock + fallback?

> +	rcu_read_unlock();
> +	vma = lock_vma_under_mmap_lock(mm, iter, address);
> +	rcu_read_lock();

OK I guess we hold the RCU lock _the whole time_ as we traverse except when
we lock under mmap lock.

> +
> +	return vma;
> +}
>  #endif /* CONFIG_PER_VMA_LOCK */
>
>  #ifdef CONFIG_LOCK_MM_AND_FIND_VMA
> --
> 2.50.0.727.gbf7dc18ff4-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 8/8] fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
  2025-07-04  6:07 ` [PATCH v6 8/8] fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
@ 2025-07-07 16:54   ` Lorenzo Stoakes
  2025-07-07 18:26   ` Liam R. Howlett
  1 sibling, 0 replies; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-07 16:54 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Jul 03, 2025 at 11:07:26PM -0700, Suren Baghdasaryan wrote:
> Utilize per-vma locks to stabilize vma after lookup without taking
> mmap_lock during PROCMAP_QUERY ioctl execution. While we might take
> mmap_lock for reading during contention, we do that momentarily only
> to lock the vma.
> This change is designed to reduce mmap_lock contention and prevent
> PROCMAP_QUERY ioctl calls from blocking address space updates.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>

This LGTM so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  fs/proc/task_mmu.c | 60 ++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 48 insertions(+), 12 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ff3fe488ce51..0496d5969a51 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -487,28 +487,64 @@ static int pid_maps_open(struct inode *inode, struct file *file)
>  		PROCMAP_QUERY_VMA_FLAGS				\
>  )
>
> -static int query_vma_setup(struct mm_struct *mm)
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static int query_vma_setup(struct proc_maps_private *priv)
>  {
> -	return mmap_read_lock_killable(mm);
> +	priv->locked_vma = NULL;
> +	priv->mmap_locked = false;
> +
> +	return 0;
>  }
>
> -static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> +static void query_vma_teardown(struct proc_maps_private *priv)
>  {
> -	mmap_read_unlock(mm);
> +	unlock_vma(priv);
> +}
> +
> +static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
> +						     unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +
> +	rcu_read_lock();
> +	vma_iter_init(&priv->iter, priv->mm, addr);
> +	vma = get_next_vma(priv, addr);
> +	rcu_read_unlock();
> +
> +	return vma;
>  }
>
> -static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static int query_vma_setup(struct proc_maps_private *priv)
>  {
> -	return find_vma(mm, addr);
> +	return mmap_read_lock_killable(priv->mm);
>  }
>
> -static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> +static void query_vma_teardown(struct proc_maps_private *priv)
> +{
> +	mmap_read_unlock(priv->mm);
> +}
> +
> +static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
> +						     unsigned long addr)
> +{
> +	return find_vma(priv->mm, addr);
> +}
> +
> +#endif  /* CONFIG_PER_VMA_LOCK */
> +
> +static struct vm_area_struct *query_matching_vma(struct proc_maps_private *priv,
>  						 unsigned long addr, u32 flags)
>  {
>  	struct vm_area_struct *vma;
>
>  next_vma:
> -	vma = query_vma_find_by_addr(mm, addr);
> +	vma = query_vma_find_by_addr(priv, addr);
> +	if (IS_ERR(vma))
> +		return vma;
> +
>  	if (!vma)
>  		goto no_vma;
>
> @@ -584,13 +620,13 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
>  	if (!mm || !mmget_not_zero(mm))
>  		return -ESRCH;
>
> -	err = query_vma_setup(mm);
> +	err = query_vma_setup(priv);
>  	if (err) {
>  		mmput(mm);
>  		return err;
>  	}
>
> -	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
> +	vma = query_matching_vma(priv, karg.query_addr, karg.query_flags);
>  	if (IS_ERR(vma)) {
>  		err = PTR_ERR(vma);
>  		vma = NULL;
> @@ -675,7 +711,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
>  	}
>
>  	/* unlock vma or mmap_lock, and put mm_struct before copying data to user */
> -	query_vma_teardown(mm, vma);
> +	query_vma_teardown(priv);
>  	mmput(mm);
>
>  	if (karg.vma_name_size && copy_to_user(u64_to_user_ptr(karg.vma_name_addr),
> @@ -695,7 +731,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
>  	return 0;
>
>  out:
> -	query_vma_teardown(mm, vma);
> +	query_vma_teardown(priv);
>  	mmput(mm);
>  	kfree(name_buf);
>  	return err;
> --
> 2.50.0.727.gbf7dc18ff4-goog
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-04  6:07 ` [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock Suren Baghdasaryan
  2025-07-07 16:51   ` Lorenzo Stoakes
@ 2025-07-07 18:20   ` Liam R. Howlett
  2025-07-07 23:12     ` Suren Baghdasaryan
  2025-07-09 10:03   ` Vlastimil Babka
  2 siblings, 1 reply; 47+ messages in thread
From: Liam R. Howlett @ 2025-07-07 18:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, lorenzo.stoakes, david, vbabka, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

* Suren Baghdasaryan <surenb@google.com> [250704 02:07]:
> With maple_tree supporting vma tree traversal under RCU and per-vma
> locks, /proc/pid/maps can be read while holding individual vma locks
> instead of locking the entire address space.
> Completely lockless approach (walking vma tree under RCU) would be quite
> complex with the main issue being get_vma_name() using callbacks which
> might not work correctly with a stable vma copy, requiring original
> (unstable) vma - see special_mapping_name() for an example.
> When per-vma lock acquisition fails, we take the mmap_lock for reading,
> lock the vma, release the mmap_lock and continue. This fallback to mmap
> read lock guarantees the reader to make forward progress even during
> lock contention. This will interfere with the writer but for a very
> short time while we are acquiring the per-vma lock and only when there
> was contention on the vma reader is interested in. We shouldn't see a
> repeated fallback to mmap read locks in practice, as this require a
> very unlikely series of lock contentions (for instance due to repeated
> vma split operations). However even if this did somehow happen, we would
> still progress.
> One case requiring special handling is when vma changes between the
> time it was found and the time it got locked. A problematic case would
> be if vma got shrunk so that it's start moved higher in the address
> space and a new vma was installed at the beginning:
> 
> reader found:               |--------VMA A--------|
> VMA is modified:            |-VMA B-|----VMA A----|
> reader locks modified VMA A
> reader reports VMA A:       |  gap  |----VMA A----|
> 
> This would result in reporting a gap in the address space that does not
> exist. To prevent this we retry the lookup after locking the vma, however
> we do that only when we identify a gap and detect that the address space
> was changed after we found the vma.
> This change is designed to reduce mmap_lock contention and prevent a
> process reading /proc/pid/maps files (often a low priority task, such
> as monitoring/data collection services) from blocking address space
> updates. Note that this change has a userspace visible disadvantage:
> it allows for sub-page data tearing as opposed to the previous mechanism
> where data tearing could happen only between pages of generated output
> data. Since current userspace considers data tearing between pages to be
> acceptable, we assume is will be able to handle sub-page data tearing
> as well.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  fs/proc/internal.h        |   5 ++
>  fs/proc/task_mmu.c        | 118 ++++++++++++++++++++++++++++++++++----
>  include/linux/mmap_lock.h |  11 ++++
>  mm/madvise.c              |   3 +-
>  mm/mmap_lock.c            |  88 ++++++++++++++++++++++++++++
>  5 files changed, 214 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 3d48ffe72583..7c235451c5ea 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -384,6 +384,11 @@ struct proc_maps_private {
>  	struct task_struct *task;
>  	struct mm_struct *mm;
>  	struct vma_iterator iter;
> +	loff_t last_pos;
> +#ifdef CONFIG_PER_VMA_LOCK
> +	bool mmap_locked;
> +	struct vm_area_struct *locked_vma;
> +#endif
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *task_mempolicy;
>  #endif
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b8bc06d05a72..ff3fe488ce51 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -127,15 +127,107 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
>  }
>  #endif
>  
> -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> -						loff_t *ppos)
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static void unlock_vma(struct proc_maps_private *priv)
> +{
> +	if (priv->locked_vma) {
> +		vma_end_read(priv->locked_vma);
> +		priv->locked_vma = NULL;
> +	}
> +}
> +
> +static const struct seq_operations proc_pid_maps_op;
> +
> +static inline bool lock_vma_range(struct seq_file *m,
> +				  struct proc_maps_private *priv)
> +{
> +	/*
> +	 * smaps and numa_maps perform page table walk, therefore require
> +	 * mmap_lock but maps can be read with locking just the vma.
> +	 */
> +	if (m->op != &proc_pid_maps_op) {
> +		if (mmap_read_lock_killable(priv->mm))
> +			return false;
> +
> +		priv->mmap_locked = true;
> +	} else {
> +		rcu_read_lock();
> +		priv->locked_vma = NULL;
> +		priv->mmap_locked = false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline void unlock_vma_range(struct proc_maps_private *priv)
> +{
> +	if (priv->mmap_locked) {
> +		mmap_read_unlock(priv->mm);
> +	} else {
> +		unlock_vma(priv);
> +		rcu_read_unlock();
> +	}
> +}
> +
> +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> +					   loff_t last_pos)
> +{
> +	struct vm_area_struct *vma;
> +
> +	if (priv->mmap_locked)
> +		return vma_next(&priv->iter);
> +
> +	unlock_vma(priv);
> +	vma = lock_next_vma(priv->mm, &priv->iter, last_pos);
> +	if (!IS_ERR_OR_NULL(vma))
> +		priv->locked_vma = vma;
> +
> +	return vma;
> +}
> +
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static inline bool lock_vma_range(struct seq_file *m,
> +				  struct proc_maps_private *priv)
>  {
> -	struct vm_area_struct *vma = vma_next(&priv->iter);
> +	return mmap_read_lock_killable(priv->mm) == 0;
> +}
> +
> +static inline void unlock_vma_range(struct proc_maps_private *priv)
> +{
> +	mmap_read_unlock(priv->mm);
> +}
> +
> +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> +					   loff_t last_pos)
> +{
> +	return vma_next(&priv->iter);
> +}
>  
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
> +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> +{
> +	struct proc_maps_private *priv = m->private;
> +	struct vm_area_struct *vma;
> +
> +	vma = get_next_vma(priv, *ppos);
> +	/* EINTR is possible */
> +	if (IS_ERR(vma))
> +		return vma;
> +
> +	/* Store previous position to be able to restart if needed */
> +	priv->last_pos = *ppos;
>  	if (vma) {
> -		*ppos = vma->vm_start;
> +		/*
> +		 * Track the end of the reported vma to ensure position changes
> +		 * even if previous vma was merged with the next vma and we
> +		 * found the extended vma with the same vm_start.
> +		 */
> +		*ppos = vma->vm_end;
>  	} else {
> -		*ppos = -2;
> +		*ppos = -2; /* -2 indicates gate vma */
>  		vma = get_gate_vma(priv->mm);
>  	}
>  
> @@ -163,28 +255,34 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
>  		return NULL;
>  	}
>  
> -	if (mmap_read_lock_killable(mm)) {
> +	if (!lock_vma_range(m, priv)) {
>  		mmput(mm);
>  		put_task_struct(priv->task);
>  		priv->task = NULL;
>  		return ERR_PTR(-EINTR);
>  	}
>  
> +	/*
> +	 * Reset current position if last_addr was set before
> +	 * and it's not a sentinel.
> +	 */
> +	if (last_addr > 0)
> +		*ppos = last_addr = priv->last_pos;
>  	vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);
>  	hold_task_mempolicy(priv);
>  	if (last_addr == -2)
>  		return get_gate_vma(mm);
>  
> -	return proc_get_vma(priv, ppos);
> +	return proc_get_vma(m, ppos);
>  }
>  
>  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
>  {
>  	if (*ppos == -2) {
> -		*ppos = -1;
> +		*ppos = -1; /* -1 indicates no more vmas */
>  		return NULL;
>  	}
> -	return proc_get_vma(m->private, ppos);
> +	return proc_get_vma(m, ppos);
>  }
>  
>  static void m_stop(struct seq_file *m, void *v)
> @@ -196,7 +294,7 @@ static void m_stop(struct seq_file *m, void *v)
>  		return;
>  
>  	release_task_mempolicy(priv);
> -	mmap_read_unlock(mm);
> +	unlock_vma_range(priv);
>  	mmput(mm);
>  	put_task_struct(priv->task);
>  	priv->task = NULL;
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 5da384bd0a26..1f4f44951abe 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -309,6 +309,17 @@ void vma_mark_detached(struct vm_area_struct *vma);
>  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  					  unsigned long address);
>  
> +/*
> + * Locks next vma pointed by the iterator. Confirms the locked vma has not
> + * been modified and will retry under mmap_lock protection if modification
> + * was detected. Should be called from read RCU section.
> + * Returns either a valid locked VMA, NULL if no more VMAs or -EINTR if the
> + * process was interrupted.
> + */
> +struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> +				     struct vma_iterator *iter,
> +				     unsigned long address);
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>  
>  static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
> diff --git a/mm/madvise.c b/mm/madvise.c
> index a34c2c89a53b..e61e32b2cd91 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -108,7 +108,8 @@ void anon_vma_name_free(struct kref *kref)
>  
>  struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
>  {
> -	mmap_assert_locked(vma->vm_mm);
> +	if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
> +		vma_assert_locked(vma);
>  
>  	return vma->anon_name;
>  }
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> index 5f725cc67334..ed0e5e2171cd 100644
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -178,6 +178,94 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	count_vm_vma_lock_event(VMA_LOCK_ABORT);
>  	return NULL;
>  }
> +
> +static struct vm_area_struct *lock_vma_under_mmap_lock(struct mm_struct *mm,
> +						       struct vma_iterator *iter,
> +						       unsigned long address)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	ret = mmap_read_lock_killable(mm);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	/* Lookup the vma at the last position again under mmap_read_lock */
> +	vma_iter_init(iter, mm, address);
> +	vma = vma_next(iter);
> +	if (vma)
> +		vma_start_read_locked(vma);
> +
> +	mmap_read_unlock(mm);
> +
> +	return vma;
> +}
> +
> +struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> +				     struct vma_iterator *iter,
> +				     unsigned long address)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned int mm_wr_seq;
> +	bool mmap_unlocked;
> +
> +	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
> +retry:
> +	/* Start mmap_lock speculation in case we need to verify the vma later */
> +	mmap_unlocked = mmap_lock_speculate_try_begin(mm, &mm_wr_seq);
> +	vma = vma_next(iter);
> +	if (!vma)
> +		return NULL;
> +
> +	vma = vma_start_read(mm, vma);
> +
> +	if (IS_ERR_OR_NULL(vma)) {
> +		/*
> +		 * Retry immediately if the vma gets detached from under us.
> +		 * Infinite loop should not happen because the vma we find will
> +		 * have to be constantly knocked out from under us.
> +		 */
> +		if (PTR_ERR(vma) == -EAGAIN) {
> +			vma_iter_init(iter, mm, address);
> +			goto retry;
> +		}
> +
> +		goto out;
> +	}
> +
> +	/*
> +	 * Verify the vma we locked belongs to the same address space and it's
> +	 * not behind of the last search position.
> +	 */
> +	if (unlikely(vma->vm_mm != mm || address >= vma->vm_end))
> +		goto out_unlock;
> +
> +	/*
> +	 * vma can be ahead of the last search position but we need to verify
> +	 * it was not shrunk after we found it and another vma has not been
> +	 * installed ahead of it. Otherwise we might observe a gap that should
> +	 * not be there.
> +	 */
> +	if (address < vma->vm_start) {
> +		/* Verify only if the address space might have changed since vma lookup. */
> +		if (!mmap_unlocked || mmap_lock_speculate_retry(mm, mm_wr_seq)) {
> +			vma_iter_init(iter, mm, address);
> +			if (vma != vma_next(iter))
> +				goto out_unlock;
> +		}
> +	}
> +
> +	return vma;
> +
> +out_unlock:
> +	vma_end_read(vma);
> +out:
> +	rcu_read_unlock();
> +	vma = lock_vma_under_mmap_lock(mm, iter, address);
> +	rcu_read_lock();
> +
> +	return vma;
> +}
>  #endif /* CONFIG_PER_VMA_LOCK */
>  
>  #ifdef CONFIG_LOCK_MM_AND_FIND_VMA
> -- 
> 2.50.0.727.gbf7dc18ff4-goog
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 8/8] fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks
  2025-07-04  6:07 ` [PATCH v6 8/8] fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
  2025-07-07 16:54   ` Lorenzo Stoakes
@ 2025-07-07 18:26   ` Liam R. Howlett
  1 sibling, 0 replies; 47+ messages in thread
From: Liam R. Howlett @ 2025-07-07 18:26 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, lorenzo.stoakes, david, vbabka, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

* Suren Baghdasaryan <surenb@google.com> [250704 02:07]:
> Utilize per-vma locks to stabilize vma after lookup without taking
> mmap_lock during PROCMAP_QUERY ioctl execution. While we might take
> mmap_lock for reading during contention, we do that momentarily only
> to lock the vma.
> This change is designed to reduce mmap_lock contention and prevent
> PROCMAP_QUERY ioctl calls from blocking address space updates.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  fs/proc/task_mmu.c | 60 ++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 48 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ff3fe488ce51..0496d5969a51 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -487,28 +487,64 @@ static int pid_maps_open(struct inode *inode, struct file *file)
>  		PROCMAP_QUERY_VMA_FLAGS				\
>  )
>  
> -static int query_vma_setup(struct mm_struct *mm)
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static int query_vma_setup(struct proc_maps_private *priv)
>  {
> -	return mmap_read_lock_killable(mm);
> +	priv->locked_vma = NULL;
> +	priv->mmap_locked = false;
> +
> +	return 0;
>  }
>  
> -static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> +static void query_vma_teardown(struct proc_maps_private *priv)
>  {
> -	mmap_read_unlock(mm);
> +	unlock_vma(priv);
> +}
> +
> +static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
> +						     unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +
> +	rcu_read_lock();
> +	vma_iter_init(&priv->iter, priv->mm, addr);
> +	vma = get_next_vma(priv, addr);
> +	rcu_read_unlock();
> +
> +	return vma;
>  }
>  
> -static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> +#else /* CONFIG_PER_VMA_LOCK */
> +
> +static int query_vma_setup(struct proc_maps_private *priv)
>  {
> -	return find_vma(mm, addr);
> +	return mmap_read_lock_killable(priv->mm);
>  }
>  
> -static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> +static void query_vma_teardown(struct proc_maps_private *priv)
> +{
> +	mmap_read_unlock(priv->mm);
> +}
> +
> +static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_private *priv,
> +						     unsigned long addr)
> +{
> +	return find_vma(priv->mm, addr);
> +}
> +
> +#endif  /* CONFIG_PER_VMA_LOCK */
> +
> +static struct vm_area_struct *query_matching_vma(struct proc_maps_private *priv,
>  						 unsigned long addr, u32 flags)
>  {
>  	struct vm_area_struct *vma;
>  
>  next_vma:
> -	vma = query_vma_find_by_addr(mm, addr);
> +	vma = query_vma_find_by_addr(priv, addr);
> +	if (IS_ERR(vma))
> +		return vma;
> +
>  	if (!vma)
>  		goto no_vma;
>  
> @@ -584,13 +620,13 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
>  	if (!mm || !mmget_not_zero(mm))
>  		return -ESRCH;
>  
> -	err = query_vma_setup(mm);
> +	err = query_vma_setup(priv);
>  	if (err) {
>  		mmput(mm);
>  		return err;
>  	}
>  
> -	vma = query_matching_vma(mm, karg.query_addr, karg.query_flags);
> +	vma = query_matching_vma(priv, karg.query_addr, karg.query_flags);
>  	if (IS_ERR(vma)) {
>  		err = PTR_ERR(vma);
>  		vma = NULL;
> @@ -675,7 +711,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
>  	}
>  
>  	/* unlock vma or mmap_lock, and put mm_struct before copying data to user */
> -	query_vma_teardown(mm, vma);
> +	query_vma_teardown(priv);
>  	mmput(mm);
>  
>  	if (karg.vma_name_size && copy_to_user(u64_to_user_ptr(karg.vma_name_addr),
> @@ -695,7 +731,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
>  	return 0;
>  
>  out:
> -	query_vma_teardown(mm, vma);
> +	query_vma_teardown(priv);
>  	mmput(mm);
>  	kfree(name_buf);
>  	return err;
> -- 
> 2.50.0.727.gbf7dc18ff4-goog
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-07 16:51   ` Lorenzo Stoakes
@ 2025-07-07 23:10     ` Suren Baghdasaryan
  2025-07-09  8:57       ` Vlastimil Babka
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-07 23:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Mon, Jul 7, 2025 at 9:52 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Sorry I know it's annoying, but some petty commit msg nits:
>
> On Thu, Jul 03, 2025 at 11:07:25PM -0700, Suren Baghdasaryan wrote:
> > With maple_tree supporting vma tree traversal under RCU and per-vma
> > locks, /proc/pid/maps can be read while holding individual vma locks
> > instead of locking the entire address space.
> > Completely lockless approach (walking vma tree under RCU) would be quite
>
> Completely lockless approach -> A completely lockless approach
>
> > complex with the main issue being get_vma_name() using callbacks which
> > might not work correctly with a stable vma copy, requiring original
> > (unstable) vma - see special_mapping_name() for an example.
>
> NIT: for an example -> for example

Ack.

>
> > When per-vma lock acquisition fails, we take the mmap_lock for reading,
> > lock the vma, release the mmap_lock and continue. This fallback to mmap
> > read lock guarantees the reader to make forward progress even during
> > lock contention. This will interfere with the writer but for a very
> > short time while we are acquiring the per-vma lock and only when there
> > was contention on the vma reader is interested in. We shouldn't see a
>
> Can we separate out into a new paragraph?

Will do.

>
> > repeated fallback to mmap read locks in practice, as this require a
> > very unlikely series of lock contentions (for instance due to repeated
> > vma split operations). However even if this did somehow happen, we would
> > still progress.
> > One case requiring special handling is when vma changes between the
>
> when vma changes -> when a vma chnages

Ack.

>
> > time it was found and the time it got locked. A problematic case would
> > be if vma got shrunk so that it's start moved higher in the address
>
> vma -> a vma

Ack.

>
> it's start moved higher -> its vm_start moved higher

Ack.

>
> > space and a new vma was installed at the beginning:
> >
> > reader found:               |--------VMA A--------|
> > VMA is modified:            |-VMA B-|----VMA A----|
> > reader locks modified VMA A
> > reader reports VMA A:       |  gap  |----VMA A----|
> >
> > This would result in reporting a gap in the address space that does not
> > exist. To prevent this we retry the lookup after locking the vma, however
> > we do that only when we identify a gap and detect that the address space
> > was changed after we found the vma.
>
> Can we separate out into a new paragraph?

Ack.

>
> > This change is designed to reduce mmap_lock contention and prevent a
> > process reading /proc/pid/maps files (often a low priority task, such
> > as monitoring/data collection services) from blocking address space
> > updates. Note that this change has a userspace visible disadvantage:
> > it allows for sub-page data tearing as opposed to the previous mechanism
> > where data tearing could happen only between pages of generated output
> > data. Since current userspace considers data tearing between pages to be
> > acceptable, we assume is will be able to handle sub-page data tearing
> > as well.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> OK this is looking pretty great now, I make a bunch of points below, but I
> don't think anything is holding this up from being OK, so with those
> addressed:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!

>
> > ---
> >  fs/proc/internal.h        |   5 ++
> >  fs/proc/task_mmu.c        | 118 ++++++++++++++++++++++++++++++++++----
> >  include/linux/mmap_lock.h |  11 ++++
> >  mm/madvise.c              |   3 +-
> >  mm/mmap_lock.c            |  88 ++++++++++++++++++++++++++++
> >  5 files changed, 214 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 3d48ffe72583..7c235451c5ea 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -384,6 +384,11 @@ struct proc_maps_private {
> >       struct task_struct *task;
> >       struct mm_struct *mm;
> >       struct vma_iterator iter;
> > +     loff_t last_pos;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     bool mmap_locked;
> > +     struct vm_area_struct *locked_vma;
> > +#endif
> >  #ifdef CONFIG_NUMA
> >       struct mempolicy *task_mempolicy;
> >  #endif
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b8bc06d05a72..ff3fe488ce51 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -127,15 +127,107 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> >  }
> >  #endif
> >
> > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > -                                             loff_t *ppos)
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static void unlock_vma(struct proc_maps_private *priv)
> > +{
> > +     if (priv->locked_vma) {
> > +             vma_end_read(priv->locked_vma);
> > +             priv->locked_vma = NULL;
> > +     }
> > +}
> > +
> > +static const struct seq_operations proc_pid_maps_op;
> > +
> > +static inline bool lock_vma_range(struct seq_file *m,
> > +                               struct proc_maps_private *priv)
>
> OK this is a nice abstraction.
>
> > +{
> > +     /*
> > +      * smaps and numa_maps perform page table walk, therefore require
> > +      * mmap_lock but maps can be read with locking just the vma.
>
> Probably worth mentioning that you hold the RCU read lock for the operation
> also.

Ack.

>
> > +      */
> > +     if (m->op != &proc_pid_maps_op) {
> > +             if (mmap_read_lock_killable(priv->mm))
> > +                     return false;
> > +
> > +             priv->mmap_locked = true;
> > +     } else {
> > +             rcu_read_lock();
> > +             priv->locked_vma = NULL;
> > +             priv->mmap_locked = false;
> > +     }
> > +
> > +     return true;
> > +}
> > +
> > +static inline void unlock_vma_range(struct proc_maps_private *priv)
>
> I guess the 'range' is either - the whole thing in case of mmap read
> locked, or single VMA in case of per-VMA locks.

Correct.

>
> > +{
> > +     if (priv->mmap_locked) {
> > +             mmap_read_unlock(priv->mm);
> > +     } else {
> > +             unlock_vma(priv);
> > +             rcu_read_unlock();
> > +     }
> > +}
> > +
> > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > +                                        loff_t last_pos)
> > +{
> > +     struct vm_area_struct *vma;
> > +
> > +     if (priv->mmap_locked)
> > +             return vma_next(&priv->iter);
> > +
> > +     unlock_vma(priv);
> > +     vma = lock_next_vma(priv->mm, &priv->iter, last_pos);
> > +     if (!IS_ERR_OR_NULL(vma))
> > +             priv->locked_vma = vma;
> > +
> > +     return vma;
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> > +
> > +static inline bool lock_vma_range(struct seq_file *m,
> > +                               struct proc_maps_private *priv)
> >  {
> > -     struct vm_area_struct *vma = vma_next(&priv->iter);
> > +     return mmap_read_lock_killable(priv->mm) == 0;
> > +}
> > +
> > +static inline void unlock_vma_range(struct proc_maps_private *priv)
> > +{
> > +     mmap_read_unlock(priv->mm);
> > +}
> > +
> > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > +                                        loff_t last_pos)
> > +{
> > +     return vma_next(&priv->iter);
> > +}
> >
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> > +{
> > +     struct proc_maps_private *priv = m->private;
> > +     struct vm_area_struct *vma;
> > +
> > +     vma = get_next_vma(priv, *ppos);
> > +     /* EINTR is possible */
> > +     if (IS_ERR(vma))
> > +             return vma;
> > +
> > +     /* Store previous position to be able to restart if needed */
> > +     priv->last_pos = *ppos;
> >       if (vma) {
> > -             *ppos = vma->vm_start;
> > +             /*
> > +              * Track the end of the reported vma to ensure position changes
> > +              * even if previous vma was merged with the next vma and we
> > +              * found the extended vma with the same vm_start.
> > +              */
> > +             *ppos = vma->vm_end;
> >       } else {
> > -             *ppos = -2;
> > +             *ppos = -2; /* -2 indicates gate vma */
> >               vma = get_gate_vma(priv->mm);
> >       }
> >
> > @@ -163,28 +255,34 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> >               return NULL;
> >       }
> >
> > -     if (mmap_read_lock_killable(mm)) {
> > +     if (!lock_vma_range(m, priv)) {
> >               mmput(mm);
> >               put_task_struct(priv->task);
> >               priv->task = NULL;
> >               return ERR_PTR(-EINTR);
> >       }
> >
> > +     /*
> > +      * Reset current position if last_addr was set before
> > +      * and it's not a sentinel.
> > +      */
> > +     if (last_addr > 0)
> > +             *ppos = last_addr = priv->last_pos;
> >       vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);
> >       hold_task_mempolicy(priv);
> >       if (last_addr == -2)
> >               return get_gate_vma(mm);
> >
> > -     return proc_get_vma(priv, ppos);
> > +     return proc_get_vma(m, ppos);
> >  }
> >
> >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> >  {
> >       if (*ppos == -2) {
> > -             *ppos = -1;
> > +             *ppos = -1; /* -1 indicates no more vmas */
> >               return NULL;
> >       }
> > -     return proc_get_vma(m->private, ppos);
> > +     return proc_get_vma(m, ppos);
> >  }
> >
> >  static void m_stop(struct seq_file *m, void *v)
> > @@ -196,7 +294,7 @@ static void m_stop(struct seq_file *m, void *v)
> >               return;
> >
> >       release_task_mempolicy(priv);
> > -     mmap_read_unlock(mm);
> > +     unlock_vma_range(priv);
> >       mmput(mm);
> >       put_task_struct(priv->task);
> >       priv->task = NULL;
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index 5da384bd0a26..1f4f44951abe 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -309,6 +309,17 @@ void vma_mark_detached(struct vm_area_struct *vma);
> >  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >                                         unsigned long address);
> >
> > +/*
> > + * Locks next vma pointed by the iterator. Confirms the locked vma has not
> > + * been modified and will retry under mmap_lock protection if modification
> > + * was detected. Should be called from read RCU section.
> > + * Returns either a valid locked VMA, NULL if no more VMAs or -EINTR if the
> > + * process was interrupted.
> > + */
> > +struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> > +                                  struct vma_iterator *iter,
> > +                                  unsigned long address);
> > +
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> >  static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index a34c2c89a53b..e61e32b2cd91 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -108,7 +108,8 @@ void anon_vma_name_free(struct kref *kref)
> >
> >  struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
> >  {
> > -     mmap_assert_locked(vma->vm_mm);
> > +     if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
> > +             vma_assert_locked(vma);
>
> This looks familiar ;)

Yep, that's your fix which I folded in.

>
> >
> >       return vma->anon_name;
> >  }
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > index 5f725cc67334..ed0e5e2171cd 100644
> > --- a/mm/mmap_lock.c
> > +++ b/mm/mmap_lock.c
> > @@ -178,6 +178,94 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> >       return NULL;
> >  }
> > +
> > +static struct vm_area_struct *lock_vma_under_mmap_lock(struct mm_struct *mm,
> > +                                                    struct vma_iterator *iter,
>
> Nit, but we tend to call this vmi (yes Liam and I are addicted to 3 letter
> abbreviations, we are evil beings)

Ok, I'll rename it.

>
> > +                                                    unsigned long address)
>
> I swear we already had a helper for this? Maybe misremembering

I think you might be confusing it with lock_vma_under_rcu()


>
> > +{
> > +     struct vm_area_struct *vma;
> > +     int ret;
> > +
> > +     ret = mmap_read_lock_killable(mm);
> > +     if (ret)
> > +             return ERR_PTR(ret);
> > +
> > +     /* Lookup the vma at the last position again under mmap_read_lock */
> > +     vma_iter_init(iter, mm, address);
> > +     vma = vma_next(iter);
>
> Maybe worth calling this lock_next_under_mmap_lock() as we are grabbing the
> next VMA here??

Sure. lock_next_vma_under_mmap_lock() ?

>
> > +     if (vma)
> > +             vma_start_read_locked(vma);
> > +
> > +     mmap_read_unlock(mm);
> > +
> > +     return vma;
> > +}
> > +
> > +struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> > +                                  struct vma_iterator *iter,
> > +                                  unsigned long address)
>
> Slightly confusing this, I think last_pos would be better? Or last_address?
>
> Otherwise it's not clear it's the address of the next VMA or the end of the
> previous.

Ok, last_address it is then.

>
> > +{
> > +     struct vm_area_struct *vma;
> > +     unsigned int mm_wr_seq;
> > +     bool mmap_unlocked;
> > +
> > +     RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
> > +retry:
> > +     /* Start mmap_lock speculation in case we need to verify the vma later */
> > +     mmap_unlocked = mmap_lock_speculate_try_begin(mm, &mm_wr_seq);
> > +     vma = vma_next(iter);
> > +     if (!vma)
> > +             return NULL;
> > +
> > +     vma = vma_start_read(mm, vma);
> > +
>
> Nit, but myabe erase this newline.

Ack.

>
> > +     if (IS_ERR_OR_NULL(vma)) {
> > +             /*
> > +              * Retry immediately if the vma gets detached from under us.
> > +              * Infinite loop should not happen because the vma we find will
> > +              * have to be constantly knocked out from under us.
> > +              */
> > +             if (PTR_ERR(vma) == -EAGAIN) {
>
> Maybe worth a comment here stating that we intentionally retry getting the
> next VMA, and therefore must reset to the last visited adress each time.

Ack.

>
> > +                     vma_iter_init(iter, mm, address);
>
> Maybe Liam can confirm this is the best approach? Seems correct though.

Liam's Reviewed-by confirms correctness now :)

>
> > +                     goto retry;
> > +             }
> > +
> > +             goto out;
> > +     }
> > +
> > +     /*
> > +      * Verify the vma we locked belongs to the same address space and it's
> > +      * not behind of the last search position.
> > +      */
> > +     if (unlikely(vma->vm_mm != mm || address >= vma->vm_end))
> > +             goto out_unlock;
> > +
> > +     /*
> > +      * vma can be ahead of the last search position but we need to verify
> > +      * it was not shrunk after we found it and another vma has not been
> > +      * installed ahead of it. Otherwise we might observe a gap that should
> > +      * not be there.
> > +      */
> > +     if (address < vma->vm_start) {
> > +             /* Verify only if the address space might have changed since vma lookup. */
> > +             if (!mmap_unlocked || mmap_lock_speculate_retry(mm, mm_wr_seq)) {
> > +                     vma_iter_init(iter, mm, address);
> > +                     if (vma != vma_next(iter))
> > +                             goto out_unlock;
> > +             }
> > +     }
> > +
> > +     return vma;
> > +
> > +out_unlock:
> > +     vma_end_read(vma);
> > +out:
>
> Maybe these labels should reflect the fact this is a fallback case?
>
> Like fallback_unlock + fallback?

Ack.

>
> > +     rcu_read_unlock();
> > +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> > +     rcu_read_lock();
>
> OK I guess we hold the RCU lock _the whole time_ as we traverse except when
> we lock under mmap lock.

Correct.

>
> > +
> > +     return vma;
> > +}
> >  #endif /* CONFIG_PER_VMA_LOCK */
> >
> >  #ifdef CONFIG_LOCK_MM_AND_FIND_VMA
> > --
> > 2.50.0.727.gbf7dc18ff4-goog
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-07 18:20   ` Liam R. Howlett
@ 2025-07-07 23:12     ` Suren Baghdasaryan
  0 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-07 23:12 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, lorenzo.stoakes, david,
	vbabka, peterx, jannh, hannes, mhocko, paulmck, shuah, adobriyan,
	brauner, josef, yebin10, linux, willy, osalvador, andrii,
	ryan.roberts, christophe.leroy, tjmercier, kaleshsingh, aha310510,
	linux-kernel, linux-fsdevel, linux-mm, linux-kselftest

On Mon, Jul 7, 2025 at 11:21 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250704 02:07]:
> > With maple_tree supporting vma tree traversal under RCU and per-vma
> > locks, /proc/pid/maps can be read while holding individual vma locks
> > instead of locking the entire address space.
> > Completely lockless approach (walking vma tree under RCU) would be quite
> > complex with the main issue being get_vma_name() using callbacks which
> > might not work correctly with a stable vma copy, requiring original
> > (unstable) vma - see special_mapping_name() for an example.
> > When per-vma lock acquisition fails, we take the mmap_lock for reading,
> > lock the vma, release the mmap_lock and continue. This fallback to mmap
> > read lock guarantees the reader to make forward progress even during
> > lock contention. This will interfere with the writer but for a very
> > short time while we are acquiring the per-vma lock and only when there
> > was contention on the vma reader is interested in. We shouldn't see a
> > repeated fallback to mmap read locks in practice, as this require a
> > very unlikely series of lock contentions (for instance due to repeated
> > vma split operations). However even if this did somehow happen, we would
> > still progress.
> > One case requiring special handling is when vma changes between the
> > time it was found and the time it got locked. A problematic case would
> > be if vma got shrunk so that it's start moved higher in the address
> > space and a new vma was installed at the beginning:
> >
> > reader found:               |--------VMA A--------|
> > VMA is modified:            |-VMA B-|----VMA A----|
> > reader locks modified VMA A
> > reader reports VMA A:       |  gap  |----VMA A----|
> >
> > This would result in reporting a gap in the address space that does not
> > exist. To prevent this we retry the lookup after locking the vma, however
> > we do that only when we identify a gap and detect that the address space
> > was changed after we found the vma.
> > This change is designed to reduce mmap_lock contention and prevent a
> > process reading /proc/pid/maps files (often a low priority task, such
> > as monitoring/data collection services) from blocking address space
> > updates. Note that this change has a userspace visible disadvantage:
> > it allows for sub-page data tearing as opposed to the previous mechanism
> > where data tearing could happen only between pages of generated output
> > data. Since current userspace considers data tearing between pages to be
> > acceptable, we assume is will be able to handle sub-page data tearing
> > as well.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Thanks!
I'll update addressing Lorenzo's nits and will repost in a couple
days. Hopefully by then I can get some reviews for the tests in the
series.

>
> > ---
> >  fs/proc/internal.h        |   5 ++
> >  fs/proc/task_mmu.c        | 118 ++++++++++++++++++++++++++++++++++----
> >  include/linux/mmap_lock.h |  11 ++++
> >  mm/madvise.c              |   3 +-
> >  mm/mmap_lock.c            |  88 ++++++++++++++++++++++++++++
> >  5 files changed, 214 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 3d48ffe72583..7c235451c5ea 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -384,6 +384,11 @@ struct proc_maps_private {
> >       struct task_struct *task;
> >       struct mm_struct *mm;
> >       struct vma_iterator iter;
> > +     loff_t last_pos;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +     bool mmap_locked;
> > +     struct vm_area_struct *locked_vma;
> > +#endif
> >  #ifdef CONFIG_NUMA
> >       struct mempolicy *task_mempolicy;
> >  #endif
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b8bc06d05a72..ff3fe488ce51 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -127,15 +127,107 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
> >  }
> >  #endif
> >
> > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > -                                             loff_t *ppos)
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static void unlock_vma(struct proc_maps_private *priv)
> > +{
> > +     if (priv->locked_vma) {
> > +             vma_end_read(priv->locked_vma);
> > +             priv->locked_vma = NULL;
> > +     }
> > +}
> > +
> > +static const struct seq_operations proc_pid_maps_op;
> > +
> > +static inline bool lock_vma_range(struct seq_file *m,
> > +                               struct proc_maps_private *priv)
> > +{
> > +     /*
> > +      * smaps and numa_maps perform page table walk, therefore require
> > +      * mmap_lock but maps can be read with locking just the vma.
> > +      */
> > +     if (m->op != &proc_pid_maps_op) {
> > +             if (mmap_read_lock_killable(priv->mm))
> > +                     return false;
> > +
> > +             priv->mmap_locked = true;
> > +     } else {
> > +             rcu_read_lock();
> > +             priv->locked_vma = NULL;
> > +             priv->mmap_locked = false;
> > +     }
> > +
> > +     return true;
> > +}
> > +
> > +static inline void unlock_vma_range(struct proc_maps_private *priv)
> > +{
> > +     if (priv->mmap_locked) {
> > +             mmap_read_unlock(priv->mm);
> > +     } else {
> > +             unlock_vma(priv);
> > +             rcu_read_unlock();
> > +     }
> > +}
> > +
> > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > +                                        loff_t last_pos)
> > +{
> > +     struct vm_area_struct *vma;
> > +
> > +     if (priv->mmap_locked)
> > +             return vma_next(&priv->iter);
> > +
> > +     unlock_vma(priv);
> > +     vma = lock_next_vma(priv->mm, &priv->iter, last_pos);
> > +     if (!IS_ERR_OR_NULL(vma))
> > +             priv->locked_vma = vma;
> > +
> > +     return vma;
> > +}
> > +
> > +#else /* CONFIG_PER_VMA_LOCK */
> > +
> > +static inline bool lock_vma_range(struct seq_file *m,
> > +                               struct proc_maps_private *priv)
> >  {
> > -     struct vm_area_struct *vma = vma_next(&priv->iter);
> > +     return mmap_read_lock_killable(priv->mm) == 0;
> > +}
> > +
> > +static inline void unlock_vma_range(struct proc_maps_private *priv)
> > +{
> > +     mmap_read_unlock(priv->mm);
> > +}
> > +
> > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
> > +                                        loff_t last_pos)
> > +{
> > +     return vma_next(&priv->iter);
> > +}
> >
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
> > +{
> > +     struct proc_maps_private *priv = m->private;
> > +     struct vm_area_struct *vma;
> > +
> > +     vma = get_next_vma(priv, *ppos);
> > +     /* EINTR is possible */
> > +     if (IS_ERR(vma))
> > +             return vma;
> > +
> > +     /* Store previous position to be able to restart if needed */
> > +     priv->last_pos = *ppos;
> >       if (vma) {
> > -             *ppos = vma->vm_start;
> > +             /*
> > +              * Track the end of the reported vma to ensure position changes
> > +              * even if previous vma was merged with the next vma and we
> > +              * found the extended vma with the same vm_start.
> > +              */
> > +             *ppos = vma->vm_end;
> >       } else {
> > -             *ppos = -2;
> > +             *ppos = -2; /* -2 indicates gate vma */
> >               vma = get_gate_vma(priv->mm);
> >       }
> >
> > @@ -163,28 +255,34 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> >               return NULL;
> >       }
> >
> > -     if (mmap_read_lock_killable(mm)) {
> > +     if (!lock_vma_range(m, priv)) {
> >               mmput(mm);
> >               put_task_struct(priv->task);
> >               priv->task = NULL;
> >               return ERR_PTR(-EINTR);
> >       }
> >
> > +     /*
> > +      * Reset current position if last_addr was set before
> > +      * and it's not a sentinel.
> > +      */
> > +     if (last_addr > 0)
> > +             *ppos = last_addr = priv->last_pos;
> >       vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);
> >       hold_task_mempolicy(priv);
> >       if (last_addr == -2)
> >               return get_gate_vma(mm);
> >
> > -     return proc_get_vma(priv, ppos);
> > +     return proc_get_vma(m, ppos);
> >  }
> >
> >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> >  {
> >       if (*ppos == -2) {
> > -             *ppos = -1;
> > +             *ppos = -1; /* -1 indicates no more vmas */
> >               return NULL;
> >       }
> > -     return proc_get_vma(m->private, ppos);
> > +     return proc_get_vma(m, ppos);
> >  }
> >
> >  static void m_stop(struct seq_file *m, void *v)
> > @@ -196,7 +294,7 @@ static void m_stop(struct seq_file *m, void *v)
> >               return;
> >
> >       release_task_mempolicy(priv);
> > -     mmap_read_unlock(mm);
> > +     unlock_vma_range(priv);
> >       mmput(mm);
> >       put_task_struct(priv->task);
> >       priv->task = NULL;
> > diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> > index 5da384bd0a26..1f4f44951abe 100644
> > --- a/include/linux/mmap_lock.h
> > +++ b/include/linux/mmap_lock.h
> > @@ -309,6 +309,17 @@ void vma_mark_detached(struct vm_area_struct *vma);
> >  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >                                         unsigned long address);
> >
> > +/*
> > + * Locks next vma pointed by the iterator. Confirms the locked vma has not
> > + * been modified and will retry under mmap_lock protection if modification
> > + * was detected. Should be called from read RCU section.
> > + * Returns either a valid locked VMA, NULL if no more VMAs or -EINTR if the
> > + * process was interrupted.
> > + */
> > +struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> > +                                  struct vma_iterator *iter,
> > +                                  unsigned long address);
> > +
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> >  static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index a34c2c89a53b..e61e32b2cd91 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -108,7 +108,8 @@ void anon_vma_name_free(struct kref *kref)
> >
> >  struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma)
> >  {
> > -     mmap_assert_locked(vma->vm_mm);
> > +     if (!rwsem_is_locked(&vma->vm_mm->mmap_lock))
> > +             vma_assert_locked(vma);
> >
> >       return vma->anon_name;
> >  }
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > index 5f725cc67334..ed0e5e2171cd 100644
> > --- a/mm/mmap_lock.c
> > +++ b/mm/mmap_lock.c
> > @@ -178,6 +178,94 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> >       return NULL;
> >  }
> > +
> > +static struct vm_area_struct *lock_vma_under_mmap_lock(struct mm_struct *mm,
> > +                                                    struct vma_iterator *iter,
> > +                                                    unsigned long address)
> > +{
> > +     struct vm_area_struct *vma;
> > +     int ret;
> > +
> > +     ret = mmap_read_lock_killable(mm);
> > +     if (ret)
> > +             return ERR_PTR(ret);
> > +
> > +     /* Lookup the vma at the last position again under mmap_read_lock */
> > +     vma_iter_init(iter, mm, address);
> > +     vma = vma_next(iter);
> > +     if (vma)
> > +             vma_start_read_locked(vma);
> > +
> > +     mmap_read_unlock(mm);
> > +
> > +     return vma;
> > +}
> > +
> > +struct vm_area_struct *lock_next_vma(struct mm_struct *mm,
> > +                                  struct vma_iterator *iter,
> > +                                  unsigned long address)
> > +{
> > +     struct vm_area_struct *vma;
> > +     unsigned int mm_wr_seq;
> > +     bool mmap_unlocked;
> > +
> > +     RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu read lock held");
> > +retry:
> > +     /* Start mmap_lock speculation in case we need to verify the vma later */
> > +     mmap_unlocked = mmap_lock_speculate_try_begin(mm, &mm_wr_seq);
> > +     vma = vma_next(iter);
> > +     if (!vma)
> > +             return NULL;
> > +
> > +     vma = vma_start_read(mm, vma);
> > +
> > +     if (IS_ERR_OR_NULL(vma)) {
> > +             /*
> > +              * Retry immediately if the vma gets detached from under us.
> > +              * Infinite loop should not happen because the vma we find will
> > +              * have to be constantly knocked out from under us.
> > +              */
> > +             if (PTR_ERR(vma) == -EAGAIN) {
> > +                     vma_iter_init(iter, mm, address);
> > +                     goto retry;
> > +             }
> > +
> > +             goto out;
> > +     }
> > +
> > +     /*
> > +      * Verify the vma we locked belongs to the same address space and it's
> > +      * not behind of the last search position.
> > +      */
> > +     if (unlikely(vma->vm_mm != mm || address >= vma->vm_end))
> > +             goto out_unlock;
> > +
> > +     /*
> > +      * vma can be ahead of the last search position but we need to verify
> > +      * it was not shrunk after we found it and another vma has not been
> > +      * installed ahead of it. Otherwise we might observe a gap that should
> > +      * not be there.
> > +      */
> > +     if (address < vma->vm_start) {
> > +             /* Verify only if the address space might have changed since vma lookup. */
> > +             if (!mmap_unlocked || mmap_lock_speculate_retry(mm, mm_wr_seq)) {
> > +                     vma_iter_init(iter, mm, address);
> > +                     if (vma != vma_next(iter))
> > +                             goto out_unlock;
> > +             }
> > +     }
> > +
> > +     return vma;
> > +
> > +out_unlock:
> > +     vma_end_read(vma);
> > +out:
> > +     rcu_read_unlock();
> > +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> > +     rcu_read_lock();
> > +
> > +     return vma;
> > +}
> >  #endif /* CONFIG_PER_VMA_LOCK */
> >
> >  #ifdef CONFIG_LOCK_MM_AND_FIND_VMA
> > --
> > 2.50.0.727.gbf7dc18ff4-goog
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned
  2025-07-04  6:07 ` [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned Suren Baghdasaryan
  2025-07-07 15:01   ` Lorenzo Stoakes
@ 2025-07-08 17:37   ` Vlastimil Babka
  2025-07-10  5:49     ` Suren Baghdasaryan
  1 sibling, 1 reply; 47+ messages in thread
From: Vlastimil Babka @ 2025-07-08 17:37 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On 7/4/25 08:07, Suren Baghdasaryan wrote:
> Back in 2.6 era, last_addr used to be stored in seq_file->version
> variable, which was unsigned long. As a result, sentinels to represent
> gate vma and end of all vmas used unsigned values. In more recent
> kernels we don't used seq_file->version anymore and therefore conversion
> from loff_t into unsigned type is not needed. Similarly, sentinel values
> don't need to be unsigned. Remove type conversion for set_file position
> and change sentinel values to signed.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Some stuff in the code gave me a pause but it's out of scope here so just in
case someone wants to do some extra churn...

> ---
>  fs/proc/task_mmu.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 751479eb128f..b8bc06d05a72 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -135,7 +135,7 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
>  	if (vma) {
>  		*ppos = vma->vm_start;
>  	} else {
> -		*ppos = -2UL;
> +		*ppos = -2;
>  		vma = get_gate_vma(priv->mm);
>  	}
>  
> @@ -145,11 +145,11 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
>  static void *m_start(struct seq_file *m, loff_t *ppos)
>  {
>  	struct proc_maps_private *priv = m->private;
> -	unsigned long last_addr = *ppos;
> +	loff_t last_addr = *ppos;
>  	struct mm_struct *mm;
>  
>  	/* See m_next(). Zero at the start or after lseek. */
> -	if (last_addr == -1UL)
> +	if (last_addr == -1)
>  		return NULL;
>  
>  	priv->task = get_proc_task(priv->inode);
> @@ -170,9 +170,9 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
>  		return ERR_PTR(-EINTR);
>  	}
>  
> -	vma_iter_init(&priv->iter, mm, last_addr);
> +	vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);

I wonder if this should rather be done only after dealing with the -2 case
below. It seems wrong to init the iterator with a bogus address. What if it
acquires some sanity checks?

>  	hold_task_mempolicy(priv);

It seems suboptimal to do that mempolicy refcount dance for numa_maps sake
even if we're reading a different /proc file... maybe priv could have a flag
to determine?

> -	if (last_addr == -2UL)
> +	if (last_addr == -2)
>  		return get_gate_vma(mm);

I think only after the above it makes sense to init the iterator?

>  	return proc_get_vma(priv, ppos);
> @@ -180,8 +180,8 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
>  
>  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
>  {
> -	if (*ppos == -2UL) {
> -		*ppos = -1UL;
> +	if (*ppos == -2) {
> +		*ppos = -1;
>  		return NULL;
>  	}
>  	return proc_get_vma(m->private, ppos);



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-07 23:10     ` Suren Baghdasaryan
@ 2025-07-09  8:57       ` Vlastimil Babka
  2025-07-09 14:43         ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Vlastimil Babka @ 2025-07-09  8:57 UTC (permalink / raw)
  To: Suren Baghdasaryan, Lorenzo Stoakes
  Cc: akpm, Liam.Howlett, david, peterx, jannh, hannes, mhocko, paulmck,
	shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 7/8/25 01:10, Suren Baghdasaryan wrote:
>>> +     rcu_read_unlock();
>>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
>>> +     rcu_read_lock();
>> OK I guess we hold the RCU lock the whole time as we traverse except when
>> we lock under mmap lock.
> Correct.

I wonder if it's really necessary? Can't it be done just inside
lock_next_vma()? It would also avoid the unlock/lock dance quoted above.

Even if we later manage to extend this approach to smaps and employ rcu
locking to traverse the page tables, I'd think it's best to separate and
fine-grain the rcu lock usage for vma iterator and page tables, if only to
avoid too long time under the lock.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-04  6:07 ` [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock Suren Baghdasaryan
  2025-07-07 16:51   ` Lorenzo Stoakes
  2025-07-07 18:20   ` Liam R. Howlett
@ 2025-07-09 10:03   ` Vlastimil Babka
  2025-07-09 14:43     ` Suren Baghdasaryan
  2 siblings, 1 reply; 47+ messages in thread
From: Vlastimil Babka @ 2025-07-09 10:03 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: Liam.Howlett, lorenzo.stoakes, david, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On 7/4/25 08:07, Suren Baghdasaryan wrote:
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -178,6 +178,94 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  	count_vm_vma_lock_event(VMA_LOCK_ABORT);
>  	return NULL;
>  }
> +
> +static struct vm_area_struct *lock_vma_under_mmap_lock(struct mm_struct *mm,
> +						       struct vma_iterator *iter,
> +						       unsigned long address)
> +{
> +	struct vm_area_struct *vma;
> +	int ret;
> +
> +	ret = mmap_read_lock_killable(mm);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	/* Lookup the vma at the last position again under mmap_read_lock */
> +	vma_iter_init(iter, mm, address);
> +	vma = vma_next(iter);
> +	if (vma)
> +		vma_start_read_locked(vma);

This can in theory return false (refcount overflow?) so it should be handled?

> +
> +	mmap_read_unlock(mm);
> +
> +	return vma;
> +}
> +


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-09  8:57       ` Vlastimil Babka
@ 2025-07-09 14:43         ` Suren Baghdasaryan
  2025-07-09 15:03           ` Vlastimil Babka
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-09 14:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, akpm, Liam.Howlett, david, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> >>> +     rcu_read_unlock();
> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> >>> +     rcu_read_lock();
> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> >> we lock under mmap lock.
> > Correct.
>
> I wonder if it's really necessary? Can't it be done just inside
> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
>
> Even if we later manage to extend this approach to smaps and employ rcu
> locking to traverse the page tables, I'd think it's best to separate and
> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> avoid too long time under the lock.

I thought we would need to be in the same rcu read section while
traversing the maple tree using vma_next() but now looking at it,
maybe we can indeed enter only while finding and locking the next
vma...
Liam, would that work? I see struct ma_state containing a node field.
Can it be freed from under us if we find a vma, exit rcu read section
then re-enter rcu and use the same iterator to find the next vma?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-09 10:03   ` Vlastimil Babka
@ 2025-07-09 14:43     ` Suren Baghdasaryan
  0 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-09 14:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Wed, Jul 9, 2025 at 3:03 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/4/25 08:07, Suren Baghdasaryan wrote:
> > --- a/mm/mmap_lock.c
> > +++ b/mm/mmap_lock.c
> > @@ -178,6 +178,94 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> >       return NULL;
> >  }
> > +
> > +static struct vm_area_struct *lock_vma_under_mmap_lock(struct mm_struct *mm,
> > +                                                    struct vma_iterator *iter,
> > +                                                    unsigned long address)
> > +{
> > +     struct vm_area_struct *vma;
> > +     int ret;
> > +
> > +     ret = mmap_read_lock_killable(mm);
> > +     if (ret)
> > +             return ERR_PTR(ret);
> > +
> > +     /* Lookup the vma at the last position again under mmap_read_lock */
> > +     vma_iter_init(iter, mm, address);
> > +     vma = vma_next(iter);
> > +     if (vma)
> > +             vma_start_read_locked(vma);
>
> This can in theory return false (refcount overflow?) so it should be handled?

Yes, I should handle it by falling back to mmap_lock. Thanks!

>
> > +
> > +     mmap_read_unlock(mm);
> > +
> > +     return vma;
> > +}
> > +


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-09 14:43         ` Suren Baghdasaryan
@ 2025-07-09 15:03           ` Vlastimil Babka
  2025-07-09 15:06             ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Vlastimil Babka @ 2025-07-09 15:03 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lorenzo Stoakes, akpm, Liam.Howlett, david, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On 7/9/25 16:43, Suren Baghdasaryan wrote:
> On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 7/8/25 01:10, Suren Baghdasaryan wrote:
>> >>> +     rcu_read_unlock();
>> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
>> >>> +     rcu_read_lock();
>> >> OK I guess we hold the RCU lock the whole time as we traverse except when
>> >> we lock under mmap lock.
>> > Correct.
>>
>> I wonder if it's really necessary? Can't it be done just inside
>> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
>>
>> Even if we later manage to extend this approach to smaps and employ rcu
>> locking to traverse the page tables, I'd think it's best to separate and
>> fine-grain the rcu lock usage for vma iterator and page tables, if only to
>> avoid too long time under the lock.
> 
> I thought we would need to be in the same rcu read section while
> traversing the maple tree using vma_next() but now looking at it,
> maybe we can indeed enter only while finding and locking the next
> vma...
> Liam, would that work? I see struct ma_state containing a node field.
> Can it be freed from under us if we find a vma, exit rcu read section
> then re-enter rcu and use the same iterator to find the next vma?

If the rcu protection needs to be contigous, and patch 8 avoids the issue by
always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
the issue or is it why we see the syzbot reports?) then I guess in the code
quoted above we also need a vma_iter_init() after the rcu_read_lock(),
because although the iterator was used briefly under mmap_lock protection,
that was then unlocked and there can be a race before the rcu_read_lock().


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-09 15:03           ` Vlastimil Babka
@ 2025-07-09 15:06             ` Suren Baghdasaryan
  2025-07-09 16:11               ` Liam R. Howlett
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-09 15:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, akpm, Liam.Howlett, david, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/9/25 16:43, Suren Baghdasaryan wrote:
> > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> >> >>> +     rcu_read_unlock();
> >> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> >> >>> +     rcu_read_lock();
> >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> >> >> we lock under mmap lock.
> >> > Correct.
> >>
> >> I wonder if it's really necessary? Can't it be done just inside
> >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
> >>
> >> Even if we later manage to extend this approach to smaps and employ rcu
> >> locking to traverse the page tables, I'd think it's best to separate and
> >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> >> avoid too long time under the lock.
> >
> > I thought we would need to be in the same rcu read section while
> > traversing the maple tree using vma_next() but now looking at it,
> > maybe we can indeed enter only while finding and locking the next
> > vma...
> > Liam, would that work? I see struct ma_state containing a node field.
> > Can it be freed from under us if we find a vma, exit rcu read section
> > then re-enter rcu and use the same iterator to find the next vma?
>
> If the rcu protection needs to be contigous, and patch 8 avoids the issue by
> always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
> the issue or is it why we see the syzbot reports?) then I guess in the code
> quoted above we also need a vma_iter_init() after the rcu_read_lock(),
> because although the iterator was used briefly under mmap_lock protection,
> that was then unlocked and there can be a race before the rcu_read_lock().

Quite true. So, let's wait for Liam's confirmation and based on his
answer I'll change the patch by either reducing the rcu read section
or adding the missing vma_iter_init() after we switch to mmap_lock.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-09 15:06             ` Suren Baghdasaryan
@ 2025-07-09 16:11               ` Liam R. Howlett
  2025-07-09 17:47                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Liam R. Howlett @ 2025-07-09 16:11 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Lorenzo Stoakes, akpm, david, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

* Suren Baghdasaryan <surenb@google.com> [250709 11:06]:
> On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 7/9/25 16:43, Suren Baghdasaryan wrote:
> > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> > >> >>> +     rcu_read_unlock();
> > >> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> > >> >>> +     rcu_read_lock();
> > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> > >> >> we lock under mmap lock.
> > >> > Correct.
> > >>
> > >> I wonder if it's really necessary? Can't it be done just inside
> > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
> > >>
> > >> Even if we later manage to extend this approach to smaps and employ rcu
> > >> locking to traverse the page tables, I'd think it's best to separate and
> > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> > >> avoid too long time under the lock.
> > >
> > > I thought we would need to be in the same rcu read section while
> > > traversing the maple tree using vma_next() but now looking at it,
> > > maybe we can indeed enter only while finding and locking the next
> > > vma...
> > > Liam, would that work? I see struct ma_state containing a node field.
> > > Can it be freed from under us if we find a vma, exit rcu read section
> > > then re-enter rcu and use the same iterator to find the next vma?
> >
> > If the rcu protection needs to be contigous, and patch 8 avoids the issue by
> > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
> > the issue or is it why we see the syzbot reports?) then I guess in the code
> > quoted above we also need a vma_iter_init() after the rcu_read_lock(),
> > because although the iterator was used briefly under mmap_lock protection,
> > that was then unlocked and there can be a race before the rcu_read_lock().
> 
> Quite true. So, let's wait for Liam's confirmation and based on his
> answer I'll change the patch by either reducing the rcu read section
> or adding the missing vma_iter_init() after we switch to mmap_lock.

You need to either be under rcu or mmap lock to ensure the node in the
maple state hasn't been freed (and potentially, reallocated).

So in this case, in the higher level, we can hold the rcu read lock for
a series of walks and avoid re-walking the tree then the performance
would be better.

When we return to userspace, then we should drop the rcu read lock and
will need to vma_iter_set()/vma_iter_invalidate() on return.  I thought
this was being done (through vma_iter_init()), but syzbot seems to
indicate a path that was missed?

This is the same thing that needed to be done previously with the mmap
lock, but now under the rcu lock.

I'm not sure how to mitigate the issue with the page table, maybe we
guess on the number of vmas that we were doing for 4k blocks of output
and just drop/reacquire then.  Probably a problem for another day
anyways.

Also, I think you can also change the vma_iter_init() to vma_iter_set(),
which is slightly less code under the hood.  Vlastimil asked about this
and it's probably a better choice.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-09 16:11               ` Liam R. Howlett
@ 2025-07-09 17:47                 ` Suren Baghdasaryan
  2025-07-10  7:03                   ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-09 17:47 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, akpm, david, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250709 11:06]:
> > On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 7/9/25 16:43, Suren Baghdasaryan wrote:
> > > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >>
> > > >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> > > >> >>> +     rcu_read_unlock();
> > > >> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> > > >> >>> +     rcu_read_lock();
> > > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> > > >> >> we lock under mmap lock.
> > > >> > Correct.
> > > >>
> > > >> I wonder if it's really necessary? Can't it be done just inside
> > > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
> > > >>
> > > >> Even if we later manage to extend this approach to smaps and employ rcu
> > > >> locking to traverse the page tables, I'd think it's best to separate and
> > > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> > > >> avoid too long time under the lock.
> > > >
> > > > I thought we would need to be in the same rcu read section while
> > > > traversing the maple tree using vma_next() but now looking at it,
> > > > maybe we can indeed enter only while finding and locking the next
> > > > vma...
> > > > Liam, would that work? I see struct ma_state containing a node field.
> > > > Can it be freed from under us if we find a vma, exit rcu read section
> > > > then re-enter rcu and use the same iterator to find the next vma?
> > >
> > > If the rcu protection needs to be contigous, and patch 8 avoids the issue by
> > > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
> > > the issue or is it why we see the syzbot reports?) then I guess in the code
> > > quoted above we also need a vma_iter_init() after the rcu_read_lock(),
> > > because although the iterator was used briefly under mmap_lock protection,
> > > that was then unlocked and there can be a race before the rcu_read_lock().
> >
> > Quite true. So, let's wait for Liam's confirmation and based on his
> > answer I'll change the patch by either reducing the rcu read section
> > or adding the missing vma_iter_init() after we switch to mmap_lock.
>
> You need to either be under rcu or mmap lock to ensure the node in the
> maple state hasn't been freed (and potentially, reallocated).
>
> So in this case, in the higher level, we can hold the rcu read lock for
> a series of walks and avoid re-walking the tree then the performance
> would be better.

Got it. Thanks for confirming!

>
> When we return to userspace, then we should drop the rcu read lock and
> will need to vma_iter_set()/vma_iter_invalidate() on return.  I thought
> this was being done (through vma_iter_init()), but syzbot seems to
> indicate a path that was missed?

We do that in m_start()/m_stop() by calling
lock_vma_range()/unlock_vma_range() but I think I have two problems
here:
1. As Vlastimil mentioned I do not reset the iterator when falling
back to mmap_lock and exiting and then re-entering rcu read section;
2. I do not reset the iterator after exiting rcu read section in
m_stop() and re-entering it in m_start(), so the later call to
lock_next_vma() might be using an iterator with a node that was freed
(and possibly reallocated).

>
> This is the same thing that needed to be done previously with the mmap
> lock, but now under the rcu lock.
>
> I'm not sure how to mitigate the issue with the page table, maybe we
> guess on the number of vmas that we were doing for 4k blocks of output
> and just drop/reacquire then.  Probably a problem for another day
> anyways.
>
> Also, I think you can also change the vma_iter_init() to vma_iter_set(),
> which is slightly less code under the hood.  Vlastimil asked about this
> and it's probably a better choice.

Ack.
I'll update my series with these fixes and all comments I received so
far, will run the reproducers to confirm no issues and repost them
later today.
Thanks,
Suren.

>
> Thanks,
> Liam
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned
  2025-07-08 17:37   ` Vlastimil Babka
@ 2025-07-10  5:49     ` Suren Baghdasaryan
  0 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-10  5:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, david, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On Tue, Jul 8, 2025 at 10:37 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/4/25 08:07, Suren Baghdasaryan wrote:
> > Back in 2.6 era, last_addr used to be stored in seq_file->version
> > variable, which was unsigned long. As a result, sentinels to represent
> > gate vma and end of all vmas used unsigned values. In more recent
> > kernels we don't used seq_file->version anymore and therefore conversion
> > from loff_t into unsigned type is not needed. Similarly, sentinel values
> > don't need to be unsigned. Remove type conversion for set_file position
> > and change sentinel values to signed.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> Some stuff in the code gave me a pause but it's out of scope here so just in
> case someone wants to do some extra churn...
>
> > ---
> >  fs/proc/task_mmu.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 751479eb128f..b8bc06d05a72 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -135,7 +135,7 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> >       if (vma) {
> >               *ppos = vma->vm_start;
> >       } else {
> > -             *ppos = -2UL;
> > +             *ppos = -2;
> >               vma = get_gate_vma(priv->mm);
> >       }
> >
> > @@ -145,11 +145,11 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> >  static void *m_start(struct seq_file *m, loff_t *ppos)
> >  {
> >       struct proc_maps_private *priv = m->private;
> > -     unsigned long last_addr = *ppos;
> > +     loff_t last_addr = *ppos;
> >       struct mm_struct *mm;
> >
> >       /* See m_next(). Zero at the start or after lseek. */
> > -     if (last_addr == -1UL)
> > +     if (last_addr == -1)
> >               return NULL;
> >
> >       priv->task = get_proc_task(priv->inode);
> > @@ -170,9 +170,9 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> >               return ERR_PTR(-EINTR);
> >       }
> >
> > -     vma_iter_init(&priv->iter, mm, last_addr);
> > +     vma_iter_init(&priv->iter, mm, (unsigned long)last_addr);
>
> I wonder if this should rather be done only after dealing with the -2 case
> below. It seems wrong to init the iterator with a bogus address. What if it
> acquires some sanity checks?
>
> >       hold_task_mempolicy(priv);
>
> It seems suboptimal to do that mempolicy refcount dance for numa_maps sake
> even if we're reading a different /proc file... maybe priv could have a flag
> to determine?
>
> > -     if (last_addr == -2UL)
> > +     if (last_addr == -2)
> >               return get_gate_vma(mm);
>
> I think only after the above it makes sense to init the iterator?

Yes makes sense but let me do that outside of this patchset as it's
rather unrelated.

>
> >       return proc_get_vma(priv, ppos);
> > @@ -180,8 +180,8 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
> >
> >  static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
> >  {
> > -     if (*ppos == -2UL) {
> > -             *ppos = -1UL;
> > +     if (*ppos == -2) {
> > +             *ppos = -1;
> >               return NULL;
> >       }
> >       return proc_get_vma(m->private, ppos);
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-09 17:47                 ` Suren Baghdasaryan
@ 2025-07-10  7:03                   ` Suren Baghdasaryan
  2025-07-10 17:02                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-10  7:03 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, akpm, david, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Wed, Jul 9, 2025 at 10:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [250709 11:06]:
> > > On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >
> > > > On 7/9/25 16:43, Suren Baghdasaryan wrote:
> > > > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > >>
> > > > >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> > > > >> >>> +     rcu_read_unlock();
> > > > >> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> > > > >> >>> +     rcu_read_lock();
> > > > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> > > > >> >> we lock under mmap lock.
> > > > >> > Correct.
> > > > >>
> > > > >> I wonder if it's really necessary? Can't it be done just inside
> > > > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
> > > > >>
> > > > >> Even if we later manage to extend this approach to smaps and employ rcu
> > > > >> locking to traverse the page tables, I'd think it's best to separate and
> > > > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> > > > >> avoid too long time under the lock.
> > > > >
> > > > > I thought we would need to be in the same rcu read section while
> > > > > traversing the maple tree using vma_next() but now looking at it,
> > > > > maybe we can indeed enter only while finding and locking the next
> > > > > vma...
> > > > > Liam, would that work? I see struct ma_state containing a node field.
> > > > > Can it be freed from under us if we find a vma, exit rcu read section
> > > > > then re-enter rcu and use the same iterator to find the next vma?
> > > >
> > > > If the rcu protection needs to be contigous, and patch 8 avoids the issue by
> > > > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
> > > > the issue or is it why we see the syzbot reports?) then I guess in the code
> > > > quoted above we also need a vma_iter_init() after the rcu_read_lock(),
> > > > because although the iterator was used briefly under mmap_lock protection,
> > > > that was then unlocked and there can be a race before the rcu_read_lock().
> > >
> > > Quite true. So, let's wait for Liam's confirmation and based on his
> > > answer I'll change the patch by either reducing the rcu read section
> > > or adding the missing vma_iter_init() after we switch to mmap_lock.
> >
> > You need to either be under rcu or mmap lock to ensure the node in the
> > maple state hasn't been freed (and potentially, reallocated).
> >
> > So in this case, in the higher level, we can hold the rcu read lock for
> > a series of walks and avoid re-walking the tree then the performance
> > would be better.
>
> Got it. Thanks for confirming!
>
> >
> > When we return to userspace, then we should drop the rcu read lock and
> > will need to vma_iter_set()/vma_iter_invalidate() on return.  I thought
> > this was being done (through vma_iter_init()), but syzbot seems to
> > indicate a path that was missed?
>
> We do that in m_start()/m_stop() by calling
> lock_vma_range()/unlock_vma_range() but I think I have two problems
> here:
> 1. As Vlastimil mentioned I do not reset the iterator when falling
> back to mmap_lock and exiting and then re-entering rcu read section;
> 2. I do not reset the iterator after exiting rcu read section in
> m_stop() and re-entering it in m_start(), so the later call to
> lock_next_vma() might be using an iterator with a node that was freed
> (and possibly reallocated).
>
> >
> > This is the same thing that needed to be done previously with the mmap
> > lock, but now under the rcu lock.
> >
> > I'm not sure how to mitigate the issue with the page table, maybe we
> > guess on the number of vmas that we were doing for 4k blocks of output
> > and just drop/reacquire then.  Probably a problem for another day
> > anyways.
> >
> > Also, I think you can also change the vma_iter_init() to vma_iter_set(),
> > which is slightly less code under the hood.  Vlastimil asked about this
> > and it's probably a better choice.
>
> Ack.
> I'll update my series with these fixes and all comments I received so
> far, will run the reproducers to confirm no issues and repost them
> later today.

I have the patchset ready but would like to test it some more. Will
post it tomorrow.

> Thanks,
> Suren.
>
> >
> > Thanks,
> > Liam
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-10  7:03                   ` Suren Baghdasaryan
@ 2025-07-10 17:02                     ` Suren Baghdasaryan
  2025-07-10 17:42                       ` Vlastimil Babka
  2025-07-15  8:16                       ` Vlastimil Babka
  0 siblings, 2 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-10 17:02 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, akpm, david, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Thu, Jul 10, 2025 at 12:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Jul 9, 2025 at 10:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Suren Baghdasaryan <surenb@google.com> [250709 11:06]:
> > > > On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > >
> > > > > On 7/9/25 16:43, Suren Baghdasaryan wrote:
> > > > > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > > >>
> > > > > >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> > > > > >> >>> +     rcu_read_unlock();
> > > > > >> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> > > > > >> >>> +     rcu_read_lock();
> > > > > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> > > > > >> >> we lock under mmap lock.
> > > > > >> > Correct.
> > > > > >>
> > > > > >> I wonder if it's really necessary? Can't it be done just inside
> > > > > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
> > > > > >>
> > > > > >> Even if we later manage to extend this approach to smaps and employ rcu
> > > > > >> locking to traverse the page tables, I'd think it's best to separate and
> > > > > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> > > > > >> avoid too long time under the lock.
> > > > > >
> > > > > > I thought we would need to be in the same rcu read section while
> > > > > > traversing the maple tree using vma_next() but now looking at it,
> > > > > > maybe we can indeed enter only while finding and locking the next
> > > > > > vma...
> > > > > > Liam, would that work? I see struct ma_state containing a node field.
> > > > > > Can it be freed from under us if we find a vma, exit rcu read section
> > > > > > then re-enter rcu and use the same iterator to find the next vma?
> > > > >
> > > > > If the rcu protection needs to be contigous, and patch 8 avoids the issue by
> > > > > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
> > > > > the issue or is it why we see the syzbot reports?) then I guess in the code
> > > > > quoted above we also need a vma_iter_init() after the rcu_read_lock(),
> > > > > because although the iterator was used briefly under mmap_lock protection,
> > > > > that was then unlocked and there can be a race before the rcu_read_lock().
> > > >
> > > > Quite true. So, let's wait for Liam's confirmation and based on his
> > > > answer I'll change the patch by either reducing the rcu read section
> > > > or adding the missing vma_iter_init() after we switch to mmap_lock.
> > >
> > > You need to either be under rcu or mmap lock to ensure the node in the
> > > maple state hasn't been freed (and potentially, reallocated).
> > >
> > > So in this case, in the higher level, we can hold the rcu read lock for
> > > a series of walks and avoid re-walking the tree then the performance
> > > would be better.
> >
> > Got it. Thanks for confirming!
> >
> > >
> > > When we return to userspace, then we should drop the rcu read lock and
> > > will need to vma_iter_set()/vma_iter_invalidate() on return.  I thought
> > > this was being done (through vma_iter_init()), but syzbot seems to
> > > indicate a path that was missed?
> >
> > We do that in m_start()/m_stop() by calling
> > lock_vma_range()/unlock_vma_range() but I think I have two problems
> > here:
> > 1. As Vlastimil mentioned I do not reset the iterator when falling
> > back to mmap_lock and exiting and then re-entering rcu read section;
> > 2. I do not reset the iterator after exiting rcu read section in
> > m_stop() and re-entering it in m_start(), so the later call to
> > lock_next_vma() might be using an iterator with a node that was freed
> > (and possibly reallocated).
> >
> > >
> > > This is the same thing that needed to be done previously with the mmap
> > > lock, but now under the rcu lock.
> > >
> > > I'm not sure how to mitigate the issue with the page table, maybe we
> > > guess on the number of vmas that we were doing for 4k blocks of output
> > > and just drop/reacquire then.  Probably a problem for another day
> > > anyways.
> > >
> > > Also, I think you can also change the vma_iter_init() to vma_iter_set(),
> > > which is slightly less code under the hood.  Vlastimil asked about this
> > > and it's probably a better choice.
> >
> > Ack.
> > I'll update my series with these fixes and all comments I received so
> > far, will run the reproducers to confirm no issues and repost them
> > later today.
>
> I have the patchset ready but would like to test it some more. Will
> post it tomorrow.

Ok, I found a couple of issues using the syzbot reproducer [1] (which
is awesome BTW!):
1. rwsem_acquire_read() inside vma_start_read() at [2] should be moved
after the last check, otherwise the lock is considered taken on
vma->vm_refcnt overflow;
2. query_matching_vma() is missing unlock_vma() call when it does
"goto next_vma;" and re-issues query_vma_find_by_addr(). The previous
vma is left locked;

[1] https://syzkaller.appspot.com/x/repro.c?x=101edf70580000
[2] https://elixir.bootlin.com/linux/v6.15.5/source/include/linux/mm.h#L747

After these fixes it's much harder to fail but I still get one more
error copied below. I will continue the investigation and will hold
off reposting until this is fixed. That will be next week since I'll
be out of town the rest of this week.

Andrew, could you please remove this patchset from mm-unstable for now
until I fix the issue and re-post the new version?

The error I got after these fixes is:

[   56.342886]
[   56.342910] ------------[ cut here ]------------
[   56.342934] WARNING: CPU: 46 PID: 5701 at lib/maple_tree.c:4734
mas_next_slot+0x552/0x840
[   56.344691] ================================================
[   56.344695] WARNING: lock held when returning to user space!
[   56.344698] 6.16.0-rc5-00321-g31d640f7b07c-dirty #379 Not tainted
[   56.344702] ------------------------------------------------
[   56.344704] syzbot_repro1/5700 is leaving the kernel with locks still held!
[   56.344715] 1 lock held by syzbot_repro1/5700:
[   56.344720]  #0: ffff93a8c2cea788 (vm_lock){++++}-{0:0}
[   56.349286] Modules linked in:
[   56.355569] , at: get_next_vma+0x91/0xe0
[   56.377452]
[   56.377929] CPU: 46 UID: 0 PID: 5701 Comm: syzbot_repro1 Not
tainted 6.16.0-rc5-00321-g31d640f7b07c-dirty #379 PREEMPT(voluntary)
[   56.381592] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   56.384664] RIP: 0010:mas_next_slot+0x552/0x840
[   56.386097] Code: 43 38 83 e8 02 83 f8 01 77 c4 e9 e5 fa ff ff 48
8b 43 28 e9 83 fb ff ff 49 8b 06 30 c0 49 39 c6 74 be c7 43 38 05 00
00 00 90 <0f> 0b 90 48 c7 04 24 00 001
[   56.392303] RSP: 0018:ffffa01188217cd8 EFLAGS: 00010206
[   56.393928] RAX: ffff93a8c2724e00 RBX: ffff93a8c1af2898 RCX: 1ffff27519183e61
[   56.396300] RDX: ffff93a8c2724e0e RSI: 0000000000000000 RDI: ffff93a8c1af2898
[   56.398515] RBP: 00007ffd83c2ffff R08: 00000000ffffffff R09: ffff93a8c8c1f308
[   56.400722] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff
[   56.402935] R13: 0000000000000001 R14: ffff93a8c8c1f300 R15: ffff93a8c8c1f308
[   56.405222] FS:  00007ff71a3946c0(0000) GS:ffff93b83a9af000(0000)
knlGS:0000000000000000
[   56.408236] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   56.409994] CR2: 00007ff71a393bb0 CR3: 0000000102f84000 CR4: 0000000000750ef0
[   56.412367] PKRU: 55555554
[   56.413231] Call Trace:
[   56.413955]  <TASK>
[   56.414672]  mas_find+0x5c/0x1c0
[   56.415713]  lock_next_vma+0x41/0x4d0
[   56.416869]  get_next_vma+0x91/0xe0
[   56.417954]  do_procmap_query+0x249/0xa90
[   56.419310]  ? do_procmap_query+0x1b8/0xa90
[   56.420591]  procfs_procmap_ioctl+0x20/0x40
[   56.421896]  __x64_sys_ioctl+0x8e/0xe0
[   56.423514]  do_syscall_64+0xbb/0x360
[   56.424715]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   56.426296] RIP: 0033:0x41a7e9
[   56.427254] Code: c0 79 93 eb d5 48 8d 7c 1d 00 eb 99 0f 1f 44 00
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c38
[   56.432893] RSP: 002b:00007ff71a3941f8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[   56.435222] RAX: ffffffffffffffda RBX: 00007ff71a394cdc RCX: 000000000041a7e9
[   56.437475] RDX: 0000200000000180 RSI: 00000000c0686611 RDI: 0000000000000003
[   56.440084] RBP: 00007ff71a394220 R08: 00007ff71a3946c0 R09: 00007ff71a394210
[   56.442345] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffd0
[   56.444545] R13: 0000000000000000 R14: 00007ffd83c4fe30 R15: 00007ffd83c4ff18
[   56.446732]  </TASK>
[   56.447436] Kernel panic - not syncing: kernel: panic_on_warn set ...
[   56.449433] CPU: 46 UID: 0 PID: 5701 Comm: syzbot_repro1 Not
tainted 6.16.0-rc5-00321-g31d640f7b07c-dirty #379 PREEMPT(voluntary)
[   56.453043] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   56.456564] Call Trace:
[   56.457340]  <TASK>
[   56.457969]  dump_stack_lvl+0x5d/0x80
[   56.459188]  ? mas_next_slot+0x510/0x840
[   56.460388]  panic+0x11a/0x2ce
[   56.461353]  ? mas_next_slot+0x552/0x840
[   56.462564]  check_panic_on_warn.cold+0xf/0x1e
[   56.463960]  __warn.cold+0xc3/0x153
[   56.465043]  ? mas_next_slot+0x552/0x840
[   56.466367]  report_bug+0xff/0x140
[   56.467441]  ? mas_next_slot+0x552/0x840
[   56.468993]  handle_bug+0x163/0x1e0
[   56.470236]  exc_invalid_op+0x17/0x70
[   56.471366]  asm_exc_invalid_op+0x1a/0x20
[   56.472629] RIP: 0010:mas_next_slot+0x552/0x840
[   56.474053] Code: 43 38 83 e8 02 83 f8 01 77 c4 e9 e5 fa ff ff 48
8b 43 28 e9 83 fb ff ff 49 8b 06 30 c0 49 39 c6 74 be c7 43 38 05 00
00 00 90 <0f> 0b 90 48 c7 04 24 00 001
[   56.479861] RSP: 0018:ffffa01188217cd8 EFLAGS: 00010206
[   56.481501] RAX: ffff93a8c2724e00 RBX: ffff93a8c1af2898 RCX: 1ffff27519183e61
[   56.483665] RDX: ffff93a8c2724e0e RSI: 0000000000000000 RDI: ffff93a8c1af2898
[   56.486323] RBP: 00007ffd83c2ffff R08: 00000000ffffffff R09: ffff93a8c8c1f308
[   56.488491] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffff
[   56.490634] R13: 0000000000000001 R14: ffff93a8c8c1f300 R15: ffff93a8c8c1f308
[   56.492856]  mas_find+0x5c/0x1c0
[   56.493888]  lock_next_vma+0x41/0x4d0
[   56.495022]  get_next_vma+0x91/0xe0
[   56.496204]  do_procmap_query+0x249/0xa90
[   56.497515]  ? do_procmap_query+0x1b8/0xa90
[   56.499232]  procfs_procmap_ioctl+0x20/0x40
[   56.500500]  __x64_sys_ioctl+0x8e/0xe0
[   56.501682]  do_syscall_64+0xbb/0x360
[   56.502845]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   56.504408] RIP: 0033:0x41a7e9
[   56.505362] Code: c0 79 93 eb d5 48 8d 7c 1d 00 eb 99 0f 1f 44 00
00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c38
[   56.511066] RSP: 002b:00007ff71a3941f8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[   56.513523] RAX: ffffffffffffffda RBX: 00007ff71a394cdc RCX: 000000000041a7e9
[   56.515977] RDX: 0000200000000180 RSI: 00000000c0686611 RDI: 0000000000000003
[   56.518440] RBP: 00007ff71a394220 R08: 00007ff71a3946c0 R09: 00007ff71a394210
[   56.520643] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffd0
[   56.522884] R13: 0000000000000000 R14: 00007ffd83c4fe30 R15: 00007ffd83c4ff18
[   56.525170]  </TASK>
[   56.527859] Kernel Offset: 0x1a00000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   56.531106] Rebooting in 86400 seconds..

>
> > Thanks,
> > Suren.
> >
> > >
> > > Thanks,
> > > Liam
> > >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-10 17:02                     ` Suren Baghdasaryan
@ 2025-07-10 17:42                       ` Vlastimil Babka
  2025-07-15  8:16                       ` Vlastimil Babka
  1 sibling, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2025-07-10 17:42 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Lorenzo Stoakes, akpm, david,
	peterx, jannh, hannes, mhocko, paulmck, shuah, adobriyan, brauner,
	josef, yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On 7/10/25 19:02, Suren Baghdasaryan wrote:
> On Thu, Jul 10, 2025 at 12:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>>
>> I have the patchset ready but would like to test it some more. Will
>> post it tomorrow.
> 
> Ok, I found a couple of issues using the syzbot reproducer [1] (which
> is awesome BTW!):
> 1. rwsem_acquire_read() inside vma_start_read() at [2] should be moved
> after the last check, otherwise the lock is considered taken on
> vma->vm_refcnt overflow;
> 2. query_matching_vma() is missing unlock_vma() call when it does
> "goto next_vma;" and re-issues query_vma_find_by_addr(). The previous
> vma is left locked;

How does that happen? query_vma_find_by_addr() does get_next_vma() which
does unlock_vma()?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY
  2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
                   ` (7 preceding siblings ...)
  2025-07-04  6:07 ` [PATCH v6 8/8] fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
@ 2025-07-15  8:10 ` Lorenzo Stoakes
  8 siblings, 0 replies; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15  8:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, Liam.Howlett, david, vbabka, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

Hi Andrew,

This series is causing syzkaller bugs and needs to be fixed, Suren + us are
looking into this, could you drop the series for now?

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-10 17:02                     ` Suren Baghdasaryan
  2025-07-10 17:42                       ` Vlastimil Babka
@ 2025-07-15  8:16                       ` Vlastimil Babka
  2025-07-15  9:40                         ` Lorenzo Stoakes
  2025-07-15 20:13                         ` Suren Baghdasaryan
  1 sibling, 2 replies; 47+ messages in thread
From: Vlastimil Babka @ 2025-07-15  8:16 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Lorenzo Stoakes, akpm, david,
	peterx, jannh, hannes, mhocko, paulmck, shuah, adobriyan, brauner,
	josef, yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On 7/10/25 19:02, Suren Baghdasaryan wrote:
> On Thu, Jul 10, 2025 at 12:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Wed, Jul 9, 2025 at 10:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
>> >
>> > On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>> > >
>> > > * Suren Baghdasaryan <surenb@google.com> [250709 11:06]:
>> > > > On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>> > > > >
>> > > > > On 7/9/25 16:43, Suren Baghdasaryan wrote:
>> > > > > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> > > > > >>
>> > > > > >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
>> > > > > >> >>> +     rcu_read_unlock();
>> > > > > >> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
>> > > > > >> >>> +     rcu_read_lock();
>> > > > > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
>> > > > > >> >> we lock under mmap lock.
>> > > > > >> > Correct.
>> > > > > >>
>> > > > > >> I wonder if it's really necessary? Can't it be done just inside
>> > > > > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
>> > > > > >>
>> > > > > >> Even if we later manage to extend this approach to smaps and employ rcu
>> > > > > >> locking to traverse the page tables, I'd think it's best to separate and
>> > > > > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
>> > > > > >> avoid too long time under the lock.
>> > > > > >
>> > > > > > I thought we would need to be in the same rcu read section while
>> > > > > > traversing the maple tree using vma_next() but now looking at it,
>> > > > > > maybe we can indeed enter only while finding and locking the next
>> > > > > > vma...
>> > > > > > Liam, would that work? I see struct ma_state containing a node field.
>> > > > > > Can it be freed from under us if we find a vma, exit rcu read section
>> > > > > > then re-enter rcu and use the same iterator to find the next vma?
>> > > > >
>> > > > > If the rcu protection needs to be contigous, and patch 8 avoids the issue by
>> > > > > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
>> > > > > the issue or is it why we see the syzbot reports?) then I guess in the code
>> > > > > quoted above we also need a vma_iter_init() after the rcu_read_lock(),
>> > > > > because although the iterator was used briefly under mmap_lock protection,
>> > > > > that was then unlocked and there can be a race before the rcu_read_lock().
>> > > >
>> > > > Quite true. So, let's wait for Liam's confirmation and based on his
>> > > > answer I'll change the patch by either reducing the rcu read section
>> > > > or adding the missing vma_iter_init() after we switch to mmap_lock.
>> > >
>> > > You need to either be under rcu or mmap lock to ensure the node in the
>> > > maple state hasn't been freed (and potentially, reallocated).
>> > >
>> > > So in this case, in the higher level, we can hold the rcu read lock for
>> > > a series of walks and avoid re-walking the tree then the performance
>> > > would be better.
>> >
>> > Got it. Thanks for confirming!
>> >
>> > >
>> > > When we return to userspace, then we should drop the rcu read lock and
>> > > will need to vma_iter_set()/vma_iter_invalidate() on return.  I thought
>> > > this was being done (through vma_iter_init()), but syzbot seems to
>> > > indicate a path that was missed?
>> >
>> > We do that in m_start()/m_stop() by calling
>> > lock_vma_range()/unlock_vma_range() but I think I have two problems
>> > here:
>> > 1. As Vlastimil mentioned I do not reset the iterator when falling
>> > back to mmap_lock and exiting and then re-entering rcu read section;
>> > 2. I do not reset the iterator after exiting rcu read section in
>> > m_stop() and re-entering it in m_start(), so the later call to
>> > lock_next_vma() might be using an iterator with a node that was freed
>> > (and possibly reallocated).
>> >
>> > >
>> > > This is the same thing that needed to be done previously with the mmap
>> > > lock, but now under the rcu lock.
>> > >
>> > > I'm not sure how to mitigate the issue with the page table, maybe we
>> > > guess on the number of vmas that we were doing for 4k blocks of output
>> > > and just drop/reacquire then.  Probably a problem for another day
>> > > anyways.
>> > >
>> > > Also, I think you can also change the vma_iter_init() to vma_iter_set(),
>> > > which is slightly less code under the hood.  Vlastimil asked about this
>> > > and it's probably a better choice.
>> >
>> > Ack.
>> > I'll update my series with these fixes and all comments I received so
>> > far, will run the reproducers to confirm no issues and repost them
>> > later today.
>>
>> I have the patchset ready but would like to test it some more. Will
>> post it tomorrow.
> 
> Ok, I found a couple of issues using the syzbot reproducer [1] (which
> is awesome BTW!):
> 1. rwsem_acquire_read() inside vma_start_read() at [2] should be moved
> after the last check, otherwise the lock is considered taken on
> vma->vm_refcnt overflow;

I think it's fine because if the last check fails there's a
vma_refcount_put() that includes rwsem_release(), no?

> 2. query_matching_vma() is missing unlock_vma() call when it does
> "goto next_vma;" and re-issues query_vma_find_by_addr(). The previous
> vma is left locked;
> 
> [1] https://syzkaller.appspot.com/x/repro.c?x=101edf70580000
> [2] https://elixir.bootlin.com/linux/v6.15.5/source/include/linux/mm.h#L747
> 
> After these fixes it's much harder to fail but I still get one more
> error copied below. I will continue the investigation and will hold
> off reposting until this is fixed. That will be next week since I'll
> be out of town the rest of this week.
> 
> Andrew, could you please remove this patchset from mm-unstable for now
> until I fix the issue and re-post the new version?

Andrew can you do that please? We keep getting new syzbot reports.

> The error I got after these fixes is:

I suspect the root cause is the ioctls are not serialized against each other
(probably not even against read()) and yet we treat m->private as safe to
work on. Now we have various fields that are dangerous to race on - for
example locked_vma and iter races would explain a lot of this.

I suspect as long as we used purely seq_file workflow, it did the right
thing for us wrt serialization, but the ioctl addition violates that. We
should rather recheck even the code before this series, if dangerous ioctl
vs read() races are possible. And the ioctl implementation should be
refactored to use an own per-ioctl-call private context, not the seq_file's
per-file-open context.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15  8:16                       ` Vlastimil Babka
@ 2025-07-15  9:40                         ` Lorenzo Stoakes
  2025-07-15  9:52                           ` David Hildenbrand
  2025-07-15 20:13                         ` Suren Baghdasaryan
  1 sibling, 1 reply; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15  9:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, akpm, david, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Jul 15, 2025 at 10:16:41AM +0200, Vlastimil Babka wrote:
> > Andrew, could you please remove this patchset from mm-unstable for now
> > until I fix the issue and re-post the new version?
>
> Andrew can you do that please? We keep getting new syzbot reports.

I also pinged up top :P just to be extra specially clear...

>
> > The error I got after these fixes is:
>
> I suspect the root cause is the ioctls are not serialized against each other
> (probably not even against read()) and yet we treat m->private as safe to
> work on. Now we have various fields that are dangerous to race on - for
> example locked_vma and iter races would explain a lot of this.
>
> I suspect as long as we used purely seq_file workflow, it did the right
> thing for us wrt serialization, but the ioctl addition violates that. We
> should rather recheck even the code before this series, if dangerous ioctl
> vs read() races are possible. And the ioctl implementation should be
> refactored to use an own per-ioctl-call private context, not the seq_file's
> per-file-open context.

Entirely agree with this analysis. I had a look at most recent report, see:

https://lore.kernel.org/linux-mm/f13cda37-06a0-4281-87d1-042678a38a6b@lucifer.local/

AFAICT we either have to lock around the ioctl or find a new way of storing
per-ioctl state.

We'd probably need to separate out the procmap query stuff to do that
though. Probably.

I don't think a per-priv say lock is necessarily _that_ egregious since are
people _really_ opening this one file and doing multiple parallel queries all
that much?

Anyway this latest report seems entirely about the procmap stuff.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15  9:40                         ` Lorenzo Stoakes
@ 2025-07-15  9:52                           ` David Hildenbrand
  2025-07-15 10:16                             ` Lorenzo Stoakes
  2025-07-15 10:23                             ` Vlastimil Babka
  0 siblings, 2 replies; 47+ messages in thread
From: David Hildenbrand @ 2025-07-15  9:52 UTC (permalink / raw)
  To: Lorenzo Stoakes, Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, akpm, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On 15.07.25 11:40, Lorenzo Stoakes wrote:
> On Tue, Jul 15, 2025 at 10:16:41AM +0200, Vlastimil Babka wrote:
>>> Andrew, could you please remove this patchset from mm-unstable for now
>>> until I fix the issue and re-post the new version?
>>
>> Andrew can you do that please? We keep getting new syzbot reports.
> 
> I also pinged up top :P just to be extra specially clear...
> 
>>
>>> The error I got after these fixes is:
>>
>> I suspect the root cause is the ioctls are not serialized against each other
>> (probably not even against read()) and yet we treat m->private as safe to
>> work on. Now we have various fields that are dangerous to race on - for
>> example locked_vma and iter races would explain a lot of this.
>>
>> I suspect as long as we used purely seq_file workflow, it did the right
>> thing for us wrt serialization, but the ioctl addition violates that. We
>> should rather recheck even the code before this series, if dangerous ioctl
>> vs read() races are possible. And the ioctl implementation should be
>> refactored to use an own per-ioctl-call private context, not the seq_file's
>> per-file-open context.
> 
> Entirely agree with this analysis. I had a look at most recent report, see:
> 
> https://lore.kernel.org/linux-mm/f13cda37-06a0-4281-87d1-042678a38a6b@lucifer.local/
> 
> AFAICT we either have to lock around the ioctl or find a new way of storing
> per-ioctl state.
> 
> We'd probably need to separate out the procmap query stuff to do that
> though. Probably.

When I skimmed that series the first time, I was wondering "why are we 
even caring about PROCMAP_QUERY that in the context of this patch series".

Maybe that helps :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15  9:52                           ` David Hildenbrand
@ 2025-07-15 10:16                             ` Lorenzo Stoakes
  2025-07-15 10:23                             ` Vlastimil Babka
  1 sibling, 0 replies; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 10:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett, akpm,
	peterx, jannh, hannes, mhocko, paulmck, shuah, adobriyan, brauner,
	josef, yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Jul 15, 2025 at 11:52:49AM +0200, David Hildenbrand wrote:
> On 15.07.25 11:40, Lorenzo Stoakes wrote:
> > On Tue, Jul 15, 2025 at 10:16:41AM +0200, Vlastimil Babka wrote:
> > > > Andrew, could you please remove this patchset from mm-unstable for now
> > > > until I fix the issue and re-post the new version?
> > >
> > > Andrew can you do that please? We keep getting new syzbot reports.
> >
> > I also pinged up top :P just to be extra specially clear...
> >
> > >
> > > > The error I got after these fixes is:
> > >
> > > I suspect the root cause is the ioctls are not serialized against each other
> > > (probably not even against read()) and yet we treat m->private as safe to
> > > work on. Now we have various fields that are dangerous to race on - for
> > > example locked_vma and iter races would explain a lot of this.
> > >
> > > I suspect as long as we used purely seq_file workflow, it did the right
> > > thing for us wrt serialization, but the ioctl addition violates that. We
> > > should rather recheck even the code before this series, if dangerous ioctl
> > > vs read() races are possible. And the ioctl implementation should be
> > > refactored to use an own per-ioctl-call private context, not the seq_file's
> > > per-file-open context.
> >
> > Entirely agree with this analysis. I had a look at most recent report, see:
> >
> > https://lore.kernel.org/linux-mm/f13cda37-06a0-4281-87d1-042678a38a6b@lucifer.local/
> >
> > AFAICT we either have to lock around the ioctl or find a new way of storing
> > per-ioctl state.
> >
> > We'd probably need to separate out the procmap query stuff to do that
> > though. Probably.
>
> When I skimmed that series the first time, I was wondering "why are we even
> caring about PROCMAP_QUERY that in the context of this patch series".
>
> Maybe that helps :)

Haha well I think it's _still useful_ for avoid contention of the mmap lock. But
we probably just need to bite bullet and lock per-fd for this

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15  9:52                           ` David Hildenbrand
  2025-07-15 10:16                             ` Lorenzo Stoakes
@ 2025-07-15 10:23                             ` Vlastimil Babka
  2025-07-15 10:31                               ` Lorenzo Stoakes
  1 sibling, 1 reply; 47+ messages in thread
From: Vlastimil Babka @ 2025-07-15 10:23 UTC (permalink / raw)
  To: David Hildenbrand, Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Liam R. Howlett, akpm, peterx, jannh, hannes,
	mhocko, paulmck, shuah, adobriyan, brauner, josef, yebin10, linux,
	willy, osalvador, andrii, ryan.roberts, christophe.leroy,
	tjmercier, kaleshsingh, aha310510, linux-kernel, linux-fsdevel,
	linux-mm, linux-kselftest

On 7/15/25 11:52, David Hildenbrand wrote:
> On 15.07.25 11:40, Lorenzo Stoakes wrote:
>> On Tue, Jul 15, 2025 at 10:16:41AM +0200, Vlastimil Babka wrote:
>>>> Andrew, could you please remove this patchset from mm-unstable for now
>>>> until I fix the issue and re-post the new version?
>>>
>>> Andrew can you do that please? We keep getting new syzbot reports.
>> 
>> I also pinged up top :P just to be extra specially clear...
>> 
>>>
>>>> The error I got after these fixes is:
>>>
>>> I suspect the root cause is the ioctls are not serialized against each other
>>> (probably not even against read()) and yet we treat m->private as safe to
>>> work on. Now we have various fields that are dangerous to race on - for
>>> example locked_vma and iter races would explain a lot of this.
>>>
>>> I suspect as long as we used purely seq_file workflow, it did the right
>>> thing for us wrt serialization, but the ioctl addition violates that. We
>>> should rather recheck even the code before this series, if dangerous ioctl
>>> vs read() races are possible. And the ioctl implementation should be
>>> refactored to use an own per-ioctl-call private context, not the seq_file's
>>> per-file-open context.
>> 
>> Entirely agree with this analysis. I had a look at most recent report, see:
>> 
>> https://lore.kernel.org/linux-mm/f13cda37-06a0-4281-87d1-042678a38a6b@lucifer.local/
>> 
>> AFAICT we either have to lock around the ioctl or find a new way of storing
>> per-ioctl state.
>> 
>> We'd probably need to separate out the procmap query stuff to do that
>> though. Probably.
> 
> When I skimmed that series the first time, I was wondering "why are we 
> even caring about PROCMAP_QUERY that in the context of this patch series".
> 
> Maybe that helps :)

Yeah seems like before patch 8/8 the ioctl handling, specifically
do_procmap_query() only looks at priv->mm and nothing else so it should be
safe as that's a stable value.

So it should be also enough to drop the last patch from mm for now, not
whole series.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 10:23                             ` Vlastimil Babka
@ 2025-07-15 10:31                               ` Lorenzo Stoakes
  2025-07-15 10:51                                 ` Lorenzo Stoakes
  2025-07-15 17:05                                 ` Andrii Nakryiko
  0 siblings, 2 replies; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 10:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, Suren Baghdasaryan, Liam R. Howlett, akpm,
	peterx, jannh, hannes, mhocko, paulmck, shuah, adobriyan, brauner,
	josef, yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Jul 15, 2025 at 12:23:31PM +0200, Vlastimil Babka wrote:
> On 7/15/25 11:52, David Hildenbrand wrote:
> > On 15.07.25 11:40, Lorenzo Stoakes wrote:
> >> On Tue, Jul 15, 2025 at 10:16:41AM +0200, Vlastimil Babka wrote:
> >>>> Andrew, could you please remove this patchset from mm-unstable for now
> >>>> until I fix the issue and re-post the new version?
> >>>
> >>> Andrew can you do that please? We keep getting new syzbot reports.
> >>
> >> I also pinged up top :P just to be extra specially clear...
> >>
> >>>
> >>>> The error I got after these fixes is:
> >>>
> >>> I suspect the root cause is the ioctls are not serialized against each other
> >>> (probably not even against read()) and yet we treat m->private as safe to
> >>> work on. Now we have various fields that are dangerous to race on - for
> >>> example locked_vma and iter races would explain a lot of this.
> >>>
> >>> I suspect as long as we used purely seq_file workflow, it did the right
> >>> thing for us wrt serialization, but the ioctl addition violates that. We
> >>> should rather recheck even the code before this series, if dangerous ioctl
> >>> vs read() races are possible. And the ioctl implementation should be
> >>> refactored to use an own per-ioctl-call private context, not the seq_file's
> >>> per-file-open context.
> >>
> >> Entirely agree with this analysis. I had a look at most recent report, see:
> >>
> >> https://lore.kernel.org/linux-mm/f13cda37-06a0-4281-87d1-042678a38a6b@lucifer.local/
> >>
> >> AFAICT we either have to lock around the ioctl or find a new way of storing
> >> per-ioctl state.
> >>
> >> We'd probably need to separate out the procmap query stuff to do that
> >> though. Probably.
> >
> > When I skimmed that series the first time, I was wondering "why are we
> > even caring about PROCMAP_QUERY that in the context of this patch series".
> >
> > Maybe that helps :)
>
> Yeah seems like before patch 8/8 the ioctl handling, specifically
> do_procmap_query() only looks at priv->mm and nothing else so it should be
> safe as that's a stable value.
>
> So it should be also enough to drop the last patch from mm for now, not
> whole series.

Yeah to save the mothership we can ditch the landing craft :P

Maybe worth doing that, and figure out in a follow up how to fix this.

Or we could just sling in a cheeky spinlock


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 10:31                               ` Lorenzo Stoakes
@ 2025-07-15 10:51                                 ` Lorenzo Stoakes
  2025-07-15 17:05                                 ` Andrii Nakryiko
  1 sibling, 0 replies; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 10:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, Suren Baghdasaryan, Liam R. Howlett, akpm,
	peterx, jannh, hannes, mhocko, paulmck, shuah, adobriyan, brauner,
	josef, yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Jul 15, 2025 at 11:31:23AM +0100, Lorenzo Stoakes wrote:
> Or we could just sling in a cheeky spinlock

As politely pointed out to me by Vlasta off-list s/spinlock/mutex/ since we
apparently allocate for anon_vma_name :'(.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 10:31                               ` Lorenzo Stoakes
  2025-07-15 10:51                                 ` Lorenzo Stoakes
@ 2025-07-15 17:05                                 ` Andrii Nakryiko
  2025-07-15 17:10                                   ` Lorenzo Stoakes
  1 sibling, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2025-07-15 17:05 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, David Hildenbrand, Suren Baghdasaryan,
	Liam R. Howlett, akpm, peterx, jannh, hannes, mhocko, paulmck,
	shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Jul 15, 2025 at 3:31 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Jul 15, 2025 at 12:23:31PM +0200, Vlastimil Babka wrote:
> > On 7/15/25 11:52, David Hildenbrand wrote:
> > > On 15.07.25 11:40, Lorenzo Stoakes wrote:
> > >> On Tue, Jul 15, 2025 at 10:16:41AM +0200, Vlastimil Babka wrote:
> > >>>> Andrew, could you please remove this patchset from mm-unstable for now
> > >>>> until I fix the issue and re-post the new version?
> > >>>
> > >>> Andrew can you do that please? We keep getting new syzbot reports.
> > >>
> > >> I also pinged up top :P just to be extra specially clear...
> > >>
> > >>>
> > >>>> The error I got after these fixes is:
> > >>>
> > >>> I suspect the root cause is the ioctls are not serialized against each other
> > >>> (probably not even against read()) and yet we treat m->private as safe to
> > >>> work on. Now we have various fields that are dangerous to race on - for
> > >>> example locked_vma and iter races would explain a lot of this.
> > >>>
> > >>> I suspect as long as we used purely seq_file workflow, it did the right
> > >>> thing for us wrt serialization, but the ioctl addition violates that. We
> > >>> should rather recheck even the code before this series, if dangerous ioctl
> > >>> vs read() races are possible. And the ioctl implementation should be
> > >>> refactored to use an own per-ioctl-call private context, not the seq_file's
> > >>> per-file-open context.
> > >>
> > >> Entirely agree with this analysis. I had a look at most recent report, see:
> > >>
> > >> https://lore.kernel.org/linux-mm/f13cda37-06a0-4281-87d1-042678a38a6b@lucifer.local/
> > >>
> > >> AFAICT we either have to lock around the ioctl or find a new way of storing
> > >> per-ioctl state.
> > >>
> > >> We'd probably need to separate out the procmap query stuff to do that
> > >> though. Probably.
> > >
> > > When I skimmed that series the first time, I was wondering "why are we
> > > even caring about PROCMAP_QUERY that in the context of this patch series".
> > >
> > > Maybe that helps :)
> >
> > Yeah seems like before patch 8/8 the ioctl handling, specifically
> > do_procmap_query() only looks at priv->mm and nothing else so it should be
> > safe as that's a stable value.
> >
> > So it should be also enough to drop the last patch from mm for now, not
> > whole series.
>
> Yeah to save the mothership we can ditch the landing craft :P
>
> Maybe worth doing that, and figure out in a follow up how to fix this.

For PROCMAP_QUERY, we need priv->mm, but the newly added locked_vma
and locked_vma don't need to be persisted between ioctl calls. So we
can just add those two fields into a small struct, and for seq_file
case have it in priv, but for PROCMAP_QUERY just have it on the stack.
The code can be written to accept this struct to maintain the state,
which for PROCMAP_QUERY ioctl will be very short-lived on the stack
one.

Would that work?

>
> Or we could just sling in a cheeky spinlock


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 17:05                                 ` Andrii Nakryiko
@ 2025-07-15 17:10                                   ` Lorenzo Stoakes
  2025-07-15 17:20                                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 17:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Vlastimil Babka, David Hildenbrand, Suren Baghdasaryan,
	Liam R. Howlett, akpm, peterx, jannh, hannes, mhocko, paulmck,
	shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Jul 15, 2025 at 10:05:49AM -0700, Andrii Nakryiko wrote:
> On Tue, Jul 15, 2025 at 3:31 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Jul 15, 2025 at 12:23:31PM +0200, Vlastimil Babka wrote:
> > > On 7/15/25 11:52, David Hildenbrand wrote:
> > > > On 15.07.25 11:40, Lorenzo Stoakes wrote:
> > > >> On Tue, Jul 15, 2025 at 10:16:41AM +0200, Vlastimil Babka wrote:
> > > >>>> Andrew, could you please remove this patchset from mm-unstable for now
> > > >>>> until I fix the issue and re-post the new version?
> > > >>>
> > > >>> Andrew can you do that please? We keep getting new syzbot reports.
> > > >>
> > > >> I also pinged up top :P just to be extra specially clear...
> > > >>
> > > >>>
> > > >>>> The error I got after these fixes is:
> > > >>>
> > > >>> I suspect the root cause is the ioctls are not serialized against each other
> > > >>> (probably not even against read()) and yet we treat m->private as safe to
> > > >>> work on. Now we have various fields that are dangerous to race on - for
> > > >>> example locked_vma and iter races would explain a lot of this.
> > > >>>
> > > >>> I suspect as long as we used purely seq_file workflow, it did the right
> > > >>> thing for us wrt serialization, but the ioctl addition violates that. We
> > > >>> should rather recheck even the code before this series, if dangerous ioctl
> > > >>> vs read() races are possible. And the ioctl implementation should be
> > > >>> refactored to use an own per-ioctl-call private context, not the seq_file's
> > > >>> per-file-open context.
> > > >>
> > > >> Entirely agree with this analysis. I had a look at most recent report, see:
> > > >>
> > > >> https://lore.kernel.org/linux-mm/f13cda37-06a0-4281-87d1-042678a38a6b@lucifer.local/
> > > >>
> > > >> AFAICT we either have to lock around the ioctl or find a new way of storing
> > > >> per-ioctl state.
> > > >>
> > > >> We'd probably need to separate out the procmap query stuff to do that
> > > >> though. Probably.
> > > >
> > > > When I skimmed that series the first time, I was wondering "why are we
> > > > even caring about PROCMAP_QUERY that in the context of this patch series".
> > > >
> > > > Maybe that helps :)
> > >
> > > Yeah seems like before patch 8/8 the ioctl handling, specifically
> > > do_procmap_query() only looks at priv->mm and nothing else so it should be
> > > safe as that's a stable value.
> > >
> > > So it should be also enough to drop the last patch from mm for now, not
> > > whole series.
> >
> > Yeah to save the mothership we can ditch the landing craft :P
> >
> > Maybe worth doing that, and figure out in a follow up how to fix this.
>
> For PROCMAP_QUERY, we need priv->mm, but the newly added locked_vma
> and locked_vma don't need to be persisted between ioctl calls. So we
> can just add those two fields into a small struct, and for seq_file
> case have it in priv, but for PROCMAP_QUERY just have it on the stack.
> The code can be written to accept this struct to maintain the state,
> which for PROCMAP_QUERY ioctl will be very short-lived on the stack
> one.
>
> Would that work?

Yeah that's a great idea actually, the stack would obviously give us the
per-query invocation thing. Nice!

I am kicking myself because I jokingly suggested (off-list) that a helper
struct would be the answer to everything (I do love them) and of
course... here we are :P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 17:10                                   ` Lorenzo Stoakes
@ 2025-07-15 17:20                                     ` Lorenzo Stoakes
  2025-07-15 17:29                                       ` Andrii Nakryiko
  0 siblings, 1 reply; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 17:20 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Vlastimil Babka, David Hildenbrand, Suren Baghdasaryan,
	Liam R. Howlett, akpm, peterx, jannh, hannes, mhocko, paulmck,
	shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Jul 15, 2025 at 06:10:16PM +0100, Lorenzo Stoakes wrote:
> > For PROCMAP_QUERY, we need priv->mm, but the newly added locked_vma
> > and locked_vma don't need to be persisted between ioctl calls. So we
> > can just add those two fields into a small struct, and for seq_file
> > case have it in priv, but for PROCMAP_QUERY just have it on the stack.
> > The code can be written to accept this struct to maintain the state,
> > which for PROCMAP_QUERY ioctl will be very short-lived on the stack
> > one.
> >
> > Would that work?
>
> Yeah that's a great idea actually, the stack would obviously give us the
> per-query invocation thing. Nice!
>
> I am kicking myself because I jokingly suggested (off-list) that a helper
> struct would be the answer to everything (I do love them) and of
> course... here we are :P

Hm but actually we'd have to invert things I think, what I mean is - since
these fields can be updated at any time by racing threads, we can't have
_anything_ in the priv struct that is mutable.

So instead we should do something like:

struct proc_maps_state {
	const struct proc_maps_private *priv;
	bool mmap_locked;
	struct vm_area_struct *locked_vma;
	struct vma_iterator iter;
	loff_t last_pos;
};

static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
	struct seq_file *seq = file->private_data;
	struct proc_maps_private *priv = seq->private;
	struct proc_maps_state state = {
		.priv = priv,
	};

	switch (cmd) {
	case PROCMAP_QUERY:
		return do_procmap_query(state, (void __user *)arg);
	default:
		return -ENOIOCTLCMD;
	}
}

And then we have a stack-based thing with the bits that change and a
read-only pointer to the bits that must remain static. And we can enforce
that with const...

We'd have to move the VMI and last_pos out too to make it const.

Anyway the general idea should work!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 17:20                                     ` Lorenzo Stoakes
@ 2025-07-15 17:29                                       ` Andrii Nakryiko
  2025-07-15 20:18                                         ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Andrii Nakryiko @ 2025-07-15 17:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, David Hildenbrand, Suren Baghdasaryan,
	Liam R. Howlett, akpm, peterx, jannh, hannes, mhocko, paulmck,
	shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Jul 15, 2025 at 10:21 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Jul 15, 2025 at 06:10:16PM +0100, Lorenzo Stoakes wrote:
> > > For PROCMAP_QUERY, we need priv->mm, but the newly added locked_vma
> > > and locked_vma don't need to be persisted between ioctl calls. So we
> > > can just add those two fields into a small struct, and for seq_file
> > > case have it in priv, but for PROCMAP_QUERY just have it on the stack.
> > > The code can be written to accept this struct to maintain the state,
> > > which for PROCMAP_QUERY ioctl will be very short-lived on the stack
> > > one.
> > >
> > > Would that work?
> >
> > Yeah that's a great idea actually, the stack would obviously give us the
> > per-query invocation thing. Nice!
> >
> > I am kicking myself because I jokingly suggested (off-list) that a helper
> > struct would be the answer to everything (I do love them) and of
> > course... here we are :P
>
> Hm but actually we'd have to invert things I think, what I mean is - since
> these fields can be updated at any time by racing threads, we can't have
> _anything_ in the priv struct that is mutable.
>

Exactly, and I guess I was just being incomplete with just listing two
of the fields that Suren make use of in PROCMAP_QUERY. See below.

> So instead we should do something like:
>
> struct proc_maps_state {
>         const struct proc_maps_private *priv;
>         bool mmap_locked;
>         struct vm_area_struct *locked_vma;
>         struct vma_iterator iter;
>         loff_t last_pos;
> };
>
> static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> {
>         struct seq_file *seq = file->private_data;
>         struct proc_maps_private *priv = seq->private;
>         struct proc_maps_state state = {
>                 .priv = priv,
>         };
>
>         switch (cmd) {
>         case PROCMAP_QUERY:
>                 return do_procmap_query(state, (void __user *)arg);

I guess it's a matter of preference, but I'd actually just pass
seq->priv->mm and arg into do_procmap_query(), which will make it
super obvious that priv is not used or mutated, and all the new stuff
that Suren needs for lockless VMA iteration, including iter (not sure
PROCMAP_QUERY needs last_pos, tbh), I'd just put into this new struct,
which do_procmap_query() can keep private to itself.

Ultimately, I think we are on the same page, it's just a matter of
structuring code and types.

>         default:
>                 return -ENOIOCTLCMD;
>         }
> }
>
> And then we have a stack-based thing with the bits that change and a
> read-only pointer to the bits that must remain static. And we can enforce
> that with const...
>
> We'd have to move the VMI and last_pos out too to make it const.
>
> Anyway the general idea should work!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15  8:16                       ` Vlastimil Babka
  2025-07-15  9:40                         ` Lorenzo Stoakes
@ 2025-07-15 20:13                         ` Suren Baghdasaryan
  2025-07-16 14:00                           ` Lorenzo Stoakes
  1 sibling, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-15 20:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Lorenzo Stoakes, akpm, david, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Jul 15, 2025 at 1:16 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/10/25 19:02, Suren Baghdasaryan wrote:
> > On Thu, Jul 10, 2025 at 12:03 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Wed, Jul 9, 2025 at 10:47 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >> >
> >> > On Wed, Jul 9, 2025 at 4:12 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >> > >
> >> > > * Suren Baghdasaryan <surenb@google.com> [250709 11:06]:
> >> > > > On Wed, Jul 9, 2025 at 3:03 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> > > > >
> >> > > > > On 7/9/25 16:43, Suren Baghdasaryan wrote:
> >> > > > > > On Wed, Jul 9, 2025 at 1:57 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >> > > > > >>
> >> > > > > >> On 7/8/25 01:10, Suren Baghdasaryan wrote:
> >> > > > > >> >>> +     rcu_read_unlock();
> >> > > > > >> >>> +     vma = lock_vma_under_mmap_lock(mm, iter, address);
> >> > > > > >> >>> +     rcu_read_lock();
> >> > > > > >> >> OK I guess we hold the RCU lock the whole time as we traverse except when
> >> > > > > >> >> we lock under mmap lock.
> >> > > > > >> > Correct.
> >> > > > > >>
> >> > > > > >> I wonder if it's really necessary? Can't it be done just inside
> >> > > > > >> lock_next_vma()? It would also avoid the unlock/lock dance quoted above.
> >> > > > > >>
> >> > > > > >> Even if we later manage to extend this approach to smaps and employ rcu
> >> > > > > >> locking to traverse the page tables, I'd think it's best to separate and
> >> > > > > >> fine-grain the rcu lock usage for vma iterator and page tables, if only to
> >> > > > > >> avoid too long time under the lock.
> >> > > > > >
> >> > > > > > I thought we would need to be in the same rcu read section while
> >> > > > > > traversing the maple tree using vma_next() but now looking at it,
> >> > > > > > maybe we can indeed enter only while finding and locking the next
> >> > > > > > vma...
> >> > > > > > Liam, would that work? I see struct ma_state containing a node field.
> >> > > > > > Can it be freed from under us if we find a vma, exit rcu read section
> >> > > > > > then re-enter rcu and use the same iterator to find the next vma?
> >> > > > >
> >> > > > > If the rcu protection needs to be contigous, and patch 8 avoids the issue by
> >> > > > > always doing vma_iter_init() after rcu_read_lock() (but does it really avoid
> >> > > > > the issue or is it why we see the syzbot reports?) then I guess in the code
> >> > > > > quoted above we also need a vma_iter_init() after the rcu_read_lock(),
> >> > > > > because although the iterator was used briefly under mmap_lock protection,
> >> > > > > that was then unlocked and there can be a race before the rcu_read_lock().
> >> > > >
> >> > > > Quite true. So, let's wait for Liam's confirmation and based on his
> >> > > > answer I'll change the patch by either reducing the rcu read section
> >> > > > or adding the missing vma_iter_init() after we switch to mmap_lock.
> >> > >
> >> > > You need to either be under rcu or mmap lock to ensure the node in the
> >> > > maple state hasn't been freed (and potentially, reallocated).
> >> > >
> >> > > So in this case, in the higher level, we can hold the rcu read lock for
> >> > > a series of walks and avoid re-walking the tree then the performance
> >> > > would be better.
> >> >
> >> > Got it. Thanks for confirming!
> >> >
> >> > >
> >> > > When we return to userspace, then we should drop the rcu read lock and
> >> > > will need to vma_iter_set()/vma_iter_invalidate() on return.  I thought
> >> > > this was being done (through vma_iter_init()), but syzbot seems to
> >> > > indicate a path that was missed?
> >> >
> >> > We do that in m_start()/m_stop() by calling
> >> > lock_vma_range()/unlock_vma_range() but I think I have two problems
> >> > here:
> >> > 1. As Vlastimil mentioned I do not reset the iterator when falling
> >> > back to mmap_lock and exiting and then re-entering rcu read section;
> >> > 2. I do not reset the iterator after exiting rcu read section in
> >> > m_stop() and re-entering it in m_start(), so the later call to
> >> > lock_next_vma() might be using an iterator with a node that was freed
> >> > (and possibly reallocated).
> >> >
> >> > >
> >> > > This is the same thing that needed to be done previously with the mmap
> >> > > lock, but now under the rcu lock.
> >> > >
> >> > > I'm not sure how to mitigate the issue with the page table, maybe we
> >> > > guess on the number of vmas that we were doing for 4k blocks of output
> >> > > and just drop/reacquire then.  Probably a problem for another day
> >> > > anyways.
> >> > >
> >> > > Also, I think you can also change the vma_iter_init() to vma_iter_set(),
> >> > > which is slightly less code under the hood.  Vlastimil asked about this
> >> > > and it's probably a better choice.
> >> >
> >> > Ack.
> >> > I'll update my series with these fixes and all comments I received so
> >> > far, will run the reproducers to confirm no issues and repost them
> >> > later today.
> >>
> >> I have the patchset ready but would like to test it some more. Will
> >> post it tomorrow.
> >
> > Ok, I found a couple of issues using the syzbot reproducer [1] (which
> > is awesome BTW!):
> > 1. rwsem_acquire_read() inside vma_start_read() at [2] should be moved
> > after the last check, otherwise the lock is considered taken on
> > vma->vm_refcnt overflow;
>
> I think it's fine because if the last check fails there's a
> vma_refcount_put() that includes rwsem_release(), no?

Ah, yes, you are right. This is fine. Obviously trying to figure out
the issue right before a flight is not a good idea :)

>
> > 2. query_matching_vma() is missing unlock_vma() call when it does
> > "goto next_vma;" and re-issues query_vma_find_by_addr(). The previous
> > vma is left locked;
> >
> > [1] https://syzkaller.appspot.com/x/repro.c?x=101edf70580000
> > [2] https://elixir.bootlin.com/linux/v6.15.5/source/include/linux/mm.h#L747
> >
> > After these fixes it's much harder to fail but I still get one more
> > error copied below. I will continue the investigation and will hold
> > off reposting until this is fixed. That will be next week since I'll
> > be out of town the rest of this week.
> >
> > Andrew, could you please remove this patchset from mm-unstable for now
> > until I fix the issue and re-post the new version?
>
> Andrew can you do that please? We keep getting new syzbot reports.
>
> > The error I got after these fixes is:
>
> I suspect the root cause is the ioctls are not serialized against each other
> (probably not even against read()) and yet we treat m->private as safe to
> work on. Now we have various fields that are dangerous to race on - for
> example locked_vma and iter races would explain a lot of this.
>
> I suspect as long as we used purely seq_file workflow, it did the right
> thing for us wrt serialization, but the ioctl addition violates that. We
> should rather recheck even the code before this series, if dangerous ioctl
> vs read() races are possible. And the ioctl implementation should be
> refactored to use an own per-ioctl-call private context, not the seq_file's
> per-file-open context.

Huh, I completely failed to consider this. In hindsight it is quite
obvious... Thanks Vlastimil, I owe you a beer or two.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 17:29                                       ` Andrii Nakryiko
@ 2025-07-15 20:18                                         ` Suren Baghdasaryan
  2025-07-16  1:50                                           ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-15 20:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Lorenzo Stoakes, Vlastimil Babka, David Hildenbrand,
	Liam R. Howlett, akpm, peterx, jannh, hannes, mhocko, paulmck,
	shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Jul 15, 2025 at 10:29 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Jul 15, 2025 at 10:21 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Jul 15, 2025 at 06:10:16PM +0100, Lorenzo Stoakes wrote:
> > > > For PROCMAP_QUERY, we need priv->mm, but the newly added locked_vma
> > > > and locked_vma don't need to be persisted between ioctl calls. So we
> > > > can just add those two fields into a small struct, and for seq_file
> > > > case have it in priv, but for PROCMAP_QUERY just have it on the stack.
> > > > The code can be written to accept this struct to maintain the state,
> > > > which for PROCMAP_QUERY ioctl will be very short-lived on the stack
> > > > one.
> > > >
> > > > Would that work?
> > >
> > > Yeah that's a great idea actually, the stack would obviously give us the
> > > per-query invocation thing. Nice!
> > >
> > > I am kicking myself because I jokingly suggested (off-list) that a helper
> > > struct would be the answer to everything (I do love them) and of
> > > course... here we are :P
> >
> > Hm but actually we'd have to invert things I think, what I mean is - since
> > these fields can be updated at any time by racing threads, we can't have
> > _anything_ in the priv struct that is mutable.
> >
>
> Exactly, and I guess I was just being incomplete with just listing two
> of the fields that Suren make use of in PROCMAP_QUERY. See below.
>
> > So instead we should do something like:
> >
> > struct proc_maps_state {
> >         const struct proc_maps_private *priv;
> >         bool mmap_locked;
> >         struct vm_area_struct *locked_vma;
> >         struct vma_iterator iter;
> >         loff_t last_pos;
> > };
> >
> > static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > {
> >         struct seq_file *seq = file->private_data;
> >         struct proc_maps_private *priv = seq->private;
> >         struct proc_maps_state state = {
> >                 .priv = priv,
> >         };
> >
> >         switch (cmd) {
> >         case PROCMAP_QUERY:
> >                 return do_procmap_query(state, (void __user *)arg);
>
> I guess it's a matter of preference, but I'd actually just pass
> seq->priv->mm and arg into do_procmap_query(), which will make it
> super obvious that priv is not used or mutated, and all the new stuff
> that Suren needs for lockless VMA iteration, including iter (not sure
> PROCMAP_QUERY needs last_pos, tbh), I'd just put into this new struct,
> which do_procmap_query() can keep private to itself.
>
> Ultimately, I think we are on the same page, it's just a matter of
> structuring code and types.

That sounds cleaner to me too.
I'll post a new version of my patchset today without the last patch to
keep PROCMAP_QUERY changes separate, and then a follow-up patch that
does this refactoring and changes PROCMAP_QUERY to use per-vma locks.

Thanks folks! It's good to be back from vacation with the problem
already figured out for you :)

>
> >         default:
> >                 return -ENOIOCTLCMD;
> >         }
> > }
> >
> > And then we have a stack-based thing with the bits that change and a
> > read-only pointer to the bits that must remain static. And we can enforce
> > that with const...
> >
> > We'd have to move the VMI and last_pos out too to make it const.
> >
> > Anyway the general idea should work!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 20:18                                         ` Suren Baghdasaryan
@ 2025-07-16  1:50                                           ` Suren Baghdasaryan
  0 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-16  1:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Lorenzo Stoakes, Vlastimil Babka, David Hildenbrand,
	Liam R. Howlett, akpm, peterx, jannh, hannes, mhocko, paulmck,
	shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On Tue, Jul 15, 2025 at 1:18 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Jul 15, 2025 at 10:29 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Jul 15, 2025 at 10:21 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Tue, Jul 15, 2025 at 06:10:16PM +0100, Lorenzo Stoakes wrote:
> > > > > For PROCMAP_QUERY, we need priv->mm, but the newly added locked_vma
> > > > > and locked_vma don't need to be persisted between ioctl calls. So we
> > > > > can just add those two fields into a small struct, and for seq_file
> > > > > case have it in priv, but for PROCMAP_QUERY just have it on the stack.
> > > > > The code can be written to accept this struct to maintain the state,
> > > > > which for PROCMAP_QUERY ioctl will be very short-lived on the stack
> > > > > one.
> > > > >
> > > > > Would that work?
> > > >
> > > > Yeah that's a great idea actually, the stack would obviously give us the
> > > > per-query invocation thing. Nice!
> > > >
> > > > I am kicking myself because I jokingly suggested (off-list) that a helper
> > > > struct would be the answer to everything (I do love them) and of
> > > > course... here we are :P
> > >
> > > Hm but actually we'd have to invert things I think, what I mean is - since
> > > these fields can be updated at any time by racing threads, we can't have
> > > _anything_ in the priv struct that is mutable.
> > >
> >
> > Exactly, and I guess I was just being incomplete with just listing two
> > of the fields that Suren make use of in PROCMAP_QUERY. See below.
> >
> > > So instead we should do something like:
> > >
> > > struct proc_maps_state {
> > >         const struct proc_maps_private *priv;
> > >         bool mmap_locked;
> > >         struct vm_area_struct *locked_vma;
> > >         struct vma_iterator iter;
> > >         loff_t last_pos;
> > > };
> > >
> > > static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > > {
> > >         struct seq_file *seq = file->private_data;
> > >         struct proc_maps_private *priv = seq->private;
> > >         struct proc_maps_state state = {
> > >                 .priv = priv,
> > >         };
> > >
> > >         switch (cmd) {
> > >         case PROCMAP_QUERY:
> > >                 return do_procmap_query(state, (void __user *)arg);
> >
> > I guess it's a matter of preference, but I'd actually just pass
> > seq->priv->mm and arg into do_procmap_query(), which will make it
> > super obvious that priv is not used or mutated, and all the new stuff
> > that Suren needs for lockless VMA iteration, including iter (not sure
> > PROCMAP_QUERY needs last_pos, tbh), I'd just put into this new struct,
> > which do_procmap_query() can keep private to itself.
> >
> > Ultimately, I think we are on the same page, it's just a matter of
> > structuring code and types.
>
> That sounds cleaner to me too.
> I'll post a new version of my patchset today without the last patch to
> keep PROCMAP_QUERY changes separate, and then a follow-up patch that
> does this refactoring and changes PROCMAP_QUERY to use per-vma locks.
>
> Thanks folks! It's good to be back from vacation with the problem
> already figured out for you :)

Yes, Vlastimil was correct. Once I refactored the last patch so that
do_procmap_query() does not use seq->private at all, the reproducer
stopped failing.
I'll post the patchset without the last patch shortly and once Andrew
takes it into mm-unstable, will post the last patch as a follow-up.
Thanks,
Suren.

>
> >
> > >         default:
> > >                 return -ENOIOCTLCMD;
> > >         }
> > > }
> > >
> > > And then we have a stack-based thing with the bits that change and a
> > > read-only pointer to the bits that must remain static. And we can enforce
> > > that with const...
> > >
> > > We'd have to move the VMI and last_pos out too to make it const.
> > >
> > > Anyway the general idea should work!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-15 20:13                         ` Suren Baghdasaryan
@ 2025-07-16 14:00                           ` Lorenzo Stoakes
  2025-07-16 14:07                             ` Vlastimil Babka
  0 siblings, 1 reply; 47+ messages in thread
From: Lorenzo Stoakes @ 2025-07-16 14:00 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Liam R. Howlett, akpm, david, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Tue, Jul 15, 2025 at 01:13:36PM -0700, Suren Baghdasaryan wrote:
> Huh, I completely failed to consider this. In hindsight it is quite
> obvious... Thanks Vlastimil, I owe you a beer or two.

FYI - Vlasta prefers the 'starobrno' brand of beer, he says he can't get
enough :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-16 14:00                           ` Lorenzo Stoakes
@ 2025-07-16 14:07                             ` Vlastimil Babka
  2025-07-16 14:27                               ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Vlastimil Babka @ 2025-07-16 14:07 UTC (permalink / raw)
  To: Lorenzo Stoakes, Suren Baghdasaryan
  Cc: Liam R. Howlett, akpm, david, peterx, jannh, hannes, mhocko,
	paulmck, shuah, adobriyan, brauner, josef, yebin10, linux, willy,
	osalvador, andrii, ryan.roberts, christophe.leroy, tjmercier,
	kaleshsingh, aha310510, linux-kernel, linux-fsdevel, linux-mm,
	linux-kselftest

On 7/16/25 16:00, Lorenzo Stoakes wrote:
> On Tue, Jul 15, 2025 at 01:13:36PM -0700, Suren Baghdasaryan wrote:
>> Huh, I completely failed to consider this. In hindsight it is quite
>> obvious... Thanks Vlastimil, I owe you a beer or two.
> 
> FYI - Vlasta prefers the 'starobrno' brand of beer, he says he can't get
> enough :)

FYI - Lorenzo is a notorious liar :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock
  2025-07-16 14:07                             ` Vlastimil Babka
@ 2025-07-16 14:27                               ` Suren Baghdasaryan
  0 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-07-16 14:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, Liam R. Howlett, akpm, david, peterx, jannh,
	hannes, mhocko, paulmck, shuah, adobriyan, brauner, josef,
	yebin10, linux, willy, osalvador, andrii, ryan.roberts,
	christophe.leroy, tjmercier, kaleshsingh, aha310510, linux-kernel,
	linux-fsdevel, linux-mm, linux-kselftest

On Wed, Jul 16, 2025 at 7:07 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/16/25 16:00, Lorenzo Stoakes wrote:
> > On Tue, Jul 15, 2025 at 01:13:36PM -0700, Suren Baghdasaryan wrote:
> >> Huh, I completely failed to consider this. In hindsight it is quite
> >> obvious... Thanks Vlastimil, I owe you a beer or two.
> >
> > FYI - Vlasta prefers the 'starobrno' brand of beer, he says he can't get
> > enough :)
>
> FYI - Lorenzo is a notorious liar :)

A search for starobrno in Tokyo provides a couple of suggestions:
- Pilsen Alley in Ginza focuses on Czech Pilsner beer, and while they
feature Pilsner Urquell, they might also carry other Czech brands.
- Cerveza Japan ShibuyaAXSH, a casual restaurant specializing in
Spanish paella, also offers an extensive selection of imported craft
beers from various countries, including the Czech Republic.


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-07-16 14:27 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04  6:07 [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Suren Baghdasaryan
2025-07-04  6:07 ` [PATCH v6 1/8] selftests/proc: add /proc/pid/maps tearing from vma split test Suren Baghdasaryan
2025-07-04  6:07 ` [PATCH v6 2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing Suren Baghdasaryan
2025-07-04  6:07 ` [PATCH v6 3/8] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping Suren Baghdasaryan
2025-07-04  6:07 ` [PATCH v6 4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified Suren Baghdasaryan
2025-07-04  6:07 ` [PATCH v6 5/8] selftests/proc: add verbose more for tests to facilitate debugging Suren Baghdasaryan
2025-07-04  6:07 ` [PATCH v6 6/8] fs/proc/task_mmu: remove conversion of seq_file position to unsigned Suren Baghdasaryan
2025-07-07 15:01   ` Lorenzo Stoakes
2025-07-08 17:37   ` Vlastimil Babka
2025-07-10  5:49     ` Suren Baghdasaryan
2025-07-04  6:07 ` [PATCH v6 7/8] fs/proc/task_mmu: read proc/pid/maps under per-vma lock Suren Baghdasaryan
2025-07-07 16:51   ` Lorenzo Stoakes
2025-07-07 23:10     ` Suren Baghdasaryan
2025-07-09  8:57       ` Vlastimil Babka
2025-07-09 14:43         ` Suren Baghdasaryan
2025-07-09 15:03           ` Vlastimil Babka
2025-07-09 15:06             ` Suren Baghdasaryan
2025-07-09 16:11               ` Liam R. Howlett
2025-07-09 17:47                 ` Suren Baghdasaryan
2025-07-10  7:03                   ` Suren Baghdasaryan
2025-07-10 17:02                     ` Suren Baghdasaryan
2025-07-10 17:42                       ` Vlastimil Babka
2025-07-15  8:16                       ` Vlastimil Babka
2025-07-15  9:40                         ` Lorenzo Stoakes
2025-07-15  9:52                           ` David Hildenbrand
2025-07-15 10:16                             ` Lorenzo Stoakes
2025-07-15 10:23                             ` Vlastimil Babka
2025-07-15 10:31                               ` Lorenzo Stoakes
2025-07-15 10:51                                 ` Lorenzo Stoakes
2025-07-15 17:05                                 ` Andrii Nakryiko
2025-07-15 17:10                                   ` Lorenzo Stoakes
2025-07-15 17:20                                     ` Lorenzo Stoakes
2025-07-15 17:29                                       ` Andrii Nakryiko
2025-07-15 20:18                                         ` Suren Baghdasaryan
2025-07-16  1:50                                           ` Suren Baghdasaryan
2025-07-15 20:13                         ` Suren Baghdasaryan
2025-07-16 14:00                           ` Lorenzo Stoakes
2025-07-16 14:07                             ` Vlastimil Babka
2025-07-16 14:27                               ` Suren Baghdasaryan
2025-07-07 18:20   ` Liam R. Howlett
2025-07-07 23:12     ` Suren Baghdasaryan
2025-07-09 10:03   ` Vlastimil Babka
2025-07-09 14:43     ` Suren Baghdasaryan
2025-07-04  6:07 ` [PATCH v6 8/8] fs/proc/task_mmu: execute PROCMAP_QUERY ioctl under per-vma locks Suren Baghdasaryan
2025-07-07 16:54   ` Lorenzo Stoakes
2025-07-07 18:26   ` Liam R. Howlett
2025-07-15  8:10 ` [PATCH v6 0/8] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).