All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] vma count: fixes, test and improvements
@ 2025-09-15 16:36 Kalesh Singh
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
                   ` (7 more replies)
  0 siblings, 8 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
  To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

Hi all,

This is v2 to the VMA count patch I previously posted at:

https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/


I've split it into multiple patches to address the feedback.

The main changes in v2 are:

- Use a capacity-based check for VMA count limit, per Lorenzo.
- Rename map_count to vma_count, per David.
- Add assertions for exceeding the limit, per Pedro.
- Add tests for max_vma_count, per Liam.
- Emit a trace event for failure due to insufficient capacity for
  observability

Tested on x86_64 and arm64:

- Build test:
    - allyesconfig for rename

- Selftests:
      cd tools/testing/selftests/mm && \
          make && \
          ./run_vmtests.sh -t max_vma_count

       (With trace_max_vma_count_exceeded enabled)

- vma tests:
      cd tools/testing/vma && \
          make && \
	  ./vma

Thanks,
Kalesh

Kalesh Singh (7):
  mm: fix off-by-one error in VMA count limit checks
  mm/selftests: add max_vma_count tests
  mm: introduce vma_count_remaining()
  mm: rename mm_struct::map_count to vma_count
  mm: harden vma_count against direct modification
  mm: add assertion for VMA count limit
  mm/tracing: introduce max_vma_count_exceeded trace event

 fs/binfmt_elf.c                               |   2 +-
 fs/coredump.c                                 |   2 +-
 include/linux/mm.h                            |  35 +-
 include/linux/mm_types.h                      |   5 +-
 include/trace/events/vma.h                    |  32 +
 kernel/fork.c                                 |   2 +-
 mm/debug.c                                    |   2 +-
 mm/internal.h                                 |   1 +
 mm/mmap.c                                     |  28 +-
 mm/mremap.c                                   |  13 +-
 mm/nommu.c                                    |   8 +-
 mm/util.c                                     |   1 -
 mm/vma.c                                      |  88 ++-
 tools/testing/selftests/mm/Makefile           |   1 +
 .../selftests/mm/max_vma_count_tests.c        | 709 ++++++++++++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   5 +
 tools/testing/vma/vma.c                       |  32 +-
 tools/testing/vma/vma_internal.h              |  44 +-
 18 files changed, 949 insertions(+), 61 deletions(-)
 create mode 100644 include/trace/events/vma.h
 create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c


base-commit: f83ec76bf285bea5727f478a68b894f5543ca76e
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
  2025-09-15 22:36   ` Andrew Morton
                     ` (5 more replies)
  2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
                   ` (6 subsequent siblings)
  7 siblings, 6 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
  To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Kalesh Singh, stable, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

The VMA count limit check in do_mmap() and do_brk_flags() uses a
strict inequality (>), which allows a process's VMA count to exceed
the configured sysctl_max_map_count limit by one.

A process with mm->map_count == sysctl_max_map_count will incorrectly
pass this check and then exceed the limit upon allocation of a new VMA
when its map_count is incremented.

Other VMA allocation paths, such as split_vma(), already use the
correct, inclusive (>=) comparison.

Fix this bug by changing the comparison to be inclusive in do_mmap()
and do_brk_flags(), bringing them in line with the correct behavior
of other allocation paths.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---

Chnages in v2:
 - Fix mmap check, per Pedro

 mm/mmap.c | 2 +-
 mm/vma.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 7306253cc3b5..e5370e7fcd8f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		return -EOVERFLOW;
 
 	/* Too many mappings? */
-	if (mm->map_count > sysctl_max_map_count)
+	if (mm->map_count >= sysctl_max_map_count)
 		return -ENOMEM;
 
 	/*
diff --git a/mm/vma.c b/mm/vma.c
index 3b12c7579831..033a388bc4b1 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
-	if (mm->map_count > sysctl_max_map_count)
+	if (mm->map_count >= sysctl_max_map_count)
 		return -ENOMEM;
 
 	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 2/7] mm/selftests: add max_vma_count tests
  2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
  2025-09-17 10:58   ` David Hildenbrand
  2025-09-18 14:42   ` Lorenzo Stoakes
  2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
  To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

Add a new selftest to verify that the max VMA count limit is correctly
enforced.

This test suite checks that various VMA operations (mmap, mprotect,
munmap, mremap) succeed or fail as expected when the number of VMAs is
close to the sysctl_max_map_count limit.

The test works by first creating a large number of VMAs to bring the
process close to the limit, and then performing various operations that
may or may not create new VMAs. The test then verifies that the
operations that would exceed the limit fail, and that the operations
that do not exceed the limit succeed.

NOTE: munmap is special as it's allowed to temporarily exceed the limit
by one for splits as this will decrease back to the limit once the unmap
succeeds.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---

Changes in v2:
 - Add tests, per Liam (note that the do_brk_flags() path is not easily
   tested from userspace, so it's not included here). Exceeding the limit there
   should be uncommon.

 tools/testing/selftests/mm/Makefile           |   1 +
 .../selftests/mm/max_vma_count_tests.c        | 709 ++++++++++++++++++
 tools/testing/selftests/mm/run_vmtests.sh     |   5 +
 3 files changed, 715 insertions(+)
 create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index d13b3cef2a2b..00a4b04eab06 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += uffd-stress
 TEST_GEN_FILES += uffd-unit-tests
 TEST_GEN_FILES += uffd-wp-mremap
+TEST_GEN_FILES += max_vma_count_tests
 TEST_GEN_FILES += split_huge_page_test
 TEST_GEN_FILES += ksm_tests
 TEST_GEN_FILES += ksm_functional_tests
diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c
new file mode 100644
index 000000000000..c8401c03425c
--- /dev/null
+++ b/tools/testing/selftests/mm/max_vma_count_tests.c
@@ -0,0 +1,709 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2025 Google LLC
+ */
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <linux/prctl.h>  /* Definition of PR_* constants */
+#include <sys/prctl.h>
+
+#include "../kselftest.h"
+
+static int get_max_vma_count(void);
+static bool set_max_vma_count(int val);
+static int get_current_vma_count(void);
+static bool is_current_vma_count(const char *msg, int expected);
+static bool is_test_area_mapped(const char *msg);
+static void print_surrounding_maps(const char *msg);
+
+/* Globals initialized in test_suite_setup() */
+static int MAX_VMA_COUNT;
+static int ORIGINAL_MAX_VMA_COUNT;
+static int PAGE_SIZE;
+static int GUARD_SIZE;
+static int TEST_AREA_SIZE;
+static int EXTRA_MAP_SIZE;
+
+static int MAX_VMA_COUNT;
+
+static int NR_EXTRA_MAPS;
+
+static char *TEST_AREA;
+static char *EXTRA_MAPS;
+
+#define DEFAULT_MAX_MAP_COUNT	65530
+#define TEST_AREA_NR_PAGES	3
+/* 1 before test area + 1 after test area + 1 after extra mappings */
+#define NR_GUARDS		3
+#define TEST_AREA_PROT		(PROT_NONE)
+#define EXTRA_MAP_PROT		(PROT_NONE)
+
+/**
+ * test_suite_setup - Set up the VMA layout for VMA count testing.
+ *
+ * Sets up the following VMA layout:
+ *
+ * +----- base_addr
+ * |
+ * V
+ * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
+ * |  Guard Page  |                      |  Guard Page  |  Extra Map 1   | Unmapped Gap |  Extra Map 2   | Unmapped Gap | ... |  Extra Map N   | Unmapped Gap |
+ * |  (unmapped)  |      TEST_AREA       |  (unmapped)  | (mapped page)  |  (1 page)    | (mapped page)  |  (1 page)    | ... | (mapped page)  |  (1 page)    |
+ * |   (1 page)   | (unmapped, 3 pages)  |   (1 page)   |    (1 page)    |              |    (1 page)    |              |     |    (1 page)    |              |
+ * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
+ * ^              ^                      ^              ^                                                                  ^
+ * |              |                      |              |                                                                  |
+ * +--GUARD_SIZE--+                      |              +-- EXTRA_MAPS points here             Sufficient EXTRA_MAPS to ---+
+ *    (PAGE_SIZE) |                      |                                                         reach MAX_VMA_COUNT
+ *                |                      |
+ *                +--- TEST_AREA_SIZE ---+
+ *                |   (3 * PAGE_SIZE)    |
+ *                ^
+ *                |
+ *                +-- TEST_AREA starts here
+ *
+ * Populates TEST_AREA and other globals required for the tests.
+ * If successful, the current VMA count will be MAX_VMA_COUNT - 1.
+ *
+ * Return: true on success, false on failure.
+ */
+static bool test_suite_setup(void)
+{
+	int initial_vma_count;
+	size_t reservation_size;
+	void *base_addr = NULL;
+	char *ptr = NULL;
+
+	ksft_print_msg("Setting up vma_max_count test suite...\n");
+
+	/* Initialize globals */
+	PAGE_SIZE = sysconf(_SC_PAGESIZE);
+	TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE;
+	GUARD_SIZE = PAGE_SIZE;
+	EXTRA_MAP_SIZE = PAGE_SIZE;
+	MAX_VMA_COUNT = get_max_vma_count();
+
+	MAX_VMA_COUNT = get_max_vma_count();
+	if (MAX_VMA_COUNT < 0) {
+		ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n");
+		return false;
+	}
+
+	/*
+	 * If the current limit is higher than the kernel default,
+	 * we attempt to lower it to the default to ensure the test
+	 * can run with a reliably known boundary.
+	 */
+	ORIGINAL_MAX_VMA_COUNT = 0;
+
+	if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) {
+		ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT;
+
+		ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n",
+				MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT);
+
+		if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) {
+			ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n",
+					DEFAULT_MAX_MAP_COUNT);
+			ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n");
+
+			MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT;
+
+			return false;
+		}
+
+		/* Update MAX_VMA_COUNT for the test run */
+		MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT;
+	}
+
+	initial_vma_count = get_current_vma_count();
+	if (initial_vma_count < 0) {
+		ksft_print_msg("Failed to read /proc/self/maps\n");
+		return false;
+	}
+
+	/*
+	 * Calculate how many extra mappings we need to create to reach
+	 * MAX_VMA_COUNT - 1 (excluding test area).
+	 */
+	NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count;
+
+	if (NR_EXTRA_MAPS < 1) {
+		ksft_print_msg("Not enough available maps to run test\n");
+		ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n",
+				MAX_VMA_COUNT, initial_vma_count);
+		return false;
+	}
+
+	/*
+	 * Reserve space for:
+	 * - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2)
+	 * - The test area itself (TEST_AREA_NR_PAGES)
+	 * - The guard pages (NR_GUARDS)
+	 */
+	reservation_size = ((NR_EXTRA_MAPS * 2) +
+				TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE;
+
+	base_addr = mmap(NULL, reservation_size, PROT_NONE,
+			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+	if (base_addr == MAP_FAILED) {
+		ksft_print_msg("Failed tommap initial reservation\n");
+		return false;
+	}
+
+	if (munmap(base_addr, reservation_size) == -1) {
+		ksft_print_msg("Failed to munmap initial reservation\n");
+		return false;
+	}
+
+	/* Get the addr of the test area */
+	TEST_AREA = (char *)base_addr + GUARD_SIZE;
+
+	/*
+	 * Get the addr of the region for extra mappings:
+	 *     test area + 1 guard.
+	 */
+	EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE;
+
+	/* Create single-page mappings separated by unmapped pages */
+	ptr = EXTRA_MAPS;
+	for (int i = 0; i < NR_EXTRA_MAPS; ++i) {
+		if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT,
+			MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
+			-1, 0) == MAP_FAILED) {
+			perror("mmap in fill loop");
+			ksft_print_msg("Failed on mapping #%d of %d\n", i + 1,
+					NR_EXTRA_MAPS);
+			return false;
+		}
+
+		/* Advance pointer by 2 to leave a gap */
+		ptr += (2 * EXTRA_MAP_SIZE);
+	}
+
+	if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1))
+		return false;
+
+	ksft_print_msg("vma_max_count test suite setup done.\n");
+
+	return true;
+}
+
+static void test_suite_teardown(void)
+{
+	if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) {
+		if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT))
+			ksft_print_msg("Failed to restore max_map_count to %d\n",
+					ORIGINAL_MAX_VMA_COUNT);
+	}
+}
+
+/* --- Test Helper Functions --- */
+static bool mmap_anon(void)
+{
+	void *addr =  mmap(NULL, PAGE_SIZE, PROT_READ,
+			   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+
+	/*
+	 * Handle cleanup here as the runner doesn't track where this,
+	 *mapping is located.
+	 */
+	if (addr != MAP_FAILED)
+		munmap(addr, PAGE_SIZE);
+
+	return addr != MAP_FAILED;
+}
+
+static inline bool __mprotect(char *addr, int size)
+{
+	int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC);
+
+	return mprotect(addr, size, new_prot) == 0;
+}
+
+static bool mprotect_nosplit(void)
+{
+	return __mprotect(TEST_AREA, TEST_AREA_SIZE);
+}
+
+static bool mprotect_2way_split(void)
+{
+	return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
+}
+
+static bool mprotect_3way_split(void)
+{
+	return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
+}
+
+static inline bool __munmap(char *addr, int size)
+{
+	return munmap(addr, size) == 0;
+}
+
+static bool munmap_nosplit(void)
+{
+	return __munmap(TEST_AREA, TEST_AREA_SIZE);
+}
+
+static bool munmap_2way_split(void)
+{
+	return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
+}
+
+static bool munmap_3way_split(void)
+{
+	return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
+}
+
+/* mremap accounts for the worst case to fail early */
+static const int MREMAP_REQUIRED_VMA_SLOTS = 6;
+
+static bool mremap_dontunmap(void)
+{
+	void *new_addr;
+
+	/*
+	 * Using MREMAP_DONTUNMAP will create a new mapping without
+	 * removing the old one, consuming one VMA slot.
+	 */
+	new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE,
+			  MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL);
+
+	if (new_addr != MAP_FAILED)
+		munmap(new_addr, TEST_AREA_SIZE);
+
+	return new_addr != MAP_FAILED;
+}
+
+struct test {
+	const char *name;
+	bool (*test)(void);
+	 /* How many VMA slots below the limit this test needs to start? */
+	int vma_slots_needed;
+	bool expect_success;
+};
+
+/* --- Test Cases --- */
+struct test tests[] = {
+	{
+		.name = "mmap_at_1_below_vma_count_limit",
+		.test = mmap_anon,
+		.vma_slots_needed = 1,
+		.expect_success = true,
+	},
+	{
+		.name = "mmap_at_vma_count_limit",
+		.test = mmap_anon,
+		.vma_slots_needed = 0,
+		.expect_success = false,
+	},
+	{
+		.name = "mprotect_nosplit_at_1_below_vma_count_limit",
+		.test = mprotect_nosplit,
+		.vma_slots_needed = 1,
+		.expect_success = true,
+	},
+	{
+		.name = "mprotect_nosplit_at_vma_count_limit",
+		.test = mprotect_nosplit,
+		.vma_slots_needed = 0,
+		.expect_success = true,
+	},
+	{
+		.name = "mprotect_2way_split_at_1_below_vma_count_limit",
+		.test = mprotect_2way_split,
+		.vma_slots_needed = 1,
+		.expect_success = true,
+	},
+	{
+		.name = "mprotect_2way_split_at_vma_count_limit",
+		.test = mprotect_2way_split,
+		.vma_slots_needed = 0,
+		.expect_success = false,
+	},
+	{
+		.name = "mprotect_3way_split_at_2_below_vma_count_limit",
+		.test = mprotect_3way_split,
+		.vma_slots_needed = 2,
+		.expect_success = true,
+	},
+	{
+		.name = "mprotect_3way_split_at_1_below_vma_count_limit",
+		.test = mprotect_3way_split,
+		.vma_slots_needed = 1,
+		.expect_success = false,
+	},
+	{
+		.name = "mprotect_3way_split_at_vma_count_limit",
+		.test = mprotect_3way_split,
+		.vma_slots_needed = 0,
+		.expect_success = false,
+	},
+	{
+		.name = "munmap_nosplit_at_1_below_vma_count_limit",
+		.test = munmap_nosplit,
+		.vma_slots_needed = 1,
+		.expect_success = true,
+	},
+	{
+		.name = "munmap_nosplit_at_vma_count_limit",
+		.test = munmap_nosplit,
+		.vma_slots_needed = 0,
+		.expect_success = true,
+	},
+	{
+		.name = "munmap_2way_split_at_1_below_vma_count_limit",
+		.test = munmap_2way_split,
+		.vma_slots_needed = 1,
+		.expect_success = true,
+	},
+	{
+		.name = "munmap_2way_split_at_vma_count_limit",
+		.test = munmap_2way_split,
+		.vma_slots_needed = 0,
+		.expect_success = true,
+	},
+	{
+		.name = "munmap_3way_split_at_2_below_vma_count_limit",
+		.test = munmap_3way_split,
+		.vma_slots_needed = 2,
+		.expect_success = true,
+	},
+	{
+		.name = "munmap_3way_split_at_1_below_vma_count_limit",
+		.test = munmap_3way_split,
+		.vma_slots_needed = 1,
+		.expect_success = true,
+	},
+	{
+		.name = "munmap_3way_split_at_vma_count_limit",
+		.test = munmap_3way_split,
+		.vma_slots_needed = 0,
+		.expect_success = false,
+	},
+	{
+		.name = "mremap_dontunmap_at_required_vma_count_capcity",
+		.test = mremap_dontunmap,
+		.vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS,
+		.expect_success = true,
+	},
+	{
+		.name = "mremap_dontunmap_at_1_below_required_vma_count_capacity",
+		.test = mremap_dontunmap,
+		.vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1,
+		.expect_success = false,
+	},
+};
+
+/* --- Test Runner --- */
+int main(int argc, char **argv)
+{
+	int num_tests = ARRAY_SIZE(tests);
+	int failed_tests = 0;
+
+	ksft_set_plan(num_tests);
+
+	if (!test_suite_setup() != 0) {
+		if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT)
+			ksft_exit_skip("max_map_count too high and cannot be lowered\n"
+					"Please rerun as root.\n");
+		else
+			ksft_exit_fail_msg("Test suite setup failed. Aborting.\n");
+
+	}
+
+	for (int i = 0; i < num_tests; i++) {
+		int maps_to_unmap = tests[i].vma_slots_needed;
+		const char *name = tests[i].name;
+		bool test_passed;
+
+		errno = 0;
+
+		/* 1. Setup: TEST_AREA mapping */
+		if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT,
+			MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
+				== MAP_FAILED) {
+			ksft_test_result_fail(
+				"%s: Test setup failed to map TEST_AREA\n",
+				name);
+			maps_to_unmap = 0;
+			goto fail;
+		}
+
+		/* Label TEST_AREA to ease debugging */
+		if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA,
+			TEST_AREA_SIZE, "TEST_AREA")) {
+			ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n",
+					name);
+			ksft_print_msg(
+				"Continuing without named TEST_AREA mapping\n");
+		}
+
+		/* 2. Setup: Adjust VMA count based on test requirements */
+		if (maps_to_unmap > NR_EXTRA_MAPS) {
+			ksft_test_result_fail(
+				"%s: Test setup failed: Invalid VMA slots required %d\n",
+				name, tests[i].vma_slots_needed);
+			maps_to_unmap = 0;
+			goto fail;
+		}
+
+		/* Unmap extra mappings, accounting for the 1-page gap */
+		for (int j = 0; j < maps_to_unmap; j++)
+			munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
+				EXTRA_MAP_SIZE);
+
+		/*
+		 * 3. Verify the preconditions.
+		 *
+		 *    Sometimes there isn't an easy way to determine the cause
+		 *    of the test failure.
+		 *    e.g. an mprotect ENOMEM may be due to trying to protect
+		 *         unmapped area or due to hitting MAX_VMA_COUNT limit.
+		 *
+		 *    We verify the preconditions of the test to ensure any
+		 *    expected failures are from the expected cause and not
+		 *    coincidental.
+		 */
+		if (!is_current_vma_count(name,
+			MAX_VMA_COUNT - tests[i].vma_slots_needed))
+			goto fail;
+
+		if (!is_test_area_mapped(name))
+			goto fail;
+
+		/* 4. Run the test */
+		test_passed = (tests[i].test() == tests[i].expect_success);
+		if (test_passed) {
+			ksft_test_result_pass("%s\n", name);
+		} else {
+fail:
+			failed_tests++;
+			ksft_test_result_fail(
+				"%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n",
+				name, get_current_vma_count(), MAX_VMA_COUNT,
+				errno, strerror(errno));
+			print_surrounding_maps(name);
+		}
+
+		/* 5. Teardown: Unmap TEST_AREA. */
+		munmap(TEST_AREA, TEST_AREA_SIZE);
+
+		/* 6. Teardown: Restore extra mappings to test suite baseline */
+		for (int j = 0; j < maps_to_unmap; j++) {
+			/* Remap extra mappings, accounting for the gap */
+			mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
+				EXTRA_MAP_SIZE, EXTRA_MAP_PROT,
+				MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
+				-1, 0);
+		}
+	}
+
+	test_suite_teardown();
+
+	if (failed_tests > 0)
+		ksft_exit_fail();
+	else
+		ksft_exit_pass();
+}
+
+/* --- Utilities --- */
+
+static int get_max_vma_count(void)
+{
+	int max_count;
+	FILE *f;
+
+	f = fopen("/proc/sys/vm/max_map_count", "r");
+	if (!f)
+		return -1;
+
+	if (fscanf(f, "%d", &max_count) != 1)
+		max_count = -1;
+
+
+	fclose(f);
+
+	return max_count;
+}
+
+static bool set_max_vma_count(int val)
+{
+	FILE *f;
+	bool success = false;
+
+	f = fopen("/proc/sys/vm/max_map_count", "w");
+	if (!f)
+		return false;
+
+	if (fprintf(f, "%d", val) > 0)
+		success = true;
+
+	fclose(f);
+	return success;
+}
+
+static int get_current_vma_count(void)
+{
+	char line[1024];
+	int count = 0;
+	FILE *f;
+
+	f = fopen("/proc/self/maps", "r");
+	if (!f)
+		return -1;
+
+	while (fgets(line, sizeof(line), f)) {
+		if (!strstr(line, "[vsyscall]"))
+			count++;
+	}
+
+	fclose(f);
+
+	return count;
+}
+
+static bool is_current_vma_count(const char *msg, int expected)
+{
+	int current = get_current_vma_count();
+
+	if (current == expected)
+		return true;
+
+	ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current,
+			expected);
+	return false;
+}
+
+static bool is_test_area_mapped(const char *msg)
+{
+	unsigned long search_start = (unsigned long)TEST_AREA;
+	unsigned long search_end = search_start + TEST_AREA_SIZE;
+	bool found = false;
+	char line[1024];
+	FILE *f;
+
+	f = fopen("/proc/self/maps", "r");
+	if (!f) {
+		ksft_print_msg("failed to open /proc/self/maps\n");
+		return false;
+	}
+
+	while (fgets(line, sizeof(line), f)) {
+		unsigned long start, end;
+
+		if (sscanf(line, "%lx-%lx", &start, &end) != 2)
+			continue;
+
+		/* Check for an exact match of the range */
+		if (start == search_start && end == search_end) {
+			found = true;
+			break;
+		} else if (start > search_end) {
+			/*
+			 *Since maps are sorted, if we've passed the end, we
+			 * can stop searching.
+			 */
+			break;
+		}
+	}
+
+	fclose(f);
+
+	if (found)
+		return true;
+
+	/* Not found */
+	ksft_print_msg(
+		"%s: TEST_AREA is not mapped as a single contiguous block.\n",
+		msg);
+	print_surrounding_maps(msg);
+
+	return false;
+}
+
+static void print_surrounding_maps(const char *msg)
+{
+	unsigned long search_start = (unsigned long)TEST_AREA;
+	unsigned long search_end = search_start + TEST_AREA_SIZE;
+	unsigned long start;
+	unsigned long end;
+	char line[1024] = {};
+	int line_idx = 0;
+	int first_match_idx = -1;
+	int last_match_idx = -1;
+	FILE *f;
+
+	f = fopen("/proc/self/maps", "r");
+	if (!f)
+		return;
+
+	if (msg)
+		ksft_print_msg("%s\n", msg);
+
+	ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n",
+			TEST_AREA);
+
+	/* First pass: Read all lines and find the range of matching entries */
+	fseek(f, 0, SEEK_SET); /* Rewind file */
+	while (fgets(line, sizeof(line), f)) {
+		if (sscanf(line, "%lx-%lx", &start, &end) != 2) {
+			line_idx++;
+			continue;
+		}
+
+		/* Check for any overlap */
+		if (start < search_end && end > search_start) {
+			if (first_match_idx == -1)
+				first_match_idx = line_idx;
+			last_match_idx = line_idx;
+		} else if (start > search_end) {
+			/*
+			 * Since maps are sorted, if we've passed the end, we
+			 * can stop searching.
+			 */
+			break;
+		}
+
+		line_idx++;
+	}
+
+	if (first_match_idx == -1) {
+		ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n",
+				TEST_AREA);
+	} else {
+		/* Second pass: Print the relevant lines */
+		fseek(f, 0, SEEK_SET); /* Rewind file */
+		line_idx = 0;
+		while (fgets(line, sizeof(line), f)) {
+			/* Print 2 lines before the first match */
+			if (line_idx >= first_match_idx - 2 &&
+				line_idx < first_match_idx)
+				ksft_print_msg("   %s", line);
+
+			/* Print all matching TEST_AREA entries */
+			if (line_idx >= first_match_idx &&
+				line_idx <= last_match_idx)
+				ksft_print_msg(">> %s", line);
+
+			/* Print 2 lines after the last match */
+			if (line_idx > last_match_idx &&
+				line_idx <= last_match_idx + 2)
+				ksft_print_msg("   %s", line);
+
+			line_idx++;
+		}
+	}
+
+	ksft_print_msg("--------------------------------------------------\n");
+
+	fclose(f);
+}
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 471e539d82b8..3794b50ec280 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -49,6 +49,8 @@ separated by spaces:
 	test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
 - madv_populate
 	test memadvise(2) MADV_POPULATE_{READ,WRITE} options
+- max_vma_count
+	tests for max vma_count
 - memfd_secret
 	test memfd_secret(2)
 - process_mrelease
@@ -417,6 +419,9 @@ fi # VADDR64
 # vmalloc stability smoke test
 CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
 
+# test operations against max vma count limit
+CATEGORY="max_vma_count" run_test ./max_vma_count_tests
+
 CATEGORY="mremap" run_test ./mremap_dontunmap
 
 CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
  2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
  2025-09-16 18:33   ` kernel test robot
                     ` (3 more replies)
  2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
                   ` (4 subsequent siblings)
  7 siblings, 4 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
  To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

The checks against sysctl_max_map_count are open-coded in multiple
places. While simple checks are manageable, the logic in places like
mremap.c involves arithmetic with magic numbers that can be difficult
to reason about. e.g. ... >= sysctl_max_map_count - 3

To improve readability and centralize the logic, introduce a new helper,
vma_count_remaining(). This function returns the VMA count headroom
available for a givine process.

The most common case of checking for a single new VMA can be done with
the convenience helper has_vma_count_remaining():

    if (!vma_count_remaining(mm))

And the complex checks in mremap.c become clearer by expressing the
required capacity directly:

    if (vma_count_remaining(mm) <  4)

While a capacity-based function could be misused (e.g., with an
incorrect '<' vs '<=' comparison), the improved readability at the call
sites makes such errors less likely than with the previous open-coded
arithmetic.

As part of this change, sysctl_max_map_count is made static to
mm/mmap.c to improve encapsulation.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---

Changes in v2:
  - Fix documentation comment for vma_count_remaining(), per Mike
  - Remove extern in header, per Mike and Pedro
  - Move declaration to mm/internal.h, per Mike
  - Replace exceeds_max_map_count() with capacity-based vma_count_remaining(),
       per Lorenzo.
  - Fix tools/testing/vma, per Lorenzo.

 include/linux/mm.h               |  2 --
 mm/internal.h                    |  2 ++
 mm/mmap.c                        | 21 ++++++++++++++++++++-
 mm/mremap.c                      |  7 ++++---
 mm/nommu.c                       |  2 +-
 mm/util.c                        |  1 -
 mm/vma.c                         | 10 +++++-----
 tools/testing/vma/vma_internal.h |  9 +++++++++
 8 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..138bab2988f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page)
 #define MAPCOUNT_ELF_CORE_MARGIN	(5)
 #define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
 
-extern int sysctl_max_map_count;
-
 extern unsigned long sysctl_user_reserve_kbytes;
 extern unsigned long sysctl_admin_reserve_kbytes;
 
diff --git a/mm/internal.h b/mm/internal.h
index 45b725c3dc03..39f1c9535ae5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
 void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
 int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 
+int vma_count_remaining(const struct mm_struct *mm);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index e5370e7fcd8f..af88ce1fbb5f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		return -EOVERFLOW;
 
 	/* Too many mappings? */
-	if (mm->map_count >= sysctl_max_map_count)
+	if (!vma_count_remaining(mm))
 		return -ENOMEM;
 
 	/*
@@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
 int sysctl_legacy_va_layout;
 #endif
 
+static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+
+/**
+ * vma_count_remaining - Determine available VMA slots
+ * @mm: The memory descriptor for the process.
+ *
+ * Check how many more VMAs can be created for the given @mm
+ * before hitting the sysctl_max_map_count limit.
+ *
+ * Return: The number of new VMAs the process can accommodate.
+ */
+int vma_count_remaining(const struct mm_struct *mm)
+{
+	const int map_count = mm->map_count;
+	const int max_count = sysctl_max_map_count;
+
+	return (max_count > map_count) ? (max_count - map_count) : 0;
+}
+
 static const struct ctl_table mmap_table[] = {
 		{
 				.procname       = "max_map_count",
diff --git a/mm/mremap.c b/mm/mremap.c
index 35de0a7b910e..14d35d87e89b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
 	 * We'd prefer to avoid failure later on in do_munmap:
 	 * which may split one vma into three before unmapping.
 	 */
-	if (current->mm->map_count >= sysctl_max_map_count - 3)
+	if (vma_count_remaining(current->mm) < 4)
 		return -ENOMEM;
 
 	if (vma->vm_ops && vma->vm_ops->may_split) {
@@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 	 * split in 3 before unmapping it.
 	 * That means 2 more maps (1 for each) to the ones we already hold.
 	 * Check whether current map count plus 2 still leads us to 4 maps below
-	 * the threshold, otherwise return -ENOMEM here to be more safe.
+	 * the threshold. In other words, is the current map count + 6 at or
+	 * below the threshold? Otherwise return -ENOMEM here to be more safe.
 	 */
-	if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
+	if (vma_count_remaining(current->mm) < 6)
 		return -ENOMEM;
 
 	return 0;
diff --git a/mm/nommu.c b/mm/nommu.c
index 8b819fafd57b..dd75f2334812 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		return -ENOMEM;
 
 	mm = vma->vm_mm;
-	if (mm->map_count >= sysctl_max_map_count)
+	if (!vma_count_remaining(mm))
 		return -ENOMEM;
 
 	region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
diff --git a/mm/util.c b/mm/util.c
index f814e6a59ab1..b6e83922cafe 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy);
 int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
 static int sysctl_overcommit_ratio __read_mostly = 50;
 static unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
 unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
 unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
 
diff --git a/mm/vma.c b/mm/vma.c
index 033a388bc4b1..df0e8409f63d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 }
 
 /*
- * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
- * has already been checked or doesn't make sense to fail.
+ * __split_vma() bypasses vma_count_remaining() checks. We use this where
+ * it has already been checked or doesn't make sense to fail.
  * VMA Iterator will point to the original VMA.
  */
 static __must_check int
@@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		     unsigned long addr, int new_below)
 {
-	if (vma->vm_mm->map_count >= sysctl_max_map_count)
+	if (!vma_count_remaining(vma->vm_mm))
 		return -ENOMEM;
 
 	return __split_vma(vmi, vma, addr, new_below);
@@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		 * its limit temporarily, to help free resources as expected.
 		 */
 		if (vms->end < vms->vma->vm_end &&
-		    vms->vma->vm_mm->map_count >= sysctl_max_map_count) {
+		    !vma_count_remaining(vms->vma->vm_mm)) {
 			error = -ENOMEM;
 			goto map_count_exceeded;
 		}
@@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
-	if (mm->map_count >= sysctl_max_map_count)
+	if (!vma_count_remaining(mm))
 		return -ENOMEM;
 
 	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 3639aa8dd2b0..52cd7ddc73f4 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
 	return vm_flags;
 }
 
+/* Helper to get VMA count capacity */
+static int vma_count_remaining(const struct mm_struct *mm)
+{
+	const int map_count = mm->map_count;
+	const int max_count = sysctl_max_map_count;
+
+	return (max_count > map_count) ? (max_count - map_count) : 0;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count
  2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
                   ` (2 preceding siblings ...)
  2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
  2025-09-17 13:41   ` David Hildenbrand
                     ` (2 more replies)
  2025-09-15 16:36 ` [PATCH v2 5/7] mm: harden vma_count against direct modification Kalesh Singh
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
  To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

A mechanical rename of the mm_struct->map_count field to
vma_count; no functional change is intended.

The name "map_count" is ambiguous within the memory management subsystem,
as it can be confused with the folio/page->_mapcount field, which tracks
PTE references.

The new name, vma_count, is more precise as this field has always
counted the number of vm_area_structs associated with an mm_struct.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---

 Changes in v2:
  - map_count is easily confused with _mapcount rename to vma_count, per David

 fs/binfmt_elf.c                  |  2 +-
 fs/coredump.c                    |  2 +-
 include/linux/mm_types.h         |  2 +-
 kernel/fork.c                    |  2 +-
 mm/debug.c                       |  2 +-
 mm/mmap.c                        |  6 +++---
 mm/nommu.c                       |  6 +++---
 mm/vma.c                         | 24 ++++++++++++------------
 tools/testing/vma/vma.c          | 32 ++++++++++++++++----------------
 tools/testing/vma/vma_internal.h |  6 +++---
 10 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 264fba0d44bd..52449dec12cb 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1643,7 +1643,7 @@ static int fill_files_note(struct memelfnote *note, struct coredump_params *cprm
 	data[0] = count;
 	data[1] = PAGE_SIZE;
 	/*
-	 * Count usually is less than mm->map_count,
+	 * Count usually is less than mm->vma_count,
 	 * we need to move filenames down.
 	 */
 	n = cprm->vma_count - count;
diff --git a/fs/coredump.c b/fs/coredump.c
index 60bc9685e149..8881459c53d9 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1731,7 +1731,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)

 	cprm->vma_data_size = 0;
 	gate_vma = get_gate_vma(mm);
-	cprm->vma_count = mm->map_count + (gate_vma ? 1 : 0);
+	cprm->vma_count = mm->vma_count + (gate_vma ? 1 : 0);

 	cprm->vma_meta = kvmalloc_array(cprm->vma_count, sizeof(*cprm->vma_meta), GFP_KERNEL);
 	if (!cprm->vma_meta) {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 08bc2442db93..4343be2f9e85 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1020,7 +1020,7 @@ struct mm_struct {
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
-		int map_count;			/* number of VMAs */
+		int vma_count;			/* number of VMAs */

 		spinlock_t page_table_lock; /* Protects page tables and some
 					     * counters
diff --git a/kernel/fork.c b/kernel/fork.c
index c4ada32598bd..8fcbbf947579 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm_pgtables_bytes_init(mm);
-	mm->map_count = 0;
+	mm->vma_count = 0;
 	mm->locked_vm = 0;
 	atomic64_set(&mm->pinned_vm, 0);
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
diff --git a/mm/debug.c b/mm/debug.c
index b4388f4dcd4d..40fc9425a84a 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -204,7 +204,7 @@ void dump_mm(const struct mm_struct *mm)
 		mm->pgd, atomic_read(&mm->mm_users),
 		atomic_read(&mm->mm_count),
 		mm_pgtables_bytes(mm),
-		mm->map_count,
+		mm->vma_count,
 		mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
 		(u64)atomic64_read(&mm->pinned_vm),
 		mm->data_vm, mm->exec_vm, mm->stack_vm,
diff --git a/mm/mmap.c b/mm/mmap.c
index af88ce1fbb5f..c6769394a174 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm)
 		vma = vma_next(&vmi);
 	} while (vma && likely(!xa_is_zero(vma)));
 
-	BUG_ON(count != mm->map_count);
+	BUG_ON(count != mm->vma_count);
 
 	trace_exit_mmap(mm);
 destroy:
@@ -1517,7 +1517,7 @@ static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
  */
 int vma_count_remaining(const struct mm_struct *mm)
 {
-	const int map_count = mm->map_count;
+	const int map_count = mm->vma_count;
 	const int max_count = sysctl_max_map_count;
 
 	return (max_count > map_count) ? (max_count - map_count) : 0;
@@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		 */
 		vma_iter_bulk_store(&vmi, tmp);
 
-		mm->map_count++;
+		mm->vma_count++;
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/nommu.c b/mm/nommu.c
index dd75f2334812..9ab2e5ca736d 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -576,7 +576,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
 
 static void cleanup_vma_from_mm(struct vm_area_struct *vma)
 {
-	vma->vm_mm->map_count--;
+	vma->vm_mm->vma_count--;
 	/* remove the VMA from the mapping */
 	if (vma->vm_file) {
 		struct address_space *mapping;
@@ -1198,7 +1198,7 @@ unsigned long do_mmap(struct file *file,
 		goto error_just_free;
 
 	setup_vma_to_mm(vma, current->mm);
-	current->mm->map_count++;
+	current->mm->vma_count++;
 	/* add the VMA to the tree */
 	vma_iter_store_new(&vmi, vma);
 
@@ -1366,7 +1366,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	setup_vma_to_mm(vma, mm);
 	setup_vma_to_mm(new, mm);
 	vma_iter_store_new(vmi, new);
-	mm->map_count++;
+	mm->vma_count++;
 	return 0;
 
 err_vmi_preallocate:
diff --git a/mm/vma.c b/mm/vma.c
index df0e8409f63d..64f4e7c867c3 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 		 * (it may either follow vma or precede it).
 		 */
 		vma_iter_store_new(vmi, vp->insert);
-		mm->map_count++;
+		mm->vma_count++;
 	}
 
 	if (vp->anon_vma) {
@@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 		}
 		if (vp->remove->anon_vma)
 			anon_vma_merge(vp->vma, vp->remove);
-		mm->map_count--;
+		mm->vma_count--;
 		mpol_put(vma_policy(vp->remove));
 		if (!vp->remove2)
 			WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
@@ -683,13 +683,13 @@ void validate_mm(struct mm_struct *mm)
 		}
 #endif
 		/* Check for a infinite loop */
-		if (++i > mm->map_count + 10) {
+		if (++i > mm->vma_count + 10) {
 			i = -1;
 			break;
 		}
 	}
-	if (i != mm->map_count) {
-		pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i);
+	if (i != mm->vma_count) {
+		pr_emerg("vma_count %d vma iterator %d\n", mm->vma_count, i);
 		bug = 1;
 	}
 	VM_BUG_ON_MM(bug, mm);
@@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	struct mm_struct *mm;
 
 	mm = current->mm;
-	mm->map_count -= vms->vma_count;
+	mm->vma_count -= vms->vma_count;
 	mm->locked_vm -= vms->locked_vm;
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
@@ -1340,14 +1340,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->start > vms->vma->vm_start) {

 		/*
-		 * Make sure that map_count on return from munmap() will
+		 * Make sure that vma_count on return from munmap() will
 		 * not exceed its limit; but let map_count go just above
 		 * its limit temporarily, to help free resources as expected.
 		 */
 		if (vms->end < vms->vma->vm_end &&
 		    !vma_count_remaining(vms->vma->vm_mm)) {
 			error = -ENOMEM;
-			goto map_count_exceeded;
+			goto vma_count_exceeded;
 		}
 
 		/* Don't bother splitting the VMA if we can't unmap it anyway */
@@ -1461,7 +1461,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 modify_vma_failed:
 	reattach_vmas(mas_detach);
 start_split_failed:
-map_count_exceeded:
+vma_count_exceeded:
 	return error;
 }
 
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
 	vma_start_write(vma);
 	vma_iter_store_new(&vmi, vma);
 	vma_link_file(vma);
-	mm->map_count++;
+	mm->vma_count++;
 	validate_mm(mm);
 	return 0;
 }
@@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 	/* Lock the VMA since it is modified after insertion into VMA tree */
 	vma_start_write(vma);
 	vma_iter_store_new(vmi, vma);
-	map->mm->map_count++;
+	map->mm->vma_count++;
 	vma_link_file(vma);

 	/*
@@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
 		goto mas_store_fail;

-	mm->map_count++;
+	mm->vma_count++;
 	validate_mm(mm);
 out:
 	perf_event_mmap(vma);
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 656e1c75b711..69fa7d14a6c2 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
 	}

 	mtree_destroy(&mm->mm_mt);
-	mm->map_count = 0;
+	mm->vma_count = 0;
 	return count;
 }

@@ -500,7 +500,7 @@ static bool test_merge_new(void)
 	INIT_LIST_HEAD(&vma_d->anon_vma_chain);
 	list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain);
 	ASSERT_FALSE(merged);
-	ASSERT_EQ(mm.map_count, 4);
+	ASSERT_EQ(mm.vma_count, 4);

 	/*
 	 * Merge BOTH sides.
@@ -519,7 +519,7 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->vm_pgoff, 0);
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 3);
+	ASSERT_EQ(mm.vma_count, 3);

 	/*
 	 * Merge to PREVIOUS VMA.
@@ -536,7 +536,7 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->vm_pgoff, 0);
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 3);
+	ASSERT_EQ(mm.vma_count, 3);

 	/*
 	 * Merge to NEXT VMA.
@@ -555,7 +555,7 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->vm_pgoff, 6);
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 3);
+	ASSERT_EQ(mm.vma_count, 3);

 	/*
 	 * Merge BOTH sides.
@@ -573,7 +573,7 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->vm_pgoff, 0);
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 2);
+	ASSERT_EQ(mm.vma_count, 2);

 	/*
 	 * Merge to NEXT VMA.
@@ -591,7 +591,7 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->vm_pgoff, 0xa);
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 2);
+	ASSERT_EQ(mm.vma_count, 2);

 	/*
 	 * Merge BOTH sides.
@@ -608,7 +608,7 @@ static bool test_merge_new(void)
 	ASSERT_EQ(vma->vm_pgoff, 0);
 	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 1);
+	ASSERT_EQ(mm.vma_count, 1);

 	/*
 	 * Final state.
@@ -967,7 +967,7 @@ static bool test_vma_merge_new_with_close(void)
 	ASSERT_EQ(vma->vm_pgoff, 0);
 	ASSERT_EQ(vma->vm_ops, &vm_ops);
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 2);
+	ASSERT_EQ(mm.vma_count, 2);

 	cleanup_mm(&mm, &vmi);
 	return true;
@@ -1017,7 +1017,7 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma->vm_pgoff, 2);
 	ASSERT_TRUE(vma_write_started(vma));
 	ASSERT_TRUE(vma_write_started(vma_next));
-	ASSERT_EQ(mm.map_count, 2);
+	ASSERT_EQ(mm.vma_count, 2);

 	/* Clear down and reset. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1045,7 +1045,7 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_next->vm_pgoff, 2);
 	ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_next));
-	ASSERT_EQ(mm.map_count, 1);
+	ASSERT_EQ(mm.vma_count, 1);

 	/* Clear down and reset. We should have deleted vma. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1079,7 +1079,7 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma->vm_pgoff, 6);
 	ASSERT_TRUE(vma_write_started(vma_prev));
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 2);
+	ASSERT_EQ(mm.vma_count, 2);

 	/* Clear down and reset. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1108,7 +1108,7 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_prev->vm_pgoff, 0);
 	ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_prev));
-	ASSERT_EQ(mm.map_count, 1);
+	ASSERT_EQ(mm.vma_count, 1);

 	/* Clear down and reset. We should have deleted vma. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1138,7 +1138,7 @@ static bool test_merge_existing(void)
 	ASSERT_EQ(vma_prev->vm_pgoff, 0);
 	ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
 	ASSERT_TRUE(vma_write_started(vma_prev));
-	ASSERT_EQ(mm.map_count, 1);
+	ASSERT_EQ(mm.vma_count, 1);

 	/* Clear down and reset. We should have deleted prev and next. */
 	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1540,7 +1540,7 @@ static bool test_merge_extend(void)
 	ASSERT_EQ(vma->vm_end, 0x4000);
 	ASSERT_EQ(vma->vm_pgoff, 0);
 	ASSERT_TRUE(vma_write_started(vma));
-	ASSERT_EQ(mm.map_count, 1);
+	ASSERT_EQ(mm.vma_count, 1);

 	cleanup_mm(&mm, &vmi);
 	return true;
@@ -1652,7 +1652,7 @@ static bool test_mmap_region_basic(void)
 			     0x24d, NULL);
 	ASSERT_EQ(addr, 0x24d000);

-	ASSERT_EQ(mm.map_count, 2);
+	ASSERT_EQ(mm.vma_count, 2);

 	for_each_vma(vmi, vma) {
 		if (vma->vm_start == 0x300000) {
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 52cd7ddc73f4..15525b86145d 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -251,7 +251,7 @@ struct mutex {};

 struct mm_struct {
 	struct maple_tree mm_mt;
-	int map_count;			/* number of VMAs */
+	int vma_count;			/* number of VMAs */
 	unsigned long total_vm;	   /* Total pages mapped */
 	unsigned long locked_vm;   /* Pages that have PG_mlocked set */
 	unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
@@ -1520,10 +1520,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
 /* Helper to get VMA count capacity */
 static int vma_count_remaining(const struct mm_struct *mm)
 {
-	const int map_count = mm->map_count;
+	const int vma_count = mm->vma_count;
 	const int max_count = sysctl_max_map_count;

-	return (max_count > map_count) ? (max_count - map_count) : 0;
+	return (max_count > vma_count) ? (max_count - vma_count) : 0;
 }

 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 5/7] mm: harden vma_count against direct modification
  2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
                   ` (3 preceding siblings ...)
  2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
  2025-09-16  9:10   ` kernel test robot
  2025-09-18 14:52   ` Lorenzo Stoakes
  2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
  To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

To make VMA counting more robust, prevent direct modification of the
mm->vma_count field. This is achieved by making the public-facing
member const via a union and requiring all modifications to go
through a new set of helper functions the operate on a private
__vma_count.

While there are no other invariants tied to vma_count currently, this
structural change improves maintainability; as it creates a single,
centralized point for any future logic, such as adding debug checks
or updating related statistics (in subsequent patches).

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
 include/linux/mm.h               | 25 +++++++++++++++++++++++++
 include/linux/mm_types.h         |  5 ++++-
 kernel/fork.c                    |  2 +-
 mm/mmap.c                        |  2 +-
 mm/vma.c                         | 12 ++++++------
 tools/testing/vma/vma.c          |  2 +-
 tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++-
 7 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 138bab2988f8..8bad1454984c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)

 void snapshot_page(struct page_snapshot *ps, const struct page *page);

+static inline void vma_count_init(struct mm_struct *mm)
+{
+	ACCESS_PRIVATE(mm, __vma_count) = 0;
+}
+
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+{
+	ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
+}
+
+static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
+{
+	vma_count_add(mm, -nr_vmas);
+}
+
+static inline void vma_count_inc(struct mm_struct *mm)
+{
+	vma_count_add(mm, 1);
+}
+
+static inline void vma_count_dec(struct mm_struct *mm)
+{
+	vma_count_sub(mm, 1);
+}
+
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4343be2f9e85..2ea8fc722aa2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1020,7 +1020,10 @@ struct mm_struct {
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
-		int vma_count;			/* number of VMAs */
+		union {
+			const int vma_count;		/* number of VMAs */
+			int __private __vma_count;
+		};

 		spinlock_t page_table_lock; /* Protects page tables and some
 					     * counters
diff --git a/kernel/fork.c b/kernel/fork.c
index 8fcbbf947579..ea9eff416e51 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm_pgtables_bytes_init(mm);
-	mm->vma_count = 0;
+	vma_count_init(mm);
 	mm->locked_vm = 0;
 	atomic64_set(&mm->pinned_vm, 0);
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
diff --git a/mm/mmap.c b/mm/mmap.c
index c6769394a174..30ddd550197e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		 */
 		vma_iter_bulk_store(&vmi, tmp);
 
-		mm->vma_count++;
+		vma_count_inc(mm);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
diff --git a/mm/vma.c b/mm/vma.c
index 64f4e7c867c3..0cd3cb472220 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 		 * (it may either follow vma or precede it).
 		 */
 		vma_iter_store_new(vmi, vp->insert);
-		mm->vma_count++;
+		vma_count_inc(mm);
 	}
 
 	if (vp->anon_vma) {
@@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 		}
 		if (vp->remove->anon_vma)
 			anon_vma_merge(vp->vma, vp->remove);
-		mm->vma_count--;
+		vma_count_dec(mm);
 		mpol_put(vma_policy(vp->remove));
 		if (!vp->remove2)
 			WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
@@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	struct mm_struct *mm;
 
 	mm = current->mm;
-	mm->vma_count -= vms->vma_count;
+	vma_count_sub(mm, vms->vma_count);
 	mm->locked_vm -= vms->locked_vm;
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
 	vma_start_write(vma);
 	vma_iter_store_new(&vmi, vma);
 	vma_link_file(vma);
-	mm->vma_count++;
+	vma_count_inc(mm);
 	validate_mm(mm);
 	return 0;
 }
@@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 	/* Lock the VMA since it is modified after insertion into VMA tree */
 	vma_start_write(vma);
 	vma_iter_store_new(vmi, vma);
-	map->mm->vma_count++;
+	vma_count_inc(map->mm);
 	vma_link_file(vma);
 
 	/*
@@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
 		goto mas_store_fail;
 
-	mm->vma_count++;
+	vma_count_inc(mm);
 	validate_mm(mm);
 out:
 	perf_event_mmap(vma);
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 69fa7d14a6c2..ee5a1e2365e0 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
 	}
 
 	mtree_destroy(&mm->mm_mt);
-	mm->vma_count = 0;
+	vma_count_init(mm);
 	return count;
 }
 
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 15525b86145d..6e724ba1adf4 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -251,7 +251,10 @@ struct mutex {};
 
 struct mm_struct {
 	struct maple_tree mm_mt;
-	int vma_count;			/* number of VMAs */
+	union {
+		const int vma_count;		/* number of VMAs */
+		int __vma_count;
+	};
 	unsigned long total_vm;	   /* Total pages mapped */
 	unsigned long locked_vm;   /* Pages that have PG_mlocked set */
 	unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
@@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm)
 	return (max_count > vma_count) ? (max_count - vma_count) : 0;
 }
 
+static inline void vma_count_init(struct mm_struct *mm)
+{
+	mm->__vma_count = 0;
+}
+
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+{
+	mm->__vma_count += nr_vmas;
+}
+
+static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
+{
+	vma_count_add(mm, -nr_vmas);
+}
+
+static inline void vma_count_inc(struct mm_struct *mm)
+{
+	vma_count_add(mm, 1);
+}
+
+static inline void vma_count_dec(struct mm_struct *mm)
+{
+	vma_count_sub(mm, 1);
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 6/7] mm: add assertion for VMA count limit
  2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
                   ` (4 preceding siblings ...)
  2025-09-15 16:36 ` [PATCH v2 5/7] mm: harden vma_count against direct modification Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
  2025-09-17 13:44   ` David Hildenbrand
  2025-09-18 13:30   ` Lorenzo Stoakes
  2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
  2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
  7 siblings, 2 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
  To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect
cases where the VMA count exceeds the sysctl_max_map_count limit.

This check will help catch future bugs or regressions where
the VMAs are allocated exceeding the limit.

The warning is placed in the main vma_count_*() helpers, while the
internal *_nocheck variants bypass it. _nocheck helpers are used to
ensure that the assertion does not trigger a false positive in
the legitimate case of a temporary VMA increase past the limit
by a VMA split in munmap().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---

Changes in v2:
  - Add assertions if exceeding max_vma_count limit, per Pedro

 include/linux/mm.h               | 12 ++++++--
 mm/internal.h                    |  1 -
 mm/vma.c                         | 49 +++++++++++++++++++++++++-------
 tools/testing/vma/vma_internal.h |  7 ++++-
 4 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bad1454984c..3a3749d7015c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
 
 void snapshot_page(struct page_snapshot *ps, const struct page *page);
 
+int vma_count_remaining(const struct mm_struct *mm);
+
 static inline void vma_count_init(struct mm_struct *mm)
 {
 	ACCESS_PRIVATE(mm, __vma_count) = 0;
 }
 
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
 {
 	ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
 }
 
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+{
+	VM_WARN_ON_ONCE(!vma_count_remaining(mm));
+	__vma_count_add_nocheck(mm, nr_vmas);
+}
+
 static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
 {
-	vma_count_add(mm, -nr_vmas);
+	__vma_count_add_nocheck(mm, -nr_vmas);
 }
 
 static inline void vma_count_inc(struct mm_struct *mm)
diff --git a/mm/internal.h b/mm/internal.h
index 39f1c9535ae5..e0567a3b64fa 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1661,6 +1661,5 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
 void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
 int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 
-int vma_count_remaining(const struct mm_struct *mm);
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/vma.c b/mm/vma.c
index 0cd3cb472220..0e4fcaebe209 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -323,15 +323,17 @@ static void vma_prepare(struct vma_prepare *vp)
 }
 
 /*
- * vma_complete- Helper function for handling the unlocking after altering VMAs,
- * or for inserting a VMA.
+ * This is the internal, unsafe version of vma_complete(). Unlike its
+ * wrapper, this function bypasses runtime checks for VMA count limits by
+ * using the _nocheck vma_count* helpers.
  *
- * @vp: The vma_prepare struct
- * @vmi: The vma iterator
- * @mm: The mm_struct
+ * Its use is restricted to __split_vma() where the VMA count can be
+ * temporarily higher than the sysctl_max_map_count limit.
+ *
+ * All other callers must use vma_complete().
  */
-static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
-			 struct mm_struct *mm)
+static void __vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
+			   struct mm_struct *mm)
 {
 	if (vp->file) {
 		if (vp->adj_next)
@@ -352,7 +354,11 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 		 * (it may either follow vma or precede it).
 		 */
 		vma_iter_store_new(vmi, vp->insert);
-		vma_count_inc(mm);
+		/*
+		 * Explicitly allow vma_count to exceed the threshold to prevent,
+		 * blocking munmap() freeing resources.
+		 */
+		__vma_count_add_nocheck(mm, 1);
 	}
 
 	if (vp->anon_vma) {
@@ -403,6 +409,26 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
 		uprobe_mmap(vp->insert);
 }
 
+/*
+ * vma_complete- Helper function for handling the unlocking after altering VMAs,
+ * or for inserting a VMA.
+ *
+ * @vp: The vma_prepare struct
+ * @vmi: The vma iterator
+ * @mm: The mm_struct
+ */
+static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
+			 struct mm_struct *mm)
+{
+	/*
+	 * __vma_complete() explicitly foregoes checking the new
+	 * vma_count against the sysctl_max_map_count limit, so
+	 * do it here.
+	 */
+	VM_WARN_ON_ONCE(!vma_count_remaining(mm));
+	__vma_complete(vp, vmi, mm);
+}
+
 /*
  * init_vma_prep() - Initializer wrapper for vma_prepare struct
  * @vp: The vma_prepare struct
@@ -564,8 +590,11 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		vma->vm_end = addr;
 	}
 
-	/* vma_complete stores the new vma */
-	vma_complete(&vp, vmi, vma->vm_mm);
+	/*
+	 * __vma_complete stores the new vma without checking against the
+	 * sysctl_max_map_count (vma_count) limit.
+	 */
+	__vma_complete(&vp, vmi, vma->vm_mm);
 	validate_mm(vma->vm_mm);

 	/* Success. */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 6e724ba1adf4..d084b1eb2a5c 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -1534,11 +1534,16 @@ static inline void vma_count_init(struct mm_struct *mm)
 	mm->__vma_count = 0;
 }

-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
 {
 	mm->__vma_count += nr_vmas;
 }

+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+{
+	__vma_count_add_nocheck(mm, nr_vmas);
+}
+
 static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
 {
 	vma_count_add(mm, -nr_vmas);
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
                   ` (5 preceding siblings ...)
  2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
  2025-09-15 23:41   ` Steven Rostedt
  2025-09-18 13:42   ` Lorenzo Stoakes
  2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
  7 siblings, 2 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
  To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

Needed observability on in field devices can be collected with minimal
overhead and can be toggled on and off. Event driven telemetry can be
done with tracepoint BPF programs.

The process comm is provided for aggregation across devices and tgid is
to enable per-process aggregation per device.

This allows for observing the distribution of such problems in the
field, to deduce if there are legitimate bugs or if a bump to the limit is
warranted.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---

Chnages in v2:
  - Add needed observability for operations failing due to the vma count limit,
      per Minchan
    (Since there isn't a common point for debug logging due checks being
    external to the capacity based vma_count_remaining() helper. I used a
    trace event for low overhead and to facilitate event driven telemetry
    for in field devices)

 include/trace/events/vma.h | 32 ++++++++++++++++++++++++++++++++
 mm/mmap.c                  |  5 ++++-
 mm/mremap.c                | 10 ++++++++--
 mm/vma.c                   | 11 +++++++++--
 4 files changed, 53 insertions(+), 5 deletions(-)
 create mode 100644 include/trace/events/vma.h

diff --git a/include/trace/events/vma.h b/include/trace/events/vma.h
new file mode 100644
index 000000000000..2fed63b0d0a6
--- /dev/null
+++ b/include/trace/events/vma.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vma
+
+#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMA_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(max_vma_count_exceeded,
+
+	TP_PROTO(struct task_struct *task),
+
+	TP_ARGS(task),
+
+	TP_STRUCT__entry(
+		__string(comm,	task->comm)
+		__field(pid_t,	tgid)
+	),
+
+	TP_fast_assign(
+		__assign_str(comm);
+		__entry->tgid = task->tgid;
+	),
+
+	TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
+);
+
+#endif /*  _TRACE_VMA_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/mmap.c b/mm/mmap.c
index 30ddd550197e..0bb311bf48f3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -56,6 +56,7 @@
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/mmap.h>
+#include <trace/events/vma.h>
 
 #include "internal.h"
 
@@ -374,8 +375,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 		return -EOVERFLOW;
 
 	/* Too many mappings? */
-	if (!vma_count_remaining(mm))
+	if (!vma_count_remaining(mm)) {
+		trace_max_vma_count_exceeded(current);
 		return -ENOMEM;
+	}
 
 	/*
 	 * addr is returned from get_unmapped_area,
diff --git a/mm/mremap.c b/mm/mremap.c
index 14d35d87e89b..f42ac05f0069 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -30,6 +30,8 @@
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 
+#include <trace/events/vma.h>
+
 #include "internal.h"
 
 /* Classify the kind of remap operation being performed. */
@@ -1040,8 +1042,10 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
 	 * We'd prefer to avoid failure later on in do_munmap:
 	 * which may split one vma into three before unmapping.
 	 */
-	if (vma_count_remaining(current->mm) < 4)
+	if (vma_count_remaining(current->mm) < 4) {
+		trace_max_vma_count_exceeded(current);
 		return -ENOMEM;
+	}
 
 	if (vma->vm_ops && vma->vm_ops->may_split) {
 		if (vma->vm_start != old_addr)
@@ -1817,8 +1821,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
 	 * the threshold. In other words, is the current map count + 6 at or
 	 * below the threshold? Otherwise return -ENOMEM here to be more safe.
 	 */
-	if (vma_count_remaining(current->mm) < 6)
+	if (vma_count_remaining(current->mm) < 6) {
+		trace_max_vma_count_exceeded(current);
 		return -ENOMEM;
+	}
 
 	return 0;
 }
diff --git a/mm/vma.c b/mm/vma.c
index 0e4fcaebe209..692c33c3e84d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -7,6 +7,8 @@
 #include "vma_internal.h"
 #include "vma.h"
 
+#include <trace/events/vma.h>
+
 struct mmap_state {
 	struct mm_struct *mm;
 	struct vma_iterator *vmi;
@@ -621,8 +623,10 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		     unsigned long addr, int new_below)
 {
-	if (!vma_count_remaining(vma->vm_mm))
+	if (!vma_count_remaining(vma->vm_mm)) {
+		trace_max_vma_count_exceeded(current);
 		return -ENOMEM;
+	}
 
 	return __split_vma(vmi, vma, addr, new_below);
 }
@@ -1375,6 +1379,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		 */
 		if (vms->end < vms->vma->vm_end &&
 		    !vma_count_remaining(vms->vma->vm_mm)) {
+			trace_max_vma_count_exceeded(current);
 			error = -ENOMEM;
 			goto vma_count_exceeded;
 		}
@@ -2801,8 +2806,10 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
-	if (!vma_count_remaining(mm))
+	if (!vma_count_remaining(mm)) {
+		trace_max_vma_count_exceeded(current);
 		return -ENOMEM;
+	}
 
 	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
                   ` (6 preceding siblings ...)
  2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
@ 2025-09-15 22:34 ` Andrew Morton
  2025-09-15 23:10   ` Kalesh Singh
  2025-09-16 10:12   ` Lorenzo Stoakes
  7 siblings, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2025-09-15 22:34 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Mon, 15 Sep 2025 09:36:31 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:

> Hi all,
> 
> This is v2 to the VMA count patch I previously posted at:
> 
> https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
> 
> 
> I've split it into multiple patches to address the feedback.
> 
> The main changes in v2 are:
> 
> - Use a capacity-based check for VMA count limit, per Lorenzo.
> - Rename map_count to vma_count, per David.
> - Add assertions for exceeding the limit, per Pedro.
> - Add tests for max_vma_count, per Liam.
> - Emit a trace event for failure due to insufficient capacity for
>   observability
> 
> Tested on x86_64 and arm64:
> 
> - Build test:
>     - allyesconfig for rename
> 
> - Selftests:
>       cd tools/testing/selftests/mm && \
>           make && \
>           ./run_vmtests.sh -t max_vma_count
> 
>        (With trace_max_vma_count_exceeded enabled)
> 
> - vma tests:
>       cd tools/testing/vma && \
>           make && \
> 	  ./vma

fwiw, there's nothing in the above which is usable in a [0/N] overview.

While useful, the "what changed since the previous version" info isn't
a suitable thing to carry in the permanent kernel record - it's
short-term treansient stuff, not helpful to someone who is looking at
the patchset in 2029.  

Similarly, the "how it was tested" material is also useful, but it
becomes irrelevant as soon as the code hits linux-next and mainline.


Anyhow, this -rc cycle has been quite the firehose in MM and I'm
feeling a need to slow things down for additional stabilization and so
people hopefully get additional bandwidth to digest the material we've
added this far.  So I think I'll just cherrypick [1/7] for now.  A
great flood of positive review activity would probably make me revisit
that ;)


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
@ 2025-09-15 22:36   ` Andrew Morton
  2025-09-16 14:20     ` Jonathan Corbet
  2025-09-16  9:45   ` Pedro Falcato
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2025-09-15 22:36 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, stable, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:

> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

lol.

x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l
661

we really blew it that time!


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
@ 2025-09-15 23:10   ` Kalesh Singh
  2025-09-16  0:05     ` Andrew Morton
  2025-09-16 10:12   ` Lorenzo Stoakes
  1 sibling, 1 reply; 68+ messages in thread
From: Kalesh Singh @ 2025-09-15 23:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Mon, Sep 15, 2025 at 3:34 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 15 Sep 2025 09:36:31 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > Hi all,
> >
> > This is v2 to the VMA count patch I previously posted at:
> >
> > https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
> >
> >
> > I've split it into multiple patches to address the feedback.
> >
> > The main changes in v2 are:
> >
> > - Use a capacity-based check for VMA count limit, per Lorenzo.
> > - Rename map_count to vma_count, per David.
> > - Add assertions for exceeding the limit, per Pedro.
> > - Add tests for max_vma_count, per Liam.
> > - Emit a trace event for failure due to insufficient capacity for
> >   observability
> >
> > Tested on x86_64 and arm64:
> >
> > - Build test:
> >     - allyesconfig for rename
> >
> > - Selftests:
> >       cd tools/testing/selftests/mm && \
> >           make && \
> >           ./run_vmtests.sh -t max_vma_count
> >
> >        (With trace_max_vma_count_exceeded enabled)
> >
> > - vma tests:
> >       cd tools/testing/vma && \
> >           make && \
> >         ./vma
>
> fwiw, there's nothing in the above which is usable in a [0/N] overview.
>
> While useful, the "what changed since the previous version" info isn't
> a suitable thing to carry in the permanent kernel record - it's
> short-term treansient stuff, not helpful to someone who is looking at
> the patchset in 2029.
>
> Similarly, the "how it was tested" material is also useful, but it
> becomes irrelevant as soon as the code hits linux-next and mainline.

Hi Andrew,

Thanks for the feedback. Do you mean the cover letter was not needed
in this case or that it lacked enough context?

>
>
> Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> feeling a need to slow things down for additional stabilization and so
> people hopefully get additional bandwidth to digest the material we've
> added this far.  So I think I'll just cherrypick [1/7] for now.  A
> great flood of positive review activity would probably make me revisit
> that ;)
>

I understand, yes 1/7 is all we need for now, since it prevents an
unrecoverable situation where we get over the limit and cannot recover
as munmap() will then always fail.

Thanks,
Kalesh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
@ 2025-09-15 23:41   ` Steven Rostedt
  2025-09-16  1:19     ` Kalesh Singh
  2025-09-18 13:42   ` Lorenzo Stoakes
  1 sibling, 1 reply; 68+ messages in thread
From: Steven Rostedt @ 2025-09-15 23:41 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Mon, 15 Sep 2025 09:36:38 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> Needed observability on in field devices can be collected with minimal
> overhead and can be toggled on and off. Event driven telemetry can be
> done with tracepoint BPF programs.
> 
> The process comm is provided for aggregation across devices and tgid is
> to enable per-process aggregation per device.

What do you mean about comm being used to aggregation across devices?
What's special about this trace event that will make it used across devices?

Note, if BPF is being used, can't the BPF program just add the current
comm? Why waste space in the ring buffer for it?



> +
> +TRACE_EVENT(max_vma_count_exceeded,
> +
> +	TP_PROTO(struct task_struct *task),

Why pass in the task if it's always going to be current?

> +
> +	TP_ARGS(task),
> +
> +	TP_STRUCT__entry(
> +		__string(comm,	task->comm)

This could be:

		__string(comm, current)

But I still want to know what makes this trace event special over other
trace events to store this, and can't it be retrieved another way,
especially if BPF is being used to hook to it?

-- Steve


> +		__field(pid_t,	tgid)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(comm);
> +		__entry->tgid = task->tgid;
> +	),
> +
> +	TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
> +);
> +

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-15 23:10   ` Kalesh Singh
@ 2025-09-16  0:05     ` Andrew Morton
  2025-09-16  1:23       ` Kalesh Singh
  0 siblings, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2025-09-16  0:05 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Mon, 15 Sep 2025 16:10:55 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:

> > fwiw, there's nothing in the above which is usable in a [0/N] overview.
> >
> > While useful, the "what changed since the previous version" info isn't
> > a suitable thing to carry in the permanent kernel record - it's
> > short-term treansient stuff, not helpful to someone who is looking at
> > the patchset in 2029.
> >
> > Similarly, the "how it was tested" material is also useful, but it
> > becomes irrelevant as soon as the code hits linux-next and mainline.
> 
> Hi Andrew,
> 
> Thanks for the feedback. Do you mean the cover letter was not needed
> in this case or that it lacked enough context?

The latter.  As I've split up the series, please put together some
words to describe the remaining 6 patches if/when resending.

Thanks.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-15 23:41   ` Steven Rostedt
@ 2025-09-16  1:19     ` Kalesh Singh
  2025-09-16 15:52       ` Steven Rostedt
  0 siblings, 1 reply; 68+ messages in thread
From: Kalesh Singh @ 2025-09-16  1:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Mon, Sep 15, 2025 at 4:41 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 15 Sep 2025 09:36:38 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > Needed observability on in field devices can be collected with minimal
> > overhead and can be toggled on and off. Event driven telemetry can be
> > done with tracepoint BPF programs.
> >
> > The process comm is provided for aggregation across devices and tgid is
> > to enable per-process aggregation per device.
>
> What do you mean about comm being used to aggregation across devices?
> What's special about this trace event that will make it used across devices?
>
> Note, if BPF is being used, can't the BPF program just add the current
> comm? Why waste space in the ring buffer for it?
>
>
>
> > +
> > +TRACE_EVENT(max_vma_count_exceeded,
> > +
> > +     TP_PROTO(struct task_struct *task),
>
> Why pass in the task if it's always going to be current?
>
> > +
> > +     TP_ARGS(task),
> > +
> > +     TP_STRUCT__entry(
> > +             __string(comm,  task->comm)
>
> This could be:
>
>                 __string(comm, current)
>
> But I still want to know what makes this trace event special over other
> trace events to store this, and can't it be retrieved another way,
> especially if BPF is being used to hook to it?

Hi Steve,

Thanks for the comments and suggestion you are right we can use bpf to
get the comm. There is nothing special about this trace event.  I will
drop comm in the next revision.

The reason I did the task_struct parameter (current): I believe there
is a limitation that we must  specify at least 1 parameter to the
TRACE_EVENT()  PROTO and ARGS macros.

Is there some way to use this without needing a parameter?

I hit the build failure below, with no parameters:

In file included from mm/vma.c:10:
./include/trace/events/vma.h:10:1: error: expected parameter declarator
   10 | TRACE_EVENT(max_vma_count_exceeded,
      | ^
...

Below is the code for reference:

/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM vma

#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_VMA_H

#include <linux/tracepoint.h>

TRACE_EVENT(max_vma_count_exceeded,

TP_PROTO(),

TP_ARGS(),

TP_STRUCT__entry(
__field(pid_t, tgid)
),

TP_fast_assign(
__entry->tgid = current->tgid;
),

TP_printk("tgid=%d", __entry->tgid)
);

#endif /*  _TRACE_VMA_H */

/* This part must be outside protection */
#include <trace/define_trace.h>

Thanks,
Kalesh

>
> -- Steve
>
>
> > +             __field(pid_t,  tgid)
> > +     ),
> > +
> > +     TP_fast_assign(
> > +             __assign_str(comm);
> > +             __entry->tgid = task->tgid;
> > +     ),
> > +
> > +     TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
> > +);
> > +

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-16  0:05     ` Andrew Morton
@ 2025-09-16  1:23       ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-16  1:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Mon, Sep 15, 2025 at 5:05 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 15 Sep 2025 16:10:55 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > > fwiw, there's nothing in the above which is usable in a [0/N] overview.
> > >
> > > While useful, the "what changed since the previous version" info isn't
> > > a suitable thing to carry in the permanent kernel record - it's
> > > short-term treansient stuff, not helpful to someone who is looking at
> > > the patchset in 2029.
> > >
> > > Similarly, the "how it was tested" material is also useful, but it
> > > becomes irrelevant as soon as the code hits linux-next and mainline.
> >
> > Hi Andrew,
> >
> > Thanks for the feedback. Do you mean the cover letter was not needed
> > in this case or that it lacked enough context?
>
> The latter.  As I've split up the series, please put together some
> words to describe the remaining 6 patches if/when resending.

Hi Andrew,

Thanks for clarifying. I'll make sure to fix that when resending.

--Kalesh

>
> Thanks.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 5/7] mm: harden vma_count against direct modification
  2025-09-15 16:36 ` [PATCH v2 5/7] mm: harden vma_count against direct modification Kalesh Singh
@ 2025-09-16  9:10   ` kernel test robot
  2025-09-18 14:52   ` Lorenzo Stoakes
  1 sibling, 0 replies; 68+ messages in thread
From: kernel test robot @ 2025-09-16  9:10 UTC (permalink / raw)
  To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, david, Liam.Howlett,
	rppt, pfalcato
  Cc: oe-kbuild-all, kernel-team, android-mm, Kalesh Singh,
	Alexander Viro, Christian Brauner, Jan Kara, Kees Cook,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan

Hi Kalesh,

kernel test robot noticed the following build errors:

[auto build test ERROR on f83ec76bf285bea5727f478a68b894f5543ca76e]

url:    https://github.com/intel-lab-lkp/linux/commits/Kalesh-Singh/mm-fix-off-by-one-error-in-VMA-count-limit-checks/20250916-005018
base:   f83ec76bf285bea5727f478a68b894f5543ca76e
patch link:    https://lore.kernel.org/r/20250915163838.631445-6-kaleshsingh%40google.com
patch subject: [PATCH v2 5/7] mm: harden vma_count against direct modification
config: m68k-allnoconfig (https://download.01.org/0day-ci/archive/20250916/202509161708.EnaFVJbi-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250916/202509161708.EnaFVJbi-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509161708.EnaFVJbi-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/nommu.c: In function 'cleanup_vma_from_mm':
>> mm/nommu.c:579:30: error: decrement of read-only member 'vma_count'
     579 |         vma->vm_mm->vma_count--;
         |                              ^~
   mm/nommu.c: In function 'do_mmap':
>> mm/nommu.c:1201:31: error: increment of read-only member 'vma_count'
    1201 |         current->mm->vma_count++;
         |                               ^~
   mm/nommu.c: In function 'split_vma':
   mm/nommu.c:1369:22: error: increment of read-only member 'vma_count'
    1369 |         mm->vma_count++;
         |                      ^~


vim +/vma_count +579 mm/nommu.c

8220543df1489e Matthew Wilcox (Oracle  2022-09-06  576) 
8220543df1489e Matthew Wilcox (Oracle  2022-09-06  577) static void cleanup_vma_from_mm(struct vm_area_struct *vma)
8220543df1489e Matthew Wilcox (Oracle  2022-09-06  578) {
53045ecb29a61f Kalesh Singh            2025-09-15 @579  	vma->vm_mm->vma_count--;
^1da177e4c3f41 Linus Torvalds          2005-04-16  580  	/* remove the VMA from the mapping */
^1da177e4c3f41 Linus Torvalds          2005-04-16  581  	if (vma->vm_file) {
7964cf8caa4dfa Liam R. Howlett         2022-09-06  582  		struct address_space *mapping;
^1da177e4c3f41 Linus Torvalds          2005-04-16  583  		mapping = vma->vm_file->f_mapping;
^1da177e4c3f41 Linus Torvalds          2005-04-16  584  
83cde9e8ba95d1 Davidlohr Bueso         2014-12-12  585  		i_mmap_lock_write(mapping);
^1da177e4c3f41 Linus Torvalds          2005-04-16  586  		flush_dcache_mmap_lock(mapping);
6b2dbba8b6ac4d Michel Lespinasse       2012-10-08  587  		vma_interval_tree_remove(vma, &mapping->i_mmap);
^1da177e4c3f41 Linus Torvalds          2005-04-16  588  		flush_dcache_mmap_unlock(mapping);
83cde9e8ba95d1 Davidlohr Bueso         2014-12-12  589  		i_mmap_unlock_write(mapping);
^1da177e4c3f41 Linus Torvalds          2005-04-16  590  	}
8220543df1489e Matthew Wilcox (Oracle  2022-09-06  591) }
47d9644de92c1a Liam R. Howlett         2023-01-20  592  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
  2025-09-15 22:36   ` Andrew Morton
@ 2025-09-16  9:45   ` Pedro Falcato
  2025-09-17  7:44   ` SeongJae Park
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 68+ messages in thread
From: Pedro Falcato @ 2025-09-16  9:45 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	kernel-team, android-mm, stable, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
> 
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
> 
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
> 
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Looks good, thanks!

-- 
Pedro

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
  2025-09-15 23:10   ` Kalesh Singh
@ 2025-09-16 10:12   ` Lorenzo Stoakes
  2025-09-16 17:47     ` Kalesh Singh
  2025-09-17  2:16     ` Andrew Morton
  1 sibling, 2 replies; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 10:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
> Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> feeling a need to slow things down for additional stabilization and so
> people hopefully get additional bandwidth to digest the material we've
> added this far.  So I think I'll just cherrypick [1/7] for now.  A
> great flood of positive review activity would probably make me revisit
> that ;)
>

Kalesh - I do intend to look at this series when I have a chance. My review
workload has been insane so it's hard to keep up at the moment.

Andrew - This cycle has been crazy, speaking from point of view of somebody
doing a lot of review, it's been very very exhausting from this side too,
and this kind of work can feel a little... thankless... sometimes :)

I feel like we maybe need a way to ask people to slow down, sometimes at
least.

Perhaps being less accepting of patches during merge window is one aspect,
as the merge window leading up to this cycle was almost the same review
load as when the cycle started.

Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor
in all this too :)

(I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week
vacataion afterwards over the merge window so I hope to stave off burnout
that way. Be good if I could keep mails upon return to 3 digits, but I have
my doubts :P)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-15 22:36   ` Andrew Morton
@ 2025-09-16 14:20     ` Jonathan Corbet
  2025-09-17  1:16       ` Andrew Morton
  0 siblings, 1 reply; 68+ messages in thread
From: Jonathan Corbet @ 2025-09-16 14:20 UTC (permalink / raw)
  To: Andrew Morton, Kalesh Singh
  Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, stable, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>
> lol.
>
> x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l
> 661
>
> we really blew it that time!

A few years back I made a list of the most-fixed commits in the kernel
history; unsurprisingly that one (I call it the "original sin") came out
on top:

  https://lwn.net/Articles/914632/

Perhaps the time has come to update that analysis.

jon

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16  1:19     ` Kalesh Singh
@ 2025-09-16 15:52       ` Steven Rostedt
  2025-09-16 17:36         ` Kalesh Singh
  2025-09-18 11:38         ` Pedro Falcato
  0 siblings, 2 replies; 68+ messages in thread
From: Steven Rostedt @ 2025-09-16 15:52 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Mon, 15 Sep 2025 18:19:53 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:


> Hi Steve,
> 
> Thanks for the comments and suggestion you are right we can use bpf to
> get the comm. There is nothing special about this trace event.  I will
> drop comm in the next revision.
> 
> The reason I did the task_struct parameter (current): I believe there
> is a limitation that we must  specify at least 1 parameter to the
> TRACE_EVENT()  PROTO and ARGS macros.

OK, then this is another issue. We don't want tracepoint "markers".
Each tracepoint can take up to 5K in memory due to the code it
generates and the meta data to control it.

For something like that, we highly recommend dynamic probes (fprobes,
kprobes, etc).

The only purpose of a static tracepoint is to get data within a
function that is too difficult to get via a probe. It should never be
used as a trigger where its purpose is "we hit this path".

-- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16 15:52       ` Steven Rostedt
@ 2025-09-16 17:36         ` Kalesh Singh
  2025-09-16 17:48           ` Steven Rostedt
  2025-09-18 11:38         ` Pedro Falcato
  1 sibling, 1 reply; 68+ messages in thread
From: Kalesh Singh @ 2025-09-16 17:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Tue, Sep 16, 2025 at 8:52 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 15 Sep 2025 18:19:53 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
>
> > Hi Steve,
> >
> > Thanks for the comments and suggestion you are right we can use bpf to
> > get the comm. There is nothing special about this trace event.  I will
> > drop comm in the next revision.
> >
> > The reason I did the task_struct parameter (current): I believe there
> > is a limitation that we must  specify at least 1 parameter to the
> > TRACE_EVENT()  PROTO and ARGS macros.
>
> OK, then this is another issue. We don't want tracepoint "markers".
> Each tracepoint can take up to 5K in memory due to the code it
> generates and the meta data to control it.
>
> For something like that, we highly recommend dynamic probes (fprobes,
> kprobes, etc).
>
> The only purpose of a static tracepoint is to get data within a
> function that is too difficult to get via a probe. It should never be
> used as a trigger where its purpose is "we hit this path".

Hi Steve,

I completely agree with the principle that static tracepoints
shouldn't be used as markers if a dynamic probe will suffice. The
intent here is to avoid introducing overhead in the common case to
avoid regressing mmap, munmap, and other syscall latencies; while
still providing observability for the max vma_count exceeded failure
condition.

The original centralized check (before previous review rounds) was
indeed in a dedicated function, exceeds_max_map_count(), where a
kprobe/fprobe could have been easily attached without impacting the
common path. This was changed due to previous review feedback to the
capacity based vma_count_remaining() which necessitated the check to
be done externally by the callers:

https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/

Would you be ok with something like:

trace_max_vma_count_exceeded(mm);

TP_STRUCT__entry(
__field(unsigned int, mm_id)
__field(unsigned int vma_count)
)

mm_id would be the hash of the mm_struct ptr similar to rss_stat and
the vma_count is the current vma count (some syscalls have different
requirements on the capacity remaining: mremap requires 6 available
slots, other syscalls require 1).

Thanks,
Kalesh

>
> -- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-16 10:12   ` Lorenzo Stoakes
@ 2025-09-16 17:47     ` Kalesh Singh
  2025-09-17  2:16     ` Andrew Morton
  1 sibling, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-16 17:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, minchan, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Tue, Sep 16, 2025 at 3:12 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
> > Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> > feeling a need to slow things down for additional stabilization and so
> > people hopefully get additional bandwidth to digest the material we've
> > added this far.  So I think I'll just cherrypick [1/7] for now.  A
> > great flood of positive review activity would probably make me revisit
> > that ;)
> >
>
> Kalesh - I do intend to look at this series when I have a chance. My review
> workload has been insane so it's hard to keep up at the moment.
>
> Andrew - This cycle has been crazy, speaking from point of view of somebody
> doing a lot of review, it's been very very exhausting from this side too,
> and this kind of work can feel a little... thankless... sometimes :)
>
> I feel like we maybe need a way to ask people to slow down, sometimes at
> least.
>
> Perhaps being less accepting of patches during merge window is one aspect,
> as the merge window leading up to this cycle was almost the same review
> load as when the cycle started.
>
> Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor
> in all this too :)
>
> (I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week
> vacataion afterwards over the merge window so I hope to stave off burnout
> that way. Be good if I could keep mails upon return to 3 digits, but I have
> my doubts :P)

Hi Lorenzo,

Absolutely, take care of yourself. We all appreciate the  amount of
work you put into reviewing :)

Have a good talk and enjoy your vacation. Don't worry about the backlog.

-- Kalesh

>
> Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16 17:36         ` Kalesh Singh
@ 2025-09-16 17:48           ` Steven Rostedt
  2025-09-16 17:57             ` Kalesh Singh
  0 siblings, 1 reply; 68+ messages in thread
From: Steven Rostedt @ 2025-09-16 17:48 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Tue, 16 Sep 2025 10:36:57 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> I completely agree with the principle that static tracepoints
> shouldn't be used as markers if a dynamic probe will suffice. The
> intent here is to avoid introducing overhead in the common case to
> avoid regressing mmap, munmap, and other syscall latencies; while
> still providing observability for the max vma_count exceeded failure
> condition.
> 
> The original centralized check (before previous review rounds) was
> indeed in a dedicated function, exceeds_max_map_count(), where a
> kprobe/fprobe could have been easily attached without impacting the
> common path. This was changed due to previous review feedback to the
> capacity based vma_count_remaining() which necessitated the check to
> be done externally by the callers:
> 
> https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
> 
> Would you be ok with something like:
> 
> trace_max_vma_count_exceeded(mm);
> 
> TP_STRUCT__entry(
> __field(unsigned int, mm_id)
> __field(unsigned int vma_count)
> )
> 
> mm_id would be the hash of the mm_struct ptr similar to rss_stat and
> the vma_count is the current vma count (some syscalls have different
> requirements on the capacity remaining: mremap requires 6 available
> slots, other syscalls require 1).
> 

BTW, why the hash of the mm pointer and not the pointer itself? We save
pointers in lots of places, and if it is the pointer, you could use an
eprobe to attache to the trace event to dereference its fields.

-- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16 17:48           ` Steven Rostedt
@ 2025-09-16 17:57             ` Kalesh Singh
  2025-09-16 18:02               ` Steven Rostedt
  0 siblings, 1 reply; 68+ messages in thread
From: Kalesh Singh @ 2025-09-16 17:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Tue, Sep 16, 2025 at 10:47 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 16 Sep 2025 10:36:57 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > I completely agree with the principle that static tracepoints
> > shouldn't be used as markers if a dynamic probe will suffice. The
> > intent here is to avoid introducing overhead in the common case to
> > avoid regressing mmap, munmap, and other syscall latencies; while
> > still providing observability for the max vma_count exceeded failure
> > condition.
> >
> > The original centralized check (before previous review rounds) was
> > indeed in a dedicated function, exceeds_max_map_count(), where a
> > kprobe/fprobe could have been easily attached without impacting the
> > common path. This was changed due to previous review feedback to the
> > capacity based vma_count_remaining() which necessitated the check to
> > be done externally by the callers:
> >
> > https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
> >
> > Would you be ok with something like:
> >
> > trace_max_vma_count_exceeded(mm);
> >
> > TP_STRUCT__entry(
> > __field(unsigned int, mm_id)
> > __field(unsigned int vma_count)
> > )
> >
> > mm_id would be the hash of the mm_struct ptr similar to rss_stat and
> > the vma_count is the current vma count (some syscalls have different
> > requirements on the capacity remaining: mremap requires 6 available
> > slots, other syscalls require 1).
> >
>
> BTW, why the hash of the mm pointer and not the pointer itself? We save
> pointers in lots of places, and if it is the pointer, you could use an
> eprobe to attache to the trace event to dereference its fields.

In Android we try to avoid exposing raw kernel pointers to userspace
for security reasons: raising /proc/sys/kernel/kptr_restrict to 2
immediately after symbols are resolved for necessary telemetry tooling
during early boot. I believe this is also why rss_stat uses the hash
and not the raw pointer.

Thanks,
Kalesh

>
> -- Steve
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16 17:57             ` Kalesh Singh
@ 2025-09-16 18:02               ` Steven Rostedt
  2025-09-16 18:23                 ` Kalesh Singh
  0 siblings, 1 reply; 68+ messages in thread
From: Steven Rostedt @ 2025-09-16 18:02 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Tue, 16 Sep 2025 10:57:43 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> > BTW, why the hash of the mm pointer and not the pointer itself? We save
> > pointers in lots of places, and if it is the pointer, you could use an
> > eprobe to attache to the trace event to dereference its fields.  
> 
> In Android we try to avoid exposing raw kernel pointers to userspace
> for security reasons: raising /proc/sys/kernel/kptr_restrict to 2
> immediately after symbols are resolved for necessary telemetry tooling
> during early boot. I believe this is also why rss_stat uses the hash
> and not the raw pointer.

When it comes to tracing, you already lost. If it goes into the ring buffer
it's a raw pointer. BPF doesn't use the output of the trace event, so you
are exposing nothing from that. It uses the proto directly.

Heck, if you enable function tracing, you are exposing every function
address it traces via the raw data output.

-- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16 18:02               ` Steven Rostedt
@ 2025-09-16 18:23                 ` Kalesh Singh
  2025-09-16 18:51                   ` Steven Rostedt
  0 siblings, 1 reply; 68+ messages in thread
From: Kalesh Singh @ 2025-09-16 18:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Tue, Sep 16, 2025 at 11:01 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 16 Sep 2025 10:57:43 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > > BTW, why the hash of the mm pointer and not the pointer itself? We save
> > > pointers in lots of places, and if it is the pointer, you could use an
> > > eprobe to attache to the trace event to dereference its fields.
> >
> > In Android we try to avoid exposing raw kernel pointers to userspace
> > for security reasons: raising /proc/sys/kernel/kptr_restrict to 2
> > immediately after symbols are resolved for necessary telemetry tooling
> > during early boot. I believe this is also why rss_stat uses the hash
> > and not the raw pointer.
>
> When it comes to tracing, you already lost. If it goes into the ring buffer
> it's a raw pointer. BPF doesn't use the output of the trace event, so you
> are exposing nothing from that. It uses the proto directly.

My understanding is that the BPF tracepoint type uses the trace event
fields from TP_STRUCT__entry(); whereas the raw tracepoint type has
access to the proto arguments. Please CMIW: Isn't what we'd be adding
to the trace buffer is the hashed mm_id value?

>
> Heck, if you enable function tracing, you are exposing every function
> address it traces via the raw data output.

Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER  in
Android production kernels.

Thanks,
Kalesh

>
> -- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
@ 2025-09-16 18:33   ` kernel test robot
  2025-09-18 14:10     ` Lorenzo Stoakes
  2025-09-17 13:38   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 68+ messages in thread
From: kernel test robot @ 2025-09-16 18:33 UTC (permalink / raw)
  To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, david, Liam.Howlett,
	rppt, pfalcato
  Cc: oe-kbuild-all, kernel-team, android-mm, Kalesh Singh,
	Alexander Viro, Christian Brauner, Jan Kara, Kees Cook,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan

Hi Kalesh,

kernel test robot noticed the following build errors:

[auto build test ERROR on f83ec76bf285bea5727f478a68b894f5543ca76e]

url:    https://github.com/intel-lab-lkp/linux/commits/Kalesh-Singh/mm-fix-off-by-one-error-in-VMA-count-limit-checks/20250916-005018
base:   f83ec76bf285bea5727f478a68b894f5543ca76e
patch link:    https://lore.kernel.org/r/20250915163838.631445-4-kaleshsingh%40google.com
patch subject: [PATCH v2 3/7] mm: introduce vma_count_remaining()
config: s390-randconfig-r122-20250916 (https://download.01.org/0day-ci/archive/20250917/202509170238.NUURxWPI-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 13.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250917/202509170238.NUURxWPI-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509170238.NUURxWPI-lkp@intel.com/

All errors (new ones prefixed by >>):

   s390-linux-ld: mm/mmap.o: in function `do_mmap':
>> mm/mmap.c:377:(.text+0x6566): undefined reference to `vma_count_remaining'
   s390-linux-ld: mm/mremap.o: in function `prep_move_vma':
>> mm/mremap.c:1043:(.text+0x199a): undefined reference to `vma_count_remaining'
   s390-linux-ld: mm/mremap.o: in function `check_mremap_params':
   mm/mremap.c:1820:(.text+0x4044): undefined reference to `vma_count_remaining'
   s390-linux-ld: mm/vma.o: in function `vms_gather_munmap_vmas':
>> mm/vma.c:1348:(.text+0x8ebc): undefined reference to `vma_count_remaining'
   s390-linux-ld: mm/vma.o: in function `split_vma':
   mm/vma.c:595:(.text+0xa2a2): undefined reference to `vma_count_remaining'
   s390-linux-ld: mm/vma.o:mm/vma.c:2775: more undefined references to `vma_count_remaining' follow


vim +377 mm/mmap.c

   276	
   277	/**
   278	 * do_mmap() - Perform a userland memory mapping into the current process
   279	 * address space of length @len with protection bits @prot, mmap flags @flags
   280	 * (from which VMA flags will be inferred), and any additional VMA flags to
   281	 * apply @vm_flags. If this is a file-backed mapping then the file is specified
   282	 * in @file and page offset into the file via @pgoff.
   283	 *
   284	 * This function does not perform security checks on the file and assumes, if
   285	 * @uf is non-NULL, the caller has provided a list head to track unmap events
   286	 * for userfaultfd @uf.
   287	 *
   288	 * It also simply indicates whether memory population is required by setting
   289	 * @populate, which must be non-NULL, expecting the caller to actually perform
   290	 * this task itself if appropriate.
   291	 *
   292	 * This function will invoke architecture-specific (and if provided and
   293	 * relevant, file system-specific) logic to determine the most appropriate
   294	 * unmapped area in which to place the mapping if not MAP_FIXED.
   295	 *
   296	 * Callers which require userland mmap() behaviour should invoke vm_mmap(),
   297	 * which is also exported for module use.
   298	 *
   299	 * Those which require this behaviour less security checks, userfaultfd and
   300	 * populate behaviour, and who handle the mmap write lock themselves, should
   301	 * call this function.
   302	 *
   303	 * Note that the returned address may reside within a merged VMA if an
   304	 * appropriate merge were to take place, so it doesn't necessarily specify the
   305	 * start of a VMA, rather only the start of a valid mapped range of length
   306	 * @len bytes, rounded down to the nearest page size.
   307	 *
   308	 * The caller must write-lock current->mm->mmap_lock.
   309	 *
   310	 * @file: An optional struct file pointer describing the file which is to be
   311	 * mapped, if a file-backed mapping.
   312	 * @addr: If non-zero, hints at (or if @flags has MAP_FIXED set, specifies) the
   313	 * address at which to perform this mapping. See mmap (2) for details. Must be
   314	 * page-aligned.
   315	 * @len: The length of the mapping. Will be page-aligned and must be at least 1
   316	 * page in size.
   317	 * @prot: Protection bits describing access required to the mapping. See mmap
   318	 * (2) for details.
   319	 * @flags: Flags specifying how the mapping should be performed, see mmap (2)
   320	 * for details.
   321	 * @vm_flags: VMA flags which should be set by default, or 0 otherwise.
   322	 * @pgoff: Page offset into the @file if file-backed, should be 0 otherwise.
   323	 * @populate: A pointer to a value which will be set to 0 if no population of
   324	 * the range is required, or the number of bytes to populate if it is. Must be
   325	 * non-NULL. See mmap (2) for details as to under what circumstances population
   326	 * of the range occurs.
   327	 * @uf: An optional pointer to a list head to track userfaultfd unmap events
   328	 * should unmapping events arise. If provided, it is up to the caller to manage
   329	 * this.
   330	 *
   331	 * Returns: Either an error, or the address at which the requested mapping has
   332	 * been performed.
   333	 */
   334	unsigned long do_mmap(struct file *file, unsigned long addr,
   335				unsigned long len, unsigned long prot,
   336				unsigned long flags, vm_flags_t vm_flags,
   337				unsigned long pgoff, unsigned long *populate,
   338				struct list_head *uf)
   339	{
   340		struct mm_struct *mm = current->mm;
   341		int pkey = 0;
   342	
   343		*populate = 0;
   344	
   345		mmap_assert_write_locked(mm);
   346	
   347		if (!len)
   348			return -EINVAL;
   349	
   350		/*
   351		 * Does the application expect PROT_READ to imply PROT_EXEC?
   352		 *
   353		 * (the exception is when the underlying filesystem is noexec
   354		 *  mounted, in which case we don't add PROT_EXEC.)
   355		 */
   356		if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
   357			if (!(file && path_noexec(&file->f_path)))
   358				prot |= PROT_EXEC;
   359	
   360		/* force arch specific MAP_FIXED handling in get_unmapped_area */
   361		if (flags & MAP_FIXED_NOREPLACE)
   362			flags |= MAP_FIXED;
   363	
   364		if (!(flags & MAP_FIXED))
   365			addr = round_hint_to_min(addr);
   366	
   367		/* Careful about overflows.. */
   368		len = PAGE_ALIGN(len);
   369		if (!len)
   370			return -ENOMEM;
   371	
   372		/* offset overflow? */
   373		if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
   374			return -EOVERFLOW;
   375	
   376		/* Too many mappings? */
 > 377		if (!vma_count_remaining(mm))
   378			return -ENOMEM;
   379	
   380		/*
   381		 * addr is returned from get_unmapped_area,
   382		 * There are two cases:
   383		 * 1> MAP_FIXED == false
   384		 *	unallocated memory, no need to check sealing.
   385		 * 1> MAP_FIXED == true
   386		 *	sealing is checked inside mmap_region when
   387		 *	do_vmi_munmap is called.
   388		 */
   389	
   390		if (prot == PROT_EXEC) {
   391			pkey = execute_only_pkey(mm);
   392			if (pkey < 0)
   393				pkey = 0;
   394		}
   395	
   396		/* Do simple checking here so the lower-level routines won't have
   397		 * to. we assume access permissions have been handled by the open
   398		 * of the memory object, so we don't do any here.
   399		 */
   400		vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(file, flags) |
   401				mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
   402	
   403		/* Obtain the address to map to. we verify (or select) it and ensure
   404		 * that it represents a valid section of the address space.
   405		 */
   406		addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags);
   407		if (IS_ERR_VALUE(addr))
   408			return addr;
   409	
   410		if (flags & MAP_FIXED_NOREPLACE) {
   411			if (find_vma_intersection(mm, addr, addr + len))
   412				return -EEXIST;
   413		}
   414	
   415		if (flags & MAP_LOCKED)
   416			if (!can_do_mlock())
   417				return -EPERM;
   418	
   419		if (!mlock_future_ok(mm, vm_flags, len))
   420			return -EAGAIN;
   421	
   422		if (file) {
   423			struct inode *inode = file_inode(file);
   424			unsigned long flags_mask;
   425			int err;
   426	
   427			if (!file_mmap_ok(file, inode, pgoff, len))
   428				return -EOVERFLOW;
   429	
   430			flags_mask = LEGACY_MAP_MASK;
   431			if (file->f_op->fop_flags & FOP_MMAP_SYNC)
   432				flags_mask |= MAP_SYNC;
   433	
   434			switch (flags & MAP_TYPE) {
   435			case MAP_SHARED:
   436				/*
   437				 * Force use of MAP_SHARED_VALIDATE with non-legacy
   438				 * flags. E.g. MAP_SYNC is dangerous to use with
   439				 * MAP_SHARED as you don't know which consistency model
   440				 * you will get. We silently ignore unsupported flags
   441				 * with MAP_SHARED to preserve backward compatibility.
   442				 */
   443				flags &= LEGACY_MAP_MASK;
   444				fallthrough;
   445			case MAP_SHARED_VALIDATE:
   446				if (flags & ~flags_mask)
   447					return -EOPNOTSUPP;
   448				if (prot & PROT_WRITE) {
   449					if (!(file->f_mode & FMODE_WRITE))
   450						return -EACCES;
   451					if (IS_SWAPFILE(file->f_mapping->host))
   452						return -ETXTBSY;
   453				}
   454	
   455				/*
   456				 * Make sure we don't allow writing to an append-only
   457				 * file..
   458				 */
   459				if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
   460					return -EACCES;
   461	
   462				vm_flags |= VM_SHARED | VM_MAYSHARE;
   463				if (!(file->f_mode & FMODE_WRITE))
   464					vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
   465				fallthrough;
   466			case MAP_PRIVATE:
   467				if (!(file->f_mode & FMODE_READ))
   468					return -EACCES;
   469				if (path_noexec(&file->f_path)) {
   470					if (vm_flags & VM_EXEC)
   471						return -EPERM;
   472					vm_flags &= ~VM_MAYEXEC;
   473				}
   474	
   475				if (!can_mmap_file(file))
   476					return -ENODEV;
   477				if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
   478					return -EINVAL;
   479				break;
   480	
   481			default:
   482				return -EINVAL;
   483			}
   484	
   485			/*
   486			 * Check to see if we are violating any seals and update VMA
   487			 * flags if necessary to avoid future seal violations.
   488			 */
   489			err = memfd_check_seals_mmap(file, &vm_flags);
   490			if (err)
   491				return (unsigned long)err;
   492		} else {
   493			switch (flags & MAP_TYPE) {
   494			case MAP_SHARED:
   495				if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
   496					return -EINVAL;
   497				/*
   498				 * Ignore pgoff.
   499				 */
   500				pgoff = 0;
   501				vm_flags |= VM_SHARED | VM_MAYSHARE;
   502				break;
   503			case MAP_DROPPABLE:
   504				if (VM_DROPPABLE == VM_NONE)
   505					return -ENOTSUPP;
   506				/*
   507				 * A locked or stack area makes no sense to be droppable.
   508				 *
   509				 * Also, since droppable pages can just go away at any time
   510				 * it makes no sense to copy them on fork or dump them.
   511				 *
   512				 * And don't attempt to combine with hugetlb for now.
   513				 */
   514				if (flags & (MAP_LOCKED | MAP_HUGETLB))
   515				        return -EINVAL;
   516				if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP))
   517				        return -EINVAL;
   518	
   519				vm_flags |= VM_DROPPABLE;
   520	
   521				/*
   522				 * If the pages can be dropped, then it doesn't make
   523				 * sense to reserve them.
   524				 */
   525				vm_flags |= VM_NORESERVE;
   526	
   527				/*
   528				 * Likewise, they're volatile enough that they
   529				 * shouldn't survive forks or coredumps.
   530				 */
   531				vm_flags |= VM_WIPEONFORK | VM_DONTDUMP;
   532				fallthrough;
   533			case MAP_PRIVATE:
   534				/*
   535				 * Set pgoff according to addr for anon_vma.
   536				 */
   537				pgoff = addr >> PAGE_SHIFT;
   538				break;
   539			default:
   540				return -EINVAL;
   541			}
   542		}
   543	
   544		/*
   545		 * Set 'VM_NORESERVE' if we should not account for the
   546		 * memory use of this mapping.
   547		 */
   548		if (flags & MAP_NORESERVE) {
   549			/* We honor MAP_NORESERVE if allowed to overcommit */
   550			if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
   551				vm_flags |= VM_NORESERVE;
   552	
   553			/* hugetlb applies strict overcommit unless MAP_NORESERVE */
   554			if (file && is_file_hugepages(file))
   555				vm_flags |= VM_NORESERVE;
   556		}
   557	
   558		addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
   559		if (!IS_ERR_VALUE(addr) &&
   560		    ((vm_flags & VM_LOCKED) ||
   561		     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
   562			*populate = len;
   563		return addr;
   564	}
   565	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16 18:23                 ` Kalesh Singh
@ 2025-09-16 18:51                   ` Steven Rostedt
  2025-09-16 20:08                     ` Kalesh Singh
  0 siblings, 1 reply; 68+ messages in thread
From: Steven Rostedt @ 2025-09-16 18:51 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Tue, 16 Sep 2025 11:23:03 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:

> > When it comes to tracing, you already lost. If it goes into the ring buffer
> > it's a raw pointer. BPF doesn't use the output of the trace event, so you
> > are exposing nothing from that. It uses the proto directly.  
> 
> My understanding is that the BPF tracepoint type uses the trace event
> fields from TP_STRUCT__entry(); whereas the raw tracepoint type has
> access to the proto arguments. Please CMIW: Isn't what we'd be adding
> to the trace buffer is the hashed mm_id value?

Ah, right. Can't the BPF infrastructure protect against it?

> 
> >
> > Heck, if you enable function tracing, you are exposing every function
> > address it traces via the raw data output.  
> 
> Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER  in
> Android production kernels.

Does it block all the other trace events that share pointers?

Like nmi handler tracepoints, x86_fpu tracepoints, page_fault kernel
tracepoint, tasklet tracepoints, cpu hot plug tracepoints, timer
tracepoints, work queue tracepoints, ipi tracepoints, scheduling
tracepoints, locking tracepoints, rcu tracepoints, dma tracepoints,
module tracepoints, xdp tracepoints, filemap tracepoints, page map
tracepoints, vmscan tracepoints, percpu tracepoints, kmem_cache
tracepoints, mmap lock tracepoints, file lock tracepoints, and many
more! (I got tired of looking them up).

Again, are you really protecting anything if this one trace point
hashes the pointer? Most other tracepoints expose this. If BPF can
access a tracepoint entry struct, it can access the raw address and
break KASLR.

-- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16 18:51                   ` Steven Rostedt
@ 2025-09-16 20:08                     ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-16 20:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
	Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
	linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest

On Tue, Sep 16, 2025 at 11:51 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 16 Sep 2025 11:23:03 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > > When it comes to tracing, you already lost. If it goes into the ring buffer
> > > it's a raw pointer. BPF doesn't use the output of the trace event, so you
> > > are exposing nothing from that. It uses the proto directly.
> >
> > My understanding is that the BPF tracepoint type uses the trace event
> > fields from TP_STRUCT__entry(); whereas the raw tracepoint type has
> > access to the proto arguments. Please CMIW: Isn't what we'd be adding
> > to the trace buffer is the hashed mm_id value?
>
> Ah, right. Can't the BPF infrastructure protect against it?
>
> >
> > >
> > > Heck, if you enable function tracing, you are exposing every function
> > > address it traces via the raw data output.
> >
> > Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER  in
> > Android production kernels.
>
> Does it block all the other trace events that share pointers?
>
> Like nmi handler tracepoints, x86_fpu tracepoints, page_fault kernel
> tracepoint, tasklet tracepoints, cpu hot plug tracepoints, timer
> tracepoints, work queue tracepoints, ipi tracepoints, scheduling
> tracepoints, locking tracepoints, rcu tracepoints, dma tracepoints,
> module tracepoints, xdp tracepoints, filemap tracepoints, page map
> tracepoints, vmscan tracepoints, percpu tracepoints, kmem_cache
> tracepoints, mmap lock tracepoints, file lock tracepoints, and many
> more! (I got tired of looking them up).

Hi Steve,

I see your point :)  I'll use the raw pointer and handle not exposing
it from the BPF side.

Thanks for discussing.

--Kalesh

>
> Again, are you really protecting anything if this one trace point
> hashes the pointer? Most other tracepoints expose this. If BPF can
> access a tracepoint entry struct, it can access the raw address and
> break KASLR.

Thanks,
Kalesh
>
> -- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-16 14:20     ` Jonathan Corbet
@ 2025-09-17  1:16       ` Andrew Morton
  0 siblings, 0 replies; 68+ messages in thread
From: Andrew Morton @ 2025-09-17  1:16 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Kalesh Singh, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, stable, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest, Dave Hansen

On Tue, 16 Sep 2025 08:20:14 -0600 Jonathan Corbet <corbet@lwn.net> wrote:

> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> >> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> >
> > lol.
> >
> > x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l
> > 661
> >
> > we really blew it that time!
> 
> A few years back I made a list of the most-fixed commits in the kernel
> history; unsurprisingly that one (I call it the "original sin") came out
> on top:
> 
>   https://lwn.net/Articles/914632/
> 
> Perhaps the time has come to update that analysis.
> 

Hah, thanks, that's a fun article.

It's a shame it's so recent.  I do recall back in maybe 2002 a fresh
young developer popped up with a massive rework of the IDE drivers.  I
promptly took this into my tree because I was having a hate on the IDE
drivers at the time.  omfg, what a disaster - this was the buggiest
bunch of junk conceivable and it went away after a few weeks.

I'm glad to say that the developer (email handle at the time was
"haveblue") has become better since then.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-16 10:12   ` Lorenzo Stoakes
  2025-09-16 17:47     ` Kalesh Singh
@ 2025-09-17  2:16     ` Andrew Morton
  2025-09-17  5:36       ` Lorenzo Stoakes
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2025-09-17  2:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Tue, 16 Sep 2025 11:12:03 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
> > Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> > feeling a need to slow things down for additional stabilization and so
> > people hopefully get additional bandwidth to digest the material we've
> > added this far.  So I think I'll just cherrypick [1/7] for now.  A
> > great flood of positive review activity would probably make me revisit
> > that ;)
> >
> 
> Kalesh - I do intend to look at this series when I have a chance. My review
> workload has been insane so it's hard to keep up at the moment.
> 
> Andrew - This cycle has been crazy, speaking from point of view of somebody
> doing a lot of review, it's been very very exhausting from this side too,
> and this kind of work can feel a little... thankless... sometimes :)

I hear you.  I'm shedding most everything now, to give us a couple of
weeks to digest.

> I feel like we maybe need a way to ask people to slow down, sometimes at
> least.

Yup, I'm sending submitters private emails explaining the situation.

Maybe they should be public emails, I find it a hard call.

> Perhaps being less accepting of patches during merge window is one aspect,
> as the merge window leading up to this cycle was almost the same review
> load as when the cycle started.

I'm having trouble understanding what you said here?

> Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor
> in all this too :)
> 
> (I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week
> vacataion afterwards over the merge window so I hope to stave off burnout
> that way. Be good if I could keep mails upon return to 3 digits, but I have
> my doubts :P)

I'd blow that in three days ;)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-17  2:16     ` Andrew Morton
@ 2025-09-17  5:36       ` Lorenzo Stoakes
  2025-09-17 23:32         ` Andrew Morton
  0 siblings, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-17  5:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Tue, Sep 16, 2025 at 07:16:45PM -0700, Andrew Morton wrote:
> On Tue, 16 Sep 2025 11:12:03 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
> > > Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> > > feeling a need to slow things down for additional stabilization and so
> > > people hopefully get additional bandwidth to digest the material we've
> > > added this far.  So I think I'll just cherrypick [1/7] for now.  A
> > > great flood of positive review activity would probably make me revisit
> > > that ;)
> > >
> >
> > Kalesh - I do intend to look at this series when I have a chance. My review
> > workload has been insane so it's hard to keep up at the moment.
> >
> > Andrew - This cycle has been crazy, speaking from point of view of somebody
> > doing a lot of review, it's been very very exhausting from this side too,
> > and this kind of work can feel a little... thankless... sometimes :)
>
> I hear you.  I'm shedding most everything now, to give us a couple of
> weeks to digest.

Thanks, much appreciated! :)

>
> > I feel like we maybe need a way to ask people to slow down, sometimes at
> > least.
>
> Yup, I'm sending submitters private emails explaining the situation.

And again, much appreciated :)

>
> Maybe they should be public emails, I find it a hard call.

Yeah it can be hard to get that balance right. Maybe public is better when
we're deeper in the rc and there's a general load problem?

>
> > Perhaps being less accepting of patches during merge window is one aspect,
> > as the merge window leading up to this cycle was almost the same review
> > load as when the cycle started.
>
> I'm having trouble understanding what you said here?

Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
merge window, esp. now we have mm-new.

And last merge window my review load felt similar to during a cycle, which
was kind of crazy.

So I wonder if we should be less accommodating and simply say 'sorry it's
the merge window, no submissions accepted'?

Of course I'm being a bit selfish here as I'm on holiday in the next merge
window and hope forlornly to reduce the mail I come back to :P

>
> > Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor
> > in all this too :)
> >
> > (I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week
> > vacataion afterwards over the merge window so I hope to stave off burnout
> > that way. Be good if I could keep mails upon return to 3 digits, but I have
> > my doubts :P)
>
> I'd blow that in three days ;)

Haha yeah I bet :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
  2025-09-15 22:36   ` Andrew Morton
  2025-09-16  9:45   ` Pedro Falcato
@ 2025-09-17  7:44   ` SeongJae Park
  2025-09-17 10:52   ` David Hildenbrand
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 68+ messages in thread
From: SeongJae Park @ 2025-09-17  7:44 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: SeongJae Park, akpm, minchan, lorenzo.stoakes, david,
	Liam.Howlett, rppt, pfalcato, kernel-team, android-mm, stable,
	Alexander Viro, Christian Brauner, Jan Kara, Kees Cook,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:

> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
> 
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
> 
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
> 
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>

Acked-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
                     ` (2 preceding siblings ...)
  2025-09-17  7:44   ` SeongJae Park
@ 2025-09-17 10:52   ` David Hildenbrand
  2025-09-18 11:31   ` Pedro Falcato
  2025-09-18 13:53   ` Lorenzo Stoakes
  5 siblings, 0 replies; 68+ messages in thread
From: David Hildenbrand @ 2025-09-17 10:52 UTC (permalink / raw)
  To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, stable, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On 15.09.25 18:36, Kalesh Singh wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
> 
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
> 
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
> 
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 2/7] mm/selftests: add max_vma_count tests
  2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
@ 2025-09-17 10:58   ` David Hildenbrand
  2025-09-17 16:49     ` Kalesh Singh
  2025-09-18 14:42   ` Lorenzo Stoakes
  1 sibling, 1 reply; 68+ messages in thread
From: David Hildenbrand @ 2025-09-17 10:58 UTC (permalink / raw)
  To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

> + * test_suite_setup - Set up the VMA layout for VMA count testing.
> + *
> + * Sets up the following VMA layout:
> + *
> + * +----- base_addr
> + * |
> + * V
> + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> + * |  Guard Page  |                      |  Guard Page  |  Extra Map 1   | Unmapped Gap |  Extra Map 2   | Unmapped Gap | ... |  Extra Map N   | Unmapped Gap |
> + * |  (unmapped)  |      TEST_AREA       |  (unmapped)  | (mapped page)  |  (1 page)    | (mapped page)  |  (1 page)    | ... | (mapped page)  |  (1 page)    |
> + * |   (1 page)   | (unmapped, 3 pages)  |   (1 page)   |    (1 page)    |              |    (1 page)    |              |     |    (1 page)    |              |
> + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> + * ^              ^                      ^              ^                                                                  ^
> + * |              |                      |              |                                                                  |
> + * +--GUARD_SIZE--+                      |              +-- EXTRA_MAPS points here             Sufficient EXTRA_MAPS to ---+
> + *    (PAGE_SIZE) |                      |                                                         reach MAX_VMA_COUNT
> + *                |                      |
> + *                +--- TEST_AREA_SIZE ---+
> + *                |   (3 * PAGE_SIZE)    |
> + *                ^
> + *                |
> + *                +-- TEST_AREA starts here
> + *
>

Just wondering if we could find a different name than "guard page" here, 
to not confuse stuff with guard ptes

Will the current "guard page" we a valid vma or just a hole?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
  2025-09-16 18:33   ` kernel test robot
@ 2025-09-17 13:38   ` David Hildenbrand
  2025-09-17 17:10     ` Kalesh Singh
  2025-09-18 13:20   ` Lorenzo Stoakes
  2025-09-18 14:31   ` Lorenzo Stoakes
  3 siblings, 1 reply; 68+ messages in thread
From: David Hildenbrand @ 2025-09-17 13:38 UTC (permalink / raw)
  To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On 15.09.25 18:36, Kalesh Singh wrote:
> The checks against sysctl_max_map_count are open-coded in multiple
> places. While simple checks are manageable, the logic in places like
> mremap.c involves arithmetic with magic numbers that can be difficult
> to reason about. e.g. ... >= sysctl_max_map_count - 3
> 
> To improve readability and centralize the logic, introduce a new helper,
> vma_count_remaining(). This function returns the VMA count headroom
> available for a givine process.

s/givine/given/

s/process/mm/

> 
> The most common case of checking for a single new VMA can be done with
> the convenience helper has_vma_count_remaining():
> 
>      if (!vma_count_remaining(mm))
> 
> And the complex checks in mremap.c become clearer by expressing the
> required capacity directly:
> 
>      if (vma_count_remaining(mm) <  4)
> 
> While a capacity-based function could be misused (e.g., with an
> incorrect '<' vs '<=' comparison), the improved readability at the call
> sites makes such errors less likely than with the previous open-coded
> arithmetic.
> 
> As part of this change, sysctl_max_map_count is made static to
> mm/mmap.c to improve encapsulation.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---

[...]

>   	/*
> @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
>   int sysctl_legacy_va_layout;
>   #endif
>   
> +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> +
> +/**
> + * vma_count_remaining - Determine available VMA slots
> + * @mm: The memory descriptor for the process.
> + *
> + * Check how many more VMAs can be created for the given @mm
> + * before hitting the sysctl_max_map_count limit.
> + *
> + * Return: The number of new VMAs the process can accommodate.
> + */
> +int vma_count_remaining(const struct mm_struct *mm)
> +{
> +	const int map_count = mm->map_count;
> +	const int max_count = sysctl_max_map_count;

If we worry about rare races (sysctl_max_map_count changing?) we should 
probably force a single read through READ_ONCE()?

Otherwise one might trick vma_count_remaining() into returning a 
negative number I assume.

> +
> +	return (max_count > map_count) ? (max_count - map_count) : 0;
> +}

Nothing else jumped at me.

Not sure what the buildbot complains about but I'm sure you'll figure it 
out :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count
  2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
@ 2025-09-17 13:41   ` David Hildenbrand
  2025-09-18 11:46   ` Pedro Falcato
  2025-09-18 14:48   ` Lorenzo Stoakes
  2 siblings, 0 replies; 68+ messages in thread
From: David Hildenbrand @ 2025-09-17 13:41 UTC (permalink / raw)
  To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On 15.09.25 18:36, Kalesh Singh wrote:
> A mechanical rename of the mm_struct->map_count field to
> vma_count; no functional change is intended.
> 
> The name "map_count" is ambiguous within the memory management subsystem,
> as it can be confused with the folio/page->_mapcount field, which tracks
> PTE references.
> 
> The new name, vma_count, is more precise as this field has always
> counted the number of vm_area_structs associated with an mm_struct.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---


[...]

> +++ b/mm/mmap.c
> @@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm)
>   		vma = vma_next(&vmi);
>   	} while (vma && likely(!xa_is_zero(vma)));
>   
> -	BUG_ON(count != mm->map_count);
> +	BUG_ON(count != mm->vma_count);
>   

While at it, best to change that to a WARN_ON_ONCE() or even a 
VM_WARN_ON_ONCE().

[ or remove it -- have we ever seen that firing? ]

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
  2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
@ 2025-09-17 13:44   ` David Hildenbrand
  2025-09-17 17:22     ` Kalesh Singh
  2025-09-18 11:48     ` Pedro Falcato
  2025-09-18 13:30   ` Lorenzo Stoakes
  1 sibling, 2 replies; 68+ messages in thread
From: David Hildenbrand @ 2025-09-17 13:44 UTC (permalink / raw)
  To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
	pfalcato
  Cc: kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On 15.09.25 18:36, Kalesh Singh wrote:
> Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect
> cases where the VMA count exceeds the sysctl_max_map_count limit.
> 
> This check will help catch future bugs or regressions where
> the VMAs are allocated exceeding the limit.
> 
> The warning is placed in the main vma_count_*() helpers, while the
> internal *_nocheck variants bypass it. _nocheck helpers are used to
> ensure that the assertion does not trigger a false positive in
> the legitimate case of a temporary VMA increase past the limit
> by a VMA split in munmap().
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
> 
> Changes in v2:
>    - Add assertions if exceeding max_vma_count limit, per Pedro
> 
>   include/linux/mm.h               | 12 ++++++--
>   mm/internal.h                    |  1 -
>   mm/vma.c                         | 49 +++++++++++++++++++++++++-------
>   tools/testing/vma/vma_internal.h |  7 ++++-
>   4 files changed, 55 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8bad1454984c..3a3749d7015c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
>   
>   void snapshot_page(struct page_snapshot *ps, const struct page *page);
>   
> +int vma_count_remaining(const struct mm_struct *mm);
> +
>   static inline void vma_count_init(struct mm_struct *mm)
>   {
>   	ACCESS_PRIVATE(mm, __vma_count) = 0;
>   }
>   
> -static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
>   {
>   	ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
>   }
>   
> +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> +{
> +	VM_WARN_ON_ONCE(!vma_count_remaining(mm));

Can't that fire when changing the max count from user space at just the 
wrong time?

I assume we'll have to tolerated that and might just want to drop this 
patch from the series.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 2/7] mm/selftests: add max_vma_count tests
  2025-09-17 10:58   ` David Hildenbrand
@ 2025-09-17 16:49     ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-17 16:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Wed, Sep 17, 2025 at 3:58 AM 'David Hildenbrand' via kernel-team
<kernel-team@android.com> wrote:
>
> > + * test_suite_setup - Set up the VMA layout for VMA count testing.
> > + *
> > + * Sets up the following VMA layout:
> > + *
> > + * +----- base_addr
> > + * |
> > + * V
> > + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> > + * |  Guard Page  |                      |  Guard Page  |  Extra Map 1   | Unmapped Gap |  Extra Map 2   | Unmapped Gap | ... |  Extra Map N   | Unmapped Gap |
> > + * |  (unmapped)  |      TEST_AREA       |  (unmapped)  | (mapped page)  |  (1 page)    | (mapped page)  |  (1 page)    | ... | (mapped page)  |  (1 page)    |
> > + * |   (1 page)   | (unmapped, 3 pages)  |   (1 page)   |    (1 page)    |              |    (1 page)    |              |     |    (1 page)    |              |
> > + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> > + * ^              ^                      ^              ^                                                                  ^
> > + * |              |                      |              |                                                                  |
> > + * +--GUARD_SIZE--+                      |              +-- EXTRA_MAPS points here             Sufficient EXTRA_MAPS to ---+
> > + *    (PAGE_SIZE) |                      |                                                         reach MAX_VMA_COUNT
> > + *                |                      |
> > + *                +--- TEST_AREA_SIZE ---+
> > + *                |   (3 * PAGE_SIZE)    |
> > + *                ^
> > + *                |
> > + *                +-- TEST_AREA starts here
> > + *
> >
>

Hi David,

Thanks for the reviews.

> Just wondering if we could find a different name than "guard page" here,
> to not confuse stuff with guard ptes
>
> Will the current "guard page" we a valid vma or just a hole?

It's a hole to prevent coincidental merging of adjacent VMAs. I can
rename it to HOLE in the next revision.

Thanks,
Kalesh

>
> --
> Cheers
>
> David / dhildenb
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-17 13:38   ` David Hildenbrand
@ 2025-09-17 17:10     ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-17 17:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Wed, Sep 17, 2025 at 6:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.09.25 18:36, Kalesh Singh wrote:
> > The checks against sysctl_max_map_count are open-coded in multiple
> > places. While simple checks are manageable, the logic in places like
> > mremap.c involves arithmetic with magic numbers that can be difficult
> > to reason about. e.g. ... >= sysctl_max_map_count - 3
> >
> > To improve readability and centralize the logic, introduce a new helper,
> > vma_count_remaining(). This function returns the VMA count headroom
> > available for a givine process.
>
> s/givine/given/
>
> s/process/mm/
>
> >
> > The most common case of checking for a single new VMA can be done with
> > the convenience helper has_vma_count_remaining():
> >
> >      if (!vma_count_remaining(mm))
> >
> > And the complex checks in mremap.c become clearer by expressing the
> > required capacity directly:
> >
> >      if (vma_count_remaining(mm) <  4)
> >
> > While a capacity-based function could be misused (e.g., with an
> > incorrect '<' vs '<=' comparison), the improved readability at the call
> > sites makes such errors less likely than with the previous open-coded
> > arithmetic.
> >
> > As part of this change, sysctl_max_map_count is made static to
> > mm/mmap.c to improve encapsulation.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
>
> [...]
>
> >       /*
> > @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
> >   int sysctl_legacy_va_layout;
> >   #endif
> >
> > +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> > +
> > +/**
> > + * vma_count_remaining - Determine available VMA slots
> > + * @mm: The memory descriptor for the process.
> > + *
> > + * Check how many more VMAs can be created for the given @mm
> > + * before hitting the sysctl_max_map_count limit.
> > + *
> > + * Return: The number of new VMAs the process can accommodate.
> > + */
> > +int vma_count_remaining(const struct mm_struct *mm)
> > +{
> > +     const int map_count = mm->map_count;
> > +     const int max_count = sysctl_max_map_count;
>
> If we worry about rare races (sysctl_max_map_count changing?) we should
> probably force a single read through READ_ONCE()?
>
> Otherwise one might trick vma_count_remaining() into returning a
> negative number I assume.

Agreed, I'll add the READ_ONCE when resending.

Thanks,
Kalesh

>
> > +
> > +     return (max_count > map_count) ? (max_count - map_count) : 0;
> > +}
>
> Nothing else jumped at me.
>
> Not sure what the buildbot complains about but I'm sure you'll figure it
> out :)
>
> --
> Cheers
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
  2025-09-17 13:44   ` David Hildenbrand
@ 2025-09-17 17:22     ` Kalesh Singh
  2025-09-17 18:34       ` David Hildenbrand
  2025-09-18 11:48     ` Pedro Falcato
  1 sibling, 1 reply; 68+ messages in thread
From: Kalesh Singh @ 2025-09-17 17:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Wed, Sep 17, 2025 at 6:44 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.09.25 18:36, Kalesh Singh wrote:
> > Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect
> > cases where the VMA count exceeds the sysctl_max_map_count limit.
> >
> > This check will help catch future bugs or regressions where
> > the VMAs are allocated exceeding the limit.
> >
> > The warning is placed in the main vma_count_*() helpers, while the
> > internal *_nocheck variants bypass it. _nocheck helpers are used to
> > ensure that the assertion does not trigger a false positive in
> > the legitimate case of a temporary VMA increase past the limit
> > by a VMA split in munmap().
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
> >
> > Changes in v2:
> >    - Add assertions if exceeding max_vma_count limit, per Pedro
> >
> >   include/linux/mm.h               | 12 ++++++--
> >   mm/internal.h                    |  1 -
> >   mm/vma.c                         | 49 +++++++++++++++++++++++++-------
> >   tools/testing/vma/vma_internal.h |  7 ++++-
> >   4 files changed, 55 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8bad1454984c..3a3749d7015c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
> >
> >   void snapshot_page(struct page_snapshot *ps, const struct page *page);
> >
> > +int vma_count_remaining(const struct mm_struct *mm);
> > +
> >   static inline void vma_count_init(struct mm_struct *mm)
> >   {
> >       ACCESS_PRIVATE(mm, __vma_count) = 0;
> >   }
> >
> > -static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
> >   {
> >       ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> >   }
> >
> > +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +{
> > +     VM_WARN_ON_ONCE(!vma_count_remaining(mm));
>
> Can't that fire when changing the max count from user space at just the
> wrong time?

You are right: technically it's possible if it was raised between the
time of checking and when the new VMA is added.

>
> I assume we'll have to tolerated that and might just want to drop this
> patch from the series.
>

It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?

Thanks,
Kalesh

> --
> Cheers
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
  2025-09-17 17:22     ` Kalesh Singh
@ 2025-09-17 18:34       ` David Hildenbrand
  2025-09-17 20:31         ` Kalesh Singh
  0 siblings, 1 reply; 68+ messages in thread
From: David Hildenbrand @ 2025-09-17 18:34 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

>> Can't that fire when changing the max count from user space at just the
>> wrong time?
> 
> You are right: technically it's possible if it was raised between the
> time of checking and when the new VMA is added.
> 
>>
>> I assume we'll have to tolerated that and might just want to drop this
>> patch from the series.
>>
> 
> It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?

Due to the racy nature I think any kinds of assertions would be wrong. 
Without any such races possible I would agree that the checks would be 
nice to have.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
  2025-09-17 18:34       ` David Hildenbrand
@ 2025-09-17 20:31         ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-17 20:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Wed, Sep 17, 2025 at 11:34 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> Can't that fire when changing the max count from user space at just the
> >> wrong time?
> >
> > You are right: technically it's possible if it was raised between the
> > time of checking and when the new VMA is added.
> >
> >>
> >> I assume we'll have to tolerated that and might just want to drop this
> >> patch from the series.
> >>
> >
> > It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?
>
> Due to the racy nature I think any kinds of assertions would be wrong.
> Without any such races possible I would agree that the checks would be
> nice to have.

Alright I'll drop this in the next revision.

Thanks,
Kalesh

>
> --
> Cheers
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-17  5:36       ` Lorenzo Stoakes
@ 2025-09-17 23:32         ` Andrew Morton
  2025-09-18 10:29           ` Lorenzo Stoakes
  0 siblings, 1 reply; 68+ messages in thread
From: Andrew Morton @ 2025-09-17 23:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> >
> > > Perhaps being less accepting of patches during merge window is one aspect,
> > > as the merge window leading up to this cycle was almost the same review
> > > load as when the cycle started.
> >
> > I'm having trouble understanding what you said here?
> 
> Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
> merge window, esp. now we have mm-new.
> 
> And last merge window my review load felt similar to during a cycle, which
> was kind of crazy.
> 
> So I wonder if we should be less accommodating and simply say 'sorry it's
> the merge window, no submissions accepted'?

hm, I always have a lot of emails piled up by the time mm-stable gets
merged upstream.  That ~1 week between "we merged" and "-rc1" is a nice
time to go through that material and add it to mm-new.  I think it
smooths things out.  I mean, this is peak time for people to be
considering the new material?

(ot, that backlog is always >400 emails and a lot of that gets tossed
out anyway - either it's just too old so I request a refresh or there
was a new version, or review was unpromising, etc).


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-17 23:32         ` Andrew Morton
@ 2025-09-18 10:29           ` Lorenzo Stoakes
  2025-09-18 12:07             ` David Hildenbrand
  0 siblings, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 10:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
> On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > >
> > > > Perhaps being less accepting of patches during merge window is one aspect,
> > > > as the merge window leading up to this cycle was almost the same review
> > > > load as when the cycle started.
> > >
> > > I'm having trouble understanding what you said here?
> >
> > Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
> > merge window, esp. now we have mm-new.
> >
> > And last merge window my review load felt similar to during a cycle, which
> > was kind of crazy.
> >
> > So I wonder if we should be less accommodating and simply say 'sorry it's
> > the merge window, no submissions accepted'?
>
> hm, I always have a lot of emails piled up by the time mm-stable gets
> merged upstream.  That ~1 week between "we merged" and "-rc1" is a nice
> time to go through that material and add it to mm-new.  I think it
> smooths things out.  I mean, this is peak time for people to be
> considering the new material?

I'm confused, why is the merge window a good time to consider new material?

People have the entirety of the cycle to submit new material, and they do
so.

And equally, people are sending new revisions of old code, engaging in
discussion from old series etc. during the merge window also.

What happens is you essentially have reviewers work 9 weeks instead of 7
for a cycle without much of a let up (+ so no break from it), based on
workload from the past cycle/merge window.

I mean I can only ask that perhaps we consider not doing this in mm (I
gather many other subsystems equally have a kinda 'freeze' during this
time).

>
> (ot, that backlog is always >400 emails and a lot of that gets tossed
> out anyway - either it's just too old so I request a refresh or there
> was a new version, or review was unpromising, etc).
>

Right, which actually makes everything a lot more uncertain from a
reviewer's point of view, as we don't definitely have a solid git base,
mm-new isn't synced with Linus's tree very much during this time etc.

Which makes the merge window actually even worse for this stuff.

From the point of view of avoiding burn out, it'd be good to manage
expectations a bit on this.

Personally I remember literally saying to David during the last merge
window 'well I hoped I could go off and work on my own stuff instead of
just review, guess not then' :)

THP is a particularly busy area at the moment which is part of all this.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
                     ` (3 preceding siblings ...)
  2025-09-17 10:52   ` David Hildenbrand
@ 2025-09-18 11:31   ` Pedro Falcato
  2025-09-18 13:53   ` Lorenzo Stoakes
  5 siblings, 0 replies; 68+ messages in thread
From: Pedro Falcato @ 2025-09-18 11:31 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	kernel-team, android-mm, stable, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
> 
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
> 
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
> 
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>


Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-16 15:52       ` Steven Rostedt
  2025-09-16 17:36         ` Kalesh Singh
@ 2025-09-18 11:38         ` Pedro Falcato
  2025-09-18 14:48           ` Steven Rostedt
  1 sibling, 1 reply; 68+ messages in thread
From: Pedro Falcato @ 2025-09-18 11:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Kalesh Singh, akpm, minchan, lorenzo.stoakes, david, Liam.Howlett,
	rppt, kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Tue, Sep 16, 2025 at 11:52:20AM -0400, Steven Rostedt wrote:
> On Mon, 15 Sep 2025 18:19:53 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
> 
> 
> > Hi Steve,
> > 
> > Thanks for the comments and suggestion you are right we can use bpf to
> > get the comm. There is nothing special about this trace event.  I will
> > drop comm in the next revision.
> > 
> > The reason I did the task_struct parameter (current): I believe there
> > is a limitation that we must  specify at least 1 parameter to the
> > TRACE_EVENT()  PROTO and ARGS macros.
> 
> OK, then this is another issue. We don't want tracepoint "markers".
> Each tracepoint can take up to 5K in memory due to the code it
> generates and the meta data to control it.
> 
> For something like that, we highly recommend dynamic probes (fprobes,
> kprobes, etc).
> 
> The only purpose of a static tracepoint is to get data within a
> function that is too difficult to get via a probe. It should never be
> used as a trigger where its purpose is "we hit this path".
>

Isn't the usual problem with that approach, that of static functions/static
inline functions? I was tracing through a problem a few months ago, and man
I really did think "wouldn't it be nice to have a tracepoint instead of fishing
around for kprobe spots".

Not that I particularly think a tracepoint is super worth it in this case, but,
y'know.

-- 
Pedro

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count
  2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
  2025-09-17 13:41   ` David Hildenbrand
@ 2025-09-18 11:46   ` Pedro Falcato
  2025-09-18 14:48   ` Lorenzo Stoakes
  2 siblings, 0 replies; 68+ messages in thread
From: Pedro Falcato @ 2025-09-18 11:46 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Mon, Sep 15, 2025 at 09:36:35AM -0700, Kalesh Singh wrote:
> A mechanical rename of the mm_struct->map_count field to
> vma_count; no functional change is intended.
> 
> The name "map_count" is ambiguous within the memory management subsystem,
> as it can be confused with the folio/page->_mapcount field, which tracks
> PTE references.
> 
> The new name, vma_count, is more precise as this field has always
> counted the number of vm_area_structs associated with an mm_struct.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
  2025-09-17 13:44   ` David Hildenbrand
  2025-09-17 17:22     ` Kalesh Singh
@ 2025-09-18 11:48     ` Pedro Falcato
  1 sibling, 0 replies; 68+ messages in thread
From: Pedro Falcato @ 2025-09-18 11:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
	kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
	Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
	linux-trace-kernel, linux-kselftest

On Wed, Sep 17, 2025 at 03:44:27PM +0200, David Hildenbrand wrote:
> On 15.09.25 18:36, Kalesh Singh wrote:
> > Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect
> > cases where the VMA count exceeds the sysctl_max_map_count limit.
> > 
> > This check will help catch future bugs or regressions where
> > the VMAs are allocated exceeding the limit.
> > 
> > The warning is placed in the main vma_count_*() helpers, while the
> > internal *_nocheck variants bypass it. _nocheck helpers are used to
> > ensure that the assertion does not trigger a false positive in
> > the legitimate case of a temporary VMA increase past the limit
> > by a VMA split in munmap().
> > 
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
> > 
> > Changes in v2:
> >    - Add assertions if exceeding max_vma_count limit, per Pedro
> > 
> >   include/linux/mm.h               | 12 ++++++--
> >   mm/internal.h                    |  1 -
> >   mm/vma.c                         | 49 +++++++++++++++++++++++++-------
> >   tools/testing/vma/vma_internal.h |  7 ++++-
> >   4 files changed, 55 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8bad1454984c..3a3749d7015c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
> >   void snapshot_page(struct page_snapshot *ps, const struct page *page);
> > +int vma_count_remaining(const struct mm_struct *mm);
> > +
> >   static inline void vma_count_init(struct mm_struct *mm)
> >   {
> >   	ACCESS_PRIVATE(mm, __vma_count) = 0;
> >   }
> > -static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
> >   {
> >   	ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> >   }
> > +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +{
> > +	VM_WARN_ON_ONCE(!vma_count_remaining(mm));
> 
> Can't that fire when changing the max count from user space at just the
> wrong time?
> 
> I assume we'll have to tolerated that and might just want to drop this patch
> from the series.

Ah yes, of course, userspace can dynamically change it. Good catch. I guess
we'll need to kill the assertion idea then.

-- 
Pedro

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-18 10:29           ` Lorenzo Stoakes
@ 2025-09-18 12:07             ` David Hildenbrand
  2025-09-18 12:49               ` Lorenzo Stoakes
  0 siblings, 1 reply; 68+ messages in thread
From: David Hildenbrand @ 2025-09-18 12:07 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Kalesh Singh, minchan, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On 18.09.25 12:29, Lorenzo Stoakes wrote:
> On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
>> On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>>
>>>>
>>>>> Perhaps being less accepting of patches during merge window is one aspect,
>>>>> as the merge window leading up to this cycle was almost the same review
>>>>> load as when the cycle started.
>>>>
>>>> I'm having trouble understanding what you said here?
>>>
>>> Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
>>> merge window, esp. now we have mm-new.
>>>
>>> And last merge window my review load felt similar to during a cycle, which
>>> was kind of crazy.
>>>
>>> So I wonder if we should be less accommodating and simply say 'sorry it's
>>> the merge window, no submissions accepted'?
>>
>> hm, I always have a lot of emails piled up by the time mm-stable gets
>> merged upstream.  That ~1 week between "we merged" and "-rc1" is a nice
>> time to go through that material and add it to mm-new.  I think it
>> smooths things out.  I mean, this is peak time for people to be
>> considering the new material?
> 
> I'm confused, why is the merge window a good time to consider new material?
> 
> People have the entirety of the cycle to submit new material, and they do
> so.

My view is that if you are sending a cleanup/feature during the merge 
window you cannot expect a fast reply, and you should not keep sending 
new versions in that timeframe expecting that all people you CCed that 
should have a look actually did have a look.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-18 12:07             ` David Hildenbrand
@ 2025-09-18 12:49               ` Lorenzo Stoakes
  2025-09-18 20:59                 ` Andrew Morton
  0 siblings, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 12:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Kalesh Singh, minchan, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Thu, Sep 18, 2025 at 02:07:09PM +0200, David Hildenbrand wrote:
> On 18.09.25 12:29, Lorenzo Stoakes wrote:
> > On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
> > > On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > >
> > > > > > Perhaps being less accepting of patches during merge window is one aspect,
> > > > > > as the merge window leading up to this cycle was almost the same review
> > > > > > load as when the cycle started.
> > > > >
> > > > > I'm having trouble understanding what you said here?
> > > >
> > > > Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
> > > > merge window, esp. now we have mm-new.
> > > >
> > > > And last merge window my review load felt similar to during a cycle, which
> > > > was kind of crazy.
> > > >
> > > > So I wonder if we should be less accommodating and simply say 'sorry it's
> > > > the merge window, no submissions accepted'?
> > >
> > > hm, I always have a lot of emails piled up by the time mm-stable gets
> > > merged upstream.  That ~1 week between "we merged" and "-rc1" is a nice
> > > time to go through that material and add it to mm-new.  I think it
> > > smooths things out.  I mean, this is peak time for people to be
> > > considering the new material?
> >
> > I'm confused, why is the merge window a good time to consider new material?
> >
> > People have the entirety of the cycle to submit new material, and they do
> > so.
>
> My view is that if you are sending a cleanup/feature during the merge window
> you cannot expect a fast reply, and you should not keep sending new versions
> in that timeframe expecting that all people you CCed that should have a look
> actually did have a look.

Yes exactly.

The problem is all the conversations and respins and etc. _do_ carry on as
normal, and often land in mm-new, queued up for mm-unstable etc. unless we
happen to notice them.

So it makes it impossible for us to just ignore until the next cycle (or need to
go through every thread that happened afterwards).

And people know that, so just keep on submitting as normal. That was _really_
palpable last merge window.

I mean I'm cheating by going on vacation for this merge window ;) but obviously
means I'll have 2 weeks of review to check when I get back + 1st week of cycle
to go too.

I think in some subsystems new series/respins are actively unwelcome during the
merge window. I wonder if we should be the same?

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
  2025-09-16 18:33   ` kernel test robot
  2025-09-17 13:38   ` David Hildenbrand
@ 2025-09-18 13:20   ` Lorenzo Stoakes
  2025-09-18 13:26     ` Lorenzo Stoakes
  2025-09-18 14:31   ` Lorenzo Stoakes
  3 siblings, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:20 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
>  include/linux/mm.h               |  2 --
>  mm/internal.h                    |  2 ++
>  mm/mmap.c                        | 21 ++++++++++++++++++++-
>  mm/mremap.c                      |  7 ++++---
>  mm/nommu.c                       |  2 +-
>  mm/util.c                        |  1 -
>  mm/vma.c                         | 10 +++++-----
>  tools/testing/vma/vma_internal.h |  9 +++++++++

Will look into this properly later, but there's a conflict in vma_internal.h
atm for mm-new FYI.

> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 3639aa8dd2b0..52cd7ddc73f4 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
>  	return vm_flags;
>  }
>
> +/* Helper to get VMA count capacity */
> +static int vma_count_remaining(const struct mm_struct *mm)
> +{
> +	const int map_count = mm->map_count;
> +	const int max_count = sysctl_max_map_count;
> +
> +	return (max_count > map_count) ? (max_count - map_count) : 0;
> +}
> +
>  #endif	/* __MM_VMA_INTERNAL_H */
> --
> 2.51.0.384.g4c02a37b29-goog
>

Probably because stuff got added at the end... :)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-18 13:20   ` Lorenzo Stoakes
@ 2025-09-18 13:26     ` Lorenzo Stoakes
  0 siblings, 0 replies; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:26 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Thu, Sep 18, 2025 at 02:20:03PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
> >  include/linux/mm.h               |  2 --
> >  mm/internal.h                    |  2 ++
> >  mm/mmap.c                        | 21 ++++++++++++++++++++-
> >  mm/mremap.c                      |  7 ++++---
> >  mm/nommu.c                       |  2 +-
> >  mm/util.c                        |  1 -
> >  mm/vma.c                         | 10 +++++-----
> >  tools/testing/vma/vma_internal.h |  9 +++++++++
>
> Will look into this properly later, but there's a conflict in vma_internal.h
> atm for mm-new FYI.
>
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 3639aa8dd2b0..52cd7ddc73f4 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
> >  	return vm_flags;
> >  }
> >
> > +/* Helper to get VMA count capacity */
> > +static int vma_count_remaining(const struct mm_struct *mm)
> > +{
> > +	const int map_count = mm->map_count;
> > +	const int max_count = sysctl_max_map_count;
> > +
> > +	return (max_count > map_count) ? (max_count - map_count) : 0;
> > +}
> > +
> >  #endif	/* __MM_VMA_INTERNAL_H */
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >
>
> Probably because stuff got added at the end... :)

Oh mm/internal.h too, seems that's my mmap_prepare series. FYI

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
  2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
  2025-09-17 13:44   ` David Hildenbrand
@ 2025-09-18 13:30   ` Lorenzo Stoakes
  1 sibling, 0 replies; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:30 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

I see that the review is generally to drop this one so it's moot, but this is
also not applying:

Patch failed at 0006 mm: add assertion for VMA count limit
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

When trying to b4 shazam in mm-new.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
  2025-09-15 23:41   ` Steven Rostedt
@ 2025-09-18 13:42   ` Lorenzo Stoakes
  2025-09-18 13:51     ` Lorenzo Stoakes
  1 sibling, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:42 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
> Needed observability on in field devices can be collected with minimal
> overhead and can be toggled on and off. Event driven telemetry can be
> done with tracepoint BPF programs.
>
> The process comm is provided for aggregation across devices and tgid is
> to enable per-process aggregation per device.
>
> This allows for observing the distribution of such problems in the
> field, to deduce if there are legitimate bugs or if a bump to the limit is
> warranted.

It's not really a bug though is it? It's somebody running out of resources.

I'm not sure how useful this is really. But I'm open to being convinced!

I also wonder if this is better as a statistic? You'd figure out it was a
problem that way too right?

>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>

This breaks the VMA tests, please make sure to always check them:

cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined    -c -o vma.o vma.c
In file included from vma.c:33:
../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory
   10 | #include <trace/events/vma.h>
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [<builtin>: vma.o] Error 1

See below though, you've included this in the wrong place (I don't blame you,
perhaps we've not made this _very_ clear :)

> ---
>
> Chnages in v2:
>   - Add needed observability for operations failing due to the vma count limit,
>       per Minchan
>     (Since there isn't a common point for debug logging due checks being
>     external to the capacity based vma_count_remaining() helper. I used a
>     trace event for low overhead and to facilitate event driven telemetry
>     for in field devices)
>
>  include/trace/events/vma.h | 32 ++++++++++++++++++++++++++++++++

Do we really need a new file? We already have VMA-related tracepoints no?

Also if you add a new file you _have_ to update MAINTAINERS.

>  mm/mmap.c                  |  5 ++++-
>  mm/mremap.c                | 10 ++++++++--
>  mm/vma.c                   | 11 +++++++++--
>  4 files changed, 53 insertions(+), 5 deletions(-)
>  create mode 100644 include/trace/events/vma.h
>
> diff --git a/include/trace/events/vma.h b/include/trace/events/vma.h
> new file mode 100644
> index 000000000000..2fed63b0d0a6
> --- /dev/null
> +++ b/include/trace/events/vma.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM vma
> +
> +#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_VMA_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(max_vma_count_exceeded,
> +
> +	TP_PROTO(struct task_struct *task),
> +
> +	TP_ARGS(task),
> +
> +	TP_STRUCT__entry(
> +		__string(comm,	task->comm)
> +		__field(pid_t,	tgid)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(comm);
> +		__entry->tgid = task->tgid;
> +	),
> +
> +	TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
> +);
> +
> +#endif /*  _TRACE_VMA_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 30ddd550197e..0bb311bf48f3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -56,6 +56,7 @@
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/mmap.h>
> +#include <trace/events/vma.h>
>
>  #include "internal.h"
>
> @@ -374,8 +375,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		return -EOVERFLOW;
>
>  	/* Too many mappings? */
> -	if (!vma_count_remaining(mm))
> +	if (!vma_count_remaining(mm)) {
> +		trace_max_vma_count_exceeded(current);
>  		return -ENOMEM;
> +	}
>
>  	/*
>  	 * addr is returned from get_unmapped_area,
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 14d35d87e89b..f42ac05f0069 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -30,6 +30,8 @@
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
>
> +#include <trace/events/vma.h>
> +
>  #include "internal.h"
>
>  /* Classify the kind of remap operation being performed. */
> @@ -1040,8 +1042,10 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
>  	 * We'd prefer to avoid failure later on in do_munmap:
>  	 * which may split one vma into three before unmapping.
>  	 */
> -	if (vma_count_remaining(current->mm) < 4)
> +	if (vma_count_remaining(current->mm) < 4) {
> +		trace_max_vma_count_exceeded(current);

But we didn't exceed it, we're guessing we might?

>  		return -ENOMEM;
> +	}
>
>  	if (vma->vm_ops && vma->vm_ops->may_split) {
>  		if (vma->vm_start != old_addr)
> @@ -1817,8 +1821,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
>  	 * the threshold. In other words, is the current map count + 6 at or
>  	 * below the threshold? Otherwise return -ENOMEM here to be more safe.
>  	 */
> -	if (vma_count_remaining(current->mm) < 6)
> +	if (vma_count_remaining(current->mm) < 6) {
> +		trace_max_vma_count_exceeded(current);

Similar point here.

>  		return -ENOMEM;
> +	}
>
>  	return 0;
>  }
> diff --git a/mm/vma.c b/mm/vma.c
> index 0e4fcaebe209..692c33c3e84d 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -7,6 +7,8 @@
>  #include "vma_internal.h"
>  #include "vma.h"
>
> +#include <trace/events/vma.h>

Nope you don't do this :)

vma.c is designed to be userland testable and _intentionally_ puts all its
includes in vma_internal.h.

So you'd need to add this include over there, and then update the vma tests
vma_internal.h to stub out the trace function.

> +
>  struct mmap_state {
>  	struct mm_struct *mm;
>  	struct vma_iterator *vmi;
> @@ -621,8 +623,10 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		     unsigned long addr, int new_below)
>  {
> -	if (!vma_count_remaining(vma->vm_mm))
> +	if (!vma_count_remaining(vma->vm_mm)) {
> +		trace_max_vma_count_exceeded(current);
>  		return -ENOMEM;
> +	}
>
>  	return __split_vma(vmi, vma, addr, new_below);
>  }
> @@ -1375,6 +1379,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  		 */
>  		if (vms->end < vms->vma->vm_end &&
>  		    !vma_count_remaining(vms->vma->vm_mm)) {
> +			trace_max_vma_count_exceeded(current);
>  			error = -ENOMEM;
>  			goto vma_count_exceeded;
>  		}
> @@ -2801,8 +2806,10 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
>  		return -ENOMEM;
>
> -	if (!vma_count_remaining(mm))
> +	if (!vma_count_remaining(mm)) {
> +		trace_max_vma_count_exceeded(current);
>  		return -ENOMEM;
> +	}
>
>  	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
>  		return -ENOMEM;
> --
> 2.51.0.384.g4c02a37b29-goog
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-18 13:42   ` Lorenzo Stoakes
@ 2025-09-18 13:51     ` Lorenzo Stoakes
  2025-09-18 15:55       ` Kalesh Singh
  0 siblings, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:51 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest


On Thu, Sep 18, 2025 at 02:42:16PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
> > Needed observability on in field devices can be collected with minimal
> > overhead and can be toggled on and off. Event driven telemetry can be
> > done with tracepoint BPF programs.
> >
> > The process comm is provided for aggregation across devices and tgid is
> > to enable per-process aggregation per device.
> >
> > This allows for observing the distribution of such problems in the
> > field, to deduce if there are legitimate bugs or if a bump to the limit is
> > warranted.
>
> It's not really a bug though is it? It's somebody running out of resources.
>
> I'm not sure how useful this is really. But I'm open to being convinced!
>
> I also wonder if this is better as a statistic? You'd figure out it was a
> problem that way too right?
>
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
>
> This breaks the VMA tests, please make sure to always check them:
>
> cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined    -c -o vma.o vma.c
> In file included from vma.c:33:
> ../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory
>    10 | #include <trace/events/vma.h>
>       |          ^~~~~~~~~~~~~~~~~~~~
> compilation terminated.
> make: *** [<builtin>: vma.o] Error 1

Trivial build fix:

----8<----
From fe4c30abbd302ccc628ec92381ac10cea31c6d85 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Thu, 18 Sep 2025 14:47:10 +0100
Subject: [PATCH] fix

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c                         | 2 --
 mm/vma_internal.h                | 2 ++
 tools/testing/vma/vma_internal.h | 4 ++++
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 26046b28cdda..a11d29a2ddc0 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -7,8 +7,6 @@
 #include "vma_internal.h"
 #include "vma.h"

-#include <trace/events/vma.h>
-
 struct mmap_state {
 	struct mm_struct *mm;
 	struct vma_iterator *vmi;
diff --git a/mm/vma_internal.h b/mm/vma_internal.h
index 2f05735ff190..2f5ba679f43d 100644
--- a/mm/vma_internal.h
+++ b/mm/vma_internal.h
@@ -47,6 +47,8 @@
 #include <linux/uprobes.h>
 #include <linux/userfaultfd_k.h>

+#include <trace/events/vma.h>
+
 #include <asm/current.h>
 #include <asm/tlb.h>

diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 07f4108c5e4c..c08c91861b9a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -1661,4 +1661,8 @@ static inline void vma_count_dec(struct mm_struct *mm)
 	vma_count_sub(mm, 1);
 }

+static void trace_max_vma_count_exceeded(struct task_struct *task)
+{
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
--
2.51.0

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
  2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
                     ` (4 preceding siblings ...)
  2025-09-18 11:31   ` Pedro Falcato
@ 2025-09-18 13:53   ` Lorenzo Stoakes
  5 siblings, 0 replies; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:53 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, stable, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
>
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
>
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.

Nice spot :)

And also 'doh!'

>
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>
> Chnages in v2:
>  - Fix mmap check, per Pedro
>
>  mm/mmap.c | 2 +-
>  mm/vma.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7306253cc3b5..e5370e7fcd8f 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		return -EOVERFLOW;
>
>  	/* Too many mappings? */
> -	if (mm->map_count > sysctl_max_map_count)
> +	if (mm->map_count >= sysctl_max_map_count)
>  		return -ENOMEM;
>
>  	/*
> diff --git a/mm/vma.c b/mm/vma.c
> index 3b12c7579831..033a388bc4b1 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
>  		return -ENOMEM;
>
> -	if (mm->map_count > sysctl_max_map_count)
> +	if (mm->map_count >= sysctl_max_map_count)
>  		return -ENOMEM;
>
>  	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> --
> 2.51.0.384.g4c02a37b29-goog
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-16 18:33   ` kernel test robot
@ 2025-09-18 14:10     ` Lorenzo Stoakes
  0 siblings, 0 replies; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:10 UTC (permalink / raw)
  To: kernel test robot
  Cc: Kalesh Singh, akpm, minchan, david, Liam.Howlett, rppt, pfalcato,
	oe-kbuild-all, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan

On Wed, Sep 17, 2025 at 02:33:46AM +0800, kernel test robot wrote:
> Hi Kalesh,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on f83ec76bf285bea5727f478a68b894f5543ca76e]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Kalesh-Singh/mm-fix-off-by-one-error-in-VMA-count-limit-checks/20250916-005018
> base:   f83ec76bf285bea5727f478a68b894f5543ca76e
> patch link:    https://lore.kernel.org/r/20250915163838.631445-4-kaleshsingh%40google.com
> patch subject: [PATCH v2 3/7] mm: introduce vma_count_remaining()
> config: s390-randconfig-r122-20250916 (https://download.01.org/0day-ci/archive/20250917/202509170238.NUURxWPI-lkp@intel.com/config)
> compiler: s390-linux-gcc (GCC) 13.4.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250917/202509170238.NUURxWPI-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202509170238.NUURxWPI-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
>    s390-linux-ld: mm/mmap.o: in function `do_mmap':
> >> mm/mmap.c:377:(.text+0x6566): undefined reference to `vma_count_remaining'
>    s390-linux-ld: mm/mremap.o: in function `prep_move_vma':
> >> mm/mremap.c:1043:(.text+0x199a): undefined reference to `vma_count_remaining'
>    s390-linux-ld: mm/mremap.o: in function `check_mremap_params':
>    mm/mremap.c:1820:(.text+0x4044): undefined reference to `vma_count_remaining'
>    s390-linux-ld: mm/vma.o: in function `vms_gather_munmap_vmas':
> >> mm/vma.c:1348:(.text+0x8ebc): undefined reference to `vma_count_remaining'
>    s390-linux-ld: mm/vma.o: in function `split_vma':
>    mm/vma.c:595:(.text+0xa2a2): undefined reference to `vma_count_remaining'
>    s390-linux-ld: mm/vma.o:mm/vma.c:2775: more undefined references to `vma_count_remaining' follow

You've accidentally put vma_count_remaining() in an:

#ifdef CONFIG_SYSCTL

block, which this config doesn't have:

#ifdef CONFIG_SYSCTL
#if defined(HAVE_ARCH_PICK_MMAP_LAYOUT) || \
		defined(CONFIG_ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT)
int sysctl_legacy_va_layout;
#endif

static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;

/**
 * vma_count_remaining - Determine available VMA slots
 * @mm: The memory descriptor for the process.
 *
 * Check how many more VMAs can be created for the given @mm
 * before hitting the sysctl_max_map_count limit.
 *
 * Return: The number of new VMAs the process can accommodate.
 */
int vma_count_remaining(const struct mm_struct *mm)
{
	const int map_count = mm->vma_count;
	const int max_count = sysctl_max_map_count;

	return (max_count > map_count) ? (max_count - map_count) : 0;
}

...
#endif /* CONFIG_SYSCTL */

So you should move it out of here and that should fix the kernelbot report.

Cheers, Lorenzo

>
>
> vim +377 mm/mmap.c
>
>    276
>    277	/**
>    278	 * do_mmap() - Perform a userland memory mapping into the current process
>    279	 * address space of length @len with protection bits @prot, mmap flags @flags
>    280	 * (from which VMA flags will be inferred), and any additional VMA flags to
>    281	 * apply @vm_flags. If this is a file-backed mapping then the file is specified
>    282	 * in @file and page offset into the file via @pgoff.
>    283	 *
>    284	 * This function does not perform security checks on the file and assumes, if
>    285	 * @uf is non-NULL, the caller has provided a list head to track unmap events
>    286	 * for userfaultfd @uf.
>    287	 *
>    288	 * It also simply indicates whether memory population is required by setting
>    289	 * @populate, which must be non-NULL, expecting the caller to actually perform
>    290	 * this task itself if appropriate.
>    291	 *
>    292	 * This function will invoke architecture-specific (and if provided and
>    293	 * relevant, file system-specific) logic to determine the most appropriate
>    294	 * unmapped area in which to place the mapping if not MAP_FIXED.
>    295	 *
>    296	 * Callers which require userland mmap() behaviour should invoke vm_mmap(),
>    297	 * which is also exported for module use.
>    298	 *
>    299	 * Those which require this behaviour less security checks, userfaultfd and
>    300	 * populate behaviour, and who handle the mmap write lock themselves, should
>    301	 * call this function.
>    302	 *
>    303	 * Note that the returned address may reside within a merged VMA if an
>    304	 * appropriate merge were to take place, so it doesn't necessarily specify the
>    305	 * start of a VMA, rather only the start of a valid mapped range of length
>    306	 * @len bytes, rounded down to the nearest page size.
>    307	 *
>    308	 * The caller must write-lock current->mm->mmap_lock.
>    309	 *
>    310	 * @file: An optional struct file pointer describing the file which is to be
>    311	 * mapped, if a file-backed mapping.
>    312	 * @addr: If non-zero, hints at (or if @flags has MAP_FIXED set, specifies) the
>    313	 * address at which to perform this mapping. See mmap (2) for details. Must be
>    314	 * page-aligned.
>    315	 * @len: The length of the mapping. Will be page-aligned and must be at least 1
>    316	 * page in size.
>    317	 * @prot: Protection bits describing access required to the mapping. See mmap
>    318	 * (2) for details.
>    319	 * @flags: Flags specifying how the mapping should be performed, see mmap (2)
>    320	 * for details.
>    321	 * @vm_flags: VMA flags which should be set by default, or 0 otherwise.
>    322	 * @pgoff: Page offset into the @file if file-backed, should be 0 otherwise.
>    323	 * @populate: A pointer to a value which will be set to 0 if no population of
>    324	 * the range is required, or the number of bytes to populate if it is. Must be
>    325	 * non-NULL. See mmap (2) for details as to under what circumstances population
>    326	 * of the range occurs.
>    327	 * @uf: An optional pointer to a list head to track userfaultfd unmap events
>    328	 * should unmapping events arise. If provided, it is up to the caller to manage
>    329	 * this.
>    330	 *
>    331	 * Returns: Either an error, or the address at which the requested mapping has
>    332	 * been performed.
>    333	 */
>    334	unsigned long do_mmap(struct file *file, unsigned long addr,
>    335				unsigned long len, unsigned long prot,
>    336				unsigned long flags, vm_flags_t vm_flags,
>    337				unsigned long pgoff, unsigned long *populate,
>    338				struct list_head *uf)
>    339	{
>    340		struct mm_struct *mm = current->mm;
>    341		int pkey = 0;
>    342
>    343		*populate = 0;
>    344
>    345		mmap_assert_write_locked(mm);
>    346
>    347		if (!len)
>    348			return -EINVAL;
>    349
>    350		/*
>    351		 * Does the application expect PROT_READ to imply PROT_EXEC?
>    352		 *
>    353		 * (the exception is when the underlying filesystem is noexec
>    354		 *  mounted, in which case we don't add PROT_EXEC.)
>    355		 */
>    356		if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
>    357			if (!(file && path_noexec(&file->f_path)))
>    358				prot |= PROT_EXEC;
>    359
>    360		/* force arch specific MAP_FIXED handling in get_unmapped_area */
>    361		if (flags & MAP_FIXED_NOREPLACE)
>    362			flags |= MAP_FIXED;
>    363
>    364		if (!(flags & MAP_FIXED))
>    365			addr = round_hint_to_min(addr);
>    366
>    367		/* Careful about overflows.. */
>    368		len = PAGE_ALIGN(len);
>    369		if (!len)
>    370			return -ENOMEM;
>    371
>    372		/* offset overflow? */
>    373		if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
>    374			return -EOVERFLOW;
>    375
>    376		/* Too many mappings? */
>  > 377		if (!vma_count_remaining(mm))
>    378			return -ENOMEM;
>    379
>    380		/*
>    381		 * addr is returned from get_unmapped_area,
>    382		 * There are two cases:
>    383		 * 1> MAP_FIXED == false
>    384		 *	unallocated memory, no need to check sealing.
>    385		 * 1> MAP_FIXED == true
>    386		 *	sealing is checked inside mmap_region when
>    387		 *	do_vmi_munmap is called.
>    388		 */
>    389
>    390		if (prot == PROT_EXEC) {
>    391			pkey = execute_only_pkey(mm);
>    392			if (pkey < 0)
>    393				pkey = 0;
>    394		}
>    395
>    396		/* Do simple checking here so the lower-level routines won't have
>    397		 * to. we assume access permissions have been handled by the open
>    398		 * of the memory object, so we don't do any here.
>    399		 */
>    400		vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(file, flags) |
>    401				mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
>    402
>    403		/* Obtain the address to map to. we verify (or select) it and ensure
>    404		 * that it represents a valid section of the address space.
>    405		 */
>    406		addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags);
>    407		if (IS_ERR_VALUE(addr))
>    408			return addr;
>    409
>    410		if (flags & MAP_FIXED_NOREPLACE) {
>    411			if (find_vma_intersection(mm, addr, addr + len))
>    412				return -EEXIST;
>    413		}
>    414
>    415		if (flags & MAP_LOCKED)
>    416			if (!can_do_mlock())
>    417				return -EPERM;
>    418
>    419		if (!mlock_future_ok(mm, vm_flags, len))
>    420			return -EAGAIN;
>    421
>    422		if (file) {
>    423			struct inode *inode = file_inode(file);
>    424			unsigned long flags_mask;
>    425			int err;
>    426
>    427			if (!file_mmap_ok(file, inode, pgoff, len))
>    428				return -EOVERFLOW;
>    429
>    430			flags_mask = LEGACY_MAP_MASK;
>    431			if (file->f_op->fop_flags & FOP_MMAP_SYNC)
>    432				flags_mask |= MAP_SYNC;
>    433
>    434			switch (flags & MAP_TYPE) {
>    435			case MAP_SHARED:
>    436				/*
>    437				 * Force use of MAP_SHARED_VALIDATE with non-legacy
>    438				 * flags. E.g. MAP_SYNC is dangerous to use with
>    439				 * MAP_SHARED as you don't know which consistency model
>    440				 * you will get. We silently ignore unsupported flags
>    441				 * with MAP_SHARED to preserve backward compatibility.
>    442				 */
>    443				flags &= LEGACY_MAP_MASK;
>    444				fallthrough;
>    445			case MAP_SHARED_VALIDATE:
>    446				if (flags & ~flags_mask)
>    447					return -EOPNOTSUPP;
>    448				if (prot & PROT_WRITE) {
>    449					if (!(file->f_mode & FMODE_WRITE))
>    450						return -EACCES;
>    451					if (IS_SWAPFILE(file->f_mapping->host))
>    452						return -ETXTBSY;
>    453				}
>    454
>    455				/*
>    456				 * Make sure we don't allow writing to an append-only
>    457				 * file..
>    458				 */
>    459				if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
>    460					return -EACCES;
>    461
>    462				vm_flags |= VM_SHARED | VM_MAYSHARE;
>    463				if (!(file->f_mode & FMODE_WRITE))
>    464					vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
>    465				fallthrough;
>    466			case MAP_PRIVATE:
>    467				if (!(file->f_mode & FMODE_READ))
>    468					return -EACCES;
>    469				if (path_noexec(&file->f_path)) {
>    470					if (vm_flags & VM_EXEC)
>    471						return -EPERM;
>    472					vm_flags &= ~VM_MAYEXEC;
>    473				}
>    474
>    475				if (!can_mmap_file(file))
>    476					return -ENODEV;
>    477				if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
>    478					return -EINVAL;
>    479				break;
>    480
>    481			default:
>    482				return -EINVAL;
>    483			}
>    484
>    485			/*
>    486			 * Check to see if we are violating any seals and update VMA
>    487			 * flags if necessary to avoid future seal violations.
>    488			 */
>    489			err = memfd_check_seals_mmap(file, &vm_flags);
>    490			if (err)
>    491				return (unsigned long)err;
>    492		} else {
>    493			switch (flags & MAP_TYPE) {
>    494			case MAP_SHARED:
>    495				if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
>    496					return -EINVAL;
>    497				/*
>    498				 * Ignore pgoff.
>    499				 */
>    500				pgoff = 0;
>    501				vm_flags |= VM_SHARED | VM_MAYSHARE;
>    502				break;
>    503			case MAP_DROPPABLE:
>    504				if (VM_DROPPABLE == VM_NONE)
>    505					return -ENOTSUPP;
>    506				/*
>    507				 * A locked or stack area makes no sense to be droppable.
>    508				 *
>    509				 * Also, since droppable pages can just go away at any time
>    510				 * it makes no sense to copy them on fork or dump them.
>    511				 *
>    512				 * And don't attempt to combine with hugetlb for now.
>    513				 */
>    514				if (flags & (MAP_LOCKED | MAP_HUGETLB))
>    515				        return -EINVAL;
>    516				if (vm_flags & (VM_GROWSDOWN | VM_GROWSUP))
>    517				        return -EINVAL;
>    518
>    519				vm_flags |= VM_DROPPABLE;
>    520
>    521				/*
>    522				 * If the pages can be dropped, then it doesn't make
>    523				 * sense to reserve them.
>    524				 */
>    525				vm_flags |= VM_NORESERVE;
>    526
>    527				/*
>    528				 * Likewise, they're volatile enough that they
>    529				 * shouldn't survive forks or coredumps.
>    530				 */
>    531				vm_flags |= VM_WIPEONFORK | VM_DONTDUMP;
>    532				fallthrough;
>    533			case MAP_PRIVATE:
>    534				/*
>    535				 * Set pgoff according to addr for anon_vma.
>    536				 */
>    537				pgoff = addr >> PAGE_SHIFT;
>    538				break;
>    539			default:
>    540				return -EINVAL;
>    541			}
>    542		}
>    543
>    544		/*
>    545		 * Set 'VM_NORESERVE' if we should not account for the
>    546		 * memory use of this mapping.
>    547		 */
>    548		if (flags & MAP_NORESERVE) {
>    549			/* We honor MAP_NORESERVE if allowed to overcommit */
>    550			if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
>    551				vm_flags |= VM_NORESERVE;
>    552
>    553			/* hugetlb applies strict overcommit unless MAP_NORESERVE */
>    554			if (file && is_file_hugepages(file))
>    555				vm_flags |= VM_NORESERVE;
>    556		}
>    557
>    558		addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
>    559		if (!IS_ERR_VALUE(addr) &&
>    560		    ((vm_flags & VM_LOCKED) ||
>    561		     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
>    562			*populate = len;
>    563		return addr;
>    564	}
>    565
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
                     ` (2 preceding siblings ...)
  2025-09-18 13:20   ` Lorenzo Stoakes
@ 2025-09-18 14:31   ` Lorenzo Stoakes
  2025-09-18 15:52     ` Kalesh Singh
  3 siblings, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:31 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
> The checks against sysctl_max_map_count are open-coded in multiple
> places. While simple checks are manageable, the logic in places like
> mremap.c involves arithmetic with magic numbers that can be difficult
> to reason about. e.g. ... >= sysctl_max_map_count - 3
>
> To improve readability and centralize the logic, introduce a new helper,
> vma_count_remaining(). This function returns the VMA count headroom
> available for a givine process.
>
> The most common case of checking for a single new VMA can be done with
> the convenience helper has_vma_count_remaining():
>
>     if (!vma_count_remaining(mm))
>
> And the complex checks in mremap.c become clearer by expressing the
> required capacity directly:
>
>     if (vma_count_remaining(mm) <  4)

Double space after 4.

>
> While a capacity-based function could be misused (e.g., with an
> incorrect '<' vs '<=' comparison), the improved readability at the call
> sites makes such errors less likely than with the previous open-coded
> arithmetic.
>
> As part of this change, sysctl_max_map_count is made static to
> mm/mmap.c to improve encapsulation.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>

Generally logic looks ok, with various stuff fixed below + in commit msg we
should be good to go :)

> ---
>
> Changes in v2:
>   - Fix documentation comment for vma_count_remaining(), per Mike
>   - Remove extern in header, per Mike and Pedro
>   - Move declaration to mm/internal.h, per Mike
>   - Replace exceeds_max_map_count() with capacity-based vma_count_remaining(),
>        per Lorenzo.
>   - Fix tools/testing/vma, per Lorenzo.
>
>  include/linux/mm.h               |  2 --
>  mm/internal.h                    |  2 ++
>  mm/mmap.c                        | 21 ++++++++++++++++++++-
>  mm/mremap.c                      |  7 ++++---
>  mm/nommu.c                       |  2 +-
>  mm/util.c                        |  1 -
>  mm/vma.c                         | 10 +++++-----
>  tools/testing/vma/vma_internal.h |  9 +++++++++
>  8 files changed, 41 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ae97a0b8ec7..138bab2988f8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page)
>  #define MAPCOUNT_ELF_CORE_MARGIN	(5)
>  #define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
>
> -extern int sysctl_max_map_count;
> -

Nice to get rid of this as a global! :)

>  extern unsigned long sysctl_user_reserve_kbytes;
>  extern unsigned long sysctl_admin_reserve_kbytes;
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 45b725c3dc03..39f1c9535ae5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
>  void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
>  int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
>
> +int vma_count_remaining(const struct mm_struct *mm);
> +
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index e5370e7fcd8f..af88ce1fbb5f 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		return -EOVERFLOW;
>
>  	/* Too many mappings? */
> -	if (mm->map_count >= sysctl_max_map_count)
> +	if (!vma_count_remaining(mm))
>  		return -ENOMEM;
>
>  	/*
> @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
>  int sysctl_legacy_va_layout;
>  #endif
>
> +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> +
> +/**
> + * vma_count_remaining - Determine available VMA slots
> + * @mm: The memory descriptor for the process.
> + *
> + * Check how many more VMAs can be created for the given @mm
> + * before hitting the sysctl_max_map_count limit.
> + *
> + * Return: The number of new VMAs the process can accommodate.
> + */
> +int vma_count_remaining(const struct mm_struct *mm)
> +{
> +	const int map_count = mm->map_count;
> +	const int max_count = sysctl_max_map_count;

David already commented on the READ_ONCE() here, seems wise.

> +
> +	return (max_count > map_count) ? (max_count - map_count) : 0;

Not a big deal but would prefer:

	if (map_count >= map_count)
		return 0;

	return max_count - map_count;

As the ternary here is a bit less clear, and it puts the 'failure' case first
and the 'success' case afterwards.

> +}
> +

As discussed in reply to the kernel bot, you've accidentally placed this in a
CONFIG_SYSCTL block, so need to move it :>)

>  static const struct ctl_table mmap_table[] = {
>  		{
>  				.procname       = "max_map_count",
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 35de0a7b910e..14d35d87e89b 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
>  	 * We'd prefer to avoid failure later on in do_munmap:
>  	 * which may split one vma into three before unmapping.
>  	 */
> -	if (current->mm->map_count >= sysctl_max_map_count - 3)
> +	if (vma_count_remaining(current->mm) < 4)
>  		return -ENOMEM;

This is much clearer.

>
>  	if (vma->vm_ops && vma->vm_ops->may_split) {
> @@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
>  	 * split in 3 before unmapping it.
>  	 * That means 2 more maps (1 for each) to the ones we already hold.
>  	 * Check whether current map count plus 2 still leads us to 4 maps below
> -	 * the threshold, otherwise return -ENOMEM here to be more safe.
> +	 * the threshold. In other words, is the current map count + 6 at or
> +	 * below the threshold? Otherwise return -ENOMEM here to be more safe.
>  	 */
> -	if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
> +	if (vma_count_remaining(current->mm) < 6)
>  		return -ENOMEM;

I hate that we do this silly check here, but the time to revisit it is another
series...

>
>  	return 0;
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 8b819fafd57b..dd75f2334812 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		return -ENOMEM;
>
>  	mm = vma->vm_mm;
> -	if (mm->map_count >= sysctl_max_map_count)
> +	if (!vma_count_remaining(mm))
>  		return -ENOMEM;
>
>  	region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
> diff --git a/mm/util.c b/mm/util.c
> index f814e6a59ab1..b6e83922cafe 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy);
>  int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
>  static int sysctl_overcommit_ratio __read_mostly = 50;
>  static unsigned long sysctl_overcommit_kbytes __read_mostly;
> -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
>  unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
>  unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
>
> diff --git a/mm/vma.c b/mm/vma.c
> index 033a388bc4b1..df0e8409f63d 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
>  }
>
>  /*
> - * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
> - * has already been checked or doesn't make sense to fail.
> + * __split_vma() bypasses vma_count_remaining() checks. We use this where
> + * it has already been checked or doesn't make sense to fail.
>   * VMA Iterator will point to the original VMA.
>   */
>  static __must_check int
> @@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		     unsigned long addr, int new_below)
>  {
> -	if (vma->vm_mm->map_count >= sysctl_max_map_count)
> +	if (!vma_count_remaining(vma->vm_mm))
>  		return -ENOMEM;
>
>  	return __split_vma(vmi, vma, addr, new_below);
> @@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  		 * its limit temporarily, to help free resources as expected.
>  		 */
>  		if (vms->end < vms->vma->vm_end &&
> -		    vms->vma->vm_mm->map_count >= sysctl_max_map_count) {
> +		    !vma_count_remaining(vms->vma->vm_mm)) {
>  			error = -ENOMEM;
>  			goto map_count_exceeded;
>  		}
> @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
>  		return -ENOMEM;
>
> -	if (mm->map_count >= sysctl_max_map_count)
> +	if (!vma_count_remaining(mm))
>  		return -ENOMEM;
>
>  	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 3639aa8dd2b0..52cd7ddc73f4 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
>  	return vm_flags;
>  }
>
> +/* Helper to get VMA count capacity */
> +static int vma_count_remaining(const struct mm_struct *mm)
> +{
> +	const int map_count = mm->map_count;
> +	const int max_count = sysctl_max_map_count;
> +
> +	return (max_count > map_count) ? (max_count - map_count) : 0;
> +}
> +
>  #endif	/* __MM_VMA_INTERNAL_H */
> --
> 2.51.0.384.g4c02a37b29-goog
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 2/7] mm/selftests: add max_vma_count tests
  2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
  2025-09-17 10:58   ` David Hildenbrand
@ 2025-09-18 14:42   ` Lorenzo Stoakes
  2025-09-18 16:21     ` Kalesh Singh
  1 sibling, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:42 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:33AM -0700, Kalesh Singh wrote:
> Add a new selftest to verify that the max VMA count limit is correctly
> enforced.
>
> This test suite checks that various VMA operations (mmap, mprotect,
> munmap, mremap) succeed or fail as expected when the number of VMAs is
> close to the sysctl_max_map_count limit.
>
> The test works by first creating a large number of VMAs to bring the
> process close to the limit, and then performing various operations that
> may or may not create new VMAs. The test then verifies that the
> operations that would exceed the limit fail, and that the operations
> that do not exceed the limit succeed.
>
> NOTE: munmap is special as it's allowed to temporarily exceed the limit
> by one for splits as this will decrease back to the limit once the unmap
> succeeds.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
>
> Changes in v2:
>  - Add tests, per Liam (note that the do_brk_flags() path is not easily
>    tested from userspace, so it's not included here). Exceeding the limit there
>    should be uncommon.
>
>  tools/testing/selftests/mm/Makefile           |   1 +
>  .../selftests/mm/max_vma_count_tests.c        | 709 ++++++++++++++++++

If you add a new file, update MAINTAINERS. Should put in MEMORY MAPPING section.

>  tools/testing/selftests/mm/run_vmtests.sh     |   5 +

You're missing a .gitignore.

>  3 files changed, 715 insertions(+)
>  create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
>
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index d13b3cef2a2b..00a4b04eab06 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress
>  TEST_GEN_FILES += uffd-stress
>  TEST_GEN_FILES += uffd-unit-tests
>  TEST_GEN_FILES += uffd-wp-mremap
> +TEST_GEN_FILES += max_vma_count_tests
>  TEST_GEN_FILES += split_huge_page_test
>  TEST_GEN_FILES += ksm_tests
>  TEST_GEN_FILES += ksm_functional_tests
> diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c
> new file mode 100644
> index 000000000000..c8401c03425c
> --- /dev/null
> +++ b/tools/testing/selftests/mm/max_vma_count_tests.c
> @@ -0,0 +1,709 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2025 Google LLC
> + */
> +#define _GNU_SOURCE
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <stdbool.h>
> +#include <linux/prctl.h>  /* Definition of PR_* constants */
> +#include <sys/prctl.h>
> +
> +#include "../kselftest.h"
> +
> +static int get_max_vma_count(void);
> +static bool set_max_vma_count(int val);
> +static int get_current_vma_count(void);
> +static bool is_current_vma_count(const char *msg, int expected);
> +static bool is_test_area_mapped(const char *msg);
> +static void print_surrounding_maps(const char *msg);
> +
> +/* Globals initialized in test_suite_setup() */
> +static int MAX_VMA_COUNT;
> +static int ORIGINAL_MAX_VMA_COUNT;
> +static int PAGE_SIZE;
> +static int GUARD_SIZE;
> +static int TEST_AREA_SIZE;
> +static int EXTRA_MAP_SIZE;
> +
> +static int MAX_VMA_COUNT;
> +
> +static int NR_EXTRA_MAPS;
> +
> +static char *TEST_AREA;
> +static char *EXTRA_MAPS;
> +
> +#define DEFAULT_MAX_MAP_COUNT	65530
> +#define TEST_AREA_NR_PAGES	3
> +/* 1 before test area + 1 after test area + 1 after extra mappings */
> +#define NR_GUARDS		3
> +#define TEST_AREA_PROT		(PROT_NONE)
> +#define EXTRA_MAP_PROT		(PROT_NONE)
> +
> +/**
> + * test_suite_setup - Set up the VMA layout for VMA count testing.
> + *
> + * Sets up the following VMA layout:
> + *
> + * +----- base_addr
> + * |
> + * V
> + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> + * |  Guard Page  |                      |  Guard Page  |  Extra Map 1   | Unmapped Gap |  Extra Map 2   | Unmapped Gap | ... |  Extra Map N   | Unmapped Gap |
> + * |  (unmapped)  |      TEST_AREA       |  (unmapped)  | (mapped page)  |  (1 page)    | (mapped page)  |  (1 page)    | ... | (mapped page)  |  (1 page)    |
> + * |   (1 page)   | (unmapped, 3 pages)  |   (1 page)   |    (1 page)    |              |    (1 page)    |              |     |    (1 page)    |              |
> + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+

I have no idea what a 'guard page' is here. You're not using MADV_GUARD_INSTALL so it's presumably an arbitrary PROT_NONE page or something?

> + * ^              ^                      ^              ^                                                                  ^
> + * |              |                      |              |                                                                  |
> + * +--GUARD_SIZE--+                      |              +-- EXTRA_MAPS points here             Sufficient EXTRA_MAPS to ---+
> + *    (PAGE_SIZE) |                      |                                                         reach MAX_VMA_COUNT
> + *                |                      |
> + *                +--- TEST_AREA_SIZE ---+
> + *                |   (3 * PAGE_SIZE)    |
> + *                ^
> + *                |
> + *                +-- TEST_AREA starts here

Can we make this vertical, it's obv >80 chars and it's kind of hard to read.

Also isn't this hugely convoluted? Do we really need to set up such a
complicated set of VMAs for every test?

> + *
> + * Populates TEST_AREA and other globals required for the tests.
> + * If successful, the current VMA count will be MAX_VMA_COUNT - 1.
> + *
> + * Return: true on success, false on failure.
> + */
> +static bool test_suite_setup(void)
> +{
> +	int initial_vma_count;
> +	size_t reservation_size;
> +	void *base_addr = NULL;
> +	char *ptr = NULL;
> +
> +	ksft_print_msg("Setting up vma_max_count test suite...\n");
> +
> +	/* Initialize globals */
> +	PAGE_SIZE = sysconf(_SC_PAGESIZE);
> +	TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE;
> +	GUARD_SIZE = PAGE_SIZE;
> +	EXTRA_MAP_SIZE = PAGE_SIZE;
> +	MAX_VMA_COUNT = get_max_vma_count();

Dude we are not COBOL programmers :) What is it with these capitalised variable
names?

This is really horrible, please don't do that. And why on earth are you using
globals like this.

(Again) use the self test harness please :)

> +
> +	MAX_VMA_COUNT = get_max_vma_count();
> +	if (MAX_VMA_COUNT < 0) {
> +		ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n");
> +		return false;
> +	}
> +
> +	/*
> +	 * If the current limit is higher than the kernel default,
> +	 * we attempt to lower it to the default to ensure the test
> +	 * can run with a reliably known boundary.
> +	 */
> +	ORIGINAL_MAX_VMA_COUNT = 0;
> +
> +	if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) {
> +		ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT;
> +
> +		ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n",
> +				MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT);
> +
> +		if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) {
> +			ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n",
> +					DEFAULT_MAX_MAP_COUNT);

Probably partly a product of you not using the harness, but you're violating the
max chars per line all over the place.

checkpatch.pl kinda has a fit with this file, can we just try and keep it to ~80 chars please?

> +			ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n");
> +
> +			MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT;
> +
> +			return false;
> +		}
> +
> +		/* Update MAX_VMA_COUNT for the test run */
> +		MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT;
> +	}
> +
> +	initial_vma_count = get_current_vma_count();
> +	if (initial_vma_count < 0) {
> +		ksft_print_msg("Failed to read /proc/self/maps\n");
> +		return false;
> +	}
> +
> +	/*
> +	 * Calculate how many extra mappings we need to create to reach
> +	 * MAX_VMA_COUNT - 1 (excluding test area).
> +	 */
> +	NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count;
> +
> +	if (NR_EXTRA_MAPS < 1) {
> +		ksft_print_msg("Not enough available maps to run test\n");
> +		ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n",
> +				MAX_VMA_COUNT, initial_vma_count);
> +		return false;
> +	}
> +
> +	/*
> +	 * Reserve space for:
> +	 * - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2)
> +	 * - The test area itself (TEST_AREA_NR_PAGES)
> +	 * - The guard pages (NR_GUARDS)
> +	 */
> +	reservation_size = ((NR_EXTRA_MAPS * 2) +
> +				TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE;
> +
> +	base_addr = mmap(NULL, reservation_size, PROT_NONE,
> +			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +	if (base_addr == MAP_FAILED) {
> +		ksft_print_msg("Failed tommap initial reservation\n");
> +		return false;
> +	}
> +
> +	if (munmap(base_addr, reservation_size) == -1) {
> +		ksft_print_msg("Failed to munmap initial reservation\n");
> +		return false;
> +	}
> +
> +	/* Get the addr of the test area */
> +	TEST_AREA = (char *)base_addr + GUARD_SIZE;
> +
> +	/*
> +	 * Get the addr of the region for extra mappings:
> +	 *     test area + 1 guard.
> +	 */
> +	EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE;
> +
> +	/* Create single-page mappings separated by unmapped pages */
> +	ptr = EXTRA_MAPS;
> +	for (int i = 0; i < NR_EXTRA_MAPS; ++i) {
> +		if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT,
> +			MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
> +			-1, 0) == MAP_FAILED) {
> +			perror("mmap in fill loop");
> +			ksft_print_msg("Failed on mapping #%d of %d\n", i + 1,
> +					NR_EXTRA_MAPS);
> +			return false;
> +		}
> +
> +		/* Advance pointer by 2 to leave a gap */
> +		ptr += (2 * EXTRA_MAP_SIZE);
> +	}
> +
> +	if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1))
> +		return false;
> +
> +	ksft_print_msg("vma_max_count test suite setup done.\n");
> +
> +	return true;
> +}

I mean it'll be fixed by the test harness, but even in test code let's please
not write MASSIVE functions that do a million things. It's not really readable.

> +
> +static void test_suite_teardown(void)
> +{
> +	if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) {
> +		if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT))
> +			ksft_print_msg("Failed to restore max_map_count to %d\n",
> +					ORIGINAL_MAX_VMA_COUNT);
> +	}
> +}
> +
> +/* --- Test Helper Functions --- */
> +static bool mmap_anon(void)
> +{
> +	void *addr =  mmap(NULL, PAGE_SIZE, PROT_READ,
> +			   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +
> +	/*
> +	 * Handle cleanup here as the runner doesn't track where this,
> +	 *mapping is located.
> +	 */
> +	if (addr != MAP_FAILED)
> +		munmap(addr, PAGE_SIZE);
> +
> +	return addr != MAP_FAILED;
> +}
> +
> +static inline bool __mprotect(char *addr, int size)
> +{
> +	int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC);
> +
> +	return mprotect(addr, size, new_prot) == 0;
> +}
> +
> +static bool mprotect_nosplit(void)
> +{
> +	return __mprotect(TEST_AREA, TEST_AREA_SIZE);
> +}
> +
> +static bool mprotect_2way_split(void)
> +{
> +	return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
> +}
> +
> +static bool mprotect_3way_split(void)
> +{
> +	return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +static inline bool __munmap(char *addr, int size)
> +{
> +	return munmap(addr, size) == 0;
> +}
> +
> +static bool munmap_nosplit(void)
> +{
> +	return __munmap(TEST_AREA, TEST_AREA_SIZE);
> +}
> +
> +static bool munmap_2way_split(void)
> +{
> +	return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
> +}
> +
> +static bool munmap_3way_split(void)
> +{
> +	return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +/* mremap accounts for the worst case to fail early */
> +static const int MREMAP_REQUIRED_VMA_SLOTS = 6;
> +
> +static bool mremap_dontunmap(void)
> +{
> +	void *new_addr;
> +
> +	/*
> +	 * Using MREMAP_DONTUNMAP will create a new mapping without
> +	 * removing the old one, consuming one VMA slot.
> +	 */
> +	new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE,
> +			  MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL);
> +
> +	if (new_addr != MAP_FAILED)
> +		munmap(new_addr, TEST_AREA_SIZE);
> +
> +	return new_addr != MAP_FAILED;
> +}
> +
> +struct test {
> +	const char *name;
> +	bool (*test)(void);
> +	 /* How many VMA slots below the limit this test needs to start? */
> +	int vma_slots_needed;
> +	bool expect_success;
> +};
> +
> +/* --- Test Cases --- */
> +struct test tests[] = {
> +	{
> +		.name = "mmap_at_1_below_vma_count_limit",
> +		.test = mmap_anon,
> +		.vma_slots_needed = 1,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "mmap_at_vma_count_limit",
> +		.test = mmap_anon,
> +		.vma_slots_needed = 0,
> +		.expect_success = false,
> +	},
> +	{
> +		.name = "mprotect_nosplit_at_1_below_vma_count_limit",
> +		.test = mprotect_nosplit,
> +		.vma_slots_needed = 1,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "mprotect_nosplit_at_vma_count_limit",
> +		.test = mprotect_nosplit,
> +		.vma_slots_needed = 0,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "mprotect_2way_split_at_1_below_vma_count_limit",
> +		.test = mprotect_2way_split,
> +		.vma_slots_needed = 1,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "mprotect_2way_split_at_vma_count_limit",
> +		.test = mprotect_2way_split,
> +		.vma_slots_needed = 0,
> +		.expect_success = false,
> +	},
> +	{
> +		.name = "mprotect_3way_split_at_2_below_vma_count_limit",
> +		.test = mprotect_3way_split,
> +		.vma_slots_needed = 2,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "mprotect_3way_split_at_1_below_vma_count_limit",
> +		.test = mprotect_3way_split,
> +		.vma_slots_needed = 1,
> +		.expect_success = false,
> +	},
> +	{
> +		.name = "mprotect_3way_split_at_vma_count_limit",
> +		.test = mprotect_3way_split,
> +		.vma_slots_needed = 0,
> +		.expect_success = false,
> +	},
> +	{
> +		.name = "munmap_nosplit_at_1_below_vma_count_limit",
> +		.test = munmap_nosplit,
> +		.vma_slots_needed = 1,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "munmap_nosplit_at_vma_count_limit",
> +		.test = munmap_nosplit,
> +		.vma_slots_needed = 0,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "munmap_2way_split_at_1_below_vma_count_limit",
> +		.test = munmap_2way_split,
> +		.vma_slots_needed = 1,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "munmap_2way_split_at_vma_count_limit",
> +		.test = munmap_2way_split,
> +		.vma_slots_needed = 0,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "munmap_3way_split_at_2_below_vma_count_limit",
> +		.test = munmap_3way_split,
> +		.vma_slots_needed = 2,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "munmap_3way_split_at_1_below_vma_count_limit",
> +		.test = munmap_3way_split,
> +		.vma_slots_needed = 1,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "munmap_3way_split_at_vma_count_limit",
> +		.test = munmap_3way_split,
> +		.vma_slots_needed = 0,
> +		.expect_success = false,
> +	},
> +	{
> +		.name = "mremap_dontunmap_at_required_vma_count_capcity",
> +		.test = mremap_dontunmap,
> +		.vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS,
> +		.expect_success = true,
> +	},
> +	{
> +		.name = "mremap_dontunmap_at_1_below_required_vma_count_capacity",
> +		.test = mremap_dontunmap,
> +		.vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1,
> +		.expect_success = false,
> +	},

This is horrible. We don't need to do it this way.

Use the kselftest_harness please. See the guard_regions.c test for an example of
how to use it.

> +};
> +
> +/* --- Test Runner --- */
> +int main(int argc, char **argv)
> +{
> +	int num_tests = ARRAY_SIZE(tests);
> +	int failed_tests = 0;
> +
> +	ksft_set_plan(num_tests);
> +
> +	if (!test_suite_setup() != 0) {
> +		if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT)
> +			ksft_exit_skip("max_map_count too high and cannot be lowered\n"
> +					"Please rerun as root.\n");
> +		else
> +			ksft_exit_fail_msg("Test suite setup failed. Aborting.\n");
> +
> +	}
> +
> +	for (int i = 0; i < num_tests; i++) {
> +		int maps_to_unmap = tests[i].vma_slots_needed;
> +		const char *name = tests[i].name;
> +		bool test_passed;
> +
> +		errno = 0;
> +
> +		/* 1. Setup: TEST_AREA mapping */
> +		if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT,
> +			MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
> +				== MAP_FAILED) {
> +			ksft_test_result_fail(
> +				"%s: Test setup failed to map TEST_AREA\n",
> +				name);
> +			maps_to_unmap = 0;
> +			goto fail;
> +		}
> +
> +		/* Label TEST_AREA to ease debugging */
> +		if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA,
> +			TEST_AREA_SIZE, "TEST_AREA")) {
> +			ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n",
> +					name);
> +			ksft_print_msg(
> +				"Continuing without named TEST_AREA mapping\n");
> +		}
> +
> +		/* 2. Setup: Adjust VMA count based on test requirements */
> +		if (maps_to_unmap > NR_EXTRA_MAPS) {
> +			ksft_test_result_fail(
> +				"%s: Test setup failed: Invalid VMA slots required %d\n",
> +				name, tests[i].vma_slots_needed);
> +			maps_to_unmap = 0;
> +			goto fail;
> +		}
> +
> +		/* Unmap extra mappings, accounting for the 1-page gap */
> +		for (int j = 0; j < maps_to_unmap; j++)
> +			munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
> +				EXTRA_MAP_SIZE);
> +
> +		/*
> +		 * 3. Verify the preconditions.
> +		 *
> +		 *    Sometimes there isn't an easy way to determine the cause
> +		 *    of the test failure.
> +		 *    e.g. an mprotect ENOMEM may be due to trying to protect
> +		 *         unmapped area or due to hitting MAX_VMA_COUNT limit.
> +		 *
> +		 *    We verify the preconditions of the test to ensure any
> +		 *    expected failures are from the expected cause and not
> +		 *    coincidental.
> +		 */
> +		if (!is_current_vma_count(name,
> +			MAX_VMA_COUNT - tests[i].vma_slots_needed))
> +			goto fail;
> +
> +		if (!is_test_area_mapped(name))
> +			goto fail;
> +
> +		/* 4. Run the test */
> +		test_passed = (tests[i].test() == tests[i].expect_success);
> +		if (test_passed) {
> +			ksft_test_result_pass("%s\n", name);
> +		} else {
> +fail:
> +			failed_tests++;
> +			ksft_test_result_fail(
> +				"%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n",
> +				name, get_current_vma_count(), MAX_VMA_COUNT,
> +				errno, strerror(errno));
> +			print_surrounding_maps(name);
> +		}
> +
> +		/* 5. Teardown: Unmap TEST_AREA. */
> +		munmap(TEST_AREA, TEST_AREA_SIZE);
> +
> +		/* 6. Teardown: Restore extra mappings to test suite baseline */
> +		for (int j = 0; j < maps_to_unmap; j++) {
> +			/* Remap extra mappings, accounting for the gap */
> +			mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
> +				EXTRA_MAP_SIZE, EXTRA_MAP_PROT,
> +				MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
> +				-1, 0);
> +		}
> +	}
> +
> +	test_suite_teardown();
> +
> +	if (failed_tests > 0)
> +		ksft_exit_fail();
> +	else
> +		ksft_exit_pass();
> +}
> +
> +/* --- Utilities --- */
> +
> +static int get_max_vma_count(void)
> +{
> +	int max_count;
> +	FILE *f;
> +
> +	f = fopen("/proc/sys/vm/max_map_count", "r");
> +	if (!f)
> +		return -1;
> +
> +	if (fscanf(f, "%d", &max_count) != 1)
> +		max_count = -1;
> +
> +
> +	fclose(f);
> +
> +	return max_count;
> +}
> +
> +static bool set_max_vma_count(int val)
> +{
> +	FILE *f;
> +	bool success = false;
> +
> +	f = fopen("/proc/sys/vm/max_map_count", "w");
> +	if (!f)
> +		return false;
> +
> +	if (fprintf(f, "%d", val) > 0)
> +		success = true;
> +
> +	fclose(f);
> +	return success;
> +}
> +
> +static int get_current_vma_count(void)
> +{
> +	char line[1024];
> +	int count = 0;
> +	FILE *f;
> +
> +	f = fopen("/proc/self/maps", "r");
> +	if (!f)
> +		return -1;
> +
> +	while (fgets(line, sizeof(line), f)) {
> +		if (!strstr(line, "[vsyscall]"))
> +			count++;
> +	}
> +
> +	fclose(f);
> +
> +	return count;
> +}
> +
> +static bool is_current_vma_count(const char *msg, int expected)
> +{
> +	int current = get_current_vma_count();
> +
> +	if (current == expected)
> +		return true;
> +
> +	ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current,
> +			expected);
> +	return false;
> +}
> +
> +static bool is_test_area_mapped(const char *msg)
> +{
> +	unsigned long search_start = (unsigned long)TEST_AREA;
> +	unsigned long search_end = search_start + TEST_AREA_SIZE;
> +	bool found = false;
> +	char line[1024];
> +	FILE *f;
> +
> +	f = fopen("/proc/self/maps", "r");
> +	if (!f) {
> +		ksft_print_msg("failed to open /proc/self/maps\n");
> +		return false;
> +	}
> +
> +	while (fgets(line, sizeof(line), f)) {
> +		unsigned long start, end;
> +
> +		if (sscanf(line, "%lx-%lx", &start, &end) != 2)
> +			continue;
> +
> +		/* Check for an exact match of the range */
> +		if (start == search_start && end == search_end) {
> +			found = true;
> +			break;
> +		} else if (start > search_end) {
> +			/*
> +			 *Since maps are sorted, if we've passed the end, we
> +			 * can stop searching.
> +			 */
> +			break;
> +		}
> +	}
> +

There's helpers for this kind of thing in vm_util.h, let's not duplicate
code. Also you can use the proc query ioctl thingy (also helpers for that in
vm_util.h) rather than having to spelunk /proc/$pid/maps.

> +	fclose(f);
> +
> +	if (found)
> +		return true;
> +
> +	/* Not found */
> +	ksft_print_msg(
> +		"%s: TEST_AREA is not mapped as a single contiguous block.\n",
> +		msg);
> +	print_surrounding_maps(msg);
> +
> +	return false;
> +}
> +
> +static void print_surrounding_maps(const char *msg)

WHy are you printing a bunch of noise in a test?

> +{
> +	unsigned long search_start = (unsigned long)TEST_AREA;
> +	unsigned long search_end = search_start + TEST_AREA_SIZE;
> +	unsigned long start;
> +	unsigned long end;
> +	char line[1024] = {};
> +	int line_idx = 0;
> +	int first_match_idx = -1;
> +	int last_match_idx = -1;
> +	FILE *f;
> +
> +	f = fopen("/proc/self/maps", "r");
> +	if (!f)
> +		return;
> +
> +	if (msg)
> +		ksft_print_msg("%s\n", msg);
> +
> +	ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n",
> +			TEST_AREA);
> +
> +	/* First pass: Read all lines and find the range of matching entries */
> +	fseek(f, 0, SEEK_SET); /* Rewind file */
> +	while (fgets(line, sizeof(line), f)) {
> +		if (sscanf(line, "%lx-%lx", &start, &end) != 2) {
> +			line_idx++;
> +			continue;
> +		}
> +
> +		/* Check for any overlap */
> +		if (start < search_end && end > search_start) {
> +			if (first_match_idx == -1)
> +				first_match_idx = line_idx;
> +			last_match_idx = line_idx;
> +		} else if (start > search_end) {
> +			/*
> +			 * Since maps are sorted, if we've passed the end, we
> +			 * can stop searching.
> +			 */
> +			break;
> +		}
> +
> +		line_idx++;
> +	}
> +
> +	if (first_match_idx == -1) {
> +		ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n",
> +				TEST_AREA);
> +	} else {
> +		/* Second pass: Print the relevant lines */
> +		fseek(f, 0, SEEK_SET); /* Rewind file */
> +		line_idx = 0;
> +		while (fgets(line, sizeof(line), f)) {
> +			/* Print 2 lines before the first match */
> +			if (line_idx >= first_match_idx - 2 &&
> +				line_idx < first_match_idx)
> +				ksft_print_msg("   %s", line);
> +
> +			/* Print all matching TEST_AREA entries */
> +			if (line_idx >= first_match_idx &&
> +				line_idx <= last_match_idx)
> +				ksft_print_msg(">> %s", line);
> +
> +			/* Print 2 lines after the last match */
> +			if (line_idx > last_match_idx &&
> +				line_idx <= last_match_idx + 2)
> +				ksft_print_msg("   %s", line);
> +
> +			line_idx++;
> +		}
> +	}
> +
> +	ksft_print_msg("--------------------------------------------------\n");
> +
> +	fclose(f);
> +}
> diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> index 471e539d82b8..3794b50ec280 100755
> --- a/tools/testing/selftests/mm/run_vmtests.sh
> +++ b/tools/testing/selftests/mm/run_vmtests.sh
> @@ -49,6 +49,8 @@ separated by spaces:
>  	test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
>  - madv_populate
>  	test memadvise(2) MADV_POPULATE_{READ,WRITE} options
> +- max_vma_count
> +	tests for max vma_count
>  - memfd_secret
>  	test memfd_secret(2)
>  - process_mrelease
> @@ -417,6 +419,9 @@ fi # VADDR64
>  # vmalloc stability smoke test
>  CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
>
> +# test operations against max vma count limit
> +CATEGORY="max_vma_count" run_test ./max_vma_count_tests
> +
>  CATEGORY="mremap" run_test ./mremap_dontunmap
>
>  CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
> --
> 2.51.0.384.g4c02a37b29-goog
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-18 11:38         ` Pedro Falcato
@ 2025-09-18 14:48           ` Steven Rostedt
  0 siblings, 0 replies; 68+ messages in thread
From: Steven Rostedt @ 2025-09-18 14:48 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Kalesh Singh, akpm, minchan, lorenzo.stoakes, david, Liam.Howlett,
	rppt, kernel-team, android-mm, Alexander Viro, Christian Brauner,
	Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Thu, 18 Sep 2025 12:38:56 +0100
Pedro Falcato <pfalcato@suse.de> wrote:

> Isn't the usual problem with that approach, that of static functions/static
> inline functions? I was tracing through a problem a few months ago, and man
> I really did think "wouldn't it be nice to have a tracepoint instead of fishing
> around for kprobe spots".
> 
> Not that I particularly think a tracepoint is super worth it in this case, but,
> y'know.

Yes, it would be useful. The issue is that tracepoints are not free. They
do increase the I$ hit and take up memory.

If you're going to inject a tracepoint somewhere, at least extract some
useful information from that spot. If you can't think of anything to track,
then it's not worth a tracepoint.

If one really wants a way to track something, they could add a static
branch that would call a function when enabled that could be traced. They
would also need a way to enable that static branch.

-- Steve

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count
  2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
  2025-09-17 13:41   ` David Hildenbrand
  2025-09-18 11:46   ` Pedro Falcato
@ 2025-09-18 14:48   ` Lorenzo Stoakes
  2 siblings, 0 replies; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:48 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:35AM -0700, Kalesh Singh wrote:
> A mechanical rename of the mm_struct->map_count field to
> vma_count; no functional change is intended.
>
> The name "map_count" is ambiguous within the memory management subsystem,
> as it can be confused with the folio/page->_mapcount field, which tracks
> PTE references.
>
> The new name, vma_count, is more precise as this field has always
> counted the number of vm_area_structs associated with an mm_struct.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>

Yeah this is nice thanks, LGTM so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

One small comment below, but I think that change can wait for another series
probably.

> ---
>
>  Changes in v2:
>   - map_count is easily confused with _mapcount rename to vma_count, per David
>
>  fs/binfmt_elf.c                  |  2 +-
>  fs/coredump.c                    |  2 +-
>  include/linux/mm_types.h         |  2 +-
>  kernel/fork.c                    |  2 +-
>  mm/debug.c                       |  2 +-
>  mm/mmap.c                        |  6 +++---
>  mm/nommu.c                       |  6 +++---
>  mm/vma.c                         | 24 ++++++++++++------------
>  tools/testing/vma/vma.c          | 32 ++++++++++++++++----------------
>  tools/testing/vma/vma_internal.h |  6 +++---
>  10 files changed, 42 insertions(+), 42 deletions(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 264fba0d44bd..52449dec12cb 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1643,7 +1643,7 @@ static int fill_files_note(struct memelfnote *note, struct coredump_params *cprm
>  	data[0] = count;
>  	data[1] = PAGE_SIZE;
>  	/*
> -	 * Count usually is less than mm->map_count,
> +	 * Count usually is less than mm->vma_count,
>  	 * we need to move filenames down.
>  	 */
>  	n = cprm->vma_count - count;
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 60bc9685e149..8881459c53d9 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -1731,7 +1731,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
>
>  	cprm->vma_data_size = 0;
>  	gate_vma = get_gate_vma(mm);
> -	cprm->vma_count = mm->map_count + (gate_vma ? 1 : 0);
> +	cprm->vma_count = mm->vma_count + (gate_vma ? 1 : 0);
>
>  	cprm->vma_meta = kvmalloc_array(cprm->vma_count, sizeof(*cprm->vma_meta), GFP_KERNEL);
>  	if (!cprm->vma_meta) {
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 08bc2442db93..4343be2f9e85 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1020,7 +1020,7 @@ struct mm_struct {
>  #ifdef CONFIG_MMU
>  		atomic_long_t pgtables_bytes;	/* size of all page tables */
>  #endif
> -		int map_count;			/* number of VMAs */
> +		int vma_count;			/* number of VMAs */
>
>  		spinlock_t page_table_lock; /* Protects page tables and some
>  					     * counters
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4ada32598bd..8fcbbf947579 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	mmap_init_lock(mm);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm_pgtables_bytes_init(mm);
> -	mm->map_count = 0;
> +	mm->vma_count = 0;
>  	mm->locked_vm = 0;
>  	atomic64_set(&mm->pinned_vm, 0);
>  	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> diff --git a/mm/debug.c b/mm/debug.c
> index b4388f4dcd4d..40fc9425a84a 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -204,7 +204,7 @@ void dump_mm(const struct mm_struct *mm)
>  		mm->pgd, atomic_read(&mm->mm_users),
>  		atomic_read(&mm->mm_count),
>  		mm_pgtables_bytes(mm),
> -		mm->map_count,
> +		mm->vma_count,
>  		mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
>  		(u64)atomic64_read(&mm->pinned_vm),
>  		mm->data_vm, mm->exec_vm, mm->stack_vm,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index af88ce1fbb5f..c6769394a174 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm)
>  		vma = vma_next(&vmi);
>  	} while (vma && likely(!xa_is_zero(vma)));
>
> -	BUG_ON(count != mm->map_count);
> +	BUG_ON(count != mm->vma_count);

Be nice to switch this to a WARN_ON_ONCE()... think David mentioned it though.

But maybe not one for this series...

>
>  	trace_exit_mmap(mm);
>  destroy:
> @@ -1517,7 +1517,7 @@ static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
>   */
>  int vma_count_remaining(const struct mm_struct *mm)
>  {
> -	const int map_count = mm->map_count;
> +	const int map_count = mm->vma_count;
>  	const int max_count = sysctl_max_map_count;
>
>  	return (max_count > map_count) ? (max_count - map_count) : 0;
> @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  		 */
>  		vma_iter_bulk_store(&vmi, tmp);
>
> -		mm->map_count++;
> +		mm->vma_count++;
>
>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index dd75f2334812..9ab2e5ca736d 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -576,7 +576,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
>
>  static void cleanup_vma_from_mm(struct vm_area_struct *vma)
>  {
> -	vma->vm_mm->map_count--;
> +	vma->vm_mm->vma_count--;
>  	/* remove the VMA from the mapping */
>  	if (vma->vm_file) {
>  		struct address_space *mapping;
> @@ -1198,7 +1198,7 @@ unsigned long do_mmap(struct file *file,
>  		goto error_just_free;
>
>  	setup_vma_to_mm(vma, current->mm);
> -	current->mm->map_count++;
> +	current->mm->vma_count++;
>  	/* add the VMA to the tree */
>  	vma_iter_store_new(&vmi, vma);
>
> @@ -1366,7 +1366,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	setup_vma_to_mm(vma, mm);
>  	setup_vma_to_mm(new, mm);
>  	vma_iter_store_new(vmi, new);
> -	mm->map_count++;
> +	mm->vma_count++;
>  	return 0;
>
>  err_vmi_preallocate:
> diff --git a/mm/vma.c b/mm/vma.c
> index df0e8409f63d..64f4e7c867c3 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  		 * (it may either follow vma or precede it).
>  		 */
>  		vma_iter_store_new(vmi, vp->insert);
> -		mm->map_count++;
> +		mm->vma_count++;
>  	}
>
>  	if (vp->anon_vma) {
> @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  		}
>  		if (vp->remove->anon_vma)
>  			anon_vma_merge(vp->vma, vp->remove);
> -		mm->map_count--;
> +		mm->vma_count--;
>  		mpol_put(vma_policy(vp->remove));
>  		if (!vp->remove2)
>  			WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
> @@ -683,13 +683,13 @@ void validate_mm(struct mm_struct *mm)
>  		}
>  #endif
>  		/* Check for a infinite loop */
> -		if (++i > mm->map_count + 10) {
> +		if (++i > mm->vma_count + 10) {
>  			i = -1;
>  			break;
>  		}
>  	}
> -	if (i != mm->map_count) {
> -		pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i);
> +	if (i != mm->vma_count) {
> +		pr_emerg("vma_count %d vma iterator %d\n", mm->vma_count, i);
>  		bug = 1;
>  	}
>  	VM_BUG_ON_MM(bug, mm);
> @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	struct mm_struct *mm;
>
>  	mm = current->mm;
> -	mm->map_count -= vms->vma_count;
> +	mm->vma_count -= vms->vma_count;
>  	mm->locked_vm -= vms->locked_vm;
>  	if (vms->unlock)
>  		mmap_write_downgrade(mm);
> @@ -1340,14 +1340,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  	if (vms->start > vms->vma->vm_start) {
>
>  		/*
> -		 * Make sure that map_count on return from munmap() will
> +		 * Make sure that vma_count on return from munmap() will
>  		 * not exceed its limit; but let map_count go just above
>  		 * its limit temporarily, to help free resources as expected.
>  		 */
>  		if (vms->end < vms->vma->vm_end &&
>  		    !vma_count_remaining(vms->vma->vm_mm)) {
>  			error = -ENOMEM;
> -			goto map_count_exceeded;
> +			goto vma_count_exceeded;
>  		}
>
>  		/* Don't bother splitting the VMA if we can't unmap it anyway */
> @@ -1461,7 +1461,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  modify_vma_failed:
>  	reattach_vmas(mas_detach);
>  start_split_failed:
> -map_count_exceeded:
> +vma_count_exceeded:
>  	return error;
>  }
>
> @@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
>  	vma_start_write(vma);
>  	vma_iter_store_new(&vmi, vma);
>  	vma_link_file(vma);
> -	mm->map_count++;
> +	mm->vma_count++;
>  	validate_mm(mm);
>  	return 0;
>  }
> @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
>  	vma_iter_store_new(vmi, vma);
> -	map->mm->map_count++;
> +	map->mm->vma_count++;
>  	vma_link_file(vma);
>
>  	/*
> @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
>  		goto mas_store_fail;
>
> -	mm->map_count++;
> +	mm->vma_count++;
>  	validate_mm(mm);
>  out:
>  	perf_event_mmap(vma);
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 656e1c75b711..69fa7d14a6c2 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c
> @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
>  	}
>
>  	mtree_destroy(&mm->mm_mt);
> -	mm->map_count = 0;
> +	mm->vma_count = 0;
>  	return count;
>  }
>
> @@ -500,7 +500,7 @@ static bool test_merge_new(void)
>  	INIT_LIST_HEAD(&vma_d->anon_vma_chain);
>  	list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain);
>  	ASSERT_FALSE(merged);
> -	ASSERT_EQ(mm.map_count, 4);
> +	ASSERT_EQ(mm.vma_count, 4);
>
>  	/*
>  	 * Merge BOTH sides.
> @@ -519,7 +519,7 @@ static bool test_merge_new(void)
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 3);
> +	ASSERT_EQ(mm.vma_count, 3);
>
>  	/*
>  	 * Merge to PREVIOUS VMA.
> @@ -536,7 +536,7 @@ static bool test_merge_new(void)
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 3);
> +	ASSERT_EQ(mm.vma_count, 3);
>
>  	/*
>  	 * Merge to NEXT VMA.
> @@ -555,7 +555,7 @@ static bool test_merge_new(void)
>  	ASSERT_EQ(vma->vm_pgoff, 6);
>  	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 3);
> +	ASSERT_EQ(mm.vma_count, 3);
>
>  	/*
>  	 * Merge BOTH sides.
> @@ -573,7 +573,7 @@ static bool test_merge_new(void)
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 2);
> +	ASSERT_EQ(mm.vma_count, 2);
>
>  	/*
>  	 * Merge to NEXT VMA.
> @@ -591,7 +591,7 @@ static bool test_merge_new(void)
>  	ASSERT_EQ(vma->vm_pgoff, 0xa);
>  	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 2);
> +	ASSERT_EQ(mm.vma_count, 2);
>
>  	/*
>  	 * Merge BOTH sides.
> @@ -608,7 +608,7 @@ static bool test_merge_new(void)
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  	ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 1);
> +	ASSERT_EQ(mm.vma_count, 1);
>
>  	/*
>  	 * Final state.
> @@ -967,7 +967,7 @@ static bool test_vma_merge_new_with_close(void)
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  	ASSERT_EQ(vma->vm_ops, &vm_ops);
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 2);
> +	ASSERT_EQ(mm.vma_count, 2);
>
>  	cleanup_mm(&mm, &vmi);
>  	return true;
> @@ -1017,7 +1017,7 @@ static bool test_merge_existing(void)
>  	ASSERT_EQ(vma->vm_pgoff, 2);
>  	ASSERT_TRUE(vma_write_started(vma));
>  	ASSERT_TRUE(vma_write_started(vma_next));
> -	ASSERT_EQ(mm.map_count, 2);
> +	ASSERT_EQ(mm.vma_count, 2);
>
>  	/* Clear down and reset. */
>  	ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
> @@ -1045,7 +1045,7 @@ static bool test_merge_existing(void)
>  	ASSERT_EQ(vma_next->vm_pgoff, 2);
>  	ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma_next));
> -	ASSERT_EQ(mm.map_count, 1);
> +	ASSERT_EQ(mm.vma_count, 1);
>
>  	/* Clear down and reset. We should have deleted vma. */
>  	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1079,7 +1079,7 @@ static bool test_merge_existing(void)
>  	ASSERT_EQ(vma->vm_pgoff, 6);
>  	ASSERT_TRUE(vma_write_started(vma_prev));
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 2);
> +	ASSERT_EQ(mm.vma_count, 2);
>
>  	/* Clear down and reset. */
>  	ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
> @@ -1108,7 +1108,7 @@ static bool test_merge_existing(void)
>  	ASSERT_EQ(vma_prev->vm_pgoff, 0);
>  	ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma_prev));
> -	ASSERT_EQ(mm.map_count, 1);
> +	ASSERT_EQ(mm.vma_count, 1);
>
>  	/* Clear down and reset. We should have deleted vma. */
>  	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1138,7 +1138,7 @@ static bool test_merge_existing(void)
>  	ASSERT_EQ(vma_prev->vm_pgoff, 0);
>  	ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
>  	ASSERT_TRUE(vma_write_started(vma_prev));
> -	ASSERT_EQ(mm.map_count, 1);
> +	ASSERT_EQ(mm.vma_count, 1);
>
>  	/* Clear down and reset. We should have deleted prev and next. */
>  	ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1540,7 +1540,7 @@ static bool test_merge_extend(void)
>  	ASSERT_EQ(vma->vm_end, 0x4000);
>  	ASSERT_EQ(vma->vm_pgoff, 0);
>  	ASSERT_TRUE(vma_write_started(vma));
> -	ASSERT_EQ(mm.map_count, 1);
> +	ASSERT_EQ(mm.vma_count, 1);
>
>  	cleanup_mm(&mm, &vmi);
>  	return true;
> @@ -1652,7 +1652,7 @@ static bool test_mmap_region_basic(void)
>  			     0x24d, NULL);
>  	ASSERT_EQ(addr, 0x24d000);
>
> -	ASSERT_EQ(mm.map_count, 2);
> +	ASSERT_EQ(mm.vma_count, 2);
>
>  	for_each_vma(vmi, vma) {
>  		if (vma->vm_start == 0x300000) {
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 52cd7ddc73f4..15525b86145d 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -251,7 +251,7 @@ struct mutex {};
>
>  struct mm_struct {
>  	struct maple_tree mm_mt;
> -	int map_count;			/* number of VMAs */
> +	int vma_count;			/* number of VMAs */
>  	unsigned long total_vm;	   /* Total pages mapped */
>  	unsigned long locked_vm;   /* Pages that have PG_mlocked set */
>  	unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
> @@ -1520,10 +1520,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
>  /* Helper to get VMA count capacity */
>  static int vma_count_remaining(const struct mm_struct *mm)
>  {
> -	const int map_count = mm->map_count;
> +	const int vma_count = mm->vma_count;
>  	const int max_count = sysctl_max_map_count;
>
> -	return (max_count > map_count) ? (max_count - map_count) : 0;
> +	return (max_count > vma_count) ? (max_count - vma_count) : 0;
>  }
>
>  #endif	/* __MM_VMA_INTERNAL_H */
> --
> 2.51.0.384.g4c02a37b29-goog
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 5/7] mm: harden vma_count against direct modification
  2025-09-15 16:36 ` [PATCH v2 5/7] mm: harden vma_count against direct modification Kalesh Singh
  2025-09-16  9:10   ` kernel test robot
@ 2025-09-18 14:52   ` Lorenzo Stoakes
  2025-09-18 15:43     ` Kalesh Singh
  1 sibling, 1 reply; 68+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:52 UTC (permalink / raw)
  To: Kalesh Singh
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Mon, Sep 15, 2025 at 09:36:36AM -0700, Kalesh Singh wrote:
> To make VMA counting more robust, prevent direct modification of the
> mm->vma_count field. This is achieved by making the public-facing
> member const via a union and requiring all modifications to go
> through a new set of helper functions the operate on a private
> __vma_count.
>
> While there are no other invariants tied to vma_count currently, this
> structural change improves maintainability; as it creates a single,
> centralized point for any future logic, such as adding debug checks
> or updating related statistics (in subsequent patches).
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>

Hmmm I"m not sure about this one.

I think this is a 'we don't need it' situation, and it's making everything a bit
uglier.

I especially hate vma_count_add() and vma_count_sub(). You're essentially
overridding the whole concept in these cases to make stuff that's already in
place work in those cases

I don't think this really adds much honestly.

(You're also clearly missing cases as the kernel bot has found issues)

> ---
>  include/linux/mm.h               | 25 +++++++++++++++++++++++++
>  include/linux/mm_types.h         |  5 ++++-
>  kernel/fork.c                    |  2 +-
>  mm/mmap.c                        |  2 +-
>  mm/vma.c                         | 12 ++++++------
>  tools/testing/vma/vma.c          |  2 +-
>  tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++-
>  7 files changed, 67 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 138bab2988f8..8bad1454984c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
>
>  void snapshot_page(struct page_snapshot *ps, const struct page *page);
>
> +static inline void vma_count_init(struct mm_struct *mm)
> +{
> +	ACCESS_PRIVATE(mm, __vma_count) = 0;
> +}
> +
> +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> +{
> +	ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> +}
> +
> +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
> +{
> +	vma_count_add(mm, -nr_vmas);
> +}
> +
> +static inline void vma_count_inc(struct mm_struct *mm)
> +{
> +	vma_count_add(mm, 1);
> +}
> +
> +static inline void vma_count_dec(struct mm_struct *mm)
> +{
> +	vma_count_sub(mm, 1);
> +}
> +
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4343be2f9e85..2ea8fc722aa2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1020,7 +1020,10 @@ struct mm_struct {
>  #ifdef CONFIG_MMU
>  		atomic_long_t pgtables_bytes;	/* size of all page tables */
>  #endif
> -		int vma_count;			/* number of VMAs */
> +		union {
> +			const int vma_count;		/* number of VMAs */
> +			int __private __vma_count;
> +		};
>
>  		spinlock_t page_table_lock; /* Protects page tables and some
>  					     * counters
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8fcbbf947579..ea9eff416e51 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	mmap_init_lock(mm);
>  	INIT_LIST_HEAD(&mm->mmlist);
>  	mm_pgtables_bytes_init(mm);
> -	mm->vma_count = 0;
> +	vma_count_init(mm);
>  	mm->locked_vm = 0;
>  	atomic64_set(&mm->pinned_vm, 0);
>  	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c6769394a174..30ddd550197e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  		 */
>  		vma_iter_bulk_store(&vmi, tmp);
>
> -		mm->vma_count++;
> +		vma_count_inc(mm);
>
>  		if (tmp->vm_ops && tmp->vm_ops->open)
>  			tmp->vm_ops->open(tmp);
> diff --git a/mm/vma.c b/mm/vma.c
> index 64f4e7c867c3..0cd3cb472220 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  		 * (it may either follow vma or precede it).
>  		 */
>  		vma_iter_store_new(vmi, vp->insert);
> -		mm->vma_count++;
> +		vma_count_inc(mm);
>  	}
>
>  	if (vp->anon_vma) {
> @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
>  		}
>  		if (vp->remove->anon_vma)
>  			anon_vma_merge(vp->vma, vp->remove);
> -		mm->vma_count--;
> +		vma_count_dec(mm);
>  		mpol_put(vma_policy(vp->remove));
>  		if (!vp->remove2)
>  			WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
> @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	struct mm_struct *mm;
>
>  	mm = current->mm;
> -	mm->vma_count -= vms->vma_count;
> +	vma_count_sub(mm, vms->vma_count);
>  	mm->locked_vm -= vms->locked_vm;
>  	if (vms->unlock)
>  		mmap_write_downgrade(mm);
> @@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
>  	vma_start_write(vma);
>  	vma_iter_store_new(&vmi, vma);
>  	vma_link_file(vma);
> -	mm->vma_count++;
> +	vma_count_inc(mm);
>  	validate_mm(mm);
>  	return 0;
>  }
> @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
>  	vma_iter_store_new(vmi, vma);
> -	map->mm->vma_count++;
> +	vma_count_inc(map->mm);
>  	vma_link_file(vma);
>
>  	/*
> @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
>  		goto mas_store_fail;
>
> -	mm->vma_count++;
> +	vma_count_inc(mm);
>  	validate_mm(mm);
>  out:
>  	perf_event_mmap(vma);
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 69fa7d14a6c2..ee5a1e2365e0 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c
> @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
>  	}
>
>  	mtree_destroy(&mm->mm_mt);
> -	mm->vma_count = 0;
> +	vma_count_init(mm);
>  	return count;
>  }
>
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 15525b86145d..6e724ba1adf4 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -251,7 +251,10 @@ struct mutex {};
>
>  struct mm_struct {
>  	struct maple_tree mm_mt;
> -	int vma_count;			/* number of VMAs */
> +	union {
> +		const int vma_count;		/* number of VMAs */
> +		int __vma_count;
> +	};
>  	unsigned long total_vm;	   /* Total pages mapped */
>  	unsigned long locked_vm;   /* Pages that have PG_mlocked set */
>  	unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
> @@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm)
>  	return (max_count > vma_count) ? (max_count - vma_count) : 0;
>  }
>
> +static inline void vma_count_init(struct mm_struct *mm)
> +{
> +	mm->__vma_count = 0;
> +}
> +
> +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> +{
> +	mm->__vma_count += nr_vmas;
> +}
> +
> +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
> +{
> +	vma_count_add(mm, -nr_vmas);
> +}
> +
> +static inline void vma_count_inc(struct mm_struct *mm)
> +{
> +	vma_count_add(mm, 1);
> +}
> +
> +static inline void vma_count_dec(struct mm_struct *mm)
> +{
> +	vma_count_sub(mm, 1);
> +}
> +
>  #endif	/* __MM_VMA_INTERNAL_H */
> --
> 2.51.0.384.g4c02a37b29-goog
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 5/7] mm: harden vma_count against direct modification
  2025-09-18 14:52   ` Lorenzo Stoakes
@ 2025-09-18 15:43     ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-18 15:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Thu, Sep 18, 2025 at 7:52 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 09:36:36AM -0700, Kalesh Singh wrote:
> > To make VMA counting more robust, prevent direct modification of the
> > mm->vma_count field. This is achieved by making the public-facing
> > member const via a union and requiring all modifications to go
> > through a new set of helper functions the operate on a private
> > __vma_count.
> >
> > While there are no other invariants tied to vma_count currently, this
> > structural change improves maintainability; as it creates a single,
> > centralized point for any future logic, such as adding debug checks
> > or updating related statistics (in subsequent patches).
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
>
> Hmmm I"m not sure about this one.
>
> I think this is a 'we don't need it' situation, and it's making everything a bit
> uglier.
>
> I especially hate vma_count_add() and vma_count_sub(). You're essentially
> overridding the whole concept in these cases to make stuff that's already in
> place work in those cases
>
> I don't think this really adds much honestly.

Hi Lorenzo,

Thanks for reviewing.

This was done to facilitate adding the assertions. So I'll drop this
along with that in the next version.

Thanks,
Kalesh

>
> (You're also clearly missing cases as the kernel bot has found issues)
>
> > ---
> >  include/linux/mm.h               | 25 +++++++++++++++++++++++++
> >  include/linux/mm_types.h         |  5 ++++-
> >  kernel/fork.c                    |  2 +-
> >  mm/mmap.c                        |  2 +-
> >  mm/vma.c                         | 12 ++++++------
> >  tools/testing/vma/vma.c          |  2 +-
> >  tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++-
> >  7 files changed, 67 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 138bab2988f8..8bad1454984c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
> >
> >  void snapshot_page(struct page_snapshot *ps, const struct page *page);
> >
> > +static inline void vma_count_init(struct mm_struct *mm)
> > +{
> > +     ACCESS_PRIVATE(mm, __vma_count) = 0;
> > +}
> > +
> > +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +{
> > +     ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> > +}
> > +
> > +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
> > +{
> > +     vma_count_add(mm, -nr_vmas);
> > +}
> > +
> > +static inline void vma_count_inc(struct mm_struct *mm)
> > +{
> > +     vma_count_add(mm, 1);
> > +}
> > +
> > +static inline void vma_count_dec(struct mm_struct *mm)
> > +{
> > +     vma_count_sub(mm, 1);
> > +}
> > +
> >  #endif /* _LINUX_MM_H */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 4343be2f9e85..2ea8fc722aa2 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1020,7 +1020,10 @@ struct mm_struct {
> >  #ifdef CONFIG_MMU
> >               atomic_long_t pgtables_bytes;   /* size of all page tables */
> >  #endif
> > -             int vma_count;                  /* number of VMAs */
> > +             union {
> > +                     const int vma_count;            /* number of VMAs */
> > +                     int __private __vma_count;
> > +             };
> >
> >               spinlock_t page_table_lock; /* Protects page tables and some
> >                                            * counters
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 8fcbbf947579..ea9eff416e51 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >       mmap_init_lock(mm);
> >       INIT_LIST_HEAD(&mm->mmlist);
> >       mm_pgtables_bytes_init(mm);
> > -     mm->vma_count = 0;
> > +     vma_count_init(mm);
> >       mm->locked_vm = 0;
> >       atomic64_set(&mm->pinned_vm, 0);
> >       memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c6769394a174..30ddd550197e 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >                */
> >               vma_iter_bulk_store(&vmi, tmp);
> >
> > -             mm->vma_count++;
> > +             vma_count_inc(mm);
> >
> >               if (tmp->vm_ops && tmp->vm_ops->open)
> >                       tmp->vm_ops->open(tmp);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 64f4e7c867c3..0cd3cb472220 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >                * (it may either follow vma or precede it).
> >                */
> >               vma_iter_store_new(vmi, vp->insert);
> > -             mm->vma_count++;
> > +             vma_count_inc(mm);
> >       }
> >
> >       if (vp->anon_vma) {
> > @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >               }
> >               if (vp->remove->anon_vma)
> >                       anon_vma_merge(vp->vma, vp->remove);
> > -             mm->vma_count--;
> > +             vma_count_dec(mm);
> >               mpol_put(vma_policy(vp->remove));
> >               if (!vp->remove2)
> >                       WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
> > @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> >       struct mm_struct *mm;
> >
> >       mm = current->mm;
> > -     mm->vma_count -= vms->vma_count;
> > +     vma_count_sub(mm, vms->vma_count);
> >       mm->locked_vm -= vms->locked_vm;
> >       if (vms->unlock)
> >               mmap_write_downgrade(mm);
> > @@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> >       vma_start_write(vma);
> >       vma_iter_store_new(&vmi, vma);
> >       vma_link_file(vma);
> > -     mm->vma_count++;
> > +     vma_count_inc(mm);
> >       validate_mm(mm);
> >       return 0;
> >  }
> > @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >       /* Lock the VMA since it is modified after insertion into VMA tree */
> >       vma_start_write(vma);
> >       vma_iter_store_new(vmi, vma);
> > -     map->mm->vma_count++;
> > +     vma_count_inc(map->mm);
> >       vma_link_file(vma);
> >
> >       /*
> > @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> >               goto mas_store_fail;
> >
> > -     mm->vma_count++;
> > +     vma_count_inc(mm);
> >       validate_mm(mm);
> >  out:
> >       perf_event_mmap(vma);
> > diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> > index 69fa7d14a6c2..ee5a1e2365e0 100644
> > --- a/tools/testing/vma/vma.c
> > +++ b/tools/testing/vma/vma.c
> > @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
> >       }
> >
> >       mtree_destroy(&mm->mm_mt);
> > -     mm->vma_count = 0;
> > +     vma_count_init(mm);
> >       return count;
> >  }
> >
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 15525b86145d..6e724ba1adf4 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -251,7 +251,10 @@ struct mutex {};
> >
> >  struct mm_struct {
> >       struct maple_tree mm_mt;
> > -     int vma_count;                  /* number of VMAs */
> > +     union {
> > +             const int vma_count;            /* number of VMAs */
> > +             int __vma_count;
> > +     };
> >       unsigned long total_vm;    /* Total pages mapped */
> >       unsigned long locked_vm;   /* Pages that have PG_mlocked set */
> >       unsigned long data_vm;     /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
> > @@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm)
> >       return (max_count > vma_count) ? (max_count - vma_count) : 0;
> >  }
> >
> > +static inline void vma_count_init(struct mm_struct *mm)
> > +{
> > +     mm->__vma_count = 0;
> > +}
> > +
> > +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +{
> > +     mm->__vma_count += nr_vmas;
> > +}
> > +
> > +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
> > +{
> > +     vma_count_add(mm, -nr_vmas);
> > +}
> > +
> > +static inline void vma_count_inc(struct mm_struct *mm)
> > +{
> > +     vma_count_add(mm, 1);
> > +}
> > +
> > +static inline void vma_count_dec(struct mm_struct *mm)
> > +{
> > +     vma_count_sub(mm, 1);
> > +}
> > +
> >  #endif       /* __MM_VMA_INTERNAL_H */
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
  2025-09-18 14:31   ` Lorenzo Stoakes
@ 2025-09-18 15:52     ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-18 15:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Thu, Sep 18, 2025 at 7:32 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
> > The checks against sysctl_max_map_count are open-coded in multiple
> > places. While simple checks are manageable, the logic in places like
> > mremap.c involves arithmetic with magic numbers that can be difficult
> > to reason about. e.g. ... >= sysctl_max_map_count - 3
> >
> > To improve readability and centralize the logic, introduce a new helper,
> > vma_count_remaining(). This function returns the VMA count headroom
> > available for a givine process.
> >
> > The most common case of checking for a single new VMA can be done with
> > the convenience helper has_vma_count_remaining():
> >
> >     if (!vma_count_remaining(mm))
> >
> > And the complex checks in mremap.c become clearer by expressing the
> > required capacity directly:
> >
> >     if (vma_count_remaining(mm) <  4)
>
> Double space after 4.

Will fix in the resend.

>
> >
> > While a capacity-based function could be misused (e.g., with an
> > incorrect '<' vs '<=' comparison), the improved readability at the call
> > sites makes such errors less likely than with the previous open-coded
> > arithmetic.
> >
> > As part of this change, sysctl_max_map_count is made static to
> > mm/mmap.c to improve encapsulation.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
>
> Generally logic looks ok, with various stuff fixed below + in commit msg we
> should be good to go :)
>
> > ---
> >
> > Changes in v2:
> >   - Fix documentation comment for vma_count_remaining(), per Mike
> >   - Remove extern in header, per Mike and Pedro
> >   - Move declaration to mm/internal.h, per Mike
> >   - Replace exceeds_max_map_count() with capacity-based vma_count_remaining(),
> >        per Lorenzo.
> >   - Fix tools/testing/vma, per Lorenzo.
> >
> >  include/linux/mm.h               |  2 --
> >  mm/internal.h                    |  2 ++
> >  mm/mmap.c                        | 21 ++++++++++++++++++++-
> >  mm/mremap.c                      |  7 ++++---
> >  mm/nommu.c                       |  2 +-
> >  mm/util.c                        |  1 -
> >  mm/vma.c                         | 10 +++++-----
> >  tools/testing/vma/vma_internal.h |  9 +++++++++
> >  8 files changed, 41 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1ae97a0b8ec7..138bab2988f8 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page)
> >  #define MAPCOUNT_ELF_CORE_MARGIN     (5)
> >  #define DEFAULT_MAX_MAP_COUNT        (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
> >
> > -extern int sysctl_max_map_count;
> > -
>
> Nice to get rid of this as a global! :)
>
> >  extern unsigned long sysctl_user_reserve_kbytes;
> >  extern unsigned long sysctl_admin_reserve_kbytes;
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 45b725c3dc03..39f1c9535ae5 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> >  void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> >  int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
> >
> > +int vma_count_remaining(const struct mm_struct *mm);
> > +
> >  #endif       /* __MM_INTERNAL_H */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index e5370e7fcd8f..af88ce1fbb5f 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> >               return -EOVERFLOW;
> >
> >       /* Too many mappings? */
> > -     if (mm->map_count >= sysctl_max_map_count)
> > +     if (!vma_count_remaining(mm))
> >               return -ENOMEM;
> >
> >       /*
> > @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
> >  int sysctl_legacy_va_layout;
> >  #endif
> >
> > +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> > +
> > +/**
> > + * vma_count_remaining - Determine available VMA slots
> > + * @mm: The memory descriptor for the process.
> > + *
> > + * Check how many more VMAs can be created for the given @mm
> > + * before hitting the sysctl_max_map_count limit.
> > + *
> > + * Return: The number of new VMAs the process can accommodate.
> > + */
> > +int vma_count_remaining(const struct mm_struct *mm)
> > +{
> > +     const int map_count = mm->map_count;
> > +     const int max_count = sysctl_max_map_count;
>
> David already commented on the READ_ONCE() here, seems wise.
>
> > +
> > +     return (max_count > map_count) ? (max_count - map_count) : 0;
>
> Not a big deal but would prefer:
>
>         if (map_count >= map_count)
>                 return 0;
>
>         return max_count - map_count;
>
> As the ternary here is a bit less clear, and it puts the 'failure' case first
> and the 'success' case afterwards.
>
> > +}
> > +
>
> As discussed in reply to the kernel bot, you've accidentally placed this in a
> CONFIG_SYSCTL block, so need to move it :>)

Yeah I unintentionally placed it there, I've moved this out in a newer
version. Thanks.

>
> >  static const struct ctl_table mmap_table[] = {
> >               {
> >                               .procname       = "max_map_count",
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 35de0a7b910e..14d35d87e89b 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
> >        * We'd prefer to avoid failure later on in do_munmap:
> >        * which may split one vma into three before unmapping.
> >        */
> > -     if (current->mm->map_count >= sysctl_max_map_count - 3)
> > +     if (vma_count_remaining(current->mm) < 4)
> >               return -ENOMEM;
>
> This is much clearer.
>
> >
> >       if (vma->vm_ops && vma->vm_ops->may_split) {
> > @@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
> >        * split in 3 before unmapping it.
> >        * That means 2 more maps (1 for each) to the ones we already hold.
> >        * Check whether current map count plus 2 still leads us to 4 maps below
> > -      * the threshold, otherwise return -ENOMEM here to be more safe.
> > +      * the threshold. In other words, is the current map count + 6 at or
> > +      * below the threshold? Otherwise return -ENOMEM here to be more safe.
> >        */
> > -     if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
> > +     if (vma_count_remaining(current->mm) < 6)
> >               return -ENOMEM;
>
> I hate that we do this silly check here, but the time to revisit it is another
> series...

Agreed, I also wondered if this check is more conservative than it
needs to be in all cases.

-- Kalesh

>
> >
> >       return 0;
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 8b819fafd57b..dd75f2334812 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >               return -ENOMEM;
> >
> >       mm = vma->vm_mm;
> > -     if (mm->map_count >= sysctl_max_map_count)
> > +     if (!vma_count_remaining(mm))
> >               return -ENOMEM;
> >
> >       region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
> > diff --git a/mm/util.c b/mm/util.c
> > index f814e6a59ab1..b6e83922cafe 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy);
> >  int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
> >  static int sysctl_overcommit_ratio __read_mostly = 50;
> >  static unsigned long sysctl_overcommit_kbytes __read_mostly;
> > -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> >  unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
> >  unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 033a388bc4b1..df0e8409f63d 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> >  }
> >
> >  /*
> > - * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
> > - * has already been checked or doesn't make sense to fail.
> > + * __split_vma() bypasses vma_count_remaining() checks. We use this where
> > + * it has already been checked or doesn't make sense to fail.
> >   * VMA Iterator will point to the original VMA.
> >   */
> >  static __must_check int
> > @@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >                    unsigned long addr, int new_below)
> >  {
> > -     if (vma->vm_mm->map_count >= sysctl_max_map_count)
> > +     if (!vma_count_remaining(vma->vm_mm))
> >               return -ENOMEM;
> >
> >       return __split_vma(vmi, vma, addr, new_below);
> > @@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >                * its limit temporarily, to help free resources as expected.
> >                */
> >               if (vms->end < vms->vma->vm_end &&
> > -                 vms->vma->vm_mm->map_count >= sysctl_max_map_count) {
> > +                 !vma_count_remaining(vms->vma->vm_mm)) {
> >                       error = -ENOMEM;
> >                       goto map_count_exceeded;
> >               }
> > @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
> >               return -ENOMEM;
> >
> > -     if (mm->map_count >= sysctl_max_map_count)
> > +     if (!vma_count_remaining(mm))
> >               return -ENOMEM;
> >
> >       if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 3639aa8dd2b0..52cd7ddc73f4 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
> >       return vm_flags;
> >  }
> >
> > +/* Helper to get VMA count capacity */
> > +static int vma_count_remaining(const struct mm_struct *mm)
> > +{
> > +     const int map_count = mm->map_count;
> > +     const int max_count = sysctl_max_map_count;
> > +
> > +     return (max_count > map_count) ? (max_count - map_count) : 0;
> > +}
> > +
> >  #endif       /* __MM_VMA_INTERNAL_H */
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
  2025-09-18 13:51     ` Lorenzo Stoakes
@ 2025-09-18 15:55       ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-18 15:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Thu, Sep 18, 2025 at 6:52 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
>
> On Thu, Sep 18, 2025 at 02:42:16PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
> > > Needed observability on in field devices can be collected with minimal
> > > overhead and can be toggled on and off. Event driven telemetry can be
> > > done with tracepoint BPF programs.
> > >
> > > The process comm is provided for aggregation across devices and tgid is
> > > to enable per-process aggregation per device.
> > >
> > > This allows for observing the distribution of such problems in the
> > > field, to deduce if there are legitimate bugs or if a bump to the limit is
> > > warranted.
> >
> > It's not really a bug though is it? It's somebody running out of resources.
> >
> > I'm not sure how useful this is really. But I'm open to being convinced!
> >
> > I also wonder if this is better as a statistic? You'd figure out it was a
> > problem that way too right?
> >
> > >
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Mike Rapoport <rppt@kernel.org>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > Cc: Pedro Falcato <pfalcato@suse.de>
> > > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> >
> > This breaks the VMA tests, please make sure to always check them:
> >
> > cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined    -c -o vma.o vma.c
> > In file included from vma.c:33:
> > ../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory
> >    10 | #include <trace/events/vma.h>
> >       |          ^~~~~~~~~~~~~~~~~~~~
> > compilation terminated.
> > make: *** [<builtin>: vma.o] Error 1
>
> Trivial build fix:
>
> ----8<----
> From fe4c30abbd302ccc628ec92381ac10cea31c6d85 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Thu, 18 Sep 2025 14:47:10 +0100
> Subject: [PATCH] fix
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/vma.c                         | 2 --
>  mm/vma_internal.h                | 2 ++
>  tools/testing/vma/vma_internal.h | 4 ++++
>  3 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index 26046b28cdda..a11d29a2ddc0 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -7,8 +7,6 @@
>  #include "vma_internal.h"
>  #include "vma.h"
>
> -#include <trace/events/vma.h>
> -
>  struct mmap_state {
>         struct mm_struct *mm;
>         struct vma_iterator *vmi;
> diff --git a/mm/vma_internal.h b/mm/vma_internal.h
> index 2f05735ff190..2f5ba679f43d 100644
> --- a/mm/vma_internal.h
> +++ b/mm/vma_internal.h
> @@ -47,6 +47,8 @@
>  #include <linux/uprobes.h>
>  #include <linux/userfaultfd_k.h>
>
> +#include <trace/events/vma.h>
> +
>  #include <asm/current.h>
>  #include <asm/tlb.h>
>
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 07f4108c5e4c..c08c91861b9a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -1661,4 +1661,8 @@ static inline void vma_count_dec(struct mm_struct *mm)
>         vma_count_sub(mm, 1);
>  }
>
> +static void trace_max_vma_count_exceeded(struct task_struct *task)
> +{
> +}
> +
>  #endif /* __MM_VMA_INTERNAL_H */

I made a point to build and run your tests, seems I forgot to actually
test it with this last patch.

Thanks for the fix.

--Kalesh

> --
> 2.51.0

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 2/7] mm/selftests: add max_vma_count tests
  2025-09-18 14:42   ` Lorenzo Stoakes
@ 2025-09-18 16:21     ` Kalesh Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Kalesh Singh @ 2025-09-18 16:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
	android-mm, Alexander Viro, Christian Brauner, Jan Kara,
	Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Thu, Sep 18, 2025 at 7:42 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 09:36:33AM -0700, Kalesh Singh wrote:
> > Add a new selftest to verify that the max VMA count limit is correctly
> > enforced.
> >
> > This test suite checks that various VMA operations (mmap, mprotect,
> > munmap, mremap) succeed or fail as expected when the number of VMAs is
> > close to the sysctl_max_map_count limit.
> >
> > The test works by first creating a large number of VMAs to bring the
> > process close to the limit, and then performing various operations that
> > may or may not create new VMAs. The test then verifies that the
> > operations that would exceed the limit fail, and that the operations
> > that do not exceed the limit succeed.
> >
> > NOTE: munmap is special as it's allowed to temporarily exceed the limit
> > by one for splits as this will decrease back to the limit once the unmap
> > succeeds.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
> >
> > Changes in v2:
> >  - Add tests, per Liam (note that the do_brk_flags() path is not easily
> >    tested from userspace, so it's not included here). Exceeding the limit there
> >    should be uncommon.
> >
> >  tools/testing/selftests/mm/Makefile           |   1 +
> >  .../selftests/mm/max_vma_count_tests.c        | 709 ++++++++++++++++++
>
> If you add a new file, update MAINTAINERS. Should put in MEMORY MAPPING section.
>
> >  tools/testing/selftests/mm/run_vmtests.sh     |   5 +
>
> You're missing a .gitignore.
>
> >  3 files changed, 715 insertions(+)
> >  create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
> >
> > diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> > index d13b3cef2a2b..00a4b04eab06 100644
> > --- a/tools/testing/selftests/mm/Makefile
> > +++ b/tools/testing/selftests/mm/Makefile
> > @@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress
> >  TEST_GEN_FILES += uffd-stress
> >  TEST_GEN_FILES += uffd-unit-tests
> >  TEST_GEN_FILES += uffd-wp-mremap
> > +TEST_GEN_FILES += max_vma_count_tests
> >  TEST_GEN_FILES += split_huge_page_test
> >  TEST_GEN_FILES += ksm_tests
> >  TEST_GEN_FILES += ksm_functional_tests
> > diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c
> > new file mode 100644
> > index 000000000000..c8401c03425c
> > --- /dev/null
> > +++ b/tools/testing/selftests/mm/max_vma_count_tests.c
> > @@ -0,0 +1,709 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright 2025 Google LLC
> > + */
> > +#define _GNU_SOURCE
> > +
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +#include <errno.h>
> > +#include <stdbool.h>
> > +#include <linux/prctl.h>  /* Definition of PR_* constants */
> > +#include <sys/prctl.h>
> > +
> > +#include "../kselftest.h"
> > +
> > +static int get_max_vma_count(void);
> > +static bool set_max_vma_count(int val);
> > +static int get_current_vma_count(void);
> > +static bool is_current_vma_count(const char *msg, int expected);
> > +static bool is_test_area_mapped(const char *msg);
> > +static void print_surrounding_maps(const char *msg);
> > +
> > +/* Globals initialized in test_suite_setup() */
> > +static int MAX_VMA_COUNT;
> > +static int ORIGINAL_MAX_VMA_COUNT;
> > +static int PAGE_SIZE;
> > +static int GUARD_SIZE;
> > +static int TEST_AREA_SIZE;
> > +static int EXTRA_MAP_SIZE;
> > +
> > +static int MAX_VMA_COUNT;
> > +
> > +static int NR_EXTRA_MAPS;
> > +
> > +static char *TEST_AREA;
> > +static char *EXTRA_MAPS;
> > +
> > +#define DEFAULT_MAX_MAP_COUNT        65530
> > +#define TEST_AREA_NR_PAGES   3
> > +/* 1 before test area + 1 after test area + 1 after extra mappings */
> > +#define NR_GUARDS            3
> > +#define TEST_AREA_PROT               (PROT_NONE)
> > +#define EXTRA_MAP_PROT               (PROT_NONE)
> > +
> > +/**
> > + * test_suite_setup - Set up the VMA layout for VMA count testing.
> > + *
> > + * Sets up the following VMA layout:
> > + *
> > + * +----- base_addr
> > + * |
> > + * V
> > + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> > + * |  Guard Page  |                      |  Guard Page  |  Extra Map 1   | Unmapped Gap |  Extra Map 2   | Unmapped Gap | ... |  Extra Map N   | Unmapped Gap |
> > + * |  (unmapped)  |      TEST_AREA       |  (unmapped)  | (mapped page)  |  (1 page)    | (mapped page)  |  (1 page)    | ... | (mapped page)  |  (1 page)    |
> > + * |   (1 page)   | (unmapped, 3 pages)  |   (1 page)   |    (1 page)    |              |    (1 page)    |              |     |    (1 page)    |              |
> > + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
>
> I have no idea what a 'guard page' is here. You're not using MADV_GUARD_INSTALL so it's presumably an arbitrary PROT_NONE page or something?

I've changed this to "hole" per David's earlier feedbac.

>
> > + * ^              ^                      ^              ^                                                                  ^
> > + * |              |                      |              |                                                                  |
> > + * +--GUARD_SIZE--+                      |              +-- EXTRA_MAPS points here             Sufficient EXTRA_MAPS to ---+
> > + *    (PAGE_SIZE) |                      |                                                         reach MAX_VMA_COUNT
> > + *                |                      |
> > + *                +--- TEST_AREA_SIZE ---+
> > + *                |   (3 * PAGE_SIZE)    |
> > + *                ^
> > + *                |
> > + *                +-- TEST_AREA starts here
>
> Can we make this vertical, it's obv >80 chars and it's kind of hard to read.

Will do.

>
> Also isn't this hugely convoluted? Do we really need to set up such a
> complicated set of VMAs for every test?

I thought about playing with max_map_count to achieve the needed
conditions for each test. But since this is global lowering it too
significantly will introduce issues in other system processes while
running the test (if the new max count) is lower than the current of
any process.

We only set this up once for all the tests.

>
> > + *
> > + * Populates TEST_AREA and other globals required for the tests.
> > + * If successful, the current VMA count will be MAX_VMA_COUNT - 1.
> > + *
> > + * Return: true on success, false on failure.
> > + */
> > +static bool test_suite_setup(void)
> > +{
> > +     int initial_vma_count;
> > +     size_t reservation_size;
> > +     void *base_addr = NULL;
> > +     char *ptr = NULL;
> > +
> > +     ksft_print_msg("Setting up vma_max_count test suite...\n");
> > +
> > +     /* Initialize globals */
> > +     PAGE_SIZE = sysconf(_SC_PAGESIZE);
> > +     TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE;
> > +     GUARD_SIZE = PAGE_SIZE;
> > +     EXTRA_MAP_SIZE = PAGE_SIZE;
> > +     MAX_VMA_COUNT = get_max_vma_count();
>
> Dude we are not COBOL programmers :) What is it with these capitalised variable
> names?
>
> This is really horrible, please don't do that. And why on earth are you using
> globals like this.

My intention was to avoid passing a ton of parameters to each test. I
guess I could have thread them through some struct as well :)

>
> (Again) use the self test harness please :)

Let me rework it with the harness ...
>
> > +
> > +     MAX_VMA_COUNT = get_max_vma_count();
> > +     if (MAX_VMA_COUNT < 0) {
> > +             ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n");
> > +             return false;
> > +     }
> > +
> > +     /*
> > +      * If the current limit is higher than the kernel default,
> > +      * we attempt to lower it to the default to ensure the test
> > +      * can run with a reliably known boundary.
> > +      */
> > +     ORIGINAL_MAX_VMA_COUNT = 0;
> > +
> > +     if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) {
> > +             ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT;
> > +
> > +             ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n",
> > +                             MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT);
> > +
> > +             if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) {
> > +                     ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n",
> > +                                     DEFAULT_MAX_MAP_COUNT);
>
> Probably partly a product of you not using the harness, but you're violating the
> max chars per line all over the place.
>
> checkpatch.pl kinda has a fit with this file, can we just try and keep it to ~80 chars please?
>
> > +                     ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n");
> > +
> > +                     MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT;
> > +
> > +                     return false;
> > +             }
> > +
> > +             /* Update MAX_VMA_COUNT for the test run */
> > +             MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT;
> > +     }
> > +
> > +     initial_vma_count = get_current_vma_count();
> > +     if (initial_vma_count < 0) {
> > +             ksft_print_msg("Failed to read /proc/self/maps\n");
> > +             return false;
> > +     }
> > +
> > +     /*
> > +      * Calculate how many extra mappings we need to create to reach
> > +      * MAX_VMA_COUNT - 1 (excluding test area).
> > +      */
> > +     NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count;
> > +
> > +     if (NR_EXTRA_MAPS < 1) {
> > +             ksft_print_msg("Not enough available maps to run test\n");
> > +             ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n",
> > +                             MAX_VMA_COUNT, initial_vma_count);
> > +             return false;
> > +     }
> > +
> > +     /*
> > +      * Reserve space for:
> > +      * - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2)
> > +      * - The test area itself (TEST_AREA_NR_PAGES)
> > +      * - The guard pages (NR_GUARDS)
> > +      */
> > +     reservation_size = ((NR_EXTRA_MAPS * 2) +
> > +                             TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE;
> > +
> > +     base_addr = mmap(NULL, reservation_size, PROT_NONE,
> > +                     MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> > +     if (base_addr == MAP_FAILED) {
> > +             ksft_print_msg("Failed tommap initial reservation\n");
> > +             return false;
> > +     }
> > +
> > +     if (munmap(base_addr, reservation_size) == -1) {
> > +             ksft_print_msg("Failed to munmap initial reservation\n");
> > +             return false;
> > +     }
> > +
> > +     /* Get the addr of the test area */
> > +     TEST_AREA = (char *)base_addr + GUARD_SIZE;
> > +
> > +     /*
> > +      * Get the addr of the region for extra mappings:
> > +      *     test area + 1 guard.
> > +      */
> > +     EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE;
> > +
> > +     /* Create single-page mappings separated by unmapped pages */
> > +     ptr = EXTRA_MAPS;
> > +     for (int i = 0; i < NR_EXTRA_MAPS; ++i) {
> > +             if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT,
> > +                     MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
> > +                     -1, 0) == MAP_FAILED) {
> > +                     perror("mmap in fill loop");
> > +                     ksft_print_msg("Failed on mapping #%d of %d\n", i + 1,
> > +                                     NR_EXTRA_MAPS);
> > +                     return false;
> > +             }
> > +
> > +             /* Advance pointer by 2 to leave a gap */
> > +             ptr += (2 * EXTRA_MAP_SIZE);
> > +     }
> > +
> > +     if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1))
> > +             return false;
> > +
> > +     ksft_print_msg("vma_max_count test suite setup done.\n");
> > +
> > +     return true;
> > +}
>
> I mean it'll be fixed by the test harness, but even in test code let's please
> not write MASSIVE functions that do a million things. It's not really readable.

I'll try breaking this up in the next revision.

>
> > +
> > +static void test_suite_teardown(void)
> > +{
> > +     if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) {
> > +             if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT))
> > +                     ksft_print_msg("Failed to restore max_map_count to %d\n",
> > +                                     ORIGINAL_MAX_VMA_COUNT);
> > +     }
> > +}
> > +
> > +/* --- Test Helper Functions --- */
> > +static bool mmap_anon(void)
> > +{
> > +     void *addr =  mmap(NULL, PAGE_SIZE, PROT_READ,
> > +                        MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> > +
> > +     /*
> > +      * Handle cleanup here as the runner doesn't track where this,
> > +      *mapping is located.
> > +      */
> > +     if (addr != MAP_FAILED)
> > +             munmap(addr, PAGE_SIZE);
> > +
> > +     return addr != MAP_FAILED;
> > +}
> > +
> > +static inline bool __mprotect(char *addr, int size)
> > +{
> > +     int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC);
> > +
> > +     return mprotect(addr, size, new_prot) == 0;
> > +}
> > +
> > +static bool mprotect_nosplit(void)
> > +{
> > +     return __mprotect(TEST_AREA, TEST_AREA_SIZE);
> > +}
> > +
> > +static bool mprotect_2way_split(void)
> > +{
> > +     return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
> > +}
> > +
> > +static bool mprotect_3way_split(void)
> > +{
> > +     return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
> > +}
> > +
> > +static inline bool __munmap(char *addr, int size)
> > +{
> > +     return munmap(addr, size) == 0;
> > +}
> > +
> > +static bool munmap_nosplit(void)
> > +{
> > +     return __munmap(TEST_AREA, TEST_AREA_SIZE);
> > +}
> > +
> > +static bool munmap_2way_split(void)
> > +{
> > +     return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
> > +}
> > +
> > +static bool munmap_3way_split(void)
> > +{
> > +     return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
> > +}
> > +
> > +/* mremap accounts for the worst case to fail early */
> > +static const int MREMAP_REQUIRED_VMA_SLOTS = 6;
> > +
> > +static bool mremap_dontunmap(void)
> > +{
> > +     void *new_addr;
> > +
> > +     /*
> > +      * Using MREMAP_DONTUNMAP will create a new mapping without
> > +      * removing the old one, consuming one VMA slot.
> > +      */
> > +     new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE,
> > +                       MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL);
> > +
> > +     if (new_addr != MAP_FAILED)
> > +             munmap(new_addr, TEST_AREA_SIZE);
> > +
> > +     return new_addr != MAP_FAILED;
> > +}
> > +
> > +struct test {
> > +     const char *name;
> > +     bool (*test)(void);
> > +      /* How many VMA slots below the limit this test needs to start? */
> > +     int vma_slots_needed;
> > +     bool expect_success;
> > +};
> > +
> > +/* --- Test Cases --- */
> > +struct test tests[] = {
> > +     {
> > +             .name = "mmap_at_1_below_vma_count_limit",
> > +             .test = mmap_anon,
> > +             .vma_slots_needed = 1,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "mmap_at_vma_count_limit",
> > +             .test = mmap_anon,
> > +             .vma_slots_needed = 0,
> > +             .expect_success = false,
> > +     },
> > +     {
> > +             .name = "mprotect_nosplit_at_1_below_vma_count_limit",
> > +             .test = mprotect_nosplit,
> > +             .vma_slots_needed = 1,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "mprotect_nosplit_at_vma_count_limit",
> > +             .test = mprotect_nosplit,
> > +             .vma_slots_needed = 0,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "mprotect_2way_split_at_1_below_vma_count_limit",
> > +             .test = mprotect_2way_split,
> > +             .vma_slots_needed = 1,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "mprotect_2way_split_at_vma_count_limit",
> > +             .test = mprotect_2way_split,
> > +             .vma_slots_needed = 0,
> > +             .expect_success = false,
> > +     },
> > +     {
> > +             .name = "mprotect_3way_split_at_2_below_vma_count_limit",
> > +             .test = mprotect_3way_split,
> > +             .vma_slots_needed = 2,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "mprotect_3way_split_at_1_below_vma_count_limit",
> > +             .test = mprotect_3way_split,
> > +             .vma_slots_needed = 1,
> > +             .expect_success = false,
> > +     },
> > +     {
> > +             .name = "mprotect_3way_split_at_vma_count_limit",
> > +             .test = mprotect_3way_split,
> > +             .vma_slots_needed = 0,
> > +             .expect_success = false,
> > +     },
> > +     {
> > +             .name = "munmap_nosplit_at_1_below_vma_count_limit",
> > +             .test = munmap_nosplit,
> > +             .vma_slots_needed = 1,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "munmap_nosplit_at_vma_count_limit",
> > +             .test = munmap_nosplit,
> > +             .vma_slots_needed = 0,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "munmap_2way_split_at_1_below_vma_count_limit",
> > +             .test = munmap_2way_split,
> > +             .vma_slots_needed = 1,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "munmap_2way_split_at_vma_count_limit",
> > +             .test = munmap_2way_split,
> > +             .vma_slots_needed = 0,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "munmap_3way_split_at_2_below_vma_count_limit",
> > +             .test = munmap_3way_split,
> > +             .vma_slots_needed = 2,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "munmap_3way_split_at_1_below_vma_count_limit",
> > +             .test = munmap_3way_split,
> > +             .vma_slots_needed = 1,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "munmap_3way_split_at_vma_count_limit",
> > +             .test = munmap_3way_split,
> > +             .vma_slots_needed = 0,
> > +             .expect_success = false,
> > +     },
> > +     {
> > +             .name = "mremap_dontunmap_at_required_vma_count_capcity",
> > +             .test = mremap_dontunmap,
> > +             .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS,
> > +             .expect_success = true,
> > +     },
> > +     {
> > +             .name = "mremap_dontunmap_at_1_below_required_vma_count_capacity",
> > +             .test = mremap_dontunmap,
> > +             .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1,
> > +             .expect_success = false,
> > +     },
>
> This is horrible. We don't need to do it this way.
>
> Use the kselftest_harness please. See the guard_regions.c test for an example of
> how to use it.

Will do, thanks for the reference.

>
> > +};
> > +
> > +/* --- Test Runner --- */
> > +int main(int argc, char **argv)
> > +{
> > +     int num_tests = ARRAY_SIZE(tests);
> > +     int failed_tests = 0;
> > +
> > +     ksft_set_plan(num_tests);
> > +
> > +     if (!test_suite_setup() != 0) {
> > +             if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT)
> > +                     ksft_exit_skip("max_map_count too high and cannot be lowered\n"
> > +                                     "Please rerun as root.\n");
> > +             else
> > +                     ksft_exit_fail_msg("Test suite setup failed. Aborting.\n");
> > +
> > +     }
> > +
> > +     for (int i = 0; i < num_tests; i++) {
> > +             int maps_to_unmap = tests[i].vma_slots_needed;
> > +             const char *name = tests[i].name;
> > +             bool test_passed;
> > +
> > +             errno = 0;
> > +
> > +             /* 1. Setup: TEST_AREA mapping */
> > +             if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT,
> > +                     MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
> > +                             == MAP_FAILED) {
> > +                     ksft_test_result_fail(
> > +                             "%s: Test setup failed to map TEST_AREA\n",
> > +                             name);
> > +                     maps_to_unmap = 0;
> > +                     goto fail;
> > +             }
> > +
> > +             /* Label TEST_AREA to ease debugging */
> > +             if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA,
> > +                     TEST_AREA_SIZE, "TEST_AREA")) {
> > +                     ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n",
> > +                                     name);
> > +                     ksft_print_msg(
> > +                             "Continuing without named TEST_AREA mapping\n");
> > +             }
> > +
> > +             /* 2. Setup: Adjust VMA count based on test requirements */
> > +             if (maps_to_unmap > NR_EXTRA_MAPS) {
> > +                     ksft_test_result_fail(
> > +                             "%s: Test setup failed: Invalid VMA slots required %d\n",
> > +                             name, tests[i].vma_slots_needed);
> > +                     maps_to_unmap = 0;
> > +                     goto fail;
> > +             }
> > +
> > +             /* Unmap extra mappings, accounting for the 1-page gap */
> > +             for (int j = 0; j < maps_to_unmap; j++)
> > +                     munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
> > +                             EXTRA_MAP_SIZE);
> > +
> > +             /*
> > +              * 3. Verify the preconditions.
> > +              *
> > +              *    Sometimes there isn't an easy way to determine the cause
> > +              *    of the test failure.
> > +              *    e.g. an mprotect ENOMEM may be due to trying to protect
> > +              *         unmapped area or due to hitting MAX_VMA_COUNT limit.
> > +              *
> > +              *    We verify the preconditions of the test to ensure any
> > +              *    expected failures are from the expected cause and not
> > +              *    coincidental.
> > +              */
> > +             if (!is_current_vma_count(name,
> > +                     MAX_VMA_COUNT - tests[i].vma_slots_needed))
> > +                     goto fail;
> > +
> > +             if (!is_test_area_mapped(name))
> > +                     goto fail;
> > +
> > +             /* 4. Run the test */
> > +             test_passed = (tests[i].test() == tests[i].expect_success);
> > +             if (test_passed) {
> > +                     ksft_test_result_pass("%s\n", name);
> > +             } else {
> > +fail:
> > +                     failed_tests++;
> > +                     ksft_test_result_fail(
> > +                             "%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n",
> > +                             name, get_current_vma_count(), MAX_VMA_COUNT,
> > +                             errno, strerror(errno));
> > +                     print_surrounding_maps(name);
> > +             }
> > +
> > +             /* 5. Teardown: Unmap TEST_AREA. */
> > +             munmap(TEST_AREA, TEST_AREA_SIZE);
> > +
> > +             /* 6. Teardown: Restore extra mappings to test suite baseline */
> > +             for (int j = 0; j < maps_to_unmap; j++) {
> > +                     /* Remap extra mappings, accounting for the gap */
> > +                     mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
> > +                             EXTRA_MAP_SIZE, EXTRA_MAP_PROT,
> > +                             MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
> > +                             -1, 0);
> > +             }
> > +     }
> > +
> > +     test_suite_teardown();
> > +
> > +     if (failed_tests > 0)
> > +             ksft_exit_fail();
> > +     else
> > +             ksft_exit_pass();
> > +}
> > +
> > +/* --- Utilities --- */
> > +
> > +static int get_max_vma_count(void)
> > +{
> > +     int max_count;
> > +     FILE *f;
> > +
> > +     f = fopen("/proc/sys/vm/max_map_count", "r");
> > +     if (!f)
> > +             return -1;
> > +
> > +     if (fscanf(f, "%d", &max_count) != 1)
> > +             max_count = -1;
> > +
> > +
> > +     fclose(f);
> > +
> > +     return max_count;
> > +}
> > +
> > +static bool set_max_vma_count(int val)
> > +{
> > +     FILE *f;
> > +     bool success = false;
> > +
> > +     f = fopen("/proc/sys/vm/max_map_count", "w");
> > +     if (!f)
> > +             return false;
> > +
> > +     if (fprintf(f, "%d", val) > 0)
> > +             success = true;
> > +
> > +     fclose(f);
> > +     return success;
> > +}
> > +
> > +static int get_current_vma_count(void)
> > +{
> > +     char line[1024];
> > +     int count = 0;
> > +     FILE *f;
> > +
> > +     f = fopen("/proc/self/maps", "r");
> > +     if (!f)
> > +             return -1;
> > +
> > +     while (fgets(line, sizeof(line), f)) {
> > +             if (!strstr(line, "[vsyscall]"))
> > +                     count++;
> > +     }
> > +
> > +     fclose(f);
> > +
> > +     return count;
> > +}
> > +
> > +static bool is_current_vma_count(const char *msg, int expected)
> > +{
> > +     int current = get_current_vma_count();
> > +
> > +     if (current == expected)
> > +             return true;
> > +
> > +     ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current,
> > +                     expected);
> > +     return false;
> > +}
> > +
> > +static bool is_test_area_mapped(const char *msg)
> > +{
> > +     unsigned long search_start = (unsigned long)TEST_AREA;
> > +     unsigned long search_end = search_start + TEST_AREA_SIZE;
> > +     bool found = false;
> > +     char line[1024];
> > +     FILE *f;
> > +
> > +     f = fopen("/proc/self/maps", "r");
> > +     if (!f) {
> > +             ksft_print_msg("failed to open /proc/self/maps\n");
> > +             return false;
> > +     }
> > +
> > +     while (fgets(line, sizeof(line), f)) {
> > +             unsigned long start, end;
> > +
> > +             if (sscanf(line, "%lx-%lx", &start, &end) != 2)
> > +                     continue;
> > +
> > +             /* Check for an exact match of the range */
> > +             if (start == search_start && end == search_end) {
> > +                     found = true;
> > +                     break;
> > +             } else if (start > search_end) {
> > +                     /*
> > +                      *Since maps are sorted, if we've passed the end, we
> > +                      * can stop searching.
> > +                      */
> > +                     break;
> > +             }
> > +     }
> > +
>
> There's helpers for this kind of thing in vm_util.h, let's not duplicate
> code. Also you can use the proc query ioctl thingy (also helpers for that in
> vm_util.h) rather than having to spelunk /proc/$pid/maps.

Thanks I'll use the helpers there instead.

>
> > +     fclose(f);
> > +
> > +     if (found)
> > +             return true;
> > +
> > +     /* Not found */
> > +     ksft_print_msg(
> > +             "%s: TEST_AREA is not mapped as a single contiguous block.\n",
> > +             msg);
> > +     print_surrounding_maps(msg);
> > +
> > +     return false;
> > +}
> > +
> > +static void print_surrounding_maps(const char *msg)
>
> WHy are you printing a bunch of noise in a test?

This is only printed in the failure case, it makes debugging much
easier. I can update it to a TH_LOG() instead once it's moved to the
test harness.

Thanks,
Kalesh
>
> > +{
> > +     unsigned long search_start = (unsigned long)TEST_AREA;
> > +     unsigned long search_end = search_start + TEST_AREA_SIZE;
> > +     unsigned long start;
> > +     unsigned long end;
> > +     char line[1024] = {};
> > +     int line_idx = 0;
> > +     int first_match_idx = -1;
> > +     int last_match_idx = -1;
> > +     FILE *f;
> > +
> > +     f = fopen("/proc/self/maps", "r");
> > +     if (!f)
> > +             return;
> > +
> > +     if (msg)
> > +             ksft_print_msg("%s\n", msg);
> > +
> > +     ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n",
> > +                     TEST_AREA);
> > +
> > +     /* First pass: Read all lines and find the range of matching entries */
> > +     fseek(f, 0, SEEK_SET); /* Rewind file */
> > +     while (fgets(line, sizeof(line), f)) {
> > +             if (sscanf(line, "%lx-%lx", &start, &end) != 2) {
> > +                     line_idx++;
> > +                     continue;
> > +             }
> > +
> > +             /* Check for any overlap */
> > +             if (start < search_end && end > search_start) {
> > +                     if (first_match_idx == -1)
> > +                             first_match_idx = line_idx;
> > +                     last_match_idx = line_idx;
> > +             } else if (start > search_end) {
> > +                     /*
> > +                      * Since maps are sorted, if we've passed the end, we
> > +                      * can stop searching.
> > +                      */
> > +                     break;
> > +             }
> > +
> > +             line_idx++;
> > +     }
> > +
> > +     if (first_match_idx == -1) {
> > +             ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n",
> > +                             TEST_AREA);
> > +     } else {
> > +             /* Second pass: Print the relevant lines */
> > +             fseek(f, 0, SEEK_SET); /* Rewind file */
> > +             line_idx = 0;
> > +             while (fgets(line, sizeof(line), f)) {
> > +                     /* Print 2 lines before the first match */
> > +                     if (line_idx >= first_match_idx - 2 &&
> > +                             line_idx < first_match_idx)
> > +                             ksft_print_msg("   %s", line);
> > +
> > +                     /* Print all matching TEST_AREA entries */
> > +                     if (line_idx >= first_match_idx &&
> > +                             line_idx <= last_match_idx)
> > +                             ksft_print_msg(">> %s", line);
> > +
> > +                     /* Print 2 lines after the last match */
> > +                     if (line_idx > last_match_idx &&
> > +                             line_idx <= last_match_idx + 2)
> > +                             ksft_print_msg("   %s", line);
> > +
> > +                     line_idx++;
> > +             }
> > +     }
> > +
> > +     ksft_print_msg("--------------------------------------------------\n");
> > +
> > +     fclose(f);
> > +}
> > diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> > index 471e539d82b8..3794b50ec280 100755
> > --- a/tools/testing/selftests/mm/run_vmtests.sh
> > +++ b/tools/testing/selftests/mm/run_vmtests.sh
> > @@ -49,6 +49,8 @@ separated by spaces:
> >       test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
> >  - madv_populate
> >       test memadvise(2) MADV_POPULATE_{READ,WRITE} options
> > +- max_vma_count
> > +     tests for max vma_count
> >  - memfd_secret
> >       test memfd_secret(2)
> >  - process_mrelease
> > @@ -417,6 +419,9 @@ fi # VADDR64
> >  # vmalloc stability smoke test
> >  CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
> >
> > +# test operations against max vma count limit
> > +CATEGORY="max_vma_count" run_test ./max_vma_count_tests
> > +
> >  CATEGORY="mremap" run_test ./mremap_dontunmap
> >
> >  CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
  2025-09-18 12:49               ` Lorenzo Stoakes
@ 2025-09-18 20:59                 ` Andrew Morton
  0 siblings, 0 replies; 68+ messages in thread
From: Andrew Morton @ 2025-09-18 20:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Kalesh Singh, minchan, Liam.Howlett, rppt,
	pfalcato, kernel-team, android-mm, Alexander Viro,
	Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
	Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
	linux-kselftest

On Thu, 18 Sep 2025 13:49:33 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> > > I'm confused, why is the merge window a good time to consider new material?
> > >
> > > People have the entirety of the cycle to submit new material, and they do
> > > so.
> >
> > My view is that if you are sending a cleanup/feature during the merge window
> > you cannot expect a fast reply, and you should not keep sending new versions
> > in that timeframe expecting that all people you CCed that should have a look
> > actually did have a look.
> 
> Yes exactly.
> 
> The problem is all the conversations and respins and etc. _do_ carry on as
> normal, and often land in mm-new, queued up for mm-unstable etc. unless we
> happen to notice them.
> 
> So it makes it impossible for us to just ignore until the next cycle (or need to
> go through every thread that happened afterwards).
> 
> And people know that, so just keep on submitting as normal. That was _really_
> palpable last merge window.
> 

Well, what else do we have to do during the merge window?  The previous
cycle's batch is merged up and there may be some fallout from that, but
it isn't very time-consuming.

If you're proposing that we start to use that period as a break for
sanity purposes then OK, didn't see that one coming, don't know how
widespread this desire is.

But perhaps a better time for this quiet period is during -rc6 and -rc7
when the rate of merging is throttled right back.  Or perhaps from -rc7
to mid-merge-window.


^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2025-09-18 20:59 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
2025-09-15 22:36   ` Andrew Morton
2025-09-16 14:20     ` Jonathan Corbet
2025-09-17  1:16       ` Andrew Morton
2025-09-16  9:45   ` Pedro Falcato
2025-09-17  7:44   ` SeongJae Park
2025-09-17 10:52   ` David Hildenbrand
2025-09-18 11:31   ` Pedro Falcato
2025-09-18 13:53   ` Lorenzo Stoakes
2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
2025-09-17 10:58   ` David Hildenbrand
2025-09-17 16:49     ` Kalesh Singh
2025-09-18 14:42   ` Lorenzo Stoakes
2025-09-18 16:21     ` Kalesh Singh
2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
2025-09-16 18:33   ` kernel test robot
2025-09-18 14:10     ` Lorenzo Stoakes
2025-09-17 13:38   ` David Hildenbrand
2025-09-17 17:10     ` Kalesh Singh
2025-09-18 13:20   ` Lorenzo Stoakes
2025-09-18 13:26     ` Lorenzo Stoakes
2025-09-18 14:31   ` Lorenzo Stoakes
2025-09-18 15:52     ` Kalesh Singh
2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
2025-09-17 13:41   ` David Hildenbrand
2025-09-18 11:46   ` Pedro Falcato
2025-09-18 14:48   ` Lorenzo Stoakes
2025-09-15 16:36 ` [PATCH v2 5/7] mm: harden vma_count against direct modification Kalesh Singh
2025-09-16  9:10   ` kernel test robot
2025-09-18 14:52   ` Lorenzo Stoakes
2025-09-18 15:43     ` Kalesh Singh
2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
2025-09-17 13:44   ` David Hildenbrand
2025-09-17 17:22     ` Kalesh Singh
2025-09-17 18:34       ` David Hildenbrand
2025-09-17 20:31         ` Kalesh Singh
2025-09-18 11:48     ` Pedro Falcato
2025-09-18 13:30   ` Lorenzo Stoakes
2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
2025-09-15 23:41   ` Steven Rostedt
2025-09-16  1:19     ` Kalesh Singh
2025-09-16 15:52       ` Steven Rostedt
2025-09-16 17:36         ` Kalesh Singh
2025-09-16 17:48           ` Steven Rostedt
2025-09-16 17:57             ` Kalesh Singh
2025-09-16 18:02               ` Steven Rostedt
2025-09-16 18:23                 ` Kalesh Singh
2025-09-16 18:51                   ` Steven Rostedt
2025-09-16 20:08                     ` Kalesh Singh
2025-09-18 11:38         ` Pedro Falcato
2025-09-18 14:48           ` Steven Rostedt
2025-09-18 13:42   ` Lorenzo Stoakes
2025-09-18 13:51     ` Lorenzo Stoakes
2025-09-18 15:55       ` Kalesh Singh
2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
2025-09-15 23:10   ` Kalesh Singh
2025-09-16  0:05     ` Andrew Morton
2025-09-16  1:23       ` Kalesh Singh
2025-09-16 10:12   ` Lorenzo Stoakes
2025-09-16 17:47     ` Kalesh Singh
2025-09-17  2:16     ` Andrew Morton
2025-09-17  5:36       ` Lorenzo Stoakes
2025-09-17 23:32         ` Andrew Morton
2025-09-18 10:29           ` Lorenzo Stoakes
2025-09-18 12:07             ` David Hildenbrand
2025-09-18 12:49               ` Lorenzo Stoakes
2025-09-18 20:59                 ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.