* [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
2025-09-15 22:36 ` Andrew Morton
` (5 more replies)
2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
` (6 subsequent siblings)
7 siblings, 6 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Kalesh Singh, stable, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
The VMA count limit check in do_mmap() and do_brk_flags() uses a
strict inequality (>), which allows a process's VMA count to exceed
the configured sysctl_max_map_count limit by one.
A process with mm->map_count == sysctl_max_map_count will incorrectly
pass this check and then exceed the limit upon allocation of a new VMA
when its map_count is incremented.
Other VMA allocation paths, such as split_vma(), already use the
correct, inclusive (>=) comparison.
Fix this bug by changing the comparison to be inclusive in do_mmap()
and do_brk_flags(), bringing them in line with the correct behavior
of other allocation paths.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
Chnages in v2:
- Fix mmap check, per Pedro
mm/mmap.c | 2 +-
mm/vma.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 7306253cc3b5..e5370e7fcd8f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
return -EOVERFLOW;
/* Too many mappings? */
- if (mm->map_count > sysctl_max_map_count)
+ if (mm->map_count >= sysctl_max_map_count)
return -ENOMEM;
/*
diff --git a/mm/vma.c b/mm/vma.c
index 3b12c7579831..033a388bc4b1 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
return -ENOMEM;
- if (mm->map_count > sysctl_max_map_count)
+ if (mm->map_count >= sysctl_max_map_count)
return -ENOMEM;
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
@ 2025-09-15 22:36 ` Andrew Morton
2025-09-16 14:20 ` Jonathan Corbet
2025-09-16 9:45 ` Pedro Falcato
` (4 subsequent siblings)
5 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2025-09-15 22:36 UTC (permalink / raw)
To: Kalesh Singh
Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, stable, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
lol.
x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l
661
we really blew it that time!
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-15 22:36 ` Andrew Morton
@ 2025-09-16 14:20 ` Jonathan Corbet
2025-09-17 1:16 ` Andrew Morton
0 siblings, 1 reply; 65+ messages in thread
From: Jonathan Corbet @ 2025-09-16 14:20 UTC (permalink / raw)
To: Andrew Morton, Kalesh Singh
Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, stable, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
Andrew Morton <akpm@linux-foundation.org> writes:
> On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
>
>> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
>
> lol.
>
> x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l
> 661
>
> we really blew it that time!
A few years back I made a list of the most-fixed commits in the kernel
history; unsurprisingly that one (I call it the "original sin") came out
on top:
https://lwn.net/Articles/914632/
Perhaps the time has come to update that analysis.
jon
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-16 14:20 ` Jonathan Corbet
@ 2025-09-17 1:16 ` Andrew Morton
0 siblings, 0 replies; 65+ messages in thread
From: Andrew Morton @ 2025-09-17 1:16 UTC (permalink / raw)
To: Jonathan Corbet
Cc: Kalesh Singh, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, stable, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest, Dave Hansen
On Tue, 16 Sep 2025 08:20:14 -0600 Jonathan Corbet <corbet@lwn.net> wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
>
> > On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
> >
> >> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> >
> > lol.
> >
> > x1:/usr/src/25> grep "Fixes.*1da177e4c3f4" ../gitlog|wc -l
> > 661
> >
> > we really blew it that time!
>
> A few years back I made a list of the most-fixed commits in the kernel
> history; unsurprisingly that one (I call it the "original sin") came out
> on top:
>
> https://lwn.net/Articles/914632/
>
> Perhaps the time has come to update that analysis.
>
Hah, thanks, that's a fun article.
It's a shame it's so recent. I do recall back in maybe 2002 a fresh
young developer popped up with a massive rework of the IDE drivers. I
promptly took this into my tree because I was having a hate on the IDE
drivers at the time. omfg, what a disaster - this was the buggiest
bunch of junk conceivable and it went away after a few weeks.
I'm glad to say that the developer (email handle at the time was
"haveblue") has become better since then.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
2025-09-15 22:36 ` Andrew Morton
@ 2025-09-16 9:45 ` Pedro Falcato
2025-09-17 7:44 ` SeongJae Park
` (3 subsequent siblings)
5 siblings, 0 replies; 65+ messages in thread
From: Pedro Falcato @ 2025-09-16 9:45 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
kernel-team, android-mm, stable, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
>
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
>
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
>
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Looks good, thanks!
--
Pedro
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
2025-09-15 22:36 ` Andrew Morton
2025-09-16 9:45 ` Pedro Falcato
@ 2025-09-17 7:44 ` SeongJae Park
2025-09-17 10:52 ` David Hildenbrand
` (2 subsequent siblings)
5 siblings, 0 replies; 65+ messages in thread
From: SeongJae Park @ 2025-09-17 7:44 UTC (permalink / raw)
To: Kalesh Singh
Cc: SeongJae Park, akpm, minchan, lorenzo.stoakes, david,
Liam.Howlett, rppt, pfalcato, kernel-team, android-mm, stable,
Alexander Viro, Christian Brauner, Jan Kara, Kees Cook,
Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, 15 Sep 2025 09:36:32 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
>
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
>
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
>
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: SeongJae Park <sj@kernel.org>
Thanks,
SJ
[...]
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
` (2 preceding siblings ...)
2025-09-17 7:44 ` SeongJae Park
@ 2025-09-17 10:52 ` David Hildenbrand
2025-09-18 11:31 ` Pedro Falcato
2025-09-18 13:53 ` Lorenzo Stoakes
5 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-09-17 10:52 UTC (permalink / raw)
To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, stable, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On 15.09.25 18:36, Kalesh Singh wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
>
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
>
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
>
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
` (3 preceding siblings ...)
2025-09-17 10:52 ` David Hildenbrand
@ 2025-09-18 11:31 ` Pedro Falcato
2025-09-18 13:53 ` Lorenzo Stoakes
5 siblings, 0 replies; 65+ messages in thread
From: Pedro Falcato @ 2025-09-18 11:31 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
kernel-team, android-mm, stable, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
>
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
>
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
>
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
--
Pedro
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
` (4 preceding siblings ...)
2025-09-18 11:31 ` Pedro Falcato
@ 2025-09-18 13:53 ` Lorenzo Stoakes
5 siblings, 0 replies; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:53 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, stable, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:32AM -0700, Kalesh Singh wrote:
> The VMA count limit check in do_mmap() and do_brk_flags() uses a
> strict inequality (>), which allows a process's VMA count to exceed
> the configured sysctl_max_map_count limit by one.
>
> A process with mm->map_count == sysctl_max_map_count will incorrectly
> pass this check and then exceed the limit upon allocation of a new VMA
> when its map_count is incremented.
>
> Other VMA allocation paths, such as split_vma(), already use the
> correct, inclusive (>=) comparison.
Nice spot :)
And also 'doh!'
>
> Fix this bug by changing the comparison to be inclusive in do_mmap()
> and do_brk_flags(), bringing them in line with the correct behavior
> of other allocation paths.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>
> Chnages in v2:
> - Fix mmap check, per Pedro
>
> mm/mmap.c | 2 +-
> mm/vma.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7306253cc3b5..e5370e7fcd8f 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> return -EOVERFLOW;
>
> /* Too many mappings? */
> - if (mm->map_count > sysctl_max_map_count)
> + if (mm->map_count >= sysctl_max_map_count)
> return -ENOMEM;
>
> /*
> diff --git a/mm/vma.c b/mm/vma.c
> index 3b12c7579831..033a388bc4b1 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
> return -ENOMEM;
>
> - if (mm->map_count > sysctl_max_map_count)
> + if (mm->map_count >= sysctl_max_map_count)
> return -ENOMEM;
>
> if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> --
> 2.51.0.384.g4c02a37b29-goog
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 2/7] mm/selftests: add max_vma_count tests
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
2025-09-17 10:58 ` David Hildenbrand
2025-09-18 14:42 ` Lorenzo Stoakes
2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
` (5 subsequent siblings)
7 siblings, 2 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
Add a new selftest to verify that the max VMA count limit is correctly
enforced.
This test suite checks that various VMA operations (mmap, mprotect,
munmap, mremap) succeed or fail as expected when the number of VMAs is
close to the sysctl_max_map_count limit.
The test works by first creating a large number of VMAs to bring the
process close to the limit, and then performing various operations that
may or may not create new VMAs. The test then verifies that the
operations that would exceed the limit fail, and that the operations
that do not exceed the limit succeed.
NOTE: munmap is special as it's allowed to temporarily exceed the limit
by one for splits as this will decrease back to the limit once the unmap
succeeds.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
Changes in v2:
- Add tests, per Liam (note that the do_brk_flags() path is not easily
tested from userspace, so it's not included here). Exceeding the limit there
should be uncommon.
tools/testing/selftests/mm/Makefile | 1 +
.../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 5 +
3 files changed, 715 insertions(+)
create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index d13b3cef2a2b..00a4b04eab06 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress
TEST_GEN_FILES += uffd-stress
TEST_GEN_FILES += uffd-unit-tests
TEST_GEN_FILES += uffd-wp-mremap
+TEST_GEN_FILES += max_vma_count_tests
TEST_GEN_FILES += split_huge_page_test
TEST_GEN_FILES += ksm_tests
TEST_GEN_FILES += ksm_functional_tests
diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c
new file mode 100644
index 000000000000..c8401c03425c
--- /dev/null
+++ b/tools/testing/selftests/mm/max_vma_count_tests.c
@@ -0,0 +1,709 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2025 Google LLC
+ */
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdbool.h>
+#include <linux/prctl.h> /* Definition of PR_* constants */
+#include <sys/prctl.h>
+
+#include "../kselftest.h"
+
+static int get_max_vma_count(void);
+static bool set_max_vma_count(int val);
+static int get_current_vma_count(void);
+static bool is_current_vma_count(const char *msg, int expected);
+static bool is_test_area_mapped(const char *msg);
+static void print_surrounding_maps(const char *msg);
+
+/* Globals initialized in test_suite_setup() */
+static int MAX_VMA_COUNT;
+static int ORIGINAL_MAX_VMA_COUNT;
+static int PAGE_SIZE;
+static int GUARD_SIZE;
+static int TEST_AREA_SIZE;
+static int EXTRA_MAP_SIZE;
+
+static int MAX_VMA_COUNT;
+
+static int NR_EXTRA_MAPS;
+
+static char *TEST_AREA;
+static char *EXTRA_MAPS;
+
+#define DEFAULT_MAX_MAP_COUNT 65530
+#define TEST_AREA_NR_PAGES 3
+/* 1 before test area + 1 after test area + 1 after extra mappings */
+#define NR_GUARDS 3
+#define TEST_AREA_PROT (PROT_NONE)
+#define EXTRA_MAP_PROT (PROT_NONE)
+
+/**
+ * test_suite_setup - Set up the VMA layout for VMA count testing.
+ *
+ * Sets up the following VMA layout:
+ *
+ * +----- base_addr
+ * |
+ * V
+ * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
+ * | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
+ * | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
+ * | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
+ * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
+ * ^ ^ ^ ^ ^
+ * | | | | |
+ * +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
+ * (PAGE_SIZE) | | reach MAX_VMA_COUNT
+ * | |
+ * +--- TEST_AREA_SIZE ---+
+ * | (3 * PAGE_SIZE) |
+ * ^
+ * |
+ * +-- TEST_AREA starts here
+ *
+ * Populates TEST_AREA and other globals required for the tests.
+ * If successful, the current VMA count will be MAX_VMA_COUNT - 1.
+ *
+ * Return: true on success, false on failure.
+ */
+static bool test_suite_setup(void)
+{
+ int initial_vma_count;
+ size_t reservation_size;
+ void *base_addr = NULL;
+ char *ptr = NULL;
+
+ ksft_print_msg("Setting up vma_max_count test suite...\n");
+
+ /* Initialize globals */
+ PAGE_SIZE = sysconf(_SC_PAGESIZE);
+ TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE;
+ GUARD_SIZE = PAGE_SIZE;
+ EXTRA_MAP_SIZE = PAGE_SIZE;
+ MAX_VMA_COUNT = get_max_vma_count();
+
+ MAX_VMA_COUNT = get_max_vma_count();
+ if (MAX_VMA_COUNT < 0) {
+ ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n");
+ return false;
+ }
+
+ /*
+ * If the current limit is higher than the kernel default,
+ * we attempt to lower it to the default to ensure the test
+ * can run with a reliably known boundary.
+ */
+ ORIGINAL_MAX_VMA_COUNT = 0;
+
+ if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) {
+ ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT;
+
+ ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n",
+ MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT);
+
+ if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) {
+ ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n",
+ DEFAULT_MAX_MAP_COUNT);
+ ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n");
+
+ MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT;
+
+ return false;
+ }
+
+ /* Update MAX_VMA_COUNT for the test run */
+ MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT;
+ }
+
+ initial_vma_count = get_current_vma_count();
+ if (initial_vma_count < 0) {
+ ksft_print_msg("Failed to read /proc/self/maps\n");
+ return false;
+ }
+
+ /*
+ * Calculate how many extra mappings we need to create to reach
+ * MAX_VMA_COUNT - 1 (excluding test area).
+ */
+ NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count;
+
+ if (NR_EXTRA_MAPS < 1) {
+ ksft_print_msg("Not enough available maps to run test\n");
+ ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n",
+ MAX_VMA_COUNT, initial_vma_count);
+ return false;
+ }
+
+ /*
+ * Reserve space for:
+ * - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2)
+ * - The test area itself (TEST_AREA_NR_PAGES)
+ * - The guard pages (NR_GUARDS)
+ */
+ reservation_size = ((NR_EXTRA_MAPS * 2) +
+ TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE;
+
+ base_addr = mmap(NULL, reservation_size, PROT_NONE,
+ MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+ if (base_addr == MAP_FAILED) {
+ ksft_print_msg("Failed tommap initial reservation\n");
+ return false;
+ }
+
+ if (munmap(base_addr, reservation_size) == -1) {
+ ksft_print_msg("Failed to munmap initial reservation\n");
+ return false;
+ }
+
+ /* Get the addr of the test area */
+ TEST_AREA = (char *)base_addr + GUARD_SIZE;
+
+ /*
+ * Get the addr of the region for extra mappings:
+ * test area + 1 guard.
+ */
+ EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE;
+
+ /* Create single-page mappings separated by unmapped pages */
+ ptr = EXTRA_MAPS;
+ for (int i = 0; i < NR_EXTRA_MAPS; ++i) {
+ if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT,
+ MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
+ -1, 0) == MAP_FAILED) {
+ perror("mmap in fill loop");
+ ksft_print_msg("Failed on mapping #%d of %d\n", i + 1,
+ NR_EXTRA_MAPS);
+ return false;
+ }
+
+ /* Advance pointer by 2 to leave a gap */
+ ptr += (2 * EXTRA_MAP_SIZE);
+ }
+
+ if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1))
+ return false;
+
+ ksft_print_msg("vma_max_count test suite setup done.\n");
+
+ return true;
+}
+
+static void test_suite_teardown(void)
+{
+ if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) {
+ if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT))
+ ksft_print_msg("Failed to restore max_map_count to %d\n",
+ ORIGINAL_MAX_VMA_COUNT);
+ }
+}
+
+/* --- Test Helper Functions --- */
+static bool mmap_anon(void)
+{
+ void *addr = mmap(NULL, PAGE_SIZE, PROT_READ,
+ MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
+
+ /*
+ * Handle cleanup here as the runner doesn't track where this,
+ *mapping is located.
+ */
+ if (addr != MAP_FAILED)
+ munmap(addr, PAGE_SIZE);
+
+ return addr != MAP_FAILED;
+}
+
+static inline bool __mprotect(char *addr, int size)
+{
+ int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC);
+
+ return mprotect(addr, size, new_prot) == 0;
+}
+
+static bool mprotect_nosplit(void)
+{
+ return __mprotect(TEST_AREA, TEST_AREA_SIZE);
+}
+
+static bool mprotect_2way_split(void)
+{
+ return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
+}
+
+static bool mprotect_3way_split(void)
+{
+ return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
+}
+
+static inline bool __munmap(char *addr, int size)
+{
+ return munmap(addr, size) == 0;
+}
+
+static bool munmap_nosplit(void)
+{
+ return __munmap(TEST_AREA, TEST_AREA_SIZE);
+}
+
+static bool munmap_2way_split(void)
+{
+ return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
+}
+
+static bool munmap_3way_split(void)
+{
+ return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
+}
+
+/* mremap accounts for the worst case to fail early */
+static const int MREMAP_REQUIRED_VMA_SLOTS = 6;
+
+static bool mremap_dontunmap(void)
+{
+ void *new_addr;
+
+ /*
+ * Using MREMAP_DONTUNMAP will create a new mapping without
+ * removing the old one, consuming one VMA slot.
+ */
+ new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE,
+ MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL);
+
+ if (new_addr != MAP_FAILED)
+ munmap(new_addr, TEST_AREA_SIZE);
+
+ return new_addr != MAP_FAILED;
+}
+
+struct test {
+ const char *name;
+ bool (*test)(void);
+ /* How many VMA slots below the limit this test needs to start? */
+ int vma_slots_needed;
+ bool expect_success;
+};
+
+/* --- Test Cases --- */
+struct test tests[] = {
+ {
+ .name = "mmap_at_1_below_vma_count_limit",
+ .test = mmap_anon,
+ .vma_slots_needed = 1,
+ .expect_success = true,
+ },
+ {
+ .name = "mmap_at_vma_count_limit",
+ .test = mmap_anon,
+ .vma_slots_needed = 0,
+ .expect_success = false,
+ },
+ {
+ .name = "mprotect_nosplit_at_1_below_vma_count_limit",
+ .test = mprotect_nosplit,
+ .vma_slots_needed = 1,
+ .expect_success = true,
+ },
+ {
+ .name = "mprotect_nosplit_at_vma_count_limit",
+ .test = mprotect_nosplit,
+ .vma_slots_needed = 0,
+ .expect_success = true,
+ },
+ {
+ .name = "mprotect_2way_split_at_1_below_vma_count_limit",
+ .test = mprotect_2way_split,
+ .vma_slots_needed = 1,
+ .expect_success = true,
+ },
+ {
+ .name = "mprotect_2way_split_at_vma_count_limit",
+ .test = mprotect_2way_split,
+ .vma_slots_needed = 0,
+ .expect_success = false,
+ },
+ {
+ .name = "mprotect_3way_split_at_2_below_vma_count_limit",
+ .test = mprotect_3way_split,
+ .vma_slots_needed = 2,
+ .expect_success = true,
+ },
+ {
+ .name = "mprotect_3way_split_at_1_below_vma_count_limit",
+ .test = mprotect_3way_split,
+ .vma_slots_needed = 1,
+ .expect_success = false,
+ },
+ {
+ .name = "mprotect_3way_split_at_vma_count_limit",
+ .test = mprotect_3way_split,
+ .vma_slots_needed = 0,
+ .expect_success = false,
+ },
+ {
+ .name = "munmap_nosplit_at_1_below_vma_count_limit",
+ .test = munmap_nosplit,
+ .vma_slots_needed = 1,
+ .expect_success = true,
+ },
+ {
+ .name = "munmap_nosplit_at_vma_count_limit",
+ .test = munmap_nosplit,
+ .vma_slots_needed = 0,
+ .expect_success = true,
+ },
+ {
+ .name = "munmap_2way_split_at_1_below_vma_count_limit",
+ .test = munmap_2way_split,
+ .vma_slots_needed = 1,
+ .expect_success = true,
+ },
+ {
+ .name = "munmap_2way_split_at_vma_count_limit",
+ .test = munmap_2way_split,
+ .vma_slots_needed = 0,
+ .expect_success = true,
+ },
+ {
+ .name = "munmap_3way_split_at_2_below_vma_count_limit",
+ .test = munmap_3way_split,
+ .vma_slots_needed = 2,
+ .expect_success = true,
+ },
+ {
+ .name = "munmap_3way_split_at_1_below_vma_count_limit",
+ .test = munmap_3way_split,
+ .vma_slots_needed = 1,
+ .expect_success = true,
+ },
+ {
+ .name = "munmap_3way_split_at_vma_count_limit",
+ .test = munmap_3way_split,
+ .vma_slots_needed = 0,
+ .expect_success = false,
+ },
+ {
+ .name = "mremap_dontunmap_at_required_vma_count_capcity",
+ .test = mremap_dontunmap,
+ .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS,
+ .expect_success = true,
+ },
+ {
+ .name = "mremap_dontunmap_at_1_below_required_vma_count_capacity",
+ .test = mremap_dontunmap,
+ .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1,
+ .expect_success = false,
+ },
+};
+
+/* --- Test Runner --- */
+int main(int argc, char **argv)
+{
+ int num_tests = ARRAY_SIZE(tests);
+ int failed_tests = 0;
+
+ ksft_set_plan(num_tests);
+
+ if (!test_suite_setup() != 0) {
+ if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT)
+ ksft_exit_skip("max_map_count too high and cannot be lowered\n"
+ "Please rerun as root.\n");
+ else
+ ksft_exit_fail_msg("Test suite setup failed. Aborting.\n");
+
+ }
+
+ for (int i = 0; i < num_tests; i++) {
+ int maps_to_unmap = tests[i].vma_slots_needed;
+ const char *name = tests[i].name;
+ bool test_passed;
+
+ errno = 0;
+
+ /* 1. Setup: TEST_AREA mapping */
+ if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT,
+ MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
+ == MAP_FAILED) {
+ ksft_test_result_fail(
+ "%s: Test setup failed to map TEST_AREA\n",
+ name);
+ maps_to_unmap = 0;
+ goto fail;
+ }
+
+ /* Label TEST_AREA to ease debugging */
+ if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA,
+ TEST_AREA_SIZE, "TEST_AREA")) {
+ ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n",
+ name);
+ ksft_print_msg(
+ "Continuing without named TEST_AREA mapping\n");
+ }
+
+ /* 2. Setup: Adjust VMA count based on test requirements */
+ if (maps_to_unmap > NR_EXTRA_MAPS) {
+ ksft_test_result_fail(
+ "%s: Test setup failed: Invalid VMA slots required %d\n",
+ name, tests[i].vma_slots_needed);
+ maps_to_unmap = 0;
+ goto fail;
+ }
+
+ /* Unmap extra mappings, accounting for the 1-page gap */
+ for (int j = 0; j < maps_to_unmap; j++)
+ munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
+ EXTRA_MAP_SIZE);
+
+ /*
+ * 3. Verify the preconditions.
+ *
+ * Sometimes there isn't an easy way to determine the cause
+ * of the test failure.
+ * e.g. an mprotect ENOMEM may be due to trying to protect
+ * unmapped area or due to hitting MAX_VMA_COUNT limit.
+ *
+ * We verify the preconditions of the test to ensure any
+ * expected failures are from the expected cause and not
+ * coincidental.
+ */
+ if (!is_current_vma_count(name,
+ MAX_VMA_COUNT - tests[i].vma_slots_needed))
+ goto fail;
+
+ if (!is_test_area_mapped(name))
+ goto fail;
+
+ /* 4. Run the test */
+ test_passed = (tests[i].test() == tests[i].expect_success);
+ if (test_passed) {
+ ksft_test_result_pass("%s\n", name);
+ } else {
+fail:
+ failed_tests++;
+ ksft_test_result_fail(
+ "%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n",
+ name, get_current_vma_count(), MAX_VMA_COUNT,
+ errno, strerror(errno));
+ print_surrounding_maps(name);
+ }
+
+ /* 5. Teardown: Unmap TEST_AREA. */
+ munmap(TEST_AREA, TEST_AREA_SIZE);
+
+ /* 6. Teardown: Restore extra mappings to test suite baseline */
+ for (int j = 0; j < maps_to_unmap; j++) {
+ /* Remap extra mappings, accounting for the gap */
+ mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
+ EXTRA_MAP_SIZE, EXTRA_MAP_PROT,
+ MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
+ -1, 0);
+ }
+ }
+
+ test_suite_teardown();
+
+ if (failed_tests > 0)
+ ksft_exit_fail();
+ else
+ ksft_exit_pass();
+}
+
+/* --- Utilities --- */
+
+static int get_max_vma_count(void)
+{
+ int max_count;
+ FILE *f;
+
+ f = fopen("/proc/sys/vm/max_map_count", "r");
+ if (!f)
+ return -1;
+
+ if (fscanf(f, "%d", &max_count) != 1)
+ max_count = -1;
+
+
+ fclose(f);
+
+ return max_count;
+}
+
+static bool set_max_vma_count(int val)
+{
+ FILE *f;
+ bool success = false;
+
+ f = fopen("/proc/sys/vm/max_map_count", "w");
+ if (!f)
+ return false;
+
+ if (fprintf(f, "%d", val) > 0)
+ success = true;
+
+ fclose(f);
+ return success;
+}
+
+static int get_current_vma_count(void)
+{
+ char line[1024];
+ int count = 0;
+ FILE *f;
+
+ f = fopen("/proc/self/maps", "r");
+ if (!f)
+ return -1;
+
+ while (fgets(line, sizeof(line), f)) {
+ if (!strstr(line, "[vsyscall]"))
+ count++;
+ }
+
+ fclose(f);
+
+ return count;
+}
+
+static bool is_current_vma_count(const char *msg, int expected)
+{
+ int current = get_current_vma_count();
+
+ if (current == expected)
+ return true;
+
+ ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current,
+ expected);
+ return false;
+}
+
+static bool is_test_area_mapped(const char *msg)
+{
+ unsigned long search_start = (unsigned long)TEST_AREA;
+ unsigned long search_end = search_start + TEST_AREA_SIZE;
+ bool found = false;
+ char line[1024];
+ FILE *f;
+
+ f = fopen("/proc/self/maps", "r");
+ if (!f) {
+ ksft_print_msg("failed to open /proc/self/maps\n");
+ return false;
+ }
+
+ while (fgets(line, sizeof(line), f)) {
+ unsigned long start, end;
+
+ if (sscanf(line, "%lx-%lx", &start, &end) != 2)
+ continue;
+
+ /* Check for an exact match of the range */
+ if (start == search_start && end == search_end) {
+ found = true;
+ break;
+ } else if (start > search_end) {
+ /*
+ *Since maps are sorted, if we've passed the end, we
+ * can stop searching.
+ */
+ break;
+ }
+ }
+
+ fclose(f);
+
+ if (found)
+ return true;
+
+ /* Not found */
+ ksft_print_msg(
+ "%s: TEST_AREA is not mapped as a single contiguous block.\n",
+ msg);
+ print_surrounding_maps(msg);
+
+ return false;
+}
+
+static void print_surrounding_maps(const char *msg)
+{
+ unsigned long search_start = (unsigned long)TEST_AREA;
+ unsigned long search_end = search_start + TEST_AREA_SIZE;
+ unsigned long start;
+ unsigned long end;
+ char line[1024] = {};
+ int line_idx = 0;
+ int first_match_idx = -1;
+ int last_match_idx = -1;
+ FILE *f;
+
+ f = fopen("/proc/self/maps", "r");
+ if (!f)
+ return;
+
+ if (msg)
+ ksft_print_msg("%s\n", msg);
+
+ ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n",
+ TEST_AREA);
+
+ /* First pass: Read all lines and find the range of matching entries */
+ fseek(f, 0, SEEK_SET); /* Rewind file */
+ while (fgets(line, sizeof(line), f)) {
+ if (sscanf(line, "%lx-%lx", &start, &end) != 2) {
+ line_idx++;
+ continue;
+ }
+
+ /* Check for any overlap */
+ if (start < search_end && end > search_start) {
+ if (first_match_idx == -1)
+ first_match_idx = line_idx;
+ last_match_idx = line_idx;
+ } else if (start > search_end) {
+ /*
+ * Since maps are sorted, if we've passed the end, we
+ * can stop searching.
+ */
+ break;
+ }
+
+ line_idx++;
+ }
+
+ if (first_match_idx == -1) {
+ ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n",
+ TEST_AREA);
+ } else {
+ /* Second pass: Print the relevant lines */
+ fseek(f, 0, SEEK_SET); /* Rewind file */
+ line_idx = 0;
+ while (fgets(line, sizeof(line), f)) {
+ /* Print 2 lines before the first match */
+ if (line_idx >= first_match_idx - 2 &&
+ line_idx < first_match_idx)
+ ksft_print_msg(" %s", line);
+
+ /* Print all matching TEST_AREA entries */
+ if (line_idx >= first_match_idx &&
+ line_idx <= last_match_idx)
+ ksft_print_msg(">> %s", line);
+
+ /* Print 2 lines after the last match */
+ if (line_idx > last_match_idx &&
+ line_idx <= last_match_idx + 2)
+ ksft_print_msg(" %s", line);
+
+ line_idx++;
+ }
+ }
+
+ ksft_print_msg("--------------------------------------------------\n");
+
+ fclose(f);
+}
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 471e539d82b8..3794b50ec280 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -49,6 +49,8 @@ separated by spaces:
test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
- madv_populate
test memadvise(2) MADV_POPULATE_{READ,WRITE} options
+- max_vma_count
+ tests for max vma_count
- memfd_secret
test memfd_secret(2)
- process_mrelease
@@ -417,6 +419,9 @@ fi # VADDR64
# vmalloc stability smoke test
CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
+# test operations against max vma count limit
+CATEGORY="max_vma_count" run_test ./max_vma_count_tests
+
CATEGORY="mremap" run_test ./mremap_dontunmap
CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread* Re: [PATCH v2 2/7] mm/selftests: add max_vma_count tests
2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
@ 2025-09-17 10:58 ` David Hildenbrand
2025-09-17 16:49 ` Kalesh Singh
2025-09-18 14:42 ` Lorenzo Stoakes
1 sibling, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-09-17 10:58 UTC (permalink / raw)
To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
> + * test_suite_setup - Set up the VMA layout for VMA count testing.
> + *
> + * Sets up the following VMA layout:
> + *
> + * +----- base_addr
> + * |
> + * V
> + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> + * | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
> + * | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
> + * | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
> + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> + * ^ ^ ^ ^ ^
> + * | | | | |
> + * +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
> + * (PAGE_SIZE) | | reach MAX_VMA_COUNT
> + * | |
> + * +--- TEST_AREA_SIZE ---+
> + * | (3 * PAGE_SIZE) |
> + * ^
> + * |
> + * +-- TEST_AREA starts here
> + *
>
Just wondering if we could find a different name than "guard page" here,
to not confuse stuff with guard ptes
Will the current "guard page" we a valid vma or just a hole?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 2/7] mm/selftests: add max_vma_count tests
2025-09-17 10:58 ` David Hildenbrand
@ 2025-09-17 16:49 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-17 16:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Wed, Sep 17, 2025 at 3:58 AM 'David Hildenbrand' via kernel-team
<kernel-team@android.com> wrote:
>
> > + * test_suite_setup - Set up the VMA layout for VMA count testing.
> > + *
> > + * Sets up the following VMA layout:
> > + *
> > + * +----- base_addr
> > + * |
> > + * V
> > + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> > + * | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
> > + * | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
> > + * | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
> > + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> > + * ^ ^ ^ ^ ^
> > + * | | | | |
> > + * +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
> > + * (PAGE_SIZE) | | reach MAX_VMA_COUNT
> > + * | |
> > + * +--- TEST_AREA_SIZE ---+
> > + * | (3 * PAGE_SIZE) |
> > + * ^
> > + * |
> > + * +-- TEST_AREA starts here
> > + *
> >
>
Hi David,
Thanks for the reviews.
> Just wondering if we could find a different name than "guard page" here,
> to not confuse stuff with guard ptes
>
> Will the current "guard page" we a valid vma or just a hole?
It's a hole to prevent coincidental merging of adjacent VMAs. I can
rename it to HOLE in the next revision.
Thanks,
Kalesh
>
> --
> Cheers
>
> David / dhildenb
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 2/7] mm/selftests: add max_vma_count tests
2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
2025-09-17 10:58 ` David Hildenbrand
@ 2025-09-18 14:42 ` Lorenzo Stoakes
2025-09-18 16:21 ` Kalesh Singh
1 sibling, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:42 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:33AM -0700, Kalesh Singh wrote:
> Add a new selftest to verify that the max VMA count limit is correctly
> enforced.
>
> This test suite checks that various VMA operations (mmap, mprotect,
> munmap, mremap) succeed or fail as expected when the number of VMAs is
> close to the sysctl_max_map_count limit.
>
> The test works by first creating a large number of VMAs to bring the
> process close to the limit, and then performing various operations that
> may or may not create new VMAs. The test then verifies that the
> operations that would exceed the limit fail, and that the operations
> that do not exceed the limit succeed.
>
> NOTE: munmap is special as it's allowed to temporarily exceed the limit
> by one for splits as this will decrease back to the limit once the unmap
> succeeds.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
>
> Changes in v2:
> - Add tests, per Liam (note that the do_brk_flags() path is not easily
> tested from userspace, so it's not included here). Exceeding the limit there
> should be uncommon.
>
> tools/testing/selftests/mm/Makefile | 1 +
> .../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++
If you add a new file, update MAINTAINERS. Should put in MEMORY MAPPING section.
> tools/testing/selftests/mm/run_vmtests.sh | 5 +
You're missing a .gitignore.
> 3 files changed, 715 insertions(+)
> create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
>
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index d13b3cef2a2b..00a4b04eab06 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress
> TEST_GEN_FILES += uffd-stress
> TEST_GEN_FILES += uffd-unit-tests
> TEST_GEN_FILES += uffd-wp-mremap
> +TEST_GEN_FILES += max_vma_count_tests
> TEST_GEN_FILES += split_huge_page_test
> TEST_GEN_FILES += ksm_tests
> TEST_GEN_FILES += ksm_functional_tests
> diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c
> new file mode 100644
> index 000000000000..c8401c03425c
> --- /dev/null
> +++ b/tools/testing/selftests/mm/max_vma_count_tests.c
> @@ -0,0 +1,709 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2025 Google LLC
> + */
> +#define _GNU_SOURCE
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <stdbool.h>
> +#include <linux/prctl.h> /* Definition of PR_* constants */
> +#include <sys/prctl.h>
> +
> +#include "../kselftest.h"
> +
> +static int get_max_vma_count(void);
> +static bool set_max_vma_count(int val);
> +static int get_current_vma_count(void);
> +static bool is_current_vma_count(const char *msg, int expected);
> +static bool is_test_area_mapped(const char *msg);
> +static void print_surrounding_maps(const char *msg);
> +
> +/* Globals initialized in test_suite_setup() */
> +static int MAX_VMA_COUNT;
> +static int ORIGINAL_MAX_VMA_COUNT;
> +static int PAGE_SIZE;
> +static int GUARD_SIZE;
> +static int TEST_AREA_SIZE;
> +static int EXTRA_MAP_SIZE;
> +
> +static int MAX_VMA_COUNT;
> +
> +static int NR_EXTRA_MAPS;
> +
> +static char *TEST_AREA;
> +static char *EXTRA_MAPS;
> +
> +#define DEFAULT_MAX_MAP_COUNT 65530
> +#define TEST_AREA_NR_PAGES 3
> +/* 1 before test area + 1 after test area + 1 after extra mappings */
> +#define NR_GUARDS 3
> +#define TEST_AREA_PROT (PROT_NONE)
> +#define EXTRA_MAP_PROT (PROT_NONE)
> +
> +/**
> + * test_suite_setup - Set up the VMA layout for VMA count testing.
> + *
> + * Sets up the following VMA layout:
> + *
> + * +----- base_addr
> + * |
> + * V
> + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> + * | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
> + * | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
> + * | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
> + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
I have no idea what a 'guard page' is here. You're not using MADV_GUARD_INSTALL so it's presumably an arbitrary PROT_NONE page or something?
> + * ^ ^ ^ ^ ^
> + * | | | | |
> + * +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
> + * (PAGE_SIZE) | | reach MAX_VMA_COUNT
> + * | |
> + * +--- TEST_AREA_SIZE ---+
> + * | (3 * PAGE_SIZE) |
> + * ^
> + * |
> + * +-- TEST_AREA starts here
Can we make this vertical, it's obv >80 chars and it's kind of hard to read.
Also isn't this hugely convoluted? Do we really need to set up such a
complicated set of VMAs for every test?
> + *
> + * Populates TEST_AREA and other globals required for the tests.
> + * If successful, the current VMA count will be MAX_VMA_COUNT - 1.
> + *
> + * Return: true on success, false on failure.
> + */
> +static bool test_suite_setup(void)
> +{
> + int initial_vma_count;
> + size_t reservation_size;
> + void *base_addr = NULL;
> + char *ptr = NULL;
> +
> + ksft_print_msg("Setting up vma_max_count test suite...\n");
> +
> + /* Initialize globals */
> + PAGE_SIZE = sysconf(_SC_PAGESIZE);
> + TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE;
> + GUARD_SIZE = PAGE_SIZE;
> + EXTRA_MAP_SIZE = PAGE_SIZE;
> + MAX_VMA_COUNT = get_max_vma_count();
Dude we are not COBOL programmers :) What is it with these capitalised variable
names?
This is really horrible, please don't do that. And why on earth are you using
globals like this.
(Again) use the self test harness please :)
> +
> + MAX_VMA_COUNT = get_max_vma_count();
> + if (MAX_VMA_COUNT < 0) {
> + ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n");
> + return false;
> + }
> +
> + /*
> + * If the current limit is higher than the kernel default,
> + * we attempt to lower it to the default to ensure the test
> + * can run with a reliably known boundary.
> + */
> + ORIGINAL_MAX_VMA_COUNT = 0;
> +
> + if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) {
> + ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT;
> +
> + ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n",
> + MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT);
> +
> + if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) {
> + ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n",
> + DEFAULT_MAX_MAP_COUNT);
Probably partly a product of you not using the harness, but you're violating the
max chars per line all over the place.
checkpatch.pl kinda has a fit with this file, can we just try and keep it to ~80 chars please?
> + ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n");
> +
> + MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT;
> +
> + return false;
> + }
> +
> + /* Update MAX_VMA_COUNT for the test run */
> + MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT;
> + }
> +
> + initial_vma_count = get_current_vma_count();
> + if (initial_vma_count < 0) {
> + ksft_print_msg("Failed to read /proc/self/maps\n");
> + return false;
> + }
> +
> + /*
> + * Calculate how many extra mappings we need to create to reach
> + * MAX_VMA_COUNT - 1 (excluding test area).
> + */
> + NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count;
> +
> + if (NR_EXTRA_MAPS < 1) {
> + ksft_print_msg("Not enough available maps to run test\n");
> + ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n",
> + MAX_VMA_COUNT, initial_vma_count);
> + return false;
> + }
> +
> + /*
> + * Reserve space for:
> + * - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2)
> + * - The test area itself (TEST_AREA_NR_PAGES)
> + * - The guard pages (NR_GUARDS)
> + */
> + reservation_size = ((NR_EXTRA_MAPS * 2) +
> + TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE;
> +
> + base_addr = mmap(NULL, reservation_size, PROT_NONE,
> + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> + if (base_addr == MAP_FAILED) {
> + ksft_print_msg("Failed tommap initial reservation\n");
> + return false;
> + }
> +
> + if (munmap(base_addr, reservation_size) == -1) {
> + ksft_print_msg("Failed to munmap initial reservation\n");
> + return false;
> + }
> +
> + /* Get the addr of the test area */
> + TEST_AREA = (char *)base_addr + GUARD_SIZE;
> +
> + /*
> + * Get the addr of the region for extra mappings:
> + * test area + 1 guard.
> + */
> + EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE;
> +
> + /* Create single-page mappings separated by unmapped pages */
> + ptr = EXTRA_MAPS;
> + for (int i = 0; i < NR_EXTRA_MAPS; ++i) {
> + if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT,
> + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
> + -1, 0) == MAP_FAILED) {
> + perror("mmap in fill loop");
> + ksft_print_msg("Failed on mapping #%d of %d\n", i + 1,
> + NR_EXTRA_MAPS);
> + return false;
> + }
> +
> + /* Advance pointer by 2 to leave a gap */
> + ptr += (2 * EXTRA_MAP_SIZE);
> + }
> +
> + if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1))
> + return false;
> +
> + ksft_print_msg("vma_max_count test suite setup done.\n");
> +
> + return true;
> +}
I mean it'll be fixed by the test harness, but even in test code let's please
not write MASSIVE functions that do a million things. It's not really readable.
> +
> +static void test_suite_teardown(void)
> +{
> + if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) {
> + if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT))
> + ksft_print_msg("Failed to restore max_map_count to %d\n",
> + ORIGINAL_MAX_VMA_COUNT);
> + }
> +}
> +
> +/* --- Test Helper Functions --- */
> +static bool mmap_anon(void)
> +{
> + void *addr = mmap(NULL, PAGE_SIZE, PROT_READ,
> + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> +
> + /*
> + * Handle cleanup here as the runner doesn't track where this,
> + *mapping is located.
> + */
> + if (addr != MAP_FAILED)
> + munmap(addr, PAGE_SIZE);
> +
> + return addr != MAP_FAILED;
> +}
> +
> +static inline bool __mprotect(char *addr, int size)
> +{
> + int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC);
> +
> + return mprotect(addr, size, new_prot) == 0;
> +}
> +
> +static bool mprotect_nosplit(void)
> +{
> + return __mprotect(TEST_AREA, TEST_AREA_SIZE);
> +}
> +
> +static bool mprotect_2way_split(void)
> +{
> + return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
> +}
> +
> +static bool mprotect_3way_split(void)
> +{
> + return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +static inline bool __munmap(char *addr, int size)
> +{
> + return munmap(addr, size) == 0;
> +}
> +
> +static bool munmap_nosplit(void)
> +{
> + return __munmap(TEST_AREA, TEST_AREA_SIZE);
> +}
> +
> +static bool munmap_2way_split(void)
> +{
> + return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
> +}
> +
> +static bool munmap_3way_split(void)
> +{
> + return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +/* mremap accounts for the worst case to fail early */
> +static const int MREMAP_REQUIRED_VMA_SLOTS = 6;
> +
> +static bool mremap_dontunmap(void)
> +{
> + void *new_addr;
> +
> + /*
> + * Using MREMAP_DONTUNMAP will create a new mapping without
> + * removing the old one, consuming one VMA slot.
> + */
> + new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE,
> + MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL);
> +
> + if (new_addr != MAP_FAILED)
> + munmap(new_addr, TEST_AREA_SIZE);
> +
> + return new_addr != MAP_FAILED;
> +}
> +
> +struct test {
> + const char *name;
> + bool (*test)(void);
> + /* How many VMA slots below the limit this test needs to start? */
> + int vma_slots_needed;
> + bool expect_success;
> +};
> +
> +/* --- Test Cases --- */
> +struct test tests[] = {
> + {
> + .name = "mmap_at_1_below_vma_count_limit",
> + .test = mmap_anon,
> + .vma_slots_needed = 1,
> + .expect_success = true,
> + },
> + {
> + .name = "mmap_at_vma_count_limit",
> + .test = mmap_anon,
> + .vma_slots_needed = 0,
> + .expect_success = false,
> + },
> + {
> + .name = "mprotect_nosplit_at_1_below_vma_count_limit",
> + .test = mprotect_nosplit,
> + .vma_slots_needed = 1,
> + .expect_success = true,
> + },
> + {
> + .name = "mprotect_nosplit_at_vma_count_limit",
> + .test = mprotect_nosplit,
> + .vma_slots_needed = 0,
> + .expect_success = true,
> + },
> + {
> + .name = "mprotect_2way_split_at_1_below_vma_count_limit",
> + .test = mprotect_2way_split,
> + .vma_slots_needed = 1,
> + .expect_success = true,
> + },
> + {
> + .name = "mprotect_2way_split_at_vma_count_limit",
> + .test = mprotect_2way_split,
> + .vma_slots_needed = 0,
> + .expect_success = false,
> + },
> + {
> + .name = "mprotect_3way_split_at_2_below_vma_count_limit",
> + .test = mprotect_3way_split,
> + .vma_slots_needed = 2,
> + .expect_success = true,
> + },
> + {
> + .name = "mprotect_3way_split_at_1_below_vma_count_limit",
> + .test = mprotect_3way_split,
> + .vma_slots_needed = 1,
> + .expect_success = false,
> + },
> + {
> + .name = "mprotect_3way_split_at_vma_count_limit",
> + .test = mprotect_3way_split,
> + .vma_slots_needed = 0,
> + .expect_success = false,
> + },
> + {
> + .name = "munmap_nosplit_at_1_below_vma_count_limit",
> + .test = munmap_nosplit,
> + .vma_slots_needed = 1,
> + .expect_success = true,
> + },
> + {
> + .name = "munmap_nosplit_at_vma_count_limit",
> + .test = munmap_nosplit,
> + .vma_slots_needed = 0,
> + .expect_success = true,
> + },
> + {
> + .name = "munmap_2way_split_at_1_below_vma_count_limit",
> + .test = munmap_2way_split,
> + .vma_slots_needed = 1,
> + .expect_success = true,
> + },
> + {
> + .name = "munmap_2way_split_at_vma_count_limit",
> + .test = munmap_2way_split,
> + .vma_slots_needed = 0,
> + .expect_success = true,
> + },
> + {
> + .name = "munmap_3way_split_at_2_below_vma_count_limit",
> + .test = munmap_3way_split,
> + .vma_slots_needed = 2,
> + .expect_success = true,
> + },
> + {
> + .name = "munmap_3way_split_at_1_below_vma_count_limit",
> + .test = munmap_3way_split,
> + .vma_slots_needed = 1,
> + .expect_success = true,
> + },
> + {
> + .name = "munmap_3way_split_at_vma_count_limit",
> + .test = munmap_3way_split,
> + .vma_slots_needed = 0,
> + .expect_success = false,
> + },
> + {
> + .name = "mremap_dontunmap_at_required_vma_count_capcity",
> + .test = mremap_dontunmap,
> + .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS,
> + .expect_success = true,
> + },
> + {
> + .name = "mremap_dontunmap_at_1_below_required_vma_count_capacity",
> + .test = mremap_dontunmap,
> + .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1,
> + .expect_success = false,
> + },
This is horrible. We don't need to do it this way.
Use the kselftest_harness please. See the guard_regions.c test for an example of
how to use it.
> +};
> +
> +/* --- Test Runner --- */
> +int main(int argc, char **argv)
> +{
> + int num_tests = ARRAY_SIZE(tests);
> + int failed_tests = 0;
> +
> + ksft_set_plan(num_tests);
> +
> + if (!test_suite_setup() != 0) {
> + if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT)
> + ksft_exit_skip("max_map_count too high and cannot be lowered\n"
> + "Please rerun as root.\n");
> + else
> + ksft_exit_fail_msg("Test suite setup failed. Aborting.\n");
> +
> + }
> +
> + for (int i = 0; i < num_tests; i++) {
> + int maps_to_unmap = tests[i].vma_slots_needed;
> + const char *name = tests[i].name;
> + bool test_passed;
> +
> + errno = 0;
> +
> + /* 1. Setup: TEST_AREA mapping */
> + if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT,
> + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
> + == MAP_FAILED) {
> + ksft_test_result_fail(
> + "%s: Test setup failed to map TEST_AREA\n",
> + name);
> + maps_to_unmap = 0;
> + goto fail;
> + }
> +
> + /* Label TEST_AREA to ease debugging */
> + if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA,
> + TEST_AREA_SIZE, "TEST_AREA")) {
> + ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n",
> + name);
> + ksft_print_msg(
> + "Continuing without named TEST_AREA mapping\n");
> + }
> +
> + /* 2. Setup: Adjust VMA count based on test requirements */
> + if (maps_to_unmap > NR_EXTRA_MAPS) {
> + ksft_test_result_fail(
> + "%s: Test setup failed: Invalid VMA slots required %d\n",
> + name, tests[i].vma_slots_needed);
> + maps_to_unmap = 0;
> + goto fail;
> + }
> +
> + /* Unmap extra mappings, accounting for the 1-page gap */
> + for (int j = 0; j < maps_to_unmap; j++)
> + munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
> + EXTRA_MAP_SIZE);
> +
> + /*
> + * 3. Verify the preconditions.
> + *
> + * Sometimes there isn't an easy way to determine the cause
> + * of the test failure.
> + * e.g. an mprotect ENOMEM may be due to trying to protect
> + * unmapped area or due to hitting MAX_VMA_COUNT limit.
> + *
> + * We verify the preconditions of the test to ensure any
> + * expected failures are from the expected cause and not
> + * coincidental.
> + */
> + if (!is_current_vma_count(name,
> + MAX_VMA_COUNT - tests[i].vma_slots_needed))
> + goto fail;
> +
> + if (!is_test_area_mapped(name))
> + goto fail;
> +
> + /* 4. Run the test */
> + test_passed = (tests[i].test() == tests[i].expect_success);
> + if (test_passed) {
> + ksft_test_result_pass("%s\n", name);
> + } else {
> +fail:
> + failed_tests++;
> + ksft_test_result_fail(
> + "%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n",
> + name, get_current_vma_count(), MAX_VMA_COUNT,
> + errno, strerror(errno));
> + print_surrounding_maps(name);
> + }
> +
> + /* 5. Teardown: Unmap TEST_AREA. */
> + munmap(TEST_AREA, TEST_AREA_SIZE);
> +
> + /* 6. Teardown: Restore extra mappings to test suite baseline */
> + for (int j = 0; j < maps_to_unmap; j++) {
> + /* Remap extra mappings, accounting for the gap */
> + mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
> + EXTRA_MAP_SIZE, EXTRA_MAP_PROT,
> + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
> + -1, 0);
> + }
> + }
> +
> + test_suite_teardown();
> +
> + if (failed_tests > 0)
> + ksft_exit_fail();
> + else
> + ksft_exit_pass();
> +}
> +
> +/* --- Utilities --- */
> +
> +static int get_max_vma_count(void)
> +{
> + int max_count;
> + FILE *f;
> +
> + f = fopen("/proc/sys/vm/max_map_count", "r");
> + if (!f)
> + return -1;
> +
> + if (fscanf(f, "%d", &max_count) != 1)
> + max_count = -1;
> +
> +
> + fclose(f);
> +
> + return max_count;
> +}
> +
> +static bool set_max_vma_count(int val)
> +{
> + FILE *f;
> + bool success = false;
> +
> + f = fopen("/proc/sys/vm/max_map_count", "w");
> + if (!f)
> + return false;
> +
> + if (fprintf(f, "%d", val) > 0)
> + success = true;
> +
> + fclose(f);
> + return success;
> +}
> +
> +static int get_current_vma_count(void)
> +{
> + char line[1024];
> + int count = 0;
> + FILE *f;
> +
> + f = fopen("/proc/self/maps", "r");
> + if (!f)
> + return -1;
> +
> + while (fgets(line, sizeof(line), f)) {
> + if (!strstr(line, "[vsyscall]"))
> + count++;
> + }
> +
> + fclose(f);
> +
> + return count;
> +}
> +
> +static bool is_current_vma_count(const char *msg, int expected)
> +{
> + int current = get_current_vma_count();
> +
> + if (current == expected)
> + return true;
> +
> + ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current,
> + expected);
> + return false;
> +}
> +
> +static bool is_test_area_mapped(const char *msg)
> +{
> + unsigned long search_start = (unsigned long)TEST_AREA;
> + unsigned long search_end = search_start + TEST_AREA_SIZE;
> + bool found = false;
> + char line[1024];
> + FILE *f;
> +
> + f = fopen("/proc/self/maps", "r");
> + if (!f) {
> + ksft_print_msg("failed to open /proc/self/maps\n");
> + return false;
> + }
> +
> + while (fgets(line, sizeof(line), f)) {
> + unsigned long start, end;
> +
> + if (sscanf(line, "%lx-%lx", &start, &end) != 2)
> + continue;
> +
> + /* Check for an exact match of the range */
> + if (start == search_start && end == search_end) {
> + found = true;
> + break;
> + } else if (start > search_end) {
> + /*
> + *Since maps are sorted, if we've passed the end, we
> + * can stop searching.
> + */
> + break;
> + }
> + }
> +
There's helpers for this kind of thing in vm_util.h, let's not duplicate
code. Also you can use the proc query ioctl thingy (also helpers for that in
vm_util.h) rather than having to spelunk /proc/$pid/maps.
> + fclose(f);
> +
> + if (found)
> + return true;
> +
> + /* Not found */
> + ksft_print_msg(
> + "%s: TEST_AREA is not mapped as a single contiguous block.\n",
> + msg);
> + print_surrounding_maps(msg);
> +
> + return false;
> +}
> +
> +static void print_surrounding_maps(const char *msg)
WHy are you printing a bunch of noise in a test?
> +{
> + unsigned long search_start = (unsigned long)TEST_AREA;
> + unsigned long search_end = search_start + TEST_AREA_SIZE;
> + unsigned long start;
> + unsigned long end;
> + char line[1024] = {};
> + int line_idx = 0;
> + int first_match_idx = -1;
> + int last_match_idx = -1;
> + FILE *f;
> +
> + f = fopen("/proc/self/maps", "r");
> + if (!f)
> + return;
> +
> + if (msg)
> + ksft_print_msg("%s\n", msg);
> +
> + ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n",
> + TEST_AREA);
> +
> + /* First pass: Read all lines and find the range of matching entries */
> + fseek(f, 0, SEEK_SET); /* Rewind file */
> + while (fgets(line, sizeof(line), f)) {
> + if (sscanf(line, "%lx-%lx", &start, &end) != 2) {
> + line_idx++;
> + continue;
> + }
> +
> + /* Check for any overlap */
> + if (start < search_end && end > search_start) {
> + if (first_match_idx == -1)
> + first_match_idx = line_idx;
> + last_match_idx = line_idx;
> + } else if (start > search_end) {
> + /*
> + * Since maps are sorted, if we've passed the end, we
> + * can stop searching.
> + */
> + break;
> + }
> +
> + line_idx++;
> + }
> +
> + if (first_match_idx == -1) {
> + ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n",
> + TEST_AREA);
> + } else {
> + /* Second pass: Print the relevant lines */
> + fseek(f, 0, SEEK_SET); /* Rewind file */
> + line_idx = 0;
> + while (fgets(line, sizeof(line), f)) {
> + /* Print 2 lines before the first match */
> + if (line_idx >= first_match_idx - 2 &&
> + line_idx < first_match_idx)
> + ksft_print_msg(" %s", line);
> +
> + /* Print all matching TEST_AREA entries */
> + if (line_idx >= first_match_idx &&
> + line_idx <= last_match_idx)
> + ksft_print_msg(">> %s", line);
> +
> + /* Print 2 lines after the last match */
> + if (line_idx > last_match_idx &&
> + line_idx <= last_match_idx + 2)
> + ksft_print_msg(" %s", line);
> +
> + line_idx++;
> + }
> + }
> +
> + ksft_print_msg("--------------------------------------------------\n");
> +
> + fclose(f);
> +}
> diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> index 471e539d82b8..3794b50ec280 100755
> --- a/tools/testing/selftests/mm/run_vmtests.sh
> +++ b/tools/testing/selftests/mm/run_vmtests.sh
> @@ -49,6 +49,8 @@ separated by spaces:
> test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
> - madv_populate
> test memadvise(2) MADV_POPULATE_{READ,WRITE} options
> +- max_vma_count
> + tests for max vma_count
> - memfd_secret
> test memfd_secret(2)
> - process_mrelease
> @@ -417,6 +419,9 @@ fi # VADDR64
> # vmalloc stability smoke test
> CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
>
> +# test operations against max vma count limit
> +CATEGORY="max_vma_count" run_test ./max_vma_count_tests
> +
> CATEGORY="mremap" run_test ./mremap_dontunmap
>
> CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
> --
> 2.51.0.384.g4c02a37b29-goog
>
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 2/7] mm/selftests: add max_vma_count tests
2025-09-18 14:42 ` Lorenzo Stoakes
@ 2025-09-18 16:21 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-18 16:21 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, Sep 18, 2025 at 7:42 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 09:36:33AM -0700, Kalesh Singh wrote:
> > Add a new selftest to verify that the max VMA count limit is correctly
> > enforced.
> >
> > This test suite checks that various VMA operations (mmap, mprotect,
> > munmap, mremap) succeed or fail as expected when the number of VMAs is
> > close to the sysctl_max_map_count limit.
> >
> > The test works by first creating a large number of VMAs to bring the
> > process close to the limit, and then performing various operations that
> > may or may not create new VMAs. The test then verifies that the
> > operations that would exceed the limit fail, and that the operations
> > that do not exceed the limit succeed.
> >
> > NOTE: munmap is special as it's allowed to temporarily exceed the limit
> > by one for splits as this will decrease back to the limit once the unmap
> > succeeds.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
> >
> > Changes in v2:
> > - Add tests, per Liam (note that the do_brk_flags() path is not easily
> > tested from userspace, so it's not included here). Exceeding the limit there
> > should be uncommon.
> >
> > tools/testing/selftests/mm/Makefile | 1 +
> > .../selftests/mm/max_vma_count_tests.c | 709 ++++++++++++++++++
>
> If you add a new file, update MAINTAINERS. Should put in MEMORY MAPPING section.
>
> > tools/testing/selftests/mm/run_vmtests.sh | 5 +
>
> You're missing a .gitignore.
>
> > 3 files changed, 715 insertions(+)
> > create mode 100644 tools/testing/selftests/mm/max_vma_count_tests.c
> >
> > diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> > index d13b3cef2a2b..00a4b04eab06 100644
> > --- a/tools/testing/selftests/mm/Makefile
> > +++ b/tools/testing/selftests/mm/Makefile
> > @@ -91,6 +91,7 @@ TEST_GEN_FILES += transhuge-stress
> > TEST_GEN_FILES += uffd-stress
> > TEST_GEN_FILES += uffd-unit-tests
> > TEST_GEN_FILES += uffd-wp-mremap
> > +TEST_GEN_FILES += max_vma_count_tests
> > TEST_GEN_FILES += split_huge_page_test
> > TEST_GEN_FILES += ksm_tests
> > TEST_GEN_FILES += ksm_functional_tests
> > diff --git a/tools/testing/selftests/mm/max_vma_count_tests.c b/tools/testing/selftests/mm/max_vma_count_tests.c
> > new file mode 100644
> > index 000000000000..c8401c03425c
> > --- /dev/null
> > +++ b/tools/testing/selftests/mm/max_vma_count_tests.c
> > @@ -0,0 +1,709 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright 2025 Google LLC
> > + */
> > +#define _GNU_SOURCE
> > +
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +#include <errno.h>
> > +#include <stdbool.h>
> > +#include <linux/prctl.h> /* Definition of PR_* constants */
> > +#include <sys/prctl.h>
> > +
> > +#include "../kselftest.h"
> > +
> > +static int get_max_vma_count(void);
> > +static bool set_max_vma_count(int val);
> > +static int get_current_vma_count(void);
> > +static bool is_current_vma_count(const char *msg, int expected);
> > +static bool is_test_area_mapped(const char *msg);
> > +static void print_surrounding_maps(const char *msg);
> > +
> > +/* Globals initialized in test_suite_setup() */
> > +static int MAX_VMA_COUNT;
> > +static int ORIGINAL_MAX_VMA_COUNT;
> > +static int PAGE_SIZE;
> > +static int GUARD_SIZE;
> > +static int TEST_AREA_SIZE;
> > +static int EXTRA_MAP_SIZE;
> > +
> > +static int MAX_VMA_COUNT;
> > +
> > +static int NR_EXTRA_MAPS;
> > +
> > +static char *TEST_AREA;
> > +static char *EXTRA_MAPS;
> > +
> > +#define DEFAULT_MAX_MAP_COUNT 65530
> > +#define TEST_AREA_NR_PAGES 3
> > +/* 1 before test area + 1 after test area + 1 after extra mappings */
> > +#define NR_GUARDS 3
> > +#define TEST_AREA_PROT (PROT_NONE)
> > +#define EXTRA_MAP_PROT (PROT_NONE)
> > +
> > +/**
> > + * test_suite_setup - Set up the VMA layout for VMA count testing.
> > + *
> > + * Sets up the following VMA layout:
> > + *
> > + * +----- base_addr
> > + * |
> > + * V
> > + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
> > + * | Guard Page | | Guard Page | Extra Map 1 | Unmapped Gap | Extra Map 2 | Unmapped Gap | ... | Extra Map N | Unmapped Gap |
> > + * | (unmapped) | TEST_AREA | (unmapped) | (mapped page) | (1 page) | (mapped page) | (1 page) | ... | (mapped page) | (1 page) |
> > + * | (1 page) | (unmapped, 3 pages) | (1 page) | (1 page) | | (1 page) | | | (1 page) | |
> > + * +--------------+----------------------+--------------+----------------+--------------+----------------+--------------+-----+----------------+--------------+
>
> I have no idea what a 'guard page' is here. You're not using MADV_GUARD_INSTALL so it's presumably an arbitrary PROT_NONE page or something?
I've changed this to "hole" per David's earlier feedbac.
>
> > + * ^ ^ ^ ^ ^
> > + * | | | | |
> > + * +--GUARD_SIZE--+ | +-- EXTRA_MAPS points here Sufficient EXTRA_MAPS to ---+
> > + * (PAGE_SIZE) | | reach MAX_VMA_COUNT
> > + * | |
> > + * +--- TEST_AREA_SIZE ---+
> > + * | (3 * PAGE_SIZE) |
> > + * ^
> > + * |
> > + * +-- TEST_AREA starts here
>
> Can we make this vertical, it's obv >80 chars and it's kind of hard to read.
Will do.
>
> Also isn't this hugely convoluted? Do we really need to set up such a
> complicated set of VMAs for every test?
I thought about playing with max_map_count to achieve the needed
conditions for each test. But since this is global lowering it too
significantly will introduce issues in other system processes while
running the test (if the new max count) is lower than the current of
any process.
We only set this up once for all the tests.
>
> > + *
> > + * Populates TEST_AREA and other globals required for the tests.
> > + * If successful, the current VMA count will be MAX_VMA_COUNT - 1.
> > + *
> > + * Return: true on success, false on failure.
> > + */
> > +static bool test_suite_setup(void)
> > +{
> > + int initial_vma_count;
> > + size_t reservation_size;
> > + void *base_addr = NULL;
> > + char *ptr = NULL;
> > +
> > + ksft_print_msg("Setting up vma_max_count test suite...\n");
> > +
> > + /* Initialize globals */
> > + PAGE_SIZE = sysconf(_SC_PAGESIZE);
> > + TEST_AREA_SIZE = TEST_AREA_NR_PAGES * PAGE_SIZE;
> > + GUARD_SIZE = PAGE_SIZE;
> > + EXTRA_MAP_SIZE = PAGE_SIZE;
> > + MAX_VMA_COUNT = get_max_vma_count();
>
> Dude we are not COBOL programmers :) What is it with these capitalised variable
> names?
>
> This is really horrible, please don't do that. And why on earth are you using
> globals like this.
My intention was to avoid passing a ton of parameters to each test. I
guess I could have thread them through some struct as well :)
>
> (Again) use the self test harness please :)
Let me rework it with the harness ...
>
> > +
> > + MAX_VMA_COUNT = get_max_vma_count();
> > + if (MAX_VMA_COUNT < 0) {
> > + ksft_print_msg("Failed to read /proc/sys/vm/max_map_count\n");
> > + return false;
> > + }
> > +
> > + /*
> > + * If the current limit is higher than the kernel default,
> > + * we attempt to lower it to the default to ensure the test
> > + * can run with a reliably known boundary.
> > + */
> > + ORIGINAL_MAX_VMA_COUNT = 0;
> > +
> > + if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT) {
> > + ORIGINAL_MAX_VMA_COUNT = MAX_VMA_COUNT;
> > +
> > + ksft_print_msg("Max VMA count is %d, lowering to default %d for test...\n",
> > + MAX_VMA_COUNT, DEFAULT_MAX_MAP_COUNT);
> > +
> > + if (!set_max_vma_count(DEFAULT_MAX_MAP_COUNT)) {
> > + ksft_print_msg("WARNING: Failed to lower max_map_count to %d (requires root)n",
> > + DEFAULT_MAX_MAP_COUNT);
>
> Probably partly a product of you not using the harness, but you're violating the
> max chars per line all over the place.
>
> checkpatch.pl kinda has a fit with this file, can we just try and keep it to ~80 chars please?
>
> > + ksft_print_msg("Skipping test. Please run as root: limit needs adjustment\n");
> > +
> > + MAX_VMA_COUNT = ORIGINAL_MAX_VMA_COUNT;
> > +
> > + return false;
> > + }
> > +
> > + /* Update MAX_VMA_COUNT for the test run */
> > + MAX_VMA_COUNT = DEFAULT_MAX_MAP_COUNT;
> > + }
> > +
> > + initial_vma_count = get_current_vma_count();
> > + if (initial_vma_count < 0) {
> > + ksft_print_msg("Failed to read /proc/self/maps\n");
> > + return false;
> > + }
> > +
> > + /*
> > + * Calculate how many extra mappings we need to create to reach
> > + * MAX_VMA_COUNT - 1 (excluding test area).
> > + */
> > + NR_EXTRA_MAPS = MAX_VMA_COUNT - 1 - initial_vma_count;
> > +
> > + if (NR_EXTRA_MAPS < 1) {
> > + ksft_print_msg("Not enough available maps to run test\n");
> > + ksft_print_msg("max_vma_count=%d, current_vma_count=%d\n",
> > + MAX_VMA_COUNT, initial_vma_count);
> > + return false;
> > + }
> > +
> > + /*
> > + * Reserve space for:
> > + * - Extra mappings with a 1-page gap after each (NR_EXTRA_MAPS * 2)
> > + * - The test area itself (TEST_AREA_NR_PAGES)
> > + * - The guard pages (NR_GUARDS)
> > + */
> > + reservation_size = ((NR_EXTRA_MAPS * 2) +
> > + TEST_AREA_NR_PAGES + NR_GUARDS) * PAGE_SIZE;
> > +
> > + base_addr = mmap(NULL, reservation_size, PROT_NONE,
> > + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> > + if (base_addr == MAP_FAILED) {
> > + ksft_print_msg("Failed tommap initial reservation\n");
> > + return false;
> > + }
> > +
> > + if (munmap(base_addr, reservation_size) == -1) {
> > + ksft_print_msg("Failed to munmap initial reservation\n");
> > + return false;
> > + }
> > +
> > + /* Get the addr of the test area */
> > + TEST_AREA = (char *)base_addr + GUARD_SIZE;
> > +
> > + /*
> > + * Get the addr of the region for extra mappings:
> > + * test area + 1 guard.
> > + */
> > + EXTRA_MAPS = TEST_AREA + TEST_AREA_SIZE + GUARD_SIZE;
> > +
> > + /* Create single-page mappings separated by unmapped pages */
> > + ptr = EXTRA_MAPS;
> > + for (int i = 0; i < NR_EXTRA_MAPS; ++i) {
> > + if (mmap(ptr, PAGE_SIZE, EXTRA_MAP_PROT,
> > + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
> > + -1, 0) == MAP_FAILED) {
> > + perror("mmap in fill loop");
> > + ksft_print_msg("Failed on mapping #%d of %d\n", i + 1,
> > + NR_EXTRA_MAPS);
> > + return false;
> > + }
> > +
> > + /* Advance pointer by 2 to leave a gap */
> > + ptr += (2 * EXTRA_MAP_SIZE);
> > + }
> > +
> > + if (!is_current_vma_count("test_suite_setup", MAX_VMA_COUNT - 1))
> > + return false;
> > +
> > + ksft_print_msg("vma_max_count test suite setup done.\n");
> > +
> > + return true;
> > +}
>
> I mean it'll be fixed by the test harness, but even in test code let's please
> not write MASSIVE functions that do a million things. It's not really readable.
I'll try breaking this up in the next revision.
>
> > +
> > +static void test_suite_teardown(void)
> > +{
> > + if (ORIGINAL_MAX_VMA_COUNT && MAX_VMA_COUNT != ORIGINAL_MAX_VMA_COUNT) {
> > + if (!set_max_vma_count(ORIGINAL_MAX_VMA_COUNT))
> > + ksft_print_msg("Failed to restore max_map_count to %d\n",
> > + ORIGINAL_MAX_VMA_COUNT);
> > + }
> > +}
> > +
> > +/* --- Test Helper Functions --- */
> > +static bool mmap_anon(void)
> > +{
> > + void *addr = mmap(NULL, PAGE_SIZE, PROT_READ,
> > + MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> > +
> > + /*
> > + * Handle cleanup here as the runner doesn't track where this,
> > + *mapping is located.
> > + */
> > + if (addr != MAP_FAILED)
> > + munmap(addr, PAGE_SIZE);
> > +
> > + return addr != MAP_FAILED;
> > +}
> > +
> > +static inline bool __mprotect(char *addr, int size)
> > +{
> > + int new_prot = ~TEST_AREA_PROT & (PROT_READ | PROT_WRITE | PROT_EXEC);
> > +
> > + return mprotect(addr, size, new_prot) == 0;
> > +}
> > +
> > +static bool mprotect_nosplit(void)
> > +{
> > + return __mprotect(TEST_AREA, TEST_AREA_SIZE);
> > +}
> > +
> > +static bool mprotect_2way_split(void)
> > +{
> > + return __mprotect(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
> > +}
> > +
> > +static bool mprotect_3way_split(void)
> > +{
> > + return __mprotect(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
> > +}
> > +
> > +static inline bool __munmap(char *addr, int size)
> > +{
> > + return munmap(addr, size) == 0;
> > +}
> > +
> > +static bool munmap_nosplit(void)
> > +{
> > + return __munmap(TEST_AREA, TEST_AREA_SIZE);
> > +}
> > +
> > +static bool munmap_2way_split(void)
> > +{
> > + return __munmap(TEST_AREA, TEST_AREA_SIZE - PAGE_SIZE);
> > +}
> > +
> > +static bool munmap_3way_split(void)
> > +{
> > + return __munmap(TEST_AREA + PAGE_SIZE, PAGE_SIZE);
> > +}
> > +
> > +/* mremap accounts for the worst case to fail early */
> > +static const int MREMAP_REQUIRED_VMA_SLOTS = 6;
> > +
> > +static bool mremap_dontunmap(void)
> > +{
> > + void *new_addr;
> > +
> > + /*
> > + * Using MREMAP_DONTUNMAP will create a new mapping without
> > + * removing the old one, consuming one VMA slot.
> > + */
> > + new_addr = mremap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_SIZE,
> > + MREMAP_MAYMOVE | MREMAP_DONTUNMAP, NULL);
> > +
> > + if (new_addr != MAP_FAILED)
> > + munmap(new_addr, TEST_AREA_SIZE);
> > +
> > + return new_addr != MAP_FAILED;
> > +}
> > +
> > +struct test {
> > + const char *name;
> > + bool (*test)(void);
> > + /* How many VMA slots below the limit this test needs to start? */
> > + int vma_slots_needed;
> > + bool expect_success;
> > +};
> > +
> > +/* --- Test Cases --- */
> > +struct test tests[] = {
> > + {
> > + .name = "mmap_at_1_below_vma_count_limit",
> > + .test = mmap_anon,
> > + .vma_slots_needed = 1,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "mmap_at_vma_count_limit",
> > + .test = mmap_anon,
> > + .vma_slots_needed = 0,
> > + .expect_success = false,
> > + },
> > + {
> > + .name = "mprotect_nosplit_at_1_below_vma_count_limit",
> > + .test = mprotect_nosplit,
> > + .vma_slots_needed = 1,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "mprotect_nosplit_at_vma_count_limit",
> > + .test = mprotect_nosplit,
> > + .vma_slots_needed = 0,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "mprotect_2way_split_at_1_below_vma_count_limit",
> > + .test = mprotect_2way_split,
> > + .vma_slots_needed = 1,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "mprotect_2way_split_at_vma_count_limit",
> > + .test = mprotect_2way_split,
> > + .vma_slots_needed = 0,
> > + .expect_success = false,
> > + },
> > + {
> > + .name = "mprotect_3way_split_at_2_below_vma_count_limit",
> > + .test = mprotect_3way_split,
> > + .vma_slots_needed = 2,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "mprotect_3way_split_at_1_below_vma_count_limit",
> > + .test = mprotect_3way_split,
> > + .vma_slots_needed = 1,
> > + .expect_success = false,
> > + },
> > + {
> > + .name = "mprotect_3way_split_at_vma_count_limit",
> > + .test = mprotect_3way_split,
> > + .vma_slots_needed = 0,
> > + .expect_success = false,
> > + },
> > + {
> > + .name = "munmap_nosplit_at_1_below_vma_count_limit",
> > + .test = munmap_nosplit,
> > + .vma_slots_needed = 1,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "munmap_nosplit_at_vma_count_limit",
> > + .test = munmap_nosplit,
> > + .vma_slots_needed = 0,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "munmap_2way_split_at_1_below_vma_count_limit",
> > + .test = munmap_2way_split,
> > + .vma_slots_needed = 1,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "munmap_2way_split_at_vma_count_limit",
> > + .test = munmap_2way_split,
> > + .vma_slots_needed = 0,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "munmap_3way_split_at_2_below_vma_count_limit",
> > + .test = munmap_3way_split,
> > + .vma_slots_needed = 2,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "munmap_3way_split_at_1_below_vma_count_limit",
> > + .test = munmap_3way_split,
> > + .vma_slots_needed = 1,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "munmap_3way_split_at_vma_count_limit",
> > + .test = munmap_3way_split,
> > + .vma_slots_needed = 0,
> > + .expect_success = false,
> > + },
> > + {
> > + .name = "mremap_dontunmap_at_required_vma_count_capcity",
> > + .test = mremap_dontunmap,
> > + .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS,
> > + .expect_success = true,
> > + },
> > + {
> > + .name = "mremap_dontunmap_at_1_below_required_vma_count_capacity",
> > + .test = mremap_dontunmap,
> > + .vma_slots_needed = MREMAP_REQUIRED_VMA_SLOTS - 1,
> > + .expect_success = false,
> > + },
>
> This is horrible. We don't need to do it this way.
>
> Use the kselftest_harness please. See the guard_regions.c test for an example of
> how to use it.
Will do, thanks for the reference.
>
> > +};
> > +
> > +/* --- Test Runner --- */
> > +int main(int argc, char **argv)
> > +{
> > + int num_tests = ARRAY_SIZE(tests);
> > + int failed_tests = 0;
> > +
> > + ksft_set_plan(num_tests);
> > +
> > + if (!test_suite_setup() != 0) {
> > + if (MAX_VMA_COUNT > DEFAULT_MAX_MAP_COUNT)
> > + ksft_exit_skip("max_map_count too high and cannot be lowered\n"
> > + "Please rerun as root.\n");
> > + else
> > + ksft_exit_fail_msg("Test suite setup failed. Aborting.\n");
> > +
> > + }
> > +
> > + for (int i = 0; i < num_tests; i++) {
> > + int maps_to_unmap = tests[i].vma_slots_needed;
> > + const char *name = tests[i].name;
> > + bool test_passed;
> > +
> > + errno = 0;
> > +
> > + /* 1. Setup: TEST_AREA mapping */
> > + if (mmap(TEST_AREA, TEST_AREA_SIZE, TEST_AREA_PROT,
> > + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
> > + == MAP_FAILED) {
> > + ksft_test_result_fail(
> > + "%s: Test setup failed to map TEST_AREA\n",
> > + name);
> > + maps_to_unmap = 0;
> > + goto fail;
> > + }
> > +
> > + /* Label TEST_AREA to ease debugging */
> > + if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, TEST_AREA,
> > + TEST_AREA_SIZE, "TEST_AREA")) {
> > + ksft_print_msg("WARNING: [%s] prctl(PR_SET_VMA) failed\n",
> > + name);
> > + ksft_print_msg(
> > + "Continuing without named TEST_AREA mapping\n");
> > + }
> > +
> > + /* 2. Setup: Adjust VMA count based on test requirements */
> > + if (maps_to_unmap > NR_EXTRA_MAPS) {
> > + ksft_test_result_fail(
> > + "%s: Test setup failed: Invalid VMA slots required %d\n",
> > + name, tests[i].vma_slots_needed);
> > + maps_to_unmap = 0;
> > + goto fail;
> > + }
> > +
> > + /* Unmap extra mappings, accounting for the 1-page gap */
> > + for (int j = 0; j < maps_to_unmap; j++)
> > + munmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
> > + EXTRA_MAP_SIZE);
> > +
> > + /*
> > + * 3. Verify the preconditions.
> > + *
> > + * Sometimes there isn't an easy way to determine the cause
> > + * of the test failure.
> > + * e.g. an mprotect ENOMEM may be due to trying to protect
> > + * unmapped area or due to hitting MAX_VMA_COUNT limit.
> > + *
> > + * We verify the preconditions of the test to ensure any
> > + * expected failures are from the expected cause and not
> > + * coincidental.
> > + */
> > + if (!is_current_vma_count(name,
> > + MAX_VMA_COUNT - tests[i].vma_slots_needed))
> > + goto fail;
> > +
> > + if (!is_test_area_mapped(name))
> > + goto fail;
> > +
> > + /* 4. Run the test */
> > + test_passed = (tests[i].test() == tests[i].expect_success);
> > + if (test_passed) {
> > + ksft_test_result_pass("%s\n", name);
> > + } else {
> > +fail:
> > + failed_tests++;
> > + ksft_test_result_fail(
> > + "%s: current_vma_count=%d,max_vma_count=%d: errno: %d (%s)\n",
> > + name, get_current_vma_count(), MAX_VMA_COUNT,
> > + errno, strerror(errno));
> > + print_surrounding_maps(name);
> > + }
> > +
> > + /* 5. Teardown: Unmap TEST_AREA. */
> > + munmap(TEST_AREA, TEST_AREA_SIZE);
> > +
> > + /* 6. Teardown: Restore extra mappings to test suite baseline */
> > + for (int j = 0; j < maps_to_unmap; j++) {
> > + /* Remap extra mappings, accounting for the gap */
> > + mmap(EXTRA_MAPS + (j * 2 * EXTRA_MAP_SIZE),
> > + EXTRA_MAP_SIZE, EXTRA_MAP_PROT,
> > + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
> > + -1, 0);
> > + }
> > + }
> > +
> > + test_suite_teardown();
> > +
> > + if (failed_tests > 0)
> > + ksft_exit_fail();
> > + else
> > + ksft_exit_pass();
> > +}
> > +
> > +/* --- Utilities --- */
> > +
> > +static int get_max_vma_count(void)
> > +{
> > + int max_count;
> > + FILE *f;
> > +
> > + f = fopen("/proc/sys/vm/max_map_count", "r");
> > + if (!f)
> > + return -1;
> > +
> > + if (fscanf(f, "%d", &max_count) != 1)
> > + max_count = -1;
> > +
> > +
> > + fclose(f);
> > +
> > + return max_count;
> > +}
> > +
> > +static bool set_max_vma_count(int val)
> > +{
> > + FILE *f;
> > + bool success = false;
> > +
> > + f = fopen("/proc/sys/vm/max_map_count", "w");
> > + if (!f)
> > + return false;
> > +
> > + if (fprintf(f, "%d", val) > 0)
> > + success = true;
> > +
> > + fclose(f);
> > + return success;
> > +}
> > +
> > +static int get_current_vma_count(void)
> > +{
> > + char line[1024];
> > + int count = 0;
> > + FILE *f;
> > +
> > + f = fopen("/proc/self/maps", "r");
> > + if (!f)
> > + return -1;
> > +
> > + while (fgets(line, sizeof(line), f)) {
> > + if (!strstr(line, "[vsyscall]"))
> > + count++;
> > + }
> > +
> > + fclose(f);
> > +
> > + return count;
> > +}
> > +
> > +static bool is_current_vma_count(const char *msg, int expected)
> > +{
> > + int current = get_current_vma_count();
> > +
> > + if (current == expected)
> > + return true;
> > +
> > + ksft_print_msg("%s: vma count is %d, expected %d\n", msg, current,
> > + expected);
> > + return false;
> > +}
> > +
> > +static bool is_test_area_mapped(const char *msg)
> > +{
> > + unsigned long search_start = (unsigned long)TEST_AREA;
> > + unsigned long search_end = search_start + TEST_AREA_SIZE;
> > + bool found = false;
> > + char line[1024];
> > + FILE *f;
> > +
> > + f = fopen("/proc/self/maps", "r");
> > + if (!f) {
> > + ksft_print_msg("failed to open /proc/self/maps\n");
> > + return false;
> > + }
> > +
> > + while (fgets(line, sizeof(line), f)) {
> > + unsigned long start, end;
> > +
> > + if (sscanf(line, "%lx-%lx", &start, &end) != 2)
> > + continue;
> > +
> > + /* Check for an exact match of the range */
> > + if (start == search_start && end == search_end) {
> > + found = true;
> > + break;
> > + } else if (start > search_end) {
> > + /*
> > + *Since maps are sorted, if we've passed the end, we
> > + * can stop searching.
> > + */
> > + break;
> > + }
> > + }
> > +
>
> There's helpers for this kind of thing in vm_util.h, let's not duplicate
> code. Also you can use the proc query ioctl thingy (also helpers for that in
> vm_util.h) rather than having to spelunk /proc/$pid/maps.
Thanks I'll use the helpers there instead.
>
> > + fclose(f);
> > +
> > + if (found)
> > + return true;
> > +
> > + /* Not found */
> > + ksft_print_msg(
> > + "%s: TEST_AREA is not mapped as a single contiguous block.\n",
> > + msg);
> > + print_surrounding_maps(msg);
> > +
> > + return false;
> > +}
> > +
> > +static void print_surrounding_maps(const char *msg)
>
> WHy are you printing a bunch of noise in a test?
This is only printed in the failure case, it makes debugging much
easier. I can update it to a TH_LOG() instead once it's moved to the
test harness.
Thanks,
Kalesh
>
> > +{
> > + unsigned long search_start = (unsigned long)TEST_AREA;
> > + unsigned long search_end = search_start + TEST_AREA_SIZE;
> > + unsigned long start;
> > + unsigned long end;
> > + char line[1024] = {};
> > + int line_idx = 0;
> > + int first_match_idx = -1;
> > + int last_match_idx = -1;
> > + FILE *f;
> > +
> > + f = fopen("/proc/self/maps", "r");
> > + if (!f)
> > + return;
> > +
> > + if (msg)
> > + ksft_print_msg("%s\n", msg);
> > +
> > + ksft_print_msg("--- Surrounding VMA entries for TEST_AREA (%p) ---\n",
> > + TEST_AREA);
> > +
> > + /* First pass: Read all lines and find the range of matching entries */
> > + fseek(f, 0, SEEK_SET); /* Rewind file */
> > + while (fgets(line, sizeof(line), f)) {
> > + if (sscanf(line, "%lx-%lx", &start, &end) != 2) {
> > + line_idx++;
> > + continue;
> > + }
> > +
> > + /* Check for any overlap */
> > + if (start < search_end && end > search_start) {
> > + if (first_match_idx == -1)
> > + first_match_idx = line_idx;
> > + last_match_idx = line_idx;
> > + } else if (start > search_end) {
> > + /*
> > + * Since maps are sorted, if we've passed the end, we
> > + * can stop searching.
> > + */
> > + break;
> > + }
> > +
> > + line_idx++;
> > + }
> > +
> > + if (first_match_idx == -1) {
> > + ksft_print_msg("TEST_AREA (%p) is not currently mapped.\n",
> > + TEST_AREA);
> > + } else {
> > + /* Second pass: Print the relevant lines */
> > + fseek(f, 0, SEEK_SET); /* Rewind file */
> > + line_idx = 0;
> > + while (fgets(line, sizeof(line), f)) {
> > + /* Print 2 lines before the first match */
> > + if (line_idx >= first_match_idx - 2 &&
> > + line_idx < first_match_idx)
> > + ksft_print_msg(" %s", line);
> > +
> > + /* Print all matching TEST_AREA entries */
> > + if (line_idx >= first_match_idx &&
> > + line_idx <= last_match_idx)
> > + ksft_print_msg(">> %s", line);
> > +
> > + /* Print 2 lines after the last match */
> > + if (line_idx > last_match_idx &&
> > + line_idx <= last_match_idx + 2)
> > + ksft_print_msg(" %s", line);
> > +
> > + line_idx++;
> > + }
> > + }
> > +
> > + ksft_print_msg("--------------------------------------------------\n");
> > +
> > + fclose(f);
> > +}
> > diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
> > index 471e539d82b8..3794b50ec280 100755
> > --- a/tools/testing/selftests/mm/run_vmtests.sh
> > +++ b/tools/testing/selftests/mm/run_vmtests.sh
> > @@ -49,6 +49,8 @@ separated by spaces:
> > test madvise(2) MADV_GUARD_INSTALL and MADV_GUARD_REMOVE options
> > - madv_populate
> > test memadvise(2) MADV_POPULATE_{READ,WRITE} options
> > +- max_vma_count
> > + tests for max vma_count
> > - memfd_secret
> > test memfd_secret(2)
> > - process_mrelease
> > @@ -417,6 +419,9 @@ fi # VADDR64
> > # vmalloc stability smoke test
> > CATEGORY="vmalloc" run_test bash ./test_vmalloc.sh smoke
> >
> > +# test operations against max vma count limit
> > +CATEGORY="max_vma_count" run_test ./max_vma_count_tests
> > +
> > CATEGORY="mremap" run_test ./mremap_dontunmap
> >
> > CATEGORY="hmm" run_test bash ./test_hmm.sh smoke
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 3/7] mm: introduce vma_count_remaining()
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
2025-09-15 16:36 ` [PATCH v2 1/7] mm: fix off-by-one error in VMA count limit checks Kalesh Singh
2025-09-15 16:36 ` [PATCH v2 2/7] mm/selftests: add max_vma_count tests Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
2025-09-17 13:38 ` David Hildenbrand
` (2 more replies)
2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
` (4 subsequent siblings)
7 siblings, 3 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
The checks against sysctl_max_map_count are open-coded in multiple
places. While simple checks are manageable, the logic in places like
mremap.c involves arithmetic with magic numbers that can be difficult
to reason about. e.g. ... >= sysctl_max_map_count - 3
To improve readability and centralize the logic, introduce a new helper,
vma_count_remaining(). This function returns the VMA count headroom
available for a givine process.
The most common case of checking for a single new VMA can be done with
the convenience helper has_vma_count_remaining():
if (!vma_count_remaining(mm))
And the complex checks in mremap.c become clearer by expressing the
required capacity directly:
if (vma_count_remaining(mm) < 4)
While a capacity-based function could be misused (e.g., with an
incorrect '<' vs '<=' comparison), the improved readability at the call
sites makes such errors less likely than with the previous open-coded
arithmetic.
As part of this change, sysctl_max_map_count is made static to
mm/mmap.c to improve encapsulation.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
Changes in v2:
- Fix documentation comment for vma_count_remaining(), per Mike
- Remove extern in header, per Mike and Pedro
- Move declaration to mm/internal.h, per Mike
- Replace exceeds_max_map_count() with capacity-based vma_count_remaining(),
per Lorenzo.
- Fix tools/testing/vma, per Lorenzo.
include/linux/mm.h | 2 --
mm/internal.h | 2 ++
mm/mmap.c | 21 ++++++++++++++++++++-
mm/mremap.c | 7 ++++---
mm/nommu.c | 2 +-
mm/util.c | 1 -
mm/vma.c | 10 +++++-----
tools/testing/vma/vma_internal.h | 9 +++++++++
8 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..138bab2988f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page)
#define MAPCOUNT_ELF_CORE_MARGIN (5)
#define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-extern int sysctl_max_map_count;
-
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;
diff --git a/mm/internal.h b/mm/internal.h
index 45b725c3dc03..39f1c9535ae5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
+int vma_count_remaining(const struct mm_struct *mm);
+
#endif /* __MM_INTERNAL_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index e5370e7fcd8f..af88ce1fbb5f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
return -EOVERFLOW;
/* Too many mappings? */
- if (mm->map_count >= sysctl_max_map_count)
+ if (!vma_count_remaining(mm))
return -ENOMEM;
/*
@@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
int sysctl_legacy_va_layout;
#endif
+static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+
+/**
+ * vma_count_remaining - Determine available VMA slots
+ * @mm: The memory descriptor for the process.
+ *
+ * Check how many more VMAs can be created for the given @mm
+ * before hitting the sysctl_max_map_count limit.
+ *
+ * Return: The number of new VMAs the process can accommodate.
+ */
+int vma_count_remaining(const struct mm_struct *mm)
+{
+ const int map_count = mm->map_count;
+ const int max_count = sysctl_max_map_count;
+
+ return (max_count > map_count) ? (max_count - map_count) : 0;
+}
+
static const struct ctl_table mmap_table[] = {
{
.procname = "max_map_count",
diff --git a/mm/mremap.c b/mm/mremap.c
index 35de0a7b910e..14d35d87e89b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
* We'd prefer to avoid failure later on in do_munmap:
* which may split one vma into three before unmapping.
*/
- if (current->mm->map_count >= sysctl_max_map_count - 3)
+ if (vma_count_remaining(current->mm) < 4)
return -ENOMEM;
if (vma->vm_ops && vma->vm_ops->may_split) {
@@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
* split in 3 before unmapping it.
* That means 2 more maps (1 for each) to the ones we already hold.
* Check whether current map count plus 2 still leads us to 4 maps below
- * the threshold, otherwise return -ENOMEM here to be more safe.
+ * the threshold. In other words, is the current map count + 6 at or
+ * below the threshold? Otherwise return -ENOMEM here to be more safe.
*/
- if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
+ if (vma_count_remaining(current->mm) < 6)
return -ENOMEM;
return 0;
diff --git a/mm/nommu.c b/mm/nommu.c
index 8b819fafd57b..dd75f2334812 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
return -ENOMEM;
mm = vma->vm_mm;
- if (mm->map_count >= sysctl_max_map_count)
+ if (!vma_count_remaining(mm))
return -ENOMEM;
region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
diff --git a/mm/util.c b/mm/util.c
index f814e6a59ab1..b6e83922cafe 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy);
int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
static int sysctl_overcommit_ratio __read_mostly = 50;
static unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
diff --git a/mm/vma.c b/mm/vma.c
index 033a388bc4b1..df0e8409f63d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
}
/*
- * __split_vma() bypasses sysctl_max_map_count checking. We use this where it
- * has already been checked or doesn't make sense to fail.
+ * __split_vma() bypasses vma_count_remaining() checks. We use this where
+ * it has already been checked or doesn't make sense to fail.
* VMA Iterator will point to the original VMA.
*/
static __must_check int
@@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long addr, int new_below)
{
- if (vma->vm_mm->map_count >= sysctl_max_map_count)
+ if (!vma_count_remaining(vma->vm_mm))
return -ENOMEM;
return __split_vma(vmi, vma, addr, new_below);
@@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
* its limit temporarily, to help free resources as expected.
*/
if (vms->end < vms->vma->vm_end &&
- vms->vma->vm_mm->map_count >= sysctl_max_map_count) {
+ !vma_count_remaining(vms->vma->vm_mm)) {
error = -ENOMEM;
goto map_count_exceeded;
}
@@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
return -ENOMEM;
- if (mm->map_count >= sysctl_max_map_count)
+ if (!vma_count_remaining(mm))
return -ENOMEM;
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 3639aa8dd2b0..52cd7ddc73f4 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
return vm_flags;
}
+/* Helper to get VMA count capacity */
+static int vma_count_remaining(const struct mm_struct *mm)
+{
+ const int map_count = mm->map_count;
+ const int max_count = sysctl_max_map_count;
+
+ return (max_count > map_count) ? (max_count - map_count) : 0;
+}
+
#endif /* __MM_VMA_INTERNAL_H */
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
@ 2025-09-17 13:38 ` David Hildenbrand
2025-09-17 17:10 ` Kalesh Singh
2025-09-18 13:20 ` Lorenzo Stoakes
2025-09-18 14:31 ` Lorenzo Stoakes
2 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-09-17 13:38 UTC (permalink / raw)
To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On 15.09.25 18:36, Kalesh Singh wrote:
> The checks against sysctl_max_map_count are open-coded in multiple
> places. While simple checks are manageable, the logic in places like
> mremap.c involves arithmetic with magic numbers that can be difficult
> to reason about. e.g. ... >= sysctl_max_map_count - 3
>
> To improve readability and centralize the logic, introduce a new helper,
> vma_count_remaining(). This function returns the VMA count headroom
> available for a givine process.
s/givine/given/
s/process/mm/
>
> The most common case of checking for a single new VMA can be done with
> the convenience helper has_vma_count_remaining():
>
> if (!vma_count_remaining(mm))
>
> And the complex checks in mremap.c become clearer by expressing the
> required capacity directly:
>
> if (vma_count_remaining(mm) < 4)
>
> While a capacity-based function could be misused (e.g., with an
> incorrect '<' vs '<=' comparison), the improved readability at the call
> sites makes such errors less likely than with the previous open-coded
> arithmetic.
>
> As part of this change, sysctl_max_map_count is made static to
> mm/mmap.c to improve encapsulation.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
[...]
> /*
> @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
> int sysctl_legacy_va_layout;
> #endif
>
> +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> +
> +/**
> + * vma_count_remaining - Determine available VMA slots
> + * @mm: The memory descriptor for the process.
> + *
> + * Check how many more VMAs can be created for the given @mm
> + * before hitting the sysctl_max_map_count limit.
> + *
> + * Return: The number of new VMAs the process can accommodate.
> + */
> +int vma_count_remaining(const struct mm_struct *mm)
> +{
> + const int map_count = mm->map_count;
> + const int max_count = sysctl_max_map_count;
If we worry about rare races (sysctl_max_map_count changing?) we should
probably force a single read through READ_ONCE()?
Otherwise one might trick vma_count_remaining() into returning a
negative number I assume.
> +
> + return (max_count > map_count) ? (max_count - map_count) : 0;
> +}
Nothing else jumped at me.
Not sure what the buildbot complains about but I'm sure you'll figure it
out :)
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
2025-09-17 13:38 ` David Hildenbrand
@ 2025-09-17 17:10 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-17 17:10 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Wed, Sep 17, 2025 at 6:38 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.09.25 18:36, Kalesh Singh wrote:
> > The checks against sysctl_max_map_count are open-coded in multiple
> > places. While simple checks are manageable, the logic in places like
> > mremap.c involves arithmetic with magic numbers that can be difficult
> > to reason about. e.g. ... >= sysctl_max_map_count - 3
> >
> > To improve readability and centralize the logic, introduce a new helper,
> > vma_count_remaining(). This function returns the VMA count headroom
> > available for a givine process.
>
> s/givine/given/
>
> s/process/mm/
>
> >
> > The most common case of checking for a single new VMA can be done with
> > the convenience helper has_vma_count_remaining():
> >
> > if (!vma_count_remaining(mm))
> >
> > And the complex checks in mremap.c become clearer by expressing the
> > required capacity directly:
> >
> > if (vma_count_remaining(mm) < 4)
> >
> > While a capacity-based function could be misused (e.g., with an
> > incorrect '<' vs '<=' comparison), the improved readability at the call
> > sites makes such errors less likely than with the previous open-coded
> > arithmetic.
> >
> > As part of this change, sysctl_max_map_count is made static to
> > mm/mmap.c to improve encapsulation.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
>
> [...]
>
> > /*
> > @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
> > int sysctl_legacy_va_layout;
> > #endif
> >
> > +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> > +
> > +/**
> > + * vma_count_remaining - Determine available VMA slots
> > + * @mm: The memory descriptor for the process.
> > + *
> > + * Check how many more VMAs can be created for the given @mm
> > + * before hitting the sysctl_max_map_count limit.
> > + *
> > + * Return: The number of new VMAs the process can accommodate.
> > + */
> > +int vma_count_remaining(const struct mm_struct *mm)
> > +{
> > + const int map_count = mm->map_count;
> > + const int max_count = sysctl_max_map_count;
>
> If we worry about rare races (sysctl_max_map_count changing?) we should
> probably force a single read through READ_ONCE()?
>
> Otherwise one might trick vma_count_remaining() into returning a
> negative number I assume.
Agreed, I'll add the READ_ONCE when resending.
Thanks,
Kalesh
>
> > +
> > + return (max_count > map_count) ? (max_count - map_count) : 0;
> > +}
>
> Nothing else jumped at me.
>
> Not sure what the buildbot complains about but I'm sure you'll figure it
> out :)
>
> --
> Cheers
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
2025-09-17 13:38 ` David Hildenbrand
@ 2025-09-18 13:20 ` Lorenzo Stoakes
2025-09-18 13:26 ` Lorenzo Stoakes
2025-09-18 14:31 ` Lorenzo Stoakes
2 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:20 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
> include/linux/mm.h | 2 --
> mm/internal.h | 2 ++
> mm/mmap.c | 21 ++++++++++++++++++++-
> mm/mremap.c | 7 ++++---
> mm/nommu.c | 2 +-
> mm/util.c | 1 -
> mm/vma.c | 10 +++++-----
> tools/testing/vma/vma_internal.h | 9 +++++++++
Will look into this properly later, but there's a conflict in vma_internal.h
atm for mm-new FYI.
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 3639aa8dd2b0..52cd7ddc73f4 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
> return vm_flags;
> }
>
> +/* Helper to get VMA count capacity */
> +static int vma_count_remaining(const struct mm_struct *mm)
> +{
> + const int map_count = mm->map_count;
> + const int max_count = sysctl_max_map_count;
> +
> + return (max_count > map_count) ? (max_count - map_count) : 0;
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.51.0.384.g4c02a37b29-goog
>
Probably because stuff got added at the end... :)
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
2025-09-18 13:20 ` Lorenzo Stoakes
@ 2025-09-18 13:26 ` Lorenzo Stoakes
0 siblings, 0 replies; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:26 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, Sep 18, 2025 at 02:20:03PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
> > include/linux/mm.h | 2 --
> > mm/internal.h | 2 ++
> > mm/mmap.c | 21 ++++++++++++++++++++-
> > mm/mremap.c | 7 ++++---
> > mm/nommu.c | 2 +-
> > mm/util.c | 1 -
> > mm/vma.c | 10 +++++-----
> > tools/testing/vma/vma_internal.h | 9 +++++++++
>
> Will look into this properly later, but there's a conflict in vma_internal.h
> atm for mm-new FYI.
>
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 3639aa8dd2b0..52cd7ddc73f4 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
> > return vm_flags;
> > }
> >
> > +/* Helper to get VMA count capacity */
> > +static int vma_count_remaining(const struct mm_struct *mm)
> > +{
> > + const int map_count = mm->map_count;
> > + const int max_count = sysctl_max_map_count;
> > +
> > + return (max_count > map_count) ? (max_count - map_count) : 0;
> > +}
> > +
> > #endif /* __MM_VMA_INTERNAL_H */
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >
>
> Probably because stuff got added at the end... :)
Oh mm/internal.h too, seems that's my mmap_prepare series. FYI
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
2025-09-17 13:38 ` David Hildenbrand
2025-09-18 13:20 ` Lorenzo Stoakes
@ 2025-09-18 14:31 ` Lorenzo Stoakes
2025-09-18 15:52 ` Kalesh Singh
2 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:31 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
> The checks against sysctl_max_map_count are open-coded in multiple
> places. While simple checks are manageable, the logic in places like
> mremap.c involves arithmetic with magic numbers that can be difficult
> to reason about. e.g. ... >= sysctl_max_map_count - 3
>
> To improve readability and centralize the logic, introduce a new helper,
> vma_count_remaining(). This function returns the VMA count headroom
> available for a givine process.
>
> The most common case of checking for a single new VMA can be done with
> the convenience helper has_vma_count_remaining():
>
> if (!vma_count_remaining(mm))
>
> And the complex checks in mremap.c become clearer by expressing the
> required capacity directly:
>
> if (vma_count_remaining(mm) < 4)
Double space after 4.
>
> While a capacity-based function could be misused (e.g., with an
> incorrect '<' vs '<=' comparison), the improved readability at the call
> sites makes such errors less likely than with the previous open-coded
> arithmetic.
>
> As part of this change, sysctl_max_map_count is made static to
> mm/mmap.c to improve encapsulation.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Generally logic looks ok, with various stuff fixed below + in commit msg we
should be good to go :)
> ---
>
> Changes in v2:
> - Fix documentation comment for vma_count_remaining(), per Mike
> - Remove extern in header, per Mike and Pedro
> - Move declaration to mm/internal.h, per Mike
> - Replace exceeds_max_map_count() with capacity-based vma_count_remaining(),
> per Lorenzo.
> - Fix tools/testing/vma, per Lorenzo.
>
> include/linux/mm.h | 2 --
> mm/internal.h | 2 ++
> mm/mmap.c | 21 ++++++++++++++++++++-
> mm/mremap.c | 7 ++++---
> mm/nommu.c | 2 +-
> mm/util.c | 1 -
> mm/vma.c | 10 +++++-----
> tools/testing/vma/vma_internal.h | 9 +++++++++
> 8 files changed, 41 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ae97a0b8ec7..138bab2988f8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page)
> #define MAPCOUNT_ELF_CORE_MARGIN (5)
> #define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
>
> -extern int sysctl_max_map_count;
> -
Nice to get rid of this as a global! :)
> extern unsigned long sysctl_user_reserve_kbytes;
> extern unsigned long sysctl_admin_reserve_kbytes;
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 45b725c3dc03..39f1c9535ae5 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
>
> +int vma_count_remaining(const struct mm_struct *mm);
> +
> #endif /* __MM_INTERNAL_H */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index e5370e7fcd8f..af88ce1fbb5f 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> return -EOVERFLOW;
>
> /* Too many mappings? */
> - if (mm->map_count >= sysctl_max_map_count)
> + if (!vma_count_remaining(mm))
> return -ENOMEM;
>
> /*
> @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
> int sysctl_legacy_va_layout;
> #endif
>
> +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> +
> +/**
> + * vma_count_remaining - Determine available VMA slots
> + * @mm: The memory descriptor for the process.
> + *
> + * Check how many more VMAs can be created for the given @mm
> + * before hitting the sysctl_max_map_count limit.
> + *
> + * Return: The number of new VMAs the process can accommodate.
> + */
> +int vma_count_remaining(const struct mm_struct *mm)
> +{
> + const int map_count = mm->map_count;
> + const int max_count = sysctl_max_map_count;
David already commented on the READ_ONCE() here, seems wise.
> +
> + return (max_count > map_count) ? (max_count - map_count) : 0;
Not a big deal but would prefer:
if (map_count >= map_count)
return 0;
return max_count - map_count;
As the ternary here is a bit less clear, and it puts the 'failure' case first
and the 'success' case afterwards.
> +}
> +
As discussed in reply to the kernel bot, you've accidentally placed this in a
CONFIG_SYSCTL block, so need to move it :>)
> static const struct ctl_table mmap_table[] = {
> {
> .procname = "max_map_count",
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 35de0a7b910e..14d35d87e89b 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
> * We'd prefer to avoid failure later on in do_munmap:
> * which may split one vma into three before unmapping.
> */
> - if (current->mm->map_count >= sysctl_max_map_count - 3)
> + if (vma_count_remaining(current->mm) < 4)
> return -ENOMEM;
This is much clearer.
>
> if (vma->vm_ops && vma->vm_ops->may_split) {
> @@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
> * split in 3 before unmapping it.
> * That means 2 more maps (1 for each) to the ones we already hold.
> * Check whether current map count plus 2 still leads us to 4 maps below
> - * the threshold, otherwise return -ENOMEM here to be more safe.
> + * the threshold. In other words, is the current map count + 6 at or
> + * below the threshold? Otherwise return -ENOMEM here to be more safe.
> */
> - if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
> + if (vma_count_remaining(current->mm) < 6)
> return -ENOMEM;
I hate that we do this silly check here, but the time to revisit it is another
series...
>
> return 0;
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 8b819fafd57b..dd75f2334812 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> return -ENOMEM;
>
> mm = vma->vm_mm;
> - if (mm->map_count >= sysctl_max_map_count)
> + if (!vma_count_remaining(mm))
> return -ENOMEM;
>
> region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
> diff --git a/mm/util.c b/mm/util.c
> index f814e6a59ab1..b6e83922cafe 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy);
> int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
> static int sysctl_overcommit_ratio __read_mostly = 50;
> static unsigned long sysctl_overcommit_kbytes __read_mostly;
> -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
> unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
>
> diff --git a/mm/vma.c b/mm/vma.c
> index 033a388bc4b1..df0e8409f63d 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> }
>
> /*
> - * __split_vma() bypasses sysctl_max_map_count checking. We use this where it
> - * has already been checked or doesn't make sense to fail.
> + * __split_vma() bypasses vma_count_remaining() checks. We use this where
> + * it has already been checked or doesn't make sense to fail.
> * VMA Iterator will point to the original VMA.
> */
> static __must_check int
> @@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> unsigned long addr, int new_below)
> {
> - if (vma->vm_mm->map_count >= sysctl_max_map_count)
> + if (!vma_count_remaining(vma->vm_mm))
> return -ENOMEM;
>
> return __split_vma(vmi, vma, addr, new_below);
> @@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> * its limit temporarily, to help free resources as expected.
> */
> if (vms->end < vms->vma->vm_end &&
> - vms->vma->vm_mm->map_count >= sysctl_max_map_count) {
> + !vma_count_remaining(vms->vma->vm_mm)) {
> error = -ENOMEM;
> goto map_count_exceeded;
> }
> @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
> return -ENOMEM;
>
> - if (mm->map_count >= sysctl_max_map_count)
> + if (!vma_count_remaining(mm))
> return -ENOMEM;
>
> if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 3639aa8dd2b0..52cd7ddc73f4 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
> return vm_flags;
> }
>
> +/* Helper to get VMA count capacity */
> +static int vma_count_remaining(const struct mm_struct *mm)
> +{
> + const int map_count = mm->map_count;
> + const int max_count = sysctl_max_map_count;
> +
> + return (max_count > map_count) ? (max_count - map_count) : 0;
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.51.0.384.g4c02a37b29-goog
>
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 3/7] mm: introduce vma_count_remaining()
2025-09-18 14:31 ` Lorenzo Stoakes
@ 2025-09-18 15:52 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-18 15:52 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, Sep 18, 2025 at 7:32 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 09:36:34AM -0700, Kalesh Singh wrote:
> > The checks against sysctl_max_map_count are open-coded in multiple
> > places. While simple checks are manageable, the logic in places like
> > mremap.c involves arithmetic with magic numbers that can be difficult
> > to reason about. e.g. ... >= sysctl_max_map_count - 3
> >
> > To improve readability and centralize the logic, introduce a new helper,
> > vma_count_remaining(). This function returns the VMA count headroom
> > available for a givine process.
> >
> > The most common case of checking for a single new VMA can be done with
> > the convenience helper has_vma_count_remaining():
> >
> > if (!vma_count_remaining(mm))
> >
> > And the complex checks in mremap.c become clearer by expressing the
> > required capacity directly:
> >
> > if (vma_count_remaining(mm) < 4)
>
> Double space after 4.
Will fix in the resend.
>
> >
> > While a capacity-based function could be misused (e.g., with an
> > incorrect '<' vs '<=' comparison), the improved readability at the call
> > sites makes such errors less likely than with the previous open-coded
> > arithmetic.
> >
> > As part of this change, sysctl_max_map_count is made static to
> > mm/mmap.c to improve encapsulation.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
>
> Generally logic looks ok, with various stuff fixed below + in commit msg we
> should be good to go :)
>
> > ---
> >
> > Changes in v2:
> > - Fix documentation comment for vma_count_remaining(), per Mike
> > - Remove extern in header, per Mike and Pedro
> > - Move declaration to mm/internal.h, per Mike
> > - Replace exceeds_max_map_count() with capacity-based vma_count_remaining(),
> > per Lorenzo.
> > - Fix tools/testing/vma, per Lorenzo.
> >
> > include/linux/mm.h | 2 --
> > mm/internal.h | 2 ++
> > mm/mmap.c | 21 ++++++++++++++++++++-
> > mm/mremap.c | 7 ++++---
> > mm/nommu.c | 2 +-
> > mm/util.c | 1 -
> > mm/vma.c | 10 +++++-----
> > tools/testing/vma/vma_internal.h | 9 +++++++++
> > 8 files changed, 41 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1ae97a0b8ec7..138bab2988f8 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -192,8 +192,6 @@ static inline void __mm_zero_struct_page(struct page *page)
> > #define MAPCOUNT_ELF_CORE_MARGIN (5)
> > #define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
> >
> > -extern int sysctl_max_map_count;
> > -
>
> Nice to get rid of this as a global! :)
>
> > extern unsigned long sysctl_user_reserve_kbytes;
> > extern unsigned long sysctl_admin_reserve_kbytes;
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 45b725c3dc03..39f1c9535ae5 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1661,4 +1661,6 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
> > void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> > int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
> >
> > +int vma_count_remaining(const struct mm_struct *mm);
> > +
> > #endif /* __MM_INTERNAL_H */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index e5370e7fcd8f..af88ce1fbb5f 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -374,7 +374,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> > return -EOVERFLOW;
> >
> > /* Too many mappings? */
> > - if (mm->map_count >= sysctl_max_map_count)
> > + if (!vma_count_remaining(mm))
> > return -ENOMEM;
> >
> > /*
> > @@ -1504,6 +1504,25 @@ struct vm_area_struct *_install_special_mapping(
> > int sysctl_legacy_va_layout;
> > #endif
> >
> > +static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> > +
> > +/**
> > + * vma_count_remaining - Determine available VMA slots
> > + * @mm: The memory descriptor for the process.
> > + *
> > + * Check how many more VMAs can be created for the given @mm
> > + * before hitting the sysctl_max_map_count limit.
> > + *
> > + * Return: The number of new VMAs the process can accommodate.
> > + */
> > +int vma_count_remaining(const struct mm_struct *mm)
> > +{
> > + const int map_count = mm->map_count;
> > + const int max_count = sysctl_max_map_count;
>
> David already commented on the READ_ONCE() here, seems wise.
>
> > +
> > + return (max_count > map_count) ? (max_count - map_count) : 0;
>
> Not a big deal but would prefer:
>
> if (map_count >= map_count)
> return 0;
>
> return max_count - map_count;
>
> As the ternary here is a bit less clear, and it puts the 'failure' case first
> and the 'success' case afterwards.
>
> > +}
> > +
>
> As discussed in reply to the kernel bot, you've accidentally placed this in a
> CONFIG_SYSCTL block, so need to move it :>)
Yeah I unintentionally placed it there, I've moved this out in a newer
version. Thanks.
>
> > static const struct ctl_table mmap_table[] = {
> > {
> > .procname = "max_map_count",
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 35de0a7b910e..14d35d87e89b 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -1040,7 +1040,7 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
> > * We'd prefer to avoid failure later on in do_munmap:
> > * which may split one vma into three before unmapping.
> > */
> > - if (current->mm->map_count >= sysctl_max_map_count - 3)
> > + if (vma_count_remaining(current->mm) < 4)
> > return -ENOMEM;
>
> This is much clearer.
>
> >
> > if (vma->vm_ops && vma->vm_ops->may_split) {
> > @@ -1814,9 +1814,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
> > * split in 3 before unmapping it.
> > * That means 2 more maps (1 for each) to the ones we already hold.
> > * Check whether current map count plus 2 still leads us to 4 maps below
> > - * the threshold, otherwise return -ENOMEM here to be more safe.
> > + * the threshold. In other words, is the current map count + 6 at or
> > + * below the threshold? Otherwise return -ENOMEM here to be more safe.
> > */
> > - if ((current->mm->map_count + 2) >= sysctl_max_map_count - 3)
> > + if (vma_count_remaining(current->mm) < 6)
> > return -ENOMEM;
>
> I hate that we do this silly check here, but the time to revisit it is another
> series...
Agreed, I also wondered if this check is more conservative than it
needs to be in all cases.
-- Kalesh
>
> >
> > return 0;
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 8b819fafd57b..dd75f2334812 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -1316,7 +1316,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > return -ENOMEM;
> >
> > mm = vma->vm_mm;
> > - if (mm->map_count >= sysctl_max_map_count)
> > + if (!vma_count_remaining(mm))
> > return -ENOMEM;
> >
> > region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
> > diff --git a/mm/util.c b/mm/util.c
> > index f814e6a59ab1..b6e83922cafe 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -751,7 +751,6 @@ EXPORT_SYMBOL(folio_mc_copy);
> > int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
> > static int sysctl_overcommit_ratio __read_mostly = 50;
> > static unsigned long sysctl_overcommit_kbytes __read_mostly;
> > -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> > unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
> > unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 033a388bc4b1..df0e8409f63d 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -491,8 +491,8 @@ void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> > }
> >
> > /*
> > - * __split_vma() bypasses sysctl_max_map_count checking. We use this where it
> > - * has already been checked or doesn't make sense to fail.
> > + * __split_vma() bypasses vma_count_remaining() checks. We use this where
> > + * it has already been checked or doesn't make sense to fail.
> > * VMA Iterator will point to the original VMA.
> > */
> > static __must_check int
> > @@ -592,7 +592,7 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > unsigned long addr, int new_below)
> > {
> > - if (vma->vm_mm->map_count >= sysctl_max_map_count)
> > + if (!vma_count_remaining(vma->vm_mm))
> > return -ENOMEM;
> >
> > return __split_vma(vmi, vma, addr, new_below);
> > @@ -1345,7 +1345,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > * its limit temporarily, to help free resources as expected.
> > */
> > if (vms->end < vms->vma->vm_end &&
> > - vms->vma->vm_mm->map_count >= sysctl_max_map_count) {
> > + !vma_count_remaining(vms->vma->vm_mm)) {
> > error = -ENOMEM;
> > goto map_count_exceeded;
> > }
> > @@ -2772,7 +2772,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
> > return -ENOMEM;
> >
> > - if (mm->map_count >= sysctl_max_map_count)
> > + if (!vma_count_remaining(mm))
> > return -ENOMEM;
> >
> > if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 3639aa8dd2b0..52cd7ddc73f4 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -1517,4 +1517,13 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
> > return vm_flags;
> > }
> >
> > +/* Helper to get VMA count capacity */
> > +static int vma_count_remaining(const struct mm_struct *mm)
> > +{
> > + const int map_count = mm->map_count;
> > + const int max_count = sysctl_max_map_count;
> > +
> > + return (max_count > map_count) ? (max_count - map_count) : 0;
> > +}
> > +
> > #endif /* __MM_VMA_INTERNAL_H */
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
` (2 preceding siblings ...)
2025-09-15 16:36 ` [PATCH v2 3/7] mm: introduce vma_count_remaining() Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
2025-09-17 13:41 ` David Hildenbrand
` (2 more replies)
2025-09-15 16:36 ` [PATCH v2 5/7] mm: harden vma_count against direct modification Kalesh Singh
` (3 subsequent siblings)
7 siblings, 3 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
A mechanical rename of the mm_struct->map_count field to
vma_count; no functional change is intended.
The name "map_count" is ambiguous within the memory management subsystem,
as it can be confused with the folio/page->_mapcount field, which tracks
PTE references.
The new name, vma_count, is more precise as this field has always
counted the number of vm_area_structs associated with an mm_struct.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
Changes in v2:
- map_count is easily confused with _mapcount rename to vma_count, per David
fs/binfmt_elf.c | 2 +-
fs/coredump.c | 2 +-
include/linux/mm_types.h | 2 +-
kernel/fork.c | 2 +-
mm/debug.c | 2 +-
mm/mmap.c | 6 +++---
mm/nommu.c | 6 +++---
mm/vma.c | 24 ++++++++++++------------
tools/testing/vma/vma.c | 32 ++++++++++++++++----------------
tools/testing/vma/vma_internal.h | 6 +++---
10 files changed, 42 insertions(+), 42 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 264fba0d44bd..52449dec12cb 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1643,7 +1643,7 @@ static int fill_files_note(struct memelfnote *note, struct coredump_params *cprm
data[0] = count;
data[1] = PAGE_SIZE;
/*
- * Count usually is less than mm->map_count,
+ * Count usually is less than mm->vma_count,
* we need to move filenames down.
*/
n = cprm->vma_count - count;
diff --git a/fs/coredump.c b/fs/coredump.c
index 60bc9685e149..8881459c53d9 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1731,7 +1731,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
cprm->vma_data_size = 0;
gate_vma = get_gate_vma(mm);
- cprm->vma_count = mm->map_count + (gate_vma ? 1 : 0);
+ cprm->vma_count = mm->vma_count + (gate_vma ? 1 : 0);
cprm->vma_meta = kvmalloc_array(cprm->vma_count, sizeof(*cprm->vma_meta), GFP_KERNEL);
if (!cprm->vma_meta) {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 08bc2442db93..4343be2f9e85 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1020,7 +1020,7 @@ struct mm_struct {
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* size of all page tables */
#endif
- int map_count; /* number of VMAs */
+ int vma_count; /* number of VMAs */
spinlock_t page_table_lock; /* Protects page tables and some
* counters
diff --git a/kernel/fork.c b/kernel/fork.c
index c4ada32598bd..8fcbbf947579 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
mm_pgtables_bytes_init(mm);
- mm->map_count = 0;
+ mm->vma_count = 0;
mm->locked_vm = 0;
atomic64_set(&mm->pinned_vm, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
diff --git a/mm/debug.c b/mm/debug.c
index b4388f4dcd4d..40fc9425a84a 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -204,7 +204,7 @@ void dump_mm(const struct mm_struct *mm)
mm->pgd, atomic_read(&mm->mm_users),
atomic_read(&mm->mm_count),
mm_pgtables_bytes(mm),
- mm->map_count,
+ mm->vma_count,
mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
(u64)atomic64_read(&mm->pinned_vm),
mm->data_vm, mm->exec_vm, mm->stack_vm,
diff --git a/mm/mmap.c b/mm/mmap.c
index af88ce1fbb5f..c6769394a174 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm)
vma = vma_next(&vmi);
} while (vma && likely(!xa_is_zero(vma)));
- BUG_ON(count != mm->map_count);
+ BUG_ON(count != mm->vma_count);
trace_exit_mmap(mm);
destroy:
@@ -1517,7 +1517,7 @@ static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
*/
int vma_count_remaining(const struct mm_struct *mm)
{
- const int map_count = mm->map_count;
+ const int map_count = mm->vma_count;
const int max_count = sysctl_max_map_count;
return (max_count > map_count) ? (max_count - map_count) : 0;
@@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
*/
vma_iter_bulk_store(&vmi, tmp);
- mm->map_count++;
+ mm->vma_count++;
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
diff --git a/mm/nommu.c b/mm/nommu.c
index dd75f2334812..9ab2e5ca736d 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -576,7 +576,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
static void cleanup_vma_from_mm(struct vm_area_struct *vma)
{
- vma->vm_mm->map_count--;
+ vma->vm_mm->vma_count--;
/* remove the VMA from the mapping */
if (vma->vm_file) {
struct address_space *mapping;
@@ -1198,7 +1198,7 @@ unsigned long do_mmap(struct file *file,
goto error_just_free;
setup_vma_to_mm(vma, current->mm);
- current->mm->map_count++;
+ current->mm->vma_count++;
/* add the VMA to the tree */
vma_iter_store_new(&vmi, vma);
@@ -1366,7 +1366,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
setup_vma_to_mm(vma, mm);
setup_vma_to_mm(new, mm);
vma_iter_store_new(vmi, new);
- mm->map_count++;
+ mm->vma_count++;
return 0;
err_vmi_preallocate:
diff --git a/mm/vma.c b/mm/vma.c
index df0e8409f63d..64f4e7c867c3 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
* (it may either follow vma or precede it).
*/
vma_iter_store_new(vmi, vp->insert);
- mm->map_count++;
+ mm->vma_count++;
}
if (vp->anon_vma) {
@@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
}
if (vp->remove->anon_vma)
anon_vma_merge(vp->vma, vp->remove);
- mm->map_count--;
+ mm->vma_count--;
mpol_put(vma_policy(vp->remove));
if (!vp->remove2)
WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
@@ -683,13 +683,13 @@ void validate_mm(struct mm_struct *mm)
}
#endif
/* Check for a infinite loop */
- if (++i > mm->map_count + 10) {
+ if (++i > mm->vma_count + 10) {
i = -1;
break;
}
}
- if (i != mm->map_count) {
- pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i);
+ if (i != mm->vma_count) {
+ pr_emerg("vma_count %d vma iterator %d\n", mm->vma_count, i);
bug = 1;
}
VM_BUG_ON_MM(bug, mm);
@@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
struct mm_struct *mm;
mm = current->mm;
- mm->map_count -= vms->vma_count;
+ mm->vma_count -= vms->vma_count;
mm->locked_vm -= vms->locked_vm;
if (vms->unlock)
mmap_write_downgrade(mm);
@@ -1340,14 +1340,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
if (vms->start > vms->vma->vm_start) {
/*
- * Make sure that map_count on return from munmap() will
+ * Make sure that vma_count on return from munmap() will
* not exceed its limit; but let map_count go just above
* its limit temporarily, to help free resources as expected.
*/
if (vms->end < vms->vma->vm_end &&
!vma_count_remaining(vms->vma->vm_mm)) {
error = -ENOMEM;
- goto map_count_exceeded;
+ goto vma_count_exceeded;
}
/* Don't bother splitting the VMA if we can't unmap it anyway */
@@ -1461,7 +1461,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
modify_vma_failed:
reattach_vmas(mas_detach);
start_split_failed:
-map_count_exceeded:
+vma_count_exceeded:
return error;
}
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
vma_start_write(vma);
vma_iter_store_new(&vmi, vma);
vma_link_file(vma);
- mm->map_count++;
+ mm->vma_count++;
validate_mm(mm);
return 0;
}
@@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
/* Lock the VMA since it is modified after insertion into VMA tree */
vma_start_write(vma);
vma_iter_store_new(vmi, vma);
- map->mm->map_count++;
+ map->mm->vma_count++;
vma_link_file(vma);
/*
@@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
goto mas_store_fail;
- mm->map_count++;
+ mm->vma_count++;
validate_mm(mm);
out:
perf_event_mmap(vma);
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 656e1c75b711..69fa7d14a6c2 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
}
mtree_destroy(&mm->mm_mt);
- mm->map_count = 0;
+ mm->vma_count = 0;
return count;
}
@@ -500,7 +500,7 @@ static bool test_merge_new(void)
INIT_LIST_HEAD(&vma_d->anon_vma_chain);
list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain);
ASSERT_FALSE(merged);
- ASSERT_EQ(mm.map_count, 4);
+ ASSERT_EQ(mm.vma_count, 4);
/*
* Merge BOTH sides.
@@ -519,7 +519,7 @@ static bool test_merge_new(void)
ASSERT_EQ(vma->vm_pgoff, 0);
ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 3);
+ ASSERT_EQ(mm.vma_count, 3);
/*
* Merge to PREVIOUS VMA.
@@ -536,7 +536,7 @@ static bool test_merge_new(void)
ASSERT_EQ(vma->vm_pgoff, 0);
ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 3);
+ ASSERT_EQ(mm.vma_count, 3);
/*
* Merge to NEXT VMA.
@@ -555,7 +555,7 @@ static bool test_merge_new(void)
ASSERT_EQ(vma->vm_pgoff, 6);
ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 3);
+ ASSERT_EQ(mm.vma_count, 3);
/*
* Merge BOTH sides.
@@ -573,7 +573,7 @@ static bool test_merge_new(void)
ASSERT_EQ(vma->vm_pgoff, 0);
ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 2);
+ ASSERT_EQ(mm.vma_count, 2);
/*
* Merge to NEXT VMA.
@@ -591,7 +591,7 @@ static bool test_merge_new(void)
ASSERT_EQ(vma->vm_pgoff, 0xa);
ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 2);
+ ASSERT_EQ(mm.vma_count, 2);
/*
* Merge BOTH sides.
@@ -608,7 +608,7 @@ static bool test_merge_new(void)
ASSERT_EQ(vma->vm_pgoff, 0);
ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 1);
+ ASSERT_EQ(mm.vma_count, 1);
/*
* Final state.
@@ -967,7 +967,7 @@ static bool test_vma_merge_new_with_close(void)
ASSERT_EQ(vma->vm_pgoff, 0);
ASSERT_EQ(vma->vm_ops, &vm_ops);
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 2);
+ ASSERT_EQ(mm.vma_count, 2);
cleanup_mm(&mm, &vmi);
return true;
@@ -1017,7 +1017,7 @@ static bool test_merge_existing(void)
ASSERT_EQ(vma->vm_pgoff, 2);
ASSERT_TRUE(vma_write_started(vma));
ASSERT_TRUE(vma_write_started(vma_next));
- ASSERT_EQ(mm.map_count, 2);
+ ASSERT_EQ(mm.vma_count, 2);
/* Clear down and reset. */
ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1045,7 +1045,7 @@ static bool test_merge_existing(void)
ASSERT_EQ(vma_next->vm_pgoff, 2);
ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma_next));
- ASSERT_EQ(mm.map_count, 1);
+ ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted vma. */
ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1079,7 +1079,7 @@ static bool test_merge_existing(void)
ASSERT_EQ(vma->vm_pgoff, 6);
ASSERT_TRUE(vma_write_started(vma_prev));
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 2);
+ ASSERT_EQ(mm.vma_count, 2);
/* Clear down and reset. */
ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
@@ -1108,7 +1108,7 @@ static bool test_merge_existing(void)
ASSERT_EQ(vma_prev->vm_pgoff, 0);
ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma_prev));
- ASSERT_EQ(mm.map_count, 1);
+ ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted vma. */
ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1138,7 +1138,7 @@ static bool test_merge_existing(void)
ASSERT_EQ(vma_prev->vm_pgoff, 0);
ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
ASSERT_TRUE(vma_write_started(vma_prev));
- ASSERT_EQ(mm.map_count, 1);
+ ASSERT_EQ(mm.vma_count, 1);
/* Clear down and reset. We should have deleted prev and next. */
ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
@@ -1540,7 +1540,7 @@ static bool test_merge_extend(void)
ASSERT_EQ(vma->vm_end, 0x4000);
ASSERT_EQ(vma->vm_pgoff, 0);
ASSERT_TRUE(vma_write_started(vma));
- ASSERT_EQ(mm.map_count, 1);
+ ASSERT_EQ(mm.vma_count, 1);
cleanup_mm(&mm, &vmi);
return true;
@@ -1652,7 +1652,7 @@ static bool test_mmap_region_basic(void)
0x24d, NULL);
ASSERT_EQ(addr, 0x24d000);
- ASSERT_EQ(mm.map_count, 2);
+ ASSERT_EQ(mm.vma_count, 2);
for_each_vma(vmi, vma) {
if (vma->vm_start == 0x300000) {
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 52cd7ddc73f4..15525b86145d 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -251,7 +251,7 @@ struct mutex {};
struct mm_struct {
struct maple_tree mm_mt;
- int map_count; /* number of VMAs */
+ int vma_count; /* number of VMAs */
unsigned long total_vm; /* Total pages mapped */
unsigned long locked_vm; /* Pages that have PG_mlocked set */
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
@@ -1520,10 +1520,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
/* Helper to get VMA count capacity */
static int vma_count_remaining(const struct mm_struct *mm)
{
- const int map_count = mm->map_count;
+ const int vma_count = mm->vma_count;
const int max_count = sysctl_max_map_count;
- return (max_count > map_count) ? (max_count - map_count) : 0;
+ return (max_count > vma_count) ? (max_count - vma_count) : 0;
}
#endif /* __MM_VMA_INTERNAL_H */
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread* Re: [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count
2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
@ 2025-09-17 13:41 ` David Hildenbrand
2025-09-18 11:46 ` Pedro Falcato
2025-09-18 14:48 ` Lorenzo Stoakes
2 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-09-17 13:41 UTC (permalink / raw)
To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On 15.09.25 18:36, Kalesh Singh wrote:
> A mechanical rename of the mm_struct->map_count field to
> vma_count; no functional change is intended.
>
> The name "map_count" is ambiguous within the memory management subsystem,
> as it can be confused with the folio/page->_mapcount field, which tracks
> PTE references.
>
> The new name, vma_count, is more precise as this field has always
> counted the number of vm_area_structs associated with an mm_struct.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
[...]
> +++ b/mm/mmap.c
> @@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm)
> vma = vma_next(&vmi);
> } while (vma && likely(!xa_is_zero(vma)));
>
> - BUG_ON(count != mm->map_count);
> + BUG_ON(count != mm->vma_count);
>
While at it, best to change that to a WARN_ON_ONCE() or even a
VM_WARN_ON_ONCE().
[ or remove it -- have we ever seen that firing? ]
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count
2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
2025-09-17 13:41 ` David Hildenbrand
@ 2025-09-18 11:46 ` Pedro Falcato
2025-09-18 14:48 ` Lorenzo Stoakes
2 siblings, 0 replies; 65+ messages in thread
From: Pedro Falcato @ 2025-09-18 11:46 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Mon, Sep 15, 2025 at 09:36:35AM -0700, Kalesh Singh wrote:
> A mechanical rename of the mm_struct->map_count field to
> vma_count; no functional change is intended.
>
> The name "map_count" is ambiguous within the memory management subsystem,
> as it can be confused with the folio/page->_mapcount field, which tracks
> PTE references.
>
> The new name, vma_count, is more precise as this field has always
> counted the number of vm_area_structs associated with an mm_struct.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
--
Pedro
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count
2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
2025-09-17 13:41 ` David Hildenbrand
2025-09-18 11:46 ` Pedro Falcato
@ 2025-09-18 14:48 ` Lorenzo Stoakes
2 siblings, 0 replies; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:48 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:35AM -0700, Kalesh Singh wrote:
> A mechanical rename of the mm_struct->map_count field to
> vma_count; no functional change is intended.
>
> The name "map_count" is ambiguous within the memory management subsystem,
> as it can be confused with the folio/page->_mapcount field, which tracks
> PTE references.
>
> The new name, vma_count, is more precise as this field has always
> counted the number of vm_area_structs associated with an mm_struct.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Yeah this is nice thanks, LGTM so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
One small comment below, but I think that change can wait for another series
probably.
> ---
>
> Changes in v2:
> - map_count is easily confused with _mapcount rename to vma_count, per David
>
> fs/binfmt_elf.c | 2 +-
> fs/coredump.c | 2 +-
> include/linux/mm_types.h | 2 +-
> kernel/fork.c | 2 +-
> mm/debug.c | 2 +-
> mm/mmap.c | 6 +++---
> mm/nommu.c | 6 +++---
> mm/vma.c | 24 ++++++++++++------------
> tools/testing/vma/vma.c | 32 ++++++++++++++++----------------
> tools/testing/vma/vma_internal.h | 6 +++---
> 10 files changed, 42 insertions(+), 42 deletions(-)
>
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 264fba0d44bd..52449dec12cb 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -1643,7 +1643,7 @@ static int fill_files_note(struct memelfnote *note, struct coredump_params *cprm
> data[0] = count;
> data[1] = PAGE_SIZE;
> /*
> - * Count usually is less than mm->map_count,
> + * Count usually is less than mm->vma_count,
> * we need to move filenames down.
> */
> n = cprm->vma_count - count;
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 60bc9685e149..8881459c53d9 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -1731,7 +1731,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
>
> cprm->vma_data_size = 0;
> gate_vma = get_gate_vma(mm);
> - cprm->vma_count = mm->map_count + (gate_vma ? 1 : 0);
> + cprm->vma_count = mm->vma_count + (gate_vma ? 1 : 0);
>
> cprm->vma_meta = kvmalloc_array(cprm->vma_count, sizeof(*cprm->vma_meta), GFP_KERNEL);
> if (!cprm->vma_meta) {
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 08bc2442db93..4343be2f9e85 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1020,7 +1020,7 @@ struct mm_struct {
> #ifdef CONFIG_MMU
> atomic_long_t pgtables_bytes; /* size of all page tables */
> #endif
> - int map_count; /* number of VMAs */
> + int vma_count; /* number of VMAs */
>
> spinlock_t page_table_lock; /* Protects page tables and some
> * counters
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4ada32598bd..8fcbbf947579 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> mmap_init_lock(mm);
> INIT_LIST_HEAD(&mm->mmlist);
> mm_pgtables_bytes_init(mm);
> - mm->map_count = 0;
> + mm->vma_count = 0;
> mm->locked_vm = 0;
> atomic64_set(&mm->pinned_vm, 0);
> memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> diff --git a/mm/debug.c b/mm/debug.c
> index b4388f4dcd4d..40fc9425a84a 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -204,7 +204,7 @@ void dump_mm(const struct mm_struct *mm)
> mm->pgd, atomic_read(&mm->mm_users),
> atomic_read(&mm->mm_count),
> mm_pgtables_bytes(mm),
> - mm->map_count,
> + mm->vma_count,
> mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
> (u64)atomic64_read(&mm->pinned_vm),
> mm->data_vm, mm->exec_vm, mm->stack_vm,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index af88ce1fbb5f..c6769394a174 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1308,7 +1308,7 @@ void exit_mmap(struct mm_struct *mm)
> vma = vma_next(&vmi);
> } while (vma && likely(!xa_is_zero(vma)));
>
> - BUG_ON(count != mm->map_count);
> + BUG_ON(count != mm->vma_count);
Be nice to switch this to a WARN_ON_ONCE()... think David mentioned it though.
But maybe not one for this series...
>
> trace_exit_mmap(mm);
> destroy:
> @@ -1517,7 +1517,7 @@ static int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> */
> int vma_count_remaining(const struct mm_struct *mm)
> {
> - const int map_count = mm->map_count;
> + const int map_count = mm->vma_count;
> const int max_count = sysctl_max_map_count;
>
> return (max_count > map_count) ? (max_count - map_count) : 0;
> @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> */
> vma_iter_bulk_store(&vmi, tmp);
>
> - mm->map_count++;
> + mm->vma_count++;
>
> if (tmp->vm_ops && tmp->vm_ops->open)
> tmp->vm_ops->open(tmp);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index dd75f2334812..9ab2e5ca736d 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -576,7 +576,7 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
>
> static void cleanup_vma_from_mm(struct vm_area_struct *vma)
> {
> - vma->vm_mm->map_count--;
> + vma->vm_mm->vma_count--;
> /* remove the VMA from the mapping */
> if (vma->vm_file) {
> struct address_space *mapping;
> @@ -1198,7 +1198,7 @@ unsigned long do_mmap(struct file *file,
> goto error_just_free;
>
> setup_vma_to_mm(vma, current->mm);
> - current->mm->map_count++;
> + current->mm->vma_count++;
> /* add the VMA to the tree */
> vma_iter_store_new(&vmi, vma);
>
> @@ -1366,7 +1366,7 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> setup_vma_to_mm(vma, mm);
> setup_vma_to_mm(new, mm);
> vma_iter_store_new(vmi, new);
> - mm->map_count++;
> + mm->vma_count++;
> return 0;
>
> err_vmi_preallocate:
> diff --git a/mm/vma.c b/mm/vma.c
> index df0e8409f63d..64f4e7c867c3 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> * (it may either follow vma or precede it).
> */
> vma_iter_store_new(vmi, vp->insert);
> - mm->map_count++;
> + mm->vma_count++;
> }
>
> if (vp->anon_vma) {
> @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> }
> if (vp->remove->anon_vma)
> anon_vma_merge(vp->vma, vp->remove);
> - mm->map_count--;
> + mm->vma_count--;
> mpol_put(vma_policy(vp->remove));
> if (!vp->remove2)
> WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
> @@ -683,13 +683,13 @@ void validate_mm(struct mm_struct *mm)
> }
> #endif
> /* Check for a infinite loop */
> - if (++i > mm->map_count + 10) {
> + if (++i > mm->vma_count + 10) {
> i = -1;
> break;
> }
> }
> - if (i != mm->map_count) {
> - pr_emerg("map_count %d vma iterator %d\n", mm->map_count, i);
> + if (i != mm->vma_count) {
> + pr_emerg("vma_count %d vma iterator %d\n", mm->vma_count, i);
> bug = 1;
> }
> VM_BUG_ON_MM(bug, mm);
> @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> struct mm_struct *mm;
>
> mm = current->mm;
> - mm->map_count -= vms->vma_count;
> + mm->vma_count -= vms->vma_count;
> mm->locked_vm -= vms->locked_vm;
> if (vms->unlock)
> mmap_write_downgrade(mm);
> @@ -1340,14 +1340,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> if (vms->start > vms->vma->vm_start) {
>
> /*
> - * Make sure that map_count on return from munmap() will
> + * Make sure that vma_count on return from munmap() will
> * not exceed its limit; but let map_count go just above
> * its limit temporarily, to help free resources as expected.
> */
> if (vms->end < vms->vma->vm_end &&
> !vma_count_remaining(vms->vma->vm_mm)) {
> error = -ENOMEM;
> - goto map_count_exceeded;
> + goto vma_count_exceeded;
> }
>
> /* Don't bother splitting the VMA if we can't unmap it anyway */
> @@ -1461,7 +1461,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> modify_vma_failed:
> reattach_vmas(mas_detach);
> start_split_failed:
> -map_count_exceeded:
> +vma_count_exceeded:
> return error;
> }
>
> @@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> vma_start_write(vma);
> vma_iter_store_new(&vmi, vma);
> vma_link_file(vma);
> - mm->map_count++;
> + mm->vma_count++;
> validate_mm(mm);
> return 0;
> }
> @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> /* Lock the VMA since it is modified after insertion into VMA tree */
> vma_start_write(vma);
> vma_iter_store_new(vmi, vma);
> - map->mm->map_count++;
> + map->mm->vma_count++;
> vma_link_file(vma);
>
> /*
> @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> goto mas_store_fail;
>
> - mm->map_count++;
> + mm->vma_count++;
> validate_mm(mm);
> out:
> perf_event_mmap(vma);
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 656e1c75b711..69fa7d14a6c2 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c
> @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
> }
>
> mtree_destroy(&mm->mm_mt);
> - mm->map_count = 0;
> + mm->vma_count = 0;
> return count;
> }
>
> @@ -500,7 +500,7 @@ static bool test_merge_new(void)
> INIT_LIST_HEAD(&vma_d->anon_vma_chain);
> list_add(&dummy_anon_vma_chain_d.same_vma, &vma_d->anon_vma_chain);
> ASSERT_FALSE(merged);
> - ASSERT_EQ(mm.map_count, 4);
> + ASSERT_EQ(mm.vma_count, 4);
>
> /*
> * Merge BOTH sides.
> @@ -519,7 +519,7 @@ static bool test_merge_new(void)
> ASSERT_EQ(vma->vm_pgoff, 0);
> ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 3);
> + ASSERT_EQ(mm.vma_count, 3);
>
> /*
> * Merge to PREVIOUS VMA.
> @@ -536,7 +536,7 @@ static bool test_merge_new(void)
> ASSERT_EQ(vma->vm_pgoff, 0);
> ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 3);
> + ASSERT_EQ(mm.vma_count, 3);
>
> /*
> * Merge to NEXT VMA.
> @@ -555,7 +555,7 @@ static bool test_merge_new(void)
> ASSERT_EQ(vma->vm_pgoff, 6);
> ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 3);
> + ASSERT_EQ(mm.vma_count, 3);
>
> /*
> * Merge BOTH sides.
> @@ -573,7 +573,7 @@ static bool test_merge_new(void)
> ASSERT_EQ(vma->vm_pgoff, 0);
> ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 2);
> + ASSERT_EQ(mm.vma_count, 2);
>
> /*
> * Merge to NEXT VMA.
> @@ -591,7 +591,7 @@ static bool test_merge_new(void)
> ASSERT_EQ(vma->vm_pgoff, 0xa);
> ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 2);
> + ASSERT_EQ(mm.vma_count, 2);
>
> /*
> * Merge BOTH sides.
> @@ -608,7 +608,7 @@ static bool test_merge_new(void)
> ASSERT_EQ(vma->vm_pgoff, 0);
> ASSERT_EQ(vma->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 1);
> + ASSERT_EQ(mm.vma_count, 1);
>
> /*
> * Final state.
> @@ -967,7 +967,7 @@ static bool test_vma_merge_new_with_close(void)
> ASSERT_EQ(vma->vm_pgoff, 0);
> ASSERT_EQ(vma->vm_ops, &vm_ops);
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 2);
> + ASSERT_EQ(mm.vma_count, 2);
>
> cleanup_mm(&mm, &vmi);
> return true;
> @@ -1017,7 +1017,7 @@ static bool test_merge_existing(void)
> ASSERT_EQ(vma->vm_pgoff, 2);
> ASSERT_TRUE(vma_write_started(vma));
> ASSERT_TRUE(vma_write_started(vma_next));
> - ASSERT_EQ(mm.map_count, 2);
> + ASSERT_EQ(mm.vma_count, 2);
>
> /* Clear down and reset. */
> ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
> @@ -1045,7 +1045,7 @@ static bool test_merge_existing(void)
> ASSERT_EQ(vma_next->vm_pgoff, 2);
> ASSERT_EQ(vma_next->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma_next));
> - ASSERT_EQ(mm.map_count, 1);
> + ASSERT_EQ(mm.vma_count, 1);
>
> /* Clear down and reset. We should have deleted vma. */
> ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1079,7 +1079,7 @@ static bool test_merge_existing(void)
> ASSERT_EQ(vma->vm_pgoff, 6);
> ASSERT_TRUE(vma_write_started(vma_prev));
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 2);
> + ASSERT_EQ(mm.vma_count, 2);
>
> /* Clear down and reset. */
> ASSERT_EQ(cleanup_mm(&mm, &vmi), 2);
> @@ -1108,7 +1108,7 @@ static bool test_merge_existing(void)
> ASSERT_EQ(vma_prev->vm_pgoff, 0);
> ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma_prev));
> - ASSERT_EQ(mm.map_count, 1);
> + ASSERT_EQ(mm.vma_count, 1);
>
> /* Clear down and reset. We should have deleted vma. */
> ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1138,7 +1138,7 @@ static bool test_merge_existing(void)
> ASSERT_EQ(vma_prev->vm_pgoff, 0);
> ASSERT_EQ(vma_prev->anon_vma, &dummy_anon_vma);
> ASSERT_TRUE(vma_write_started(vma_prev));
> - ASSERT_EQ(mm.map_count, 1);
> + ASSERT_EQ(mm.vma_count, 1);
>
> /* Clear down and reset. We should have deleted prev and next. */
> ASSERT_EQ(cleanup_mm(&mm, &vmi), 1);
> @@ -1540,7 +1540,7 @@ static bool test_merge_extend(void)
> ASSERT_EQ(vma->vm_end, 0x4000);
> ASSERT_EQ(vma->vm_pgoff, 0);
> ASSERT_TRUE(vma_write_started(vma));
> - ASSERT_EQ(mm.map_count, 1);
> + ASSERT_EQ(mm.vma_count, 1);
>
> cleanup_mm(&mm, &vmi);
> return true;
> @@ -1652,7 +1652,7 @@ static bool test_mmap_region_basic(void)
> 0x24d, NULL);
> ASSERT_EQ(addr, 0x24d000);
>
> - ASSERT_EQ(mm.map_count, 2);
> + ASSERT_EQ(mm.vma_count, 2);
>
> for_each_vma(vmi, vma) {
> if (vma->vm_start == 0x300000) {
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 52cd7ddc73f4..15525b86145d 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -251,7 +251,7 @@ struct mutex {};
>
> struct mm_struct {
> struct maple_tree mm_mt;
> - int map_count; /* number of VMAs */
> + int vma_count; /* number of VMAs */
> unsigned long total_vm; /* Total pages mapped */
> unsigned long locked_vm; /* Pages that have PG_mlocked set */
> unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
> @@ -1520,10 +1520,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
> /* Helper to get VMA count capacity */
> static int vma_count_remaining(const struct mm_struct *mm)
> {
> - const int map_count = mm->map_count;
> + const int vma_count = mm->vma_count;
> const int max_count = sysctl_max_map_count;
>
> - return (max_count > map_count) ? (max_count - map_count) : 0;
> + return (max_count > vma_count) ? (max_count - vma_count) : 0;
> }
>
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.51.0.384.g4c02a37b29-goog
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 5/7] mm: harden vma_count against direct modification
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
` (3 preceding siblings ...)
2025-09-15 16:36 ` [PATCH v2 4/7] mm: rename mm_struct::map_count to vma_count Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
2025-09-18 14:52 ` Lorenzo Stoakes
2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
` (2 subsequent siblings)
7 siblings, 1 reply; 65+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
To make VMA counting more robust, prevent direct modification of the
mm->vma_count field. This is achieved by making the public-facing
member const via a union and requiring all modifications to go
through a new set of helper functions the operate on a private
__vma_count.
While there are no other invariants tied to vma_count currently, this
structural change improves maintainability; as it creates a single,
centralized point for any future logic, such as adding debug checks
or updating related statistics (in subsequent patches).
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
include/linux/mm.h | 25 +++++++++++++++++++++++++
include/linux/mm_types.h | 5 ++++-
kernel/fork.c | 2 +-
mm/mmap.c | 2 +-
mm/vma.c | 12 ++++++------
tools/testing/vma/vma.c | 2 +-
tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++-
7 files changed, 67 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 138bab2988f8..8bad1454984c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+static inline void vma_count_init(struct mm_struct *mm)
+{
+ ACCESS_PRIVATE(mm, __vma_count) = 0;
+}
+
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+{
+ ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
+}
+
+static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
+{
+ vma_count_add(mm, -nr_vmas);
+}
+
+static inline void vma_count_inc(struct mm_struct *mm)
+{
+ vma_count_add(mm, 1);
+}
+
+static inline void vma_count_dec(struct mm_struct *mm)
+{
+ vma_count_sub(mm, 1);
+}
+
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4343be2f9e85..2ea8fc722aa2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1020,7 +1020,10 @@ struct mm_struct {
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* size of all page tables */
#endif
- int vma_count; /* number of VMAs */
+ union {
+ const int vma_count; /* number of VMAs */
+ int __private __vma_count;
+ };
spinlock_t page_table_lock; /* Protects page tables and some
* counters
diff --git a/kernel/fork.c b/kernel/fork.c
index 8fcbbf947579..ea9eff416e51 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mmap_init_lock(mm);
INIT_LIST_HEAD(&mm->mmlist);
mm_pgtables_bytes_init(mm);
- mm->vma_count = 0;
+ vma_count_init(mm);
mm->locked_vm = 0;
atomic64_set(&mm->pinned_vm, 0);
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
diff --git a/mm/mmap.c b/mm/mmap.c
index c6769394a174..30ddd550197e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
*/
vma_iter_bulk_store(&vmi, tmp);
- mm->vma_count++;
+ vma_count_inc(mm);
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
diff --git a/mm/vma.c b/mm/vma.c
index 64f4e7c867c3..0cd3cb472220 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
* (it may either follow vma or precede it).
*/
vma_iter_store_new(vmi, vp->insert);
- mm->vma_count++;
+ vma_count_inc(mm);
}
if (vp->anon_vma) {
@@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
}
if (vp->remove->anon_vma)
anon_vma_merge(vp->vma, vp->remove);
- mm->vma_count--;
+ vma_count_dec(mm);
mpol_put(vma_policy(vp->remove));
if (!vp->remove2)
WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
@@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
struct mm_struct *mm;
mm = current->mm;
- mm->vma_count -= vms->vma_count;
+ vma_count_sub(mm, vms->vma_count);
mm->locked_vm -= vms->locked_vm;
if (vms->unlock)
mmap_write_downgrade(mm);
@@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
vma_start_write(vma);
vma_iter_store_new(&vmi, vma);
vma_link_file(vma);
- mm->vma_count++;
+ vma_count_inc(mm);
validate_mm(mm);
return 0;
}
@@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
/* Lock the VMA since it is modified after insertion into VMA tree */
vma_start_write(vma);
vma_iter_store_new(vmi, vma);
- map->mm->vma_count++;
+ vma_count_inc(map->mm);
vma_link_file(vma);
/*
@@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
goto mas_store_fail;
- mm->vma_count++;
+ vma_count_inc(mm);
validate_mm(mm);
out:
perf_event_mmap(vma);
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 69fa7d14a6c2..ee5a1e2365e0 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
}
mtree_destroy(&mm->mm_mt);
- mm->vma_count = 0;
+ vma_count_init(mm);
return count;
}
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 15525b86145d..6e724ba1adf4 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -251,7 +251,10 @@ struct mutex {};
struct mm_struct {
struct maple_tree mm_mt;
- int vma_count; /* number of VMAs */
+ union {
+ const int vma_count; /* number of VMAs */
+ int __vma_count;
+ };
unsigned long total_vm; /* Total pages mapped */
unsigned long locked_vm; /* Pages that have PG_mlocked set */
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
@@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm)
return (max_count > vma_count) ? (max_count - vma_count) : 0;
}
+static inline void vma_count_init(struct mm_struct *mm)
+{
+ mm->__vma_count = 0;
+}
+
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+{
+ mm->__vma_count += nr_vmas;
+}
+
+static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
+{
+ vma_count_add(mm, -nr_vmas);
+}
+
+static inline void vma_count_inc(struct mm_struct *mm)
+{
+ vma_count_add(mm, 1);
+}
+
+static inline void vma_count_dec(struct mm_struct *mm)
+{
+ vma_count_sub(mm, 1);
+}
+
#endif /* __MM_VMA_INTERNAL_H */
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread* Re: [PATCH v2 5/7] mm: harden vma_count against direct modification
2025-09-15 16:36 ` [PATCH v2 5/7] mm: harden vma_count against direct modification Kalesh Singh
@ 2025-09-18 14:52 ` Lorenzo Stoakes
2025-09-18 15:43 ` Kalesh Singh
0 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 14:52 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:36AM -0700, Kalesh Singh wrote:
> To make VMA counting more robust, prevent direct modification of the
> mm->vma_count field. This is achieved by making the public-facing
> member const via a union and requiring all modifications to go
> through a new set of helper functions the operate on a private
> __vma_count.
>
> While there are no other invariants tied to vma_count currently, this
> structural change improves maintainability; as it creates a single,
> centralized point for any future logic, such as adding debug checks
> or updating related statistics (in subsequent patches).
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Hmmm I"m not sure about this one.
I think this is a 'we don't need it' situation, and it's making everything a bit
uglier.
I especially hate vma_count_add() and vma_count_sub(). You're essentially
overridding the whole concept in these cases to make stuff that's already in
place work in those cases
I don't think this really adds much honestly.
(You're also clearly missing cases as the kernel bot has found issues)
> ---
> include/linux/mm.h | 25 +++++++++++++++++++++++++
> include/linux/mm_types.h | 5 ++++-
> kernel/fork.c | 2 +-
> mm/mmap.c | 2 +-
> mm/vma.c | 12 ++++++------
> tools/testing/vma/vma.c | 2 +-
> tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++-
> 7 files changed, 67 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 138bab2988f8..8bad1454984c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
>
> void snapshot_page(struct page_snapshot *ps, const struct page *page);
>
> +static inline void vma_count_init(struct mm_struct *mm)
> +{
> + ACCESS_PRIVATE(mm, __vma_count) = 0;
> +}
> +
> +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> +{
> + ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> +}
> +
> +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
> +{
> + vma_count_add(mm, -nr_vmas);
> +}
> +
> +static inline void vma_count_inc(struct mm_struct *mm)
> +{
> + vma_count_add(mm, 1);
> +}
> +
> +static inline void vma_count_dec(struct mm_struct *mm)
> +{
> + vma_count_sub(mm, 1);
> +}
> +
> #endif /* _LINUX_MM_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4343be2f9e85..2ea8fc722aa2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1020,7 +1020,10 @@ struct mm_struct {
> #ifdef CONFIG_MMU
> atomic_long_t pgtables_bytes; /* size of all page tables */
> #endif
> - int vma_count; /* number of VMAs */
> + union {
> + const int vma_count; /* number of VMAs */
> + int __private __vma_count;
> + };
>
> spinlock_t page_table_lock; /* Protects page tables and some
> * counters
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8fcbbf947579..ea9eff416e51 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> mmap_init_lock(mm);
> INIT_LIST_HEAD(&mm->mmlist);
> mm_pgtables_bytes_init(mm);
> - mm->vma_count = 0;
> + vma_count_init(mm);
> mm->locked_vm = 0;
> atomic64_set(&mm->pinned_vm, 0);
> memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c6769394a174..30ddd550197e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> */
> vma_iter_bulk_store(&vmi, tmp);
>
> - mm->vma_count++;
> + vma_count_inc(mm);
>
> if (tmp->vm_ops && tmp->vm_ops->open)
> tmp->vm_ops->open(tmp);
> diff --git a/mm/vma.c b/mm/vma.c
> index 64f4e7c867c3..0cd3cb472220 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> * (it may either follow vma or precede it).
> */
> vma_iter_store_new(vmi, vp->insert);
> - mm->vma_count++;
> + vma_count_inc(mm);
> }
>
> if (vp->anon_vma) {
> @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> }
> if (vp->remove->anon_vma)
> anon_vma_merge(vp->vma, vp->remove);
> - mm->vma_count--;
> + vma_count_dec(mm);
> mpol_put(vma_policy(vp->remove));
> if (!vp->remove2)
> WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
> @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> struct mm_struct *mm;
>
> mm = current->mm;
> - mm->vma_count -= vms->vma_count;
> + vma_count_sub(mm, vms->vma_count);
> mm->locked_vm -= vms->locked_vm;
> if (vms->unlock)
> mmap_write_downgrade(mm);
> @@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> vma_start_write(vma);
> vma_iter_store_new(&vmi, vma);
> vma_link_file(vma);
> - mm->vma_count++;
> + vma_count_inc(mm);
> validate_mm(mm);
> return 0;
> }
> @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> /* Lock the VMA since it is modified after insertion into VMA tree */
> vma_start_write(vma);
> vma_iter_store_new(vmi, vma);
> - map->mm->vma_count++;
> + vma_count_inc(map->mm);
> vma_link_file(vma);
>
> /*
> @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> goto mas_store_fail;
>
> - mm->vma_count++;
> + vma_count_inc(mm);
> validate_mm(mm);
> out:
> perf_event_mmap(vma);
> diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> index 69fa7d14a6c2..ee5a1e2365e0 100644
> --- a/tools/testing/vma/vma.c
> +++ b/tools/testing/vma/vma.c
> @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
> }
>
> mtree_destroy(&mm->mm_mt);
> - mm->vma_count = 0;
> + vma_count_init(mm);
> return count;
> }
>
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 15525b86145d..6e724ba1adf4 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -251,7 +251,10 @@ struct mutex {};
>
> struct mm_struct {
> struct maple_tree mm_mt;
> - int vma_count; /* number of VMAs */
> + union {
> + const int vma_count; /* number of VMAs */
> + int __vma_count;
> + };
> unsigned long total_vm; /* Total pages mapped */
> unsigned long locked_vm; /* Pages that have PG_mlocked set */
> unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
> @@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm)
> return (max_count > vma_count) ? (max_count - vma_count) : 0;
> }
>
> +static inline void vma_count_init(struct mm_struct *mm)
> +{
> + mm->__vma_count = 0;
> +}
> +
> +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> +{
> + mm->__vma_count += nr_vmas;
> +}
> +
> +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
> +{
> + vma_count_add(mm, -nr_vmas);
> +}
> +
> +static inline void vma_count_inc(struct mm_struct *mm)
> +{
> + vma_count_add(mm, 1);
> +}
> +
> +static inline void vma_count_dec(struct mm_struct *mm)
> +{
> + vma_count_sub(mm, 1);
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.51.0.384.g4c02a37b29-goog
>
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 5/7] mm: harden vma_count against direct modification
2025-09-18 14:52 ` Lorenzo Stoakes
@ 2025-09-18 15:43 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-18 15:43 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, Sep 18, 2025 at 7:52 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 09:36:36AM -0700, Kalesh Singh wrote:
> > To make VMA counting more robust, prevent direct modification of the
> > mm->vma_count field. This is achieved by making the public-facing
> > member const via a union and requiring all modifications to go
> > through a new set of helper functions the operate on a private
> > __vma_count.
> >
> > While there are no other invariants tied to vma_count currently, this
> > structural change improves maintainability; as it creates a single,
> > centralized point for any future logic, such as adding debug checks
> > or updating related statistics (in subsequent patches).
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
>
> Hmmm I"m not sure about this one.
>
> I think this is a 'we don't need it' situation, and it's making everything a bit
> uglier.
>
> I especially hate vma_count_add() and vma_count_sub(). You're essentially
> overridding the whole concept in these cases to make stuff that's already in
> place work in those cases
>
> I don't think this really adds much honestly.
Hi Lorenzo,
Thanks for reviewing.
This was done to facilitate adding the assertions. So I'll drop this
along with that in the next version.
Thanks,
Kalesh
>
> (You're also clearly missing cases as the kernel bot has found issues)
>
> > ---
> > include/linux/mm.h | 25 +++++++++++++++++++++++++
> > include/linux/mm_types.h | 5 ++++-
> > kernel/fork.c | 2 +-
> > mm/mmap.c | 2 +-
> > mm/vma.c | 12 ++++++------
> > tools/testing/vma/vma.c | 2 +-
> > tools/testing/vma/vma_internal.h | 30 +++++++++++++++++++++++++++++-
> > 7 files changed, 67 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 138bab2988f8..8bad1454984c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4219,4 +4219,29 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
> >
> > void snapshot_page(struct page_snapshot *ps, const struct page *page);
> >
> > +static inline void vma_count_init(struct mm_struct *mm)
> > +{
> > + ACCESS_PRIVATE(mm, __vma_count) = 0;
> > +}
> > +
> > +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +{
> > + ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> > +}
> > +
> > +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
> > +{
> > + vma_count_add(mm, -nr_vmas);
> > +}
> > +
> > +static inline void vma_count_inc(struct mm_struct *mm)
> > +{
> > + vma_count_add(mm, 1);
> > +}
> > +
> > +static inline void vma_count_dec(struct mm_struct *mm)
> > +{
> > + vma_count_sub(mm, 1);
> > +}
> > +
> > #endif /* _LINUX_MM_H */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 4343be2f9e85..2ea8fc722aa2 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1020,7 +1020,10 @@ struct mm_struct {
> > #ifdef CONFIG_MMU
> > atomic_long_t pgtables_bytes; /* size of all page tables */
> > #endif
> > - int vma_count; /* number of VMAs */
> > + union {
> > + const int vma_count; /* number of VMAs */
> > + int __private __vma_count;
> > + };
> >
> > spinlock_t page_table_lock; /* Protects page tables and some
> > * counters
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 8fcbbf947579..ea9eff416e51 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1037,7 +1037,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> > mmap_init_lock(mm);
> > INIT_LIST_HEAD(&mm->mmlist);
> > mm_pgtables_bytes_init(mm);
> > - mm->vma_count = 0;
> > + vma_count_init(mm);
> > mm->locked_vm = 0;
> > atomic64_set(&mm->pinned_vm, 0);
> > memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c6769394a174..30ddd550197e 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1828,7 +1828,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> > */
> > vma_iter_bulk_store(&vmi, tmp);
> >
> > - mm->vma_count++;
> > + vma_count_inc(mm);
> >
> > if (tmp->vm_ops && tmp->vm_ops->open)
> > tmp->vm_ops->open(tmp);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 64f4e7c867c3..0cd3cb472220 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -352,7 +352,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> > * (it may either follow vma or precede it).
> > */
> > vma_iter_store_new(vmi, vp->insert);
> > - mm->vma_count++;
> > + vma_count_inc(mm);
> > }
> >
> > if (vp->anon_vma) {
> > @@ -383,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> > }
> > if (vp->remove->anon_vma)
> > anon_vma_merge(vp->vma, vp->remove);
> > - mm->vma_count--;
> > + vma_count_dec(mm);
> > mpol_put(vma_policy(vp->remove));
> > if (!vp->remove2)
> > WARN_ON_ONCE(vp->vma->vm_end < vp->remove->vm_end);
> > @@ -1266,7 +1266,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> > struct mm_struct *mm;
> >
> > mm = current->mm;
> > - mm->vma_count -= vms->vma_count;
> > + vma_count_sub(mm, vms->vma_count);
> > mm->locked_vm -= vms->locked_vm;
> > if (vms->unlock)
> > mmap_write_downgrade(mm);
> > @@ -1795,7 +1795,7 @@ int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
> > vma_start_write(vma);
> > vma_iter_store_new(&vmi, vma);
> > vma_link_file(vma);
> > - mm->vma_count++;
> > + vma_count_inc(mm);
> > validate_mm(mm);
> > return 0;
> > }
> > @@ -2495,7 +2495,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > /* Lock the VMA since it is modified after insertion into VMA tree */
> > vma_start_write(vma);
> > vma_iter_store_new(vmi, vma);
> > - map->mm->vma_count++;
> > + vma_count_inc(map->mm);
> > vma_link_file(vma);
> >
> > /*
> > @@ -2810,7 +2810,7 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
> > goto mas_store_fail;
> >
> > - mm->vma_count++;
> > + vma_count_inc(mm);
> > validate_mm(mm);
> > out:
> > perf_event_mmap(vma);
> > diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
> > index 69fa7d14a6c2..ee5a1e2365e0 100644
> > --- a/tools/testing/vma/vma.c
> > +++ b/tools/testing/vma/vma.c
> > @@ -261,7 +261,7 @@ static int cleanup_mm(struct mm_struct *mm, struct vma_iterator *vmi)
> > }
> >
> > mtree_destroy(&mm->mm_mt);
> > - mm->vma_count = 0;
> > + vma_count_init(mm);
> > return count;
> > }
> >
> > diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> > index 15525b86145d..6e724ba1adf4 100644
> > --- a/tools/testing/vma/vma_internal.h
> > +++ b/tools/testing/vma/vma_internal.h
> > @@ -251,7 +251,10 @@ struct mutex {};
> >
> > struct mm_struct {
> > struct maple_tree mm_mt;
> > - int vma_count; /* number of VMAs */
> > + union {
> > + const int vma_count; /* number of VMAs */
> > + int __vma_count;
> > + };
> > unsigned long total_vm; /* Total pages mapped */
> > unsigned long locked_vm; /* Pages that have PG_mlocked set */
> > unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
> > @@ -1526,4 +1529,29 @@ static int vma_count_remaining(const struct mm_struct *mm)
> > return (max_count > vma_count) ? (max_count - vma_count) : 0;
> > }
> >
> > +static inline void vma_count_init(struct mm_struct *mm)
> > +{
> > + mm->__vma_count = 0;
> > +}
> > +
> > +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +{
> > + mm->__vma_count += nr_vmas;
> > +}
> > +
> > +static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
> > +{
> > + vma_count_add(mm, -nr_vmas);
> > +}
> > +
> > +static inline void vma_count_inc(struct mm_struct *mm)
> > +{
> > + vma_count_add(mm, 1);
> > +}
> > +
> > +static inline void vma_count_dec(struct mm_struct *mm)
> > +{
> > + vma_count_sub(mm, 1);
> > +}
> > +
> > #endif /* __MM_VMA_INTERNAL_H */
> > --
> > 2.51.0.384.g4c02a37b29-goog
> >
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 6/7] mm: add assertion for VMA count limit
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
` (4 preceding siblings ...)
2025-09-15 16:36 ` [PATCH v2 5/7] mm: harden vma_count against direct modification Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
2025-09-17 13:44 ` David Hildenbrand
2025-09-18 13:30 ` Lorenzo Stoakes
2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
7 siblings, 2 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect
cases where the VMA count exceeds the sysctl_max_map_count limit.
This check will help catch future bugs or regressions where
the VMAs are allocated exceeding the limit.
The warning is placed in the main vma_count_*() helpers, while the
internal *_nocheck variants bypass it. _nocheck helpers are used to
ensure that the assertion does not trigger a false positive in
the legitimate case of a temporary VMA increase past the limit
by a VMA split in munmap().
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
Changes in v2:
- Add assertions if exceeding max_vma_count limit, per Pedro
include/linux/mm.h | 12 ++++++--
mm/internal.h | 1 -
mm/vma.c | 49 +++++++++++++++++++++++++-------
tools/testing/vma/vma_internal.h | 7 ++++-
4 files changed, 55 insertions(+), 14 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bad1454984c..3a3749d7015c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
void snapshot_page(struct page_snapshot *ps, const struct page *page);
+int vma_count_remaining(const struct mm_struct *mm);
+
static inline void vma_count_init(struct mm_struct *mm)
{
ACCESS_PRIVATE(mm, __vma_count) = 0;
}
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
{
ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
}
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+{
+ VM_WARN_ON_ONCE(!vma_count_remaining(mm));
+ __vma_count_add_nocheck(mm, nr_vmas);
+}
+
static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
{
- vma_count_add(mm, -nr_vmas);
+ __vma_count_add_nocheck(mm, -nr_vmas);
}
static inline void vma_count_inc(struct mm_struct *mm)
diff --git a/mm/internal.h b/mm/internal.h
index 39f1c9535ae5..e0567a3b64fa 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1661,6 +1661,5 @@ static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
-int vma_count_remaining(const struct mm_struct *mm);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/vma.c b/mm/vma.c
index 0cd3cb472220..0e4fcaebe209 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -323,15 +323,17 @@ static void vma_prepare(struct vma_prepare *vp)
}
/*
- * vma_complete- Helper function for handling the unlocking after altering VMAs,
- * or for inserting a VMA.
+ * This is the internal, unsafe version of vma_complete(). Unlike its
+ * wrapper, this function bypasses runtime checks for VMA count limits by
+ * using the _nocheck vma_count* helpers.
*
- * @vp: The vma_prepare struct
- * @vmi: The vma iterator
- * @mm: The mm_struct
+ * Its use is restricted to __split_vma() where the VMA count can be
+ * temporarily higher than the sysctl_max_map_count limit.
+ *
+ * All other callers must use vma_complete().
*/
-static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
- struct mm_struct *mm)
+static void __vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
+ struct mm_struct *mm)
{
if (vp->file) {
if (vp->adj_next)
@@ -352,7 +354,11 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
* (it may either follow vma or precede it).
*/
vma_iter_store_new(vmi, vp->insert);
- vma_count_inc(mm);
+ /*
+ * Explicitly allow vma_count to exceed the threshold to prevent,
+ * blocking munmap() freeing resources.
+ */
+ __vma_count_add_nocheck(mm, 1);
}
if (vp->anon_vma) {
@@ -403,6 +409,26 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
uprobe_mmap(vp->insert);
}
+/*
+ * vma_complete- Helper function for handling the unlocking after altering VMAs,
+ * or for inserting a VMA.
+ *
+ * @vp: The vma_prepare struct
+ * @vmi: The vma iterator
+ * @mm: The mm_struct
+ */
+static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
+ struct mm_struct *mm)
+{
+ /*
+ * __vma_complete() explicitly foregoes checking the new
+ * vma_count against the sysctl_max_map_count limit, so
+ * do it here.
+ */
+ VM_WARN_ON_ONCE(!vma_count_remaining(mm));
+ __vma_complete(vp, vmi, mm);
+}
+
/*
* init_vma_prep() - Initializer wrapper for vma_prepare struct
* @vp: The vma_prepare struct
@@ -564,8 +590,11 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma->vm_end = addr;
}
- /* vma_complete stores the new vma */
- vma_complete(&vp, vmi, vma->vm_mm);
+ /*
+ * __vma_complete stores the new vma without checking against the
+ * sysctl_max_map_count (vma_count) limit.
+ */
+ __vma_complete(&vp, vmi, vma->vm_mm);
validate_mm(vma->vm_mm);
/* Success. */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 6e724ba1adf4..d084b1eb2a5c 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -1534,11 +1534,16 @@ static inline void vma_count_init(struct mm_struct *mm)
mm->__vma_count = 0;
}
-static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
{
mm->__vma_count += nr_vmas;
}
+static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
+{
+ __vma_count_add_nocheck(mm, nr_vmas);
+}
+
static inline void vma_count_sub(struct mm_struct *mm, int nr_vmas)
{
vma_count_add(mm, -nr_vmas);
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
@ 2025-09-17 13:44 ` David Hildenbrand
2025-09-17 17:22 ` Kalesh Singh
2025-09-18 11:48 ` Pedro Falcato
2025-09-18 13:30 ` Lorenzo Stoakes
1 sibling, 2 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-09-17 13:44 UTC (permalink / raw)
To: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On 15.09.25 18:36, Kalesh Singh wrote:
> Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect
> cases where the VMA count exceeds the sysctl_max_map_count limit.
>
> This check will help catch future bugs or regressions where
> the VMAs are allocated exceeding the limit.
>
> The warning is placed in the main vma_count_*() helpers, while the
> internal *_nocheck variants bypass it. _nocheck helpers are used to
> ensure that the assertion does not trigger a false positive in
> the legitimate case of a temporary VMA increase past the limit
> by a VMA split in munmap().
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
>
> Changes in v2:
> - Add assertions if exceeding max_vma_count limit, per Pedro
>
> include/linux/mm.h | 12 ++++++--
> mm/internal.h | 1 -
> mm/vma.c | 49 +++++++++++++++++++++++++-------
> tools/testing/vma/vma_internal.h | 7 ++++-
> 4 files changed, 55 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8bad1454984c..3a3749d7015c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
>
> void snapshot_page(struct page_snapshot *ps, const struct page *page);
>
> +int vma_count_remaining(const struct mm_struct *mm);
> +
> static inline void vma_count_init(struct mm_struct *mm)
> {
> ACCESS_PRIVATE(mm, __vma_count) = 0;
> }
>
> -static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
> {
> ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> }
>
> +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> +{
> + VM_WARN_ON_ONCE(!vma_count_remaining(mm));
Can't that fire when changing the max count from user space at just the
wrong time?
I assume we'll have to tolerated that and might just want to drop this
patch from the series.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
2025-09-17 13:44 ` David Hildenbrand
@ 2025-09-17 17:22 ` Kalesh Singh
2025-09-17 18:34 ` David Hildenbrand
2025-09-18 11:48 ` Pedro Falcato
1 sibling, 1 reply; 65+ messages in thread
From: Kalesh Singh @ 2025-09-17 17:22 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Wed, Sep 17, 2025 at 6:44 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.09.25 18:36, Kalesh Singh wrote:
> > Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect
> > cases where the VMA count exceeds the sysctl_max_map_count limit.
> >
> > This check will help catch future bugs or regressions where
> > the VMAs are allocated exceeding the limit.
> >
> > The warning is placed in the main vma_count_*() helpers, while the
> > internal *_nocheck variants bypass it. _nocheck helpers are used to
> > ensure that the assertion does not trigger a false positive in
> > the legitimate case of a temporary VMA increase past the limit
> > by a VMA split in munmap().
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
> >
> > Changes in v2:
> > - Add assertions if exceeding max_vma_count limit, per Pedro
> >
> > include/linux/mm.h | 12 ++++++--
> > mm/internal.h | 1 -
> > mm/vma.c | 49 +++++++++++++++++++++++++-------
> > tools/testing/vma/vma_internal.h | 7 ++++-
> > 4 files changed, 55 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8bad1454984c..3a3749d7015c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
> >
> > void snapshot_page(struct page_snapshot *ps, const struct page *page);
> >
> > +int vma_count_remaining(const struct mm_struct *mm);
> > +
> > static inline void vma_count_init(struct mm_struct *mm)
> > {
> > ACCESS_PRIVATE(mm, __vma_count) = 0;
> > }
> >
> > -static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
> > {
> > ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> > }
> >
> > +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +{
> > + VM_WARN_ON_ONCE(!vma_count_remaining(mm));
>
> Can't that fire when changing the max count from user space at just the
> wrong time?
You are right: technically it's possible if it was raised between the
time of checking and when the new VMA is added.
>
> I assume we'll have to tolerated that and might just want to drop this
> patch from the series.
>
It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?
Thanks,
Kalesh
> --
> Cheers
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
2025-09-17 17:22 ` Kalesh Singh
@ 2025-09-17 18:34 ` David Hildenbrand
2025-09-17 20:31 ` Kalesh Singh
0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-09-17 18:34 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
>> Can't that fire when changing the max count from user space at just the
>> wrong time?
>
> You are right: technically it's possible if it was raised between the
> time of checking and when the new VMA is added.
>
>>
>> I assume we'll have to tolerated that and might just want to drop this
>> patch from the series.
>>
>
> It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?
Due to the racy nature I think any kinds of assertions would be wrong.
Without any such races possible I would agree that the checks would be
nice to have.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
2025-09-17 18:34 ` David Hildenbrand
@ 2025-09-17 20:31 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-17 20:31 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Wed, Sep 17, 2025 at 11:34 AM David Hildenbrand <david@redhat.com> wrote:
>
> >> Can't that fire when changing the max count from user space at just the
> >> wrong time?
> >
> > You are right: technically it's possible if it was raised between the
> > time of checking and when the new VMA is added.
> >
> >>
> >> I assume we'll have to tolerated that and might just want to drop this
> >> patch from the series.
> >>
> >
> > It is compiled out in !CONFIG_VM_DEBUG builds, would we still want to drop it?
>
> Due to the racy nature I think any kinds of assertions would be wrong.
> Without any such races possible I would agree that the checks would be
> nice to have.
Alright I'll drop this in the next revision.
Thanks,
Kalesh
>
> --
> Cheers
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
2025-09-17 13:44 ` David Hildenbrand
2025-09-17 17:22 ` Kalesh Singh
@ 2025-09-18 11:48 ` Pedro Falcato
1 sibling, 0 replies; 65+ messages in thread
From: Pedro Falcato @ 2025-09-18 11:48 UTC (permalink / raw)
To: David Hildenbrand
Cc: Kalesh Singh, akpm, minchan, lorenzo.stoakes, Liam.Howlett, rppt,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Wed, Sep 17, 2025 at 03:44:27PM +0200, David Hildenbrand wrote:
> On 15.09.25 18:36, Kalesh Singh wrote:
> > Building on the vma_count helpers, add a VM_WARN_ON_ONCE() to detect
> > cases where the VMA count exceeds the sysctl_max_map_count limit.
> >
> > This check will help catch future bugs or regressions where
> > the VMAs are allocated exceeding the limit.
> >
> > The warning is placed in the main vma_count_*() helpers, while the
> > internal *_nocheck variants bypass it. _nocheck helpers are used to
> > ensure that the assertion does not trigger a false positive in
> > the legitimate case of a temporary VMA increase past the limit
> > by a VMA split in munmap().
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > ---
> >
> > Changes in v2:
> > - Add assertions if exceeding max_vma_count limit, per Pedro
> >
> > include/linux/mm.h | 12 ++++++--
> > mm/internal.h | 1 -
> > mm/vma.c | 49 +++++++++++++++++++++++++-------
> > tools/testing/vma/vma_internal.h | 7 ++++-
> > 4 files changed, 55 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8bad1454984c..3a3749d7015c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4219,19 +4219,27 @@ static inline bool snapshot_page_is_faithful(const struct page_snapshot *ps)
> > void snapshot_page(struct page_snapshot *ps, const struct page *page);
> > +int vma_count_remaining(const struct mm_struct *mm);
> > +
> > static inline void vma_count_init(struct mm_struct *mm)
> > {
> > ACCESS_PRIVATE(mm, __vma_count) = 0;
> > }
> > -static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +static inline void __vma_count_add_nocheck(struct mm_struct *mm, int nr_vmas)
> > {
> > ACCESS_PRIVATE(mm, __vma_count) += nr_vmas;
> > }
> > +static inline void vma_count_add(struct mm_struct *mm, int nr_vmas)
> > +{
> > + VM_WARN_ON_ONCE(!vma_count_remaining(mm));
>
> Can't that fire when changing the max count from user space at just the
> wrong time?
>
> I assume we'll have to tolerated that and might just want to drop this patch
> from the series.
Ah yes, of course, userspace can dynamically change it. Good catch. I guess
we'll need to kill the assertion idea then.
--
Pedro
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 6/7] mm: add assertion for VMA count limit
2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
2025-09-17 13:44 ` David Hildenbrand
@ 2025-09-18 13:30 ` Lorenzo Stoakes
1 sibling, 0 replies; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:30 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
I see that the review is generally to drop this one so it's moot, but this is
also not applying:
Patch failed at 0006 mm: add assertion for VMA count limit
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
When trying to b4 shazam in mm-new.
^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
` (5 preceding siblings ...)
2025-09-15 16:36 ` [PATCH v2 6/7] mm: add assertion for VMA count limit Kalesh Singh
@ 2025-09-15 16:36 ` Kalesh Singh
2025-09-15 23:41 ` Steven Rostedt
2025-09-18 13:42 ` Lorenzo Stoakes
2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
7 siblings, 2 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-15 16:36 UTC (permalink / raw)
To: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato
Cc: kernel-team, android-mm, Kalesh Singh, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
Needed observability on in field devices can be collected with minimal
overhead and can be toggled on and off. Event driven telemetry can be
done with tracepoint BPF programs.
The process comm is provided for aggregation across devices and tgid is
to enable per-process aggregation per device.
This allows for observing the distribution of such problems in the
field, to deduce if there are legitimate bugs or if a bump to the limit is
warranted.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
Chnages in v2:
- Add needed observability for operations failing due to the vma count limit,
per Minchan
(Since there isn't a common point for debug logging due checks being
external to the capacity based vma_count_remaining() helper. I used a
trace event for low overhead and to facilitate event driven telemetry
for in field devices)
include/trace/events/vma.h | 32 ++++++++++++++++++++++++++++++++
mm/mmap.c | 5 ++++-
mm/mremap.c | 10 ++++++++--
mm/vma.c | 11 +++++++++--
4 files changed, 53 insertions(+), 5 deletions(-)
create mode 100644 include/trace/events/vma.h
diff --git a/include/trace/events/vma.h b/include/trace/events/vma.h
new file mode 100644
index 000000000000..2fed63b0d0a6
--- /dev/null
+++ b/include/trace/events/vma.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vma
+
+#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMA_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(max_vma_count_exceeded,
+
+ TP_PROTO(struct task_struct *task),
+
+ TP_ARGS(task),
+
+ TP_STRUCT__entry(
+ __string(comm, task->comm)
+ __field(pid_t, tgid)
+ ),
+
+ TP_fast_assign(
+ __assign_str(comm);
+ __entry->tgid = task->tgid;
+ ),
+
+ TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
+);
+
+#endif /* _TRACE_VMA_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/mmap.c b/mm/mmap.c
index 30ddd550197e..0bb311bf48f3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -56,6 +56,7 @@
#define CREATE_TRACE_POINTS
#include <trace/events/mmap.h>
+#include <trace/events/vma.h>
#include "internal.h"
@@ -374,8 +375,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
return -EOVERFLOW;
/* Too many mappings? */
- if (!vma_count_remaining(mm))
+ if (!vma_count_remaining(mm)) {
+ trace_max_vma_count_exceeded(current);
return -ENOMEM;
+ }
/*
* addr is returned from get_unmapped_area,
diff --git a/mm/mremap.c b/mm/mremap.c
index 14d35d87e89b..f42ac05f0069 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -30,6 +30,8 @@
#include <asm/tlb.h>
#include <asm/pgalloc.h>
+#include <trace/events/vma.h>
+
#include "internal.h"
/* Classify the kind of remap operation being performed. */
@@ -1040,8 +1042,10 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
* We'd prefer to avoid failure later on in do_munmap:
* which may split one vma into three before unmapping.
*/
- if (vma_count_remaining(current->mm) < 4)
+ if (vma_count_remaining(current->mm) < 4) {
+ trace_max_vma_count_exceeded(current);
return -ENOMEM;
+ }
if (vma->vm_ops && vma->vm_ops->may_split) {
if (vma->vm_start != old_addr)
@@ -1817,8 +1821,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
* the threshold. In other words, is the current map count + 6 at or
* below the threshold? Otherwise return -ENOMEM here to be more safe.
*/
- if (vma_count_remaining(current->mm) < 6)
+ if (vma_count_remaining(current->mm) < 6) {
+ trace_max_vma_count_exceeded(current);
return -ENOMEM;
+ }
return 0;
}
diff --git a/mm/vma.c b/mm/vma.c
index 0e4fcaebe209..692c33c3e84d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -7,6 +7,8 @@
#include "vma_internal.h"
#include "vma.h"
+#include <trace/events/vma.h>
+
struct mmap_state {
struct mm_struct *mm;
struct vma_iterator *vmi;
@@ -621,8 +623,10 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long addr, int new_below)
{
- if (!vma_count_remaining(vma->vm_mm))
+ if (!vma_count_remaining(vma->vm_mm)) {
+ trace_max_vma_count_exceeded(current);
return -ENOMEM;
+ }
return __split_vma(vmi, vma, addr, new_below);
}
@@ -1375,6 +1379,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
*/
if (vms->end < vms->vma->vm_end &&
!vma_count_remaining(vms->vma->vm_mm)) {
+ trace_max_vma_count_exceeded(current);
error = -ENOMEM;
goto vma_count_exceeded;
}
@@ -2801,8 +2806,10 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
return -ENOMEM;
- if (!vma_count_remaining(mm))
+ if (!vma_count_remaining(mm)) {
+ trace_max_vma_count_exceeded(current);
return -ENOMEM;
+ }
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;
--
2.51.0.384.g4c02a37b29-goog
^ permalink raw reply related [flat|nested] 65+ messages in thread* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
@ 2025-09-15 23:41 ` Steven Rostedt
2025-09-16 1:19 ` Kalesh Singh
2025-09-18 13:42 ` Lorenzo Stoakes
1 sibling, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2025-09-15 23:41 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Mon, 15 Sep 2025 09:36:38 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:
> Needed observability on in field devices can be collected with minimal
> overhead and can be toggled on and off. Event driven telemetry can be
> done with tracepoint BPF programs.
>
> The process comm is provided for aggregation across devices and tgid is
> to enable per-process aggregation per device.
What do you mean about comm being used to aggregation across devices?
What's special about this trace event that will make it used across devices?
Note, if BPF is being used, can't the BPF program just add the current
comm? Why waste space in the ring buffer for it?
> +
> +TRACE_EVENT(max_vma_count_exceeded,
> +
> + TP_PROTO(struct task_struct *task),
Why pass in the task if it's always going to be current?
> +
> + TP_ARGS(task),
> +
> + TP_STRUCT__entry(
> + __string(comm, task->comm)
This could be:
__string(comm, current)
But I still want to know what makes this trace event special over other
trace events to store this, and can't it be retrieved another way,
especially if BPF is being used to hook to it?
-- Steve
> + __field(pid_t, tgid)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(comm);
> + __entry->tgid = task->tgid;
> + ),
> +
> + TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
> +);
> +
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-15 23:41 ` Steven Rostedt
@ 2025-09-16 1:19 ` Kalesh Singh
2025-09-16 15:52 ` Steven Rostedt
0 siblings, 1 reply; 65+ messages in thread
From: Kalesh Singh @ 2025-09-16 1:19 UTC (permalink / raw)
To: Steven Rostedt
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Mon, Sep 15, 2025 at 4:41 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 15 Sep 2025 09:36:38 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > Needed observability on in field devices can be collected with minimal
> > overhead and can be toggled on and off. Event driven telemetry can be
> > done with tracepoint BPF programs.
> >
> > The process comm is provided for aggregation across devices and tgid is
> > to enable per-process aggregation per device.
>
> What do you mean about comm being used to aggregation across devices?
> What's special about this trace event that will make it used across devices?
>
> Note, if BPF is being used, can't the BPF program just add the current
> comm? Why waste space in the ring buffer for it?
>
>
>
> > +
> > +TRACE_EVENT(max_vma_count_exceeded,
> > +
> > + TP_PROTO(struct task_struct *task),
>
> Why pass in the task if it's always going to be current?
>
> > +
> > + TP_ARGS(task),
> > +
> > + TP_STRUCT__entry(
> > + __string(comm, task->comm)
>
> This could be:
>
> __string(comm, current)
>
> But I still want to know what makes this trace event special over other
> trace events to store this, and can't it be retrieved another way,
> especially if BPF is being used to hook to it?
Hi Steve,
Thanks for the comments and suggestion you are right we can use bpf to
get the comm. There is nothing special about this trace event. I will
drop comm in the next revision.
The reason I did the task_struct parameter (current): I believe there
is a limitation that we must specify at least 1 parameter to the
TRACE_EVENT() PROTO and ARGS macros.
Is there some way to use this without needing a parameter?
I hit the build failure below, with no parameters:
In file included from mm/vma.c:10:
./include/trace/events/vma.h:10:1: error: expected parameter declarator
10 | TRACE_EVENT(max_vma_count_exceeded,
| ^
...
Below is the code for reference:
/* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM vma
#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_VMA_H
#include <linux/tracepoint.h>
TRACE_EVENT(max_vma_count_exceeded,
TP_PROTO(),
TP_ARGS(),
TP_STRUCT__entry(
__field(pid_t, tgid)
),
TP_fast_assign(
__entry->tgid = current->tgid;
),
TP_printk("tgid=%d", __entry->tgid)
);
#endif /* _TRACE_VMA_H */
/* This part must be outside protection */
#include <trace/define_trace.h>
Thanks,
Kalesh
>
> -- Steve
>
>
> > + __field(pid_t, tgid)
> > + ),
> > +
> > + TP_fast_assign(
> > + __assign_str(comm);
> > + __entry->tgid = task->tgid;
> > + ),
> > +
> > + TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
> > +);
> > +
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 1:19 ` Kalesh Singh
@ 2025-09-16 15:52 ` Steven Rostedt
2025-09-16 17:36 ` Kalesh Singh
2025-09-18 11:38 ` Pedro Falcato
0 siblings, 2 replies; 65+ messages in thread
From: Steven Rostedt @ 2025-09-16 15:52 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Mon, 15 Sep 2025 18:19:53 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:
> Hi Steve,
>
> Thanks for the comments and suggestion you are right we can use bpf to
> get the comm. There is nothing special about this trace event. I will
> drop comm in the next revision.
>
> The reason I did the task_struct parameter (current): I believe there
> is a limitation that we must specify at least 1 parameter to the
> TRACE_EVENT() PROTO and ARGS macros.
OK, then this is another issue. We don't want tracepoint "markers".
Each tracepoint can take up to 5K in memory due to the code it
generates and the meta data to control it.
For something like that, we highly recommend dynamic probes (fprobes,
kprobes, etc).
The only purpose of a static tracepoint is to get data within a
function that is too difficult to get via a probe. It should never be
used as a trigger where its purpose is "we hit this path".
-- Steve
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 15:52 ` Steven Rostedt
@ 2025-09-16 17:36 ` Kalesh Singh
2025-09-16 17:48 ` Steven Rostedt
2025-09-18 11:38 ` Pedro Falcato
1 sibling, 1 reply; 65+ messages in thread
From: Kalesh Singh @ 2025-09-16 17:36 UTC (permalink / raw)
To: Steven Rostedt
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Tue, Sep 16, 2025 at 8:52 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 15 Sep 2025 18:19:53 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
>
> > Hi Steve,
> >
> > Thanks for the comments and suggestion you are right we can use bpf to
> > get the comm. There is nothing special about this trace event. I will
> > drop comm in the next revision.
> >
> > The reason I did the task_struct parameter (current): I believe there
> > is a limitation that we must specify at least 1 parameter to the
> > TRACE_EVENT() PROTO and ARGS macros.
>
> OK, then this is another issue. We don't want tracepoint "markers".
> Each tracepoint can take up to 5K in memory due to the code it
> generates and the meta data to control it.
>
> For something like that, we highly recommend dynamic probes (fprobes,
> kprobes, etc).
>
> The only purpose of a static tracepoint is to get data within a
> function that is too difficult to get via a probe. It should never be
> used as a trigger where its purpose is "we hit this path".
Hi Steve,
I completely agree with the principle that static tracepoints
shouldn't be used as markers if a dynamic probe will suffice. The
intent here is to avoid introducing overhead in the common case to
avoid regressing mmap, munmap, and other syscall latencies; while
still providing observability for the max vma_count exceeded failure
condition.
The original centralized check (before previous review rounds) was
indeed in a dedicated function, exceeds_max_map_count(), where a
kprobe/fprobe could have been easily attached without impacting the
common path. This was changed due to previous review feedback to the
capacity based vma_count_remaining() which necessitated the check to
be done externally by the callers:
https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
Would you be ok with something like:
trace_max_vma_count_exceeded(mm);
TP_STRUCT__entry(
__field(unsigned int, mm_id)
__field(unsigned int vma_count)
)
mm_id would be the hash of the mm_struct ptr similar to rss_stat and
the vma_count is the current vma count (some syscalls have different
requirements on the capacity remaining: mremap requires 6 available
slots, other syscalls require 1).
Thanks,
Kalesh
>
> -- Steve
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 17:36 ` Kalesh Singh
@ 2025-09-16 17:48 ` Steven Rostedt
2025-09-16 17:57 ` Kalesh Singh
0 siblings, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2025-09-16 17:48 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Tue, 16 Sep 2025 10:36:57 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:
> I completely agree with the principle that static tracepoints
> shouldn't be used as markers if a dynamic probe will suffice. The
> intent here is to avoid introducing overhead in the common case to
> avoid regressing mmap, munmap, and other syscall latencies; while
> still providing observability for the max vma_count exceeded failure
> condition.
>
> The original centralized check (before previous review rounds) was
> indeed in a dedicated function, exceeds_max_map_count(), where a
> kprobe/fprobe could have been easily attached without impacting the
> common path. This was changed due to previous review feedback to the
> capacity based vma_count_remaining() which necessitated the check to
> be done externally by the callers:
>
> https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
>
> Would you be ok with something like:
>
> trace_max_vma_count_exceeded(mm);
>
> TP_STRUCT__entry(
> __field(unsigned int, mm_id)
> __field(unsigned int vma_count)
> )
>
> mm_id would be the hash of the mm_struct ptr similar to rss_stat and
> the vma_count is the current vma count (some syscalls have different
> requirements on the capacity remaining: mremap requires 6 available
> slots, other syscalls require 1).
>
BTW, why the hash of the mm pointer and not the pointer itself? We save
pointers in lots of places, and if it is the pointer, you could use an
eprobe to attache to the trace event to dereference its fields.
-- Steve
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 17:48 ` Steven Rostedt
@ 2025-09-16 17:57 ` Kalesh Singh
2025-09-16 18:02 ` Steven Rostedt
0 siblings, 1 reply; 65+ messages in thread
From: Kalesh Singh @ 2025-09-16 17:57 UTC (permalink / raw)
To: Steven Rostedt
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Tue, Sep 16, 2025 at 10:47 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 16 Sep 2025 10:36:57 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > I completely agree with the principle that static tracepoints
> > shouldn't be used as markers if a dynamic probe will suffice. The
> > intent here is to avoid introducing overhead in the common case to
> > avoid regressing mmap, munmap, and other syscall latencies; while
> > still providing observability for the max vma_count exceeded failure
> > condition.
> >
> > The original centralized check (before previous review rounds) was
> > indeed in a dedicated function, exceeds_max_map_count(), where a
> > kprobe/fprobe could have been easily attached without impacting the
> > common path. This was changed due to previous review feedback to the
> > capacity based vma_count_remaining() which necessitated the check to
> > be done externally by the callers:
> >
> > https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
> >
> > Would you be ok with something like:
> >
> > trace_max_vma_count_exceeded(mm);
> >
> > TP_STRUCT__entry(
> > __field(unsigned int, mm_id)
> > __field(unsigned int vma_count)
> > )
> >
> > mm_id would be the hash of the mm_struct ptr similar to rss_stat and
> > the vma_count is the current vma count (some syscalls have different
> > requirements on the capacity remaining: mremap requires 6 available
> > slots, other syscalls require 1).
> >
>
> BTW, why the hash of the mm pointer and not the pointer itself? We save
> pointers in lots of places, and if it is the pointer, you could use an
> eprobe to attache to the trace event to dereference its fields.
In Android we try to avoid exposing raw kernel pointers to userspace
for security reasons: raising /proc/sys/kernel/kptr_restrict to 2
immediately after symbols are resolved for necessary telemetry tooling
during early boot. I believe this is also why rss_stat uses the hash
and not the raw pointer.
Thanks,
Kalesh
>
> -- Steve
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 17:57 ` Kalesh Singh
@ 2025-09-16 18:02 ` Steven Rostedt
2025-09-16 18:23 ` Kalesh Singh
0 siblings, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2025-09-16 18:02 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Tue, 16 Sep 2025 10:57:43 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:
> > BTW, why the hash of the mm pointer and not the pointer itself? We save
> > pointers in lots of places, and if it is the pointer, you could use an
> > eprobe to attache to the trace event to dereference its fields.
>
> In Android we try to avoid exposing raw kernel pointers to userspace
> for security reasons: raising /proc/sys/kernel/kptr_restrict to 2
> immediately after symbols are resolved for necessary telemetry tooling
> during early boot. I believe this is also why rss_stat uses the hash
> and not the raw pointer.
When it comes to tracing, you already lost. If it goes into the ring buffer
it's a raw pointer. BPF doesn't use the output of the trace event, so you
are exposing nothing from that. It uses the proto directly.
Heck, if you enable function tracing, you are exposing every function
address it traces via the raw data output.
-- Steve
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 18:02 ` Steven Rostedt
@ 2025-09-16 18:23 ` Kalesh Singh
2025-09-16 18:51 ` Steven Rostedt
0 siblings, 1 reply; 65+ messages in thread
From: Kalesh Singh @ 2025-09-16 18:23 UTC (permalink / raw)
To: Steven Rostedt
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Tue, Sep 16, 2025 at 11:01 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 16 Sep 2025 10:57:43 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > > BTW, why the hash of the mm pointer and not the pointer itself? We save
> > > pointers in lots of places, and if it is the pointer, you could use an
> > > eprobe to attache to the trace event to dereference its fields.
> >
> > In Android we try to avoid exposing raw kernel pointers to userspace
> > for security reasons: raising /proc/sys/kernel/kptr_restrict to 2
> > immediately after symbols are resolved for necessary telemetry tooling
> > during early boot. I believe this is also why rss_stat uses the hash
> > and not the raw pointer.
>
> When it comes to tracing, you already lost. If it goes into the ring buffer
> it's a raw pointer. BPF doesn't use the output of the trace event, so you
> are exposing nothing from that. It uses the proto directly.
My understanding is that the BPF tracepoint type uses the trace event
fields from TP_STRUCT__entry(); whereas the raw tracepoint type has
access to the proto arguments. Please CMIW: Isn't what we'd be adding
to the trace buffer is the hashed mm_id value?
>
> Heck, if you enable function tracing, you are exposing every function
> address it traces via the raw data output.
Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER in
Android production kernels.
Thanks,
Kalesh
>
> -- Steve
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 18:23 ` Kalesh Singh
@ 2025-09-16 18:51 ` Steven Rostedt
2025-09-16 20:08 ` Kalesh Singh
0 siblings, 1 reply; 65+ messages in thread
From: Steven Rostedt @ 2025-09-16 18:51 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Tue, 16 Sep 2025 11:23:03 -0700
Kalesh Singh <kaleshsingh@google.com> wrote:
> > When it comes to tracing, you already lost. If it goes into the ring buffer
> > it's a raw pointer. BPF doesn't use the output of the trace event, so you
> > are exposing nothing from that. It uses the proto directly.
>
> My understanding is that the BPF tracepoint type uses the trace event
> fields from TP_STRUCT__entry(); whereas the raw tracepoint type has
> access to the proto arguments. Please CMIW: Isn't what we'd be adding
> to the trace buffer is the hashed mm_id value?
Ah, right. Can't the BPF infrastructure protect against it?
>
> >
> > Heck, if you enable function tracing, you are exposing every function
> > address it traces via the raw data output.
>
> Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER in
> Android production kernels.
Does it block all the other trace events that share pointers?
Like nmi handler tracepoints, x86_fpu tracepoints, page_fault kernel
tracepoint, tasklet tracepoints, cpu hot plug tracepoints, timer
tracepoints, work queue tracepoints, ipi tracepoints, scheduling
tracepoints, locking tracepoints, rcu tracepoints, dma tracepoints,
module tracepoints, xdp tracepoints, filemap tracepoints, page map
tracepoints, vmscan tracepoints, percpu tracepoints, kmem_cache
tracepoints, mmap lock tracepoints, file lock tracepoints, and many
more! (I got tired of looking them up).
Again, are you really protecting anything if this one trace point
hashes the pointer? Most other tracepoints expose this. If BPF can
access a tracepoint entry struct, it can access the raw address and
break KASLR.
-- Steve
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 18:51 ` Steven Rostedt
@ 2025-09-16 20:08 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-16 20:08 UTC (permalink / raw)
To: Steven Rostedt
Cc: akpm, minchan, lorenzo.stoakes, david, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Ben Segall, Mel Gorman,
Valentin Schneider, Jann Horn, Shuah Khan, linux-kernel,
linux-fsdevel, linux-mm, linux-trace-kernel, linux-kselftest
On Tue, Sep 16, 2025 at 11:51 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 16 Sep 2025 11:23:03 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > > When it comes to tracing, you already lost. If it goes into the ring buffer
> > > it's a raw pointer. BPF doesn't use the output of the trace event, so you
> > > are exposing nothing from that. It uses the proto directly.
> >
> > My understanding is that the BPF tracepoint type uses the trace event
> > fields from TP_STRUCT__entry(); whereas the raw tracepoint type has
> > access to the proto arguments. Please CMIW: Isn't what we'd be adding
> > to the trace buffer is the hashed mm_id value?
>
> Ah, right. Can't the BPF infrastructure protect against it?
>
> >
> > >
> > > Heck, if you enable function tracing, you are exposing every function
> > > address it traces via the raw data output.
> >
> > Right, security doesn't allow compiling CONFIG_FUNCTION_TRACER in
> > Android production kernels.
>
> Does it block all the other trace events that share pointers?
>
> Like nmi handler tracepoints, x86_fpu tracepoints, page_fault kernel
> tracepoint, tasklet tracepoints, cpu hot plug tracepoints, timer
> tracepoints, work queue tracepoints, ipi tracepoints, scheduling
> tracepoints, locking tracepoints, rcu tracepoints, dma tracepoints,
> module tracepoints, xdp tracepoints, filemap tracepoints, page map
> tracepoints, vmscan tracepoints, percpu tracepoints, kmem_cache
> tracepoints, mmap lock tracepoints, file lock tracepoints, and many
> more! (I got tired of looking them up).
Hi Steve,
I see your point :) I'll use the raw pointer and handle not exposing
it from the BPF side.
Thanks for discussing.
--Kalesh
>
> Again, are you really protecting anything if this one trace point
> hashes the pointer? Most other tracepoints expose this. If BPF can
> access a tracepoint entry struct, it can access the raw address and
> break KASLR.
Thanks,
Kalesh
>
> -- Steve
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-16 15:52 ` Steven Rostedt
2025-09-16 17:36 ` Kalesh Singh
@ 2025-09-18 11:38 ` Pedro Falcato
2025-09-18 14:48 ` Steven Rostedt
1 sibling, 1 reply; 65+ messages in thread
From: Pedro Falcato @ 2025-09-18 11:38 UTC (permalink / raw)
To: Steven Rostedt
Cc: Kalesh Singh, akpm, minchan, lorenzo.stoakes, david, Liam.Howlett,
rppt, kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Tue, Sep 16, 2025 at 11:52:20AM -0400, Steven Rostedt wrote:
> On Mon, 15 Sep 2025 18:19:53 -0700
> Kalesh Singh <kaleshsingh@google.com> wrote:
>
>
> > Hi Steve,
> >
> > Thanks for the comments and suggestion you are right we can use bpf to
> > get the comm. There is nothing special about this trace event. I will
> > drop comm in the next revision.
> >
> > The reason I did the task_struct parameter (current): I believe there
> > is a limitation that we must specify at least 1 parameter to the
> > TRACE_EVENT() PROTO and ARGS macros.
>
> OK, then this is another issue. We don't want tracepoint "markers".
> Each tracepoint can take up to 5K in memory due to the code it
> generates and the meta data to control it.
>
> For something like that, we highly recommend dynamic probes (fprobes,
> kprobes, etc).
>
> The only purpose of a static tracepoint is to get data within a
> function that is too difficult to get via a probe. It should never be
> used as a trigger where its purpose is "we hit this path".
>
Isn't the usual problem with that approach, that of static functions/static
inline functions? I was tracing through a problem a few months ago, and man
I really did think "wouldn't it be nice to have a tracepoint instead of fishing
around for kprobe spots".
Not that I particularly think a tracepoint is super worth it in this case, but,
y'know.
--
Pedro
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-18 11:38 ` Pedro Falcato
@ 2025-09-18 14:48 ` Steven Rostedt
0 siblings, 0 replies; 65+ messages in thread
From: Steven Rostedt @ 2025-09-18 14:48 UTC (permalink / raw)
To: Pedro Falcato
Cc: Kalesh Singh, akpm, minchan, lorenzo.stoakes, david, Liam.Howlett,
rppt, kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, 18 Sep 2025 12:38:56 +0100
Pedro Falcato <pfalcato@suse.de> wrote:
> Isn't the usual problem with that approach, that of static functions/static
> inline functions? I was tracing through a problem a few months ago, and man
> I really did think "wouldn't it be nice to have a tracepoint instead of fishing
> around for kprobe spots".
>
> Not that I particularly think a tracepoint is super worth it in this case, but,
> y'know.
Yes, it would be useful. The issue is that tracepoints are not free. They
do increase the I$ hit and take up memory.
If you're going to inject a tracepoint somewhere, at least extract some
useful information from that spot. If you can't think of anything to track,
then it's not worth a tracepoint.
If one really wants a way to track something, they could add a static
branch that would call a function when enabled that could be traced. They
would also need a way to enable that static branch.
-- Steve
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
2025-09-15 23:41 ` Steven Rostedt
@ 2025-09-18 13:42 ` Lorenzo Stoakes
2025-09-18 13:51 ` Lorenzo Stoakes
1 sibling, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:42 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
> Needed observability on in field devices can be collected with minimal
> overhead and can be toggled on and off. Event driven telemetry can be
> done with tracepoint BPF programs.
>
> The process comm is provided for aggregation across devices and tgid is
> to enable per-process aggregation per device.
>
> This allows for observing the distribution of such problems in the
> field, to deduce if there are legitimate bugs or if a bump to the limit is
> warranted.
It's not really a bug though is it? It's somebody running out of resources.
I'm not sure how useful this is really. But I'm open to being convinced!
I also wonder if this is better as a statistic? You'd figure out it was a
problem that way too right?
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
This breaks the VMA tests, please make sure to always check them:
cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -c -o vma.o vma.c
In file included from vma.c:33:
../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory
10 | #include <trace/events/vma.h>
| ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [<builtin>: vma.o] Error 1
See below though, you've included this in the wrong place (I don't blame you,
perhaps we've not made this _very_ clear :)
> ---
>
> Chnages in v2:
> - Add needed observability for operations failing due to the vma count limit,
> per Minchan
> (Since there isn't a common point for debug logging due checks being
> external to the capacity based vma_count_remaining() helper. I used a
> trace event for low overhead and to facilitate event driven telemetry
> for in field devices)
>
> include/trace/events/vma.h | 32 ++++++++++++++++++++++++++++++++
Do we really need a new file? We already have VMA-related tracepoints no?
Also if you add a new file you _have_ to update MAINTAINERS.
> mm/mmap.c | 5 ++++-
> mm/mremap.c | 10 ++++++++--
> mm/vma.c | 11 +++++++++--
> 4 files changed, 53 insertions(+), 5 deletions(-)
> create mode 100644 include/trace/events/vma.h
>
> diff --git a/include/trace/events/vma.h b/include/trace/events/vma.h
> new file mode 100644
> index 000000000000..2fed63b0d0a6
> --- /dev/null
> +++ b/include/trace/events/vma.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM vma
> +
> +#if !defined(_TRACE_VMA_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_VMA_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(max_vma_count_exceeded,
> +
> + TP_PROTO(struct task_struct *task),
> +
> + TP_ARGS(task),
> +
> + TP_STRUCT__entry(
> + __string(comm, task->comm)
> + __field(pid_t, tgid)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(comm);
> + __entry->tgid = task->tgid;
> + ),
> +
> + TP_printk("comm=%s tgid=%d", __get_str(comm), __entry->tgid)
> +);
> +
> +#endif /* _TRACE_VMA_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 30ddd550197e..0bb311bf48f3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -56,6 +56,7 @@
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/mmap.h>
> +#include <trace/events/vma.h>
>
> #include "internal.h"
>
> @@ -374,8 +375,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> return -EOVERFLOW;
>
> /* Too many mappings? */
> - if (!vma_count_remaining(mm))
> + if (!vma_count_remaining(mm)) {
> + trace_max_vma_count_exceeded(current);
> return -ENOMEM;
> + }
>
> /*
> * addr is returned from get_unmapped_area,
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 14d35d87e89b..f42ac05f0069 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -30,6 +30,8 @@
> #include <asm/tlb.h>
> #include <asm/pgalloc.h>
>
> +#include <trace/events/vma.h>
> +
> #include "internal.h"
>
> /* Classify the kind of remap operation being performed. */
> @@ -1040,8 +1042,10 @@ static unsigned long prep_move_vma(struct vma_remap_struct *vrm)
> * We'd prefer to avoid failure later on in do_munmap:
> * which may split one vma into three before unmapping.
> */
> - if (vma_count_remaining(current->mm) < 4)
> + if (vma_count_remaining(current->mm) < 4) {
> + trace_max_vma_count_exceeded(current);
But we didn't exceed it, we're guessing we might?
> return -ENOMEM;
> + }
>
> if (vma->vm_ops && vma->vm_ops->may_split) {
> if (vma->vm_start != old_addr)
> @@ -1817,8 +1821,10 @@ static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
> * the threshold. In other words, is the current map count + 6 at or
> * below the threshold? Otherwise return -ENOMEM here to be more safe.
> */
> - if (vma_count_remaining(current->mm) < 6)
> + if (vma_count_remaining(current->mm) < 6) {
> + trace_max_vma_count_exceeded(current);
Similar point here.
> return -ENOMEM;
> + }
>
> return 0;
> }
> diff --git a/mm/vma.c b/mm/vma.c
> index 0e4fcaebe209..692c33c3e84d 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -7,6 +7,8 @@
> #include "vma_internal.h"
> #include "vma.h"
>
> +#include <trace/events/vma.h>
Nope you don't do this :)
vma.c is designed to be userland testable and _intentionally_ puts all its
includes in vma_internal.h.
So you'd need to add this include over there, and then update the vma tests
vma_internal.h to stub out the trace function.
> +
> struct mmap_state {
> struct mm_struct *mm;
> struct vma_iterator *vmi;
> @@ -621,8 +623,10 @@ __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> unsigned long addr, int new_below)
> {
> - if (!vma_count_remaining(vma->vm_mm))
> + if (!vma_count_remaining(vma->vm_mm)) {
> + trace_max_vma_count_exceeded(current);
> return -ENOMEM;
> + }
>
> return __split_vma(vmi, vma, addr, new_below);
> }
> @@ -1375,6 +1379,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> */
> if (vms->end < vms->vma->vm_end &&
> !vma_count_remaining(vms->vma->vm_mm)) {
> + trace_max_vma_count_exceeded(current);
> error = -ENOMEM;
> goto vma_count_exceeded;
> }
> @@ -2801,8 +2806,10 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT))
> return -ENOMEM;
>
> - if (!vma_count_remaining(mm))
> + if (!vma_count_remaining(mm)) {
> + trace_max_vma_count_exceeded(current);
> return -ENOMEM;
> + }
>
> if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> return -ENOMEM;
> --
> 2.51.0.384.g4c02a37b29-goog
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-18 13:42 ` Lorenzo Stoakes
@ 2025-09-18 13:51 ` Lorenzo Stoakes
2025-09-18 15:55 ` Kalesh Singh
0 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 13:51 UTC (permalink / raw)
To: Kalesh Singh
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, Sep 18, 2025 at 02:42:16PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
> > Needed observability on in field devices can be collected with minimal
> > overhead and can be toggled on and off. Event driven telemetry can be
> > done with tracepoint BPF programs.
> >
> > The process comm is provided for aggregation across devices and tgid is
> > to enable per-process aggregation per device.
> >
> > This allows for observing the distribution of such problems in the
> > field, to deduce if there are legitimate bugs or if a bump to the limit is
> > warranted.
>
> It's not really a bug though is it? It's somebody running out of resources.
>
> I'm not sure how useful this is really. But I'm open to being convinced!
>
> I also wonder if this is better as a statistic? You'd figure out it was a
> problem that way too right?
>
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Mike Rapoport <rppt@kernel.org>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Pedro Falcato <pfalcato@suse.de>
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
>
> This breaks the VMA tests, please make sure to always check them:
>
> cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -c -o vma.o vma.c
> In file included from vma.c:33:
> ../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory
> 10 | #include <trace/events/vma.h>
> | ^~~~~~~~~~~~~~~~~~~~
> compilation terminated.
> make: *** [<builtin>: vma.o] Error 1
Trivial build fix:
----8<----
From fe4c30abbd302ccc628ec92381ac10cea31c6d85 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Thu, 18 Sep 2025 14:47:10 +0100
Subject: [PATCH] fix
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/vma.c | 2 --
mm/vma_internal.h | 2 ++
tools/testing/vma/vma_internal.h | 4 ++++
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index 26046b28cdda..a11d29a2ddc0 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -7,8 +7,6 @@
#include "vma_internal.h"
#include "vma.h"
-#include <trace/events/vma.h>
-
struct mmap_state {
struct mm_struct *mm;
struct vma_iterator *vmi;
diff --git a/mm/vma_internal.h b/mm/vma_internal.h
index 2f05735ff190..2f5ba679f43d 100644
--- a/mm/vma_internal.h
+++ b/mm/vma_internal.h
@@ -47,6 +47,8 @@
#include <linux/uprobes.h>
#include <linux/userfaultfd_k.h>
+#include <trace/events/vma.h>
+
#include <asm/current.h>
#include <asm/tlb.h>
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 07f4108c5e4c..c08c91861b9a 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -1661,4 +1661,8 @@ static inline void vma_count_dec(struct mm_struct *mm)
vma_count_sub(mm, 1);
}
+static void trace_max_vma_count_exceeded(struct task_struct *task)
+{
+}
+
#endif /* __MM_VMA_INTERNAL_H */
--
2.51.0
^ permalink raw reply related [flat|nested] 65+ messages in thread* Re: [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event
2025-09-18 13:51 ` Lorenzo Stoakes
@ 2025-09-18 15:55 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-18 15:55 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: akpm, minchan, david, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, Sep 18, 2025 at 6:52 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
>
> On Thu, Sep 18, 2025 at 02:42:16PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Sep 15, 2025 at 09:36:38AM -0700, Kalesh Singh wrote:
> > > Needed observability on in field devices can be collected with minimal
> > > overhead and can be toggled on and off. Event driven telemetry can be
> > > done with tracepoint BPF programs.
> > >
> > > The process comm is provided for aggregation across devices and tgid is
> > > to enable per-process aggregation per device.
> > >
> > > This allows for observing the distribution of such problems in the
> > > field, to deduce if there are legitimate bugs or if a bump to the limit is
> > > warranted.
> >
> > It's not really a bug though is it? It's somebody running out of resources.
> >
> > I'm not sure how useful this is really. But I'm open to being convinced!
> >
> > I also wonder if this is better as a statistic? You'd figure out it was a
> > problem that way too right?
> >
> > >
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Mike Rapoport <rppt@kernel.org>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > Cc: Pedro Falcato <pfalcato@suse.de>
> > > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> >
> > This breaks the VMA tests, please make sure to always check them:
> >
> > cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -c -o vma.o vma.c
> > In file included from vma.c:33:
> > ../../../mm/vma.c:10:10: fatal error: trace/events/vma.h: No such file or directory
> > 10 | #include <trace/events/vma.h>
> > | ^~~~~~~~~~~~~~~~~~~~
> > compilation terminated.
> > make: *** [<builtin>: vma.o] Error 1
>
> Trivial build fix:
>
> ----8<----
> From fe4c30abbd302ccc628ec92381ac10cea31c6d85 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Thu, 18 Sep 2025 14:47:10 +0100
> Subject: [PATCH] fix
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/vma.c | 2 --
> mm/vma_internal.h | 2 ++
> tools/testing/vma/vma_internal.h | 4 ++++
> 3 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index 26046b28cdda..a11d29a2ddc0 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -7,8 +7,6 @@
> #include "vma_internal.h"
> #include "vma.h"
>
> -#include <trace/events/vma.h>
> -
> struct mmap_state {
> struct mm_struct *mm;
> struct vma_iterator *vmi;
> diff --git a/mm/vma_internal.h b/mm/vma_internal.h
> index 2f05735ff190..2f5ba679f43d 100644
> --- a/mm/vma_internal.h
> +++ b/mm/vma_internal.h
> @@ -47,6 +47,8 @@
> #include <linux/uprobes.h>
> #include <linux/userfaultfd_k.h>
>
> +#include <trace/events/vma.h>
> +
> #include <asm/current.h>
> #include <asm/tlb.h>
>
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 07f4108c5e4c..c08c91861b9a 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -1661,4 +1661,8 @@ static inline void vma_count_dec(struct mm_struct *mm)
> vma_count_sub(mm, 1);
> }
>
> +static void trace_max_vma_count_exceeded(struct task_struct *task)
> +{
> +}
> +
> #endif /* __MM_VMA_INTERNAL_H */
I made a point to build and run your tests, seems I forgot to actually
test it with this last patch.
Thanks for the fix.
--Kalesh
> --
> 2.51.0
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-15 16:36 [PATCH v2 0/7] vma count: fixes, test and improvements Kalesh Singh
` (6 preceding siblings ...)
2025-09-15 16:36 ` [PATCH v2 7/7] mm/tracing: introduce max_vma_count_exceeded trace event Kalesh Singh
@ 2025-09-15 22:34 ` Andrew Morton
2025-09-15 23:10 ` Kalesh Singh
2025-09-16 10:12 ` Lorenzo Stoakes
7 siblings, 2 replies; 65+ messages in thread
From: Andrew Morton @ 2025-09-15 22:34 UTC (permalink / raw)
To: Kalesh Singh
Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Mon, 15 Sep 2025 09:36:31 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
> Hi all,
>
> This is v2 to the VMA count patch I previously posted at:
>
> https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
>
>
> I've split it into multiple patches to address the feedback.
>
> The main changes in v2 are:
>
> - Use a capacity-based check for VMA count limit, per Lorenzo.
> - Rename map_count to vma_count, per David.
> - Add assertions for exceeding the limit, per Pedro.
> - Add tests for max_vma_count, per Liam.
> - Emit a trace event for failure due to insufficient capacity for
> observability
>
> Tested on x86_64 and arm64:
>
> - Build test:
> - allyesconfig for rename
>
> - Selftests:
> cd tools/testing/selftests/mm && \
> make && \
> ./run_vmtests.sh -t max_vma_count
>
> (With trace_max_vma_count_exceeded enabled)
>
> - vma tests:
> cd tools/testing/vma && \
> make && \
> ./vma
fwiw, there's nothing in the above which is usable in a [0/N] overview.
While useful, the "what changed since the previous version" info isn't
a suitable thing to carry in the permanent kernel record - it's
short-term treansient stuff, not helpful to someone who is looking at
the patchset in 2029.
Similarly, the "how it was tested" material is also useful, but it
becomes irrelevant as soon as the code hits linux-next and mainline.
Anyhow, this -rc cycle has been quite the firehose in MM and I'm
feeling a need to slow things down for additional stabilization and so
people hopefully get additional bandwidth to digest the material we've
added this far. So I think I'll just cherrypick [1/7] for now. A
great flood of positive review activity would probably make me revisit
that ;)
^ permalink raw reply [flat|nested] 65+ messages in thread* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
@ 2025-09-15 23:10 ` Kalesh Singh
2025-09-16 0:05 ` Andrew Morton
2025-09-16 10:12 ` Lorenzo Stoakes
1 sibling, 1 reply; 65+ messages in thread
From: Kalesh Singh @ 2025-09-15 23:10 UTC (permalink / raw)
To: Andrew Morton
Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Mon, Sep 15, 2025 at 3:34 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 15 Sep 2025 09:36:31 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > Hi all,
> >
> > This is v2 to the VMA count patch I previously posted at:
> >
> > https://lore.kernel.org/r/20250903232437.1454293-1-kaleshsingh@google.com/
> >
> >
> > I've split it into multiple patches to address the feedback.
> >
> > The main changes in v2 are:
> >
> > - Use a capacity-based check for VMA count limit, per Lorenzo.
> > - Rename map_count to vma_count, per David.
> > - Add assertions for exceeding the limit, per Pedro.
> > - Add tests for max_vma_count, per Liam.
> > - Emit a trace event for failure due to insufficient capacity for
> > observability
> >
> > Tested on x86_64 and arm64:
> >
> > - Build test:
> > - allyesconfig for rename
> >
> > - Selftests:
> > cd tools/testing/selftests/mm && \
> > make && \
> > ./run_vmtests.sh -t max_vma_count
> >
> > (With trace_max_vma_count_exceeded enabled)
> >
> > - vma tests:
> > cd tools/testing/vma && \
> > make && \
> > ./vma
>
> fwiw, there's nothing in the above which is usable in a [0/N] overview.
>
> While useful, the "what changed since the previous version" info isn't
> a suitable thing to carry in the permanent kernel record - it's
> short-term treansient stuff, not helpful to someone who is looking at
> the patchset in 2029.
>
> Similarly, the "how it was tested" material is also useful, but it
> becomes irrelevant as soon as the code hits linux-next and mainline.
Hi Andrew,
Thanks for the feedback. Do you mean the cover letter was not needed
in this case or that it lacked enough context?
>
>
> Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> feeling a need to slow things down for additional stabilization and so
> people hopefully get additional bandwidth to digest the material we've
> added this far. So I think I'll just cherrypick [1/7] for now. A
> great flood of positive review activity would probably make me revisit
> that ;)
>
I understand, yes 1/7 is all we need for now, since it prevents an
unrecoverable situation where we get over the limit and cannot recover
as munmap() will then always fail.
Thanks,
Kalesh
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-15 23:10 ` Kalesh Singh
@ 2025-09-16 0:05 ` Andrew Morton
2025-09-16 1:23 ` Kalesh Singh
0 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2025-09-16 0:05 UTC (permalink / raw)
To: Kalesh Singh
Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Mon, 15 Sep 2025 16:10:55 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
> > fwiw, there's nothing in the above which is usable in a [0/N] overview.
> >
> > While useful, the "what changed since the previous version" info isn't
> > a suitable thing to carry in the permanent kernel record - it's
> > short-term treansient stuff, not helpful to someone who is looking at
> > the patchset in 2029.
> >
> > Similarly, the "how it was tested" material is also useful, but it
> > becomes irrelevant as soon as the code hits linux-next and mainline.
>
> Hi Andrew,
>
> Thanks for the feedback. Do you mean the cover letter was not needed
> in this case or that it lacked enough context?
The latter. As I've split up the series, please put together some
words to describe the remaining 6 patches if/when resending.
Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-16 0:05 ` Andrew Morton
@ 2025-09-16 1:23 ` Kalesh Singh
0 siblings, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-16 1:23 UTC (permalink / raw)
To: Andrew Morton
Cc: minchan, lorenzo.stoakes, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Mon, Sep 15, 2025 at 5:05 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 15 Sep 2025 16:10:55 -0700 Kalesh Singh <kaleshsingh@google.com> wrote:
>
> > > fwiw, there's nothing in the above which is usable in a [0/N] overview.
> > >
> > > While useful, the "what changed since the previous version" info isn't
> > > a suitable thing to carry in the permanent kernel record - it's
> > > short-term treansient stuff, not helpful to someone who is looking at
> > > the patchset in 2029.
> > >
> > > Similarly, the "how it was tested" material is also useful, but it
> > > becomes irrelevant as soon as the code hits linux-next and mainline.
> >
> > Hi Andrew,
> >
> > Thanks for the feedback. Do you mean the cover letter was not needed
> > in this case or that it lacked enough context?
>
> The latter. As I've split up the series, please put together some
> words to describe the remaining 6 patches if/when resending.
Hi Andrew,
Thanks for clarifying. I'll make sure to fix that when resending.
--Kalesh
>
> Thanks.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-15 22:34 ` [PATCH v2 0/7] vma count: fixes, test and improvements Andrew Morton
2025-09-15 23:10 ` Kalesh Singh
@ 2025-09-16 10:12 ` Lorenzo Stoakes
2025-09-16 17:47 ` Kalesh Singh
2025-09-17 2:16 ` Andrew Morton
1 sibling, 2 replies; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-16 10:12 UTC (permalink / raw)
To: Andrew Morton
Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
> Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> feeling a need to slow things down for additional stabilization and so
> people hopefully get additional bandwidth to digest the material we've
> added this far. So I think I'll just cherrypick [1/7] for now. A
> great flood of positive review activity would probably make me revisit
> that ;)
>
Kalesh - I do intend to look at this series when I have a chance. My review
workload has been insane so it's hard to keep up at the moment.
Andrew - This cycle has been crazy, speaking from point of view of somebody
doing a lot of review, it's been very very exhausting from this side too,
and this kind of work can feel a little... thankless... sometimes :)
I feel like we maybe need a way to ask people to slow down, sometimes at
least.
Perhaps being less accepting of patches during merge window is one aspect,
as the merge window leading up to this cycle was almost the same review
load as when the cycle started.
Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor
in all this too :)
(I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week
vacataion afterwards over the merge window so I hope to stave off burnout
that way. Be good if I could keep mails upon return to 3 digits, but I have
my doubts :P)
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-16 10:12 ` Lorenzo Stoakes
@ 2025-09-16 17:47 ` Kalesh Singh
2025-09-17 2:16 ` Andrew Morton
1 sibling, 0 replies; 65+ messages in thread
From: Kalesh Singh @ 2025-09-16 17:47 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, minchan, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Tue, Sep 16, 2025 at 3:12 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
> > Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> > feeling a need to slow things down for additional stabilization and so
> > people hopefully get additional bandwidth to digest the material we've
> > added this far. So I think I'll just cherrypick [1/7] for now. A
> > great flood of positive review activity would probably make me revisit
> > that ;)
> >
>
> Kalesh - I do intend to look at this series when I have a chance. My review
> workload has been insane so it's hard to keep up at the moment.
>
> Andrew - This cycle has been crazy, speaking from point of view of somebody
> doing a lot of review, it's been very very exhausting from this side too,
> and this kind of work can feel a little... thankless... sometimes :)
>
> I feel like we maybe need a way to ask people to slow down, sometimes at
> least.
>
> Perhaps being less accepting of patches during merge window is one aspect,
> as the merge window leading up to this cycle was almost the same review
> load as when the cycle started.
>
> Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor
> in all this too :)
>
> (I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week
> vacataion afterwards over the merge window so I hope to stave off burnout
> that way. Be good if I could keep mails upon return to 3 digits, but I have
> my doubts :P)
Hi Lorenzo,
Absolutely, take care of yourself. We all appreciate the amount of
work you put into reviewing :)
Have a good talk and enjoy your vacation. Don't worry about the backlog.
-- Kalesh
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-16 10:12 ` Lorenzo Stoakes
2025-09-16 17:47 ` Kalesh Singh
@ 2025-09-17 2:16 ` Andrew Morton
2025-09-17 5:36 ` Lorenzo Stoakes
1 sibling, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2025-09-17 2:16 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Tue, 16 Sep 2025 11:12:03 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
> > Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> > feeling a need to slow things down for additional stabilization and so
> > people hopefully get additional bandwidth to digest the material we've
> > added this far. So I think I'll just cherrypick [1/7] for now. A
> > great flood of positive review activity would probably make me revisit
> > that ;)
> >
>
> Kalesh - I do intend to look at this series when I have a chance. My review
> workload has been insane so it's hard to keep up at the moment.
>
> Andrew - This cycle has been crazy, speaking from point of view of somebody
> doing a lot of review, it's been very very exhausting from this side too,
> and this kind of work can feel a little... thankless... sometimes :)
I hear you. I'm shedding most everything now, to give us a couple of
weeks to digest.
> I feel like we maybe need a way to ask people to slow down, sometimes at
> least.
Yup, I'm sending submitters private emails explaining the situation.
Maybe they should be public emails, I find it a hard call.
> Perhaps being less accepting of patches during merge window is one aspect,
> as the merge window leading up to this cycle was almost the same review
> load as when the cycle started.
I'm having trouble understanding what you said here?
> Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor
> in all this too :)
>
> (I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week
> vacataion afterwards over the merge window so I hope to stave off burnout
> that way. Be good if I could keep mails upon return to 3 digits, but I have
> my doubts :P)
I'd blow that in three days ;)
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-17 2:16 ` Andrew Morton
@ 2025-09-17 5:36 ` Lorenzo Stoakes
2025-09-17 23:32 ` Andrew Morton
0 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-17 5:36 UTC (permalink / raw)
To: Andrew Morton
Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Tue, Sep 16, 2025 at 07:16:45PM -0700, Andrew Morton wrote:
> On Tue, 16 Sep 2025 11:12:03 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > On Mon, Sep 15, 2025 at 03:34:01PM -0700, Andrew Morton wrote:
> > > Anyhow, this -rc cycle has been quite the firehose in MM and I'm
> > > feeling a need to slow things down for additional stabilization and so
> > > people hopefully get additional bandwidth to digest the material we've
> > > added this far. So I think I'll just cherrypick [1/7] for now. A
> > > great flood of positive review activity would probably make me revisit
> > > that ;)
> > >
> >
> > Kalesh - I do intend to look at this series when I have a chance. My review
> > workload has been insane so it's hard to keep up at the moment.
> >
> > Andrew - This cycle has been crazy, speaking from point of view of somebody
> > doing a lot of review, it's been very very exhausting from this side too,
> > and this kind of work can feel a little... thankless... sometimes :)
>
> I hear you. I'm shedding most everything now, to give us a couple of
> weeks to digest.
Thanks, much appreciated! :)
>
> > I feel like we maybe need a way to ask people to slow down, sometimes at
> > least.
>
> Yup, I'm sending submitters private emails explaining the situation.
And again, much appreciated :)
>
> Maybe they should be public emails, I find it a hard call.
Yeah it can be hard to get that balance right. Maybe public is better when
we're deeper in the rc and there's a general load problem?
>
> > Perhaps being less accepting of patches during merge window is one aspect,
> > as the merge window leading up to this cycle was almost the same review
> > load as when the cycle started.
>
> I'm having trouble understanding what you said here?
Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
merge window, esp. now we have mm-new.
And last merge window my review load felt similar to during a cycle, which
was kind of crazy.
So I wonder if we should be less accommodating and simply say 'sorry it's
the merge window, no submissions accepted'?
Of course I'm being a bit selfish here as I'm on holiday in the next merge
window and hope forlornly to reduce the mail I come back to :P
>
> > Anyway, TL;DR: I think we need to be mindful of reviewer sanity as a factor
> > in all this too :)
> >
> > (I am spekaing at Kernel Recipes then going on a very-badly-needed 2.5 week
> > vacataion afterwards over the merge window so I hope to stave off burnout
> > that way. Be good if I could keep mails upon return to 3 digits, but I have
> > my doubts :P)
>
> I'd blow that in three days ;)
Haha yeah I bet :)
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-17 5:36 ` Lorenzo Stoakes
@ 2025-09-17 23:32 ` Andrew Morton
2025-09-18 10:29 ` Lorenzo Stoakes
0 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2025-09-17 23:32 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > > Perhaps being less accepting of patches during merge window is one aspect,
> > > as the merge window leading up to this cycle was almost the same review
> > > load as when the cycle started.
> >
> > I'm having trouble understanding what you said here?
>
> Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
> merge window, esp. now we have mm-new.
>
> And last merge window my review load felt similar to during a cycle, which
> was kind of crazy.
>
> So I wonder if we should be less accommodating and simply say 'sorry it's
> the merge window, no submissions accepted'?
hm, I always have a lot of emails piled up by the time mm-stable gets
merged upstream. That ~1 week between "we merged" and "-rc1" is a nice
time to go through that material and add it to mm-new. I think it
smooths things out. I mean, this is peak time for people to be
considering the new material?
(ot, that backlog is always >400 emails and a lot of that gets tossed
out anyway - either it's just too old so I request a refresh or there
was a new version, or review was unpromising, etc).
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-17 23:32 ` Andrew Morton
@ 2025-09-18 10:29 ` Lorenzo Stoakes
2025-09-18 12:07 ` David Hildenbrand
0 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 10:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Kalesh Singh, minchan, david, Liam.Howlett, rppt, pfalcato,
kernel-team, android-mm, Alexander Viro, Christian Brauner,
Jan Kara, Kees Cook, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Ben Segall, Mel Gorman, Valentin Schneider,
Jann Horn, Shuah Khan, linux-kernel, linux-fsdevel, linux-mm,
linux-trace-kernel, linux-kselftest
On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
> On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > >
> > > > Perhaps being less accepting of patches during merge window is one aspect,
> > > > as the merge window leading up to this cycle was almost the same review
> > > > load as when the cycle started.
> > >
> > > I'm having trouble understanding what you said here?
> >
> > Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
> > merge window, esp. now we have mm-new.
> >
> > And last merge window my review load felt similar to during a cycle, which
> > was kind of crazy.
> >
> > So I wonder if we should be less accommodating and simply say 'sorry it's
> > the merge window, no submissions accepted'?
>
> hm, I always have a lot of emails piled up by the time mm-stable gets
> merged upstream. That ~1 week between "we merged" and "-rc1" is a nice
> time to go through that material and add it to mm-new. I think it
> smooths things out. I mean, this is peak time for people to be
> considering the new material?
I'm confused, why is the merge window a good time to consider new material?
People have the entirety of the cycle to submit new material, and they do
so.
And equally, people are sending new revisions of old code, engaging in
discussion from old series etc. during the merge window also.
What happens is you essentially have reviewers work 9 weeks instead of 7
for a cycle without much of a let up (+ so no break from it), based on
workload from the past cycle/merge window.
I mean I can only ask that perhaps we consider not doing this in mm (I
gather many other subsystems equally have a kinda 'freeze' during this
time).
>
> (ot, that backlog is always >400 emails and a lot of that gets tossed
> out anyway - either it's just too old so I request a refresh or there
> was a new version, or review was unpromising, etc).
>
Right, which actually makes everything a lot more uncertain from a
reviewer's point of view, as we don't definitely have a solid git base,
mm-new isn't synced with Linus's tree very much during this time etc.
Which makes the merge window actually even worse for this stuff.
From the point of view of avoiding burn out, it'd be good to manage
expectations a bit on this.
Personally I remember literally saying to David during the last merge
window 'well I hoped I could go off and work on my own stuff instead of
just review, guess not then' :)
THP is a particularly busy area at the moment which is part of all this.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-18 10:29 ` Lorenzo Stoakes
@ 2025-09-18 12:07 ` David Hildenbrand
2025-09-18 12:49 ` Lorenzo Stoakes
0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-09-18 12:07 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: Kalesh Singh, minchan, Liam.Howlett, rppt, pfalcato, kernel-team,
android-mm, Alexander Viro, Christian Brauner, Jan Kara,
Kees Cook, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Ben Segall, Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On 18.09.25 12:29, Lorenzo Stoakes wrote:
> On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
>> On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>>
>>>>
>>>>> Perhaps being less accepting of patches during merge window is one aspect,
>>>>> as the merge window leading up to this cycle was almost the same review
>>>>> load as when the cycle started.
>>>>
>>>> I'm having trouble understanding what you said here?
>>>
>>> Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
>>> merge window, esp. now we have mm-new.
>>>
>>> And last merge window my review load felt similar to during a cycle, which
>>> was kind of crazy.
>>>
>>> So I wonder if we should be less accommodating and simply say 'sorry it's
>>> the merge window, no submissions accepted'?
>>
>> hm, I always have a lot of emails piled up by the time mm-stable gets
>> merged upstream. That ~1 week between "we merged" and "-rc1" is a nice
>> time to go through that material and add it to mm-new. I think it
>> smooths things out. I mean, this is peak time for people to be
>> considering the new material?
>
> I'm confused, why is the merge window a good time to consider new material?
>
> People have the entirety of the cycle to submit new material, and they do
> so.
My view is that if you are sending a cleanup/feature during the merge
window you cannot expect a fast reply, and you should not keep sending
new versions in that timeframe expecting that all people you CCed that
should have a look actually did have a look.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-18 12:07 ` David Hildenbrand
@ 2025-09-18 12:49 ` Lorenzo Stoakes
2025-09-18 20:59 ` Andrew Morton
0 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-09-18 12:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Kalesh Singh, minchan, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, Sep 18, 2025 at 02:07:09PM +0200, David Hildenbrand wrote:
> On 18.09.25 12:29, Lorenzo Stoakes wrote:
> > On Wed, Sep 17, 2025 at 04:32:31PM -0700, Andrew Morton wrote:
> > > On Wed, 17 Sep 2025 06:36:34 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > > >
> > > > > > Perhaps being less accepting of patches during merge window is one aspect,
> > > > > > as the merge window leading up to this cycle was almost the same review
> > > > > > load as when the cycle started.
> > > > >
> > > > > I'm having trouble understanding what you said here?
> > > >
> > > > Sorry, what I mean to say is that in mm we're pretty open to taking stuff in the
> > > > merge window, esp. now we have mm-new.
> > > >
> > > > And last merge window my review load felt similar to during a cycle, which
> > > > was kind of crazy.
> > > >
> > > > So I wonder if we should be less accommodating and simply say 'sorry it's
> > > > the merge window, no submissions accepted'?
> > >
> > > hm, I always have a lot of emails piled up by the time mm-stable gets
> > > merged upstream. That ~1 week between "we merged" and "-rc1" is a nice
> > > time to go through that material and add it to mm-new. I think it
> > > smooths things out. I mean, this is peak time for people to be
> > > considering the new material?
> >
> > I'm confused, why is the merge window a good time to consider new material?
> >
> > People have the entirety of the cycle to submit new material, and they do
> > so.
>
> My view is that if you are sending a cleanup/feature during the merge window
> you cannot expect a fast reply, and you should not keep sending new versions
> in that timeframe expecting that all people you CCed that should have a look
> actually did have a look.
Yes exactly.
The problem is all the conversations and respins and etc. _do_ carry on as
normal, and often land in mm-new, queued up for mm-unstable etc. unless we
happen to notice them.
So it makes it impossible for us to just ignore until the next cycle (or need to
go through every thread that happened afterwards).
And people know that, so just keep on submitting as normal. That was _really_
palpable last merge window.
I mean I'm cheating by going on vacation for this merge window ;) but obviously
means I'll have 2 weeks of review to check when I get back + 1st week of cycle
to go too.
I think in some subsystems new series/respins are actively unwelcome during the
merge window. I wonder if we should be the same?
>
> --
> Cheers
>
> David / dhildenb
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 0/7] vma count: fixes, test and improvements
2025-09-18 12:49 ` Lorenzo Stoakes
@ 2025-09-18 20:59 ` Andrew Morton
0 siblings, 0 replies; 65+ messages in thread
From: Andrew Morton @ 2025-09-18 20:59 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: David Hildenbrand, Kalesh Singh, minchan, Liam.Howlett, rppt,
pfalcato, kernel-team, android-mm, Alexander Viro,
Christian Brauner, Jan Kara, Kees Cook, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, Dietmar Eggemann, Ben Segall,
Mel Gorman, Valentin Schneider, Jann Horn, Shuah Khan,
linux-kernel, linux-fsdevel, linux-mm, linux-trace-kernel,
linux-kselftest
On Thu, 18 Sep 2025 13:49:33 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> > > I'm confused, why is the merge window a good time to consider new material?
> > >
> > > People have the entirety of the cycle to submit new material, and they do
> > > so.
> >
> > My view is that if you are sending a cleanup/feature during the merge window
> > you cannot expect a fast reply, and you should not keep sending new versions
> > in that timeframe expecting that all people you CCed that should have a look
> > actually did have a look.
>
> Yes exactly.
>
> The problem is all the conversations and respins and etc. _do_ carry on as
> normal, and often land in mm-new, queued up for mm-unstable etc. unless we
> happen to notice them.
>
> So it makes it impossible for us to just ignore until the next cycle (or need to
> go through every thread that happened afterwards).
>
> And people know that, so just keep on submitting as normal. That was _really_
> palpable last merge window.
>
Well, what else do we have to do during the merge window? The previous
cycle's batch is merged up and there may be some fallout from that, but
it isn't very time-consuming.
If you're proposing that we start to use that period as a break for
sanity purposes then OK, didn't see that one coming, don't know how
widespread this desire is.
But perhaps a better time for this quiet period is during -rc6 and -rc7
when the rate of merging is throttled right back. Or perhaps from -rc7
to mid-merge-window.
^ permalink raw reply [flat|nested] 65+ messages in thread