[PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
@ 2026-03-22  7:08 Suren Baghdasaryan
  2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
  2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
  0 siblings, 2 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2026-03-22  7:08 UTC (permalink / raw)
  To: akpm
  Cc: hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt, willy,
	Liam.Howlett, ljs, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel, Suren Baghdasaryan

When shrinking lruvec, MGLRU pins address space before walking it.
This is excessive since all it needs for walking the page range is
a stable mm_struct to be able to take and release mmap_read_lock and
a stable mm->mm_mt tree to walk. This address space pinning results
in delays when releasing the memory of a dying process. This also
prevents mm reapers (both in-kernel oom-reaper and userspace
process_mrelease()) from doing their job during MGLRU scan because
they check task_will_free_mem() which will yield negative result due
to the elevated mm->mm_users.

Replace unnecessary address space pinning with mm_struct pinning by
replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
within mm_struct itself, therefore it won't be freed as long as
mm_struct is stable and it won't change during the walk because
mmap_read_lock is being held.

Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/vmscan.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33287ba4a500..68e8e90e38f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2863,8 +2863,9 @@ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk)
 		return NULL;

 	clear_bit(key, &mm->lru_gen.bitmap);
+	mmgrab(mm);

-	return mmget_not_zero(mm) ? mm : NULL;
+	return mm;
 }

 void lru_gen_add_mm(struct mm_struct *mm)
@@ -3064,7 +3065,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
 		reset_bloom_filter(mm_state, walk->seq + 1);

 	if (*iter)
-		mmput_async(*iter);
+		mmdrop(*iter);

 	*iter = mm;

base-commit: 8c65073d94c8b7cc3170de31af38edc9f5d96f0e
-- 
2.53.0.1018.g2bb0e51243-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
  2026-03-22  7:08 [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space Suren Baghdasaryan
@ 2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
  2026-03-23 16:19   ` Suren Baghdasaryan
  2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
  1 sibling, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-23 13:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt,
	willy, Liam.Howlett, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel

On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> When shrinking lruvec, MGLRU pins address space before walking it.
> This is excessive since all it needs for walking the page range is
> a stable mm_struct to be able to take and release mmap_read_lock and
> a stable mm->mm_mt tree to walk. This address space pinning results

Hmm, I guess exit_mmap() calls __mt_destroy(), but that'll just destroy
allocated state and leave the tree empty right, so traversal of that tree
at that point would just do nothing?

> in delays when releasing the memory of a dying process. This also
> prevents mm reapers (both in-kernel oom-reaper and userspace
> process_mrelease()) from doing their job during MGLRU scan because
> they check task_will_free_mem() which will yield negative result due
> to the elevated mm->mm_users.
>
> Replace unnecessary address space pinning with mm_struct pinning by
> replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> within mm_struct itself, therefore it won't be freed as long as
> mm_struct is stable and it won't change during the walk because
> mmap_read_lock is being held.
>
> Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 33287ba4a500..68e8e90e38f5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2863,8 +2863,9 @@ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk)
>  		return NULL;

Not related to this series, but I really don't like how coupled MGLRU is to
the rest of the 'classic' reclaim code.

Just in the middle of vmscan you walk into generic mm walker logic and the
only hint it's MGLRU is you see lru_gen_xxx stuff (I'm also annoyed that we
call it MGLRU but it's called lru_gen_xxx in the kernel :)

>
>  	clear_bit(key, &mm->lru_gen.bitmap);
> +	mmgrab(mm);

Is the mm somehow pinned here or, on destruction, would move it from the mm
list meaning that we can safely assume we have something sane in mm-> to
grab? I guess this must have already been the case for mmget_not_zero() to
have been used before though.

>
> -	return mmget_not_zero(mm) ? mm : NULL;
> +	return mm;
>  }
>
>  void lru_gen_add_mm(struct mm_struct *mm)
> @@ -3064,7 +3065,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
>  		reset_bloom_filter(mm_state, walk->seq + 1);
>
>  	if (*iter)
> -		mmput_async(*iter);
> +		mmdrop(*iter);

This will now be a blocking call that could free the mm (via __mmdrop()),
so could take a while, is that ok?

If before the code was intentionally deferring work here, doesn't that
imply that being slow here might be an issue, somehow? Or was it just
because they could? :)

>
>  	*iter = mm;
>
>
> base-commit: 8c65073d94c8b7cc3170de31af38edc9f5d96f0e
> --
> 2.53.0.1018.g2bb0e51243-goog
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
  2026-03-22  7:08 [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space Suren Baghdasaryan
  2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
  2026-03-23 16:26   ` Suren Baghdasaryan
  1 sibling, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-23 13:43 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt,
	willy, Liam.Howlett, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel

On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> When shrinking lruvec, MGLRU pins address space before walking it.
> This is excessive since all it needs for walking the page range is
> a stable mm_struct to be able to take and release mmap_read_lock and
> a stable mm->mm_mt tree to walk. This address space pinning results
> in delays when releasing the memory of a dying process. This also
> prevents mm reapers (both in-kernel oom-reaper and userspace
> process_mrelease()) from doing their job during MGLRU scan because
> they check task_will_free_mem() which will yield negative result due
> to the elevated mm->mm_users.
>
> Replace unnecessary address space pinning with mm_struct pinning by
> replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> within mm_struct itself, therefore it won't be freed as long as
> mm_struct is stable and it won't change during the walk because
> mmap_read_lock is being held.
>
> Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Oh and sorry, missed this - do we need a 'Fixed' for something that's just a
perf improvement?

This doesn't seem buggy?

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
  2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 16:19   ` Suren Baghdasaryan
  2026-03-23 17:06     ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 9+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23 16:19 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt,
	willy, Liam.Howlett, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel

On Mon, Mar 23, 2026 at 6:43 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> > When shrinking lruvec, MGLRU pins address space before walking it.
> > This is excessive since all it needs for walking the page range is
> > a stable mm_struct to be able to take and release mmap_read_lock and
> > a stable mm->mm_mt tree to walk. This address space pinning results
>
> Hmm, I guess exit_mmap() calls __mt_destroy(), but that'll just destroy
> allocated state and leave the tree empty right, so traversal of that tree
> at that point would just do nothing?

Correct. And __mt_destroy() happens under mmap_write_lock while
traversal under mmap_read_lock, so they should not race.

>
> > in delays when releasing the memory of a dying process. This also
> > prevents mm reapers (both in-kernel oom-reaper and userspace
> > process_mrelease()) from doing their job during MGLRU scan because
> > they check task_will_free_mem() which will yield negative result due
> > to the elevated mm->mm_users.
> >
> > Replace unnecessary address space pinning with mm_struct pinning by
> > replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> > within mm_struct itself, therefore it won't be freed as long as
> > mm_struct is stable and it won't change during the walk because
> > mmap_read_lock is being held.
> >
> > Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/vmscan.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 33287ba4a500..68e8e90e38f5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2863,8 +2863,9 @@ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk)
> >               return NULL;
>
> Not related to this series, but I really don't like how coupled MGLRU is to
> the rest of the 'classic' reclaim code.
>
> Just in the middle of vmscan you walk into generic mm walker logic and the
> only hint it's MGLRU is you see lru_gen_xxx stuff (I'm also annoyed that we
> call it MGLRU but it's called lru_gen_xxx in the kernel :)

I don't have a strong opinion on this. Perhaps the naming can be
changed outside of this series.

>
> >
> >       clear_bit(key, &mm->lru_gen.bitmap);
> > +     mmgrab(mm);
>
> Is the mm somehow pinned here or, on destruction, would move it from the mm
> list meaning that we can safely assume we have something sane in mm-> to
> grab? I guess this must have already been the case for mmget_not_zero() to
> have been used before though.

Yes, mm is stable because it's fetched from mm_list. When mm is added
to this list via lru_gen_add_mm(mm) it is referenced and that
reference is dropped only after lru_gen_del_mm(mm) removes the mm from
this list (see https://elixir.bootlin.com/linux/v7.0-rc4/source/kernel/fork.c#L1185
and https://elixir.bootlin.com/linux/v7.0-rc4/source/kernel/fork.c#L1187).
Addition, removal and retrieval from that list happen under
mm_list->lock which prevents races.

>
> >
> > -     return mmget_not_zero(mm) ? mm : NULL;
> > +     return mm;
> >  }
> >
> >  void lru_gen_add_mm(struct mm_struct *mm)
> > @@ -3064,7 +3065,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
> >               reset_bloom_filter(mm_state, walk->seq + 1);
> >
> >       if (*iter)
> > -             mmput_async(*iter);
> > +             mmdrop(*iter);
>
> This will now be a blocking call that could free the mm (via __mmdrop()),
> so could take a while, is that ok?

mmdrop() should not be a heavy-weight operation. It simply destroys
the metadata associated with mm_struct. mmput() OTOH will call
exit_mmap() if it drops the last reference and that can take a while
because that's when we free the memory of the process. I believe
that's why mmput_async() was used here.

>
> If before the code was intentionally deferring work here, doesn't that
> imply that being slow here might be an issue, somehow? Or was it just
> because they could? :)

I think the reason was the possibility of calling mmput() -> __mmput()
-> exit_mmap(mm) which could indeed block us for a while.

>
> >
> >       *iter = mm;
> >
> >
> > base-commit: 8c65073d94c8b7cc3170de31af38edc9f5d96f0e
> > --
> > 2.53.0.1018.g2bb0e51243-goog
> >
>
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
  2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 16:26   ` Suren Baghdasaryan
  2026-03-23 17:02     ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 9+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23 16:26 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt,
	willy, Liam.Howlett, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel

On Mon, Mar 23, 2026 at 6:43 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> > When shrinking lruvec, MGLRU pins address space before walking it.
> > This is excessive since all it needs for walking the page range is
> > a stable mm_struct to be able to take and release mmap_read_lock and
> > a stable mm->mm_mt tree to walk. This address space pinning results
> > in delays when releasing the memory of a dying process. This also
> > prevents mm reapers (both in-kernel oom-reaper and userspace
> > process_mrelease()) from doing their job during MGLRU scan because
> > they check task_will_free_mem() which will yield negative result due
> > to the elevated mm->mm_users.
> >
> > Replace unnecessary address space pinning with mm_struct pinning by
> > replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> > within mm_struct itself, therefore it won't be freed as long as
> > mm_struct is stable and it won't change during the walk because
> > mmap_read_lock is being held.
> >
> > Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Oh and sorry, missed this - do we need a 'Fixed' for something that's just a
> perf improvement?
>
> This doesn't seem buggy?

Well, if we are pinning something we should not is that a bug? I
didn't CC stable, so I don't expect it to be backported (though if
it's accepted, we will backport it in Android as it causes real
problems with reclaim), however if someone is looking for MGLRU fixes
I would like this patch to be discoverable. If you feel strongly about
it, I'll remove the tag.

Thanks for the review!

>
> Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
  2026-03-23 16:26   ` Suren Baghdasaryan
@ 2026-03-23 17:02     ` Lorenzo Stoakes (Oracle)
  2026-03-23 17:43       ` Suren Baghdasaryan
  0 siblings, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-23 17:02 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt,
	willy, Liam.Howlett, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel

On Mon, Mar 23, 2026 at 09:26:02AM -0700, Suren Baghdasaryan wrote:
> On Mon, Mar 23, 2026 at 6:43 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> > > When shrinking lruvec, MGLRU pins address space before walking it.
> > > This is excessive since all it needs for walking the page range is
> > > a stable mm_struct to be able to take and release mmap_read_lock and
> > > a stable mm->mm_mt tree to walk. This address space pinning results
> > > in delays when releasing the memory of a dying process. This also
> > > prevents mm reapers (both in-kernel oom-reaper and userspace
> > > process_mrelease()) from doing their job during MGLRU scan because
> > > they check task_will_free_mem() which will yield negative result due
> > > to the elevated mm->mm_users.
> > >
> > > Replace unnecessary address space pinning with mm_struct pinning by
> > > replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> > > within mm_struct itself, therefore it won't be freed as long as
> > > mm_struct is stable and it won't change during the walk because
> > > mmap_read_lock is being held.
> > >
> > > Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >
> > Oh and sorry, missed this - do we need a 'Fixed' for something that's just a
> > perf improvement?
> >
> > This doesn't seem buggy?
>
> Well, if we are pinning something we should not is that a bug? I
> didn't CC stable, so I don't expect it to be backported (though if
> it's accepted, we will backport it in Android as it causes real
> problems with reclaim), however if someone is looking for MGLRU fixes
> I would like this patch to be discoverable. If you feel strongly about
> it, I'll remove the tag.

I don't feel that strongly about it :>)

>
> Thanks for the review!

No problem! Working through your vma killable series atm also :)

>
> >
> > Thanks, Lorenzo

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
  2026-03-23 16:19   ` Suren Baghdasaryan
@ 2026-03-23 17:06     ` Lorenzo Stoakes (Oracle)
  2026-03-23 17:24       ` Suren Baghdasaryan
  0 siblings, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-23 17:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt,
	willy, Liam.Howlett, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel

On Mon, Mar 23, 2026 at 09:19:04AM -0700, Suren Baghdasaryan wrote:
> On Mon, Mar 23, 2026 at 6:43 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> > > When shrinking lruvec, MGLRU pins address space before walking it.
> > > This is excessive since all it needs for walking the page range is
> > > a stable mm_struct to be able to take and release mmap_read_lock and
> > > a stable mm->mm_mt tree to walk. This address space pinning results
> >
> > Hmm, I guess exit_mmap() calls __mt_destroy(), but that'll just destroy
> > allocated state and leave the tree empty right, so traversal of that tree
> > at that point would just do nothing?
>
> Correct. And __mt_destroy() happens under mmap_write_lock while
> traversal under mmap_read_lock, so they should not race.

Yeah that's fair.

>
> >
> > > in delays when releasing the memory of a dying process. This also
> > > prevents mm reapers (both in-kernel oom-reaper and userspace
> > > process_mrelease()) from doing their job during MGLRU scan because
> > > they check task_will_free_mem() which will yield negative result due
> > > to the elevated mm->mm_users.
> > >
> > > Replace unnecessary address space pinning with mm_struct pinning by
> > > replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> > > within mm_struct itself, therefore it won't be freed as long as
> > > mm_struct is stable and it won't change during the walk because
> > > mmap_read_lock is being held.
> > >
> > > Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>

Given you have cleared up my concerns, this LGTM, so:

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

> > > ---
> > >  mm/vmscan.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 33287ba4a500..68e8e90e38f5 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2863,8 +2863,9 @@ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk)
> > >               return NULL;
> >
> > Not related to this series, but I really don't like how coupled MGLRU is to
> > the rest of the 'classic' reclaim code.
> >
> > Just in the middle of vmscan you walk into generic mm walker logic and the
> > only hint it's MGLRU is you see lru_gen_xxx stuff (I'm also annoyed that we
> > call it MGLRU but it's called lru_gen_xxx in the kernel :)
>
> I don't have a strong opinion on this. Perhaps the naming can be
> changed outside of this series.

I was thinking more of a new file for mglru :>)

I believe we also need some more active maintainership also... but that's
another issue ;)

>
> >
> > >
> > >       clear_bit(key, &mm->lru_gen.bitmap);
> > > +     mmgrab(mm);
> >
> > Is the mm somehow pinned here or, on destruction, would move it from the mm
> > list meaning that we can safely assume we have something sane in mm-> to
> > grab? I guess this must have already been the case for mmget_not_zero() to
> > have been used before though.
>
> Yes, mm is stable because it's fetched from mm_list. When mm is added
> to this list via lru_gen_add_mm(mm) it is referenced and that
> reference is dropped only after lru_gen_del_mm(mm) removes the mm from
> this list (see https://elixir.bootlin.com/linux/v7.0-rc4/source/kernel/fork.c#L1185
> and https://elixir.bootlin.com/linux/v7.0-rc4/source/kernel/fork.c#L1187).
> Addition, removal and retrieval from that list happen under
> mm_list->lock which prevents races.

Ack, thanks!

>
> >
> > >
> > > -     return mmget_not_zero(mm) ? mm : NULL;
> > > +     return mm;
> > >  }
> > >
> > >  void lru_gen_add_mm(struct mm_struct *mm)
> > > @@ -3064,7 +3065,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
> > >               reset_bloom_filter(mm_state, walk->seq + 1);
> > >
> > >       if (*iter)
> > > -             mmput_async(*iter);
> > > +             mmdrop(*iter);
> >
> > This will now be a blocking call that could free the mm (via __mmdrop()),
> > so could take a while, is that ok?
>
> mmdrop() should not be a heavy-weight operation. It simply destroys
> the metadata associated with mm_struct. mmput() OTOH will call
> exit_mmap() if it drops the last reference and that can take a while
> because that's when we free the memory of the process. I believe
> that's why mmput_async() was used here.

Yeah that's fair enough! Thanks.

>
> >
> > If before the code was intentionally deferring work here, doesn't that
> > imply that being slow here might be an issue, somehow? Or was it just
> > because they could? :)
>
> I think the reason was the possibility of calling mmput() -> __mmput()
> -> exit_mmap(mm) which could indeed block us for a while.

Yeah fair :)

>
> >
> > >
> > >       *iter = mm;
> > >
> > >
> > > base-commit: 8c65073d94c8b7cc3170de31af38edc9f5d96f0e
> > > --
> > > 2.53.0.1018.g2bb0e51243-goog
> > >
> >
> > Thanks, Lorenzo

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
  2026-03-23 17:06     ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 17:24       ` Suren Baghdasaryan
  0 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23 17:24 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt,
	willy, Liam.Howlett, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel

On Mon, Mar 23, 2026 at 10:06 AM Lorenzo Stoakes (Oracle)
<ljs@kernel.org> wrote:
>
> On Mon, Mar 23, 2026 at 09:19:04AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Mar 23, 2026 at 6:43 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> > > > When shrinking lruvec, MGLRU pins address space before walking it.
> > > > This is excessive since all it needs for walking the page range is
> > > > a stable mm_struct to be able to take and release mmap_read_lock and
> > > > a stable mm->mm_mt tree to walk. This address space pinning results
> > >
> > > Hmm, I guess exit_mmap() calls __mt_destroy(), but that'll just destroy
> > > allocated state and leave the tree empty right, so traversal of that tree
> > > at that point would just do nothing?
> >
> > Correct. And __mt_destroy() happens under mmap_write_lock while
> > traversal under mmap_read_lock, so they should not race.
>
> Yeah that's fair.
>
> >
> > >
> > > > in delays when releasing the memory of a dying process. This also
> > > > prevents mm reapers (both in-kernel oom-reaper and userspace
> > > > process_mrelease()) from doing their job during MGLRU scan because
> > > > they check task_will_free_mem() which will yield negative result due
> > > > to the elevated mm->mm_users.
> > > >
> > > > Replace unnecessary address space pinning with mm_struct pinning by
> > > > replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> > > > within mm_struct itself, therefore it won't be freed as long as
> > > > mm_struct is stable and it won't change during the walk because
> > > > mmap_read_lock is being held.
> > > >
> > > > Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> Given you have cleared up my concerns, this LGTM, so:
>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Thanks!

>
> > > > ---
> > > >  mm/vmscan.c | 5 +++--
> > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 33287ba4a500..68e8e90e38f5 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2863,8 +2863,9 @@ static struct mm_struct *get_next_mm(struct lru_gen_mm_walk *walk)
> > > >               return NULL;
> > >
> > > Not related to this series, but I really don't like how coupled MGLRU is to
> > > the rest of the 'classic' reclaim code.
> > >
> > > Just in the middle of vmscan you walk into generic mm walker logic and the
> > > only hint it's MGLRU is you see lru_gen_xxx stuff (I'm also annoyed that we
> > > call it MGLRU but it's called lru_gen_xxx in the kernel :)
> >
> > I don't have a strong opinion on this. Perhaps the naming can be
> > changed outside of this series.
>
> I was thinking more of a new file for mglru :>)
>
> I believe we also need some more active maintainership also... but that's
> another issue ;)

Yes, in my team I volunteered one developer to actively review and
support MGLRU. Over time I hope he will be in a position to help with
maintenance.

>
> >
> > >
> > > >
> > > >       clear_bit(key, &mm->lru_gen.bitmap);
> > > > +     mmgrab(mm);
> > >
> > > Is the mm somehow pinned here or, on destruction, would move it from the mm
> > > list meaning that we can safely assume we have something sane in mm-> to
> > > grab? I guess this must have already been the case for mmget_not_zero() to
> > > have been used before though.
> >
> > Yes, mm is stable because it's fetched from mm_list. When mm is added
> > to this list via lru_gen_add_mm(mm) it is referenced and that
> > reference is dropped only after lru_gen_del_mm(mm) removes the mm from
> > this list (see https://elixir.bootlin.com/linux/v7.0-rc4/source/kernel/fork.c#L1185
> > and https://elixir.bootlin.com/linux/v7.0-rc4/source/kernel/fork.c#L1187).
> > Addition, removal and retrieval from that list happen under
> > mm_list->lock which prevents races.
>
> Ack, thanks!
>
> >
> > >
> > > >
> > > > -     return mmget_not_zero(mm) ? mm : NULL;
> > > > +     return mm;
> > > >  }
> > > >
> > > >  void lru_gen_add_mm(struct mm_struct *mm)
> > > > @@ -3064,7 +3065,7 @@ static bool iterate_mm_list(struct lru_gen_mm_walk *walk, struct mm_struct **ite
> > > >               reset_bloom_filter(mm_state, walk->seq + 1);
> > > >
> > > >       if (*iter)
> > > > -             mmput_async(*iter);
> > > > +             mmdrop(*iter);
> > >
> > > This will now be a blocking call that could free the mm (via __mmdrop()),
> > > so could take a while, is that ok?
> >
> > mmdrop() should not be a heavy-weight operation. It simply destroys
> > the metadata associated with mm_struct. mmput() OTOH will call
> > exit_mmap() if it drops the last reference and that can take a while
> > because that's when we free the memory of the process. I believe
> > that's why mmput_async() was used here.
>
> Yeah that's fair enough! Thanks.
>
> >
> > >
> > > If before the code was intentionally deferring work here, doesn't that
> > > imply that being slow here might be an issue, somehow? Or was it just
> > > because they could? :)
> >
> > I think the reason was the possibility of calling mmput() -> __mmput()
> > -> exit_mmap(mm) which could indeed block us for a while.
>
> Yeah fair :)
>
> >
> > >
> > > >
> > > >       *iter = mm;
> > > >
> > > >
> > > > base-commit: 8c65073d94c8b7cc3170de31af38edc9f5d96f0e
> > > > --
> > > > 2.53.0.1018.g2bb0e51243-goog
> > > >
> > >
> > > Thanks, Lorenzo
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space
  2026-03-23 17:02     ` Lorenzo Stoakes (Oracle)
@ 2026-03-23 17:43       ` Suren Baghdasaryan
  0 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2026-03-23 17:43 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: akpm, hannes, david, mhocko, zhengqi.arch, yuzhao, shakeel.butt,
	willy, Liam.Howlett, axelrasmussen, yuanchu, weixugc, linux-mm,
	linux-kernel

On Mon, Mar 23, 2026 at 10:02 AM Lorenzo Stoakes (Oracle)
<ljs@kernel.org> wrote:
>
> On Mon, Mar 23, 2026 at 09:26:02AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Mar 23, 2026 at 6:43 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> > >
> > > On Sun, Mar 22, 2026 at 12:08:43AM -0700, Suren Baghdasaryan wrote:
> > > > When shrinking lruvec, MGLRU pins address space before walking it.
> > > > This is excessive since all it needs for walking the page range is
> > > > a stable mm_struct to be able to take and release mmap_read_lock and
> > > > a stable mm->mm_mt tree to walk. This address space pinning results
> > > > in delays when releasing the memory of a dying process. This also
> > > > prevents mm reapers (both in-kernel oom-reaper and userspace
> > > > process_mrelease()) from doing their job during MGLRU scan because
> > > > they check task_will_free_mem() which will yield negative result due
> > > > to the elevated mm->mm_users.
> > > >
> > > > Replace unnecessary address space pinning with mm_struct pinning by
> > > > replacing mmget/mmput with mmgrab/mmdrop calls. mm_mt is contained
> > > > within mm_struct itself, therefore it won't be freed as long as
> > > > mm_struct is stable and it won't change during the walk because
> > > > mmap_read_lock is being held.
> > > >
> > > > Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > >
> > > Oh and sorry, missed this - do we need a 'Fixed' for something that's just a
> > > perf improvement?
> > >
> > > This doesn't seem buggy?
> >
> > Well, if we are pinning something we should not is that a bug? I
> > didn't CC stable, so I don't expect it to be backported (though if
> > it's accepted, we will backport it in Android as it causes real
> > problems with reclaim), however if someone is looking for MGLRU fixes
> > I would like this patch to be discoverable. If you feel strongly about
> > it, I'll remove the tag.
>
> I don't feel that strongly about it :>)
>
> >
> > Thanks for the review!
>
> No problem! Working through your vma killable series atm also :)

Thanks!
I will need to respin v5 after Sashiko found a couple of legit issues
at [1] but I'll wait for you to post your feedback before sending it.

[1] https://sashiko.dev/#/patchset/20260322054309.898214-1-surenb@google.com

>
> >
> > >
> > > Thanks, Lorenzo
>
> Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-03-23 17:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-22  7:08 [PATCH 1/1] mm/vmscan: prevent MGLRU reclaim from pinning address space Suren Baghdasaryan
2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
2026-03-23 16:19   ` Suren Baghdasaryan
2026-03-23 17:06     ` Lorenzo Stoakes (Oracle)
2026-03-23 17:24       ` Suren Baghdasaryan
2026-03-23 13:43 ` Lorenzo Stoakes (Oracle)
2026-03-23 16:26   ` Suren Baghdasaryan
2026-03-23 17:02     ` Lorenzo Stoakes (Oracle)
2026-03-23 17:43       ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox