* [PATCH] mm: skip lru_note_cost() when scanning only file or anon
@ 2025-07-11 15:50 Roman Gushchin
2025-07-11 17:20 ` Johannes Weiner
2025-07-11 18:18 ` Roman Gushchin
0 siblings, 2 replies; 10+ messages in thread
From: Roman Gushchin @ 2025-07-11 15:50 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Shakeel Butt, Lorenzo Stoakes, Michal Hocko,
David Hildenbrand, linux-mm, linux-kernel, Roman Gushchin
lru_note_cost() records relative cost of incurring io and cpu spent
on lru rotations, which is used to balance the pressure on file and
anon memory. The applied pressure is inversely proportional to the
recorded cost of reclaiming, but only within 2/3 of the range
(swappiness aside).
This is useful when both anon and file memory is reclaimable, however
in many cases it's not the case: e.g. there might be no swap,
proactive reclaim can target anon memory specifically,
the memory pressure can come from cgroup v1's memsw limit, etc.
In all these cases recording the cost will only bias all following
reclaim, also potentially outside of the scope of the original memcg.
So it's better to not record the cost if it comes from the initially
biased reclaim.
lru_note_cost() is a relatively expensive function, which traverses
the memcg tree up to the root and takes the lruvec lock on each level.
Overall it's responsible for about 50% of cycles spent on lruvec lock,
which might be a non-trivial number overall under heavy memory
pressure. So optimizing out a large number of lru_note_cost() calls
is also beneficial from the performance perspective.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
mm/vmscan.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c86a2495138a..33fd67eab0c9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -71,6 +71,13 @@
#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>
+enum scan_balance {
+ SCAN_EQUAL,
+ SCAN_FRACT,
+ SCAN_ANON,
+ SCAN_FILE,
+};
+
struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
@@ -90,6 +97,7 @@ struct scan_control {
/*
* Scan pressure balancing between anon and file LRUs
*/
+ enum scan_balance scan_balance;
unsigned long anon_cost;
unsigned long file_cost;
@@ -1988,6 +1996,17 @@ static int current_may_throttle(void)
return !(current->flags & PF_LOCAL_THROTTLE);
}
+static bool scan_balance_biased(struct scan_control *sc)
+{
+ switch (sc->scan_balance) {
+ case SCAN_EQUAL:
+ case SCAN_FRACT:
+ return false;
+ default:
+ return true;
+ }
+}
+
/*
* shrink_inactive_list() is a helper for shrink_node(). It returns the number
* of reclaimed pages
@@ -2054,7 +2073,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
spin_unlock_irq(&lruvec->lru_lock);
- lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
+ if (!scan_balance_biased(&sc))
+ lru_note_cost(lruvec, file, stat.nr_pageout,
+ nr_scanned - nr_reclaimed);
/*
* If dirty folios are scanned that are not queued for IO, it
@@ -2202,7 +2223,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&lruvec->lru_lock);
- if (nr_rotated)
+ if (nr_rotated && !scan_balance_biased(&sc))
lru_note_cost(lruvec, file, 0, nr_rotated);
trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
nr_deactivate, nr_rotated, sc->priority, file);
@@ -2327,13 +2348,6 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
return inactive * inactive_ratio < active;
}
-enum scan_balance {
- SCAN_EQUAL,
- SCAN_FRACT,
- SCAN_ANON,
- SCAN_FILE,
-};
-
static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
{
unsigned long file;
@@ -2613,6 +2627,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
calculate_pressure_balance(sc, swappiness, fraction, &denominator);
out:
+ sc->scan_balance = scan_balance;
+
for_each_evictable_lru(lru) {
bool file = is_file_lru(lru);
unsigned long lruvec_size;
--
2.50.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-11 15:50 [PATCH] mm: skip lru_note_cost() when scanning only file or anon Roman Gushchin
@ 2025-07-11 17:20 ` Johannes Weiner
2025-07-11 17:55 ` Roman Gushchin
2025-07-11 18:18 ` Roman Gushchin
1 sibling, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2025-07-11 17:20 UTC (permalink / raw)
To: Roman Gushchin
Cc: Andrew Morton, Shakeel Butt, Lorenzo Stoakes, Michal Hocko,
David Hildenbrand, linux-mm, linux-kernel
On Fri, Jul 11, 2025 at 08:50:44AM -0700, Roman Gushchin wrote:
> lru_note_cost() records relative cost of incurring io and cpu spent
> on lru rotations, which is used to balance the pressure on file and
> anon memory. The applied pressure is inversely proportional to the
> recorded cost of reclaiming, but only within 2/3 of the range
> (swappiness aside).
>
> This is useful when both anon and file memory is reclaimable, however
> in many cases it's not the case: e.g. there might be no swap,
> proactive reclaim can target anon memory specifically,
> the memory pressure can come from cgroup v1's memsw limit, etc.
> In all these cases recording the cost will only bias all following
> reclaim, also potentially outside of the scope of the original memcg.
>
> So it's better to not record the cost if it comes from the initially
> biased reclaim.
>
> lru_note_cost() is a relatively expensive function, which traverses
> the memcg tree up to the root and takes the lruvec lock on each level.
> Overall it's responsible for about 50% of cycles spent on lruvec lock,
> which might be a non-trivial number overall under heavy memory
> pressure. So optimizing out a large number of lru_note_cost() calls
> is also beneficial from the performance perspective.
Does it actually help? It's under elevated pressure, when lru locks
are the most contended, that we also usually scan both lists.
The caveat with this patch is that, aside from the static noswap
scenario, modes can switch back and forth abruptly or even overlap.
So if you leave a pressure scenario and go back to cache trimming, you
will no longer age the cost information anymore. The next spike could
be starting out with potentially quite stale information.
Or say proactive reclaim recently already targeted anon, and there
were rotations and pageouts; that would be useful data for a reactive
reclaimer doing work at around the same time, or shortly thereafter.
So for everything but the static noswap case, the patch makes me
nervous. And I'm not sure it actually helps in the cases where it
would matter the most.
It might make more sense to look into the cost (ha) of the cost
recording itself. Can we turn it into a vmstat item? That would make
it lockless, would get rstat batching up the cgroup tree etc. This
doesn't need to be 100% precise and race free after all.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-11 17:20 ` Johannes Weiner
@ 2025-07-11 17:55 ` Roman Gushchin
2025-07-14 15:22 ` Johannes Weiner
0 siblings, 1 reply; 10+ messages in thread
From: Roman Gushchin @ 2025-07-11 17:55 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Shakeel Butt, Lorenzo Stoakes, Michal Hocko,
David Hildenbrand, linux-mm, linux-kernel
Johannes Weiner <hannes@cmpxchg.org> writes:
> On Fri, Jul 11, 2025 at 08:50:44AM -0700, Roman Gushchin wrote:
>> lru_note_cost() records relative cost of incurring io and cpu spent
>> on lru rotations, which is used to balance the pressure on file and
>> anon memory. The applied pressure is inversely proportional to the
>> recorded cost of reclaiming, but only within 2/3 of the range
>> (swappiness aside).
>>
>> This is useful when both anon and file memory is reclaimable, however
>> in many cases it's not the case: e.g. there might be no swap,
>> proactive reclaim can target anon memory specifically,
>> the memory pressure can come from cgroup v1's memsw limit, etc.
>> In all these cases recording the cost will only bias all following
>> reclaim, also potentially outside of the scope of the original memcg.
>>
>> So it's better to not record the cost if it comes from the initially
>> biased reclaim.
>>
>> lru_note_cost() is a relatively expensive function, which traverses
>> the memcg tree up to the root and takes the lruvec lock on each level.
>> Overall it's responsible for about 50% of cycles spent on lruvec lock,
>> which might be a non-trivial number overall under heavy memory
>> pressure. So optimizing out a large number of lru_note_cost() calls
>> is also beneficial from the performance perspective.
>
> Does it actually help? It's under elevated pressure, when lru locks
> are the most contended, that we also usually scan both lists.
It does, but it's mostly about !swap or memsw limit case.
It's also mostly pronounced only during high memory pressure
when there are many rotations.
And it's pretty significant in our case: I'm actually trying to address
a production regression caused by commit 0538a82c39e9 ("mm: vmscan: make
rotations a secondary factor in balancing anon vs file"), which
added another lru_note_cost() call.
> The caveat with this patch is that, aside from the static noswap
> scenario, modes can switch back and forth abruptly or even overlap.
>
> So if you leave a pressure scenario and go back to cache trimming, you
> will no longer age the cost information anymore. The next spike could
> be starting out with potentially quite stale information.
>
> Or say proactive reclaim recently already targeted anon, and there
> were rotations and pageouts; that would be useful data for a reactive
> reclaimer doing work at around the same time, or shortly thereafter.
Agree, but at the same time it's possible to come up with the scenario
when it's not good.
A
/ \
B C memory.max=X
/ \
D E
Let's say we have a cgroup structure like this, we apply a lot
of proactive anon pressure on E, then the pressure from on D from
C's limit will be biased towards file without a good reason.
Or as in my case, if a cgroup has memory.memsw.limit set and is
thrashing, does it makes sense to bias the rest of the system
into anon reclaim? The recorded cost can really large.
>
> So for everything but the static noswap case, the patch makes me
> nervous. And I'm not sure it actually helps in the cases where it
> would matter the most.
I understand, but do you think it's acceptable with some additional
conditions: e.g. narrow it down to only very high scanning priorities?
Or !sc.may_swap case?
In the end, we have the following code in get_scan_count(), so at
least on priority 0 we ignore all costs anyway.
if (!sc->priority && swappiness) {
scan_balance = SCAN_EQUAL;
goto out;
}
Wdyt?
>
> It might make more sense to look into the cost (ha) of the cost
> recording itself. Can we turn it into a vmstat item? That would make
> it lockless, would get rstat batching up the cgroup tree etc. This
> doesn't need to be 100% precise and race free after all.
Idk, maybe yes, but rstat flushing was a source of the issues as well
and now it's mostly ratelimited, so I'm concerned that because of that
we'll have sudden changes in the reclaim behavior every 2 seconds.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-11 15:50 [PATCH] mm: skip lru_note_cost() when scanning only file or anon Roman Gushchin
2025-07-11 17:20 ` Johannes Weiner
@ 2025-07-11 18:18 ` Roman Gushchin
2025-07-13 19:57 ` Hugh Dickins
1 sibling, 1 reply; 10+ messages in thread
From: Roman Gushchin @ 2025-07-11 18:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Shakeel Butt, Lorenzo Stoakes, Michal Hocko,
David Hildenbrand, linux-mm, linux-kernel
Sorry, sent a wrong version with a trivial bug. Below is the correct
one. The only difference is s/&sc/sc when calling scan_balance_biased().
--
From c06530edfb8a11139f2d7878ce3956b9238cc702 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <roman.gushchin@linux.dev>
Subject: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
lru_note_cost() records relative cost of incurring io and cpu spent
on lru rotations, which is used to balance the pressure on file and
anon memory. The applied pressure is inversely proportional to the
recorded cost of reclaiming, but only within 2/3 of the range
(swappiness aside).
This is useful when both anon and file memory is reclaimable, however
in many cases it's not the case: e.g. there might be no swap,
proactive reclaim can target anon memory specifically,
the memory pressure can come from cgroup v1's memsw limit, etc.
In all these cases recording the cost will only bias all following
reclaim, also potentially outside of the scope of the original memcg.
So it's better to not record the cost if it comes from the initially
biased reclaim.
lru_note_cost() is a relatively expensive function, which traverses
the memcg tree up to the root and takes the lruvec lock on each level.
Overall it's responsible for about 50% of cycles spent on lruvec lock,
which might be a non-trivial number overall under heavy memory
pressure. So optimizing out a large number of lru_note_cost() calls
is also beneficial from the performance perspective.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
mm/vmscan.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c86a2495138a..7d08606b08ea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -71,6 +71,13 @@
#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>
+enum scan_balance {
+ SCAN_EQUAL,
+ SCAN_FRACT,
+ SCAN_ANON,
+ SCAN_FILE,
+};
+
struct scan_control {
/* How many pages shrink_list() should reclaim */
unsigned long nr_to_reclaim;
@@ -90,6 +97,7 @@ struct scan_control {
/*
* Scan pressure balancing between anon and file LRUs
*/
+ enum scan_balance scan_balance;
unsigned long anon_cost;
unsigned long file_cost;
@@ -1988,6 +1996,17 @@ static int current_may_throttle(void)
return !(current->flags & PF_LOCAL_THROTTLE);
}
+static bool scan_balance_biased(struct scan_control *sc)
+{
+ switch (sc->scan_balance) {
+ case SCAN_EQUAL:
+ case SCAN_FRACT:
+ return false;
+ default:
+ return true;
+ }
+}
+
/*
* shrink_inactive_list() is a helper for shrink_node(). It returns the number
* of reclaimed pages
@@ -2054,7 +2073,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
spin_unlock_irq(&lruvec->lru_lock);
- lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
+ if (!scan_balance_biased(sc))
+ lru_note_cost(lruvec, file, stat.nr_pageout,
+ nr_scanned - nr_reclaimed);
/*
* If dirty folios are scanned that are not queued for IO, it
@@ -2202,7 +2223,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&lruvec->lru_lock);
- if (nr_rotated)
+ if (nr_rotated && !scan_balance_biased(sc))
lru_note_cost(lruvec, file, 0, nr_rotated);
trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
nr_deactivate, nr_rotated, sc->priority, file);
@@ -2327,13 +2348,6 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
return inactive * inactive_ratio < active;
}
-enum scan_balance {
- SCAN_EQUAL,
- SCAN_FRACT,
- SCAN_ANON,
- SCAN_FILE,
-};
-
static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
{
unsigned long file;
@@ -2613,6 +2627,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
calculate_pressure_balance(sc, swappiness, fraction, &denominator);
out:
+ sc->scan_balance = scan_balance;
+
for_each_evictable_lru(lru) {
bool file = is_file_lru(lru);
unsigned long lruvec_size;
--
2.50.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-11 18:18 ` Roman Gushchin
@ 2025-07-13 19:57 ` Hugh Dickins
2025-07-14 15:25 ` Johannes Weiner
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Hugh Dickins @ 2025-07-13 19:57 UTC (permalink / raw)
To: Roman Gushchin
Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Lorenzo Stoakes,
Michal Hocko, David Hildenbrand, linux-mm, linux-kernel
On Fri, 11 Jul 2025, Roman Gushchin wrote:
> Sorry, sent a wrong version with a trivial bug. Below is the correct
> one. The only difference is s/&sc/sc when calling scan_balance_biased().
>
> --
>
> From c06530edfb8a11139f2d7878ce3956b9238cc702 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <roman.gushchin@linux.dev>
> Subject: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
>
> lru_note_cost() records relative cost of incurring io and cpu spent
> on lru rotations, which is used to balance the pressure on file and
> anon memory. The applied pressure is inversely proportional to the
> recorded cost of reclaiming, but only within 2/3 of the range
> (swappiness aside).
>
> This is useful when both anon and file memory is reclaimable, however
> in many cases it's not the case: e.g. there might be no swap,
> proactive reclaim can target anon memory specifically,
> the memory pressure can come from cgroup v1's memsw limit, etc.
> In all these cases recording the cost will only bias all following
> reclaim, also potentially outside of the scope of the original memcg.
>
> So it's better to not record the cost if it comes from the initially
> biased reclaim.
>
> lru_note_cost() is a relatively expensive function, which traverses
> the memcg tree up to the root and takes the lruvec lock on each level.
> Overall it's responsible for about 50% of cycles spent on lruvec lock,
> which might be a non-trivial number overall under heavy memory
> pressure. So optimizing out a large number of lru_note_cost() calls
> is also beneficial from the performance perspective.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
> mm/vmscan.c | 34 +++++++++++++++++++++++++---------
> 1 file changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c86a2495138a..7d08606b08ea 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -71,6 +71,13 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/vmscan.h>
>
> +enum scan_balance {
> + SCAN_EQUAL,
> + SCAN_FRACT,
> + SCAN_ANON,
> + SCAN_FILE,
> +};
> +
> struct scan_control {
> /* How many pages shrink_list() should reclaim */
> unsigned long nr_to_reclaim;
> @@ -90,6 +97,7 @@ struct scan_control {
> /*
> * Scan pressure balancing between anon and file LRUs
> */
> + enum scan_balance scan_balance;
> unsigned long anon_cost;
> unsigned long file_cost;
>
> @@ -1988,6 +1996,17 @@ static int current_may_throttle(void)
> return !(current->flags & PF_LOCAL_THROTTLE);
> }
>
> +static bool scan_balance_biased(struct scan_control *sc)
> +{
> + switch (sc->scan_balance) {
> + case SCAN_EQUAL:
> + case SCAN_FRACT:
> + return false;
> + default:
> + return true;
> + }
> +}
> +
> /*
> * shrink_inactive_list() is a helper for shrink_node(). It returns the number
> * of reclaimed pages
> @@ -2054,7 +2073,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> spin_unlock_irq(&lruvec->lru_lock);
>
> - lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
> + if (!scan_balance_biased(sc))
> + lru_note_cost(lruvec, file, stat.nr_pageout,
> + nr_scanned - nr_reclaimed);
>
> /*
> * If dirty folios are scanned that are not queued for IO, it
> @@ -2202,7 +2223,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
> __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> spin_unlock_irq(&lruvec->lru_lock);
>
> - if (nr_rotated)
> + if (nr_rotated && !scan_balance_biased(sc))
> lru_note_cost(lruvec, file, 0, nr_rotated);
> trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
> nr_deactivate, nr_rotated, sc->priority, file);
> @@ -2327,13 +2348,6 @@ static bool inactive_is_low(struct lruvec *lruvec, enum lru_list inactive_lru)
> return inactive * inactive_ratio < active;
> }
>
> -enum scan_balance {
> - SCAN_EQUAL,
> - SCAN_FRACT,
> - SCAN_ANON,
> - SCAN_FILE,
> -};
> -
> static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
> {
> unsigned long file;
> @@ -2613,6 +2627,8 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> calculate_pressure_balance(sc, swappiness, fraction, &denominator);
>
> out:
> + sc->scan_balance = scan_balance;
> +
> for_each_evictable_lru(lru) {
> bool file = is_file_lru(lru);
> unsigned long lruvec_size;
> --
> 2.50.0
Roman, I'm expressing no opinion on your patch above, but please may I
throw the patch below (against 6.16-rc) over the wall to you, to add as a
1/2 or 2/2 to yours (as it stands, it does conflict slightly with yours).
My attention needs to be on other things; but five years ago in
https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2009211440570.5214@eggly.anvils/
I noted how lru_note_cost() became more costly with per-memcg lru_lock,
but did nothing about it at the time. Apparently now is the time.
Thanks,
Hugh
[PATCH] mm: lru_note_cost_unlock_irq()
Dropping a lock, just to demand it again for an afterthought, cannot be
good if contended: convert lru_note_cost() to lru_note_cost_unlock_irq().
Signed-off-by: Hugh Dickins <hughd@google.com>
---
include/linux/swap.h | 5 +++--
mm/swap.c | 26 +++++++++++++++++++-------
mm/vmscan.c | 8 +++-----
3 files changed, 25 insertions(+), 14 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..a64a87cda960 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -376,8 +376,9 @@ extern unsigned long totalreserve_pages;
/* linux/mm/swap.c */
-void lru_note_cost(struct lruvec *lruvec, bool file,
- unsigned int nr_io, unsigned int nr_rotated);
+void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
+ unsigned int nr_io, unsigned int nr_rotated)
+ __releases(lruvec->lru_lock);
void lru_note_cost_refault(struct folio *);
void folio_add_lru(struct folio *);
void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
diff --git a/mm/swap.c b/mm/swap.c
index 4fc322f7111a..37053f222a6e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -237,8 +237,9 @@ void folio_rotate_reclaimable(struct folio *folio)
folio_batch_add_and_move(folio, lru_move_tail, true);
}
-void lru_note_cost(struct lruvec *lruvec, bool file,
- unsigned int nr_io, unsigned int nr_rotated)
+void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
+ unsigned int nr_io, unsigned int nr_rotated)
+ __releases(lruvec->lru_lock)
{
unsigned long cost;
@@ -250,8 +251,12 @@ void lru_note_cost(struct lruvec *lruvec, bool file,
* different between them, adjust scan balance for CPU work.
*/
cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated;
+ if (!cost) {
+ spin_unlock_irq(&lruvec->lru_lock);
+ return;
+ }
- do {
+ for (;;) {
unsigned long lrusize;
/*
@@ -261,7 +266,6 @@ void lru_note_cost(struct lruvec *lruvec, bool file,
* rcu lock, so would be safe even if the page was on the LRU
* and could move simultaneously to a new lruvec).
*/
- spin_lock_irq(&lruvec->lru_lock);
/* Record cost event */
if (file)
lruvec->file_cost += cost;
@@ -285,14 +289,22 @@ void lru_note_cost(struct lruvec *lruvec, bool file,
lruvec->file_cost /= 2;
lruvec->anon_cost /= 2;
}
+
spin_unlock_irq(&lruvec->lru_lock);
- } while ((lruvec = parent_lruvec(lruvec)));
+ lruvec = parent_lruvec(lruvec);
+ if (!lruvec)
+ break;
+ spin_lock_irq(&lruvec->lru_lock);
+ }
}
void lru_note_cost_refault(struct folio *folio)
{
- lru_note_cost(folio_lruvec(folio), folio_is_file_lru(folio),
- folio_nr_pages(folio), 0);
+ struct lruvec *lruvec;
+
+ lruvec = folio_lruvec_lock_irq(folio);
+ lru_note_cost_unlock_irq(lruvec, folio_is_file_lru(folio),
+ folio_nr_pages(folio), 0);
}
static void lru_activate(struct lruvec *lruvec, struct folio *folio)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f8dfd2864bbf..5ba49f884bc0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2059,9 +2059,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
__count_vm_events(item, nr_reclaimed);
count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
- spin_unlock_irq(&lruvec->lru_lock);
- lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
+ lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
+ nr_scanned - nr_reclaimed);
/*
* If dirty folios are scanned that are not queued for IO, it
@@ -2207,10 +2207,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&lruvec->lru_lock);
- if (nr_rotated)
- lru_note_cost(lruvec, file, 0, nr_rotated);
+ lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
nr_deactivate, nr_rotated, sc->priority, file);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-11 17:55 ` Roman Gushchin
@ 2025-07-14 15:22 ` Johannes Weiner
2025-07-14 16:21 ` Roman Gushchin
0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2025-07-14 15:22 UTC (permalink / raw)
To: Roman Gushchin
Cc: Andrew Morton, Shakeel Butt, Lorenzo Stoakes, Michal Hocko,
David Hildenbrand, linux-mm, linux-kernel
On Fri, Jul 11, 2025 at 10:55:48AM -0700, Roman Gushchin wrote:
> Johannes Weiner <hannes@cmpxchg.org> writes:
> > The caveat with this patch is that, aside from the static noswap
> > scenario, modes can switch back and forth abruptly or even overlap.
> >
> > So if you leave a pressure scenario and go back to cache trimming, you
> > will no longer age the cost information anymore. The next spike could
> > be starting out with potentially quite stale information.
> >
> > Or say proactive reclaim recently already targeted anon, and there
> > were rotations and pageouts; that would be useful data for a reactive
> > reclaimer doing work at around the same time, or shortly thereafter.
>
> Agree, but at the same time it's possible to come up with the scenario
> when it's not good.
> A
> / \
> B C memory.max=X
> / \
> D E
>
> Let's say we have a cgroup structure like this, we apply a lot
> of proactive anon pressure on E, then the pressure from on D from
> C's limit will be biased towards file without a good reason.
No, this is on purpose. D and E are not independent. They're in the
same memory domain, C. So if you want to reclaim C, and a subset of
its anon has already been pressured to resistance, then a larger part
of the reclaim candidates in C will need to come from file.
> Or as in my case, if a cgroup has memory.memsw.limit set and is
> thrashing, does it makes sense to bias the rest of the system
> into anon reclaim? The recorded cost can really large.
>
> >
> > So for everything but the static noswap case, the patch makes me
> > nervous. And I'm not sure it actually helps in the cases where it
> > would matter the most.
>
> I understand, but do you think it's acceptable with some additional
> conditions: e.g. narrow it down to only very high scanning priorities?
> Or !sc.may_swap case?
>
> In the end, we have the following code in get_scan_count(), so at
> least on priority 0 we ignore all costs anyway.
> if (!sc->priority && swappiness) {
> scan_balance = SCAN_EQUAL;
> goto out;
> }
>
> Wdyt?
I think relitigating a proven aging mechanism after half a decade in
production is going to be tough and require extensive testing.
If your primary problem is the cost of the locking, I'd focus on that.
> > It might make more sense to look into the cost (ha) of the cost
> > recording itself. Can we turn it into a vmstat item? That would make
> > it lockless, would get rstat batching up the cgroup tree etc. This
> > doesn't need to be 100% precise and race free after all.
>
> Idk, maybe yes, but rstat flushing was a source of the issues as well
> and now it's mostly ratelimited, so I'm concerned that because of that
> we'll have sudden changes in the reclaim behavior every 2 seconds.
That's not a new hazard, though. prepare_scan_control() decisions are
already subject to this, as is the lru cost aging itself.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-13 19:57 ` Hugh Dickins
@ 2025-07-14 15:25 ` Johannes Weiner
2025-07-14 17:59 ` Roman Gushchin
2025-07-14 20:28 ` Shakeel Butt
2 siblings, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2025-07-14 15:25 UTC (permalink / raw)
To: Hugh Dickins
Cc: Roman Gushchin, Andrew Morton, Shakeel Butt, Lorenzo Stoakes,
Michal Hocko, David Hildenbrand, linux-mm, linux-kernel
On Sun, Jul 13, 2025 at 12:57:18PM -0700, Hugh Dickins wrote:
> [PATCH] mm: lru_note_cost_unlock_irq()
>
> Dropping a lock, just to demand it again for an afterthought, cannot be
> good if contended: convert lru_note_cost() to lru_note_cost_unlock_irq().
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-14 15:22 ` Johannes Weiner
@ 2025-07-14 16:21 ` Roman Gushchin
0 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2025-07-14 16:21 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Shakeel Butt, Lorenzo Stoakes, Michal Hocko,
David Hildenbrand, linux-mm, linux-kernel
Johannes Weiner <hannes@cmpxchg.org> writes:
> On Fri, Jul 11, 2025 at 10:55:48AM -0700, Roman Gushchin wrote:
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>> > The caveat with this patch is that, aside from the static noswap
>> > scenario, modes can switch back and forth abruptly or even overlap.
>> >
>> > So if you leave a pressure scenario and go back to cache trimming, you
>> > will no longer age the cost information anymore. The next spike could
>> > be starting out with potentially quite stale information.
>> >
>> > Or say proactive reclaim recently already targeted anon, and there
>> > were rotations and pageouts; that would be useful data for a reactive
>> > reclaimer doing work at around the same time, or shortly thereafter.
>>
>> Agree, but at the same time it's possible to come up with the scenario
>> when it's not good.
>> A
>> / \
>> B C memory.max=X
>> / \
>> D E
>>
>> Let's say we have a cgroup structure like this, we apply a lot
>> of proactive anon pressure on E, then the pressure from on D from
>> C's limit will be biased towards file without a good reason.
>
> No, this is on purpose. D and E are not independent. They're in the
> same memory domain, C. So if you want to reclaim C, and a subset of
> its anon has already been pressured to resistance, then a larger part
> of the reclaim candidates in C will need to come from file.
So, basically you can create a tiny memcg without swap in a large
system, create a ton of memory pressure there and bias the global
memory reclaim? That's strange.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-13 19:57 ` Hugh Dickins
2025-07-14 15:25 ` Johannes Weiner
@ 2025-07-14 17:59 ` Roman Gushchin
2025-07-14 20:28 ` Shakeel Butt
2 siblings, 0 replies; 10+ messages in thread
From: Roman Gushchin @ 2025-07-14 17:59 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Lorenzo Stoakes,
Michal Hocko, David Hildenbrand, linux-mm, linux-kernel
Hugh Dickins <hughd@google.com> writes:
> Roman, I'm expressing no opinion on your patch above, but please may I
> throw the patch below (against 6.16-rc) over the wall to you, to add as a
> 1/2 or 2/2 to yours (as it stands, it does conflict slightly with yours).
>
> My attention needs to be on other things; but five years ago in
> https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2009211440570.5214@eggly.anvils/
> I noted how lru_note_cost() became more costly with per-memcg lru_lock,
> but did nothing about it at the time. Apparently now is the time.
>
> Thanks,
> Hugh
>
> [PATCH] mm: lru_note_cost_unlock_irq()
>
> Dropping a lock, just to demand it again for an afterthought, cannot be
> good if contended: convert lru_note_cost() to lru_note_cost_unlock_irq().
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Tested-by: Roman Gushchin <roman.gushchin@linux.dev>
Thanks, Hugh!
Actually your patch helps quite a lot with the lruvec lock contention
in my test, so I think we can stop looking at my change from the
performance point of view.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: skip lru_note_cost() when scanning only file or anon
2025-07-13 19:57 ` Hugh Dickins
2025-07-14 15:25 ` Johannes Weiner
2025-07-14 17:59 ` Roman Gushchin
@ 2025-07-14 20:28 ` Shakeel Butt
2 siblings, 0 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-07-14 20:28 UTC (permalink / raw)
To: Hugh Dickins
Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Lorenzo Stoakes,
Michal Hocko, David Hildenbrand, linux-mm, linux-kernel
On Sun, Jul 13, 2025 at 12:57:18PM -0700, Hugh Dickins wrote:
> On Fri, 11 Jul 2025, Roman Gushchin wrote:
>
[...]
>
> [PATCH] mm: lru_note_cost_unlock_irq()
>
> Dropping a lock, just to demand it again for an afterthought, cannot be
> good if contended: convert lru_note_cost() to lru_note_cost_unlock_irq().
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-07-14 20:28 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-11 15:50 [PATCH] mm: skip lru_note_cost() when scanning only file or anon Roman Gushchin
2025-07-11 17:20 ` Johannes Weiner
2025-07-11 17:55 ` Roman Gushchin
2025-07-14 15:22 ` Johannes Weiner
2025-07-14 16:21 ` Roman Gushchin
2025-07-11 18:18 ` Roman Gushchin
2025-07-13 19:57 ` Hugh Dickins
2025-07-14 15:25 ` Johannes Weiner
2025-07-14 17:59 ` Roman Gushchin
2025-07-14 20:28 ` Shakeel Butt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).