[RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim
@ 2023-02-02 23:32 Yosry Ahmed
  2023-02-02 23:32 ` [RFC PATCH v1 1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state Yosry Ahmed
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Yosry Ahmed @ 2023-02-02 23:32 UTC (permalink / raw)
  To: Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Matthew Wilcox (Oracle), Miaohe Lin,
	David Hildenbrand, Johannes Weiner, Peter Xu, NeilBrown,
	Shakeel Butt, Michal Hocko
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm, Yosry Ahmed

Reclaimed pages through other means than LRU-based reclaim are tracked
through reclaim_state in struct scan_control, which is stashed in
current task_struct. These pages are added to the number of reclaimed
pages through LRUs. For memcg reclaim, these pages generally cannot be
linked to the memcg under reclaim and can cause an overestimated count
of reclaimed pages. This short series tries to address that.

Patch 1 is just refactoring updating reclaim_state into a helper
function, and renames reclaimed_slab to just reclaimed, with a comment
describing its true purpose.

Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.

The original draft was a little bit different. It also kept track of
uncharged objcg pages, and reported them only in memcg reclaim and only
if the uncharged memcg is in the subtree of the memcg under reclaim.
This was an attempt to make reporting of memcg reclaim even more
accurate, but was dropped due to questionable complexity vs benefit
tradeoff. It can be revived if there is interest.

Yosry Ahmed (2):
  mm: vmscan: refactor updating reclaimed pages in reclaim_state
  mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim

 fs/inode.c           |  3 +--
 fs/xfs/xfs_buf.c     |  3 +--
 include/linux/swap.h |  5 ++++-
 mm/slab.c            |  3 +--
 mm/slob.c            |  6 ++----
 mm/slub.c            |  5 ++---
 mm/vmscan.c          | 19 ++++++++++++++++---
 7 files changed, 27 insertions(+), 17 deletions(-)

-- 
2.39.1.519.gcb327c4b5f-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state
  2023-02-02 23:32 [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
@ 2023-02-02 23:32 ` Yosry Ahmed
  2023-02-03 16:22   ` Matthew Wilcox
  2023-02-02 23:32 ` [RFC PATCH v1 2/2] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
  2023-02-03  0:00 ` [RFC PATCH v1 0/2] Ignore " Dave Chinner
  2 siblings, 1 reply; 11+ messages in thread
From: Yosry Ahmed @ 2023-02-02 23:32 UTC (permalink / raw)
  To: Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Matthew Wilcox (Oracle), Miaohe Lin,
	David Hildenbrand, Johannes Weiner, Peter Xu, NeilBrown,
	Shakeel Butt, Michal Hocko
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm, Yosry Ahmed

During reclaim, we keep track of pages reclaimed from other means than
LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
which we stash a pointer to in current task_struct.

However, we keep track of more than just reclaimed slab pages through
this. We also use it for clean file pages dropped through pruned inodes,
and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add
a helper function that wraps updating it through current.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 fs/inode.c           |  3 +--
 fs/xfs/xfs_buf.c     |  3 +--
 include/linux/swap.h |  5 ++++-
 mm/slab.c            |  3 +--
 mm/slob.c            |  6 ++----
 mm/slub.c            |  5 ++---
 mm/vmscan.c          | 17 +++++++++++++++--
 7 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f453eb58fd03..adf0a7725054 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -863,8 +863,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 				__count_vm_events(KSWAPD_INODESTEAL, reap);
 			else
 				__count_vm_events(PGINODESTEAL, reap);
-			if (current->reclaim_state)
-				current->reclaim_state->reclaimed_slab += reap;
+			report_freed_pages(reap);
 		}
 		iput(inode);
 		spin_lock(lru_lock);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 54c774af6e1c..060079f1e966 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -286,8 +286,7 @@ xfs_buf_free_pages(
 		if (bp->b_pages[i])
 			__free_page(bp->b_pages[i]);
 	}
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += bp->b_page_count;
+	report_freed_pages(bp->b_page_count);
 
 	if (bp->b_pages != bp->b_page_array)
 		kmem_free(bp->b_pages);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2787b84eaf12..bc1d8b326453 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -153,13 +153,16 @@ union swap_header {
  * memory reclaim
  */
 struct reclaim_state {
-	unsigned long reclaimed_slab;
+	/* pages reclaimed outside of LRU-based reclaim */
+	unsigned long reclaimed;
 #ifdef CONFIG_LRU_GEN
 	/* per-thread mm walk data */
 	struct lru_gen_mm_walk *mm_walk;
 #endif
 };
 
+void report_freed_pages(unsigned long pages);
+
 #ifdef __KERNEL__
 
 struct address_space;
diff --git a/mm/slab.c b/mm/slab.c
index 29300fc1289a..452db5913356 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1395,8 +1395,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
 	smp_wmb();
 	__folio_clear_slab(folio);
 
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += 1 << order;
+	report_freed_pages(1 << order);
 	unaccount_slab(slab, order, cachep);
 	__free_pages(folio_page(folio, 0), order);
 }
diff --git a/mm/slob.c b/mm/slob.c
index fe567fcfa3a3..71ee00e9dd46 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -61,7 +61,7 @@
 #include <linux/slab.h>
 
 #include <linux/mm.h>
-#include <linux/swap.h> /* struct reclaim_state */
+#include <linux/swap.h> /* report_freed_pages() */
 #include <linux/cache.h>
 #include <linux/init.h>
 #include <linux/export.h>
@@ -211,9 +211,7 @@ static void slob_free_pages(void *b, int order)
 {
 	struct page *sp = virt_to_page(b);
 
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += 1 << order;
-
+	report_freed_pages(1 << order);
 	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
 			    -(PAGE_SIZE << order));
 	__free_pages(sp, order);
diff --git a/mm/slub.c b/mm/slub.c
index 13459c69095a..5145ad2467e9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -11,7 +11,7 @@
  */
 
 #include <linux/mm.h>
-#include <linux/swap.h> /* struct reclaim_state */
+#include <linux/swap.h> /* report_freed_pages() */
 #include <linux/module.h>
 #include <linux/bit_spinlock.h>
 #include <linux/interrupt.h>
@@ -2063,8 +2063,7 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
 	/* Make the mapping reset visible before clearing the flag */
 	smp_wmb();
 	__folio_clear_slab(folio);
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += pages;
+	report_freed_pages(pages);
 	unaccount_slab(slab, order, s);
 	__free_pages(folio_page(folio, 0), order);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd6637fcd8f9..63a27d2f6f31 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,6 +204,19 @@ static void set_task_reclaim_state(struct task_struct *task,
 	task->reclaim_state = rs;
 }
 
+/*
+ * reclaim_report_freed_pages: report pages freed outside of LRU-based reclaim
+ * @pages: number of pages freed
+ *
+ * If the current process is undergoing a reclaim operation,
+ * increment the number of reclaimed pages by @pages.
+ */
+void report_freed_pages(unsigned long pages)
+{
+	if (current->reclaim_state)
+		current->reclaim_state->reclaimed += pages;
+}
+
 LIST_HEAD(shrinker_list);
 DECLARE_RWSEM(shrinker_rwsem);
 
@@ -6169,8 +6182,8 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	shrink_node_memcgs(pgdat, sc);
 
 	if (reclaim_state) {
-		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-		reclaim_state->reclaimed_slab = 0;
+		sc->nr_reclaimed += reclaim_state->reclaimed;
+		reclaim_state->reclaimed = 0;
 	}
 
 	/* Record the subtree's reclaim efficiency */
-- 
2.39.1.519.gcb327c4b5f-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state
  2023-02-02 23:32 ` [RFC PATCH v1 1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state Yosry Ahmed
@ 2023-02-03 16:22   ` Matthew Wilcox
  2023-02-03 22:30     ` Yosry Ahmed
  0 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2023-02-03 16:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Miaohe Lin, David Hildenbrand, Johannes Weiner,
	Peter Xu, NeilBrown, Shakeel Butt, Michal Hocko, linux-fsdevel,
	linux-kernel, linux-xfs, linux-mm

On Thu, Feb 02, 2023 at 11:32:28PM +0000, Yosry Ahmed wrote:
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 54c774af6e1c..060079f1e966 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -286,8 +286,7 @@ xfs_buf_free_pages(
>  		if (bp->b_pages[i])
>  			__free_page(bp->b_pages[i]);
>  	}
> -	if (current->reclaim_state)
> -		current->reclaim_state->reclaimed_slab += bp->b_page_count;
> +	report_freed_pages(bp->b_page_count);

XFS can be built as a module

> +++ b/mm/vmscan.c
> @@ -204,6 +204,19 @@ static void set_task_reclaim_state(struct task_struct *task,
>  	task->reclaim_state = rs;
>  }
>  
> +/*
> + * reclaim_report_freed_pages: report pages freed outside of LRU-based reclaim
> + * @pages: number of pages freed
> + *
> + * If the current process is undergoing a reclaim operation,
> + * increment the number of reclaimed pages by @pages.
> + */
> +void report_freed_pages(unsigned long pages)
> +{
> +	if (current->reclaim_state)
> +		current->reclaim_state->reclaimed += pages;
> +}
> +

report_free_pages is not EXPORT_SYMBOLed


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state
  2023-02-03 16:22   ` Matthew Wilcox
@ 2023-02-03 22:30     ` Yosry Ahmed
  0 siblings, 0 replies; 11+ messages in thread
From: Yosry Ahmed @ 2023-02-03 22:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Miaohe Lin, David Hildenbrand, Johannes Weiner,
	Peter Xu, NeilBrown, Shakeel Butt, Michal Hocko, linux-fsdevel,
	linux-kernel, linux-xfs, linux-mm

On Fri, Feb 3, 2023 at 8:22 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Feb 02, 2023 at 11:32:28PM +0000, Yosry Ahmed wrote:
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 54c774af6e1c..060079f1e966 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -286,8 +286,7 @@ xfs_buf_free_pages(
> >               if (bp->b_pages[i])
> >                       __free_page(bp->b_pages[i]);
> >       }
> > -     if (current->reclaim_state)
> > -             current->reclaim_state->reclaimed_slab += bp->b_page_count;
> > +     report_freed_pages(bp->b_page_count);
>
> XFS can be built as a module

I didn't know that, thanks for pointing it out!

>
> > +++ b/mm/vmscan.c
> > @@ -204,6 +204,19 @@ static void set_task_reclaim_state(struct task_struct *task,
> >       task->reclaim_state = rs;
> >  }
> >
> > +/*
> > + * reclaim_report_freed_pages: report pages freed outside of LRU-based reclaim
> > + * @pages: number of pages freed
> > + *
> > + * If the current process is undergoing a reclaim operation,
> > + * increment the number of reclaimed pages by @pages.
> > + */
> > +void report_freed_pages(unsigned long pages)
> > +{
> > +     if (current->reclaim_state)
> > +             current->reclaim_state->reclaimed += pages;
> > +}
> > +
>
> report_free_pages is not EXPORT_SYMBOLed

Will do that for the next version, thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH v1 2/2] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
  2023-02-02 23:32 [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
  2023-02-02 23:32 ` [RFC PATCH v1 1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state Yosry Ahmed
@ 2023-02-02 23:32 ` Yosry Ahmed
  2023-02-03  0:00 ` [RFC PATCH v1 0/2] Ignore " Dave Chinner
  2 siblings, 0 replies; 11+ messages in thread
From: Yosry Ahmed @ 2023-02-02 23:32 UTC (permalink / raw)
  To: Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Matthew Wilcox (Oracle), Miaohe Lin,
	David Hildenbrand, Johannes Weiner, Peter Xu, NeilBrown,
	Shakeel Butt, Michal Hocko
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm, Yosry Ahmed

We keep track of different types of reclaimed pages through
reclaim_state->reclaimed, and we add them to the reported number of
reclaimed pages. For non-memcg reclaim, this makes sense. For memcg
reclaim, we have no clue if those pages are charged to the memcg under
reclaim.

Slab pages are shared by different memcgs, so a freed slab page may have
only been partially charged to the memcg under reclaim. The same goes
for clean file pages from pruned inodes or xfs buffer pages, there is no
way to link them to the memcg under reclaim.

Stop reporting those freed pages as reclaimed pages during memcg
reclaim. This should make the return value of writing to memory.reclaim,
and may help reduce unnecessary reclaim retries during memcg charging.

Generally, this should make the return value of
try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
freed a slab page that was mostly charged to the memcg under reclaim),
the return value of try_to_free_mem_cgroup_pages() can be
underestimated, but this should be fine as it is mostly called in a
retry loop.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 63a27d2f6f31..207998b16e5f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6181,7 +6181,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)

 	shrink_node_memcgs(pgdat, sc);

-	if (reclaim_state) {
+	if (reclaim_state && !cgroup_reclaim(sc)) {
 		sc->nr_reclaimed += reclaim_state->reclaimed;
 		reclaim_state->reclaimed = 0;
 	}
-- 
2.39.1.519.gcb327c4b5f-goog

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim
  2023-02-02 23:32 [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
  2023-02-02 23:32 ` [RFC PATCH v1 1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state Yosry Ahmed
  2023-02-02 23:32 ` [RFC PATCH v1 2/2] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
@ 2023-02-03  0:00 ` Dave Chinner
  2023-02-03  0:17   ` Yosry Ahmed
  2 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2023-02-03  0:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Matthew Wilcox (Oracle), Miaohe Lin,
	David Hildenbrand, Johannes Weiner, Peter Xu, NeilBrown,
	Shakeel Butt, Michal Hocko, linux-fsdevel, linux-kernel,
	linux-xfs, linux-mm

On Thu, Feb 02, 2023 at 11:32:27PM +0000, Yosry Ahmed wrote:
> Reclaimed pages through other means than LRU-based reclaim are tracked
> through reclaim_state in struct scan_control, which is stashed in
> current task_struct. These pages are added to the number of reclaimed
> pages through LRUs. For memcg reclaim, these pages generally cannot be
> linked to the memcg under reclaim and can cause an overestimated count
> of reclaimed pages. This short series tries to address that.

Can you explain why memcg specific reclaim is calling shrinkers that
are not marked with SHRINKER_MEMCG_AWARE?

i.e. only objects that are directly associated with memcg aware
shrinkers should be accounted to the memcg, right? If the cache is
global (e.g the xfs buffer cache) then they aren't marked with
SHRINKER_MEMCG_AWARE and so should only be called for root memcg
(i.e. global) reclaim contexts.

So if you are having accounting problems caused by memcg specific
reclaim on global caches freeing non-memcg accounted memory, isn't
the problem the way the shrinkers are being called?

> Patch 1 is just refactoring updating reclaim_state into a helper
> function, and renames reclaimed_slab to just reclaimed, with a comment
> describing its true purpose.
> 
> Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
> 
> The original draft was a little bit different. It also kept track of
> uncharged objcg pages, and reported them only in memcg reclaim and only
> if the uncharged memcg is in the subtree of the memcg under reclaim.
> This was an attempt to make reporting of memcg reclaim even more
> accurate, but was dropped due to questionable complexity vs benefit
> tradeoff. It can be revived if there is interest.
> 
> Yosry Ahmed (2):
>   mm: vmscan: refactor updating reclaimed pages in reclaim_state
>   mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
> 
>  fs/inode.c           |  3 +--

Inodes and inode mapping pages are directly charged to the memcg
that allocated them and the shrinker is correctly marked as
SHRINKER_MEMCG_AWARE. Freeing the pages attached to the inode will
account them correctly to the related memcg, regardless of which
memcg is triggering the reclaim.  Hence I'm not sure that skipping
the accounting of the reclaimed memory is even correct in this case;
I think the code should still be accounting for all pages that
belong to the memcg being scanned that are reclaimed, not ignoring
them altogether...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim
  2023-02-03  0:00 ` [RFC PATCH v1 0/2] Ignore " Dave Chinner
@ 2023-02-03  0:17   ` Yosry Ahmed
  2023-02-03 15:11     ` Johannes Weiner
  0 siblings, 1 reply; 11+ messages in thread
From: Yosry Ahmed @ 2023-02-03  0:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Matthew Wilcox (Oracle), Miaohe Lin,
	David Hildenbrand, Johannes Weiner, Peter Xu, NeilBrown,
	Shakeel Butt, Michal Hocko, linux-fsdevel, linux-kernel,
	linux-xfs, linux-mm

On Thu, Feb 2, 2023 at 4:01 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Feb 02, 2023 at 11:32:27PM +0000, Yosry Ahmed wrote:
> > Reclaimed pages through other means than LRU-based reclaim are tracked
> > through reclaim_state in struct scan_control, which is stashed in
> > current task_struct. These pages are added to the number of reclaimed
> > pages through LRUs. For memcg reclaim, these pages generally cannot be
> > linked to the memcg under reclaim and can cause an overestimated count
> > of reclaimed pages. This short series tries to address that.
>
> Can you explain why memcg specific reclaim is calling shrinkers that
> are not marked with SHRINKER_MEMCG_AWARE?
>
> i.e. only objects that are directly associated with memcg aware
> shrinkers should be accounted to the memcg, right? If the cache is
> global (e.g the xfs buffer cache) then they aren't marked with
> SHRINKER_MEMCG_AWARE and so should only be called for root memcg
> (i.e. global) reclaim contexts.
>
> So if you are having accounting problems caused by memcg specific
> reclaim on global caches freeing non-memcg accounted memory, isn't
> the problem the way the shrinkers are being called?

Not necessarily, according to my understanding.

My understanding is that we will only free slab objects accounted to
the memcg under reclaim (or one of its descendants), because we call
memcg aware shrinkers, as you pointed out. The point here is slab page
sharing. Ever since we started doing per-object accounting, a slab
page may have objects accounted to different memcgs. IIUC, if we free
a slab object charged to the memcg under reclaim, and this object
happened to be the last object on the page, we will free the slab
page, and count the entire page as reclaimed memory for the purpose of
memcg reclaim, which is where the inaccuracy is coming from.

Please correct me if I am wrong.

>
> > Patch 1 is just refactoring updating reclaim_state into a helper
> > function, and renames reclaimed_slab to just reclaimed, with a comment
> > describing its true purpose.
> >
> > Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
> >
> > The original draft was a little bit different. It also kept track of
> > uncharged objcg pages, and reported them only in memcg reclaim and only
> > if the uncharged memcg is in the subtree of the memcg under reclaim.
> > This was an attempt to make reporting of memcg reclaim even more
> > accurate, but was dropped due to questionable complexity vs benefit
> > tradeoff. It can be revived if there is interest.
> >
> > Yosry Ahmed (2):
> >   mm: vmscan: refactor updating reclaimed pages in reclaim_state
> >   mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
> >
> >  fs/inode.c           |  3 +--
>
> Inodes and inode mapping pages are directly charged to the memcg
> that allocated them and the shrinker is correctly marked as
> SHRINKER_MEMCG_AWARE. Freeing the pages attached to the inode will
> account them correctly to the related memcg, regardless of which
> memcg is triggering the reclaim.  Hence I'm not sure that skipping
> the accounting of the reclaimed memory is even correct in this case;

Please note that we are not skipping any accounting here. The pages
are still uncharged from the memcgs they are charged to (the allocator
memcgs as you pointed out). We just do not report them in the return
value of try_to_free_mem_cgroup_pages(), to avoid over-reporting.

> I think the code should still be accounting for all pages that
> belong to the memcg being scanned that are reclaimed, not ignoring
> them altogether...

100% agree. Ideally I would want to:
- For pruned inodes: report all freed pages for global reclaim, and
only report pages charged to the memcg under reclaim for memcg
reclaim.
- For slab: report all freed pages for global reclaim, and only report
uncharged objcg pages from the memcg under reclaim for memcg reclaim.

The only problem is that I thought people would think this is too much
complexity and not worth it. If people agree this should be the
approach to follow, I can prepare patches for this. I originally
implemented this for slab pages, but held off on sending it.

>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim
  2023-02-03  0:17   ` Yosry Ahmed
@ 2023-02-03 15:11     ` Johannes Weiner
  2023-02-03 15:28       ` Yosry Ahmed
  0 siblings, 1 reply; 11+ messages in thread
From: Johannes Weiner @ 2023-02-03 15:11 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Dave Chinner, Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Matthew Wilcox (Oracle), Miaohe Lin,
	David Hildenbrand, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On Thu, Feb 02, 2023 at 04:17:18PM -0800, Yosry Ahmed wrote:
> On Thu, Feb 2, 2023 at 4:01 PM Dave Chinner <david@fromorbit.com> wrote:
> > > Patch 1 is just refactoring updating reclaim_state into a helper
> > > function, and renames reclaimed_slab to just reclaimed, with a comment
> > > describing its true purpose.
> > >
> > > Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
> > >
> > > The original draft was a little bit different. It also kept track of
> > > uncharged objcg pages, and reported them only in memcg reclaim and only
> > > if the uncharged memcg is in the subtree of the memcg under reclaim.
> > > This was an attempt to make reporting of memcg reclaim even more
> > > accurate, but was dropped due to questionable complexity vs benefit
> > > tradeoff. It can be revived if there is interest.
> > >
> > > Yosry Ahmed (2):
> > >   mm: vmscan: refactor updating reclaimed pages in reclaim_state
> > >   mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
> > >
> > >  fs/inode.c           |  3 +--
> >
> > Inodes and inode mapping pages are directly charged to the memcg
> > that allocated them and the shrinker is correctly marked as
> > SHRINKER_MEMCG_AWARE. Freeing the pages attached to the inode will
> > account them correctly to the related memcg, regardless of which
> > memcg is triggering the reclaim.  Hence I'm not sure that skipping
> > the accounting of the reclaimed memory is even correct in this case;
> 
> Please note that we are not skipping any accounting here. The pages
> are still uncharged from the memcgs they are charged to (the allocator
> memcgs as you pointed out). We just do not report them in the return
> value of try_to_free_mem_cgroup_pages(), to avoid over-reporting.

I was wondering the same thing as Dave, reading through this. But
you're right, we'll catch the accounting during uncharge. Can you
please add a comment on the !cgroup_reclaim() explaining this?

There is one wrinkle with this, though. We have the following
(simplified) sequence during charging:

	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
						    gfp_mask, reclaim_options);

	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
		goto retry;

	/*
	 * Even though the limit is exceeded at this point, reclaim
	 * may have been able to free some pages.  Retry the charge
	 * before killing the task.
	 *
	 * Only for regular pages, though: huge pages are rather
	 * unlikely to succeed so close to the limit, and we fall back
	 * to regular pages anyway in case of failure.
	 */
	if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
		goto retry;

So in the unlikely scenario where the first call doesn't make the
necessary headroom, and the shrinkers are the only thing that made
forward progress, we would OOM prematurely.

Not that an OOM would seem that far away in that scenario, anyway. But I
remember long discussions with DavidR on probabilistic OOM regressions ;)

> > I think the code should still be accounting for all pages that
> > belong to the memcg being scanned that are reclaimed, not ignoring
> > them altogether...
> 
> 100% agree. Ideally I would want to:
> - For pruned inodes: report all freed pages for global reclaim, and
> only report pages charged to the memcg under reclaim for memcg
> reclaim.

This only happens on highmem systems at this point, as elsewhere
populated inodes aren't on the shrinker LRUs anymore. We'd probably be
ok with a comment noting the inaccuracy in the proactive reclaim stats
for the time being, until somebody actually cares about that combination.

> - For slab: report all freed pages for global reclaim, and only report
> uncharged objcg pages from the memcg under reclaim for memcg reclaim.
> 
> The only problem is that I thought people would think this is too much
> complexity and not worth it. If people agree this should be the
> approach to follow, I can prepare patches for this. I originally
> implemented this for slab pages, but held off on sending it.

I'd be curious to see the code!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim
  2023-02-03 15:11     ` Johannes Weiner
@ 2023-02-03 15:28       ` Yosry Ahmed
  2023-02-04  0:26         ` Shakeel Butt
  0 siblings, 1 reply; 11+ messages in thread
From: Yosry Ahmed @ 2023-02-03 15:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Matthew Wilcox (Oracle), Miaohe Lin,
	David Hildenbrand, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On Fri, Feb 3, 2023 at 7:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Feb 02, 2023 at 04:17:18PM -0800, Yosry Ahmed wrote:
> > On Thu, Feb 2, 2023 at 4:01 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > Patch 1 is just refactoring updating reclaim_state into a helper
> > > > function, and renames reclaimed_slab to just reclaimed, with a comment
> > > > describing its true purpose.
> > > >
> > > > Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
> > > >
> > > > The original draft was a little bit different. It also kept track of
> > > > uncharged objcg pages, and reported them only in memcg reclaim and only
> > > > if the uncharged memcg is in the subtree of the memcg under reclaim.
> > > > This was an attempt to make reporting of memcg reclaim even more
> > > > accurate, but was dropped due to questionable complexity vs benefit
> > > > tradeoff. It can be revived if there is interest.
> > > >
> > > > Yosry Ahmed (2):
> > > >   mm: vmscan: refactor updating reclaimed pages in reclaim_state
> > > >   mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
> > > >
> > > >  fs/inode.c           |  3 +--
> > >
> > > Inodes and inode mapping pages are directly charged to the memcg
> > > that allocated them and the shrinker is correctly marked as
> > > SHRINKER_MEMCG_AWARE. Freeing the pages attached to the inode will
> > > account them correctly to the related memcg, regardless of which
> > > memcg is triggering the reclaim.  Hence I'm not sure that skipping
> > > the accounting of the reclaimed memory is even correct in this case;
> >
> > Please note that we are not skipping any accounting here. The pages
> > are still uncharged from the memcgs they are charged to (the allocator
> > memcgs as you pointed out). We just do not report them in the return
> > value of try_to_free_mem_cgroup_pages(), to avoid over-reporting.
>
> I was wondering the same thing as Dave, reading through this. But
> you're right, we'll catch the accounting during uncharge. Can you
> please add a comment on the !cgroup_reclaim() explaining this?

Sure! If we settle on this implementation I will send another version
with a comment and fix the build problem in patch 2.

>
> There is one wrinkle with this, though. We have the following
> (simplified) sequence during charging:
>
>         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
>                                                     gfp_mask, reclaim_options);
>
>         if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>                 goto retry;
>
>         /*
>          * Even though the limit is exceeded at this point, reclaim
>          * may have been able to free some pages.  Retry the charge
>          * before killing the task.
>          *
>          * Only for regular pages, though: huge pages are rather
>          * unlikely to succeed so close to the limit, and we fall back
>          * to regular pages anyway in case of failure.
>          */
>         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
>                 goto retry;
>
> So in the unlikely scenario where the first call doesn't make the
> necessary headroom, and the shrinkers are the only thing that made
> forward progress, we would OOM prematurely.
>
> Not that an OOM would seem that far away in that scenario, anyway. But I
> remember long discussions with DavidR on probabilistic OOM regressions ;)
>

Above the if (nr_reclaimed...) check we have:

if (gfp_mask & __GFP_NORETRY)
    goto nomem;

, and below it we have:

if (nr_retries--)
    goto retry;

So IIUC we only prematurely OOM if we either have __GFP_NORETRY and
cannot reclaim any LRU pages in the first try, or if the scenario
where only shrinkers were successful to reclaim happens in the last
retry. Right?

> > > I think the code should still be accounting for all pages that
> > > belong to the memcg being scanned that are reclaimed, not ignoring
> > > them altogether...
> >
> > 100% agree. Ideally I would want to:
> > - For pruned inodes: report all freed pages for global reclaim, and
> > only report pages charged to the memcg under reclaim for memcg
> > reclaim.
>
> This only happens on highmem systems at this point, as elsewhere
> populated inodes aren't on the shrinker LRUs anymore. We'd probably be
> ok with a comment noting the inaccuracy in the proactive reclaim stats
> for the time being, until somebody actually cares about that combination.

Interesting, I did not realize this. I guess in this case we may get
away with just ignoring non-LRU reclaimed pages in memcg reclaim
completely, or go an extra bit and report uncharged objcg pages in
memcg reclaim. See below.

>
> > - For slab: report all freed pages for global reclaim, and only report
> > uncharged objcg pages from the memcg under reclaim for memcg reclaim.
> >
> > The only problem is that I thought people would think this is too much
> > complexity and not worth it. If people agree this should be the
> > approach to follow, I can prepare patches for this. I originally
> > implemented this for slab pages, but held off on sending it.
>
> I'd be curious to see the code!

I think it is small enough to paste here. Basically instead of just
ignoring reclaim_state->reclaimed completely in patch 2, I counted
uncharged objcg pages only in memcg reclaim instead of freed slab
pages, and ignored pruned inode pages in memcg reclaim. So I guess we
can go with either:
- Just ignore freed slab pages and pages from pruned inodes in memcg
reclaim (current RFC).
- Ignore pruned inodes in memcg reclaim (as you explain above), and
use the following diff instead of patch 2 for slab.
- Use the following diff for slab AND properly report freed pages from
pruned inodes if they are relevant to the memcg under reclaim.

Let me know what you think is best.

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc1d8b326453..37f799901dfb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -162,6 +162,7 @@ struct reclaim_state {
 };

 void report_freed_pages(unsigned long pages);
+bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg);

 #ifdef __KERNEL__

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab457f0394ab..a886ace70648 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3080,6 +3080,13 @@ static void obj_cgroup_uncharge_pages(struct
obj_cgroup *objcg,
        memcg_account_kmem(memcg, -nr_pages);
        refill_stock(memcg, nr_pages);

+       /*
+        * If undergoing memcg reclaim, report uncharged pages and drain local
+        * stock to update the memcg usage.
+        */
+       if (report_uncharged_pages(nr_pages, memcg))
+               drain_local_stock(NULL);
+
        css_put(&memcg->css);
 }

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 207998b16e5f..d4eced2b884b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,17 +204,54 @@ static void set_task_reclaim_state(struct
task_struct *task,
        task->reclaim_state = rs;
 }

+static bool cgroup_reclaim(struct scan_control *sc);
+
 /*
  * reclaim_report_freed_pages: report pages freed outside of LRU-based reclaim
  * @pages: number of pages freed
  *
- * If the current process is undergoing a reclaim operation,
+ * If the current process is undergoing a non-cgroup reclaim operation,
  * increment the number of reclaimed pages by @pages.
  */
 void report_freed_pages(unsigned long pages)
 {
-       if (current->reclaim_state)
-               current->reclaim_state->reclaimed += pages;
+       struct reclaim_state *rs = current->reclaim_state;
+       struct scan_control *sc;
+
+       if (!rs)
+               return;
+
+       sc = container_of(rs, struct scan_control, reclaim_state);
+       if (!cgroup_reclaim(sc))
+               rs->reclaimed += pages;
+}
+
+/*
+ * report_uncharged_pages: report pages uncharged outside of LRU-based reclaim
+ * @pages: number of pages uncharged
+ * @memcg: memcg pages were uncharged from
+ *
+ * If the current process is undergoing a cgroup reclaim operation, increment
+ * the number of reclaimed pages by @pages, if the memcg under
reclaim is @memcg
+ * or an ancestor of it.
+ *
+ * Returns true if an update was made.
+ */
+bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg)
+{
+       struct reclaim_state *rs = current->reclaim_state;
+       struct scan_control *sc;
+
+       if (!rs)
+               return false;
+
+       sc = container_of(rs, struct scan_control, reclaim_state);
+       if (cgroup_reclaim(sc) &&
+           mem_cgroup_is_descendant(memcg, sc->target_mem_cgroup)) {
+               rs->reclaimed += pages;
+               return true;
+       }
+       return false;
 }

 LIST_HEAD(shrinker_list);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim
  2023-02-03 15:28       ` Yosry Ahmed
@ 2023-02-04  0:26         ` Shakeel Butt
  2023-02-08 22:28           ` Yosry Ahmed
  0 siblings, 1 reply; 11+ messages in thread
From: Shakeel Butt @ 2023-02-04  0:26 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Dave Chinner, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Peter Xu, NeilBrown, Michal Hocko,
	linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On Fri, Feb 03, 2023 at 07:28:49AM -0800, Yosry Ahmed wrote:
> On Fri, Feb 3, 2023 at 7:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Feb 02, 2023 at 04:17:18PM -0800, Yosry Ahmed wrote:
> > > On Thu, Feb 2, 2023 at 4:01 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > Patch 1 is just refactoring updating reclaim_state into a helper
> > > > > function, and renames reclaimed_slab to just reclaimed, with a comment
> > > > > describing its true purpose.
> > > > >
> > > > > Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
> > > > >
> > > > > The original draft was a little bit different. It also kept track of
> > > > > uncharged objcg pages, and reported them only in memcg reclaim and only
> > > > > if the uncharged memcg is in the subtree of the memcg under reclaim.
> > > > > This was an attempt to make reporting of memcg reclaim even more
> > > > > accurate, but was dropped due to questionable complexity vs benefit
> > > > > tradeoff. It can be revived if there is interest.
> > > > >
> > > > > Yosry Ahmed (2):
> > > > >   mm: vmscan: refactor updating reclaimed pages in reclaim_state
> > > > >   mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
> > > > >
> > > > >  fs/inode.c           |  3 +--
> > > >
> > > > Inodes and inode mapping pages are directly charged to the memcg
> > > > that allocated them and the shrinker is correctly marked as
> > > > SHRINKER_MEMCG_AWARE. Freeing the pages attached to the inode will
> > > > account them correctly to the related memcg, regardless of which
> > > > memcg is triggering the reclaim.  Hence I'm not sure that skipping
> > > > the accounting of the reclaimed memory is even correct in this case;
> > >
> > > Please note that we are not skipping any accounting here. The pages
> > > are still uncharged from the memcgs they are charged to (the allocator
> > > memcgs as you pointed out). We just do not report them in the return
> > > value of try_to_free_mem_cgroup_pages(), to avoid over-reporting.
> >
> > I was wondering the same thing as Dave, reading through this. But
> > you're right, we'll catch the accounting during uncharge. Can you
> > please add a comment on the !cgroup_reclaim() explaining this?
> 
> Sure! If we settle on this implementation I will send another version
> with a comment and fix the build problem in patch 2.
> 
> >
> > There is one wrinkle with this, though. We have the following
> > (simplified) sequence during charging:
> >
> >         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> >                                                     gfp_mask, reclaim_options);
> >
> >         if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> >                 goto retry;
> >
> >         /*
> >          * Even though the limit is exceeded at this point, reclaim
> >          * may have been able to free some pages.  Retry the charge
> >          * before killing the task.
> >          *
> >          * Only for regular pages, though: huge pages are rather
> >          * unlikely to succeed so close to the limit, and we fall back
> >          * to regular pages anyway in case of failure.
> >          */
> >         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
> >                 goto retry;
> >
> > So in the unlikely scenario where the first call doesn't make the
> > necessary headroom, and the shrinkers are the only thing that made
> > forward progress, we would OOM prematurely.
> >
> > Not that an OOM would seem that far away in that scenario, anyway. But I
> > remember long discussions with DavidR on probabilistic OOM regressions ;)
> >
> 
> Above the if (nr_reclaimed...) check we have:
> 
> if (gfp_mask & __GFP_NORETRY)
>     goto nomem;
> 
> , and below it we have:
> 
> if (nr_retries--)
>     goto retry;
> 
> So IIUC we only prematurely OOM if we either have __GFP_NORETRY and
> cannot reclaim any LRU pages in the first try, or if the scenario
> where only shrinkers were successful to reclaim happens in the last
> retry. Right?
> 

We don't call oom-killer for __GFP_NORETRY. Also note that the retry
(from nr_retries) after the reclaim includes page_counter_try_charge().
So, even if try_to_free_mem_cgroup_pages() have returned 0 after
reclaiming the slab memory of the memcg, the page_counter_try_charge()
should succeed if the reclaimed slab objects have created enough margin.

> > > > I think the code should still be accounting for all pages that
> > > > belong to the memcg being scanned that are reclaimed, not ignoring
> > > > them altogether...
> > >
> > > 100% agree. Ideally I would want to:
> > > - For pruned inodes: report all freed pages for global reclaim, and
> > > only report pages charged to the memcg under reclaim for memcg
> > > reclaim.
> >
> > This only happens on highmem systems at this point, as elsewhere
> > populated inodes aren't on the shrinker LRUs anymore. We'd probably be
> > ok with a comment noting the inaccuracy in the proactive reclaim stats
> > for the time being, until somebody actually cares about that combination.
> 
> Interesting, I did not realize this. I guess in this case we may get
> away with just ignoring non-LRU reclaimed pages in memcg reclaim
> completely, or go an extra bit and report uncharged objcg pages in
> memcg reclaim. See below.
> 
> >
> > > - For slab: report all freed pages for global reclaim, and only report
> > > uncharged objcg pages from the memcg under reclaim for memcg reclaim.
> > >
> > > The only problem is that I thought people would think this is too much
> > > complexity and not worth it. If people agree this should be the
> > > approach to follow, I can prepare patches for this. I originally
> > > implemented this for slab pages, but held off on sending it.
> >
> > I'd be curious to see the code!
> 
> I think it is small enough to paste here. Basically instead of just
> ignoring reclaim_state->reclaimed completely in patch 2, I counted
> uncharged objcg pages only in memcg reclaim instead of freed slab
> pages, and ignored pruned inode pages in memcg reclaim. So I guess we
> can go with either:
> - Just ignore freed slab pages and pages from pruned inodes in memcg
> reclaim (current RFC).
> - Ignore pruned inodes in memcg reclaim (as you explain above), and
> use the following diff instead of patch 2 for slab.
> - Use the following diff for slab AND properly report freed pages from
> pruned inodes if they are relevant to the memcg under reclaim.
> 
> Let me know what you think is best.
> 

I would prefer the currect RFC instead of the other two options. Those
options are slowing down (and adding complexity) to the uncharge code
path for the accuracy which no one really need or should care about.

> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index bc1d8b326453..37f799901dfb 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -162,6 +162,7 @@ struct reclaim_state {
>  };
> 
>  void report_freed_pages(unsigned long pages);
> +bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg);
> 
>  #ifdef __KERNEL__
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab457f0394ab..a886ace70648 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3080,6 +3080,13 @@ static void obj_cgroup_uncharge_pages(struct
> obj_cgroup *objcg,
>         memcg_account_kmem(memcg, -nr_pages);
>         refill_stock(memcg, nr_pages);
> 
> +       /*
> +        * If undergoing memcg reclaim, report uncharged pages and drain local
> +        * stock to update the memcg usage.
> +        */
> +       if (report_uncharged_pages(nr_pages, memcg))
> +               drain_local_stock(NULL);
> +
>         css_put(&memcg->css);
>  }
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 207998b16e5f..d4eced2b884b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -204,17 +204,54 @@ static void set_task_reclaim_state(struct
> task_struct *task,
>         task->reclaim_state = rs;
>  }
> 
> +static bool cgroup_reclaim(struct scan_control *sc);
> +
>  /*
>   * reclaim_report_freed_pages: report pages freed outside of LRU-based reclaim
>   * @pages: number of pages freed
>   *
> - * If the current process is undergoing a reclaim operation,
> + * If the current process is undergoing a non-cgroup reclaim operation,
>   * increment the number of reclaimed pages by @pages.
>   */
>  void report_freed_pages(unsigned long pages)
>  {
> -       if (current->reclaim_state)
> -               current->reclaim_state->reclaimed += pages;
> +       struct reclaim_state *rs = current->reclaim_state;
> +       struct scan_control *sc;
> +
> +       if (!rs)
> +               return;
> +
> +       sc = container_of(rs, struct scan_control, reclaim_state);
> +       if (!cgroup_reclaim(sc))
> +               rs->reclaimed += pages;
> +}
> +
> +/*
> + * report_uncharged_pages: report pages uncharged outside of LRU-based reclaim
> + * @pages: number of pages uncharged
> + * @memcg: memcg pages were uncharged from
> + *
> + * If the current process is undergoing a cgroup reclaim operation, increment
> + * the number of reclaimed pages by @pages, if the memcg under
> reclaim is @memcg
> + * or an ancestor of it.
> + *
> + * Returns true if an update was made.
> + */
> +bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg)
> +{
> +       struct reclaim_state *rs = current->reclaim_state;
> +       struct scan_control *sc;
> +
> +       if (!rs)
> +               return false;
> +
> +       sc = container_of(rs, struct scan_control, reclaim_state);
> +       if (cgroup_reclaim(sc) &&
> +           mem_cgroup_is_descendant(memcg, sc->target_mem_cgroup)) {
> +               rs->reclaimed += pages;
> +               return true;
> +       }
> +       return false;
>  }
> 
>  LIST_HEAD(shrinker_list);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim
  2023-02-04  0:26         ` Shakeel Butt
@ 2023-02-08 22:28           ` Yosry Ahmed
  0 siblings, 0 replies; 11+ messages in thread
From: Yosry Ahmed @ 2023-02-08 22:28 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Dave Chinner, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Peter Xu, NeilBrown, Michal Hocko,
	linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On Fri, Feb 3, 2023 at 4:26 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Feb 03, 2023 at 07:28:49AM -0800, Yosry Ahmed wrote:
> > On Fri, Feb 3, 2023 at 7:11 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Thu, Feb 02, 2023 at 04:17:18PM -0800, Yosry Ahmed wrote:
> > > > On Thu, Feb 2, 2023 at 4:01 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > Patch 1 is just refactoring updating reclaim_state into a helper
> > > > > > function, and renames reclaimed_slab to just reclaimed, with a comment
> > > > > > describing its true purpose.
> > > > > >
> > > > > > Patch 2 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
> > > > > >
> > > > > > The original draft was a little bit different. It also kept track of
> > > > > > uncharged objcg pages, and reported them only in memcg reclaim and only
> > > > > > if the uncharged memcg is in the subtree of the memcg under reclaim.
> > > > > > This was an attempt to make reporting of memcg reclaim even more
> > > > > > accurate, but was dropped due to questionable complexity vs benefit
> > > > > > tradeoff. It can be revived if there is interest.
> > > > > >
> > > > > > Yosry Ahmed (2):
> > > > > >   mm: vmscan: refactor updating reclaimed pages in reclaim_state
> > > > > >   mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
> > > > > >
> > > > > >  fs/inode.c           |  3 +--
> > > > >
> > > > > Inodes and inode mapping pages are directly charged to the memcg
> > > > > that allocated them and the shrinker is correctly marked as
> > > > > SHRINKER_MEMCG_AWARE. Freeing the pages attached to the inode will
> > > > > account them correctly to the related memcg, regardless of which
> > > > > memcg is triggering the reclaim.  Hence I'm not sure that skipping
> > > > > the accounting of the reclaimed memory is even correct in this case;
> > > >
> > > > Please note that we are not skipping any accounting here. The pages
> > > > are still uncharged from the memcgs they are charged to (the allocator
> > > > memcgs as you pointed out). We just do not report them in the return
> > > > value of try_to_free_mem_cgroup_pages(), to avoid over-reporting.
> > >
> > > I was wondering the same thing as Dave, reading through this. But
> > > you're right, we'll catch the accounting during uncharge. Can you
> > > please add a comment on the !cgroup_reclaim() explaining this?
> >
> > Sure! If we settle on this implementation I will send another version
> > with a comment and fix the build problem in patch 2.
> >
> > >
> > > There is one wrinkle with this, though. We have the following
> > > (simplified) sequence during charging:
> > >
> > >         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> > >                                                     gfp_mask, reclaim_options);
> > >
> > >         if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> > >                 goto retry;
> > >
> > >         /*
> > >          * Even though the limit is exceeded at this point, reclaim
> > >          * may have been able to free some pages.  Retry the charge
> > >          * before killing the task.
> > >          *
> > >          * Only for regular pages, though: huge pages are rather
> > >          * unlikely to succeed so close to the limit, and we fall back
> > >          * to regular pages anyway in case of failure.
> > >          */
> > >         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
> > >                 goto retry;
> > >
> > > So in the unlikely scenario where the first call doesn't make the
> > > necessary headroom, and the shrinkers are the only thing that made
> > > forward progress, we would OOM prematurely.
> > >
> > > Not that an OOM would seem that far away in that scenario, anyway. But I
> > > remember long discussions with DavidR on probabilistic OOM regressions ;)
> > >
> >
> > Above the if (nr_reclaimed...) check we have:
> >
> > if (gfp_mask & __GFP_NORETRY)
> >     goto nomem;
> >
> > , and below it we have:
> >
> > if (nr_retries--)
> >     goto retry;
> >
> > So IIUC we only prematurely OOM if we either have __GFP_NORETRY and
> > cannot reclaim any LRU pages in the first try, or if the scenario
> > where only shrinkers were successful to reclaim happens in the last
> > retry. Right?
> >
>
> We don't call oom-killer for __GFP_NORETRY. Also note that the retry
> (from nr_retries) after the reclaim includes page_counter_try_charge().
> So, even if try_to_free_mem_cgroup_pages() have returned 0 after
> reclaiming the slab memory of the memcg, the page_counter_try_charge()
> should succeed if the reclaimed slab objects have created enough margin.
>
> > > > > I think the code should still be accounting for all pages that
> > > > > belong to the memcg being scanned that are reclaimed, not ignoring
> > > > > them altogether...
> > > >
> > > > 100% agree. Ideally I would want to:
> > > > - For pruned inodes: report all freed pages for global reclaim, and
> > > > only report pages charged to the memcg under reclaim for memcg
> > > > reclaim.
> > >
> > > This only happens on highmem systems at this point, as elsewhere
> > > populated inodes aren't on the shrinker LRUs anymore. We'd probably be
> > > ok with a comment noting the inaccuracy in the proactive reclaim stats
> > > for the time being, until somebody actually cares about that combination.
> >
> > Interesting, I did not realize this. I guess in this case we may get
> > away with just ignoring non-LRU reclaimed pages in memcg reclaim
> > completely, or go an extra bit and report uncharged objcg pages in
> > memcg reclaim. See below.
> >
> > >
> > > > - For slab: report all freed pages for global reclaim, and only report
> > > > uncharged objcg pages from the memcg under reclaim for memcg reclaim.
> > > >
> > > > The only problem is that I thought people would think this is too much
> > > > complexity and not worth it. If people agree this should be the
> > > > approach to follow, I can prepare patches for this. I originally
> > > > implemented this for slab pages, but held off on sending it.
> > >
> > > I'd be curious to see the code!
> >
> > I think it is small enough to paste here. Basically instead of just
> > ignoring reclaim_state->reclaimed completely in patch 2, I counted
> > uncharged objcg pages only in memcg reclaim instead of freed slab
> > pages, and ignored pruned inode pages in memcg reclaim. So I guess we
> > can go with either:
> > - Just ignore freed slab pages and pages from pruned inodes in memcg
> > reclaim (current RFC).
> > - Ignore pruned inodes in memcg reclaim (as you explain above), and
> > use the following diff instead of patch 2 for slab.
> > - Use the following diff for slab AND properly report freed pages from
> > pruned inodes if they are relevant to the memcg under reclaim.
> >
> > Let me know what you think is best.
> >
>
> I would prefer the currect RFC instead of the other two options. Those
> options are slowing down (and adding complexity) to the uncharge code
> path for the accuracy which no one really need or should care about.
>
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index bc1d8b326453..37f799901dfb 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -162,6 +162,7 @@ struct reclaim_state {
> >  };
> >
> >  void report_freed_pages(unsigned long pages);
> > +bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg);
> >
> >  #ifdef __KERNEL__
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ab457f0394ab..a886ace70648 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3080,6 +3080,13 @@ static void obj_cgroup_uncharge_pages(struct
> > obj_cgroup *objcg,
> >         memcg_account_kmem(memcg, -nr_pages);
> >         refill_stock(memcg, nr_pages);
> >
> > +       /*
> > +        * If undergoing memcg reclaim, report uncharged pages and drain local
> > +        * stock to update the memcg usage.
> > +        */
> > +       if (report_uncharged_pages(nr_pages, memcg))
> > +               drain_local_stock(NULL);
> > +
> >         css_put(&memcg->css);
> >  }
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 207998b16e5f..d4eced2b884b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -204,17 +204,54 @@ static void set_task_reclaim_state(struct
> > task_struct *task,
> >         task->reclaim_state = rs;
> >  }
> >
> > +static bool cgroup_reclaim(struct scan_control *sc);
> > +
> >  /*
> >   * reclaim_report_freed_pages: report pages freed outside of LRU-based reclaim
> >   * @pages: number of pages freed
> >   *
> > - * If the current process is undergoing a reclaim operation,
> > + * If the current process is undergoing a non-cgroup reclaim operation,
> >   * increment the number of reclaimed pages by @pages.
> >   */
> >  void report_freed_pages(unsigned long pages)
> >  {
> > -       if (current->reclaim_state)
> > -               current->reclaim_state->reclaimed += pages;
> > +       struct reclaim_state *rs = current->reclaim_state;
> > +       struct scan_control *sc;
> > +
> > +       if (!rs)
> > +               return;
> > +
> > +       sc = container_of(rs, struct scan_control, reclaim_state);
> > +       if (!cgroup_reclaim(sc))
> > +               rs->reclaimed += pages;
> > +}
> > +
> > +/*
> > + * report_uncharged_pages: report pages uncharged outside of LRU-based reclaim
> > + * @pages: number of pages uncharged
> > + * @memcg: memcg pages were uncharged from
> > + *
> > + * If the current process is undergoing a cgroup reclaim operation, increment
> > + * the number of reclaimed pages by @pages, if the memcg under
> > reclaim is @memcg
> > + * or an ancestor of it.
> > + *
> > + * Returns true if an update was made.
> > + */
> > +bool report_uncharged_pages(unsigned long pages, struct mem_cgroup *memcg)
> > +{
> > +       struct reclaim_state *rs = current->reclaim_state;
> > +       struct scan_control *sc;
> > +
> > +       if (!rs)
> > +               return false;
> > +
> > +       sc = container_of(rs, struct scan_control, reclaim_state);
> > +       if (cgroup_reclaim(sc) &&
> > +           mem_cgroup_is_descendant(memcg, sc->target_mem_cgroup)) {
> > +               rs->reclaimed += pages;
> > +               return true;
> > +       }
> > +       return false;
> >  }
> >
> >  LIST_HEAD(shrinker_list);

Any further thoughts on this, whether to refresh the current RFC with
added comments (based on Johannes's feedback) and exporting
report_freed_pages() (based on Matthew's feedback), or to send a new
version with the code above that accurately counts objcg uncharged
pages in memcg reclaim?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-02-08 22:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-02 23:32 [RFC PATCH v1 0/2] Ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
2023-02-02 23:32 ` [RFC PATCH v1 1/2] mm: vmscan: refactor updating reclaimed pages in reclaim_state Yosry Ahmed
2023-02-03 16:22   ` Matthew Wilcox
2023-02-03 22:30     ` Yosry Ahmed
2023-02-02 23:32 ` [RFC PATCH v1 2/2] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
2023-02-03  0:00 ` [RFC PATCH v1 0/2] Ignore " Dave Chinner
2023-02-03  0:17   ` Yosry Ahmed
2023-02-03 15:11     ` Johannes Weiner
2023-02-03 15:28       ` Yosry Ahmed
2023-02-04  0:26         ` Shakeel Butt
2023-02-08 22:28           ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).