[PATCH v3 0/4] introduce uncharged file mapped folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] introduce uncharged file mapped folios
@ 2025-08-19  0:36 Boris Burkov
  2025-08-19  0:36 ` [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED Boris Burkov
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Boris Burkov @ 2025-08-19  0:36 UTC (permalink / raw)
  To: akpm
  Cc: linux-btrfs, linux-mm, linux-fsdevel, kernel-team, shakeel.butt,
	wqu, willy, mhocko, muchun.song, roman.gushchin, hannes

I would like to revisit Qu's proposal to not charge btrfs extent_buffer
allocations to the user's cgroup.

https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com/

I believe it is detrimental to randomly account these global pages to
the cgroup using them, basically at random. A bit more justification
and explanation in the patches themselves.

---
Changelog:
v3:
- use mod_node_page_state since we will never count cgroup stats
- include Shakeel's patch that removes a WARNING triggered by this series
v2:
- switch from filemap_add_folio_nocharge() to AS_UNCHARGED on the
  address_space.
- fix an interrupt safety bug in the vmstat patch.
- fix some foolish build errors for CONFIG_MEMCG=n


Boris Burkov (3):
  mm/filemap: add AS_UNCHARGED
  mm: add vmstat for cgroup uncharged pages
  btrfs: set AS_UNCHARGED on the btree_inode

Shakeel Butt (1):
  memcg: remove warning from folio_lruvec

 fs/btrfs/disk-io.c         |  1 +
 include/linux/memcontrol.h |  5 +----
 include/linux/mmzone.h     |  3 +++
 include/linux/pagemap.h    |  1 +
 mm/filemap.c               | 29 +++++++++++++++++++++++++----
 mm/vmstat.c                |  3 +++
 6 files changed, 34 insertions(+), 8 deletions(-)

-- 
2.50.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-19  0:36 [PATCH v3 0/4] introduce uncharged file mapped folios Boris Burkov
@ 2025-08-19  0:36 ` Boris Burkov
  2025-08-19  2:46   ` Matthew Wilcox
  2025-08-20 22:06   ` Klara Modin
  2025-08-19  0:36 ` [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages Boris Burkov
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 22+ messages in thread
From: Boris Burkov @ 2025-08-19  0:36 UTC (permalink / raw)
  To: akpm
  Cc: linux-btrfs, linux-mm, linux-fsdevel, kernel-team, shakeel.butt,
	wqu, willy, mhocko, muchun.song, roman.gushchin, hannes

Btrfs currently tracks its metadata pages in the page cache, using a
fake inode (fs_info->btree_inode) with offsets corresponding to where
the metadata is stored in the filesystem's full logical address space.

A consequence of this is that when btrfs uses filemap_add_folio(), this
usage is charged to the cgroup of whichever task happens to be running
at the time. These folios don't belong to any particular user cgroup, so
I don't think it makes much sense for them to be charged in that way.
Some negative consequences as a result:
- A task can be holding some important btrfs locks, then need to lookup
  some metadata and go into reclaim, extending the duration it holds
  that lock for, and unfairly pushing its own reclaim pain onto other
  cgroups.
- If that cgroup goes into reclaim, it might reclaim these folios a
  different non-reclaiming cgroup might need soon. This is naturally
  offset by LRU reclaim, but still.

A very similar proposal to use the root cgroup was previously made by
Qu, where he eventually proposed the idea of setting it per
address_space. This makes good sense for the btrfs use case, as the
uncharged behavior should apply to all use of the address_space, not
select allocations. I.e., if someone adds another filemap_add_folio()
call using btrfs's btree_inode, we would almost certainly want the
uncharged behavior.

Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com/
Suggested-by: Qu Wenruo <wqu@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Boris Burkov <boris@bur.io>
---
 include/linux/pagemap.h |  1 +
 mm/filemap.c            | 12 ++++++++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c9ba69e02e3e..06dc3fae8124 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -211,6 +211,7 @@ enum mapping_flags {
 				   folio contents */
 	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
 	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
+	AS_UNCHARGED = 10,	/* Do not charge usage to a cgroup */
 	/* Bits 16-25 are used for FOLIO_ORDER */
 	AS_FOLIO_ORDER_BITS = 5,
 	AS_FOLIO_ORDER_MIN = 16,
diff --git a/mm/filemap.c b/mm/filemap.c
index e4a5a46db89b..5004a2cfa0cc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -960,15 +960,19 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
 {
 	void *shadow = NULL;
 	int ret;
+	bool charge_mem_cgroup = !test_bit(AS_UNCHARGED, &mapping->flags);
 
-	ret = mem_cgroup_charge(folio, NULL, gfp);
-	if (ret)
-		return ret;
+	if (charge_mem_cgroup) {
+		ret = mem_cgroup_charge(folio, NULL, gfp);
+		if (ret)
+			return ret;
+	}
 
 	__folio_set_locked(folio);
 	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
 	if (unlikely(ret)) {
-		mem_cgroup_uncharge(folio);
+		if (charge_mem_cgroup)
+			mem_cgroup_uncharge(folio);
 		__folio_clear_locked(folio);
 	} else {
 		/*
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages
  2025-08-19  0:36 [PATCH v3 0/4] introduce uncharged file mapped folios Boris Burkov
  2025-08-19  0:36 ` [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED Boris Burkov
@ 2025-08-19  0:36 ` Boris Burkov
  2025-08-19  2:50   ` Matthew Wilcox
  2025-08-19  0:36 ` [PATCH v3 3/4] btrfs: set AS_UNCHARGED on the btree_inode Boris Burkov
  2025-08-19  0:36 ` [PATCH v3 4/4] memcg: remove warning from folio_lruvec Boris Burkov
  3 siblings, 1 reply; 22+ messages in thread
From: Boris Burkov @ 2025-08-19  0:36 UTC (permalink / raw)
  To: akpm
  Cc: linux-btrfs, linux-mm, linux-fsdevel, kernel-team, shakeel.butt,
	wqu, willy, mhocko, muchun.song, roman.gushchin, hannes

Uncharged pages are tricky to track by their essential "uncharged"
nature. To maintain good accounting, introduce a vmstat counter tracking
all uncharged pages. Since this is only meaningful when cgroups are
configured, only expose the counter when CONFIG_MEMCG is set.

Confirmed that these work as expected at a high level by mounting a
btrfs using AS_UNCHARGED for metadata pages, and seeing the counter rise
with fs usage then go back to a minimal level after drop_caches and
finally down to 0 after unmounting the fs.

Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Boris Burkov <boris@bur.io>
---
 include/linux/mmzone.h |  3 +++
 mm/filemap.c           | 17 +++++++++++++++++
 mm/vmstat.c            |  3 +++
 3 files changed, 23 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fe13ad175fed..8166dadb6a33 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -259,6 +259,9 @@ enum node_stat_item {
 	NR_HUGETLB,
 #endif
 	NR_BALLOON_PAGES,
+#ifdef CONFIG_MEMCG
+	NR_UNCHARGED_FILE_PAGES,
+#endif
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 5004a2cfa0cc..514ccf7c8aa3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -146,6 +146,19 @@ static void page_cache_delete(struct address_space *mapping,
 	mapping->nrpages -= nr;
 }
 
+#ifdef CONFIG_MEMCG
+static void filemap_mod_uncharged_vmstat(struct folio *folio, int sign)
+{
+	long nr = folio_nr_pages(folio) * sign;
+
+	mod_node_page_state(folio_pgdat(folio), NR_UNCHARGED_FILE_PAGES, nr);
+}
+#else
+static void filemap_mod_uncharged_vmstat(struct folio *folio, int sign)
+{
+}
+#endif
+
 static void filemap_unaccount_folio(struct address_space *mapping,
 		struct folio *folio)
 {
@@ -190,6 +203,8 @@ static void filemap_unaccount_folio(struct address_space *mapping,
 		__lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr);
 		filemap_nr_thps_dec(mapping);
 	}
+	if (test_bit(AS_UNCHARGED, &folio->mapping->flags))
+		filemap_mod_uncharged_vmstat(folio, -1);
 
 	/*
 	 * At this point folio must be either written or cleaned by
@@ -987,6 +1002,8 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
 		if (!(gfp & __GFP_WRITE) && shadow)
 			workingset_refault(folio, shadow);
 		folio_add_lru(folio);
+		if (!charge_mem_cgroup)
+			filemap_mod_uncharged_vmstat(folio, 1);
 	}
 	return ret;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e74f0b2a1021..135f7bab4de6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1290,6 +1290,9 @@ const char * const vmstat_text[] = {
 	[I(NR_HUGETLB)]				= "nr_hugetlb",
 #endif
 	[I(NR_BALLOON_PAGES)]			= "nr_balloon_pages",
+#ifdef CONFIG_MEMCG
+	[I(NR_UNCHARGED_FILE_PAGES)]		= "nr_uncharged_file_pages",
+#endif
 #undef I
 
 	/* system-wide enum vm_stat_item counters */
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 3/4] btrfs: set AS_UNCHARGED on the btree_inode
  2025-08-19  0:36 [PATCH v3 0/4] introduce uncharged file mapped folios Boris Burkov
  2025-08-19  0:36 ` [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED Boris Burkov
  2025-08-19  0:36 ` [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages Boris Burkov
@ 2025-08-19  0:36 ` Boris Burkov
  2025-08-19  0:36 ` [PATCH v3 4/4] memcg: remove warning from folio_lruvec Boris Burkov
  3 siblings, 0 replies; 22+ messages in thread
From: Boris Burkov @ 2025-08-19  0:36 UTC (permalink / raw)
  To: akpm
  Cc: linux-btrfs, linux-mm, linux-fsdevel, kernel-team, shakeel.butt,
	wqu, willy, mhocko, muchun.song, roman.gushchin, hannes

extent_buffers are global and shared so their pages should not belong to
any particular cgroup (currently whichever cgroups happens to allocate
the extent_buffer).

Btrfs tree operations should not arbitrarily block on cgroup reclaim or
have the shared extent_buffer pages on a cgroup's reclaim lists.

Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/disk-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 70fc4e7cc5a0..24ac4006dfe2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1930,6 +1930,7 @@ static int btrfs_init_btree_inode(struct super_block *sb)
 	BTRFS_I(inode)->root = btrfs_grab_root(fs_info->tree_root);
 	set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags);
 	__insert_inode_hash(inode, hash);
+	set_bit(AS_UNCHARGED, &inode->i_mapping->flags);
 	fs_info->btree_inode = inode;
 
 	return 0;
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 4/4] memcg: remove warning from folio_lruvec
  2025-08-19  0:36 [PATCH v3 0/4] introduce uncharged file mapped folios Boris Burkov
                   ` (2 preceding siblings ...)
  2025-08-19  0:36 ` [PATCH v3 3/4] btrfs: set AS_UNCHARGED on the btree_inode Boris Burkov
@ 2025-08-19  0:36 ` Boris Burkov
  2025-08-19  2:41   ` Matthew Wilcox
  3 siblings, 1 reply; 22+ messages in thread
From: Boris Burkov @ 2025-08-19  0:36 UTC (permalink / raw)
  To: akpm
  Cc: linux-btrfs, linux-mm, linux-fsdevel, kernel-team, shakeel.butt,
	wqu, willy, mhocko, muchun.song, roman.gushchin, hannes

From: Shakeel Butt <shakeel.butt@linux.dev>

Commit a4055888629bc ("mm/memcg: warning on !memcg after readahead page
charged") added the warning in folio_lruvec (older name was
mem_cgroup_page_lruvec) for !memcg when charging of readahead pages were
added to the kernel. Basically lru pages on a memcg enabled system were
always expected to be charged to a memcg.

However a recent functionality to allow metadata of btrfs, which is in
page cache, to be uncharged is added to the kernel. We can either change
the condition to only check anon pages or file pages which does not have
AS_UNCHARGED in their mapping. Instead of such complicated check, let's
just remove the warning as it is not really helpful anymore.

Closes: https://ci.syzbot.org/series/15fd2538-1138-43c0-b4d6-6d7f53b0be69
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9fa3afc90dd5..fae105a9cb46 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -729,10 +729,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
  */
 static inline struct lruvec *folio_lruvec(struct folio *folio)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
-
-	VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled(), folio);
-	return mem_cgroup_lruvec(memcg, folio_pgdat(folio));
+	return mem_cgroup_lruvec(folio_memcg(folio), folio_pgdat(folio));
 }
 
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] memcg: remove warning from folio_lruvec
  2025-08-19  0:36 ` [PATCH v3 4/4] memcg: remove warning from folio_lruvec Boris Burkov
@ 2025-08-19  2:41   ` Matthew Wilcox
  2025-08-19  5:20     ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2025-08-19  2:41 UTC (permalink / raw)
  To: Boris Burkov
  Cc: akpm, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Mon, Aug 18, 2025 at 05:36:56PM -0700, Boris Burkov wrote:
> Commit a4055888629bc ("mm/memcg: warning on !memcg after readahead page
> charged") added the warning in folio_lruvec (older name was
> mem_cgroup_page_lruvec) for !memcg when charging of readahead pages were
> added to the kernel. Basically lru pages on a memcg enabled system were
> always expected to be charged to a memcg.
> 
> However a recent functionality to allow metadata of btrfs, which is in
> page cache, to be uncharged is added to the kernel. We can either change
> the condition to only check anon pages or file pages which does not have
> AS_UNCHARGED in their mapping. Instead of such complicated check, let's
> just remove the warning as it is not really helpful anymore.

This has to go before patch 3 (and I'd put it before patch 1) in order
to preserve bisectability..  That requires changing the tenses in the
commit message, but that's perfectly acceptable.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-19  0:36 ` [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED Boris Burkov
@ 2025-08-19  2:46   ` Matthew Wilcox
  2025-08-19  3:57     ` Boris Burkov
  2025-08-20 22:06   ` Klara Modin
  1 sibling, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2025-08-19  2:46 UTC (permalink / raw)
  To: Boris Burkov
  Cc: akpm, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Mon, Aug 18, 2025 at 05:36:53PM -0700, Boris Burkov wrote:
> +++ b/mm/filemap.c
> @@ -960,15 +960,19 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>  {
>  	void *shadow = NULL;
>  	int ret;
> +	bool charge_mem_cgroup = !test_bit(AS_UNCHARGED, &mapping->flags);
>  
> -	ret = mem_cgroup_charge(folio, NULL, gfp);
> -	if (ret)
> -		return ret;
> +	if (charge_mem_cgroup) {
> +		ret = mem_cgroup_charge(folio, NULL, gfp);
> +		if (ret)
> +			return ret;
> +	}
>  
>  	__folio_set_locked(folio);
>  	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
>  	if (unlikely(ret)) {
> -		mem_cgroup_uncharge(folio);
> +		if (charge_mem_cgroup)
> +			mem_cgroup_uncharge(folio);
>  		__folio_clear_locked(folio);

This is unnecessarily complex; mem_cgroup_uncharge() is a no-op if
the folio is not charged.  Sure, it's a wasted function call, but
this is a rare error path, so minimising size is more important than
minimising time.

So you can drop the 'bool' as well:

+	if (!test_bit(AS_UNCHARGED, &mapping->flags)) {


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages
  2025-08-19  0:36 ` [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages Boris Burkov
@ 2025-08-19  2:50   ` Matthew Wilcox
  2025-08-19  4:05     ` Boris Burkov
  2025-08-19 15:53     ` Shakeel Butt
  0 siblings, 2 replies; 22+ messages in thread
From: Matthew Wilcox @ 2025-08-19  2:50 UTC (permalink / raw)
  To: Boris Burkov
  Cc: akpm, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Mon, Aug 18, 2025 at 05:36:54PM -0700, Boris Burkov wrote:
> Uncharged pages are tricky to track by their essential "uncharged"
> nature. To maintain good accounting, introduce a vmstat counter tracking
> all uncharged pages. Since this is only meaningful when cgroups are
> configured, only expose the counter when CONFIG_MEMCG is set.

I don't understand why this is needed.  Maybe Shakeel had better
reasoning that wasn't captured in the commit message.

If they're unaccounted, then you can get a good estimate of them
just by subtracting the number of accounted pages from the number of
file pages.  Sure there's a small race between the two numbers being
updated, so you migth be off by a bit.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-19  2:46   ` Matthew Wilcox
@ 2025-08-19  3:57     ` Boris Burkov
  0 siblings, 0 replies; 22+ messages in thread
From: Boris Burkov @ 2025-08-19  3:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Tue, Aug 19, 2025 at 03:46:42AM +0100, Matthew Wilcox wrote:
> On Mon, Aug 18, 2025 at 05:36:53PM -0700, Boris Burkov wrote:
> > +++ b/mm/filemap.c
> > @@ -960,15 +960,19 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> >  {
> >  	void *shadow = NULL;
> >  	int ret;
> > +	bool charge_mem_cgroup = !test_bit(AS_UNCHARGED, &mapping->flags);
> >  
> > -	ret = mem_cgroup_charge(folio, NULL, gfp);
> > -	if (ret)
> > -		return ret;
> > +	if (charge_mem_cgroup) {
> > +		ret = mem_cgroup_charge(folio, NULL, gfp);
> > +		if (ret)
> > +			return ret;
> > +	}
> >  
> >  	__folio_set_locked(folio);
> >  	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
> >  	if (unlikely(ret)) {
> > -		mem_cgroup_uncharge(folio);
> > +		if (charge_mem_cgroup)
> > +			mem_cgroup_uncharge(folio);
> >  		__folio_clear_locked(folio);
> 
> This is unnecessarily complex; mem_cgroup_uncharge() is a no-op if
> the folio is not charged.  Sure, it's a wasted function call, but
> this is a rare error path, so minimising size is more important than
> minimising time.

Makes sense, thanks.

> 
> So you can drop the 'bool' as well:
> 
> +	if (!test_bit(AS_UNCHARGED, &mapping->flags)) {

I do use it once more for the stats, so if we do ultimately decide those
are worthwhile, should I keep the bool?

I can also put the checking in the stats function, in which case we
could drop it after all.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages
  2025-08-19  2:50   ` Matthew Wilcox
@ 2025-08-19  4:05     ` Boris Burkov
  2025-08-19 15:53     ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Boris Burkov @ 2025-08-19  4:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Tue, Aug 19, 2025 at 03:50:49AM +0100, Matthew Wilcox wrote:
> On Mon, Aug 18, 2025 at 05:36:54PM -0700, Boris Burkov wrote:
> > Uncharged pages are tricky to track by their essential "uncharged"
> > nature. To maintain good accounting, introduce a vmstat counter tracking
> > all uncharged pages. Since this is only meaningful when cgroups are
> > configured, only expose the counter when CONFIG_MEMCG is set.
> 
> I don't understand why this is needed.  Maybe Shakeel had better
> reasoning that wasn't captured in the commit message.
> 
> If they're unaccounted, then you can get a good estimate of them
> just by subtracting the number of accounted pages from the number of
> file pages.  Sure there's a small race between the two numbers being
> updated, so you migth be off by a bit.

I don't think there is any over the top elaborate reasoning beyond being
precise and accurate with stats being a good thing.

In my experience, implicit calculations like the one you propose tend to
lead to metrics that drift into incorrectness. Today's correct
calculation is tomorrow's wrong one, when the invariants from which it
was derived shift again. We have seen this in practice before with
kernel memory usage at Meta.

I don't think this costs a lot, and it has an easy to understand
definition. Are you concerned that there is only a single user in btrfs,
so that doesn't merit defining a new stat?

Sorry to put words in your mouth, just trying to guess what might be
objectionable about it.

Thanks for the reviews, by the way.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] memcg: remove warning from folio_lruvec
  2025-08-19  2:41   ` Matthew Wilcox
@ 2025-08-19  5:20     ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2025-08-19  5:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boris Burkov, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Tue, 19 Aug 2025 03:41:30 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> On Mon, Aug 18, 2025 at 05:36:56PM -0700, Boris Burkov wrote:
> > Commit a4055888629bc ("mm/memcg: warning on !memcg after readahead page
> > charged") added the warning in folio_lruvec (older name was
> > mem_cgroup_page_lruvec) for !memcg when charging of readahead pages were
> > added to the kernel. Basically lru pages on a memcg enabled system were
> > always expected to be charged to a memcg.
> > 
> > However a recent functionality to allow metadata of btrfs, which is in
> > page cache, to be uncharged is added to the kernel. We can either change
> > the condition to only check anon pages or file pages which does not have
> > AS_UNCHARGED in their mapping. Instead of such complicated check, let's
> > just remove the warning as it is not really helpful anymore.
> 
> This has to go before patch 3 (and I'd put it before patch 1) in order
> to preserve bisectability..

Thanks, I'll move it to [2/4]

?  That requires changing the tenses in the
> commit message, but that's perfectly acceptable.

I'm not spottig this.  a4055888629bc was added in 2020 and the btrfs
change is in a preceding series.  I'll describe that by name, so

: Commit a4055888629bc ("mm/memcg: warning on !memcg after readahead page
: charged") added the warning in folio_lruvec (older name was
: mem_cgroup_page_lruvec) for !memcg when charging of readahead pages were
: added to the kernel.  Basically lru pages on a memcg enabled system were
: always expected to be charged to a memcg.
: 
: However a recent functionality to allow metadata of btrfs, which is in
: page cache, to be uncharged was added to the kernel
: ("btrfs-set-as_uncharged-on-the-btree_inode.patch").  We can either change
: the condition to only check anon pages or file pages which does not have
: AS_UNCHARGED in their mapping.  Instead of such complicated check, let's
: just remove the warning as it is not really helpful anymore.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages
  2025-08-19  2:50   ` Matthew Wilcox
  2025-08-19  4:05     ` Boris Burkov
@ 2025-08-19 15:53     ` Shakeel Butt
  2025-08-19 23:46       ` Matthew Wilcox
  1 sibling, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-08-19 15:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boris Burkov, akpm, linux-btrfs, linux-mm, linux-fsdevel,
	kernel-team, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Tue, Aug 19, 2025 at 03:50:49AM +0100, Matthew Wilcox wrote:
> On Mon, Aug 18, 2025 at 05:36:54PM -0700, Boris Burkov wrote:
> > Uncharged pages are tricky to track by their essential "uncharged"
> > nature. To maintain good accounting, introduce a vmstat counter tracking
> > all uncharged pages. Since this is only meaningful when cgroups are
> > configured, only expose the counter when CONFIG_MEMCG is set.
> 
> I don't understand why this is needed.  Maybe Shakeel had better
> reasoning that wasn't captured in the commit message.
> 
> If they're unaccounted, then you can get a good estimate of them
> just by subtracting the number of accounted pages from the number of
> file pages.  Sure there's a small race between the two numbers being
> updated, so you migth be off by a bit.

My initial thinking was based on Qu's original proposal which was using
root memcg where there will not be any difference between accounted
file pages and system wide file pages. However with Boris's change, we
can actually get the estimate, as you pointed out, by subtracting the
number of accounted file pages from system wide number of file pages.

However I still think we should keep this new metric because of
performance reason. To get accounted file pages, we need to read
memory.stat of the root memcg which can be very expensive. Basically it
may have to flush the rstat update trees on all the CPUs on the system.
Since this new metric will be used to calculate system overhead, the
high cost will limit how frequently a user can query the latest stat.

I do know there are use-cases where users want to query the system
overhead at high frequency. One such use-case is keeping a safe buffer
on a memory overcommitted system (Google does this and measure system
overhead every second).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages
  2025-08-19 15:53     ` Shakeel Butt
@ 2025-08-19 23:46       ` Matthew Wilcox
  2025-08-20  1:25         ` Shakeel Butt
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2025-08-19 23:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Boris Burkov, akpm, linux-btrfs, linux-mm, linux-fsdevel,
	kernel-team, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Tue, Aug 19, 2025 at 08:53:59AM -0700, Shakeel Butt wrote:
> My initial thinking was based on Qu's original proposal which was using
> root memcg where there will not be any difference between accounted
> file pages and system wide file pages. However with Boris's change, we
> can actually get the estimate, as you pointed out, by subtracting the
> number of accounted file pages from system wide number of file pages.
> 
> However I still think we should keep this new metric because of
> performance reason. To get accounted file pages, we need to read
> memory.stat of the root memcg which can be very expensive. Basically it
> may have to flush the rstat update trees on all the CPUs on the system.
> Since this new metric will be used to calculate system overhead, the
> high cost will limit how frequently a user can query the latest stat.

OK, but couldn't we make that argument for anything else?  Like slab,
say.  Why's "file" memory different?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages
  2025-08-19 23:46       ` Matthew Wilcox
@ 2025-08-20  1:25         ` Shakeel Butt
  2025-08-20 13:19           ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-08-20  1:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boris Burkov, akpm, linux-btrfs, linux-mm, linux-fsdevel,
	kernel-team, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Wed, Aug 20, 2025 at 12:46:43AM +0100, Matthew Wilcox wrote:
> On Tue, Aug 19, 2025 at 08:53:59AM -0700, Shakeel Butt wrote:
> > My initial thinking was based on Qu's original proposal which was using
> > root memcg where there will not be any difference between accounted
> > file pages and system wide file pages. However with Boris's change, we
> > can actually get the estimate, as you pointed out, by subtracting the
> > number of accounted file pages from system wide number of file pages.
> > 
> > However I still think we should keep this new metric because of
> > performance reason. To get accounted file pages, we need to read
> > memory.stat of the root memcg which can be very expensive. Basically it
> > may have to flush the rstat update trees on all the CPUs on the system.
> > Since this new metric will be used to calculate system overhead, the
> > high cost will limit how frequently a user can query the latest stat.
> 
> OK, but couldn't we make that argument for anything else?  Like slab,
> say.  Why's "file" memory different?

Good point and I think it does apply to other memory types too. I would
call "file" memory to be more important as it is one of the largest
consumer of DRAM on, at least, Meta infra. Slab needs a bit more thought.
At the system level (i.e. /proc/meminfo), we account at the page (or
slab) level while for memcg, we account per-object (plus obj_cgroup
pointer).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages
  2025-08-20  1:25         ` Shakeel Butt
@ 2025-08-20 13:19           ` Matthew Wilcox
  2025-08-20 16:21             ` Shakeel Butt
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2025-08-20 13:19 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Boris Burkov, akpm, linux-btrfs, linux-mm, linux-fsdevel,
	kernel-team, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Tue, Aug 19, 2025 at 06:25:36PM -0700, Shakeel Butt wrote:
> On Wed, Aug 20, 2025 at 12:46:43AM +0100, Matthew Wilcox wrote:
> > OK, but couldn't we make that argument for anything else?  Like slab,
> > say.  Why's "file" memory different?
> 
> Good point and I think it does apply to other memory types too. I would
> call "file" memory to be more important as it is one of the largest
> consumer of DRAM on, at least, Meta infra. Slab needs a bit more thought.
> At the system level (i.e. /proc/meminfo), we account at the page (or
> slab) level while for memcg, we account per-object (plus obj_cgroup
> pointer).

That was supposed to be a reductio ad absurdum, not an invitation to
add more counters.

Look, if this is information you really need, I think you should come
up with a better way of collecting it than by adding new counters and
new complexity to everything involved in GFP_ACCOUNT activities.

The unaccounted address_spaces are a very tiny percentage of file
memory, at least as far as this patch set goes.  I don't think this
patch is justifiable on its face.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages
  2025-08-20 13:19           ` Matthew Wilcox
@ 2025-08-20 16:21             ` Shakeel Butt
  0 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-08-20 16:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Boris Burkov, akpm, linux-btrfs, linux-mm, linux-fsdevel,
	kernel-team, wqu, mhocko, muchun.song, roman.gushchin, hannes

On Wed, Aug 20, 2025 at 02:19:24PM +0100, Matthew Wilcox wrote:
> On Tue, Aug 19, 2025 at 06:25:36PM -0700, Shakeel Butt wrote:
> > On Wed, Aug 20, 2025 at 12:46:43AM +0100, Matthew Wilcox wrote:
> > > OK, but couldn't we make that argument for anything else?  Like slab,
> > > say.  Why's "file" memory different?
> > 
> > Good point and I think it does apply to other memory types too. I would
> > call "file" memory to be more important as it is one of the largest
> > consumer of DRAM on, at least, Meta infra. Slab needs a bit more thought.
> > At the system level (i.e. /proc/meminfo), we account at the page (or
> > slab) level while for memcg, we account per-object (plus obj_cgroup
> > pointer).
> 
> That was supposed to be a reductio ad absurdum, not an invitation to
> add more counters.
> 
> Look, if this is information you really need, I think you should come
> up with a better way of collecting it than by adding new counters and
> new complexity to everything involved in GFP_ACCOUNT activities.
> 

Please elaborate more on this complexity. To me, particularly for this
specific case, a dedicated counter seems more cleaner compared to error
prone and costly alternatives. I am not getting the complexity argument.

> The unaccounted address_spaces are a very tiny percentage of file
> memory, at least as far as this patch set goes.

From [1], Qu noted "On a real world system, the metadata itself can
easily go hundreds of GiBs...".

This does not seem tiny.

> I don't think this
> patch is justifiable on its face.

I think I have provided enough justifications. However I don't want to
force push this until I fully understand your concerns. This will become
part of API and I don't want a situation where we regret this later.

[1] https://lore.kernel.org/linux-mm/08ccb40d-6261-4757-957d-537d295d2cf5@suse.com/


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-19  0:36 ` [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED Boris Burkov
  2025-08-19  2:46   ` Matthew Wilcox
@ 2025-08-20 22:06   ` Klara Modin
  2025-08-20 22:22     ` Shakeel Butt
  2025-08-20 22:52     ` Boris Burkov
  1 sibling, 2 replies; 22+ messages in thread
From: Klara Modin @ 2025-08-20 22:06 UTC (permalink / raw)
  To: Boris Burkov
  Cc: akpm, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, willy, mhocko, muchun.song, roman.gushchin,
	hannes

Hi,

On 2025-08-18 17:36:53 -0700, Boris Burkov wrote:
> Btrfs currently tracks its metadata pages in the page cache, using a
> fake inode (fs_info->btree_inode) with offsets corresponding to where
> the metadata is stored in the filesystem's full logical address space.
> 
> A consequence of this is that when btrfs uses filemap_add_folio(), this
> usage is charged to the cgroup of whichever task happens to be running
> at the time. These folios don't belong to any particular user cgroup, so
> I don't think it makes much sense for them to be charged in that way.
> Some negative consequences as a result:
> - A task can be holding some important btrfs locks, then need to lookup
>   some metadata and go into reclaim, extending the duration it holds
>   that lock for, and unfairly pushing its own reclaim pain onto other
>   cgroups.
> - If that cgroup goes into reclaim, it might reclaim these folios a
>   different non-reclaiming cgroup might need soon. This is naturally
>   offset by LRU reclaim, but still.
> 
> A very similar proposal to use the root cgroup was previously made by
> Qu, where he eventually proposed the idea of setting it per
> address_space. This makes good sense for the btrfs use case, as the
> uncharged behavior should apply to all use of the address_space, not
> select allocations. I.e., if someone adds another filemap_add_folio()
> call using btrfs's btree_inode, we would almost certainly want the
> uncharged behavior.
> 
> Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com/
> Suggested-by: Qu Wenruo <wqu@suse.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Tested-by: syzbot@syzkaller.appspotmail.com
> Signed-off-by: Boris Burkov <boris@bur.io>

I bisected the following null-dereference to 3f31e0d9912d ("btrfs: set
AS_UNCHARGED on the btree_inode") in mm-new but I believe it's a result of
this patch:

 Oops [#1]
 CPU: 4 UID: 0 PID: 87 Comm: kswapd0 Not tainted 6.17.0-rc2-next-20250820-00349-gd6ecef4f9566 #511 PREEMPTLAZY
 Hardware name: Banana Pi BPI-F3 (DT)
 epc : workingset_eviction (include/linux/memcontrol.h:815 mm/workingset.c:257 mm/workingset.c:394) 
 ra : __remove_mapping (mm/vmscan.c:805) 
 epc : ffffffff802e6de8 ra : ffffffff802b4114 sp : ffffffc6006c3670
  gp : ffffffff8227dad8 tp : ffffffd701a2cb00 t0 : ffffffff80027d00
  t1 : 0000000000000000 t2 : 0000000000000001 s0 : ffffffc6006c3680
  s1 : ffffffc50415a540 a0 : 0000000000000001 a1 : ffffffd700b70048
  a2 : 0000000000000000 a3 : 0000000000000000 a4 : 00000000000003f0
  a5 : ffffffd700b70430 a6 : 0000000000000000 a7 : ffffffd77ffd1dc0
  s2 : ffffffd705a483d8 s3 : ffffffd705a483e0 s4 : 0000000000000001
  s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000001
  s8 : ffffffd705a483d8 s9 : ffffffc6006c3760 s10: ffffffc50415a548
  s11: ffffffff81e000e0 t3 : 0000000000000000 t4 : 0000000000000001
  t5 : 0000000000000003 t6 : 0000000000000003
 status: 0000000200000100 badaddr: 00000000000000d0 cause: 000000000000000d
 workingset_eviction (include/linux/memcontrol.h:815 mm/workingset.c:257 mm/workingset.c:394) 
 __remove_mapping (mm/vmscan.c:805) 
 shrink_folio_list (mm/vmscan.c:1545 (discriminator 2)) 
 evict_folios (mm/vmscan.c:4738) 
 try_to_shrink_lruvec (mm/vmscan.c:4901) 
 shrink_one (mm/vmscan.c:4947) 
 shrink_node (include/asm-generic/preempt.h:54 (discriminator 1) include/linux/rcupdate.h:93 (discriminator 1) include/linux/rcupdate.h:839 (discriminator 1) mm/vmscan.c:5010 (discriminator 1) mm/vmscan.c:5086 (discriminator 1) mm/vmscan.c:6073 (discriminator 1)) 
 balance_pgdat (mm/vmscan.c:6942 mm/vmscan.c:7116) 
 kswapd (mm/vmscan.c:7381) 
 kthread (kernel/kthread.c:463) 
 ret_from_fork_kernel (include/linux/entry-common.h:155 (discriminator 4) include/linux/entry-common.h:210 (discriminator 4) arch/riscv/kernel/process.c:216 (discriminator 4)) 
 ret_from_fork_kernel_asm (arch/riscv/kernel/entry.S:328) 
 Code: 0987 060a 6633 01c6 97ba b02f 01d7 0001 0013 0000 (5503) 0d08
 All code
 ========
    0:	060a0987          	.insn	4, 0x060a0987
    4:	01c66633          	or	a2,a2,t3
    8:	97ba                	.insn	2, 0x97ba
    a:	01d7b02f          	amoadd.d	zero,t4,(a5)
    e:	0001                	.insn	2, 0x0001
   10:	00000013          	addi	zero,zero,0
   14:*	0d085503          	lhu	a0,208(a6)		<-- trapping instruction
 
 Code starting with the faulting instruction
 ===========================================
    0:	0d085503          	lhu	a0,208(a6)
 ---[ end trace 0000000000000000 ]---
 note: kswapd0[87] exited with irqs disabled
 note: kswapd0[87] exited with preempt_count 2

> ---
>  include/linux/pagemap.h |  1 +
>  mm/filemap.c            | 12 ++++++++----
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index c9ba69e02e3e..06dc3fae8124 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -211,6 +211,7 @@ enum mapping_flags {
>  				   folio contents */
>  	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
>  	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
> +	AS_UNCHARGED = 10,	/* Do not charge usage to a cgroup */
>  	/* Bits 16-25 are used for FOLIO_ORDER */
>  	AS_FOLIO_ORDER_BITS = 5,
>  	AS_FOLIO_ORDER_MIN = 16,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index e4a5a46db89b..5004a2cfa0cc 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -960,15 +960,19 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>  {
>  	void *shadow = NULL;
>  	int ret;
> +	bool charge_mem_cgroup = !test_bit(AS_UNCHARGED, &mapping->flags);
>  
> -	ret = mem_cgroup_charge(folio, NULL, gfp);
> -	if (ret)
> -		return ret;
> +	if (charge_mem_cgroup) {
> +		ret = mem_cgroup_charge(folio, NULL, gfp);
> +		if (ret)
> +			return ret;
> +	}
>  
>  	__folio_set_locked(folio);
>  	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
>  	if (unlikely(ret)) {
> -		mem_cgroup_uncharge(folio);
> +		if (charge_mem_cgroup)
> +			mem_cgroup_uncharge(folio);
>  		__folio_clear_locked(folio);
>  	} else {
>  		/*
> -- 
> 2.50.1
> 

This means that not all folios will have a memcg attached also when
memcg is enabled. In lru_gen_eviction() mem_cgroup_id() is called
without a NULL check which then leads to the null-dereference.

The following diff resolves the issue for me:

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fae105a9cb46..c70e789201fc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -809,7 +809,7 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_disabled() || !memcg)
 		return 0;
 
 	return memcg->id.id;

However, it's mentioned in folio_memcg() that it can return NULL so this
might be an existing bug which this patch just makes more obvious.

There's also workingset_eviction() which instead gets the memcg from
lruvec. Doing that in lru_gen_eviction() also resolves the issue for me:

diff --git a/mm/workingset.c b/mm/workingset.c
index 68a76a91111f..e805eadf0ec7 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -243,6 +243,7 @@ static void *lru_gen_eviction(struct folio *folio)
 	int tier = lru_tier_from_refs(refs, workingset);
 	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
+	int memcgid;
 
 	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
 
@@ -254,7 +255,9 @@ static void *lru_gen_eviction(struct folio *folio)
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
 
-	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
+
+	return pack_shadow(memcgid, pgdat, token, workingset);
 }
 
 /*

I don't really know what I'm doing here, though.

Regards,
Klara Modin


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-20 22:06   ` Klara Modin
@ 2025-08-20 22:22     ` Shakeel Butt
  2025-08-20 22:52     ` Boris Burkov
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-08-20 22:22 UTC (permalink / raw)
  To: Klara Modin
  Cc: Boris Burkov, akpm, linux-btrfs, linux-mm, linux-fsdevel,
	kernel-team, wqu, willy, mhocko, muchun.song, roman.gushchin,
	hannes

On Thu, Aug 21, 2025 at 12:06:42AM +0200, Klara Modin wrote:
[...]
> 
> The following diff resolves the issue for me:
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index fae105a9cb46..c70e789201fc 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -809,7 +809,7 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  
>  static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
>  {
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_disabled() || !memcg)
>  		return 0;
>  
>  	return memcg->id.id;
> 
> However, it's mentioned in folio_memcg() that it can return NULL so this
> might be an existing bug which this patch just makes more obvious.
> 
> There's also workingset_eviction() which instead gets the memcg from
> lruvec. Doing that in lru_gen_eviction() also resolves the issue for me:
> 
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 68a76a91111f..e805eadf0ec7 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -243,6 +243,7 @@ static void *lru_gen_eviction(struct folio *folio)
>  	int tier = lru_tier_from_refs(refs, workingset);
>  	struct mem_cgroup *memcg = folio_memcg(folio);
>  	struct pglist_data *pgdat = folio_pgdat(folio);
> +	int memcgid;
>  
>  	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
>  
> @@ -254,7 +255,9 @@ static void *lru_gen_eviction(struct folio *folio)
>  	hist = lru_hist_from_seq(min_seq);
>  	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
>  
> -	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
> +	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> +
> +	return pack_shadow(memcgid, pgdat, token, workingset);
>  }
>  
>  /*
> 
> I don't really know what I'm doing here, though.

Thanks a lot for the report and I think we will prefer your second
approach which makes lru_gen_eviction() similar to
workingset_eviction(). Can you please send a formal patch and Boris can
add that patch in his series?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-20 22:06   ` Klara Modin
  2025-08-20 22:22     ` Shakeel Butt
@ 2025-08-20 22:52     ` Boris Burkov
  2025-08-20 23:15       ` Klara Modin
  2025-08-20 23:53       ` Shakeel Butt
  1 sibling, 2 replies; 22+ messages in thread
From: Boris Burkov @ 2025-08-20 22:52 UTC (permalink / raw)
  To: Klara Modin
  Cc: akpm, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, willy, mhocko, muchun.song, roman.gushchin,
	hannes

On Thu, Aug 21, 2025 at 12:06:42AM +0200, Klara Modin wrote:
> Hi,
> 
> On 2025-08-18 17:36:53 -0700, Boris Burkov wrote:
> > Btrfs currently tracks its metadata pages in the page cache, using a
> > fake inode (fs_info->btree_inode) with offsets corresponding to where
> > the metadata is stored in the filesystem's full logical address space.
> > 
> > A consequence of this is that when btrfs uses filemap_add_folio(), this
> > usage is charged to the cgroup of whichever task happens to be running
> > at the time. These folios don't belong to any particular user cgroup, so
> > I don't think it makes much sense for them to be charged in that way.
> > Some negative consequences as a result:
> > - A task can be holding some important btrfs locks, then need to lookup
> >   some metadata and go into reclaim, extending the duration it holds
> >   that lock for, and unfairly pushing its own reclaim pain onto other
> >   cgroups.
> > - If that cgroup goes into reclaim, it might reclaim these folios a
> >   different non-reclaiming cgroup might need soon. This is naturally
> >   offset by LRU reclaim, but still.
> > 
> > A very similar proposal to use the root cgroup was previously made by
> > Qu, where he eventually proposed the idea of setting it per
> > address_space. This makes good sense for the btrfs use case, as the
> > uncharged behavior should apply to all use of the address_space, not
> > select allocations. I.e., if someone adds another filemap_add_folio()
> > call using btrfs's btree_inode, we would almost certainly want the
> > uncharged behavior.
> > 
> > Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com/
> > Suggested-by: Qu Wenruo <wqu@suse.com>
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Tested-by: syzbot@syzkaller.appspotmail.com
> > Signed-off-by: Boris Burkov <boris@bur.io>
> 
> I bisected the following null-dereference to 3f31e0d9912d ("btrfs: set
> AS_UNCHARGED on the btree_inode") in mm-new but I believe it's a result of
> this patch:
> 
>  Oops [#1]
>  CPU: 4 UID: 0 PID: 87 Comm: kswapd0 Not tainted 6.17.0-rc2-next-20250820-00349-gd6ecef4f9566 #511 PREEMPTLAZY
>  Hardware name: Banana Pi BPI-F3 (DT)
>  epc : workingset_eviction (include/linux/memcontrol.h:815 mm/workingset.c:257 mm/workingset.c:394) 
>  ra : __remove_mapping (mm/vmscan.c:805) 
>  epc : ffffffff802e6de8 ra : ffffffff802b4114 sp : ffffffc6006c3670
>   gp : ffffffff8227dad8 tp : ffffffd701a2cb00 t0 : ffffffff80027d00
>   t1 : 0000000000000000 t2 : 0000000000000001 s0 : ffffffc6006c3680
>   s1 : ffffffc50415a540 a0 : 0000000000000001 a1 : ffffffd700b70048
>   a2 : 0000000000000000 a3 : 0000000000000000 a4 : 00000000000003f0
>   a5 : ffffffd700b70430 a6 : 0000000000000000 a7 : ffffffd77ffd1dc0
>   s2 : ffffffd705a483d8 s3 : ffffffd705a483e0 s4 : 0000000000000001
>   s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000001
>   s8 : ffffffd705a483d8 s9 : ffffffc6006c3760 s10: ffffffc50415a548
>   s11: ffffffff81e000e0 t3 : 0000000000000000 t4 : 0000000000000001
>   t5 : 0000000000000003 t6 : 0000000000000003
>  status: 0000000200000100 badaddr: 00000000000000d0 cause: 000000000000000d
>  workingset_eviction (include/linux/memcontrol.h:815 mm/workingset.c:257 mm/workingset.c:394) 
>  __remove_mapping (mm/vmscan.c:805) 
>  shrink_folio_list (mm/vmscan.c:1545 (discriminator 2)) 
>  evict_folios (mm/vmscan.c:4738) 
>  try_to_shrink_lruvec (mm/vmscan.c:4901) 
>  shrink_one (mm/vmscan.c:4947) 
>  shrink_node (include/asm-generic/preempt.h:54 (discriminator 1) include/linux/rcupdate.h:93 (discriminator 1) include/linux/rcupdate.h:839 (discriminator 1) mm/vmscan.c:5010 (discriminator 1) mm/vmscan.c:5086 (discriminator 1) mm/vmscan.c:6073 (discriminator 1)) 
>  balance_pgdat (mm/vmscan.c:6942 mm/vmscan.c:7116) 
>  kswapd (mm/vmscan.c:7381) 
>  kthread (kernel/kthread.c:463) 
>  ret_from_fork_kernel (include/linux/entry-common.h:155 (discriminator 4) include/linux/entry-common.h:210 (discriminator 4) arch/riscv/kernel/process.c:216 (discriminator 4)) 
>  ret_from_fork_kernel_asm (arch/riscv/kernel/entry.S:328) 
>  Code: 0987 060a 6633 01c6 97ba b02f 01d7 0001 0013 0000 (5503) 0d08
>  All code
>  ========
>     0:	060a0987          	.insn	4, 0x060a0987
>     4:	01c66633          	or	a2,a2,t3
>     8:	97ba                	.insn	2, 0x97ba
>     a:	01d7b02f          	amoadd.d	zero,t4,(a5)
>     e:	0001                	.insn	2, 0x0001
>    10:	00000013          	addi	zero,zero,0
>    14:*	0d085503          	lhu	a0,208(a6)		<-- trapping instruction
>  
>  Code starting with the faulting instruction
>  ===========================================
>     0:	0d085503          	lhu	a0,208(a6)
>  ---[ end trace 0000000000000000 ]---
>  note: kswapd0[87] exited with irqs disabled
>  note: kswapd0[87] exited with preempt_count 2
> 
> > ---
> >  include/linux/pagemap.h |  1 +
> >  mm/filemap.c            | 12 ++++++++----
> >  2 files changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index c9ba69e02e3e..06dc3fae8124 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -211,6 +211,7 @@ enum mapping_flags {
> >  				   folio contents */
> >  	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
> >  	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
> > +	AS_UNCHARGED = 10,	/* Do not charge usage to a cgroup */
> >  	/* Bits 16-25 are used for FOLIO_ORDER */
> >  	AS_FOLIO_ORDER_BITS = 5,
> >  	AS_FOLIO_ORDER_MIN = 16,
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index e4a5a46db89b..5004a2cfa0cc 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -960,15 +960,19 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
> >  {
> >  	void *shadow = NULL;
> >  	int ret;
> > +	bool charge_mem_cgroup = !test_bit(AS_UNCHARGED, &mapping->flags);
> >  
> > -	ret = mem_cgroup_charge(folio, NULL, gfp);
> > -	if (ret)
> > -		return ret;
> > +	if (charge_mem_cgroup) {
> > +		ret = mem_cgroup_charge(folio, NULL, gfp);
> > +		if (ret)
> > +			return ret;
> > +	}
> >  
> >  	__folio_set_locked(folio);
> >  	ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
> >  	if (unlikely(ret)) {
> > -		mem_cgroup_uncharge(folio);
> > +		if (charge_mem_cgroup)
> > +			mem_cgroup_uncharge(folio);
> >  		__folio_clear_locked(folio);
> >  	} else {
> >  		/*
> > -- 
> > 2.50.1
> > 
> 
> This means that not all folios will have a memcg attached also when
> memcg is enabled. In lru_gen_eviction() mem_cgroup_id() is called
> without a NULL check which then leads to the null-dereference.
> 
> The following diff resolves the issue for me:
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index fae105a9cb46..c70e789201fc 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -809,7 +809,7 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  
>  static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
>  {
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_disabled() || !memcg)
>  		return 0;
>  
>  	return memcg->id.id;
> 
> However, it's mentioned in folio_memcg() that it can return NULL so this
> might be an existing bug which this patch just makes more obvious.
> 
> There's also workingset_eviction() which instead gets the memcg from
> lruvec. Doing that in lru_gen_eviction() also resolves the issue for me:
> 
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 68a76a91111f..e805eadf0ec7 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -243,6 +243,7 @@ static void *lru_gen_eviction(struct folio *folio)
>  	int tier = lru_tier_from_refs(refs, workingset);
>  	struct mem_cgroup *memcg = folio_memcg(folio);
>  	struct pglist_data *pgdat = folio_pgdat(folio);
> +	int memcgid;
>  
>  	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
>  
> @@ -254,7 +255,9 @@ static void *lru_gen_eviction(struct folio *folio)
>  	hist = lru_hist_from_seq(min_seq);
>  	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
>  
> -	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
> +	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> +
> +	return pack_shadow(memcgid, pgdat, token, workingset);
>  }
>  
>  /*
> 
> I don't really know what I'm doing here, though.

Me neither, clearly :)

Thanks so much for the report and fix! I fear there might be some other
paths that try to get memcg from lruvec or folio or whatever without
checking it. I feel like in this exact case, I would want to go to the
first sign of trouble and fix it at lruvec_memcg(). But then who knows
what else we've missed.

May I ask what you were running to trigger this? My fstests run (clearly
not exercising enough interesting memory paths) did not hit it.

This does make me wonder if the superior approach to the original patch
isn't just to go back to the very first thing Qu did and account these
to the root cgroup rather than do the whole uncharged thing.

Boris

> 
> Regards,
> Klara Modin


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-20 22:52     ` Boris Burkov
@ 2025-08-20 23:15       ` Klara Modin
  2025-08-20 23:53       ` Shakeel Butt
  1 sibling, 0 replies; 22+ messages in thread
From: Klara Modin @ 2025-08-20 23:15 UTC (permalink / raw)
  To: Boris Burkov
  Cc: akpm, linux-btrfs, linux-mm, linux-fsdevel, kernel-team,
	shakeel.butt, wqu, willy, mhocko, muchun.song, roman.gushchin,
	hannes

Hi,

On 2025-08-20 15:52:22 -0700, Boris Burkov wrote:
> On Thu, Aug 21, 2025 at 12:06:42AM +0200, Klara Modin wrote:
> > Hi,
> > 
> > On 2025-08-18 17:36:53 -0700, Boris Burkov wrote:
> > > Btrfs currently tracks its metadata pages in the page cache, using a
> > > fake inode (fs_info->btree_inode) with offsets corresponding to where
> > > the metadata is stored in the filesystem's full logical address space.
> > > 
> > > A consequence of this is that when btrfs uses filemap_add_folio(), this
> > > usage is charged to the cgroup of whichever task happens to be running
> > > at the time. These folios don't belong to any particular user cgroup, so
> > > I don't think it makes much sense for them to be charged in that way.
> > > Some negative consequences as a result:
> > > - A task can be holding some important btrfs locks, then need to lookup
> > >   some metadata and go into reclaim, extending the duration it holds
> > >   that lock for, and unfairly pushing its own reclaim pain onto other
> > >   cgroups.
> > > - If that cgroup goes into reclaim, it might reclaim these folios a
> > >   different non-reclaiming cgroup might need soon. This is naturally
> > >   offset by LRU reclaim, but still.
> > > 
> > > A very similar proposal to use the root cgroup was previously made by
> > > Qu, where he eventually proposed the idea of setting it per
> > > address_space. This makes good sense for the btrfs use case, as the
> > > uncharged behavior should apply to all use of the address_space, not
> > > select allocations. I.e., if someone adds another filemap_add_folio()
> > > call using btrfs's btree_inode, we would almost certainly want the
> > > uncharged behavior.
> > > 
> > > Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com/
> > > Suggested-by: Qu Wenruo <wqu@suse.com>
> > > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > Tested-by: syzbot@syzkaller.appspotmail.com
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > 
> > I bisected the following null-dereference to 3f31e0d9912d ("btrfs: set
> > AS_UNCHARGED on the btree_inode") in mm-new but I believe it's a result of
> > this patch:
> > 

...

> > 
> > This means that not all folios will have a memcg attached also when
> > memcg is enabled. In lru_gen_eviction() mem_cgroup_id() is called
> > without a NULL check which then leads to the null-dereference.
> > 
> > The following diff resolves the issue for me:
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index fae105a9cb46..c70e789201fc 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -809,7 +809,7 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
> >  
> >  static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
> >  {
> > -	if (mem_cgroup_disabled())
> > +	if (mem_cgroup_disabled() || !memcg)
> >  		return 0;
> >  
> >  	return memcg->id.id;
> > 
> > However, it's mentioned in folio_memcg() that it can return NULL so this
> > might be an existing bug which this patch just makes more obvious.
> > 
> > There's also workingset_eviction() which instead gets the memcg from
> > lruvec. Doing that in lru_gen_eviction() also resolves the issue for me:
> > 
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 68a76a91111f..e805eadf0ec7 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -243,6 +243,7 @@ static void *lru_gen_eviction(struct folio *folio)
> >  	int tier = lru_tier_from_refs(refs, workingset);
> >  	struct mem_cgroup *memcg = folio_memcg(folio);
> >  	struct pglist_data *pgdat = folio_pgdat(folio);
> > +	int memcgid;
> >  
> >  	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
> >  
> > @@ -254,7 +255,9 @@ static void *lru_gen_eviction(struct folio *folio)
> >  	hist = lru_hist_from_seq(min_seq);
> >  	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
> >  
> > -	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
> > +	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
> > +
> > +	return pack_shadow(memcgid, pgdat, token, workingset);
> >  }
> >  
> >  /*
> > 
> > I don't really know what I'm doing here, though.
> 
> Me neither, clearly :)
> 
> Thanks so much for the report and fix! I fear there might be some other
> paths that try to get memcg from lruvec or folio or whatever without
> checking it. I feel like in this exact case, I would want to go to the
> first sign of trouble and fix it at lruvec_memcg(). But then who knows
> what else we've missed.
> 
> May I ask what you were running to trigger this? My fstests run (clearly
> not exercising enough interesting memory paths) did not hit it.
> 
> This does make me wonder if the superior approach to the original patch
> isn't just to go back to the very first thing Qu did and account these
> to the root cgroup rather than do the whole uncharged thing.
> 
> Boris
> 
> > 
> > Regards,
> > Klara Modin

For me it's easiest to trigger when cloning a large repository, e.g. the
kernel or gcc, with low-ish amount of RAM (maybe 1-4 GiB) so under memory
pressure. Also:

CONFIG_LRU_GEN=y
CONFIG_LRU_GEN_ENABLED=y


Shakeel:

I think I'll wait a little before submitting a patch to see if there are
any more comments.

Regards,
Klara Modin


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-20 22:52     ` Boris Burkov
  2025-08-20 23:15       ` Klara Modin
@ 2025-08-20 23:53       ` Shakeel Butt
  2025-08-21 19:37         ` Shakeel Butt
  1 sibling, 1 reply; 22+ messages in thread
From: Shakeel Butt @ 2025-08-20 23:53 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Klara Modin, akpm, linux-btrfs, linux-mm, linux-fsdevel,
	kernel-team, wqu, willy, mhocko, muchun.song, roman.gushchin,
	hannes

On Wed, Aug 20, 2025 at 03:52:22PM -0700, Boris Burkov wrote:
[...]
> 
> Thanks so much for the report and fix! I fear there might be some other
> paths that try to get memcg from lruvec or folio or whatever without
> checking it. I feel like in this exact case, I would want to go to the
> first sign of trouble and fix it at lruvec_memcg(). But then who knows
> what else we've missed.

lruvec_memcg() is not an issue but folio_memcg() can be. I found
following instances of folio_memcg() which are problematic (on
next-20250819):

mm/memcontrol.c:3246:   css_get_many(&__folio_memcg(folio)->css, new_refs);

include/trace/events/writeback.h:269:           __entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup);

mm/workingset.c:244:    struct mem_cgroup *memcg = folio_memcg(folio);

mm/huge_memory.c:4020:  WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));

> 
> May I ask what you were running to trigger this? My fstests run (clearly
> not exercising enough interesting memory paths) did not hit it.
> 
> This does make me wonder if the superior approach to the original patch
> isn't just to go back to the very first thing Qu did and account these
> to the root cgroup rather than do the whole uncharged thing.

I don't have any strong preference one way or the other.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED
  2025-08-20 23:53       ` Shakeel Butt
@ 2025-08-21 19:37         ` Shakeel Butt
  0 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2025-08-21 19:37 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Klara Modin, akpm, linux-btrfs, linux-mm, linux-fsdevel,
	kernel-team, wqu, willy, mhocko, muchun.song, roman.gushchin,
	hannes

On Wed, Aug 20, 2025 at 04:53:08PM -0700, Shakeel Butt wrote:
> On Wed, Aug 20, 2025 at 03:52:22PM -0700, Boris Burkov wrote:
> [...]
> > 
> > Thanks so much for the report and fix! I fear there might be some other
> > paths that try to get memcg from lruvec or folio or whatever without
> > checking it. I feel like in this exact case, I would want to go to the
> > first sign of trouble and fix it at lruvec_memcg(). But then who knows
> > what else we've missed.
> 
> lruvec_memcg() is not an issue but folio_memcg() can be. I found
> following instances of folio_memcg() which are problematic (on
> next-20250819):
> 
> mm/memcontrol.c:3246:   css_get_many(&__folio_memcg(folio)->css, new_refs);
> 
> include/trace/events/writeback.h:269:           __entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup);
> 
> mm/workingset.c:244:    struct mem_cgroup *memcg = folio_memcg(folio);
> 
> mm/huge_memory.c:4020:  WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
> 
> > 
> > May I ask what you were running to trigger this? My fstests run (clearly
> > not exercising enough interesting memory paths) did not hit it.
> > 
> > This does make me wonder if the superior approach to the original patch
> > isn't just to go back to the very first thing Qu did and account these
> > to the root cgroup rather than do the whole uncharged thing.
> 
> I don't have any strong preference one way or the other.

After thinking a bit more, I think root memcg approach by Qu should be
preferred. Using that we will avoid this unnecessary code churn for NULL
memcg checks and I am pretty sure I might have missed some places I
listed above.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-08-21 19:37 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19  0:36 [PATCH v3 0/4] introduce uncharged file mapped folios Boris Burkov
2025-08-19  0:36 ` [PATCH v3 1/4] mm/filemap: add AS_UNCHARGED Boris Burkov
2025-08-19  2:46   ` Matthew Wilcox
2025-08-19  3:57     ` Boris Burkov
2025-08-20 22:06   ` Klara Modin
2025-08-20 22:22     ` Shakeel Butt
2025-08-20 22:52     ` Boris Burkov
2025-08-20 23:15       ` Klara Modin
2025-08-20 23:53       ` Shakeel Butt
2025-08-21 19:37         ` Shakeel Butt
2025-08-19  0:36 ` [PATCH v3 2/4] mm: add vmstat for cgroup uncharged pages Boris Burkov
2025-08-19  2:50   ` Matthew Wilcox
2025-08-19  4:05     ` Boris Burkov
2025-08-19 15:53     ` Shakeel Butt
2025-08-19 23:46       ` Matthew Wilcox
2025-08-20  1:25         ` Shakeel Butt
2025-08-20 13:19           ` Matthew Wilcox
2025-08-20 16:21             ` Shakeel Butt
2025-08-19  0:36 ` [PATCH v3 3/4] btrfs: set AS_UNCHARGED on the btree_inode Boris Burkov
2025-08-19  0:36 ` [PATCH v3 4/4] memcg: remove warning from folio_lruvec Boris Burkov
2025-08-19  2:41   ` Matthew Wilcox
2025-08-19  5:20     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).