From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
Qu Wenruo <wqu@suse.com>,
linux-btrfs@vger.kernel.org, linux-mm@kvack.org,
linux-fsdevel@vger.kernel.org
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Cgroups <cgroups@vger.kernel.org>
Subject: Re: [PATCH 0/2] mm: skip memcg for certain address space
Date: Thu, 25 Jul 2024 18:30:53 +0930 [thread overview]
Message-ID: <f84a9639-fbc4-466f-822a-d151ac4db8e6@gmx.com> (raw)
In-Reply-To: <8faa191c-a216-4da0-a92c-2456521dcf08@kernel.org>
在 2024/7/18 01:25, Vlastimil Babka (SUSE) 写道:
> Hi,
>
> you should have Ccd people according to get_maintainers script to get a
> reply faster. Let me Cc the MEMCG section.
>
> On 7/10/24 3:07 AM, Qu Wenruo wrote:
>> Recently I'm hitting soft lockup if adding an order 2 folio to a
>> filemap using GFP_NOFS | __GFP_NOFAIL. The softlockup happens at memcg
>> charge code, and I guess that's exactly what __GFP_NOFAIL is expected to
>> do, wait indefinitely until the request can be met.
>
> Seems like a bug to me, as the charging of __GFP_NOFAIL in
> try_charge_memcg() should proceed to the force: part AFAICS and just go over
> the limit.
After more reproduces of the bug (thus more logs), it turns out to be a
corner case that is specific to the different folio sizes, not the mem
cgroup.
We have something like this:
retry:
ret = filemap_add_folio();
if (!ret)
goto out;
existing_folio = filemap_lock_folio();
if (IS_ERROR(existing_folio))
goto retry;
This is causing a dead loop, if we have the following filemap layout:
|<- folio range ->|
| | |////|////|
Where |//| is the range that we have an exiting page.
In above case, filemap_add_folio() will return -EEXIST due to the
conflicting two pages.
Meanwhile filemap_lock_folio() will always return -ENOENT, as at the
folio index, there is no page at all.
The symptom looks like cgroup related just because we're spending a lot
of time inside cgroup code, but the cause is not cgroup at all.
This is not causing problem for now because the existing code is always
using order 0 folios, thus above case won't happen.
Upon larger folios support is enabled, and we're allowing mixed folio
sizes, it will lead to the above problem sooner or later.
I'll still push the opt-out of mem cgroup as an optimization, but since
the root cause is pinned down, I'll no longer include this optimization
in the larger folio enablement.
Thanks for all the help, and sorry for the extra noise.
Qu
>
> I was suspecting mem_cgroup_oom() a bit earlier return true, causing the
> retry loop, due to GFP_NOFS. But it seems out_of_memory() should be
> specifically proceeding for GFP_NOFS if it's memcg oom. But I might be
> missing something else. Anyway we should know what exactly is going first.
>
>> On the other hand, if we do not use __GFP_NOFAIL, we can be limited by
>> memcg at a lot of critical location, and lead to unnecessary transaction
>> abort just due to memcg limit.
>>
>> However for that specific btrfs call site, there is really no need charge
>> the memcg, as that address space belongs to btree inode, which is not
>> accessible to any end user, and that btree inode is a shared pool for
>> all metadata of a btrfs.
>>
>> So this patchset introduces a new address space flag, AS_NO_MEMCG, so
>> that folios added to that address space will not trigger any memcg
>> charge.
>>
>> This would be the basis for future btrfs changes, like removing
>> __GFP_NOFAIL completely and larger metadata folios.
>>
>> Qu Wenruo (2):
>> mm: make lru_gen_eviction() to handle folios without memcg info
>> mm: allow certain address space to be not accounted by memcg
>>
>> fs/btrfs/disk-io.c | 1 +
>> include/linux/pagemap.h | 1 +
>> mm/filemap.c | 12 +++++++++---
>> mm/workingset.c | 2 +-
>> 4 files changed, 12 insertions(+), 4 deletions(-)
>>
>
>
prev parent reply other threads:[~2024-07-25 9:01 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-10 1:07 [PATCH 0/2] mm: skip memcg for certain address space Qu Wenruo
2024-07-10 1:07 ` [PATCH 1/2] mm: make lru_gen_eviction() to handle folios without memcg info Qu Wenruo
2024-07-10 1:07 ` [PATCH 2/2] mm: allow certain address space to be not accounted by memcg Qu Wenruo
2024-07-17 7:42 ` [PATCH 0/2] mm: skip memcg for certain address space Qu Wenruo
2024-07-17 15:55 ` Vlastimil Babka (SUSE)
2024-07-17 16:14 ` Michal Hocko
2024-07-17 22:38 ` Qu Wenruo
2024-07-18 7:17 ` Vlastimil Babka (SUSE)
2024-07-18 7:25 ` Michal Hocko
2024-07-18 7:57 ` Qu Wenruo
2024-07-18 8:09 ` Michal Hocko
2024-07-18 8:10 ` Michal Hocko
2024-07-18 8:52 ` Qu Wenruo
2024-07-18 9:25 ` Michal Hocko
2024-07-18 7:52 ` Qu Wenruo
2024-07-18 8:28 ` Vlastimil Babka (SUSE)
2024-07-18 8:50 ` Qu Wenruo
2024-07-18 9:19 ` Vlastimil Babka (SUSE)
2024-07-25 9:00 ` Qu Wenruo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f84a9639-fbc4-466f-822a-d151ac4db8e6@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@kernel.org \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox