From: David Hildenbrand <david@redhat.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>,
Joanne Koong <joannelkoong@gmail.com>, Zi Yan <ziy@nvidia.com>,
miklos@szeredi.hu, linux-fsdevel@vger.kernel.org,
jefflexu@linux.alibaba.com, josef@toxicpanda.com,
linux-mm@kvack.org, kernel-team@meta.com,
Matthew Wilcox <willy@infradead.org>,
Oscar Salvador <osalvador@suse.de>,
Michal Hocko <mhocko@kernel.org>
Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
Date: Tue, 24 Dec 2024 13:37:49 +0100 [thread overview]
Message-ID: <c91b6836-fa30-44a9-bc15-afc829acaba9@redhat.com> (raw)
In-Reply-To: <kyn5ji73biubd5fqbpycu4xsheqvomb3cu45ufw7u2paj5rmhr@bhnlclvuujcu>
On 23.12.24 23:14, Shakeel Butt wrote:
> On Sat, Dec 21, 2024 at 05:18:20PM +0100, David Hildenbrand wrote:
> [...]
>>
>> Yes, so I can see fuse
>>
>> (1) Breaking memory reclaim (memory cannot get freed up)
>>
>> (2) Breaking page migration (memory cannot be migrated)
>>
>> Due to (1) we might experience bigger memory pressure in the system I guess.
>> A handful of these pages don't really hurt, I have no idea how bad having
>> many of these pages can be. But yes, inherently we cannot throw away the
>> data as long as it is dirty without causing harm. (maybe we could move it to
>> some other cache, like swap/zswap; but that smells like a big and
>> complicated project)
>>
>> Due to (2) we turn pages that are supposed to be movable possibly for a long
>> time unmovable. Even a *single* such page will mean that CMA allocations /
>> memory unplug can start failing.
>>
>> We have similar situations with page pinning. With things like O_DIRECT, our
>> assumption/experience so far is that it will only take a couple of seconds
>> max, and retry loops are sufficient to handle it. That's why only long-term
>> pinning ("indeterminate", e.g., vfio) migrate these pages out of
>> ZONE_MOVABLE/MIGRATE_CMA areas in order to long-term pin them.
>>
>>
>> The biggest concern I have is that timeouts, while likely reasonable it many
>> scenarios, might not be desirable even for some sane workloads, and the
>> default in all system will be "no timeout", letting the clueless admin of
>> each and every system out there that might support fuse to make a decision.
>>
>> I might have misunderstood something, in which case I am very sorry, but we
>> also don't want CMA allocations to start failing simply because a network
>> connection is down for a couple of minutes such that a fuse daemon cannot
>> make progress.
>>
>
> I think you have valid concerns but these are not new and not unique to
> fuse. Any filesystem with a potential arbitrary stall can have similar
> issues. The arbitrary stall can be caused due to network issues or some
> faultly local storage.
What concerns me more is that this is can be triggered by even
unprivileged user space, and that there is no default protection as far
as I understood, because timeouts cannot be set universally to a sane
defaults.
Again, please correct me if I got that wrong.
BTW, I just looked at NFS out of interest, in particular
nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
canceling writeback. IIUC, there are default timeouts for UDP and TCP,
whereby the TCP default one seems to be around 60s (* retrans?), and the
privileged user that mounts it can set higher ones. I guess one could
run into similar writeback issues?
So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
Not sure if I grasped all details about NFS and writeback and when it
would redirty+end writeback, and if there is some other handling in there.
>
> Regarding the reclaim, I wouldn't say fuse or similar filesystem are
> breaking memory reclaim as the kernel has mechanism to throttle the
> threads dirtying the file memory to reduce the chance of situations
> where most of memory becomes unreclaimable due to being dirty.
Yes, likely even cgroups can easily limit the amount.
>
> Please note that such filesystems are mostly used in environments like
> data center or hyperscalar and usually have more advanced mechanisms to
> handle and avoid situations like long delays. For such environment
> network unavailability is a larger issue than some cma allocation
> failure. My point is: let's not assume the disastrous situaion is normal
> and overcomplicate the solution.
Let me summarize my main point: ZONE_MOVABLE/MIGRATE_CMA must only be
used for movable allocations.
Mechanisms that possible turn these folios unmovable for a
long/indeterminate time must either fail or migrate these folios out of
these regions, otherwise we start violating the very semantics why
ZONE_MOVABLE/MIGRATE_CMA was added in the first place.
Yes, there are corner cases where we cannot guarantee movability (e.g.,
OOM when allocating a migration destination), but these are not cases
that can be triggered by (unprivileged) user space easily.
That's why FOLL_LONGTERM pinning does exactly that: even if user space
would promise that this is really only "short-term", we will treat it as
"possibly forever", because it's under user-space control.
Instead of having more subsystems violate these semantics because
"performance" ... I would hope we would do better. Maybe it's an issue
for NFS as well ("at least" only for privileged user space)? In which
case, again, I would hope we would do better.
Anyhow, I'm hoping there will be more feedback from other MM folks, but
likely right now a lot of people are out (just like I should ;) ).
If I end up being the only one with these concerns, then likely people
can feel free to ignore them. ;)
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-12-24 12:37 UTC|newest]
Thread overview: 124+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
2024-12-19 13:05 ` David Hildenbrand
2024-12-19 14:19 ` Zi Yan
2024-12-19 15:08 ` Zi Yan
2024-12-19 15:39 ` David Hildenbrand
2024-12-19 15:47 ` Zi Yan
2024-12-19 15:50 ` David Hildenbrand
2024-12-19 15:43 ` Shakeel Butt
2024-12-19 15:47 ` David Hildenbrand
2024-12-19 15:53 ` Shakeel Butt
2024-12-19 15:55 ` Zi Yan
2024-12-19 15:56 ` Bernd Schubert
2024-12-19 16:00 ` Zi Yan
2024-12-19 16:02 ` Zi Yan
2024-12-19 16:09 ` Bernd Schubert
2024-12-19 16:14 ` Zi Yan
2024-12-19 16:26 ` Shakeel Butt
2024-12-19 16:31 ` David Hildenbrand
2024-12-19 16:53 ` Shakeel Butt
2024-12-19 16:22 ` Shakeel Butt
2024-12-19 16:29 ` David Hildenbrand
2024-12-19 16:40 ` Shakeel Butt
2024-12-19 16:41 ` David Hildenbrand
2024-12-19 17:14 ` Shakeel Butt
2024-12-19 17:26 ` David Hildenbrand
2024-12-19 17:30 ` Bernd Schubert
2024-12-19 17:37 ` Shakeel Butt
2024-12-19 17:40 ` Bernd Schubert
2024-12-19 17:44 ` Joanne Koong
2024-12-19 17:54 ` Shakeel Butt
2024-12-20 11:44 ` David Hildenbrand
2024-12-20 12:15 ` Bernd Schubert
2024-12-20 14:49 ` David Hildenbrand
2024-12-20 15:26 ` Bernd Schubert
2024-12-20 18:01 ` Shakeel Butt
2024-12-21 2:28 ` Jingbo Xu
2024-12-21 16:23 ` David Hildenbrand
2024-12-22 2:47 ` Jingbo Xu
2024-12-24 11:32 ` David Hildenbrand
2024-12-21 16:18 ` David Hildenbrand
2024-12-23 22:14 ` Shakeel Butt
2024-12-24 12:37 ` David Hildenbrand [this message]
2024-12-26 15:11 ` Zi Yan
2024-12-26 20:13 ` Shakeel Butt
2024-12-26 22:02 ` Bernd Schubert
2024-12-27 20:08 ` Joanne Koong
2024-12-27 20:32 ` Bernd Schubert
2024-12-30 17:52 ` Joanne Koong
2024-12-30 10:16 ` David Hildenbrand
2024-12-30 18:38 ` Joanne Koong
2024-12-30 19:52 ` David Hildenbrand
2024-12-30 20:11 ` Shakeel Butt
2025-01-02 18:54 ` Joanne Koong
2025-01-03 20:31 ` David Hildenbrand
2025-01-06 10:19 ` Miklos Szeredi
2025-01-06 18:17 ` Shakeel Butt
2025-01-07 8:34 ` David Hildenbrand
2025-01-07 18:07 ` Shakeel Butt
2025-01-09 11:22 ` David Hildenbrand
2025-01-10 20:28 ` Jeff Layton
2025-01-10 21:13 ` David Hildenbrand
2025-01-10 22:00 ` Shakeel Butt
2025-01-13 15:27 ` David Hildenbrand
2025-01-13 21:44 ` Jeff Layton
2025-01-14 8:38 ` Miklos Szeredi
2025-01-14 9:40 ` Miklos Szeredi
2025-01-14 9:55 ` Bernd Schubert
2025-01-14 10:07 ` Miklos Szeredi
2025-01-14 18:07 ` Joanne Koong
2025-01-14 18:58 ` Miklos Szeredi
2025-01-14 19:12 ` Joanne Koong
2025-01-14 20:00 ` Miklos Szeredi
2025-01-14 20:29 ` Jeff Layton
2025-01-14 21:40 ` Bernd Schubert
2025-01-23 16:06 ` Pavel Begunkov
2025-01-14 20:51 ` Joanne Koong
2025-01-24 12:25 ` David Hildenbrand
2025-01-14 15:49 ` Jeff Layton
2025-01-24 12:29 ` David Hildenbrand
2025-01-28 10:16 ` Miklos Szeredi
2025-01-14 15:44 ` Jeff Layton
2025-01-14 18:58 ` Joanne Koong
2025-01-10 23:11 ` Jeff Layton
2025-01-10 20:16 ` Jeff Layton
2025-01-10 20:20 ` David Hildenbrand
2025-01-10 20:43 ` Jeff Layton
2025-01-10 21:00 ` David Hildenbrand
2025-01-10 21:07 ` Jeff Layton
2025-01-10 21:21 ` David Hildenbrand
2025-01-07 16:15 ` Miklos Szeredi
2025-01-08 1:40 ` Jingbo Xu
2024-12-30 20:04 ` Shakeel Butt
2025-01-02 19:59 ` Joanne Koong
2025-01-02 20:26 ` Zi Yan
2024-12-20 21:01 ` Joanne Koong
2024-12-21 16:25 ` David Hildenbrand
2024-12-21 21:59 ` Bernd Schubert
2024-12-23 19:00 ` Joanne Koong
2024-12-26 22:44 ` Bernd Schubert
2024-12-27 18:25 ` Joanne Koong
2024-12-19 17:55 ` Joanne Koong
2024-12-19 18:04 ` Bernd Schubert
2024-12-19 18:11 ` Shakeel Butt
2024-12-20 7:55 ` Jingbo Xu
2025-04-02 21:34 ` Joanne Koong
2025-04-03 3:31 ` Jingbo Xu
2025-04-03 9:18 ` David Hildenbrand
2025-04-03 9:25 ` Bernd Schubert
2025-04-03 9:35 ` Christian Brauner
2025-04-03 19:09 ` Joanne Koong
2025-04-03 20:44 ` David Hildenbrand
2025-04-03 22:04 ` Joanne Koong
2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-25 9:46 ` Jingbo Xu
2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-12-13 11:52 ` Miklos Szeredi
2024-12-13 16:47 ` Shakeel Butt
2024-12-18 17:37 ` Joanne Koong
2024-12-18 17:44 ` Shakeel Butt
2024-12-18 17:53 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c91b6836-fa30-44a9-bc15-afc829acaba9@redhat.com \
--to=david@redhat.com \
--cc=bernd.schubert@fastmail.fm \
--cc=jefflexu@linux.alibaba.com \
--cc=joannelkoong@gmail.com \
--cc=josef@toxicpanda.com \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=miklos@szeredi.hu \
--cc=osalvador@suse.de \
--cc=shakeel.butt@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).