linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: David Hildenbrand <david@redhat.com>,
	Shakeel Butt <shakeel.butt@linux.dev>
Cc: Miklos Szeredi <miklos@szeredi.hu>,
	Joanne Koong <joannelkoong@gmail.com>,
	 Bernd Schubert <bernd.schubert@fastmail.fm>,
	Zi Yan <ziy@nvidia.com>,
	linux-fsdevel@vger.kernel.org,  jefflexu@linux.alibaba.com,
	josef@toxicpanda.com, linux-mm@kvack.org,  kernel-team@meta.com,
	Matthew Wilcox <willy@infradead.org>,
	Oscar Salvador	 <osalvador@suse.de>,
	Michal Hocko <mhocko@kernel.org>
Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
Date: Mon, 13 Jan 2025 16:44:26 -0500	[thread overview]
Message-ID: <dfd5427e2b4434355dd75d5fbe2460a656aba94e.camel@kernel.org> (raw)
In-Reply-To: <2848b566-3cae-4e89-916c-241508054402@redhat.com>

On Mon, 2025-01-13 at 16:27 +0100, David Hildenbrand wrote:
> On 10.01.25 23:00, Shakeel Butt wrote:
> > On Fri, Jan 10, 2025 at 10:13:17PM +0100, David Hildenbrand wrote:
> > > On 10.01.25 21:28, Jeff Layton wrote:
> > > > On Thu, 2025-01-09 at 12:22 +0100, David Hildenbrand wrote:
> > > > > On 07.01.25 19:07, Shakeel Butt wrote:
> > > > > > On Tue, Jan 07, 2025 at 09:34:49AM +0100, David Hildenbrand wrote:
> > > > > > > On 06.01.25 19:17, Shakeel Butt wrote:
> > > > > > > > On Mon, Jan 06, 2025 at 11:19:42AM +0100, Miklos Szeredi wrote:
> > > > > > > > > On Fri, 3 Jan 2025 at 21:31, David Hildenbrand <david@redhat.com> wrote:
> > > > > > > > > > In any case, having movable pages be turned unmovable due to persistent
> > > > > > > > > > writaback is something that must be fixed, not worked around. Likely a
> > > > > > > > > > good topic for LSF/MM.
> > > > > > > > > 
> > > > > > > > > Yes, this seems a good cross fs-mm topic.
> > > > > > > > > 
> > > > > > > > > So the issue discussed here is that movable pages used for fuse
> > > > > > > > > page-cache cause a problems when memory needs to be compacted. The
> > > > > > > > > problem is either that
> > > > > > > > > 
> > > > > > > > >      - the page is skipped, leaving the physical memory block unmovable
> > > > > > > > > 
> > > > > > > > >      - the compaction is blocked for an unbounded time
> > > > > > > > > 
> > > > > > > > > While the new AS_WRITEBACK_INDETERMINATE could potentially make things
> > > > > > > > > worse, the same thing happens on readahead, since the new page can be
> > > > > > > > > locked for an indeterminate amount of time, which can also block
> > > > > > > > > compaction, right?
> > > > > > > 
> > > > > > > Yes, as memory hotplug + virtio-mem maintainer my bigger concern is these
> > > > > > > pages residing in ZONE_MOVABLE / MIGRATE_CMA areas where there *must not be
> > > > > > > unmovable pages ever*. Not triggered by an untrusted source, not triggered
> > > > > > > by an trusted source.
> > > > > > > 
> > > > > > > It's a violation of core-mm principles.
> > > > > > 
> > > > > > The "must not be unmovable pages ever" is a very strong statement and we
> > > > > > are violating it today and will keep violating it in future. Any
> > > > > > page/folio under lock or writeback or have reference taken or have been
> > > > > > isolated from their LRU is unmovable (most of the time for small period
> > > > > > of time).
> > > > > 
> > > > > ^ this: "small period of time" is what I meant.
> > > > > 
> > > > > Most of these things are known to not be problematic: retrying a couple
> > > > > of times makes it work, that's why migration keeps retrying.
> > > > > 
> > > > > Again, as an example, we allow short-term O_DIRECT but disallow
> > > > > long-term page pinning. I think there were concerns at some point if
> > > > > O_DIRECT might also be problematic (I/O might take a while), but so far
> > > > > it was not a problem in practice that would make CMA allocations easily
> > > > > fail.
> > > > > 
> > > > > vmsplice() is a known problem, because it behaves like O_DIRECT but
> > > > > actually triggers long-term pinning; IIRC David Howells has this on his
> > > > > todo list to fix. [I recall that seccomp disallows vmsplice by default
> > > > > right now]
> > > > > 
> > > > > These operations are being done all over the place in kernel.
> > > > > > Miklos gave an example of readahead.
> > > > > 
> > > > > I assume you mean "unmovable for a short time", correct, or can you
> > > > > point me at that specific example; I think I missed that.
> > 
> > Please see https://lore.kernel.org/all/CAJfpegthP2enc9o1hV-izyAG9nHcD_tT8dKFxxzhdQws6pcyhQ@mail.gmail.com/
> > 
> > > > > 
> > > > > > The per-CPU LRU caches are another
> > > > > > case where folios can get stuck for long period of time.
> > > > > 
> > > > > Which is why memory offlining disables the lru cache. See
> > > > > lru_cache_disable(). Other users that care about that drain the LRU on
> > > > > all cpus.
> > > > > 
> > > > > > Reclaim and
> > > > > > compaction can isolate a lot of folios that they need to have
> > > > > > too_many_isolated() checks. So, "must not be unmovable pages ever" is
> > > > > > impractical.
> > > > > 
> > > > > "must only be short-term unmovable", better?
> > 
> > Yes and you have clarified further below of the actual amount.
> > 
> > > > > 
> > > > 
> > > > Still a little ambiguous.
> > > > 
> > > > How short is "short-term"? Are we talking milliseconds or minutes?
> > > 
> > > Usually a couple of seconds, max. For memory offlining, slightly longer
> > > times are acceptable; other things (in particular compaction or CMA
> > > allocations) will give up much faster.
> > > 
> > > > 
> > > > Imposing a hard timeout on writeback requests to unprivileged FUSE
> > > > servers might give us a better guarantee of forward-progress, but it
> > > > would probably have to be on the order of at least a minute or so to be
> > > > workable.
> > > 
> > > Yes, and that might already be a bit too much, especially if stuck on
> > > waiting for folio writeback ... so ideally we could find a way to migrate
> > > these folios that are under writeback and it's not your ordinary disk driver
> > > that responds rather quickly.
> > > 
> > > Right now we do it via these temp pages, and I can see how that's
> > > undesirable.
> > > 
> > > For NFS etc. we probably never ran into this, because it's all used in
> > > fairly well managed environments and, well, I assume NFS easily outdates CMA
> > > and ZONE_MOVABLE :)
> > > 
> > > > > > > 
> > > > > > The point is that, yes we should aim to improve things but in iterations
> > > > > > and "must not be unmovable pages ever" is not something we can achieve
> > > > > > in one step.
> > > > > 
> > > > > I agree with the "improve things in iterations", but as
> > > > > AS_WRITEBACK_INDETERMINATE has the FOLL_LONGTERM smell to it, I think we
> > > > > are making things worse.
> > 
> > AS_WRITEBACK_INDETERMINATE is really a bad name we picked as it is still
> > causing confusion. It is a simple flag to avoid deadlock in the reclaim
> > code path and does not say anything about movability.
> > 
> > > > > 
> > > > > And as this discussion has been going on for too long, to summarize my
> > > > > point: there exist conditions where pages are short-term unmovable, and
> > > > > possibly some to be fixed that turn pages long-term unmovable (e.g.,
> > > > > vmsplice); that does not mean that we can freely add new conditions that
> > > > > turn movable pages unmovable long-term or even forever.
> > > > > 
> > > > > Again, this might be a good LSF/MM topic. If I would have the capacity I
> > > > > would suggest a topic around which things are know to cause pages to be
> > > > > short-term or long-term unmovable/unsplittable, and which can be
> > > > > handled, which not. Maybe I'll find the time to propose that as a topic.
> > > > > 
> > > > 
> > > > 
> > > > This does sound like great LSF/MM fodder! I predict that this session
> > > > will run long! ;)
> > > 
> > > Heh, fully agreed! :)
> > 
> > I would like more targeted topic and for that I want us to at least
> > agree where we are disagring. Let me write down two statements and
> > please tell me where you disagree:
> 
> I think we're mostly in agreement!
> 
> > 
> > 1. For a normal running FUSE server (without tmp pages), the lifetime of
> > writeback state of fuse folios falls under "short-term unmovable" bucket
> > as it does not differ in anyway from anyother filesystems handling
> > writeback folios.
> 
> That's the expectation, yes. As long as the FUSE server is able to make 
> progress, the expectation is that it's just like NFS etc. If it isn't 
> able to make progress (i.e., crash), the expectation is that everything 
> will get cleaned up either way.
> 
> I wonder if there could be valid scenario where the FUSE server is no 
> longer able to make progress (ignoring network outages), or the progress 
> might start being extremely slow such that it becomes a problem. In 
> contrast to in-kernel FSs, one can do some fancy stuff with fuse where 
> writing a page could possibly consume a lot of memory in user-space. 
> Likely, in this case we might just blame it on the admin that agreed to 
> running this (trusted) fuse server.
> 
> > 
> > 2. For a buggy or untrusted FUSE server (without tmp pages), the
> > lifetime of writeback state of fuse folios can be arbitrarily long and
> > we need some mechanism to limit it.
> 
> Yes.
> 
> 
> Especially in 1), we really want to wait for writeback to finish, just 
> like for any other filesystem. For 2), we want a way so writeback will 
> not get stuck for a long time, but are able to make progress and migrate 
> these pages.
> 

What if we were to allow the kernel to kill off an unprivileged FUSE
server that was "misbehaving" [1], clean any dirty pagecache pages that
it has, and set writeback errors on the corresponding FUSE inodes [2]?
We'd still need a rather long timeout (on the order of at least a
minute or so, by default).

Would that be enough to assuage concerns about unprivileged servers
pinning pages indefinitely? Buggy servers are still a problem, but
there's not much we can do about that.

There are a lot of details we'd have to sort out, so I'm also
interested in whether anyone (Miklos? Bernd?) would find this basic
approach objectionable.

[1]: for some definition of misbehavior. Probably a writeback
timeout of some sort but maybe there would be other criteria too.

[2]: or maybe just make them eligible to be cleaned without talking to
the server, should the VM wish it.
-- 
Jeff Layton <jlayton@kernel.org>


  reply	other threads:[~2025-01-13 21:44 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
2024-12-19 13:05   ` David Hildenbrand
2024-12-19 14:19     ` Zi Yan
2024-12-19 15:08       ` Zi Yan
2024-12-19 15:39         ` David Hildenbrand
2024-12-19 15:47           ` Zi Yan
2024-12-19 15:50             ` David Hildenbrand
2024-12-19 15:43     ` Shakeel Butt
2024-12-19 15:47       ` David Hildenbrand
2024-12-19 15:53         ` Shakeel Butt
2024-12-19 15:55           ` Zi Yan
2024-12-19 15:56             ` Bernd Schubert
2024-12-19 16:00               ` Zi Yan
2024-12-19 16:02                 ` Zi Yan
2024-12-19 16:09                   ` Bernd Schubert
2024-12-19 16:14                     ` Zi Yan
2024-12-19 16:26                       ` Shakeel Butt
2024-12-19 16:31                         ` David Hildenbrand
2024-12-19 16:53                           ` Shakeel Butt
2024-12-19 16:22             ` Shakeel Butt
2024-12-19 16:29               ` David Hildenbrand
2024-12-19 16:40                 ` Shakeel Butt
2024-12-19 16:41                   ` David Hildenbrand
2024-12-19 17:14                     ` Shakeel Butt
2024-12-19 17:26                       ` David Hildenbrand
2024-12-19 17:30                         ` Bernd Schubert
2024-12-19 17:37                           ` Shakeel Butt
2024-12-19 17:40                             ` Bernd Schubert
2024-12-19 17:44                             ` Joanne Koong
2024-12-19 17:54                               ` Shakeel Butt
2024-12-20 11:44                                 ` David Hildenbrand
2024-12-20 12:15                                   ` Bernd Schubert
2024-12-20 14:49                                     ` David Hildenbrand
2024-12-20 15:26                                       ` Bernd Schubert
2024-12-20 18:01                                       ` Shakeel Butt
2024-12-21  2:28                                         ` Jingbo Xu
2024-12-21 16:23                                           ` David Hildenbrand
2024-12-22  2:47                                             ` Jingbo Xu
2024-12-24 11:32                                               ` David Hildenbrand
2024-12-21 16:18                                         ` David Hildenbrand
2024-12-23 22:14                                           ` Shakeel Butt
2024-12-24 12:37                                             ` David Hildenbrand
2024-12-26 15:11                                               ` Zi Yan
2024-12-26 20:13                                               ` Shakeel Butt
2024-12-26 22:02                                                 ` Bernd Schubert
2024-12-27 20:08                                                 ` Joanne Koong
2024-12-27 20:32                                                   ` Bernd Schubert
2024-12-30 17:52                                                     ` Joanne Koong
2024-12-30 10:16                                                 ` David Hildenbrand
2024-12-30 18:38                                                   ` Joanne Koong
2024-12-30 19:52                                                     ` David Hildenbrand
2024-12-30 20:11                                                       ` Shakeel Butt
2025-01-02 18:54                                                         ` Joanne Koong
2025-01-03 20:31                                                           ` David Hildenbrand
2025-01-06 10:19                                                             ` Miklos Szeredi
2025-01-06 18:17                                                               ` Shakeel Butt
2025-01-07  8:34                                                                 ` David Hildenbrand
2025-01-07 18:07                                                                   ` Shakeel Butt
2025-01-09 11:22                                                                     ` David Hildenbrand
2025-01-10 20:28                                                                       ` Jeff Layton
2025-01-10 21:13                                                                         ` David Hildenbrand
2025-01-10 22:00                                                                           ` Shakeel Butt
2025-01-13 15:27                                                                             ` David Hildenbrand
2025-01-13 21:44                                                                               ` Jeff Layton [this message]
2025-01-14  8:38                                                                                 ` Miklos Szeredi
2025-01-14  9:40                                                                                   ` Miklos Szeredi
2025-01-14  9:55                                                                                     ` Bernd Schubert
2025-01-14 10:07                                                                                       ` Miklos Szeredi
2025-01-14 18:07                                                                                         ` Joanne Koong
2025-01-14 18:58                                                                                           ` Miklos Szeredi
2025-01-14 19:12                                                                                             ` Joanne Koong
2025-01-14 20:00                                                                                               ` Miklos Szeredi
2025-01-14 20:29                                                                                               ` Jeff Layton
2025-01-14 21:40                                                                                                 ` Bernd Schubert
2025-01-23 16:06                                                                                                   ` Pavel Begunkov
2025-01-14 20:51                                                                                         ` Joanne Koong
2025-01-24 12:25                                                                                           ` David Hildenbrand
2025-01-14 15:49                                                                                     ` Jeff Layton
2025-01-24 12:29                                                                                       ` David Hildenbrand
2025-01-28 10:16                                                                                         ` Miklos Szeredi
2025-01-14 15:44                                                                                   ` Jeff Layton
2025-01-14 18:58                                                                                     ` Joanne Koong
2025-01-10 23:11                                                                           ` Jeff Layton
2025-01-10 20:16                                                                   ` Jeff Layton
2025-01-10 20:20                                                                     ` David Hildenbrand
2025-01-10 20:43                                                                       ` Jeff Layton
2025-01-10 21:00                                                                         ` David Hildenbrand
2025-01-10 21:07                                                                           ` Jeff Layton
2025-01-10 21:21                                                                             ` David Hildenbrand
2025-01-07 16:15                                                                 ` Miklos Szeredi
2025-01-08  1:40                                                                   ` Jingbo Xu
2024-12-30 20:04                                                     ` Shakeel Butt
2025-01-02 19:59                                                       ` Joanne Koong
2025-01-02 20:26                                                         ` Zi Yan
2024-12-20 21:01                                       ` Joanne Koong
2024-12-21 16:25                                         ` David Hildenbrand
2024-12-21 21:59                                           ` Bernd Schubert
2024-12-23 19:00                                             ` Joanne Koong
2024-12-26 22:44                                               ` Bernd Schubert
2024-12-27 18:25                                                 ` Joanne Koong
2024-12-19 17:55                         ` Joanne Koong
2024-12-19 18:04                           ` Bernd Schubert
2024-12-19 18:11                             ` Shakeel Butt
2024-12-20  7:55                     ` Jingbo Xu
2025-04-02 21:34     ` Joanne Koong
2025-04-03  3:31       ` Jingbo Xu
2025-04-03  9:18         ` David Hildenbrand
2025-04-03  9:25           ` Bernd Schubert
2025-04-03  9:35             ` Christian Brauner
2025-04-03 19:09           ` Joanne Koong
2025-04-03 20:44             ` David Hildenbrand
2025-04-03 22:04               ` Joanne Koong
2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-25  9:46   ` Jingbo Xu
2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-12-13 11:52 ` Miklos Szeredi
2024-12-13 16:47   ` Shakeel Butt
2024-12-18 17:37     ` Joanne Koong
2024-12-18 17:44       ` Shakeel Butt
2024-12-18 17:53         ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dfd5427e2b4434355dd75d5fbe2460a656aba94e.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=bernd.schubert@fastmail.fm \
    --cc=david@redhat.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=osalvador@suse.de \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).