linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Zi Yan" <ziy@nvidia.com>
To: "Joanne Koong" <joannelkoong@gmail.com>,
	"Shakeel Butt" <shakeel.butt@linux.dev>
Cc: "David Hildenbrand" <david@redhat.com>,
	"Bernd Schubert" <bernd.schubert@fastmail.fm>,
	<miklos@szeredi.hu>, <linux-fsdevel@vger.kernel.org>,
	<jefflexu@linux.alibaba.com>, <josef@toxicpanda.com>,
	<linux-mm@kvack.org>, <kernel-team@meta.com>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Oscar Salvador" <osalvador@suse.de>,
	"Michal Hocko" <mhocko@kernel.org>
Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings
Date: Thu, 02 Jan 2025 15:26:09 -0500	[thread overview]
Message-ID: <D6RVBFDFZ177.2XJG7IX6PHJBS@nvidia.com> (raw)
In-Reply-To: <CAJnrk1bmjd_yE0LO=Qdff==Zk5neunvUbnsEVYqNPPDsSJUudw@mail.gmail.com>

On Thu Jan 2, 2025 at 2:59 PM EST, Joanne Koong wrote:
> On Mon, Dec 30, 2024 at 12:04 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote:
> > > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > Thanks David for the response.
> >
> > > >
> > > > >> BTW, I just looked at NFS out of interest, in particular
> > > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages +
> > > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP,
> > > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the
> > > > >> privileged user that mounts it can set higher ones. I guess one could run
> > > > >> into similar writeback issues?
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > sorry for the late reply.
> > > >
> > > > > Yes, I think so.
> > > > >
> > > > >>
> > > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs?
> > > > >
> > > > > I feel like INDETERMINATE in the name is the main cause of confusion.
> > > >
> > > > We are adding logic that says "unconditionally, never wait on writeback
> > > > for these folios, not even any sync migration". That's the main problem
> > > > I have.
> > > >
> > > > Your explanation below is helpful. Because ...
> > > >
> > > > > So, let me explain why it is required (but later I will tell you how it
> > > > > can be avoided). The FUSE thread which is actively handling writeback of
> > > > > a given folio can cause memory allocation either through syscall or page
> > > > > fault. That memory allocation can trigger global reclaim synchronously
> > > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same
> > > > > folio whose writeback it is supposed to end and cauing a deadlock. So,
> > > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock.
> > > >  > > The in-kernel fs avoid this situation through the use of GFP_NOFS
> > > > > allocations. The userspace fs can also use a similar approach which is
> > > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been
> > > > > told that it is hard to use as it is per-thread flag and has to be set
> > > > > for all the threads handling writeback which can be error prone if the
> > > > > threadpool is dynamic. Second it is very coarse such that all the
> > > > > allocations from those threads (e.g. page faults) become NOFS which
> > > > > makes userspace very unreliable on highly utilized machine as NOFS can
> > > > > not reclaim potentially a lot of memory and can not trigger oom-kill.
> > > > >
> > > >
> > > > ... now I understand that we want to prevent a deadlock in one specific
> > > > scenario only?
> > > >
> > > > What sounds plausible for me is:
> > > >
> > > > a) Make this only affect the actual deadlock path: sync migration
> > > >     during compaction. Communicate it either using some "context"
> > > >     information or with a new MIGRATE_SYNC_COMPACTION.
> > > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express
> > > >      that very deadlock problem.
> > > > c) Leave all others sync migration users alone for now
> > >
> > > The deadlock path is separate from sync migration. The deadlock arises
> > > from a corner case where cgroupv1 reclaim waits on a folio under
> > > writeback where that writeback itself is blocked on reclaim.
> > >
> >
> > Joanne, let's drop the patch to migrate.c completely and let's rename
> > the flag to something like what David is suggesting and only handle in
> > the reclaim path.
> >
> > > >
> > > > Would that prevent the deadlock? Even *better* would be to to be able to
> > > > ask the fs if starting writeback on a specific folio could deadlock.
> > > > Because in most cases, as I understand, we'll  not actually run into the
> > > > deadlock and would just want to wait for writeback to just complete
> > > > (esp. compaction).
> > > >
> > > > (I still think having folios under writeback for a long time might be a
> > > > problem, but that's indeed something to sort out separately in the
> > > > future, because I suspect NFS has similar issues. We'd want to "wait
> > > > with timeout" and e.g., cancel writeback during memory
> > > > offlining/alloc_cma ...)
> >
> > Thanks David and yes let's handle the folios under writeback issue
> > separately.
> >
> > >
> > > I'm looking back at some of the discussions in v2 [1] and I'm still
> > > not clear on how memory fragmentation for non-movable pages differs
> > > from memory fragmentation from movable pages and whether one is worse
> > > than the other.
> >
> > I think the fragmentation due to movable pages becoming unmovable is
> > worse as that situation is unexpected and the kernel can waste a lot of
> > CPU to defrag the block containing those folios. For non-movable blocks,
> > the kernel will not even try to defrag. Now we can have a situation
> > where almost all memory is backed by non-movable blocks and higher order
> > allocations start failing even when there is enough free memory. For
> > such situations either system needs to be restarted (or workloads
> > restarted if they are cause of high non-movable memory) or the admin
> > needs to setup ZONE_MOVABLE where non-movable allocations don't go.
>
> Thanks for the explanations.
>
> The reason I ask is because I'm trying to figure out if having a time
> interval wait or retry mechanism instead of skipping migration would
> be a viable solution. Where when attempting the migration for folios
> with the as_writeback_indeterminate flag that are under writeback,
> it'll wait on folio writeback for a certain amount of time and then
> skip the migration if no progress has been made and the folio is still
> under writeback.
>
> there are two cases for fuse folios under writeback (for folios not
> under writeback, migration will work as is):
> a) normal case: server is not malicious or buggy, writeback is
> completed in a timely manner.
> For this case, migration would be successful and there'd be no
> difference for this between having no temp pages vs temp pages
>
>
> b) server is malicious or buggy:
> eg the server never completes writeback
>
> With no temp pages:
> The folio under writeback prevents a memory block (not sure how big
> this usually is?) from being compacted, leading to memory
> fragmentation

It is called pageblock. Its size is usually the same as a PMD THP
(e.g., 2MB on x86_64).

With no temp pages, folios can spread across multiple pageblocks,
fragmenting all of them.

>
> With temp pages:
> fuse allocates a non-movable page for every page it needs to write
> back, which worsens memory usage, these pages will never get freed
> since the server never finishes writeback on them. The non-movable
> pages could also fragment memory blocks like in the scenario with no
> temp pages.

Since the temp pages are all coming from MIGRATE_UNMOVABLE pageblocks,
which are much fewer, the fragmentation is much limited.

>
>
> Is the b) case with no temp pages worse for memory health than the
> scenario with temp pages? For the cpu usage issue (eg kernel keeps
> trying to defrag blocks containing these problematic folios), it seems
> like this could be potentially mitigated by marking these blocks as
> uncompactable?

With no temp pages, folios under writeback can potentially fragment more,
if not all, pageblocks, compared to with temp pages, because
MIGRATE_UNMOVABLE pageblocks are used for unmovable page allocations,
like kernel data allocations, and are supposed to be much fewer than
MIGRATE_MOVABLE pageblocks in the system.

>
>
> Thanks,
> Joanne
>
> >
> > > Currently fuse uses movable temp pages (allocated with
> > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same
> > > issue where a buggy/malicious server may never complete writeback.
> >
> > So, these temp pages are not an issue for fragmenting the movable blocks
> > but if there is no limit on temp pages, the whole system can become
> > non-movable (there is a case where movable blocks on non-ZONE_MOVABLE
> > can be converted into non-movable blocks under low memory). ZONE_MOVABLE
> > will avoid such scenario but tuning the right size of ZONE_MOVABLE is
> > not easy.
> >
> > > This has the same effect of fragmenting memory and has a worse memory
> > > cost to the system in terms of memory used. With not having temp pages
> > > though, now in this scenario, pages allocated in a movable page block
> > > can't be compacted and that memory is fragmented. My (basic and maybe
> > > incorrect) understanding is that memory gets allocated through a buddy
> > > allocator and moveable vs nonmovable pages get allocated to
> > > corresponding blocks that match their type, but there's no other
> > > difference otherwise. Is this understanding correct? Or is there some
> > > substantial difference between fragmentation for movable vs nonmovable
> > > blocks?
> >
> > The main difference is the fallback of high order allocation which can
> > trigger compaction or background compaction through kcompactd. The
> > kernel will only try to defrag the movable blocks.
> >




-- 
Best Regards,
Yan, Zi


  reply	other threads:[~2025-01-02 20:26 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-22 23:23 [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-11-22 23:23 ` [PATCH v6 1/5] mm: add AS_WRITEBACK_INDETERMINATE mapping flag Joanne Koong
2024-11-22 23:23 ` [PATCH v6 2/5] mm: skip reclaiming folios in legacy memcg writeback indeterminate contexts Joanne Koong
2024-11-22 23:23 ` [PATCH v6 3/5] fs/writeback: in wait_sb_inodes(), skip wait for AS_WRITEBACK_INDETERMINATE mappings Joanne Koong
2024-11-22 23:23 ` [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with " Joanne Koong
2024-12-19 13:05   ` David Hildenbrand
2024-12-19 14:19     ` Zi Yan
2024-12-19 15:08       ` Zi Yan
2024-12-19 15:39         ` David Hildenbrand
2024-12-19 15:47           ` Zi Yan
2024-12-19 15:50             ` David Hildenbrand
2024-12-19 15:43     ` Shakeel Butt
2024-12-19 15:47       ` David Hildenbrand
2024-12-19 15:53         ` Shakeel Butt
2024-12-19 15:55           ` Zi Yan
2024-12-19 15:56             ` Bernd Schubert
2024-12-19 16:00               ` Zi Yan
2024-12-19 16:02                 ` Zi Yan
2024-12-19 16:09                   ` Bernd Schubert
2024-12-19 16:14                     ` Zi Yan
2024-12-19 16:26                       ` Shakeel Butt
2024-12-19 16:31                         ` David Hildenbrand
2024-12-19 16:53                           ` Shakeel Butt
2024-12-19 16:22             ` Shakeel Butt
2024-12-19 16:29               ` David Hildenbrand
2024-12-19 16:40                 ` Shakeel Butt
2024-12-19 16:41                   ` David Hildenbrand
2024-12-19 17:14                     ` Shakeel Butt
2024-12-19 17:26                       ` David Hildenbrand
2024-12-19 17:30                         ` Bernd Schubert
2024-12-19 17:37                           ` Shakeel Butt
2024-12-19 17:40                             ` Bernd Schubert
2024-12-19 17:44                             ` Joanne Koong
2024-12-19 17:54                               ` Shakeel Butt
2024-12-20 11:44                                 ` David Hildenbrand
2024-12-20 12:15                                   ` Bernd Schubert
2024-12-20 14:49                                     ` David Hildenbrand
2024-12-20 15:26                                       ` Bernd Schubert
2024-12-20 18:01                                       ` Shakeel Butt
2024-12-21  2:28                                         ` Jingbo Xu
2024-12-21 16:23                                           ` David Hildenbrand
2024-12-22  2:47                                             ` Jingbo Xu
2024-12-24 11:32                                               ` David Hildenbrand
2024-12-21 16:18                                         ` David Hildenbrand
2024-12-23 22:14                                           ` Shakeel Butt
2024-12-24 12:37                                             ` David Hildenbrand
2024-12-26 15:11                                               ` Zi Yan
2024-12-26 20:13                                               ` Shakeel Butt
2024-12-26 22:02                                                 ` Bernd Schubert
2024-12-27 20:08                                                 ` Joanne Koong
2024-12-27 20:32                                                   ` Bernd Schubert
2024-12-30 17:52                                                     ` Joanne Koong
2024-12-30 10:16                                                 ` David Hildenbrand
2024-12-30 18:38                                                   ` Joanne Koong
2024-12-30 19:52                                                     ` David Hildenbrand
2024-12-30 20:11                                                       ` Shakeel Butt
2025-01-02 18:54                                                         ` Joanne Koong
2025-01-03 20:31                                                           ` David Hildenbrand
2025-01-06 10:19                                                             ` Miklos Szeredi
2025-01-06 18:17                                                               ` Shakeel Butt
2025-01-07  8:34                                                                 ` David Hildenbrand
2025-01-07 18:07                                                                   ` Shakeel Butt
2025-01-09 11:22                                                                     ` David Hildenbrand
2025-01-10 20:28                                                                       ` Jeff Layton
2025-01-10 21:13                                                                         ` David Hildenbrand
2025-01-10 22:00                                                                           ` Shakeel Butt
2025-01-13 15:27                                                                             ` David Hildenbrand
2025-01-13 21:44                                                                               ` Jeff Layton
2025-01-14  8:38                                                                                 ` Miklos Szeredi
2025-01-14  9:40                                                                                   ` Miklos Szeredi
2025-01-14  9:55                                                                                     ` Bernd Schubert
2025-01-14 10:07                                                                                       ` Miklos Szeredi
2025-01-14 18:07                                                                                         ` Joanne Koong
2025-01-14 18:58                                                                                           ` Miklos Szeredi
2025-01-14 19:12                                                                                             ` Joanne Koong
2025-01-14 20:00                                                                                               ` Miklos Szeredi
2025-01-14 20:29                                                                                               ` Jeff Layton
2025-01-14 21:40                                                                                                 ` Bernd Schubert
2025-01-23 16:06                                                                                                   ` Pavel Begunkov
2025-01-14 20:51                                                                                         ` Joanne Koong
2025-01-24 12:25                                                                                           ` David Hildenbrand
2025-01-14 15:49                                                                                     ` Jeff Layton
2025-01-24 12:29                                                                                       ` David Hildenbrand
2025-01-28 10:16                                                                                         ` Miklos Szeredi
2025-01-14 15:44                                                                                   ` Jeff Layton
2025-01-14 18:58                                                                                     ` Joanne Koong
2025-01-10 23:11                                                                           ` Jeff Layton
2025-01-10 20:16                                                                   ` Jeff Layton
2025-01-10 20:20                                                                     ` David Hildenbrand
2025-01-10 20:43                                                                       ` Jeff Layton
2025-01-10 21:00                                                                         ` David Hildenbrand
2025-01-10 21:07                                                                           ` Jeff Layton
2025-01-10 21:21                                                                             ` David Hildenbrand
2025-01-07 16:15                                                                 ` Miklos Szeredi
2025-01-08  1:40                                                                   ` Jingbo Xu
2024-12-30 20:04                                                     ` Shakeel Butt
2025-01-02 19:59                                                       ` Joanne Koong
2025-01-02 20:26                                                         ` Zi Yan [this message]
2024-12-20 21:01                                       ` Joanne Koong
2024-12-21 16:25                                         ` David Hildenbrand
2024-12-21 21:59                                           ` Bernd Schubert
2024-12-23 19:00                                             ` Joanne Koong
2024-12-26 22:44                                               ` Bernd Schubert
2024-12-27 18:25                                                 ` Joanne Koong
2024-12-19 17:55                         ` Joanne Koong
2024-12-19 18:04                           ` Bernd Schubert
2024-12-19 18:11                             ` Shakeel Butt
2024-12-20  7:55                     ` Jingbo Xu
2025-04-02 21:34     ` Joanne Koong
2025-04-03  3:31       ` Jingbo Xu
2025-04-03  9:18         ` David Hildenbrand
2025-04-03  9:25           ` Bernd Schubert
2025-04-03  9:35             ` Christian Brauner
2025-04-03 19:09           ` Joanne Koong
2025-04-03 20:44             ` David Hildenbrand
2025-04-03 22:04               ` Joanne Koong
2024-11-22 23:23 ` [PATCH v6 5/5] fuse: remove tmp folio for writebacks and internal rb tree Joanne Koong
2024-11-25  9:46   ` Jingbo Xu
2024-12-12 21:55 ` [PATCH v6 0/5] fuse: remove temp page copies in writeback Joanne Koong
2024-12-13 11:52 ` Miklos Szeredi
2024-12-13 16:47   ` Shakeel Butt
2024-12-18 17:37     ` Joanne Koong
2024-12-18 17:44       ` Shakeel Butt
2024-12-18 17:53         ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D6RVBFDFZ177.2XJG7IX6PHJBS@nvidia.com \
    --to=ziy@nvidia.com \
    --cc=bernd.schubert@fastmail.fm \
    --cc=david@redhat.com \
    --cc=jefflexu@linux.alibaba.com \
    --cc=joannelkoong@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@meta.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=osalvador@suse.de \
    --cc=shakeel.butt@linux.dev \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).