From: Matthew Wilcox <willy@infradead.org>
To: Chris Li <chrisl@kernel.org>
Cc: Karim Manaouil <kmanaouil.dev@gmail.com>, Jan Kara <jack@suse.cz>,
Chuanhua Han <hanchuanhua@oppo.com>,
linux-mm <linux-mm@kvack.org>,
lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com,
21cnbao@gmail.com, david@redhat.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"
Date: Fri, 31 May 2024 04:12:37 +0100 [thread overview]
Message-ID: <ZllAJbLaYGQkrPyV@casper.infradead.org> (raw)
In-Reply-To: <CAF8kJuO3BxYuQm7d1drw3spb0CxGYZ6OigzXDzLqjtgWYVF7jw@mail.gmail.com>
On Thu, May 30, 2024 at 03:53:49PM -0700, Chris Li wrote:
> On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > Where the anonymous memory case, the dirty page does not have to write
> > > to swap. It is optional, so which page you choose to swap out is
> > > critical, you want to swap out the coldest page, the page that is
> > > least likely to get swapin. Therefore, the LRU makes sense.
> >
> > Disagree. There are two things you want and the LRU serves neither
> > particularly well. One is that when you want to reclaim memory, you
> > want to find some memory that is likely to not be accessed in the next
> > few seconds/minutes/hours. It doesn't need to be the coldest, just in
> > (say) the coldest 10% or so of memory. And it needs to already be clean,
> > otherwise you have to wait for it to writeback, and you can't afford that.
>
> Do you disagree that LRU is necessary or the way we use the LRU?
I think we should switch to a scheme where we just don't use an LRU at
all.
> In order to get the coldest 10% or so pages, assume you still need to
> maintain an LRU, no?
I don't think that's true. If you reframe the problem as "we need to
find some of the coldest pages in the system", then you can use a
different scheme.
> > The second thing you need to be able to do is find pages which are
> > already dirty, and not likely to be written to soon, and write those
> > back so they join the pool of clean pages which are eligible for reclaim.
> > Again, the LRU isn't really the best tool for the job.
>
> It seems you need to LRU to find which pages qualify for write back.
> It should be both dirty and cold.
>
> The question is, can you do the reclaim write back without LRU for
> anonymous pages?
> If LRU is unavoidable, then it is necessarily evil.
The point I was trying to make is that a simple physical scan is 40x
faster. So if you just scan N pages, starting from wherever you left
off the scan last time, and even 1/10 of them are eligible for
reclaiming (not referenced since last time the clock hand swept past it,
perhaps), you're still reclaiming 4x as many pages as doing an LRU scan.
> > > In VMA swap out, the question is, which VMA you choose from first? To
> > > make things more complicated, the same page can map into different
> > > processes in more than one VMA as well.
> >
> > This is why we have the anon_vma, to handle the same pages mapped from
> > multiple VMAs.
>
> Can you clarify when you use anon_vma to organize the swap out and
> swap in, do you want to write a range of pages rather than just one
> page at a time? Will write back a sub list of the LRU work for you?
> Ideally we shouldn't write back pages that are hot. anon_vma alone
> does not give us that information.
So filesystems do write back all pages in an inode that are dirty,
regardless of whether they're hot. But, as noted, we do like to
get the pagecache written back periodically even if the pages are
going to be redirtied soon. And this is somewhere that I think there's
a difference between anon & file pages. So maybe the algorithm looks
something like this:
A: write page fault causes page to be created
B: scan unmaps page, marks it dirty, does not start writeout
C: scan finds dirty, unmapped anon page, starts writeout
D: scan finds clean unmapped anon page, frees it
so it will actually take three trips around the whole of memory for
the physical scan to evict an anon page. That should be adequate
time for a workload to fault back in a page that's actually hot.
(if a page fault finds a page in state B, it transitions back to state
A and gets three more trips around the clock).
next prev parent reply other threads:[~2024-05-31 3:12 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-01 9:24 [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Chris Li
2024-03-01 9:53 ` Nhat Pham
2024-03-01 18:57 ` Chris Li
2024-03-04 22:58 ` Matthew Wilcox
2024-03-05 3:23 ` Chengming Zhou
2024-03-05 7:44 ` Chris Li
2024-03-05 8:15 ` Chengming Zhou
2024-03-05 18:24 ` Chris Li
2024-03-05 9:32 ` Nhat Pham
2024-03-05 9:52 ` Chengming Zhou
2024-03-05 10:55 ` Nhat Pham
2024-03-05 19:20 ` Chris Li
2024-03-05 20:56 ` Jared Hulbert
2024-03-05 21:38 ` Jared Hulbert
2024-03-05 21:58 ` Chris Li
2024-03-06 4:16 ` Jared Hulbert
2024-03-06 5:50 ` Chris Li
[not found] ` <CA+ZsKJ7JE56NS6hu4L_uyywxZO7ixgftvfKjdND9e5SOyn+72Q@mail.gmail.com>
2024-03-06 18:16 ` Chris Li
2024-03-06 22:44 ` Jared Hulbert
2024-03-07 0:46 ` Chris Li
2024-03-07 8:57 ` Jared Hulbert
2024-03-06 1:33 ` Barry Song
2024-03-04 18:43 ` Kairui Song
2024-03-04 22:03 ` Jared Hulbert
2024-03-04 22:47 ` Chris Li
2024-03-04 22:36 ` Chris Li
2024-03-06 1:15 ` Barry Song
2024-03-06 2:59 ` Chris Li
2024-03-06 6:05 ` Barry Song
2024-03-06 17:56 ` Chris Li
2024-03-06 21:29 ` Barry Song
2024-03-08 8:55 ` David Hildenbrand
2024-03-07 7:56 ` Chuanhua Han
2024-03-07 14:03 ` [Lsf-pc] " Jan Kara
2024-03-07 21:06 ` Jared Hulbert
2024-03-07 21:17 ` Barry Song
2024-03-08 0:14 ` Jared Hulbert
2024-03-08 0:53 ` Barry Song
2024-03-14 9:03 ` Jan Kara
2024-05-16 15:04 ` Zi Yan
2024-05-17 3:48 ` Chris Li
2024-03-14 8:52 ` Jan Kara
2024-03-08 2:02 ` Chuanhua Han
2024-03-14 8:26 ` Jan Kara
2024-03-14 11:19 ` Chuanhua Han
2024-05-15 23:07 ` Chris Li
2024-05-16 7:16 ` Chuanhua Han
2024-05-17 12:12 ` Karim Manaouil
2024-05-21 20:40 ` Chris Li
2024-05-28 7:08 ` Jared Hulbert
2024-05-29 3:36 ` Chris Li
2024-05-29 3:57 ` Matthew Wilcox
2024-05-29 6:50 ` Chris Li
2024-05-29 12:33 ` Matthew Wilcox
2024-05-30 22:53 ` Chris Li
2024-05-31 3:12 ` Matthew Wilcox [this message]
2024-06-01 0:43 ` Chris Li
2024-05-31 1:56 ` Yuanchu Xie
2024-05-31 16:51 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZllAJbLaYGQkrPyV@casper.infradead.org \
--to=willy@infradead.org \
--cc=21cnbao@gmail.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hanchuanhua@oppo.com \
--cc=jack@suse.cz \
--cc=kmanaouil.dev@gmail.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ryan.roberts@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).