From: Minchan Kim <minchan@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: Kairui Song <ryncsn@gmail.com>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
"Huang, Ying" <ying.huang@intel.com>,
Chris Li <chrisl@kernel.org>, Yu Zhao <yuzhao@google.com>,
Barry Song <v-songbaohua@oppo.com>, SeongJae Park <sj@kernel.org>,
Hugh Dickins <hughd@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Matthew Wilcox <willy@infradead.org>,
Michal Hocko <mhocko@suse.com>,
Yosry Ahmed <yosryahmed@google.com>,
stable@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] mm/swap: fix race when skipping swapcache
Date: Thu, 15 Feb 2024 12:55:11 -0800 [thread overview]
Message-ID: <Zc56L6oL4JmxqaFN@google.com> (raw)
In-Reply-To: <4c651673-132f-4cd8-997e-175f586fd2e6@redhat.com>
Hi David,
On Thu, Feb 15, 2024 at 09:03:28PM +0100, David Hildenbrand wrote:
< snip >
> > >
> > > We would detect later, that the PTE changed, but we would temporarily
> > > mess with that swap slot that we might no longer "own".
> > >
> > > I was thinking about alternatives, it's tricky because of the concurrent
> > > MADV_DONTNEED possibility. Something with another fake-swap entry type
> > > (similar to migration entries) might work, but would require more changes.
> >
> > Yeah, in the long term I also think more work is needed for the swap subsystem.
> >
> > In my opinion, for this particular issue, or, for cache bypassed
> > swapin, a new swap map value similar to SWAP_MAP_BAD/SWAP_MAP_SHMEM
> > might be needed, that may even help to simplify the swap count release
> > routine for cache bypassed swapin, and improve the performance.
>
> The question is if we really want to track that in the swapcache and not
> rather in the page table.
>
> Imagine the following:
>
> (1) allocate the folio and lock it (we do that already)
>
> (2) take the page table lock. If the PTE is still the same, insert a new
> "swapin_in_process" fake swp entry that references the locked folio.
>
> (3) read the folio from swap. This will unlock the folio IIUC. (we do that
> already)
>
> (4) relock the folio. (we do that already, might not want to fail)
>
> (4) take the PTE lock. If the PTE did not change, turn it into a present PTE
> entry. Otherwise, cleanup.
>
>
> Any concurrent swap-in users would spot the new "swapin_in_process" fake swp
> entry and wait for the page lock (just like we do with migration entries).
>
> Zap code would mostly only clear the "swapin_in_process" fake swp entry and
> leave the cleanup to (4) above. Fortunately, concurrent fork() is impossible
> as that cannot race with page faults.
>
> There might be one minor thing to optimize with the folio lock above. But in
> essence, it would work just like migration entries, just that they are
> installed only while we actually do read the content from disk etc.
That's a great idea. I was thinking to have the synchronization in the
page table but couldn't reach to the other non_swap_entry idea.
Only concern of the approach is that it would be harder to have the fix
in the stable tree. If there isn't strong objection, I prefer the
Kairui's orginal solution(with some tweak of scheduler if it's
necessary) first and then pursue your idea on latest tree.
next prev parent reply other threads:[~2024-02-15 20:55 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-06 18:25 [PATCH v2] mm/swap: fix race when skipping swapcache Kairui Song
2024-02-06 18:30 ` kernel test robot
2024-02-06 18:44 ` SeongJae Park
2024-02-06 23:02 ` Minchan Kim
2024-02-07 3:22 ` Kairui Song
2024-02-06 23:10 ` Chris Li
2024-02-06 23:40 ` Barry Song
2024-02-07 2:03 ` Chris Li
2024-02-07 2:20 ` Kairui Song
2024-02-07 1:52 ` Barry Song
2024-02-07 3:21 ` Kairui Song
2024-02-07 4:01 ` Chris Li
2024-02-07 4:06 ` Kairui Song
2024-02-07 18:31 ` Minchan Kim
2024-02-08 6:04 ` Kairui Song
2024-02-08 6:34 ` Huang, Ying
2024-02-08 19:01 ` Kairui Song
2024-02-08 19:42 ` Chris Li
2024-02-09 5:30 ` Kairui Song
2024-02-12 19:53 ` Kairui Song
2024-02-15 0:44 ` Minchan Kim
2024-02-15 19:07 ` Kairui Song
2024-02-19 5:42 ` Huang, Ying
2024-02-08 7:16 ` Barry Song
2024-02-07 2:08 ` Huang, Ying
2024-02-07 2:28 ` Kairui Song
2024-02-07 3:44 ` Huang, Ying
2024-02-07 3:45 ` Barry Song
2024-02-07 4:16 ` Huang, Ying
2024-02-07 4:24 ` Barry Song
2024-02-15 15:36 ` David Hildenbrand
2024-02-15 18:49 ` Kairui Song
2024-02-15 20:03 ` David Hildenbrand
2024-02-15 20:55 ` Minchan Kim [this message]
2024-02-15 22:58 ` Andrew Morton
2024-02-16 0:54 ` Barry Song
2024-02-16 10:01 ` Kairui Song
2024-02-16 7:11 ` Kairui Song
2024-02-16 16:16 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zc56L6oL4JmxqaFN@google.com \
--to=minchan@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=ryncsn@gmail.com \
--cc=sj@kernel.org \
--cc=stable@vger.kernel.org \
--cc=v-songbaohua@oppo.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=yosryahmed@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.