linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>, Jens Axboe <axboe@kernel.dk>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Hannes Reineke <hare@suse.de>,
	David Hildenbrand <david@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	linux-mm@kvack.org, linux-block@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] block: Remove special-casing of compound pages
Date: Fri, 15 Sep 2023 15:48:46 -0700 (PDT)	[thread overview]
Message-ID: <1fa6119e-6e5-5e8-21d8-571814f6a99e@google.com> (raw)
In-Reply-To: <ZQRoWVntO22VWL8K@casper.infradead.org>

On Fri, 15 Sep 2023, Matthew Wilcox wrote:
> On Wed, Aug 16, 2023 at 01:27:17PM -0700, Hugh Dickins wrote:
> > > This problem predates the folio work; it could for example have been
> > > triggered by mmaping a THP in tmpfs and using that as the target of an
> > > O_DIRECT read.
> > > 
> > > Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
> > 
> > No. It's a good catch, but bug looks specific to the folio work to me.
> > 
> > Almost all shmem pages are dirty from birth, even as soon as they are
> > brought back from swap; so it is not necessary to re-mark them dirty.
> > 
> > The exceptions are pages allocated to holes when faulted: so you did
> > get me worried as to whether khugepaged could collapse a pmd-ful of
> > those into a THP without marking the result as dirty.
> > 
> > But no, in v6.5-rc6 the collapse_file() success path has
> > 	if (is_shmem)
> > 		folio_mark_dirty(folio);
> > and in v5.10 the same appears as
> > 		if (is_shmem)
> > 			set_page_dirty(new_page);
> > 
> > (IIRC, that or marking pmd dirty was missed from early shmem THP
> > support, but fairly soon corrected, and backported to stable then.
> > I have a faint memory of versions which assembled pmd_dirty from
> > collected pte_dirtys.)
> > 
> > And the !is_shmem case is for CONFIG_READ_ONLY_THP_FOR_FS: writing
> > into those pages, by direct IO or whatever, is already prohibited.
> > 
> > It's dem dirty (or not dirty) folios dat's the trouble!
> 
> Thanks for the correction!  Could it happen with anon THP?
> They're not kept dirty from birth ... are they?

Anon pages, THP or other, are not marked dirty from birth, right.
But nor are they considered for freeing without writeout:
shrink_folio_list() does add_to_swap() on them without considering
dirtiness, and add_to_swap() does an unconditional folio_mark_dirty().
Well, not quite unconditional: it is conditional on allocating swap,
but shrink_folio_list() just reactivates when swap is not allocated.

So, I see no opportunity for data loss there.

When it's read back from swap later on, the folio will be left clean
while it matches swap: I haven't bothered to recheck the latest details
of what happens when it's CoWed, or the swap is deleted, those details
won't matter given the behavior above.  But might there be a direct IO
problem while that anon folio (large or small) remains clean in swapcache,
when reclaim's pageout() might be liable to free it without rewriting?

There ought not to be: get_user_pages()/follow_page_pte() have taken
care of that for many years with the FOLL_TOUCH+FOLL_WRITE
	if (flags & FOLL_TOUCH) {
		if ((flags & FOLL_WRITE) &&
		    !pte_dirty(pte) && !PageDirty(page))
			set_page_dirty(page);
and follow_trans_huge_pmd() dirties the pmd when FOLL_TOUCH+FOLL_WRITE.
I forget why follow_page_pte() prefers to dirty page rather than pte,
but either approach should be good enough to avoid the data loss.

However, I don't see equivalent FOLL_TOUCH+FOLL_WRITE code where
get_user_pages_fast() goes its fast route; nor pin_user_pages_fast()
used by lib/iov_iter.c; and it looks odd that pin_user_pages_remote()
and pin_user_pages_unlocked() use FOLL_TOUCH but pin_user_pages() not.

Problem?  Not particularly for anon or for large, but for any?  Or not
a problem because of final set_page_dirty() or folio_mark_dirty() on
release - only a problem while that PageCompound check remains?

(Of course filesystems hate behind-the-back dirtying for other reasons,
that they may lose the data without proper notice: separate discussion
we'd better not get back into here.)

I've spent much longer trying to answer this than I could afford,
expect no more from me, back to you and GUP+PIN experts.

Hugh

  reply	other threads:[~2023-09-15 22:51 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-14 14:41 [PATCH] block: Remove special-casing of compound pages Matthew Wilcox (Oracle)
2023-08-14 14:48 ` Hannes Reinecke
2023-08-16 17:03 ` Fix rare user data corruption when using THP Matthew Wilcox
2023-08-16 20:27 ` [PATCH] block: Remove special-casing of compound pages Hugh Dickins
2023-09-15 14:21   ` Matthew Wilcox
2023-09-15 22:48     ` Hugh Dickins [this message]
2023-12-07 21:04 ` Jens Axboe
2024-02-29 18:25   ` Greg Edwards
2024-02-29 19:37     ` Matthew Wilcox
2024-02-29 20:05       ` Greg Edwards
2023-12-07 22:10 ` Keith Busch
2023-12-07 23:57   ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1fa6119e-6e5-5e8-21d8-571814f6a99e@google.com \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=david@redhat.com \
    --cc=hare@suse.de \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).