linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matt Whitlock <kernel@mattwhitlock.name>
To: Matthew Wilcox <willy@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>,
	<linux-fsdevel@vger.kernel.org>, Christoph Hellwig <hch@lst.de>,
	David Howells <dhowells@redhat.com>, Jens Axboe <axboe@kernel.dk>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
Date: Tue, 27 Jun 2023 14:38:32 -0400	[thread overview]
Message-ID: <992abc31-c052-435d-b9e4-a55dccb03a73@mattwhitlock.name> (raw)
In-Reply-To: <ZJq6nJBoX1m6Po9+@casper.infradead.org>

On Tuesday, 27 June 2023 06:31:56 EDT, Matthew Wilcox wrote:
> Perhaps the problem is that splice() appears to copy, but really just
> takes the reference.  Perhaps splice needs to actually copy if it
> sees a multi-page folio and isn't going to take all of it.  I'm not
> an expert in splice-ology, so let's cc some people who know more about
> splice than I do.

As a naive userspace coder, my mental model of splice is that it takes 
references to the pages backing the input file/pipe and appends those 
references to the output pipe's queue. Crucially, my expectation is that 
the cloned pages are marked copy-on-write, such that if some other process 
subsequently modifies the pages, then those pages will be copied at the 
time of modification (i.e., copy-on-write), and the references in the 
spliced-into pipe buffer will still refer to the unmodified originals.

If splice is merely taking a reference without caring to implement 
copy-on-write semantics, then it's not just FALLOC_FL_PUNCH_HOLE that will 
cause problems. Indeed, *any* write to the input file in the range where 
pages were spliced into a pipe will impact the pages already in the pipe 
buffer. I'm sure that makes perfect sense to a kernel developer, but it 
violates the implied contract of splice, which is to act *as though* the 
pages had been *copied* into the pipe.

Obviously, I'm not asking for splice to actually copy the pages, as 
zero-copy semantics are a large part of the motivation for preferring 
splice in the first place, but I do believe that the pages should be made 
copy-on-write such that subsequent writes to those pages in the page cache, 
whether by write or mmap or fallocate or otherwise, will force a copy of 
the pages to be made if the copy-on-write reference count of the modified 
pages is non-zero. Note, I say, "copy-on-write reference count," as 
distinct from the ordinary reference count, as you would need to properly 
handle the case where there are multiple *shared* references to a page that 
*should* observe any changes made to it, while there may also be multiple 
*private* references to the page that *must not* observe any changes made 
to it.

Just a thought: if you're going to implement bi-level reference counting 
semantics, then you could make MMAP_PRIVATE actually take a snapshot of the 
mapped pages at the time of mapping. Currently, mmap(2) says, "It is 
unspecified whether changes made to the file after the mmap() call are 
visible in the mapped region," so you have the latitude to make the private 
mapping truly isolated from subsequent changes in the backing file, using 
the same plumbing as you could use to isolate spliced pages.

  reply	other threads:[~2023-06-27 18:41 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-27  1:12 [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE Matt Whitlock
2023-06-27  5:47 ` Dave Chinner
2023-06-27  6:20   ` Dave Chinner
2023-06-27 10:31   ` Matthew Wilcox
2023-06-27 18:38     ` Matt Whitlock [this message]
2023-06-27 21:49     ` Dave Chinner
2023-06-27 18:14   ` Matt Whitlock
2023-06-27 21:51     ` Dave Chinner
2023-06-28  6:30   ` David Howells
2023-06-28  8:15     ` Dave Chinner
2023-06-28  9:33     ` David Howells
2023-06-28 12:51       ` Matthew Wilcox
2023-06-28 14:19       ` David Howells
2023-06-28 22:41       ` Dave Chinner
2023-06-28 17:35     ` Matt Whitlock
2023-06-28 18:27     ` David Howells
2023-06-28 22:17       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=992abc31-c052-435d-b9e4-a55dccb03a73@mattwhitlock.name \
    --to=kernel@mattwhitlock.name \
    --cc=axboe@kernel.dk \
    --cc=david@fromorbit.com \
    --cc=dhowells@redhat.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).