[Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
@ 2023-06-27  1:12 Matt Whitlock
  2023-06-27  5:47 ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Matt Whitlock @ 2023-06-27  1:12 UTC (permalink / raw)
  To: linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2403 bytes --]

Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when 
calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have 
just been spliced into a pipe. It appears that the fallocate call can zero 
out the pages that are sitting in the pipe buffer, before those pages are 
read from the pipe.

Simplified code excerpt (eliding error checking):

int fd = /* open file descriptor referring to some disk file */;
for (off_t consumed = 0;;) {
   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
   if (n <= 0) break;
   consumed += n;
   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
}

Expected behavior:
Punching holes in a file after splicing pages out of that file into a pipe 
should not corrupt the spliced-out pages in the pipe buffer.

Observed behavior:
Some of the pages that have been spliced into the pipe get zeroed out by 
the subsequent fallocate call before they can be consumed from the read 
side of the pipe.

Steps to reproduce:

1. Save the attached ones.c, dontneed.c, and consume.c.

2. gcc -o ones ones.c
   gcc -o dontneed dontneed.c
   gcc -o consume consume.c

3. Fill a file with 32 MiB of 0xFF:
   ./ones | head -c$((1<<25)) >testfile

4. Evict the pages of the file from the page cache:
   sync testfile && ./dontneed testfile

5. Splice the file into a pipe, punching out batches of pages after 
splicing them:
   ./consume testfile | hexdump -C

The expected output from hexdump should show 32 MiB of 0xFF. Indeed, on my 
system, if I omit the POSIX_FADV_DONTNEED advice, then I do get the 
expected output. However, if the pages of the file are not already present 
in the page cache (i.e., if the splice call faults them in from disk), then 
the hexdump output shows some pages full of 0xFF and some pages full of 
0x00.

Note #1: I am running a fairly antiquated x86_64 system. You may need to 
use a file larger than 32 MiB to reproduce the misbehavior on a more modern 
system. In particular, even I cannot reproduce the problem when I use 16 
MiB. Conversely, when I use a very large file (too large to fit entirely in 
the page cache), then I don't need the "dontneed" call to reproduce the 
problem.

Note #2: I am siting my test file on an XFS file system running on a 
hardware RAID volume at /dev/sda1. I'm not sure if that's relevant.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: ones.c --]
[-- Type: text/x-csrc, Size: 295 bytes --]

#include <string.h>

#include <errno.h>
#include <error.h>
#include <fcntl.h>
#include <sysexits.h>
#include <unistd.h>

int main() {
	char buf[4096];
	memset(buf, 0xFF, sizeof buf);
	for (;;)
		if (write(STDOUT_FILENO, buf, sizeof buf) < 0)
			error(EX_IOERR, errno, "write");
	return EX_OK;
}

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: dontneed.c --]
[-- Type: text/x-csrc, Size: 654 bytes --]

#include <stdio.h>

#include <errno.h>
#include <error.h>
#include <fcntl.h>
#include <sysexits.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
	if (argc < 2) {
		printf("usage: %s <file> [...]\n", argc > 0 ? *argv : "dontneed");
		return EX_USAGE;
	}
	int ret = EX_OK;
	while (++argv, --argc > 0) {
		int fd = open(*argv, O_RDONLY);
		if (fd < 0) {
			error(0, errno, "%s", *argv);
			ret = ret ?: EX_NOINPUT;
			continue;
		}
		if (posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) < 0) {
			error(0, errno, "%s: posix_fadvise", *argv);
			ret = EX_OSERR;
		}
		if (close(fd) < 0)
			error(EX_OSERR, errno, "%s: close", *argv);
	}
	return ret;
}

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #4: consume.c --]
[-- Type: text/x-csrc, Size: 888 bytes --]

#define _GNU_SOURCE

#include <stdint.h>

#include <errno.h>
#include <error.h>
#include <fcntl.h>
#include <sysexits.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
	++argv, --argc;
	do {
		const char *filename = "<stdin>";
		if (argc > 0) {
			close(STDIN_FILENO);
			if (open(filename = *argv, O_RDWR) != STDIN_FILENO)
				error(EX_NOINPUT, errno, "%s", filename);
		}
		for (off_t consumed = 0;;) {
			ssize_t n = splice(STDIN_FILENO, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
			if (n <= 0) {
				if (n < 0)
					error(EX_IOERR, errno, "%s: splice", filename);
				break;
			}
			if (fallocate(STDIN_FILENO, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed += n) < 0)
				error(EX_OSERR, errno, "%s: fallocate", filename);
		}
		if (argc > 0 && unlink(filename) < 0)
			error(EX_OSERR, errno, "%s: unlink", filename);
	} while (++argv, --argc > 0);
	return EX_OK;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-27  1:12 [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE Matt Whitlock
@ 2023-06-27  5:47 ` Dave Chinner
  2023-06-27  6:20   ` Dave Chinner
                     ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Dave Chinner @ 2023-06-27  5:47 UTC (permalink / raw)
  To: Matt Whitlock; +Cc: linux-fsdevel, Matthew Wilcox

On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote:
> Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when
> calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have
> just been spliced into a pipe. It appears that the fallocate call can zero
> out the pages that are sitting in the pipe buffer, before those pages are
> read from the pipe.
> 
> Simplified code excerpt (eliding error checking):
> 
> int fd = /* open file descriptor referring to some disk file */;
> for (off_t consumed = 0;;) {
>   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
>   if (n <= 0) break;
>   consumed += n;
>   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
> }

Huh. Never seen that pattern before - what are you trying to
implement with this?

> Expected behavior:
> Punching holes in a file after splicing pages out of that file into a pipe
> should not corrupt the spliced-out pages in the pipe buffer.

splice is a nasty, tricky beast that should never have been
inflicted on the world...

> Observed behavior:
> Some of the pages that have been spliced into the pipe get zeroed out by the
> subsequent fallocate call before they can be consumed from the read side of
> the pipe.

Which implies the splice is not copying the page cache pages but
simply taking a reference to them.

> 
> 
> Steps to reproduce:
> 
> 1. Save the attached ones.c, dontneed.c, and consume.c.
> 
> 2. gcc -o ones ones.c
>   gcc -o dontneed dontneed.c
>   gcc -o consume consume.c
> 
> 3. Fill a file with 32 MiB of 0xFF:
>   ./ones | head -c$((1<<25)) >testfile
>
> 4. Evict the pages of the file from the page cache:
>   sync testfile && ./dontneed testfile

To save everyone some time, this one liner:

# xfs_io -ft -c "pwrite -S 0xff 0 32M" -c fsync -c "fadvise -d 0 32M" testfile

Does the same thing as steps 3 and 4. I also reproduced it with much
smaller files - 1MB is large enough to see multiple corruption
events every time I've run it.

> 5. Splice the file into a pipe, punching out batches of pages after splicing
> them:
>   ./consume testfile | hexdump -C
> 
> The expected output from hexdump should show 32 MiB of 0xFF. Indeed, on my
> system, if I omit the POSIX_FADV_DONTNEED advice, then I do get the expected
> output. However, if the pages of the file are not already present in the
> page cache (i.e., if the splice call faults them in from disk), then the
> hexdump output shows some pages full of 0xFF and some pages full of 0x00.

Like so:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b0a000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b10000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b12000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b18000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b19000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b20000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b22000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b24000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b25000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b28000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01b29000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
01b2c000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Hmmm. the corruption, more often than not, starts on a high-order
aligned file offset. Tracing indicates data is being populated in
the page cache by readahead, which would be using high-order folios
in XFS.

All the splice operations are return byte counts that are 4kB
aligned, so punch is doing filesystem block aligned punches. The
extent freeing traces indicate the filesystem is removing exactly
the right ranges from the file, and so the page cache invalidation
calls it is doing are also going to be for the correct ranges.

This smells of a partial high-order folio invalidation problem,
or at least a problem with splice working on pages rather than
folios the two not being properly coherent as a result of partial
folio invalidation.

To confirm, I removed all the mapping_set_large_folios() calls in
XFS, and the data corruption goes away. Hence, at minimum, large
folios look like a trigger for the problem.

Willy, over to you.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-27  5:47 ` Dave Chinner
@ 2023-06-27  6:20   ` Dave Chinner
  2023-06-27 10:31   ` Matthew Wilcox
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2023-06-27  6:20 UTC (permalink / raw)
  To: Matt Whitlock; +Cc: linux-fsdevel, Matthew Wilcox

On Tue, Jun 27, 2023 at 03:47:57PM +1000, Dave Chinner wrote:
> On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote:
> > Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when
> > calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have
> > just been spliced into a pipe. It appears that the fallocate call can zero
> > out the pages that are sitting in the pipe buffer, before those pages are
> > read from the pipe.
> > 
> > Simplified code excerpt (eliding error checking):
> > 
> > int fd = /* open file descriptor referring to some disk file */;
> > for (off_t consumed = 0;;) {
> >   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
> >   if (n <= 0) break;
> >   consumed += n;
> >   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
> > }
> 
> Huh. Never seen that pattern before - what are you trying to
> implement with this?
> 
> > Expected behavior:
> > Punching holes in a file after splicing pages out of that file into a pipe
> > should not corrupt the spliced-out pages in the pipe buffer.
> 
> splice is a nasty, tricky beast that should never have been
> inflicted on the world...
> 
> > Observed behavior:
> > Some of the pages that have been spliced into the pipe get zeroed out by the
> > subsequent fallocate call before they can be consumed from the read side of
> > the pipe.
> 
> Which implies the splice is not copying the page cache pages but
> simply taking a reference to them.
> 
> > 
> > 
> > Steps to reproduce:
> > 
> > 1. Save the attached ones.c, dontneed.c, and consume.c.
> > 
> > 2. gcc -o ones ones.c
> >   gcc -o dontneed dontneed.c
> >   gcc -o consume consume.c
> > 
> > 3. Fill a file with 32 MiB of 0xFF:
> >   ./ones | head -c$((1<<25)) >testfile
> >
> > 4. Evict the pages of the file from the page cache:
> >   sync testfile && ./dontneed testfile
> 
> To save everyone some time, this one liner:
> 
> # xfs_io -ft -c "pwrite -S 0xff 0 32M" -c fsync -c "fadvise -d 0 32M" testfile
> 
> Does the same thing as steps 3 and 4. I also reproduced it with much
> smaller files - 1MB is large enough to see multiple corruption
> events every time I've run it.
> 
> > 5. Splice the file into a pipe, punching out batches of pages after splicing
> > them:
> >   ./consume testfile | hexdump -C
> > 
> > The expected output from hexdump should show 32 MiB of 0xFF. Indeed, on my
> > system, if I omit the POSIX_FADV_DONTNEED advice, then I do get the expected
> > output. However, if the pages of the file are not already present in the
> > page cache (i.e., if the splice call faults them in from disk), then the
> > hexdump output shows some pages full of 0xFF and some pages full of 0x00.
> 
> Like so:
> 
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b0a000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b10000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b12000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b18000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b19000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b20000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b22000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b24000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b25000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b28000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 01b29000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
> *
> 01b2c000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> 
> Hmmm. the corruption, more often than not, starts on a high-order
> aligned file offset. Tracing indicates data is being populated in
> the page cache by readahead, which would be using high-order folios
> in XFS.
> 
> All the splice operations are return byte counts that are 4kB
> aligned, so punch is doing filesystem block aligned punches. The
> extent freeing traces indicate the filesystem is removing exactly
> the right ranges from the file, and so the page cache invalidation
> calls it is doing are also going to be for the correct ranges.
> 
> This smells of a partial high-order folio invalidation problem,
> or at least a problem with splice working on pages rather than
> folios the two not being properly coherent as a result of partial
> folio invalidation.
> 
> To confirm, I removed all the mapping_set_large_folios() calls in
> XFS, and the data corruption goes away. Hence, at minimum, large
> folios look like a trigger for the problem.
> 
> Willy, over to you.

Just to follow up, splice ends up in the iov_iter code with
ITER_PIPE as the destination. We then end up with
copy_page_to_iter(), which if the iter is a pipe spits out to
copy_page_to_iter_pipe(). This does does not copy the page contents,
it simply grabs a reference to the page and then stuffs it into the
pipe buffer where it then sits until it is read.

IOWs, the splice to pipe operation is taking a reference to the
pagei cache page, then we drop all the locks that protect it, return
to userspace which then triggers an invalidation of the page
with a hole punch, and the data disappears from the page that is
referenced in the pipe.

So copy_page_to_iter_pipe() is simply broken. It's making the
assumption that it can just take a reference to a page cache page
and the state/contents of the page will not change until the page
is released (i.e. consumed by a pipe reader). This is not true
for file-backed pages - if the page is not locked, then anything
can happen to it while it is sitting on the pipe...

Why this requires large folios to trigger is something I don't
understand - the code looks broken even for single page objects in
the page cache. It's probably just a timing issue that high order
folios tickle much more easily with different readahead and
invalidation patterns.

Anyway, this needs iov_iter/folio expertise at this point...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-27  5:47 ` Dave Chinner
  2023-06-27  6:20   ` Dave Chinner
@ 2023-06-27 10:31   ` Matthew Wilcox
  2023-06-27 18:38     ` Matt Whitlock
  2023-06-27 21:49     ` Dave Chinner
  2023-06-27 18:14   ` Matt Whitlock
  2023-06-28  6:30   ` David Howells
  3 siblings, 2 replies; 17+ messages in thread
From: Matthew Wilcox @ 2023-06-27 10:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matt Whitlock, linux-fsdevel, Christoph Hellwig, David Howells,
	Jens Axboe, Al Viro

On Tue, Jun 27, 2023 at 03:47:57PM +1000, Dave Chinner wrote:
> On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote:
> > Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when
> > calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have
> > just been spliced into a pipe. It appears that the fallocate call can zero
> > out the pages that are sitting in the pipe buffer, before those pages are
> > read from the pipe.
> > 
> > Simplified code excerpt (eliding error checking):
> > 
> > int fd = /* open file descriptor referring to some disk file */;
> > for (off_t consumed = 0;;) {
> >   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
> >   if (n <= 0) break;
> >   consumed += n;
> >   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
> > }
> 
> Huh. Never seen that pattern before - what are you trying to
> implement with this?
> 
> > Expected behavior:
> > Punching holes in a file after splicing pages out of that file into a pipe
> > should not corrupt the spliced-out pages in the pipe buffer.
> 
> splice is a nasty, tricky beast that should never have been
> inflicted on the world...

Indeed.  I understand the problem, I just don't know if it's a bug.

> > Observed behavior:
> > Some of the pages that have been spliced into the pipe get zeroed out by the
> > subsequent fallocate call before they can be consumed from the read side of
> > the pipe.
> 
> Which implies the splice is not copying the page cache pages but
> simply taking a reference to them.

Yup.

> Hmmm. the corruption, more often than not, starts on a high-order
> aligned file offset. Tracing indicates data is being populated in
> the page cache by readahead, which would be using high-order folios
> in XFS.
> 
> All the splice operations are return byte counts that are 4kB
> aligned, so punch is doing filesystem block aligned punches. The
> extent freeing traces indicate the filesystem is removing exactly
> the right ranges from the file, and so the page cache invalidation
> calls it is doing are also going to be for the correct ranges.
> 
> This smells of a partial high-order folio invalidation problem,
> or at least a problem with splice working on pages rather than
> folios the two not being properly coherent as a result of partial
> folio invalidation.
> 
> To confirm, I removed all the mapping_set_large_folios() calls in
> XFS, and the data corruption goes away. Hence, at minimum, large
> folios look like a trigger for the problem.

If you do a PUNCH HOLE, documented behaviour is:

       Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38)
       in mode deallocates space (i.e., creates a  hole)  in  the  byte  range
       starting  at offset and continuing for len bytes.  Within the specified
       range, partial filesystem  blocks  are  zeroed,  and  whole  filesystem
       blocks  are removed from the file.  After a successful call, subsequent
       reads from this range will return zeros.

So we have, let's say, an order-4 folio and the user tries to PUNCH_HOLE
page 3 of it.  We try to split it, but that fails because the pipe holds
a reference.  The filesystem has removed the underlying data from the
storage medium.  What is the page cache to do?  It must memset() so that
subsequent reads return zeroes.  And now the page in the pipe has the
hole punched into it.

I think you can reproduce this problem without large folios by using a
512-byte block size filesystem and punching holes that are sub page
size.  The page cache must behave similarly.

Perhaps the problem is that splice() appears to copy, but really just
takes the reference.  Perhaps splice needs to actually copy if it
sees a multi-page folio and isn't going to take all of it.  I'm not
an expert in splice-ology, so let's cc some people who know more about
splice than I do.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-27  5:47 ` Dave Chinner
  2023-06-27  6:20   ` Dave Chinner
  2023-06-27 10:31   ` Matthew Wilcox
@ 2023-06-27 18:14   ` Matt Whitlock
  2023-06-27 21:51     ` Dave Chinner
  2023-06-28  6:30   ` David Howells
  3 siblings, 1 reply; 17+ messages in thread
From: Matt Whitlock @ 2023-06-27 18:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 2472 bytes --]

On Tuesday, 27 June 2023 01:47:57 EDT, Dave Chinner wrote:
> On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote:
>> Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when
>> calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have
>> just been spliced into a pipe. It appears that the fallocate call can zero
>> out the pages that are sitting in the pipe buffer, before those pages are
>> read from the pipe.
>> 
>> Simplified code excerpt (eliding error checking):
>> 
>> int fd = /* open file descriptor referring to some disk file */;
>> for (off_t consumed = 0;;) {
>>   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
>>   if (n <= 0) break;
>>   consumed += n;
>>   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
>> }
>
> Huh. Never seen that pattern before - what are you trying to
> implement with this?

It's part of a tool I wrote that implements an indefinitely expandable 
user-space pipe buffer backed by an unlinked-on-creation disk file. It's 
very useful as a shim in a pipeline between a process that produces a large 
but finite amount of data quickly and a process that consumes data slowly. 
My canonical use case is in my nightly backup cronjob, where I have 'tar 
-c' piped into 'xz' piped into a tool that uploads its stdin to an Amazon 
S3-compatible data store. I insert my diskbuffer tool between tar and xz so 
that the tar process can complete as quickly as possible (thus achieving a 
snapshot as close to atomic as practically possible without freezing the 
file system) and the xz process can take its sweet time crunching the 
tarball down as small as possible, yet the disk space consumed by the 
temporary (uncompressed) tarball can be released progressively as xz 
consumes the tarball, and all of the disk space will be released 
automatically if the pipeline dies, such as if the system is rebooted 
during a backup run.

Conceptually:

tar -c /home | diskbuffer /var/tmp | xz -9e | s3upload ...

The 'diskbuffer' tool creates an unlinked temporary file (using 
O_TMPFILE|O_EXCL) in the specified directory and then enters a loop, 
splicing bytes from stdin into the file and splicing bytes from the file 
into stdout, incrementally punching out the file as it successfully splices 
to stdout.

Source code for the tool is attached. (It has a bit more functionality than 
I have described here.)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: diskbuffer.c --]
[-- Type: text/x-csrc, Size: 3190 bytes --]

#define _GNU_SOURCE

#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <errno.h>
#include <error.h>
#include <fcntl.h>
#include <poll.h>
#include <sysexits.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/statfs.h>
#include <sys/types.h>


static inline size_t splice_len(off_t len) {
	return (size_t) (len > INT_MAX ? INT_MAX : len);
}

int main(int argc, char *argv[]) {
	bool defer = false, keep = false;
	if (argc > 1 && strcmp(argv[1], "--defer") == 0) {
		defer = true;
		argv[1] = argv[0];
		--argc, ++argv;
	}
	if (argc > 1 && strcmp(argv[1], "--keep") == 0) {
		keep = true;
		argv[1] = argv[0];
		--argc, ++argv;
	}
	if (argc > 2 || keep && argc < 2) {
		fprintf(stderr, "usage: %s [--defer] { [<tmpdir>] | --keep <file> }\n", argv[0]);
		return EX_USAGE;
	}
	const char *path;
	int fd;
	if (keep) {
		if ((fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0666)) < 0) {
			error(EX_NOINPUT, errno, "%s", argv[1]);
		}
	}
	else {
		if (argc > 1) {
			path = argv[1];
		}
		else if (!(path = getenv("TMPDIR"))) {
			path = "/tmp";
		}
		if ((fd = open(path, O_RDWR | O_TMPFILE | O_EXCL, 0000)) < 0) {
			error(EX_NOINPUT, errno, "%s", path);
		}
	}

	bool eof = false, punch = !keep;
	off_t n_in = 0, n_out = 0, max = 0, mask = 0;
	for (;;) {
		off_t n_buf = n_in - n_out;
		if (n_buf == max) {
			struct statfs f;
			if (fstatfs(fd, &f) < 0) {
				error(EX_NOINPUT, errno, "%s", path);
			}
			max = f.f_bavail / 2 * f.f_bsize;
			mask = ~((off_t) f.f_bsize - 1);
		}
		struct pollfd pfds[2] = { };
		nfds_t nfds = 0;
		if (!eof && n_buf < max) {
			pfds[nfds].fd = STDIN_FILENO;
			pfds[nfds].events = POLLIN;
			++nfds;
		}
		if (n_buf > 0 && (!defer || eof)) {
			pfds[nfds].fd = STDOUT_FILENO;
			pfds[nfds].events = POLLOUT;
			++nfds;
		}
		if (nfds == 0) {
			if (eof) {
				break;
			}
			sleep(1);
			continue;
		}
		if (poll(pfds, nfds, -1) > 0) {
			for (nfds_t i = 0; i < nfds; ++i) {
				if (pfds[i].revents) {
					if (pfds[i].fd == STDIN_FILENO) {
						ssize_t n = splice(STDIN_FILENO, NULL, fd, &n_in, splice_len(max - n_buf), SPLICE_F_MOVE | SPLICE_F_NONBLOCK | SPLICE_F_MORE);
						if (n <= 0) {
							if (n == 0) {
								eof = true;
							}
							else if (errno != EAGAIN) {
								error(EX_IOERR, errno, "%s: splice", "<stdin>");
							}
						}
						else {
							n_buf += n;
						}
					}
					else if (pfds[i].fd == STDOUT_FILENO) {
						ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, splice_len(n_buf), SPLICE_F_MOVE | SPLICE_F_NONBLOCK | SPLICE_F_MORE);
						if (n <= 0) {
							if (n < 0 && errno != EAGAIN) {
								error(EX_IOERR, errno, "%s: splice", "<stdout>");
							}
						}
						else if (((n_out += n) & mask) && punch && fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, n_out & mask) < 0) {
							if (errno != EOPNOTSUPP && errno != ENOSYS) {
								error(EX_IOERR, errno, "%s: fallocate", path);
							}
							fprintf(stderr, "WARNING: temp file in %s does not support FALLOC_FL_PUNCH_HOLE\n", path);
							punch = false;
						}
					}
				}
			}
		}
		else if (errno != EINTR) {
			error(EX_OSERR, errno, "poll");
		}
	}
	return EX_OK;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-27 10:31   ` Matthew Wilcox
@ 2023-06-27 18:38     ` Matt Whitlock
  2023-06-27 21:49     ` Dave Chinner
  1 sibling, 0 replies; 17+ messages in thread
From: Matt Whitlock @ 2023-06-27 18:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, linux-fsdevel, Christoph Hellwig, David Howells,
	Jens Axboe, Al Viro

On Tuesday, 27 June 2023 06:31:56 EDT, Matthew Wilcox wrote:
> Perhaps the problem is that splice() appears to copy, but really just
> takes the reference.  Perhaps splice needs to actually copy if it
> sees a multi-page folio and isn't going to take all of it.  I'm not
> an expert in splice-ology, so let's cc some people who know more about
> splice than I do.

As a naive userspace coder, my mental model of splice is that it takes 
references to the pages backing the input file/pipe and appends those 
references to the output pipe's queue. Crucially, my expectation is that 
the cloned pages are marked copy-on-write, such that if some other process 
subsequently modifies the pages, then those pages will be copied at the 
time of modification (i.e., copy-on-write), and the references in the 
spliced-into pipe buffer will still refer to the unmodified originals.

If splice is merely taking a reference without caring to implement 
copy-on-write semantics, then it's not just FALLOC_FL_PUNCH_HOLE that will 
cause problems. Indeed, *any* write to the input file in the range where 
pages were spliced into a pipe will impact the pages already in the pipe 
buffer. I'm sure that makes perfect sense to a kernel developer, but it 
violates the implied contract of splice, which is to act *as though* the 
pages had been *copied* into the pipe.

Obviously, I'm not asking for splice to actually copy the pages, as 
zero-copy semantics are a large part of the motivation for preferring 
splice in the first place, but I do believe that the pages should be made 
copy-on-write such that subsequent writes to those pages in the page cache, 
whether by write or mmap or fallocate or otherwise, will force a copy of 
the pages to be made if the copy-on-write reference count of the modified 
pages is non-zero. Note, I say, "copy-on-write reference count," as 
distinct from the ordinary reference count, as you would need to properly 
handle the case where there are multiple *shared* references to a page that 
*should* observe any changes made to it, while there may also be multiple 
*private* references to the page that *must not* observe any changes made 
to it.

Just a thought: if you're going to implement bi-level reference counting 
semantics, then you could make MMAP_PRIVATE actually take a snapshot of the 
mapped pages at the time of mapping. Currently, mmap(2) says, "It is 
unspecified whether changes made to the file after the mmap() call are 
visible in the mapped region," so you have the latitude to make the private 
mapping truly isolated from subsequent changes in the backing file, using 
the same plumbing as you could use to isolate spliced pages.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-27 10:31   ` Matthew Wilcox
  2023-06-27 18:38     ` Matt Whitlock
@ 2023-06-27 21:49     ` Dave Chinner
  1 sibling, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2023-06-27 21:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Matt Whitlock, linux-fsdevel, Christoph Hellwig, David Howells,
	Jens Axboe, Al Viro

On Tue, Jun 27, 2023 at 11:31:56AM +0100, Matthew Wilcox wrote:
> On Tue, Jun 27, 2023 at 03:47:57PM +1000, Dave Chinner wrote:
> > On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote:
> > > Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when
> > > calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have
> > > just been spliced into a pipe. It appears that the fallocate call can zero
> > > out the pages that are sitting in the pipe buffer, before those pages are
> > > read from the pipe.
> > > 
> > > Simplified code excerpt (eliding error checking):
> > > 
> > > int fd = /* open file descriptor referring to some disk file */;
> > > for (off_t consumed = 0;;) {
> > >   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
> > >   if (n <= 0) break;
> > >   consumed += n;
> > >   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
> > > }
> > 
> > Huh. Never seen that pattern before - what are you trying to
> > implement with this?
> > 
> > > Expected behavior:
> > > Punching holes in a file after splicing pages out of that file into a pipe
> > > should not corrupt the spliced-out pages in the pipe buffer.
> > 
> > splice is a nasty, tricky beast that should never have been
> > inflicted on the world...
> 
> Indeed.  I understand the problem, I just don't know if it's a bug.
> 
> > > Observed behavior:
> > > Some of the pages that have been spliced into the pipe get zeroed out by the
> > > subsequent fallocate call before they can be consumed from the read side of
> > > the pipe.
> > 
> > Which implies the splice is not copying the page cache pages but
> > simply taking a reference to them.
> 
> Yup.
> 
> > Hmmm. the corruption, more often than not, starts on a high-order
> > aligned file offset. Tracing indicates data is being populated in
> > the page cache by readahead, which would be using high-order folios
> > in XFS.
> > 
> > All the splice operations are return byte counts that are 4kB
> > aligned, so punch is doing filesystem block aligned punches. The
> > extent freeing traces indicate the filesystem is removing exactly
> > the right ranges from the file, and so the page cache invalidation
> > calls it is doing are also going to be for the correct ranges.
> > 
> > This smells of a partial high-order folio invalidation problem,
> > or at least a problem with splice working on pages rather than
> > folios the two not being properly coherent as a result of partial
> > folio invalidation.
> > 
> > To confirm, I removed all the mapping_set_large_folios() calls in
> > XFS, and the data corruption goes away. Hence, at minimum, large
> > folios look like a trigger for the problem.
> 
> If you do a PUNCH HOLE, documented behaviour is:
> 
>        Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38)
>        in mode deallocates space (i.e., creates a  hole)  in  the  byte  range
>        starting  at offset and continuing for len bytes.  Within the specified
>        range, partial filesystem  blocks  are  zeroed,  and  whole  filesystem
>        blocks  are removed from the file.  After a successful call, subsequent
>        reads from this range will return zeros.
> 
> So we have, let's say, an order-4 folio and the user tries to PUNCH_HOLE
> page 3 of it.  We try to split it, but that fails because the pipe holds
> a reference.  The filesystem has removed the underlying data from the
> storage medium.  What is the page cache to do?  It must memset() so that
> subsequent reads return zeroes.  And now the page in the pipe has the
> hole punched into it.

Ok, that's what I suspected.

> I think you can reproduce this problem without large folios by using a
> 512-byte block size filesystem and punching holes that are sub page
> size.  The page cache must behave similarly.

Not on XFS. See xfs_flush_unmap_range(), that is run on fallocate
ranges before we do the operation:

	rounding = max_t(xfs_off_t, mp->m_sb.sb_blocksize, PAGE_SIZE);
	start = round_down(offset, rounding);
	end = round_up(offset + len, rounding) - 1;
	....
	truncate_pagecache_range(inode, start, end);

The invalidation rounded to the larger of the PAGE_SIZE or
filesystem block size, so that we, at minimum, invalidate entire
pages in the page cache. The block size case is for doing the right
thing with block size > page size.

Hence XFS will not do sub-page invalidations and so avoids touching
the contents of the page in this case. However, with large folios,
we cannot invalidate entire objects in the page cache like this any
more, so invalidation touches the page contents and that shows up in
the pages that are held in the pipe...

> Perhaps the problem is that splice() appears to copy, but really just
> takes the reference.  Perhaps splice needs to actually copy if it
> sees a multi-page folio and isn't going to take all of it.  I'm not
> an expert in splice-ology, so let's cc some people who know more about
> splice than I do.

Yup, that's pretty much my conclusion - if the destination is page
based, we copy the data. If the destination is a pipe, we simply
take references to the source pages instead of copying the data -
see my followup email.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-27 18:14   ` Matt Whitlock
@ 2023-06-27 21:51     ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2023-06-27 21:51 UTC (permalink / raw)
  To: Matt Whitlock; +Cc: linux-fsdevel, Matthew Wilcox

On Tue, Jun 27, 2023 at 02:14:41PM -0400, Matt Whitlock wrote:
> On Tuesday, 27 June 2023 01:47:57 EDT, Dave Chinner wrote:
> > On Mon, Jun 26, 2023 at 09:12:52PM -0400, Matt Whitlock wrote:
> > > Hello, all. I am experiencing a data corruption issue on Linux 6.1.24 when
> > > calling fallocate with FALLOC_FL_PUNCH_HOLE to punch out pages that have
> > > just been spliced into a pipe. It appears that the fallocate call can zero
> > > out the pages that are sitting in the pipe buffer, before those pages are
> > > read from the pipe.
> > > 
> > > Simplified code excerpt (eliding error checking):
> > > 
> > > int fd = /* open file descriptor referring to some disk file */;
> > > for (off_t consumed = 0;;) {
> > >   ssize_t n = splice(fd, NULL, STDOUT_FILENO, NULL, SIZE_MAX, 0);
> > >   if (n <= 0) break;
> > >   consumed += n;
> > >   fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, consumed);
> > > }
> > 
> > Huh. Never seen that pattern before - what are you trying to
> > implement with this?
> 
> It's part of a tool I wrote that implements an indefinitely expandable
> user-space pipe buffer backed by an unlinked-on-creation disk file. It's
> very useful as a shim in a pipeline between a process that produces a large
> but finite amount of data quickly and a process that consumes data slowly.
> My canonical use case is in my nightly backup cronjob, where I have 'tar -c'
> piped into 'xz' piped into a tool that uploads its stdin to an Amazon
> S3-compatible data store.

Neat trick.

I think that what you really want for this is something like blksnap
so you can have temporary atomic snapshots of the filesystem to take
the backup from :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-27  5:47 ` Dave Chinner
                     ` (2 preceding siblings ...)
  2023-06-27 18:14   ` Matt Whitlock
@ 2023-06-28  6:30   ` David Howells
  2023-06-28  8:15     ` Dave Chinner
                       ` (3 more replies)
  3 siblings, 4 replies; 17+ messages in thread
From: David Howells @ 2023-06-28  6:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Dave Chinner, Matt Whitlock, linux-fsdevel,
	Christoph Hellwig, Jens Axboe, Al Viro

Matthew Wilcox <willy@infradead.org> wrote:

> > > Expected behavior:
> > > Punching holes in a file after splicing pages out of that file into a pipe
> > > should not corrupt the spliced-out pages in the pipe buffer.

I think this bit is the key.  Why would this be the expected behaviour?  As
you say, splice is allowed to stuff parts of the pagecache into a pipe and
these may get transferred, say, to a network card at the end to transmit
directly from.  It's a form of direct I/O.  If someone has the pages mmapped,
they can change the data that will be transmitted; if someone does a write(),
they can change that data too.  The point of splice is to avoid the copy - but
it comes with a tradeoff.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-28  6:30   ` David Howells
@ 2023-06-28  8:15     ` Dave Chinner
  2023-06-28  9:33     ` David Howells
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2023-06-28  8:15 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Matt Whitlock, linux-fsdevel, Christoph Hellwig,
	Jens Axboe, Al Viro

On Wed, Jun 28, 2023 at 07:30:50AM +0100, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > > > Expected behavior:
> > > > Punching holes in a file after splicing pages out of that file into a pipe
> > > > should not corrupt the spliced-out pages in the pipe buffer.
> 
> I think this bit is the key.  Why would this be the expected behaviour?  As
> you say, splice is allowed to stuff parts of the pagecache into a pipe and
> these may get transferred, say, to a network card at the end to transmit
> directly from.  It's a form of direct I/O.  If someone has the pages mmapped,
> they can change the data that will be transmitted; if someone does a write(),
> they can change that data too.  The point of splice is to avoid the copy - but
> it comes with a tradeoff.

I wouldn't call "post-splice filesystem modifications randomly
corrupts pipe data" a tradeoff. I call that a bug. 

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-28  6:30   ` David Howells
  2023-06-28  8:15     ` Dave Chinner
@ 2023-06-28  9:33     ` David Howells
  2023-06-28 12:51       ` Matthew Wilcox
                         ` (2 more replies)
  2023-06-28 17:35     ` Matt Whitlock
  2023-06-28 18:27     ` David Howells
  3 siblings, 3 replies; 17+ messages in thread
From: David Howells @ 2023-06-28  9:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dhowells, Matthew Wilcox, Matt Whitlock, linux-fsdevel,
	Christoph Hellwig, Jens Axboe, Al Viro

Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jun 28, 2023 at 07:30:50AM +0100, David Howells wrote:
> > Matthew Wilcox <willy@infradead.org> wrote:
> > > > > Expected behavior:
> > > > > Punching holes in a file after splicing pages out of that file into
> > > > > a pipe should not corrupt the spliced-out pages in the pipe buffer.
> > 
> > I think this bit is the key.  Why would this be the expected behaviour?
> > As you say, splice is allowed to stuff parts of the pagecache into a pipe
> > and these may get transferred, say, to a network card at the end to
> > transmit directly from.  It's a form of direct I/O.

Actually, it's a form of zerocopy, not direct I/O.

> > If someone has the pages mmapped, they can change the data that will be
> > transmitted; if someone does a write(), they can change that data too.
> > The point of splice is to avoid the copy - but it comes with a tradeoff.
> 
> I wouldn't call "post-splice filesystem modifications randomly
> corrupts pipe data" a tradeoff. I call that a bug.

Would you consider it a kernel bug, then, if you use sendmsg(MSG_ZEROCOPY) to
send some data from a file mmapping that some other userspace then corrupts by
altering the file before the kernel has managed to send it?

Anyway, if you think the splice thing is a bug, we have to fix splice from a
buffered file that is shared-writably mmapped as well as fixing
fallocate()-driven mangling.  There are a number of options:

 (0) Document the bug as a feature: "If this is a problem, don't use splice".

 (1) Always copy the data into the pipe.

 (2) Always unmap and steal the pages from the pagecache, copying if we can't.

 (3) R/O-protect any PTEs mapping those pages and implement CoW.

 (4) Disallow splice() from any region that's mmapped, disallow mmap() on or
     make page_mkwrite wait for any region that's currently spliced.  Disallow
     fallocate() on or make fallocate() wait for any pages that are spliced.

With recent changes, I think there are only two places that need fixing:
filemap_splice_read() and shmem_splice_read().  However, I wonder what
performance effect of having to do a PTE hunt in splice() will be.

And then there's vmsplice()...

Also, I do wonder what happens if you do MSG_ZEROCOPY to a loopback network
address and then splice out of the other end.  I'm guessing you'll get the
zerocopied pages out into your pipe as I think it just moves the sent skbuffs
to the receive queue on the other end.

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-28  9:33     ` David Howells
@ 2023-06-28 12:51       ` Matthew Wilcox
  2023-06-28 14:19       ` David Howells
  2023-06-28 22:41       ` Dave Chinner
  2 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2023-06-28 12:51 UTC (permalink / raw)
  To: David Howells
  Cc: Dave Chinner, Matt Whitlock, linux-fsdevel, Christoph Hellwig,
	Jens Axboe, Al Viro

On Wed, Jun 28, 2023 at 10:33:24AM +0100, David Howells wrote:
> Would you consider it a kernel bug, then, if you use sendmsg(MSG_ZEROCOPY) to
> send some data from a file mmapping that some other userspace then corrupts by
> altering the file before the kernel has managed to send it?

I think there's a difference in that sendmsg() will block until it
returns (... right?)  splice() returns.  That implies the copy is done.
Then the same thread modifies the file, but the pipe sees the new
data, not the old.  Doesn't that feel like a bug to you?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-28  9:33     ` David Howells
  2023-06-28 12:51       ` Matthew Wilcox
@ 2023-06-28 14:19       ` David Howells
  2023-06-28 22:41       ` Dave Chinner
  2 siblings, 0 replies; 17+ messages in thread
From: David Howells @ 2023-06-28 14:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Dave Chinner, Matt Whitlock, linux-fsdevel,
	Christoph Hellwig, Jens Axboe, Al Viro

Matthew Wilcox <willy@infradead.org> wrote:

> On Wed, Jun 28, 2023 at 10:33:24AM +0100, David Howells wrote:
> > Would you consider it a kernel bug, then, if you use sendmsg(MSG_ZEROCOPY) to
> > send some data from a file mmapping that some other userspace then corrupts by
> > altering the file before the kernel has managed to send it?
> 
> I think there's a difference in that sendmsg() will block until it
> returns (... right?

Nope - sendmsg returns once the data is in the Tx queue.  It tells you about
the progress made in transmitting your zerocopy data by pasting error-report
messages with SO_EE_ORIGIN_ZEROCOPY set into the msg_control buffer when you
call recvmsg().  That tells you how many of the bytes you sent with
MSG_ZEROCOPY have so far been transmitted.  You're expected to work out from
that which of your buffers are now freed up.

Further, even if the process that issued the sendmsg() might be blocked, it
doesn't mean that some other process can modify the contents of the buffer by
write() or writing through a shared-writable mmap.

> splice() returns.  That implies the copy is done.

I thought the whole point of splice was to *avoid* the copy.  Or is it only to
avoid user<->kernel copies?

> Then the same thread modifies the file, but the pipe sees the new data, not
> the old.  Doesn't that feel like a bug to you?

It can be argued either way.  It you want to see it as a bug, then what
solution do you want?

 (1) Always copy the data into the pipe.

 (2) Always unmap and steal the pages from the pagecache, copying if we can't.

 (3) R/O-protect any PTEs mapping those pages and implement CoW.

 (4) Disallow splice() from any region that's mmapped, disallow mmap() on or
     make page_mkwrite wait for any region that's currently spliced.  Disallow
     fallocate() on or make fallocate() wait for any pages that are spliced.

2 and 3 are likely to have a performance degradation because we'll have to
zap/modify PTEs before completing the splice.  4 will have some functionality
degradation.

3 sounds particularly messy.  The problem is that if a page in the pagecache
is spliced into a pipe, we have to discard the page from the pagecache if we
would otherwise want to modify it.  I guess we'd need a splice counter adding
to struct folio?  (Or as least a PG_was_once_spliced bit).

Maybe (2a) Steal the page if unmapped, unpinned and only in use by the
pagecache, copy it otherwise?

Maybe you have a better solution?

David

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-28  6:30   ` David Howells
  2023-06-28  8:15     ` Dave Chinner
  2023-06-28  9:33     ` David Howells
@ 2023-06-28 17:35     ` Matt Whitlock
  2023-06-28 18:27     ` David Howells
  3 siblings, 0 replies; 17+ messages in thread
From: Matt Whitlock @ 2023-06-28 17:35 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Dave Chinner, linux-fsdevel, Christoph Hellwig,
	Jens Axboe, Al Viro

On Wednesday, 28 June 2023 02:30:50 EDT, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
>
>>>> Expected behavior:
>>>> Punching holes in a file after splicing pages out of that 
>>>> file into a pipe
>>>> should not corrupt the spliced-out pages in the pipe buffer.
>
> I think this bit is the key.  Why would this be the expected behaviour?

Why? Because SPLICE_F_MOVE exists. Even though that flag is a no-op as of 
Linux 2.6.21, its existence implies that calling splice() *without* 
specifying SPLICE_F_MOVE performs a *copy*. The kernel is, of course, free 
*not* to copy pages in pursuit of better performance, but it must behave as 
though it did copy unless SPLICE_F_MOVE is specified, in which case 
userspace is explicitly acknowledging that subsequent modification of the 
spliced pages may impact the spliced data. Effectively, SPLICE_F_MOVE is a 
promise by userspace that the moved pages will not be subsequently 
modified, and if they are, then all bets are off.

In other words, the currently implemented behavior is appropriate for 
SPLICE_F_MOVE, but it is not appropriate for ~SPLICE_F_MOVE.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-28  6:30   ` David Howells
                       ` (2 preceding siblings ...)
  2023-06-28 17:35     ` Matt Whitlock
@ 2023-06-28 18:27     ` David Howells
  2023-06-28 22:17       ` Dave Chinner
  3 siblings, 1 reply; 17+ messages in thread
From: David Howells @ 2023-06-28 18:27 UTC (permalink / raw)
  To: Matt Whitlock
  Cc: dhowells, Matthew Wilcox, Dave Chinner, linux-fsdevel,
	Christoph Hellwig, Jens Axboe, Al Viro

Matt Whitlock <kernel@mattwhitlock.name> wrote:

> In other words, the currently implemented behavior is appropriate for
> SPLICE_F_MOVE, but it is not appropriate for ~SPLICE_F_MOVE.

The problems with SPLICE_F_MOVE is that it's only applicable to splicing *out*
of a pipe.  By the time you get that far the pages can already be corrupted by
a shared-writable mmap or write().

David


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-28 18:27     ` David Howells
@ 2023-06-28 22:17       ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2023-06-28 22:17 UTC (permalink / raw)
  To: David Howells
  Cc: Matt Whitlock, Matthew Wilcox, linux-fsdevel, Christoph Hellwig,
	Jens Axboe, Al Viro

On Wed, Jun 28, 2023 at 07:27:26PM +0100, David Howells wrote:
> Matt Whitlock <kernel@mattwhitlock.name> wrote:
> 
> > In other words, the currently implemented behavior is appropriate for
> > SPLICE_F_MOVE, but it is not appropriate for ~SPLICE_F_MOVE.
> 
> The problems with SPLICE_F_MOVE is that it's only applicable to splicing *out*
> of a pipe.  By the time you get that far the pages can already be corrupted by
> a shared-writable mmap or write().

That's not documented in the man page.

Indeed, I think Matt's point - and mine, too, for that matter - is
that the splice(2) man page documents *none* of this
"copy-by-reference" behaviour or it's side effects. All the man page
documents is that the data is *copied in kernel-space* rather than
needing kernel->user->kernel data movement to copy it from one fd to
the other.

The man page *heavily implies* that splice is a "fast immediate
data copy". It most definitely does not describe any "zero-copy with
whacky post-completion data stream corrupting side effects"
mechanisms. There's not even an entry in the "notes" or "bugs"
section to warn users that they cannot trust the contents of the
source or destination pipe to be what they think they might be as
the "data copy" implied by the pipe buffer might not occur until
some arbitrary time in the future.

Hence, according to the man page, what it is doing right now
definitely contrary to the behaviour implied by the documentation...

i.e. If the data that is "copied" to the destination pipe is not
resolved until some future action by some 3rd party process is
performed, then the man page must tell users they cannot use this
for any sort of data stream where they require the data being
transferred needs to remain stable as of the time of the splice
operation.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE
  2023-06-28  9:33     ` David Howells
  2023-06-28 12:51       ` Matthew Wilcox
  2023-06-28 14:19       ` David Howells
@ 2023-06-28 22:41       ` Dave Chinner
  2 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2023-06-28 22:41 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, Matt Whitlock, linux-fsdevel, Christoph Hellwig,
	Jens Axboe, Al Viro

On Wed, Jun 28, 2023 at 10:33:24AM +0100, David Howells wrote:
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Wed, Jun 28, 2023 at 07:30:50AM +0100, David Howells wrote:
> > > Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > Expected behavior:
> > > > > > Punching holes in a file after splicing pages out of that file into
> > > > > > a pipe should not corrupt the spliced-out pages in the pipe buffer.
> > > 
> > > I think this bit is the key.  Why would this be the expected behaviour?
> > > As you say, splice is allowed to stuff parts of the pagecache into a pipe
> > > and these may get transferred, say, to a network card at the end to
> > > transmit directly from.  It's a form of direct I/O.
> 
> Actually, it's a form of zerocopy, not direct I/O.
> 
> > > If someone has the pages mmapped, they can change the data that will be
> > > transmitted; if someone does a write(), they can change that data too.
> > > The point of splice is to avoid the copy - but it comes with a tradeoff.
> > 
> > I wouldn't call "post-splice filesystem modifications randomly
> > corrupts pipe data" a tradeoff. I call that a bug.
> 
> Would you consider it a kernel bug, then, if you use sendmsg(MSG_ZEROCOPY) to
> send some data from a file mmapping that some other userspace then corrupts by
> altering the file before the kernel has managed to send it?

Yes.

We had this exact discussion a few months ago w.r.t. OTW checksums
being invalid because data was changing mid-checksum.

I consider that a bug, and in most cases this is avoided by the
hardware checksum offloads. i.e. the checksum doesn't occur until
after the data is DMAd to the hardware out of the referenced page
so it is immune to tearing caused by 3rd party access to the data.

That doesn't mean the data being delivered is valid or correct;
fundamentally this is a use-after-free situation.

> Anyway, if you think the splice thing is a bug, we have to fix splice from a
> buffered file that is shared-writably mmapped as well as fixing
> fallocate()-driven mangling.  There are a number of options:
> 
>  (0) Document the bug as a feature: "If this is a problem, don't use splice".

That's the primary issue right now - the behaviour is not documented
anywhere.

> 
>  (1) Always copy the data into the pipe.

We always copy non-pipe data. Only pipes are considered special
here...

>  (2) Always unmap and steal the pages from the pagecache, copying if we can't.

That's effectively the definition of SPLICE_F_GIFT behaviour, yes?

>  (3) R/O-protect any PTEs mapping those pages and implement CoW.

Another mechanism for stable pages, but can we do CoW for kernel
only mappings easily?

>  (4) Disallow splice() from any region that's mmapped, disallow mmap() on or
>      make page_mkwrite wait for any region that's currently spliced.  Disallow
>      fallocate() on or make fallocate() wait for any pages that are spliced.

fallocate() is already protected against splice(2) operations via
exclusive i_rwsem, mapping->invalidate_lock, DIO drains and internal
filesystem locking.

The problem is that splicing to a pipe hands page cache pages to a
pipe that is not under the control (or is even visible) to the
filesystem, so the filesystem can do *nothing* to serialise against
these anonymous references that the pipe holds to page caceh pages.

So the filesystem would require a page/folio based synchronisation
mechanism, kinda like writeback and stable pages, and we'd have to
issue that wait everywhere we currently have a folio_wait_stable()
IO barrier. fallocate() would also need this, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-06-28 22:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-27  1:12 [Reproducer] Corruption, possible race between splice and FALLOC_FL_PUNCH_HOLE Matt Whitlock
2023-06-27  5:47 ` Dave Chinner
2023-06-27  6:20   ` Dave Chinner
2023-06-27 10:31   ` Matthew Wilcox
2023-06-27 18:38     ` Matt Whitlock
2023-06-27 21:49     ` Dave Chinner
2023-06-27 18:14   ` Matt Whitlock
2023-06-27 21:51     ` Dave Chinner
2023-06-28  6:30   ` David Howells
2023-06-28  8:15     ` Dave Chinner
2023-06-28  9:33     ` David Howells
2023-06-28 12:51       ` Matthew Wilcox
2023-06-28 14:19       ` David Howells
2023-06-28 22:41       ` Dave Chinner
2023-06-28 17:35     ` Matt Whitlock
2023-06-28 18:27     ` David Howells
2023-06-28 22:17       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).