From: Noah Misch <noah@leadboat.com>
To: linux-xfs@vger.kernel.org
Subject: After block device error, FICLONE and sync_file_range() make NULs, unlike read()
Date: Tue, 8 Nov 2022 09:24:36 -0800 [thread overview]
Message-ID: <20221108172436.GA3613139@rfd.leadboat.com> (raw)
Scenario: due to a block device error, the kernel fails to persist some file
content. Even so, read() always returns the file content accurately. The
first FICLONE returns EIO, but every subsequent FICLONE or copy_file_range()
operates as though the file were all zeros. How feasible is it change FICLONE
and copy_file_range() such that they instead find the bytes that read() finds?
- Kernel is 6.0.0-1-sparc64-smp from Debian sid, running in a Solaris-hosted VM.
- The VM is gcc202 from https://cfarm.tetaneutral.net/machines/list/.
Accounts are available.
- The outcome is still reproducible in FICLONE issued two days after the
original block device error. I haven't checked whether it survives a
reboot.
- The "sync" command did not help.
- The block device errors have been ongoing for years. If curious, see
https://postgr.es/m/CA+hUKGKfrXnuyk0Z24m8x4_eziuC3kLSaCmEeKPO1DVU9t-qtQ@mail.gmail.com
for details. (Fixing the sunvdc driver is out of scope for this thread.)
Other known symptoms are failures in truncate() and fsync(). The system has
been generally usable for applications not requiring persistence. I saw the
FICLONE problem after the system updated coreutils from 8.32-4.1 to 9.1-1.
That introduced a "cp" that uses FICLONE. My current workaround is to place
a "cp" in my PATH that does 'exec /usr/bin/cp --reflink=never "$@"'
The trouble emerged at a "cp". To capture more details, I replaced "cp" with
"trace-cp" containing:
sum "$1"
strace cp "$@" 2>&1 | sed -n '/^geteuid/,$p'
sum "$2"
Output from that follows. FICLONE returns EIO. "cp" then falls back to
copy_file_range(), which yields an all-zeros file:
47831 16384 pg_wal/000000030000000000000003
geteuid() = 1450
openat(AT_FDCWD, "/home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
fstatat64(AT_FDCWD, "pg_wal/000000030000000000000003", {st_mode=S_IFREG|0600, st_size=16777216, ...}, 0) = 0
openat(AT_FDCWD, "pg_wal/000000030000000000000003", O_RDONLY) = 4
fstatat64(4, "", {st_mode=S_IFREG|0600, st_size=16777216, ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "/home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003", O_WRONLY|O_CREAT|O_EXCL, 0600) = 5
ioctl(5, BTRFS_IOC_CLONE or FICLONE, 4) = -1 EIO (Input/output error)
fstatat64(5, "", {st_mode=S_IFREG|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
fadvise64_64(4, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = 16777216
copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = 0
close(5) = 0
close(4) = 0
_llseek(0, 0, [0], SEEK_CUR) = 0
close(0) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
00000 16384 /home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003
Subsequent FICLONE returns 0 and yields an all-zeros file. Test script:
set -x
broken_source=t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
dest=$HOME/tmp/discard
sum "$broken_source"
: 'FICLONE returns 0 and yields an all-zeros file'
strace cp --reflink=always "$broken_source" "$dest" 2>&1 | sed -n '/^geteuid/,$p'
sum "$dest"; rm "$dest"
: 'copy_file_range() returns 0 and yields an all-zeros file'
strace -e copy_file_range cat "$broken_source" >"$dest"
sum "$dest"; rm "$dest"
: 'read() gets the intended bytes'
cat "$broken_source" | cat >"$dest"
sum "$dest"; rm "$dest"
Test script output:
+ broken_source=t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
+ dest=/home/nm/tmp/discard
+ sum t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
49522 16384 t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
+ : FICLONE returns 0 and yields an all-zeros file
+ strace cp --reflink=always t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003 /home/nm/tmp/discard
+ sed -n /^geteuid/,$p
geteuid() = 1450
openat(AT_FDCWD, "/home/nm/tmp/discard", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
fstatat64(AT_FDCWD, "t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003", {st_mode=S_IFREG|0600, st_size=16777216, ...}, 0) = 0
openat(AT_FDCWD, "t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003", O_RDONLY) = 3
fstatat64(3, "", {st_mode=S_IFREG|0600, st_size=16777216, ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "/home/nm/tmp/discard", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = 0
close(4) = 0
close(3) = 0
_llseek(0, 0, 0x7feffddf1c0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
close(0) = 0
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
+ sum /home/nm/tmp/discard
00000 16384 /home/nm/tmp/discard
+ rm /home/nm/tmp/discard
+ : copy_file_range() returns 0 and yields an all-zeros file
+ strace -e copy_file_range cat t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 16777216
copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 0
+++ exited with 0 +++
+ sum /home/nm/tmp/discard
00000 16384 /home/nm/tmp/discard
+ rm /home/nm/tmp/discard
+ : read() gets the intended bytes
+ cat t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
+ cat
+ sum /home/nm/tmp/discard
49522 16384 /home/nm/tmp/discard
+ rm /home/nm/tmp/discard
next reply other threads:[~2022-11-08 17:24 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-08 17:24 Noah Misch [this message]
2022-11-09 16:47 ` After block device error, FICLONE and sync_file_range() make NULs, unlike read() Darrick J. Wong
2022-11-10 4:54 ` Noah Misch
2022-11-16 3:14 ` Darrick J. Wong
2022-11-20 1:34 ` Noah Misch
2022-11-29 2:50 ` Darrick J. Wong
2022-12-10 7:43 ` Noah Misch
2022-12-13 19:20 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221108172436.GA3613139@rfd.leadboat.com \
--to=noah@leadboat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox