public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Noah Misch <noah@leadboat.com>
To: linux-xfs@vger.kernel.org
Subject: After block device error, FICLONE and sync_file_range() make NULs, unlike read()
Date: Tue, 8 Nov 2022 09:24:36 -0800	[thread overview]
Message-ID: <20221108172436.GA3613139@rfd.leadboat.com> (raw)

Scenario: due to a block device error, the kernel fails to persist some file
content.  Even so, read() always returns the file content accurately.  The
first FICLONE returns EIO, but every subsequent FICLONE or copy_file_range()
operates as though the file were all zeros.  How feasible is it change FICLONE
and copy_file_range() such that they instead find the bytes that read() finds?

- Kernel is 6.0.0-1-sparc64-smp from Debian sid, running in a Solaris-hosted VM.

- The VM is gcc202 from https://cfarm.tetaneutral.net/machines/list/.
  Accounts are available.

- The outcome is still reproducible in FICLONE issued two days after the
  original block device error.  I haven't checked whether it survives a
  reboot.

- The "sync" command did not help.

- The block device errors have been ongoing for years.  If curious, see
  https://postgr.es/m/CA+hUKGKfrXnuyk0Z24m8x4_eziuC3kLSaCmEeKPO1DVU9t-qtQ@mail.gmail.com
  for details.  (Fixing the sunvdc driver is out of scope for this thread.)
  Other known symptoms are failures in truncate() and fsync().  The system has
  been generally usable for applications not requiring persistence.  I saw the
  FICLONE problem after the system updated coreutils from 8.32-4.1 to 9.1-1.
  That introduced a "cp" that uses FICLONE.  My current workaround is to place
  a "cp" in my PATH that does 'exec /usr/bin/cp --reflink=never "$@"'


The trouble emerged at a "cp".  To capture more details, I replaced "cp" with
"trace-cp" containing:

  sum "$1"
  strace cp "$@" 2>&1 | sed -n '/^geteuid/,$p'
  sum "$2"

Output from that follows.  FICLONE returns EIO.  "cp" then falls back to
copy_file_range(), which yields an all-zeros file:

  47831 16384 pg_wal/000000030000000000000003
  geteuid()                               = 1450
  openat(AT_FDCWD, "/home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
  fstatat64(AT_FDCWD, "pg_wal/000000030000000000000003", {st_mode=S_IFREG|0600, st_size=16777216, ...}, 0) = 0
  openat(AT_FDCWD, "pg_wal/000000030000000000000003", O_RDONLY) = 4
  fstatat64(4, "", {st_mode=S_IFREG|0600, st_size=16777216, ...}, AT_EMPTY_PATH) = 0
  openat(AT_FDCWD, "/home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003", O_WRONLY|O_CREAT|O_EXCL, 0600) = 5
  ioctl(5, BTRFS_IOC_CLONE or FICLONE, 4) = -1 EIO (Input/output error)
  fstatat64(5, "", {st_mode=S_IFREG|0600, st_size=0, ...}, AT_EMPTY_PATH) = 0
  fadvise64_64(4, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
  copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = 16777216
  copy_file_range(4, NULL, 5, NULL, 9223372035781033984, 0) = 0
  close(5)                                = 0
  close(4)                                = 0
  _llseek(0, 0, [0], SEEK_CUR)            = 0
  close(0)                                = 0
  close(1)                                = 0
  close(2)                                = 0
  exit_group(0)                           = ?
  +++ exited with 0 +++
  00000 16384 /home/nm/src/pg/backbranch/extra/src/test/recovery/tmp_check/t_028_pitr_timelines_primary_data/archives/000000030000000000000003

Subsequent FICLONE returns 0 and yields an all-zeros file.  Test script:

  set -x
  broken_source=t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  dest=$HOME/tmp/discard
  sum "$broken_source"
  : 'FICLONE returns 0 and yields an all-zeros file'
  strace cp --reflink=always "$broken_source" "$dest" 2>&1 | sed -n '/^geteuid/,$p'
  sum "$dest"; rm "$dest"
  : 'copy_file_range() returns 0 and yields an all-zeros file'
  strace -e copy_file_range cat "$broken_source" >"$dest"
  sum "$dest"; rm "$dest"
  : 'read() gets the intended bytes'
  cat "$broken_source" | cat >"$dest"
  sum "$dest"; rm "$dest"

Test script output:

  + broken_source=t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  + dest=/home/nm/tmp/discard
  + sum t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  49522 16384 t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  + : FICLONE returns 0 and yields an all-zeros file
  + strace cp --reflink=always t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003 /home/nm/tmp/discard
  + sed -n /^geteuid/,$p
  geteuid()                               = 1450
  openat(AT_FDCWD, "/home/nm/tmp/discard", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
  fstatat64(AT_FDCWD, "t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003", {st_mode=S_IFREG|0600, st_size=16777216, ...}, 0) = 0
  openat(AT_FDCWD, "t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003", O_RDONLY) = 3
  fstatat64(3, "", {st_mode=S_IFREG|0600, st_size=16777216, ...}, AT_EMPTY_PATH) = 0
  openat(AT_FDCWD, "/home/nm/tmp/discard", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
  ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = 0
  close(4)                                = 0
  close(3)                                = 0
  _llseek(0, 0, 0x7feffddf1c0, SEEK_CUR)  = -1 ESPIPE (Illegal seek)
  close(0)                                = 0
  close(1)                                = 0
  close(2)                                = 0
  exit_group(0)                           = ?
  +++ exited with 0 +++
  + sum /home/nm/tmp/discard
  00000 16384 /home/nm/tmp/discard
  + rm /home/nm/tmp/discard
  + : copy_file_range() returns 0 and yields an all-zeros file
  + strace -e copy_file_range cat t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 16777216
  copy_file_range(3, NULL, 1, NULL, 9223372035781033984, 0) = 0
  +++ exited with 0 +++
  + sum /home/nm/tmp/discard
  00000 16384 /home/nm/tmp/discard
  + rm /home/nm/tmp/discard
  + : read() gets the intended bytes
  + cat t_028_pitr_timelines_node_pitr_data/pgdata/pg_wal/000000030000000000000003
  + cat
  + sum /home/nm/tmp/discard
  49522 16384 /home/nm/tmp/discard
  + rm /home/nm/tmp/discard

             reply	other threads:[~2022-11-08 17:24 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-08 17:24 Noah Misch [this message]
2022-11-09 16:47 ` After block device error, FICLONE and sync_file_range() make NULs, unlike read() Darrick J. Wong
2022-11-10  4:54   ` Noah Misch
2022-11-16  3:14     ` Darrick J. Wong
2022-11-20  1:34       ` Noah Misch
2022-11-29  2:50         ` Darrick J. Wong
2022-12-10  7:43           ` Noah Misch
2022-12-13 19:20             ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221108172436.GA3613139@rfd.leadboat.com \
    --to=noah@leadboat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox