Linux Btrfs filesystem development
 help / color / mirror / Atom feed
* [PATCH] btrfs-progs: scrub: add extra warning for scrub on single device
@ 2022-08-22  7:15 Qu Wenruo
  2022-08-22  8:00 ` Graham Cobb
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2022-08-22  7:15 UTC (permalink / raw)
  To: linux-btrfs

[PROBLEM]

Command "btrfs scrub start" can be executed with single device or a
mount point.

If executed with single device, it will only start a scrub for that
device.

This is fine for SINGLE/DUP/RAID1*/RAID10/RAID0, as they are all
mirror/simple stripe based profiles.

But this can be a problem for RAID56, as for RAID56 scrub, data stripes
are treated as RAID0, while only when scrubbing the P/Q stripes or doing
data rebuild we will try to do a full stripe read/rebuild.

This means, for the following case:

Dev 1:  |<--- Data stripe 1 (good) -->|
Dev 2:  |<--- Data stripe 2 (good) -->|
Dev 3:  |<--- Parity stripe (bad)  -->|

If we only scrub dev 1 or dev 2, the corrupted parity on dev 3 will not
be repaired, breaking the one corrupted/missing device tolerance.
(if we lost device 1 or 2, no data can be recovered for this full
stripe).

Unfortunately for the existing btrfs RAID56 users, there seems to be a
trend to use single device scrub instead of full fs scrub.

[CAUSE]

This is due to the bad design of btrfs_scrub_dev(), by treating data
stripes just like RAID0.
This means, we will read data stripes several times (2x for RAID5, 3x
for RAID6), not only causing more IO, but much slower read for each
device.

At least the end users should be fully informed, and choose their poison
to drink (until a better ioctl introduced).

[WORKAROUND]

This patch will introduce the following workarounds:

- Extra warning for "btrfs scrub start <dev>"
  If we detect we're doing single device scrub, and the fs has RAID56
  feature, we output a warning for the user.
  (But still allow the command to execute)

- Extra man page warning.
  Now there is an extra section for scrub and RAID56.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
BTW there is a WIP new ioctl, btrfs_scrub_fs(), that is going to change
the situation for RAID56, by not only reducing the IO for RAID56, but
with better error reporting (including P/Q mismatch cases), and simpler
and streamlined code for verification.

Thus in the future, we may switch to the new ioctl and in that case, the
error message will no longer be needed for the new ioctl.
---
 Documentation/btrfs-scrub.rst | 13 +++++++++++++
 cmds/scrub.c                  | 29 +++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/Documentation/btrfs-scrub.rst b/Documentation/btrfs-scrub.rst
index 75079eecc9ef..bbf342d4b6c8 100644
--- a/Documentation/btrfs-scrub.rst
+++ b/Documentation/btrfs-scrub.rst
@@ -48,6 +48,19 @@ start [-BdqrRf] [-c <ioprio_class> -n <ioprio_classdata>] <path>|<device>
         configured similar to the ``ionice(1)`` syntax using *-c* and *-n*
         options.  Note that not all IO schedulers honor the ionice settings.
 
+        .. warning::
+                Using ``btrfs scrub start <device>`` on a RAID56 fs is strongly
+                discouraged.
+
+                Due to the bad design of RAID56 scrub, single device scrubbing
+                will tread the data stripes as RAID0, only for data recovery
+                (data stripes has bad csum) or P/Q stripes we do full stripe
+                checks.
+
+                This means, if running ``btrfs scrub start <device>``,
+                corruption in P/Q stripes has a high chance not to be repaired,
+                greatly reducing the robustness of RAID56.
+
         ``Options``
 
         -B
diff --git a/cmds/scrub.c b/cmds/scrub.c
index 7c2d9b79c275..a69de8c1402b 100644
--- a/cmds/scrub.c
+++ b/cmds/scrub.c
@@ -42,6 +42,7 @@
 #include "kernel-shared/volumes.h"
 #include "kernel-shared/disk-io.h"
 #include "common/open-utils.h"
+#include "common/path-utils.h"
 #include "common/units.h"
 #include "cmds/commands.h"
 #include "common/help.h"
@@ -1143,6 +1144,7 @@ static int scrub_start(const struct cmd_struct *cmd, int argc, char **argv,
 		       bool resume)
 {
 	int fdmnt;
+	int sysfsfd = -1;
 	int prg_fd = -1;
 	int fdres = -1;
 	int ret;
@@ -1188,6 +1190,8 @@ static int scrub_start(const struct cmd_struct *cmd, int argc, char **argv,
 	DIR *dirstream = NULL;
 	int force = 0;
 	int nothing_to_resume = 0;
+	bool is_block_dev = false;
+	bool is_raid56 = false;
 
 	while ((c = getopt(argc, argv, "BdqrRc:n:f")) != -1) {
 		switch (c) {
@@ -1242,6 +1246,9 @@ static int scrub_start(const struct cmd_struct *cmd, int argc, char **argv,
 
 	path = argv[optind];
 
+	if (path_is_block_device(path))
+		is_block_dev = true;
+
 	fdmnt = open_path_or_dev_mnt(path, &dirstream, !do_quiet);
 	if (fdmnt < 0)
 		return 1;
@@ -1254,6 +1261,28 @@ static int scrub_start(const struct cmd_struct *cmd, int argc, char **argv,
 		err = 1;
 		goto out;
 	}
+
+	sysfsfd = sysfs_open_fsid_file(fdmnt, "features/raid56");
+	if (sysfsfd >= 0) {
+		is_raid56 = true;
+		close(sysfsfd);
+	}
+
+	/*
+	 * If we're scrubbing a single device on fs with RAID56, that scrub
+	 * will not verify the data on the other stripes unless we're scrubbing
+	 * a P/Q stripe.
+	 * Due to the current scrub_dev ioctl design, we can cause way more
+	 * IO for data stripes, thus full scrub on RAID56 can be slow.
+	 *
+	 * Some one uses single device scrub to speed up the progress, but the
+	 * hidden cost of corrupted/unscrubbed data must not be hidden.
+	 */
+	if (is_raid56 && is_block_dev) {
+		warning("scrub single device of a btrfs RAID56 is not recommened.");
+		warning("Check 'btrfs-scrub'(8) for more details");
+	}
+
 	if (!fi_args.num_devices) {
 		error_on(!do_quiet, "no devices found");
 		err = 1;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-08-22  8:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-22  7:15 [PATCH] btrfs-progs: scrub: add extra warning for scrub on single device Qu Wenruo
2022-08-22  8:00 ` Graham Cobb
2022-08-22  8:07   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox