public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
@ 2025-05-15  8:00 Qu Wenruo
  2025-05-15  8:00 ` [PATCH v2 1/6] " Qu Wenruo
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Qu Wenruo @ 2025-05-15  8:00 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2:
- Rename the subcommand to "fix-data-checksum"
  It's better to use full name in the command name

- Remove unused members inside corrupted_block
  The old @extent_bytenr and @extent_len is no longer needed, even for
  the future file-deletion action.

- Fix the bitmap size off-by-1 bug
  We must use the bit 0 to represent mirror 1, or the bitmap size will
  exceed num_mirrors.

- Introduce -i|--interactive mode
  Will ask the user for the action on the corrupted block, including:

  * Ignore
    The default behavior if no command is provided

  * Use specified mirror to update the data checksum item
    The user must input a number inside range [1, num_mirrors].

- Introduce -m|--mirror <num> mode
  Use specified mirror for all corrupted blocks.
  The value <num> must be >= 1. And if the value is larger than the
  actual max mirror number, the real mirror number will be
  `num % (num_mirror + 1)`.

We have a long history of data csum mismatch, caused by direct IO and
buffered being modified during writeback.

Although the problem is worked around in v6.15 (and being backported),
for the affected fs there is no good way to fix them, other than complex
manually find out which files are affected and delete them.

This series introduce the initial implementation of "btrfs rescue
fix-data-checksum", which is designed to fix such problem by either:

- Update the csum items using the data from specified copy

The subcommand has 3 modes so far:

- Readonly mode
  Only report all corrupted blocks and affected files, no repair is
  done.

- Interactive mode
  Ask for what to do, including

  * Ignore (the default)
  * Use certain mirror to update the checksum item

- Mirror mode
  Use specified mirror to update the checksum item, the batch mode of
  the interactive one.

In the future, there will be one more mode:

- Delete mode
  Delete all involved files.

  There are still some points to address before implementing this mode.

Qu Wenruo (6):
  btrfs-progs: introduce "btrfs rescue fix-data-checksum"
  btrfs-progs: fix a bug in btrfs_find_item()
  btrfs-progs: fix-data-checksum: show affected files
  btrfs-progs: fix-data-checksum: introduce interactive mode
  btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch
  btrfs-progs: fix-data-checksum: introduce -m|--mirror option

 Documentation/btrfs-rescue.rst  |  28 ++
 Makefile                        |   2 +-
 cmds/rescue-fix-data-checksum.c | 511 ++++++++++++++++++++++++++++++++
 cmds/rescue.c                   |  65 ++++
 cmds/rescue.h                   |  10 +
 kernel-shared/ctree.c           |  17 +-
 kernel-shared/file-item.c       |   2 +-
 kernel-shared/file-item.h       |   5 +
 8 files changed, 635 insertions(+), 5 deletions(-)
 create mode 100644 cmds/rescue-fix-data-checksum.c

--
2.49.0


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
  2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
@ 2025-05-15  8:00 ` Qu Wenruo
  2025-05-30 11:10   ` David Sterba
  2025-05-15  8:00 ` [PATCH v2 2/6] btrfs-progs: fix a bug in btrfs_find_item() Qu Wenruo
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2025-05-15  8:00 UTC (permalink / raw)
  To: linux-btrfs

It's a long known problem that direct IO can lead to data checksum
mismatches if the user space is also modifying its buffer during
writeback.

Although it's fixed in the recent v6.15 release (and backported to older
kernels), we still need a user friendly way to fix those problems.

This patch introduce the dryrun version of "btrfs rescue
fix-data-checksum", which reports the logical bytenr and corrupted mirrors.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 Documentation/btrfs-rescue.rst  |  19 +++
 Makefile                        |   2 +-
 cmds/rescue-fix-data-checksum.c | 288 ++++++++++++++++++++++++++++++++
 cmds/rescue.c                   |  48 ++++++
 cmds/rescue.h                   |   7 +
 5 files changed, 363 insertions(+), 1 deletion(-)
 create mode 100644 cmds/rescue-fix-data-checksum.c

diff --git a/Documentation/btrfs-rescue.rst b/Documentation/btrfs-rescue.rst
index f52e6c26359a..7a237ba8ebbb 100644
--- a/Documentation/btrfs-rescue.rst
+++ b/Documentation/btrfs-rescue.rst
@@ -50,6 +50,25 @@ fix-device-size <device>
 
                 WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
 
+fix-data-checksum <device>
+	fix data checksum mismatch
+
+	There is a long existing problem that if a user space program is doing
+	direct IO and modifies the buffer before the write back finished, it
+	can lead to data checksum mismatches.
+
+	This problem is known but not fixed until upstream release v6.15
+	(backported to older kernels). So it's possible to hit false data
+	checksum mismatch for any long running btrfs.
+
+	In that case this program can be utilized to repair such problem.
+
+        ``Options``
+
+	-r|--readonly
+		readonly mode, only scan and report for data checksum mismatch,
+		do no repair
+
 .. _man-rescue-clear-ino-cache:
 
 clear-ino-cache <device>
diff --git a/Makefile b/Makefile
index 7e36aa425736..523b83495524 100644
--- a/Makefile
+++ b/Makefile
@@ -256,7 +256,7 @@ cmds_objects = cmds/subvolume.o cmds/subvolume-list.o \
 	       cmds/inspect.o cmds/balance.o cmds/send.o cmds/receive.o \
 	       cmds/quota.o cmds/qgroup.o cmds/replace.o check/main.o \
 	       cmds/restore.o cmds/rescue.o cmds/rescue-chunk-recover.o \
-	       cmds/rescue-super-recover.o \
+	       cmds/rescue-super-recover.o cmds/rescue-fix-data-checksum.o \
 	       cmds/property.o cmds/filesystem-usage.o cmds/inspect-dump-tree.o \
 	       cmds/inspect-dump-super.o cmds/inspect-tree-stats.o cmds/filesystem-du.o \
 	       cmds/reflink.o \
diff --git a/cmds/rescue-fix-data-checksum.c b/cmds/rescue-fix-data-checksum.c
new file mode 100644
index 000000000000..7e613ba57f76
--- /dev/null
+++ b/cmds/rescue-fix-data-checksum.c
@@ -0,0 +1,288 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "kerncompat.h"
+#include "kernel-shared/disk-io.h"
+#include "kernel-shared/ctree.h"
+#include "kernel-shared/volumes.h"
+#include "common/messages.h"
+#include "common/open-utils.h"
+#include "cmds/rescue.h"
+
+/*
+ * Record one corrupted data blocks.
+ *
+ * We do not report immediately, this is for future file deleting support.
+ */
+struct corrupted_block {
+	struct list_head list;
+	/* The logical bytenr of the exact corrupted block. */
+	u64 logical;
+
+	/* The amount of mirrors above logical have. */
+	unsigned int num_mirrors;
+
+	/*
+	 * Which mirror failed.
+	 *
+	 * Note, bit 0 means mirror 1, since mirror 0 means choosing a
+	 * live mirror, and we never utilized that mirror 0.
+	 */
+	unsigned long *error_mirror_bitmap;
+};
+
+static int global_repair_mode;
+LIST_HEAD(corrupted_blocks);
+
+static int add_corrupted_block(struct btrfs_fs_info *fs_info, u64 logical,
+			       unsigned int mirror, unsigned int num_mirrors)
+{
+	struct corrupted_block *last;
+	if (list_empty(&corrupted_blocks))
+		goto add;
+
+	last = list_entry(corrupted_blocks.prev, struct corrupted_block, list);
+	/* The last entry is the same, just set update the error mirror bitmap. */
+	if (last->logical == logical) {
+		UASSERT(last->error_mirror_bitmap);
+		set_bit(mirror, last->error_mirror_bitmap);
+		return 0;
+	}
+add:
+	last = calloc(1, sizeof(*last));
+	if (!last)
+		return -ENOMEM;
+	last->error_mirror_bitmap = calloc(1, BITS_TO_LONGS(num_mirrors));
+	if (!last->error_mirror_bitmap) {
+		free(last);
+		return -ENOMEM;
+	}
+	set_bit(mirror - 1, last->error_mirror_bitmap);
+	last->logical = logical;
+	last->num_mirrors = num_mirrors;
+
+	list_add_tail(&last->list, &corrupted_blocks);
+	return 0;
+}
+
+/*
+ * Verify all mirrors for @logical.
+ *
+ * If something critical happened, return <0 and should end the run immediately.
+ * Otherwise return 0, including data checksum mismatch or read failure.
+ */
+static int verify_one_data_block(struct btrfs_fs_info *fs_info,
+				 struct extent_buffer *leaf,
+				 unsigned long leaf_offset, u64 logical,
+				 unsigned int num_mirrors)
+{
+	const u32 blocksize = fs_info->sectorsize;
+	const u32 csum_size = fs_info->csum_size;
+	u8 *buf;
+	u8 csum[BTRFS_CSUM_SIZE];
+	u8 csum_expected[BTRFS_CSUM_SIZE];
+	int ret = 0;
+
+	buf = malloc(blocksize);
+	if (!buf)
+		return -ENOMEM;
+
+	for (int mirror = 1; mirror <= num_mirrors; mirror++) {
+		u64 read_len = blocksize;
+
+		ret = read_data_from_disk(fs_info, buf, logical, &read_len, mirror);
+		if (ret < 0) {
+			/* IO error, add one record. */
+			ret = add_corrupted_block(fs_info, logical, mirror, num_mirrors);
+			if (ret < 0)
+				break;
+		}
+		/* Verify the data checksum. */
+		btrfs_csum_data(fs_info, fs_info->csum_type, buf, csum, blocksize);
+		read_extent_buffer(leaf, csum_expected, leaf_offset, csum_size);
+		if (memcmp(csum_expected, csum, csum_size) != 0) {
+			ret = add_corrupted_block(fs_info, logical, mirror, num_mirrors);
+			if (ret < 0)
+				break;
+		}
+	}
+
+	free(buf);
+	return ret;
+}
+
+static int iterate_one_csum_item(struct btrfs_fs_info *fs_info, struct btrfs_path *path)
+{
+	struct btrfs_key key;
+	const unsigned long item_ptr_off = btrfs_item_ptr_offset(path->nodes[0],
+								 path->slots[0]);
+	const u32 blocksize = fs_info->sectorsize;
+	int num_mirrors;
+	u64 data_size;
+	u64 cur;
+	char *buf;
+	int ret = 0;
+
+	buf = malloc(blocksize);
+	if (!buf)
+		return -ENOMEM;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	data_size = btrfs_item_size(path->nodes[0], path->slots[0]) /
+		    fs_info->csum_size * blocksize;
+	num_mirrors = btrfs_num_copies(fs_info, key.offset, data_size);
+
+	for (cur = 0; cur < data_size; cur += blocksize) {
+		const unsigned long leaf_offset = item_ptr_off +
+			cur / blocksize * fs_info->csum_size;
+
+		ret = verify_one_data_block(fs_info, path->nodes[0], leaf_offset,
+					    key.offset + cur, num_mirrors);
+		if (ret < 0)
+			break;
+	}
+	free(buf);
+	return ret;
+}
+
+static int iterate_csum_root(struct btrfs_fs_info *fs_info, struct btrfs_root *csum_root)
+{
+	struct btrfs_path path = { 0 };
+	struct btrfs_key key;
+	int ret;
+
+	key.objectid = 0;
+	key.type = 0;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, csum_root, &key, &path, 0, 0);
+	if (ret < 0) {
+		errno = -ret;
+		error("failed to get the first tree block of csum tree: %m");
+		return ret;
+	}
+	UASSERT(ret > 0);
+	while (true) {
+		btrfs_item_key_to_cpu(path.nodes[0], &key, path.slots[0]);
+		if (key.type != BTRFS_EXTENT_CSUM_KEY)
+			goto next;
+		ret = iterate_one_csum_item(fs_info, &path);
+		if (ret < 0)
+			break;
+next:
+		ret = btrfs_next_item(csum_root, &path);
+		if (ret > 0) {
+			ret = 0;
+			break;
+		}
+		if (ret < 0) {
+			errno = -ret;
+			error("failed to get next csum item: %m");
+		}
+	}
+	btrfs_release_path(&path);
+	return ret;
+}
+
+static void report_corrupted_blocks(void)
+{
+	struct corrupted_block *entry;
+
+	if (list_empty(&corrupted_blocks)) {
+		printf("No data checksum mismatch found\n");
+		return;
+	}
+
+	list_for_each_entry(entry, &corrupted_blocks, list) {
+		bool has_printed = false;
+
+		printf("logical=%llu corrtuped mirrors=", entry->logical);
+		/* Poor man's bitmap print. */
+		for (int i = 0; i < entry->num_mirrors; i++) {
+			if (test_bit(i, entry->error_mirror_bitmap)) {
+				if (has_printed)
+					printf(",");
+				/*
+				 * Bit 0 means mirror 1, thus we need to increase
+				 * the value by 1.
+				 */
+				printf("%d", i + 1);
+				has_printed=true;
+			}
+		}
+		printf("\n");
+	}
+}
+
+static void free_corrupted_blocks(void)
+{
+	while (!list_empty(&corrupted_blocks)) {
+		struct corrupted_block *entry;
+
+		entry = list_entry(corrupted_blocks.next, struct corrupted_block, list);
+		list_del_init(&entry->list);
+		free(entry->error_mirror_bitmap);
+		free(entry);
+	}
+}
+
+int btrfs_recover_fix_data_checksum(const char *path,
+				    enum btrfs_fix_data_checksum_mode mode)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_root *csum_root;
+	struct open_ctree_args oca = { 0 };
+	int ret;
+
+	if (mode >= BTRFS_FIX_DATA_CSUMS_LAST)
+		return -EINVAL;
+
+	ret = check_mounted(path);
+	if (ret < 0) {
+		errno = -ret;
+		error("could not check mount status: %m");
+		return ret;
+	}
+	if (ret > 0) {
+		error("%s is currently mounted", path);
+		return -EBUSY;
+	}
+
+	global_repair_mode = mode;
+	oca.filename = path;
+	oca.flags = OPEN_CTREE_WRITES;
+	fs_info = open_ctree_fs_info(&oca);
+	if (!fs_info) {
+		error("failed to open btrfs at %s", path);
+		return -EIO;
+	}
+	csum_root = btrfs_csum_root(fs_info, 0);
+	if (!csum_root) {
+		error("failed to get csum root");
+		ret = -EIO;
+		goto out_close;
+	}
+	ret = iterate_csum_root(fs_info, csum_root);
+	if (ret) {
+		errno = -ret;
+		error("failed to iterate csum tree: %m");
+	}
+	report_corrupted_blocks();
+out_close:
+	free_corrupted_blocks();
+	close_ctree_fs_info(fs_info);
+	return ret;
+}
diff --git a/cmds/rescue.c b/cmds/rescue.c
index c60bf11675b9..0d3f86f7e8d6 100644
--- a/cmds/rescue.c
+++ b/cmds/rescue.c
@@ -22,6 +22,7 @@
 #include <errno.h>
 #include <stdbool.h>
 #include <unistd.h>
+#include <getopt.h>
 #include "kernel-lib/list.h"
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/volumes.h"
@@ -275,6 +276,52 @@ out:
 }
 static DEFINE_SIMPLE_COMMAND(rescue_fix_device_size, "fix-device-size");
 
+static const char * const cmd_rescue_fix_data_checksum_usage[] = {
+	"btrfs rescue fix-data-checksum <device>",
+	"Fix data checksum mismatches.",
+	"",
+	OPTLINE("-r", "readonly mode, only report errors without repair"),
+	HELPINFO_INSERT_GLOBALS,
+	HELPINFO_INSERT_VERBOSE,
+	NULL
+};
+
+static int cmd_rescue_fix_data_checksum(const struct cmd_struct *cmd,
+					int argc, char **argv)
+{
+	enum btrfs_fix_data_checksum_mode mode = BTRFS_FIX_DATA_CSUMS_READONLY;
+	int ret;
+	optind = 0;
+
+	while (1) {
+		int c;
+		enum { GETOPT_VAL_DRYRUN = GETOPT_VAL_FIRST, };
+		static const struct option long_options [] = {
+			{"readonly", no_argument, NULL, 'r'},
+			{"NULL", 0, NULL, 0},
+		};
+		c = getopt_long(argc, argv, "r", long_options, NULL);
+		if (c < 0)
+			break;
+		switch (c) {
+		case 'r':
+			mode = BTRFS_FIX_DATA_CSUMS_READONLY;
+			break;
+		default:
+			usage_unknown_option(cmd, argv);
+		}
+	}
+	if (check_argc_min(argc - optind, 1))
+		return 1;
+	ret = btrfs_recover_fix_data_checksum(argv[optind], mode);
+	if (ret < 0) {
+		errno = -ret;
+		error("failed to fix data checksums: %m");
+	}
+	return !!ret;
+}
+static DEFINE_SIMPLE_COMMAND(rescue_fix_data_checksum, "fix-data-checksum");
+
 static const char * const cmd_rescue_create_control_device_usage[] = {
 	"btrfs rescue create-control-device",
 	"Create /dev/btrfs-control (see 'CONTROL DEVICE' in btrfs(5))",
@@ -527,6 +574,7 @@ static const struct cmd_group rescue_cmd_group = {
 		&cmd_struct_rescue_super_recover,
 		&cmd_struct_rescue_zero_log,
 		&cmd_struct_rescue_fix_device_size,
+		&cmd_struct_rescue_fix_data_checksum,
 		&cmd_struct_rescue_create_control_device,
 		&cmd_struct_rescue_clear_ino_cache,
 		&cmd_struct_rescue_clear_space_cache,
diff --git a/cmds/rescue.h b/cmds/rescue.h
index 5a9e46b7aae0..331b595f1c6f 100644
--- a/cmds/rescue.h
+++ b/cmds/rescue.h
@@ -20,7 +20,14 @@
 #ifndef __BTRFS_RESCUE_H__
 #define __BTRFS_RESCUE_H__
 
+enum btrfs_fix_data_checksum_mode {
+	BTRFS_FIX_DATA_CSUMS_READONLY,
+	BTRFS_FIX_DATA_CSUMS_LAST,
+};
+
 int btrfs_recover_superblocks(const char *path, int yes);
 int btrfs_recover_chunk_tree(const char *path, int yes);
+int btrfs_recover_fix_data_checksum(const char *path,
+				    enum btrfs_fix_data_checksum_mode mode);
 
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/6] btrfs-progs: fix a bug in btrfs_find_item()
  2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
  2025-05-15  8:00 ` [PATCH v2 1/6] " Qu Wenruo
@ 2025-05-15  8:00 ` Qu Wenruo
  2025-05-30 11:11   ` David Sterba
  2025-05-15  8:00 ` [PATCH v2 3/6] btrfs-progs: fix-data-checksum: show affected files Qu Wenruo
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2025-05-15  8:00 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
There is a seldomly utilized function, btrfs_find_item(), which has no
document and is not behaving correctly.

Inside backref.c, iterate_inode_refs() and btrfs_ref_to_path() both
utilize this function, to find the parent inode.

However btrfs_find_item() will never return 0 if @ioff is passed as 0
for such usage, result early failure for all kinds of inode iteration
functions.

[CAUSE]
Both functions pass 0 as the @ioff parameter initially, e.g:

 We have the following fs tree root:

  	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		generation 3 transid 9 size 6 nbytes 16384
		block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
		sequence 1 flags 0x0(none)
	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
		index 0 namelen 2 name: ..
	item 2 key (256 DIR_ITEM 2507850652) itemoff 16078 itemsize 33
		location key (257 INODE_ITEM 0) type FILE
		transid 9 data_len 0 name_len 3
		name: foo
	item 3 key (256 DIR_INDEX 2) itemoff 16045 itemsize 33
		location key (257 INODE_ITEM 0) type FILE
		transid 9 data_len 0 name_len 3
		name: foo
	item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
		generation 9 transid 9 size 16384 nbytes 16384
		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
		sequence 4 flags 0x0(none)
	item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
		index 2 namelen 3 name: foo
	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
		generation 9 type 1 (regular)
		extent data disk byte 13631488 nr 16384
		extent data offset 0 nr 16384 ram 16384
		extent compression 0 (none)

  Then we call paths_from_inode() with:
  - @inum = 257
  - ipath = {.fs_root = 5}

  Then we got the following sequence:

  iterate_irefs(257, fs_root, inode_to_path)
  |- iterate_inode_refs()
     |- inode_ref_info()
        |- btrfs_find_item(257, 0, fs_root)
	|  Returned 1, with @found_key updated to
	|  (257, INODE_REF, 256).

  This makes iterate_irefs() exit immediately, but obviously that
  btrfs_find_item() call is to find any INODE_REF, not to find the
  exact match.

[FIX]
If btrfs_find_item() found an item matching the objectid and type, then
it should return 0 other than 1.

Fix it and keep the behavior the same across btrfs-progs and the kernel.

Since we're here, also add some comments explaining the function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 kernel-shared/ctree.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel-shared/ctree.c b/kernel-shared/ctree.c
index 3184c916175e..f90de606e7b1 100644
--- a/kernel-shared/ctree.c
+++ b/kernel-shared/ctree.c
@@ -1246,6 +1246,17 @@ static void reada_for_search(struct btrfs_fs_info *fs_info,
 	}
 }
 
+/*
+ * Find the first key in @fs_root that matches all the following conditions:
+ *
+ * - key.obojectid == @iobjectid
+ * - key.type == @key_type
+ * - key.offset >= ioff
+ *
+ * Return 0 if such key can be found, and @found_key is updated.
+ * Return >0 if no such key can be found.
+ * Return <0 for critical errors.
+ */
 int btrfs_find_item(struct btrfs_root *fs_root, struct btrfs_path *found_path,
 		u64 iobjectid, u64 ioff, u8 key_type,
 		struct btrfs_key *found_key)
@@ -1280,10 +1291,10 @@ int btrfs_find_item(struct btrfs_root *fs_root, struct btrfs_path *found_path,
 
 	btrfs_item_key_to_cpu(eb, found_key, path->slots[0]);
 	if (found_key->type != key.type ||
-			found_key->objectid != key.objectid) {
+	    found_key->objectid != key.objectid)
 		ret = 1;
-		goto out;
-	}
+	else
+		ret = 0;
 
 out:
 	if (path != found_path)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/6] btrfs-progs: fix-data-checksum: show affected files
  2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
  2025-05-15  8:00 ` [PATCH v2 1/6] " Qu Wenruo
  2025-05-15  8:00 ` [PATCH v2 2/6] btrfs-progs: fix a bug in btrfs_find_item() Qu Wenruo
@ 2025-05-15  8:00 ` Qu Wenruo
  2025-05-30 11:14   ` David Sterba
  2025-05-15  8:00 ` [PATCH v2 4/6] btrfs-progs: fix-data-checksum: introduce interactive mode Qu Wenruo
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2025-05-15  8:00 UTC (permalink / raw)
  To: linux-btrfs

Previously "btrfs rescue fix-data-checksum" only show affected logical
bytenr, which is not helpful to determine which files are affected.

Enhance the output by also outputting the affected subvolumes (in
numeric form), and the file paths inside that subvolume.

The example looks like this:

  logical=13631488 corrtuped mirrors=1 affected files:
    (subvolume 5)/foo
    (subvolume 5)/dir/bar
  logical=13635584 corrtuped mirrors=1 affected files:
    (subvolume 5)/foo
    (subvolume 5)/dir/bar

Although the end result is still not perfect, it's still much easier to
find out which files are affected.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 cmds/rescue-fix-data-checksum.c | 59 +++++++++++++++++++++++++++++++--
 1 file changed, 56 insertions(+), 3 deletions(-)

diff --git a/cmds/rescue-fix-data-checksum.c b/cmds/rescue-fix-data-checksum.c
index 7e613ba57f76..bf3a65b31c71 100644
--- a/cmds/rescue-fix-data-checksum.c
+++ b/cmds/rescue-fix-data-checksum.c
@@ -18,6 +18,7 @@
 #include "kernel-shared/disk-io.h"
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/volumes.h"
+#include "kernel-shared/backref.h"
 #include "common/messages.h"
 #include "common/open-utils.h"
 #include "cmds/rescue.h"
@@ -158,6 +159,49 @@ static int iterate_one_csum_item(struct btrfs_fs_info *fs_info, struct btrfs_pat
 	return ret;
 }
 
+static int print_filenames(u64 ino, u64 offset, u64 rootid, void *ctx)
+{
+	struct btrfs_fs_info *fs_info = ctx;
+	struct btrfs_root *root;
+	struct btrfs_key key;
+	struct inode_fs_paths *ipath;
+	struct btrfs_path path = { 0 };
+	int ret;
+
+	key.objectid = rootid;
+	key.type = BTRFS_ROOT_ITEM_KEY;
+	key.offset = (u64)-1;
+
+	root = btrfs_read_fs_root(fs_info, &key);
+	if (IS_ERR(root)) {
+		ret = PTR_ERR(root);
+		errno = -ret;
+		error("failed to get subvolume %llu: %m", rootid);
+		return ret;
+	}
+	ipath = init_ipath(128 * BTRFS_PATH_NAME_MAX, root, &path);
+	if (IS_ERR(ipath)) {
+		ret = PTR_ERR(ipath);
+		errno = -ret;
+		error("failed to initialize ipath: %m");
+		return ret;
+	}
+	ret = paths_from_inode(ino, ipath);
+	if (ret < 0) {
+		errno = -ret;
+		error("failed to resolve root %llu ino %llu to paths: %m", rootid, ino);
+		goto out;
+	}
+	for (int i = 0; i < ipath->fspath->elem_cnt; i++)
+		printf("  (subvolume %llu)/%s\n", rootid, (char *)ipath->fspath->val[i]);
+	if (ipath->fspath->elem_missed)
+		printf("  (subvolume %llu) %d files not printed\n", rootid,
+		       ipath->fspath->elem_missed);
+out:
+	free_ipath(ipath);
+	return ret;
+}
+
 static int iterate_csum_root(struct btrfs_fs_info *fs_info, struct btrfs_root *csum_root)
 {
 	struct btrfs_path path = { 0 };
@@ -197,9 +241,10 @@ next:
 	return ret;
 }
 
-static void report_corrupted_blocks(void)
+static void report_corrupted_blocks(struct btrfs_fs_info *fs_info)
 {
 	struct corrupted_block *entry;
+	struct btrfs_path path = { 0 };
 
 	if (list_empty(&corrupted_blocks)) {
 		printf("No data checksum mismatch found\n");
@@ -208,6 +253,7 @@ static void report_corrupted_blocks(void)
 
 	list_for_each_entry(entry, &corrupted_blocks, list) {
 		bool has_printed = false;
+		int ret;
 
 		printf("logical=%llu corrtuped mirrors=", entry->logical);
 		/* Poor man's bitmap print. */
@@ -223,7 +269,14 @@ static void report_corrupted_blocks(void)
 				has_printed=true;
 			}
 		}
-		printf("\n");
+		printf(" affected files:\n");
+		ret = iterate_inodes_from_logical(entry->logical, fs_info, &path,
+						  print_filenames, fs_info);
+		if (ret < 0) {
+			errno = -ret;
+			error("failed to iterate involved files: %m");
+			break;
+		}
 	}
 }
 
@@ -280,7 +333,7 @@ int btrfs_recover_fix_data_checksum(const char *path,
 		errno = -ret;
 		error("failed to iterate csum tree: %m");
 	}
-	report_corrupted_blocks();
+	report_corrupted_blocks(fs_info);
 out_close:
 	free_corrupted_blocks();
 	close_ctree_fs_info(fs_info);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 4/6] btrfs-progs: fix-data-checksum: introduce interactive mode
  2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-05-15  8:00 ` [PATCH v2 3/6] btrfs-progs: fix-data-checksum: show affected files Qu Wenruo
@ 2025-05-15  8:00 ` Qu Wenruo
  2025-05-30 11:18   ` David Sterba
  2025-05-15  8:00 ` [PATCH v2 5/6] btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2025-05-15  8:00 UTC (permalink / raw)
  To: linux-btrfs

This mode will ask user for how to fix each block.

User input can match the first letter or the whole action name to
specify given action, the input is verified case insensitive.

If no user input is provided, the default action is to ignore the
corrupted block.

If the input matches no action, a warning is outputted and user must
retry until a valid input is provided.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 Documentation/btrfs-rescue.rst  |  3 ++
 cmds/rescue-fix-data-checksum.c | 69 ++++++++++++++++++++++++++++++++-
 cmds/rescue.c                   |  9 ++++-
 cmds/rescue.h                   |  1 +
 4 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/Documentation/btrfs-rescue.rst b/Documentation/btrfs-rescue.rst
index 7a237ba8ebbb..0e6da4397f51 100644
--- a/Documentation/btrfs-rescue.rst
+++ b/Documentation/btrfs-rescue.rst
@@ -69,6 +69,9 @@ fix-data-checksum <device>
 		readonly mode, only scan and report for data checksum mismatch,
 		do no repair
 
+	-i|--interactive
+		interactive mode, ask for how to repair, ignore the error by default
+
 .. _man-rescue-clear-ino-cache:
 
 clear-ino-cache <device>
diff --git a/cmds/rescue-fix-data-checksum.c b/cmds/rescue-fix-data-checksum.c
index bf3a65b31c71..9a4f3fd73aaf 100644
--- a/cmds/rescue-fix-data-checksum.c
+++ b/cmds/rescue-fix-data-checksum.c
@@ -14,6 +14,7 @@
  * Boston, MA 021110-1307, USA.
  */
 
+#include <ctype.h>
 #include "kerncompat.h"
 #include "kernel-shared/disk-io.h"
 #include "kernel-shared/ctree.h"
@@ -45,6 +46,21 @@ struct corrupted_block {
 	unsigned long *error_mirror_bitmap;
 };
 
+enum fix_data_checksum_action_value {
+	ACTION_IGNORE,
+	ACTION_LAST,
+};
+
+static const struct fix_data_checksum_action {
+	enum fix_data_checksum_action_value value;
+	const char *string;
+} actions[] = {
+	[ACTION_IGNORE] = {
+		.value = ACTION_IGNORE,
+		.string = "ignore",
+	},
+};
+
 static int global_repair_mode;
 LIST_HEAD(corrupted_blocks);
 
@@ -241,10 +257,49 @@ next:
 	return ret;
 }
 
-static void report_corrupted_blocks(struct btrfs_fs_info *fs_info)
+#define ASK_ACTION_BUFSIZE	(32)
+static enum fix_data_checksum_action_value ask_action()
+{
+	char buf[ASK_ACTION_BUFSIZE] = { 0 };
+	bool printed;
+
+again:
+	printed = false;
+	for (int i = 0; i < ACTION_LAST; i++) {
+		if (printed)
+			printf("/");
+		/* Mark Ignore as default */
+		if (i == ACTION_IGNORE)
+			printf("<<%c>>%s", toupper(actions[i].string[0]),
+			       actions[i].string + 1);
+		else
+			printf("<%c>%s", toupper(actions[i].string[0]),
+			       actions[i].string + 1);
+	}
+	printf(":");
+	fflush(stdout);
+	/* Default to Ignore if no action provided. */
+	if (!fgets(buf, sizeof(buf) - 1, stdin))
+		return ACTION_IGNORE;
+	if (buf[0] == '\n')
+		return ACTION_IGNORE;
+	/* Check exact match or matching the initial letter. */
+	for (int i = 0; i < ACTION_LAST; i++) {
+		if (strncasecmp(buf, actions[i].string, 1) == 0 ||
+		    strncasecmp(buf, actions[i].string, ASK_ACTION_BUFSIZE) == 0)
+			return actions[i].value;
+	}
+	/* No valid action found, retry. */
+	warning("invalid action, please retry");
+	goto again;
+}
+
+static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,
+				    enum btrfs_fix_data_checksum_mode mode)
 {
 	struct corrupted_block *entry;
 	struct btrfs_path path = { 0 };
+	enum fix_data_checksum_action_value action;
 
 	if (list_empty(&corrupted_blocks)) {
 		printf("No data checksum mismatch found\n");
@@ -277,6 +332,16 @@ static void report_corrupted_blocks(struct btrfs_fs_info *fs_info)
 			error("failed to iterate involved files: %m");
 			break;
 		}
+		switch (mode) {
+		case BTRFS_FIX_DATA_CSUMS_INTERACTIVE:
+			action = ask_action();
+			UASSERT(action == ACTION_IGNORE);
+			fallthrough;
+		case BTRFS_FIX_DATA_CSUMS_READONLY:
+			break;
+		default:
+			UASSERT(0);
+		}
 	}
 }
 
@@ -333,7 +398,7 @@ int btrfs_recover_fix_data_checksum(const char *path,
 		errno = -ret;
 		error("failed to iterate csum tree: %m");
 	}
-	report_corrupted_blocks(fs_info);
+	report_corrupted_blocks(fs_info, mode);
 out_close:
 	free_corrupted_blocks();
 	close_ctree_fs_info(fs_info);
diff --git a/cmds/rescue.c b/cmds/rescue.c
index 0d3f86f7e8d6..0c19de47895a 100644
--- a/cmds/rescue.c
+++ b/cmds/rescue.c
@@ -280,7 +280,8 @@ static const char * const cmd_rescue_fix_data_checksum_usage[] = {
 	"btrfs rescue fix-data-checksum <device>",
 	"Fix data checksum mismatches.",
 	"",
-	OPTLINE("-r", "readonly mode, only report errors without repair"),
+	OPTLINE("-r|--readonly", "readonly mode, only report errors without repair"),
+	OPTLINE("-i|--interactive", "interactive mode, ignore the error by default."),
 	HELPINFO_INSERT_GLOBALS,
 	HELPINFO_INSERT_VERBOSE,
 	NULL
@@ -298,15 +299,19 @@ static int cmd_rescue_fix_data_checksum(const struct cmd_struct *cmd,
 		enum { GETOPT_VAL_DRYRUN = GETOPT_VAL_FIRST, };
 		static const struct option long_options [] = {
 			{"readonly", no_argument, NULL, 'r'},
+			{"interactive", no_argument, NULL, 'i'},
 			{"NULL", 0, NULL, 0},
 		};
-		c = getopt_long(argc, argv, "r", long_options, NULL);
+		c = getopt_long(argc, argv, "ri", long_options, NULL);
 		if (c < 0)
 			break;
 		switch (c) {
 		case 'r':
 			mode = BTRFS_FIX_DATA_CSUMS_READONLY;
 			break;
+		case 'i':
+			mode = BTRFS_FIX_DATA_CSUMS_INTERACTIVE;
+			break;
 		default:
 			usage_unknown_option(cmd, argv);
 		}
diff --git a/cmds/rescue.h b/cmds/rescue.h
index 331b595f1c6f..4ae43cbb4cd4 100644
--- a/cmds/rescue.h
+++ b/cmds/rescue.h
@@ -22,6 +22,7 @@
 
 enum btrfs_fix_data_checksum_mode {
 	BTRFS_FIX_DATA_CSUMS_READONLY,
+	BTRFS_FIX_DATA_CSUMS_INTERACTIVE,
 	BTRFS_FIX_DATA_CSUMS_LAST,
 };
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 5/6] btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch
  2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
                   ` (3 preceding siblings ...)
  2025-05-15  8:00 ` [PATCH v2 4/6] btrfs-progs: fix-data-checksum: introduce interactive mode Qu Wenruo
@ 2025-05-15  8:00 ` Qu Wenruo
  2025-05-30 11:19   ` David Sterba
  2025-05-15  8:00 ` [PATCH v2 6/6] btrfs-progs: fix-data-checksum: introduce -m|--mirror option Qu Wenruo
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2025-05-15  8:00 UTC (permalink / raw)
  To: linux-btrfs

This adds a new group of action in the interactive mode to fix a csum
mismatch.

The output looks like this:

  logical=13631488 corrtuped mirrors=1 affected files:
    (subvolume 5)/foo
    (subvolume 5)/dir/bar
  <<I>>gnore/<1>:1
  Csum item for logical 13631488 updated using data from mirror 1

In the interactive mode, the update-csum-item behavior is outputted as
all available mirror numbers.

Considering all the existing (and future) action should starts with an
alphabet, it's pretty easy to distinguish mirror number from other
actions.

The update-csum-item action itself is pretty straight-forward, just read
out the data from specified mirror, then calculate a new checksum, and
update the corresponding csum item in csum tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 cmds/rescue-fix-data-checksum.c | 119 +++++++++++++++++++++++++++++---
 kernel-shared/file-item.c       |   2 +-
 kernel-shared/file-item.h       |   5 ++
 3 files changed, 114 insertions(+), 12 deletions(-)

diff --git a/cmds/rescue-fix-data-checksum.c b/cmds/rescue-fix-data-checksum.c
index 9a4f3fd73aaf..86924d61823e 100644
--- a/cmds/rescue-fix-data-checksum.c
+++ b/cmds/rescue-fix-data-checksum.c
@@ -20,6 +20,8 @@
 #include "kernel-shared/ctree.h"
 #include "kernel-shared/volumes.h"
 #include "kernel-shared/backref.h"
+#include "kernel-shared/transaction.h"
+#include "kernel-shared/file-item.h"
 #include "common/messages.h"
 #include "common/open-utils.h"
 #include "cmds/rescue.h"
@@ -48,6 +50,7 @@ struct corrupted_block {
 
 enum fix_data_checksum_action_value {
 	ACTION_IGNORE,
+	ACTION_UPDATE_CSUM,
 	ACTION_LAST,
 };
 
@@ -59,6 +62,10 @@ static const struct fix_data_checksum_action {
 		.value = ACTION_IGNORE,
 		.string = "ignore",
 	},
+	[ACTION_UPDATE_CSUM] = {
+		.value = ACTION_UPDATE_CSUM,
+		.string = "update-csum",
+	},
 };
 
 static int global_repair_mode;
@@ -258,10 +265,13 @@ next:
 }
 
 #define ASK_ACTION_BUFSIZE	(32)
-static enum fix_data_checksum_action_value ask_action()
+static enum fix_data_checksum_action_value ask_action(unsigned int num_mirrors,
+						      unsigned int *mirror_ret)
 {
+	unsigned long ret;
 	char buf[ASK_ACTION_BUFSIZE] = { 0 };
 	bool printed;
+	char *endptr;
 
 again:
 	printed = false;
@@ -269,12 +279,22 @@ again:
 		if (printed)
 			printf("/");
 		/* Mark Ignore as default */
-		if (i == ACTION_IGNORE)
+		if (i == ACTION_IGNORE) {
 			printf("<<%c>>%s", toupper(actions[i].string[0]),
 			       actions[i].string + 1);
-		else
+		} else if (i == ACTION_UPDATE_CSUM) {
+			/*
+			 * For update-csum action, we need a mirror number,
+			 * so output all valid mirrors numbers instead.
+			 */
+			for (int cur_mirror = 1; cur_mirror <= num_mirrors;
+			     cur_mirror++)
+				printf("<%u>", cur_mirror);
+		} else {
 			printf("<%c>%s", toupper(actions[i].string[0]),
 			       actions[i].string + 1);
+		}
+		printed = true;
 	}
 	printf(":");
 	fflush(stdout);
@@ -285,13 +305,79 @@ again:
 		return ACTION_IGNORE;
 	/* Check exact match or matching the initial letter. */
 	for (int i = 0; i < ACTION_LAST; i++) {
-		if (strncasecmp(buf, actions[i].string, 1) == 0 ||
-		    strncasecmp(buf, actions[i].string, ASK_ACTION_BUFSIZE) == 0)
+		if ((strncasecmp(buf, actions[i].string, 1) == 0 ||
+		     strncasecmp(buf, actions[i].string, ASK_ACTION_BUFSIZE) == 0) &&
+		     actions[i].value != ACTION_UPDATE_CSUM)
 			return actions[i].value;
 	}
-	/* No valid action found, retry. */
-	warning("invalid action, please retry");
-	goto again;
+	/* No match, check if it's some numeric string. */
+	ret = strtoul(buf, &endptr, 10);
+	if (endptr == buf || ret == ULONG_MAX) {
+		/* No valid action found, retry. */
+		warning("invalid action, please retry");
+		goto again;
+	}
+	if (ret > num_mirrors || ret == 0) {
+		warning("invalid mirror number %lu, must be in range [1, %d], please retry",
+			ret, num_mirrors);
+		goto again;
+	}
+	*mirror_ret = ret;
+	return ACTION_UPDATE_CSUM;
+}
+
+static int update_csum_item(struct btrfs_fs_info *fs_info, u64 logical,
+			    unsigned int mirror)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *csum_root = btrfs_csum_root(fs_info, logical);
+	struct btrfs_path path = { 0 };
+	struct btrfs_csum_item *citem;
+	u64 read_len = fs_info->sectorsize;
+	u8 csum[BTRFS_CSUM_SIZE] = { 0 };
+	u8 *buf;
+	int ret;
+
+	buf = malloc(fs_info->sectorsize);
+	if (!buf)
+		return -ENOMEM;
+	ret = read_data_from_disk(fs_info, buf, logical, &read_len, mirror);
+	if (ret < 0) {
+		errno = -ret;
+		error("failed to read block at logical %llu mirror %u: %m",
+			logical, mirror);
+		goto out;
+	}
+	trans = btrfs_start_transaction(csum_root, 1);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		errno = -ret;
+		error_msg(ERROR_MSG_START_TRANS, "%m");
+		goto out;
+	}
+	citem = btrfs_lookup_csum(trans, csum_root, &path, logical,
+				  BTRFS_EXTENT_CSUM_OBJECTID, fs_info->csum_type, 1);
+	if (IS_ERR(citem)) {
+		ret = PTR_ERR(citem);
+		errno = -ret;
+		error("failed to find csum item for logical %llu: $m", logical);
+		btrfs_abort_transaction(trans, ret);
+		goto out;
+	}
+	btrfs_csum_data(fs_info, fs_info->csum_type, buf, csum, fs_info->sectorsize);
+	write_extent_buffer(path.nodes[0], csum, (unsigned long)citem, fs_info->csum_size);
+	btrfs_release_path(&path);
+	ret = btrfs_commit_transaction(trans, csum_root);
+	if (ret < 0) {
+		errno = -ret;
+		error_msg(ERROR_MSG_COMMIT_TRANS, "%m");
+	}
+	printf("Csum item for logical %llu updated using data from mirror %u\n",
+		logical, mirror);
+out:
+	free(buf);
+	btrfs_release_path(&path);
+	return ret;
 }
 
 static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,
@@ -307,6 +393,7 @@ static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,
 	}
 
 	list_for_each_entry(entry, &corrupted_blocks, list) {
+		unsigned int mirror;
 		bool has_printed = false;
 		int ret;
 
@@ -334,10 +421,20 @@ static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,
 		}
 		switch (mode) {
 		case BTRFS_FIX_DATA_CSUMS_INTERACTIVE:
-			action = ask_action();
-			UASSERT(action == ACTION_IGNORE);
-			fallthrough;
+			action = ask_action(entry->num_mirrors, &mirror);
+			break;
 		case BTRFS_FIX_DATA_CSUMS_READONLY:
+			action = ACTION_IGNORE;
+			break;
+		default:
+			UASSERT(0);
+		}
+
+		switch (action) {
+		case ACTION_IGNORE:
+			break;
+		case ACTION_UPDATE_CSUM:
+			ret = update_csum_item(fs_info, entry->logical, mirror);
 			break;
 		default:
 			UASSERT(0);
diff --git a/kernel-shared/file-item.c b/kernel-shared/file-item.c
index 18791c0647b7..503ad657c661 100644
--- a/kernel-shared/file-item.c
+++ b/kernel-shared/file-item.c
@@ -112,7 +112,7 @@ fail:
 	return err;
 }
 
-static struct btrfs_csum_item *
+struct btrfs_csum_item *
 btrfs_lookup_csum(struct btrfs_trans_handle *trans,
 		  struct btrfs_root *root,
 		  struct btrfs_path *path,
diff --git a/kernel-shared/file-item.h b/kernel-shared/file-item.h
index cab0bc4e9ce0..5a5d8da10266 100644
--- a/kernel-shared/file-item.h
+++ b/kernel-shared/file-item.h
@@ -89,6 +89,11 @@ int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 			     struct btrfs_file_extent_item *stack_fi);
 int btrfs_csum_file_block(struct btrfs_trans_handle *trans, u64 logical,
 			  u64 csum_objectid, u32 csum_type, const char *data);
+struct btrfs_csum_item *
+btrfs_lookup_csum(struct btrfs_trans_handle *trans,
+		  struct btrfs_root *root,
+		  struct btrfs_path *path,
+		  u64 bytenr, u64 csum_objectid, u16 csum_type, int cow);
 int btrfs_insert_inline_extent(struct btrfs_trans_handle *trans,
 			       struct btrfs_root *root, u64 objectid,
 			       u64 offset, const char *buffer, size_t size,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 6/6] btrfs-progs: fix-data-checksum: introduce -m|--mirror option
  2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
                   ` (4 preceding siblings ...)
  2025-05-15  8:00 ` [PATCH v2 5/6] btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch Qu Wenruo
@ 2025-05-15  8:00 ` Qu Wenruo
  2025-05-30 11:07 ` [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" David Sterba
  2025-05-31 16:58 ` Goffredo Baroncelli
  7 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2025-05-15  8:00 UTC (permalink / raw)
  To: linux-btrfs

This option allows "btrfs rescue fix-data-checksum" to use the specified
mirror number to update checksum item for all corrupted mirrors.

If the specified number is larger than the max number of mirrors, the
real mirror number will be `num % (num_mirrors + 1)`.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 Documentation/btrfs-rescue.rst  |  6 ++++++
 cmds/rescue-fix-data-checksum.c | 16 ++++++++++++----
 cmds/rescue.c                   | 16 ++++++++++++++--
 cmds/rescue.h                   |  4 +++-
 4 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/Documentation/btrfs-rescue.rst b/Documentation/btrfs-rescue.rst
index 0e6da4397f51..7fc2bde590d2 100644
--- a/Documentation/btrfs-rescue.rst
+++ b/Documentation/btrfs-rescue.rst
@@ -72,6 +72,12 @@ fix-data-checksum <device>
 	-i|--interactive
 		interactive mode, ask for how to repair, ignore the error by default
 
+	-m|--mirror <num>
+		use specified mirror to update the checksum item for all corrupted blocks.
+
+		The value must be >= 1, and if the corrupted block has less mirrors than
+		the value, the mirror number will be `num % (num_mirrors + 1)`.
+
 .. _man-rescue-clear-ino-cache:
 
 clear-ino-cache <device>
diff --git a/cmds/rescue-fix-data-checksum.c b/cmds/rescue-fix-data-checksum.c
index 86924d61823e..23b59fffe2f7 100644
--- a/cmds/rescue-fix-data-checksum.c
+++ b/cmds/rescue-fix-data-checksum.c
@@ -381,7 +381,8 @@ out:
 }
 
 static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,
-				    enum btrfs_fix_data_checksum_mode mode)
+				    enum btrfs_fix_data_checksum_mode mode,
+				    unsigned int mirror)
 {
 	struct corrupted_block *entry;
 	struct btrfs_path path = { 0 };
@@ -393,7 +394,6 @@ static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,
 	}
 
 	list_for_each_entry(entry, &corrupted_blocks, list) {
-		unsigned int mirror;
 		bool has_printed = false;
 		int ret;
 
@@ -426,6 +426,10 @@ static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,
 		case BTRFS_FIX_DATA_CSUMS_READONLY:
 			action = ACTION_IGNORE;
 			break;
+		case BTRFS_FIX_DATA_CSUMS_UPDATE_CSUM_ITEM:
+			action = ACTION_UPDATE_CSUM;
+			mirror = mirror % (entry->num_mirrors + 1);
+			break;
 		default:
 			UASSERT(0);
 		}
@@ -434,6 +438,7 @@ static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,
 		case ACTION_IGNORE:
 			break;
 		case ACTION_UPDATE_CSUM:
+			UASSERT(mirror > 0 && mirror <= entry->num_mirrors);
 			ret = update_csum_item(fs_info, entry->logical, mirror);
 			break;
 		default:
@@ -455,7 +460,8 @@ static void free_corrupted_blocks(void)
 }
 
 int btrfs_recover_fix_data_checksum(const char *path,
-				    enum btrfs_fix_data_checksum_mode mode)
+				    enum btrfs_fix_data_checksum_mode mode,
+				    unsigned int mirror)
 {
 	struct btrfs_fs_info *fs_info;
 	struct btrfs_root *csum_root;
@@ -465,6 +471,8 @@ int btrfs_recover_fix_data_checksum(const char *path,
 	if (mode >= BTRFS_FIX_DATA_CSUMS_LAST)
 		return -EINVAL;
 
+	if (mode == BTRFS_FIX_DATA_CSUMS_UPDATE_CSUM_ITEM)
+		UASSERT(mirror > 0);
 	ret = check_mounted(path);
 	if (ret < 0) {
 		errno = -ret;
@@ -495,7 +503,7 @@ int btrfs_recover_fix_data_checksum(const char *path,
 		errno = -ret;
 		error("failed to iterate csum tree: %m");
 	}
-	report_corrupted_blocks(fs_info, mode);
+	report_corrupted_blocks(fs_info, mode, mirror);
 out_close:
 	free_corrupted_blocks();
 	close_ctree_fs_info(fs_info);
diff --git a/cmds/rescue.c b/cmds/rescue.c
index 0c19de47895a..f575646c735a 100644
--- a/cmds/rescue.c
+++ b/cmds/rescue.c
@@ -31,6 +31,7 @@
 #include "kernel-shared/extent_io.h"
 #include "kernel-shared/accessors.h"
 #include "kernel-shared/uapi/btrfs_tree.h"
+#include "common/string-utils.h"
 #include "common/messages.h"
 #include "common/utils.h"
 #include "common/help.h"
@@ -282,6 +283,7 @@ static const char * const cmd_rescue_fix_data_checksum_usage[] = {
 	"",
 	OPTLINE("-r|--readonly", "readonly mode, only report errors without repair"),
 	OPTLINE("-i|--interactive", "interactive mode, ignore the error by default."),
+	OPTLINE("-m|--mirror <mirror>", "update csum item using specified mirror"),
 	HELPINFO_INSERT_GLOBALS,
 	HELPINFO_INSERT_VERBOSE,
 	NULL
@@ -291,6 +293,7 @@ static int cmd_rescue_fix_data_checksum(const struct cmd_struct *cmd,
 					int argc, char **argv)
 {
 	enum btrfs_fix_data_checksum_mode mode = BTRFS_FIX_DATA_CSUMS_READONLY;
+	unsigned int mirror = 0;
 	int ret;
 	optind = 0;
 
@@ -300,9 +303,10 @@ static int cmd_rescue_fix_data_checksum(const struct cmd_struct *cmd,
 		static const struct option long_options [] = {
 			{"readonly", no_argument, NULL, 'r'},
 			{"interactive", no_argument, NULL, 'i'},
+			{"mirror", required_argument, NULL, 'm'},
 			{"NULL", 0, NULL, 0},
 		};
-		c = getopt_long(argc, argv, "ri", long_options, NULL);
+		c = getopt_long(argc, argv, "rim:", long_options, NULL);
 		if (c < 0)
 			break;
 		switch (c) {
@@ -312,13 +316,21 @@ static int cmd_rescue_fix_data_checksum(const struct cmd_struct *cmd,
 		case 'i':
 			mode = BTRFS_FIX_DATA_CSUMS_INTERACTIVE;
 			break;
+		case 'm':
+			mode = BTRFS_FIX_DATA_CSUMS_UPDATE_CSUM_ITEM;
+			mirror = arg_strtou64(optarg);
+			if (mirror == 0) {
+				error("invalid mirror number %u, must be >= 1", mirror);
+				return 1;
+			}
+			break;
 		default:
 			usage_unknown_option(cmd, argv);
 		}
 	}
 	if (check_argc_min(argc - optind, 1))
 		return 1;
-	ret = btrfs_recover_fix_data_checksum(argv[optind], mode);
+	ret = btrfs_recover_fix_data_checksum(argv[optind], mode, mirror);
 	if (ret < 0) {
 		errno = -ret;
 		error("failed to fix data checksums: %m");
diff --git a/cmds/rescue.h b/cmds/rescue.h
index 4ae43cbb4cd4..f78ec436a908 100644
--- a/cmds/rescue.h
+++ b/cmds/rescue.h
@@ -23,12 +23,14 @@
 enum btrfs_fix_data_checksum_mode {
 	BTRFS_FIX_DATA_CSUMS_READONLY,
 	BTRFS_FIX_DATA_CSUMS_INTERACTIVE,
+	BTRFS_FIX_DATA_CSUMS_UPDATE_CSUM_ITEM,
 	BTRFS_FIX_DATA_CSUMS_LAST,
 };
 
 int btrfs_recover_superblocks(const char *path, int yes);
 int btrfs_recover_chunk_tree(const char *path, int yes);
 int btrfs_recover_fix_data_checksum(const char *path,
-				    enum btrfs_fix_data_checksum_mode mode);
+				    enum btrfs_fix_data_checksum_mode mode,
+				    unsigned int mirror);
 
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
  2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
                   ` (5 preceding siblings ...)
  2025-05-15  8:00 ` [PATCH v2 6/6] btrfs-progs: fix-data-checksum: introduce -m|--mirror option Qu Wenruo
@ 2025-05-30 11:07 ` David Sterba
  2025-05-31 16:58 ` Goffredo Baroncelli
  7 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2025-05-30 11:07 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 15, 2025 at 05:30:15PM +0930, Qu Wenruo wrote:
> [CHANGELOG]
> v2:
> - Rename the subcommand to "fix-data-checksum"
>   It's better to use full name in the command name
> 
> - Remove unused members inside corrupted_block
>   The old @extent_bytenr and @extent_len is no longer needed, even for
>   the future file-deletion action.
> 
> - Fix the bitmap size off-by-1 bug
>   We must use the bit 0 to represent mirror 1, or the bitmap size will
>   exceed num_mirrors.
> 
> - Introduce -i|--interactive mode
>   Will ask the user for the action on the corrupted block, including:
> 
>   * Ignore
>     The default behavior if no command is provided
> 
>   * Use specified mirror to update the data checksum item
>     The user must input a number inside range [1, num_mirrors].
> 
> - Introduce -m|--mirror <num> mode
>   Use specified mirror for all corrupted blocks.
>   The value <num> must be >= 1. And if the value is larger than the
>   actual max mirror number, the real mirror number will be
>   `num % (num_mirror + 1)`.
> 
> We have a long history of data csum mismatch, caused by direct IO and
> buffered being modified during writeback.
> 
> Although the problem is worked around in v6.15 (and being backported),
> for the affected fs there is no good way to fix them, other than complex
> manually find out which files are affected and delete them.
> 
> This series introduce the initial implementation of "btrfs rescue
> fix-data-checksum", which is designed to fix such problem by either:
> 
> - Update the csum items using the data from specified copy
> 
> The subcommand has 3 modes so far:
> 
> - Readonly mode
>   Only report all corrupted blocks and affected files, no repair is
>   done.
> 
> - Interactive mode
>   Ask for what to do, including
> 
>   * Ignore (the default)
>   * Use certain mirror to update the checksum item
> 
> - Mirror mode
>   Use specified mirror to update the checksum item, the batch mode of
>   the interactive one.
> 
> In the future, there will be one more mode:
> 
> - Delete mode
>   Delete all involved files.
> 
>   There are still some points to address before implementing this mode.
> 
> Qu Wenruo (6):
>   btrfs-progs: introduce "btrfs rescue fix-data-checksum"
>   btrfs-progs: fix a bug in btrfs_find_item()
>   btrfs-progs: fix-data-checksum: show affected files
>   btrfs-progs: fix-data-checksum: introduce interactive mode
>   btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch
>   btrfs-progs: fix-data-checksum: introduce -m|--mirror option

Thanks, this is certainly useful and is a reasonable compromise to have
a way to fix the checksums. I did some conding style fixups in the devel
branch, I'll comment under the patches.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
  2025-05-15  8:00 ` [PATCH v2 1/6] " Qu Wenruo
@ 2025-05-30 11:10   ` David Sterba
  0 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2025-05-30 11:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 15, 2025 at 05:30:16PM +0930, Qu Wenruo wrote:
> +static void report_corrupted_blocks(void)
> +{
> +	struct corrupted_block *entry;
> +
> +	if (list_empty(&corrupted_blocks)) {
> +		printf("No data checksum mismatch found\n");

printf(...) -> pr_verbose(LOG_DEFAULT, ...) so the verbosity options are
respected. The semantics is the same as printf, so there's no implicit "\n"
and you can glue the lines as you need.

> +		return;
> +	}
> +
> +	list_for_each_entry(entry, &corrupted_blocks, list) {
> +		bool has_printed = false;
> +
> +		printf("logical=%llu corrtuped mirrors=", entry->logical);
> +		/* Poor man's bitmap print. */
> +		for (int i = 0; i < entry->num_mirrors; i++) {
> +			if (test_bit(i, entry->error_mirror_bitmap)) {
> +				if (has_printed)
> +					printf(",");
> +				/*
> +				 * Bit 0 means mirror 1, thus we need to increase
> +				 * the value by 1.
> +				 */
> +				printf("%d", i + 1);
> +				has_printed=true;
> +			}
> +		}
> +		printf("\n");
> +	}
> +}

> +static int cmd_rescue_fix_data_checksum(const struct cmd_struct *cmd,
> +					int argc, char **argv)
> +{
> +	enum btrfs_fix_data_checksum_mode mode = BTRFS_FIX_DATA_CSUMS_READONLY;
> +	int ret;
> +	optind = 0;
> +
> +	while (1) {
> +		int c;
> +		enum { GETOPT_VAL_DRYRUN = GETOPT_VAL_FIRST, };

The ending "," does not need to be there

> +		static const struct option long_options [] = {
> +			{"readonly", no_argument, NULL, 'r'},
> +			{"NULL", 0, NULL, 0},
> +		};

Missing newline

> +		c = getopt_long(argc, argv, "r", long_options, NULL);
> +		if (c < 0)
> +			break;
> +		switch (c) {
> +		case 'r':
> +			mode = BTRFS_FIX_DATA_CSUMS_READONLY;
> +			break;
> +		default:
> +			usage_unknown_option(cmd, argv);
> +		}
> +	}
> +	if (check_argc_min(argc - optind, 1))
> +		return 1;
> +	ret = btrfs_recover_fix_data_checksum(argv[optind], mode);
> +	if (ret < 0) {
> +		errno = -ret;
> +		error("failed to fix data checksums: %m");
> +	}
> +	return !!ret;
> +}
> +static DEFINE_SIMPLE_COMMAND(rescue_fix_data_checksum, "fix-data-checksum");

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 2/6] btrfs-progs: fix a bug in btrfs_find_item()
  2025-05-15  8:00 ` [PATCH v2 2/6] btrfs-progs: fix a bug in btrfs_find_item() Qu Wenruo
@ 2025-05-30 11:11   ` David Sterba
  0 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2025-05-30 11:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 15, 2025 at 05:30:17PM +0930, Qu Wenruo wrote:
> [BUG]
> There is a seldomly utilized function, btrfs_find_item(), which has no
> document and is not behaving correctly.
> 
> Inside backref.c, iterate_inode_refs() and btrfs_ref_to_path() both
> utilize this function, to find the parent inode.
> 
> However btrfs_find_item() will never return 0 if @ioff is passed as 0
> for such usage, result early failure for all kinds of inode iteration
> functions.
> 
> [CAUSE]
> Both functions pass 0 as the @ioff parameter initially, e.g:
> 
>  We have the following fs tree root:
> 
>   	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
> 		generation 3 transid 9 size 6 nbytes 16384
> 		block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
> 		sequence 1 flags 0x0(none)
> 	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
> 		index 0 namelen 2 name: ..
> 	item 2 key (256 DIR_ITEM 2507850652) itemoff 16078 itemsize 33
> 		location key (257 INODE_ITEM 0) type FILE
> 		transid 9 data_len 0 name_len 3
> 		name: foo
> 	item 3 key (256 DIR_INDEX 2) itemoff 16045 itemsize 33
> 		location key (257 INODE_ITEM 0) type FILE
> 		transid 9 data_len 0 name_len 3
> 		name: foo
> 	item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
> 		generation 9 transid 9 size 16384 nbytes 16384
> 		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
> 		sequence 4 flags 0x0(none)
> 	item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
> 		index 2 namelen 3 name: foo
> 	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
> 		generation 9 type 1 (regular)
> 		extent data disk byte 13631488 nr 16384
> 		extent data offset 0 nr 16384 ram 16384
> 		extent compression 0 (none)
> 
>   Then we call paths_from_inode() with:
>   - @inum = 257
>   - ipath = {.fs_root = 5}
> 
>   Then we got the following sequence:
> 
>   iterate_irefs(257, fs_root, inode_to_path)
>   |- iterate_inode_refs()
>      |- inode_ref_info()
>         |- btrfs_find_item(257, 0, fs_root)
> 	|  Returned 1, with @found_key updated to
> 	|  (257, INODE_REF, 256).
> 
>   This makes iterate_irefs() exit immediately, but obviously that
>   btrfs_find_item() call is to find any INODE_REF, not to find the
>   exact match.
> 
> [FIX]
> If btrfs_find_item() found an item matching the objectid and type, then
> it should return 0 other than 1.
> 
> Fix it and keep the behavior the same across btrfs-progs and the kernel.
> 
> Since we're here, also add some comments explaining the function.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  kernel-shared/ctree.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel-shared/ctree.c b/kernel-shared/ctree.c
> index 3184c916175e..f90de606e7b1 100644
> --- a/kernel-shared/ctree.c
> +++ b/kernel-shared/ctree.c
> @@ -1246,6 +1246,17 @@ static void reada_for_search(struct btrfs_fs_info *fs_info,
>  	}
>  }
>  
> +/*
> + * Find the first key in @fs_root that matches all the following conditions:
> + *
> + * - key.obojectid == @iobjectid
> + * - key.type == @key_type
> + * - key.offset >= ioff
> + *
> + * Return 0 if such key can be found, and @found_key is updated.
> + * Return >0 if no such key can be found.
> + * Return <0 for critical errors.
> + */
>  int btrfs_find_item(struct btrfs_root *fs_root, struct btrfs_path *found_path,
>  		u64 iobjectid, u64 ioff, u8 key_type,
>  		struct btrfs_key *found_key)
> @@ -1280,10 +1291,10 @@ int btrfs_find_item(struct btrfs_root *fs_root, struct btrfs_path *found_path,
>  
>  	btrfs_item_key_to_cpu(eb, found_key, path->slots[0]);
>  	if (found_key->type != key.type ||
> -			found_key->objectid != key.objectid) {
> +	    found_key->objectid != key.objectid)

This fits one line. We don't have perfect 1:1 line matching with kernel
so the style can be fixed.

>  		ret = 1;
> -		goto out;
> -	}
> +	else
> +		ret = 0;
>  
>  out:
>  	if (path != found_path)
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 3/6] btrfs-progs: fix-data-checksum: show affected files
  2025-05-15  8:00 ` [PATCH v2 3/6] btrfs-progs: fix-data-checksum: show affected files Qu Wenruo
@ 2025-05-30 11:14   ` David Sterba
  0 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2025-05-30 11:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 15, 2025 at 05:30:18PM +0930, Qu Wenruo wrote:
> Previously "btrfs rescue fix-data-checksum" only show affected logical
> bytenr, which is not helpful to determine which files are affected.
> 
> Enhance the output by also outputting the affected subvolumes (in
> numeric form), and the file paths inside that subvolume.
> 
> The example looks like this:
> 
>   logical=13631488 corrtuped mirrors=1 affected files:
>     (subvolume 5)/foo
>     (subvolume 5)/dir/bar
>   logical=13635584 corrtuped mirrors=1 affected files:
>     (subvolume 5)/foo
>     (subvolume 5)/dir/bar
> 
> Although the end result is still not perfect, it's still much easier to
> find out which files are affected.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  cmds/rescue-fix-data-checksum.c | 59 +++++++++++++++++++++++++++++++--
>  1 file changed, 56 insertions(+), 3 deletions(-)
> 
> diff --git a/cmds/rescue-fix-data-checksum.c b/cmds/rescue-fix-data-checksum.c
> index 7e613ba57f76..bf3a65b31c71 100644
> --- a/cmds/rescue-fix-data-checksum.c
> +++ b/cmds/rescue-fix-data-checksum.c
> @@ -18,6 +18,7 @@
>  #include "kernel-shared/disk-io.h"
>  #include "kernel-shared/ctree.h"
>  #include "kernel-shared/volumes.h"
> +#include "kernel-shared/backref.h"
>  #include "common/messages.h"
>  #include "common/open-utils.h"
>  #include "cmds/rescue.h"
> @@ -158,6 +159,49 @@ static int iterate_one_csum_item(struct btrfs_fs_info *fs_info, struct btrfs_pat
>  	return ret;
>  }
>  
> +static int print_filenames(u64 ino, u64 offset, u64 rootid, void *ctx)
> +{
> +	struct btrfs_fs_info *fs_info = ctx;
> +	struct btrfs_root *root;
> +	struct btrfs_key key;
> +	struct inode_fs_paths *ipath;
> +	struct btrfs_path path = { 0 };
> +	int ret;
> +
> +	key.objectid = rootid;
> +	key.type = BTRFS_ROOT_ITEM_KEY;
> +	key.offset = (u64)-1;
> +
> +	root = btrfs_read_fs_root(fs_info, &key);
> +	if (IS_ERR(root)) {
> +		ret = PTR_ERR(root);
> +		errno = -ret;
> +		error("failed to get subvolume %llu: %m", rootid);
> +		return ret;
> +	}
> +	ipath = init_ipath(128 * BTRFS_PATH_NAME_MAX, root, &path);

128 is a magic constant, I've left it like that but we can use the
PATH_MAX and some named definition for the count.

> +	if (IS_ERR(ipath)) {
> +		ret = PTR_ERR(ipath);
> +		errno = -ret;
> +		error("failed to initialize ipath: %m");
> +		return ret;
> +	}
> +	ret = paths_from_inode(ino, ipath);
> +	if (ret < 0) {
> +		errno = -ret;
> +		error("failed to resolve root %llu ino %llu to paths: %m", rootid, ino);
> +		goto out;
> +	}
> +	for (int i = 0; i < ipath->fspath->elem_cnt; i++)
> +		printf("  (subvolume %llu)/%s\n", rootid, (char *)ipath->fspath->val[i]);
> +	if (ipath->fspath->elem_missed)
> +		printf("  (subvolume %llu) %d files not printed\n", rootid,
> +		       ipath->fspath->elem_missed);
> +out:
> +	free_ipath(ipath);
> +	return ret;
> +}
> +

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 4/6] btrfs-progs: fix-data-checksum: introduce interactive mode
  2025-05-15  8:00 ` [PATCH v2 4/6] btrfs-progs: fix-data-checksum: introduce interactive mode Qu Wenruo
@ 2025-05-30 11:18   ` David Sterba
  0 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2025-05-30 11:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 15, 2025 at 05:30:19PM +0930, Qu Wenruo wrote:
> --- a/cmds/rescue-fix-data-checksum.c
> +++ b/cmds/rescue-fix-data-checksum.c
> @@ -14,6 +14,7 @@
>   * Boston, MA 021110-1307, USA.
>   */
>  
> +#include <ctype.h>
>  #include "kerncompat.h"

The kerncompat.h needs to be first as there are workarounds for some
system includes. This has been unified everywhere else so please stick
to the ordering.

>  #include "kernel-shared/disk-io.h"
>  #include "kernel-shared/ctree.h"
> @@ -45,6 +46,21 @@ struct corrupted_block {
>  	unsigned long *error_mirror_bitmap;
>  };
>  
> +enum fix_data_checksum_action_value {
> +	ACTION_IGNORE,
> +	ACTION_LAST,
> +};
> +
> +static const struct fix_data_checksum_action {
> +	enum fix_data_checksum_action_value value;
> +	const char *string;
> +} actions[] = {
> +	[ACTION_IGNORE] = {
> +		.value = ACTION_IGNORE,
> +		.string = "ignore",
> +	},
> +};
> +
>  static int global_repair_mode;
>  LIST_HEAD(corrupted_blocks);
>  
> @@ -241,10 +257,49 @@ next:
>  	return ret;
>  }
>  
> @@ -277,6 +332,16 @@ static void report_corrupted_blocks(struct btrfs_fs_info *fs_info)
>  			error("failed to iterate involved files: %m");
>  			break;
>  		}
> +		switch (mode) {
> +		case BTRFS_FIX_DATA_CSUMS_INTERACTIVE:
> +			action = ask_action();
> +			UASSERT(action == ACTION_IGNORE);
> +			fallthrough;
> +		case BTRFS_FIX_DATA_CSUMS_READONLY:
> +			break;
> +		default:
> +			UASSERT(0);

I haven't fixed that but please don't do the ASSERT(0), we have
INTERNAL_ERROR for runtime errors or handle it and return with a hard
exit.

> +		}
>  	}
>  }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/6] btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch
  2025-05-15  8:00 ` [PATCH v2 5/6] btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch Qu Wenruo
@ 2025-05-30 11:19   ` David Sterba
  2025-05-30 22:23     ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: David Sterba @ 2025-05-30 11:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 15, 2025 at 05:30:20PM +0930, Qu Wenruo wrote:
> +static int update_csum_item(struct btrfs_fs_info *fs_info, u64 logical,
> +			    unsigned int mirror)
> +{
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_root *csum_root = btrfs_csum_root(fs_info, logical);
> +	struct btrfs_path path = { 0 };
> +	struct btrfs_csum_item *citem;
> +	u64 read_len = fs_info->sectorsize;
> +	u8 csum[BTRFS_CSUM_SIZE] = { 0 };
> +	u8 *buf;
> +	int ret;
> +
> +	buf = malloc(fs_info->sectorsize);
> +	if (!buf)
> +		return -ENOMEM;

Not fixed, but the block buffers can be on stack as well if they're just
for the single function, the size is bounded.

> +	ret = read_data_from_disk(fs_info, buf, logical, &read_len, mirror);
> +	if (ret < 0) {
> +		errno = -ret;
> +		error("failed to read block at logical %llu mirror %u: %m",
> +			logical, mirror);
> +		goto out;
> +	}
> +	trans = btrfs_start_transaction(csum_root, 1);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		errno = -ret;
> +		error_msg(ERROR_MSG_START_TRANS, "%m");
> +		goto out;
> +	}
> +	citem = btrfs_lookup_csum(trans, csum_root, &path, logical,
> +				  BTRFS_EXTENT_CSUM_OBJECTID, fs_info->csum_type, 1);
> +	if (IS_ERR(citem)) {
> +		ret = PTR_ERR(citem);
> +		errno = -ret;
> +		error("failed to find csum item for logical %llu: $m", logical);
> +		btrfs_abort_transaction(trans, ret);
> +		goto out;
> +	}
> +	btrfs_csum_data(fs_info, fs_info->csum_type, buf, csum, fs_info->sectorsize);
> +	write_extent_buffer(path.nodes[0], csum, (unsigned long)citem, fs_info->csum_size);
> +	btrfs_release_path(&path);
> +	ret = btrfs_commit_transaction(trans, csum_root);
> +	if (ret < 0) {
> +		errno = -ret;
> +		error_msg(ERROR_MSG_COMMIT_TRANS, "%m");
> +	}
> +	printf("Csum item for logical %llu updated using data from mirror %u\n",
> +		logical, mirror);
> +out:
> +	free(buf);
> +	btrfs_release_path(&path);
> +	return ret;
>  }
>  
>  static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 5/6] btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch
  2025-05-30 11:19   ` David Sterba
@ 2025-05-30 22:23     ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2025-05-30 22:23 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



在 2025/5/30 20:49, David Sterba 写道:
> On Thu, May 15, 2025 at 05:30:20PM +0930, Qu Wenruo wrote:
>> +static int update_csum_item(struct btrfs_fs_info *fs_info, u64 logical,
>> +			    unsigned int mirror)
>> +{
>> +	struct btrfs_trans_handle *trans;
>> +	struct btrfs_root *csum_root = btrfs_csum_root(fs_info, logical);
>> +	struct btrfs_path path = { 0 };
>> +	struct btrfs_csum_item *citem;
>> +	u64 read_len = fs_info->sectorsize;
>> +	u8 csum[BTRFS_CSUM_SIZE] = { 0 };
>> +	u8 *buf;
>> +	int ret;
>> +
>> +	buf = malloc(fs_info->sectorsize);
>> +	if (!buf)
>> +		return -ENOMEM;
> 
> Not fixed, but the block buffers can be on stack as well if they're just
> for the single function, the size is bounded.

Unfortunately the sectorsize is still runtime determined, thus it can 
not be on-stack.

Thanks,
Qu>
>> +	ret = read_data_from_disk(fs_info, buf, logical, &read_len, mirror);
>> +	if (ret < 0) {
>> +		errno = -ret;
>> +		error("failed to read block at logical %llu mirror %u: %m",
>> +			logical, mirror);
>> +		goto out;
>> +	}
>> +	trans = btrfs_start_transaction(csum_root, 1);
>> +	if (IS_ERR(trans)) {
>> +		ret = PTR_ERR(trans);
>> +		errno = -ret;
>> +		error_msg(ERROR_MSG_START_TRANS, "%m");
>> +		goto out;
>> +	}
>> +	citem = btrfs_lookup_csum(trans, csum_root, &path, logical,
>> +				  BTRFS_EXTENT_CSUM_OBJECTID, fs_info->csum_type, 1);
>> +	if (IS_ERR(citem)) {
>> +		ret = PTR_ERR(citem);
>> +		errno = -ret;
>> +		error("failed to find csum item for logical %llu: $m", logical);
>> +		btrfs_abort_transaction(trans, ret);
>> +		goto out;
>> +	}
>> +	btrfs_csum_data(fs_info, fs_info->csum_type, buf, csum, fs_info->sectorsize);
>> +	write_extent_buffer(path.nodes[0], csum, (unsigned long)citem, fs_info->csum_size);
>> +	btrfs_release_path(&path);
>> +	ret = btrfs_commit_transaction(trans, csum_root);
>> +	if (ret < 0) {
>> +		errno = -ret;
>> +		error_msg(ERROR_MSG_COMMIT_TRANS, "%m");
>> +	}
>> +	printf("Csum item for logical %llu updated using data from mirror %u\n",
>> +		logical, mirror);
>> +out:
>> +	free(buf);
>> +	btrfs_release_path(&path);
>> +	return ret;
>>   }
>>   
>>   static void report_corrupted_blocks(struct btrfs_fs_info *fs_info,


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
  2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
                   ` (6 preceding siblings ...)
  2025-05-30 11:07 ` [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" David Sterba
@ 2025-05-31 16:58 ` Goffredo Baroncelli
  2025-05-31 22:07   ` Qu Wenruo
  7 siblings, 1 reply; 18+ messages in thread
From: Goffredo Baroncelli @ 2025-05-31 16:58 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 15/05/2025 10.00, Qu Wenruo wrote:
> [CHANGELOG]
> v2:
> - Rename the subcommand to "fix-data-checksum"
>    It's better to use full name in the command name
> 
> - Remove unused members inside corrupted_block
>    The old @extent_bytenr and @extent_len is no longer needed, even for
>    the future file-deletion action.
> 
> - Fix the bitmap size off-by-1 bug
>    We must use the bit 0 to represent mirror 1, or the bitmap size will
>    exceed num_mirrors.
> 
> - Introduce -i|--interactive mode
>    Will ask the user for the action on the corrupted block, including:
> 
>    * Ignore
>      The default behavior if no command is provided
> 
>    * Use specified mirror to update the data checksum item
>      The user must input a number inside range [1, num_mirrors].
> 
> - Introduce -m|--mirror <num> mode
>    Use specified mirror for all corrupted blocks.
>    The value <num> must be >= 1. And if the value is larger than the
>    actual max mirror number, the real mirror number will be
>    `num % (num_mirror + 1)`.
> 
> We have a long history of data csum mismatch, caused by direct IO and
> buffered being modified during writeback.
> 
What about having an ioctl (on a mounted fs) which allow us to read data from a file even
if the csum doesn't match ? I asked that because the problem usually happen on a specific file
and not to an entire filesystem. In this case I think that it would be more practical to read the block
using the IOCTL, and then rewrite the block, at the specific offset (to update the checksum).

Of course there are several tradeoff: the "unmounted" version, doesn't duplicate a shared block,
where my approach (read the data using an ioctl to avoid the CSUM mismatch and rewrite it) make
a fork if the block is shared. However, as told before, the problem is related to specifics file and it seems
a waste of resource reading an entire filesystem to correct few files. No to mention that the IOCTL
can be done on a "live" filesystem.

Said that, of course "btrfs rescue fix-data-checksum" is better than nothing.

BR
G.Baroncelli

> Although the problem is worked around in v6.15 (and being backported),
> for the affected fs there is no good way to fix them, other than complex
> manually find out which files are affected and delete them.
> 
> This series introduce the initial implementation of "btrfs rescue
> fix-data-checksum", which is designed to fix such problem by either:
> 
> - Update the csum items using the data from specified copy
> 
> The subcommand has 3 modes so far:
> 
> - Readonly mode
>    Only report all corrupted blocks and affected files, no repair is
>    done.
> 
> - Interactive mode
>    Ask for what to do, including
> 
>    * Ignore (the default)
>    * Use certain mirror to update the checksum item
> 
> - Mirror mode
>    Use specified mirror to update the checksum item, the batch mode of
>    the interactive one.
> 
> In the future, there will be one more mode:
> 
> - Delete mode
>    Delete all involved files.
> 
>    There are still some points to address before implementing this mode.
> 
> Qu Wenruo (6):
>    btrfs-progs: introduce "btrfs rescue fix-data-checksum"
>    btrfs-progs: fix a bug in btrfs_find_item()
>    btrfs-progs: fix-data-checksum: show affected files
>    btrfs-progs: fix-data-checksum: introduce interactive mode
>    btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch
>    btrfs-progs: fix-data-checksum: introduce -m|--mirror option
> 
>   Documentation/btrfs-rescue.rst  |  28 ++
>   Makefile                        |   2 +-
>   cmds/rescue-fix-data-checksum.c | 511 ++++++++++++++++++++++++++++++++
>   cmds/rescue.c                   |  65 ++++
>   cmds/rescue.h                   |  10 +
>   kernel-shared/ctree.c           |  17 +-
>   kernel-shared/file-item.c       |   2 +-
>   kernel-shared/file-item.h       |   5 +
>   8 files changed, 635 insertions(+), 5 deletions(-)
>   create mode 100644 cmds/rescue-fix-data-checksum.c
> 
> --
> 2.49.0
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
  2025-05-31 16:58 ` Goffredo Baroncelli
@ 2025-05-31 22:07   ` Qu Wenruo
  2025-06-02 17:28     ` Goffredo Baroncelli
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2025-05-31 22:07 UTC (permalink / raw)
  To: kreijack, Qu Wenruo, linux-btrfs



在 2025/6/1 02:28, Goffredo Baroncelli 写道:
> On 15/05/2025 10.00, Qu Wenruo wrote:
>> [CHANGELOG]
>> v2:
>> - Rename the subcommand to "fix-data-checksum"
>>    It's better to use full name in the command name
>>
>> - Remove unused members inside corrupted_block
>>    The old @extent_bytenr and @extent_len is no longer needed, even for
>>    the future file-deletion action.
>>
>> - Fix the bitmap size off-by-1 bug
>>    We must use the bit 0 to represent mirror 1, or the bitmap size will
>>    exceed num_mirrors.
>>
>> - Introduce -i|--interactive mode
>>    Will ask the user for the action on the corrupted block, including:
>>
>>    * Ignore
>>      The default behavior if no command is provided
>>
>>    * Use specified mirror to update the data checksum item
>>      The user must input a number inside range [1, num_mirrors].
>>
>> - Introduce -m|--mirror <num> mode
>>    Use specified mirror for all corrupted blocks.
>>    The value <num> must be >= 1. And if the value is larger than the
>>    actual max mirror number, the real mirror number will be
>>    `num % (num_mirror + 1)`.
>>
>> We have a long history of data csum mismatch, caused by direct IO and
>> buffered being modified during writeback.
>>
> What about having an ioctl (on a mounted fs) which allow us to read data 
> from a file even
> if the csum doesn't match ?

The problem is again with mirrors.

Unless you ask the end user to provide a mirror number, for a with 
multiple mirrors, and the mirrors doesn't match each other, the behavior 
will be a mess.

That's why I'm planning to add something like a mirror priority, with a 
new mirror "best" to try use the mirror with the most matches.


Despite the mirror number problem (we need to ask the end user for 
mirror number), I think it's possible to implement the feature (mostly) 
in user space.

E.g. combine fiemap result with btrfs-map-logical, then read from the 
disk directly.

> I asked that because the problem usually 
> happen on a specific file
> and not to an entire filesystem. In this case I think that it would be 
> more practical to read the block
> using the IOCTL, and then rewrite the block, at the specific offset (to 
> update the checksum).

I'm fine with the idea of reading from raw data idea (although prefer to 
implement it in user space), but not a huge fan to simply re-write with COW.

E.g. the bad csum is still there, can still be exposed by scrub, even 
it's not referred by any file anymore.

> 
> Of course there are several tradeoff: the "unmounted" version, doesn't 
> duplicate a shared block,
> where my approach (read the data using an ioctl to avoid the CSUM 
> mismatch and rewrite it) make
> a fork if the block is shared. However, as told before, the problem is 
> related to specifics file and it seems
> a waste of resource reading an entire filesystem to correct few files. 
> No to mention that the IOCTL
> can be done on a "live" filesystem.

I do not think we should treat the csum error so easily, it's still a 
big problem.

Even it's known that direct IO with buffer modified halfway can lead to 
corruption, the csum mismatch is still a huge problem.
The end user should check if their problem is properly written, or use 
the latest kernel with proper backport.

If it's really caused by hardware (memory or disk or whatever), it's a 
huge deal and definitely needs proper inspection and verification.

So overall, if one hits a csum mismatch, especially after the direct IO 
fix, then it should not be treated as "something that can be easily 
fixed online", it should be something huge, at least needs some tool to 
handle it offline.

Thanks,
Qu

> 
> Said that, of course "btrfs rescue fix-data-checksum" is better than 
> nothing.
> 
> BR
> G.Baroncelli
> 
>> Although the problem is worked around in v6.15 (and being backported),
>> for the affected fs there is no good way to fix them, other than complex
>> manually find out which files are affected and delete them.
>>
>> This series introduce the initial implementation of "btrfs rescue
>> fix-data-checksum", which is designed to fix such problem by either:
>>
>> - Update the csum items using the data from specified copy
>>
>> The subcommand has 3 modes so far:
>>
>> - Readonly mode
>>    Only report all corrupted blocks and affected files, no repair is
>>    done.
>>
>> - Interactive mode
>>    Ask for what to do, including
>>
>>    * Ignore (the default)
>>    * Use certain mirror to update the checksum item
>>
>> - Mirror mode
>>    Use specified mirror to update the checksum item, the batch mode of
>>    the interactive one.
>>
>> In the future, there will be one more mode:
>>
>> - Delete mode
>>    Delete all involved files.
>>
>>    There are still some points to address before implementing this mode.
>>
>> Qu Wenruo (6):
>>    btrfs-progs: introduce "btrfs rescue fix-data-checksum"
>>    btrfs-progs: fix a bug in btrfs_find_item()
>>    btrfs-progs: fix-data-checksum: show affected files
>>    btrfs-progs: fix-data-checksum: introduce interactive mode
>>    btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch
>>    btrfs-progs: fix-data-checksum: introduce -m|--mirror option
>>
>>   Documentation/btrfs-rescue.rst  |  28 ++
>>   Makefile                        |   2 +-
>>   cmds/rescue-fix-data-checksum.c | 511 ++++++++++++++++++++++++++++++++
>>   cmds/rescue.c                   |  65 ++++
>>   cmds/rescue.h                   |  10 +
>>   kernel-shared/ctree.c           |  17 +-
>>   kernel-shared/file-item.c       |   2 +-
>>   kernel-shared/file-item.h       |   5 +
>>   8 files changed, 635 insertions(+), 5 deletions(-)
>>   create mode 100644 cmds/rescue-fix-data-checksum.c
>>
>> -- 
>> 2.49.0
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
  2025-05-31 22:07   ` Qu Wenruo
@ 2025-06-02 17:28     ` Goffredo Baroncelli
  2025-06-02 22:55       ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Goffredo Baroncelli @ 2025-06-02 17:28 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 01/06/2025 00.07, Qu Wenruo wrote:
> 
> 
> 在 2025/6/1 02:28, Goffredo Baroncelli 写道:
>> On 15/05/2025 10.00, Qu Wenruo wrote:
[...]
>>> We have a long history of data csum mismatch, caused by direct IO and
>>> buffered being modified during writeback.
>>>
>> What about having an ioctl (on a mounted fs) which allow us to read data from a file even
>> if the csum doesn't match ?
> 
> The problem is again with mirrors.
> 
> Unless you ask the end user to provide a mirror number, for a with multiple mirrors, and the mirrors doesn't match each other, the behavior will be a mess.
> 
> That's why I'm planning to add something like a mirror priority, with a new mirror "best" to try use the mirror with the most matches.
> 
> 
> Despite the mirror number problem (we need to ask the end user for mirror number), I think it's possible to implement the feature (mostly) in user space.
> 
> E.g. combine fiemap result with btrfs-map-logical, then read from the disk directly.
> 
This would expose to the user space the complexity of the filesystem; this means having two code path doing the same thing in two subtle different ways. Pay attention to consider also a raid5/6 case with a missing disks.

I don't know how important is select a valid mirror. Or at least, I don't see a realistic case where a mirror may be better than another one.

After your consideration, I would like to suggest the following:
1) read the data from a mirror, the csum matches, return data
2) read the data from a mirror, the csum DOES NOT match:
    2.1) read the data from another mirror(s), the csum matches,
	- rebuild the csum
	- return the data
    2.2) read the data from another mirror(s), the csum still doesn't match, return error
    2.3) (NEW, alternatively to 2.2) read the data from another mirror(s), the csum still doesn't match:
	- take the latest read attempt as valid
         - rebuild the csum
         - return the last read attempt

2.3) mode should be a enabled by an IOCTL on a specific fd. The assumption here is that if the csum doesn't match any mirror may be equally bad/good. But in the IOCTL we could add a parameter to specify the order of the mirrors (still we need a way to specify the order of the mirror in the raid5 case).

>> I asked that because the problem usually happen on a specific file
>> and not to an entire filesystem. In this case I think that it would be more practical to read the block
>> using the IOCTL, and then rewrite the block, at the specific offset (to update the checksum).
> 
> I'm fine with the idea of reading from raw data idea (although prefer to implement it in user space), but not a huge fan to simply re-write with COW.
> 
> E.g. the bad csum is still there, can still be exposed by scrub, even it's not referred by any file anymore.
> 

If I understood correctly, you are saying that due to the fact that even if an extent is partially referred, all data in the extent is tracked and has his checksum. A rewriting in the middle would generate a new extent with a new checksum, but the old data still exists in the extent with his dedicated (and un-matching ) csum. (I repeated this because it seems to me a technical detail not so obvious but very important)

>>
>> Of course there are several tradeoff: the "unmounted" version, doesn't duplicate a shared block,
>> where my approach (read the data using an ioctl to avoid the CSUM mismatch and rewrite it) make
>> a fork if the block is shared. However, as told before, the problem is related to specifics file and it seems
>> a waste of resource reading an entire filesystem to correct few files. No to mention that the IOCTL
>> can be done on a "live" filesystem.
> 
> I do not think we should treat the csum error so easily, it's still a big problem.
> 
> Even it's known that direct IO with buffer modified halfway can lead to corruption, the csum mismatch is still a huge problem.
> The end user should check if their problem is properly written, or use the latest kernel with proper backport.
> 
> If it's really caused by hardware (memory or disk or whatever), it's a huge deal and definitely needs proper inspection and verification.
> 
> So overall, if one hits a csum mismatch, especially after the direct IO fix, then it should not be treated as "something that can be easily fixed online", it should be something huge, at least needs some tool to handle it offline.
> 

Even tough I agree about the severity (e.g. the corruption happened due to an hw error), I less agree about the fact that this is a justification to not have a simpler interface (form an user interface POV) that don't requires an un-mount/mount of the filesystem.

Un-mounting a root filesystem is a not so easy operation, so we should provide a different way to access the data.

However I agree that we should allow the user to select a different "mirror", considering that a "mirror" may be any valid combination of disk to rebuild the data (e.g. when we will define the interface, we should consider raid 6).

> Thanks,
> Qu
> 
>>

BR
Goffredo

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
  2025-06-02 17:28     ` Goffredo Baroncelli
@ 2025-06-02 22:55       ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2025-06-02 22:55 UTC (permalink / raw)
  To: kreijack, Qu Wenruo, linux-btrfs



在 2025/6/3 02:58, Goffredo Baroncelli 写道:
> On 01/06/2025 00.07, Qu Wenruo wrote:
>>
>>
>> 在 2025/6/1 02:28, Goffredo Baroncelli 写道:
>>> On 15/05/2025 10.00, Qu Wenruo wrote:
> [...]
>>>> We have a long history of data csum mismatch, caused by direct IO and
>>>> buffered being modified during writeback.
>>>>
>>> What about having an ioctl (on a mounted fs) which allow us to read 
>>> data from a file even
>>> if the csum doesn't match ?
>>
>> The problem is again with mirrors.
>>
>> Unless you ask the end user to provide a mirror number, for a with 
>> multiple mirrors, and the mirrors doesn't match each other, the 
>> behavior will be a mess.
>>
>> That's why I'm planning to add something like a mirror priority, with 
>> a new mirror "best" to try use the mirror with the most matches.
>>
>>
>> Despite the mirror number problem (we need to ask the end user for 
>> mirror number), I think it's possible to implement the feature 
>> (mostly) in user space.
>>
>> E.g. combine fiemap result with btrfs-map-logical, then read from the 
>> disk directly.
>>
> This would expose to the user space the complexity of the filesystem; 

Let me be clear, the filesystem is complex.

If there is a csum mismatch, the end user should dig into the rabbit 
hole to find out why, fixing it with some tool should not be the first 
thing they want to do.

> this means having two code path doing the same thing in two subtle 
> different ways.

It's fine to have two code paths doing the same thing.

One for offline and one for online.

> Pay attention to consider also a raid5/6 case with a 
> missing disks.
> 
> I don't know how important is select a valid mirror. Or at least, I 
> don't see a realistic case where a mirror may be better than another one.

RAID1C3, RAID1C4, where one can have 2 or more mirrors matches each 
other but not matching the csum.

> 
> After your consideration, I would like to suggest the following:
> 1) read the data from a mirror, the csum matches, return data
> 2) read the data from a mirror, the csum DOES NOT match:
>     2.1) read the data from another mirror(s), the csum matches,
>      - rebuild the csum
>      - return the data
>     2.2) read the data from another mirror(s), the csum still doesn't 
> match, return error
>     2.3) (NEW, alternatively to 2.2) read the data from another 
> mirror(s), the csum still doesn't match:
>      - take the latest read attempt as valid
>          - rebuild the csum
>          - return the last read attempt
> 
> 2.3) mode should be a enabled by an IOCTL on a specific fd.

Again, we should not even put a complex csum repair into kernel mode and 
allow it to be done online.

Back to the first point, csum mismatch should be investigated, it's not 
something one should hit commonly, and we should not make fixing it 
easier either.


Csum mismatch should be treated as tree-checker errors.
I have not yet seen anyone suggesting there should be some fancy ioctl 
to fix a tree-checker error.

Then there should not be any easy tool to fix csum error online.


> The 
> assumption here is that if the csum doesn't match any mirror may be 
> equally bad/good. But in the IOCTL we could add a parameter to specify 
> the order of the mirrors (still we need a way to specify the order of 
> the mirror in the raid5 case).
> 
>>> I asked that because the problem usually happen on a specific file
>>> and not to an entire filesystem. In this case I think that it would 
>>> be more practical to read the block
>>> using the IOCTL, and then rewrite the block, at the specific offset 
>>> (to update the checksum).
>>
>> I'm fine with the idea of reading from raw data idea (although prefer 
>> to implement it in user space), but not a huge fan to simply re-write 
>> with COW.
>>
>> E.g. the bad csum is still there, can still be exposed by scrub, even 
>> it's not referred by any file anymore.
>>
> 
> If I understood correctly, you are saying that due to the fact that even 
> if an extent is partially referred, all data in the extent is tracked 
> and has his checksum. A rewriting in the middle would generate a new 
> extent with a new checksum, but the old data still exists in the extent 
> with his dedicated (and un-matching ) csum. (I repeated this because it 
> seems to me a technical detail not so obvious but very important)

Yes.


>> So overall, if one hits a csum mismatch, especially after the direct 
>> IO fix, then it should not be treated as "something that can be easily 
>> fixed online", it should be something huge, at least needs some tool 
>> to handle it offline.
>>
> 
> Even tough I agree about the severity (e.g. the corruption happened due 
> to an hw error), I less agree about the fact that this is a 
> justification to not have a simpler interface (form an user interface 
> POV) that don't requires an un-mount/mount of the filesystem.
> 
> Un-mounting a root filesystem is a not so easy operation, so we should 
> provide a different way to access the data.

If you can not trust your hardware, online repair is just going to cause 
more problems.

Again, replace csum error with tree-checker.

Thanks,
Qu

> 
> However I agree that we should allow the user to select a different 
> "mirror", considering that a "mirror" may be any valid combination of 
> disk to rebuild the data (e.g. when we will define the interface, we 
> should consider raid 6).
> 
>> Thanks,
>> Qu
>>
>>>
> 
> BR
> Goffredo
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-02 22:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
2025-05-15  8:00 ` [PATCH v2 1/6] " Qu Wenruo
2025-05-30 11:10   ` David Sterba
2025-05-15  8:00 ` [PATCH v2 2/6] btrfs-progs: fix a bug in btrfs_find_item() Qu Wenruo
2025-05-30 11:11   ` David Sterba
2025-05-15  8:00 ` [PATCH v2 3/6] btrfs-progs: fix-data-checksum: show affected files Qu Wenruo
2025-05-30 11:14   ` David Sterba
2025-05-15  8:00 ` [PATCH v2 4/6] btrfs-progs: fix-data-checksum: introduce interactive mode Qu Wenruo
2025-05-30 11:18   ` David Sterba
2025-05-15  8:00 ` [PATCH v2 5/6] btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch Qu Wenruo
2025-05-30 11:19   ` David Sterba
2025-05-30 22:23     ` Qu Wenruo
2025-05-15  8:00 ` [PATCH v2 6/6] btrfs-progs: fix-data-checksum: introduce -m|--mirror option Qu Wenruo
2025-05-30 11:07 ` [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" David Sterba
2025-05-31 16:58 ` Goffredo Baroncelli
2025-05-31 22:07   ` Qu Wenruo
2025-06-02 17:28     ` Goffredo Baroncelli
2025-06-02 22:55       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox