[PATCH 0/6] btrfs: scrub: refine the error messages

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] btrfs: scrub: refine the error messages
@ 2024-03-14  9:50 Qu Wenruo
  2024-03-14  9:50 ` [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address Qu Wenruo
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14  9:50 UTC (permalink / raw)
  To: linux-btrfs

During some support sessions, I found older kernels are always report
very unuseful scrub error messages like:

 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv

There are two problems:
- No proper details about the corruption
  No metadata or data indication, nor the filename or the tree id.
  Even the involved kernel (and newer kernels after the scrub rework)
  has the ability to do backref walk and find out the file path or the
  tree backref.

  My guess is, for data sometimes the corrupted sector is no longer
  referred by any data extent.

- Too noisy and useless error message from
  btrfs_dev_stat_inc_and_print()
  I'd argue we should not output any error message just for
  btrfs_dev_stat_inc_and_print().

After the series, the error message would look like this:

 BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

This involves the following enhancement:

- Single line
  And we ensure we output at least one line for the error we hit.
  No more unrelated btrfs_dev_stat_inc_and_print() output.

- Proper fall backs
  We have the following different fall back options
  * Repaired
    Just a line of something repaired for logical/physical address.

  * Detailed data info
    Including the following elements (if possible), and if higher
    priority ones can not be fetched, it would be skipped and try
    lower priority items:
    + file path (can have multiple ones)
    + root/inode number and offset
    + logical/physical address (always output)

  * Detaile metadata info
    The priority list is:
    + root ref/level
    + logical/physical address (always output)

  For the worst case of data corruption, we would still have some like:

   BTRFS warning (device dm-2): checksum error at data, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

  And similar ones for metadata:

   BTRFS warning (device dm-2): checksum error at meta, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

The first patch is fixing a regression in the error message, which leads
to bad logical/physical bytenr.
The second one would reduce the log level for
btrfs_dev_stat_inc_and_print().

Path 3~4 are cleanups to remove unnecessary parameters.

The remaining reworks the format and refine the error message frequency.

Qu Wenruo (7):
  btrfs: scrub: fix incorrectly reported logical/physical address
  btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
  btrfs: scrub: remove unused is_super parameter from
    scrub_print_common_warning()
  btrfs: scrub: remove unnecessary dev/physical lookup for
    scrub_stripe_report_errors()
  btrfs: scrub: simplify the inode iteration output
  btrfs: scrub: unify and shorten the error message
  btrfs: scrub: fix the frequency of error messages

 fs/btrfs/scrub.c   | 185 ++++++++++++++++-----------------------------
 fs/btrfs/volumes.c |   2 +-
 2 files changed, 66 insertions(+), 121 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address
  2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
@ 2024-03-14  9:50 ` Qu Wenruo
  2024-03-14 12:24   ` Anand Jain
  2024-03-14 17:10   ` Filipe Manana
  2024-03-14  9:50 ` [PATCH 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print() Qu Wenruo
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14  9:50 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
Scrub is not reporting the correct logical/physical address, it can be
verified by the following script:

 # mkfs.btrfs -f $dev1
 # mount $dev1 $mnt
 # xfs_io -f -c "pwrite -S 0xaa 0 128k" $mnt/file1
 # umount $mnt
 # xfs_io -f -c "pwrite -S 0xff 13647872 4k" $dev1
 # mount $dev1 $mnt
 # btrfs scrub start -fB $mnt
 # umount $mnt

Note above 13647872 is the physical address for logical 13631488 + 4K.

Scrub would report the following error:

 BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file1)

On the other hand, "btrfs check --check-data-csum" is reporting the
correct logical/physical address:

 Checking filesystem on /dev/test/scratch1
 UUID: db2eb621-b09d-4f24-8199-da17dc7b3201
 [5/7] checking csums against data
 mirror 1 bytenr 13647872 csum 0x13fec125 expected csum 0x656bd64e
 ERROR: errors found in csum tree

[CAUSE]
In the function scrub_stripe_report_errors(), we always use the
stripe->logical and its physical address to print the error message, not
taking the sector number into consideration at all.

[FIX]
Fix the error reporting function by calculating logical/physical with
the sector number.

Now the scrub report is correct:

 BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
 BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1)

Fixes: 0096580713ff ("btrfs: scrub: introduce error reporting functionality for scrub_stripe")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index fa25004ab04e..119e98797b21 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -870,7 +870,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 				      DEFAULT_RATELIMIT_BURST);
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	struct btrfs_device *dev = NULL;
-	u64 physical = 0;
+	u64 stripe_physical = stripe->physical;
 	int nr_data_sectors = 0;
 	int nr_meta_sectors = 0;
 	int nr_nodatacsum_sectors = 0;
@@ -903,13 +903,17 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 		 */
 		if (ret < 0)
 			goto skip;
-		physical = bioc->stripes[stripe_index].physical;
+		stripe_physical = bioc->stripes[stripe_index].physical;
 		dev = bioc->stripes[stripe_index].dev;
 		btrfs_put_bioc(bioc);
 	}
 
 skip:
 	for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
+		u64 logical = stripe->logical +
+			      (sector_nr << fs_info->sectorsize_bits);
+		u64 physical = stripe_physical +
+			      (sector_nr << fs_info->sectorsize_bits);
 		bool repaired = false;
 
 		if (stripe->sectors[sector_nr].is_metadata) {
@@ -938,12 +942,12 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 			if (dev) {
 				btrfs_err_rl_in_rcu(fs_info,
 			"fixed up error at logical %llu on dev %s physical %llu",
-					    stripe->logical, btrfs_dev_name(dev),
+					    logical, btrfs_dev_name(dev),
 					    physical);
 			} else {
 				btrfs_err_rl_in_rcu(fs_info,
 			"fixed up error at logical %llu on mirror %u",
-					    stripe->logical, stripe->mirror_num);
+					    logical, stripe->mirror_num);
 			}
 			continue;
 		}
@@ -952,26 +956,26 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 		if (dev) {
 			btrfs_err_rl_in_rcu(fs_info,
 	"unable to fixup (regular) error at logical %llu on dev %s physical %llu",
-					    stripe->logical, btrfs_dev_name(dev),
+					    logical, btrfs_dev_name(dev),
 					    physical);
 		} else {
 			btrfs_err_rl_in_rcu(fs_info,
 	"unable to fixup (regular) error at logical %llu on mirror %u",
-					    stripe->logical, stripe->mirror_num);
+					    logical, stripe->mirror_num);
 		}
 
 		if (test_bit(sector_nr, &stripe->io_error_bitmap))
 			if (__ratelimit(&rs) && dev)
 				scrub_print_common_warning("i/o error", dev, false,
-						     stripe->logical, physical);
+						     logical, physical);
 		if (test_bit(sector_nr, &stripe->csum_error_bitmap))
 			if (__ratelimit(&rs) && dev)
 				scrub_print_common_warning("checksum error", dev, false,
-						     stripe->logical, physical);
+						     logical, physical);
 		if (test_bit(sector_nr, &stripe->meta_error_bitmap))
 			if (__ratelimit(&rs) && dev)
 				scrub_print_common_warning("header error", dev, false,
-						     stripe->logical, physical);
+						     logical, physical);
 	}
 
 	spin_lock(&sctx->stat_lock);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
  2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
  2024-03-14  9:50 ` [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address Qu Wenruo
@ 2024-03-14  9:50 ` Qu Wenruo
  2024-03-14 17:17   ` Filipe Manana
  2024-03-14  9:50 ` [PATCH 3/7] btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning() Qu Wenruo
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14  9:50 UTC (permalink / raw)
  To: linux-btrfs; +Cc: stable

Currently when we increase the device statistics, it would always lead
to an error message in the kernel log.

I would argue this behavior is not ideal:

- It would flood the dmesg and bury real important messages
  One common scenario is scrub.
  If scrub hit some errors, it would cause both scrub and
  btrfs_dev_stat_inc_and_print() to print error messages.

  And in that case, btrfs_dev_stat_inc_and_print() is completely
  useless.

- The results of btrfs_dev_stat_inc_and_print() is mostly for history
  monitoring, doesn't has enough details

  If we trigger the errors during regular read, such messages from
  btrfs_dev_stat_inc_and_print() won't help us to locate the cause
  either.

The real usage for the btrfs device statistics is for some user space
daemon to check if there is any new errors, acting like some checks on
SMART, thus we don't really need/want those messages in dmesg.

This patch would reduce the log level to debug (disabled by default) for
btrfs_dev_stat_inc_and_print().
For users really want to utilize btrfs devices statistics, they should
go check "btrfs device stats" periodically, and we should focus the
kernel error messages to more important things.

CC: stable@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e49935a54da0..126145950ed3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7828,7 +7828,7 @@ void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index)

 	if (!dev->dev_stats_valid)
 		return;
-	btrfs_err_rl_in_rcu(dev->fs_info,
+	btrfs_debug_rl_in_rcu(dev->fs_info,
 		"bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
 			   btrfs_dev_name(dev),
 			   btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
-- 
2.44.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/7] btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning()
  2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
  2024-03-14  9:50 ` [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address Qu Wenruo
  2024-03-14  9:50 ` [PATCH 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print() Qu Wenruo
@ 2024-03-14  9:50 ` Qu Wenruo
  2024-03-14 17:19   ` Filipe Manana
  2024-03-14  9:50 ` [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors() Qu Wenruo
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14  9:50 UTC (permalink / raw)
  To: linux-btrfs

Since commit 2a2dc22f7e9d ("btrfs: scrub: use dedicated super block
verification function to scrub one super block"), the super block
scrubbing is handled in a dedicated helper, thus
scrub_print_common_warning() no longer needs to print warning for super
block errors.

Just remove the parameter.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 119e98797b21..5e371ffdb37b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -476,7 +476,7 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
 }
 
 static void scrub_print_common_warning(const char *errstr, struct btrfs_device *dev,
-				       bool is_super, u64 logical, u64 physical)
+				       u64 logical, u64 physical)
 {
 	struct btrfs_fs_info *fs_info = dev->fs_info;
 	struct btrfs_path *path;
@@ -488,12 +488,6 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 	u32 item_size;
 	int ret;
 
-	/* Super block error, no need to search extent tree. */
-	if (is_super) {
-		btrfs_warn_in_rcu(fs_info, "%s on device %s, physical %llu",
-				  errstr, btrfs_dev_name(dev), physical);
-		return;
-	}
 	path = btrfs_alloc_path();
 	if (!path)
 		return;
@@ -966,15 +960,15 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 
 		if (test_bit(sector_nr, &stripe->io_error_bitmap))
 			if (__ratelimit(&rs) && dev)
-				scrub_print_common_warning("i/o error", dev, false,
+				scrub_print_common_warning("i/o error", dev,
 						     logical, physical);
 		if (test_bit(sector_nr, &stripe->csum_error_bitmap))
 			if (__ratelimit(&rs) && dev)
-				scrub_print_common_warning("checksum error", dev, false,
+				scrub_print_common_warning("checksum error", dev,
 						     logical, physical);
 		if (test_bit(sector_nr, &stripe->meta_error_bitmap))
 			if (__ratelimit(&rs) && dev)
-				scrub_print_common_warning("header error", dev, false,
+				scrub_print_common_warning("header error", dev,
 						     logical, physical);
 	}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors()
  2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
                   ` (2 preceding siblings ...)
  2024-03-14  9:50 ` [PATCH 3/7] btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning() Qu Wenruo
@ 2024-03-14  9:50 ` Qu Wenruo
  2024-03-14 17:26   ` Filipe Manana
  2024-03-14  9:50 ` [PATCH 5/7] btrfs: scrub: simplify the inode iteration output Qu Wenruo
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14  9:50 UTC (permalink / raw)
  To: linux-btrfs

The @stripe passed into scrub_stripe_report_errors() either has
stripe->dev and stripe->physical properly populated (regular data
stripes) or stripe->flags would have SCRUB_STRIPE_FLAG_NO_REPORT
(RAID56 P/Q stripes).

Thus there is no need to go with btrfs_map_block() to get the
dev/physical.

Just add an extra ASSERT() to make sure we get stripe->dev populated
correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 59 ++++++------------------------------------------
 1 file changed, 7 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 5e371ffdb37b..277583464371 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -860,10 +860,8 @@ static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
 static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 				       struct scrub_stripe *stripe)
 {
-	static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
-				      DEFAULT_RATELIMIT_BURST);
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
-	struct btrfs_device *dev = NULL;
+	struct btrfs_device *dev = stripe->dev;
 	u64 stripe_physical = stripe->physical;
 	int nr_data_sectors = 0;
 	int nr_meta_sectors = 0;
@@ -874,35 +872,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 	if (test_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state))
 		return;
 
-	/*
-	 * Init needed infos for error reporting.
-	 *
-	 * Although our scrub_stripe infrastructure is mostly based on btrfs_submit_bio()
-	 * thus no need for dev/physical, error reporting still needs dev and physical.
-	 */
-	if (!bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors)) {
-		u64 mapped_len = fs_info->sectorsize;
-		struct btrfs_io_context *bioc = NULL;
-		int stripe_index = stripe->mirror_num - 1;
-		int ret;
-
-		/* For scrub, our mirror_num should always start at 1. */
-		ASSERT(stripe->mirror_num >= 1);
-		ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
-				      stripe->logical, &mapped_len, &bioc,
-				      NULL, NULL);
-		/*
-		 * If we failed, dev will be NULL, and later detailed reports
-		 * will just be skipped.
-		 */
-		if (ret < 0)
-			goto skip;
-		stripe_physical = bioc->stripes[stripe_index].physical;
-		dev = bioc->stripes[stripe_index].dev;
-		btrfs_put_bioc(bioc);
-	}
-
-skip:
+	ASSERT(dev);
 	for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
 		u64 logical = stripe->logical +
 			      (sector_nr << fs_info->sectorsize_bits);
@@ -933,42 +903,27 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 		 * output the message of repaired message.
 		 */
 		if (repaired) {
-			if (dev) {
-				btrfs_err_rl_in_rcu(fs_info,
+			btrfs_err_rl_in_rcu(fs_info,
 			"fixed up error at logical %llu on dev %s physical %llu",
 					    logical, btrfs_dev_name(dev),
 					    physical);
-			} else {
-				btrfs_err_rl_in_rcu(fs_info,
-			"fixed up error at logical %llu on mirror %u",
-					    logical, stripe->mirror_num);
-			}
 			continue;
 		}
 
 		/* The remaining are all for unrepaired. */
-		if (dev) {
-			btrfs_err_rl_in_rcu(fs_info,
+		btrfs_err_rl_in_rcu(fs_info,
 	"unable to fixup (regular) error at logical %llu on dev %s physical %llu",
 					    logical, btrfs_dev_name(dev),
 					    physical);
-		} else {
-			btrfs_err_rl_in_rcu(fs_info,
-	"unable to fixup (regular) error at logical %llu on mirror %u",
-					    logical, stripe->mirror_num);
-		}
 
 		if (test_bit(sector_nr, &stripe->io_error_bitmap))
-			if (__ratelimit(&rs) && dev)
-				scrub_print_common_warning("i/o error", dev,
+			scrub_print_common_warning("i/o error", dev,
 						     logical, physical);
 		if (test_bit(sector_nr, &stripe->csum_error_bitmap))
-			if (__ratelimit(&rs) && dev)
-				scrub_print_common_warning("checksum error", dev,
+			scrub_print_common_warning("checksum error", dev,
 						     logical, physical);
 		if (test_bit(sector_nr, &stripe->meta_error_bitmap))
-			if (__ratelimit(&rs) && dev)
-				scrub_print_common_warning("header error", dev,
+			scrub_print_common_warning("header error", dev,
 						     logical, physical);
 	}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/7] btrfs: scrub: simplify the inode iteration output
  2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
                   ` (3 preceding siblings ...)
  2024-03-14  9:50 ` [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors() Qu Wenruo
@ 2024-03-14  9:50 ` Qu Wenruo
  2024-03-14 17:29   ` Filipe Manana
  2024-03-14  9:50 ` [PATCH 6/7] btrfs: scrub: unify and shorten the error message Qu Wenruo
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14  9:50 UTC (permalink / raw)
  To: linux-btrfs

The following two output are not really needed:

- nlinks
  Normally file inodes should have nlinks as 1, for those inodes have
  multiple hard links, the inode/root number is still enough to pin it
  down to certain inode.

- size
  The size is always fixed to sector size.

By removing the nlinks output, we can reduce one inode item lookup.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 27 +--------------------------
 1 file changed, 1 insertion(+), 26 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 277583464371..18b2ee3b1616 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -388,17 +388,13 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
 static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
 				     u64 root, void *warn_ctx)
 {
-	u32 nlink;
 	int ret;
 	int i;
 	unsigned nofs_flag;
-	struct extent_buffer *eb;
-	struct btrfs_inode_item *inode_item;
 	struct scrub_warning *swarn = warn_ctx;
 	struct btrfs_fs_info *fs_info = swarn->dev->fs_info;
 	struct inode_fs_paths *ipath = NULL;
 	struct btrfs_root *local_root;
-	struct btrfs_key key;
 
 	local_root = btrfs_get_fs_root(fs_info, root, true);
 	if (IS_ERR(local_root)) {
@@ -406,26 +402,6 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
 		goto err;
 	}
 
-	/*
-	 * this makes the path point to (inum INODE_ITEM ioff)
-	 */
-	key.objectid = inum;
-	key.type = BTRFS_INODE_ITEM_KEY;
-	key.offset = 0;
-
-	ret = btrfs_search_slot(NULL, local_root, &key, swarn->path, 0, 0);
-	if (ret) {
-		btrfs_put_root(local_root);
-		btrfs_release_path(swarn->path);
-		goto err;
-	}
-
-	eb = swarn->path->nodes[0];
-	inode_item = btrfs_item_ptr(eb, swarn->path->slots[0],
-					struct btrfs_inode_item);
-	nlink = btrfs_inode_nlink(eb, inode_item);
-	btrfs_release_path(swarn->path);
-
 	/*
 	 * init_path might indirectly call vmalloc, or use GFP_KERNEL. Scrub
 	 * uses GFP_NOFS in this context, so we keep it consistent but it does
@@ -451,12 +427,11 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
 	 */
 	for (i = 0; i < ipath->fspath->elem_cnt; ++i)
 		btrfs_warn_in_rcu(fs_info,
-"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %u, links %u (path: %s)",
+"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, path: %s",
 				  swarn->errstr, swarn->logical,
 				  btrfs_dev_name(swarn->dev),
 				  swarn->physical,
 				  root, inum, offset,
-				  fs_info->sectorsize, nlink,
 				  (char *)(unsigned long)ipath->fspath->val[i]);
 
 	btrfs_put_root(local_root);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/7] btrfs: scrub: unify and shorten the error message
  2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
                   ` (4 preceding siblings ...)
  2024-03-14  9:50 ` [PATCH 5/7] btrfs: scrub: simplify the inode iteration output Qu Wenruo
@ 2024-03-14  9:50 ` Qu Wenruo
  2024-03-14 17:40   ` Filipe Manana
                     ` (2 more replies)
  2024-03-14  9:50 ` [PATCH 7/7] btrfs: scrub: fix the frequency of error messages Qu Wenruo
  2024-03-14 17:35 ` [PATCH 0/6] btrfs: scrub: refine the " Boris Burkov
  7 siblings, 3 replies; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14  9:50 UTC (permalink / raw)
  To: linux-btrfs

Currently the scrub error report is pretty long:

 BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
 BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1)

Since we have so many things to output, it's not a surprise we got long
error lines.

To make the lines a little shorter, and make important info more
obvious, change the unify output to something like this:

 BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
 BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

The idea is, to put large values/small values/string separated meanwhile
skip the some descriptor directly:

- "logical LOGICAL(MIRROR)"
  LOGICAL is always a large value, meanwhile MIRROR is always a single
  digit.
  Thus combining them together, human can still split them instinctively.

- "physical DEVID(DEVPATH)PHYSICAL"
  DEVID is normally a short number, DEVPATH is a long string, and PHYSICAL
  is a normally a large number.
  And for most end users, the most important device path is still easy
  to catch, even surrounded by some large numbers.

- inode ROOT/INO(PATH)
  To locate a btrfs inode, we have to provide both root and inode
  number. Normally ROOT should be a much smaller number (3 digits)
  meanwhile the INO can be pretty large.

  However for the end user, the most important thing is the PATH, which
  is still not hard to locate.

Any better ideas would be appreciated to make those lines shorter.

However what we really need is a proper way to report all those info
(maybe put the file/dev name resolution into the user space) to user
space reliably (without rate limit).

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 55 +++++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 18b2ee3b1616..17e492076bf8 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -226,6 +226,7 @@ struct scrub_warning {
 	u64			physical;
 	u64			logical;
 	struct btrfs_device	*dev;
+	int			mirror_num;
 };
 
 static void release_scrub_stripe(struct scrub_stripe *stripe)
@@ -427,12 +428,12 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
 	 */
 	for (i = 0; i < ipath->fspath->elem_cnt; ++i)
 		btrfs_warn_in_rcu(fs_info,
-"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, path: %s",
-				  swarn->errstr, swarn->logical,
-				  btrfs_dev_name(swarn->dev),
-				  swarn->physical,
-				  root, inum, offset,
-				  (char *)(unsigned long)ipath->fspath->val[i]);
+"%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
+				  swarn->errstr, root, inum,
+				  (char *)ipath->fspath->val[i], offset,
+				  swarn->logical, swarn->mirror_num,
+				  swarn->dev->devid, btrfs_dev_name(swarn->dev),
+				  swarn->physical);
 
 	btrfs_put_root(local_root);
 	free_ipath(ipath);
@@ -440,18 +441,17 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
 
 err:
 	btrfs_warn_in_rcu(fs_info,
-			  "%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu: path resolving failed with ret=%d",
-			  swarn->errstr, swarn->logical,
-			  btrfs_dev_name(swarn->dev),
-			  swarn->physical,
-			  root, inum, offset, ret);
-
+	"%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
+			  swarn->errstr, root, inum, offset,
+			  swarn->logical, swarn->mirror_num,
+			  swarn->dev->devid, btrfs_dev_name(swarn->dev),
+			  swarn->physical);
 	free_ipath(ipath);
 	return 0;
 }
 
 static void scrub_print_common_warning(const char *errstr, struct btrfs_device *dev,
-				       u64 logical, u64 physical)
+				       u64 logical, u64 physical, int mirror_num)
 {
 	struct btrfs_fs_info *fs_info = dev->fs_info;
 	struct btrfs_path *path;
@@ -471,6 +471,7 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 	swarn.logical = logical;
 	swarn.errstr = errstr;
 	swarn.dev = NULL;
+	swarn.mirror_num = mirror_num;
 
 	ret = extent_from_logical(fs_info, swarn.logical, path, &found_key,
 				  &flags);
@@ -501,10 +502,11 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 			if (ret > 0)
 				break;
 			btrfs_warn_in_rcu(fs_info,
-"%s at logical %llu on dev %s, physical %llu: metadata %s (level %d) in tree %llu",
-				errstr, swarn.logical, btrfs_dev_name(dev),
-				swarn.physical, (ref_level ? "node" : "leaf"),
-				ref_level, ref_root);
+"%s at metadata in tree %llu(%u), logical %llu(%u) physical %llu(%s)%llu",
+				errstr, ref_root, ref_level,
+				swarn.logical, swarn.mirror_num,
+				dev->devid, btrfs_dev_name(dev),
+				swarn.physical);
 		}
 		btrfs_release_path(path);
 	} else {
@@ -879,27 +881,32 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 		 */
 		if (repaired) {
 			btrfs_err_rl_in_rcu(fs_info,
-			"fixed up error at logical %llu on dev %s physical %llu",
-					    logical, btrfs_dev_name(dev),
+			"fixed up error at logical %llu(%u) physical %llu(%s)%llu",
+					    logical, stripe->mirror_num,
+					    dev->devid, btrfs_dev_name(dev),
 					    physical);
 			continue;
 		}
 
 		/* The remaining are all for unrepaired. */
 		btrfs_err_rl_in_rcu(fs_info,
-	"unable to fixup (regular) error at logical %llu on dev %s physical %llu",
-					    logical, btrfs_dev_name(dev),
+	"unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu",
+					    logical, stripe->mirror_num,
+					    dev->devid, btrfs_dev_name(dev),
 					    physical);
 
 		if (test_bit(sector_nr, &stripe->io_error_bitmap))
 			scrub_print_common_warning("i/o error", dev,
-						     logical, physical);
+						     logical, physical,
+						     stripe->mirror_num);
 		if (test_bit(sector_nr, &stripe->csum_error_bitmap))
 			scrub_print_common_warning("checksum error", dev,
-						     logical, physical);
+						     logical, physical,
+						     stripe->mirror_num);
 		if (test_bit(sector_nr, &stripe->meta_error_bitmap))
 			scrub_print_common_warning("header error", dev,
-						     logical, physical);
+						     logical, physical,
+						     stripe->mirror_num);
 	}
 
 	spin_lock(&sctx->stat_lock);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/7] btrfs: scrub: fix the frequency of error messages
  2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
                   ` (5 preceding siblings ...)
  2024-03-14  9:50 ` [PATCH 6/7] btrfs: scrub: unify and shorten the error message Qu Wenruo
@ 2024-03-14  9:50 ` Qu Wenruo
  2024-03-14 17:51   ` Filipe Manana
  2024-03-14 17:35 ` [PATCH 0/6] btrfs: scrub: refine the " Boris Burkov
  7 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14  9:50 UTC (permalink / raw)
  To: linux-btrfs

For btrfs scrub error messages, I have hit a lot of support cases on
older kernels where no filename is outputted at all, with only error
messages like:

 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
 BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv
 BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2825, gen 0

The "unable to fixup (regular) error" line shows we hit an unrepairable
error, then normally we would do data/metadata backref walk to grab the
correct info.

But we can hit cases like the inode is already orphan (unlinked from any
parent inode), or even the subvolume is orphan (unlinked and waiting to
be deleted).

In that case we're not sure what's the proper way to continue (Is it
some critical data/metadata? Would it prevent the system from booting?)

To improve the situation, this patch would:

- Ensure we at least output one message for each unrepairable error
  If by somehow we didn't output any error message for the error, we
  always fallback to the basic logical/physical error output, with its
  type (data or metadata).

- Remove the "unable to fixup" error message
  Since we have ensured at least one error message is outputted for each
  unrepairable corruption, there is no need for that fallback one.

Now one unrepairable corruption would output a more refined and less
duplicated error message:

 BTRFS info (device dm-2): scrub: started on devid 1
 BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
 BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 17e492076bf8..84617a64b2d4 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -227,6 +227,7 @@ struct scrub_warning {
 	u64			logical;
 	struct btrfs_device	*dev;
 	int			mirror_num;
+	bool			message_printed;
 };
 
 static void release_scrub_stripe(struct scrub_stripe *stripe)
@@ -426,26 +427,29 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
 	 * we deliberately ignore the bit ipath might have been too small to
 	 * hold all of the paths here
 	 */
-	for (i = 0; i < ipath->fspath->elem_cnt; ++i)
-		btrfs_warn_in_rcu(fs_info,
+	for (i = 0; i < ipath->fspath->elem_cnt; ++i) {
+		btrfs_warn_rl_in_rcu(fs_info,
 "%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
 				  swarn->errstr, root, inum,
 				  (char *)ipath->fspath->val[i], offset,
 				  swarn->logical, swarn->mirror_num,
 				  swarn->dev->devid, btrfs_dev_name(swarn->dev),
 				  swarn->physical);
+		swarn->message_printed = true;
+	}
 
 	btrfs_put_root(local_root);
 	free_ipath(ipath);
 	return 0;
 
 err:
-	btrfs_warn_in_rcu(fs_info,
+	btrfs_warn_rl_in_rcu(fs_info,
 	"%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
 			  swarn->errstr, root, inum, offset,
 			  swarn->logical, swarn->mirror_num,
 			  swarn->dev->devid, btrfs_dev_name(swarn->dev),
 			  swarn->physical);
+	swarn->message_printed = true;
 	free_ipath(ipath);
 	return 0;
 }
@@ -472,6 +476,7 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 	swarn.errstr = errstr;
 	swarn.dev = NULL;
 	swarn.mirror_num = mirror_num;
+	swarn.message_printed = false;
 
 	ret = extent_from_logical(fs_info, swarn.logical, path, &found_key,
 				  &flags);
@@ -488,26 +493,31 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 		unsigned long ptr = 0;
 		u8 ref_level;
 		u64 ref_root;
+		bool message_printed = false;
 
 		while (true) {
 			ret = tree_backref_for_extent(&ptr, eb, &found_key, ei,
 						      item_size, &ref_root,
 						      &ref_level);
-			if (ret < 0) {
-				btrfs_warn(fs_info,
-				"failed to resolve tree backref for logical %llu: %d",
-						  swarn.logical, ret);
+			if (ret < 0)
 				break;
-			}
 			if (ret > 0)
 				break;
-			btrfs_warn_in_rcu(fs_info,
+			btrfs_warn_rl_in_rcu(fs_info,
 "%s at metadata in tree %llu(%u), logical %llu(%u) physical %llu(%s)%llu",
 				errstr, ref_root, ref_level,
 				swarn.logical, swarn.mirror_num,
 				dev->devid, btrfs_dev_name(dev),
 				swarn.physical);
+			message_printed = true;
 		}
+		if (!message_printed)
+			btrfs_warn_rl_in_rcu(fs_info,
+"%s at metadata, logical %llu(%u) physical %llu(%s)%llu",
+				errstr, swarn.logical, swarn.mirror_num,
+				dev->devid, btrfs_dev_name(dev),
+				swarn.physical);
+
 		btrfs_release_path(path);
 	} else {
 		struct btrfs_backref_walk_ctx ctx = { 0 };
@@ -522,8 +532,14 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
 		swarn.dev = dev;
 
 		iterate_extent_inodes(&ctx, true, scrub_print_warning_inode, &swarn);
+		if (!swarn.message_printed) {
+			btrfs_warn_rl_in_rcu(fs_info,
+"%s at data, unresolved path name, logical %llu(%u) physical %llu(%s)%llu",
+				errstr, swarn.logical, swarn.mirror_num,
+				dev->devid, btrfs_dev_name(dev),
+				swarn.physical);
+		}
 	}
-
 out:
 	btrfs_free_path(path);
 }
@@ -889,12 +905,6 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
 		}
 
 		/* The remaining are all for unrepaired. */
-		btrfs_err_rl_in_rcu(fs_info,
-	"unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu",
-					    logical, stripe->mirror_num,
-					    dev->devid, btrfs_dev_name(dev),
-					    physical);
-
 		if (test_bit(sector_nr, &stripe->io_error_bitmap))
 			scrub_print_common_warning("i/o error", dev,
 						     logical, physical,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address
  2024-03-14  9:50 ` [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address Qu Wenruo
@ 2024-03-14 12:24   ` Anand Jain
  2024-03-14 20:15     ` Qu Wenruo
  2024-03-14 17:10   ` Filipe Manana
  1 sibling, 1 reply; 31+ messages in thread
From: Anand Jain @ 2024-03-14 12:24 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 3/14/24 15:20, Qu Wenruo wrote:
> [BUG]
> Scrub is not reporting the correct logical/physical address, it can be
> verified by the following script:
> 
>   # mkfs.btrfs -f $dev1
>   # mount $dev1 $mnt
>   # xfs_io -f -c "pwrite -S 0xaa 0 128k" $mnt/file1
>   # umount $mnt
>   # xfs_io -f -c "pwrite -S 0xff 13647872 4k" $dev1
>   # mount $dev1 $mnt
>   # btrfs scrub start -fB $mnt
>   # umount $mnt
> 
> Note above 13647872 is the physical address for logical 13631488 + 4K.
> 
> Scrub would report the following error:
> 
>   BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
>   BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file1)
> 
> On the other hand, "btrfs check --check-data-csum" is reporting the
> correct logical/physical address:
> 
>   Checking filesystem on /dev/test/scratch1
>   UUID: db2eb621-b09d-4f24-8199-da17dc7b3201
>   [5/7] checking csums against data
>   mirror 1 bytenr 13647872 csum 0x13fec125 expected csum 0x656bd64e
>   ERROR: errors found in csum tree
> 
> [CAUSE]
> In the function scrub_stripe_report_errors(), we always use the
> stripe->logical and its physical address to print the error message, not
> taking the sector number into consideration at all.
> 
> [FIX]
> Fix the error reporting function by calculating logical/physical with
> the sector number.
> 
> Now the scrub report is correct:
> 
>   BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
>   BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1)
> 
> Fixes: 0096580713ff ("btrfs: scrub: introduce error reporting functionality for scrub_stripe")
> Signed-off-by: Qu Wenruo <wqu@suse.com>


To help accurate debug on stable kernel, we could also add
CC: stable@vger.kernel.org #6.4+

Not too particular though.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Thanks.


> ---
>   fs/btrfs/scrub.c | 22 +++++++++++++---------
>   1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index fa25004ab04e..119e98797b21 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -870,7 +870,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>   				      DEFAULT_RATELIMIT_BURST);
>   	struct btrfs_fs_info *fs_info = sctx->fs_info;
>   	struct btrfs_device *dev = NULL;
> -	u64 physical = 0;
> +	u64 stripe_physical = stripe->physical;
>   	int nr_data_sectors = 0;
>   	int nr_meta_sectors = 0;
>   	int nr_nodatacsum_sectors = 0;
> @@ -903,13 +903,17 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>   		 */
>   		if (ret < 0)
>   			goto skip;
> -		physical = bioc->stripes[stripe_index].physical;
> +		stripe_physical = bioc->stripes[stripe_index].physical;
>   		dev = bioc->stripes[stripe_index].dev;
>   		btrfs_put_bioc(bioc);
>   	}
>   
>   skip:
>   	for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> +		u64 logical = stripe->logical +
> +			      (sector_nr << fs_info->sectorsize_bits);
> +		u64 physical = stripe_physical +
> +			      (sector_nr << fs_info->sectorsize_bits);
>   		bool repaired = false;
>   
>   		if (stripe->sectors[sector_nr].is_metadata) {
> @@ -938,12 +942,12 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>   			if (dev) {
>   				btrfs_err_rl_in_rcu(fs_info,
>   			"fixed up error at logical %llu on dev %s physical %llu",
> -					    stripe->logical, btrfs_dev_name(dev),
> +					    logical, btrfs_dev_name(dev),
>   					    physical);
>   			} else {
>   				btrfs_err_rl_in_rcu(fs_info,
>   			"fixed up error at logical %llu on mirror %u",
> -					    stripe->logical, stripe->mirror_num);
> +					    logical, stripe->mirror_num);
>   			}
>   			continue;
>   		}
> @@ -952,26 +956,26 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>   		if (dev) {
>   			btrfs_err_rl_in_rcu(fs_info,
>   	"unable to fixup (regular) error at logical %llu on dev %s physical %llu",
> -					    stripe->logical, btrfs_dev_name(dev),
> +					    logical, btrfs_dev_name(dev),
>   					    physical);
>   		} else {
>   			btrfs_err_rl_in_rcu(fs_info,
>   	"unable to fixup (regular) error at logical %llu on mirror %u",
> -					    stripe->logical, stripe->mirror_num);
> +					    logical, stripe->mirror_num);
>   		}
>   
>   		if (test_bit(sector_nr, &stripe->io_error_bitmap))
>   			if (__ratelimit(&rs) && dev)
>   				scrub_print_common_warning("i/o error", dev, false,
> -						     stripe->logical, physical);
> +						     logical, physical);
>   		if (test_bit(sector_nr, &stripe->csum_error_bitmap))
>   			if (__ratelimit(&rs) && dev)
>   				scrub_print_common_warning("checksum error", dev, false,
> -						     stripe->logical, physical);
> +						     logical, physical);
>   		if (test_bit(sector_nr, &stripe->meta_error_bitmap))
>   			if (__ratelimit(&rs) && dev)
>   				scrub_print_common_warning("header error", dev, false,
> -						     stripe->logical, physical);
> +						     logical, physical);
>   	}
>   
>   	spin_lock(&sctx->stat_lock);


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address
  2024-03-14  9:50 ` [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address Qu Wenruo
  2024-03-14 12:24   ` Anand Jain
@ 2024-03-14 17:10   ` Filipe Manana
  1 sibling, 0 replies; 31+ messages in thread
From: Filipe Manana @ 2024-03-14 17:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 14, 2024 at 9:50 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> Scrub is not reporting the correct logical/physical address, it can be
> verified by the following script:
>
>  # mkfs.btrfs -f $dev1
>  # mount $dev1 $mnt
>  # xfs_io -f -c "pwrite -S 0xaa 0 128k" $mnt/file1
>  # umount $mnt
>  # xfs_io -f -c "pwrite -S 0xff 13647872 4k" $dev1
>  # mount $dev1 $mnt
>  # btrfs scrub start -fB $mnt
>  # umount $mnt
>
> Note above 13647872 is the physical address for logical 13631488 + 4K.
>
> Scrub would report the following error:
>
>  BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
>  BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file1)
>
> On the other hand, "btrfs check --check-data-csum" is reporting the
> correct logical/physical address:
>
>  Checking filesystem on /dev/test/scratch1
>  UUID: db2eb621-b09d-4f24-8199-da17dc7b3201
>  [5/7] checking csums against data
>  mirror 1 bytenr 13647872 csum 0x13fec125 expected csum 0x656bd64e
>  ERROR: errors found in csum tree
>
> [CAUSE]
> In the function scrub_stripe_report_errors(), we always use the
> stripe->logical and its physical address to print the error message, not
> taking the sector number into consideration at all.
>
> [FIX]
> Fix the error reporting function by calculating logical/physical with
> the sector number.
>
> Now the scrub report is correct:
>
>  BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
>  BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1)
>
> Fixes: 0096580713ff ("btrfs: scrub: introduce error reporting functionality for scrub_stripe")
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, just one minor and optional comment below.

> ---
>  fs/btrfs/scrub.c | 22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index fa25004ab04e..119e98797b21 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -870,7 +870,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                                       DEFAULT_RATELIMIT_BURST);
>         struct btrfs_fs_info *fs_info = sctx->fs_info;
>         struct btrfs_device *dev = NULL;
> -       u64 physical = 0;
> +       u64 stripe_physical = stripe->physical;
>         int nr_data_sectors = 0;
>         int nr_meta_sectors = 0;
>         int nr_nodatacsum_sectors = 0;
> @@ -903,13 +903,17 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                  */
>                 if (ret < 0)
>                         goto skip;
> -               physical = bioc->stripes[stripe_index].physical;
> +               stripe_physical = bioc->stripes[stripe_index].physical;
>                 dev = bioc->stripes[stripe_index].dev;
>                 btrfs_put_bioc(bioc);
>         }
>
>  skip:
>         for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> +               u64 logical = stripe->logical +
> +                             (sector_nr << fs_info->sectorsize_bits);
> +               u64 physical = stripe_physical +
> +                             (sector_nr << fs_info->sectorsize_bits);

These could be made const to make it clear they're not supposed to change.

Thanks.

>                 bool repaired = false;
>
>                 if (stripe->sectors[sector_nr].is_metadata) {
> @@ -938,12 +942,12 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                         if (dev) {
>                                 btrfs_err_rl_in_rcu(fs_info,
>                         "fixed up error at logical %llu on dev %s physical %llu",
> -                                           stripe->logical, btrfs_dev_name(dev),
> +                                           logical, btrfs_dev_name(dev),
>                                             physical);
>                         } else {
>                                 btrfs_err_rl_in_rcu(fs_info,
>                         "fixed up error at logical %llu on mirror %u",
> -                                           stripe->logical, stripe->mirror_num);
> +                                           logical, stripe->mirror_num);
>                         }
>                         continue;
>                 }
> @@ -952,26 +956,26 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                 if (dev) {
>                         btrfs_err_rl_in_rcu(fs_info,
>         "unable to fixup (regular) error at logical %llu on dev %s physical %llu",
> -                                           stripe->logical, btrfs_dev_name(dev),
> +                                           logical, btrfs_dev_name(dev),
>                                             physical);
>                 } else {
>                         btrfs_err_rl_in_rcu(fs_info,
>         "unable to fixup (regular) error at logical %llu on mirror %u",
> -                                           stripe->logical, stripe->mirror_num);
> +                                           logical, stripe->mirror_num);
>                 }
>
>                 if (test_bit(sector_nr, &stripe->io_error_bitmap))
>                         if (__ratelimit(&rs) && dev)
>                                 scrub_print_common_warning("i/o error", dev, false,
> -                                                    stripe->logical, physical);
> +                                                    logical, physical);
>                 if (test_bit(sector_nr, &stripe->csum_error_bitmap))
>                         if (__ratelimit(&rs) && dev)
>                                 scrub_print_common_warning("checksum error", dev, false,
> -                                                    stripe->logical, physical);
> +                                                    logical, physical);
>                 if (test_bit(sector_nr, &stripe->meta_error_bitmap))
>                         if (__ratelimit(&rs) && dev)
>                                 scrub_print_common_warning("header error", dev, false,
> -                                                    stripe->logical, physical);
> +                                                    logical, physical);
>         }
>
>         spin_lock(&sctx->stat_lock);
> --
> 2.44.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
  2024-03-14  9:50 ` [PATCH 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print() Qu Wenruo
@ 2024-03-14 17:17   ` Filipe Manana
  2024-03-14 20:26     ` Qu Wenruo
  0 siblings, 1 reply; 31+ messages in thread
From: Filipe Manana @ 2024-03-14 17:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, stable

On Thu, Mar 14, 2024 at 9:54 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently when we increase the device statistics, it would always lead
> to an error message in the kernel log.
>
> I would argue this behavior is not ideal:
>
> - It would flood the dmesg and bury real important messages
>   One common scenario is scrub.
>   If scrub hit some errors, it would cause both scrub and
>   btrfs_dev_stat_inc_and_print() to print error messages.
>
>   And in that case, btrfs_dev_stat_inc_and_print() is completely
>   useless.
>
> - The results of btrfs_dev_stat_inc_and_print() is mostly for history
>   monitoring, doesn't has enough details
>
>   If we trigger the errors during regular read, such messages from
>   btrfs_dev_stat_inc_and_print() won't help us to locate the cause
>   either.
>
> The real usage for the btrfs device statistics is for some user space
> daemon to check if there is any new errors, acting like some checks on
> SMART, thus we don't really need/want those messages in dmesg.
>
> This patch would reduce the log level to debug (disabled by default) for
> btrfs_dev_stat_inc_and_print().
> For users really want to utilize btrfs devices statistics, they should
> go check "btrfs device stats" periodically, and we should focus the
> kernel error messages to more important things.

Not sure if this is the right thing to do.

In the scrub context it can be annoying for sure.
Other cases I'm not so sure about, because having error messages in
dmesg/syslog may help notice issues more quickly.

>
> CC: stable@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/volumes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e49935a54da0..126145950ed3 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7828,7 +7828,7 @@ void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index)
>
>         if (!dev->dev_stats_valid)
>                 return;
> -       btrfs_err_rl_in_rcu(dev->fs_info,
> +       btrfs_debug_rl_in_rcu(dev->fs_info,
>                 "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
>                            btrfs_dev_name(dev),
>                            btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
> --
> 2.44.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/7] btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning()
  2024-03-14  9:50 ` [PATCH 3/7] btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning() Qu Wenruo
@ 2024-03-14 17:19   ` Filipe Manana
  0 siblings, 0 replies; 31+ messages in thread
From: Filipe Manana @ 2024-03-14 17:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 14, 2024 at 9:50 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Since commit 2a2dc22f7e9d ("btrfs: scrub: use dedicated super block
> verification function to scrub one super block"), the super block
> scrubbing is handled in a dedicated helper, thus
> scrub_print_common_warning() no longer needs to print warning for super
> block errors.
>
> Just remove the parameter.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/scrub.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 119e98797b21..5e371ffdb37b 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -476,7 +476,7 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
>  }
>
>  static void scrub_print_common_warning(const char *errstr, struct btrfs_device *dev,
> -                                      bool is_super, u64 logical, u64 physical)
> +                                      u64 logical, u64 physical)
>  {
>         struct btrfs_fs_info *fs_info = dev->fs_info;
>         struct btrfs_path *path;
> @@ -488,12 +488,6 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>         u32 item_size;
>         int ret;
>
> -       /* Super block error, no need to search extent tree. */
> -       if (is_super) {
> -               btrfs_warn_in_rcu(fs_info, "%s on device %s, physical %llu",
> -                                 errstr, btrfs_dev_name(dev), physical);
> -               return;
> -       }
>         path = btrfs_alloc_path();
>         if (!path)
>                 return;
> @@ -966,15 +960,15 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>
>                 if (test_bit(sector_nr, &stripe->io_error_bitmap))
>                         if (__ratelimit(&rs) && dev)
> -                               scrub_print_common_warning("i/o error", dev, false,
> +                               scrub_print_common_warning("i/o error", dev,
>                                                      logical, physical);
>                 if (test_bit(sector_nr, &stripe->csum_error_bitmap))
>                         if (__ratelimit(&rs) && dev)
> -                               scrub_print_common_warning("checksum error", dev, false,
> +                               scrub_print_common_warning("checksum error", dev,
>                                                      logical, physical);
>                 if (test_bit(sector_nr, &stripe->meta_error_bitmap))
>                         if (__ratelimit(&rs) && dev)
> -                               scrub_print_common_warning("header error", dev, false,
> +                               scrub_print_common_warning("header error", dev,
>                                                      logical, physical);
>         }
>
> --
> 2.44.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors()
  2024-03-14  9:50 ` [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors() Qu Wenruo
@ 2024-03-14 17:26   ` Filipe Manana
  2024-03-14 20:28     ` Qu Wenruo
  0 siblings, 1 reply; 31+ messages in thread
From: Filipe Manana @ 2024-03-14 17:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 14, 2024 at 9:50 AM Qu Wenruo <wqu@suse.com> wrote:
>
> The @stripe passed into scrub_stripe_report_errors() either has
> stripe->dev and stripe->physical properly populated (regular data
> stripes) or stripe->flags would have SCRUB_STRIPE_FLAG_NO_REPORT
> (RAID56 P/Q stripes).
>
> Thus there is no need to go with btrfs_map_block() to get the
> dev/physical.
>
> Just add an extra ASSERT() to make sure we get stripe->dev populated
> correctly.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/scrub.c | 59 ++++++------------------------------------------
>  1 file changed, 7 insertions(+), 52 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 5e371ffdb37b..277583464371 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -860,10 +860,8 @@ static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
>  static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                                        struct scrub_stripe *stripe)
>  {
> -       static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
> -                                     DEFAULT_RATELIMIT_BURST);
>         struct btrfs_fs_info *fs_info = sctx->fs_info;
> -       struct btrfs_device *dev = NULL;
> +       struct btrfs_device *dev = stripe->dev;
>         u64 stripe_physical = stripe->physical;
>         int nr_data_sectors = 0;
>         int nr_meta_sectors = 0;
> @@ -874,35 +872,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>         if (test_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state))
>                 return;
>
> -       /*
> -        * Init needed infos for error reporting.
> -        *
> -        * Although our scrub_stripe infrastructure is mostly based on btrfs_submit_bio()
> -        * thus no need for dev/physical, error reporting still needs dev and physical.
> -        */
> -       if (!bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors)) {
> -               u64 mapped_len = fs_info->sectorsize;
> -               struct btrfs_io_context *bioc = NULL;
> -               int stripe_index = stripe->mirror_num - 1;
> -               int ret;
> -
> -               /* For scrub, our mirror_num should always start at 1. */
> -               ASSERT(stripe->mirror_num >= 1);
> -               ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
> -                                     stripe->logical, &mapped_len, &bioc,
> -                                     NULL, NULL);
> -               /*
> -                * If we failed, dev will be NULL, and later detailed reports
> -                * will just be skipped.
> -                */
> -               if (ret < 0)
> -                       goto skip;
> -               stripe_physical = bioc->stripes[stripe_index].physical;
> -               dev = bioc->stripes[stripe_index].dev;
> -               btrfs_put_bioc(bioc);
> -       }
> -
> -skip:
> +       ASSERT(dev);
>         for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
>                 u64 logical = stripe->logical +
>                               (sector_nr << fs_info->sectorsize_bits);
> @@ -933,42 +903,27 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                  * output the message of repaired message.
>                  */
>                 if (repaired) {
> -                       if (dev) {
> -                               btrfs_err_rl_in_rcu(fs_info,
> +                       btrfs_err_rl_in_rcu(fs_info,
>                         "fixed up error at logical %llu on dev %s physical %llu",
>                                             logical, btrfs_dev_name(dev),
>                                             physical);
> -                       } else {
> -                               btrfs_err_rl_in_rcu(fs_info,
> -                       "fixed up error at logical %llu on mirror %u",
> -                                           logical, stripe->mirror_num);
> -                       }
>                         continue;
>                 }
>
>                 /* The remaining are all for unrepaired. */
> -               if (dev) {
> -                       btrfs_err_rl_in_rcu(fs_info,
> +               btrfs_err_rl_in_rcu(fs_info,
>         "unable to fixup (regular) error at logical %llu on dev %s physical %llu",
>                                             logical, btrfs_dev_name(dev),
>                                             physical);
> -               } else {
> -                       btrfs_err_rl_in_rcu(fs_info,
> -       "unable to fixup (regular) error at logical %llu on mirror %u",
> -                                           logical, stripe->mirror_num);
> -               }
>
>                 if (test_bit(sector_nr, &stripe->io_error_bitmap))
> -                       if (__ratelimit(&rs) && dev)
> -                               scrub_print_common_warning("i/o error", dev,
> +                       scrub_print_common_warning("i/o error", dev,
>                                                      logical, physical);
>                 if (test_bit(sector_nr, &stripe->csum_error_bitmap))
> -                       if (__ratelimit(&rs) && dev)
> -                               scrub_print_common_warning("checksum error", dev,
> +                       scrub_print_common_warning("checksum error", dev,
>                                                      logical, physical);
>                 if (test_bit(sector_nr, &stripe->meta_error_bitmap))
> -                       if (__ratelimit(&rs) && dev)
> -                               scrub_print_common_warning("header error", dev,
> +                       scrub_print_common_warning("header error", dev,

Why are we removing the rate limiting?
This seems like an unrelated change.

Everything else looks ok.

Thanks.

>                                                      logical, physical);
>         }
>
> --
> 2.44.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/7] btrfs: scrub: simplify the inode iteration output
  2024-03-14  9:50 ` [PATCH 5/7] btrfs: scrub: simplify the inode iteration output Qu Wenruo
@ 2024-03-14 17:29   ` Filipe Manana
  0 siblings, 0 replies; 31+ messages in thread
From: Filipe Manana @ 2024-03-14 17:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 14, 2024 at 9:51 AM Qu Wenruo <wqu@suse.com> wrote:
>
> The following two output are not really needed:
>
> - nlinks
>   Normally file inodes should have nlinks as 1, for those inodes have
>   multiple hard links, the inode/root number is still enough to pin it
>   down to certain inode.
>
> - size
>   The size is always fixed to sector size.
>
> By removing the nlinks output, we can reduce one inode item lookup.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Seems reasonable, thanks.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/scrub.c | 27 +--------------------------
>  1 file changed, 1 insertion(+), 26 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 277583464371..18b2ee3b1616 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -388,17 +388,13 @@ static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
>  static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
>                                      u64 root, void *warn_ctx)
>  {
> -       u32 nlink;
>         int ret;
>         int i;
>         unsigned nofs_flag;
> -       struct extent_buffer *eb;
> -       struct btrfs_inode_item *inode_item;
>         struct scrub_warning *swarn = warn_ctx;
>         struct btrfs_fs_info *fs_info = swarn->dev->fs_info;
>         struct inode_fs_paths *ipath = NULL;
>         struct btrfs_root *local_root;
> -       struct btrfs_key key;
>
>         local_root = btrfs_get_fs_root(fs_info, root, true);
>         if (IS_ERR(local_root)) {
> @@ -406,26 +402,6 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
>                 goto err;
>         }
>
> -       /*
> -        * this makes the path point to (inum INODE_ITEM ioff)
> -        */
> -       key.objectid = inum;
> -       key.type = BTRFS_INODE_ITEM_KEY;
> -       key.offset = 0;
> -
> -       ret = btrfs_search_slot(NULL, local_root, &key, swarn->path, 0, 0);
> -       if (ret) {
> -               btrfs_put_root(local_root);
> -               btrfs_release_path(swarn->path);
> -               goto err;
> -       }
> -
> -       eb = swarn->path->nodes[0];
> -       inode_item = btrfs_item_ptr(eb, swarn->path->slots[0],
> -                                       struct btrfs_inode_item);
> -       nlink = btrfs_inode_nlink(eb, inode_item);
> -       btrfs_release_path(swarn->path);
> -
>         /*
>          * init_path might indirectly call vmalloc, or use GFP_KERNEL. Scrub
>          * uses GFP_NOFS in this context, so we keep it consistent but it does
> @@ -451,12 +427,11 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
>          */
>         for (i = 0; i < ipath->fspath->elem_cnt; ++i)
>                 btrfs_warn_in_rcu(fs_info,
> -"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %u, links %u (path: %s)",
> +"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, path: %s",
>                                   swarn->errstr, swarn->logical,
>                                   btrfs_dev_name(swarn->dev),
>                                   swarn->physical,
>                                   root, inum, offset,
> -                                 fs_info->sectorsize, nlink,
>                                   (char *)(unsigned long)ipath->fspath->val[i]);
>
>         btrfs_put_root(local_root);
> --
> 2.44.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/6] btrfs: scrub: refine the error messages
  2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
                   ` (6 preceding siblings ...)
  2024-03-14  9:50 ` [PATCH 7/7] btrfs: scrub: fix the frequency of error messages Qu Wenruo
@ 2024-03-14 17:35 ` Boris Burkov
  2024-03-14 20:30   ` Qu Wenruo
  7 siblings, 1 reply; 31+ messages in thread
From: Boris Burkov @ 2024-03-14 17:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 14, 2024 at 08:20:13PM +1030, Qu Wenruo wrote:
> During some support sessions, I found older kernels are always report
> very unuseful scrub error messages like:
> 
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
>  BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
>  BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv
> 
> There are two problems:
> - No proper details about the corruption
>   No metadata or data indication, nor the filename or the tree id.
>   Even the involved kernel (and newer kernels after the scrub rework)
>   has the ability to do backref walk and find out the file path or the
>   tree backref.
> 
>   My guess is, for data sometimes the corrupted sector is no longer
>   referred by any data extent.
> 
> - Too noisy and useless error message from
>   btrfs_dev_stat_inc_and_print()
>   I'd argue we should not output any error message just for
>   btrfs_dev_stat_inc_and_print().
> 
> After the series, the error message would look like this:
> 
>  BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

Stoked on this series, thanks for doing it!

qq: would it be helpful to also include the actual/expected csum? I
think it's particularly helpful when one or the other is either zeros or
the checksum of the zero block.

> 
> This involves the following enhancement:
> 
> - Single line
>   And we ensure we output at least one line for the error we hit.
>   No more unrelated btrfs_dev_stat_inc_and_print() output.
> 
> - Proper fall backs
>   We have the following different fall back options
>   * Repaired
>     Just a line of something repaired for logical/physical address.
> 
>   * Detailed data info
>     Including the following elements (if possible), and if higher
>     priority ones can not be fetched, it would be skipped and try
>     lower priority items:
>     + file path (can have multiple ones)
>     + root/inode number and offset
>     + logical/physical address (always output)
> 
>   * Detaile metadata info
>     The priority list is:
>     + root ref/level
>     + logical/physical address (always output)
> 
>   For the worst case of data corruption, we would still have some like:
> 
>    BTRFS warning (device dm-2): checksum error at data, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
> 
>   And similar ones for metadata:
> 
>    BTRFS warning (device dm-2): checksum error at meta, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
> 
> The first patch is fixing a regression in the error message, which leads
> to bad logical/physical bytenr.
> The second one would reduce the log level for
> btrfs_dev_stat_inc_and_print().
> 
> Path 3~4 are cleanups to remove unnecessary parameters.
> 
> The remaining reworks the format and refine the error message frequency.
> 
> Qu Wenruo (7):
>   btrfs: scrub: fix incorrectly reported logical/physical address
>   btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
>   btrfs: scrub: remove unused is_super parameter from
>     scrub_print_common_warning()
>   btrfs: scrub: remove unnecessary dev/physical lookup for
>     scrub_stripe_report_errors()
>   btrfs: scrub: simplify the inode iteration output
>   btrfs: scrub: unify and shorten the error message
>   btrfs: scrub: fix the frequency of error messages
> 
>  fs/btrfs/scrub.c   | 185 ++++++++++++++++-----------------------------
>  fs/btrfs/volumes.c |   2 +-
>  2 files changed, 66 insertions(+), 121 deletions(-)
> 
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/7] btrfs: scrub: unify and shorten the error message
  2024-03-14  9:50 ` [PATCH 6/7] btrfs: scrub: unify and shorten the error message Qu Wenruo
@ 2024-03-14 17:40   ` Filipe Manana
  2024-03-14 20:56     ` Qu Wenruo
  2024-03-14 23:05   ` kernel test robot
  2024-03-15 11:44   ` kernel test robot
  2 siblings, 1 reply; 31+ messages in thread
From: Filipe Manana @ 2024-03-14 17:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 14, 2024 at 9:51 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Currently the scrub error report is pretty long:
>
>  BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
>  BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1)
>
> Since we have so many things to output, it's not a surprise we got long
> error lines.
>
> To make the lines a little shorter, and make important info more
> obvious, change the unify output to something like this:
>
>  BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>  BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872

I find that hard to read because:

1) Please leave spaces before opening parenthesis.
   That makes things less cluttered and easier to the eyes, more
consistent with what we generally do and it's the formal way to do
things in English (see
https://www.scribens.com/grammar-rules/typography/spacing.html);

2) Instead of "inode 5/257(file1)", something a lot easier to
understand like "inode 5 root 257 path \"file1\"", which leaves no
margin for doubt about what each number is;

3) Same with "logical 13647872(1)" - what is the 1? That will be the
question anyone reading such a log message will likely have.
    Something like "logical 13647872 mirror 1" makes it clear;

4) Same with "physical 1(/dev/mapper/test-scratch1)13647872".
    Something like "physical 13647872 device ID 1 device path
\"/dev/mapper/test-scratch1\"", makes it clear what each number is and
easier to read.

Thanks.

>
> The idea is, to put large values/small values/string separated meanwhile
> skip the some descriptor directly:
>
> - "logical LOGICAL(MIRROR)"
>   LOGICAL is always a large value, meanwhile MIRROR is always a single
>   digit.
>   Thus combining them together, human can still split them instinctively.
>
> - "physical DEVID(DEVPATH)PHYSICAL"
>   DEVID is normally a short number, DEVPATH is a long string, and PHYSICAL
>   is a normally a large number.
>   And for most end users, the most important device path is still easy
>   to catch, even surrounded by some large numbers.
>
> - inode ROOT/INO(PATH)
>   To locate a btrfs inode, we have to provide both root and inode
>   number. Normally ROOT should be a much smaller number (3 digits)
>   meanwhile the INO can be pretty large.
>
>   However for the end user, the most important thing is the PATH, which
>   is still not hard to locate.
>
> Any better ideas would be appreciated to make those lines shorter.
>
> However what we really need is a proper way to report all those info
> (maybe put the file/dev name resolution into the user space) to user
> space reliably (without rate limit).
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/scrub.c | 55 +++++++++++++++++++++++++++---------------------
>  1 file changed, 31 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 18b2ee3b1616..17e492076bf8 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -226,6 +226,7 @@ struct scrub_warning {
>         u64                     physical;
>         u64                     logical;
>         struct btrfs_device     *dev;
> +       int                     mirror_num;
>  };
>
>  static void release_scrub_stripe(struct scrub_stripe *stripe)
> @@ -427,12 +428,12 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
>          */
>         for (i = 0; i < ipath->fspath->elem_cnt; ++i)
>                 btrfs_warn_in_rcu(fs_info,
> -"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, path: %s",
> -                                 swarn->errstr, swarn->logical,
> -                                 btrfs_dev_name(swarn->dev),
> -                                 swarn->physical,
> -                                 root, inum, offset,
> -                                 (char *)(unsigned long)ipath->fspath->val[i]);
> +"%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
> +                                 swarn->errstr, root, inum,
> +                                 (char *)ipath->fspath->val[i], offset,
> +                                 swarn->logical, swarn->mirror_num,
> +                                 swarn->dev->devid, btrfs_dev_name(swarn->dev),
> +                                 swarn->physical);
>
>         btrfs_put_root(local_root);
>         free_ipath(ipath);
> @@ -440,18 +441,17 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
>
>  err:
>         btrfs_warn_in_rcu(fs_info,
> -                         "%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu: path resolving failed with ret=%d",
> -                         swarn->errstr, swarn->logical,
> -                         btrfs_dev_name(swarn->dev),
> -                         swarn->physical,
> -                         root, inum, offset, ret);
> -
> +       "%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
> +                         swarn->errstr, root, inum, offset,
> +                         swarn->logical, swarn->mirror_num,
> +                         swarn->dev->devid, btrfs_dev_name(swarn->dev),
> +                         swarn->physical);
>         free_ipath(ipath);
>         return 0;
>  }
>
>  static void scrub_print_common_warning(const char *errstr, struct btrfs_device *dev,
> -                                      u64 logical, u64 physical)
> +                                      u64 logical, u64 physical, int mirror_num)
>  {
>         struct btrfs_fs_info *fs_info = dev->fs_info;
>         struct btrfs_path *path;
> @@ -471,6 +471,7 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>         swarn.logical = logical;
>         swarn.errstr = errstr;
>         swarn.dev = NULL;
> +       swarn.mirror_num = mirror_num;
>
>         ret = extent_from_logical(fs_info, swarn.logical, path, &found_key,
>                                   &flags);
> @@ -501,10 +502,11 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>                         if (ret > 0)
>                                 break;
>                         btrfs_warn_in_rcu(fs_info,
> -"%s at logical %llu on dev %s, physical %llu: metadata %s (level %d) in tree %llu",
> -                               errstr, swarn.logical, btrfs_dev_name(dev),
> -                               swarn.physical, (ref_level ? "node" : "leaf"),
> -                               ref_level, ref_root);
> +"%s at metadata in tree %llu(%u), logical %llu(%u) physical %llu(%s)%llu",
> +                               errstr, ref_root, ref_level,
> +                               swarn.logical, swarn.mirror_num,
> +                               dev->devid, btrfs_dev_name(dev),
> +                               swarn.physical);
>                 }
>                 btrfs_release_path(path);
>         } else {
> @@ -879,27 +881,32 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                  */
>                 if (repaired) {
>                         btrfs_err_rl_in_rcu(fs_info,
> -                       "fixed up error at logical %llu on dev %s physical %llu",
> -                                           logical, btrfs_dev_name(dev),
> +                       "fixed up error at logical %llu(%u) physical %llu(%s)%llu",
> +                                           logical, stripe->mirror_num,
> +                                           dev->devid, btrfs_dev_name(dev),
>                                             physical);
>                         continue;
>                 }
>
>                 /* The remaining are all for unrepaired. */
>                 btrfs_err_rl_in_rcu(fs_info,
> -       "unable to fixup (regular) error at logical %llu on dev %s physical %llu",
> -                                           logical, btrfs_dev_name(dev),
> +       "unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu",
> +                                           logical, stripe->mirror_num,
> +                                           dev->devid, btrfs_dev_name(dev),
>                                             physical);
>
>                 if (test_bit(sector_nr, &stripe->io_error_bitmap))
>                         scrub_print_common_warning("i/o error", dev,
> -                                                    logical, physical);
> +                                                    logical, physical,
> +                                                    stripe->mirror_num);
>                 if (test_bit(sector_nr, &stripe->csum_error_bitmap))
>                         scrub_print_common_warning("checksum error", dev,
> -                                                    logical, physical);
> +                                                    logical, physical,
> +                                                    stripe->mirror_num);
>                 if (test_bit(sector_nr, &stripe->meta_error_bitmap))
>                         scrub_print_common_warning("header error", dev,
> -                                                    logical, physical);
> +                                                    logical, physical,
> +                                                    stripe->mirror_num);
>         }
>
>         spin_lock(&sctx->stat_lock);
> --
> 2.44.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/7] btrfs: scrub: fix the frequency of error messages
  2024-03-14  9:50 ` [PATCH 7/7] btrfs: scrub: fix the frequency of error messages Qu Wenruo
@ 2024-03-14 17:51   ` Filipe Manana
  2024-03-14 20:32     ` Qu Wenruo
  0 siblings, 1 reply; 31+ messages in thread
From: Filipe Manana @ 2024-03-14 17:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 14, 2024 at 9:51 AM Qu Wenruo <wqu@suse.com> wrote:
>
> For btrfs scrub error messages, I have hit a lot of support cases on
> older kernels where no filename is outputted at all, with only error
> messages like:
>
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
>  BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
>  BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv
>  BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2825, gen 0
>
> The "unable to fixup (regular) error" line shows we hit an unrepairable
> error, then normally we would do data/metadata backref walk to grab the
> correct info.
>
> But we can hit cases like the inode is already orphan (unlinked from any
> parent inode), or even the subvolume is orphan (unlinked and waiting to
> be deleted).
>
> In that case we're not sure what's the proper way to continue (Is it
> some critical data/metadata? Would it prevent the system from booting?)
>
> To improve the situation, this patch would:
>
> - Ensure we at least output one message for each unrepairable error
>   If by somehow we didn't output any error message for the error, we
>   always fallback to the basic logical/physical error output, with its
>   type (data or metadata).
>
> - Remove the "unable to fixup" error message
>   Since we have ensured at least one error message is outputted for each
>   unrepairable corruption, there is no need for that fallback one.
>
> Now one unrepairable corruption would output a more refined and less
> duplicated error message:
>
>  BTRFS info (device dm-2): scrub: started on devid 1
>  BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>  BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/scrub.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 17e492076bf8..84617a64b2d4 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -227,6 +227,7 @@ struct scrub_warning {
>         u64                     logical;
>         struct btrfs_device     *dev;
>         int                     mirror_num;
> +       bool                    message_printed;
>  };
>
>  static void release_scrub_stripe(struct scrub_stripe *stripe)
> @@ -426,26 +427,29 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
>          * we deliberately ignore the bit ipath might have been too small to
>          * hold all of the paths here
>          */
> -       for (i = 0; i < ipath->fspath->elem_cnt; ++i)
> -               btrfs_warn_in_rcu(fs_info,
> +       for (i = 0; i < ipath->fspath->elem_cnt; ++i) {
> +               btrfs_warn_rl_in_rcu(fs_info,
>  "%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
>                                   swarn->errstr, root, inum,
>                                   (char *)ipath->fspath->val[i], offset,
>                                   swarn->logical, swarn->mirror_num,
>                                   swarn->dev->devid, btrfs_dev_name(swarn->dev),
>                                   swarn->physical);
> +               swarn->message_printed = true;
> +       }
>
>         btrfs_put_root(local_root);
>         free_ipath(ipath);
>         return 0;
>
>  err:
> -       btrfs_warn_in_rcu(fs_info,
> +       btrfs_warn_rl_in_rcu(fs_info,
>         "%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
>                           swarn->errstr, root, inum, offset,
>                           swarn->logical, swarn->mirror_num,
>                           swarn->dev->devid, btrfs_dev_name(swarn->dev),
>                           swarn->physical);
> +       swarn->message_printed = true;
>         free_ipath(ipath);
>         return 0;
>  }
> @@ -472,6 +476,7 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>         swarn.errstr = errstr;
>         swarn.dev = NULL;
>         swarn.mirror_num = mirror_num;
> +       swarn.message_printed = false;
>
>         ret = extent_from_logical(fs_info, swarn.logical, path, &found_key,
>                                   &flags);
> @@ -488,26 +493,31 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>                 unsigned long ptr = 0;
>                 u8 ref_level;
>                 u64 ref_root;
> +               bool message_printed = false;
>
>                 while (true) {
>                         ret = tree_backref_for_extent(&ptr, eb, &found_key, ei,
>                                                       item_size, &ref_root,
>                                                       &ref_level);
> -                       if (ret < 0) {
> -                               btrfs_warn(fs_info,
> -                               "failed to resolve tree backref for logical %llu: %d",
> -                                                 swarn.logical, ret);
> +                       if (ret < 0)
>                                 break;
> -                       }
>                         if (ret > 0)
>                                 break;
> -                       btrfs_warn_in_rcu(fs_info,
> +                       btrfs_warn_rl_in_rcu(fs_info,
>  "%s at metadata in tree %llu(%u), logical %llu(%u) physical %llu(%s)%llu",
>                                 errstr, ref_root, ref_level,
>                                 swarn.logical, swarn.mirror_num,
>                                 dev->devid, btrfs_dev_name(dev),
>                                 swarn.physical);
> +                       message_printed = true;
>                 }
> +               if (!message_printed)
> +                       btrfs_warn_rl_in_rcu(fs_info,
> +"%s at metadata, logical %llu(%u) physical %llu(%s)%llu",
> +                               errstr, swarn.logical, swarn.mirror_num,
> +                               dev->devid, btrfs_dev_name(dev),
> +                               swarn.physical);
> +
>                 btrfs_release_path(path);
>         } else {
>                 struct btrfs_backref_walk_ctx ctx = { 0 };
> @@ -522,8 +532,14 @@ static void scrub_print_common_warning(const char *errstr, struct btrfs_device *
>                 swarn.dev = dev;
>
>                 iterate_extent_inodes(&ctx, true, scrub_print_warning_inode, &swarn);
> +               if (!swarn.message_printed) {
> +                       btrfs_warn_rl_in_rcu(fs_info,
> +"%s at data, unresolved path name, logical %llu(%u) physical %llu(%s)%llu",
> +                               errstr, swarn.logical, swarn.mirror_num,
> +                               dev->devid, btrfs_dev_name(dev),
> +                               swarn.physical);
> +               }

No need for the curly braces here, it's a single statement, so we
follow the style of dropping them.


>         }
> -
>  out:
>         btrfs_free_path(path);
>  }
> @@ -889,12 +905,6 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>                 }
>
>                 /* The remaining are all for unrepaired. */
> -               btrfs_err_rl_in_rcu(fs_info,
> -       "unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu",
> -                                           logical, stripe->mirror_num,
> -                                           dev->devid, btrfs_dev_name(dev),
> -                                           physical);
> -

This hunk seems unintentional?

Thanks.

>                 if (test_bit(sector_nr, &stripe->io_error_bitmap))
>                         scrub_print_common_warning("i/o error", dev,
>                                                      logical, physical,
> --
> 2.44.0
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address
  2024-03-14 12:24   ` Anand Jain
@ 2024-03-14 20:15     ` Qu Wenruo
  2024-04-04 20:01       ` David Sterba
  0 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14 20:15 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs



在 2024/3/14 22:54, Anand Jain 写道:
> On 3/14/24 15:20, Qu Wenruo wrote:
>> [BUG]
>> Scrub is not reporting the correct logical/physical address, it can be
>> verified by the following script:
>>
>>   # mkfs.btrfs -f $dev1
>>   # mount $dev1 $mnt
>>   # xfs_io -f -c "pwrite -S 0xaa 0 128k" $mnt/file1
>>   # umount $mnt
>>   # xfs_io -f -c "pwrite -S 0xff 13647872 4k" $dev1
>>   # mount $dev1 $mnt
>>   # btrfs scrub start -fB $mnt
>>   # umount $mnt
>>
>> Note above 13647872 is the physical address for logical 13631488 + 4K.
>>
>> Scrub would report the following error:
>>
>>   BTRFS error (device dm-2): unable to fixup (regular) error at 
>> logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
>>   BTRFS warning (device dm-2): checksum error at logical 13631488 on 
>> dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, 
>> offset 0, length 4096, links 1 (path: file1)
>>
>> On the other hand, "btrfs check --check-data-csum" is reporting the
>> correct logical/physical address:
>>
>>   Checking filesystem on /dev/test/scratch1
>>   UUID: db2eb621-b09d-4f24-8199-da17dc7b3201
>>   [5/7] checking csums against data
>>   mirror 1 bytenr 13647872 csum 0x13fec125 expected csum 0x656bd64e
>>   ERROR: errors found in csum tree
>>
>> [CAUSE]
>> In the function scrub_stripe_report_errors(), we always use the
>> stripe->logical and its physical address to print the error message, not
>> taking the sector number into consideration at all.
>>
>> [FIX]
>> Fix the error reporting function by calculating logical/physical with
>> the sector number.
>>
>> Now the scrub report is correct:
>>
>>   BTRFS error (device dm-2): unable to fixup (regular) error at 
>> logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
>>   BTRFS warning (device dm-2): checksum error at logical 13647872 on 
>> dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, 
>> offset 16384, length 4096, links 1 (path: file1)
>>
>> Fixes: 0096580713ff ("btrfs: scrub: introduce error reporting 
>> functionality for scrub_stripe")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> 
> To help accurate debug on stable kernel, we could also add
> CC: stable@vger.kernel.org #6.4+

IIRC Fixes: tag is more accurate than just CC stable, as stable has the 
correct bots catching the fixes line.

Thanks,
Qu
> 
> Not too particular though.
> 
> Reviewed-by: Anand Jain <anand.jain@oracle.com>
> 
> Thanks.
> 
> 
>> ---
>>   fs/btrfs/scrub.c | 22 +++++++++++++---------
>>   1 file changed, 13 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index fa25004ab04e..119e98797b21 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -870,7 +870,7 @@ static void scrub_stripe_report_errors(struct 
>> scrub_ctx *sctx,
>>                         DEFAULT_RATELIMIT_BURST);
>>       struct btrfs_fs_info *fs_info = sctx->fs_info;
>>       struct btrfs_device *dev = NULL;
>> -    u64 physical = 0;
>> +    u64 stripe_physical = stripe->physical;
>>       int nr_data_sectors = 0;
>>       int nr_meta_sectors = 0;
>>       int nr_nodatacsum_sectors = 0;
>> @@ -903,13 +903,17 @@ static void scrub_stripe_report_errors(struct 
>> scrub_ctx *sctx,
>>            */
>>           if (ret < 0)
>>               goto skip;
>> -        physical = bioc->stripes[stripe_index].physical;
>> +        stripe_physical = bioc->stripes[stripe_index].physical;
>>           dev = bioc->stripes[stripe_index].dev;
>>           btrfs_put_bioc(bioc);
>>       }
>>   skip:
>>       for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, 
>> stripe->nr_sectors) {
>> +        u64 logical = stripe->logical +
>> +                  (sector_nr << fs_info->sectorsize_bits);
>> +        u64 physical = stripe_physical +
>> +                  (sector_nr << fs_info->sectorsize_bits);
>>           bool repaired = false;
>>           if (stripe->sectors[sector_nr].is_metadata) {
>> @@ -938,12 +942,12 @@ static void scrub_stripe_report_errors(struct 
>> scrub_ctx *sctx,
>>               if (dev) {
>>                   btrfs_err_rl_in_rcu(fs_info,
>>               "fixed up error at logical %llu on dev %s physical %llu",
>> -                        stripe->logical, btrfs_dev_name(dev),
>> +                        logical, btrfs_dev_name(dev),
>>                           physical);
>>               } else {
>>                   btrfs_err_rl_in_rcu(fs_info,
>>               "fixed up error at logical %llu on mirror %u",
>> -                        stripe->logical, stripe->mirror_num);
>> +                        logical, stripe->mirror_num);
>>               }
>>               continue;
>>           }
>> @@ -952,26 +956,26 @@ static void scrub_stripe_report_errors(struct 
>> scrub_ctx *sctx,
>>           if (dev) {
>>               btrfs_err_rl_in_rcu(fs_info,
>>       "unable to fixup (regular) error at logical %llu on dev %s 
>> physical %llu",
>> -                        stripe->logical, btrfs_dev_name(dev),
>> +                        logical, btrfs_dev_name(dev),
>>                           physical);
>>           } else {
>>               btrfs_err_rl_in_rcu(fs_info,
>>       "unable to fixup (regular) error at logical %llu on mirror %u",
>> -                        stripe->logical, stripe->mirror_num);
>> +                        logical, stripe->mirror_num);
>>           }
>>           if (test_bit(sector_nr, &stripe->io_error_bitmap))
>>               if (__ratelimit(&rs) && dev)
>>                   scrub_print_common_warning("i/o error", dev, false,
>> -                             stripe->logical, physical);
>> +                             logical, physical);
>>           if (test_bit(sector_nr, &stripe->csum_error_bitmap))
>>               if (__ratelimit(&rs) && dev)
>>                   scrub_print_common_warning("checksum error", dev, 
>> false,
>> -                             stripe->logical, physical);
>> +                             logical, physical);
>>           if (test_bit(sector_nr, &stripe->meta_error_bitmap))
>>               if (__ratelimit(&rs) && dev)
>>                   scrub_print_common_warning("header error", dev, false,
>> -                             stripe->logical, physical);
>> +                             logical, physical);
>>       }
>>       spin_lock(&sctx->stat_lock);
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
  2024-03-14 17:17   ` Filipe Manana
@ 2024-03-14 20:26     ` Qu Wenruo
  2024-03-18 19:54       ` Filipe Manana
  0 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14 20:26 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, stable



在 2024/3/15 03:47, Filipe Manana 写道:
> On Thu, Mar 14, 2024 at 9:54 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Currently when we increase the device statistics, it would always lead
>> to an error message in the kernel log.
>>
>> I would argue this behavior is not ideal:
>>
>> - It would flood the dmesg and bury real important messages
>>    One common scenario is scrub.
>>    If scrub hit some errors, it would cause both scrub and
>>    btrfs_dev_stat_inc_and_print() to print error messages.
>>
>>    And in that case, btrfs_dev_stat_inc_and_print() is completely
>>    useless.
>>
>> - The results of btrfs_dev_stat_inc_and_print() is mostly for history
>>    monitoring, doesn't has enough details
>>
>>    If we trigger the errors during regular read, such messages from
>>    btrfs_dev_stat_inc_and_print() won't help us to locate the cause
>>    either.
>>
>> The real usage for the btrfs device statistics is for some user space
>> daemon to check if there is any new errors, acting like some checks on
>> SMART, thus we don't really need/want those messages in dmesg.
>>
>> This patch would reduce the log level to debug (disabled by default) for
>> btrfs_dev_stat_inc_and_print().
>> For users really want to utilize btrfs devices statistics, they should
>> go check "btrfs device stats" periodically, and we should focus the
>> kernel error messages to more important things.
>
> Not sure if this is the right thing to do.
>
> In the scrub context it can be annoying for sure.
> Other cases I'm not so sure about, because having error messages in
> dmesg/syslog may help notice issues more quickly.

For non-scrub cases, I'd argue we already have enough output:

No matter if the error is fixed or not, every time a mirror got csum
mismatch or other errors, we already have error message output:

  Data: btrfs_print_data_csum_error()
  Meta: btrfs_validate_extent_buffer()

For repaired ones, we have extra output from bio layer for both metadata
and data:
  btrfs_repair_io_failure()

So I'd say the dev_stat ones are already duplicated.

Thanks,
Qu
>
>>
>> CC: stable@vger.kernel.org
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/volumes.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e49935a54da0..126145950ed3 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -7828,7 +7828,7 @@ void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index)
>>
>>          if (!dev->dev_stats_valid)
>>                  return;
>> -       btrfs_err_rl_in_rcu(dev->fs_info,
>> +       btrfs_debug_rl_in_rcu(dev->fs_info,
>>                  "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
>>                             btrfs_dev_name(dev),
>>                             btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
>> --
>> 2.44.0
>>
>>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors()
  2024-03-14 17:26   ` Filipe Manana
@ 2024-03-14 20:28     ` Qu Wenruo
  2024-03-18 16:16       ` Filipe Manana
  0 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14 20:28 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2024/3/15 03:56, Filipe Manana 写道:
> On Thu, Mar 14, 2024 at 9:50 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> The @stripe passed into scrub_stripe_report_errors() either has
>> stripe->dev and stripe->physical properly populated (regular data
>> stripes) or stripe->flags would have SCRUB_STRIPE_FLAG_NO_REPORT
>> (RAID56 P/Q stripes).
>>
>> Thus there is no need to go with btrfs_map_block() to get the
>> dev/physical.
>>
>> Just add an extra ASSERT() to make sure we get stripe->dev populated
>> correctly.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/scrub.c | 59 ++++++------------------------------------------
>>   1 file changed, 7 insertions(+), 52 deletions(-)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index 5e371ffdb37b..277583464371 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -860,10 +860,8 @@ static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
>>   static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>>                                         struct scrub_stripe *stripe)
>>   {
>> -       static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
>> -                                     DEFAULT_RATELIMIT_BURST);
>>          struct btrfs_fs_info *fs_info = sctx->fs_info;
>> -       struct btrfs_device *dev = NULL;
>> +       struct btrfs_device *dev = stripe->dev;
>>          u64 stripe_physical = stripe->physical;
>>          int nr_data_sectors = 0;
>>          int nr_meta_sectors = 0;
>> @@ -874,35 +872,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>>          if (test_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state))
>>                  return;
>>
>> -       /*
>> -        * Init needed infos for error reporting.
>> -        *
>> -        * Although our scrub_stripe infrastructure is mostly based on btrfs_submit_bio()
>> -        * thus no need for dev/physical, error reporting still needs dev and physical.
>> -        */
>> -       if (!bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors)) {
>> -               u64 mapped_len = fs_info->sectorsize;
>> -               struct btrfs_io_context *bioc = NULL;
>> -               int stripe_index = stripe->mirror_num - 1;
>> -               int ret;
>> -
>> -               /* For scrub, our mirror_num should always start at 1. */
>> -               ASSERT(stripe->mirror_num >= 1);
>> -               ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
>> -                                     stripe->logical, &mapped_len, &bioc,
>> -                                     NULL, NULL);
>> -               /*
>> -                * If we failed, dev will be NULL, and later detailed reports
>> -                * will just be skipped.
>> -                */
>> -               if (ret < 0)
>> -                       goto skip;
>> -               stripe_physical = bioc->stripes[stripe_index].physical;
>> -               dev = bioc->stripes[stripe_index].dev;
>> -               btrfs_put_bioc(bioc);
>> -       }
>> -
>> -skip:
>> +       ASSERT(dev);
>>          for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
>>                  u64 logical = stripe->logical +
>>                                (sector_nr << fs_info->sectorsize_bits);
>> @@ -933,42 +903,27 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>>                   * output the message of repaired message.
>>                   */
>>                  if (repaired) {
>> -                       if (dev) {
>> -                               btrfs_err_rl_in_rcu(fs_info,
>> +                       btrfs_err_rl_in_rcu(fs_info,
>>                          "fixed up error at logical %llu on dev %s physical %llu",
>>                                              logical, btrfs_dev_name(dev),
>>                                              physical);
>> -                       } else {
>> -                               btrfs_err_rl_in_rcu(fs_info,
>> -                       "fixed up error at logical %llu on mirror %u",
>> -                                           logical, stripe->mirror_num);
>> -                       }
>>                          continue;
>>                  }
>>
>>                  /* The remaining are all for unrepaired. */
>> -               if (dev) {
>> -                       btrfs_err_rl_in_rcu(fs_info,
>> +               btrfs_err_rl_in_rcu(fs_info,
>>          "unable to fixup (regular) error at logical %llu on dev %s physical %llu",
>>                                              logical, btrfs_dev_name(dev),
>>                                              physical);
>> -               } else {
>> -                       btrfs_err_rl_in_rcu(fs_info,
>> -       "unable to fixup (regular) error at logical %llu on mirror %u",
>> -                                           logical, stripe->mirror_num);
>> -               }
>>
>>                  if (test_bit(sector_nr, &stripe->io_error_bitmap))
>> -                       if (__ratelimit(&rs) && dev)
>> -                               scrub_print_common_warning("i/o error", dev,
>> +                       scrub_print_common_warning("i/o error", dev,
>>                                                       logical, physical);
>>                  if (test_bit(sector_nr, &stripe->csum_error_bitmap))
>> -                       if (__ratelimit(&rs) && dev)
>> -                               scrub_print_common_warning("checksum error", dev,
>> +                       scrub_print_common_warning("checksum error", dev,
>>                                                       logical, physical);
>>                  if (test_bit(sector_nr, &stripe->meta_error_bitmap))
>> -                       if (__ratelimit(&rs) && dev)
>> -                               scrub_print_common_warning("header error", dev,
>> +                       scrub_print_common_warning("header error", dev,
>
> Why are we removing the rate limiting?
> This seems like an unrelated change.

Because the ratelimit is not consistent.

E.g. the "fixed up error" line is not rated limited by @rs, but by
btrfs_err_rl_in_rcu().

And we would have scrub_print_common_warning() to go btrfs_*_rl() helper
in the coming patches.

Thanks,
Qu
>
> Everything else looks ok.
>
> Thanks.
>
>>                                                       logical, physical);
>>          }
>>
>> --
>> 2.44.0
>>
>>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/6] btrfs: scrub: refine the error messages
  2024-03-14 17:35 ` [PATCH 0/6] btrfs: scrub: refine the " Boris Burkov
@ 2024-03-14 20:30   ` Qu Wenruo
  0 siblings, 0 replies; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14 20:30 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs



在 2024/3/15 04:05, Boris Burkov 写道:
> On Thu, Mar 14, 2024 at 08:20:13PM +1030, Qu Wenruo wrote:
>> During some support sessions, I found older kernels are always report
>> very unuseful scrub error messages like:
>>
>>   BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2823, gen 0
>>   BTRFS error (device dm-0): unable to fixup (regular) error at logical 1563504640 on dev /dev/mapper/sys-rootlv
>>   BTRFS error (device dm-0): bdev /dev/mapper/sys-rootlv errs: wr 0, rd 0, flush 0, corrupt 2824, gen 0
>>   BTRFS error (device dm-0): unable to fixup (regular) error at logical 1579016192 on dev /dev/mapper/sys-rootlv
>>
>> There are two problems:
>> - No proper details about the corruption
>>    No metadata or data indication, nor the filename or the tree id.
>>    Even the involved kernel (and newer kernels after the scrub rework)
>>    has the ability to do backref walk and find out the file path or the
>>    tree backref.
>>
>>    My guess is, for data sometimes the corrupted sector is no longer
>>    referred by any data extent.
>>
>> - Too noisy and useless error message from
>>    btrfs_dev_stat_inc_and_print()
>>    I'd argue we should not output any error message just for
>>    btrfs_dev_stat_inc_and_print().
>>
>> After the series, the error message would look like this:
>>
>>   BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>
> Stoked on this series, thanks for doing it!
>
> qq: would it be helpful to also include the actual/expected csum? I
> think it's particularly helpful when one or the other is either zeros or
> the checksum of the zero block.

It's a little too long to add (especially considering SHA256).
And even with CRC32C, I have difficulty to expose any all zero/one checksum.
(Maybe I did too few support recently?)

Thanks,
Qu

>
>>
>> This involves the following enhancement:
>>
>> - Single line
>>    And we ensure we output at least one line for the error we hit.
>>    No more unrelated btrfs_dev_stat_inc_and_print() output.
>>
>> - Proper fall backs
>>    We have the following different fall back options
>>    * Repaired
>>      Just a line of something repaired for logical/physical address.
>>
>>    * Detailed data info
>>      Including the following elements (if possible), and if higher
>>      priority ones can not be fetched, it would be skipped and try
>>      lower priority items:
>>      + file path (can have multiple ones)
>>      + root/inode number and offset
>>      + logical/physical address (always output)
>>
>>    * Detaile metadata info
>>      The priority list is:
>>      + root ref/level
>>      + logical/physical address (always output)
>>
>>    For the worst case of data corruption, we would still have some like:
>>
>>     BTRFS warning (device dm-2): checksum error at data, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>>
>>    And similar ones for metadata:
>>
>>     BTRFS warning (device dm-2): checksum error at meta, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>>
>> The first patch is fixing a regression in the error message, which leads
>> to bad logical/physical bytenr.
>> The second one would reduce the log level for
>> btrfs_dev_stat_inc_and_print().
>>
>> Path 3~4 are cleanups to remove unnecessary parameters.
>>
>> The remaining reworks the format and refine the error message frequency.
>>
>> Qu Wenruo (7):
>>    btrfs: scrub: fix incorrectly reported logical/physical address
>>    btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
>>    btrfs: scrub: remove unused is_super parameter from
>>      scrub_print_common_warning()
>>    btrfs: scrub: remove unnecessary dev/physical lookup for
>>      scrub_stripe_report_errors()
>>    btrfs: scrub: simplify the inode iteration output
>>    btrfs: scrub: unify and shorten the error message
>>    btrfs: scrub: fix the frequency of error messages
>>
>>   fs/btrfs/scrub.c   | 185 ++++++++++++++++-----------------------------
>>   fs/btrfs/volumes.c |   2 +-
>>   2 files changed, 66 insertions(+), 121 deletions(-)
>>
>> --
>> 2.44.0
>>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/7] btrfs: scrub: fix the frequency of error messages
  2024-03-14 17:51   ` Filipe Manana
@ 2024-03-14 20:32     ` Qu Wenruo
  0 siblings, 0 replies; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14 20:32 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2024/3/15 04:21, Filipe Manana 写道:
[...]
>> +               }
>
> No need for the curly braces here, it's a single statement, so we
> follow the style of dropping them.
>
>
>>          }
>> -
>>   out:
>>          btrfs_free_path(path);
>>   }
>> @@ -889,12 +905,6 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>>                  }
>>
>>                  /* The remaining are all for unrepaired. */
>> -               btrfs_err_rl_in_rcu(fs_info,
>> -       "unable to fixup (regular) error at logical %llu(%u) physical %llu(%s)%llu",
>> -                                           logical, stripe->mirror_num,
>> -                                           dev->devid, btrfs_dev_name(dev),
>> -                                           physical);
>> -
>
> This hunk seems unintentional?

This is intentional:

  - Remove the "unable to fixup" error message
   Since we have ensured at least one error message is outputted for each
   unrepairable corruption, there is no need for that fallback one.

Since we have ensured we would at least output one line for each
unrepaired corruption (before rate limit though), that line is duplicated.

Thanks,
Qu
>
> Thanks.
>
>>                  if (test_bit(sector_nr, &stripe->io_error_bitmap))
>>                          scrub_print_common_warning("i/o error", dev,
>>                                                       logical, physical,
>> --
>> 2.44.0
>>
>>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/7] btrfs: scrub: unify and shorten the error message
  2024-03-14 17:40   ` Filipe Manana
@ 2024-03-14 20:56     ` Qu Wenruo
  2024-03-18 16:26       ` Filipe Manana
  0 siblings, 1 reply; 31+ messages in thread
From: Qu Wenruo @ 2024-03-14 20:56 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

在 2024/3/15 04:10, Filipe Manana 写道:
> On Thu, Mar 14, 2024 at 9:51 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> Currently the scrub error report is pretty long:
>>
>>   BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
>>   BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1)
>>
>> Since we have so many things to output, it's not a surprise we got long
>> error lines.
>>
>> To make the lines a little shorter, and make important info more
>> obvious, change the unify output to something like this:
>>
>>   BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>>   BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
> 
> I find that hard to read because:
> 
> 1) Please leave spaces before opening parenthesis.
>     That makes things less cluttered and easier to the eyes, more
> consistent with what we generally do and it's the formal way to do
> things in English (see
> https://www.scribens.com/grammar-rules/typography/spacing.html);

Yep, I can do that, but I also want to keep the involved members 
together thus closer.

Not sure if adding spaces would still keep the close relationships 
between the values.

E.g: inode 5/256 (file1) fileoff 16384, logical 123456 (1) physical 1 
(scratch1) 123456

It makes it a little harder to indicate that "(1)" is related to logical 
address (thus mirror number).

> 
> 2) Instead of "inode 5/257(file1)", something a lot easier to
> understand like "inode 5 root 257 path \"file1\"", which leaves no
> margin for doubt about what each number is;
> 
> 3) Same with "logical 13647872(1)" - what is the 1? That will be the
> question anyone reading such a log message will likely have.
>      Something like "logical 13647872 mirror 1" makes it clear;
> 
> 4) Same with "physical 1(/dev/mapper/test-scratch1)13647872".
>      Something like "physical 13647872 device ID 1 device path
> \"/dev/mapper/test-scratch1\"", makes it clear what each number is and
> easier to read.

I totally understand the original output format is much easier to read 
on its own.

However if we have hundreds lines of similar output, it's a different 
story, a small change in any of the value is much harder to catch, and 
the extra helpful prompt is in fact a burden other than a bless.

(That's why things like spreadsheet is much easier to reader for 
multiple similarly structured data, and in that case it's much better to 
craft a script to convert the lines into a csv)

Unfortunately we don't have the benefit (at least yet) to structure all 
these info into a structured output.

But what I'm doing here is to reduce the prompts to minimal (at most 4 
prompts, "inode", "fileoff", "logical", "physical"), so less duplicated 
strings for multiple lines of similar error messages.

The patchset is in the middle ground between fully detailed output (the 
old one, focusing on single line) and the fully script orient output (no 
prompt at all, just all numbers/strings, focusing on CSV like output).

Furthermore, I would also argue that, for entry level end users, they 
can still catch the critical info (like file path and device path) 
without much difficult, and those info is enough for them to take action.
(E.g. deleting the file, or replace the disk)

Yes, they would get confused on the devid or rootid, but that's fine and 
everyone can learn something new.

For experienced users who understand basics of btrfs, or us developers, 
the split in values are already arranged in a way similar sized values 
are never close together (aka, string/large/small value split).

Thanks,
Qu

> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/7] btrfs: scrub: unify and shorten the error message
  2024-03-14  9:50 ` [PATCH 6/7] btrfs: scrub: unify and shorten the error message Qu Wenruo
  2024-03-14 17:40   ` Filipe Manana
@ 2024-03-14 23:05   ` kernel test robot
  2024-03-15 11:44   ` kernel test robot
  2 siblings, 0 replies; 31+ messages in thread
From: kernel test robot @ 2024-03-14 23:05 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: oe-kbuild-all

Hi Qu,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on linus/master v6.8 next-20240314]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-fix-incorrectly-reported-logical-physical-address/20240314-215457
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/6ba44b940e4e3eea573cad667ab8c0b2dd8f2c06.1710409033.git.wqu%40suse.com
patch subject: [PATCH 6/7] btrfs: scrub: unify and shorten the error message
config: m68k-defconfig (https://download.01.org/0day-ci/archive/20240315/202403150650.dNFtHzxf-lkp@intel.com/config)
compiler: m68k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240315/202403150650.dNFtHzxf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403150650.dNFtHzxf-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from fs/btrfs/extent_io.h:16,
                    from fs/btrfs/locking.h:13,
                    from fs/btrfs/ctree.h:19,
                    from fs/btrfs/scrub.c:10:
   fs/btrfs/scrub.c: In function 'scrub_print_warning_inode':
>> fs/btrfs/scrub.c:433:35: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     433 |                                   (char *)ipath->fspath->val[i], offset,
         |                                   ^
   fs/btrfs/messages.h:27:39: note: in definition of macro 'btrfs_printk'
      27 |         _btrfs_printk(fs_info, fmt, ##args)
         |                                       ^~~~
   fs/btrfs/messages.h:66:9: note: in expansion of macro 'btrfs_printk_in_rcu'
      66 |         btrfs_printk_in_rcu(fs_info, KERN_WARNING fmt, ##args)
         |         ^~~~~~~~~~~~~~~~~~~
   fs/btrfs/scrub.c:430:17: note: in expansion of macro 'btrfs_warn_in_rcu'
     430 |                 btrfs_warn_in_rcu(fs_info,
         |                 ^~~~~~~~~~~~~~~~~


vim +433 fs/btrfs/scrub.c

   388	
   389	static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
   390					     u64 root, void *warn_ctx)
   391	{
   392		int ret;
   393		int i;
   394		unsigned nofs_flag;
   395		struct scrub_warning *swarn = warn_ctx;
   396		struct btrfs_fs_info *fs_info = swarn->dev->fs_info;
   397		struct inode_fs_paths *ipath = NULL;
   398		struct btrfs_root *local_root;
   399	
   400		local_root = btrfs_get_fs_root(fs_info, root, true);
   401		if (IS_ERR(local_root)) {
   402			ret = PTR_ERR(local_root);
   403			goto err;
   404		}
   405	
   406		/*
   407		 * init_path might indirectly call vmalloc, or use GFP_KERNEL. Scrub
   408		 * uses GFP_NOFS in this context, so we keep it consistent but it does
   409		 * not seem to be strictly necessary.
   410		 */
   411		nofs_flag = memalloc_nofs_save();
   412		ipath = init_ipath(4096, local_root, swarn->path);
   413		memalloc_nofs_restore(nofs_flag);
   414		if (IS_ERR(ipath)) {
   415			btrfs_put_root(local_root);
   416			ret = PTR_ERR(ipath);
   417			ipath = NULL;
   418			goto err;
   419		}
   420		ret = paths_from_inode(inum, ipath);
   421	
   422		if (ret < 0)
   423			goto err;
   424	
   425		/*
   426		 * we deliberately ignore the bit ipath might have been too small to
   427		 * hold all of the paths here
   428		 */
   429		for (i = 0; i < ipath->fspath->elem_cnt; ++i)
   430			btrfs_warn_in_rcu(fs_info,
   431	"%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
   432					  swarn->errstr, root, inum,
 > 433					  (char *)ipath->fspath->val[i], offset,
   434					  swarn->logical, swarn->mirror_num,
   435					  swarn->dev->devid, btrfs_dev_name(swarn->dev),
   436					  swarn->physical);
   437	
   438		btrfs_put_root(local_root);
   439		free_ipath(ipath);
   440		return 0;
   441	
   442	err:
   443		btrfs_warn_in_rcu(fs_info,
   444		"%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
   445				  swarn->errstr, root, inum, offset,
   446				  swarn->logical, swarn->mirror_num,
   447				  swarn->dev->devid, btrfs_dev_name(swarn->dev),
   448				  swarn->physical);
   449		free_ipath(ipath);
   450		return 0;
   451	}
   452	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/7] btrfs: scrub: unify and shorten the error message
  2024-03-14  9:50 ` [PATCH 6/7] btrfs: scrub: unify and shorten the error message Qu Wenruo
  2024-03-14 17:40   ` Filipe Manana
  2024-03-14 23:05   ` kernel test robot
@ 2024-03-15 11:44   ` kernel test robot
  2 siblings, 0 replies; 31+ messages in thread
From: kernel test robot @ 2024-03-15 11:44 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: oe-kbuild-all

Hi Qu,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kdave/for-next]
[also build test WARNING on linus/master v6.8 next-20240315]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qu-Wenruo/btrfs-scrub-fix-incorrectly-reported-logical-physical-address/20240314-215457
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next
patch link:    https://lore.kernel.org/r/6ba44b940e4e3eea573cad667ab8c0b2dd8f2c06.1710409033.git.wqu%40suse.com
patch subject: [PATCH 6/7] btrfs: scrub: unify and shorten the error message
config: hexagon-randconfig-r111-20240315 (https://download.01.org/0day-ci/archive/20240315/202403151954.2aDLBPQm-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 8f68022f8e6e54d1aeae4ed301f5a015963089b7)
reproduce: (https://download.01.org/0day-ci/archive/20240315/202403151954.2aDLBPQm-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202403151954.2aDLBPQm-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> fs/btrfs/scrub.c:430:17: sparse: sparse: non size-preserving integer to pointer cast

vim +430 fs/btrfs/scrub.c

a2de733c78fa7a Arne Jansen    2011-03-08  388  
c7499a64dcf62b Filipe Manana  2022-11-01  389  static int scrub_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
c7499a64dcf62b Filipe Manana  2022-11-01  390  				     u64 root, void *warn_ctx)
558540c17771ea Jan Schmidt    2011-06-13  391  {
558540c17771ea Jan Schmidt    2011-06-13  392  	int ret;
558540c17771ea Jan Schmidt    2011-06-13  393  	int i;
de2491fdefe7e5 David Sterba   2017-05-31  394  	unsigned nofs_flag;
ff023aac31198e Stefan Behrens 2012-11-06  395  	struct scrub_warning *swarn = warn_ctx;
fb456252d3d9c0 Jeff Mahoney   2016-06-22  396  	struct btrfs_fs_info *fs_info = swarn->dev->fs_info;
558540c17771ea Jan Schmidt    2011-06-13  397  	struct inode_fs_paths *ipath = NULL;
558540c17771ea Jan Schmidt    2011-06-13  398  	struct btrfs_root *local_root;
558540c17771ea Jan Schmidt    2011-06-13  399  
56e9357a1e8167 David Sterba   2020-05-15  400  	local_root = btrfs_get_fs_root(fs_info, root, true);
558540c17771ea Jan Schmidt    2011-06-13  401  	if (IS_ERR(local_root)) {
558540c17771ea Jan Schmidt    2011-06-13  402  		ret = PTR_ERR(local_root);
558540c17771ea Jan Schmidt    2011-06-13  403  		goto err;
558540c17771ea Jan Schmidt    2011-06-13  404  	}
558540c17771ea Jan Schmidt    2011-06-13  405  
de2491fdefe7e5 David Sterba   2017-05-31  406  	/*
de2491fdefe7e5 David Sterba   2017-05-31  407  	 * init_path might indirectly call vmalloc, or use GFP_KERNEL. Scrub
de2491fdefe7e5 David Sterba   2017-05-31  408  	 * uses GFP_NOFS in this context, so we keep it consistent but it does
de2491fdefe7e5 David Sterba   2017-05-31  409  	 * not seem to be strictly necessary.
de2491fdefe7e5 David Sterba   2017-05-31  410  	 */
de2491fdefe7e5 David Sterba   2017-05-31  411  	nofs_flag = memalloc_nofs_save();
558540c17771ea Jan Schmidt    2011-06-13  412  	ipath = init_ipath(4096, local_root, swarn->path);
de2491fdefe7e5 David Sterba   2017-05-31  413  	memalloc_nofs_restore(nofs_flag);
26bdef541d26fd Dan Carpenter  2011-11-16  414  	if (IS_ERR(ipath)) {
0024652895e347 Josef Bacik    2020-01-24  415  		btrfs_put_root(local_root);
26bdef541d26fd Dan Carpenter  2011-11-16  416  		ret = PTR_ERR(ipath);
26bdef541d26fd Dan Carpenter  2011-11-16  417  		ipath = NULL;
26bdef541d26fd Dan Carpenter  2011-11-16  418  		goto err;
26bdef541d26fd Dan Carpenter  2011-11-16  419  	}
558540c17771ea Jan Schmidt    2011-06-13  420  	ret = paths_from_inode(inum, ipath);
558540c17771ea Jan Schmidt    2011-06-13  421  
558540c17771ea Jan Schmidt    2011-06-13  422  	if (ret < 0)
558540c17771ea Jan Schmidt    2011-06-13  423  		goto err;
558540c17771ea Jan Schmidt    2011-06-13  424  
558540c17771ea Jan Schmidt    2011-06-13  425  	/*
558540c17771ea Jan Schmidt    2011-06-13  426  	 * we deliberately ignore the bit ipath might have been too small to
558540c17771ea Jan Schmidt    2011-06-13  427  	 * hold all of the paths here
558540c17771ea Jan Schmidt    2011-06-13  428  	 */
558540c17771ea Jan Schmidt    2011-06-13  429  	for (i = 0; i < ipath->fspath->elem_cnt; ++i)
5d163e0e68ce74 Jeff Mahoney   2016-09-20 @430  		btrfs_warn_in_rcu(fs_info,
023af88ce10a69 Qu Wenruo      2024-03-14  431  "%s at inode %lld/%llu(%s) fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
023af88ce10a69 Qu Wenruo      2024-03-14  432  				  swarn->errstr, root, inum,
023af88ce10a69 Qu Wenruo      2024-03-14  433  				  (char *)ipath->fspath->val[i], offset,
023af88ce10a69 Qu Wenruo      2024-03-14  434  				  swarn->logical, swarn->mirror_num,
023af88ce10a69 Qu Wenruo      2024-03-14  435  				  swarn->dev->devid, btrfs_dev_name(swarn->dev),
023af88ce10a69 Qu Wenruo      2024-03-14  436  				  swarn->physical);
558540c17771ea Jan Schmidt    2011-06-13  437  
0024652895e347 Josef Bacik    2020-01-24  438  	btrfs_put_root(local_root);
558540c17771ea Jan Schmidt    2011-06-13  439  	free_ipath(ipath);
558540c17771ea Jan Schmidt    2011-06-13  440  	return 0;
558540c17771ea Jan Schmidt    2011-06-13  441  
558540c17771ea Jan Schmidt    2011-06-13  442  err:
5d163e0e68ce74 Jeff Mahoney   2016-09-20  443  	btrfs_warn_in_rcu(fs_info,
023af88ce10a69 Qu Wenruo      2024-03-14  444  	"%s at inode %lld/%llu fileoff %llu, logical %llu(%u) physical %llu(%s)%llu",
023af88ce10a69 Qu Wenruo      2024-03-14  445  			  swarn->errstr, root, inum, offset,
023af88ce10a69 Qu Wenruo      2024-03-14  446  			  swarn->logical, swarn->mirror_num,
023af88ce10a69 Qu Wenruo      2024-03-14  447  			  swarn->dev->devid, btrfs_dev_name(swarn->dev),
023af88ce10a69 Qu Wenruo      2024-03-14  448  			  swarn->physical);
558540c17771ea Jan Schmidt    2011-06-13  449  	free_ipath(ipath);
558540c17771ea Jan Schmidt    2011-06-13  450  	return 0;
558540c17771ea Jan Schmidt    2011-06-13  451  }
558540c17771ea Jan Schmidt    2011-06-13  452  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors()
  2024-03-14 20:28     ` Qu Wenruo
@ 2024-03-18 16:16       ` Filipe Manana
  2024-03-18 19:53         ` Qu Wenruo
  0 siblings, 1 reply; 31+ messages in thread
From: Filipe Manana @ 2024-03-18 16:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On Thu, Mar 14, 2024 at 8:28 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/3/15 03:56, Filipe Manana 写道:
> > On Thu, Mar 14, 2024 at 9:50 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> The @stripe passed into scrub_stripe_report_errors() either has
> >> stripe->dev and stripe->physical properly populated (regular data
> >> stripes) or stripe->flags would have SCRUB_STRIPE_FLAG_NO_REPORT
> >> (RAID56 P/Q stripes).
> >>
> >> Thus there is no need to go with btrfs_map_block() to get the
> >> dev/physical.
> >>
> >> Just add an extra ASSERT() to make sure we get stripe->dev populated
> >> correctly.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>   fs/btrfs/scrub.c | 59 ++++++------------------------------------------
> >>   1 file changed, 7 insertions(+), 52 deletions(-)
> >>
> >> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> >> index 5e371ffdb37b..277583464371 100644
> >> --- a/fs/btrfs/scrub.c
> >> +++ b/fs/btrfs/scrub.c
> >> @@ -860,10 +860,8 @@ static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
> >>   static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
> >>                                         struct scrub_stripe *stripe)
> >>   {
> >> -       static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
> >> -                                     DEFAULT_RATELIMIT_BURST);
> >>          struct btrfs_fs_info *fs_info = sctx->fs_info;
> >> -       struct btrfs_device *dev = NULL;
> >> +       struct btrfs_device *dev = stripe->dev;
> >>          u64 stripe_physical = stripe->physical;
> >>          int nr_data_sectors = 0;
> >>          int nr_meta_sectors = 0;
> >> @@ -874,35 +872,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
> >>          if (test_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state))
> >>                  return;
> >>
> >> -       /*
> >> -        * Init needed infos for error reporting.
> >> -        *
> >> -        * Although our scrub_stripe infrastructure is mostly based on btrfs_submit_bio()
> >> -        * thus no need for dev/physical, error reporting still needs dev and physical.
> >> -        */
> >> -       if (!bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors)) {
> >> -               u64 mapped_len = fs_info->sectorsize;
> >> -               struct btrfs_io_context *bioc = NULL;
> >> -               int stripe_index = stripe->mirror_num - 1;
> >> -               int ret;
> >> -
> >> -               /* For scrub, our mirror_num should always start at 1. */
> >> -               ASSERT(stripe->mirror_num >= 1);
> >> -               ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
> >> -                                     stripe->logical, &mapped_len, &bioc,
> >> -                                     NULL, NULL);
> >> -               /*
> >> -                * If we failed, dev will be NULL, and later detailed reports
> >> -                * will just be skipped.
> >> -                */
> >> -               if (ret < 0)
> >> -                       goto skip;
> >> -               stripe_physical = bioc->stripes[stripe_index].physical;
> >> -               dev = bioc->stripes[stripe_index].dev;
> >> -               btrfs_put_bioc(bioc);
> >> -       }
> >> -
> >> -skip:
> >> +       ASSERT(dev);
> >>          for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
> >>                  u64 logical = stripe->logical +
> >>                                (sector_nr << fs_info->sectorsize_bits);
> >> @@ -933,42 +903,27 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
> >>                   * output the message of repaired message.
> >>                   */
> >>                  if (repaired) {
> >> -                       if (dev) {
> >> -                               btrfs_err_rl_in_rcu(fs_info,
> >> +                       btrfs_err_rl_in_rcu(fs_info,
> >>                          "fixed up error at logical %llu on dev %s physical %llu",
> >>                                              logical, btrfs_dev_name(dev),
> >>                                              physical);
> >> -                       } else {
> >> -                               btrfs_err_rl_in_rcu(fs_info,
> >> -                       "fixed up error at logical %llu on mirror %u",
> >> -                                           logical, stripe->mirror_num);
> >> -                       }
> >>                          continue;
> >>                  }
> >>
> >>                  /* The remaining are all for unrepaired. */
> >> -               if (dev) {
> >> -                       btrfs_err_rl_in_rcu(fs_info,
> >> +               btrfs_err_rl_in_rcu(fs_info,
> >>          "unable to fixup (regular) error at logical %llu on dev %s physical %llu",
> >>                                              logical, btrfs_dev_name(dev),
> >>                                              physical);
> >> -               } else {
> >> -                       btrfs_err_rl_in_rcu(fs_info,
> >> -       "unable to fixup (regular) error at logical %llu on mirror %u",
> >> -                                           logical, stripe->mirror_num);
> >> -               }
> >>
> >>                  if (test_bit(sector_nr, &stripe->io_error_bitmap))
> >> -                       if (__ratelimit(&rs) && dev)
> >> -                               scrub_print_common_warning("i/o error", dev,
> >> +                       scrub_print_common_warning("i/o error", dev,
> >>                                                       logical, physical);
> >>                  if (test_bit(sector_nr, &stripe->csum_error_bitmap))
> >> -                       if (__ratelimit(&rs) && dev)
> >> -                               scrub_print_common_warning("checksum error", dev,
> >> +                       scrub_print_common_warning("checksum error", dev,
> >>                                                       logical, physical);
> >>                  if (test_bit(sector_nr, &stripe->meta_error_bitmap))
> >> -                       if (__ratelimit(&rs) && dev)
> >> -                               scrub_print_common_warning("header error", dev,
> >> +                       scrub_print_common_warning("header error", dev,
> >
> > Why are we removing the rate limiting?
> > This seems like an unrelated change.
>
> Because the ratelimit is not consistent.
>
> E.g. the "fixed up error" line is not rated limited by @rs, but by
> btrfs_err_rl_in_rcu().

That I don't understand.

What I see is that scrub_print_common_warning() calls
btrfs_warn_in_rcu() or btrfs_warn() or scrub_print_warning_inode()
(which calls btrfs_warn_in_rcu()).
I don't see where btrfs_err_rl_in_rcu() is called. So to me it seems
@rs is doing the rate limiting.

>
> And we would have scrub_print_common_warning() to go btrfs_*_rl() helper
> in the coming patches.

Ok, but if @rs is indeed doing the rate limiting here, then from an
organization point of view, it should be removed in the later patch
that makes scrub_print_common_warning() use the btrfs_*_rl() helpers.

Thanks.

>
> Thanks,
> Qu
> >
> > Everything else looks ok.
> >
> > Thanks.
> >
> >>                                                       logical, physical);
> >>          }
> >>
> >> --
> >> 2.44.0
> >>
> >>
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/7] btrfs: scrub: unify and shorten the error message
  2024-03-14 20:56     ` Qu Wenruo
@ 2024-03-18 16:26       ` Filipe Manana
  2024-03-18 20:00         ` Qu Wenruo
  0 siblings, 1 reply; 31+ messages in thread
From: Filipe Manana @ 2024-03-18 16:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Mar 14, 2024 at 8:56 PM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2024/3/15 04:10, Filipe Manana 写道:
> > On Thu, Mar 14, 2024 at 9:51 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> Currently the scrub error report is pretty long:
> >>
> >>   BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
> >>   BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1)
> >>
> >> Since we have so many things to output, it's not a surprise we got long
> >> error lines.
> >>
> >> To make the lines a little shorter, and make important info more
> >> obvious, change the unify output to something like this:
> >>
> >>   BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
> >>   BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
> >
> > I find that hard to read because:
> >
> > 1) Please leave spaces before opening parenthesis.
> >     That makes things less cluttered and easier to the eyes, more
> > consistent with what we generally do and it's the formal way to do
> > things in English (see
> > https://www.scribens.com/grammar-rules/typography/spacing.html);
>
> Yep, I can do that, but I also want to keep the involved members
> together thus closer.
>
> Not sure if adding spaces would still keep the close relationships
> between the values.
>
> E.g: inode 5/256 (file1) fileoff 16384, logical 123456 (1) physical 1
> (scratch1) 123456
>
> It makes it a little harder to indicate that "(1)" is related to logical
> address (thus mirror number).
>
> >
> > 2) Instead of "inode 5/257(file1)", something a lot easier to
> > understand like "inode 5 root 257 path \"file1\"", which leaves no
> > margin for doubt about what each number is;
> >
> > 3) Same with "logical 13647872(1)" - what is the 1? That will be the
> > question anyone reading such a log message will likely have.
> >      Something like "logical 13647872 mirror 1" makes it clear;
> >
> > 4) Same with "physical 1(/dev/mapper/test-scratch1)13647872".
> >      Something like "physical 13647872 device ID 1 device path
> > \"/dev/mapper/test-scratch1\"", makes it clear what each number is and
> > easier to read.
>
> I totally understand the original output format is much easier to read
> on its own.
>
> However if we have hundreds lines of similar output, it's a different
> story, a small change in any of the value is much harder to catch, and
> the extra helpful prompt is in fact a burden other than a bless.
>
> (That's why things like spreadsheet is much easier to reader for
> multiple similarly structured data, and in that case it's much better to
> craft a script to convert the lines into a csv)
>
> Unfortunately we don't have the benefit (at least yet) to structure all
> these info into a structured output.
>
>
> But what I'm doing here is to reduce the prompts to minimal (at most 4
> prompts, "inode", "fileoff", "logical", "physical"), so less duplicated
> strings for multiple lines of similar error messages.
>
> The patchset is in the middle ground between fully detailed output (the
> old one, focusing on single line) and the fully script orient output (no
> prompt at all, just all numbers/strings, focusing on CSV like output).
>
>
> Furthermore, I would also argue that, for entry level end users, they
> can still catch the critical info (like file path and device path)
> without much difficult, and those info is enough for them to take action.
> (E.g. deleting the file, or replace the disk)
>
> Yes, they would get confused on the devid or rootid, but that's fine and
> everyone can learn something new.
>
> For experienced users who understand basics of btrfs, or us developers,
> the split in values are already arranged in a way similar sized values
> are never close together (aka, string/large/small value split).

Well, I'm a developer and I can tell you if I ever see such log
messages, I'll have to go to the source code to figure out what each
value is supposed to be.
I'll be able to memorize for some hours or even a few days, but after
that I'll forget again and have to look at the source code again.


>
> Thanks,
> Qu
>
> >
> > Thanks.
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors()
  2024-03-18 16:16       ` Filipe Manana
@ 2024-03-18 19:53         ` Qu Wenruo
  0 siblings, 0 replies; 31+ messages in thread
From: Qu Wenruo @ 2024-03-18 19:53 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs



在 2024/3/19 02:46, Filipe Manana 写道:
> On Thu, Mar 14, 2024 at 8:28 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> 在 2024/3/15 03:56, Filipe Manana 写道:
>>> On Thu, Mar 14, 2024 at 9:50 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> The @stripe passed into scrub_stripe_report_errors() either has
>>>> stripe->dev and stripe->physical properly populated (regular data
>>>> stripes) or stripe->flags would have SCRUB_STRIPE_FLAG_NO_REPORT
>>>> (RAID56 P/Q stripes).
>>>>
>>>> Thus there is no need to go with btrfs_map_block() to get the
>>>> dev/physical.
>>>>
>>>> Just add an extra ASSERT() to make sure we get stripe->dev populated
>>>> correctly.
>>>>
>>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>>> ---
>>>>    fs/btrfs/scrub.c | 59 ++++++------------------------------------------
>>>>    1 file changed, 7 insertions(+), 52 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>>> index 5e371ffdb37b..277583464371 100644
>>>> --- a/fs/btrfs/scrub.c
>>>> +++ b/fs/btrfs/scrub.c
>>>> @@ -860,10 +860,8 @@ static void scrub_stripe_submit_repair_read(struct scrub_stripe *stripe,
>>>>    static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>>>>                                          struct scrub_stripe *stripe)
>>>>    {
>>>> -       static DEFINE_RATELIMIT_STATE(rs, DEFAULT_RATELIMIT_INTERVAL,
>>>> -                                     DEFAULT_RATELIMIT_BURST);
>>>>           struct btrfs_fs_info *fs_info = sctx->fs_info;
>>>> -       struct btrfs_device *dev = NULL;
>>>> +       struct btrfs_device *dev = stripe->dev;
>>>>           u64 stripe_physical = stripe->physical;
>>>>           int nr_data_sectors = 0;
>>>>           int nr_meta_sectors = 0;
>>>> @@ -874,35 +872,7 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>>>>           if (test_bit(SCRUB_STRIPE_FLAG_NO_REPORT, &stripe->state))
>>>>                   return;
>>>>
>>>> -       /*
>>>> -        * Init needed infos for error reporting.
>>>> -        *
>>>> -        * Although our scrub_stripe infrastructure is mostly based on btrfs_submit_bio()
>>>> -        * thus no need for dev/physical, error reporting still needs dev and physical.
>>>> -        */
>>>> -       if (!bitmap_empty(&stripe->init_error_bitmap, stripe->nr_sectors)) {
>>>> -               u64 mapped_len = fs_info->sectorsize;
>>>> -               struct btrfs_io_context *bioc = NULL;
>>>> -               int stripe_index = stripe->mirror_num - 1;
>>>> -               int ret;
>>>> -
>>>> -               /* For scrub, our mirror_num should always start at 1. */
>>>> -               ASSERT(stripe->mirror_num >= 1);
>>>> -               ret = btrfs_map_block(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
>>>> -                                     stripe->logical, &mapped_len, &bioc,
>>>> -                                     NULL, NULL);
>>>> -               /*
>>>> -                * If we failed, dev will be NULL, and later detailed reports
>>>> -                * will just be skipped.
>>>> -                */
>>>> -               if (ret < 0)
>>>> -                       goto skip;
>>>> -               stripe_physical = bioc->stripes[stripe_index].physical;
>>>> -               dev = bioc->stripes[stripe_index].dev;
>>>> -               btrfs_put_bioc(bioc);
>>>> -       }
>>>> -
>>>> -skip:
>>>> +       ASSERT(dev);
>>>>           for_each_set_bit(sector_nr, &stripe->extent_sector_bitmap, stripe->nr_sectors) {
>>>>                   u64 logical = stripe->logical +
>>>>                                 (sector_nr << fs_info->sectorsize_bits);
>>>> @@ -933,42 +903,27 @@ static void scrub_stripe_report_errors(struct scrub_ctx *sctx,
>>>>                    * output the message of repaired message.
>>>>                    */
>>>>                   if (repaired) {
>>>> -                       if (dev) {
>>>> -                               btrfs_err_rl_in_rcu(fs_info,
>>>> +                       btrfs_err_rl_in_rcu(fs_info,
>>>>                           "fixed up error at logical %llu on dev %s physical %llu",
>>>>                                               logical, btrfs_dev_name(dev),
>>>>                                               physical);
>>>> -                       } else {
>>>> -                               btrfs_err_rl_in_rcu(fs_info,
>>>> -                       "fixed up error at logical %llu on mirror %u",
>>>> -                                           logical, stripe->mirror_num);
>>>> -                       }
>>>>                           continue;
>>>>                   }
>>>>
>>>>                   /* The remaining are all for unrepaired. */
>>>> -               if (dev) {
>>>> -                       btrfs_err_rl_in_rcu(fs_info,
>>>> +               btrfs_err_rl_in_rcu(fs_info,
>>>>           "unable to fixup (regular) error at logical %llu on dev %s physical %llu",
>>>>                                               logical, btrfs_dev_name(dev),
>>>>                                               physical);
>>>> -               } else {
>>>> -                       btrfs_err_rl_in_rcu(fs_info,
>>>> -       "unable to fixup (regular) error at logical %llu on mirror %u",
>>>> -                                           logical, stripe->mirror_num);
>>>> -               }
>>>>
>>>>                   if (test_bit(sector_nr, &stripe->io_error_bitmap))
>>>> -                       if (__ratelimit(&rs) && dev)
>>>> -                               scrub_print_common_warning("i/o error", dev,
>>>> +                       scrub_print_common_warning("i/o error", dev,
>>>>                                                        logical, physical);
>>>>                   if (test_bit(sector_nr, &stripe->csum_error_bitmap))
>>>> -                       if (__ratelimit(&rs) && dev)
>>>> -                               scrub_print_common_warning("checksum error", dev,
>>>> +                       scrub_print_common_warning("checksum error", dev,
>>>>                                                        logical, physical);
>>>>                   if (test_bit(sector_nr, &stripe->meta_error_bitmap))
>>>> -                       if (__ratelimit(&rs) && dev)
>>>> -                               scrub_print_common_warning("header error", dev,
>>>> +                       scrub_print_common_warning("header error", dev,
>>>
>>> Why are we removing the rate limiting?
>>> This seems like an unrelated change.
>>
>> Because the ratelimit is not consistent.
>>
>> E.g. the "fixed up error" line is not rated limited by @rs, but by
>> btrfs_err_rl_in_rcu().
>
> That I don't understand.
>
> What I see is that scrub_print_common_warning() calls
> btrfs_warn_in_rcu() or btrfs_warn() or scrub_print_warning_inode()
> (which calls btrfs_warn_in_rcu()).
> I don't see where btrfs_err_rl_in_rcu() is called. So to me it seems
> @rs is doing the rate limiting.
>
>>
>> And we would have scrub_print_common_warning() to go btrfs_*_rl() helper
>> in the coming patches.
>
> Ok, but if @rs is indeed doing the rate limiting here, then from an
> organization point of view, it should be removed in the later patch
> that makes scrub_print_common_warning() use the btrfs_*_rl() helpers.

OK, I'll rearrange the patches so that @rs would only be removed after
we have migrate all the messages are migrated to rate limited versions.

Thanks,
Qu
>
> Thanks.
>
>>
>> Thanks,
>> Qu
>>>
>>> Everything else looks ok.
>>>
>>> Thanks.
>>>
>>>>                                                        logical, physical);
>>>>           }
>>>>
>>>> --
>>>> 2.44.0
>>>>
>>>>
>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print()
  2024-03-14 20:26     ` Qu Wenruo
@ 2024-03-18 19:54       ` Filipe Manana
  0 siblings, 0 replies; 31+ messages in thread
From: Filipe Manana @ 2024-03-18 19:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, stable

On Thu, Mar 14, 2024 at 8:26 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/3/15 03:47, Filipe Manana 写道:
> > On Thu, Mar 14, 2024 at 9:54 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> Currently when we increase the device statistics, it would always lead
> >> to an error message in the kernel log.
> >>
> >> I would argue this behavior is not ideal:
> >>
> >> - It would flood the dmesg and bury real important messages
> >>    One common scenario is scrub.
> >>    If scrub hit some errors, it would cause both scrub and
> >>    btrfs_dev_stat_inc_and_print() to print error messages.
> >>
> >>    And in that case, btrfs_dev_stat_inc_and_print() is completely
> >>    useless.
> >>
> >> - The results of btrfs_dev_stat_inc_and_print() is mostly for history
> >>    monitoring, doesn't has enough details
> >>
> >>    If we trigger the errors during regular read, such messages from
> >>    btrfs_dev_stat_inc_and_print() won't help us to locate the cause
> >>    either.
> >>
> >> The real usage for the btrfs device statistics is for some user space
> >> daemon to check if there is any new errors, acting like some checks on
> >> SMART, thus we don't really need/want those messages in dmesg.
> >>
> >> This patch would reduce the log level to debug (disabled by default) for
> >> btrfs_dev_stat_inc_and_print().
> >> For users really want to utilize btrfs devices statistics, they should
> >> go check "btrfs device stats" periodically, and we should focus the
> >> kernel error messages to more important things.
> >
> > Not sure if this is the right thing to do.
> >
> > In the scrub context it can be annoying for sure.
> > Other cases I'm not so sure about, because having error messages in
> > dmesg/syslog may help notice issues more quickly.
>
> For non-scrub cases, I'd argue we already have enough output:
>
> No matter if the error is fixed or not, every time a mirror got csum
> mismatch or other errors, we already have error message output:
>
>   Data: btrfs_print_data_csum_error()
>   Meta: btrfs_validate_extent_buffer()
>
> For repaired ones, we have extra output from bio layer for both metadata
> and data:
>   btrfs_repair_io_failure()
>
> So I'd say the dev_stat ones are already duplicated.

Ok, I suppose it's fine then.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
> Thanks,
> Qu
> >
> >>
> >> CC: stable@vger.kernel.org
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>   fs/btrfs/volumes.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> >> index e49935a54da0..126145950ed3 100644
> >> --- a/fs/btrfs/volumes.c
> >> +++ b/fs/btrfs/volumes.c
> >> @@ -7828,7 +7828,7 @@ void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index)
> >>
> >>          if (!dev->dev_stats_valid)
> >>                  return;
> >> -       btrfs_err_rl_in_rcu(dev->fs_info,
> >> +       btrfs_debug_rl_in_rcu(dev->fs_info,
> >>                  "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
> >>                             btrfs_dev_name(dev),
> >>                             btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
> >> --
> >> 2.44.0
> >>
> >>
> >

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/7] btrfs: scrub: unify and shorten the error message
  2024-03-18 16:26       ` Filipe Manana
@ 2024-03-18 20:00         ` Qu Wenruo
  0 siblings, 0 replies; 31+ messages in thread
From: Qu Wenruo @ 2024-03-18 20:00 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2024/3/19 02:56, Filipe Manana 写道:
> On Thu, Mar 14, 2024 at 8:56 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> 在 2024/3/15 04:10, Filipe Manana 写道:
>>> On Thu, Mar 14, 2024 at 9:51 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>> Currently the scrub error report is pretty long:
>>>>
>>>>    BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
>>>>    BTRFS warning (device dm-2): checksum error at logical 13647872 on dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, offset 16384, length 4096, links 1 (path: file1)
>>>>
>>>> Since we have so many things to output, it's not a surprise we got long
>>>> error lines.
>>>>
>>>> To make the lines a little shorter, and make important info more
>>>> obvious, change the unify output to something like this:
>>>>
>>>>    BTRFS error (device dm-2): unable to fixup (regular) error at logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>>>>    BTRFS warning (device dm-2): checksum error at inode 5/257(file1) fileoff 16384, logical 13647872(1) physical 1(/dev/mapper/test-scratch1)13647872
>>>
>>> I find that hard to read because:
>>>
>>> 1) Please leave spaces before opening parenthesis.
>>>      That makes things less cluttered and easier to the eyes, more
>>> consistent with what we generally do and it's the formal way to do
>>> things in English (see
>>> https://www.scribens.com/grammar-rules/typography/spacing.html);
>>
>> Yep, I can do that, but I also want to keep the involved members
>> together thus closer.
>>
>> Not sure if adding spaces would still keep the close relationships
>> between the values.
>>
>> E.g: inode 5/256 (file1) fileoff 16384, logical 123456 (1) physical 1
>> (scratch1) 123456
>>
>> It makes it a little harder to indicate that "(1)" is related to logical
>> address (thus mirror number).
>>
>>>
>>> 2) Instead of "inode 5/257(file1)", something a lot easier to
>>> understand like "inode 5 root 257 path \"file1\"", which leaves no
>>> margin for doubt about what each number is;
>>>
>>> 3) Same with "logical 13647872(1)" - what is the 1? That will be the
>>> question anyone reading such a log message will likely have.
>>>       Something like "logical 13647872 mirror 1" makes it clear;
>>>
>>> 4) Same with "physical 1(/dev/mapper/test-scratch1)13647872".
>>>       Something like "physical 13647872 device ID 1 device path
>>> \"/dev/mapper/test-scratch1\"", makes it clear what each number is and
>>> easier to read.
>>
>> I totally understand the original output format is much easier to read
>> on its own.
>>
>> However if we have hundreds lines of similar output, it's a different
>> story, a small change in any of the value is much harder to catch, and
>> the extra helpful prompt is in fact a burden other than a bless.
>>
>> (That's why things like spreadsheet is much easier to reader for
>> multiple similarly structured data, and in that case it's much better to
>> craft a script to convert the lines into a csv)
>>
>> Unfortunately we don't have the benefit (at least yet) to structure all
>> these info into a structured output.
>>
>>
>> But what I'm doing here is to reduce the prompts to minimal (at most 4
>> prompts, "inode", "fileoff", "logical", "physical"), so less duplicated
>> strings for multiple lines of similar error messages.
>>
>> The patchset is in the middle ground between fully detailed output (the
>> old one, focusing on single line) and the fully script orient output (no
>> prompt at all, just all numbers/strings, focusing on CSV like output).
>>
>>
>> Furthermore, I would also argue that, for entry level end users, they
>> can still catch the critical info (like file path and device path)
>> without much difficult, and those info is enough for them to take action.
>> (E.g. deleting the file, or replace the disk)
>>
>> Yes, they would get confused on the devid or rootid, but that's fine and
>> everyone can learn something new.
>>
>> For experienced users who understand basics of btrfs, or us developers,
>> the split in values are already arranged in a way similar sized values
>> are never close together (aka, string/large/small value split).
>
> Well, I'm a developer and I can tell you if I ever see such log
> messages, I'll have to go to the source code to figure out what each
> value is supposed to be.
> I'll be able to memorize for some hours or even a few days, but after
> that I'll forget again and have to look at the source code again.

As the core problem is, if you just see one such error, and you need to
investigate, yes it needs more digging.

But in the real world, the support cases I hit are all have tons of
similar lines almost flooding the dmesg (well, in that case they don't
include any useful info either).

In that case, digging the format once is not a big deal at all, in fact
the main heavy lifting part is to determine how severe the corruption is.

And I'm not going full csv-like output, we still have several prompts to
provide some help.

Thanks,
Qu

>
>
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks.
>>>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address
  2024-03-14 20:15     ` Qu Wenruo
@ 2024-04-04 20:01       ` David Sterba
  0 siblings, 0 replies; 31+ messages in thread
From: David Sterba @ 2024-04-04 20:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Anand Jain, Qu Wenruo, linux-btrfs

On Fri, Mar 15, 2024 at 06:45:04AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2024/3/14 22:54, Anand Jain 写道:
> > On 3/14/24 15:20, Qu Wenruo wrote:
> >> [BUG]
> >> Scrub is not reporting the correct logical/physical address, it can be
> >> verified by the following script:
> >>
> >>   # mkfs.btrfs -f $dev1
> >>   # mount $dev1 $mnt
> >>   # xfs_io -f -c "pwrite -S 0xaa 0 128k" $mnt/file1
> >>   # umount $mnt
> >>   # xfs_io -f -c "pwrite -S 0xff 13647872 4k" $dev1
> >>   # mount $dev1 $mnt
> >>   # btrfs scrub start -fB $mnt
> >>   # umount $mnt
> >>
> >> Note above 13647872 is the physical address for logical 13631488 + 4K.
> >>
> >> Scrub would report the following error:
> >>
> >>   BTRFS error (device dm-2): unable to fixup (regular) error at 
> >> logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
> >>   BTRFS warning (device dm-2): checksum error at logical 13631488 on 
> >> dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, 
> >> offset 0, length 4096, links 1 (path: file1)
> >>
> >> On the other hand, "btrfs check --check-data-csum" is reporting the
> >> correct logical/physical address:
> >>
> >>   Checking filesystem on /dev/test/scratch1
> >>   UUID: db2eb621-b09d-4f24-8199-da17dc7b3201
> >>   [5/7] checking csums against data
> >>   mirror 1 bytenr 13647872 csum 0x13fec125 expected csum 0x656bd64e
> >>   ERROR: errors found in csum tree
> >>
> >> [CAUSE]
> >> In the function scrub_stripe_report_errors(), we always use the
> >> stripe->logical and its physical address to print the error message, not
> >> taking the sector number into consideration at all.
> >>
> >> [FIX]
> >> Fix the error reporting function by calculating logical/physical with
> >> the sector number.
> >>
> >> Now the scrub report is correct:
> >>
> >>   BTRFS error (device dm-2): unable to fixup (regular) error at 
> >> logical 13647872 on dev /dev/mapper/test-scratch1 physical 13647872
> >>   BTRFS warning (device dm-2): checksum error at logical 13647872 on 
> >> dev /dev/mapper/test-scratch1, physical 13647872, root 5, inode 257, 
> >> offset 16384, length 4096, links 1 (path: file1)
> >>
> >> Fixes: 0096580713ff ("btrfs: scrub: introduce error reporting 
> >> functionality for scrub_stripe")
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> > 
> > 
> > To help accurate debug on stable kernel, we could also add
> > CC: stable@vger.kernel.org #6.4+
> 
> IIRC Fixes: tag is more accurate than just CC stable, as stable has the 
> correct bots catching the fixes line.

Fixes: + CC: is the most reliable way to get patches to stable, CC: is
next but just Fixes: is not guaranteed as it depends on manual
processing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-04-04 20:08 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-14  9:50 [PATCH 0/6] btrfs: scrub: refine the error messages Qu Wenruo
2024-03-14  9:50 ` [PATCH 1/7] btrfs: scrub: fix incorrectly reported logical/physical address Qu Wenruo
2024-03-14 12:24   ` Anand Jain
2024-03-14 20:15     ` Qu Wenruo
2024-04-04 20:01       ` David Sterba
2024-03-14 17:10   ` Filipe Manana
2024-03-14  9:50 ` [PATCH 2/7] btrfs: reduce the log level for btrfs_dev_stat_inc_and_print() Qu Wenruo
2024-03-14 17:17   ` Filipe Manana
2024-03-14 20:26     ` Qu Wenruo
2024-03-18 19:54       ` Filipe Manana
2024-03-14  9:50 ` [PATCH 3/7] btrfs: scrub: remove unused is_super parameter from scrub_print_common_warning() Qu Wenruo
2024-03-14 17:19   ` Filipe Manana
2024-03-14  9:50 ` [PATCH 4/7] btrfs: scrub: remove unnecessary dev/physical lookup for scrub_stripe_report_errors() Qu Wenruo
2024-03-14 17:26   ` Filipe Manana
2024-03-14 20:28     ` Qu Wenruo
2024-03-18 16:16       ` Filipe Manana
2024-03-18 19:53         ` Qu Wenruo
2024-03-14  9:50 ` [PATCH 5/7] btrfs: scrub: simplify the inode iteration output Qu Wenruo
2024-03-14 17:29   ` Filipe Manana
2024-03-14  9:50 ` [PATCH 6/7] btrfs: scrub: unify and shorten the error message Qu Wenruo
2024-03-14 17:40   ` Filipe Manana
2024-03-14 20:56     ` Qu Wenruo
2024-03-18 16:26       ` Filipe Manana
2024-03-18 20:00         ` Qu Wenruo
2024-03-14 23:05   ` kernel test robot
2024-03-15 11:44   ` kernel test robot
2024-03-14  9:50 ` [PATCH 7/7] btrfs: scrub: fix the frequency of error messages Qu Wenruo
2024-03-14 17:51   ` Filipe Manana
2024-03-14 20:32     ` Qu Wenruo
2024-03-14 17:35 ` [PATCH 0/6] btrfs: scrub: refine the " Boris Burkov
2024-03-14 20:30   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox