public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Cc: Nikolay Borisov <nborisov@suse.com>
Subject: [PATCH URGENT v1.1 2/2] btrfs-progs: disk-io: Flush to ensure super block write is FUA
Date: Wed, 27 Mar 2019 17:46:52 +0800	[thread overview]
Message-ID: <20190327094652.16078-3-wqu@suse.com> (raw)
In-Reply-To: <20190327094652.16078-1-wqu@suse.com>

[BUG]
There are tons of reports of btrfs-progs screwing up the fs, the most
recent one is "btrfs check --clear-space-cache v1" triggered BUG_ON()
and then leaving the fs with transid mismatch problem.

[CAUSE]
In kernel, we have block layer handing the flush work, even on devices
without FUA support (like most SATA device using default libata
settings), kernel handles FUA write by flushing the device, then normal
write, and finish it with another flush.

The pre-flush, write, post-flush works pretty well to implement FUA
write.

However in btrfs-progs we just use pwrite(), there is nothing keeping
the write order.

So even for basic v1 free space cache clearing, we have different vision
on the write sequence from kernel bio layer (by dm-log-writes) and user
space pwrite() calls.

In btrfs-progs, with extra debug output in write_tree_block() and
write_dev_supers(), we can see btrfs-progs follows the right write
sequence:

  Opening filesystem to check...
  Checking filesystem on /dev/mapper/log
  UUID: 3feb3c8b-4eb3-42f3-8e9c-0af22dd58ecf
  write tree block start=1708130304 gen=39
  write tree block start=1708146688 gen=39
  write tree block start=1708163072 gen=39
  write super devid=1 gen=39
  write tree block start=1708179456 gen=40
  write tree block start=1708195840 gen=40
  write super devid=1 gen=40
  write tree block start=1708130304 gen=41
  write tree block start=1708146688 gen=41
  write tree block start=1708228608 gen=41
  write super devid=1 gen=41
  write tree block start=1708163072 gen=42
  write tree block start=1708179456 gen=42
  write super devid=1 gen=42
  write tree block start=1708130304 gen=43
  write tree block start=1708146688 gen=43
  write super devid=1 gen=43
  Free space cache cleared

But from dm-log-writes, the bio sequence is a different story:

  replaying 1742: sector 131072, size 4096, flags 0(NONE)
  replaying 1743: sector 128, size 4096, flags 0(NONE) <<< Only one sb write
  replaying 1744: sector 2828480, size 4096, flags 0(NONE)
  replaying 1745: sector 2828488, size 4096, flags 0(NONE)
  replaying 1746: sector 2828496, size 4096, flags 0(NONE)
  replaying 1787: sector 2304120, size 4096, flags 0(NONE)
  ......
  replaying 1790: sector 2304144, size 4096, flags 0(NONE)
  replaying 1791: sector 2304152, size 4096, flags 0(NONE)
  replaying 1792: sector 0, size 0, flags 8(MARK)

During the free space cache clearing, we committed 3 transaction but
dm-log-write only caught one super block write.

This means all the 3 writes were merged into the last super block write.
And the super block write was the 2nd write, before all tree block
writes, completely screwing up the metadata CoW protection.

No wonder crashed btrfs-progs can make things worse.

[FIX]
Fix this super serious problem by implementing pre and post flush for
the primary super block in btrfs-progs.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
---
 disk-io.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/disk-io.c b/disk-io.c
index f7fb7026cd94..fd56436d70ef 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1585,6 +1585,17 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
 	u32 crc;
 	int i, ret;
 
+	/*
+	 * We need to write super block after all metadata written.
+	 * This is the equivalent of kernel pre-flush for FUA.
+	 */
+	ret = fsync(device->fd);
+	if (ret < 0) {
+		error(
+		"failed to write super block for devid %llu: flush error: %m",
+			device->devid);
+		return -errno;
+	}
 	if (fs_info->super_bytenr != BTRFS_SUPER_INFO_OFFSET) {
 		btrfs_set_super_bytenr(sb, fs_info->super_bytenr);
 		crc = ~(u32)0;
@@ -1606,6 +1617,13 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
 				device->devid);
 			return -EIO;
 		}
+		ret = fsync(device->fd);
+		if (ret < 0) {
+			error(
+		"failed to write super block for devid %llu: flush error: %m",
+				device->devid);
+			return -errno;
+		}
 		return 0;
 	}
 
@@ -1634,6 +1652,19 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
 				device->devid);
 			return -errno;
 		}
+		/*
+		 * Flush after the primary sb write, this is the equivalent of
+		 * kernel post-flush for FUA write.
+		 */
+		if (i == 0) {
+			ret = fsync(device->fd);
+			if (ret < 0) {
+				error(
+		"failed to write super block for devid %llu: flush error: %m",
+					device->devid);
+				return -errno;
+			}
+		}
 	}
 
 	return 0;
-- 
2.21.0


  parent reply	other threads:[~2019-03-27  9:47 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-27  9:46 [PATCH URGENT v1.1 0/2] btrfs-progs: Fix the nobarrier behavior of write Qu Wenruo
2019-03-27  9:46 ` [PATCH URGENT v1.1 1/2] btrfs-progs: disk-io: Make super block write error easier to read Qu Wenruo
2019-03-27 11:34   ` Nikolay Borisov
2019-03-27  9:46 ` Qu Wenruo [this message]
2019-03-27 14:07 ` [PATCH URGENT v1.1 0/2] btrfs-progs: Fix the nobarrier behavior of write Adam Borowski
2019-03-27 14:17   ` Hugo Mills
2019-03-27 14:39   ` Qu Wenruo
2019-03-27 14:42     ` Qu Wenruo
2019-03-27 14:48   ` Qu Wenruo
2019-03-31 14:42     ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190327094652.16078-3-wqu@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=nborisov@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox