[PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk
@ 2016-03-10 11:10 Alex Lyakas
  2016-03-11 22:19 ` Nicholas D Steeves
  2016-03-21 18:51 ` Filipe Manana
  0 siblings, 2 replies; 6+ messages in thread
From: Alex Lyakas @ 2016-03-10 11:10 UTC (permalink / raw)
  To: linux-btrfs; +Cc: fdmanana, Alex Lyakas

csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add an additional sanity
check on the extent buffer header.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will,
but it is better than to have a silent metadata corruption on disk.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
---
 fs/btrfs/disk-io.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4420ab2..cf85714 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -506,23 +506,34 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
 
 static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 {
 	u64 start = page_offset(page);
 	u64 found_start;
 	struct extent_buffer *eb;
 
 	eb = (struct extent_buffer *)page->private;
 	if (page != eb->pages[0])
 		return 0;
+
 	found_start = btrfs_header_bytenr(eb);
-	if (WARN_ON(found_start != start || !PageUptodate(page)))
-		return 0;
+	/*
+	 * Please do not consolidate these warnings into a single if.
+	 * It is useful to know what went wrong.
+	 */
+	if (WARN_ON(found_start != start))
+		return -EUCLEAN;
+	if (WARN_ON(!PageUptodate(page)))
+		return -EUCLEAN;
+
+	ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
+			btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
+
 	return csum_tree_block(fs_info, eb, 0);
 }
 
 static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
 				 struct extent_buffer *eb)
 {
 	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
 	u8 fsid[BTRFS_UUID_SIZE];
 	int ret = 1;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk
  2016-03-10 11:10 [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk Alex Lyakas
@ 2016-03-11 22:19 ` Nicholas D Steeves
  2016-03-13  9:51   ` Alex Lyakas
  2016-03-21 18:51 ` Filipe Manana
  1 sibling, 1 reply; 6+ messages in thread
From: Nicholas D Steeves @ 2016-03-11 22:19 UTC (permalink / raw)
  To: Btrfs BTRFS

On 10 March 2016 at 06:10, Alex Lyakas <alex.bolshoy@gmail.com> wrote:
> csum_dirty_buffer was issuing a warning in case the extent buffer
> did not look alright, but was still returning success.
> Let's return error in this case, and also add an additional sanity
> check on the extent buffer header.
> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will,
> but it is better than to have a silent metadata corruption on disk.

Does this mean there is a good chance that everyone has corrupted
metadata?  Is there any way to verify/rebuild it without
wipefs+mkfs+restore from backups?

Best regards,
Nicholas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk
  2016-03-11 22:19 ` Nicholas D Steeves
@ 2016-03-13  9:51   ` Alex Lyakas
  2016-04-06  4:04     ` Nicholas D Steeves
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Lyakas @ 2016-03-13  9:51 UTC (permalink / raw)
  To: Nicholas D Steeves; +Cc: Btrfs BTRFS

Nicholas,

On Sat, Mar 12, 2016 at 12:19 AM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
> On 10 March 2016 at 06:10, Alex Lyakas <alex.bolshoy@gmail.com> wrote:
>> csum_dirty_buffer was issuing a warning in case the extent buffer
>> did not look alright, but was still returning success.
>> Let's return error in this case, and also add an additional sanity
>> check on the extent buffer header.
>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will,
>> but it is better than to have a silent metadata corruption on disk.
>
> Does this mean there is a good chance that everyone has corrupted
> metadata?
No, this definitely does not.

The code that I added prevents btrfs from writing a metadata block, if
it somehow got corrupted before being sent to disk. If it happens, it
indicates a bug somewhere in the kernel. For example, if some other
kernel module erroneously uses a page-cache entry, which does not
belong to it (and contains btrfs metadata block or part of it).

> Is there any way to verify/rebuild it without wipefs+mkfs+restore from backups?
To verify btrfs metadata: unmount the filesystem and run "btrfs check
...". Do not specify the "repair" parameter. Another way to verify is
to run "btrfs-debug-tree" and redirect its standard output to
/dev/null. It should not print anything to standard error. But "btrfs
check" is faster.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk
  2016-03-13  9:51   ` Alex Lyakas
@ 2016-04-06  4:04     ` Nicholas D Steeves
  2016-04-06  7:01       ` Duncan
  0 siblings, 1 reply; 6+ messages in thread
From: Nicholas D Steeves @ 2016-04-06  4:04 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi Alex,

On 13 March 2016 at 05:51, Alex Lyakas <alex@zadarastorage.com> wrote:
> Nicholas,
>
> On Sat, Mar 12, 2016 at 12:19 AM, Nicholas D Steeves <nsteeves@gmail.com> wrote:
>> On 10 March 2016 at 06:10, Alex Lyakas <alex.bolshoy@gmail.com> wrote:
>> Does this mean there is a good chance that everyone has corrupted
>> metadata?
> No, this definitely does not.
>
> The code that I added prevents btrfs from writing a metadata block, if
> it somehow got corrupted before being sent to disk. If it happens, it
> indicates a bug somewhere in the kernel. For example, if some other
> kernel module erroneously uses a page-cache entry, which does not
> belong to it (and contains btrfs metadata block or part of it).

Oh wow, I didn't know that was possible.  If I understand correctly,
this patch makes using bcache a little bit safer?  (I don't use it
since I'm too short on free time to what is--I suspect-- something
that radically increases the chances of having to restore from backup)

>> Is there any way to verify/rebuild it without wipefs+mkfs+restore from backups?
> To verify btrfs metadata: unmount the filesystem and run "btrfs check
> ...". Do not specify the "repair" parameter. Another way to verify is
> to run "btrfs-debug-tree" and redirect its standard output to
> /dev/null. It should not print anything to standard error. But "btrfs
> check" is faster.

Ah, that's exactly what I was looking for!  Thank you.  It took
forever, and brought me back to what it was like to fsck large ext2
volumes.  Is btrfs check conceptually identical to a read-only fsck of
a ext2 volume?  If now how does it defer?

Are the following sort of errors still an issue?:
Extent back ref already exists for 2148837945344 parent 0 root 257
leaf parent key incorrect 504993210368
bad block 504993210368
( https://btrfs.wiki.kernel.org/index.php/Btrfsck )

Cheers,
Nicholas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk
  2016-04-06  4:04     ` Nicholas D Steeves
@ 2016-04-06  7:01       ` Duncan
  0 siblings, 0 replies; 6+ messages in thread
From: Duncan @ 2016-04-06  7:01 UTC (permalink / raw)
  To: linux-btrfs

Nicholas D Steeves posted on Wed, 06 Apr 2016 00:04:36 -0400 as excerpted:

> Ah, that's exactly what I was looking for!  Thank you.  It took forever,
> and brought me back to what it was like to fsck large ext2 volumes.  Is
> btrfs check conceptually identical to a read-only fsck of a ext2 volume?
> If now how does it defer?

That question had me confused for a moment as I couldn't figure out what 
was "deferred", until I realized you must have meant "differ". =:^)

At a suitably high level, yes, btrfs check (without --repair) is 
comparable to a read-only fsck of an ext* volume.  They're both quite 
deep and thorough checks of their respective filesystems.

But while ext2 had no journal and thus required such a deep fsck on 
reboot to recover after a bad shutdown, the later generations had a 
journal and avoided that, tho running an occasional fsck was still 
recommended.  What they run automatically (on boot) is a very brief check 
that just checks some basics and that the filesystem looks like it 
should, replaying the journal if shutdown wasn't safely done, but nothing 
very thorough at all.

And btrfs, being an atomic operation copy-on-write (cow) filesystem, is 
to a journaled filesystem like ext3/4, what a journaled filesystem like 
ext3/4 is to an earlier unjournaled filesystem like ext2.  Which is why 
btrfs check doesn't normally need to be run -- because in general, the 
atomic cow nature of btrfs means that a commit is done atomically, you 
either get the state of the previous commit, or of the new one.  There's 
no way to get a half-way state where only part of the data was written, 
which was the problem ext2 had, that the ext3 and later ext4 journal was 
designed to alleviate, but that an atomic cow filesystem such as btrfs 
does away with entirely.

(Tho btrfs does have a journal, its use is far more limited.  Normally, 
commits happen only every 30 seconds or so, and the btrfs journal simply 
replays any fsynced file changes between commits so they still hit the 
filesystem in the case of a crash before a later commit did it the normal 
way.  IOW, it's only a shortcut to faster fsyncs, which would otherwise 
force a full filesystem commit before returning, much as they did on 
ext3.  But even without that replay, a btrfs should be self-consistent to 
one of the two latest commits, with only the fsyncs done between commits 
lost if the log isn't replayed at all.)

Of course all that assumes no critical bugs.  There's a reason btrfs is 
still considered stabilizing, not fully stable and mature yet, as there 
are still bugs being found and fixed that prevent this ideal from being 
reality in more cases than we'd like, tho the filesystem is in general 
stable enough for many to use daily, as many including myself do, as long 
as they have backups just in case, and their world won't end if they 
actually have to use them.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk
  2016-03-10 11:10 [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk Alex Lyakas
  2016-03-11 22:19 ` Nicholas D Steeves
@ 2016-03-21 18:51 ` Filipe Manana
  1 sibling, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2016-03-21 18:51 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs@vger.kernel.org, Alex Lyakas

On Thu, Mar 10, 2016 at 11:10 AM, Alex Lyakas <alex.bolshoy@gmail.com> wrote:
> csum_dirty_buffer was issuing a warning in case the extent buffer
> did not look alright, but was still returning success.
> Let's return error in this case, and also add an additional sanity
> check on the extent buffer header.
> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will,
> but it is better than to have a silent metadata corruption on disk.
>
> Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>

> ---
>  fs/btrfs/disk-io.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 4420ab2..cf85714 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -506,23 +506,34 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root,
>
>  static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
>  {
>         u64 start = page_offset(page);
>         u64 found_start;
>         struct extent_buffer *eb;
>
>         eb = (struct extent_buffer *)page->private;
>         if (page != eb->pages[0])
>                 return 0;
> +
>         found_start = btrfs_header_bytenr(eb);
> -       if (WARN_ON(found_start != start || !PageUptodate(page)))
> -               return 0;
> +       /*
> +        * Please do not consolidate these warnings into a single if.
> +        * It is useful to know what went wrong.
> +        */
> +       if (WARN_ON(found_start != start))
> +               return -EUCLEAN;
> +       if (WARN_ON(!PageUptodate(page)))
> +               return -EUCLEAN;
> +
> +       ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
> +                       btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
> +
>         return csum_tree_block(fs_info, eb, 0);
>  }
>
>  static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
>                                  struct extent_buffer *eb)
>  {
>         struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>         u8 fsid[BTRFS_UUID_SIZE];
>         int ret = 1;
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-04-06  7:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-10 11:10 [PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk Alex Lyakas
2016-03-11 22:19 ` Nicholas D Steeves
2016-03-13  9:51   ` Alex Lyakas
2016-04-06  4:04     ` Nicholas D Steeves
2016-04-06  7:01       ` Duncan
2016-03-21 18:51 ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).