Re: File system corruption, btrfsck abort

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Christophe de Dinechin <dinechin@redhat.com>,
	<linux-btrfs@vger.kernel.org>
Subject: Re: File system corruption, btrfsck abort
Date: Fri, 28 Apr 2017 08:45:37 +0800	[thread overview]
Message-ID: <43eda9a8-e7be-4bc3-bd37-4df44f93f321@cn.fujitsu.com> (raw)
In-Reply-To: <D64BF81E-58B3-4B19-AD65-F5056768466C@redhat.com>



At 04/26/2017 01:50 AM, Christophe de Dinechin wrote:
> Hi,
> 
> 
> I”ve been trying to run btrfs as my primary work filesystem for about 3-4 months now on Fedora 25 systems. I ran a few times into filesystem corruptions. At least one I attributed to a damaged disk, but the last one is with a brand new 3T disk that reports no SMART errors. Worse yet, in at least three cases, the filesystem corruption caused btrfsck to crash.
> 
> The last filesystem corruption is documented here: https://bugzilla.redhat.com/show_bug.cgi?id=1444821. The dmesg log is in there.

According to the bugzilla, the btrfs-progs seems to be too old in btrfs 
standard.

What about using the latest btrfs-progs v4.10.2?

Furthermore for v4.10.2, btrfs check provides a new mode called lowmem.
You could try "btrfs check --mode=lowmem" to see if such problem can be 
avoided.

For the kernel bug, it seems to be related to wrongly inserted delayed 
ref, but I can totally be wrong.

Thanks,
Qu
> 
> The btrfsck crash is here: https://bugzilla.redhat.com/show_bug.cgi?id=1435567. I have two crash modes: either an abort or a SIGSEGV. I checked that both still happens on master as of today.
> 
> The cause of the abort is that we call set_extent_dirty from check_extent_refs with rec->max_size == 0. I’ve instrumented to try to see where we set this to 0 (see https://github.com/c3d/btrfs-progs/tree/rhbz1435567), and indeed, we do sometimes see max_size set to 0 in a few locations. My instrumentation shows this:
> 
> 78655 [1.792241:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139eb80 max_size 16384 tmpl 0x7fffffffd120
> 78657 [1.792242:0x451cb8] MAX_SIZE_ZERO: Set max size 0 for rec 0x139ec50 from tmpl 0x7fffffffcf80
> 78660 [1.792244:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139ed50 max_size 16384 tmpl 0x7fffffffd120
> 
> I don’t really know what to make of it.
> 
> The cause of the SIGSEGV is that we try to free a list entry that has its next set to NULL.
> 
> #0  list_del (entry=0x555555db0420) at /usr/src/debug/btrfs-progs-v4.10.1/kernel-lib/list.h:125
> #1  free_all_extent_backrefs (rec=0x555555db0350) at cmds-check.c:5386
> #2  maybe_free_extent_rec (extent_cache=0x7fffffffd990, rec=0x555555db0350) at cmds-check.c:5417
> #3  0x00005555555b308f in check_block (flags=<optimized out>, buf=0x55557b87cdf0, extent_cache=0x7fffffffd990, root=0x55555587d570) at cmds-check.c:5851
> #4  run_next_block (root=root@entry=0x55555587d570, bits=bits@entry=0x5555558841
> 
> I don’t know if the two problems are related, but they seem to be pretty consistent on this specific disk, so I think that we have a good opportunity to improve btrfsck to make it more robust to this specific form of corruption. But I don’t want to hapazardly modify a code I don’t really understand. So if anybody could make a suggestion on what the right strategy should be when we have max_size == 0, or how to avoid it in the first place.
> 
> I don’t know if this is relevant at all, but all the machines that failed that way were used to run VMs with KVM/QEMU. DIsk activity tends to be somewhat intense on occasions, since the VMs running there are part of a personal Jenkins ring that automatically builds various projects. Nominally, there are between three and five guests running (Windows XP, WIndows 10, macOS, Fedora25, Ubuntu 16.04).
> 
> 
> Thanks
> Christophe de Dinechin
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>

next prev parent reply	other threads:[~2017-04-28  0:45 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-25 17:50 File system corruption, btrfsck abort Christophe de Dinechin
2017-04-27 14:58 ` Christophe de Dinechin
2017-04-27 15:12   ` Christophe de Dinechin
2017-04-28  0:45 ` Qu Wenruo [this message]
2017-04-28  8:47   ` Christophe de Dinechin
2017-05-02  0:17     ` Qu Wenruo
2017-05-03 14:21       ` Christophe de Dinechin
2017-05-04 12:33         ` Christophe de Dinechin
2017-05-05  0:18         ` Qu Wenruo
2017-04-28  3:58 ` Chris Murphy
     [not found]   ` <2CE52079-1B96-4FB3-8CEF-05FC6D3CB183@redhat.com>
2017-04-28 20:09     ` Chris Murphy
2017-04-29  8:46       ` Christophe de Dinechin
2017-04-29 19:13         ` Chris Murphy
2017-05-03 14:17           ` Christophe de Dinechin
2017-05-03 14:49             ` Austin S. Hemmelgarn
2017-05-03 17:43             ` Chris Murphy
2017-04-29 19:18         ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=43eda9a8-e7be-4bc3-bd37-4df44f93f321@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=dinechin@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).