Re: btrfs crashes during routine btrfs-balance-least-used

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: Kai Krakow <hurikhan77@gmail.com>, Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>, Oliver Wien <ow@netactive.de>
Subject: Re: btrfs crashes during routine btrfs-balance-least-used
Date: Mon, 15 Jul 2024 14:30:05 +0930	[thread overview]
Message-ID: <d30126ba-a9fc-44ea-bba8-1f42b242ca91@suse.com> (raw)
In-Reply-To: <CAMthOuNk273pZUSU1Npr-Zx7LscBxOsXeyWcQCgbYP=82TfreA@mail.gmail.com>

在 2024/7/15 13:59, Kai Krakow 写道:
> Hello Qu!
> 
> Thanks for looking into this. We already started the restore, so we no
> longer have any access to the corrupted disk images.
> 
> Am So., 14. Juli 2024 um 23:54 Uhr schrieb Qu Wenruo <quwenruo.btrfs@gmx.com>:
>>
>>
>>
>> 在 2024/7/15 01:43, Kai Krakow 写道:
>>> Hello btrfs list!
>>>
>>> (also reported in irc)
>>>
>>> Our btrfs pool crashed during a routine btrfs-balance-least-used.
>>> Maybe of interest: bees is also running on this filesystem, snapper
>>> takes hourly snapshots with retention policy.
>>>
>>> I'm currently still collecting diagnostics, "btrfs check" log is
>>> already 3 GB and growing.
>>>
>>> The btrfs runs on three devices vd{c,e,f}1 with data=single meta=raid1.
>>>
>>> Here's an excerpt from dmesg (full log https://gist.tnonline.net/TE):
>>
>> Unfortunately the full log is not really full.
>>
>> There should be extent leaf dump, and after that dump, showing the
>> reason why we believe it's a problem.
>>
>> Is there any true full dmesg dump?
> 
> Yes, sorry. The gist has been truncated - mea culpa. I repasted it:
> https://gist.tnonline.net/6Q

Thanks a lot!

That contains (almost) all info we need to know.

The offending bytenr is 402811572224, and in the dump we indeed have the 
item for it:

[1143913.108184] 	item 188 key (402811572224 168 4096) itemoff 14598 
itemsize 79
[1143913.108185] 		extent refs 3 gen 3678544 flags 1
[1143913.108186] 		ref#0: extent data backref root 13835058055282163977 
objectid 281473384125923 offset 81432576 count 1

The last line is the problem.

Firstly we shouldn't even have a root with that high value.
Secondly that root number 13835058055282163977 is a very weird hex value 
too (0xc000000000000109), the '0xc' part means it's definitely not a 
simple bitflip.

Furthermore, the objectid value is also very weird (0xffffa11315e3).
No wonder the extent tree is not going to handle it correctly.

But I have no idea why this happens, it passes csum thus I'm assuming 
it's runtime corruption.

[...]
> 
>> The other thing is, does the server has ECC memory?
>> It's not uncommon to see bitflips causing various problems (almost
>> monthly reports).
> 
> I don't know the exact hosting environment, we are inside of a QEMU
> VM. But I'm pretty sure it is ECC.

And considering it's some virtualization environment, you do not have 
any out-of-tree modules?

> 
> The disk images are hosted on DRBD, with two redundant remote block
> devices on NVMe RAID. Our VM runs on KVM/QEMU. We are not seeing DRBD
> from within the VM. Because the lower storage layer is redundant, we
> are not running a data raid profile in btrfs but we use multiple block
> devices because we are seeing better latency behavior that way.
> 
>> If the machine doesn't have ECC memory, then a memtest would be preferable.
> 
> I'll ask our data center operators about ECC but I'm pretty sure the
> answer will be: yes, it's ECC.
> 
> We have been using their data centers for 20+ years and have never
> seen a bit flip or storage failure.

Yeah, I do not think it's the hardware corruption either now.

> 
> I wonder if parallel use of snapper (hourly, with thinning after 24h),
> bees (we are seeing dedup rates of 2:1 - 3:1 for some datasets in the
> hosting services)

Snapshotting is done in a very special timing (at the end of transaction 
commit), thus it should not be related to balance operations.

> and btrfs-balance-least-used somehow triggered this.
> I remember some old reports where bees could trigger corruption in
> balance or scrub, and evading that by pausing if it detected it. I
> don't know if this is an issue any longer (kernel 6.6.30 LTS).

No recent bugs come up to me immediately, but even if we have, the 
corruption looks too special.
It still matches the item size and ref count, but in the middle the data 
it got corrupted with seemingly garbage.

As the higher bits of u64 is store in higher address in x86_64 memory,
the corruption looks to cover the following bits:

0                       8                       16
|        le64 root      |      le64 objectid    |
|09 01 00 00 00 00 00 0c|e3 15 13 a1 ff ff 00 00|
                       ====================
16                      24          28
|        le64 offset    | le32 refs |
|00 09 da 04 00 00 00 00|01 00 00 00|

So far the corruption looks like starts from byte 7 ends at byte 14.

In theory, if you have kept the image, we can spend enough time to find 
out the correct values, but so far it really looks like some garbage 
filling the range.

For now what I can do is add extra checks (making sure the root number 
seems valid), but it won't really catch all randomly corrupted data.

And as my final struggle, did this VM experienced any migration?
As that's the only thing not btrfs I can think of, that would corrupt 
data at runtime.

Thanks,
Qu
> 
> 
> Thanks,
> Kai
>

next prev parent reply	other threads:[~2024-07-15  5:00 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-14 16:13 btrfs crashes during routine btrfs-balance-least-used Kai Krakow
2024-07-14 21:53 ` Qu Wenruo
2024-07-15  4:29   ` Kai Krakow
2024-07-15  5:00     ` Qu Wenruo [this message]
2024-07-15  5:31       ` Kai Krakow
2024-07-15  5:50         ` Qu Wenruo
2024-07-16  6:51           ` Kai Krakow
2024-07-16  9:09             ` Qu Wenruo
2024-07-16 13:25               ` Kai Krakow
2024-07-16 22:18                 ` Qu Wenruo
2024-07-17  8:09                   ` Kai Krakow

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d30126ba-a9fc-44ea-bba8-1f42b242ca91@suse.com \
    --to=wqu@suse.com \
    --cc=hurikhan77@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=ow@netactive.de \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox