Re: btrfs crashes during routine btrfs-balance-least-used

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Kai Krakow <hurikhan77@gmail.com>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>, Oliver Wien <ow@netactive.de>
Subject: Re: btrfs crashes during routine btrfs-balance-least-used
Date: Mon, 15 Jul 2024 15:20:31 +0930	[thread overview]
Message-ID: <1728bb6e-9dd0-4a2c-be16-41cd01231484@gmx.com> (raw)
In-Reply-To: <CAMthOuMWvFWNEJkOWQiP2eNwWap8H4z7Gb6qzS8M-WkTUk2=Mw@mail.gmail.com>



在 2024/7/15 15:01, Kai Krakow 写道:
> Am Mo., 15. Juli 2024 um 07:00 Uhr schrieb Qu Wenruo <wqu@suse.com>:
[...]
>> The last line is the problem.
>>
>> Firstly we shouldn't even have a root with that high value.
>> Secondly that root number 13835058055282163977 is a very weird hex value
>> too (0xc000000000000109), the '0xc' part means it's definitely not a
>> simple bitflip.
>
> Oh wow, I didn't even notice that. I skimmed through the logs to
> resolve the root numbers to subvolids, hoping for an idea where the
> "corrupted extents" are - but they were all over the place in
> seemingly unrelated subvols.
>
> This host runs several systemd-nspawn containers with various
> generations of PHP, MySQL containers (MySQL data itself is on a
> different partition because btrfs and mysql don't play well), a huge
> maildir storage, a huge web vhost storage, and some mail filter / mail
> services containers. Just to give you an idea of what kind of data and
> workload is used...

That shouldn't be a problem, as the only thing can access btrfs metadata
is, btrfs itself.

Although in theory, anything can access the host memory can also access
the kernel memory of the guest...

>
>> Furthermore, the objectid value is also very weird (0xffffa11315e3).
>> No wonder the extent tree is not going to handle it correctly.
>>
>> But I have no idea why this happens, it passes csum thus I'm assuming
>> it's runtime corruption.
>
> We had some crashes in the past due to OOM, sometimes btrfs has been
> involved. This was largely solved by disabling huge pages, updating
> from kernel 6.1 to 6.6, and running with a bees patch that reduces the
> memory used for ref lookups:
> https://github.com/Zygo/bees/issues/260
>
>> [...]
>>>
>>>> The other thing is, does the server has ECC memory?
>>>> It's not uncommon to see bitflips causing various problems (almost
>>>> monthly reports).
>>>
>>> I don't know the exact hosting environment, we are inside of a QEMU
>>> VM. But I'm pretty sure it is ECC.
>>
>> And considering it's some virtualization environment, you do not have
>> any out-of-tree modules?
>
> No, the system is running Gentoo, the kernel is manually configured
> and runs without module support. Everything is baked in.
>
>>> The disk images are hosted on DRBD, with two redundant remote block
>>> devices on NVMe RAID. Our VM runs on KVM/QEMU. We are not seeing DRBD
>>> from within the VM. Because the lower storage layer is redundant, we
>>> are not running a data raid profile in btrfs but we use multiple block
>>> devices because we are seeing better latency behavior that way.
>>>
>>>> If the machine doesn't have ECC memory, then a memtest would be preferable.
>>>
>>> I'll ask our data center operators about ECC but I'm pretty sure the
>>> answer will be: yes, it's ECC.
>>>
>>> We have been using their data centers for 20+ years and have never
>>> seen a bit flip or storage failure.
>>
>> Yeah, I do not think it's the hardware corruption either now.
>
> Yes, what you found above looks really messed up - that's not a bitflip.
>
>>> I wonder if parallel use of snapper (hourly, with thinning after 24h),
>>> bees (we are seeing dedup rates of 2:1 - 3:1 for some datasets in the
>>> hosting services)
>>
>> Snapshotting is done in a very special timing (at the end of transaction
>> commit), thus it should not be related to balance operations.
>>
>>> and btrfs-balance-least-used somehow triggered this.
>>> I remember some old reports where bees could trigger corruption in
>>> balance or scrub, and evading that by pausing if it detected it. I
>>> don't know if this is an issue any longer (kernel 6.6.30 LTS).
>>
>> No recent bugs come up to me immediately, but even if we have, the
>> corruption looks too special.
>> It still matches the item size and ref count, but in the middle the data
>> it got corrupted with seemingly garbage.
>
> I think Zygo has some notes of it in the bees github:
> https://github.com/Zygo/bees/blob/master/docs/btrfs-kernel.md

Wow, the first time I know there is such a well maintained matrix on
various problems.

>
> I think it was about btrfs-send and dedup at the same time... Memory
> fades faster if one gets older... ;-)

Nope, this is completely a different one.

>
>> As the higher bits of u64 is store in higher address in x86_64 memory,
>> the corruption looks to cover the following bits:
>>
>> 0                       8                       16
>> |        le64 root      |      le64 objectid    |
>> |09 01 00 00 00 00 00 0c|e3 15 13 a1 ff ff 00 00|
>>                         ====================
>> 16                      24          28
>> |        le64 offset    | le32 refs |
>> |00 09 da 04 00 00 00 00|01 00 00 00|
>>
>> So far the corruption looks like starts from byte 7 ends at byte 14.
>>
>> In theory, if you have kept the image, we can spend enough time to find
>> out the correct values, but so far it really looks like some garbage
>> filling the range.
>>
>> For now what I can do is add extra checks (making sure the root number
>> seems valid), but it won't really catch all randomly corrupted data.
>
> This could be useful. Will this be in btrfs-check, or in the kernel?

Kernel, and it would catch the error at write time, so that such obvious
corruption would not even reach disks.
Although it would still make the fs RO, it will not cause any corruption
on-disk.

For btrfs-progs, it's already detecting such mismatch, but that's
already too late, isn't it?

>
>> And as my final struggle, did this VM experienced any migration?
>
> No. It has been sitting there for 4 or 5 years. But we are slowly
> approaching the capacity of the hardware. What happens then: Our data
> center operator would shut the VM down, rewire the DRBD to another
> hardware, and boot it again. No hot migration, if you meant that.

OK, then definitely something weird happened inside btrfs code.
But I'm out of any clue...

>
> I'm not sure how reliable DRBD is, but I researched it a little while
> ago and it seems to prefer reliability over speed, so it looks very
> solid. I don't think anything broke there, and even then 8 bytes of
> garbage looks strange. Well, that's 64 bits of garbage - if that means
> anything.

Even if it's really some lower level storage corruption, it has to pass
the btrfs metadata csum first.
You really need a super lucky random corruption that still matches the csum.

Then you still need to pass tree-checker, which doesn't sounds
reasonable to me at all.

>
>> As that's the only thing not btrfs I can think of, that would corrupt
>> data at runtime.
>
> I'm using btrfs in various workloads, and even with partially broken
> hardware (raid1 on spinning rust with broken sectors). And while it
> had its flaws like 10+ years ago, it has been rock solid for me during
> the last 5 years at least - even if I handled the hardware very
> impatiently (like hitting the reset button or cycling power). Of
> course, this system we're talking about has been handled seriously.
> ;-)

And we're also enhancing our handling on bad hardwares (except when they
cheat on FLUSH), the biggest example is tree-checker.

But I really run out of ideas for such a huge corruption.
Your setup really rules out almost everything, from out-of-tree (Nv*dia
drivers) to bad hot memory migration implementation.

I'll let you know when the new tree-checker patch come out, and
hopefully it can catch the root cause.

Thanks,
Qu
>
> Thanks,
> Kai

next prev parent reply	other threads:[~2024-07-15  5:50 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-14 16:13 btrfs crashes during routine btrfs-balance-least-used Kai Krakow
2024-07-14 21:53 ` Qu Wenruo
2024-07-15  4:29   ` Kai Krakow
2024-07-15  5:00     ` Qu Wenruo
2024-07-15  5:31       ` Kai Krakow
2024-07-15  5:50         ` Qu Wenruo [this message]
2024-07-16  6:51           ` Kai Krakow
2024-07-16  9:09             ` Qu Wenruo
2024-07-16 13:25               ` Kai Krakow
2024-07-16 22:18                 ` Qu Wenruo
2024-07-17  8:09                   ` Kai Krakow

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1728bb6e-9dd0-4a2c-be16-41cd01231484@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=hurikhan77@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=ow@netactive.de \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox