public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Sam Edwards <cfsworks@gmail.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>, linux-btrfs@vger.kernel.org
Subject: Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old
Date: Sat, 11 Sep 2021 12:56:34 -0400	[thread overview]
Message-ID: <20210911165634.GK29026@hungrycats.org> (raw)
In-Reply-To: <CAH5Ym4jNgs1iJufbmCDOS6N=k+YH4nZTSQ7j-MrM3mp8M0Yn2g@mail.gmail.com>

On Fri, Sep 10, 2021 at 11:07:04PM -0600, Sam Edwards wrote:
> Pretty much everything you've said scans with what I'm seeing over here.
>
> Sadly I can't easily replace, remove, or augment the drive. It's
> inside a laptop that doesn't look like it was designed for user
> serviceability. I don't even have any assurances that it isn't
> soldered to the board.
>
> I included the SMART information when I posted my Gist. Take whatever
> info you need.

For the list, that data is:

	Model Number:                       NVMe CA6-8D1024
	Serial Number:                      <redacted>
	Firmware Version:                   ERA0901
	PCI Vendor ID:                      0x1e95
	PCI Vendor Subsystem ID:            0x1dbe
	Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]

It's not one I've seen previously reported, but there's a huge variety
of SSD firmware in the field.

> Best,
> Sam
>
>
> On Fri, Sep 10, 2021 at 10:24 PM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Fri, Sep 10, 2021 at 01:44:51AM -0600, Sam Edwards wrote:
> > > Hi Qu,
> > >
> > > Thank you very much for your insight. It is also helpful to know that
> > > I appear to be understanding the btrfs internals properly. :)
> > >
> > > On Fri, Sep 10, 2021 at 12:17 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > > > But for SSD/HDD, we're more cautious about any transid error.
> > > >
> > > > There are already known bad disks which don't follow FLUSH requests,
> > > > thus leads to super easy transid mismatch on powerloss:
> > > >
> > > > https://btrfs.wiki.kernel.org/index.php/Hardware_bugs
> > >
> > > I strongly believe now that the SSD hardware/firmware is not the
> > > issue. Not only was the last "power loss"-type event on the 5th (which
> > > appears to have caused no damage), the issue appeared while the system
> > > was running -- a SSD firmware problem of that magnitude should have
> > > been apparent long before I upgraded to 5.14.1.
> >
> > Have you reported the model and firmware revision of the SSD?
> > This is critical information for identifying firmware problems.
> >
> > Write cache firmware failures are not only triggered by power losses.
> > They can be triggered by underlying media failures and SSD embedded DRAM
> > failures.  This kind of firmware bug would show zero symptoms before
> > (and sometimes during) the triggering failure event.  When it happens,
> > there may be no indication of failure other than a whole lot of "parent
> > transid verify failed" and related errors that appear out of nowhere
> > during otherwise normal operation.  Such failures are often accompanied
> > by bus timeouts and resets, but it can also happen with no external
> > indication if there is no write in progress at the time of failure.
> >
> > There are roughly 40 distinct block addresses affected in your check log,
> > clustered in two separate 256 MB blocks.  That looks fairly normal--the
> > btrfs allocator packs related updates close together to take advantage
> > of sequential write performance, so all of a dropped transaction's
> > writes would have nearby addresses.  The variation in transid is also
> > normal--the new tree pages refer to old tree pages of various ages and
> > replace old tree pages of various ages themselves.
> >
> > The parent transid verify messages appear in identical pairs, which is
> > consistent with a dropped cache write on a single drive with dup metadata.
> > Btrfs will try to read both mirror copies if the first one is wrong.
> > Both writes occur at the same time, so both writes can be in the drive's
> > write cache when the contents of the drive's DRAM are lost.
> >
> > The other errors are consistent with consequences of these failures.
> > Leaf pages of the metadata trees are no longer reachable because a path
> > to these leaf nodes, or the nodes themselves, do not exist on disk.
> > Items within the missing leaf pages (like csums, inodes, dirents, and
> > backrefs) cannot be found when references to these items from other
> > parts of the filesystem are checked, so you get a lot of missing
> > directory entries, backrefs, and csums.
> >
> > Other failure modes can generate this pattern (e.g. host RAM corruption
> > and kernel bugs) but other failure modes often leave other indications
> > that are not present in these logs.  There's no information here
> > that points to some failure mode other than a write cache failure.
> > btrfs is designed to reliably detect write cache failures, so based on
> > the information so far I wouldn't suspect anything else.
> >
> > > > > So, question: does the btrfs module use 256 MiB as a default size for
> > > > > write-back cache pages anywhere?
> > > >
> > > > No.
> > > >
> > > > > Or might this problem reside deeper
> > > > > in the block storage system?
> > > >
> > > > So if possible, please test without dm-crypto to make sure it's really
> > > > btrfs causing the problem.
> > >
> > > Since you don't find the 256 MiB size evocative of any component of
> > > btrfs, I am also willing to say that this likely isn't a fault in
> > > btrfs. So, I think this is now an exercise in ruling out btrfs
> > > completely (before I move on to investigating devicemapper, the crypto
> > > target, the NVMe driver, and the block storage system as a whole) and
> > > further studying the filesystem to gain insight as to how this
> > > happened.
> > >
> > > > > Also, for what it's worth: the last generation to be written before
> > > > > these inconsistencies seems to be 66303. Files written as part of that
> > > > > transaction have a birth date of Sep. 7. That's during the same boot
> > > > > as the failure to read-only,
> > > >
> > > > That's interesting.
> > > >
> > > > Btrfs aborts transaction when it hits critical metadata problems to
> > > > prevent further damage, but I still remember some possible corruption
> > > > caused by aborting transaction in the old days.
> > > > But those corruptions should not exist in v5.14.
> > >
> > > I understand "aborts" to mean, "if btrfs is in the middle of a write
> > > transaction, and in the process discovers existing metadata corruption
> > > (e.g. tree node generation mismatch) already on-disk, it will undo any
> > > earlier modifications already written to disk as part of that
> > > transaction" -- is this correct? If so, I don't believe there was any
> > > existing metadata corruption on the disk beforehand, so I doubt the
> > > abort mechanism is at fault here.
> > >
> > > > Furthermore, the read-only flips itself is more important, do you have
> > > > the dmesg of that incident?
> > >
> > > Sadly, I wasn't able to save the dmesg (for obvious reasons), but I do
> > > recall logs complaining of checksum failures and transid mismatches. I
> > > did make note of the block bytenrs, which were in the interval
> > > [1065332047872, 1065391243264] which itself appears to be part of the
> > > same 256 MiB "slice" as I found before.
> > >
> > > > Recently we also see some corrupted free space cache causing problems.
> > >
> > > I infer "problems" to mean, "free space cache indicates block X is
> > > free though it is not, so btrfs allocates a new tree node or extent at
> > > that location, overwriting necessary data"?
> > >
> > > If so, I don't think this is the case here either. If an older node is
> > > overwritten with a newer one, I would expect the transid mismatches to
> > > be that the found IDs are *greater* than the wanted IDs. I am finding
> > > valid, but *older*, tree nodes in the affected region. This seems like
> > > a "lost writes" kind of problem to me, not overwrites.
> > >
> > > However I did note some free space cache errors in my timeline.txt
> > > file over on the Github Gist. If you believe that this is still a
> > > likelihood, I can investigate further.
> > >
> > > > If you're going to re-test this, mind to use space cache v2 to do the
> > > > tests?
> > > >
> > > > > which suggests that the cached buffer
> > > > > wasn't merely "not saved before a previous shutdown" but rather
> > > > > "discarded without being written back." Thoughts?
> > > >
> > > > This should not be a thing for btrfs.
> > > >
> > > > Btrfs COW protects its metadata, as long as there is no problem in btrfs
> > > > itself allocating new tree blocks to ex you're reporting some RO
> > > > incidents, then I guess the COW mechanism may be broken at that time
> > > > alreadyisting blocks in the same transaction, we're completely safe.
> > >
> > > Again, making sure I understand you correctly: Btrfs, by design,
> > > *never* modifies an active tree node/leaf. It will instead copy the
> > > node/leaf it wishes to modify into memory, perform the modifications
> > > there, save the modified copy at a free location on disk, and then
> > > recursively modify parents (using the same process) all the way back
> > > to the superblock(s).
> > >
> > > And so, if power is lost in the middle of this process (i.e. before
> > > the superblock is updated), then the old paths down the trees are
> > > intact. The transaction is lost, but existing metadata is protected.
> > > Correct?
> >
> > If btrfs stops writing to the filesystem immediately, either because a
> > power failure stopped the kernel running, or a metadata integrity failure
> > error was detected, then existing metadata is protected.
> >
> > If btrfs does _not_ stop writing immediately (e.g. the disk does not
> > notify btrfs of write failure) and commits a transaction after writes are
> > dropped, then metadata will be destroyed by later transactions because
> > they will assume the previously written data is present on disk when it
> > is not.
> >
> > > > But considering you're reporting some RO incidents, then I guess the COW
> > > > mechanism may be broken at that time already.
> > > > (Maybe abort leads to some corrupted free space cache?)
> > >
> > > Hypothetically, suppose one were to try this experiment:
> > > 1. Set up a brand-new machine, which has 2 independent but equally-sized disks.
> > > 2. On disk A, create a btrfs and mount it. Use it as a normal rootfs
> > > for a while, and then unmount it.
> > > 3. Image disk A to disk B (perhaps obfuscating the image so that it
> > > doesn't confuse btrfs's RAID logic).
> > > 4. Remount disk A and begin using it heavily (reads and writes). At a
> > > random moment, instantly copy a 256 MiB slice from disk B onto disk A,
> > > choosing an offset that has recently been written on disk A, while
> > > disk A is still mounted.
> > >
> > > Would this cause csum failures, transid mismatches, a RO incident, and
> > > the general output from "btrfs check" that I posted?
> >
> > It would, but in far larger quantities (thousands of blocks instead of
> > just a few dozen).
> >
> > If you have two disks you could run btrfs raid1 on them.  Failures on
> > one disk will be corrected by reading from the other.  Pairing disks by
> > different vendors in large numbers is an excellent way to quickly learn
> > which models are bad.
> >
> > > If so, I think the problem is that a write-back cache, somewhere
> > > deeper in the block storage stack (i.e. not btrfs) simply...
> > > *discarded* a contiguous 256 MiB worth of data that btrfs thought had
> > > been successfully written, and the RO incident was due to btrfs
> > > suddenly seeing old data resurface. Does this seem like the best
> > > working hypothesis currently?
> >
> > No, we would expect _much_ more damage than your check run reported.
> >
> > > Also: Per your expert opinion, what would be the best way to repair
> > > disk A (without the benefit of disk B, heh) in the experiment above? I
> > > imagine the affected slice itself is not salvageable, and will have to
> > > be forcibly marked as free space, and in the process throwing out any
> > > references to tree nodes that happened to reside there, meaning I'll
> > > lose a whole bunch of extents, inodes, csum tree entries, and so on...
> > > and in turn I'd lose a random assortment of files. I'm actually okay
> > > with that, because I can diff the surviving files against my last
> > > backup and restore whatever was deleted.
> >
> > You'd have to reinsert all the surviving (and up to date) leaf nodes
> > into new trees, then remove any dangling references, then verify all
> > the surviving data.  That's more total IO than mkfs + restore, and
> > requires significant memory and CPU as well.
> >
> > > Or is the whole thing FUBAR at this point and I should just zero the
> > > partition and restore from backups?
> >
> > If you have backups, it's usually faster to restore them.
> >
> > I'd put that drive on probation for a while--maybe use it only in raid1
> > setups, or turn off its write cache, or run a stress test for a while
> > to see if it fails again.
> >
> > Certainly if you suspect the 5.14 kernel then it would be worthwhile to
> > retest the drive with an older kernel, or test different drives with a
> > newer kernel.  If you can reproduce the issue with multiple drives that
> > is always an interesting bug report.
> >
> > We want to know your drive model and firmware revision.  If everyone who
> > owns one of these drives reports the same problem, we know the problem
> > is in the drive firmware.  If we see the problem across multiple drives
> > then we know it's somewhere else.
> >
> > > Thank you for your time,
> > > Sam
> > >
> > > > Thanks,
> > > > Qu
> > > >
> > > > >
> > > > > Cheers,
> > > > > Sam
> > > > >
> > > > >
> > > > > On Wed, Sep 8, 2021 at 6:47 PM Sam Edwards <cfsworks@gmail.com> wrote:
> > > > >>
> > > > >> Hello list,
> > > > >>
> > > > >> First, I should say that there's no urgency here on my part.
> > > > >> Everything important is very well backed up, and even the
> > > > >> "unimportant" files (various configs) seem readable. I imaged the
> > > > >> partition without even attempting a repair. Normally, my inclination
> > > > >> would be to shrug this off and recreate the filesystem.
> > > > >>
> > > > >> However, I'd like to help investigate the root cause, because:
> > > > >> 1. This happened suspiciously soon (see my timeline in the link below)
> > > > >> after upgrading to kernel 5.14.1, so may be a serious regression.
> > > > >> 2. The filesystem was created less than 5 weeks ago, so the possible
> > > > >> causes are relatively few.
> > > > >> 3. My last successful btrfs scrub was just before upgrading to 5.14.1,
> > > > >> hopefully narrowing possible root causes even more.
> > > > >> 4. I have imaged the partition and am thus willing to attempt risky
> > > > >> experimental repairs. (Mostly for the sake of reporting if they work.)
> > > > >>
> > > > >> Disk setup: NVMe SSD, GPT partition, dm-crypt, btrfs as root fs (no LVM)
> > > > >> OS: Gentoo
> > > > >> Earliest kernel ever used: 5.10.52-gentoo
> > > > >> First kernel version used for "real" usage: 5.13.8
> > > > >> Relevant information: See my Gist,
> > > > >> https://gist.github.com/CFSworks/650280371fc266b2712d02aa2f4c24e8
> > > > >> Misc. notes: I have run "fstrim /" on occasion, but don't have
> > > > >> discards enabled automatically. I doubt TRIM is the culprit, but I
> > > > >> can't rule it out.
> > > > >>
> > > > >> My primary hypothesis is that there's some write bug in Linux 5.14.1.
> > > > >> I installed some package updates right before btrfs detected the
> > > > >> problem, and most of the files in the `btrfs check` output look like
> > > > >> they were created as part of those updates.
> > > > >>
> > > > >> My secondary hypothesis is that creating and/or using the swapfile
> > > > >> caused some kind of silent corruption that didn't become a detectable
> > > > >> issue until several further writes later.
> > > > >>
> > > > >> Let me know if there's anything else I should try/provide!
> > > > >>
> > > > >> Regards,
> > > > >> Sam

  parent reply	other threads:[~2021-09-11 16:56 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-09  0:47 Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old Sam Edwards
2021-09-09  7:12 ` Nikolay Borisov
2021-09-09  7:30   ` Sam Edwards
2021-09-10  5:39 ` Sam Edwards
2021-09-10  6:16   ` Qu Wenruo
2021-09-10  7:44     ` Sam Edwards
2021-09-10  8:31       ` Qu Wenruo
2021-09-10 22:34         ` Sam Edwards
2021-09-11  1:05           ` Qu Wenruo
2021-09-11  4:24       ` Zygo Blaxell
2021-09-11  5:07         ` Sam Edwards
2021-09-11  7:38           ` Forza
2021-09-11 16:56           ` Zygo Blaxell [this message]
2021-09-12  6:12             ` Sam Edwards
2021-09-12 23:07               ` Zygo Blaxell
2021-09-14 23:58                 ` Sam Edwards
2021-09-15 18:10                   ` Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210911165634.GK29026@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=cfsworks@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox