From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4532C433F5 for ; Sun, 12 Sep 2021 23:07:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8714960FA0 for ; Sun, 12 Sep 2021 23:07:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236622AbhILXIf (ORCPT ); Sun, 12 Sep 2021 19:08:35 -0400 Received: from james.kirk.hungrycats.org ([174.142.39.145]:43646 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236546AbhILXIf (ORCPT ); Sun, 12 Sep 2021 19:08:35 -0400 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id E2E6EB7448D; Sun, 12 Sep 2021 19:07:19 -0400 (EDT) Date: Sun, 12 Sep 2021 19:07:19 -0400 From: Zygo Blaxell To: Sam Edwards Cc: Qu Wenruo , linux-btrfs@vger.kernel.org Subject: Re: Corruption suspiciously soon after upgrade to 5.14.1; filesystem less than 5 weeks old Message-ID: <20210912230719.GL29026@hungrycats.org> References: <20210911042414.GJ29026@hungrycats.org> <20210911165634.GK29026@hungrycats.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Sun, Sep 12, 2021 at 12:12:13AM -0600, Sam Edwards wrote: > On Sat, Sep 11, 2021 at 10:56 AM Zygo Blaxell > wrote: > > It's not one I've seen previously reported, but there's a huge variety > > of SSD firmware in the field. > > It seems to be a very newly released SSD. It's possible that the > reason nobody else has reported issues with it yet is that nobody else > who owns one of these has yet met the conditions for this problem to > occur. All the more reason to figure this out, I say. > > I've been working to verify what you've said previously (and to rule > out any contrary hypotheses - like chunks momentarily having the wrong > physical offset). One point I can't corroborate is: > > > There are roughly 40 distinct block addresses affected in your check log, > > clustered in two separate 256 MB blocks. > > The only missing writes that I see are in a single 256 MiB cluster > (belonging to chunk 1065173909504). What is the other 256 MiB cluster > that you are seeing? Hex addresses from your parent transid verify failed messages: f80ad70000 f80ad74000 f80c11c000 f80c6dc000 f80c6e8000 f80c704000 f80dd04000 f80e320000 f80e3a4000 f80e3cc000 f8167c8000 f8167e0000 f818a68000 f818a6c000 f818a70000 f818a74000 f818a7c000 f818a80000 f818a84000 f818a8c000 "leaf parent key incorrect" has a similar distribution. There is less than 256MB distance from the first to the last, but they occupy two separate 256MB-aligned regions (f800000000 and f810000000). If there is dup metadata then these blocks occupy four separate 256MB regions, as there is some space between duplicate regions (in logical address space). 256MB is far too large for a plausible erase block size, 1GB is even less likely. > What shows that writes to that range went > missing, too? (Or by "affected" do you only mean "involved in the > damaged transactions in some way"?) "parent transid verify failed" is btrfs for "there is a reference to a block at this location, but the block was not written and a different (usually older) block was found instead." "leaf parent key incorrect" is a synonym. The named block is almost always a missing write. I don't see any signs of more than one transaction being affected. With 20 distinct pages of metadata lost we'd expect up to 60,000 items with reference problems, but there are only about 9000 unique error items. If there was more damage to leaf pages, especially from other transactions, I'd expect more unique errors. > I do find it interesting that, of a few dozen missing writes, all of > them are clustered together, while other writes in the same > transactions appear to have had a perfect success rate. My expectation > for drive cache failure would have been that *all* writes (during the > incident) get the same probability of being dropped. All of the > failures being grouped like that can only mean one thing... I just > don't know what it is. :) That's not how write caches work--they are not giant FIFO queues. Write caches reorder writes in close temporal proximity (i.e. writes that occurred close together in time rather than in space). There is necessarily reordering in the failure case--if there was no reordering, in a strictly FIFO cache, there would be no btrfs errors because without reordering the last persisted transaction would be completely persisted with all its metadata intact (there may be later transactions that were lost, but if none of their writes persisted then those transactions didn't really happen). The cache does not occupy all of the DRAM in the device--usually it is only a handful of pages, because the firmware will drain the cache to flash as quickly as possible. btrfs quite often has to stop to read data from disk during a transaction, giving the drive time to catch up flushing out the cache. This results in a small number of writes in flight. So there might be e.g. writes to blocks A, B, C, D, E, F, barrier, G, and those are reordered in write cache as A, D, E, F, G, B, C, and then the last 2 writes are lost by a drive problem. Writes to locations A, D, E, F, and G are persisted (G might be the superblock update, completing the transaction and updating root pointers to all the trees) while writes to B and C are not (these are the "leaf parent key incorrect" and "parent transid verify failed" errors, where a parent node in a tree points to a block that does not contain the matching child). Reordering like the above is allowed, but only if the drive can guarantee results equivalent to the original ordering with the barrier constraint (e.g. it has big capacitors or SLC write cache). The drive can legally write G earlier if and only if it can guarantee it will finish writing all of A-F, even if power is lost, host sends a reset, or media fails. > So, the prime suspect at this point is the SSD firmware. Once I have a > little more information, I'll (try to) share what I find with the > vendor. Ideally I'd like to narrow down which of 3 components of the > firmware apparently contains the fault: > 1. Write back cache: Most likely, although not certain at this point. > If I turn off the write cache and the problem goes away, I'll know. Disabling write cache is the most common workaround, and often successful if the drive is still healthy. > 2. NVMe command queues: Perhaps there is some race condition where 2 > writes submitted on different queues will, under some circumstances, > cause one/both of the writes to be ignored. That often happens to drives as they fail. Firmware reboots due to a bug in error handling code or hardware fault, and doesn't remember what was in its write cache. Also if there is a transport failure and the host resets the bus, the drive firmware might have its write cache forcibly erased before being able to write it. The spec says that doesn't happen, but some vendors are demonstrably unable to follow spec. > 3. LBA mapper: Given the pattern of torn writes, it's possible that > some LBAs were not updated to the new PBAs after some of the writes. I > find this pretty unlikely for a handful of reasons (trying to write a > non-erased block should result in an internal error, old PBA should be > erased, ...) That's an SSD-specific restatement of #1 (failure to persist data before reporting successfully completed write to the host, and returning previous versions of data on later reads of the same address). SSDs don't necessarily erase old blocks immediately--a large, empty or frequently discarded SSD might not erase old blocks for months. All of the above fit into the general category of "drive drops some writes, out of order, when some triggering failure occurs." If you have access to the drive's firmware on github, you could check out the code, determine which bug is occurring, and send a pull request with the fix. If you don't, usually the practical solution is to choose a different drive vendor, unless you're ordering enough units to cause drive manufacturer shareholders to panic when you stop. Also you need to be _really_ sure it's the drive, and this information casts some doubt on that theory: > However, even if this is a firmware/hardware issue, I remain > unconvinced that it's purely coincidence just how quickly this > happened after the upgrade to 5.14.x. In addition to this corruption, > there are the 2 incidents where the system became unresponsive under > I/O load (and the second was purely reads from trying to image the > SSD). Those problems didn't occur when booting a rescue USB with an > older kernel. So some change which landed in 5.14.x may have changed > the drive command pattern in some important way to trigger the SSD > fault (esp, in the case of possibility #2 above). That gives me hope > that, if nothing else, we may be able to add a device quirk to Linux > and minimize future damage that way. :) Yeah, if something horrible happened in the Linux 5.14 baremetal NVME hardware drivers or PCIe subsystem in general, then it could produce symptoms like these. It wouldn't be the first time a regression in other parts of Linux was detected by a flood of btrfs errors. Device resets might trigger write cache losses and then all the above "firmware" symptoms (but the firmware is not at fault, it is getting disrupted by the host) (unless you are a stickler for the letter of the spec that says write cache must be immune to host action). Linux 5.14 btrfs on VMs seems OK. I run tests continuously on new Linux kernels to detect btrfs and lvm regressions early, and nothing like this has happened on my humble fleet. My test coverage is limited--it won't detect a baremetal NVME transport issue, as that's handled by the host kernel not the VM guest. > Bayes calls out from beyond the grave and demands that, before I try > any experiments, I first establish the base rate of these corruptions > under current conditions. So that means rebuilding my filesystem from > backups and continuing to use it exactly as I have been, prepared for > this problem to happen again. Being prepared means stepping up my > backup frequency, so I'll first set up a btrbk server that can accept > hourly backups. Sound methodology. > Wish me luck, Good luck! > Sam