From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91D3FC2D0C3 for ; Sat, 21 Dec 2019 20:06:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 45DBC206D3 for ; Sat, 21 Dec 2019 20:06:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727102AbfLUUGm (ORCPT ); Sat, 21 Dec 2019 15:06:42 -0500 Received: from james.kirk.hungrycats.org ([174.142.39.145]:33402 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726593AbfLUUGl (ORCPT ); Sat, 21 Dec 2019 15:06:41 -0500 Received: by james.kirk.hungrycats.org (Postfix, from userid 1002) id 1FAFE537865; Sat, 21 Dec 2019 15:06:37 -0500 (EST) Date: Sat, 21 Dec 2019 15:06:32 -0500 From: Zygo Blaxell To: Chris Murphy Cc: Btrfs BTRFS , Qu Wenruo , Marc Lehmann Subject: Re: btrfs dev del not transaction protected? Message-ID: <20191221200632.GB13306@hungrycats.org> References: <20191220040536.GA1682@schmorp.de> <20191220063702.GE5861@schmorp.de> <1912b2a1-2aa9-bf4c-198f-c5e1565dd11f@gmx.com> <20191220132703.GA3435@schmorp.de> <204287e5-8aca-3a51-9bc9-be577299bfd6@gmx.com> <20191220165008.GA1603@schmorp.de> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="R3G7APHDIzY6R/pk" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org --R3G7APHDIzY6R/pk Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Dec 20, 2019 at 01:24:02PM -0700, Chris Murphy wrote: > On Fri, Dec 20, 2019 at 9:53 AM Marc Lehmann wrote: > > > > On Fri, Dec 20, 2019 at 09:41:15PM +0800, Qu Wenruo wrote: >=20 > > > Consider all these insane things, I tend to believe there is some > > > FUA/FLUSH related hardware problem. > > > > Please don't - I honestly think btrfs developers are way to fast to bla= me > > hardware for problems. >=20 > That's because they have a lot of evidence of this, in a way that's > only inferable with other file systems. This has long been suspected > by, and demonstrated, well before Btrfs with ZFS development. >=20 > A reasonable criticism of Btrfs development is the state of the file > system check repair, which still has danger warnings. But it's also a > case of damned if they do, and damned if they don't provide it. It > might be the best chance of recovery, so why not provide it? > Conversely, the reality is that the file system is complicated enough, > and the file system checker too slow, that the effort needs to be on > (what I call) file system autopsy tools, to figure out why the > corruption happened, and prevent that from happening. The repair is > often too difficult. >=20 > Take, for example, the recent 5.2.0-5.2.14 corruption bug. That was > self-reported once it was discovered and fixed, which took longer than > usual, and developers apologized. What else can they do? It's not like > the developers are blaming hardware for their own bugs. They have > consistently taken responsibility for Btrfs bugs. >=20 >=20 > > I currently lose btrfs filesystems about once every > > 6 months, and other than the occasional user error, it's always the ker= nel > > (e.g. 4.11 corrupting data, dmcache and/or bcache corrupting things, > > low-memory situations etc. - none of these seem to be centric to btrfs, > > but none of those are hardware errors either). I know its the kernel in > > most cases because in those cases, I can identify the fix in a later > > kernel, or the mitigating circumstances don't appear (e.g. freezes). >=20 > Usually Btrfs developers do mention the possibility of other software > layers contributing to the problem, it's a valid observation that this > possibility be stated. Also note that not all btrfs developers will agree on a failure analysis. Some patience is required. Be prepared to support your bug report with working reproducers and relevant evidence, possibly many times, with fresh backtraces on each new kernel release in which the bug still appears. > However, if it's exclusively a software problem, then it should be > reproducible on other systems. >=20 >=20 > > In any case if it is a hardware problem, then linux and/or btrfs has > > to work around them, because it affects many different controllers on > > different boards: >=20 > How do you propose Btrfs work around it? In particular when there are > additional software layers over which it has no control? >=20 > Have you tried disabling the (drives') write cache? Apparently many sysadmins disable write cache proactively on all drives, instead of waiting until the drive drops some data to learn that there's a problem with the firmware. That's a reasonable tradeoff for btrfs, which already has a heavily optimized write path (most of the IO time in btrfs commit is spent _reading_ metadata). > > Just while I was writing this mail, on 5.4.5, the _newly created_ btrfs > > filesystem I restored to went into readonly mode with ENOSPC. Another > > hardware problem? >=20 > > [41801.618887] CPU: 2 PID: 5713 Comm: kworker/u8:15 Tainted: P = OE 5.4.5-050405-generic #201912181630 >=20 > Why is this kernel tainted? The point of pointing this out isn't to > blame whatever it tainting the kernel, but to point out that > identifying the cause of your problems is made a lot more difficult. I > think you need to simplify the setup, a lot, in order to reduce the > surface area of possible problems. Any bug hunt is made way harder > when there's complication. >=20 >=20 >=20 > > [41801.618888] Hardware name: MSI MS-7816/Z97-G43 (MS-7816), BIOS V17.8= 12/24/2014 > > [41801.618903] Workqueue: btrfs-endio-write btrfs_endio_write_helper [b= trfs] > > [41801.618916] RIP: 0010:btrfs_finish_ordered_io+0x730/0x820 [btrfs] > > [41801.618917] Code: 49 8b 46 50 f0 48 0f ba a8 40 ce 00 00 02 72 1c 8b= 45 b0 83 f8 fb 0f 84 d4 00 00 00 89 c6 48 c7 c7 48 33 62 c0 e8 eb 9c 91 d5= <0f> 0b 8b 4d b0 ba 57 0c 00 00 48 c7 c6 40 67 61 c0 4c 89 f7 bb 01 > > [41801.618918] RSP: 0018:ffffc18b40edfd80 EFLAGS: 00010282 > > [41801.618921] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3= 159: errno=3D-28 No space left > > [41801.618922] RAX: 0000000000000000 RBX: ffff9f8b7b2e3800 RCX: 0000000= 000000006 > > [41801.618922] BTRFS info (device dm-35): forced readonly > > [41801.618924] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff9f8= bbeb17440 > > [41801.618924] RBP: ffffc18b40edfdf8 R08: 00000000000005a6 R09: fffffff= f979a4d90 > > [41801.618925] R10: ffffffff97983fa8 R11: ffffc18b40edfbe8 R12: ffff9f8= ad8b4ab60 > > [41801.618926] R13: ffff9f867ddb53c0 R14: ffff9f8bbb0446e8 R15: 0000000= 000000000 > > [41801.618927] FS: 0000000000000000(0000) GS:ffff9f8bbeb00000(0000) kn= lGS:0000000000000000 > > [41801.618928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [41801.618929] CR2: 00007f8ab728fc30 CR3: 000000049080a002 CR4: 0000000= 0001606e0 > > [41801.618930] Call Trace: > > [41801.618943] finish_ordered_fn+0x15/0x20 [btrfs] > > [41801.618957] normal_work_helper+0xbd/0x2f0 [btrfs] > > [41801.618959] ? __schedule+0x2eb/0x740 > > [41801.618973] btrfs_endio_write_helper+0x12/0x20 [btrfs] > > [41801.618975] process_one_work+0x1ec/0x3a0 > > [41801.618977] worker_thread+0x4d/0x400 > > [41801.618979] kthread+0x104/0x140 > > [41801.618980] ? process_one_work+0x3a0/0x3a0 > > [41801.618982] ? kthread_park+0x90/0x90 > > [41801.618984] ret_from_fork+0x1f/0x40 > > [41801.618985] ---[ end trace 35086266bf39c897 ]--- > > [41801.618987] BTRFS: error (device dm-35) in btrfs_finish_ordered_io:3= 159: errno=3D-28 No space left > > > > unmount/remount seems to make it work again, and it is full (df) yet has > > 3TB of unallocated space left. No clue what to do now, do I have to sta= rt > > over restoring again? > > > > Filesystem Size Used Avail Use% Mounted on > > /dev/mapper/xmnt-cold15 27T 23T 0 100% /cold1 >=20 > Clearly a bug, possibly more than one. This problem is being discussed > in other threads on df misreporting with recent kernels, and a fix is > pending. >=20 > As for the ENOSPC, also clearly a bug. But not clear why or where. >=20 >=20 > > Please, don't always chalk it up to hardware problems - btrfs is a > > wonderful filesystem for many reasons, one reason I like is that it can > > detect corruption much earlier than other filesystems. This featire alo= ne > > makes it impossible for me to go back to xfs. However, I had corruption > > on ext4, xfs, reiserfs over the years, but btrfs *is* simply way buggier > > still than those - before btrfs (and even now) I kept md5sums of all > > archived files (~200TB), and xfs and ext4 _do_ a much better job at not > > corrupting data than btrfs on the same hardware - while I get filesystem > > problems about every half a year with btrfs, I had (silent) corruption > > problems maybe once every three to four years with xfs or ext4 (and not > > yet on the bxoes I use currently). >=20 > I can't really parse the suggestion that you are seeing md5 mismatches > (indicating data changes) on Btrfs, where Btrfs doesn't produce a csum > warning along with EIO on those files? Are these files nodatacow, > either by mount option nodatasum or nodatacow, or using chattr +C on > these files? >=20 > A mechanism explaining this anecdote isn't clear. Not even crc32c > checksum collision would explain more than maybe one instance of it. >=20 > I'm curious what Zygo thinks about this. Hardware bugs and failures are certainly common, and fleetwide hardware failures do happen. They're also recognizable as hardware bugs--some specific failure modes (e.g. single-bit data value errors, parent transid verify failure after crashes) are definitely hardware and can be easily spotted with only a few lines of kernel logs. Some components of btrfs (e.g. scrubs, csum verification, raid1 corruption recovery) are very reliable detectors of hardware or firmware misbehavior (although sometimes it is not trivial to identify _which_ hardware is at fault). Some parts of btrfs (like free space management) are completely btrfs, and cannot be affected by hardware failures without destroying the entire filesystem. On the other hand, it's not like btrfs or the Linux kernel has been bug free either, and a lot of serious but hard to detect bugs are 5-10 years old when they get fixed. All kernels before 5.1 had silent data corruption bugs for compressed data at hole boundaries. Kernels 5.1 to 5.4 have use-after-free bugs in btrfs that lead to metadata corruption (5.1), transaction aborts due to self-detected metadata corruption (5.2), and crashes (5.3 and 5.4). 5.2 also had a second metadata corruption with deadlock bug. Other parts of the kernel are hard on data as well: somewhere around 4.7 a year-old kernel memory corruption bug was found in the r8169 network driver, and 4.0, 4.19, and 5.1 all had famous block-layer bugs that would destroy any filesystem under certain conditions. I test every upstream kernel release thoroughly before deploying to production, because every upstream Linux kernel release has thousands of bugs (btrfs is usually about 1-2% of those). I am still waiting for the very first upstream kernel release for btrfs that can run our full production stress test workload without any backported fixes and without crashing or corrupting data or metadata for 30 days. So far that goal has never been met. We upgrade kernels when a new release gets better than an old one, but the median uptime under stress is still an order of magnitude short of the 30 day mark, and our testing on 5.4.5+fixes isn't done yet. Unfortunately, due to the nature of crashing bugs, we can only work on the most frequently occurring bug at any time, and each one has to be fixed before the next most frequently occurring bug can be discovered, making these fixes a very sequential process. Then there's the two-month lag to get patches from the mailing list into stable kernels, which is plenty of time for new regressions to appear, and we start over again with a fresh set of bugs to fix. btrfs dev del bugs are not crashing bugs, so they are so far down my priority list that I haven't bothered to test for them, or even to report them when I find one accidentally. There are a few bugs there though, especially if you are low on metadata space (which is a likely event if you just removed an entire disk's worth of storage) or btrfs has a bug in that kernel version that just makes btrfs _think_ it is low on metadata space, and the transaction aborts during the delete. Occasionally I hit one of these in an array and work around it with a patch like this one: diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 56e35d2e957c..b16539fd2c23 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7350,6 +7350,8 @@ int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_in= fo) #if 0 ret =3D -EINVAL; goto error; +#else + btrfs_set_super_num_devices(fs_info->super_copy, total_dev); #endif } if (btrfs_super_total_bytes(fs_info->super_copy) < Probably not a good idea for general use, but it may solve an immediate problem if the problem is simply that the wrong number of devices is stored in the superblock. >=20 >=20 >=20 >=20 >=20 >=20 > > > > Please take these issues seriously - the trend of "it's a hardware > > problem" will not remove the "unstable" stigma from btrfs as long as bt= rfs > > is clearly more buggy then other filesystems. >=20 > > Sorry to be so blunt, but I am a bit sensitive with always being told > > "it's probably a hardware problem" when it clearly affects practically = any > > server and any laptop I administrate. I believe in btrfs, and detecting > > corruption early is a feature to me. >=20 > The problem with the anecdotal method of arguing in favor of software > bugs as the explanation? It directly goes against my own experience, > also anecdote. I've had no problems that I can attribute to Btrfs. All > were hardware or user sabotage. And I've had zero data loss, outside > of user sabotage. You are definitely not testing hard enough. ;) At one point in 2016 there were 145 active bugs known today. About 10 of those 145 were discovered in the last few months alone (i.e. it was broken in 2016, and we only know now how broken it was then after 3 years of hindsight). https://imgur.com/a/A2sXcQB Thankfully, many of those bugs were mostly harmless, but some were not: I've found at least 5 distinct data or metadata corrupting bugs since 2014, and confirmed the existence of several more in regression testing. > I have seen device UNC read errors, corrected automatically by Btrfs. > And I have seen devices return bad data that Btrfs caught, that would > otherwise have been silent corruption of either metadata or data, and > this was corrected in the raid1 cases, and merely reported in the > non-raid cases. And I've also seen considerable corruption reported > upon SD Cards in the midst of implosion and becoming read only. But > even read only, I was able to get all the data out. btrfs data recovery on raid1 from csum and UNC sector failures is excellent. I've seen no issues there since 3.18ish. I do test that =66rom time to time with VMs and fault injection and also with real disk failures. btrfs on raid5 (internal or external raid5 implementation), device delete, and some unfortunate degraded mode behaviors still need some work. > But in your case, practically ever server and laptop? That's weird and > unexpected. And it makes me wonder what's in common. Btrfs is much > fussier than other file systems because the by far largest target for > corruption, isn't file system metadata, but data. The actual payload > of a file system isn't the file system. And Btrfs is the only Linux > native file system that checksums data. The other file systems check > only metadata, and only somewhat recently, depending on the > distribution you're using. If the "corruption" consists of large quantities of zeros, the problem might be using the (default) noflushoncommit mount option, or using applications that don't fsync() religiously. This is correct filesystem behavior, though maybe not behavior any application developer wants. If the corruption affects compressed data adjacent to holes, then it's a known problem fixed in 5.1 and later. If the corruption is specifically and only parent transid verify failures after a crash, UNC sector read, or power failure, then we'd be looking for drive firmware issues or non-default kernel settings to get a fleetwide effect. If the corruption is general metadata corruption without metadata page csum failures, then it could be host RAM failure, general kernel memory corruption (i.e. you have to look at all the other device drivers in the system), or known bugs in btrfs kernel 5.1 and later. If the corruption is all csum failures, then there's a long list of drive issues that could cause it, or the partition could be trampled by other software (BIOSes are sometimes surprisingly bad at this). > > I understand it can be frustrating to be confronted with hard to explain > > accidents, and I understand if you can't find the bug with the sparse i= nfo > > I gave, especially as the bug might not even be in btrfs. But keep in m= ind > > that the people who boldly/dumbly use btrfs in production and restore > > dozens of terabytes from backup every so and so many months are also be= ing > > frustrated if they present evidence from multiple machines and get told > > "its probably a hardware problem". >=20 > For sure. But take the contrary case that other file systems have > depended on for more than a decade: assuming the hardware is returning > valid data. This is intrinsic to their design. And go back before they > had metadata checksumming, and you'd see it stated on their lists that > they do assume this, and if your devices return any bad data, it's not > the file system's fault at all. Not even the lack of reporting any > kind of problem whatsoever. How is that better? >=20 > Well indeed, not long after Btrfs was demonstrating these are actually > more common problems that suspected, metadata checksumming started > creeping into other file systems, finally becoming the default (a > while ago on XFS, and very recently on ext4). And they are catching a > lot of these same kinds of layer and hardware bugs. Hardware does not > just mean the drive, it can be power supply, logic board, controller, > cables, drive write caches, drive firmware, and other drive internals. >=20 > And the only way any problem can be fixed, is to understand how, when > and where it happened. >=20 > -- > Chris Murphy >=20 --R3G7APHDIzY6R/pk Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iF0EABECAB0WIQSnOVjcfGcC/+em7H2B+YsaVrMbnAUCXf57RgAKCRCB+YsaVrMb nAOlAJ9XuU3KSIrmubSj7ukeHnX4VOh2TQCg1ppo/Ak+wwoSVpjCNXCcMauN81c= =66PJ -----END PGP SIGNATURE----- --R3G7APHDIzY6R/pk--