From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f48.google.com ([209.85.218.48]:45399 "EHLO mail-oi0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751028AbeDAVLF (ORCPT ); Sun, 1 Apr 2018 17:11:05 -0400 Received: by mail-oi0-f48.google.com with SMTP id 71-v6so11463539oie.12 for ; Sun, 01 Apr 2018 14:11:05 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <1521662556.4312.39.camel@scientia.net> <20180329215011.GC2446@hungrycats.org> <389bce3c-92ac-390a-1719-5b9591c9b85c@libero.it> <20180331050345.GE2446@hungrycats.org> <20180401034544.GA28769@hungrycats.org> From: Chris Murphy Date: Sun, 1 Apr 2018 15:11:04 -0600 Message-ID: Subject: Re: Status of RAID5/6 To: Chris Murphy Cc: Zygo Blaxell , Goffredo Baroncelli , Christoph Anton Mitterer , Btrfs BTRFS Content-Type: text/plain; charset="UTF-8" Sender: linux-btrfs-owner@vger.kernel.org List-ID: (I hate it when my palm rubs the trackpad and hits send prematurely...) On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy wrote: >> Users can run scrub immediately after _every_ unclean shutdown to >> reduce the risk of inconsistent parity and unrecoverable data should >> a disk fail later, but this can only prevent future write hole events, >> not recover data lost during past events. > > Problem is, Btrfs assumes a leaf is correct if it passes checksum. And > such a leaf containing EXTENT_CSUM means that EXTENT_CSUM means that EXTENT_CSUM is assumed to be correct. But in fact it could be stale. It's just as possible the metadata and superblock update is what's missing due to the interruption, while both data and parity strip writes succeeded. The window for either the data or parity write to fail is way shorter of a time interval, than that of the numerous metadata writes, followed by superblock update. In such a case, the old metadata is what's pointed to, including EXTENT_CSUM. Therefore your scrub would always show csum error, even if both data and parity are correct. You'd have to init-csum in this case, I suppose. Pretty much it's RMW with a (partial) stripe overwrite upending COW, and therefore upending the atomicity, and thus consistency of Btrfs in the raid56 case where any portion of the transaction is interrupted. And this is amplified if metadata is also raid56. ZFS avoids the problem at the expense of probably a ton of fragmentation, by taking e.g. 4KiB RMW and writing a full length stripe of 8KiB fully COW, rather than doing stripe modification with an overwrite. And that's because it has dynamic stripe lengths. For Btrfs to always do COW would mean that 4KiB change goes into a new full stripe, 64KiB * num devices, assuming no other changes are ready at commit time. So yeah, avoiding the problem is best. But if it's going to be a journal, it's going to make things pretty damn slow I'd think, unless the journal can be explicitly placed something faster than the array, like an SSD/NVMe device. And that's what mdadm allows and expects. -- Chris Murphy