From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB05FC63798 for ; Sat, 28 Nov 2020 01:59:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9AB5D20679 for ; Sat, 28 Nov 2020 01:59:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731398AbgK1B6w (ORCPT ); Fri, 27 Nov 2020 20:58:52 -0500 Received: from icebox.esperi.org.uk ([81.187.191.129]:48744 "EHLO mail.esperi.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731210AbgK1B6A (ORCPT ); Fri, 27 Nov 2020 20:58:00 -0500 X-Greylist: delayed 369 seconds by postgrey-1.27 at vger.kernel.org; Fri, 27 Nov 2020 20:57:57 EST Received: from loom (nix@sidle.srvr.nix [192.168.14.8]) by mail.esperi.org.uk (8.16.1/8.16.1) with ESMTP id 0AS1vXLv016469; Sat, 28 Nov 2020 01:57:33 GMT From: Nix To: Mukund Sivaraman Cc: Wols Lists , linux-raid@vger.kernel.org Subject: Re: RAID-6 and write hole with write-intent bitmap References: <20201124072039.GA381531@jurassic.vpn.mukund.org> <5FBCDC18.9050809@youngman.org.uk> <20201124185004.GA27132@jurassic.vpn.mukund.org> Emacs: the definitive fritterware. Date: Sat, 28 Nov 2020 01:57:33 +0000 In-Reply-To: <20201124185004.GA27132@jurassic.vpn.mukund.org> (Mukund Sivaraman's message of "Wed, 25 Nov 2020 00:20:04 +0530") Message-ID: <878samckfm.fsf@esperi.org.uk> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-DCC-wuwien-Metrics: loom 1290; Body=3 Fuz1=3 Fuz2=3 Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org On 24 Nov 2020, Mukund Sivaraman told this: [...] > (a) With RAID-5, assuming there are 4 member disks A, B, C, D, a write > operation with its data on disk A and stripe's parity on disk B may > involve: > > 1. a read of the stripe > 2. update of data on A > 3. computation and update of parity A^C^D on B > > These are not atomic updates. If power is lost between steps 2 and 3, The writes usually proceed in parallel (because anything else would be abominably slow). But... the problem is that the writes to the component disks are also not atomic, and will likely not proceed at the same rates: only with spindle-synched drives is there anything like a guarantee of that, and those have been unobtainable for decades. So a power loss could well lead to 500 sectors of the stripe written on disk A, 430 sectors written on disk B... and the sectors between sector 430 and 500 are not consistent. (Disk C might well be up around sector 600, disk D around sector 450 and there's no *way* mere parity or RAID 6 syndromes can recover from the wildly-varying mess between sectors 430 and 600... it's not like it gets recorded anywhere where a disk write got up to before the power went out, either. But the journal avoids this in the usual fashion for a journal, by writing out the whole thing first and committing it to stable storage, so that on restart the incomplete writes can just be replayed.) -- NULL && (void)