From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 746ABCCA47C for ; Tue, 7 Jun 2022 17:47:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347866AbiFGRq6 (ORCPT ); Tue, 7 Jun 2022 13:46:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57794 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1349096AbiFGRqp (ORCPT ); Tue, 7 Jun 2022 13:46:45 -0400 Received: from libero.it (smtp-17.italiaonline.it [213.209.10.17]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A82ADF9 for ; Tue, 7 Jun 2022 10:36:36 -0700 (PDT) Received: from [192.168.1.27] ([78.12.29.176]) by smtp-17.iol.local with ESMTPA id yd83nfN0wikHEyd83nvyvt; Tue, 07 Jun 2022 19:36:33 +0200 x-libjamoibt: 1601 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=inwind.it; s=s2014; t=1654623393; bh=s2+KPohPSi4S4YiAsCgx2ZShuEnIl3Lf399JZr9yuRI=; h=From; b=XEDNygQsnzyAyzL1NQ8wEOn/SdO9M2HSTirfA1lDmEWOgNVO7dyTeAGA5gg1Q//PU 0MHbsxjO/P4ah89pHPhS7x72hoB32llvlTxJEom6OHNwmhO7sjbz7ewKd1RjvOoU9Y GSBFxkUGiGlfuxUdUTr1fIlCLJWGz5ogGTSYkaHJroCVDUdccF7C2H1ow93dWe4xTp MyKs7KquNvsYJ43CmYOXwu8+vYN8aLFUVI5x62pcPzx0294S4K1xAlPR5yZEX5zrBr OTC/BXAsSmLfGN7A6AlkDe97ASX5O1z2AwyPGXQbDyHNHLIiHHwIL4bs1olLqOsOul UYq3z25wN3aBw== X-CNFS-Analysis: v=2.4 cv=Y7A9DjSN c=1 sm=1 tr=0 ts=629f8ca1 cx=a_exe a=j3kPaYAfCNpxz33IBwghmg==:117 a=j3kPaYAfCNpxz33IBwghmg==:17 a=IkcTkHD0fZMA:10 a=g5KSERc4JarvlXtSQH8A:9 a=QEXdDO2ut3YA:10 Message-ID: Date: Tue, 7 Jun 2022 19:36:30 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.0 Reply-To: kreijack@inwind.it Subject: Re: [PATCH DRAFT] btrfs: RAID56J journal on-disk format draft Content-Language: en-US To: Qu Wenruo , Lukas Straub Cc: Martin Raiber , Paul Jones , Wang Yugui , "linux-btrfs@vger.kernel.org" References: <20220601102532.D262.409509F4@e16-tech.com> <49fb1216-189d-8801-d134-596284f62f1f@gmx.com> <20220601170741.4B12.409509F4@e16-tech.com> <5f49c12e-4655-48dd-0d73-49dc351eae15@gmx.com> <6cbc718d-4afb-87e7-6f01-a1d06a74ab9e@gmx.com> <01020181209a0f8e-b97fa255-3146-4ced-b9c9-a6627a21d6e1-000000@eu-west-1.amazonses.com> <20220603093207.6722d77a@gecko> <8c318892-0d36-51bb-18e0-a762dd75b723@gmx.com> <252577ba-1659-62f8-fc44-fea506eb97b7@gmx.com> <128e0119-088b-7a10-c874-551196df4c56@libero.it> <2575376b-fbd9-8406-3684-7fbc3899ddf3@gmx.com> From: Goffredo Baroncelli In-Reply-To: <2575376b-fbd9-8406-3684-7fbc3899ddf3@gmx.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CMAE-Envelope: MS4xfF6Gk3yOnX+ANL7986ZYkd/yhzdRkeWTKXPzXC1PTzafWSXQAu3nLuR+YtZee957fbNl8R6Y1B9Cj4jhnvM6P6yyO+wmgNuB1dsLas8bvb95Yi86jqlS APZ3inTsAwo5/Zxn+c4/pJog689+BNIfIRA2BTjL3eVlv59HkdGAtxkTFE0tcbiaXFbPEkNPZAhH9AVM3aTXchl1r8V4AMZXFCCKpQUwBqK8h3NHMPncSoOD ASmFHEXIr4ZJ6EjxTWNnunPVYbcs74TO0Ndh6ZrlqNZ4h5EFhWutXqrUWxMpvGIOu8joTkYwQT/NNlglUAdsU8S1+JkwmJ3xlqqIC7J7g4u/dAo9RtmzRtya HZOk12M6 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On 07/06/2022 03.27, Qu Wenruo wrote: > > > On 2022/6/7 02:10, Goffredo Baroncelli wrote: [...] >> >> But with a battery backup (i.e. no power failure), the likelihood of b) >> became >> negligible. >> >> This to say that a write intent bitmap will provide an huge >> improvement of the resilience of a btrfs raid5, and in turn raid6. >> >> My only suggestions, is to find a way to store the bitmap intent not in the >> raid5/6 block group, but in a separate block group, with the appropriate >> level >> of redundancy. > > That's why I want to reject RAID56 as metadata, and just store the > write-intent tree into the metadata, like what we did for fsync (log tree). > My suggestion was not to use the btrfs metadata to store the "write-intent", but to track the space used by the write-intent storage area with a bg. Then the write intent can be handled not with a btrfs btree, but (e.g.) simply writing a bitmap of the used blocks, or the pairs [starts, length].... I really like the idea to store the write intent in a btree. I find it very elegant. However I don't think that it is convenient. The write intent disk format is not performance related, you don't need to seek inside it; and it is small: you need to read it (entirerly) only in case of power failure, and in any case the biggest cost is to scrub the last updated blocks. So it is not needed a btree. Moreover, the handling of raid5/6 is a layer below the btree. I think that updating the write-intent btree would be a performance bottleneck. I am quite sure that the write intent likely requires less than one metadata page (16K today); however to store this page you need to update the metadata page tracking... >> >> This for two main reasons: >> 1) in future BTRFS may get the ability of allocating this block group in a >> dedicate disks set. I see two main cases: >> a) in case of raid6, we can store the intent bitmap (or the journal) in a >> raid1C3 BG allocated in the faster disks. The cons is that each block >> has to be >> written 3x2 times. But if you have an hybrid disks set (some ssd and >> some hdd, >> you got a noticeable gain of performance) > > In fact, for 4 disk usage, RAID10 has good enough chance to tolerate 2 > missing disks. > > In fact, the chance to tolerate two missing devices for 4 disks RAID10 is: > > 4 / 6 = 66.7% > > 4 is the total valid combinations, no order involved, including: > (1, 3), (1, 4), (2, 3) (2, 4). > (Or 4C2 - 2) > > 6 is the 4C2. > > So really no need to go RAID1C3 unless you're really want to ensured 2 > disks tolerance. I don't get the point: I started talking about raid6. The raid6 is two failures proof (you need three failure to see the problem... in theory). If P is the probability of a disk failure (with P << 1), the likelihood of a RAID6 failure is O(P^3). The same is RAID1C3. Instead RAID10 failure likelihood is only a bit lesser than two disk failure: RAID10 (4 disks) failure is O(0.66 * P^2) ~ O(P^2). Because P is << 1 then P^3 << 0.66 * P^2. > >> b) another option is to spread the intent bitmap (or the journal) in >> *all* disks, >> where each disks contains only the the related data (if we update only >> disk #1 >> and disk #2, we have to update only the intent bitmap (or the journal) in >> disk #1 and  disk #2) > > That's my initial per-device reservation method. > > But for write-intent tree, I tend to not go that way, but with a > RO-compatible flag instead, as it's much simpler and more back compatible. > > Thanks, > Qu >> >> >> 2) having a dedicate bg for the intent bitmap (or the journal), has >> another big >> advantage: you don't need to change the meaning of the raid5/6 bg. This >> means >> that an older kernel can read/write a raid5/6 filesystem: it sufficient >> to ignore >> the intent bitmap (or the journal) >> >> >> >>> >>> Furthermore, this even allows us to go something like bitmap tree, for >>> such write-intent bitmap. >>> And as long as the user is not using RAID56 for metadata (maybe even >>> it's OK to use RAID56 for metadata), it should be pretty safe against >>> most write-hole (for metadata and CoW data only though, nocow data is >>> still affected). >>> >>> Thus I believe this can be a valid path to explore, and even have a >>> higher priority than full journal. >>> >>> Thanks, >>> Qu >>> >> >> >> -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5