From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8A75C433FE for ; Tue, 24 May 2022 07:32:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233945AbiEXHca (ORCPT ); Tue, 24 May 2022 03:32:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235091AbiEXHcW (ORCPT ); Tue, 24 May 2022 03:32:22 -0400 Received: from verein.lst.de (verein.lst.de [213.95.11.211]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7AB01496B6 for ; Tue, 24 May 2022 00:32:20 -0700 (PDT) Received: by verein.lst.de (Postfix, from userid 2407) id E4E0A68AFE; Tue, 24 May 2022 09:32:16 +0200 (CEST) Date: Tue, 24 May 2022 09:32:16 +0200 From: Christoph Hellwig To: Qu Wenruo Cc: Christoph Hellwig , Chris Mason , Josef Bacik , David Sterba , Qu Wenruo , linux-btrfs@vger.kernel.org Subject: Re: [PATCH 8/8] btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio Message-ID: <20220524073216.GB26145@lst.de> References: <20220522114754.173685-1-hch@lst.de> <20220522114754.173685-9-hch@lst.de> <20220522123108.GA23355@lst.de> <20220522125337.GB24032@lst.de> <8a6fb996-64c3-63b3-7f9c-aec78e83504e@gmx.com> <20220523062636.GA29750@lst.de> <84b022dc-6310-1d52-b8e3-33f915a4fee7@gmx.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <84b022dc-6310-1d52-b8e3-33f915a4fee7@gmx.com> User-Agent: Mutt/1.5.17 (2007-11-01) Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Mon, May 23, 2022 at 03:46:02PM +0800, Qu Wenruo wrote: >> Becasue btrfs_repair_io_failure can't handle multiple-page I/O. It >> is also is rather cumbersome because it bypassed the normal bio >> mapping. As a follow on I'd rather move it over to btrfs_map_bio >> with a special flag for the single mirror parity write rather than that >> hack. > > In fact so far for all callers of btrfs_repair_io_failure(), we are > always handling things inside one stripe. > > Thus we can easily enhance that function to handle multi page ranges. > > Although a dedicated btrfs_map_bio() flags seems more generic and better. I did think of moving btrfs_repair_io_failure over to my new infrastructure in fact, because it seems inherently possible. Just not the highest priority right now. >> Because the whole bio at this point is all the bad sectors. There >> is no point in writing only parts of the bio because that would leave >> corruption on disk. >> >>> The only reason I can think of is, we're still trying to do some >>> "optimization". >>> >>> But all our bio submission is already synchronous, I doubt such >>> "optimization" would make much difference. >> >> Can you explain what you mean here? > > We wait for the read bio anyway, I doubt the batched write part is that > important. I still don't understand the point. Once we read more than a single page, writing it back as a patch is completely trivial as shown by this series. Why would we not do it? > > If you really want, I can try to make the write part asynchronous, while > still keep the read part synchronous, and easier to read. Asynchronous writes gets us back into all the I/O completion handler complexities, which was the whole reason to start on the synchronous repair. > In your current version, the do {} while() loop iterates through all > mirrors. > > But for the following case, we will hit problems thanks to RAID1C3 again: > > Mirror 1 |X|X|X|X| > Mirror 2 |X| |X| | > Mirror 3 | |X| |X| > > We hit mirror 1 initially, thus @initial_mirror is 1. > > Then when we try mirror 2, since the first sector is still bad, we jump > to the next mirror. > > For mirror 3, we fixed the first sector only. Then 2nd sector is still > from mirror 3 and didn't pass. > Now we have no more mirrors, and still return -EIO. Can you share a test case? The code resets initial_mirror as soon as we made any progress so that should not happen. > So my points still stand, if we want to do batched handling, either we > go bitmap or we give up. Why? For the very common case of clustered corruption or entirely failing reads it is significantly faster than a simple synchronous read of each sector, and also much better than the existing code. It also is a lot less code than the existing code base, and (maybe I'm biassed) a lot more readable. Bitmaps only help you with randomly splattered corruption, which simply is not how SSDs or hard drives actually fail. > Such hacky bandage seems to work at first glance and will pass your new > test cases, but it doesn't do it any better than sector-by-sector waiting. > (Forgot to mention, the new RAID1C3 test case may also be flawed, as any > read on other mirrors will cause read-repair, screwing up our later > retry, thus we must check pid first before doing any read.) The updated version uses the read from mirror loop from btrfs/142 that cleverly used bash internals to not issue the read if it would be done using the wrong mirror. Which also really nicely speeds up the tests including the exist 140 and 141 ones.