From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14387C433F5 for ; Thu, 12 May 2022 17:20:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1356760AbiELRUu (ORCPT ); Thu, 12 May 2022 13:20:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37432 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1356972AbiELRUs (ORCPT ); Thu, 12 May 2022 13:20:48 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E4E6113D1E for ; Thu, 12 May 2022 10:20:46 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id A014D21B4C; Thu, 12 May 2022 17:20:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1652376045; h=from:from:reply-to:reply-to:date:date:message-id:message-id:to:to: cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rx1v3luk2lpwgawsxEeKozeb8bwAW0/IhD1mAnVDYug=; b=yGu2odCwfPoFQk7nvWP5BnqwERrPcc7SJi+FOApbrJNnDaECS55/3iFsFoUdJmGVPvekmY bcKTVhCpdKvyl8Xn5tEPpGE5fFKz0IO9uhRlVcrgSNhxnx9y9lZwe/iKYSRjExHTvb2RCc X+bzI3tOPtKv4V4/VYq4ATaWl927ztE= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1652376045; h=from:from:reply-to:reply-to:date:date:message-id:message-id:to:to: cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rx1v3luk2lpwgawsxEeKozeb8bwAW0/IhD1mAnVDYug=; b=VTM9yI2emUBjMsSAlacssPiUzv2/XL0XxrST/JZv+1iB33uwgNjphhmFcVDhLIZhpPoIpU OYOrrpkl7Q98vnCg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 743EC13ABE; Thu, 12 May 2022 17:20:45 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id e1ByG+1BfWLSIgAAMHmgww (envelope-from ); Thu, 12 May 2022 17:20:45 +0000 Date: Thu, 12 May 2022 19:16:29 +0200 From: David Sterba To: Qu Wenruo Cc: Christoph Hellwig , Qu Wenruo , linux-btrfs@vger.kernel.org Subject: Re: [PATCH 05/13] btrfs: add btrfs_read_repair_ctrl to record corrupted sectors Message-ID: <20220512171629.GT18596@twin.jikos.cz> Reply-To: dsterba@suse.cz Mail-Followup-To: dsterba@suse.cz, Qu Wenruo , Christoph Hellwig , Qu Wenruo , linux-btrfs@vger.kernel.org References: <3f9b82f1bcc955fbb689a469e749cf1534857ea1.1651559986.git.wqu@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23.1-rc1 (2014-03-12) Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Thu, May 05, 2022 at 06:40:31AM +0800, Qu Wenruo wrote: > On 2022/5/4 22:05, Christoph Hellwig wrote: > > On Wed, May 04, 2022 at 09:12:50AM +0800, Qu Wenruo wrote: > >>> This is a really bad idea. Now every read needs to pay the quite > >>> large price for corner case of a read repair. As I mentioned last time > >>> I think a mempool with a few entries that any read repair can dip into > >>> is a much better choice here. > >> > >> The problem is, can mempool provide a variable length entry? > > > > It can't. But you can allocate a few different bucket sizes as a > > starter or do a multi-level setup where you don't have a bitmap that > > operates on the entire bio. > > That sounds way more complex than either the original method (just fail > to repair) or the current pre-allocation method. > > We need to cover a pretty wide length variable, from the minimal 4K > (only need one bit), to much larger ones (observed bio over 16M, but > only for write). > > It would make sense if we have a hard limit on how large a read bio can > be (I have only observed 4M as the biggest bio size for read, and if > limited to 4M, we only need 128bytes for the bitmap and can go mempool). > > > In fact, the original one (just error out if failed to allocate memory) > is way more robust than you think. > > The point here is, if a large read bio failed, VFS will retry with much > smaller block size (normally just one page or one block), and if we even > failed to allocate memory for an u32, we're really screwed up. > > Do we really need to go down the rabbit hole of using mempool for > variable length situation? Especially we're not that eager to ensure the > memory allocation. I'm reluctant to use mempools in filesystem on any other layer than for bios and better just wrapped under the biosets. Anywhere else it creates potential for high-level deadlocks between various threads. In this particular case of repair it does not seem that risky, as the only user of the mempool is the repair submission and does only that, ie. no interaction with other threads. I think mempool in this case is a hammer too big, we could have a per-device preallocated chunk of memory for repair purposes and manage that by a lock and/or wait queue. There would be only a linear dependency: request (use) -> io -> wait -> endio (return. If the repair resource is not available any other request waits. Effectively what mempools do but much simpler.