From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF78DC432BE for ; Tue, 31 Aug 2021 08:17:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C103360FC0 for ; Tue, 31 Aug 2021 08:17:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240239AbhHaISJ (ORCPT ); Tue, 31 Aug 2021 04:18:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36752 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240192AbhHaISI (ORCPT ); Tue, 31 Aug 2021 04:18:08 -0400 Received: from mail-wr1-x435.google.com (mail-wr1-x435.google.com [IPv6:2a00:1450:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7E32C061575 for ; Tue, 31 Aug 2021 01:17:11 -0700 (PDT) Received: by mail-wr1-x435.google.com with SMTP id b6so26253924wrh.10 for ; Tue, 31 Aug 2021 01:17:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=IeybWOOkKaTyYWy1hjKO+2u4Z1GCTqSwl9Y7jai6P10=; b=azJASwsrcB5vZfVN2MVOr8HrA6QBh75n8liy7BmcamnZbicwP8wtRHuNMdsteClf6+ g4PUVWvT/wk4IifxK9nj+NMsGgndAvLUJJ92oKepAm2WYaYkSbSEoOnf0Bu6msJdoQaL AcQbdMbNWAvgMj+wxg1gNfceVqffSShUAJadNTl9Ay1MgAYs6vdCN7+eKCEjUG01iq3E w/cmaGYBqVBvCO+i1Bsmp3tqXjzEZ57lPdEF8u9yLnLoqAlu36ZOiJc8JgrJWPds++A/ ofOhqPJIRI0ZoBlEcSGBhjiIlWTHMNIPEqQXhcrsCUt9/UErTJF1Pnel2pUQfzjTLaQ1 WGCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=IeybWOOkKaTyYWy1hjKO+2u4Z1GCTqSwl9Y7jai6P10=; b=UjXG0NmIajBRKi2hZi95fNrlhS3ZGPQN3bwteC26jDrb5o1/mmZYN3PBr7qUr/j7GQ RcKfPZY2Rdk3wKauV5yDKG5/hUEOj7J6nY9b7lpQMeDmB44dCdttiQ7KypFe/MKeRVmw RD26OnMU79yULV9UtdU4TZNr54xK0krAalSEzUw4SR29RqsvuEJ2NHc3eNwmI6jd0MW3 NUebAN9lPpdvcXfURa8BmINNeKUpJPZK/ofHM5eSlFSrlAzlmSwNYutT5lkvGYxeDsON sOksMbtNdgfyax/sTgLuwrx9ncRDj78/4WR2f0CVHRCrKljPSreroVki6UJUKdlbvGl3 PPsg== X-Gm-Message-State: AOAM533ql3tK0ao48ENxXfkhBFozu3YliPQQf+/Cf4ivwKKfRTiDbu9j rEX7oEP5bdi8P800vucZeEM0E44xwQHkpQ== X-Google-Smtp-Source: ABdhPJwfT+7pT7zWdTRAlojgk4Pa2ulWAUBgFFDe4DPAozboCPDrUHqRciw+nKHWbUo6vc+p0YNcVg== X-Received: by 2002:a05:6000:18c2:: with SMTP id w2mr29288017wrq.282.1630397830102; Tue, 31 Aug 2021 01:17:10 -0700 (PDT) Received: from [192.168.10.90] (cable-89-1-117-98.nc.de. [89.1.117.98]) by smtp.gmail.com with ESMTPSA id u16sm1918329wmc.41.2021.08.31.01.17.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 31 Aug 2021 01:17:09 -0700 (PDT) Subject: Re: Questions about BTRFS balance and scrub on non-RAID setup To: Lionel Bouton , linux-btrfs@vger.kernel.org References: <04941c75-3ea5-32de-5978-efe5c5681ee2@bouton.name> From: Andrej Friesen Message-ID: Date: Tue, 31 Aug 2021 10:17:07 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <04941c75-3ea5-32de-5978-efe5c5681ee2@bouton.name> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Hi, thanks for the useful information Lionel. That already helped a lot! Scrub: > Partially. Ceph replication/scrub/repair will cover individual disk/OSD > server faults but not faults at the origin of the data being stored. > > We provide the same service for a customer. Several years ago the VM > hosting the NFS server for this customer ran on hardware that developed > a fault, the result was silent corruption of the data written by the NFS > server *before* being handed to Ceph for storage (probably memory or CPU > related, we threw the server out of the cluster and never looked back...). > - ceph scrubbing was of no use there because from its point of view the > replicated blocks were all fine. > - we launch btrfs scrub monthly by default and this is how we detected > the corruption. This is a really good point! Even though we might not be able to automatically let btrfs repair the corrupted files during the scrub it would be nice to know that this happened and act accordingly. > We make regular rbd snapshots so we could : > - switch the NFS server to an existing read-only replica (that could not > be corrupted by the same fault as it was replicated using simple > file-level content synchronization), > - restart the original NFS server using the last known good snapshot, > - rsync fresh data from the replica to the original server to catch up, > - switch back. We also wanted to do some rbd snapshots to have some kind of disaster recovery if something happens. Just in case. Our idea was also to offer quick file based "backups" to with btrfs snapshots. This would help if the file was once created correctly and afterwards writes to that file would get corrupt because of hardware failures. But for filesystem corruption reasons we also wanted to keep some rbd snapshots, you never know. Balance: > Full balance is probably overkill in any situation and can sunk your I/O > bandwidth. With recent kernels it seems there is less need for > balancing. We still use an automatic balancing script that tries to > limit the amount of free space allocated to nearly empty allocation > groups (by using "usage=50+" filters) and cancels the balance if it is > too long (to avoid limiting IO performance for too long, waiting for a > next call to continue) but I'm not sure if it's still worth it. In our > case we have been bitten by out of space situations with old kernels > brought by over-allocation of free space due to temporary large space > usages so we consider it an additional safeguard. In order to solve the file system full "problem" we wanted to create a large block device and use a quota of lets say 80 % of that for the data subvolume. We could also make the block device double the size of the subvolume and quota we offer because it is thin provisioned from the ceph side we do not lose any storage. We have tested discard/trim with btrfs and ceph and everything worked fine :-) Is there any metric we could/should measure in order to see if a balance would give us some benefit in some way? Did you only do the balance for the file system full problem? I saw a recommendation to run this balance daily: `btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4` Source: https://github.com/netdata/netdata/issues/3203#issuecomment-356026930 Is that a valid recommendation still today? If so, why is the FAQ not having such information available? I am happy to put something in the wiki, if needed. Defragmentation: > You probably want to use autodefrag or a custom defragmentation solution > too. We weren't satisfied with autodefrag in some situations (were > clearly fragmentation crept in and IO performance suffered until a > manual defrag) and developed our own scheduler for triggering > defragmentation based on file writes and slow full filesystem scans, The ceph cluster only uses SSDs therefore I guess we do not suffer from fragmentation problem as with HDDs. As far as I understood SSDs. -- Andrej Friesen https://www.ajfriesen.com/