From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DA52C433FE for ; Thu, 6 Oct 2022 19:56:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232134AbiJFT4r (ORCPT ); Thu, 6 Oct 2022 15:56:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232052AbiJFT4q (ORCPT ); Thu, 6 Oct 2022 15:56:46 -0400 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 83F54A571A for ; Thu, 6 Oct 2022 12:56:44 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sin.source.kernel.org (Postfix) with ESMTPS id E2A60CE16FF for ; Thu, 6 Oct 2022 19:56:42 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 28463C433B5 for ; Thu, 6 Oct 2022 19:56:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1665086201; bh=5qhcNmmiTRbYfjZ1OekIVtyz7gUM/3bZWgMZf6Kc3fU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=dvxhUCcPLloTWHOWatV9Z6U817CCIpSUJrA4sB1B0RWVdTiczfHLWzs9IbNQBgEaG AwvkRjL26pkE9F2D7BJ6SqLyDPKENwaZvTAwxtK287QrpOlh9Tz68hetyumJyWkK/6 iisMBWu6CvKm+GNOinMY8+mXZ2ASe2bpHg42kCTowVtgLB2vL6HXWysZzPycWn7xjW d0WNEwrHl959h6XRLwx6qHSwMWZ32HU/Sk7OmAP17BIr6aIr3ONXxRsJ2edWRJ68VS ncmj0Rgz+A/kOLMRVIiX1bJ/YBkqNfsKRCOnYa8fE2Xbt0oFprd0pSssW8wucMCmKx 3k3qZgaCsXrvA== Received: by mail-oi1-f177.google.com with SMTP id g130so3211497oia.13 for ; Thu, 06 Oct 2022 12:56:41 -0700 (PDT) X-Gm-Message-State: ACrzQf0CMMXf6YyzdCuN1k9eN3HORPjH2KanJEnaAIjgLT1W5CSrkXrq NjsA0FjYZxJ31oVTiVhXSBN2pd05wBzpFWmJPMQ= X-Google-Smtp-Source: AMsMyM5QZaXnK+m77XU3cWvl3MRdvV+6wwnGQcfZ4v9xAHLj5znMdKQ8idlIEQ1nH1Vsxlc5E99ClbqKFhucYFbr1OE= X-Received: by 2002:a05:6808:1691:b0:351:48da:62e0 with SMTP id bb17-20020a056808169100b0035148da62e0mr696187oib.98.1665086200208; Thu, 06 Oct 2022 12:56:40 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Filipe Manana Date: Thu, 6 Oct 2022 20:56:03 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 1/5] btrfs: 1G falloc extents To: Boris Burkov Cc: Qu Wenruo , linux-btrfs@vger.kernel.org, kernel-team@fb.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov wrote: > > On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote: > > On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo wrote: > > > > > > > > > > > > On 2022/10/6 03:49, Boris Burkov wrote: > > > > When doing a large fallocate, btrfs will break it up into 256MiB > > > > extents. Our data block groups are 1GiB, so a more natural maximum size > > > > is 1GiB, so that we tend to allocate and fully use block groups rather > > > > than fragmenting the file around. > > > > > > > > This is especially useful if large fallocates tend to be for "round" > > > > amounts, which strikes me as a reasonable assumption. > > > > > > > > While moving to size classes reduces the value of this change, it is > > > > also good to compare potential allocator algorithms against just 1G > > > > extents. > > > > > > Btrfs extent booking is already causing a lot of wasted space, is this > > > larger extent size really a good idea? > > > > > > E.g. after a lot of random writes, we may have only a very small part of > > > the original 1G still being referred. > > > (The first write into the pre-allocated range will not be COWed, but the > > > next one over the same range will be COWed) > > > > > > But the full 1G can only be freed if none of its sectors is referred. > > > Thus this would make preallocated space much harder to be free, > > > snapshots/reflink can make it even worse. > > > > > > So wouldn't such enlarged preallocted extent size cause more pressure? > > > > I agree, increasing the max extent size here does not seem like a good > > thing to do. > > > > If an application fallocates space, then it generally expects to write to all > > that space. However future random partial writes may not rewrite the entire > > extent for a very long time, therefore making us keep a 1G extent for a very > > long time (or forever in the worst case). > > > > Even for NOCOW files, it's still an issue if snapshots are used. > > > > I see your point, and agree 1GiB is worse with respect to bookend > extents. Since the patchset doesn't really rely on this, I don't mind > dropping the change. I was mostly trying to rule this out as a trivial > fix that would obviate the need for other changes. > > However, I'm not completely convinced by the argument, for two reasons. > > The first is essentially Qu's last comment. If you guys are right, then > 256MiB is probably a really bad value for this as well, and we should be > reaping the wins of making it smaller. Well, that's not a reason to increase the size, quite the opposite. > > The second is that I'm not convinced of how the regression is going to > happen here in practice. Over the years, every now and then there are users reporting that their free space is mysteriously going away and they have no clue why, even when not using snapshots. More experienced users provide help and eventually notice it's caused by many bookend extents, and the given workaround is to defrag the files. You can frequently see such reports in threads on this mailing list or in the IRC channel. > Let's say someone does a 2GiB falloc and writes > the file out once. In the old code that will be 8 256MiB extents, in the > new code, 2 1GiB extents. Then, to have this be a regression, the user > would have to fully overwrite one of the 256MiB extents, but not 1GiB. > Are there a lot of workloads that don't use nocow, and which randomly > overwrite all of a 256MiB extent of a larger file? Maybe.. Assuming that all or most workloads that use fallocate also set nocow on the files, is quite a big stretch IMO. It's true that for performance critical applications like traditional relational databases, people usually set nocow, or the application or some other software does it automatically for them. But there are also many applications that use fallocate and people are not aware of and don't set nocow, nor anything else sets nocow automatically on the files used by them. Specially if they are not IO intensive, in which case they may not be noticed and therefore space wasted due to bookend extents is more likely to happen. Having a bunch of very small extents, say 4K to 512K, is clearly bad compared to having just a few 256M extents. But having a few 1G extents instead of a group of 256M extents, probably doesn't make such a huge difference as in the former case. Does the 1G extents benefit some facebook specific workload, or some well known workload? > > Since I'm making the change, it's incumbent on me to prove it's safe, so > with that in mind, I would reiterate I'm fine to drop it. > > > > > > > In fact, the original 256M is already too large to me. > > > > > > Thanks, > > > Qu > > > > > > > > Signed-off-by: Boris Burkov > > > > --- > > > > fs/btrfs/inode.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > > > index 45ebef8d3ea8..fd66586ae2fc 100644 > > > > --- a/fs/btrfs/inode.c > > > > +++ b/fs/btrfs/inode.c > > > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > > > > if (trans) > > > > own_trans = false; > > > > while (num_bytes > 0) { > > > > - cur_bytes = min_t(u64, num_bytes, SZ_256M); > > > > + cur_bytes = min_t(u64, num_bytes, SZ_1G); > > > > cur_bytes = max(cur_bytes, min_size); > > > > /* > > > > * If we are severely fragmented we could end up with really