From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69D24C433F5 for ; Tue, 15 Mar 2022 21:06:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351814AbiCOVIJ (ORCPT ); Tue, 15 Mar 2022 17:08:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48176 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343868AbiCOVIJ (ORCPT ); Tue, 15 Mar 2022 17:08:09 -0400 Received: from drax.kayaks.hungrycats.org (drax.kayaks.hungrycats.org [174.142.148.226]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5052A483AC for ; Tue, 15 Mar 2022 14:06:56 -0700 (PDT) Received: by drax.kayaks.hungrycats.org (Postfix, from userid 1002) id 9B49225D275; Tue, 15 Mar 2022 17:06:55 -0400 (EDT) Date: Tue, 15 Mar 2022 17:06:55 -0400 From: Zygo Blaxell To: Phillip Susi Cc: Qu Wenruo , Jan Ziak <0xe2.0x9a.0x9b@gmail.com>, linux-btrfs@vger.kernel.org Subject: Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Message-ID: References: <70bc749c-4b85-f7e6-b5fd-23eb573aab70@gmx.com> <7fc9f5b4-ddb6-bd3b-bb02-2bd4af703e3b@gmx.com> <078f9f05-3f8f-eef1-8b0b-7d2a26bf1f97@gmx.com> <87a6dscn20.fsf@vps.thesusis.net> <87fsnjnjxr.fsf@vps.thesusis.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87fsnjnjxr.fsf@vps.thesusis.net> Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Tue, Mar 15, 2022 at 02:28jjjZ:46PM -0400, Phillip Susi wrote: > Zygo Blaxell writes: > > > btrfs extents are immutable, so the filesystem can't extend an existing > > extent with new data. Instead, a new extent must be created that contains > > both the old and new data to replace the old extent. At least one new > > Wait, what? How is an extent immutable? Why isn't a new tree written > out with a larger extent and once the transaction commits, bam... you've > enlarged your extent? Just like modifying any other data. If the extent is compressed, you have to write a new extent, because there's no other way to atomically update a compressed extent. If it's reflinked or snapshotted, you can't overwrite the data in place as long as a second reference to the data exists. This is what makes nodatacow and prealloc slow--on every write, they have to check whether the blocks being written are shared or not, and that check is expensive because it's a linear search of every reference for overlapping block ranges, and it can't exit the search early until it has proven there are no shared references. Contrast with datacow, which allocates a new unshared extent that it knows it can write to, and only has to check overwritten extents when they are completely overwritten (and only has to check for the existence of one reference, not enumerate them all). When a file refers to an extent, it refers to the entire extent from the file's subvol tree, even if only a single byte of the extent is contained in the file. There's no mechanism in btrfs extent tree v1 for atomically replacing an extent with separately referenceable objects, and updating all the pointers to parts of the old object to point to the new one. Any such update could cascade into updates across all reflinks and snapshots of the extent, so the write multiplier can be arbitrarily large. There is an extent tree v2 project which provides for splitting uncompressed extents (compressed extents are always immutable) by storing all the overlapping references as objects in the extent tree. It does reference tracking by creating an extent item for every referenced block range, so changing one reference's position or length (e.g. by overwriting or deleting part of an extent reference in a file) doesn't affect any other reference. In theory it could also append to the end of an existing extent, if that case ever came up. That brings us to the next problem: mutable extents won't help with the appending case without also teaching the allocator how to spread out files all over the disk so there's physical space available at file EOF. Normally in btrfs, if you write to 3 files, whatever you wrote is packed into 3 physically contiguous and adjacent extents. If you then want to append to the first or second file, you'll need a new extent, because there's no physical space between the files. > And do you mean to say that before the new data can be written, the old > data must first be read in and moved to the new extent? That seems > horridly inefficient. Normally btrfs doesn't read anything when it writes. New writes create new extents for the new data, and delete only extents that are completely replaced by the new extents. A series of sequential small writes create a lot of small extents, and small extents are sometimes undesirable. Defrag gathers these small extents when they are logically adjacent, reads them into memory, writes a new physically contiguous extent to replace them, then deletes the old extents. Autodefrag is a process that makes defrag happen in near time to extents that were written recently. Defrag isn't the only way to resolve the small-extents issue. If the file is only read once (e.g. a log file that is rotated and compressed with a high-performance compressor like xz) then defrag is a waste of read/write cycles--it's better to leave the small fragments where they are until they are deleted by an application.