From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0026FC77B7A for ; Wed, 17 May 2023 12:21:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230100AbjEQMV6 (ORCPT ); Wed, 17 May 2023 08:21:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37412 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230370AbjEQMV5 (ORCPT ); Wed, 17 May 2023 08:21:57 -0400 Received: from verein.lst.de (verein.lst.de [213.95.11.211]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B2F28193 for ; Wed, 17 May 2023 05:21:55 -0700 (PDT) Received: by verein.lst.de (Postfix, from userid 2407) id 6812768BEB; Wed, 17 May 2023 14:21:51 +0200 (CEST) Date: Wed, 17 May 2023 14:21:50 +0200 From: Christoph Hellwig To: Filipe Manana Cc: Christoph Hellwig , Chris Mason , Josef Bacik , David Sterba , linux-btrfs@vger.kernel.org Subject: Re: [PATCH 1/6] btrfs: use a linked list for tracking per-transaction/log dirty buffers Message-ID: <20230517122150.GA17334@lst.de> References: <20230515192256.29006-1-hch@lst.de> <20230515192256.29006-2-hch@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Wed, May 17, 2023 at 11:40:14AM +0100, Filipe Manana wrote: > > This patch instead switches tracking to one linked list per transaction > > and two to each root for the two tree logs which link a new object that > > points directly to the buffer. Note that the list_head can't directly be > > embedded into the extent_buffer structure given that a buffer can be part > > of more than one transaction or tree_log. This also means the existing > > error propagation based off eb->log_index never fully worked, as this > > index would get overwritten once a buffer is added to a new dirty tree. > > If an extent buffer is part of 2 transactions, it means that when it > was allocated > for the next one, it was already cleaned up in the previous one, so > its ->log_index is > no longer used by the previous transaction and can be safely > overwritten by the next transaction. > > Or did you find a case where that is not true? At least with a previous version of this patch where the list_head was embedded into the extent_buffer structure I could very easily reproduce cases where one buffer was added to another list while still on another one. The most common case was a tree log and a transaction, but I think I've also seen two transactions or two tree logs. "Cleanup up" means written back and waited for writeback or dirty canceled I guess? Or is there some other aspect I should look for? > > @@ -202,7 +202,8 @@ struct btrfs_root { > > struct btrfs_root_item root_item; > > struct btrfs_key root_key; > > struct btrfs_fs_info *fs_info; > > - struct extent_io_tree dirty_log_pages; > > + struct list_head dirty_buffers[2]; > > As this is for the log tree, I'd prefer to have its name reflect that, > like we had before. > Something like "log_dirty_buffers" for example. Ok. Given that the btrfs_root structure isn't specific to log_trees that absolutely makes sense. > 1) With the io tree approach, if we allocate multiple extent buffers > that are adjacent, we get a single entry to represent them, due the > merging done by the io tree code. > With this new approach we don't have any merging at all, using more > memory and keeping a longer list, which will take longer to iterate > and sort. At least in theory yes. But we also save a whole lot of lookups by going directly to the object instead of indirecting through the pages xarray and then again through the buffers array for the sub-block case. > For example if a 1G metadata block group is allocated, and then we > allocate all metadata extents from it, we get 65536 struct > dirty_buffer allocated, while with the io tree approach we would get a > single struct extent_state record. But that will only get you to the filemap_fdtawrite and fdatawait. After that we're still looking up every single page in that range while with this series (the patch alone isn't enough, the rest of the work comes in later patches) the list gets us straight to the extent_buffer. > 2) We now need to keep references on the extent buffers. This means > they can't be released from memory until the transaction commits. > Before we didn't do this, and if an extent buffer was allocated and > freed in the same transaction, we wouldn't need to keep it in memory > until the transaction commits. > We would not need to do it as well if its writeback is started and > completed before the transaction commit. True. But at the same time it will allow to get rid of all the extent_buffer_under_io hacks. Note that if we figure out what causes buffers to be added to multiple transactions/tree_logs and just have a list in the object we could just deleted it when cleaned and have the best of both worlds. > 3) Looking a bit below, we now need to sort the list, which can be > huge, especially taking into account the fact that adjacent extents > are not merged anymore as mentioned before, potentially making anyone > who's waiting on the transaction commit to wait for longer. > On the other end, when a task allocates a new buffer the insertion is > faster, as it's just appending to a list and can reduce the latency of > many syscalls (creat, mkdir, rmdir, rename, link/unlink, reflinks, > etc) > > Not saying that in the end this approach isn't often or generally > better, but at the very least I would like to see all these > differences explicitly mentioned in the changelog. > The changelog gives the impression that there are no tradeoffs and the > new solution is better in every aspect. > > I would say more tests/benchmarks results would be good to have too, > other than just fs_mark, and have the tests mentioned in the changelog > (results before and after this change, command lines, fio configs for > example). Do you have any particular workload you think would be useful to test?