From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76D1AC433F5 for ; Sat, 12 Mar 2022 03:49:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229679AbiCLDuJ (ORCPT ); Fri, 11 Mar 2022 22:50:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58332 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229502AbiCLDuH (ORCPT ); Fri, 11 Mar 2022 22:50:07 -0500 Received: from drax.kayaks.hungrycats.org (drax.kayaks.hungrycats.org [174.142.148.226]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D26CC23F9E1 for ; Fri, 11 Mar 2022 19:49:02 -0800 (PST) Received: by drax.kayaks.hungrycats.org (Postfix, from userid 1002) id CD83C250586; Fri, 11 Mar 2022 22:48:12 -0500 (EST) Date: Fri, 11 Mar 2022 22:48:12 -0500 From: Zygo Blaxell To: Qu Wenruo Cc: Jan Ziak <0xe2.0x9a.0x9b@gmail.com>, linux-btrfs@vger.kernel.org Subject: Re: Btrfs autodefrag wrote 5TB in one day to a 0.5TB SSD without a measurable benefit Message-ID: References: <7fc9f5b4-ddb6-bd3b-bb02-2bd4af703e3b@gmx.com> <078f9f05-3f8f-eef1-8b0b-7d2a26bf1f97@gmx.com> <59c57200-9c77-3b8a-ab9d-11aef96da852@gmx.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <59c57200-9c77-3b8a-ab9d-11aef96da852@gmx.com> Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On Sat, Mar 12, 2022 at 11:24:18AM +0800, Qu Wenruo wrote: > > > On 2022/3/12 10:43, Zygo Blaxell wrote: > > On Sat, Mar 12, 2022 at 12:28:10AM +0100, Jan Ziak wrote: > > > On Sat, Mar 12, 2022 at 12:04 AM Qu Wenruo wrote: > > > > As stated before, autodefrag is not really that useful for database. > > > > > > Do you realize that you are claiming that btrfs autodefrag should not > > > - by design - be effective in the case of high-fragmentation files? If > > > it isn't supposed to be useful for high-fragmentation files then where > > > is it supposed to be useful? Low-fragmentation files? > > > > IMHO it's best to deprecate the in-kernel autodefrag option, and start > > over with a better approach. The kernel is the wrong place to solve > > this problem, and the undesirable and unfixable things in autodefrag > > are a consequence of that early design error. > > I'm having the same feeling exactly. > > Especially the current autodefrag is putting its own policy (transid > filter) without providing a mechanism to utilize from user space. > > Exactly the opposite what we should do, provide a mechanism not a policy. > > Not to mention there are quite some limitations of the current policy. > > > But unfortunately, even we deprecate it right now, it will takes a long > time to really remove it from kernel. Agree that we have to keep it around until everyone has moved over to the new thing; however, we can stop developing the old thing much sooner, and work on the new thing immediately. > While on the other hand, we also need to introduce new parameters like > @newer_than, and @max_to_defrag to the ioctl interface. > > Which may already eat up the unused bytes (only 16 bytes, while > newer_than needs u64, max_to_defrag may also need to be u64). > > And user space tool lacks one of the critical info, where the small > writes are. Userspace can find new extents pretty fast if it's keeping up with writes in real time. bees scans do a search for all new extents in the last 30 seconds (not just the small ones) and finish in tenths of milliseconds with a hot cache. This is orders of magnitude faster than the actual defragmentation, which has to do all the data IO twice, copy all the modified metadata pages, delayed extent refs, and pay the seek costs for re-reading the fragmented data and writing it somewhere else. The kernel could maintain the list of autodefrag inodes and simply provide them to userspace on demand, but honestly I don't think this list is worth even the tiny amount of memory that it uses. > So even I can't be more happier to deprecate the autodefrag, we still > need to hang on it for a pretty lone time, before a user space tool > which can do everything the same as autodefrag. > > Thanks, > Qu > > > > > As far as I can tell, in-kernel autodefrag's only purpose is to provide > > exposure to new and exciting bugs on each kernel release, and a lot of > > uncontrolled IO demands even when it's working perfectly. Inevitably, > > re-reading old fragments that are no longer in memory will consume RAM > > and iops during writeback activity, when memory and IO bandwidth is least > > available. If we avoid expensive re-reading of extents, then we don't > > get a useful rate of reduction of fragmentation, because we can't coalesce > > small new exists with small existing ones. If we try to fix these issues > > one at a time, the feature would inevitably grow a lot of complicated > > and brittle configuration knobs to turn it off selectively, because it's > > so awful without extensive filtering. > > > > All the above criticism applies to abstract ideal in-kernel autodefrag, > > _before_ considering whether a concrete implementation might have > > limitations or bugs which make it worse than the already-bad best case. > > 5.16 happened to have a lot of examples of these, but fixing the > > regressions can only restore autodefrag's relative harmlessness, not > > add utility within the constraints the kernel is under. > > > > The right place to do autodefrag is userspace. Interfaces already > > exist for userspace to 1) discover new extents and their neighbors, > > quickly and safely, across the entire filesystem; 2) invoke defrag_range > > on file extent ranges found in step 1; and 3) run a while (true) > > loop that periodically performs steps 1 and 2. Indeed, the existing > > kernel autodefrag implementation is already using the same back-end > > infrastructure for parts 1 and 2, so all that would be required for > > userspace is to reimplement (and start improving upon) part 3. > > > > A command-line utility or daemon can locate new extents immediately with > > tree_search queries, either at filesystem-wide scales, or directed at > > user-chosen file subsets. Tools can quickly assess whether new extents > > are good candidates for defrag, then coalesce them with their neighbors. > > > > The user can choose between different tools to decide basic policy > > questions like: whether to run once in a batch job or continuously in > > the background, what amounts of IO bandwidth and memory to consume, > > whether to recompress data with a more aggressive algorithm/level, which > > reference to a snapshot-shared extent should be preferred for defrag, > > file-type-specific layout optimizations to apply, or any custom or > > experimental selection, scheduling, or optimization logic desired. > > > > Implementations can be kept simple because it's not necessary for > > userspace tools to pile every possible option into a single implementation, > > and support every released option forever (as required for the kernel). > > A specialist implementation can discard existing code with impunity or > > start from scratch with an experimental algorithm, and spend its life > > in a fork of the main userspace autodefrag project with niche users > > who never have to cope with generic users' use cases and vice versa. > > This efficiently distributes development and maintenance costs. > > > > Userspace autodefrag can be implemented today in any programming language > > with btrfs ioctl support, and run on any kernel released in the last > > 6 years. Alas, I don't know of anybody who's released a userspace > > autodefrag tool yet, and it hasn't been important enough to me to build > > one myself (other than a few proof-of-concept prototypes). > > > > For now, I do defrag mostly ad-hoc with 'btrfs fi defrag' on the most > > severely fragmented files (top N list of files with the highest extent > > counts on the filesystem), and ignore fragmentation everywhere else. > > > > > > > -Jan