From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A22AC4332F for ; Thu, 22 Dec 2022 14:04:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229985AbiLVOEK (ORCPT ); Thu, 22 Dec 2022 09:04:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43142 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230299AbiLVOEI (ORCPT ); Thu, 22 Dec 2022 09:04:08 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C743C2A24F for ; Thu, 22 Dec 2022 06:03:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1671717801; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=CtfrUuxEUJ4vot5LWyYTZkOw5zQPrhXVMxNg75ZlaKk=; b=VFUzbnLPokQh02kmAUmb/a5yCp/b27bdUEYlzcG1k2n2ObCu3ZV/gRUy4b9PBSQj/2ghEX eJH0hSNNN7n9ge68zY5naxir9FCsHGes9qlkwN2+GbOnUSQEq/hZxy5XrvmFulumF5DszT oe9iUR4BCA8P82r/umgYZSLgP8H+7PY= Received: from mail-oo1-f72.google.com (mail-oo1-f72.google.com [209.85.161.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-587-UiKKXwoyPqSkIG_PIYPhFg-1; Thu, 22 Dec 2022 09:03:19 -0500 X-MC-Unique: UiKKXwoyPqSkIG_PIYPhFg-1 Received: by mail-oo1-f72.google.com with SMTP id p27-20020a4a3c5b000000b004a3f1e7cc1fso748841oof.23 for ; Thu, 22 Dec 2022 06:03:19 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=CtfrUuxEUJ4vot5LWyYTZkOw5zQPrhXVMxNg75ZlaKk=; b=hVQAOu6N+fdHmbv8Kl35L0pUc9a4VG/3J8PHg37tYFQLZQCzUu0CD4EDXQL3n17ouN kySWWDEptUE0oCuhZQG3jglWTbn/j/xx0lorqTCORYtbdPUWLcGfz0T38EVN/zfiPwrq pTixKjB5/gDFEhp8byRDxP67dn5QhX1lq/7dK4POoMN4ellb5JIjdJbLJWhDBF4iG2SJ Jd+5Wsp/bYBleHsdhkFceSV7ecuD6TA5WQsM+oNbpImMF5VCW/iBRzI1NPig3g8eznky yDObjA4PPnvv1UZmBS7sfi6p8+8X+dloUSQSQ3rvMJFGOLq4RFPrcn0V2/czyhtA0pa+ PGCw== X-Gm-Message-State: AFqh2kpEZGIZk9V3MFb5ZgS3FS7NTk3Oc0ab4m/LGe9V3MK8oHyZrsLJ MS3uFPIEIF6RWqkK+NXcJ76aIeiOFeIT34NOfoHEZc4RpfVWW7gVrMP7Zx2wrVIrTA93KxfHY7R F6Uv5dqtN4A0xCiOPgYYcO+z26rE= X-Received: by 2002:a54:461a:0:b0:35e:a6cd:a871 with SMTP id p26-20020a54461a000000b0035ea6cda871mr2512571oip.41.1671717797835; Thu, 22 Dec 2022 06:03:17 -0800 (PST) X-Google-Smtp-Source: AMrXdXuo2L5IitDQ9YMC2g+WJs87akFORjLCgh6HIM5yxS8m+UAZ6A3mCs2TFwZG6Qz9nRaLu7A5dQ== X-Received: by 2002:a54:461a:0:b0:35e:a6cd:a871 with SMTP id p26-20020a54461a000000b0035ea6cda871mr2512535oip.41.1671717797369; Thu, 22 Dec 2022 06:03:17 -0800 (PST) Received: from bfoster (c-24-61-119-116.hsd1.ma.comcast.net. [24.61.119.116]) by smtp.gmail.com with ESMTPSA id s10-20020a05620a29ca00b006ee949b8051sm345618qkp.51.2022.12.22.06.03.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Dec 2022 06:03:16 -0800 (PST) Date: Thu, 22 Dec 2022 09:03:22 -0500 From: Brian Foster To: Kent Overstreet Cc: linux-bcachefs@vger.kernel.org Subject: Re: [PATCH RFC] bcachefs: use inode as write point index instead of task Message-ID: References: <20221212190602.1388127-1-bfoster@redhat.com> <20221213183743.3m6ntfnu7n3yebng@moria.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-bcachefs@vger.kernel.org On Mon, Dec 19, 2022 at 08:02:16PM -0500, Kent Overstreet wrote: > On Mon, Dec 19, 2022 at 10:27:23AM -0500, Brian Foster wrote: > > A couple of the more common optimizations XFS uses are speculative > > preallocation and extent size hints. The former is designed to mitigate > > fragmentation, particularly in the case of concurrent sustained writes. > > Basically it will selectively increase the size of appending delalloc > > reservations beyond current eof in anticipation of further writes. In > > the meantime, writeback will attempt to allocate maximal sized physical > > extents for contiguous delalloc ranges. Finally, the allocation itself > > will start off with a simple hint based on the physical location of the > > inode. This helps ensure extents are eventually maximally sized whenever > > sufficient contiguous free extents are available and similarly ensures > > as related inodes are removed, contiguous extents are freed together. > > Excess/unused prealloc blocks are eventually reclaimed in the background > > or as needed. > > > > Extent size hints are more for random write/allocation scenarios and > > must be set by the user. For example, consider a sparse vdisk image > > seeing random small writes all over the place. If we allocate single > > blocks at a time, fragmentation and the extent count can eventually > > explode out of control. An extent size hint of 1MB or so ensures every > > new allocation is sized/aligned as such and so helps mitigate that sort > > of problem as more of the file is allocated. > > > > Of course XFS is fundamentally different in that it's not a COW fs, so > > might have different concerns. It supports reflinks, but that's a > > relatively recent feature compared to the allocation heuristics and not > > something they were designed around or significantly updated for (since > > COW is not default behavior, although I believe an always_cow mode does > > exist). > > *nod* Yeah, I've been wondering how much this stuff makes sense in the context > of a COW filesystem. > > But we do have nocow mode, complete with unwritten extents. If we need to go the > delalloc route, I think the existing allocator design should be able to support > that (we can pin space purely as an in memory operation, but we do have a fixed > number of those so we have to be careful about introducing deadlocks). > > It sounds like the optimizations XFS is doing are trying to ensure that writes > remain contiguous on disk even when buffered writeback isn't batching them up as > much as we'd like? Is that something we still feel is important? > Pagecache/system memory size keeps going up but seek times do not (and go down > in the case of flash); it's not clear to me that this is still important today. > That's a good question and I don't really know the answer. I suspect there is more to it than the fundamental principles of hardware and related improvements. In practice these sorts of things still improve fs efficiency, performance, scalability, aging (perhaps under less than ideal hardware/workload/resource conditions), etc. Given the relative low cost (in terms of complexity) of the implementation, they certainly aren't things I see going away from fs' like XFS any time soon. The underlying concepts may just not be as generically relevant (i.e. useful across different fs implementations) as perhaps they might have been in the past. When you think about it, it is kind of amusing to see things like the fs attempt to create as large/contiguous mappings as possible, only for writeback to subsequently have to explicitly break them up into smaller I/O requests because otherwise the massive amount of in-core metadata status updates that result (i.e. clearing per-page writeback state) leads to excessive completion latency. ;) OTOH, if that eventually leads to more use of things like large folios, then perhaps that's an overall win. Anyways, I just bring these things up here for reference and discussion purposes.. > > Ok. Based on the above, it kind of sounds like a worse case scenario > > might be something like N files allocated by the same task in such a way > > that each bucket ends up split between the N files, and then some number > > of files end up removed. Rinse and repeat that sort of thing across new > > sets of files and then presumably we'd have increasing amount of free > > space in partially used buckets that cannot be allocated..? > > Yep, that would do it. > Ok. > > > > Is copygc responsible for cleaning things up in such a case in order to > > create more usable free space (hence the excessive copygc comment > > below)? > > Correct. Copygc finds buckets that are mostly but not completely empty and > evacuates them - writes the data in them to new buckets. > > Copygc doesn't do any file-level defragmentation, but now that we have > backpointers it could. > Cool. > > Hmm.. Ok, that gives me another area to look into re: copygc. ;) Thanks > > for all of the feedback and context.. > > Feel free to hit me up on IRC as you're looking at code. I'm also currently > working on the copygc code - we have a persistent fragmentation index about to > land, which will be a drastic improvement to copygc scalability. Not relevant to > what you're looking at, but the code is at least fresh in my mind :) > Thanks. Appreciate the feedback here and in the other subthread. I'm currently mostly trying to grok core concepts and map them to areas of code and such, but will undoubtedly have more questions once I get more into details.. Brian