From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 13D191D6B0 for ; Mon, 18 Sep 2023 17:51:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=AtWXTK9sBM3qyLBsA+qIGYXo3vkpfD3c/us98GiG588=; b=depldozsVNKZnEGYPeAz7DGFRW +kdoZkj/bKBt9WiQm4mVFQs5PxJYfgJXW2d993KWSTLtLZAxpb76dZiun83S2JjREqwpTaYQF0s8w 5fEzbMruEmv6NeA8T+Q1h0a23k88aCF9kaR7S+e6e1EMqRXVGxnI2tnLV/KbZ1WXs5Sz2j9I+CtbK 7E7kDIXVXWtgO37vRfLGWXIlhQI/NQ8ndoRSnvnxouzqkxCYnsdCRNHGT2Azie3J1Gx6KSJ4iJerT dEZLZQ/bBpGXucdQdNirB2I4e9OOnpn/FlEd+UTc0XLZVTRbkq4VpUn/oWlSZ6AFzZvZLmdcQx48k txFJ9yIg==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.96 #2 (Red Hat Linux)) id 1qiIP2-00G0xi-0m; Mon, 18 Sep 2023 17:51:20 +0000 Date: Mon, 18 Sep 2023 10:51:20 -0700 From: Luis Chamberlain To: Dave Chinner Cc: Matthew Wilcox , hch@infradead.org, djwong@kernel.org, dchinner@redhat.com, kbusch@kernel.org, sagi@grimberg.me, axboe@fb.com, brauner@kernel.org, hare@suse.de, ritesh.list@gmail.com, rgoldwyn@suse.com, jack@suse.cz, ziy@nvidia.com, ryan.roberts@arm.com, patches@lists.linux.dev, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, p.raghav@samsung.com, da.gomez@samsung.com, dan.helmick@samsung.com Subject: Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Message-ID: References: <20230915213254.2724586-1-mcgrof@kernel.org> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: Luis Chamberlain On Mon, Sep 18, 2023 at 12:49:36PM +1000, Dave Chinner wrote: > On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote: > > On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote: > > > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote: > > > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote: > > > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote: > > > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be > > > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS > > > > > > device then to take advantage of the recenlty posted work today to enable > > > > > > LBS support for filesystems [0]. > > > > > > > > > > Why do we need LBS devices to support bs > ps in XFS? > > > > > > > > It's the other way round -- we need the support in the page cache to > > > > reject sub-block-size folios (which is in the other patches) before we > > > > can sensibly talk about enabling any filesystems on top of LBS devices. > > > > > > > > Even XFS, or for that matter ext2 which support 16k block sizes on > > > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first. > > > > > > Well, yes, I know that. But the statement above implies that we > > > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE > > > systems. If it's meant to mean the exact opposite, then it is > > > extremely poorly worded.... > > > > Let's ignore the above statement of a second just to clarify the goal here. > > The patches posted by Pankaj enable bs > ps even on 4k sector drives. > > Yes. They also enable XFS to support bs > ps on devices up to 32kB > sector sizes, too. All the sector size does is define the minimum > filesystem block size that can be supported by the filesystem on > that device and apart from that we just don't care what the sector > size on the underlying device is. Ah yes, but on a system with 4k page size, my point is that you still can't use a 32k sector size drive. > > This patch series by definition is suggesting that an LBS device is one > > where the minimum sector size is > ps, in practice I'm talking about > > devices where the logical block size is > 4k, or where the physical > > block size can be > 4k. > > Which XFS with bs > ps just doesn't care about. As long as the > logical sector size is a power of 2 between 512 bytes and 32kB, it > will just work. > > > There are two situations in the NVMe world where > > this can happen. One is where the LBA format used is > 4k, the other is > > where the npwg & awupf | nawupf is > 4k. The first makes the logical > > FWIW, I have no idea what these acronyms actually mean.... Sorry, I've described them them, here now: https://kernelnewbies.org/KernelProjects/large-block-size#LBS_NVMe_support What matters though is in practice it is a combination of two values which today also allows the nvme driver to have a higher than 4k physical block size. > > block size > 4k, the later allows the physical block size to be > 4k. > > The first *needs* an aops which groks > ps on the block device cache. > > The second let's you remain backward compatible with 4k sector size, but > > if you want to use a larger sector size you can too, but that also > > requires a block device cache which groks > ps. When using > ps for > > logical block size of physical block size is what I am suggesting we > > should call LBS devices. > > Sure. LBS means sector size > page size for the block device. That > much has always been clear. > > But telling me that again doesn't explain what LBS support has to do > with the filesystem implementation. mkfs.xfs doesn't require LBS > support to make a new XFS filesystem with a 32kB sector size and > 64kB filessytem block size. For the mounted filesystem that > supports bs > ps, it also doesn't care about the device sector size > is smaller than what mkfs told it to us. e.g. this is how run 4kB > sector size filesystem testing on 512 byte sector devices today.... > > What I'm asking is why LBS support even mentions filesystems? It's just that without this patchset you can mkfs an filesystem with larger bs > ps, but will only be able to mount them if the sector size matches the page size. This patchset allows them to be mounted and used where sector size > ps. The main issue there is just the block device cache aops and the current size limitaiton on set_blocksize(). > It's > not necessary for filesystems to support bs > ps for LBS to be > implemented, and it's not necessary for LBS to be supported for > filesytsems to implement bs > ps. Conflating them seems a recipe > for confusion.... I see, so we should just limit the scope to enabling LBS devices on the block device cache? > > > iomap supports bufferheads as a transitional thing (e.g. for gfs2). > > > > But not for the block device cache. > > That's why I'm suggesting that you implement support for bufferheads > through the existing iomap infrastructure instead of trying to > dynamically switch aops structures.... > > > > Hence I suspect that a better solution is to always use iomap and > > > the same aops, but just switch from iomap page state to buffer heads > > > in the bdev mapping > > > > Not sure this means, any chance I can trouble you to clarify a bit more? > > bdev_use_buggerheads() > { > /* > * set the bufferhead bit atomically with invalidation emptying the > * page cache to prevent repopulation races. > */ > filemap_invalidate_lock() > invalidate_bdev() > if (turn_on_bufferheads) > set_bit(bdev->state, BDEV_USE_BUFFERHEADS); > else > clear_bit(bdev->state, BDEV_USE_BUFFERHEADS); > filemap_invalidate_unlock() > } > > bdev_iomap_begin() > { > ..... > if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS)) > iomap->flags |= IOMAP_F_BUFFER_HEAD; > } > > /* > * If an indexing switch happened while processing the iomap, make > * sure to get the iomap marked stale to force a new mapping to be > * looked up. > */ > bdev_iomap_valid() > { > bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD; > bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS); > > return need_bhs == use_bhs; > } Will give this a shot, thanks! Luis