From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D650CD13CF for ; Mon, 18 Sep 2023 02:04:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 609916B01F8; Sun, 17 Sep 2023 22:04:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5B9B66B01F9; Sun, 17 Sep 2023 22:04:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 480E66B01FA; Sun, 17 Sep 2023 22:04:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 349116B01F8 for ; Sun, 17 Sep 2023 22:04:44 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 046D81A00DF for ; Mon, 18 Sep 2023 02:04:43 +0000 (UTC) X-FDA: 81248074488.29.76FD3A4 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf14.hostedemail.com (Postfix) with ESMTP id 64DE4100005 for ; Mon, 18 Sep 2023 02:04:40 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=3zIISxp3; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf14.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695002681; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KqQ/QYkzoA6iiYkHXcsQez0zjLQ2/Y9CTcBq2HhMheY=; b=EXQZvDGHybZjlNtG9UU2BBaZj5n5LXmrWuLM/7YwIPhIAoakpih0X8w86IthVn8o1LGPv8 VQgGrgCVrrFqkwmffR7sUfnPDfsKXjRafj9rHovCfpm+F/FC81edyV+73iIRu0cwNK29zS 2gV5wrFWh0IuXB6txzpTmN6mWleu4TI= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=3zIISxp3; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf14.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695002681; a=rsa-sha256; cv=none; b=k69x4l3pIULusqJZNp/3vUwjRCz1v1So61WVZSTtOamYWmWUXLWLcLlwnOdRUvq9Lkh/FW LkyEmFmtHs7kTQ0vISl+3dvduiT2ZZJTcwxiMFBNPoKNOf3tpEhZtRheEBn0kwfbU/x4e7 jP4jf4wJzcQ2zwSrwT9PpmRKaF5VP8E= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:In-Reply-To:Content-Type: MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=KqQ/QYkzoA6iiYkHXcsQez0zjLQ2/Y9CTcBq2HhMheY=; b=3zIISxp33DxakmM4bv+pOsTPtw WoGWkER9yy3i+jAiKvT5JkmDmNyyBXWrluAq9epBzWkszCpRONGatWah1PgQ0A4W1bMphZ8RMM6AN iSW0xC2tuFzOYlgQK+wbOqkmZHoPSHFlL7+emPX0KB8KHiEh1e0SWhuoMfas9h0ooD8W7PimHSrAT ZxmCuNzq9Yl4CEbaFqXBni76lvbtf7/jBl5fzrpme1dosz0/q90D+YHyHKKpK6amlLaxT8q0Pmc2J jJB2MH15bvOSfszYcfxlRqTjQMYBmX3wonLhJXtGX2Lz7bVDcpl8Ecw66Koa/kLB706NO1kMuI/ug gTgnotKQ==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.96 #2 (Red Hat Linux)) id 1qi3ce-00EDre-1D; Mon, 18 Sep 2023 02:04:24 +0000 Date: Sun, 17 Sep 2023 19:04:24 -0700 From: Luis Chamberlain To: Dave Chinner Cc: Pankaj Raghav , linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, p.raghav@samsung.com, da.gomez@samsung.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, willy@infradead.org, djwong@kernel.org, linux-mm@kvack.org, chandan.babu@oracle.com, gost.dev@samsung.com Subject: Re: [RFC 00/23] Enable block size > page size in XFS Message-ID: References: <20230915183848.1018717-1-kernel@pankajraghav.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Stat-Signature: 8st9k9ibku5eq9bcd8urh9959u7ec43i X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 64DE4100005 X-HE-Tag: 1695002680-182885 X-HE-Meta: U2FsdGVkX1+Z6V7KVsnIaj8gn+YYiBj7Iw9ZddL7io+TmcxqPC0I2O24ff2QCJEj/Of1AW3tCXcIIhep8tav5y65tEn76xv2fMnaB8GhRFLGwhH3cIlygaoaVXqGjAIbKvfczE1PoIx5qafIiA8aH8z/BYSSacz555bpZOrG8n1Y0cf/cdSKNUJTnxoBalNlQ1fi8XM/oQ+igkOPruDf4etklPyHezyTF+CmZWAZAXMGg3C6rnY3bUSAvJuJGi+Sv5BAKma55GJ6iX12ecdEiOg+cELfpHh7pWRrKMmpiM6+5mmJOjwuJyKBVZyyIfB+pv9fXBarOT3EQaegbp3n27WZeAUiqI+rEXTnYlJIdPVmmEDJVxEczH5YAoYrP9lWZrlbW9QirMUXpV1B/wFly+TnWMkFEEvUprLH56FOfBiCXVuawYdKNJ/6nDZR6PueeYL/O5awGMoFvDsLzPXGEV3l4TGB9TAtGP78wwtBg6zbL9otwuQHJTwoB6vgdpZu0crWCdiOpxQzfaDamyfNDCqxsDRo6IOykm6Kmw+2/xpqjtLlyWjU1PNRVn4CseKP8cFMIqqDsGPEXr3bMm1vQ/QS5byJBz+9iLNCs0CsuVY0WH8i1wgjdjfsiD8ntL6K0644V1WaU7MWAYAc/8ZHtw20bWzsr8I352dbJueP84SWof9HBxDFV7K8zZPxJf9KaTtAB/RpOrAmesWKLU0rpU2qfmmuW5370CYHobrEcM61GffohOYKZiTYTihimT3h0ApiG4FkuOXpjiWBlHX6MVVLANiT4MSbHXzynPaRwQwg0W4j53vMlvBYFNyLc8NssE/FkFjmRwrpW3Bscv2CJ0eLPEE8rfvA5i/Gx1AJYZgBGw//IRxOSOfm9O4DWFz+0sIeueOeCaLpsFVcw5IAKWkVZ67IrhB2eF2qLCRXeiC/H7dNHoWdsb42AuF0szLV2cg6mwQH6AL/9aIIlw9 BWIttW63 KdIMRgGA4ufwoG7YydVr/IcvxXza0+uAqSadZf+Yhu723Z/HCtqfSDp0WtLFRUvRc9zl5NHjU0QlYYkhbKOm/h+2frPfEYiiEkOsoM86MmwuemIfccrzkno/Fp2einT4JWVwNhnxL8pCMdi5ifElMz1/SxX3XXi+L0qJeqpr6BbFe7unSsZSqFe3Ma7LdV3bpY0ihk5Uos9oFWCcdoCUWnezRTfLUro5buofJZXW1RkoqAmy7WXKORBm+KvWbeL99S8IdqXD4lL9O7oxcKwDjSzTndbdmI3qARIh4fEvlBdyGw273Qne6soCYZg3/VhSVZyhjA62c99WJR9P863/X+uCyaFvfHnNP5tN3FlCrLYIVIDFZDfEy5QUT5y1E+mmZeZUbGUzEpsfS8nw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote: > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote: > > From: Pankaj Raghav > > > > There has been efforts over the last 16 years to enable enable Large > > Block Sizes (LBS), that is block sizes in filesystems where bs > page > > size [1] [2]. Through these efforts we have learned that one of the > > main blockers to supporting bs > ps in fiesystems has been a way to > > allocate pages that are at least the filesystem block size on the page > > cache where bs > ps [3]. Another blocker was changed in filesystems due to > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew > > Willcox in the page cache for adopting xarray's multi-index support, and > > iomap support, it makes supporting bs > ps in XFS possible with only a few > > line change to XFS. Most of changes are to the page cache to support minimum > > order folio support for the target block size on the filesystem. > > > > A new motivation for LBS today is to support high-capacity (large amount > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are > > typically greater than 4k [4] to help reduce DRAM and so in turn cost > > and space. In practice this then allows different architectures to use a > > base page size of 4k while still enabling support for block sizes > > aligned to the larger IUs by relying on high order folios on the page > > cache when needed. It also enables to take advantage of these same > > drive's support for larger atomics than 4k with buffered IO support in > > Linux. As described this year at LSFMM, supporting large atomics greater > > than 4k enables databases to remove the need to rely on their own > > journaling, so they can disable double buffered writes [5], which is a > > feature different cloud providers are already innovating and enabling > > customers for through custom storage solutions. > > > > This series still needs some polishing and fixing some crashes, but it is > > mainly targeted to get initial feedback from the community, enable initial > > experimentation, hence the RFC. It's being posted now given the results from > > our testing are proving much better results than expected and we hope to > > polish this up together with the community. After all, this has been a 16 > > year old effort and none of this could have been possible without that effort. > > > > Implementation: > > > > This series only adds the notion of a minimum order of a folio in the > > page cache that was initially proposed by Willy. The minimum folio order > > requirement is set during inode creation. The minimum order will > > typically correspond to the filesystem block size. The page cache will > > in turn respect the minimum folio order requirement while allocating a > > folio. This series mainly changes the page cache's filemap, readahead, and > > truncation code to allocate and align the folios to the minimum order set for the > > filesystem's inode's respective address space mapping. > > > > Only XFS was enabled and tested as a part of this series as it has > > supported block sizes up to 64k and sector sizes up to 32k for years. > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem > > that doesn't depend on buffer-heads and support larger block sizes > > already should be able to leverage this effort to also support LBS, > > bs > ps. > > > > This also paves the way for supporting block devices where their logical > > block size > page size in the future by leveraging iomap's address space > > operation added to the block device cache by Christoph Hellwig [6]. We > > have work to enable support for this, enabling LBAs > 4k on NVME, and > > at the same time allow coexistence with buffer-heads on the same block > > device so to enable support allow for a drive to use filesystem's to > > switch between filesystem's which may depend on buffer-heads or need the > > iomap address space operations for the block device cache. Patches for > > this will be posted shortly after this patch series. > > Do you have a git tree branch that I can pull this from > somewhere? > > As it is, I'd really prefer stuff that adds significant XFS > functionality that we need to test to be based on a current Linus > TOT kernel so that we can test it without being impacted by all > the random unrelated breakages that regularly happen in linux-next > kernels.... That's understandable! I just rebased onto Linus' tree, this only has the bs > ps support on 4k sector size: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev I just did a cursory build / boot / fsx with 16k block size / 4k sector size test with this tree only. I havne't ran fstests on it. Just a heads up, using 512 byte sector size will fail for now, it's a regression we have to fix. Likewise using block sizes 1k, 2k will also regress on fsx right now. These are regressions we are aware of but haven't had time yet to bisect / fix. Luis