From: Luis Chamberlain <mcgrof@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>,
hch@infradead.org, djwong@kernel.org, dchinner@redhat.com,
kbusch@kernel.org, sagi@grimberg.me, axboe@fb.com,
brauner@kernel.org, hare@suse.de, ritesh.list@gmail.com,
rgoldwyn@suse.com, jack@suse.cz, ziy@nvidia.com,
ryan.roberts@arm.com, patches@lists.linux.dev,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-block@vger.kernel.org, p.raghav@samsung.com,
da.gomez@samsung.com, dan.helmick@samsung.com
Subject: Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
Date: Mon, 18 Sep 2023 10:51:20 -0700 [thread overview]
Message-ID: <ZQiOGA18+fpFd6ly@bombadil.infradead.org> (raw)
In-Reply-To: <ZQe6wBxRFE7yY3eQ@dread.disaster.area>
On Mon, Sep 18, 2023 at 12:49:36PM +1000, Dave Chinner wrote:
> On Sun, Sep 17, 2023 at 06:13:40PM -0700, Luis Chamberlain wrote:
> > On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> > > On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > > > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > > > device then to take advantage of the recenlty posted work today to enable
> > > > > > LBS support for filesystems [0].
> > > > >
> > > > > Why do we need LBS devices to support bs > ps in XFS?
> > > >
> > > > It's the other way round -- we need the support in the page cache to
> > > > reject sub-block-size folios (which is in the other patches) before we
> > > > can sensibly talk about enabling any filesystems on top of LBS devices.
> > > >
> > > > Even XFS, or for that matter ext2 which support 16k block sizes on
> > > > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> > >
> > > Well, yes, I know that. But the statement above implies that we
> > > can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> > > systems. If it's meant to mean the exact opposite, then it is
> > > extremely poorly worded....
> >
> > Let's ignore the above statement of a second just to clarify the goal here.
> > The patches posted by Pankaj enable bs > ps even on 4k sector drives.
>
> Yes. They also enable XFS to support bs > ps on devices up to 32kB
> sector sizes, too. All the sector size does is define the minimum
> filesystem block size that can be supported by the filesystem on
> that device and apart from that we just don't care what the sector
> size on the underlying device is.
Ah yes, but on a system with 4k page size, my point is that you still
can't use a 32k sector size drive.
> > This patch series by definition is suggesting that an LBS device is one
> > where the minimum sector size is > ps, in practice I'm talking about
> > devices where the logical block size is > 4k, or where the physical
> > block size can be > 4k.
>
> Which XFS with bs > ps just doesn't care about. As long as the
> logical sector size is a power of 2 between 512 bytes and 32kB, it
> will just work.
>
> > There are two situations in the NVMe world where
> > this can happen. One is where the LBA format used is > 4k, the other is
> > where the npwg & awupf | nawupf is > 4k. The first makes the logical
>
> FWIW, I have no idea what these acronyms actually mean....
Sorry, I've described them them, here now:
https://kernelnewbies.org/KernelProjects/large-block-size#LBS_NVMe_support
What matters though is in practice it is a combination of two values
which today also allows the nvme driver to have a higher than 4k
physical block size.
> > block size > 4k, the later allows the physical block size to be > 4k.
> > The first *needs* an aops which groks > ps on the block device cache.
> > The second let's you remain backward compatible with 4k sector size, but
> > if you want to use a larger sector size you can too, but that also
> > requires a block device cache which groks > ps. When using > ps for
> > logical block size of physical block size is what I am suggesting we
> > should call LBS devices.
>
> Sure. LBS means sector size > page size for the block device. That
> much has always been clear.
>
> But telling me that again doesn't explain what LBS support has to do
> with the filesystem implementation. mkfs.xfs doesn't require LBS
> support to make a new XFS filesystem with a 32kB sector size and
> 64kB filessytem block size. For the mounted filesystem that
> supports bs > ps, it also doesn't care about the device sector size
> is smaller than what mkfs told it to us. e.g. this is how run 4kB
> sector size filesystem testing on 512 byte sector devices today....
>
> What I'm asking is why LBS support even mentions filesystems?
It's just that without this patchset you can mkfs an filesystem with
larger bs > ps, but will only be able to mount them if the sector size
matches the page size. This patchset allows them to be mounted and used
where sector size > ps. The main issue there is just the block device
cache aops and the current size limitaiton on set_blocksize().
> It's
> not necessary for filesystems to support bs > ps for LBS to be
> implemented, and it's not necessary for LBS to be supported for
> filesytsems to implement bs > ps. Conflating them seems a recipe
> for confusion....
I see, so we should just limit the scope to enabling LBS devices on
the block device cache?
> > > iomap supports bufferheads as a transitional thing (e.g. for gfs2).
> >
> > But not for the block device cache.
>
> That's why I'm suggesting that you implement support for bufferheads
> through the existing iomap infrastructure instead of trying to
> dynamically switch aops structures....
>
> > > Hence I suspect that a better solution is to always use iomap and
> > > the same aops, but just switch from iomap page state to buffer heads
> > > in the bdev mapping
> >
> > Not sure this means, any chance I can trouble you to clarify a bit more?
>
> bdev_use_buggerheads()
> {
> /*
> * set the bufferhead bit atomically with invalidation emptying the
> * page cache to prevent repopulation races.
> */
> filemap_invalidate_lock()
> invalidate_bdev()
> if (turn_on_bufferheads)
> set_bit(bdev->state, BDEV_USE_BUFFERHEADS);
> else
> clear_bit(bdev->state, BDEV_USE_BUFFERHEADS);
> filemap_invalidate_unlock()
> }
>
> bdev_iomap_begin()
> {
> .....
> if (test_bit(bdev->state, BDEV_USE_BUFFERHEADS))
> iomap->flags |= IOMAP_F_BUFFER_HEAD;
> }
>
> /*
> * If an indexing switch happened while processing the iomap, make
> * sure to get the iomap marked stale to force a new mapping to be
> * looked up.
> */
> bdev_iomap_valid()
> {
> bool need_bhs = iomap->flags & IOMAP_F_BUFFER_HEAD;
> bool use_bhs = test_bit(bdev->state, BDEV_USE_BUGGERHEADS);
>
> return need_bhs == use_bhs;
> }
Will give this a shot, thanks!
Luis
next prev parent reply other threads:[~2023-09-18 17:51 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 01/10] bdev: rename iomap aops Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 02/10] bdev: dynamically set aops to enable LBS support Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 03/10] bdev: increase bdev max blocksize depending on the aops used Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 04/10] filesystems: add filesytem buffer-head flag Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 05/10] bdev: allow to switch between bdev aops Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 06/10] bdev: simplify coexistance Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 07/10] nvme: enhance max supported LBA format check Luis Chamberlain
2023-09-15 22:20 ` Matthew Wilcox
2023-09-15 22:27 ` Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 08/10] nvme: add awun / nawun sanity check Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 09/10] nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 10/10] nvme: enable LBS support Luis Chamberlain
2023-09-15 21:51 ` [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Matthew Wilcox
2023-09-15 22:26 ` Luis Chamberlain
2023-09-17 11:50 ` Hannes Reinecke
2023-09-18 17:12 ` Luis Chamberlain
2023-09-18 18:15 ` Matthew Wilcox
2023-09-18 18:42 ` Hannes Reinecke
2023-09-17 22:38 ` Dave Chinner
2023-09-17 23:14 ` Matthew Wilcox
2023-09-18 0:59 ` Dave Chinner
2023-09-18 1:13 ` Luis Chamberlain
2023-09-18 2:49 ` Dave Chinner
2023-09-18 17:51 ` Luis Chamberlain [this message]
2023-09-18 11:34 ` Hannes Reinecke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZQiOGA18+fpFd6ly@bombadil.infradead.org \
--to=mcgrof@kernel.org \
--cc=axboe@fb.com \
--cc=brauner@kernel.org \
--cc=da.gomez@samsung.com \
--cc=dan.helmick@samsung.com \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=hare@suse.de \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=p.raghav@samsung.com \
--cc=patches@lists.linux.dev \
--cc=rgoldwyn@suse.com \
--cc=ritesh.list@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=sagi@grimberg.me \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).