Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Luis Chamberlain <mcgrof@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	hch@infradead.org, djwong@kernel.org, dchinner@redhat.com,
	kbusch@kernel.org, sagi@grimberg.me, axboe@fb.com,
	brauner@kernel.org, hare@suse.de, ritesh.list@gmail.com,
	rgoldwyn@suse.com, jack@suse.cz, ziy@nvidia.com,
	ryan.roberts@arm.com, patches@lists.linux.dev,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-block@vger.kernel.org, p.raghav@samsung.com,
	da.gomez@samsung.com, dan.helmick@samsung.com
Subject: Re: [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads
Date: Sun, 17 Sep 2023 18:13:40 -0700	[thread overview]
Message-ID: <ZQekRLRXpoGu7VQ+@bombadil.infradead.org> (raw)
In-Reply-To: <ZQeg2+0X6yzGL1Mx@dread.disaster.area>

On Mon, Sep 18, 2023 at 10:59:07AM +1000, Dave Chinner wrote:
> On Mon, Sep 18, 2023 at 12:14:48AM +0100, Matthew Wilcox wrote:
> > On Mon, Sep 18, 2023 at 08:38:37AM +1000, Dave Chinner wrote:
> > > On Fri, Sep 15, 2023 at 02:32:44PM -0700, Luis Chamberlain wrote:
> > > > LBS devices. This in turn allows filesystems which support bs > 4k to be
> > > > enabled on a 4k PAGE_SIZE world on LBS block devices. This alows LBS
> > > > device then to take advantage of the recenlty posted work today to enable
> > > > LBS support for filesystems [0].
> > > 
> > > Why do we need LBS devices to support bs > ps in XFS?
> > 
> > It's the other way round -- we need the support in the page cache to
> > reject sub-block-size folios (which is in the other patches) before we
> > can sensibly talk about enabling any filesystems on top of LBS devices.
> >
> > Even XFS, or for that matter ext2 which support 16k block sizes on
> > CONFIG_PAGE_SIZE_16K (or 64K) kernels need that support first.
> 
> Well, yes, I know that. But the statement above implies that we
> can't use bs > ps filesytems without LBS support on 4kB PAGE_SIZE
> systems. If it's meant to mean the exact opposite, then it is
> extremely poorly worded....

Let's ignore the above statement of a second just to clarify the goal here.
The patches posted by Pankaj enable bs > ps even on 4k sector drives.

This patch series by definition is suggesting that an LBS device is one
where the minimum sector size is > ps, in practice I'm talking about
devices where the logical block size is > 4k, or where the physical
block size can be > 4k. There are two situations in the NVMe world where
this can happen. One is where the LBA format used is > 4k, the other is
where the npwg & awupf | nawupf is > 4k. The first makes the logical
block size > 4k, the later allows the physical block size to be > 4k.
The first *needs* an aops which groks > ps on the block device cache.
The second let's you remain backward compatible with 4k sector size, but
if you want to use a larger sector size you can too, but that also
requires a block device cache which groks > ps. When using > ps for
logical block size of physical block size is what I am suggesting we
should call LBS devices.

I'm happy for us to use other names but it just seemed like a natural
preogression to me.

> > > > There might be a better way to do this than do deal with the switching
> > > > of the aops dynamically, ideas welcomed!
> > > 
> > > Is it even safe to switch aops dynamically? We know there are
> > > inherent race conditions in doing this w.r.t. mmap and page faults,
> > > as the write fault part of the processing is directly dependent
> > > on the page being correctly initialised during the initial
> > > population of the page data (the "read fault" side of the write
> > > fault).
> > > 
> > > Hence it's not generally considered safe to change aops from one
> > > mechanism to another dynamically. Block devices can be mmap()d, but
> > > I don't see anything in this patch set that ensures there are no
> > > other users of the block device when the swaps are done. What am I
> > > missing?
> > 
> > We need to evict all pages from the page cache before switching aops to
> > prevent misinterpretation of folio->private. 
> 
> Yes, but if the device is mapped, even after an invalidation, we can
> still race with a new fault instantiating a page whilst the aops are
> being swapped, right? That was the problem that sunk dynamic
> swapping of the aops when turning DAX on and off on an inode, right?

I was not aware of that, thanks!

> > If switching aops is even
> > the right thing to do.  I don't see the problem with allowing buffer heads
> > on block devices, but I haven't been involved with the discussion here.
> 
> iomap supports bufferheads as a transitional thing (e.g. for gfs2).

But not for the block device cache.

> Hence I suspect that a better solution is to always use iomap and
> the same aops, but just switch from iomap page state to buffer heads
> in the bdev mapping 

Not sure this means, any chance I can trouble you to clarify a bit more?

> interface via a synchronised invalidation +
> setting/clearing IOMAP_F_BUFFER_HEAD in all new mapping requests...

I'm happy we use whatever makes more sense and is safer, just keep in
mind that we can't default to the iomap aops otherwise buffer-head
based filesystems won't work. I tried that. The first aops which the
block device cache needs if we want co-existence is buffer-heads.

If buffer-heads get large folio support then this is a non-issue as
far as I can tell but only because XFS does not use buffer-heads for
meta data. Once and when we do have a filesystem which does use
buffer-heads for metadata to support LBS it could be a problem I think.

  Luis

next prev parent reply	other threads:[~2023-09-18  1:14 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-15 21:32 [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 01/10] bdev: rename iomap aops Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 02/10] bdev: dynamically set aops to enable LBS support Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 03/10] bdev: increase bdev max blocksize depending on the aops used Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 04/10] filesystems: add filesytem buffer-head flag Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 05/10] bdev: allow to switch between bdev aops Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 06/10] bdev: simplify coexistance Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 07/10] nvme: enhance max supported LBA format check Luis Chamberlain
2023-09-15 22:20   ` Matthew Wilcox
2023-09-15 22:27     ` Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 08/10] nvme: add awun / nawun sanity check Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 09/10] nvme: add nvme_core.debug_large_atomics to force high awun as phys_bs Luis Chamberlain
2023-09-15 21:32 ` [RFC v2 10/10] nvme: enable LBS support Luis Chamberlain
2023-09-15 21:51 ` [RFC v2 00/10] bdev: LBS devices support to coexist with buffer-heads Matthew Wilcox
2023-09-15 22:26   ` Luis Chamberlain
2023-09-17 11:50   ` Hannes Reinecke
2023-09-18 17:12   ` Luis Chamberlain
2023-09-18 18:15     ` Matthew Wilcox
2023-09-18 18:42       ` Hannes Reinecke
2023-09-17 22:38 ` Dave Chinner
2023-09-17 23:14   ` Matthew Wilcox
2023-09-18  0:59     ` Dave Chinner
2023-09-18  1:13       ` Luis Chamberlain [this message]
2023-09-18  2:49         ` Dave Chinner
2023-09-18 17:51           ` Luis Chamberlain
2023-09-18 11:34     ` Hannes Reinecke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZQekRLRXpoGu7VQ+@bombadil.infradead.org \
    --to=mcgrof@kernel.org \
    --cc=axboe@fb.com \
    --cc=brauner@kernel.org \
    --cc=da.gomez@samsung.com \
    --cc=dan.helmick@samsung.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hare@suse.de \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=patches@lists.linux.dev \
    --cc=rgoldwyn@suse.com \
    --cc=ritesh.list@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=sagi@grimberg.me \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).