Re: [RFC 00/23] Enable block size > page size in XFS

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: Pankaj Raghav <kernel@pankajraghav.com>,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	p.raghav@samsung.com, da.gomez@samsung.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	willy@infradead.org, djwong@kernel.org, linux-mm@kvack.org,
	chandan.babu@oracle.com, gost.dev@samsung.com
Subject: Re: [RFC 00/23] Enable block size > page size in XFS
Date: Mon, 18 Sep 2023 15:07:42 +1000	[thread overview]
Message-ID: <ZQfbHloBUpDh+zCg@dread.disaster.area> (raw)
In-Reply-To: <ZQewKIfRYcApEYXt@bombadil.infradead.org>

On Sun, Sep 17, 2023 at 07:04:24PM -0700, Luis Chamberlain wrote:
> On Mon, Sep 18, 2023 at 08:05:20AM +1000, Dave Chinner wrote:
> > On Fri, Sep 15, 2023 at 08:38:25PM +0200, Pankaj Raghav wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > There has been efforts over the last 16 years to enable enable Large
> > > Block Sizes (LBS), that is block sizes in filesystems where bs > page
> > > size [1] [2]. Through these efforts we have learned that one of the
> > > main blockers to supporting bs > ps in fiesystems has been a way to
> > > allocate pages that are at least the filesystem block size on the page
> > > cache where bs > ps [3]. Another blocker was changed in filesystems due to
> > > buffer-heads. Thanks to these previous efforts, the surgery by Matthew
> > > Willcox in the page cache for adopting xarray's multi-index support, and
> > > iomap support, it makes supporting bs > ps in XFS possible with only a few
> > > line change to XFS. Most of changes are to the page cache to support minimum
> > > order folio support for the target block size on the filesystem.
> > > 
> > > A new motivation for LBS today is to support high-capacity (large amount
> > > of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
> > > typically greater than 4k [4] to help reduce DRAM and so in turn cost
> > > and space. In practice this then allows different architectures to use a
> > > base page size of 4k while still enabling support for block sizes
> > > aligned to the larger IUs by relying on high order folios on the page
> > > cache when needed. It also enables to take advantage of these same
> > > drive's support for larger atomics than 4k with buffered IO support in
> > > Linux. As described this year at LSFMM, supporting large atomics greater
> > > than 4k enables databases to remove the need to rely on their own
> > > journaling, so they can disable double buffered writes [5], which is a
> > > feature different cloud providers are already innovating and enabling
> > > customers for through custom storage solutions.
> > > 
> > > This series still needs some polishing and fixing some crashes, but it is
> > > mainly targeted to get initial feedback from the community, enable initial
> > > experimentation, hence the RFC. It's being posted now given the results from
> > > our testing are proving much better results than expected and we hope to
> > > polish this up together with the community. After all, this has been a 16
> > > year old effort and none of this could have been possible without that effort.
> > > 
> > > Implementation:
> > > 
> > > This series only adds the notion of a minimum order of a folio in the
> > > page cache that was initially proposed by Willy. The minimum folio order
> > > requirement is set during inode creation. The minimum order will
> > > typically correspond to the filesystem block size. The page cache will
> > > in turn respect the minimum folio order requirement while allocating a
> > > folio. This series mainly changes the page cache's filemap, readahead, and
> > > truncation code to allocate and align the folios to the minimum order set for the
> > > filesystem's inode's respective address space mapping.
> > > 
> > > Only XFS was enabled and tested as a part of this series as it has
> > > supported block sizes up to 64k and sector sizes up to 32k for years.
> > > The only thing missing was the page cache magic to enable bs > ps. However any filesystem
> > > that doesn't depend on buffer-heads and support larger block sizes
> > > already should be able to leverage this effort to also support LBS,
> > > bs > ps.
> > > 
> > > This also paves the way for supporting block devices where their logical
> > > block size > page size in the future by leveraging iomap's address space
> > > operation added to the block device cache by Christoph Hellwig [6]. We
> > > have work to enable support for this, enabling LBAs > 4k on NVME,  and
> > > at the same time allow coexistence with buffer-heads on the same block
> > > device so to enable support allow for a drive to use filesystem's to
> > > switch between filesystem's which may depend on buffer-heads or need the
> > > iomap address space operations for the block device cache. Patches for
> > > this will be posted shortly after this patch series.
> > 
> > Do you have a git tree branch that I can pull this from
> > somewhere?
> > 
> > As it is, I'd really prefer stuff that adds significant XFS
> > functionality that we need to test to be based on a current Linus
> > TOT kernel so that we can test it without being impacted by all
> > the random unrelated breakages that regularly happen in linux-next
> > kernels....
> 
> That's understandable! I just rebased onto Linus' tree, this only
> has the bs > ps support on 4k sector size:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=v6.6-rc2-lbs-nobdev


> I just did a cursory build / boot / fsx with 16k block size / 4k sector size
> test with this tree only. I havne't ran fstests on it.

W/ 64k block size, generic/042 fails (maybe just a test block size
thing), generic/091 fails (data corruption on read after ~70 ops)
and then generic/095 hung with a crash in iomap_readpage_iter()
during readahead.

Looks like a null folio was passed to ifs_alloc(), which implies the
iomap_readpage_ctx didn't have a folio attached to it. Something
isn't working properly in the readahead code, which would also
explain the quick fsx failure...

> Just a heads up, using 512 byte sector size will fail for now, it's a
> regression we have to fix. Likewise using block sizes 1k, 2k will also
> regress on fsx right now. These are regressions we are aware of but
> haven't had time yet to bisect / fix.

I'm betting that the recently added sub-folio dirty tracking code
got broken by this patchset....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2023-09-18  5:08 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
2023-09-15 19:03   ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
2023-09-15 18:55   ` Matthew Wilcox
2023-09-20  7:46     ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
2023-09-15 19:00   ` Matthew Wilcox
2023-09-20  8:06     ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
2023-09-15 19:43   ` Matthew Wilcox
2023-09-18 18:20     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
2023-09-15 19:45   ` Matthew Wilcox
2023-09-18 18:25     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
2023-09-15 19:46   ` Matthew Wilcox
2023-09-18 18:27     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
2023-09-15 19:48   ` Matthew Wilcox
2023-09-18 18:32     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
2023-09-15 19:50   ` Matthew Wilcox
2023-09-18 18:36     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
2023-09-15 19:54   ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages() Pankaj Raghav
2023-09-15 18:38 ` [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead Pankaj Raghav
2023-09-15 18:38 ` [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault() Pankaj Raghav
2023-09-15 18:38 ` [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded() Pankaj Raghav
2023-09-15 18:38 ` [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra() Pankaj Raghav
2023-09-15 18:38 ` [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra Pankaj Raghav
2023-09-15 18:38 ` [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra() Pankaj Raghav
2023-09-15 18:38 ` [RFC 19/23] truncate: align index to mapping_min_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 20/23] mm: round down folio split requirements Pankaj Raghav
2023-09-15 18:38 ` [RFC 21/23] xfs: expose block size in stat Pankaj Raghav
2023-09-15 18:38 ` [RFC 22/23] xfs: enable block size larger than page size support Pankaj Raghav
2023-09-15 18:38 ` [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize Pankaj Raghav
2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
2023-09-18 12:35   ` Pankaj Raghav
2023-09-17 22:05 ` Dave Chinner
2023-09-18  2:04   ` Luis Chamberlain
2023-09-18  5:07     ` Dave Chinner [this message]
2023-09-18 12:29       ` Pankaj Raghav
2023-09-19 11:56         ` Ritesh Harjani
2023-09-19 21:15           ` Luis Chamberlain
2023-09-21  3:00         ` Luis Chamberlain
2023-09-21  4:57           ` Luis Chamberlain
2023-09-21  6:03             ` Dave Chinner
2023-09-21  7:18               ` Luis Chamberlain
2023-09-21  7:20                 ` Luis Chamberlain
2023-09-22  5:03                 ` Dave Chinner
2023-09-22 19:38               ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZQfbHloBUpDh+zCg@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=chandan.babu@oracle.com \
    --cc=da.gomez@samsung.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=kernel@pankajraghav.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox