linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/23] Enable block size > page size in XFS
@ 2023-09-15 18:38 Pankaj Raghav
  2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
                   ` (24 more replies)
  0 siblings, 25 replies; 54+ messages in thread
From: Pankaj Raghav @ 2023-09-15 18:38 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: p.raghav, david, da.gomez, akpm, linux-kernel, willy, djwong,
	linux-mm, chandan.babu, mcgrof, gost.dev

From: Pankaj Raghav <p.raghav@samsung.com>

There has been efforts over the last 16 years to enable enable Large
Block Sizes (LBS), that is block sizes in filesystems where bs > page
size [1] [2]. Through these efforts we have learned that one of the
main blockers to supporting bs > ps in fiesystems has been a way to
allocate pages that are at least the filesystem block size on the page
cache where bs > ps [3]. Another blocker was changed in filesystems due to
buffer-heads. Thanks to these previous efforts, the surgery by Matthew
Willcox in the page cache for adopting xarray's multi-index support, and
iomap support, it makes supporting bs > ps in XFS possible with only a few
line change to XFS. Most of changes are to the page cache to support minimum
order folio support for the target block size on the filesystem.

A new motivation for LBS today is to support high-capacity (large amount
of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are
typically greater than 4k [4] to help reduce DRAM and so in turn cost
and space. In practice this then allows different architectures to use a
base page size of 4k while still enabling support for block sizes
aligned to the larger IUs by relying on high order folios on the page
cache when needed. It also enables to take advantage of these same
drive's support for larger atomics than 4k with buffered IO support in
Linux. As described this year at LSFMM, supporting large atomics greater
than 4k enables databases to remove the need to rely on their own
journaling, so they can disable double buffered writes [5], which is a
feature different cloud providers are already innovating and enabling
customers for through custom storage solutions.

This series still needs some polishing and fixing some crashes, but it is
mainly targeted to get initial feedback from the community, enable initial
experimentation, hence the RFC. It's being posted now given the results from
our testing are proving much better results than expected and we hope to
polish this up together with the community. After all, this has been a 16
year old effort and none of this could have been possible without that effort.

Implementation:

This series only adds the notion of a minimum order of a folio in the
page cache that was initially proposed by Willy. The minimum folio order
requirement is set during inode creation. The minimum order will
typically correspond to the filesystem block size. The page cache will
in turn respect the minimum folio order requirement while allocating a
folio. This series mainly changes the page cache's filemap, readahead, and
truncation code to allocate and align the folios to the minimum order set for the
filesystem's inode's respective address space mapping.

Only XFS was enabled and tested as a part of this series as it has
supported block sizes up to 64k and sector sizes up to 32k for years.
The only thing missing was the page cache magic to enable bs > ps. However any filesystem
that doesn't depend on buffer-heads and support larger block sizes
already should be able to leverage this effort to also support LBS,
bs > ps.

This also paves the way for supporting block devices where their logical
block size > page size in the future by leveraging iomap's address space
operation added to the block device cache by Christoph Hellwig [6]. We
have work to enable support for this, enabling LBAs > 4k on NVME,  and
at the same time allow coexistence with buffer-heads on the same block
device so to enable support allow for a drive to use filesystem's to
switch between filesystem's which may depend on buffer-heads or need the
iomap address space operations for the block device cache. Patches for
this will be posted shortly after this patch series.

Testing:

The test results show, this isn't so scary. Only a few regressions so
far on xfs where CRCs are disabled on block sizes smaller than 4k and
some generic tests crashing the system for bs > 4k. The crashes are at most a
handful at this point. This series has been cleaned up 3 times now after
we passed our first billion through fsx ops on different block sizes. Not
surprisingly there are a few test bugs for the bs > ps world.

We've established baseline first against linux-next against 14 different
XFS test profiles as maintained in kdevops [7]:

xfs_crc
xfs_reflink
xfs_reflink_normapbt
xfs_reflink_1024
xfs_reflink_2k
xfs_reflink_4k
xfs_nocrc
xfs_nocrc_512
xfs_nocrc_1k
xfs_nocrc_2k
xfs_nocrc_4k
xfs_logdev
xfs_rtdev
xfs_rtlogdev

We first established a high confidence baseline for linux-next and have
kept following that to ensure we don't regress it. The majority of
regressions are fsx ops on no CRC block sizes of 512 and 2k, and we plan
to fix that, but welcome others at this point to jump in and collaborate.

The list of known possible regressions are then can be seen on kdevops
with git grep:

git grep regression workflows/fstests/expunges/6.6.0-rc1-large-block-20230914/ | awk -F"unassigned/" '{print $2}'
xfs_nocrc_2k.txt:generic/075 # possible regression
xfs_nocrc_2k.txt:generic/112 # possible regression
xfs_nocrc_2k.txt:generic/127 # possible regression
xfs_nocrc_2k.txt:generic/231 # possible regression
xfs_nocrc_2k.txt:generic/263 # possible regression
xfs_nocrc_2k.txt:generic/469 # possible regression
xfs_nocrc_512.txt:generic/075 # possible regression
xfs_nocrc_512.txt:generic/112 # possible regression
xfs_nocrc_512.txt:generic/127 # possible regression
xfs_nocrc_512.txt:generic/231 # possible regression
xfs_nocrc_512.txt:generic/263 # possible regression
xfs_nocrc_512.txt:generic/469 # possible regression
xfs_reflink_1024.txt:generic/457 # possible regression crash https://gist.github.com/mcgrof/f182b250a9d091f77dc85782a83224b3
xfs_rtdev.txt:generic/333 # might crash might be a regression, takes forever...

Billion of fsx ops are possible with 16k and so far successful also with
hundreds of millions of fsx ops against 32k and 64k with 4k sector size.

To verify larger IOs are used we have been using Daniel Gomez's lbs-ctl
tool which uses eBPF to verify different IO counts on the block layer.
That tool will soon be published.

For more details please refer to the kernel newbies page on LBS [8].

[1] https://lwn.net/Articles/231793/
[2] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-1-david@fromorbit.com/
[3] https://lore.kernel.org/linux-mm/20230308075952.GU2825702@dread.disaster.area/
[4] https://cdrdv2-public.intel.com/605724/Achieving_Optimal_Perf_IU_SSDs-338395-003US.pdf
[5] https://lwn.net/Articles/932900/
[6] https://lore.kernel.org/lkml/20230801172201.1923299-2-hch@lst.de/T/
[7] https://github.com/linux-kdevops/kdevops/blob/master/playbooks/roles/fstests/templates/xfs/xfs.config
[8] https://kernelnewbies.org/KernelProjects/large-block-size

--
Regards,
Pankaj
Luis

Dave Chinner (1):
  xfs: expose block size in stat

Luis Chamberlain (12):
  filemap: set the order of the index in page_cache_delete_batch()
  filemap: align index to mapping_min_order in filemap_range_has_page()
  mm: call xas_set_order() in replace_page_cache_folio()
  filemap: align the index to mapping_min_order in __filemap_add_folio()
  filemap: align the index to mapping_min_order in
    filemap_get_folios_tag()
  filemap: align the index to mapping_min_order in filemap_get_pages()
  readahead: set file_ra_state->ra_pages to be at least
    mapping_min_order
  readahead: add folio with at least mapping_min_order in
    page_cache_ra_order
  readahead: set the minimum ra size in get_(init|next)_ra
  readahead: align ra start and size to mapping_min_order in
    ondemand_ra()
  truncate: align index to mapping_min_order
  mm: round down folio split requirements

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (9):
  pagemap: use mapping_min_order in fgf_set_order()
  filemap: add folio with at least mapping_min_order in
    __filemap_get_folio
  filemap: use mapping_min_order while allocating folios
  filemap: align the index to mapping_min_order in
    do_[a]sync_mmap_readahead
  filemap: align index to mapping_min_order in filemap_fault()
  readahead: allocate folios with mapping_min_order in ra_unbounded()
  readahead: align with mapping_min_order in force_page_cache_ra()
  xfs: enable block size larger than page size support
  xfs: set minimum order folio for page cache based on blocksize

 fs/iomap/buffered-io.c  |  2 +-
 fs/xfs/xfs_icache.c     |  8 +++-
 fs/xfs/xfs_iops.c       |  4 +-
 fs/xfs/xfs_mount.c      |  9 ++++-
 fs/xfs/xfs_super.c      |  7 +---
 include/linux/pagemap.h | 87 ++++++++++++++++++++++++++++++-----------
 mm/filemap.c            | 87 +++++++++++++++++++++++++++++++++--------
 mm/huge_memory.c        | 14 +++++--
 mm/readahead.c          | 86 ++++++++++++++++++++++++++++++++++------
 mm/truncate.c           | 34 +++++++++++-----
 10 files changed, 263 insertions(+), 75 deletions(-)


base-commit: e143016b56ecb0fcda5bb6026b0a25fe55274f56
-- 
2.40.1



^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2023-09-22 19:38 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-15 18:38 [RFC 00/23] Enable block size > page size in XFS Pankaj Raghav
2023-09-15 18:38 ` [RFC 01/23] fs: Allow fine-grained control of folio sizes Pankaj Raghav
2023-09-15 19:03   ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 02/23] pagemap: use mapping_min_order in fgf_set_order() Pankaj Raghav
2023-09-15 18:55   ` Matthew Wilcox
2023-09-20  7:46     ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 03/23] filemap: add folio with at least mapping_min_order in __filemap_get_folio Pankaj Raghav
2023-09-15 19:00   ` Matthew Wilcox
2023-09-20  8:06     ` Pankaj Raghav
2023-09-15 18:38 ` [RFC 04/23] filemap: set the order of the index in page_cache_delete_batch() Pankaj Raghav
2023-09-15 19:43   ` Matthew Wilcox
2023-09-18 18:20     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 05/23] filemap: align index to mapping_min_order in filemap_range_has_page() Pankaj Raghav
2023-09-15 19:45   ` Matthew Wilcox
2023-09-18 18:25     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 06/23] mm: call xas_set_order() in replace_page_cache_folio() Pankaj Raghav
2023-09-15 19:46   ` Matthew Wilcox
2023-09-18 18:27     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 07/23] filemap: align the index to mapping_min_order in __filemap_add_folio() Pankaj Raghav
2023-09-15 19:48   ` Matthew Wilcox
2023-09-18 18:32     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 08/23] filemap: align the index to mapping_min_order in filemap_get_folios_tag() Pankaj Raghav
2023-09-15 19:50   ` Matthew Wilcox
2023-09-18 18:36     ` Luis Chamberlain
2023-09-15 18:38 ` [RFC 09/23] filemap: use mapping_min_order while allocating folios Pankaj Raghav
2023-09-15 19:54   ` Matthew Wilcox
2023-09-15 18:38 ` [RFC 10/23] filemap: align the index to mapping_min_order in filemap_get_pages() Pankaj Raghav
2023-09-15 18:38 ` [RFC 11/23] filemap: align the index to mapping_min_order in do_[a]sync_mmap_readahead Pankaj Raghav
2023-09-15 18:38 ` [RFC 12/23] filemap: align index to mapping_min_order in filemap_fault() Pankaj Raghav
2023-09-15 18:38 ` [RFC 13/23] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 14/23] readahead: allocate folios with mapping_min_order in ra_unbounded() Pankaj Raghav
2023-09-15 18:38 ` [RFC 15/23] readahead: align with mapping_min_order in force_page_cache_ra() Pankaj Raghav
2023-09-15 18:38 ` [RFC 16/23] readahead: add folio with at least mapping_min_order in page_cache_ra_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 17/23] readahead: set the minimum ra size in get_(init|next)_ra Pankaj Raghav
2023-09-15 18:38 ` [RFC 18/23] readahead: align ra start and size to mapping_min_order in ondemand_ra() Pankaj Raghav
2023-09-15 18:38 ` [RFC 19/23] truncate: align index to mapping_min_order Pankaj Raghav
2023-09-15 18:38 ` [RFC 20/23] mm: round down folio split requirements Pankaj Raghav
2023-09-15 18:38 ` [RFC 21/23] xfs: expose block size in stat Pankaj Raghav
2023-09-15 18:38 ` [RFC 22/23] xfs: enable block size larger than page size support Pankaj Raghav
2023-09-15 18:38 ` [RFC 23/23] xfs: set minimum order folio for page cache based on blocksize Pankaj Raghav
2023-09-15 18:50 ` [RFC 00/23] Enable block size > page size in XFS Matthew Wilcox
2023-09-18 12:35   ` Pankaj Raghav
2023-09-17 22:05 ` Dave Chinner
2023-09-18  2:04   ` Luis Chamberlain
2023-09-18  5:07     ` Dave Chinner
2023-09-18 12:29       ` Pankaj Raghav
2023-09-19 11:56         ` Ritesh Harjani
2023-09-19 21:15           ` Luis Chamberlain
2023-09-21  3:00         ` Luis Chamberlain
     [not found]           ` <ZQvNVAfZMjE3hgmN@bombadil.infradead.org>
2023-09-21  6:03             ` Dave Chinner
2023-09-21  7:18               ` Luis Chamberlain
2023-09-21  7:20                 ` Luis Chamberlain
2023-09-22  5:03                 ` Dave Chinner
2023-09-22 19:38               ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).