* [PATCH v2 0/8] enable bs > ps for block devices
@ 2025-02-04 23:12 Luis Chamberlain
2025-02-04 23:12 ` [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
` (7 more replies)
0 siblings, 8 replies; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
This v2 addresses feedback from the first RFC on enabling bs > ps for
block devices [0] after which I split the async read buffer-head work
into its own series [1]. This unifies the series now that this the
buffer-head work is greatly simplified, and generalizing a block size
check is now merged upstream on v6.14-rc1.
Changes in this series:
- Simplify block_read_full_folio() with bh_offset() and moves this
as a first patch
- Re-orders the negative shift patch to go first as otherwise
the blocks_per_folio changes don't make any sense
- Simplifies the amount of changes in the patch
"enable large folio support for large logical block sizes" as most
of the required changes are now upstream
- Drops the NVMe patch as its no longer needed
- Keeps the nrpages to 1 for readahead for folio for buffer-heads
as suggested by Matthew
- Takes the suggested approach by Matthew Wilcox on async read by
replacing the batched read with a straight forward iteration
- Tons of cosmetic updates as requested by folks
- Rebases on top of v6.14-rc1
- Tested with both fstests on ext4 and blktests using the latest
changes posted to support bs > ps for block devices just now [2]
- Updates the rationale for why we use 64k as the current limit:
test and validation
If you want this on a tree, this is available on the kdevops linux
large-block-buffer-heads-for-next branch [3].
[0] https://lkml.kernel.org/r/20241113094727.1497722-1-mcgrof@kernel.org
[1] https://lkml.kernel.org/r/20241218022626.3668119-1-mcgrof@kernel.org
[2] https://lkml.kernel.org/r/20250204225729.422949-1-mcgrof@kernel.org
[3] https://github.com/linux-kdevops/linux/tree/large-block-buffer-heads-for-next
Hannes Reinecke (3):
fs/mpage: avoid negative shift for large blocksize
fs/mpage: use blocks_per_folio instead of blocks_per_page
block/bdev: enable large folio support for large logical block sizes
Luis Chamberlain (4):
fs/buffer: simplify block_read_full_folio() with bh_offset()
fs/buffer fs/mpage: remove large folio restriction
block/bdev: lift block size restrictions to 64k
bdev: use bdev_io_min() for statx block size
Matthew Wilcox (1):
fs/buffer: remove batching from async read
block/bdev.c | 11 ++++----
fs/buffer.c | 58 +++++++++++++++++-------------------------
fs/mpage.c | 45 +++++++++++++++-----------------
include/linux/blkdev.h | 9 ++++++-
4 files changed, 58 insertions(+), 65 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
@ 2025-02-04 23:12 ` Luis Chamberlain
2025-02-05 16:18 ` Hannes Reinecke
2025-02-04 23:12 ` [PATCH v2 2/8] fs/buffer: remove batching from async read Luis Chamberlain
` (6 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
When we read over all buffers in a folio we currently use the
buffer index on the folio and blocksize to get the offset. Simplify
this with bh_offset(). This simplifies the loop while making no
functional changes.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/buffer.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index cc8452f60251..b99560e8a142 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2381,7 +2381,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
lblock = div_u64(limit + blocksize - 1, blocksize);
bh = head;
nr = 0;
- i = 0;
do {
if (buffer_uptodate(bh))
@@ -2398,7 +2397,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
page_error = true;
}
if (!buffer_mapped(bh)) {
- folio_zero_range(folio, i * blocksize,
+ folio_zero_range(folio, bh_offset(bh),
blocksize);
if (!err)
set_buffer_uptodate(bh);
@@ -2412,7 +2411,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
continue;
}
arr[nr++] = bh;
- } while (i++, iblock++, (bh = bh->b_this_page) != head);
+ } while (iblock++, (bh = bh->b_this_page) != head);
if (fully_mapped)
folio_set_mappedtodisk(folio);
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 2/8] fs/buffer: remove batching from async read
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
2025-02-04 23:12 ` [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
@ 2025-02-04 23:12 ` Luis Chamberlain
2025-02-05 16:21 ` Hannes Reinecke
2025-02-17 21:40 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 3/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
` (5 subsequent siblings)
7 siblings, 2 replies; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
From: Matthew Wilcox <willy@infradead.org>
The current implementation of a folio async read in block_read_full_folio()
first batches all buffer-heads which need IOs issued for by putting them on an
array of max size MAX_BUF_PER_PAGE. After collection it locks the batched
buffer-heads and finally submits the pending reads. On systems with CPUs
where the system page size is quite larger like Hexagon with 256 KiB this
batching can lead stack growth warnings so we want to avoid that.
Note the use of folio_end_read() through block_read_full_folio(), its
used either when the folio is determined to be fully uptodate and no
pending read is needed, an IO error happened on get_block(), or an out of
bound read raced against batching collection to make our required reads
uptodate.
We can simplify this logic considerably and remove the stack growth
issues of MAX_BUF_PER_PAGE by just replacing the batched logic with
one which only issues IO for the previous buffer-head keeping in mind
we'll always have one buffer-head (the current one) on the folio with
an async flag, this will prevent any calls to folio_end_read().
So we accomplish two things with this:
o Avoid large stacks arrays with MAX_BUF_PER_PAGE
o Make the need for folio_end_read() explicit and easier to read
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/buffer.c | 51 +++++++++++++++++++++------------------------------
1 file changed, 21 insertions(+), 30 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index b99560e8a142..167fa3e33566 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,9 +2361,8 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
struct inode *inode = folio->mapping->host;
sector_t iblock, lblock;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head, *prev = NULL;
size_t blocksize;
- int nr, i;
int fully_mapped = 1;
bool page_error = false;
loff_t limit = i_size_read(inode);
@@ -2380,7 +2379,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
iblock = div_u64(folio_pos(folio), blocksize);
lblock = div_u64(limit + blocksize - 1, blocksize);
bh = head;
- nr = 0;
do {
if (buffer_uptodate(bh))
@@ -2410,40 +2408,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
if (buffer_uptodate(bh))
continue;
}
- arr[nr++] = bh;
+
+ lock_buffer(bh);
+ if (buffer_uptodate(bh)) {
+ unlock_buffer(bh);
+ continue;
+ }
+
+ mark_buffer_async_read(bh);
+ if (prev)
+ submit_bh(REQ_OP_READ, prev);
+ prev = bh;
} while (iblock++, (bh = bh->b_this_page) != head);
if (fully_mapped)
folio_set_mappedtodisk(folio);
- if (!nr) {
- /*
- * All buffers are uptodate or get_block() returned an
- * error when trying to map them - we can finish the read.
- */
- folio_end_read(folio, !page_error);
- return 0;
- }
-
- /* Stage two: lock the buffers */
- for (i = 0; i < nr; i++) {
- bh = arr[i];
- lock_buffer(bh);
- mark_buffer_async_read(bh);
- }
-
/*
- * Stage 3: start the IO. Check for uptodateness
- * inside the buffer lock in case another process reading
- * the underlying blockdev brought it uptodate (the sct fix).
+ * All buffers are uptodate or get_block() returned an error
+ * when trying to map them - we must finish the read because
+ * end_buffer_async_read() will never be called on any buffer
+ * in this folio.
*/
- for (i = 0; i < nr; i++) {
- bh = arr[i];
- if (buffer_uptodate(bh))
- end_buffer_async_read(bh, 1);
- else
- submit_bh(REQ_OP_READ, bh);
- }
+ if (prev)
+ submit_bh(REQ_OP_READ, prev);
+ else
+ folio_end_read(folio, !page_error);
+
return 0;
}
EXPORT_SYMBOL(block_read_full_folio);
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 3/8] fs/mpage: avoid negative shift for large blocksize
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
2025-02-04 23:12 ` [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
2025-02-04 23:12 ` [PATCH v2 2/8] fs/buffer: remove batching from async read Luis Chamberlain
@ 2025-02-04 23:12 ` Luis Chamberlain
2025-02-17 21:48 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
` (4 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof, Hannes Reinecke
From: Hannes Reinecke <hare@kernel.org>
For large blocksizes the number of block bits is larger than PAGE_SHIFT,
so use instead use folio_pos(folio) >> blkbits to calculate the sector
number. This is required to enable large folios with buffer-heads.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
fs/mpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index 82aecf372743..a3c82206977f 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -181,7 +181,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (folio_buffers(folio))
goto confused;
- block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+ block_in_file = folio_pos(folio) >> blkbits;
last_block = block_in_file + args->nr_pages * blocks_per_page;
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
@@ -527,7 +527,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
* The page has no buffers: map it to disk
*/
BUG_ON(!folio_test_uptodate(folio));
- block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+ block_in_file = folio_pos(folio) >> blkbits;
/*
* Whole page beyond EOF? Skip allocating blocks to avoid leaking
* space.
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
` (2 preceding siblings ...)
2025-02-04 23:12 ` [PATCH v2 3/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
@ 2025-02-04 23:12 ` Luis Chamberlain
2025-02-17 21:58 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 5/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
` (3 subsequent siblings)
7 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
From: Hannes Reinecke <hare@suse.de>
Convert mpage to folios and associate the number of blocks with
a folio and not a page.
[mcgrof: keep 1 page request on mpage_read_folio()]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/mpage.c | 38 +++++++++++++++++++-------------------
1 file changed, 19 insertions(+), 19 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index a3c82206977f..c17d7a724e4b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -107,7 +107,7 @@ static void map_buffer_to_folio(struct folio *folio, struct buffer_head *bh,
* don't make any buffers if there is only one buffer on
* the folio and the folio just needs to be set up to date
*/
- if (inode->i_blkbits == PAGE_SHIFT &&
+ if (inode->i_blkbits == folio_shift(folio) &&
buffer_uptodate(bh)) {
folio_mark_uptodate(folio);
return;
@@ -153,7 +153,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
struct folio *folio = args->folio;
struct inode *inode = folio->mapping->host;
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+ const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
const unsigned blocksize = 1 << blkbits;
struct buffer_head *map_bh = &args->map_bh;
sector_t block_in_file;
@@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
sector_t last_block_in_file;
sector_t first_block;
unsigned page_block;
- unsigned first_hole = blocks_per_page;
+ unsigned first_hole = blocks_per_folio;
struct block_device *bdev = NULL;
int length;
int fully_mapped = 1;
@@ -182,7 +182,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
goto confused;
block_in_file = folio_pos(folio) >> blkbits;
- last_block = block_in_file + args->nr_pages * blocks_per_page;
+ last_block = block_in_file + args->nr_pages * blocks_per_folio;
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
last_block = last_block_in_file;
@@ -204,7 +204,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
clear_buffer_mapped(map_bh);
break;
}
- if (page_block == blocks_per_page)
+ if (page_block == blocks_per_folio)
break;
page_block++;
block_in_file++;
@@ -216,7 +216,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
* Then do more get_blocks calls until we are done with this folio.
*/
map_bh->b_folio = folio;
- while (page_block < blocks_per_page) {
+ while (page_block < blocks_per_folio) {
map_bh->b_state = 0;
map_bh->b_size = 0;
@@ -229,7 +229,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (!buffer_mapped(map_bh)) {
fully_mapped = 0;
- if (first_hole == blocks_per_page)
+ if (first_hole == blocks_per_folio)
first_hole = page_block;
page_block++;
block_in_file++;
@@ -247,7 +247,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
goto confused;
}
- if (first_hole != blocks_per_page)
+ if (first_hole != blocks_per_folio)
goto confused; /* hole -> non-hole */
/* Contiguous blocks? */
@@ -260,7 +260,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (relative_block == nblocks) {
clear_buffer_mapped(map_bh);
break;
- } else if (page_block == blocks_per_page)
+ } else if (page_block == blocks_per_folio)
break;
page_block++;
block_in_file++;
@@ -268,7 +268,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
bdev = map_bh->b_bdev;
}
- if (first_hole != blocks_per_page) {
+ if (first_hole != blocks_per_folio) {
folio_zero_segment(folio, first_hole << blkbits, PAGE_SIZE);
if (first_hole == 0) {
folio_mark_uptodate(folio);
@@ -303,10 +303,10 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
relative_block = block_in_file - args->first_logical_block;
nblocks = map_bh->b_size >> blkbits;
if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
- (first_hole != blocks_per_page))
+ (first_hole != blocks_per_folio))
args->bio = mpage_bio_submit_read(args->bio);
else
- args->last_block_in_bio = first_block + blocks_per_page - 1;
+ args->last_block_in_bio = first_block + blocks_per_folio - 1;
out:
return args->bio;
@@ -456,12 +456,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
struct address_space *mapping = folio->mapping;
struct inode *inode = mapping->host;
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+ const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
sector_t last_block;
sector_t block_in_file;
sector_t first_block;
unsigned page_block;
- unsigned first_unmapped = blocks_per_page;
+ unsigned first_unmapped = blocks_per_folio;
struct block_device *bdev = NULL;
int boundary = 0;
sector_t boundary_block = 0;
@@ -486,12 +486,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
*/
if (buffer_dirty(bh))
goto confused;
- if (first_unmapped == blocks_per_page)
+ if (first_unmapped == blocks_per_folio)
first_unmapped = page_block;
continue;
}
- if (first_unmapped != blocks_per_page)
+ if (first_unmapped != blocks_per_folio)
goto confused; /* hole -> non-hole */
if (!buffer_dirty(bh) || !buffer_uptodate(bh))
@@ -536,7 +536,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
goto page_is_mapped;
last_block = (i_size - 1) >> blkbits;
map_bh.b_folio = folio;
- for (page_block = 0; page_block < blocks_per_page; ) {
+ for (page_block = 0; page_block < blocks_per_folio; ) {
map_bh.b_state = 0;
map_bh.b_size = 1 << blkbits;
@@ -618,14 +618,14 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
BUG_ON(folio_test_writeback(folio));
folio_start_writeback(folio);
folio_unlock(folio);
- if (boundary || (first_unmapped != blocks_per_page)) {
+ if (boundary || (first_unmapped != blocks_per_folio)) {
bio = mpage_bio_submit_write(bio);
if (boundary_block) {
write_boundary_block(boundary_bdev,
boundary_block, 1 << blkbits);
}
} else {
- mpd->last_block_in_bio = first_block + blocks_per_page - 1;
+ mpd->last_block_in_bio = first_block + blocks_per_folio - 1;
}
goto out;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 5/8] fs/buffer fs/mpage: remove large folio restriction
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
` (3 preceding siblings ...)
2025-02-04 23:12 ` [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
@ 2025-02-04 23:12 ` Luis Chamberlain
2025-02-05 16:21 ` Hannes Reinecke
2025-02-17 21:59 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 6/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
` (2 subsequent siblings)
7 siblings, 2 replies; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
Now that buffer-heads has been converted over to support large folios
we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents
their use.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
fs/buffer.c | 2 --
fs/mpage.c | 3 ---
2 files changed, 5 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 167fa3e33566..194eacbefc95 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2371,8 +2371,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
limit = inode->i_sb->s_maxbytes;
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
head = folio_create_buffers(folio, inode, 0);
blocksize = head->b_size;
diff --git a/fs/mpage.c b/fs/mpage.c
index c17d7a724e4b..031230531a2a 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -170,9 +170,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
unsigned relative_block;
gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
- /* MAX_BUF_PER_PAGE, for example */
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
if (args->is_readahead) {
opf |= REQ_RAHEAD;
gfp |= __GFP_NORETRY | __GFP_NOWARN;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 6/8] block/bdev: enable large folio support for large logical block sizes
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
` (4 preceding siblings ...)
2025-02-04 23:12 ` [PATCH v2 5/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
@ 2025-02-04 23:12 ` Luis Chamberlain
2025-02-17 21:59 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 7/8] block/bdev: lift block size restrictions to 64k Luis Chamberlain
2025-02-04 23:12 ` [PATCH v2 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
7 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
From: Hannes Reinecke <hare@suse.de>
Call mapping_set_folio_min_order() when modifying the logical block
size to ensure folios are allocated with the correct size.
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
block/bdev.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block/bdev.c b/block/bdev.c
index 9d73a8fbf7f9..8aadf1f23cb4 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -148,6 +148,8 @@ static void set_init_blocksize(struct block_device *bdev)
bsize <<= 1;
}
BD_INODE(bdev)->i_blkbits = blksize_bits(bsize);
+ mapping_set_folio_min_order(BD_INODE(bdev)->i_mapping,
+ get_order(bsize));
}
int set_blocksize(struct file *file, int size)
@@ -169,6 +171,7 @@ int set_blocksize(struct file *file, int size)
if (inode->i_blkbits != blksize_bits(size)) {
sync_blockdev(bdev);
inode->i_blkbits = blksize_bits(size);
+ mapping_set_folio_min_order(inode->i_mapping, get_order(size));
kill_bdev(bdev);
}
return 0;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 7/8] block/bdev: lift block size restrictions to 64k
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
` (5 preceding siblings ...)
2025-02-04 23:12 ` [PATCH v2 6/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
@ 2025-02-04 23:12 ` Luis Chamberlain
2025-02-17 22:01 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
7 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
We now can support blocksizes larger than PAGE_SIZE, so in theory
we should be able to lift the restriction up to the max supported page
cache order. However bound ourselves to what we can currently validate
and test. Through blktests and fstest we can validate up to 64k today.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
block/bdev.c | 3 +--
include/linux/blkdev.h | 9 ++++++++-
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 8aadf1f23cb4..22806ce11e1d 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -183,8 +183,7 @@ int sb_set_blocksize(struct super_block *sb, int size)
{
if (set_blocksize(sb->s_bdev_file, size))
return 0;
- /* If we get here, we know size is power of two
- * and it's value is between 512 and PAGE_SIZE */
+ /* If we get here, we know size is validated */
sb->s_blocksize = size;
sb->s_blocksize_bits = blksize_bits(size);
return sb->s_blocksize;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..a89513302977 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -26,6 +26,7 @@
#include <linux/xarray.h>
#include <linux/file.h>
#include <linux/lockdep.h>
+#include <linux/pagemap.h>
struct module;
struct request_queue;
@@ -267,10 +268,16 @@ static inline dev_t disk_devt(struct gendisk *disk)
return MKDEV(disk->major, disk->first_minor);
}
+/*
+ * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
+ * however we constrain this to what we can validate and test.
+ */
+#define BLK_MAX_BLOCK_SIZE SZ_64K
+
/* blk_validate_limits() validates bsize, so drivers don't usually need to */
static inline int blk_validate_block_size(unsigned long bsize)
{
- if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize))
+ if (bsize < 512 || bsize > BLK_MAX_BLOCK_SIZE || !is_power_of_2(bsize))
return -EINVAL;
return 0;
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v2 8/8] bdev: use bdev_io_min() for statx block size
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
` (6 preceding siblings ...)
2025-02-04 23:12 ` [PATCH v2 7/8] block/bdev: lift block size restrictions to 64k Luis Chamberlain
@ 2025-02-04 23:12 ` Luis Chamberlain
2025-02-05 16:22 ` Hannes Reinecke
7 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-04 23:12 UTC (permalink / raw)
To: hare, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel,
mcgrof
You can use lsblk to query for a block device block device block size:
lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
4096
The min-io is the minimum IO the block device prefers for optimal
performance. In turn we map this to the block device block size.
The current block size exposed even for block devices with an
LBA format of 16k is 4k. Likewise devices which support 4k LBA format
but have a larger Indirection Unit of 16k have an exposed block size
of 4k.
This incurs read-modify-writes on direct IO against devices with a
min-io larger than the page size. To fix this, use the block device
min io, which is the minimal optimal IO the device prefers.
With this we now get:
lsblk -o MIN-IO /dev/nvme0n1
MIN-IO
16384
And so userspace gets the appropriate information it needs for optimal
performance. This is verified with blkalgn against mkfs against a
device with LBA format of 4k but an NPWG of 16k (min io size)
mkfs.xfs -f -b size=16k /dev/nvme3n1
blkalgn -d nvme3n1 --ops Write
Block size : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 66 |****************************************|
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 2 |* |
Block size: 14 - 66
Block size: 17 - 2
Algn size : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 66 |****************************************|
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 2 |* |
Algn size: 14 - 66
Algn size: 17 - 2
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
block/bdev.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index 22806ce11e1d..3bd948e6438d 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1276,9 +1276,6 @@ void bdev_statx(struct path *path, struct kstat *stat,
struct inode *backing_inode;
struct block_device *bdev;
- if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
- return;
-
backing_inode = d_backing_inode(path->dentry);
/*
@@ -1305,6 +1302,8 @@ void bdev_statx(struct path *path, struct kstat *stat,
queue_atomic_write_unit_max_bytes(bd_queue));
}
+ stat->blksize = bdev_io_min(bdev);
+
blkdev_put_no_open(bdev);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-02-04 23:12 ` [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
@ 2025-02-05 16:18 ` Hannes Reinecke
2025-02-05 22:03 ` Matthew Wilcox
0 siblings, 1 reply; 30+ messages in thread
From: Hannes Reinecke @ 2025-02-05 16:18 UTC (permalink / raw)
To: Luis Chamberlain, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel
On 2/5/25 00:12, Luis Chamberlain wrote:
> When we read over all buffers in a folio we currently use the
> buffer index on the folio and blocksize to get the offset. Simplify
> this with bh_offset(). This simplifies the loop while making no
> functional changes.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> fs/buffer.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index cc8452f60251..b99560e8a142 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2381,7 +2381,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> lblock = div_u64(limit + blocksize - 1, blocksize);
> bh = head;
> nr = 0;
> - i = 0;
>
> do {
> if (buffer_uptodate(bh))
> @@ -2398,7 +2397,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> page_error = true;
> }
> if (!buffer_mapped(bh)) {
> - folio_zero_range(folio, i * blocksize,
> + folio_zero_range(folio, bh_offset(bh),
> blocksize);
> if (!err)
> set_buffer_uptodate(bh);
> @@ -2412,7 +2411,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> continue;
> }
> arr[nr++] = bh;
> - } while (i++, iblock++, (bh = bh->b_this_page) != head);
> + } while (iblock++, (bh = bh->b_this_page) != head);
>
> if (fully_mapped)
> folio_set_mappedtodisk(folio);
One wonders: shouldn't we use plugging here to make I/O more efficient?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 2/8] fs/buffer: remove batching from async read
2025-02-04 23:12 ` [PATCH v2 2/8] fs/buffer: remove batching from async read Luis Chamberlain
@ 2025-02-05 16:21 ` Hannes Reinecke
2025-02-07 7:08 ` Hannes Reinecke
2025-02-17 21:40 ` Matthew Wilcox
1 sibling, 1 reply; 30+ messages in thread
From: Hannes Reinecke @ 2025-02-05 16:21 UTC (permalink / raw)
To: Luis Chamberlain, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel
On 2/5/25 00:12, Luis Chamberlain wrote:
> From: Matthew Wilcox <willy@infradead.org>
>
> The current implementation of a folio async read in block_read_full_folio()
> first batches all buffer-heads which need IOs issued for by putting them on an
> array of max size MAX_BUF_PER_PAGE. After collection it locks the batched
> buffer-heads and finally submits the pending reads. On systems with CPUs
> where the system page size is quite larger like Hexagon with 256 KiB this
> batching can lead stack growth warnings so we want to avoid that.
>
> Note the use of folio_end_read() through block_read_full_folio(), its
> used either when the folio is determined to be fully uptodate and no
> pending read is needed, an IO error happened on get_block(), or an out of
> bound read raced against batching collection to make our required reads
> uptodate.
>
> We can simplify this logic considerably and remove the stack growth
> issues of MAX_BUF_PER_PAGE by just replacing the batched logic with
> one which only issues IO for the previous buffer-head keeping in mind
> we'll always have one buffer-head (the current one) on the folio with
> an async flag, this will prevent any calls to folio_end_read().
>
> So we accomplish two things with this:
>
> o Avoid large stacks arrays with MAX_BUF_PER_PAGE
> o Make the need for folio_end_read() explicit and easier to read
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> fs/buffer.c | 51 +++++++++++++++++++++------------------------------
> 1 file changed, 21 insertions(+), 30 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b99560e8a142..167fa3e33566 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2361,9 +2361,8 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> {
> struct inode *inode = folio->mapping->host;
> sector_t iblock, lblock;
> - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> + struct buffer_head *bh, *head, *prev = NULL;
> size_t blocksize;
> - int nr, i;
> int fully_mapped = 1;
> bool page_error = false;
> loff_t limit = i_size_read(inode);
> @@ -2380,7 +2379,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> iblock = div_u64(folio_pos(folio), blocksize);
> lblock = div_u64(limit + blocksize - 1, blocksize);
> bh = head;
> - nr = 0;
>
> do {
> if (buffer_uptodate(bh))
> @@ -2410,40 +2408,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> if (buffer_uptodate(bh))
> continue;
> }
> - arr[nr++] = bh;
> +
> + lock_buffer(bh);
> + if (buffer_uptodate(bh)) {
> + unlock_buffer(bh);
> + continue;
> + }
> +
> + mark_buffer_async_read(bh);
> + if (prev)
> + submit_bh(REQ_OP_READ, prev);
> + prev = bh;
> } while (iblock++, (bh = bh->b_this_page) != head);
>
> if (fully_mapped)
> folio_set_mappedtodisk(folio);
>
> - if (!nr) {
> - /*
> - * All buffers are uptodate or get_block() returned an
> - * error when trying to map them - we can finish the read.
> - */
> - folio_end_read(folio, !page_error);
> - return 0;
> - }
> -
> - /* Stage two: lock the buffers */
> - for (i = 0; i < nr; i++) {
> - bh = arr[i];
> - lock_buffer(bh);
> - mark_buffer_async_read(bh);
> - }
> -
> /*
> - * Stage 3: start the IO. Check for uptodateness
> - * inside the buffer lock in case another process reading
> - * the underlying blockdev brought it uptodate (the sct fix).
> + * All buffers are uptodate or get_block() returned an error
> + * when trying to map them - we must finish the read because
> + * end_buffer_async_read() will never be called on any buffer
> + * in this folio.
> */
> - for (i = 0; i < nr; i++) {
> - bh = arr[i];
> - if (buffer_uptodate(bh))
> - end_buffer_async_read(bh, 1);
> - else
> - submit_bh(REQ_OP_READ, bh);
> - }
> + if (prev)
> + submit_bh(REQ_OP_READ, prev);
> + else
> + folio_end_read(folio, !page_error);
> +
> return 0;
> }
> EXPORT_SYMBOL(block_read_full_folio);
Similar here; as we now removed batching (which technically could result
in I/O being completed while executing the various stages) there really
is nothing preventing us to use plugging here, no?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 5/8] fs/buffer fs/mpage: remove large folio restriction
2025-02-04 23:12 ` [PATCH v2 5/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
@ 2025-02-05 16:21 ` Hannes Reinecke
2025-02-17 21:59 ` Matthew Wilcox
1 sibling, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-02-05 16:21 UTC (permalink / raw)
To: Luis Chamberlain, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel
On 2/5/25 00:12, Luis Chamberlain wrote:
> Now that buffer-heads has been converted over to support large folios
> we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents
> their use.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> fs/buffer.c | 2 --
> fs/mpage.c | 3 ---
> 2 files changed, 5 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 8/8] bdev: use bdev_io_min() for statx block size
2025-02-04 23:12 ` [PATCH v2 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
@ 2025-02-05 16:22 ` Hannes Reinecke
0 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-02-05 16:22 UTC (permalink / raw)
To: Luis Chamberlain, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel
On 2/5/25 00:12, Luis Chamberlain wrote:
> You can use lsblk to query for a block device block device block size:
>
> lsblk -o MIN-IO /dev/nvme0n1
> MIN-IO
> 4096
>
> The min-io is the minimum IO the block device prefers for optimal
> performance. In turn we map this to the block device block size.
> The current block size exposed even for block devices with an
> LBA format of 16k is 4k. Likewise devices which support 4k LBA format
> but have a larger Indirection Unit of 16k have an exposed block size
> of 4k.
>
> This incurs read-modify-writes on direct IO against devices with a
> min-io larger than the page size. To fix this, use the block device
> min io, which is the minimal optimal IO the device prefers.
>
> With this we now get:
>
> lsblk -o MIN-IO /dev/nvme0n1
> MIN-IO
> 16384
>
> And so userspace gets the appropriate information it needs for optimal
> performance. This is verified with blkalgn against mkfs against a
> device with LBA format of 4k but an NPWG of 16k (min io size)
>
> mkfs.xfs -f -b size=16k /dev/nvme3n1
> blkalgn -d nvme3n1 --ops Write
>
> Block size : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 0 | |
> 512 -> 1023 : 0 | |
> 1024 -> 2047 : 0 | |
> 2048 -> 4095 : 0 | |
> 4096 -> 8191 : 0 | |
> 8192 -> 16383 : 0 | |
> 16384 -> 32767 : 66 |****************************************|
> 32768 -> 65535 : 0 | |
> 65536 -> 131071 : 0 | |
> 131072 -> 262143 : 2 |* |
> Block size: 14 - 66
> Block size: 17 - 2
>
> Algn size : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 0 | |
> 512 -> 1023 : 0 | |
> 1024 -> 2047 : 0 | |
> 2048 -> 4095 : 0 | |
> 4096 -> 8191 : 0 | |
> 8192 -> 16383 : 0 | |
> 16384 -> 32767 : 66 |****************************************|
> 32768 -> 65535 : 0 | |
> 65536 -> 131071 : 0 | |
> 131072 -> 262143 : 2 |* |
> Algn size: 14 - 66
> Algn size: 17 - 2
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
> block/bdev.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-02-05 16:18 ` Hannes Reinecke
@ 2025-02-05 22:03 ` Matthew Wilcox
2025-02-06 7:17 ` Hannes Reinecke
0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-05 22:03 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Luis Chamberlain, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On Wed, Feb 05, 2025 at 05:18:20PM +0100, Hannes Reinecke wrote:
> One wonders: shouldn't we use plugging here to make I/O more efficient?
Should we plug at a higher level?
Opposite question: What if getblk() needs to do a read (ie ext2 indirect
block)?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-02-05 22:03 ` Matthew Wilcox
@ 2025-02-06 7:17 ` Hannes Reinecke
2025-02-06 17:30 ` Luis Chamberlain
0 siblings, 1 reply; 30+ messages in thread
From: Hannes Reinecke @ 2025-02-06 7:17 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Luis Chamberlain, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On 2/5/25 23:03, Matthew Wilcox wrote:
> On Wed, Feb 05, 2025 at 05:18:20PM +0100, Hannes Reinecke wrote:
>> One wonders: shouldn't we use plugging here to make I/O more efficient?
>
> Should we plug at a higher level?
>
> Opposite question: What if getblk() needs to do a read (ie ext2 indirect
> block)?
Ah, that. Yes, plugging on higher level would be a good idea.
(And can we check for nested plugs? _Should_ we check for nested plugs?)
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-02-06 7:17 ` Hannes Reinecke
@ 2025-02-06 17:30 ` Luis Chamberlain
2025-02-07 7:06 ` Hannes Reinecke
0 siblings, 1 reply; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-06 17:30 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Matthew Wilcox, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On Thu, Feb 06, 2025 at 08:17:32AM +0100, Hannes Reinecke wrote:
> On 2/5/25 23:03, Matthew Wilcox wrote:
> > On Wed, Feb 05, 2025 at 05:18:20PM +0100, Hannes Reinecke wrote:
> > > One wonders: shouldn't we use plugging here to make I/O more efficient?
> >
> > Should we plug at a higher level?
> >
> > Opposite question: What if getblk() needs to do a read (ie ext2 indirect
> > block)?
>
> Ah, that. Yes, plugging on higher level would be a good idea.
> (And can we check for nested plugs? _Should_ we check for nested plugs?)
I think given the discussion less is more for now, and if we really want
this we can add it later. Thoughts?
Luis
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset()
2025-02-06 17:30 ` Luis Chamberlain
@ 2025-02-07 7:06 ` Hannes Reinecke
0 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-02-07 7:06 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Matthew Wilcox, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On 2/6/25 18:30, Luis Chamberlain wrote:
> On Thu, Feb 06, 2025 at 08:17:32AM +0100, Hannes Reinecke wrote:
>> On 2/5/25 23:03, Matthew Wilcox wrote:
>>> On Wed, Feb 05, 2025 at 05:18:20PM +0100, Hannes Reinecke wrote:
>>>> One wonders: shouldn't we use plugging here to make I/O more efficient?
>>>
>>> Should we plug at a higher level?
>>>
>>> Opposite question: What if getblk() needs to do a read (ie ext2 indirect
>>> block)?
>>
>> Ah, that. Yes, plugging on higher level would be a good idea.
>> (And can we check for nested plugs? _Should_ we check for nested plugs?)
>
> I think given the discussion less is more for now, and if we really want
> this we can add it later. Thoughts?
>
Yeah, go for it.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 2/8] fs/buffer: remove batching from async read
2025-02-05 16:21 ` Hannes Reinecke
@ 2025-02-07 7:08 ` Hannes Reinecke
0 siblings, 0 replies; 30+ messages in thread
From: Hannes Reinecke @ 2025-02-07 7:08 UTC (permalink / raw)
To: Luis Chamberlain, willy, dave, david, djwong, kbusch
Cc: john.g.garry, hch, ritesh.list, linux-fsdevel, linux-xfs,
linux-mm, linux-block, gost.dev, p.raghav, da.gomez, kernel
On 2/5/25 17:21, Hannes Reinecke wrote:
> On 2/5/25 00:12, Luis Chamberlain wrote:
>> From: Matthew Wilcox <willy@infradead.org>
>>
>> The current implementation of a folio async read in
>> block_read_full_folio()
>> first batches all buffer-heads which need IOs issued for by putting
>> them on an
>> array of max size MAX_BUF_PER_PAGE. After collection it locks the batched
>> buffer-heads and finally submits the pending reads. On systems with CPUs
>> where the system page size is quite larger like Hexagon with 256 KiB this
>> batching can lead stack growth warnings so we want to avoid that.
>>
>> Note the use of folio_end_read() through block_read_full_folio(), its
>> used either when the folio is determined to be fully uptodate and no
>> pending read is needed, an IO error happened on get_block(), or an out of
>> bound read raced against batching collection to make our required reads
>> uptodate.
>>
>> We can simplify this logic considerably and remove the stack growth
>> issues of MAX_BUF_PER_PAGE by just replacing the batched logic with
>> one which only issues IO for the previous buffer-head keeping in mind
>> we'll always have one buffer-head (the current one) on the folio with
>> an async flag, this will prevent any calls to folio_end_read().
>>
>> So we accomplish two things with this:
>>
>> o Avoid large stacks arrays with MAX_BUF_PER_PAGE
>> o Make the need for folio_end_read() explicit and easier to read
>>
>> Suggested-by: Matthew Wilcox <willy@infradead.org>
>> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
>> ---
>> fs/buffer.c | 51 +++++++++++++++++++++------------------------------
>> 1 file changed, 21 insertions(+), 30 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index b99560e8a142..167fa3e33566 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -2361,9 +2361,8 @@ int block_read_full_folio(struct folio *folio,
>> get_block_t *get_block)
>> {
>> struct inode *inode = folio->mapping->host;
>> sector_t iblock, lblock;
>> - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
>> + struct buffer_head *bh, *head, *prev = NULL;
>> size_t blocksize;
>> - int nr, i;
>> int fully_mapped = 1;
>> bool page_error = false;
>> loff_t limit = i_size_read(inode);
>> @@ -2380,7 +2379,6 @@ int block_read_full_folio(struct folio *folio,
>> get_block_t *get_block)
>> iblock = div_u64(folio_pos(folio), blocksize);
>> lblock = div_u64(limit + blocksize - 1, blocksize);
>> bh = head;
>> - nr = 0;
>> do {
>> if (buffer_uptodate(bh))
>> @@ -2410,40 +2408,33 @@ int block_read_full_folio(struct folio *folio,
>> get_block_t *get_block)
>> if (buffer_uptodate(bh))
>> continue;
>> }
>> - arr[nr++] = bh;
>> +
>> + lock_buffer(bh);
>> + if (buffer_uptodate(bh)) {
>> + unlock_buffer(bh);
>> + continue;
>> + }
>> +
>> + mark_buffer_async_read(bh);
>> + if (prev)
>> + submit_bh(REQ_OP_READ, prev);
>> + prev = bh;
>> } while (iblock++, (bh = bh->b_this_page) != head);
>> if (fully_mapped)
>> folio_set_mappedtodisk(folio);
>> - if (!nr) {
>> - /*
>> - * All buffers are uptodate or get_block() returned an
>> - * error when trying to map them - we can finish the read.
>> - */
>> - folio_end_read(folio, !page_error);
>> - return 0;
>> - }
>> -
>> - /* Stage two: lock the buffers */
>> - for (i = 0; i < nr; i++) {
>> - bh = arr[i];
>> - lock_buffer(bh);
>> - mark_buffer_async_read(bh);
>> - }
>> -
>> /*
>> - * Stage 3: start the IO. Check for uptodateness
>> - * inside the buffer lock in case another process reading
>> - * the underlying blockdev brought it uptodate (the sct fix).
>> + * All buffers are uptodate or get_block() returned an error
>> + * when trying to map them - we must finish the read because
>> + * end_buffer_async_read() will never be called on any buffer
>> + * in this folio.
>> */
>> - for (i = 0; i < nr; i++) {
>> - bh = arr[i];
>> - if (buffer_uptodate(bh))
>> - end_buffer_async_read(bh, 1);
>> - else
>> - submit_bh(REQ_OP_READ, bh);
>> - }
>> + if (prev)
>> + submit_bh(REQ_OP_READ, prev);
>> + else
>> + folio_end_read(folio, !page_error);
>> +
>> return 0;
>> }
>> EXPORT_SYMBOL(block_read_full_folio);
>
> Similar here; as we now removed batching (which technically could result
> in I/O being completed while executing the various stages) there really
> is nothing preventing us to use plugging here, no?
>
In the light of the discussion to the previous patch we should move that
to a later point. So:
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 2/8] fs/buffer: remove batching from async read
2025-02-04 23:12 ` [PATCH v2 2/8] fs/buffer: remove batching from async read Luis Chamberlain
2025-02-05 16:21 ` Hannes Reinecke
@ 2025-02-17 21:40 ` Matthew Wilcox
1 sibling, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-17 21:40 UTC (permalink / raw)
To: Luis Chamberlain
Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
p.raghav, da.gomez, kernel
On Tue, Feb 04, 2025 at 03:12:03PM -0800, Luis Chamberlain wrote:
> From: Matthew Wilcox <willy@infradead.org>
From: Matthew Wilcox (Oracle) <willy@infradead.org>
block_read_full_folio() currently puts all !uptodate buffers into
an array allocated on the stack, then iterates over it twice, first
locking the buffers and then submitting them for read. We want to
remove this array because it occupies too much stack space on
configurations with a larger PAGE_SIZE (eg 512 bytes with 8 byte
pointers and a 64KiB PAGE_SIZE).
We cannot simply submit buffer heads as we find them as the completion
handler needs to be able to tell when all reads are finished, so it can
end the folio read. So we keep one buffer in reserve (using the 'prev'
variable) until the end of the function.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b99560e8a142..167fa3e33566 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2361,9 +2361,8 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> {
> struct inode *inode = folio->mapping->host;
> sector_t iblock, lblock;
> - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> + struct buffer_head *bh, *head, *prev = NULL;
> size_t blocksize;
> - int nr, i;
> int fully_mapped = 1;
> bool page_error = false;
> loff_t limit = i_size_read(inode);
> @@ -2380,7 +2379,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> iblock = div_u64(folio_pos(folio), blocksize);
> lblock = div_u64(limit + blocksize - 1, blocksize);
> bh = head;
> - nr = 0;
>
> do {
> if (buffer_uptodate(bh))
> @@ -2410,40 +2408,33 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> if (buffer_uptodate(bh))
> continue;
> }
> - arr[nr++] = bh;
> +
> + lock_buffer(bh);
> + if (buffer_uptodate(bh)) {
> + unlock_buffer(bh);
> + continue;
> + }
> +
> + mark_buffer_async_read(bh);
> + if (prev)
> + submit_bh(REQ_OP_READ, prev);
> + prev = bh;
> } while (iblock++, (bh = bh->b_this_page) != head);
>
> if (fully_mapped)
> folio_set_mappedtodisk(folio);
>
> - if (!nr) {
> - /*
> - * All buffers are uptodate or get_block() returned an
> - * error when trying to map them - we can finish the read.
> - */
> - folio_end_read(folio, !page_error);
> - return 0;
> - }
> -
> - /* Stage two: lock the buffers */
> - for (i = 0; i < nr; i++) {
> - bh = arr[i];
> - lock_buffer(bh);
> - mark_buffer_async_read(bh);
> - }
> -
> /*
> - * Stage 3: start the IO. Check for uptodateness
> - * inside the buffer lock in case another process reading
> - * the underlying blockdev brought it uptodate (the sct fix).
> + * All buffers are uptodate or get_block() returned an error
> + * when trying to map them - we must finish the read because
> + * end_buffer_async_read() will never be called on any buffer
> + * in this folio.
> */
> - for (i = 0; i < nr; i++) {
> - bh = arr[i];
> - if (buffer_uptodate(bh))
> - end_buffer_async_read(bh, 1);
> - else
> - submit_bh(REQ_OP_READ, bh);
> - }
> + if (prev)
> + submit_bh(REQ_OP_READ, prev);
> + else
> + folio_end_read(folio, !page_error);
> +
> return 0;
> }
> EXPORT_SYMBOL(block_read_full_folio);
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 3/8] fs/mpage: avoid negative shift for large blocksize
2025-02-04 23:12 ` [PATCH v2 3/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
@ 2025-02-17 21:48 ` Matthew Wilcox
0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-17 21:48 UTC (permalink / raw)
To: Luis Chamberlain
Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
p.raghav, da.gomez, kernel, Hannes Reinecke
On Tue, Feb 04, 2025 at 03:12:04PM -0800, Luis Chamberlain wrote:
> From: Hannes Reinecke <hare@kernel.org>
>
> For large blocksizes the number of block bits is larger than PAGE_SHIFT,
> so use instead use folio_pos(folio) >> blkbits to calculate the sector
"so calculate the sector number from the byte offset instead"
> number. This is required to enable large folios with buffer-heads.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-04 23:12 ` [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
@ 2025-02-17 21:58 ` Matthew Wilcox
2025-02-18 15:02 ` Hannes Reinecke
0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-17 21:58 UTC (permalink / raw)
To: Luis Chamberlain
Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
p.raghav, da.gomez, kernel
On Tue, Feb 04, 2025 at 03:12:05PM -0800, Luis Chamberlain wrote:
> @@ -182,7 +182,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
> goto confused;
>
> block_in_file = folio_pos(folio) >> blkbits;
> - last_block = block_in_file + args->nr_pages * blocks_per_page;
> + last_block = block_in_file + args->nr_pages * blocks_per_folio;
In mpage_readahead(), we set args->nr_pages to the nunber of pages (not
folios) being requested. In mpage_read_folio() we currently set it to
1. So this is going to read too far ahead for readahead if using large
folios.
I think we need to make nr_pages continue to mean nr_pages. Or we pass
in nr_bytes or nr_blocks.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 5/8] fs/buffer fs/mpage: remove large folio restriction
2025-02-04 23:12 ` [PATCH v2 5/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
2025-02-05 16:21 ` Hannes Reinecke
@ 2025-02-17 21:59 ` Matthew Wilcox
1 sibling, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-17 21:59 UTC (permalink / raw)
To: Luis Chamberlain
Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
p.raghav, da.gomez, kernel
On Tue, Feb 04, 2025 at 03:12:06PM -0800, Luis Chamberlain wrote:
> Now that buffer-heads has been converted over to support large folios
> we can remove the built-in VM_BUG_ON_FOLIO() checks which prevents
> their use.
>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 6/8] block/bdev: enable large folio support for large logical block sizes
2025-02-04 23:12 ` [PATCH v2 6/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
@ 2025-02-17 21:59 ` Matthew Wilcox
0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-17 21:59 UTC (permalink / raw)
To: Luis Chamberlain
Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
p.raghav, da.gomez, kernel
On Tue, Feb 04, 2025 at 03:12:07PM -0800, Luis Chamberlain wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Call mapping_set_folio_min_order() when modifying the logical block
> size to ensure folios are allocated with the correct size.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 7/8] block/bdev: lift block size restrictions to 64k
2025-02-04 23:12 ` [PATCH v2 7/8] block/bdev: lift block size restrictions to 64k Luis Chamberlain
@ 2025-02-17 22:01 ` Matthew Wilcox
0 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-17 22:01 UTC (permalink / raw)
To: Luis Chamberlain
Cc: hare, dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
p.raghav, da.gomez, kernel
On Tue, Feb 04, 2025 at 03:12:08PM -0800, Luis Chamberlain wrote:
> We now can support blocksizes larger than PAGE_SIZE, so in theory
> we should be able to lift the restriction up to the max supported page
> cache order. However bound ourselves to what we can currently validate
> and test. Through blktests and fstest we can validate up to 64k today.
>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 248416ecd01c..a89513302977 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -26,6 +26,7 @@
> #include <linux/xarray.h>
> #include <linux/file.h>
> #include <linux/lockdep.h>
> +#include <linux/pagemap.h>
We can drop this until we actually use
> + * We should strive for 1 << (PAGE_SHIFT + MAX_PAGECACHE_ORDER)
right? I don't see anything else that needs this include.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-17 21:58 ` Matthew Wilcox
@ 2025-02-18 15:02 ` Hannes Reinecke
2025-02-21 18:58 ` Luis Chamberlain
0 siblings, 1 reply; 30+ messages in thread
From: Hannes Reinecke @ 2025-02-18 15:02 UTC (permalink / raw)
To: Matthew Wilcox, Luis Chamberlain
Cc: dave, david, djwong, kbusch, john.g.garry, hch, ritesh.list,
linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
p.raghav, da.gomez, kernel
On 2/17/25 22:58, Matthew Wilcox wrote:
> On Tue, Feb 04, 2025 at 03:12:05PM -0800, Luis Chamberlain wrote:
>> @@ -182,7 +182,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>> goto confused;
>>
>> block_in_file = folio_pos(folio) >> blkbits;
>> - last_block = block_in_file + args->nr_pages * blocks_per_page;
>> + last_block = block_in_file + args->nr_pages * blocks_per_folio;
>
> In mpage_readahead(), we set args->nr_pages to the nunber of pages (not
> folios) being requested. In mpage_read_folio() we currently set it to
> 1. So this is going to read too far ahead for readahead if using large
> folios.
>
> I think we need to make nr_pages continue to mean nr_pages. Or we pass
> in nr_bytes or nr_blocks.
>
I had been pondering this, too, while developing the patch.
The idea I had here was to change counting by pages over to counting by
folios, as then the logic is essentially unchanged.
Not a big fan of 'nr_pages', as then the question really is how much
data we should read at the end of the day. So I'd rather go with
'nr_blocks' to avoid any confusion.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-18 15:02 ` Hannes Reinecke
@ 2025-02-21 18:58 ` Luis Chamberlain
2025-02-21 20:25 ` Matthew Wilcox
2025-02-21 20:27 ` Matthew Wilcox
0 siblings, 2 replies; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-21 18:58 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Matthew Wilcox, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On Tue, Feb 18, 2025 at 04:02:43PM +0100, Hannes Reinecke wrote:
> On 2/17/25 22:58, Matthew Wilcox wrote:
> > On Tue, Feb 04, 2025 at 03:12:05PM -0800, Luis Chamberlain wrote:
> > > @@ -182,7 +182,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
> > > goto confused;
> > > block_in_file = folio_pos(folio) >> blkbits;
> > > - last_block = block_in_file + args->nr_pages * blocks_per_page;
> > > + last_block = block_in_file + args->nr_pages * blocks_per_folio;
> >
> > In mpage_readahead(), we set args->nr_pages to the nunber of pages (not
> > folios) being requested. In mpage_read_folio() we currently set it to
> > 1. So this is going to read too far ahead for readahead if using large
> > folios.
> >
> > I think we need to make nr_pages continue to mean nr_pages. Or we pass
> > in nr_bytes or nr_blocks.
> >
> I had been pondering this, too, while developing the patch.
> The idea I had here was to change counting by pages over to counting by
> folios, as then the logic is essentially unchanged.
>
> Not a big fan of 'nr_pages', as then the question really is how much
> data we should read at the end of the day. So I'd rather go with 'nr_blocks'
> to avoid any confusion.
I think the easier answer is to adjust nr_pages in terms of min-order
requirements and fix last_block computation so we don't lie for large
folios as follows. While at it, I noticed a folio_zero_segment() was
missing folio_size().
diff --git a/fs/mpage.c b/fs/mpage.c
index c17d7a724e4b..624bf30f0b2e 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -152,6 +152,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
{
struct folio *folio = args->folio;
struct inode *inode = folio->mapping->host;
+ const unsigned min_nrpages = mapping_min_folio_nrpages(folio->mapping);
const unsigned blkbits = inode->i_blkbits;
const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
const unsigned blocksize = 1 << blkbits;
@@ -172,6 +173,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
/* MAX_BUF_PER_PAGE, for example */
VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
+ VM_BUG_ON_FOLIO(args->nr_pages < min_nrpages, folio);
+ VM_BUG_ON_FOLIO(!IS_ALIGNED(args->nr_pages, min_nrpages), folio);
if (args->is_readahead) {
opf |= REQ_RAHEAD;
@@ -182,7 +185,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
goto confused;
block_in_file = folio_pos(folio) >> blkbits;
- last_block = block_in_file + args->nr_pages * blocks_per_folio;
+ last_block = block_in_file + ((args->nr_pages * PAGE_SIZE) >> blkbits);
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
last_block = last_block_in_file;
@@ -269,7 +272,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
}
if (first_hole != blocks_per_folio) {
- folio_zero_segment(folio, first_hole << blkbits, PAGE_SIZE);
+ folio_zero_segment(folio, first_hole << blkbits, folio_size(folio));
if (first_hole == 0) {
folio_mark_uptodate(folio);
folio_unlock(folio);
@@ -385,7 +388,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
{
struct mpage_readpage_args args = {
.folio = folio,
- .nr_pages = 1,
+ .nr_pages = mapping_min_folio_nrpages(folio->mapping),
.get_block = get_block,
};
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-21 18:58 ` Luis Chamberlain
@ 2025-02-21 20:25 ` Matthew Wilcox
2025-02-21 20:38 ` Luis Chamberlain
2025-02-21 20:27 ` Matthew Wilcox
1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-21 20:25 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Hannes Reinecke, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On Fri, Feb 21, 2025 at 10:58:58AM -0800, Luis Chamberlain wrote:
> @@ -385,7 +388,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
> {
> struct mpage_readpage_args args = {
> .folio = folio,
> - .nr_pages = 1,
> + .nr_pages = mapping_min_folio_nrpages(folio->mapping),
.nr_pages = folio_nr_pages(folio);
since the folio is not necessarily the minimum size.
> .get_block = get_block,
> };
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-21 18:58 ` Luis Chamberlain
2025-02-21 20:25 ` Matthew Wilcox
@ 2025-02-21 20:27 ` Matthew Wilcox
2025-02-21 20:39 ` Luis Chamberlain
1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2025-02-21 20:27 UTC (permalink / raw)
To: Luis Chamberlain
Cc: Hannes Reinecke, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On Fri, Feb 21, 2025 at 10:58:58AM -0800, Luis Chamberlain wrote:
> +++ b/fs/mpage.c
> @@ -152,6 +152,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
> {
> struct folio *folio = args->folio;
> struct inode *inode = folio->mapping->host;
> + const unsigned min_nrpages = mapping_min_folio_nrpages(folio->mapping);
> const unsigned blkbits = inode->i_blkbits;
> const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
> const unsigned blocksize = 1 << blkbits;
> @@ -172,6 +173,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>
> /* MAX_BUF_PER_PAGE, for example */
> VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> + VM_BUG_ON_FOLIO(args->nr_pages < min_nrpages, folio);
> + VM_BUG_ON_FOLIO(!IS_ALIGNED(args->nr_pages, min_nrpages), folio);
>
> if (args->is_readahead) {
> opf |= REQ_RAHEAD;
Also, I don't think these assertions add any value; we already assert
these things are true in other places.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-21 20:25 ` Matthew Wilcox
@ 2025-02-21 20:38 ` Luis Chamberlain
0 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-21 20:38 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Hannes Reinecke, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On Fri, Feb 21, 2025 at 08:25:11PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 21, 2025 at 10:58:58AM -0800, Luis Chamberlain wrote:
> > @@ -385,7 +388,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
> > {
> > struct mpage_readpage_args args = {
> > .folio = folio,
> > - .nr_pages = 1,
> > + .nr_pages = mapping_min_folio_nrpages(folio->mapping),
>
> .nr_pages = folio_nr_pages(folio);
>
> since the folio is not necessarily the minimum size.
Will roll this in for tests before a new v3.
Luis
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page
2025-02-21 20:27 ` Matthew Wilcox
@ 2025-02-21 20:39 ` Luis Chamberlain
0 siblings, 0 replies; 30+ messages in thread
From: Luis Chamberlain @ 2025-02-21 20:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Hannes Reinecke, dave, david, djwong, kbusch, john.g.garry, hch,
ritesh.list, linux-fsdevel, linux-xfs, linux-mm, linux-block,
gost.dev, p.raghav, da.gomez, kernel
On Fri, Feb 21, 2025 at 08:27:17PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 21, 2025 at 10:58:58AM -0800, Luis Chamberlain wrote:
> > +++ b/fs/mpage.c
> > @@ -152,6 +152,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
> > {
> > struct folio *folio = args->folio;
> > struct inode *inode = folio->mapping->host;
> > + const unsigned min_nrpages = mapping_min_folio_nrpages(folio->mapping);
> > const unsigned blkbits = inode->i_blkbits;
> > const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
> > const unsigned blocksize = 1 << blkbits;
> > @@ -172,6 +173,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
> >
> > /* MAX_BUF_PER_PAGE, for example */
> > VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> > + VM_BUG_ON_FOLIO(args->nr_pages < min_nrpages, folio);
> > + VM_BUG_ON_FOLIO(!IS_ALIGNED(args->nr_pages, min_nrpages), folio);
> >
> > if (args->is_readahead) {
> > opf |= REQ_RAHEAD;
>
> Also, I don't think these assertions add any value; we already assert
> these things are true in other places.
Sure, it may not have been clear to others but that doesn't mean we
need to be explicit about that in code, the commit log can justify this
alone. Will remove.
Luis
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2025-02-21 20:39 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-04 23:12 [PATCH v2 0/8] enable bs > ps for block devices Luis Chamberlain
2025-02-04 23:12 ` [PATCH v2 1/8] fs/buffer: simplify block_read_full_folio() with bh_offset() Luis Chamberlain
2025-02-05 16:18 ` Hannes Reinecke
2025-02-05 22:03 ` Matthew Wilcox
2025-02-06 7:17 ` Hannes Reinecke
2025-02-06 17:30 ` Luis Chamberlain
2025-02-07 7:06 ` Hannes Reinecke
2025-02-04 23:12 ` [PATCH v2 2/8] fs/buffer: remove batching from async read Luis Chamberlain
2025-02-05 16:21 ` Hannes Reinecke
2025-02-07 7:08 ` Hannes Reinecke
2025-02-17 21:40 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 3/8] fs/mpage: avoid negative shift for large blocksize Luis Chamberlain
2025-02-17 21:48 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 4/8] fs/mpage: use blocks_per_folio instead of blocks_per_page Luis Chamberlain
2025-02-17 21:58 ` Matthew Wilcox
2025-02-18 15:02 ` Hannes Reinecke
2025-02-21 18:58 ` Luis Chamberlain
2025-02-21 20:25 ` Matthew Wilcox
2025-02-21 20:38 ` Luis Chamberlain
2025-02-21 20:27 ` Matthew Wilcox
2025-02-21 20:39 ` Luis Chamberlain
2025-02-04 23:12 ` [PATCH v2 5/8] fs/buffer fs/mpage: remove large folio restriction Luis Chamberlain
2025-02-05 16:21 ` Hannes Reinecke
2025-02-17 21:59 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 6/8] block/bdev: enable large folio support for large logical block sizes Luis Chamberlain
2025-02-17 21:59 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 7/8] block/bdev: lift block size restrictions to 64k Luis Chamberlain
2025-02-17 22:01 ` Matthew Wilcox
2025-02-04 23:12 ` [PATCH v2 8/8] bdev: use bdev_io_min() for statx block size Luis Chamberlain
2025-02-05 16:22 ` Hannes Reinecke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).