* [PATCHv2 0/6] enable bs > ps for block devices
@ 2024-05-14 17:38 Hannes Reinecke
2024-05-14 17:38 ` [PATCH 1/6] fs/mpage: avoid negative shift for large blocksize Hannes Reinecke
` (5 more replies)
0 siblings, 6 replies; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-14 17:38 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme, Hannes Reinecke
Hi all,
based on the patch series from Pankaj '[PATCH v5 00/11] enable bs > ps in XFS'
it's now quite simple to enable support for block devices with block sizes
larger than page size even without having to disable CONFIG_BUFFER_HEAD.
The patchset really is just two rather trivial patches to fs/mpage,
and two patches to remove hardcoded restrictions on the block size.
As usual, comments and reviews are welcome.
Changes to the original submission:
- Include reviews from Matthew
- Include reviews from Luis
- Fixup crash in bio_split()
Hannes Reinecke (5):
fs/mpage: avoid negative shift for large blocksize
fs/mpage: use blocks_per_folio instead of blocks_per_page
blk-merge: split bio by max_segment_size, not PAGE_SIZE
block/bdev: enable large folio support for large logical block sizes
block/bdev: lift restrictions on supported blocksize
Pankaj Raghav (1):
nvme: enable logical block size > PAGE_SIZE
block/bdev.c | 11 ++++++---
block/blk-merge.c | 3 ++-
drivers/nvme/host/core.c | 8 +++----
fs/mpage.c | 49 +++++++++++++++++++---------------------
4 files changed, 37 insertions(+), 34 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/6] fs/mpage: avoid negative shift for large blocksize
2024-05-14 17:38 [PATCHv2 0/6] enable bs > ps for block devices Hannes Reinecke
@ 2024-05-14 17:38 ` Hannes Reinecke
2024-05-14 17:38 ` [PATCH 2/6] fs/mpage: use blocks_per_folio instead of blocks_per_page Hannes Reinecke
` (4 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-14 17:38 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme, Hannes Reinecke
For large blocksizes the number of block bits is larger than PAGE_SHIFT,
so use folio_pos() to calculate the sector number from the folio.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
fs/mpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index fa8b99a199fa..558b627d382c 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -188,7 +188,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (folio_buffers(folio))
goto confused;
- block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+ block_in_file = folio_pos(folio) >> blkbits;
last_block = block_in_file + args->nr_pages * blocks_per_page;
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
@@ -534,7 +534,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
* The page has no buffers: map it to disk
*/
BUG_ON(!folio_test_uptodate(folio));
- block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+ block_in_file = folio_pos(folio) >> blkbits;
/*
* Whole page beyond EOF? Skip allocating blocks to avoid leaking
* space.
--
2.35.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/6] fs/mpage: use blocks_per_folio instead of blocks_per_page
2024-05-14 17:38 [PATCHv2 0/6] enable bs > ps for block devices Hannes Reinecke
2024-05-14 17:38 ` [PATCH 1/6] fs/mpage: avoid negative shift for large blocksize Hannes Reinecke
@ 2024-05-14 17:38 ` Hannes Reinecke
2024-05-14 17:38 ` [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE Hannes Reinecke
` (3 subsequent siblings)
5 siblings, 0 replies; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-14 17:38 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme, Hannes Reinecke
Convert mpage to folios and associate the number of blocks with
a folio and not a page.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
fs/mpage.c | 45 +++++++++++++++++++++------------------------
1 file changed, 21 insertions(+), 24 deletions(-)
diff --git a/fs/mpage.c b/fs/mpage.c
index 558b627d382c..7cb9d9efdba8 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -114,7 +114,7 @@ static void map_buffer_to_folio(struct folio *folio, struct buffer_head *bh,
* don't make any buffers if there is only one buffer on
* the folio and the folio just needs to be set up to date
*/
- if (inode->i_blkbits == PAGE_SHIFT &&
+ if (inode->i_blkbits == folio_shift(folio) &&
buffer_uptodate(bh)) {
folio_mark_uptodate(folio);
return;
@@ -160,7 +160,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
struct folio *folio = args->folio;
struct inode *inode = folio->mapping->host;
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+ const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
const unsigned blocksize = 1 << blkbits;
struct buffer_head *map_bh = &args->map_bh;
sector_t block_in_file;
@@ -168,7 +168,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
sector_t last_block_in_file;
sector_t first_block;
unsigned page_block;
- unsigned first_hole = blocks_per_page;
+ unsigned first_hole = blocks_per_folio;
struct block_device *bdev = NULL;
int length;
int fully_mapped = 1;
@@ -177,9 +177,6 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
unsigned relative_block;
gfp_t gfp = mapping_gfp_constraint(folio->mapping, GFP_KERNEL);
- /* MAX_BUF_PER_PAGE, for example */
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
if (args->is_readahead) {
opf |= REQ_RAHEAD;
gfp |= __GFP_NORETRY | __GFP_NOWARN;
@@ -189,7 +186,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
goto confused;
block_in_file = folio_pos(folio) >> blkbits;
- last_block = block_in_file + args->nr_pages * blocks_per_page;
+ last_block = block_in_file + ((args->nr_pages * PAGE_SIZE) >> blkbits);
last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
if (last_block > last_block_in_file)
last_block = last_block_in_file;
@@ -211,7 +208,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
clear_buffer_mapped(map_bh);
break;
}
- if (page_block == blocks_per_page)
+ if (page_block == blocks_per_folio)
break;
page_block++;
block_in_file++;
@@ -223,7 +220,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
* Then do more get_blocks calls until we are done with this folio.
*/
map_bh->b_folio = folio;
- while (page_block < blocks_per_page) {
+ while (page_block < blocks_per_folio) {
map_bh->b_state = 0;
map_bh->b_size = 0;
@@ -236,7 +233,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (!buffer_mapped(map_bh)) {
fully_mapped = 0;
- if (first_hole == blocks_per_page)
+ if (first_hole == blocks_per_folio)
first_hole = page_block;
page_block++;
block_in_file++;
@@ -254,7 +251,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
goto confused;
}
- if (first_hole != blocks_per_page)
+ if (first_hole != blocks_per_folio)
goto confused; /* hole -> non-hole */
/* Contiguous blocks? */
@@ -267,7 +264,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
if (relative_block == nblocks) {
clear_buffer_mapped(map_bh);
break;
- } else if (page_block == blocks_per_page)
+ } else if (page_block == blocks_per_folio)
break;
page_block++;
block_in_file++;
@@ -275,8 +272,8 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
bdev = map_bh->b_bdev;
}
- if (first_hole != blocks_per_page) {
- folio_zero_segment(folio, first_hole << blkbits, PAGE_SIZE);
+ if (first_hole != blocks_per_folio) {
+ folio_zero_segment(folio, first_hole << blkbits, folio_size(folio));
if (first_hole == 0) {
folio_mark_uptodate(folio);
folio_unlock(folio);
@@ -310,10 +307,10 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
relative_block = block_in_file - args->first_logical_block;
nblocks = map_bh->b_size >> blkbits;
if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
- (first_hole != blocks_per_page))
+ (first_hole != blocks_per_folio))
args->bio = mpage_bio_submit_read(args->bio);
else
- args->last_block_in_bio = first_block + blocks_per_page - 1;
+ args->last_block_in_bio = first_block + blocks_per_folio - 1;
out:
return args->bio;
@@ -392,7 +389,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
{
struct mpage_readpage_args args = {
.folio = folio,
- .nr_pages = 1,
+ .nr_pages = folio_nr_pages(folio),
.get_block = get_block,
};
@@ -463,12 +460,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
struct address_space *mapping = folio->mapping;
struct inode *inode = mapping->host;
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+ const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
sector_t last_block;
sector_t block_in_file;
sector_t first_block;
unsigned page_block;
- unsigned first_unmapped = blocks_per_page;
+ unsigned first_unmapped = blocks_per_folio;
struct block_device *bdev = NULL;
int boundary = 0;
sector_t boundary_block = 0;
@@ -493,12 +490,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
*/
if (buffer_dirty(bh))
goto confused;
- if (first_unmapped == blocks_per_page)
+ if (first_unmapped == blocks_per_folio)
first_unmapped = page_block;
continue;
}
- if (first_unmapped != blocks_per_page)
+ if (first_unmapped != blocks_per_folio)
goto confused; /* hole -> non-hole */
if (!buffer_dirty(bh) || !buffer_uptodate(bh))
@@ -543,7 +540,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
goto page_is_mapped;
last_block = (i_size - 1) >> blkbits;
map_bh.b_folio = folio;
- for (page_block = 0; page_block < blocks_per_page; ) {
+ for (page_block = 0; page_block < blocks_per_folio; ) {
map_bh.b_state = 0;
map_bh.b_size = 1 << blkbits;
@@ -625,14 +622,14 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
BUG_ON(folio_test_writeback(folio));
folio_start_writeback(folio);
folio_unlock(folio);
- if (boundary || (first_unmapped != blocks_per_page)) {
+ if (boundary || (first_unmapped != blocks_per_folio)) {
bio = mpage_bio_submit_write(bio);
if (boundary_block) {
write_boundary_block(boundary_bdev,
boundary_block, 1 << blkbits);
}
} else {
- mpd->last_block_in_bio = first_block + blocks_per_page - 1;
+ mpd->last_block_in_bio = first_block + blocks_per_folio - 1;
}
goto out;
--
2.35.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE
2024-05-14 17:38 [PATCHv2 0/6] enable bs > ps for block devices Hannes Reinecke
2024-05-14 17:38 ` [PATCH 1/6] fs/mpage: avoid negative shift for large blocksize Hannes Reinecke
2024-05-14 17:38 ` [PATCH 2/6] fs/mpage: use blocks_per_folio instead of blocks_per_page Hannes Reinecke
@ 2024-05-14 17:38 ` Hannes Reinecke
2024-05-15 0:20 ` John Garry
2024-05-14 17:38 ` [PATCH 4/6] block/bdev: enable large folio support for large logical block sizes Hannes Reinecke
` (2 subsequent siblings)
5 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-14 17:38 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme, Hannes Reinecke
Bvecs can be larger than a page, and the block layer handles
this just fine. So do not split by PAGE_SIZE but rather by
the max_segment_size if that happens to be larger.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
block/blk-merge.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 4e3483a16b75..570573d7a34f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -278,6 +278,7 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
struct bio_vec bv, bvprv, *bvprvp = NULL;
struct bvec_iter iter;
unsigned nsegs = 0, bytes = 0;
+ unsigned bv_seg_lim = max(PAGE_SIZE, lim->max_segment_size);
bio_for_each_bvec(bv, bio, iter) {
/*
@@ -289,7 +290,7 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
if (nsegs < lim->max_segments &&
bytes + bv.bv_len <= max_bytes &&
- bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
+ bv.bv_offset + bv.bv_len <= bv_seg_lim) {
nsegs++;
bytes += bv.bv_len;
} else {
--
2.35.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/6] block/bdev: enable large folio support for large logical block sizes
2024-05-14 17:38 [PATCHv2 0/6] enable bs > ps for block devices Hannes Reinecke
` (2 preceding siblings ...)
2024-05-14 17:38 ` [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE Hannes Reinecke
@ 2024-05-14 17:38 ` Hannes Reinecke
2024-05-15 4:21 ` kernel test robot
2024-05-14 17:38 ` [PATCH 5/6] block/bdev: lift restrictions on supported blocksize Hannes Reinecke
2024-05-14 17:39 ` [PATCH 6/6] nvme: enable logical block size > PAGE_SIZE Hannes Reinecke
5 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-14 17:38 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme, Hannes Reinecke
Call mapping_set_folio_min_order() when modifying the logical block
size to ensure folios are allocated with the correct size.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
block/bdev.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/block/bdev.c b/block/bdev.c
index b8e32d933a63..bd2efcad4f32 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -142,6 +142,8 @@ static void set_init_blocksize(struct block_device *bdev)
bsize <<= 1;
}
bdev->bd_inode->i_blkbits = blksize_bits(bsize);
+ mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
+ get_order(bsize));
}
int set_blocksize(struct block_device *bdev, int size)
@@ -158,6 +160,8 @@ int set_blocksize(struct block_device *bdev, int size)
if (bdev->bd_inode->i_blkbits != blksize_bits(size)) {
sync_blockdev(bdev);
bdev->bd_inode->i_blkbits = blksize_bits(size);
+ mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
+ get_order(size));
kill_bdev(bdev);
}
return 0;
--
2.35.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 5/6] block/bdev: lift restrictions on supported blocksize
2024-05-14 17:38 [PATCHv2 0/6] enable bs > ps for block devices Hannes Reinecke
` (3 preceding siblings ...)
2024-05-14 17:38 ` [PATCH 4/6] block/bdev: enable large folio support for large logical block sizes Hannes Reinecke
@ 2024-05-14 17:38 ` Hannes Reinecke
2024-05-15 1:03 ` kernel test robot
2024-05-15 4:00 ` kernel test robot
2024-05-14 17:39 ` [PATCH 6/6] nvme: enable logical block size > PAGE_SIZE Hannes Reinecke
5 siblings, 2 replies; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-14 17:38 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme, Hannes Reinecke
We now can support blocksizes larger than PAGE_SIZE, so lift
the restriction.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
block/bdev.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index bd2efcad4f32..f092a1b04629 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -148,8 +148,9 @@ static void set_init_blocksize(struct block_device *bdev)
int set_blocksize(struct block_device *bdev, int size)
{
- /* Size must be a power of two, and between 512 and PAGE_SIZE */
- if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+ /* Size must be a power of two, and between 512 and MAX_PAGECACHE_ORDER*/
+ if (get_order(bs) > MAX_PAGECACHE_ORDER || size < 512 ||
+ !is_power_of_2(size))
return -EINVAL;
/* Size cannot be smaller than the size supported by the device */
@@ -174,7 +175,7 @@ int sb_set_blocksize(struct super_block *sb, int size)
if (set_blocksize(sb->s_bdev, size))
return 0;
/* If we get here, we know size is power of two
- * and it's value is between 512 and PAGE_SIZE */
+ * and it's value is larger than 512 */
sb->s_blocksize = size;
sb->s_blocksize_bits = blksize_bits(size);
return sb->s_blocksize;
--
2.35.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6/6] nvme: enable logical block size > PAGE_SIZE
2024-05-14 17:38 [PATCHv2 0/6] enable bs > ps for block devices Hannes Reinecke
` (4 preceding siblings ...)
2024-05-14 17:38 ` [PATCH 5/6] block/bdev: lift restrictions on supported blocksize Hannes Reinecke
@ 2024-05-14 17:39 ` Hannes Reinecke
5 siblings, 0 replies; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-14 17:39 UTC (permalink / raw)
To: Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme, Hannes Reinecke
From: Pankaj Raghav <p.raghav@samsung.com>
Don't set the capacity to zero for when logical block size > PAGE_SIZE
as the block device with iomap aops support allocating block cache with
a minimum folio order.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
drivers/nvme/host/core.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 828c77fa13b7..111bf4197052 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1963,11 +1963,11 @@ static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
bool valid = true;
/*
- * The block layer can't support LBA sizes larger than the page size
- * or smaller than a sector size yet, so catch this early and don't
- * allow block I/O.
+ * The block layer can't support LBA sizes larger than
+ * MAX_PAGECACHE_ORDER or smaller than a sector size, so catch this
+ * early and don't allow block I/O.
*/
- if (head->lba_shift > PAGE_SHIFT || head->lba_shift < SECTOR_SHIFT) {
+ if (get_order(bs) > MAX_PAGECACHE_ORDER || head->lba_shift < SECTOR_SHIFT) {
bs = (1 << 9);
valid = false;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE
2024-05-14 17:38 ` [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE Hannes Reinecke
@ 2024-05-15 0:20 ` John Garry
2024-05-15 12:29 ` Hannes Reinecke
0 siblings, 1 reply; 14+ messages in thread
From: John Garry @ 2024-05-15 0:20 UTC (permalink / raw)
To: Hannes Reinecke, Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme
On 14/05/2024 11:38, Hannes Reinecke wrote:
> Bvecs can be larger than a page, and the block layer handles
> this just fine. So do not split by PAGE_SIZE but rather by
> the max_segment_size if that happens to be larger.
Can you check scsi_debug for this series? I took this series only up to
this change, and got:
Startin[ 1.736470] ------------[ cut here ]------------
g Load [ 1.737777] WARNING: CPU: 0 PID: 52 at block/blk-merge.c:581
__blk_rq_map_sg+0x46a/0x480
Kernel Module fu[ 1.738862] Modules linked in:
se...[ 1.739370] CPU: 0 PID: 52 Comm: kworker/0:1H Not tainted
6.9.0-00002-g4eaa50af9312-dirty #2416
[ 1.740474] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[ 1.741809] Workqueue: kblockd blk_mq_run_work_fn
[ 1.742379] RIP: 0010:__blk_rq_map_sg+0x46a/0x480
[ 1.742939] Code: 17 fe ff ff 44 89 58 0c 48 8b 01 e9 ec fc ff ff 43
8d 3c 06 48 8b 14 24 81 ff 00 10 00 00 0f 86 af fc ff ff e9 02 f0
[ 1.743015] systemd[1]: File System Check on Root Device was skipped
because of a failed condition check (ConditionPathIsReadWrite=!/.
[ 1.745122] RSP: 0018:ff37636e4032bb90 EFLAGS: 00010212
[ 1.746419] systemd[1]: systemd-journald.service: unit configures an
IP firewall, but the local system does not support BPF/cgroup fi.
[ 1.746891] RAX: 000000000000001c RBX: 00000000000001b0 RCX:
ff28e6d8b0950a00
[ 1.747903] systemd[1]: (This warning is only shown for the first
unit using IP firewalling.)
[ 1.748549] RDX: ff7662becb4ac482 RSI: 0000000000001000 RDI:
00000000fffffffd
[ 1.749688] systemd[1]: Starting Journal Service...
[ 1.749895] RBP: ff7662becb4abf80 R08: 0000000000000000 R09:
ff28e6d880fadd40
[ 1.750965] R10: ff7662becb4ac480 R11: 0000000000000000 R12:
0000000000000000
[ 1.750966] R13: 0000000000000002 R14: 0000000000001000 R15:
ff7662becb4ac480
[ 1.750970] FS: 0000000000000000(0000) GS:ff28e6da75c00000(0000)
knlGS:0000000000000000
[ 1.750972] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.750973] CR2: 00007f7407f19000 CR3: 0000000100f24002 CR4:
0000000000771ef0
[ 1.750974] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 1.750975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 1.750976] PKRU: 55555554
[ 1.750977] Call Trace:
[ 1.750984] <TASK>
[ 1.750986] ? __warn+0x7e/0x130
[ 1.750992] ? __blk_rq_map_sg+0x46a/0x480
[ 1.750994] ? report_bug+0x18e/0x1a0
[ 1.750999] ? handle_bug+0x3d/0x70
[ 1.751003] ? exc_invalid_op+0x18/0x70
[ 1.751006] ? asm_exc_invalid_op+0x1a/0x20
[ 1.751009] ? __blk_rq_map_sg+0x46a/0x480
[ 1.751012] scsi_alloc_sgtables+0xb7/0x3f0
[ 1.751019] sd_init_command+0x177/0x9d0
[ 1.751023] scsi_queue_rq+0x7c1/0xae0
[ 1.751027] blk_mq_dispatch_rq_list+0x2bc/0x7c0
[ 1.751031] __blk_mq_sched_dispatch_requests+0x409/0x5c0
[ 1.751035] blk_mq_sched_dispatch_requests+0x2c/0x60
[ 1.751037] blk_mq_run_work_fn+0x5f/0x70
[ 1.751039] process_one_work+0x149/0x360
I suspect that you would need to also change the PAGE_SIZE check in
__blk_bios_map_sg() also. However, I am not confident that the change
below is ok to begin with...
BTW, scsi_debug does use an insane max_segment_size of -1
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
> block/blk-merge.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 4e3483a16b75..570573d7a34f 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -278,6 +278,7 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
> struct bio_vec bv, bvprv, *bvprvp = NULL;
> struct bvec_iter iter;
> unsigned nsegs = 0, bytes = 0;
> + unsigned bv_seg_lim = max(PAGE_SIZE, lim->max_segment_size);
>
> bio_for_each_bvec(bv, bio, iter) {
> /*
> @@ -289,7 +290,7 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
>
> if (nsegs < lim->max_segments &&
> bytes + bv.bv_len <= max_bytes &&
> - bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> + bv.bv_offset + bv.bv_len <= bv_seg_lim) {
> nsegs++;
> bytes += bv.bv_len;
> } else {
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 5/6] block/bdev: lift restrictions on supported blocksize
2024-05-14 17:38 ` [PATCH 5/6] block/bdev: lift restrictions on supported blocksize Hannes Reinecke
@ 2024-05-15 1:03 ` kernel test robot
2024-05-15 4:00 ` kernel test robot
1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-05-15 1:03 UTC (permalink / raw)
To: Hannes Reinecke, Jens Axboe
Cc: llvm, oe-kbuild-all, Matthew Wilcox, Luis Chamberlain,
Pankaj Raghav, linux-block, linux-nvme, Hannes Reinecke
Hi Hannes,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on linus/master v6.9]
[cannot apply to next-20240514]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Hannes-Reinecke/fs-mpage-avoid-negative-shift-for-large-blocksize/20240515-014146
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20240514173900.62207-6-hare%40kernel.org
patch subject: [PATCH 5/6] block/bdev: lift restrictions on supported blocksize
config: s390-allnoconfig (https://download.01.org/0day-ci/archive/20240515/202405150852.LoiNtqk4-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project b910bebc300dafb30569cecc3017b446ea8eafa0)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240515/202405150852.LoiNtqk4-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405150852.LoiNtqk4-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from block/bdev.c:9:
In file included from include/linux/mm.h:2210:
include/linux/vmstat.h:522:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
522 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
In file included from block/bdev.c:15:
In file included from include/linux/blk-integrity.h:5:
In file included from include/linux/blk-mq.h:8:
In file included from include/linux/scatterlist.h:9:
In file included from arch/s390/include/asm/io.h:78:
include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
547 | val = __raw_readb(PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
560 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:37:59: note: expanded from macro '__le16_to_cpu'
37 | #define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
| ^
include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
102 | #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
| ^
In file included from block/bdev.c:15:
In file included from include/linux/blk-integrity.h:5:
In file included from include/linux/blk-mq.h:8:
In file included from include/linux/scatterlist.h:9:
In file included from arch/s390/include/asm/io.h:78:
include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
573 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/big_endian.h:35:59: note: expanded from macro '__le32_to_cpu'
35 | #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
| ^
include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
115 | #define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
| ^
In file included from block/bdev.c:15:
In file included from include/linux/blk-integrity.h:5:
In file included from include/linux/blk-mq.h:8:
In file included from include/linux/scatterlist.h:9:
In file included from arch/s390/include/asm/io.h:78:
include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
584 | __raw_writeb(value, PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
594 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
604 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
692 | readsb(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
700 | readsw(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
708 | readsl(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
717 | writesb(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
726 | writesw(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
735 | writesl(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
block/bdev.c:145:2: error: call to undeclared function 'mapping_set_folio_min_order'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
145 | mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
| ^
>> block/bdev.c:152:16: error: use of undeclared identifier 'bs'
152 | if (get_order(bs) > MAX_PAGECACHE_ORDER || size < 512 ||
| ^
block/bdev.c:164:3: error: call to undeclared function 'mapping_set_folio_min_order'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
164 | mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
| ^
13 warnings and 3 errors generated.
vim +/bs +152 block/bdev.c
148
149 int set_blocksize(struct block_device *bdev, int size)
150 {
151 /* Size must be a power of two, and between 512 and MAX_PAGECACHE_ORDER*/
> 152 if (get_order(bs) > MAX_PAGECACHE_ORDER || size < 512 ||
153 !is_power_of_2(size))
154 return -EINVAL;
155
156 /* Size cannot be smaller than the size supported by the device */
157 if (size < bdev_logical_block_size(bdev))
158 return -EINVAL;
159
160 /* Don't change the size if it is same as current */
161 if (bdev->bd_inode->i_blkbits != blksize_bits(size)) {
162 sync_blockdev(bdev);
163 bdev->bd_inode->i_blkbits = blksize_bits(size);
164 mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
165 get_order(size));
166 kill_bdev(bdev);
167 }
168 return 0;
169 }
170
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 5/6] block/bdev: lift restrictions on supported blocksize
2024-05-14 17:38 ` [PATCH 5/6] block/bdev: lift restrictions on supported blocksize Hannes Reinecke
2024-05-15 1:03 ` kernel test robot
@ 2024-05-15 4:00 ` kernel test robot
1 sibling, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-05-15 4:00 UTC (permalink / raw)
To: Hannes Reinecke, Jens Axboe
Cc: oe-kbuild-all, Matthew Wilcox, Luis Chamberlain, Pankaj Raghav,
linux-block, linux-nvme, Hannes Reinecke
Hi Hannes,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on linus/master v6.9]
[cannot apply to next-20240514]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Hannes-Reinecke/fs-mpage-avoid-negative-shift-for-large-blocksize/20240515-014146
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20240514173900.62207-6-hare%40kernel.org
patch subject: [PATCH 5/6] block/bdev: lift restrictions on supported blocksize
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240515/202405151142.8COQSJsa-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240515/202405151142.8COQSJsa-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405151142.8COQSJsa-lkp@intel.com/
All errors (new ones prefixed by >>):
block/bdev.c: In function 'set_init_blocksize':
block/bdev.c:145:9: error: implicit declaration of function 'mapping_set_folio_min_order' [-Werror=implicit-function-declaration]
145 | mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
block/bdev.c: In function 'set_blocksize':
>> block/bdev.c:152:23: error: 'bs' undeclared (first use in this function); did you mean 'abs'?
152 | if (get_order(bs) > MAX_PAGECACHE_ORDER || size < 512 ||
| ^~
| abs
block/bdev.c:152:23: note: each undeclared identifier is reported only once for each function it appears in
cc1: some warnings being treated as errors
vim +152 block/bdev.c
148
149 int set_blocksize(struct block_device *bdev, int size)
150 {
151 /* Size must be a power of two, and between 512 and MAX_PAGECACHE_ORDER*/
> 152 if (get_order(bs) > MAX_PAGECACHE_ORDER || size < 512 ||
153 !is_power_of_2(size))
154 return -EINVAL;
155
156 /* Size cannot be smaller than the size supported by the device */
157 if (size < bdev_logical_block_size(bdev))
158 return -EINVAL;
159
160 /* Don't change the size if it is same as current */
161 if (bdev->bd_inode->i_blkbits != blksize_bits(size)) {
162 sync_blockdev(bdev);
163 bdev->bd_inode->i_blkbits = blksize_bits(size);
164 mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
165 get_order(size));
166 kill_bdev(bdev);
167 }
168 return 0;
169 }
170
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 4/6] block/bdev: enable large folio support for large logical block sizes
2024-05-14 17:38 ` [PATCH 4/6] block/bdev: enable large folio support for large logical block sizes Hannes Reinecke
@ 2024-05-15 4:21 ` kernel test robot
0 siblings, 0 replies; 14+ messages in thread
From: kernel test robot @ 2024-05-15 4:21 UTC (permalink / raw)
To: Hannes Reinecke, Jens Axboe
Cc: llvm, oe-kbuild-all, Matthew Wilcox, Luis Chamberlain,
Pankaj Raghav, linux-block, linux-nvme, Hannes Reinecke
Hi Hannes,
kernel test robot noticed the following build errors:
[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on linus/master v6.9]
[cannot apply to next-20240514]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Hannes-Reinecke/fs-mpage-avoid-negative-shift-for-large-blocksize/20240515-014146
base: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/r/20240514173900.62207-5-hare%40kernel.org
patch subject: [PATCH 4/6] block/bdev: enable large folio support for large logical block sizes
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20240515/202405151219.H2vlwtc0-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240515/202405151219.H2vlwtc0-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405151219.H2vlwtc0-lkp@intel.com/
All errors (new ones prefixed by >>):
>> block/bdev.c:145:2: error: call to undeclared function 'mapping_set_folio_min_order'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
145 | mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
| ^
block/bdev.c:163:3: error: call to undeclared function 'mapping_set_folio_min_order'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
163 | mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
| ^
2 errors generated.
vim +/mapping_set_folio_min_order +145 block/bdev.c
133
134 static void set_init_blocksize(struct block_device *bdev)
135 {
136 unsigned int bsize = bdev_logical_block_size(bdev);
137 loff_t size = i_size_read(bdev->bd_inode);
138
139 while (bsize < PAGE_SIZE) {
140 if (size & bsize)
141 break;
142 bsize <<= 1;
143 }
144 bdev->bd_inode->i_blkbits = blksize_bits(bsize);
> 145 mapping_set_folio_min_order(bdev->bd_inode->i_mapping,
146 get_order(bsize));
147 }
148
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE
2024-05-15 0:20 ` John Garry
@ 2024-05-15 12:29 ` Hannes Reinecke
2024-05-15 12:32 ` Hannes Reinecke
2024-05-15 15:21 ` John Garry
0 siblings, 2 replies; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-15 12:29 UTC (permalink / raw)
To: John Garry, Hannes Reinecke, Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme
On 5/15/24 02:20, John Garry wrote:
> On 14/05/2024 11:38, Hannes Reinecke wrote:
>> Bvecs can be larger than a page, and the block layer handles
>> this just fine. So do not split by PAGE_SIZE but rather by
>> the max_segment_size if that happens to be larger.
> Can you check scsi_debug for this series? I took this series only up to
> this change, and got:
>
> Startin[ 1.736470] ------------[ cut here ]------------
> g Load [ 1.737777] WARNING: CPU: 0 PID: 52 at block/blk-merge.c:581
> __blk_rq_map_sg+0x46a/0x480
> Kernel Module fu[ 1.738862] Modules linked in:
> se...[ 1.739370] CPU: 0 PID: 52 Comm: kworker/0:1H Not tainted
> 6.9.0-00002-g4eaa50af9312-dirty #2416
>
> [ 1.740474] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> [ 1.741809] Workqueue: kblockd blk_mq_run_work_fn
> [ 1.742379] RIP: 0010:__blk_rq_map_sg+0x46a/0x480
> [ 1.742939] Code: 17 fe ff ff 44 89 58 0c 48 8b 01 e9 ec fc ff ff 43
> 8d 3c 06 48 8b 14 24 81 ff 00 10 00 00 0f 86 af fc ff ff e9 02 f0
> [ 1.743015] systemd[1]: File System Check on Root Device was skipped
> because of a failed condition check (ConditionPathIsReadWrite=!/.
> [ 1.745122] RSP: 0018:ff37636e4032bb90 EFLAGS: 00010212
> [ 1.746419] systemd[1]: systemd-journald.service: unit configures an
> IP firewall, but the local system does not support BPF/cgroup fi.
> [ 1.746891] RAX: 000000000000001c RBX: 00000000000001b0 RCX:
> ff28e6d8b0950a00
> [ 1.747903] systemd[1]: (This warning is only shown for the first
> unit using IP firewalling.)
> [ 1.748549] RDX: ff7662becb4ac482 RSI: 0000000000001000 RDI:
> 00000000fffffffd
> [ 1.749688] systemd[1]: Starting Journal Service...
> [ 1.749895] RBP: ff7662becb4abf80 R08: 0000000000000000 R09:
> ff28e6d880fadd40
> [ 1.750965] R10: ff7662becb4ac480 R11: 0000000000000000 R12:
> 0000000000000000
> [ 1.750966] R13: 0000000000000002 R14: 0000000000001000 R15:
> ff7662becb4ac480
> [ 1.750970] FS: 0000000000000000(0000) GS:ff28e6da75c00000(0000)
> knlGS:0000000000000000
> [ 1.750972] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.750973] CR2: 00007f7407f19000 CR3: 0000000100f24002 CR4:
> 0000000000771ef0
> [ 1.750974] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 1.750975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 1.750976] PKRU: 55555554
> [ 1.750977] Call Trace:
> [ 1.750984] <TASK>
> [ 1.750986] ? __warn+0x7e/0x130
> [ 1.750992] ? __blk_rq_map_sg+0x46a/0x480
> [ 1.750994] ? report_bug+0x18e/0x1a0
> [ 1.750999] ? handle_bug+0x3d/0x70
> [ 1.751003] ? exc_invalid_op+0x18/0x70
> [ 1.751006] ? asm_exc_invalid_op+0x1a/0x20
> [ 1.751009] ? __blk_rq_map_sg+0x46a/0x480
> [ 1.751012] scsi_alloc_sgtables+0xb7/0x3f0
> [ 1.751019] sd_init_command+0x177/0x9d0
> [ 1.751023] scsi_queue_rq+0x7c1/0xae0
> [ 1.751027] blk_mq_dispatch_rq_list+0x2bc/0x7c0
> [ 1.751031] __blk_mq_sched_dispatch_requests+0x409/0x5c0
> [ 1.751035] blk_mq_sched_dispatch_requests+0x2c/0x60
> [ 1.751037] blk_mq_run_work_fn+0x5f/0x70
> [ 1.751039] process_one_work+0x149/0x360
>
> I suspect that you would need to also change the PAGE_SIZE check in
> __blk_bios_map_sg() also. However, I am not confident that the change
> below is ok to begin with...
>
> BTW, scsi_debug does use an insane max_segment_size of -1
>
Can you try with this patch?
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 570573d7a34f..5da63180069e 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -278,7 +278,10 @@ struct bio *bio_split_rw(struct bio *bio, const
struct queue_limits *lim,
struct bio_vec bv, bvprv, *bvprvp = NULL;
struct bvec_iter iter;
unsigned nsegs = 0, bytes = 0;
- unsigned bv_seg_lim = max(PAGE_SIZE, lim->max_segment_size);
+ unsigned bv_seg_lim = PAGE_SIZE;
+
+ if (lim->max_segment_size < UINT_MAX)
+ bv_seg_lim = lim->max_segment_size;
bio_for_each_bvec(bv, bio, iter) {
/*
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE
2024-05-15 12:29 ` Hannes Reinecke
@ 2024-05-15 12:32 ` Hannes Reinecke
2024-05-15 15:21 ` John Garry
1 sibling, 0 replies; 14+ messages in thread
From: Hannes Reinecke @ 2024-05-15 12:32 UTC (permalink / raw)
To: John Garry, Hannes Reinecke, Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme
On 5/15/24 14:29, Hannes Reinecke wrote:
> On 5/15/24 02:20, John Garry wrote:
>> On 14/05/2024 11:38, Hannes Reinecke wrote:
>>> Bvecs can be larger than a page, and the block layer handles
>>> this just fine. So do not split by PAGE_SIZE but rather by
>>> the max_segment_size if that happens to be larger.
>> Can you check scsi_debug for this series? I took this series only up
>> to this change, and got:
>>
>> Startin[ 1.736470] ------------[ cut here ]------------
>> g Load [ 1.737777] WARNING: CPU: 0 PID: 52 at block/blk-merge.c:581
>> __blk_rq_map_sg+0x46a/0x480
>> Kernel Module fu[ 1.738862] Modules linked in:
>> se...[ 1.739370] CPU: 0 PID: 52 Comm: kworker/0:1H Not tainted
>> 6.9.0-00002-g4eaa50af9312-dirty #2416
>>
>> [ 1.740474] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>> BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
>> [ 1.741809] Workqueue: kblockd blk_mq_run_work_fn
>> [ 1.742379] RIP: 0010:__blk_rq_map_sg+0x46a/0x480
>> [ 1.742939] Code: 17 fe ff ff 44 89 58 0c 48 8b 01 e9 ec fc ff ff
>> 43 8d 3c 06 48 8b 14 24 81 ff 00 10 00 00 0f 86 af fc ff ff e9 02 f0
>> [ 1.743015] systemd[1]: File System Check on Root Device was
>> skipped because of a failed condition check (ConditionPathIsReadWrite=!/.
>> [ 1.745122] RSP: 0018:ff37636e4032bb90 EFLAGS: 00010212
>> [ 1.746419] systemd[1]: systemd-journald.service: unit configures
>> an IP firewall, but the local system does not support BPF/cgroup fi.
>> [ 1.746891] RAX: 000000000000001c RBX: 00000000000001b0 RCX:
>> ff28e6d8b0950a00
>> [ 1.747903] systemd[1]: (This warning is only shown for the first
>> unit using IP firewalling.)
>> [ 1.748549] RDX: ff7662becb4ac482 RSI: 0000000000001000 RDI:
>> 00000000fffffffd
>> [ 1.749688] systemd[1]: Starting Journal Service...
>> [ 1.749895] RBP: ff7662becb4abf80 R08: 0000000000000000 R09:
>> ff28e6d880fadd40
>> [ 1.750965] R10: ff7662becb4ac480 R11: 0000000000000000 R12:
>> 0000000000000000
>> [ 1.750966] R13: 0000000000000002 R14: 0000000000001000 R15:
>> ff7662becb4ac480
>> [ 1.750970] FS: 0000000000000000(0000) GS:ff28e6da75c00000(0000)
>> knlGS:0000000000000000
>> [ 1.750972] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1.750973] CR2: 00007f7407f19000 CR3: 0000000100f24002 CR4:
>> 0000000000771ef0
>> [ 1.750974] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 1.750975] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [ 1.750976] PKRU: 55555554
>> [ 1.750977] Call Trace:
>> [ 1.750984] <TASK>
>> [ 1.750986] ? __warn+0x7e/0x130
>> [ 1.750992] ? __blk_rq_map_sg+0x46a/0x480
>> [ 1.750994] ? report_bug+0x18e/0x1a0
>> [ 1.750999] ? handle_bug+0x3d/0x70
>> [ 1.751003] ? exc_invalid_op+0x18/0x70
>> [ 1.751006] ? asm_exc_invalid_op+0x1a/0x20
>> [ 1.751009] ? __blk_rq_map_sg+0x46a/0x480
>> [ 1.751012] scsi_alloc_sgtables+0xb7/0x3f0
>> [ 1.751019] sd_init_command+0x177/0x9d0
>> [ 1.751023] scsi_queue_rq+0x7c1/0xae0
>> [ 1.751027] blk_mq_dispatch_rq_list+0x2bc/0x7c0
>> [ 1.751031] __blk_mq_sched_dispatch_requests+0x409/0x5c0
>> [ 1.751035] blk_mq_sched_dispatch_requests+0x2c/0x60
>> [ 1.751037] blk_mq_run_work_fn+0x5f/0x70
>> [ 1.751039] process_one_work+0x149/0x360
>>
>> I suspect that you would need to also change the PAGE_SIZE check in
>> __blk_bios_map_sg() also. However, I am not confident that the change
>> below is ok to begin with...
>>
>> BTW, scsi_debug does use an insane max_segment_size of -1
>>
> Can you try with this patch?
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 570573d7a34f..5da63180069e 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -278,7 +278,10 @@ struct bio *bio_split_rw(struct bio *bio, const
> struct queue_limits *lim,
> struct bio_vec bv, bvprv, *bvprvp = NULL;
> struct bvec_iter iter;
> unsigned nsegs = 0, bytes = 0;
> - unsigned bv_seg_lim = max(PAGE_SIZE, lim->max_segment_size);
> + unsigned bv_seg_lim = PAGE_SIZE;
> +
> + if (lim->max_segment_size < UINT_MAX)
> + bv_seg_lim = lim->max_segment_size;
>
> bio_for_each_bvec(bv, bio, iter) {
> /*
>
Hmm. No, forget it. Working on another fix.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE
2024-05-15 12:29 ` Hannes Reinecke
2024-05-15 12:32 ` Hannes Reinecke
@ 2024-05-15 15:21 ` John Garry
1 sibling, 0 replies; 14+ messages in thread
From: John Garry @ 2024-05-15 15:21 UTC (permalink / raw)
To: Hannes Reinecke, Hannes Reinecke, Jens Axboe
Cc: Matthew Wilcox, Luis Chamberlain, Pankaj Raghav, linux-block,
linux-nvme
On 15/05/2024 06:29, Hannes Reinecke wrote:
>>
>> I suspect that you would need to also change the PAGE_SIZE check in
>> __blk_bios_map_sg() also. However, I am not confident that the change
>> below is ok to begin with...
>>
>> BTW, scsi_debug does use an insane max_segment_size of -1
>>
> Can you try with this patch?
It's scsi_debug, anyone can try it.
As for Luis' original issue, I did not see a proper explanation why the
crash occurred. The splitting code should consider max segment size
already, AFAICS. We seem to be slicing off less than LBS, which means
bytes = 0 after the rounddown, which crashes. why?
I think that all request_queue limits should really be double-checked
for this LBS on NVMe. The virtual_boundary_mask is still 4K, which
should be ok.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-05-15 15:22 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-14 17:38 [PATCHv2 0/6] enable bs > ps for block devices Hannes Reinecke
2024-05-14 17:38 ` [PATCH 1/6] fs/mpage: avoid negative shift for large blocksize Hannes Reinecke
2024-05-14 17:38 ` [PATCH 2/6] fs/mpage: use blocks_per_folio instead of blocks_per_page Hannes Reinecke
2024-05-14 17:38 ` [PATCH 3/6] blk-merge: split bio by max_segment_size, not PAGE_SIZE Hannes Reinecke
2024-05-15 0:20 ` John Garry
2024-05-15 12:29 ` Hannes Reinecke
2024-05-15 12:32 ` Hannes Reinecke
2024-05-15 15:21 ` John Garry
2024-05-14 17:38 ` [PATCH 4/6] block/bdev: enable large folio support for large logical block sizes Hannes Reinecke
2024-05-15 4:21 ` kernel test robot
2024-05-14 17:38 ` [PATCH 5/6] block/bdev: lift restrictions on supported blocksize Hannes Reinecke
2024-05-15 1:03 ` kernel test robot
2024-05-15 4:00 ` kernel test robot
2024-05-14 17:39 ` [PATCH 6/6] nvme: enable logical block size > PAGE_SIZE Hannes Reinecke
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox