* [PATCH V3] block: make segment size limit workable for > 4K PAGE_SIZE
@ 2025-02-13 12:00 Ming Lei
2025-02-14 17:37 ` Bart Van Assche
2025-02-18 10:42 ` Daniel Gomez
0 siblings, 2 replies; 4+ messages in thread
From: Ming Lei @ 2025-02-13 12:00 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Ming Lei, Paul Bunyan, Yi Zhang, Luis Chamberlain, John Garry,
Bart Van Assche, Keith Busch
PAGE_SIZE is applied in validating block device queue limits, this way is
very fragile and is wrong:
- queue limits are read from hardware, which is often one readonly hardware
property
- PAGE_SIZE is one config option which can be changed during build time.
In RH lab, it has been found that max segment size of some mmc card is
less than 64K, then this kind of card can't be probed successfully when
same kernel is re-built with 64K PAGE_SIZE.
Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
with queue limits and checking if bio needn't split as a hint. Define
BLK_MIN_SEGMENT_SIZE as 4K(minimized PAGE_SIZE).
The following commits are depended for backporting:
commit 6aeb4f836480 ("block: remove bio_add_pc_page")
commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep")
commit b7175e24d6ac ("block: add a dma mapping iterator")
Cc: Paul Bunyan <pbunyan@redhat.com>
Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V3:
- rephrase commit log & fix patch style(Christoph)
- more comment log(Christoph)
V2:
- cover bio_split_rw_at()
- add BLK_MIN_SEGMENT_SIZE
block/blk-merge.c | 2 +-
block/blk-settings.c | 6 +++---
block/blk.h | 8 ++++++--
include/linux/blkdev.h | 2 ++
4 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 15cd231d560c..b55c52a42303 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
if (nsegs < lim->max_segments &&
bytes + bv.bv_len <= max_bytes &&
- bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
+ bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
nsegs++;
bytes += bv.bv_len;
} else {
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c44dadc35e1e..539a64ad7989 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -303,7 +303,7 @@ int blk_validate_limits(struct queue_limits *lim)
max_hw_sectors = min_not_zero(lim->max_hw_sectors,
lim->max_dev_sectors);
if (lim->max_user_sectors) {
- if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
+ if (lim->max_user_sectors < BLK_MIN_SEGMENT_SIZE / SECTOR_SIZE)
return -EINVAL;
lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors);
} else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) {
@@ -341,7 +341,7 @@ int blk_validate_limits(struct queue_limits *lim)
*/
if (!lim->seg_boundary_mask)
lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
- if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1))
+ if (WARN_ON_ONCE(lim->seg_boundary_mask < BLK_MIN_SEGMENT_SIZE - 1))
return -EINVAL;
/*
@@ -362,7 +362,7 @@ int blk_validate_limits(struct queue_limits *lim)
*/
if (!lim->max_segment_size)
lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
- if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
+ if (WARN_ON_ONCE(lim->max_segment_size < BLK_MIN_SEGMENT_SIZE))
return -EINVAL;
}
diff --git a/block/blk.h b/block/blk.h
index 90fa5f28ccab..0eca1687bec4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -358,8 +358,12 @@ struct bio *bio_split_zone_append(struct bio *bio,
static inline bool bio_may_need_split(struct bio *bio,
const struct queue_limits *lim)
{
- return lim->chunk_sectors || bio->bi_vcnt != 1 ||
- bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
+ if (lim->chunk_sectors)
+ return true;
+ if (bio->bi_vcnt != 1)
+ return true;
+ return bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset >
+ BLK_MIN_SEGMENT_SIZE;
}
/**
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..2021b2174268 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1163,6 +1163,8 @@ static inline bool bdev_is_partition(struct block_device *bdev)
enum blk_default_limits {
BLK_MAX_SEGMENTS = 128,
BLK_SAFE_MAX_SECTORS = 255,
+ /* use minimized PAGE_SIZE as min segment size hint */
+ BLK_MIN_SEGMENT_SIZE = 4096,
BLK_MAX_SEGMENT_SIZE = 65536,
BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL,
};
--
2.47.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH V3] block: make segment size limit workable for > 4K PAGE_SIZE
2025-02-13 12:00 [PATCH V3] block: make segment size limit workable for > 4K PAGE_SIZE Ming Lei
@ 2025-02-14 17:37 ` Bart Van Assche
2025-02-18 10:42 ` Daniel Gomez
1 sibling, 0 replies; 4+ messages in thread
From: Bart Van Assche @ 2025-02-14 17:37 UTC (permalink / raw)
To: Ming Lei, Jens Axboe, linux-block
Cc: Paul Bunyan, Yi Zhang, Luis Chamberlain, John Garry, Keith Busch
On 2/13/25 4:00 AM, Ming Lei wrote:
> PAGE_SIZE is applied in validating block device queue limits, this way is
^^^^^^^
used?
> very fragile and is wrong:
>
> - queue limits are read from hardware, which is often one readonly hardware
^^^^^^^^
read-only?
> @@ -1163,6 +1163,8 @@ static inline bool bdev_is_partition(struct block_device *bdev)
> enum blk_default_limits {
> BLK_MAX_SEGMENTS = 128,
> BLK_SAFE_MAX_SECTORS = 255,
> + /* use minimized PAGE_SIZE as min segment size hint */
> + BLK_MIN_SEGMENT_SIZE = 4096,
> BLK_MAX_SEGMENT_SIZE = 65536,
> BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL,
> };
The above comment could be made more clear, e.g. "Use 4 KiB as the
smallest supported DMA segment size limit instead of PAGE_SIZE. This is
important if the page size is larger than 4 KiB since the maximum DMA
segment size for some storage controllers (e.g. eMMC) is 4 KiB."
Anyway:
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH V3] block: make segment size limit workable for > 4K PAGE_SIZE
2025-02-13 12:00 [PATCH V3] block: make segment size limit workable for > 4K PAGE_SIZE Ming Lei
2025-02-14 17:37 ` Bart Van Assche
@ 2025-02-18 10:42 ` Daniel Gomez
2025-02-18 11:42 ` Ming Lei
1 sibling, 1 reply; 4+ messages in thread
From: Daniel Gomez @ 2025-02-18 10:42 UTC (permalink / raw)
To: Ming Lei
Cc: Jens Axboe, linux-block, Paul Bunyan, Yi Zhang, Luis Chamberlain,
John Garry, Bart Van Assche, Keith Busch
On Thu, Feb 13, 2025 at 08:00:40PM +0100, Ming Lei wrote:
> PAGE_SIZE is applied in validating block device queue limits, this way is
> very fragile and is wrong:
>
> - queue limits are read from hardware, which is often one readonly hardware
> property
>
> - PAGE_SIZE is one config option which can be changed during build time.
>
> In RH lab, it has been found that max segment size of some mmc card is
> less than 64K, then this kind of card can't be probed successfully when
> same kernel is re-built with 64K PAGE_SIZE.
>
> Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> with queue limits and checking if bio needn't split as a hint. Define
> BLK_MIN_SEGMENT_SIZE as 4K(minimized PAGE_SIZE).
>
> The following commits are depended for backporting:
>
> commit 6aeb4f836480 ("block: remove bio_add_pc_page")
> commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep")
> commit b7175e24d6ac ("block: add a dma mapping iterator")
>
> Cc: Paul Bunyan <pbunyan@redhat.com>
> Cc: Yi Zhang <yi.zhang@redhat.com>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Cc: John Garry <john.g.garry@oracle.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V3:
> - rephrase commit log & fix patch style(Christoph)
> - more comment log(Christoph)
> V2:
> - cover bio_split_rw_at()
> - add BLK_MIN_SEGMENT_SIZE
>
> block/blk-merge.c | 2 +-
> block/blk-settings.c | 6 +++---
> block/blk.h | 8 ++++++--
> include/linux/blkdev.h | 2 ++
> 4 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 15cd231d560c..b55c52a42303 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>
> if (nsegs < lim->max_segments &&
> bytes + bv.bv_len <= max_bytes &&
> - bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> + bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
In which cases this "new" condition (for systems where PAGE_SIZE >
BLK_MIN_SEGMENT_SIZE) is going to be true? In my test case below, is always
false, so it defaults to the else path. And I think that is going to be the
"normal" case in these systems, is that right?
Doing a 'quick' test using next-20250213 on a 16k PAGE_SIZE system with the NVMe
driver and a 4k lbs disk + ~1h fio sequential writes, I get results indicating
a write performance degradation of ~0.8%. This is due to the new loop condition
doing 4k steps rather than PS. I guess it's going to be slighly worse the larger
the PAGE_SIZE the system is, and bio? So, why not decreasing the minimum segment
size for the cases it's actually needed rather than making it now the default?
I've measured bio_split_rw_at latency in the above test with the following
results:
next-20250213:
This is a power-of-2 time (nanoseconds) histogram for all nr segments cases for
my DUT nvme drive:
@ns[fio]:
[128, 256) 11 | |
[256, 512) 242268 | |
[512, 1K) 16153055 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K) 172281 | |
[2K, 4K) 3762 | |
[4K, 8K) 1184 | |
[8K, 16K) 7581 | |
[16K, 32K) 1516 | |
[32K, 64K) 124 | |
[64K, 128K) 9 | |
[128K, 256K) 4 | |
[256K, 512K) 2 | |
[512K, 1M) 1 | |
Same power-of-2 histogram for 64 nr segments case only. Note: I couldn't trigger
a much larger nr segment case.
@latency[64]:
[256, 512) 379481 |@@ |
[512, 1K) 8096428 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K) 1592 | |
[2K, 4K) 1181 | |
[4K, 8K) 332 | |
[8K, 16K) 3027 | |
[16K, 32K) 669 | |
[32K, 64K) 63 | |
[64K, 128K) 5 | |
[128K, 256K) 2 | |
[256K, 512K) 1 | |
next-20250213 + BLK_MIN_SEGMENT_SIZE patch:
@ns[fio]:
[128, 256) 3 | |
[256, 512) 76345 | |
[512, 1K) 15514581 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K) 755336 |@@ |
[2K, 4K) 7063 | |
[4K, 8K) 3301 | |
[8K, 16K) 4574 | |
[16K, 32K) 1816 | |
[32K, 64K) 155 | |
[64K, 128K) 22 | |
[128K, 256K) 6 | |
[256K, 512K) 4 | |
[512K, 1M) 2 | |
[1M, 2M) 2 | |
[2M, 4M) 1 | |
[4M, 8M) 0 | |
[8M, 16M) 1 | |
@latency[64]:
[256, 512) 6800 | |
[512, 1K) 7540158 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K) 5747 | |
[2K, 4K) 2578 | |
[4K, 8K) 996 | |
[8K, 16K) 1960 | |
[16K, 32K) 773 | |
[32K, 64K) 54 | |
[64K, 128K) 5 | |
[128K, 256K) 3 | |
[256K, 512K) 3 | |
[512K, 1M) 2 | |
So comparing the above, I notice that the new routine is slightly slower,
resulting in a minor degradation. Do you think is worth exploring a fix and
returning PAGE_SIZE as BLK_MIN_SEGMENT_SIZE for the cases we don't need to go
further?
> nsegs++;
> bytes += bv.bv_len;
> } else {
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index c44dadc35e1e..539a64ad7989 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -303,7 +303,7 @@ int blk_validate_limits(struct queue_limits *lim)
> max_hw_sectors = min_not_zero(lim->max_hw_sectors,
> lim->max_dev_sectors);
> if (lim->max_user_sectors) {
> - if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
> + if (lim->max_user_sectors < BLK_MIN_SEGMENT_SIZE / SECTOR_SIZE)
> return -EINVAL;
> lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors);
> } else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) {
> @@ -341,7 +341,7 @@ int blk_validate_limits(struct queue_limits *lim)
> */
> if (!lim->seg_boundary_mask)
> lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
> - if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1))
> + if (WARN_ON_ONCE(lim->seg_boundary_mask < BLK_MIN_SEGMENT_SIZE - 1))
> return -EINVAL;
>
> /*
> @@ -362,7 +362,7 @@ int blk_validate_limits(struct queue_limits *lim)
> */
> if (!lim->max_segment_size)
> lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
> - if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
> + if (WARN_ON_ONCE(lim->max_segment_size < BLK_MIN_SEGMENT_SIZE))
> return -EINVAL;
> }
>
> diff --git a/block/blk.h b/block/blk.h
> index 90fa5f28ccab..0eca1687bec4 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -358,8 +358,12 @@ struct bio *bio_split_zone_append(struct bio *bio,
> static inline bool bio_may_need_split(struct bio *bio,
> const struct queue_limits *lim)
> {
> - return lim->chunk_sectors || bio->bi_vcnt != 1 ||
> - bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
> + if (lim->chunk_sectors)
> + return true;
> + if (bio->bi_vcnt != 1)
> + return true;
> + return bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset >
> + BLK_MIN_SEGMENT_SIZE;
> }
>
> /**
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 248416ecd01c..2021b2174268 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1163,6 +1163,8 @@ static inline bool bdev_is_partition(struct block_device *bdev)
> enum blk_default_limits {
> BLK_MAX_SEGMENTS = 128,
> BLK_SAFE_MAX_SECTORS = 255,
> + /* use minimized PAGE_SIZE as min segment size hint */
> + BLK_MIN_SEGMENT_SIZE = 4096,
> BLK_MAX_SEGMENT_SIZE = 65536,
> BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL,
> };
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH V3] block: make segment size limit workable for > 4K PAGE_SIZE
2025-02-18 10:42 ` Daniel Gomez
@ 2025-02-18 11:42 ` Ming Lei
0 siblings, 0 replies; 4+ messages in thread
From: Ming Lei @ 2025-02-18 11:42 UTC (permalink / raw)
To: Daniel Gomez
Cc: Jens Axboe, linux-block, Paul Bunyan, Yi Zhang, Luis Chamberlain,
John Garry, Bart Van Assche, Keith Busch
On Tue, Feb 18, 2025 at 11:42:43AM +0100, Daniel Gomez wrote:
> On Thu, Feb 13, 2025 at 08:00:40PM +0100, Ming Lei wrote:
> > PAGE_SIZE is applied in validating block device queue limits, this way is
> > very fragile and is wrong:
> >
> > - queue limits are read from hardware, which is often one readonly hardware
> > property
> >
> > - PAGE_SIZE is one config option which can be changed during build time.
> >
> > In RH lab, it has been found that max segment size of some mmc card is
> > less than 64K, then this kind of card can't be probed successfully when
> > same kernel is re-built with 64K PAGE_SIZE.
> >
> > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> > with queue limits and checking if bio needn't split as a hint. Define
> > BLK_MIN_SEGMENT_SIZE as 4K(minimized PAGE_SIZE).
> >
> > The following commits are depended for backporting:
> >
> > commit 6aeb4f836480 ("block: remove bio_add_pc_page")
> > commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep")
> > commit b7175e24d6ac ("block: add a dma mapping iterator")
> >
> > Cc: Paul Bunyan <pbunyan@redhat.com>
> > Cc: Yi Zhang <yi.zhang@redhat.com>
> > Cc: Luis Chamberlain <mcgrof@kernel.org>
> > Cc: John Garry <john.g.garry@oracle.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Keith Busch <kbusch@kernel.org>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > V3:
> > - rephrase commit log & fix patch style(Christoph)
> > - more comment log(Christoph)
> > V2:
> > - cover bio_split_rw_at()
> > - add BLK_MIN_SEGMENT_SIZE
> >
> > block/blk-merge.c | 2 +-
> > block/blk-settings.c | 6 +++---
> > block/blk.h | 8 ++++++--
> > include/linux/blkdev.h | 2 ++
> > 4 files changed, 12 insertions(+), 6 deletions(-)
> >
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 15cd231d560c..b55c52a42303 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> >
> > if (nsegs < lim->max_segments &&
> > bytes + bv.bv_len <= max_bytes &&
> > - bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> > + bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
>
> In which cases this "new" condition (for systems where PAGE_SIZE >
> BLK_MIN_SEGMENT_SIZE) is going to be true? In my test case below, is always
Yes.
> false, so it defaults to the else path. And I think that is going to be the
> "normal" case in these systems, is that right?
It depends on block size in your workload.
>
> Doing a 'quick' test using next-20250213 on a 16k PAGE_SIZE system with the NVMe
> driver and a 4k lbs disk + ~1h fio sequential writes, I get results indicating
> a write performance degradation of ~0.8%. This is due to the new loop condition
> doing 4k steps rather than PS. I guess it's going to be slighly worse the larger
> the PAGE_SIZE the system is, and bio? So, why not decreasing the minimum segment
No, just one extra bvec_split_segs() is called once for any >4k page size
Probably the opposite, effect on 64K page size could be smaller since
bio may have less bvec on same workload.
> size for the cases it's actually needed rather than making it now the default?
>
> I've measured bio_split_rw_at latency in the above test with the following
> results:
Fine, I will add one min segment size for covering this case since you care the
little perf drop on >4k PAGE_SIZE.
Thanks,
Ming
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-02-18 11:42 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13 12:00 [PATCH V3] block: make segment size limit workable for > 4K PAGE_SIZE Ming Lei
2025-02-14 17:37 ` Bart Van Assche
2025-02-18 10:42 ` Daniel Gomez
2025-02-18 11:42 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox