public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* bio segment constraints
@ 2025-04-06 19:40 Sean Anderson
  2025-04-07  7:07 ` Christoph Hellwig
  2025-04-07  7:10 ` Hannes Reinecke
  0 siblings, 2 replies; 12+ messages in thread
From: Sean Anderson @ 2025-04-06 19:40 UTC (permalink / raw)
  To: Jens Axboe, linux-block
  Cc: Miquel Raynal, Richard Weinberger, Vignesh Raghavendra, linux-mtd,
	Zhihao Cheng

Hi all,

I'm not really sure what guarantees the block layer makes regarding the
segments in a bio as part of a request submitted to a block driver. As
far as I can tell this is not documented anywhere. In particular,

- Is bv_len aligned to SECTOR_SIZE?
- To logical_sector_size?
- What if logical_sector_size > PAGE_SIZE?
- What about bv_offset?
- Is it possible to have a bio where the total length is a multiple of
   logical_sector_size, but the data is split across several segments
   where each segment is a multiple of SECTOR_SIZE?
- Is is possible to have segments not even aligned to SECTOR_SIZE?
- Can I somehow request to only get segments with bv_len aligned to
   logical_sector_size? Or do I need to do my own coalescing and bounce
   buffering for that?

I've been reading some drivers (as well as stuff in block/) to try and
figure things out, but it's hard to figure out all the places where
constraints are enforced. In particular, I've read several drivers that
make some big assumptions (which might be bugs?) For example, in
drivers/mtd/mtd_blkdevs.c, do_blktrans_request looks like:

	block = blk_rq_pos(req) << 9 >> tr->blkshift;
	nsect = blk_rq_cur_bytes(req) >> tr->blkshift;

	switch (req_op(req)) {
	/* ... snip ... */
	case REQ_OP_READ:
		buf = kmap(bio_page(req->bio)) + bio_offset(req->bio);
		for (; nsect > 0; nsect--, block++, buf += tr->blksize) {
			if (tr->readsect(dev, block, buf)) {
				kunmap(bio_page(req->bio));
				return BLK_STS_IOERR;
			}
		}
		kunmap(bio_page(req->bio));

		rq_for_each_segment(bvec, req, iter)
			flush_dcache_page(bvec.bv_page);
		return BLK_STS_OK;

For context, tr->blkshift is either 512 or 4096, depending on the
backend. From what I can tell, this code assumes the following:

- There is only one bio in a request. This one is a bit of a soft
   assumption since we should only flush the pages in the bio and not the
   whole request otherwise.
- There is only one segment in a bio. This one could be reasonable if
   max_segments was set to 1, but it's not as far as I can tell. So I
   guess we just go off the end of the bio if there's a second segment?
- The data is in lowmem OR bv_offset + bv_len <= PAGE_SIZE. kmap() only
   maps a single page, so if we go past one page we end up in adjacent
   kmapped pages.

Am I missing something here? Handling highmem seems like a persistent
issue. E.g. drivers/mtd/ubi/block.c doesn't even bother doing a kmap.
Should both of these have BLK_FEAT_BOUNCE_HIGH?

--Sean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-06 19:40 bio segment constraints Sean Anderson
@ 2025-04-07  7:07 ` Christoph Hellwig
  2025-04-07 13:46   ` Keith Busch
  2025-04-07 13:59   ` Sean Anderson
  2025-04-07  7:10 ` Hannes Reinecke
  1 sibling, 2 replies; 12+ messages in thread
From: Christoph Hellwig @ 2025-04-07  7:07 UTC (permalink / raw)
  To: Sean Anderson
  Cc: Jens Axboe, linux-block, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, linux-mtd, Zhihao Cheng

On Sun, Apr 06, 2025 at 03:40:04PM -0400, Sean Anderson wrote:
> Hi all,
> 
> I'm not really sure what guarantees the block layer makes regarding the
> segments in a bio as part of a request submitted to a block driver. As
> far as I can tell this is not documented anywhere. In particular,

First you need to define what segment you mean.  We have at least two and
a half historical uses of the name.  One is for each bio_vec attached to
the bio, either directly as submitted into ->submit_bio for bio based
drivers (case 1a), or generated by bio_split_to_limits (case 1b), which
is called for every blk-mq driver before calling into ->queue_rq(s) or
explicitly called by a few bio based driver.

The other is the bio-vec synthesized by bio_for_each_segment (case 2).

> - Is bv_len aligned to SECTOR_SIZE?

Yes.

> - To logical_sector_size?

Yes.

> - What if logical_sector_size > PAGE_SIZE?

Still always aligned to logical_sector_size.

> - What about bv_offset?

bv_offset is a memory offset and must only be aligned to the
dma_alignment limit.

> - Is it possible to have a bio where the total length is a multiple of
>   logical_sector_size, but the data is split across several segments
>   where each segment is a multiple of SECTOR_SIZE?

Yes.

> - Is is possible to have segments not even aligned to SECTOR_SIZE?

No.

> - Can I somehow request to only get segments with bv_len aligned to
>   logical_sector_size?

For drivers that use bio_split_to_limits implicitly or explicitly you can
do that by setting the right seg_boundary_mask.

> make some big assumptions (which might be bugs?) For example, in
> drivers/mtd/mtd_blkdevs.c, do_blktrans_request looks like:

> - There is only one bio in a request. This one is a bit of a soft
>   assumption since we should only flush the pages in the bio and not the
>   whole request otherwise.

It always operates on the first bio in the request and then uses
blk_update_request to move the context past that.  It is an old
and somewhat arkane way to write drivers, but should work.  The
rq_for_each_segment looks do call flush_dcache_page look horribly
wrong for this model, though.

> - The data is in lowmem OR bv_offset + bv_len <= PAGE_SIZE. kmap() only
>   maps a single page, so if we go past one page we end up in adjacent
>   kmapped pages.

Yes, this looks broken.

> Am I missing something here? Handling highmem seems like a persistent
> issue. E.g. drivers/mtd/ubi/block.c doesn't even bother doing a kmap.
> Should both of these have BLK_FEAT_BOUNCE_HIGH?

BLK_FEAT_BOUNCE_HIGH needs to go away rather sooner than later.

in the short run the best fix would be to synthesized a
bio_for_each_segment like bio_vec that stays inside a single page
using bio_iter_iovec) at the top of do_blktrans_request and use
that for all references to the data. 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-06 19:40 bio segment constraints Sean Anderson
  2025-04-07  7:07 ` Christoph Hellwig
@ 2025-04-07  7:10 ` Hannes Reinecke
  2025-04-07 14:14   ` Sean Anderson
  1 sibling, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2025-04-07  7:10 UTC (permalink / raw)
  To: Sean Anderson, Jens Axboe, linux-block
  Cc: Miquel Raynal, Richard Weinberger, Vignesh Raghavendra, linux-mtd,
	Zhihao Cheng

On 4/6/25 21:40, Sean Anderson wrote:
> Hi all,
> 
> I'm not really sure what guarantees the block layer makes regarding the
> segments in a bio as part of a request submitted to a block driver. As
> far as I can tell this is not documented anywhere. In particular,
> 
> - Is bv_len aligned to SECTOR_SIZE?

The block layer always uses a 512 byte sector size, so yes.

> - To logical_sector_size?

Not necessarily. Bvecs are a consecutive list of byte ranges which
make up the data portion of a bio.
The logical sector size is a property of the request queue, which is
applied when a request is formed from one or several bios.
For the request the overall length need to be a multiple of the logical
sector size, but not necessarily the individual bios.

> - What if logical_sector_size > PAGE_SIZE?

See above.

> - What about bv_offset?

Same story. The eventual request needs to observe that the offset
and the length is aligned to the logical block size, but the individual
bios might not.

> - Is it possible to have a bio where the total length is a multiple of
>    logical_sector_size, but the data is split across several segments
>    where each segment is a multiple of SECTOR_SIZE?

Sure.

> - Is is possible to have segments not even aligned to SECTOR_SIZE?

Nope.

> - Can I somehow request to only get segments with bv_len aligned to
>    logical_sector_size? Or do I need to do my own coalescing and bounce
>    buffering for that?
> 

The driver surely can. You should be able to set 'max_segment_size' to
the logical block size, and that should give you what you want.

> I've been reading some drivers (as well as stuff in block/) to try and
> figure things out, but it's hard to figure out all the places where
> constraints are enforced. In particular, I've read several drivers that
> make some big assumptions (which might be bugs?) For example, in
> drivers/mtd/mtd_blkdevs.c, do_blktrans_request looks like:
> 
In general, the block layer has two major data items, bios and requests.
'struct bio' is the central structure for any 'upper' layers to submit
data (via the 'submit_bio()' function), and 'struct request' is the
central structure for drivers to fetch data for submission to the
hardware (via the 'queue_rq()' request_queue callback).
And the task of the block layer is to convert 'struct bio' into
'struct request'.

[ .. ]

> For context, tr->blkshift is either 512 or 4096, depending on the
> backend. From what I can tell, this code assumes the following:
> 
mtd is probably not a good examples, as MTD has it's own set of 
limitations which might result in certain shortcuts to be taken.

> - There is only one bio in a request. This one is a bit of a soft
>    assumption since we should only flush the pages in the bio and not the
>    whole request otherwise.
> - There is only one segment in a bio. This one could be reasonable if
>    max_segments was set to 1, but it's not as far as I can tell. So I
>    guess we just go off the end of the bio if there's a second segment?
> - The data is in lowmem OR bv_offset + bv_len <= PAGE_SIZE. kmap() only
>    maps a single page, so if we go past one page we end up in adjacent
>    kmapped pages.
> 
Well, that code _does_ look suspicious. It really should be converted
to using the iov iterators.
But then again, it _might_ be okay if there are underlying MTD
restrictions which would devolve into MTD only having a single bvec.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-07  7:07 ` Christoph Hellwig
@ 2025-04-07 13:46   ` Keith Busch
  2025-04-07 13:59     ` Christoph Hellwig
  2025-04-07 13:59   ` Sean Anderson
  1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-04-07 13:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sean Anderson, Jens Axboe, linux-block, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, linux-mtd, Zhihao Cheng

On Mon, Apr 07, 2025 at 12:07:11AM -0700, Christoph Hellwig wrote:
> On Sun, Apr 06, 2025 at 03:40:04PM -0400, Sean Anderson wrote:
> > - What about bv_offset?
> 
> bv_offset is a memory offset and must only be aligned to the
> dma_alignment limit.
> 
> > - Is it possible to have a bio where the total length is a multiple of
> >   logical_sector_size, but the data is split across several segments
> >   where each segment is a multiple of SECTOR_SIZE?
> 
> Yes.
> 
> > - Is is possible to have segments not even aligned to SECTOR_SIZE?
> 
> No.

O_DIRECT only requires each user iovec be a multiple of the logical
block size with the address aligned to the dma_alignment. If the
dma_alignment is smaller than the logical block size, then this could
create bvec segments that are smaller. For nvme where we have 4-byte
dma alignment, you could have the first segment be the last 4 bytes of a
page, then the remaing 508 bytes from a different page in the next
segment.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-07  7:07 ` Christoph Hellwig
  2025-04-07 13:46   ` Keith Busch
@ 2025-04-07 13:59   ` Sean Anderson
  2025-04-07 14:12     ` Christoph Hellwig
  1 sibling, 1 reply; 12+ messages in thread
From: Sean Anderson @ 2025-04-07 13:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, linux-mtd, Zhihao Cheng

On 4/7/25 03:07, Christoph Hellwig wrote:
> On Sun, Apr 06, 2025 at 03:40:04PM -0400, Sean Anderson wrote:
>> Hi all,
>>
>> I'm not really sure what guarantees the block layer makes regarding the
>> segments in a bio as part of a request submitted to a block driver. As
>> far as I can tell this is not documented anywhere. In particular,
> 
> First you need to define what segment you mean.  We have at least two and
> a half historical uses of the name.  One is for each bio_vec attached to
> the bio, either directly as submitted into ->submit_bio for bio based
> drivers (case 1a), or generated by bio_split_to_limits (case 1b), which
> is called for every blk-mq driver before calling into ->queue_rq(s) or
> explicitly called by a few bio based driver.
> 
> The other is the bio-vec synthesized by bio_for_each_segment (case 2).

I'm referring to the bio_vecs you get from queue_mq. Which I think is the
latter.

>> - Is bv_len aligned to SECTOR_SIZE?
> 
> Yes.
> 
>> - To logical_sector_size?
> 
> Yes.

OK, but...

>> - What if logical_sector_size > PAGE_SIZE?
> 
> Still always aligned to logical_sector_size.
> 
>> - What about bv_offset?
> 
> bv_offset is a memory offset and must only be aligned to the
> dma_alignment limit.
> 
>> - Is it possible to have a bio where the total length is a multiple of
>>    logical_sector_size, but the data is split across several segments
>>    where each segment is a multiple of SECTOR_SIZE?
> 
> Yes.

...if this is the case, then for some of those segments wouldn't bv_len
not be a multiple of logical_sector_size?

>> - Is is possible to have segments not even aligned to SECTOR_SIZE?
> 
> No.
> 
>> - Can I somehow request to only get segments with bv_len aligned to
>>    logical_sector_size?
> 
> For drivers that use bio_split_to_limits implicitly or explicitly you can
> do that by setting the right seg_boundary_mask.

Is that the right knob? It operates on the physical address, so it looked
more like something for broken DMA engines. For example (if I recall correctly)
MMC SDMA can't cross a page boundary, so you could use seg_boundary_mask to
enforce that.

>> make some big assumptions (which might be bugs?) For example, in
>> drivers/mtd/mtd_blkdevs.c, do_blktrans_request looks like:
> 
>> - There is only one bio in a request. This one is a bit of a soft
>>    assumption since we should only flush the pages in the bio and not the
>>    whole request otherwise.
> 
> It always operates on the first bio in the request and then uses
> blk_update_request to move the context past that.  It is an old
> and somewhat arkane way to write drivers, but should work.  The
> rq_for_each_segment looks do call flush_dcache_page look horribly
> wrong for this model, though.
> 
>> - The data is in lowmem OR bv_offset + bv_len <= PAGE_SIZE. kmap() only
>>    maps a single page, so if we go past one page we end up in adjacent
>>    kmapped pages.
> 
> Yes, this looks broken.
> 
>> Am I missing something here? Handling highmem seems like a persistent
>> issue. E.g. drivers/mtd/ubi/block.c doesn't even bother doing a kmap.
>> Should both of these have BLK_FEAT_BOUNCE_HIGH?
> 
> BLK_FEAT_BOUNCE_HIGH needs to go away rather sooner than later.
> 
> in the short run the best fix would be to synthesized a
> bio_for_each_segment like bio_vec that stays inside a single page
> using bio_iter_iovec) at the top of do_blktrans_request and use
> that for all references to the data.
> 

OK, but if you have to stay inside a single page couldn't you end up
with a sector spanning a page boundary due to only being aligned to
dma_alignment? Or maybe we set seg_boundary_mask to PAGE_MASK to enforce that?

--Sean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-07 13:46   ` Keith Busch
@ 2025-04-07 13:59     ` Christoph Hellwig
  2025-04-07 15:52       ` Bart Van Assche
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2025-04-07 13:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Sean Anderson, Jens Axboe, linux-block,
	Miquel Raynal, Richard Weinberger, Vignesh Raghavendra, linux-mtd,
	Zhihao Cheng

On Mon, Apr 07, 2025 at 02:46:32PM +0100, Keith Busch wrote:
> > > - Is is possible to have segments not even aligned to SECTOR_SIZE?
> > 
> > No.
> 
> O_DIRECT only requires each user iovec be a multiple of the logical
> block size with the address aligned to the dma_alignment. If the
> dma_alignment is smaller than the logical block size, then this could
> create bvec segments that are smaller. For nvme where we have 4-byte
> dma alignment, you could have the first segment be the last 4 bytes of a
> page, then the remaing 508 bytes from a different page in the next
> segment.

Oh, right - with a smaller dma alignment this can actually happen.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-07 13:59   ` Sean Anderson
@ 2025-04-07 14:12     ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2025-04-07 14:12 UTC (permalink / raw)
  To: Sean Anderson
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, linux-mtd, Zhihao Cheng

On Mon, Apr 07, 2025 at 09:59:16AM -0400, Sean Anderson wrote:
> > drivers (case 1a), or generated by bio_split_to_limits (case 1b), which
> > is called for every blk-mq driver before calling into ->queue_rq(s) or
> > explicitly called by a few bio based driver.
> > 
> > The other is the bio-vec synthesized by bio_for_each_segment (case 2).
> 
> I'm referring to the bio_vecs you get from queue_mq. Which I think is the
> latter.

The bios hanging off the request have bio_vecs that are case 1b.

If you use bio_for_each_segment or an open coded variant of that on the
bio you get on-stack bio_vecs for case 2.

> > > - Is it possible to have a bio where the total length is a multiple of
> > >    logical_sector_size, but the data is split across several segments
> > >    where each segment is a multiple of SECTOR_SIZE?
> > 
> > Yes.
> 
> ...if this is the case, then for some of those segments wouldn't bv_len
> not be a multiple of logical_sector_size?

Maybe I'm misunderstanding the question.  But if you have e.g. a
transfer that is 2MiB, it could be perfectly fine that each bio_vec is
say 1kiB.

> > > - Can I somehow request to only get segments with bv_len aligned to
> > >    logical_sector_size?
> > 
> > For drivers that use bio_split_to_limits implicitly or explicitly you can
> > do that by setting the right seg_boundary_mask.
> 
> Is that the right knob? It operates on the physical address, so it looked
> more like something for broken DMA engines. For example (if I recall correctly)
> MMC SDMA can't cross a page boundary, so you could use seg_boundary_mask to
> enforce that.

seg_boundary_mask is indeed primarily about physical addresses.  That
being said if your seg_boundary_mask is logical_sector_size all bio_vecs
in cases 1a/b will be aligned to it for their base address, and thus
automatically their length too.

> > in the short run the best fix would be to synthesized a
> > bio_for_each_segment like bio_vec that stays inside a single page
> > using bio_iter_iovec) at the top of do_blktrans_request and use
> > that for all references to the data.
> > 
> 
> OK, but if you have to stay inside a single page couldn't you end up
> with a sector spanning a page boundary due to only being aligned to
> dma_alignment? Or maybe we set seg_boundary_mask to PAGE_MASK to enforce that?

bio_iter_iovec ensures that the synthetic request never crosses a
page boundary.  But as Keith pointed out that means if your
dma_alignment or seg_boundary_mask are smaller than the logical block
size you might actually get a non-aligned length.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-07  7:10 ` Hannes Reinecke
@ 2025-04-07 14:14   ` Sean Anderson
  2025-04-08  6:10     ` Hannes Reinecke
  2025-04-08 14:33     ` Keith Busch
  0 siblings, 2 replies; 12+ messages in thread
From: Sean Anderson @ 2025-04-07 14:14 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe, linux-block
  Cc: Miquel Raynal, Richard Weinberger, Vignesh Raghavendra, linux-mtd,
	Zhihao Cheng

On 4/7/25 03:10, Hannes Reinecke wrote:
> On 4/6/25 21:40, Sean Anderson wrote:
>> Hi all,
>>
>> I'm not really sure what guarantees the block layer makes regarding the
>> segments in a bio as part of a request submitted to a block driver. As
>> far as I can tell this is not documented anywhere. In particular,
>>
>> - Is bv_len aligned to SECTOR_SIZE?
> 
> The block layer always uses a 512 byte sector size, so yes.
> 
>> - To logical_sector_size?
> 
> Not necessarily. Bvecs are a consecutive list of byte ranges which
> make up the data portion of a bio.
> The logical sector size is a property of the request queue, which is
> applied when a request is formed from one or several bios.
> For the request the overall length need to be a multiple of the logical
> sector size, but not necessarily the individual bios.

Oh, so this is worse than I thought. So if you care about e.g. only submitting
I/O in units of logical_block_size, you have to combine segments across the
entire request.

>> - What if logical_sector_size > PAGE_SIZE?
> 
> See above.
> 
>> - What about bv_offset?
> 
> Same story. The eventual request needs to observe that the offset
> and the length is aligned to the logical block size, but the individual
> bios might not.
> 
>> - Is it possible to have a bio where the total length is a multiple of
>>    logical_sector_size, but the data is split across several segments
>>    where each segment is a multiple of SECTOR_SIZE?
> 
> Sure.
> 
>> - Is is possible to have segments not even aligned to SECTOR_SIZE?
> 
> Nope.
> 
>> - Can I somehow request to only get segments with bv_len aligned to
>>    logical_sector_size? Or do I need to do my own coalescing and bounce
>>    buffering for that?
>>
> 
> The driver surely can. You should be able to set 'max_segment_size' to
> the logical block size, and that should give you what you want.

But couldn't I get segments smaller than that? max_segment_size seems like
it would only restrict the maximum size, leaving the possibility open for
smaller segments.

>> I've been reading some drivers (as well as stuff in block/) to try and
>> figure things out, but it's hard to figure out all the places where
>> constraints are enforced. In particular, I've read several drivers that
>> make some big assumptions (which might be bugs?) For example, in
>> drivers/mtd/mtd_blkdevs.c, do_blktrans_request looks like:
>>
> In general, the block layer has two major data items, bios and requests.
> 'struct bio' is the central structure for any 'upper' layers to submit
> data (via the 'submit_bio()' function), and 'struct request' is the
> central structure for drivers to fetch data for submission to the
> hardware (via the 'queue_rq()' request_queue callback).
> And the task of the block layer is to convert 'struct bio' into
> 'struct request'.
> 
> [ .. ]
> 
>> For context, tr->blkshift is either 512 or 4096, depending on the
>> backend. From what I can tell, this code assumes the following:
>>
> mtd is probably not a good examples, as MTD has it's own set of limitations which might result in certain shortcuts to be taken.

Well, I want to write a block driver on top of MTD, so it's a pretty good
example for my purposes :P

>> - There is only one bio in a request. This one is a bit of a soft
>>    assumption since we should only flush the pages in the bio and not the
>>    whole request otherwise.
>> - There is only one segment in a bio. This one could be reasonable if
>>    max_segments was set to 1, but it's not as far as I can tell. So I
>>    guess we just go off the end of the bio if there's a second segment?
>> - The data is in lowmem OR bv_offset + bv_len <= PAGE_SIZE. kmap() only
>>    maps a single page, so if we go past one page we end up in adjacent
>>    kmapped pages.
>>
> Well, that code _does_ look suspicious. It really should be converted
> to using the iov iterators.

I had a look at this, but the API isn't documented so I wasn't sure what
I would get out of it. I'll have a closer look.

> But then again, it _might_ be okay if there are underlying MTD
> restrictions which would devolve into MTD only having a single bvec.

The underlying restriction is that the MTD API expects a buffer that has
contiguous kernel virtual addresses. The driver will do bounce-buffering
if wants to do DMA and virt_addr_valid is false. The mtd_blkdevs driver
promises to submit buffers of size tr->blksize to the underlying bltrans
driver. This whole thing is not very efficient if the MTD driver can do
scatter-gather DMA, but that's not the API...

Maybe I should just vmap the entire request?

--Sean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-07 13:59     ` Christoph Hellwig
@ 2025-04-07 15:52       ` Bart Van Assche
  0 siblings, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2025-04-07 15:52 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Sean Anderson, Jens Axboe, linux-block, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, linux-mtd, Zhihao Cheng

On 4/7/25 6:59 AM, Christoph Hellwig wrote:
> On Mon, Apr 07, 2025 at 02:46:32PM +0100, Keith Busch wrote:
>> O_DIRECT only requires each user iovec be a multiple of the logical
>> block size with the address aligned to the dma_alignment. If the
>> dma_alignment is smaller than the logical block size, then this could
>> create bvec segments that are smaller. For nvme where we have 4-byte
>> dma alignment, you could have the first segment be the last 4 bytes of a
>> page, then the remaing 508 bytes from a different page in the next
>> segment.
> 
> Oh, right - with a smaller dma alignment this can actually happen.

Some time ago I added src/discontiguous-io.cpp to the blktests project
to trigger this scenario. This test program submits an SG_IO request
with multiple 4 byte segments. Maybe this test program should be
modified such that it uses O_DIRECT instead of SG_IO.

Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-07 14:14   ` Sean Anderson
@ 2025-04-08  6:10     ` Hannes Reinecke
  2025-04-08 13:57       ` Sean Anderson
  2025-04-08 14:33     ` Keith Busch
  1 sibling, 1 reply; 12+ messages in thread
From: Hannes Reinecke @ 2025-04-08  6:10 UTC (permalink / raw)
  To: Sean Anderson, Jens Axboe, linux-block
  Cc: Miquel Raynal, Richard Weinberger, Vignesh Raghavendra, linux-mtd,
	Zhihao Cheng

On 4/7/25 16:14, Sean Anderson wrote:
> On 4/7/25 03:10, Hannes Reinecke wrote:
>> On 4/6/25 21:40, Sean Anderson wrote:
>>> Hi all,
>>>
>>> I'm not really sure what guarantees the block layer makes regarding the
>>> segments in a bio as part of a request submitted to a block driver. As
>>> far as I can tell this is not documented anywhere. In particular,
>>>
>>> - Is bv_len aligned to SECTOR_SIZE?
>>
>> The block layer always uses a 512 byte sector size, so yes.
>>
>>> - To logical_sector_size?
>>
>> Not necessarily. Bvecs are a consecutive list of byte ranges which
>> make up the data portion of a bio.
>> The logical sector size is a property of the request queue, which is
>> applied when a request is formed from one or several bios.
>> For the request the overall length need to be a multiple of the logical
>> sector size, but not necessarily the individual bios.
> 
> Oh, so this is worse than I thought. So if you care about e.g. only 
> submitting I/O in units of logical_block_size, you have to combine
 > segments across the entire request.>
Well, yes, and no.
You might be seeing a request with several bios, each having small
bvecs. But in the driver you will want to use an iov iterator or map
it into a sg list via blk_rq_map_sg(), and then iterate over that
for submission.

[ .. ]
>>> - Can I somehow request to only get segments with bv_len aligned to
>>>    logical_sector_size? Or do I need to do my own coalescing and bounce
>>>    buffering for that?
>>>
>>
>> The driver surely can. You should be able to set 'max_segment_size' to
>> the logical block size, and that should give you what you want.
> 
> But couldn't I get segments smaller than that? max_segment_size seems like
> it would only restrict the maximum size, leaving the possibility open for
> smaller segments.
> 
As mentioned: individual segments might. The overall request still will 
adhere to the logical block size setting (ie it will never be smaller 
than the logical block size).

Maybe have a look at drivers/mtd/ubi/block.c. There the driver will
map the request onto a scatterlist, and then iterate over the sg entries
to read in the data.

Note: mapping onto a scatterlist will coalesce adjacent bvecs, so on the
scatterlist you will find only contiguous segments which adhere to the
logical block size linmitations.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-08  6:10     ` Hannes Reinecke
@ 2025-04-08 13:57       ` Sean Anderson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Anderson @ 2025-04-08 13:57 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe, linux-block
  Cc: Miquel Raynal, Richard Weinberger, Vignesh Raghavendra, linux-mtd,
	Zhihao Cheng

On 4/8/25 02:10, Hannes Reinecke wrote:
> On 4/7/25 16:14, Sean Anderson wrote:
>> On 4/7/25 03:10, Hannes Reinecke wrote:
>>> On 4/6/25 21:40, Sean Anderson wrote:
>>>> Hi all,
>>>>
>>>> I'm not really sure what guarantees the block layer makes regarding the
>>>> segments in a bio as part of a request submitted to a block driver. As
>>>> far as I can tell this is not documented anywhere. In particular,
>>>>
>>>> - Is bv_len aligned to SECTOR_SIZE?
>>>
>>> The block layer always uses a 512 byte sector size, so yes.
>>>
>>>> - To logical_sector_size?
>>>
>>> Not necessarily. Bvecs are a consecutive list of byte ranges which
>>> make up the data portion of a bio.
>>> The logical sector size is a property of the request queue, which is
>>> applied when a request is formed from one or several bios.
>>> For the request the overall length need to be a multiple of the logical
>>> sector size, but not necessarily the individual bios.
>>
>> Oh, so this is worse than I thought. So if you care about e.g. only submitting I/O in units of logical_block_size, you have to combine
>  > segments across the entire request.>
> Well, yes, and no.
> You might be seeing a request with several bios, each having small
> bvecs. But in the driver you will want to use an iov iterator or map
> it into a sg list via blk_rq_map_sg(), and then iterate over that
> for submission.
> 
> [ .. ]
>>>> - Can I somehow request to only get segments with bv_len aligned to
>>>>    logical_sector_size? Or do I need to do my own coalescing and bounce
>>>>    buffering for that?
>>>>
>>>
>>> The driver surely can. You should be able to set 'max_segment_size' to
>>> the logical block size, and that should give you what you want.
>>
>> But couldn't I get segments smaller than that? max_segment_size seems like
>> it would only restrict the maximum size, leaving the possibility open for
>> smaller segments.
>>
> As mentioned: individual segments might. The overall request still will adhere to the logical block size setting (ie it will never be smaller than the logical block size).
> 
> Maybe have a look at drivers/mtd/ubi/block.c. There the driver will
> map the request onto a scatterlist, and then iterate over the sg entries
> to read in the data.
> 
> Note: mapping onto a scatterlist will coalesce adjacent bvecs, so on the
> scatterlist you will find only contiguous segments which adhere to the
> logical block size linmitations.

How can this happen? From above, you could have a bvec where it contains only
one sector. So it will always be discontiguous in terms of the logical block size.

I've looked at ubiblock, and that only works for that driver because it's read-only.
The MTD API allows you to read unaligned buffers (although it's inefficient since
you may re-read the same page multiple times), but for writes the data must be
aligned to the page size.

I think the only way to solve this is to create a (mtd-)page-sized buffer and bounce
the incoming data in if the bvec is shorter (or if it's in high memory).

--Sean

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: bio segment constraints
  2025-04-07 14:14   ` Sean Anderson
  2025-04-08  6:10     ` Hannes Reinecke
@ 2025-04-08 14:33     ` Keith Busch
  1 sibling, 0 replies; 12+ messages in thread
From: Keith Busch @ 2025-04-08 14:33 UTC (permalink / raw)
  To: Sean Anderson
  Cc: Hannes Reinecke, Jens Axboe, linux-block, Miquel Raynal,
	Richard Weinberger, Vignesh Raghavendra, linux-mtd, Zhihao Cheng

On Mon, Apr 07, 2025 at 10:14:28AM -0400, Sean Anderson wrote:
> On 4/7/25 03:10, Hannes Reinecke wrote:
> > 
> > The driver surely can. You should be able to set 'max_segment_size' to
> > the logical block size, and that should give you what you want.
> 
> But couldn't I get segments smaller than that? max_segment_size seems like
> it would only restrict the maximum size, leaving the possibility open for
> smaller segments.

If your driver never wants to see segments smaller than the logical
block, you could update your queue_limits dma_alignment to be
logical_block_size - 1. It is 511 by default so works for you only if
logical block size == SECTOR_SIZE.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-04-08 14:33 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-06 19:40 bio segment constraints Sean Anderson
2025-04-07  7:07 ` Christoph Hellwig
2025-04-07 13:46   ` Keith Busch
2025-04-07 13:59     ` Christoph Hellwig
2025-04-07 15:52       ` Bart Van Assche
2025-04-07 13:59   ` Sean Anderson
2025-04-07 14:12     ` Christoph Hellwig
2025-04-07  7:10 ` Hannes Reinecke
2025-04-07 14:14   ` Sean Anderson
2025-04-08  6:10     ` Hannes Reinecke
2025-04-08 13:57       ` Sean Anderson
2025-04-08 14:33     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox