bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
       [not found] <20240620132157.888559-1-lihongbo22@huawei.com>
@ 2024-06-20 13:36 ` Kent Overstreet
  2024-06-20 13:54   ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2024-06-20 13:36 UTC (permalink / raw)
  To: Hongbo Li; +Cc: linux-bcachefs, linux-fsdevel, linux-block, axboe, hch

On Thu, Jun 20, 2024 at 09:21:57PM +0800, Hongbo Li wrote:
> Support fallback to buffered I/O if the operation being performed on
> unaligned length or offset. This may change the behavior for direct
> I/O in some cases.
> 
> [Before]
> For length which aligned with 256 bytes (not SECTOR aligned) will
> read failed under direct I/O.
> 
> [After]
> For length which aligned with 256 bytes (not SECTOR aligned) will
> read the data successfully under direct I/O because it will fallback
> to buffer I/O.
> 
> Signed-off-by: Hongbo Li <lihongbo22@huawei.com>

I don't think we want to do this in bcachefs - we can't efficiently mix
buffered and O_DIRECT IO on the same file. This is true on any
filesystem, but even moreso on bcachefs as we guarantee strict page
cache consistency (O_DIRECT IO blocks pagecache fills, and vice versa).

Better way to do that would be to make the bcachefs read path natively
handle unaligned IO, which conceptually wouldn't be difficult since our
read path already has support for bouncing when necessary (i.e. reading
only part of a checksummed or compressed extent).

The catch is that struct bio - bvec_iter - represents addresses with a
sector_t, and we'd want that to be a loff_t.

That's something we should do anyways; everything else in struct bio can
represent a byte-aligned io, bvec_iter.bi_sector is the only exception
and fixing that would help in consolidating our various scatter-gather
list data structures - but we'd need buy-in from Jens and Christoph
before doing that.

As an intermediate solution (i.e., I would want it clearly labelled as a
hack to be removed when the proper solution is merged and an explanation
of what we actually want), we could add a loff_t to bch_read_bio, and
only initialize bi_iter.bi_sector right before submit_bio() when we know
our loff_t is properly aligned. But that'd be pretty ugly, we'd have to
audit and probably add new helpers for everywhere we use a bvec_iter in
the read path.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-20 13:36 ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Kent Overstreet
@ 2024-06-20 13:54   ` Matthew Wilcox
  2024-06-20 14:16     ` Kent Overstreet
  2024-06-20 15:30     ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Christoph Hellwig
  0 siblings, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2024-06-20 13:54 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Hongbo Li, linux-bcachefs, linux-fsdevel, linux-block, axboe, hch

On Thu, Jun 20, 2024 at 09:36:42AM -0400, Kent Overstreet wrote:
> On Thu, Jun 20, 2024 at 09:21:57PM +0800, Hongbo Li wrote:
> > Support fallback to buffered I/O if the operation being performed on
> > unaligned length or offset. This may change the behavior for direct
> > I/O in some cases.
> > 
> > [Before]
> > For length which aligned with 256 bytes (not SECTOR aligned) will
> > read failed under direct I/O.
> > 
> > [After]
> > For length which aligned with 256 bytes (not SECTOR aligned) will
> > read the data successfully under direct I/O because it will fallback
> > to buffer I/O.

This is against the O_DIRECT requirements.

   O_DIRECT
       The O_DIRECT flag may impose alignment restrictions on  the  length  and
       address  of  user-space  buffers  and the file offset of I/Os.  In Linux
       alignment restrictions vary by filesystem and kernel version  and  might
       be  absent  entirely.   The  handling  of  misaligned O_DIRECT I/Os also
       varies; they can either fail with EINVAL or fall back to buffered I/O.

       Since Linux 6.1, O_DIRECT support and alignment restrictions for a  file
       can  be  queried using statx(2), using the STATX_DIOALIGN flag.  Support
       for STATX_DIOALIGN varies by filesystem; see statx(2).

       Some filesystems provide their  own  interfaces  for  querying  O_DIRECT
       alignment restrictions, for example the XFS_IOC_DIOINFO operation in xf‐
       sctl(3).  STATX_DIOALIGN should be used instead when it is available.

       If none of the above is available, then direct I/O support and alignment
       restrictions  can  only  be  assumed  from  known characteristics of the
       filesystem, the individual file, the underlying storage  device(s),  and
       the  kernel  version.  In Linux 2.4, most filesystems based on block de‐
       vices require that the file offset and the length and memory address  of
       all  I/O  segments  be multiples of the filesystem block size (typically
       4096 bytes).  In Linux 2.6.0, this was relaxed to the logical block size
       of the block device (typically 512 bytes).   A  block  device's  logical
       block  size  can be determined using the ioctl(2) BLKSSZGET operation or
       from the shell using the command:

           blockdev --getss

> The catch is that struct bio - bvec_iter - represents addresses with a
> sector_t, and we'd want that to be a loff_t.
> 
> That's something we should do anyways; everything else in struct bio can
> represent a byte-aligned io, bvec_iter.bi_sector is the only exception
> and fixing that would help in consolidating our various scatter-gather
> list data structures - but we'd need buy-in from Jens and Christoph
> before doing that.

I'm against it.  Block devices only do sector-aligned IO and we should
not pretend otherwise.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-20 13:54   ` Matthew Wilcox
@ 2024-06-20 14:16     ` Kent Overstreet
  2024-06-20 14:49       ` Matthew Wilcox
  2024-06-20 15:30     ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2024-06-20 14:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hongbo Li, linux-bcachefs, linux-fsdevel, linux-block, axboe, hch

On Thu, Jun 20, 2024 at 02:54:09PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 20, 2024 at 09:36:42AM -0400, Kent Overstreet wrote:
> > On Thu, Jun 20, 2024 at 09:21:57PM +0800, Hongbo Li wrote:
> > > Support fallback to buffered I/O if the operation being performed on
> > > unaligned length or offset. This may change the behavior for direct
> > > I/O in some cases.
> > > 
> > > [Before]
> > > For length which aligned with 256 bytes (not SECTOR aligned) will
> > > read failed under direct I/O.
> > > 
> > > [After]
> > > For length which aligned with 256 bytes (not SECTOR aligned) will
> > > read the data successfully under direct I/O because it will fallback
> > > to buffer I/O.
> 
> This is against the O_DIRECT requirements.
> 
>    O_DIRECT
>        The O_DIRECT flag may impose alignment restrictions on  the  length  and
>        address  of  user-space  buffers  and the file offset of I/Os.  In Linux
>        alignment restrictions vary by filesystem and kernel version  and  might
>        be  absent  entirely.   The  handling  of  misaligned O_DIRECT I/Os also
>        varies; they can either fail with EINVAL or fall back to buffered I/O.
> 
>        Since Linux 6.1, O_DIRECT support and alignment restrictions for a  file
>        can  be  queried using statx(2), using the STATX_DIOALIGN flag.  Support
>        for STATX_DIOALIGN varies by filesystem; see statx(2).
> 
>        Some filesystems provide their  own  interfaces  for  querying  O_DIRECT
>        alignment restrictions, for example the XFS_IOC_DIOINFO operation in xf‐
>        sctl(3).  STATX_DIOALIGN should be used instead when it is available.
> 
>        If none of the above is available, then direct I/O support and alignment
>        restrictions  can  only  be  assumed  from  known characteristics of the
>        filesystem, the individual file, the underlying storage  device(s),  and
>        the  kernel  version.  In Linux 2.4, most filesystems based on block de‐
>        vices require that the file offset and the length and memory address  of
>        all  I/O  segments  be multiples of the filesystem block size (typically
>        4096 bytes).  In Linux 2.6.0, this was relaxed to the logical block size
>        of the block device (typically 512 bytes).   A  block  device's  logical
>        block  size  can be determined using the ioctl(2) BLKSSZGET operation or
>        from the shell using the command:

That's really just descriptive, not prescriptive.

The intent of O_DIRECT is "bypass the page cache", the alignment
restrictions are just a side effect of that. Applications just care
about is having predictable performance characteristics.

> > The catch is that struct bio - bvec_iter - represents addresses with a
> > sector_t, and we'd want that to be a loff_t.
> > 
> > That's something we should do anyways; everything else in struct bio can
> > represent a byte-aligned io, bvec_iter.bi_sector is the only exception
> > and fixing that would help in consolidating our various scatter-gather
> > list data structures - but we'd need buy-in from Jens and Christoph
> > before doing that.
> 
> I'm against it.  Block devices only do sector-aligned IO and we should
> not pretend otherwise.

Eh?

bio isn't really specific to the block layer anyways, given that an
iov_iter can be a bio underneath. We _really_ should be trying for
better commonality of data structures.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-20 14:16     ` Kent Overstreet
@ 2024-06-20 14:49       ` Matthew Wilcox
  2024-06-20 14:56         ` bvec_iter.bi_sector -> loff_t? Jens Axboe
                           ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Matthew Wilcox @ 2024-06-20 14:49 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Hongbo Li, linux-bcachefs, linux-fsdevel, linux-block, axboe, hch

On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
> That's really just descriptive, not prescriptive.
> 
> The intent of O_DIRECT is "bypass the page cache", the alignment
> restrictions are just a side effect of that. Applications just care
> about is having predictable performance characteristics.

But any application that has been written to use O_DIRECT already has the
alignment & size guarantees in place.  What this patch is attempting to do
is make it "more friendly" to use, and I'm not sure that's a great idea.
Not without buy-in from a large cross-section of filesystem people.

I'm more sympathetic to "lets relax the alignment requirements", since
most IO devices actually can do IO to arbitrary boundaries (or at least
reasonable boundaries, eg cacheline alignment or 4-byte alignment).
The 512 byte alignment doesn't seem particularly rooted in any hardware
restrictions.

But size?  Fundamentally, we're asking the device to do IO directly to
this userspace address.  That means you get to do the entire IO, not
just the part of it that you want.  I know some devices have bitbucket
descriptors, but many don't.

> > I'm against it.  Block devices only do sector-aligned IO and we should
> > not pretend otherwise.
> 
> Eh?
> 
> bio isn't really specific to the block layer anyways, given that an
> iov_iter can be a bio underneath. We _really_ should be trying for
> better commonality of data structures.

bio is absolutely specific to the block layer.  Look at it:

/*
 * main unit of I/O for the block layer and lower layers (ie drivers and
 * stacking drivers)
 */

        struct block_device     *bi_bdev;
        unsigned short          bi_flags;       /* BIO_* below */
        unsigned short          bi_ioprio;
        blk_status_t            bi_status;

Filesystems get to use it to interact with the block layer.  The iov_iter
isn't an abstraction over the bio, it's an abstraction over the bio_vec.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-20 14:49       ` Matthew Wilcox
@ 2024-06-20 14:56         ` Jens Axboe
  2024-06-20 15:15           ` Matthew Wilcox
  2024-06-21  2:37           ` Hongbo Li
  2024-06-20 15:35         ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Kent Overstreet
  2024-06-21  3:13         ` bvec_iter.bi_sector -> loff_t? Hongbo Li
  2 siblings, 2 replies; 20+ messages in thread
From: Jens Axboe @ 2024-06-20 14:56 UTC (permalink / raw)
  To: Matthew Wilcox, Kent Overstreet
  Cc: Hongbo Li, linux-bcachefs, linux-fsdevel, linux-block, hch

On 6/20/24 8:49 AM, Matthew Wilcox wrote:
> On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
> I'm more sympathetic to "lets relax the alignment requirements", since
> most IO devices actually can do IO to arbitrary boundaries (or at least
> reasonable boundaries, eg cacheline alignment or 4-byte alignment).
> The 512 byte alignment doesn't seem particularly rooted in any hardware
> restrictions.

We already did, based on real world use cases to avoid copies just
because the memory wasn't aligned on a sector size boundary. It's
perfectly valid now to do:

struct queue_limits lim {
	.dma_alignment = 3,
};

disk = blk_mq_alloc_disk(&tag_set, &lim, NULL);

and have O_DIRECT with a 32-bit memory alignment work just fine, where
before it would EINVAL. The sector size memory alignment thing has
always been odd and never rooted in anything other than "oh let's just
require the whole combination of size/disk offset/alignment to be sector
based".

> But size?  Fundamentally, we're asking the device to do IO directly to
> this userspace address.  That means you get to do the entire IO, not
> just the part of it that you want.  I know some devices have bitbucket
> descriptors, but many don't.

We did poke at that a bit for nvme with bitbuckets, but I don't even
know how prevalent that support is in hardware. Definitely way iffier
and spotty than the alignment, where that limit was never based on
anything remotely resembling a hardware restraint.

>>> I'm against it.  Block devices only do sector-aligned IO and we should
>>> not pretend otherwise.
>>
>> Eh?
>>
>> bio isn't really specific to the block layer anyways, given that an
>> iov_iter can be a bio underneath. We _really_ should be trying for
>> better commonality of data structures.
> 
> bio is absolutely specific to the block layer.  Look at it:

It's literally "block IO", so would have to concur with that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-20 14:56         ` bvec_iter.bi_sector -> loff_t? Jens Axboe
@ 2024-06-20 15:15           ` Matthew Wilcox
  2024-06-20 15:18             ` Jens Axboe
  2024-06-20 15:20             ` Christoph Hellwig
  2024-06-21  2:37           ` Hongbo Li
  1 sibling, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2024-06-20 15:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, Hongbo Li, linux-bcachefs, linux-fsdevel,
	linux-block, hch

On Thu, Jun 20, 2024 at 08:56:39AM -0600, Jens Axboe wrote:
> On 6/20/24 8:49 AM, Matthew Wilcox wrote:
> > On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
> > I'm more sympathetic to "lets relax the alignment requirements", since
> > most IO devices actually can do IO to arbitrary boundaries (or at least
> > reasonable boundaries, eg cacheline alignment or 4-byte alignment).
> > The 512 byte alignment doesn't seem particularly rooted in any hardware
> > restrictions.
> 
> We already did, based on real world use cases to avoid copies just
> because the memory wasn't aligned on a sector size boundary. It's
> perfectly valid now to do:
> 
> struct queue_limits lim {
> 	.dma_alignment = 3,
> };
> 
> disk = blk_mq_alloc_disk(&tag_set, &lim, NULL);
> 
> and have O_DIRECT with a 32-bit memory alignment work just fine, where
> before it would EINVAL. The sector size memory alignment thing has
> always been odd and never rooted in anything other than "oh let's just
> require the whole combination of size/disk offset/alignment to be sector
> based".

Oh, cool!  https://man7.org/linux/man-pages/man2/open.2.html
doesn't know about this yet; is anyone working on updating it?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-20 15:15           ` Matthew Wilcox
@ 2024-06-20 15:18             ` Jens Axboe
  2024-06-20 16:26               ` Keith Busch
  2024-06-20 15:20             ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2024-06-20 15:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, Hongbo Li, linux-bcachefs, linux-fsdevel,
	linux-block, hch, Keith Busch

On 6/20/24 9:15 AM, Matthew Wilcox wrote:
> On Thu, Jun 20, 2024 at 08:56:39AM -0600, Jens Axboe wrote:
>> On 6/20/24 8:49 AM, Matthew Wilcox wrote:
>>> On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
>>> I'm more sympathetic to "lets relax the alignment requirements", since
>>> most IO devices actually can do IO to arbitrary boundaries (or at least
>>> reasonable boundaries, eg cacheline alignment or 4-byte alignment).
>>> The 512 byte alignment doesn't seem particularly rooted in any hardware
>>> restrictions.
>>
>> We already did, based on real world use cases to avoid copies just
>> because the memory wasn't aligned on a sector size boundary. It's
>> perfectly valid now to do:
>>
>> struct queue_limits lim {
>> 	.dma_alignment = 3,
>> };
>>
>> disk = blk_mq_alloc_disk(&tag_set, &lim, NULL);
>>
>> and have O_DIRECT with a 32-bit memory alignment work just fine, where
>> before it would EINVAL. The sector size memory alignment thing has
>> always been odd and never rooted in anything other than "oh let's just
>> require the whole combination of size/disk offset/alignment to be sector
>> based".
> 
> Oh, cool!  https://man7.org/linux/man-pages/man2/open.2.html
> doesn't know about this yet; is anyone working on updating it?

Probably not... At least we do have STATX_DIOALIGN which can be used to
figure out what the alignment is, but I don't recall if any man date
updates got done. Keith may remember, CC'ed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-20 15:15           ` Matthew Wilcox
  2024-06-20 15:18             ` Jens Axboe
@ 2024-06-20 15:20             ` Christoph Hellwig
  2024-06-20 15:21               ` Jens Axboe
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-06-20 15:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Kent Overstreet, Hongbo Li, linux-bcachefs,
	linux-fsdevel, linux-block, hch

On Thu, Jun 20, 2024 at 04:15:23PM +0100, Matthew Wilcox wrote:
> > 
> > and have O_DIRECT with a 32-bit memory alignment work just fine, where
> > before it would EINVAL. The sector size memory alignment thing has
> > always been odd and never rooted in anything other than "oh let's just
> > require the whole combination of size/disk offset/alignment to be sector
> > based".
> 
> Oh, cool!  https://man7.org/linux/man-pages/man2/open.2.html
> doesn't know about this yet; is anyone working on updating it?

Just remember that there are two kinds of alignments:

 - the memory alignment, which Jens is talking about
 - the offset/size alignment, which is set by the LBA size

statx (optionally) exposes both in the stx_dio_mem_align and
stx_dio_offset_align fields, which are documented in the statx(2)
man page.  For network file systems like nfs there might be no
alignment requirements at all.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-20 15:20             ` Christoph Hellwig
@ 2024-06-20 15:21               ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2024-06-20 15:21 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: Kent Overstreet, Hongbo Li, linux-bcachefs, linux-fsdevel,
	linux-block

On 6/20/24 9:20 AM, Christoph Hellwig wrote:
> On Thu, Jun 20, 2024 at 04:15:23PM +0100, Matthew Wilcox wrote:
>>>
>>> and have O_DIRECT with a 32-bit memory alignment work just fine, where
>>> before it would EINVAL. The sector size memory alignment thing has
>>> always been odd and never rooted in anything other than "oh let's just
>>> require the whole combination of size/disk offset/alignment to be sector
>>> based".
>>
>> Oh, cool!  https://man7.org/linux/man-pages/man2/open.2.html
>> doesn't know about this yet; is anyone working on updating it?
> 
> Just remember that there are two kinds of alignments:
> 
>  - the memory alignment, which Jens is talking about
>  - the offset/size alignment, which is set by the LBA size

Right, that's why I made the distinction above in terms of size, disk
offset, and alignment - with the latter being what we're talking about,
the memory alignment.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-20 13:54   ` Matthew Wilcox
  2024-06-20 14:16     ` Kent Overstreet
@ 2024-06-20 15:30     ` Christoph Hellwig
  2024-06-20 15:43       ` Kent Overstreet
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-06-20 15:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kent Overstreet, Hongbo Li, linux-bcachefs, linux-fsdevel,
	linux-block, axboe, hch

On Thu, Jun 20, 2024 at 02:54:09PM +0100, Matthew Wilcox wrote:
> I'm against it.  Block devices only do sector-aligned IO and we should
> not pretend otherwise.

While I agree with that, the bvec_iter is actually used in a few other
places and could be used in more, and the 512-byte sector unit bi_sector
is the only weird thing that's not useful elsewhere.  So turning that
into a

	u64 bi_addr;

that is byte based where the meaning is specific to the user would
actually be kinda nice.  For traditional block users we'd need a
bio_sector() helpers similar to the existing bio_sectors() one,
but a lot of non-trivial drivers actually need to translated to
a variable LBA-based addressing, which would be (a tiny little bit)
simpler with the byte address.   As bi_size is already in bytes
it would also fit in pretty naturally with that.

The only thing that is really off putting is the amount of churn that
this would cause.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-20 14:49       ` Matthew Wilcox
  2024-06-20 14:56         ` bvec_iter.bi_sector -> loff_t? Jens Axboe
@ 2024-06-20 15:35         ` Kent Overstreet
  2024-06-21  3:13         ` bvec_iter.bi_sector -> loff_t? Hongbo Li
  2 siblings, 0 replies; 20+ messages in thread
From: Kent Overstreet @ 2024-06-20 15:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hongbo Li, linux-bcachefs, linux-fsdevel, linux-block, axboe, hch

On Thu, Jun 20, 2024 at 03:49:54PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
> > That's really just descriptive, not prescriptive.
> > 
> > The intent of O_DIRECT is "bypass the page cache", the alignment
> > restrictions are just a side effect of that. Applications just care
> > about is having predictable performance characteristics.
> 
> But any application that has been written to use O_DIRECT already has the
> alignment & size guarantees in place.  What this patch is attempting to do
> is make it "more friendly" to use, and I'm not sure that's a great idea.
> Not without buy-in from a large cross-section of filesystem people.

The thing is, we're currently conflating "don't use the page cache" and
"don't bounce", and we really shouldn't be.

Sometimes performance is your main criteria and you do want the error if
you misaligned the IO, but not always. The the main reason being that
mixing buffered and O_DIRECT is _really_ expensive, and you will quite
often have codepaths that don't care about performance, but do need to
use O_DIRECT so they don't interfere with your paths that do care about
performance.

So they're forced to deal with bouncing and alignment restrictions at
the applicatoin level, even though they really don't want to be, and we
(at least bcachefs) already have the code to do that bouncing when
required - there's really no reason for them to be hand rolling that,
and if writes are involved then it gets really nontrivial.

This is one of the reasons I'll likely be implementing sub-blocksize
writes through the low level write path in bcachefs. O_DIRECT gives you
things that buffered IO doesn't - it's much cheaper, and simpler to
guarantee write ordering of O_DIRECT writes (c.f. untorn writes,
O_DIRECT is much easier than buffered there as well, and it's even
easier in a COW filesystem) - and it's going to be worthwhile to be able
to provide that behaviour without the alignment restrictions that
currently come with O_DIRECT.

To sum all that up: "don't bounce" should really another rwf flag we can
pass to preadv2/pwritev2 - RWF_NOBOUNCE.

And for bcachefs I would implement bouncing-when-required even without
that flag being available, simply because that's current behaviour (due
to checksums being extent granularity). That flag really only makes
sense for bcachefs once we get opt-in block granular checksums (which is
coming, as well).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-20 15:30     ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Christoph Hellwig
@ 2024-06-20 15:43       ` Kent Overstreet
  2024-06-21  1:48         ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2024-06-20 15:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Hongbo Li, linux-bcachefs, linux-fsdevel,
	linux-block, axboe

On Thu, Jun 20, 2024 at 05:30:50PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 20, 2024 at 02:54:09PM +0100, Matthew Wilcox wrote:
> > I'm against it.  Block devices only do sector-aligned IO and we should
> > not pretend otherwise.
> 
> While I agree with that, the bvec_iter is actually used in a few other
> places and could be used in more, and the 512-byte sector unit bi_sector
> is the only weird thing that's not useful elsewhere.  So turning that
> into a
> 
> 	u64 bi_addr;
> 
> that is byte based where the meaning is specific to the user would
> actually be kinda nice.  For traditional block users we'd need a
> bio_sector() helpers similar to the existing bio_sectors() one,
> but a lot of non-trivial drivers actually need to translated to
> a variable LBA-based addressing, which would be (a tiny little bit)
> simpler with the byte address.   As bi_size is already in bytes
> it would also fit in pretty naturally with that.
> 
> The only thing that is really off putting is the amount of churn that
> this would cause.

I'm being imprecise when I just say 'struct bio'; there's things in
there that are block layer specific but there are also things in there
you want that aren't block layer specific (completion callback, write
flags, s/bi_bdev/bi_inode and that as well, perhaps). It's not at all
clear to me we'd want to deal with the churn to split that up or make
bio itself less block layer specific (although, but when I say 'aiming
for commality with struct bio' that sort of thing is what I have in
mind.

But more immediately, yes - bi_addr as all we need for this, and like
you said I think it'd be a worthwhile change.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-20 15:18             ` Jens Axboe
@ 2024-06-20 16:26               ` Keith Busch
  0 siblings, 0 replies; 20+ messages in thread
From: Keith Busch @ 2024-06-20 16:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox, Kent Overstreet, Hongbo Li, linux-bcachefs,
	linux-fsdevel, linux-block, hch

On Thu, Jun 20, 2024 at 09:18:58AM -0600, Jens Axboe wrote:
> On 6/20/24 9:15 AM, Matthew Wilcox wrote:
> > On Thu, Jun 20, 2024 at 08:56:39AM -0600, Jens Axboe wrote:
> >> On 6/20/24 8:49 AM, Matthew Wilcox wrote:
> >>> On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
> >>> I'm more sympathetic to "lets relax the alignment requirements", since
> >>> most IO devices actually can do IO to arbitrary boundaries (or at least
> >>> reasonable boundaries, eg cacheline alignment or 4-byte alignment).
> >>> The 512 byte alignment doesn't seem particularly rooted in any hardware
> >>> restrictions.
> >>
> >> We already did, based on real world use cases to avoid copies just
> >> because the memory wasn't aligned on a sector size boundary. It's
> >> perfectly valid now to do:
> >>
> >> struct queue_limits lim {
> >> 	.dma_alignment = 3,
> >> };
> >>
> >> disk = blk_mq_alloc_disk(&tag_set, &lim, NULL);
> >>
> >> and have O_DIRECT with a 32-bit memory alignment work just fine, where
> >> before it would EINVAL. The sector size memory alignment thing has
> >> always been odd and never rooted in anything other than "oh let's just
> >> require the whole combination of size/disk offset/alignment to be sector
> >> based".
> > 
> > Oh, cool!  https://man7.org/linux/man-pages/man2/open.2.html
> > doesn't know about this yet; is anyone working on updating it?
> 
> Probably not... At least we do have STATX_DIOALIGN which can be used to
> figure out what the alignment is, but I don't recall if any man date
> updates got done. Keith may remember, CC'ed.

The man page already recommends statx if available, which tells you
everything you need to know about your device's direct io alignment
requirements. The man only suggests block size alignment for older
kernels, so I think it's fine as-is, no?

You can also query the queue's "dma_alignment" sysfs attribute.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-20 15:43       ` Kent Overstreet
@ 2024-06-21  1:48         ` Ming Lei
  2024-06-21  3:07           ` Kent Overstreet
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2024-06-21  1:48 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Matthew Wilcox, Hongbo Li, linux-bcachefs,
	linux-fsdevel, linux-block, axboe, ming.lei

On Thu, Jun 20, 2024 at 11:43:44AM -0400, Kent Overstreet wrote:
> On Thu, Jun 20, 2024 at 05:30:50PM +0200, Christoph Hellwig wrote:
> > On Thu, Jun 20, 2024 at 02:54:09PM +0100, Matthew Wilcox wrote:
> > > I'm against it.  Block devices only do sector-aligned IO and we should
> > > not pretend otherwise.
> > 
> > While I agree with that, the bvec_iter is actually used in a few other
> > places and could be used in more, and the 512-byte sector unit bi_sector
> > is the only weird thing that's not useful elsewhere.  So turning that
> > into a
> > 
> > 	u64 bi_addr;
> > 
> > that is byte based where the meaning is specific to the user would
> > actually be kinda nice.  For traditional block users we'd need a
> > bio_sector() helpers similar to the existing bio_sectors() one,
> > but a lot of non-trivial drivers actually need to translated to
> > a variable LBA-based addressing, which would be (a tiny little bit)
> > simpler with the byte address.   As bi_size is already in bytes
> > it would also fit in pretty naturally with that.
> > 
> > The only thing that is really off putting is the amount of churn that
> > this would cause.
> 
> I'm being imprecise when I just say 'struct bio'; there's things in
> there that are block layer specific but there are also things in there
> you want that aren't block layer specific (completion callback, write
> flags, s/bi_bdev/bi_inode and that as well, perhaps). It's not at all
> clear to me we'd want to deal with the churn to split that up or make
> bio itself less block layer specific (although, but when I say 'aiming
> for commality with struct bio' that sort of thing is what I have in
> mind.
> 
> But more immediately, yes - bi_addr as all we need for this, and like
> you said I think it'd be a worthwhile change.

Still not clear why you need unaligned bi_addr for bio, if this bio needs
to call submit_bio(), it has to be aligned. Otherwise, you could invent any
structure for this purpose, and the structure can be payload of bio for
avoiding extra allocation, even it can be FS generic structure.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-20 14:56         ` bvec_iter.bi_sector -> loff_t? Jens Axboe
  2024-06-20 15:15           ` Matthew Wilcox
@ 2024-06-21  2:37           ` Hongbo Li
  2024-06-21  3:05             ` Kent Overstreet
  1 sibling, 1 reply; 20+ messages in thread
From: Hongbo Li @ 2024-06-21  2:37 UTC (permalink / raw)
  To: Jens Axboe, Matthew Wilcox, Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-block, hch



On 2024/6/20 22:56, Jens Axboe wrote:
> On 6/20/24 8:49 AM, Matthew Wilcox wrote:
>> On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
>> I'm more sympathetic to "lets relax the alignment requirements", since
>> most IO devices actually can do IO to arbitrary boundaries (or at least
>> reasonable boundaries, eg cacheline alignment or 4-byte alignment).
>> The 512 byte alignment doesn't seem particularly rooted in any hardware
>> restrictions.
> 
> We already did, based on real world use cases to avoid copies just
> because the memory wasn't aligned on a sector size boundary. It's
> perfectly valid now to do:
> 
> struct queue_limits lim {
> 	.dma_alignment = 3,
> };
> 
> disk = blk_mq_alloc_disk(&tag_set, &lim, NULL);
> 
> and have O_DIRECT with a 32-bit memory alignment work just fine, where
Does this mean that file system can relax its alignment restrictions on 
offset or memory (not 512 or 4096)? Is it necessary to add alignment 
restrictions in the super block of file system? Because there are 
different alignment restrictions for different storage hardware driver.

Thanks,
Hongbo
> before it would EINVAL. The sector size memory alignment thing has
> always been odd and never rooted in anything other than "oh let's just
> require the whole combination of size/disk offset/alignment to be sector
> based".
> 
>> But size?  Fundamentally, we're asking the device to do IO directly to
>> this userspace address.  That means you get to do the entire IO, not
>> just the part of it that you want.  I know some devices have bitbucket
>> descriptors, but many don't.
> 
> We did poke at that a bit for nvme with bitbuckets, but I don't even
> know how prevalent that support is in hardware. Definitely way iffier
> and spotty than the alignment, where that limit was never based on
> anything remotely resembling a hardware restraint.
> 
>>>> I'm against it.  Block devices only do sector-aligned IO and we should
>>>> not pretend otherwise.
>>>
>>> Eh?
>>>
>>> bio isn't really specific to the block layer anyways, given that an
>>> iov_iter can be a bio underneath. We _really_ should be trying for
>>> better commonality of data structures.
>>
>> bio is absolutely specific to the block layer.  Look at it:
> 
> It's literally "block IO", so would have to concur with that.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-21  2:37           ` Hongbo Li
@ 2024-06-21  3:05             ` Kent Overstreet
  0 siblings, 0 replies; 20+ messages in thread
From: Kent Overstreet @ 2024-06-21  3:05 UTC (permalink / raw)
  To: Hongbo Li
  Cc: Jens Axboe, Matthew Wilcox, linux-bcachefs, linux-fsdevel,
	linux-block, hch

On Fri, Jun 21, 2024 at 10:37:29AM +0800, Hongbo Li wrote:
> 
> 
> On 2024/6/20 22:56, Jens Axboe wrote:
> > On 6/20/24 8:49 AM, Matthew Wilcox wrote:
> > > On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
> > > I'm more sympathetic to "lets relax the alignment requirements", since
> > > most IO devices actually can do IO to arbitrary boundaries (or at least
> > > reasonable boundaries, eg cacheline alignment or 4-byte alignment).
> > > The 512 byte alignment doesn't seem particularly rooted in any hardware
> > > restrictions.
> > 
> > We already did, based on real world use cases to avoid copies just
> > because the memory wasn't aligned on a sector size boundary. It's
> > perfectly valid now to do:
> > 
> > struct queue_limits lim {
> > 	.dma_alignment = 3,
> > };
> > 
> > disk = blk_mq_alloc_disk(&tag_set, &lim, NULL);
> > 
> > and have O_DIRECT with a 32-bit memory alignment work just fine, where
> Does this mean that file system can relax its alignment restrictions on
> offset or memory (not 512 or 4096)? Is it necessary to add alignment
> restrictions in the super block of file system? Because there are different
> alignment restrictions for different storage hardware driver.

A multidevice filesystem can't get rid of those checks becaause if
different devices have inconsistent restrictions that'll be quite the
source of confusion later.

And since devices can be hot-added, it does effectively need to be in
the superblock.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-21  1:48         ` Ming Lei
@ 2024-06-21  3:07           ` Kent Overstreet
  2024-06-21  3:36             ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Kent Overstreet @ 2024-06-21  3:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Matthew Wilcox, Hongbo Li, linux-bcachefs,
	linux-fsdevel, linux-block, axboe

On Fri, Jun 21, 2024 at 09:48:11AM +0800, Ming Lei wrote:
> On Thu, Jun 20, 2024 at 11:43:44AM -0400, Kent Overstreet wrote:
> > On Thu, Jun 20, 2024 at 05:30:50PM +0200, Christoph Hellwig wrote:
> > > On Thu, Jun 20, 2024 at 02:54:09PM +0100, Matthew Wilcox wrote:
> > > > I'm against it.  Block devices only do sector-aligned IO and we should
> > > > not pretend otherwise.
> > > 
> > > While I agree with that, the bvec_iter is actually used in a few other
> > > places and could be used in more, and the 512-byte sector unit bi_sector
> > > is the only weird thing that's not useful elsewhere.  So turning that
> > > into a
> > > 
> > > 	u64 bi_addr;
> > > 
> > > that is byte based where the meaning is specific to the user would
> > > actually be kinda nice.  For traditional block users we'd need a
> > > bio_sector() helpers similar to the existing bio_sectors() one,
> > > but a lot of non-trivial drivers actually need to translated to
> > > a variable LBA-based addressing, which would be (a tiny little bit)
> > > simpler with the byte address.   As bi_size is already in bytes
> > > it would also fit in pretty naturally with that.
> > > 
> > > The only thing that is really off putting is the amount of churn that
> > > this would cause.
> > 
> > I'm being imprecise when I just say 'struct bio'; there's things in
> > there that are block layer specific but there are also things in there
> > you want that aren't block layer specific (completion callback, write
> > flags, s/bi_bdev/bi_inode and that as well, perhaps). It's not at all
> > clear to me we'd want to deal with the churn to split that up or make
> > bio itself less block layer specific (although, but when I say 'aiming
> > for commality with struct bio' that sort of thing is what I have in
> > mind.
> > 
> > But more immediately, yes - bi_addr as all we need for this, and like
> > you said I think it'd be a worthwhile change.
> 
> Still not clear why you need unaligned bi_addr for bio, if this bio needs
> to call submit_bio(), it has to be aligned. Otherwise, you could invent any
> structure for this purpose, and the structure can be payload of bio for
> avoiding extra allocation, even it can be FS generic structure.

We want to have fewer scatter/gather list data structures, not more.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t?
  2024-06-20 14:49       ` Matthew Wilcox
  2024-06-20 14:56         ` bvec_iter.bi_sector -> loff_t? Jens Axboe
  2024-06-20 15:35         ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Kent Overstreet
@ 2024-06-21  3:13         ` Hongbo Li
  2 siblings, 0 replies; 20+ messages in thread
From: Hongbo Li @ 2024-06-21  3:13 UTC (permalink / raw)
  To: Matthew Wilcox, Kent Overstreet
  Cc: linux-bcachefs, linux-fsdevel, linux-block, axboe, hch



On 2024/6/20 22:49, Matthew Wilcox wrote:
> On Thu, Jun 20, 2024 at 10:16:02AM -0400, Kent Overstreet wrote:
>> That's really just descriptive, not prescriptive.
>>
>> The intent of O_DIRECT is "bypass the page cache", the alignment
>> restrictions are just a side effect of that. Applications just care
>> about is having predictable performance characteristics.
> 
> But any application that has been written to use O_DIRECT already has the
> alignment & size guarantees in place.  What this patch is attempting to do
> is make it "more friendly" to use, and I'm not sure that's a great idea.
> Not without buy-in from a large cross-section of filesystem people.

Indeed, the purpose of O_DIRECT is to bypass the page cache. Either the
file system can handle the unaligned offset or memory, or it should
directly return EINVAL (or ENOTSUP for file cannot be read by direct
I/O?). But I have observed that some file systems have
this fallback logic (in ext4: if `ext4_should_use_dio` not true, it will 
fallback to buffer I/O. in f2fs: if `f2fs_should_use_dio` not true, it 
will fallback to buffer I/O.). Does O_DIRECT flag need prescriptive 
definition to standardize I/O behavior?

Thanks,
Hongbo
> 
> I'm more sympathetic to "lets relax the alignment requirements", since
> most IO devices actually can do IO to arbitrary boundaries (or at least
> reasonable boundaries, eg cacheline alignment or 4-byte alignment).
> The 512 byte alignment doesn't seem particularly rooted in any hardware
> restrictions.
> 
> But size?  Fundamentally, we're asking the device to do IO directly to
> this userspace address.  That means you get to do the entire IO, not
> just the part of it that you want.  I know some devices have bitbucket
> descriptors, but many don't.
> 
>>> I'm against it.  Block devices only do sector-aligned IO and we should
>>> not pretend otherwise.
>>
>> Eh?
>>
>> bio isn't really specific to the block layer anyways, given that an
>> iov_iter can be a bio underneath. We _really_ should be trying for
>> better commonality of data structures.
> 
> bio is absolutely specific to the block layer.  Look at it:
> 
> /*
>   * main unit of I/O for the block layer and lower layers (ie drivers and
>   * stacking drivers)
>   */
> 
>          struct block_device     *bi_bdev;
>          unsigned short          bi_flags;       /* BIO_* below */
>          unsigned short          bi_ioprio;
>          blk_status_t            bi_status;
> 
> Filesystems get to use it to interact with the block layer.  The iov_iter
> isn't an abstraction over the bio, it's an abstraction over the bio_vec.
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-21  3:07           ` Kent Overstreet
@ 2024-06-21  3:36             ` Ming Lei
  2024-06-21  3:52               ` Kent Overstreet
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2024-06-21  3:36 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, Matthew Wilcox, Hongbo Li, linux-bcachefs,
	linux-fsdevel, linux-block, axboe

On Thu, Jun 20, 2024 at 11:07:19PM -0400, Kent Overstreet wrote:
> On Fri, Jun 21, 2024 at 09:48:11AM +0800, Ming Lei wrote:
> > On Thu, Jun 20, 2024 at 11:43:44AM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 20, 2024 at 05:30:50PM +0200, Christoph Hellwig wrote:
> > > > On Thu, Jun 20, 2024 at 02:54:09PM +0100, Matthew Wilcox wrote:
> > > > > I'm against it.  Block devices only do sector-aligned IO and we should
> > > > > not pretend otherwise.
> > > > 
> > > > While I agree with that, the bvec_iter is actually used in a few other
> > > > places and could be used in more, and the 512-byte sector unit bi_sector
> > > > is the only weird thing that's not useful elsewhere.  So turning that
> > > > into a
> > > > 
> > > > 	u64 bi_addr;
> > > > 
> > > > that is byte based where the meaning is specific to the user would
> > > > actually be kinda nice.  For traditional block users we'd need a
> > > > bio_sector() helpers similar to the existing bio_sectors() one,
> > > > but a lot of non-trivial drivers actually need to translated to
> > > > a variable LBA-based addressing, which would be (a tiny little bit)
> > > > simpler with the byte address.   As bi_size is already in bytes
> > > > it would also fit in pretty naturally with that.
> > > > 
> > > > The only thing that is really off putting is the amount of churn that
> > > > this would cause.
> > > 
> > > I'm being imprecise when I just say 'struct bio'; there's things in
> > > there that are block layer specific but there are also things in there
> > > you want that aren't block layer specific (completion callback, write
> > > flags, s/bi_bdev/bi_inode and that as well, perhaps). It's not at all
> > > clear to me we'd want to deal with the churn to split that up or make
> > > bio itself less block layer specific (although, but when I say 'aiming
> > > for commality with struct bio' that sort of thing is what I have in
> > > mind.
> > > 
> > > But more immediately, yes - bi_addr as all we need for this, and like
> > > you said I think it'd be a worthwhile change.
> > 
> > Still not clear why you need unaligned bi_addr for bio, if this bio needs
> > to call submit_bio(), it has to be aligned. Otherwise, you could invent any
> > structure for this purpose, and the structure can be payload of bio for
> > avoiding extra allocation, even it can be FS generic structure.
> 
> We want to have fewer scatter/gather list data structures, not more.

OK, that look fine to change to bi_addr since bvec_iter is widely used now,
maybe .bi_sector can be moved into bio, cause bvec iterator needn't it.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset
  2024-06-21  3:36             ` Ming Lei
@ 2024-06-21  3:52               ` Kent Overstreet
  0 siblings, 0 replies; 20+ messages in thread
From: Kent Overstreet @ 2024-06-21  3:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Matthew Wilcox, Hongbo Li, linux-bcachefs,
	linux-fsdevel, linux-block, axboe

On Fri, Jun 21, 2024 at 11:36:20AM +0800, Ming Lei wrote:
> On Thu, Jun 20, 2024 at 11:07:19PM -0400, Kent Overstreet wrote:
> > On Fri, Jun 21, 2024 at 09:48:11AM +0800, Ming Lei wrote:
> > > On Thu, Jun 20, 2024 at 11:43:44AM -0400, Kent Overstreet wrote:
> > > > On Thu, Jun 20, 2024 at 05:30:50PM +0200, Christoph Hellwig wrote:
> > > > > On Thu, Jun 20, 2024 at 02:54:09PM +0100, Matthew Wilcox wrote:
> > > > > > I'm against it.  Block devices only do sector-aligned IO and we should
> > > > > > not pretend otherwise.
> > > > > 
> > > > > While I agree with that, the bvec_iter is actually used in a few other
> > > > > places and could be used in more, and the 512-byte sector unit bi_sector
> > > > > is the only weird thing that's not useful elsewhere.  So turning that
> > > > > into a
> > > > > 
> > > > > 	u64 bi_addr;
> > > > > 
> > > > > that is byte based where the meaning is specific to the user would
> > > > > actually be kinda nice.  For traditional block users we'd need a
> > > > > bio_sector() helpers similar to the existing bio_sectors() one,
> > > > > but a lot of non-trivial drivers actually need to translated to
> > > > > a variable LBA-based addressing, which would be (a tiny little bit)
> > > > > simpler with the byte address.   As bi_size is already in bytes
> > > > > it would also fit in pretty naturally with that.
> > > > > 
> > > > > The only thing that is really off putting is the amount of churn that
> > > > > this would cause.
> > > > 
> > > > I'm being imprecise when I just say 'struct bio'; there's things in
> > > > there that are block layer specific but there are also things in there
> > > > you want that aren't block layer specific (completion callback, write
> > > > flags, s/bi_bdev/bi_inode and that as well, perhaps). It's not at all
> > > > clear to me we'd want to deal with the churn to split that up or make
> > > > bio itself less block layer specific (although, but when I say 'aiming
> > > > for commality with struct bio' that sort of thing is what I have in
> > > > mind.
> > > > 
> > > > But more immediately, yes - bi_addr as all we need for this, and like
> > > > you said I think it'd be a worthwhile change.
> > > 
> > > Still not clear why you need unaligned bi_addr for bio, if this bio needs
> > > to call submit_bio(), it has to be aligned. Otherwise, you could invent any
> > > structure for this purpose, and the structure can be payload of bio for
> > > avoiding extra allocation, even it can be FS generic structure.
> > 
> > We want to have fewer scatter/gather list data structures, not more.
> 
> OK, that look fine to change to bi_addr since bvec_iter is widely used now,
> maybe .bi_sector can be moved into bio, cause bvec iterator needn't it.

No, we'd just add a bio_sector() helper that does the shift.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-06-21  3:52 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240620132157.888559-1-lihongbo22@huawei.com>
2024-06-20 13:36 ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Kent Overstreet
2024-06-20 13:54   ` Matthew Wilcox
2024-06-20 14:16     ` Kent Overstreet
2024-06-20 14:49       ` Matthew Wilcox
2024-06-20 14:56         ` bvec_iter.bi_sector -> loff_t? Jens Axboe
2024-06-20 15:15           ` Matthew Wilcox
2024-06-20 15:18             ` Jens Axboe
2024-06-20 16:26               ` Keith Busch
2024-06-20 15:20             ` Christoph Hellwig
2024-06-20 15:21               ` Jens Axboe
2024-06-21  2:37           ` Hongbo Li
2024-06-21  3:05             ` Kent Overstreet
2024-06-20 15:35         ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Kent Overstreet
2024-06-21  3:13         ` bvec_iter.bi_sector -> loff_t? Hongbo Li
2024-06-20 15:30     ` bvec_iter.bi_sector -> loff_t? (was: Re: [PATCH] bcachefs: allow direct io fallback to buffer io for) unaligned length or offset Christoph Hellwig
2024-06-20 15:43       ` Kent Overstreet
2024-06-21  1:48         ` Ming Lei
2024-06-21  3:07           ` Kent Overstreet
2024-06-21  3:36             ` Ming Lei
2024-06-21  3:52               ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).