* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
2025-08-27 15:20 ` Jan Kara
@ 2025-08-27 16:09 ` Mike Snitzer
2025-09-01 7:55 ` Jan Kara
2025-08-27 17:52 ` Brian Foster
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Mike Snitzer @ 2025-08-27 16:09 UTC (permalink / raw)
To: Jan Kara
Cc: Ritesh Harjani, Keith Busch, Keith Busch, linux-block,
linux-fsdevel, linux-kernel, linux-ext4, axboe, dw, brauner, hch,
martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster
Hi Jan,
On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > Keith Busch <kbusch@kernel.org> writes:
> >
> > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > >> > Keith Busch <kbusch@meta.com> writes:
> > >> > >
> > >> > > - EXT4 falls back to buffered io for writes but not for reads.
> > >> >
> > >> > ++linux-ext4 to get any historical context behind why the difference of
> > >> > behaviour in reads v/s writes for EXT4 DIO.
> > >>
> > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > >> falling back to buffered IO only if the underlying file itself does not
> > >> support any kind of direct IO.
> > >
> > > Simple test case (dio-offset-test.c) below.
> > >
> > > I also ran this on vanilla kernel and got these results:
> > >
> > > # mkfs.ext4 /dev/vda
> > > # mount /dev/vda /mnt/ext4/
> > > # make dio-offset-test
> > > # ./dio-offset-test /mnt/ext4/foobar
> > > write: Success
> > > read: Invalid argument
> > >
> > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > >
> >
> > Right. Ext4 has fallback only for dio writes but not for DIO reads...
> >
> > buffered
> > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > {
> > /* must be a directio to fall back to buffered */
> > if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > (IOMAP_WRITE | IOMAP_DIRECT))
> > return false;
> >
> > ...
> > }
> >
> > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> > -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> >
> >
> > if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > return -EINVAL;
> >
> > EXT4 then fallsback to buffered-io only for writes, but not for reads.
>
> Right. And the fallback for writes was actually inadvertedly "added" by
> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> changed the error handling logic. Previously if iomap_dio_bio_iter()
> returned EINVAL, it got propagated to userspace regardless of what
> ->iomap_end() returned. After this commit if ->iomap_end() returns error
> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> the error returned by iomap_dio_bio_iter().
>
> Now both the old and new behavior make some sense so I won't argue that the
> new iomap_iter() behavior is wrong. But I think we should change ext4 back
> to the old behavior of failing unaligned dio writes instead of them falling
> back to buffered IO. I think something like the attached patch should do
> the trick - it makes unaligned dio writes fail again while writes to holes
> of indirect-block mapped files still correctly fall back to buffered IO.
> Once fstests run completes, I'll do a proper submission...
>
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 27 Aug 2025 14:55:19 +0200
> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
>
> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> changed the error handling logic in iomap_iter(). Previously any error
> from iomap_dio_bio_iter() got propagated to userspace, after this commit
> if ->iomap_end returns error, it gets propagated to userspace instead of
> an error from iomap_dio_bio_iter(). This results in unaligned writes to
> ext4 to silently fallback to buffered IO instead of erroring out.
>
> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> unnecessary these days. It is enough to return ENOTBLK from
> ext4_iomap_begin() when we don't support DIO write for that particular
> file offset (due to hole).
Any particular reason for ext4 still returning -ENOTBLK for unaligned
DIO?
In my experience XFS returns -EINVAL when failing unaligned DIO (but
maybe there are edge cases where that isn't always the case?)
Would be nice to have consistency across filesystems for what is
returned when failing unaligned DIO.
The iomap code returns -ENOTBLK as "the magic error code to fall back
to buffered I/O". But that seems only for page cache invalidation
failure, _not_ for unaligned DIO.
(Anyway, __iomap_dio_rw's WRITE handling can return -ENOTBLK if page
cache invalidation fails during DIO write. So it seems higher-level
code, like I've added to NFS/NFSD to check for unaligned DIO failure,
should check for both -EINVAL and -ENOTBLK).
Thanks,
Mike
ps. ENOTBLK is actually much less easily confused with other random
uses of EINVAL (EINVAL use is generally way too overloaded, rendering
it a pretty unhelpful error). But switching XFS to use ENOTBLK
instead of EINVAL seems like disruptive interface breakage (I suppose
same could be said for ext4 if it were to now return EINVAL for
unaligned DIO, but ext4 flip-flopping on how it handles unaligned DIO
prompted me to ask these questions now)
> ---
> fs/ext4/file.c | 2 --
> fs/ext4/inode.c | 35 -----------------------------------
> 2 files changed, 37 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 93240e35ee36..cf39f57d21e9 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> iomap_ops = &ext4_iomap_overwrite_ops;
> ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> dio_flags, NULL, 0);
> - if (ret == -ENOTBLK)
> - ret = 0;
> if (extend) {
> /*
> * We always perform extending DIO write synchronously so by
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5b7a15db4953..c3b23c90fd11 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
> return ret;
> }
>
> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> -{
> - /* must be a directio to fall back to buffered */
> - if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> - (IOMAP_WRITE | IOMAP_DIRECT))
> - return false;
> -
> - /* atomic writes are all-or-nothing */
> - if (flags & IOMAP_ATOMIC)
> - return false;
> -
> - /* can only try again if we wrote nothing */
> - return written == 0;
> -}
> -
> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> - ssize_t written, unsigned flags, struct iomap *iomap)
> -{
> - /*
> - * Check to see whether an error occurred while writing out the data to
> - * the allocated blocks. If so, return the magic error code for
> - * non-atomic write so that we fallback to buffered I/O and attempt to
> - * complete the remainder of the I/O.
> - * For non-atomic writes, any blocks that may have been
> - * allocated in preparation for the direct I/O will be reused during
> - * buffered I/O. For atomic write, we never fallback to buffered-io.
> - */
> - if (ext4_want_directio_fallback(flags, written))
> - return -ENOTBLK;
> -
> - return 0;
> -}
> -
> const struct iomap_ops ext4_iomap_ops = {
> .iomap_begin = ext4_iomap_begin,
> - .iomap_end = ext4_iomap_end,
> };
>
> const struct iomap_ops ext4_iomap_overwrite_ops = {
> .iomap_begin = ext4_iomap_overwrite_begin,
> - .iomap_end = ext4_iomap_end,
> };
>
> static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
2025-08-27 16:09 ` Mike Snitzer
@ 2025-09-01 7:55 ` Jan Kara
2025-09-02 14:39 ` Mike Snitzer
0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-09-01 7:55 UTC (permalink / raw)
To: Mike Snitzer
Cc: Jan Kara, Ritesh Harjani, Keith Busch, Keith Busch, linux-block,
linux-fsdevel, linux-kernel, linux-ext4, axboe, dw, brauner, hch,
martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster
Hi Mike!
On Wed 27-08-25 12:09:29, Mike Snitzer wrote:
> On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> > On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > > Keith Busch <kbusch@kernel.org> writes:
> > >
> > > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > > >> > Keith Busch <kbusch@meta.com> writes:
> > > >> > >
> > > >> > > - EXT4 falls back to buffered io for writes but not for reads.
> > > >> >
> > > >> > ++linux-ext4 to get any historical context behind why the difference of
> > > >> > behaviour in reads v/s writes for EXT4 DIO.
> > > >>
> > > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > > >> falling back to buffered IO only if the underlying file itself does not
> > > >> support any kind of direct IO.
> > > >
> > > > Simple test case (dio-offset-test.c) below.
> > > >
> > > > I also ran this on vanilla kernel and got these results:
> > > >
> > > > # mkfs.ext4 /dev/vda
> > > > # mount /dev/vda /mnt/ext4/
> > > > # make dio-offset-test
> > > > # ./dio-offset-test /mnt/ext4/foobar
> > > > write: Success
> > > > read: Invalid argument
> > > >
> > > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > > >
> > >
> > > Right. Ext4 has fallback only for dio writes but not for DIO reads...
> > >
> > > buffered
> > > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > > {
> > > /* must be a directio to fall back to buffered */
> > > if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > > (IOMAP_WRITE | IOMAP_DIRECT))
> > > return false;
> > >
> > > ...
> > > }
> > >
> > > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> > > -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> > >
> > >
> > > if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > > !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > > return -EINVAL;
> > >
> > > EXT4 then fallsback to buffered-io only for writes, but not for reads.
> >
> > Right. And the fallback for writes was actually inadvertedly "added" by
> > commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> > changed the error handling logic. Previously if iomap_dio_bio_iter()
> > returned EINVAL, it got propagated to userspace regardless of what
> > ->iomap_end() returned. After this commit if ->iomap_end() returns error
> > (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> > the error returned by iomap_dio_bio_iter().
> >
> > Now both the old and new behavior make some sense so I won't argue that the
> > new iomap_iter() behavior is wrong. But I think we should change ext4 back
> > to the old behavior of failing unaligned dio writes instead of them falling
> > back to buffered IO. I think something like the attached patch should do
> > the trick - it makes unaligned dio writes fail again while writes to holes
> > of indirect-block mapped files still correctly fall back to buffered IO.
> > Once fstests run completes, I'll do a proper submission...
> >
> >
> > Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
>
> > From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@suse.cz>
> > Date: Wed, 27 Aug 2025 14:55:19 +0200
> > Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
> >
> > Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> > changed the error handling logic in iomap_iter(). Previously any error
> > from iomap_dio_bio_iter() got propagated to userspace, after this commit
> > if ->iomap_end returns error, it gets propagated to userspace instead of
> > an error from iomap_dio_bio_iter(). This results in unaligned writes to
> > ext4 to silently fallback to buffered IO instead of erroring out.
> >
> > Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> > unnecessary these days. It is enough to return ENOTBLK from
> > ext4_iomap_begin() when we don't support DIO write for that particular
> > file offset (due to hole).
>
> Any particular reason for ext4 still returning -ENOTBLK for unaligned
> DIO?
No, that is actually the bug I'm speaking about - ext4 should be returning
EINVAL for unaligned DIO as other filesystems do but after recent iomap
changes it started to return ENOTBLK.
> In my experience XFS returns -EINVAL when failing unaligned DIO (but
> maybe there are edge cases where that isn't always the case?)
>
> Would be nice to have consistency across filesystems for what is
> returned when failing unaligned DIO.
Agreed although there are various corner cases like files which never
support direct IO - e.g. with data journalling - and thus fallback to
buffered IO happens before any alignment checks.
> The iomap code returns -ENOTBLK as "the magic error code to fall back
> to buffered I/O". But that seems only for page cache invalidation
> failure, _not_ for unaligned DIO.
>
> (Anyway, __iomap_dio_rw's WRITE handling can return -ENOTBLK if page
> cache invalidation fails during DIO write. So it seems higher-level
> code, like I've added to NFS/NFSD to check for unaligned DIO failure,
> should check for both -EINVAL and -ENOTBLK).
I think the idea here is that if page cache invalidation fails we want to
fallback to buffered IO so that we don't cause cache coherency issues and
that's why ENOTBLK is returned.
> ps. ENOTBLK is actually much less easily confused with other random
> uses of EINVAL (EINVAL use is generally way too overloaded, rendering
> it a pretty unhelpful error). But switching XFS to use ENOTBLK
> instead of EINVAL seems like disruptive interface breakage (I suppose
> same could be said for ext4 if it were to now return EINVAL for
> unaligned DIO, but ext4 flip-flopping on how it handles unaligned DIO
> prompted me to ask these questions now)
Definitely. In this particular case EINVAL for unaligned DIO is there for
ages and there likely is some userspace program somewhere that depends on
it.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
2025-09-01 7:55 ` Jan Kara
@ 2025-09-02 14:39 ` Mike Snitzer
0 siblings, 0 replies; 14+ messages in thread
From: Mike Snitzer @ 2025-09-02 14:39 UTC (permalink / raw)
To: Jan Kara
Cc: Ritesh Harjani, Keith Busch, Keith Busch, linux-block,
linux-fsdevel, linux-kernel, linux-ext4, axboe, dw, brauner, hch,
martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster
On Mon, Sep 01, 2025 at 09:55:20AM +0200, Jan Kara wrote:
> Hi Mike!
>
> On Wed 27-08-25 12:09:29, Mike Snitzer wrote:
> > On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> > > On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > > > Keith Busch <kbusch@kernel.org> writes:
> > > >
> > > > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > > > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > > > >> > Keith Busch <kbusch@meta.com> writes:
> > > > >> > >
> > > > >> > > - EXT4 falls back to buffered io for writes but not for reads.
> > > > >> >
> > > > >> > ++linux-ext4 to get any historical context behind why the difference of
> > > > >> > behaviour in reads v/s writes for EXT4 DIO.
> > > > >>
> > > > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > > > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > > > >> falling back to buffered IO only if the underlying file itself does not
> > > > >> support any kind of direct IO.
> > > > >
> > > > > Simple test case (dio-offset-test.c) below.
> > > > >
> > > > > I also ran this on vanilla kernel and got these results:
> > > > >
> > > > > # mkfs.ext4 /dev/vda
> > > > > # mount /dev/vda /mnt/ext4/
> > > > > # make dio-offset-test
> > > > > # ./dio-offset-test /mnt/ext4/foobar
> > > > > write: Success
> > > > > read: Invalid argument
> > > > >
> > > > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > > > >
> > > >
> > > > Right. Ext4 has fallback only for dio writes but not for DIO reads...
> > > >
> > > > buffered
> > > > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > > > {
> > > > /* must be a directio to fall back to buffered */
> > > > if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > > > (IOMAP_WRITE | IOMAP_DIRECT))
> > > > return false;
> > > >
> > > > ...
> > > > }
> > > >
> > > > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> > > > -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> > > >
> > > >
> > > > if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > > > !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > > > return -EINVAL;
> > > >
> > > > EXT4 then fallsback to buffered-io only for writes, but not for reads.
> > >
> > > Right. And the fallback for writes was actually inadvertedly "added" by
> > > commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> > > changed the error handling logic. Previously if iomap_dio_bio_iter()
> > > returned EINVAL, it got propagated to userspace regardless of what
> > > ->iomap_end() returned. After this commit if ->iomap_end() returns error
> > > (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> > > the error returned by iomap_dio_bio_iter().
> > >
> > > Now both the old and new behavior make some sense so I won't argue that the
> > > new iomap_iter() behavior is wrong. But I think we should change ext4 back
> > > to the old behavior of failing unaligned dio writes instead of them falling
> > > back to buffered IO. I think something like the attached patch should do
> > > the trick - it makes unaligned dio writes fail again while writes to holes
> > > of indirect-block mapped files still correctly fall back to buffered IO.
> > > Once fstests run completes, I'll do a proper submission...
> > >
> > >
> > > Honza
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> >
> > > From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> > > From: Jan Kara <jack@suse.cz>
> > > Date: Wed, 27 Aug 2025 14:55:19 +0200
> > > Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
> > >
> > > Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> > > changed the error handling logic in iomap_iter(). Previously any error
> > > from iomap_dio_bio_iter() got propagated to userspace, after this commit
> > > if ->iomap_end returns error, it gets propagated to userspace instead of
> > > an error from iomap_dio_bio_iter(). This results in unaligned writes to
> > > ext4 to silently fallback to buffered IO instead of erroring out.
> > >
> > > Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> > > unnecessary these days. It is enough to return ENOTBLK from
> > > ext4_iomap_begin() when we don't support DIO write for that particular
> > > file offset (due to hole).
> >
> > Any particular reason for ext4 still returning -ENOTBLK for unaligned
> > DIO?
>
> No, that is actually the bug I'm speaking about - ext4 should be returning
> EINVAL for unaligned DIO as other filesystems do but after recent iomap
> changes it started to return ENOTBLK.
>
> > In my experience XFS returns -EINVAL when failing unaligned DIO (but
> > maybe there are edge cases where that isn't always the case?)
> >
> > Would be nice to have consistency across filesystems for what is
> > returned when failing unaligned DIO.
>
> Agreed although there are various corner cases like files which never
> support direct IO - e.g. with data journalling - and thus fallback to
> buffered IO happens before any alignment checks.
>
> > The iomap code returns -ENOTBLK as "the magic error code to fall back
> > to buffered I/O". But that seems only for page cache invalidation
> > failure, _not_ for unaligned DIO.
> >
> > (Anyway, __iomap_dio_rw's WRITE handling can return -ENOTBLK if page
> > cache invalidation fails during DIO write. So it seems higher-level
> > code, like I've added to NFS/NFSD to check for unaligned DIO failure,
> > should check for both -EINVAL and -ENOTBLK).
>
> I think the idea here is that if page cache invalidation fails we want to
> fallback to buffered IO so that we don't cause cache coherency issues and
> that's why ENOTBLK is returned.
>
> > ps. ENOTBLK is actually much less easily confused with other random
> > uses of EINVAL (EINVAL use is generally way too overloaded, rendering
> > it a pretty unhelpful error). But switching XFS to use ENOTBLK
> > instead of EINVAL seems like disruptive interface breakage (I suppose
> > same could be said for ext4 if it were to now return EINVAL for
> > unaligned DIO, but ext4 flip-flopping on how it handles unaligned DIO
> > prompted me to ask these questions now)
>
> Definitely. In this particular case EINVAL for unaligned DIO is there for
> ages and there likely is some userspace program somewhere that depends on
> it.
Thanks for your reply, that all makes sense.
Mike
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
2025-08-27 15:20 ` Jan Kara
2025-08-27 16:09 ` Mike Snitzer
@ 2025-08-27 17:52 ` Brian Foster
2025-08-27 19:20 ` Keith Busch
2025-08-29 2:11 ` Ritesh Harjani
3 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2025-08-27 17:52 UTC (permalink / raw)
To: Jan Kara
Cc: Ritesh Harjani, Keith Busch, Keith Busch, linux-block,
linux-fsdevel, linux-kernel, linux-ext4, snitzer, axboe, dw,
brauner, hch, martin.petersen, djwong, linux-xfs, viro, Jan Kara
On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > Keith Busch <kbusch@kernel.org> writes:
> >
> > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > >> > Keith Busch <kbusch@meta.com> writes:
> > >> > >
> > >> > > - EXT4 falls back to buffered io for writes but not for reads.
> > >> >
> > >> > ++linux-ext4 to get any historical context behind why the difference of
> > >> > behaviour in reads v/s writes for EXT4 DIO.
> > >>
> > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > >> falling back to buffered IO only if the underlying file itself does not
> > >> support any kind of direct IO.
> > >
> > > Simple test case (dio-offset-test.c) below.
> > >
> > > I also ran this on vanilla kernel and got these results:
> > >
> > > # mkfs.ext4 /dev/vda
> > > # mount /dev/vda /mnt/ext4/
> > > # make dio-offset-test
> > > # ./dio-offset-test /mnt/ext4/foobar
> > > write: Success
> > > read: Invalid argument
> > >
> > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > >
> >
> > Right. Ext4 has fallback only for dio writes but not for DIO reads...
> >
> > buffered
> > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > {
> > /* must be a directio to fall back to buffered */
> > if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > (IOMAP_WRITE | IOMAP_DIRECT))
> > return false;
> >
> > ...
> > }
> >
> > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> > -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> >
> >
> > if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > return -EINVAL;
> >
> > EXT4 then fallsback to buffered-io only for writes, but not for reads.
>
> Right. And the fallback for writes was actually inadvertedly "added" by
> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> changed the error handling logic. Previously if iomap_dio_bio_iter()
> returned EINVAL, it got propagated to userspace regardless of what
> ->iomap_end() returned. After this commit if ->iomap_end() returns error
> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> the error returned by iomap_dio_bio_iter().
>
Ah, so IIUC you're referring to the change in iomap_iter() where the
iomap_end() error code was returned "if (ret < 0 && !iter->processed)",
where iter->processed held a potential error code from the iterator.
That was changed to !advanced, which filters out the processed < 0 case
and allows the error to return from iomap_end here rather than from
iter->processed a few lines down.
There were further changes to eliminate the advance from iomap_iter()
case (and rename processed -> status), so I suppose we could consider
changing that to something like:
if (ret < 0 && !advanced && !iter->status)
return ret;
... which I think would restore original error behavior. But I agree
it's not totally clear which is preferable. Certainly the change in
behavior was not intentional so thanks for the analysis. I'd have to
stare at the code and think (and test) some more to form an opinion on
whether it's worth changing. Meanwhile it looks like you have a
reasonable enough workaround..
Brian
> Now both the old and new behavior make some sense so I won't argue that the
> new iomap_iter() behavior is wrong. But I think we should change ext4 back
> to the old behavior of failing unaligned dio writes instead of them falling
> back to buffered IO. I think something like the attached patch should do
> the trick - it makes unaligned dio writes fail again while writes to holes
> of indirect-block mapped files still correctly fall back to buffered IO.
> Once fstests run completes, I'll do a proper submission...
>
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 27 Aug 2025 14:55:19 +0200
> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
>
> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> changed the error handling logic in iomap_iter(). Previously any error
> from iomap_dio_bio_iter() got propagated to userspace, after this commit
> if ->iomap_end returns error, it gets propagated to userspace instead of
> an error from iomap_dio_bio_iter(). This results in unaligned writes to
> ext4 to silently fallback to buffered IO instead of erroring out.
>
> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> unnecessary these days. It is enough to return ENOTBLK from
> ext4_iomap_begin() when we don't support DIO write for that particular
> file offset (due to hole).
>
> Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/ext4/file.c | 2 --
> fs/ext4/inode.c | 35 -----------------------------------
> 2 files changed, 37 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 93240e35ee36..cf39f57d21e9 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> iomap_ops = &ext4_iomap_overwrite_ops;
> ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> dio_flags, NULL, 0);
> - if (ret == -ENOTBLK)
> - ret = 0;
> if (extend) {
> /*
> * We always perform extending DIO write synchronously so by
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5b7a15db4953..c3b23c90fd11 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
> return ret;
> }
>
> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> -{
> - /* must be a directio to fall back to buffered */
> - if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> - (IOMAP_WRITE | IOMAP_DIRECT))
> - return false;
> -
> - /* atomic writes are all-or-nothing */
> - if (flags & IOMAP_ATOMIC)
> - return false;
> -
> - /* can only try again if we wrote nothing */
> - return written == 0;
> -}
> -
> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> - ssize_t written, unsigned flags, struct iomap *iomap)
> -{
> - /*
> - * Check to see whether an error occurred while writing out the data to
> - * the allocated blocks. If so, return the magic error code for
> - * non-atomic write so that we fallback to buffered I/O and attempt to
> - * complete the remainder of the I/O.
> - * For non-atomic writes, any blocks that may have been
> - * allocated in preparation for the direct I/O will be reused during
> - * buffered I/O. For atomic write, we never fallback to buffered-io.
> - */
> - if (ext4_want_directio_fallback(flags, written))
> - return -ENOTBLK;
> -
> - return 0;
> -}
> -
> const struct iomap_ops ext4_iomap_ops = {
> .iomap_begin = ext4_iomap_begin,
> - .iomap_end = ext4_iomap_end,
> };
>
> const struct iomap_ops ext4_iomap_overwrite_ops = {
> .iomap_begin = ext4_iomap_overwrite_begin,
> - .iomap_end = ext4_iomap_end,
> };
>
> static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
2025-08-27 15:20 ` Jan Kara
2025-08-27 16:09 ` Mike Snitzer
2025-08-27 17:52 ` Brian Foster
@ 2025-08-27 19:20 ` Keith Busch
2025-09-01 8:22 ` Jan Kara
2025-08-29 2:11 ` Ritesh Harjani
3 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2025-08-27 19:20 UTC (permalink / raw)
To: Jan Kara
Cc: Ritesh Harjani, Keith Busch, linux-block, linux-fsdevel,
linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster
On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> Now both the old and new behavior make some sense so I won't argue that the
> new iomap_iter() behavior is wrong. But I think we should change ext4 back
> to the old behavior of failing unaligned dio writes instead of them falling
> back to buffered IO. I think something like the attached patch should do
> the trick - it makes unaligned dio writes fail again while writes to holes
> of indirect-block mapped files still correctly fall back to buffered IO.
> Once fstests run completes, I'll do a proper submission...
Your suggestion looks all well and good, but I have a general question
about fstests. I've written up some to test this series, and I have
filesystem specific expectations for what should error or succeed. If
you modify ext4 to fail direct-io as described, my test will have to be
kernel version specific too. Is there a best practice in fstests for
handling such scenarios?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
2025-08-27 19:20 ` Keith Busch
@ 2025-09-01 8:22 ` Jan Kara
0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2025-09-01 8:22 UTC (permalink / raw)
To: Keith Busch
Cc: Jan Kara, Ritesh Harjani, Keith Busch, linux-block, linux-fsdevel,
linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster
On Wed 27-08-25 13:20:56, Keith Busch wrote:
> On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> > Now both the old and new behavior make some sense so I won't argue that the
> > new iomap_iter() behavior is wrong. But I think we should change ext4 back
> > to the old behavior of failing unaligned dio writes instead of them falling
> > back to buffered IO. I think something like the attached patch should do
> > the trick - it makes unaligned dio writes fail again while writes to holes
> > of indirect-block mapped files still correctly fall back to buffered IO.
> > Once fstests run completes, I'll do a proper submission...
>
> Your suggestion looks all well and good, but I have a general question
> about fstests. I've written up some to test this series, and I have
> filesystem specific expectations for what should error or succeed. If
> you modify ext4 to fail direct-io as described, my test will have to be
> kernel version specific too. Is there a best practice in fstests for
> handling such scenarios?
Well, I'd just expect EINVAL for ext4 in the test. Certain kernel versions
(since February or so) will fail but that's just an indication you should
backport the fix if you care...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
2025-08-27 15:20 ` Jan Kara
` (2 preceding siblings ...)
2025-08-27 19:20 ` Keith Busch
@ 2025-08-29 2:11 ` Ritesh Harjani
2025-08-29 3:19 ` Ritesh Harjani
3 siblings, 1 reply; 14+ messages in thread
From: Ritesh Harjani @ 2025-08-29 2:11 UTC (permalink / raw)
To: Jan Kara
Cc: Keith Busch, Jan Kara, Keith Busch, linux-block, linux-fsdevel,
linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster
Jan Kara <jack@suse.cz> writes:
> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
>> Keith Busch <kbusch@kernel.org> writes:
>>
>> > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
>> >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
>> >> > Keith Busch <kbusch@meta.com> writes:
>> >> > >
>> >> > > - EXT4 falls back to buffered io for writes but not for reads.
>> >> >
>> >> > ++linux-ext4 to get any historical context behind why the difference of
>> >> > behaviour in reads v/s writes for EXT4 DIO.
>> >>
>> >> Hum, how did you test? Because in the basic testing I did (with vanilla
>> >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
>> >> falling back to buffered IO only if the underlying file itself does not
>> >> support any kind of direct IO.
>> >
>> > Simple test case (dio-offset-test.c) below.
>> >
>> > I also ran this on vanilla kernel and got these results:
>> >
>> > # mkfs.ext4 /dev/vda
>> > # mount /dev/vda /mnt/ext4/
>> > # make dio-offset-test
>> > # ./dio-offset-test /mnt/ext4/foobar
>> > write: Success
>> > read: Invalid argument
>> >
>> > I tracked the "write: Success" down to ext4's handling for the "special"
>> > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
>> >
>>
>> Right. Ext4 has fallback only for dio writes but not for DIO reads...
>>
>> buffered
>> static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>> {
>> /* must be a directio to fall back to buffered */
>> if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
>> (IOMAP_WRITE | IOMAP_DIRECT))
>> return false;
>>
>> ...
>> }
>>
>> So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
>> -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
>>
>>
>> if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>> !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>> return -EINVAL;
>>
>> EXT4 then fallsback to buffered-io only for writes, but not for reads.
>
> Right. And the fallback for writes was actually inadvertedly "added" by
> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> changed the error handling logic. Previously if iomap_dio_bio_iter()
> returned EINVAL, it got propagated to userspace regardless of what
> ->iomap_end() returned. After this commit if ->iomap_end() returns error
> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> the error returned by iomap_dio_bio_iter().
>
> Now both the old and new behavior make some sense so I won't argue that the
> new iomap_iter() behavior is wrong. But I think we should change ext4 back
> to the old behavior of failing unaligned dio writes instead of them falling
> back to buffered IO. I think something like the attached patch should do
> the trick - it makes unaligned dio writes fail again while writes to holes
> of indirect-block mapped files still correctly fall back to buffered IO.
> Once fstests run completes, I'll do a proper submission...
>
Aah, right. So it wasn't EXT4 which had this behaviour of falling back
to buffered I/O for unaligned writes. Earlier EXT4 was assuming an error
code will be detected by iomap and will be passed to it as "written" in
ext4_iomap_end() for such unaligned writes. But I guess that logic
silently got changed with that commit. Thanks for analyzing that.
I missed looking underneath iomap behaviour change :).
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Wed, 27 Aug 2025 14:55:19 +0200
> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
>
> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> changed the error handling logic in iomap_iter(). Previously any error
> from iomap_dio_bio_iter() got propagated to userspace, after this commit
> if ->iomap_end returns error, it gets propagated to userspace instead of
> an error from iomap_dio_bio_iter(). This results in unaligned writes to
> ext4 to silently fallback to buffered IO instead of erroring out.
>
> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> unnecessary these days. It is enough to return ENOTBLK from
> ext4_iomap_begin() when we don't support DIO write for that particular
> file offset (due to hole).
Right. This mainly only happens if we have holes in non-extent (indirect
blocks) case.
Also, as I see ext4 always just fallsback to buffered-io for no or
partial writes (unless iomap returned any error code). So, I was just
wondering if that could ever happen for DIO atomic write case. It's good
that we have a WARN_ON_ONCE() check in there to catch it. But I was
wondering if this needs an explicit handling in ext4_dio_write_iter() to
not fallback to buffered-writes for atomic DIO requests?
-ritesh
>
> Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/ext4/file.c | 2 --
> fs/ext4/inode.c | 35 -----------------------------------
> 2 files changed, 37 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 93240e35ee36..cf39f57d21e9 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> iomap_ops = &ext4_iomap_overwrite_ops;
> ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> dio_flags, NULL, 0);
> - if (ret == -ENOTBLK)
> - ret = 0;
> if (extend) {
> /*
> * We always perform extending DIO write synchronously so by
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5b7a15db4953..c3b23c90fd11 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
> return ret;
> }
>
> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> -{
> - /* must be a directio to fall back to buffered */
> - if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> - (IOMAP_WRITE | IOMAP_DIRECT))
> - return false;
> -
> - /* atomic writes are all-or-nothing */
> - if (flags & IOMAP_ATOMIC)
> - return false;
> -
> - /* can only try again if we wrote nothing */
> - return written == 0;
> -}
> -
> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> - ssize_t written, unsigned flags, struct iomap *iomap)
> -{
> - /*
> - * Check to see whether an error occurred while writing out the data to
> - * the allocated blocks. If so, return the magic error code for
> - * non-atomic write so that we fallback to buffered I/O and attempt to
> - * complete the remainder of the I/O.
> - * For non-atomic writes, any blocks that may have been
> - * allocated in preparation for the direct I/O will be reused during
> - * buffered I/O. For atomic write, we never fallback to buffered-io.
> - */
> - if (ext4_want_directio_fallback(flags, written))
> - return -ENOTBLK;
> -
> - return 0;
> -}
> -
> const struct iomap_ops ext4_iomap_ops = {
> .iomap_begin = ext4_iomap_begin,
> - .iomap_end = ext4_iomap_end,
> };
>
> const struct iomap_ops ext4_iomap_overwrite_ops = {
> .iomap_begin = ext4_iomap_overwrite_begin,
> - .iomap_end = ext4_iomap_end,
> };
>
> static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> --
> 2.43.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCHv3 0/8] direct-io: even more flexible io vectors
2025-08-29 2:11 ` Ritesh Harjani
@ 2025-08-29 3:19 ` Ritesh Harjani
0 siblings, 0 replies; 14+ messages in thread
From: Ritesh Harjani @ 2025-08-29 3:19 UTC (permalink / raw)
To: Jan Kara
Cc: Keith Busch, Jan Kara, Keith Busch, linux-block, linux-fsdevel,
linux-kernel, linux-ext4, snitzer, axboe, dw, brauner, hch,
martin.petersen, djwong, linux-xfs, viro, Jan Kara, Brian Foster
Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> Jan Kara <jack@suse.cz> writes:
>
>> On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
>>> Keith Busch <kbusch@kernel.org> writes:
>>>
>>> > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
>>> >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
>>> >> > Keith Busch <kbusch@meta.com> writes:
>>> >> > >
>>> >> > > - EXT4 falls back to buffered io for writes but not for reads.
>>> >> >
>>> >> > ++linux-ext4 to get any historical context behind why the difference of
>>> >> > behaviour in reads v/s writes for EXT4 DIO.
>>> >>
>>> >> Hum, how did you test? Because in the basic testing I did (with vanilla
>>> >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
>>> >> falling back to buffered IO only if the underlying file itself does not
>>> >> support any kind of direct IO.
>>> >
>>> > Simple test case (dio-offset-test.c) below.
>>> >
>>> > I also ran this on vanilla kernel and got these results:
>>> >
>>> > # mkfs.ext4 /dev/vda
>>> > # mount /dev/vda /mnt/ext4/
>>> > # make dio-offset-test
>>> > # ./dio-offset-test /mnt/ext4/foobar
>>> > write: Success
>>> > read: Invalid argument
>>> >
>>> > I tracked the "write: Success" down to ext4's handling for the "special"
>>> > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
>>> >
>>>
>>> Right. Ext4 has fallback only for dio writes but not for DIO reads...
>>>
>>> buffered
>>> static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>>> {
>>> /* must be a directio to fall back to buffered */
>>> if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
>>> (IOMAP_WRITE | IOMAP_DIRECT))
>>> return false;
>>>
>>> ...
>>> }
>>>
>>> So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
>>> -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
>>>
>>>
>>> if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>>> !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>>> return -EINVAL;
>>>
>>> EXT4 then fallsback to buffered-io only for writes, but not for reads.
>>
>> Right. And the fallback for writes was actually inadvertedly "added" by
>> commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
>> changed the error handling logic. Previously if iomap_dio_bio_iter()
>> returned EINVAL, it got propagated to userspace regardless of what
>> ->iomap_end() returned. After this commit if ->iomap_end() returns error
>> (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
>> the error returned by iomap_dio_bio_iter().
>>
>> Now both the old and new behavior make some sense so I won't argue that the
>> new iomap_iter() behavior is wrong. But I think we should change ext4 back
>> to the old behavior of failing unaligned dio writes instead of them falling
>> back to buffered IO. I think something like the attached patch should do
>> the trick - it makes unaligned dio writes fail again while writes to holes
>> of indirect-block mapped files still correctly fall back to buffered IO.
>> Once fstests run completes, I'll do a proper submission...
>>
>
> Aah, right. So it wasn't EXT4 which had this behaviour of falling back
> to buffered I/O for unaligned writes. Earlier EXT4 was assuming an error
> code will be detected by iomap and will be passed to it as "written" in
> ext4_iomap_end() for such unaligned writes. But I guess that logic
> silently got changed with that commit. Thanks for analyzing that.
> I missed looking underneath iomap behaviour change :).
>
>
>>
>> Honza
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
>> From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
>> From: Jan Kara <jack@suse.cz>
>> Date: Wed, 27 Aug 2025 14:55:19 +0200
>> Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
>>
>> Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
>> changed the error handling logic in iomap_iter(). Previously any error
>> from iomap_dio_bio_iter() got propagated to userspace, after this commit
>> if ->iomap_end returns error, it gets propagated to userspace instead of
>> an error from iomap_dio_bio_iter(). This results in unaligned writes to
>> ext4 to silently fallback to buffered IO instead of erroring out.
>>
>> Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
>> unnecessary these days. It is enough to return ENOTBLK from
>> ext4_iomap_begin() when we don't support DIO write for that particular
>> file offset (due to hole).
>
> Right. This mainly only happens if we have holes in non-extent (indirect
> blocks) case.
>
Thinking more on this case. Do we really want a fallback to buffered-io
for unaligned writes in this case (indirect block case)?
I don't think we care much here, right? And anyways the unaligned writes
should have the same behaviour for extents v/s non-extents case right?
I guess the problem is, iomap alignment check happens in
iomap_dio_bio_iter() where it has a valid bdev (populated by filesystem
during ->iomap_begin() call) to check the alignment against. But in this
indirect block case we return -ENOTBLK much earlier from ->iomap_begin()
call itself.
-ritesh
> Also, as I see ext4 always just fallsback to buffered-io for no or
> partial writes (unless iomap returned any error code). So, I was just
> wondering if that could ever happen for DIO atomic write case. It's good
> that we have a WARN_ON_ONCE() check in there to catch it. But I was
> wondering if this needs an explicit handling in ext4_dio_write_iter() to
> not fallback to buffered-writes for atomic DIO requests?
>
> -ritesh
>
>
>
>>
>> Fixes: bc264fea0f6f ("iomap: support incremental iomap_iter advances")
>> Signed-off-by: Jan Kara <jack@suse.cz>
>> ---
>> fs/ext4/file.c | 2 --
>> fs/ext4/inode.c | 35 -----------------------------------
>> 2 files changed, 37 deletions(-)
>>
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 93240e35ee36..cf39f57d21e9 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -579,8 +579,6 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> iomap_ops = &ext4_iomap_overwrite_ops;
>> ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>> dio_flags, NULL, 0);
>> - if (ret == -ENOTBLK)
>> - ret = 0;
>> if (extend) {
>> /*
>> * We always perform extending DIO write synchronously so by
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 5b7a15db4953..c3b23c90fd11 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3872,47 +3872,12 @@ static int ext4_iomap_overwrite_begin(struct inode *inode, loff_t offset,
>> return ret;
>> }
>>
>> -static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
>> -{
>> - /* must be a directio to fall back to buffered */
>> - if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
>> - (IOMAP_WRITE | IOMAP_DIRECT))
>> - return false;
>> -
>> - /* atomic writes are all-or-nothing */
>> - if (flags & IOMAP_ATOMIC)
>> - return false;
>> -
>> - /* can only try again if we wrote nothing */
>> - return written == 0;
>> -}
>> -
>> -static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>> - ssize_t written, unsigned flags, struct iomap *iomap)
>> -{
>> - /*
>> - * Check to see whether an error occurred while writing out the data to
>> - * the allocated blocks. If so, return the magic error code for
>> - * non-atomic write so that we fallback to buffered I/O and attempt to
>> - * complete the remainder of the I/O.
>> - * For non-atomic writes, any blocks that may have been
>> - * allocated in preparation for the direct I/O will be reused during
>> - * buffered I/O. For atomic write, we never fallback to buffered-io.
>> - */
>> - if (ext4_want_directio_fallback(flags, written))
>> - return -ENOTBLK;
>> -
>> - return 0;
>> -}
>> -
>> const struct iomap_ops ext4_iomap_ops = {
>> .iomap_begin = ext4_iomap_begin,
>> - .iomap_end = ext4_iomap_end,
>> };
>>
>> const struct iomap_ops ext4_iomap_overwrite_ops = {
>> .iomap_begin = ext4_iomap_overwrite_begin,
>> - .iomap_end = ext4_iomap_end,
>> };
>>
>> static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
>> --
>> 2.43.0
^ permalink raw reply [flat|nested] 14+ messages in thread