[PATCH] fs: remove power of 2 and length boundary atomic write restrictions

Linux filesystem development
 help / color / mirror / Atom feed

* [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
@ 2025-12-24 11:53 Vitaliy Filippov
  2025-12-29  7:15 ` kernel test robot
  2025-12-30  7:54 ` John Garry
  0 siblings, 2 replies; 19+ messages in thread
From: Vitaliy Filippov @ 2025-12-24 11:53 UTC (permalink / raw)
  To: linux-block, linux-nvme, linux-fsdevel; +Cc: Vitaliy Filippov

generic_atomic_write_valid() returns EINVAL for non-power-of-2 and for
non-length-aligned writes. This check is used for block devices, ext4
and xfs, but neither ext4 nor xfs rely on power of 2 restrictions.

For block devices, neither NVMe nor SCSI specification doesn't require
length alignment and 2^N length. Both specifications only require to
respect the atomic write boundary if it's set (NABSPF/NABO for NVMe and
ATOMIC BOUNDARY for SCSI). NVMe subsystem already checks writes against
this boundary; SCSI uses an explicit atomic write command so the write
is checked by the drive itself.

Signed-off-by: Vitaliy Filippov <vitalifster@gmail.com>
---
 fs/read_write.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 833bae068770..5467d710108d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1802,17 +1802,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)

 int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
 {
-	size_t len = iov_iter_count(iter);
-
 	if (!iter_is_ubuf(iter))
 		return -EINVAL;

-	if (!is_power_of_2(len))
-		return -EINVAL;
-
-	if (!IS_ALIGNED(iocb->ki_pos, len))
-		return -EINVAL;
-
 	if (!(iocb->ki_flags & IOCB_DIRECT))
 		return -EOPNOTSUPP;

-- 
2.51.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2025-12-24 11:53 [PATCH] fs: remove power of 2 and length boundary atomic write restrictions Vitaliy Filippov
@ 2025-12-29  7:15 ` kernel test robot
  2025-12-30  7:54 ` John Garry
  1 sibling, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-12-29  7:15 UTC (permalink / raw)
  To: Vitaliy Filippov
  Cc: oe-lkp, lkp, linux-fsdevel, linux-block, linux-nvme,
	Vitaliy Filippov, oliver.sang



Hello,

kernel test robot noticed "xfstests.generic.767.fail" on:

commit: b493223bbb16c8b00af5ba371d8bb2ca56506527 ("[PATCH] fs: remove power of 2 and length boundary atomic write restrictions")
url: https://github.com/intel-lab-lkp/linux/commits/Vitaliy-Filippov/fs-remove-power-of-2-and-length-boundary-atomic-write-restrictions/20251224-195553
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20251224115312.27036-1-vitalifster@gmail.com/
patch subject: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions

in testcase: xfstests
version: xfstests-x86_64-a668057f-1_20251209
with following parameters:

	disk: 4HDD
	fs: xfs
	test: generic-767


config: x86_64-rhel-9.4-func
compiler: gcc-14
test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (Skylake) with 32G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202512291503.11709b09-lkp@intel.com

2025-12-26 12:13:54 cd /lkp/benchmarks/xfstests
2025-12-26 12:13:54 export TEST_DIR=/fs/sda1
2025-12-26 12:13:54 export TEST_DEV=/dev/sda1
2025-12-26 12:13:54 export FSTYP=xfs
2025-12-26 12:13:54 export SCRATCH_MNT=/fs/scratch
2025-12-26 12:13:54 mkdir /fs/scratch -p
2025-12-26 12:13:54 export SCRATCH_DEV=/dev/sda4
2025-12-26 12:13:55 export SCRATCH_LOGDEV=/dev/sda2
meta-data=/dev/sda1              isize=512    agcount=4, agsize=13107200 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=0   metadir=0
data     =                       bsize=4096   blocks=52428800, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=25600, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
2025-12-26 12:13:55 export MKFS_OPTIONS=-mreflink=1
2025-12-26 12:13:55 echo generic/767
2025-12-26 12:13:55 ./check generic/767
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 lkp-skl-d03 6.19.0-rc1-00037-gb493223bbb16 #1 SMP PREEMPT_DYNAMIC Fri Dec 26 19:51:19 CST 2025
MKFS_OPTIONS  -- -f -mreflink=1 /dev/sda4
MOUNT_OPTIONS -- /dev/sda4 /fs/scratch

generic/767        - output mismatch (see /lkp/benchmarks/xfstests/results//generic/767.out.bad)
    --- tests/generic/767.out	2025-12-09 15:20:52.000000000 +0000
    +++ /lkp/benchmarks/xfstests/results//generic/767.out.bad	2025-12-26 12:14:08.927883699 +0000
    @@ -6,5 +6,4 @@
     one EOPNOTSUPP for buffered atomic
     pwrite: Operation not supported
     one EINVAL for unaligned directio
    -pwrite: Invalid argument
     Silence is golden
    ...
    (Run 'diff -u /lkp/benchmarks/xfstests/tests/generic/767.out /lkp/benchmarks/xfstests/results//generic/767.out.bad'  to see the entire diff)
Ran: generic/767
Failures: generic/767
Failed 1 of 1 tests




The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251229/202512291503.11709b09-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2025-12-24 11:53 [PATCH] fs: remove power of 2 and length boundary atomic write restrictions Vitaliy Filippov
  2025-12-29  7:15 ` kernel test robot
@ 2025-12-30  7:54 ` John Garry
  2025-12-30  9:01   ` Vitaliy Filippov
  1 sibling, 1 reply; 19+ messages in thread
From: John Garry @ 2025-12-30  7:54 UTC (permalink / raw)
  To: Vitaliy Filippov, linux-block, linux-nvme, linux-fsdevel

On 24/12/2025 11:53, Vitaliy Filippov wrote:
> generic_atomic_write_valid() returns EINVAL for non-power-of-2 and for
> non-length-aligned writes. This check is used for block devices, ext4
> and xfs, but neither ext4 nor xfs rely on power of 2 restrictions.
> 
> For block devices, neither NVMe nor SCSI specification doesn't require
> length alignment and 2^N length. Both specifications only require to
> respect the atomic write boundary if it's set (NABSPF/NABO for NVMe and
> ATOMIC BOUNDARY for SCSI).


> NVMe subsystem already checks writes against
> this boundary; SCSI uses an explicit atomic write command so the write
> is checked by the drive itself.
> 

Yes, they do check it - this is a safeguard against being sent something 
which cannot be atomically written. But we should not be sending 
something to the driver or disk which cannot be atomically written. So 
we are providing protection against kernel bugs.

The user should not be concerned about atomic boundaries. They should 
not encounter a scenario where they try a write which crosses a boundary 
(and cannot be atomically written). Hence the power-of-2 and alignment 
rule to avoid this.


> Signed-off-by: Vitaliy Filippov <vitalifster@gmail.com>
> ---
>   fs/read_write.c | 8 --------
>   1 file changed, 8 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 833bae068770..5467d710108d 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1802,17 +1802,9 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
>   
>   int generic_atomic_write_valid(struct kiocb *iocb, struct iov_iter *iter)
>   {
> -	size_t len = iov_iter_count(iter);
> -
>   	if (!iter_is_ubuf(iter))
>   		return -EINVAL;
>   
> -	if (!is_power_of_2(len))
> -		return -EINVAL;
> -
> -	if (!IS_ALIGNED(iocb->ki_pos, len))
> -		return -EINVAL;
> -
>   	if (!(iocb->ki_flags & IOCB_DIRECT))
>   		return -EOPNOTSUPP;
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2025-12-30  7:54 ` John Garry
@ 2025-12-30  9:01   ` Vitaliy Filippov
  2026-01-02 17:41     ` John Garry
  0 siblings, 1 reply; 19+ messages in thread
From: Vitaliy Filippov @ 2025-12-30  9:01 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

I think that even with the 2^N requirement the user still has to look
for boundaries.
1) NVMe disks may have NABO != 0 (atomic boundary offset). In this
case 2^N aligned writes won't work at all.
2) NABSPF is expressed in blocks in the NVMe spec and it's not
restricted to 2^N, it can be for example 3 (3*4096 = 12 KB). The spec
allows it. 2^N breaks this case too.
And the user also has to look for the maximum atomic write size
anyway, he can't just assume all writes are atomic out of the box,
regardless of the 2^N requirement.
So my idea is that the kernel's task is just to guarantee correctness
of atomic writes. It anyway can't provide the user with atomic writes
in all cases.

I see that xfstests also check 2^N and the check obviously failed with
my patch, should I submit a patch for xfstests first?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2025-12-30  9:01   ` Vitaliy Filippov
@ 2026-01-02 17:41     ` John Garry
  2026-01-05 18:58       ` Vitaliy Filippov
                         ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: John Garry @ 2026-01-02 17:41 UTC (permalink / raw)
  To: Vitaliy Filippov; +Cc: linux-block, linux-nvme, linux-fsdevel

On 30/12/2025 09:01, Vitaliy Filippov wrote:
> I think that even with the 2^N requirement the user still has to look
> for boundaries.
> 1) NVMe disks may have NABO != 0 (atomic boundary offset). In this
> case 2^N aligned writes won't work at all.

We don't support NABO != 0

> 2) NABSPF is expressed in blocks in the NVMe spec and it's not
> restricted to 2^N, it can be for example 3 (3*4096 = 12 KB). The spec
> allows it. 2^N breaks this case too.

We could support NABSPF which is not a power-of-2, but we don't today.

If you can find some real HW which has NABSPF which is not a power-of-2, 
then it can be considered.

> And the user also has to look for the maximum atomic write size
> anyway, he can't just assume all writes are atomic out of the box,
> regardless of the 2^N requirement.
> So my idea is that the kernel's task is just to guarantee correctness
> of atomic writes. It anyway can't provide the user with atomic writes
> in all cases.

What good is that to a user?

Consider the user wants to atomic write a range of a file which is 
backed by disk blocks which straddle a boundary - in this case, the 
write would fail. What is the user supposed to do then? That API could 
have arbitrary failures, which effectively makes it a useless API.

As I said before, just don't use RWF_ATOMIC if you don't want to deal 
with these restrictions.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-02 17:41     ` John Garry
@ 2026-01-05 18:58       ` Vitaliy Filippov
  2026-01-06  9:06         ` John Garry
  2026-01-05 19:29       ` Vitaliy Filippov
  2026-01-05 19:44       ` Vitaliy Filippov
  2 siblings, 1 reply; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-05 18:58 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

>What good is that to a user?

It will allow him to use the feature which he currently can't use.

I don't understand your point about "arbitrary" failures.

Imagine that a user just sends a 256 KB write with RWF_ATOMIC while
the device has NAWUPF=128 KB.

He gets EINVAL even though the write is 2^N and length-aligned. Is it
any different from an 'arbitrary failure' which you describe?

Now imagine that he sends a write but it spans multiple extents in the
FS. And he gets EINVAL once again.

Is it any different from what I propose?

Obviously in all of these cases the app has to make sure that it
satisfies all atomic write requirements before actually using them. I
think it's absolutely fine.

On Fri, Jan 2, 2026 at 8:41 PM John Garry <john.g.garry@oracle.com> wrote:
>
> On 30/12/2025 09:01, Vitaliy Filippov wrote:
> > I think that even with the 2^N requirement the user still has to look
> > for boundaries.
> > 1) NVMe disks may have NABO != 0 (atomic boundary offset). In this
> > case 2^N aligned writes won't work at all.
>
> We don't support NABO != 0
>
> > 2) NABSPF is expressed in blocks in the NVMe spec and it's not
> > restricted to 2^N, it can be for example 3 (3*4096 = 12 KB). The spec
> > allows it. 2^N breaks this case too.
>
> We could support NABSPF which is not a power-of-2, but we don't today.
>
> If you can find some real HW which has NABSPF which is not a power-of-2,
> then it can be considered.
>
> > And the user also has to look for the maximum atomic write size
> > anyway, he can't just assume all writes are atomic out of the box,
> > regardless of the 2^N requirement.
> > So my idea is that the kernel's task is just to guarantee correctness
> > of atomic writes. It anyway can't provide the user with atomic writes
> > in all cases.
>
> What good is that to a user?
>
> Consider the user wants to atomic write a range of a file which is
> backed by disk blocks which straddle a boundary - in this case, the
> write would fail. What is the user supposed to do then? That API could
> have arbitrary failures, which effectively makes it a useless API.
>
> As I said before, just don't use RWF_ATOMIC if you don't want to deal
> with these restrictions.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-02 17:41     ` John Garry
  2026-01-05 18:58       ` Vitaliy Filippov
@ 2026-01-05 19:29       ` Vitaliy Filippov
  2026-01-05 19:44       ` Vitaliy Filippov
  2 siblings, 0 replies; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-05 19:29 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

>As I said before, just don't use RWF_ATOMIC if you don't want to deal with these restrictions.

But how to get the torn write protection then? What if the kernel
decides to fragment my 'once atomic' write?

I'll add some details:

The real NVMe disks with atomic write support which I know are:
1) Micron 7450 / 7500 and probably later
2) Kioxia CD6-R / CD7-R / CD8-R and similar

Both use AWUPF=256 KB and NABO=0. That means any write up to 256 KB
size is atomic regardless of the offset.

Actually it results in atomic_write_max_bytes being 128 KB when IOMMU
is turned on because max_hw_sectors_kb becomes 128 KB because it's
limited by iommu_dma_opt_mapping_size() and it's hard-coded to return
128 KB = PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1) = 4096 << 5. But
that's not the main point.

My use case is: I use raw NVMe devices in my project and I want to use
atomic writes to avoid journaling. But for me it means that I want to
do atomic writes at arbitrary 4 KB aligned offsets. And I want to use
atomic writes **safely**. That's why I want to use RWF_ATOMIC - it
allows the kernel to guarantee that it doesn't fragment the write.

With the current restrictions, as a user, I can't do that - I get
EINVAL for some of my writes when I enable RWF_ATOMIC. So I'm asking:
what's the reason behind these restrictions? Could they be removed?

On Fri, Jan 2, 2026 at 8:41 PM John Garry <john.g.garry@oracle.com> wrote:
>
> On 30/12/2025 09:01, Vitaliy Filippov wrote:
> > I think that even with the 2^N requirement the user still has to look
> > for boundaries.
> > 1) NVMe disks may have NABO != 0 (atomic boundary offset). In this
> > case 2^N aligned writes won't work at all.
>
> We don't support NABO != 0
>
> > 2) NABSPF is expressed in blocks in the NVMe spec and it's not
> > restricted to 2^N, it can be for example 3 (3*4096 = 12 KB). The spec
> > allows it. 2^N breaks this case too.
>
> We could support NABSPF which is not a power-of-2, but we don't today.
>
> If you can find some real HW which has NABSPF which is not a power-of-2,
> then it can be considered.
>
> > And the user also has to look for the maximum atomic write size
> > anyway, he can't just assume all writes are atomic out of the box,
> > regardless of the 2^N requirement.
> > So my idea is that the kernel's task is just to guarantee correctness
> > of atomic writes. It anyway can't provide the user with atomic writes
> > in all cases.
>
> What good is that to a user?
>
> Consider the user wants to atomic write a range of a file which is
> backed by disk blocks which straddle a boundary - in this case, the
> write would fail. What is the user supposed to do then? That API could
> have arbitrary failures, which effectively makes it a useless API.
>
> As I said before, just don't use RWF_ATOMIC if you don't want to deal
> with these restrictions.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-02 17:41     ` John Garry
  2026-01-05 18:58       ` Vitaliy Filippov
  2026-01-05 19:29       ` Vitaliy Filippov
@ 2026-01-05 19:44       ` Vitaliy Filippov
  2026-01-06 10:55         ` Vitaliy Filippov
  2 siblings, 1 reply; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-05 19:44 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

> We don't support NABO != 0

Also, what do you mean by that? I look here and I see that the
boundary is checked by the NVMe driver:
https://github.com/torvalds/linux/blob/3609fa95fb0f2c1b099e69e56634edb8fc03f87c/drivers/nvme/host/core.c#L974
- doesn't that mean boundaries are actually supported?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-05 18:58       ` Vitaliy Filippov
@ 2026-01-06  9:06         ` John Garry
  2026-01-06 10:50           ` Vitaliy Filippov
  0 siblings, 1 reply; 19+ messages in thread
From: John Garry @ 2026-01-06  9:06 UTC (permalink / raw)
  To: Vitaliy Filippov; +Cc: linux-block, linux-nvme, linux-fsdevel

On 05/01/2026 18:58, Vitaliy Filippov wrote:
> Now imagine that he sends a write but it spans multiple extents in the
> FS. And he gets EINVAL once again.
> 
> Is it any different from what I propose?

If a user follows the current rules, they will not get a write which 
spans multiple extents and hence no -EINVAL. That is how it works for 
ext4, anyway.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-06  9:06         ` John Garry
@ 2026-01-06 10:50           ` Vitaliy Filippov
  2026-01-06 11:26             ` John Garry
  0 siblings, 1 reply; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-06 10:50 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

> If a user follows the current rules, they will not get a write which
> spans multiple extents and hence no -EINVAL. That is how it works for
> ext4, anyway.

What if he makes a sparse file by writing at random 64 kb aligned
offsets and then trying to overwrite 128 KB atomically? He'll still
get EINVAL as I understand.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-05 19:44       ` Vitaliy Filippov
@ 2026-01-06 10:55         ` Vitaliy Filippov
  0 siblings, 0 replies; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-06 10:55 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

>> We don't support NABO != 0
> Also, what do you mean by that? I look here and I see that the boundary is checked by the NVMe driver

Sorry, never mind about this message, I confused NABO and NABSPF of
course. And boundaries don't matter a lot anyway, I'm not aware of
real-life devices with boundaries.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-06 10:50           ` Vitaliy Filippov
@ 2026-01-06 11:26             ` John Garry
  2026-01-06 13:08               ` Vitaliy Filippov
  0 siblings, 1 reply; 19+ messages in thread
From: John Garry @ 2026-01-06 11:26 UTC (permalink / raw)
  To: Vitaliy Filippov; +Cc: linux-block, linux-nvme, linux-fsdevel

On 06/01/2026 10:50, Vitaliy Filippov wrote:
>> If a user follows the current rules, they will not get a write which
>> spans multiple extents and hence no -EINVAL. That is how it works for
>> ext4, anyway.
> 
> What if he makes a sparse file by writing at random 64 kb aligned
> offsets and then trying to overwrite 128 KB atomically? He'll still
> get EINVAL as I understand.

For ext4, the maximum atomic write size is limited to the bigalloc 
cluster size. Disk blocks are allocated to this cluster size granularity 
and alignment. As such, a properly aligned atomic write <= cluster size 
can never span discontiguous disk blocks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-06 11:26             ` John Garry
@ 2026-01-06 13:08               ` Vitaliy Filippov
  2026-01-07 10:51                 ` John Garry
  0 siblings, 1 reply; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-06 13:08 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

> For ext4, the maximum atomic write size is limited to the bigalloc
> cluster size. Disk blocks are allocated to this cluster size granularity
> and alignment. As such, a properly aligned atomic write <= cluster size
> can never span discontiguous disk blocks.

Ok, thank you for the explanation.

But it seems that it's an internal implementation detail of ext4,
right? So this check should be done inside ext4 code. And in fact I
suspect it's actually already done there because generic checks which
I suggest to remove can't take ext4 cluster size into account, so at
least some atomic write validation is already done inside ext4. The
only thing that's left is to move the write alignment check there too.

Another thing that suggests that it's an internal implementation
detail is that a CoW filesystem like ZFS or btrfs can probably provide
atomic write guarantees for unaligned writes too, and probably even
without hardware atomic write support.

Can my change be limited to raw block devices then? Thanks to your
explanation now I understand the motivation for these checks with
ext4, but they still make no sense for the raw NVMe disk.

I mean, can you approve my change if I rework it to only lift 2^N and
alignment checks for raw block devices and not for file systems? For
example if I move these checks directly to the related ext4 and xfs
code? I think it's the right place to do them.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-06 13:08               ` Vitaliy Filippov
@ 2026-01-07 10:51                 ` John Garry
  2026-01-07 13:05                   ` Vitaliy Filippov
  0 siblings, 1 reply; 19+ messages in thread
From: John Garry @ 2026-01-07 10:51 UTC (permalink / raw)
  To: Vitaliy Filippov; +Cc: linux-block, linux-nvme, linux-fsdevel

On 06/01/2026 13:08, Vitaliy Filippov wrote:
>> For ext4, the maximum atomic write size is limited to the bigalloc
>> cluster size. Disk blocks are allocated to this cluster size granularity
>> and alignment. As such, a properly aligned atomic write <= cluster size
>> can never span discontiguous disk blocks.
> 
> Ok, thank you for the explanation.
> 
> But it seems that it's an internal implementation detail of ext4,
> right?

I think that it is fair to say that alignment constraints of atomic 
write HW should mean specific alignment and granularity of FS disk blocks.

> So this check should be done inside ext4 code. And in fact I
> suspect it's actually already done there because generic checks which
> I suggest to remove can't take ext4 cluster size into account, so at
> least some atomic write validation is already done inside ext4. The
> only thing that's left is to move the write alignment check there too.
> 
> Another thing that suggests that it's an internal implementation
> detail is that a CoW filesystem like ZFS or btrfs can probably provide
> atomic write guarantees for unaligned writes too, and probably even
> without hardware atomic write support.

Yes, xfs already does this.

> 
> Can my change be limited to raw block devices then? 

The atomic write API is based on:
a. doing statx to find atomic write min and max limits.
b. issuing a write with RWF_ATOMIC means that the write should be 
naturally aligned and fit within the size limits.

That is the same for both raw block devices and regular FS files. And 
any atomic write boundary is not part of the API.

>Thanks to your
> explanation now I understand the motivation for these checks with
> ext4, but they still make no sense for the raw NVMe disk.
> 
> I mean, can you approve my change if I rework it to only lift 2^N and
> alignment checks for raw block devices and not for file systems? For
> example if I move these checks directly to the related ext4 and xfs
> code? I think it's the right place to do them.

What is the actual usecase you are trying to solve? You mentioned "avoid 
journaling", which does not explain what you want to achieve.

You could arrange your data so that it suits the rules.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-07 10:51                 ` John Garry
@ 2026-01-07 13:05                   ` Vitaliy Filippov
  2026-01-07 15:42                     ` John Garry
  0 siblings, 1 reply; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-07 13:05 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

> What is the actual usecase you are trying to solve? You mentioned "avoid
> journaling", which does not explain what you want to achieve.
>
> You could arrange your data so that it suits the rules.

I can't. My usecase is a distributed ceph-like SDS based on atomic
writes. Writes on a virtual block device have arbitrary length &
offset of course, nothing like 2^N, like on a regular block device.
Atomicity is implemented through journaling (double-write) on disks
without hardware atomic write support.

Then I found the new atomic write feature and SSDs with support for it
and implemented a new storage layer which can take advantage of it. My
new storage layer has write amplification about ~1.0 with atomic
writes (i.e. almost zero overhead). It's a huge improvement for me -
the old storage layer has WA from 3 to 4.

And everything was fine until I finally deployed it with enabled
RWF_ATOMIC (production setups should use safety features) and stumbled
upon the 2^N restriction... It was a big surprise, I never thought
that such a limitation could exist. It's absolutely irrational - the
device doesn't have that limitation and I'm just using the raw device.

It's normal and expected in the context of simple file systems like
ext4 and xfs. But for the raw device... I only discovered it after
several days of investigation with bpftrace and after reading the
kernel code. It's really unexpected. I think anyone expects the raw
NVMe disk to have the same requirements as it's described in the NVMe
spec.

> The atomic write API is based on:
> a. doing statx to find atomic write min and max limits.
> b. issuing a write with RWF_ATOMIC means that the write should be
> naturally aligned and fit within the size limits.
>
> That is the same for both raw block devices and regular FS files. And
> any atomic write boundary is not part of the API.

For raw block devices, you also have sysfs. You can look there and
determine actual restrictions. In fact I didn't even know about the
statx API when I was implementing atomic writes, and I don't use it.

And speaking of that API, why does it have to be like this? Currently
it looks like an API designed around existing internal restrictions of
the implementation - of two implementations more exactly: ext4 and
xfs, both of which are classic non-cow file systems. I suspect that if
it was primarily designed after zfs & btrfs then chances are the
restriction wouldn't exist.

Ok, it's already designed like this, but anyway, if the user is fine
with statx and with the 2^N restriction, then removing the restriction
for block devices also doesn't break anything for him. He'll send his
2^N aligned writes just like before. It's fine for databases like
mysql & postgresql because they always overwrite a whole fixed-size
page. But even speaking of databases, it's not guaranteed that **all**
databases will always have the same layout and that arbitrary atomic
write offsets will never be useful for them.

So again, can we please remove the restriction for raw block devices?
I can re-submit the patch :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-07 13:05                   ` Vitaliy Filippov
@ 2026-01-07 15:42                     ` John Garry
  2026-01-07 16:21                       ` Vitaliy Filippov
                                         ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: John Garry @ 2026-01-07 15:42 UTC (permalink / raw)
  To: Vitaliy Filippov; +Cc: linux-block, linux-nvme, linux-fsdevel

On 07/01/2026 13:05, Vitaliy Filippov wrote:
>> What is the actual usecase you are trying to solve? You mentioned "avoid
>> journaling", which does not explain what you want to achieve.
>>
>> You could arrange your data so that it suits the rules.
> 
> I can't. My usecase is a distributed ceph-like SDS based on atomic
> writes. Writes on a virtual block device have arbitrary length &
> offset of course,

Note that the alignment rule is not just for atomic HW boundaries. We 
also support atomic writes on stacked devices, where this is relevant - 
specifically striped devices, like raid0. Doing an unaligned atomic 
write on a striped device may result in trying to issue an atomic write 
which straddles 2x separate devices, which would obviously be broken.

> nothing like 2^N, like on a regular block device.
> Atomicity is implemented through journaling (double-write) on disks
> without hardware atomic write support.
> 
> Then I found the new atomic write feature and SSDs with support for it
> and implemented a new storage layer which can take advantage of it. My
> new storage layer has write amplification about ~1.0 with atomic
> writes (i.e. almost zero overhead). It's a huge improvement for me -
> the old storage layer has WA from 3 to 4.
> 
> And everything was fine until I finally deployed it with enabled
> RWF_ATOMIC (production setups should use safety features) and stumbled
> upon the 2^N restriction... It was a big surprise, I never thought
> that such a limitation could exist. It's absolutely irrational - the
> device doesn't have that limitation and I'm just using the raw device.

This is all described in the man pages.

> 
> It's normal and expected in the context of simple file systems like
> ext4 and xfs. But for the raw device... I only discovered it after
> several days of investigation with bpftrace and after reading the
> kernel code. It's really unexpected. I think anyone expects the raw
> NVMe disk to have the same requirements as it's described in the NVMe
> spec.
It seems that you just want to take advantage of the block layer code to 
handle submission of an atomic write bio, i.e. reject anything which 
cannot be atomically written. In essence, that would be to just set 
REQ_ATOMIC. Maybe that could be done as a passthrough command, I'm not sure.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-07 15:42                     ` John Garry
@ 2026-01-07 16:21                       ` Vitaliy Filippov
  2026-01-08 18:18                       ` Vitaliy Filippov
  2026-01-13 14:25                       ` Vitaliy Filippov
  2 siblings, 0 replies; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-07 16:21 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

> Note that the alignment rule is not just for atomic HW boundaries. We
> also support atomic writes on stacked devices, where this is relevant -
> specifically striped devices, like raid0. Doing an unaligned atomic
> write on a striped device may result in trying to issue an atomic write
> which straddles 2x separate devices, which would obviously be broken.

Ok, then I'd also add atomic boundary checks and
atomic_write_boundary_bytes = stripe size to /sys/block/**/queue for
md devices. Not 2^N and length-alignment checks, just the boundary.

> It seems that you just want to take advantage of the block layer code to
> handle submission of an atomic write bio, i.e. reject anything which
> cannot be atomically written. In essence, that would be to just set
> REQ_ATOMIC. Maybe that could be done as a passthrough command, I'm not sure.

Of course. I thought it was the whole point of RWF_ATOMIC - REQ_ATOMIC
for userspace.

I don't want passthrough commands, I want to use normal kernel I/O.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-07 15:42                     ` John Garry
  2026-01-07 16:21                       ` Vitaliy Filippov
@ 2026-01-08 18:18                       ` Vitaliy Filippov
  2026-01-13 14:25                       ` Vitaliy Filippov
  2 siblings, 0 replies; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-08 18:18 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

Soo, let's remove the 2^N restriction for raw block devices?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] fs: remove power of 2 and length boundary atomic write restrictions
  2026-01-07 15:42                     ` John Garry
  2026-01-07 16:21                       ` Vitaliy Filippov
  2026-01-08 18:18                       ` Vitaliy Filippov
@ 2026-01-13 14:25                       ` Vitaliy Filippov
  2 siblings, 0 replies; 19+ messages in thread
From: Vitaliy Filippov @ 2026-01-13 14:25 UTC (permalink / raw)
  To: John Garry; +Cc: linux-block, linux-nvme, linux-fsdevel

Hi again, so what do you think? Can the restriction be removed for raw devices?

Maybe we could consult with someone else about my question?

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-01-13 14:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-24 11:53 [PATCH] fs: remove power of 2 and length boundary atomic write restrictions Vitaliy Filippov
2025-12-29  7:15 ` kernel test robot
2025-12-30  7:54 ` John Garry
2025-12-30  9:01   ` Vitaliy Filippov
2026-01-02 17:41     ` John Garry
2026-01-05 18:58       ` Vitaliy Filippov
2026-01-06  9:06         ` John Garry
2026-01-06 10:50           ` Vitaliy Filippov
2026-01-06 11:26             ` John Garry
2026-01-06 13:08               ` Vitaliy Filippov
2026-01-07 10:51                 ` John Garry
2026-01-07 13:05                   ` Vitaliy Filippov
2026-01-07 15:42                     ` John Garry
2026-01-07 16:21                       ` Vitaliy Filippov
2026-01-08 18:18                       ` Vitaliy Filippov
2026-01-13 14:25                       ` Vitaliy Filippov
2026-01-05 19:29       ` Vitaliy Filippov
2026-01-05 19:44       ` Vitaliy Filippov
2026-01-06 10:55         ` Vitaliy Filippov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox