From: John Garry <john.g.garry@oracle.com>
To: Vitaliy Filippov <vitalifster@gmail.com>, linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-fsdevel+subscribe@vger.kernel.org,
Keith Busch <kbusch@kernel.org>
Subject: Re: [PATCH v2] Do not require atomic writes to be power of 2 sized and aligned on length boundary
Date: Tue, 23 Dec 2025 09:26:20 +0000 [thread overview]
Message-ID: <9304d77a-7439-4772-a549-5ebcf8bf371d@oracle.com> (raw)
In-Reply-To: <CAPqjcqqi8uR=RWEpLEC+JiwOg0fzvWvwEOscj-XYHKLuPcnDBA@mail.gmail.com>
On 22/12/2025 13:28, Vitaliy Filippov wrote:
> Hi linux-fsdevel,
> I recently discovered that Linux incorrectly requires all atomic
> writes to have 2^N length and to be aligned on the length boundary.
> This requirement contradicts NVMe specification which doesn't require
> such alignment and length and thus highly restricts usage of atomic
> writes with NVMe disks which support it (Micron and Kioxia).
All these alignment and size rules are specific to using RWF_ATOMIC. You
don't have to use RWF_ATOMIC if you don't want to - as you prob know,
atomic writes are implicit on NVMe.
> NVMe specification has its own atomic write restrictions - AWUPF and
> NABSPF/NABO, but both are already checked by the nvme subsystem.
> The 2^N restriction comes from generic_atomic_write_valid().
> I submitted a patch which removes this restriction to linux-block and
> linux-nvme. Sorry if these maillists weren't the right place to send
> it to, it's my first patch :).
> But the function is currently used in 3 places: block/fops.c,
> fs/ext4/file.c and fs/xfs/xfs_file.c.
> Can you tell me if ext4 and xfs really want atomic writes to be 2^N
> sized and length-aligned?
As above, this is just the kernel atomic write rules to support using
different storage technologies.
> From looking at the code I'd say they don't really require it?
> Can you approve my patch if I'm right? Please :-)
>
> On Mon, Dec 22, 2025 at 12:54 PM Vitaliy Filippov <vitalifster@gmail.com> wrote:
>>
>> Hi! Thanks a lot for your reply! This is actually my first patch ever
>> so please don't blame me for not following some standards, I'll try to
>> resubmit it correctly.
>>
>> Regarding the rest:
>>
>> 1) NVMe atomic boundaries seem to already be checked in
>> nvme_valid_atomic_write().
>>
>> 2) What's atomic_write_hw_unit_max? As I understand, Linux also
>> already checks it, at least
>> /sys/block/nvme**/queue/atomic_write_max_bytes is already limited by
>> max_hw_sectors_kb.
>>
>> 3) Yes, I've of course seen that this function is also used by ext4
>> and xfs, but I don't understand the motivation behind the 2^n
>> requirement. I suppose file systems may fragment the write according
>> to currently allocated extents for example, but I don't see how issues
>> coming from this can be fixed by requiring writes to be 2^n.
>>
>> But I understand that just removing the check may break something if
>> somebody relies on them. What do you think about removing the
>> requirement only for NVMe or only for block devices then? I see 3 ways
>> to do it:
>> a) split generic_atomic_write_valid() into two functions - first for
>> all types of inodes and second only for file systems.
>> b) remove generic_atomic_write_valid() from block device checks at all.
>> c) change generic_atomic_write_valid() just like in my original patch
>> but copy original checks into other places where it's used (ext4 and
>> xfs).
>>
>> Which way do you think would be the best?
>>
>> On Mon, Dec 22, 2025 at 2:17 AM Keith Busch <kbusch@kernel.org> wrote:
>>>
>>> On Sun, Dec 21, 2025 at 04:24:02PM +0300, Vitaliy Filippov wrote:
>>>> It contradicts NVMe specification where alignment is only required when atomic
>>>> write boundary (NABSPF/NABO) is set and highly limits usage of NVMe atomic writes
>>>
>>> Commit header is missing the "fs:" prefix, and the commit log should
>>> wrap at 72 characters.
>>>
>>> On the techincal side, this is a generic function used by multiple
>>> protocols, so you can't just appeal to NVMe to justify removing the
>>> checks.
>>>
>>> NVMe still has atomic boundaries where straddling it fails to be an
>>> atomic operation. Instead of removing the checks, you'd have to replace
>>> it with a more costly operation if you really want to support more
>>> arbitrary write lengths and offsets. And if you do manage to remove the
>>> power of two requirement, then the queue limit for nvme's
>>> atomic_write_hw_unit_max isn't correct anymore.
>
next prev parent reply other threads:[~2025-12-23 9:26 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-21 13:24 [PATCH v2] Do not require atomic writes to be power of 2 sized and aligned on length boundary Vitaliy Filippov
2025-12-21 23:17 ` Keith Busch
2025-12-22 9:54 ` Vitaliy Filippov
2025-12-22 13:28 ` Vitaliy Filippov
2025-12-23 9:26 ` John Garry [this message]
2025-12-23 11:19 ` Vitaliy Filippov
2025-12-23 11:34 ` Vitaliy Filippov
2026-01-28 6:08 ` Ojaswin Mujoo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9304d77a-7439-4772-a549-5ebcf8bf371d@oracle.com \
--to=john.g.garry@oracle.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel+subscribe@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=vitalifster@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox