Re: [LSF/MM/BPF TOPIC] Buffered atomic writes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Pankaj Raghav <pankaj.raghav@linux.dev>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Cc: linux-xfs@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, lsf-pc@lists.linux-foundation.org,
	Andres Freund <andres@anarazel.de>,
	djwong@kernel.org, john.g.garry@oracle.com, willy@infradead.org,
	hch@lst.de, ritesh.list@gmail.com, jack@suse.cz,
	Luis Chamberlain <mcgrof@kernel.org>,
	dchinner@redhat.com, Javier Gonzalez <javier.gonz@samsung.com>,
	gost.dev@samsung.com, tytso@mit.edu, p.raghav@samsung.com,
	vi.shah@samsung.com
Subject: Re: [LSF/MM/BPF TOPIC] Buffered atomic writes
Date: Mon, 16 Feb 2026 10:52:35 +0100	[thread overview]
Message-ID: <7cf3f249-453d-423a-91d1-dfb45c474b78@linux.dev> (raw)
In-Reply-To: <aY8n97G_hXzA5MMn@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 2/13/26 14:32, Ojaswin Mujoo wrote:
> On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote:
>> Hi all,
>>
>> Atomic (untorn) writes for Direct I/O have successfully landed in kernel
>> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
>> remains a contentious topic, with previous discussions often stalling due to
>> concerns about complexity versus utility.
>>
>> I would like to propose a session to discuss the concrete use cases for
>> buffered atomic writes and if possible, talk about the outstanding
>> architectural blockers blocking the current RFCs[3][4].
> 
> Hi Pankaj,
> 
> Thanks for the proposal and glad to hear there is a wider interest in
> this topic. We have also been actively working on this and I in middle
> of testing and ironing out bugs in my RFC v2 for buffered atomic
> writes, which is largely based on Dave's suggestions to maintain atomic
> write mappings in FS layer (aka XFS COW fork). Infact I was going to
> propose a discussion on this myself :) 
> 

Perfect.

>>
>> ## Use Case:
>>
>> A recurring objection to buffered atomics is the lack of a convincing use
>> case, with the argument that databases should simply migrate to direct I/O.
>> We have been working with PostgreSQL developer Andres Freund, who has
>> highlighted a specific architectural requirement where buffered I/O remains
>> preferable in certain scenarios.
> 
> Looks like you have some nice insights to cover from postgres side which
> filesystem community has been asking for. As I've also been working on
> the kernel implementation side of it, do you think we could do a joint
> session on this topic?
>
As one of the main pushback for this feature has been a valid usecase, the main
outcome I would like to get out of this session is a community consensus on the use case
for this feature.

It looks like you already made quite a bit of progress with the CoW impl, so it
would be great to if it can be a joint session.


>> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
>> was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
>> Based on the conversation/blockers we had before, the discussion at LSFMM
>> should focus on the following blocking issues:
>>
>> - Handling Short Writes under Memory Pressure[6]: A buffered atomic
>>   write might span page boundaries. If memory pressure causes a page
>>   fault or reclaim mid-copy, the write could be torn inside the page
>>   cache before it even reaches the filesystem.
>>     - The current RFC uses a "pinning" approach: pinning user pages and
>>       creating a BVEC to ensure the full copy can proceed atomically.
>>       This adds complexity to the write path.
>>     - Discussion: Is this acceptable? Should we consider alternatives,
>>       such as requiring userspace to mlock the I/O buffers before
>>       issuing the write to guarantee atomic copy in the page cache?
> 
> Right, I chose this approach because we only get to know about the short
> copy after it has actually happened in copy_folio_from_iter_atomic()
> and it seemed simpler to just not let the short copy happen. This is
> inspired from how dio pins the pages for DMA, just that we do it
> for a shorter time.
> 
> It does add slight complexity to the path but I'm not sure if it's complex
> enough to justify adding a hard requirement of having pages mlock'd.
> 

As databases like postgres have a buffer cache that they manage in userspace,
which is eventually used to do IO, I am wondering if they already do a mlock
or some other way to guarantee the buffer cache does not get reclaimed. That is
why I was thinking if we could make it a requirement. Of course, that also requires
checking if the range is mlocked in the iomap_write_iter path.

>>
>> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a
>>   PG_atomic page flag to track dirty pages requiring atomic writeback.
>>   This faced pushback due to page flags being a scarce resource[7].
>>   Furthermore, it was argued that atomic model does not fit the buffered
>>   I/O model because data sitting in the page cache is vulnerable to
>>   modification before writeback occurs, and writeback does not preserve
>>   application ordering[8].
>>     -  Dave Chinner has proposed leveraging the filesystem's CoW path
>>        where we always allocate new blocks for the atomic write (forced
>>        CoW). If the hardware supports it (e.g., NVMe atomic limits), the
>>        filesystem can optimize the writeback to use REQ_ATOMIC in place,
>>        avoiding the CoW overhead while maintaining the architectural
>>        separation.
> 
> Right, this is what I'm doing in the new RFC where we maintain the
> mappings for atomic write in COW fork. This way we are able to utilize a
> lot of existing infrastructure, however it does add some complexity to
> ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe
> it is a tradeoff since the general consesus was mostly to avoid adding
> too much complexity to iomap layer.
> 
> Another thing that came up is to consider using write through semantics 
> for buffered atomic writes, where we are able to transition page to
> writeback state immediately after the write and avoid any other users to
> modify the data till writeback completes. This might affect performance
> since we won't be able to batch similar atomic IOs but maybe
> applications like postgres would not mind this too much. If we go with
> this approach, we will be able to avoid worrying too much about other
> users changing atomic data underneath us. 
> 

Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB
pages based on `io_combine_limit` (typically 128kb). So immediately writing them
might be ok as long as we don't remove those pages from the page cache like we do in
RWF_UNCACHED.


> An argument against this however is that it is user's responsibility to
> not do non atomic IO over an atomic range and this shall be considered a
> userspace usage error. This is similar to how there are ways users can
> tear a dio if they perform overlapping writes. [1]. 
> 
> That being said, I think these points are worth discussing and it would
> be helpful to have people from postgres around while discussing these
> semantics with the FS community members.
> 
> As for ordering of writes, I'm not sure if that is something that
> we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly
> been the task of userspace via fsync() and friends.
> 

Agreed.

> 
> [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/
> 
>>     - Discussion: While the CoW approach fits XFS and other CoW
>>       filesystems well, it presents challenges for filesystems like ext4
>>       which lack CoW capabilities for data. Should this be a filesystem
>>       specific feature?
> 
> I believe your question is if we should have a hard dependency on COW
> mappings for atomic writes. Currently, COW in atomic write context in
> XFS, is used for these 2 things:
> 
> 1. COW fork holds atomic write ranges.
> 
> This is not strictly a COW feature, just that we are repurposing the COW
> fork to hold our atomic ranges. Basically a way for writeback path to
> know that atomic write was done here.
> 
> COW fork is one way to do this but I believe every FS has a version of
> in memory extent trees where such ephemeral atomic write mappings can be
> held. The extent status cache is ext4's version of this, and can be used
> to manage the atomic write ranges. 
> 
> There is an alternate suggestion that came up from discussions with Ted
> and Darrick that we can instead use a generic side-car structure which
> holds atomic write ranges. FSes can populate these during atomic writes
> and query these in their writeback paths. 
> 
> This means for any FS operation (think truncate, falloc, mwrite, write
> ...) we would need to keep this structure in sync, which can become pretty
> complex pretty fast. I'm yet to implement this so not sure how it would
> look in practice though.
> 
> 2. COW feature as a whole enables software based atomic writes.
> 
> This is something that ext4 won't be able to support (right now), just
> like how we don't support software writes for dio.
> 
> I believe Baokun and Yi and working on a feature that can eventually
> enable COW writes in ext4 [2]. Till we have something like that, we
> would have to rely on hardware support.
> 
> Regardless, I don't think the ability to support or not support
> software atomic writes largely depends on the filesystem so I'm not
> sure how we can lift this up to a generic layer anyways.
> 
> [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/
> 

Thanks for the explanation. I am also planning to take a shot at the CoW approach. I would
be more than happy to review and test if you send a RFC in the meantime.

--
Pankaj

next prev parent reply	other threads:[~2026-02-16  9:52 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-13 10:20 [LSF/MM/BPF TOPIC] Buffered atomic writes Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16  9:52   ` Pankaj Raghav [this message]
2026-02-16 15:45     ` Andres Freund
2026-02-17 12:06       ` Jan Kara
2026-02-17 12:42         ` Pankaj Raghav
2026-02-17 16:21           ` Andres Freund
2026-02-18  1:04             ` Dave Chinner
2026-02-18  6:47               ` Christoph Hellwig
2026-02-18 23:42                 ` Dave Chinner
2026-02-17 16:13         ` Andres Freund
2026-02-17 18:27           ` Ojaswin Mujoo
2026-02-17 18:42             ` Andres Freund
2026-02-18 17:37           ` Jan Kara
2026-02-18 21:04             ` Andres Freund
2026-02-19  0:32             ` Dave Chinner
2026-02-17 18:33       ` Ojaswin Mujoo
2026-02-17 17:20     ` Ojaswin Mujoo
2026-02-18 17:42       ` [Lsf-pc] " Jan Kara
2026-02-18 20:22         ` Ojaswin Mujoo
2026-02-16 11:38   ` Jan Kara
2026-02-16 13:18     ` Pankaj Raghav
2026-02-17 18:36       ` Ojaswin Mujoo
2026-02-16 15:57     ` Andres Freund
2026-02-17 18:39     ` Ojaswin Mujoo
2026-02-18  0:26       ` Dave Chinner
2026-02-18  6:49         ` Christoph Hellwig
2026-02-18 12:54         ` Ojaswin Mujoo
2026-02-15  9:01 ` Amir Goldstein
2026-02-17  5:51 ` Christoph Hellwig
2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47     ` Andres Freund
2026-02-17 22:45       ` Dave Chinner
2026-02-18  4:10         ` Andres Freund
2026-02-18  6:53       ` Christoph Hellwig
2026-02-18  6:51     ` Christoph Hellwig
2026-02-20 10:08 ` Pankaj Raghav (Samsung)
2026-02-20 15:10   ` Christoph Hellwig
2026-02-24 13:09     ` Pankaj Raghav (Samsung)
2026-02-24 15:04       ` Christoph Hellwig
2026-04-24  1:02 ` Ritesh Harjani
2026-04-24  4:42   ` Matthew Wilcox
2026-04-24  4:50     ` Ritesh Harjani
2026-04-24  6:57       ` Amir Goldstein
2026-04-24  9:40         ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7cf3f249-453d-423a-91d1-dfb45c474b78@linux.dev \
    --to=pankaj.raghav@linux.dev \
    --cc=andres@anarazel.de \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=javier.gonz@samsung.com \
    --cc=john.g.garry@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=p.raghav@samsung.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=vi.shah@samsung.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.