* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
@ 2026-03-08 9:19 Ritesh Harjani
2026-03-08 15:33 ` Andres Freund
0 siblings, 1 reply; 18+ messages in thread
From: Ritesh Harjani @ 2026-03-08 9:19 UTC (permalink / raw)
To: Andres Freund, Amir Goldstein
Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm,
linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, jack, ojaswin,
Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
p.raghav, vi.shah
Andres Freund <andres@anarazel.de> writes:
Hi,
> Hi,
>
> On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
>> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
>> >
>> > I think a better session would be how we can help postgres to move
>> > off buffered I/O instead of adding more special cases for them.
>
> FWIW, we are adding support for DIO (it's been added, but performance isn't
> competitive for most workloads in the released versions yet, work to address
> those issues is in progress).
>
Is postgres also planning to evaluate the performance gains by using DIO
atomic writes available in upstream linux kernel? What would be
interesting to see is the relative %delta with DIO atomic-writes v/s
DIO non atomic writes.
That being said, I understand the discussion in this wider thread is
also around supporting write-through in linux and then adding support of
atomic writes on top of that. We have an early prototype of that
design ready and Ojaswin will be soon posting that out.
-ritesh
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-03-08 9:19 [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes Ritesh Harjani @ 2026-03-08 15:33 ` Andres Freund 0 siblings, 0 replies; 18+ messages in thread From: Andres Freund @ 2026-03-08 15:33 UTC (permalink / raw) To: Ritesh Harjani Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-03-08 14:49:21 +0530, Ritesh Harjani wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote: > >> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > >> > > >> > I think a better session would be how we can help postgres to move > >> > off buffered I/O instead of adding more special cases for them. > > > > FWIW, we are adding support for DIO (it's been added, but performance isn't > > competitive for most workloads in the released versions yet, work to address > > those issues is in progress). > > > > Is postgres also planning to evaluate the performance gains by using DIO > atomic writes available in upstream linux kernel? What would be > interesting to see is the relative %delta with DIO atomic-writes v/s > DIO non atomic writes. For some limited workloads that comparison is possible today with minimal work (albeit with some safety compromises, due to postgres not yet verifying that the atomic boundaries are correct, but it's good enough for experiments), as you can just disable the torn-page avoidance with a configuration parameter. The gains from not needing full page writes (postgres' mechanism to protect against torn pages) can be rather significant, as full page writes have substantial overhead due to the higher journalling volume. The worst part of the cost is that the cost decreases between checkpoints (because we don't need to repeatedly log a full page images for the same page), just to then increase again when the next checkpoint starts. It's not uncommon that in the phase just after the start of a checkpoint, WAL is over 90% of full page writes (when not having full page write compression enabled), while later the same workload only has a very small percentage of the overhead. The biggest gain from atomic writes will be the more even performance (important for real world users), rather than the absolute increase in throughput. Normal gains during the full page intensive phase are probably on the order of 20-35% for workload with many small transactions, bigger for workloads with larger transactions. But if the increase in WAL volume pushes you above the disk write throughput, the gains can be almost arbitrarily larger. E.g. on a cloud disk with 100MB/s of write bandwidth, the difference between WAL throughput of 50MB/s without full page writes and the same workload with full page images generating ~300MB/s of WAL will obviously mean that you'll get about < 1/3 of the transaction throughput while also not having any spare IO capacity for anything other than WAL writes. The reason I say limited workloads above is that upstream postgres does not yet do smart enough write combining with DIO for data writes, I'd expect that to be addressed later this year (but it's community open source, as you presumably know from experience, that's not always easy to predict / control). If the workload has a large fraction of data writes, the overhead of that makes the DIO numbers too unrealistic. Unfortunately all this means that the gains from atomic writes, be it for buffered or direct IO, will very very heavily depend on the chosen workload and by tweaking the workload / hardware you can inflate the gains to an almost arbitrarily large degree. This is also about more than throughput / latency, as the volume of WAL also impacts the cost of retaining the WAL - often that's done for a while to allow point-in-time-recovery (i.e. recovering an older base backup up to a precise point in time, to recover from application bugs or operator errors). Greetings, Andres Freund ^ permalink raw reply [flat|nested] 18+ messages in thread
* [LSF/MM/BPF TOPIC] Buffered atomic writes
@ 2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-17 5:51 ` Christoph Hellwig
0 siblings, 2 replies; 18+ messages in thread
From: Pankaj Raghav @ 2026-02-13 10:20 UTC (permalink / raw)
To: linux-xfs, linux-mm, linux-fsdevel, lsf-pc
Cc: Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez,
gost.dev, tytso, p.raghav, vi.shah
Hi all,
Atomic (untorn) writes for Direct I/O have successfully landed in kernel
for ext4 and XFS[1][2]. However, extending this support to Buffered I/O
remains a contentious topic, with previous discussions often stalling
due to concerns about complexity versus utility.
I would like to propose a session to discuss the concrete use cases for
buffered atomic writes and if possible, talk about the outstanding
architectural blockers blocking the current RFCs[3][4].
## Use Case:
A recurring objection to buffered atomics is the lack of a convincing
use case, with the argument that databases should simply migrate to
direct I/O. We have been working with PostgreSQL developer Andres
Freund, who has highlighted a specific architectural requirement where
buffered I/O remains preferable in certain scenarios.
While Postgres recently started to support direct I/O, optimal
performance requires a large, statically configured user-space buffer
pool. This becomes problematic when running many Postgres instances on
the same hardware, a common deployment scenario. Statically partitioning
RAM for direct I/O caches across many instances is inefficient compared
to allowing the kernel page cache to dynamically balance memory pressure
between instances.
The other use case is using postgres as part of a larger workload on one
instance. Using up enough memory for postgres' buffer pool to make DIO
use viable is often not realistic, because some deployments require a
lot of memory to cache database IO, while others need a lot of memory
for non-database caching.
Enabling atomic writes for this buffered workload would allow Postgres
to disable full-page writes [5]. For direct I/O, this has shown to
reduce transaction variability; for buffered I/O, we expect similar
gains, alongside decreased WAL bandwidth and storage costs for WAL
archival. As a side note, for most workloads full page writes occupy a
significant portion of WAL volume.
Andres has agreed to attend LSFMM this year to discuss these requirements.
## Discussion:
We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there
was a previous LSFMM proposal about untorn buffered writes from Ted Tso.
Based on the conversation/blockers we had before, the discussion at
LSFMM should focus on the following blocking issues:
- Handling Short Writes under Memory Pressure[6]: A buffered atomic
write might span page boundaries. If memory pressure causes a page
fault or reclaim mid-copy, the write could be torn inside the page
cache before it even reaches the filesystem.
- The current RFC uses a "pinning" approach: pinning user pages and
creating a BVEC to ensure the full copy can proceed atomically.
This adds complexity to the write path.
- Discussion: Is this acceptable? Should we consider alternatives,
such as requiring userspace to mlock the I/O buffers before
issuing the write to guarantee atomic copy in the page cache?
- Page Cache Model vs. Filesystem CoW: The current RFC introduces a
PG_atomic page flag to track dirty pages requiring atomic writeback.
This faced pushback due to page flags being a scarce resource[7].
Furthermore, it was argued that atomic model does not fit the buffered
I/O model because data sitting in the page cache is vulnerable to
modification before writeback occurs, and writeback does not preserve
application ordering[8].
- Dave Chinner has proposed leveraging the filesystem's CoW path
where we always allocate new blocks for the atomic write (forced
CoW). If the hardware supports it (e.g., NVMe atomic limits), the
filesystem can optimize the writeback to use REQ_ATOMIC in place,
avoiding the CoW overhead while maintaining the architectural
separation.
- Discussion: While the CoW approach fits XFS and other CoW
filesystems well, it presents challenges for filesystems like ext4
which lack CoW capabilities for data. Should this be a filesystem
specific feature?
Comments or Curses, all are welcome.
--
Pankaj
[1] https://lwn.net/Articles/1009298/
[2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html
[3]
https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/
[4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com
[5]
https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES
[6]
https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/
[7]
https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/
[8]
https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 10:20 Pankaj Raghav @ 2026-02-13 13:32 ` Ojaswin Mujoo 2026-02-16 9:52 ` Pankaj Raghav 2026-02-16 11:38 ` Jan Kara 2026-02-17 5:51 ` Christoph Hellwig 1 sibling, 2 replies; 18+ messages in thread From: Ojaswin Mujoo @ 2026-02-13 13:32 UTC (permalink / raw) To: Pankaj Raghav Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > Hi all, > > Atomic (untorn) writes for Direct I/O have successfully landed in kernel > for ext4 and XFS[1][2]. However, extending this support to Buffered I/O > remains a contentious topic, with previous discussions often stalling due to > concerns about complexity versus utility. > > I would like to propose a session to discuss the concrete use cases for > buffered atomic writes and if possible, talk about the outstanding > architectural blockers blocking the current RFCs[3][4]. Hi Pankaj, Thanks for the proposal and glad to hear there is a wider interest in this topic. We have also been actively working on this and I in middle of testing and ironing out bugs in my RFC v2 for buffered atomic writes, which is largely based on Dave's suggestions to maintain atomic write mappings in FS layer (aka XFS COW fork). Infact I was going to propose a discussion on this myself :) > > ## Use Case: > > A recurring objection to buffered atomics is the lack of a convincing use > case, with the argument that databases should simply migrate to direct I/O. > We have been working with PostgreSQL developer Andres Freund, who has > highlighted a specific architectural requirement where buffered I/O remains > preferable in certain scenarios. Looks like you have some nice insights to cover from postgres side which filesystem community has been asking for. As I've also been working on the kernel implementation side of it, do you think we could do a joint session on this topic? > > While Postgres recently started to support direct I/O, optimal performance > requires a large, statically configured user-space buffer pool. This becomes > problematic when running many Postgres instances on the same hardware, a > common deployment scenario. Statically partitioning RAM for direct I/O > caches across many instances is inefficient compared to allowing the kernel > page cache to dynamically balance memory pressure between instances. > > The other use case is using postgres as part of a larger workload on one > instance. Using up enough memory for postgres' buffer pool to make DIO use > viable is often not realistic, because some deployments require a lot of > memory to cache database IO, while others need a lot of memory for > non-database caching. > > Enabling atomic writes for this buffered workload would allow Postgres to > disable full-page writes [5]. For direct I/O, this has shown to reduce > transaction variability; for buffered I/O, we expect similar gains, > alongside decreased WAL bandwidth and storage costs for WAL archival. As a > side note, for most workloads full page writes occupy a significant portion > of WAL volume. > > Andres has agreed to attend LSFMM this year to discuss these requirements. Glad to hear people from postgres would also be joining! > > ## Discussion: > > We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there > was a previous LSFMM proposal about untorn buffered writes from Ted Tso. > Based on the conversation/blockers we had before, the discussion at LSFMM > should focus on the following blocking issues: > > - Handling Short Writes under Memory Pressure[6]: A buffered atomic > write might span page boundaries. If memory pressure causes a page > fault or reclaim mid-copy, the write could be torn inside the page > cache before it even reaches the filesystem. > - The current RFC uses a "pinning" approach: pinning user pages and > creating a BVEC to ensure the full copy can proceed atomically. > This adds complexity to the write path. > - Discussion: Is this acceptable? Should we consider alternatives, > such as requiring userspace to mlock the I/O buffers before > issuing the write to guarantee atomic copy in the page cache? Right, I chose this approach because we only get to know about the short copy after it has actually happened in copy_folio_from_iter_atomic() and it seemed simpler to just not let the short copy happen. This is inspired from how dio pins the pages for DMA, just that we do it for a shorter time. It does add slight complexity to the path but I'm not sure if it's complex enough to justify adding a hard requirement of having pages mlock'd. > > - Page Cache Model vs. Filesystem CoW: The current RFC introduces a > PG_atomic page flag to track dirty pages requiring atomic writeback. > This faced pushback due to page flags being a scarce resource[7]. > Furthermore, it was argued that atomic model does not fit the buffered > I/O model because data sitting in the page cache is vulnerable to > modification before writeback occurs, and writeback does not preserve > application ordering[8]. > - Dave Chinner has proposed leveraging the filesystem's CoW path > where we always allocate new blocks for the atomic write (forced > CoW). If the hardware supports it (e.g., NVMe atomic limits), the > filesystem can optimize the writeback to use REQ_ATOMIC in place, > avoiding the CoW overhead while maintaining the architectural > separation. Right, this is what I'm doing in the new RFC where we maintain the mappings for atomic write in COW fork. This way we are able to utilize a lot of existing infrastructure, however it does add some complexity to ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe it is a tradeoff since the general consesus was mostly to avoid adding too much complexity to iomap layer. Another thing that came up is to consider using write through semantics for buffered atomic writes, where we are able to transition page to writeback state immediately after the write and avoid any other users to modify the data till writeback completes. This might affect performance since we won't be able to batch similar atomic IOs but maybe applications like postgres would not mind this too much. If we go with this approach, we will be able to avoid worrying too much about other users changing atomic data underneath us. An argument against this however is that it is user's responsibility to not do non atomic IO over an atomic range and this shall be considered a userspace usage error. This is similar to how there are ways users can tear a dio if they perform overlapping writes. [1]. That being said, I think these points are worth discussing and it would be helpful to have people from postgres around while discussing these semantics with the FS community members. As for ordering of writes, I'm not sure if that is something that we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly been the task of userspace via fsync() and friends. [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ > - Discussion: While the CoW approach fits XFS and other CoW > filesystems well, it presents challenges for filesystems like ext4 > which lack CoW capabilities for data. Should this be a filesystem > specific feature? I believe your question is if we should have a hard dependency on COW mappings for atomic writes. Currently, COW in atomic write context in XFS, is used for these 2 things: 1. COW fork holds atomic write ranges. This is not strictly a COW feature, just that we are repurposing the COW fork to hold our atomic ranges. Basically a way for writeback path to know that atomic write was done here. COW fork is one way to do this but I believe every FS has a version of in memory extent trees where such ephemeral atomic write mappings can be held. The extent status cache is ext4's version of this, and can be used to manage the atomic write ranges. There is an alternate suggestion that came up from discussions with Ted and Darrick that we can instead use a generic side-car structure which holds atomic write ranges. FSes can populate these during atomic writes and query these in their writeback paths. This means for any FS operation (think truncate, falloc, mwrite, write ...) we would need to keep this structure in sync, which can become pretty complex pretty fast. I'm yet to implement this so not sure how it would look in practice though. 2. COW feature as a whole enables software based atomic writes. This is something that ext4 won't be able to support (right now), just like how we don't support software writes for dio. I believe Baokun and Yi and working on a feature that can eventually enable COW writes in ext4 [2]. Till we have something like that, we would have to rely on hardware support. Regardless, I don't think the ability to support or not support software atomic writes largely depends on the filesystem so I'm not sure how we can lift this up to a generic layer anyways. [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/ Thanks, Ojaswin > > Comments or Curses, all are welcome. > > -- > Pankaj > > [1] https://lwn.net/Articles/1009298/ > [2] https://docs.kernel.org/6.17/filesystems/ext4/atomic_writes.html > [3] https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/ > [4] https://lore.kernel.org/all/cover.1762945505.git.ojaswin@linux.ibm.com > [5] https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-FULL-PAGE-WRITES > [6] > https://lore.kernel.org/linux-fsdevel/ZiZ8XGZz46D3PRKr@casper.infradead.org/ > [7] > https://lore.kernel.org/linux-fsdevel/aRSuH82gM-8BzPCU@casper.infradead.org/ > [8] > https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/ > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 13:32 ` Ojaswin Mujoo @ 2026-02-16 9:52 ` Pankaj Raghav 2026-02-17 17:20 ` Ojaswin Mujoo 2026-02-16 11:38 ` Jan Kara 1 sibling, 1 reply; 18+ messages in thread From: Pankaj Raghav @ 2026-02-16 9:52 UTC (permalink / raw) To: Ojaswin Mujoo Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On 2/13/26 14:32, Ojaswin Mujoo wrote: > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: >> Hi all, >> >> Atomic (untorn) writes for Direct I/O have successfully landed in kernel >> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O >> remains a contentious topic, with previous discussions often stalling due to >> concerns about complexity versus utility. >> >> I would like to propose a session to discuss the concrete use cases for >> buffered atomic writes and if possible, talk about the outstanding >> architectural blockers blocking the current RFCs[3][4]. > > Hi Pankaj, > > Thanks for the proposal and glad to hear there is a wider interest in > this topic. We have also been actively working on this and I in middle > of testing and ironing out bugs in my RFC v2 for buffered atomic > writes, which is largely based on Dave's suggestions to maintain atomic > write mappings in FS layer (aka XFS COW fork). Infact I was going to > propose a discussion on this myself :) > Perfect. >> >> ## Use Case: >> >> A recurring objection to buffered atomics is the lack of a convincing use >> case, with the argument that databases should simply migrate to direct I/O. >> We have been working with PostgreSQL developer Andres Freund, who has >> highlighted a specific architectural requirement where buffered I/O remains >> preferable in certain scenarios. > > Looks like you have some nice insights to cover from postgres side which > filesystem community has been asking for. As I've also been working on > the kernel implementation side of it, do you think we could do a joint > session on this topic? > As one of the main pushback for this feature has been a valid usecase, the main outcome I would like to get out of this session is a community consensus on the use case for this feature. It looks like you already made quite a bit of progress with the CoW impl, so it would be great to if it can be a joint session. >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso. >> Based on the conversation/blockers we had before, the discussion at LSFMM >> should focus on the following blocking issues: >> >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic >> write might span page boundaries. If memory pressure causes a page >> fault or reclaim mid-copy, the write could be torn inside the page >> cache before it even reaches the filesystem. >> - The current RFC uses a "pinning" approach: pinning user pages and >> creating a BVEC to ensure the full copy can proceed atomically. >> This adds complexity to the write path. >> - Discussion: Is this acceptable? Should we consider alternatives, >> such as requiring userspace to mlock the I/O buffers before >> issuing the write to guarantee atomic copy in the page cache? > > Right, I chose this approach because we only get to know about the short > copy after it has actually happened in copy_folio_from_iter_atomic() > and it seemed simpler to just not let the short copy happen. This is > inspired from how dio pins the pages for DMA, just that we do it > for a shorter time. > > It does add slight complexity to the path but I'm not sure if it's complex > enough to justify adding a hard requirement of having pages mlock'd. > As databases like postgres have a buffer cache that they manage in userspace, which is eventually used to do IO, I am wondering if they already do a mlock or some other way to guarantee the buffer cache does not get reclaimed. That is why I was thinking if we could make it a requirement. Of course, that also requires checking if the range is mlocked in the iomap_write_iter path. >> >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a >> PG_atomic page flag to track dirty pages requiring atomic writeback. >> This faced pushback due to page flags being a scarce resource[7]. >> Furthermore, it was argued that atomic model does not fit the buffered >> I/O model because data sitting in the page cache is vulnerable to >> modification before writeback occurs, and writeback does not preserve >> application ordering[8]. >> - Dave Chinner has proposed leveraging the filesystem's CoW path >> where we always allocate new blocks for the atomic write (forced >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the >> filesystem can optimize the writeback to use REQ_ATOMIC in place, >> avoiding the CoW overhead while maintaining the architectural >> separation. > > Right, this is what I'm doing in the new RFC where we maintain the > mappings for atomic write in COW fork. This way we are able to utilize a > lot of existing infrastructure, however it does add some complexity to > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe > it is a tradeoff since the general consesus was mostly to avoid adding > too much complexity to iomap layer. > > Another thing that came up is to consider using write through semantics > for buffered atomic writes, where we are able to transition page to > writeback state immediately after the write and avoid any other users to > modify the data till writeback completes. This might affect performance > since we won't be able to batch similar atomic IOs but maybe > applications like postgres would not mind this too much. If we go with > this approach, we will be able to avoid worrying too much about other > users changing atomic data underneath us. > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB pages based on `io_combine_limit` (typically 128kb). So immediately writing them might be ok as long as we don't remove those pages from the page cache like we do in RWF_UNCACHED. > An argument against this however is that it is user's responsibility to > not do non atomic IO over an atomic range and this shall be considered a > userspace usage error. This is similar to how there are ways users can > tear a dio if they perform overlapping writes. [1]. > > That being said, I think these points are worth discussing and it would > be helpful to have people from postgres around while discussing these > semantics with the FS community members. > > As for ordering of writes, I'm not sure if that is something that > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly > been the task of userspace via fsync() and friends. > Agreed. > > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ > >> - Discussion: While the CoW approach fits XFS and other CoW >> filesystems well, it presents challenges for filesystems like ext4 >> which lack CoW capabilities for data. Should this be a filesystem >> specific feature? > > I believe your question is if we should have a hard dependency on COW > mappings for atomic writes. Currently, COW in atomic write context in > XFS, is used for these 2 things: > > 1. COW fork holds atomic write ranges. > > This is not strictly a COW feature, just that we are repurposing the COW > fork to hold our atomic ranges. Basically a way for writeback path to > know that atomic write was done here. > > COW fork is one way to do this but I believe every FS has a version of > in memory extent trees where such ephemeral atomic write mappings can be > held. The extent status cache is ext4's version of this, and can be used > to manage the atomic write ranges. > > There is an alternate suggestion that came up from discussions with Ted > and Darrick that we can instead use a generic side-car structure which > holds atomic write ranges. FSes can populate these during atomic writes > and query these in their writeback paths. > > This means for any FS operation (think truncate, falloc, mwrite, write > ...) we would need to keep this structure in sync, which can become pretty > complex pretty fast. I'm yet to implement this so not sure how it would > look in practice though. > > 2. COW feature as a whole enables software based atomic writes. > > This is something that ext4 won't be able to support (right now), just > like how we don't support software writes for dio. > > I believe Baokun and Yi and working on a feature that can eventually > enable COW writes in ext4 [2]. Till we have something like that, we > would have to rely on hardware support. > > Regardless, I don't think the ability to support or not support > software atomic writes largely depends on the filesystem so I'm not > sure how we can lift this up to a generic layer anyways. > > [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/ > Thanks for the explanation. I am also planning to take a shot at the CoW approach. I would be more than happy to review and test if you send a RFC in the meantime. -- Pankaj ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 9:52 ` Pankaj Raghav @ 2026-02-17 17:20 ` Ojaswin Mujoo 2026-02-18 17:42 ` [Lsf-pc] " Jan Kara 0 siblings, 1 reply; 18+ messages in thread From: Ojaswin Mujoo @ 2026-02-17 17:20 UTC (permalink / raw) To: Pankaj Raghav Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote: > On 2/13/26 14:32, Ojaswin Mujoo wrote: > > On Fri, Feb 13, 2026 at 11:20:36AM +0100, Pankaj Raghav wrote: > >> Hi all, > >> > >> Atomic (untorn) writes for Direct I/O have successfully landed in kernel > >> for ext4 and XFS[1][2]. However, extending this support to Buffered I/O > >> remains a contentious topic, with previous discussions often stalling due to > >> concerns about complexity versus utility. > >> > >> I would like to propose a session to discuss the concrete use cases for > >> buffered atomic writes and if possible, talk about the outstanding > >> architectural blockers blocking the current RFCs[3][4]. > > > > Hi Pankaj, > > > > Thanks for the proposal and glad to hear there is a wider interest in > > this topic. We have also been actively working on this and I in middle > > of testing and ironing out bugs in my RFC v2 for buffered atomic > > writes, which is largely based on Dave's suggestions to maintain atomic > > write mappings in FS layer (aka XFS COW fork). Infact I was going to > > propose a discussion on this myself :) > > > > Perfect. > > >> > >> ## Use Case: > >> > >> A recurring objection to buffered atomics is the lack of a convincing use > >> case, with the argument that databases should simply migrate to direct I/O. > >> We have been working with PostgreSQL developer Andres Freund, who has > >> highlighted a specific architectural requirement where buffered I/O remains > >> preferable in certain scenarios. > > > > Looks like you have some nice insights to cover from postgres side which > > filesystem community has been asking for. As I've also been working on > > the kernel implementation side of it, do you think we could do a joint > > session on this topic? > > > As one of the main pushback for this feature has been a valid usecase, the main > outcome I would like to get out of this session is a community consensus on the use case > for this feature. > > It looks like you already made quite a bit of progress with the CoW impl, so it > would be great to if it can be a joint session. Awesome! > > > >> We currently have RFCs posted by John Garry and Ojaswin Mujoo, and there > >> was a previous LSFMM proposal about untorn buffered writes from Ted Tso. > >> Based on the conversation/blockers we had before, the discussion at LSFMM > >> should focus on the following blocking issues: > >> > >> - Handling Short Writes under Memory Pressure[6]: A buffered atomic > >> write might span page boundaries. If memory pressure causes a page > >> fault or reclaim mid-copy, the write could be torn inside the page > >> cache before it even reaches the filesystem. > >> - The current RFC uses a "pinning" approach: pinning user pages and > >> creating a BVEC to ensure the full copy can proceed atomically. > >> This adds complexity to the write path. > >> - Discussion: Is this acceptable? Should we consider alternatives, > >> such as requiring userspace to mlock the I/O buffers before > >> issuing the write to guarantee atomic copy in the page cache? > > > > Right, I chose this approach because we only get to know about the short > > copy after it has actually happened in copy_folio_from_iter_atomic() > > and it seemed simpler to just not let the short copy happen. This is > > inspired from how dio pins the pages for DMA, just that we do it > > for a shorter time. > > > > It does add slight complexity to the path but I'm not sure if it's complex > > enough to justify adding a hard requirement of having pages mlock'd. > > > > As databases like postgres have a buffer cache that they manage in userspace, > which is eventually used to do IO, I am wondering if they already do a mlock > or some other way to guarantee the buffer cache does not get reclaimed. That is > why I was thinking if we could make it a requirement. Of course, that also requires > checking if the range is mlocked in the iomap_write_iter path. Hmm got it,I still feel it might be an overkill for something we already have a mechanism for and can achieve easily, but I'm open to discussion on this :) > > >> > >> - Page Cache Model vs. Filesystem CoW: The current RFC introduces a > >> PG_atomic page flag to track dirty pages requiring atomic writeback. > >> This faced pushback due to page flags being a scarce resource[7]. > >> Furthermore, it was argued that atomic model does not fit the buffered > >> I/O model because data sitting in the page cache is vulnerable to > >> modification before writeback occurs, and writeback does not preserve > >> application ordering[8]. > >> - Dave Chinner has proposed leveraging the filesystem's CoW path > >> where we always allocate new blocks for the atomic write (forced > >> CoW). If the hardware supports it (e.g., NVMe atomic limits), the > >> filesystem can optimize the writeback to use REQ_ATOMIC in place, > >> avoiding the CoW overhead while maintaining the architectural > >> separation. > > > > Right, this is what I'm doing in the new RFC where we maintain the > > mappings for atomic write in COW fork. This way we are able to utilize a > > lot of existing infrastructure, however it does add some complexity to > > ->iomap_begin() and ->writeback_range() callbacks of the FS. I believe > > it is a tradeoff since the general consesus was mostly to avoid adding > > too much complexity to iomap layer. > > > > Another thing that came up is to consider using write through semantics > > for buffered atomic writes, where we are able to transition page to > > writeback state immediately after the write and avoid any other users to > > modify the data till writeback completes. This might affect performance > > since we won't be able to batch similar atomic IOs but maybe > > applications like postgres would not mind this too much. If we go with > > this approach, we will be able to avoid worrying too much about other > > users changing atomic data underneath us. > > > > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB > pages based on `io_combine_limit` (typically 128kb). So immediately writing them > might be ok as long as we don't remove those pages from the page cache like we do in > RWF_UNCACHED. Yep, and Ive not looked at the code path much but I think if we really care about the user not changing the data b/w write and writeback then we will probably need to start the writeback while holding the folio lock, which is currently not done in RWF_UNCACHED. > > > > An argument against this however is that it is user's responsibility to > > not do non atomic IO over an atomic range and this shall be considered a > > userspace usage error. This is similar to how there are ways users can > > tear a dio if they perform overlapping writes. [1]. > > > > That being said, I think these points are worth discussing and it would > > be helpful to have people from postgres around while discussing these > > semantics with the FS community members. > > > > As for ordering of writes, I'm not sure if that is something that > > we should guarantee via the RWF_ATOMIC api. Ensuring ordering has mostly > > been the task of userspace via fsync() and friends. > > > > Agreed. > > > > > [1] https://lore.kernel.org/fstests/0af205d9-6093-4931-abe9-f236acae8d44@oracle.com/ > > > >> - Discussion: While the CoW approach fits XFS and other CoW > >> filesystems well, it presents challenges for filesystems like ext4 > >> which lack CoW capabilities for data. Should this be a filesystem > >> specific feature? > > > > I believe your question is if we should have a hard dependency on COW > > mappings for atomic writes. Currently, COW in atomic write context in > > XFS, is used for these 2 things: > > > > 1. COW fork holds atomic write ranges. > > > > This is not strictly a COW feature, just that we are repurposing the COW > > fork to hold our atomic ranges. Basically a way for writeback path to > > know that atomic write was done here. > > > > COW fork is one way to do this but I believe every FS has a version of > > in memory extent trees where such ephemeral atomic write mappings can be > > held. The extent status cache is ext4's version of this, and can be used > > to manage the atomic write ranges. > > > > There is an alternate suggestion that came up from discussions with Ted > > and Darrick that we can instead use a generic side-car structure which > > holds atomic write ranges. FSes can populate these during atomic writes > > and query these in their writeback paths. > > > > This means for any FS operation (think truncate, falloc, mwrite, write > > ...) we would need to keep this structure in sync, which can become pretty > > complex pretty fast. I'm yet to implement this so not sure how it would > > look in practice though. > > > > 2. COW feature as a whole enables software based atomic writes. > > > > This is something that ext4 won't be able to support (right now), just > > like how we don't support software writes for dio. > > > > I believe Baokun and Yi and working on a feature that can eventually > > enable COW writes in ext4 [2]. Till we have something like that, we > > would have to rely on hardware support. > > > > Regardless, I don't think the ability to support or not support > > software atomic writes largely depends on the filesystem so I'm not > > sure how we can lift this up to a generic layer anyways. > > > > [2] https://lore.kernel.org/linux-ext4/9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com/ > > > > Thanks for the explanation. I am also planning to take a shot at the CoW approach. I would > be more than happy to review and test if you send a RFC in the meantime. Thanks Pankaj, I'm testing the current RFC internally. I think I'll have something in coming weeks and we can go over the design and how it looks etc. Regards, ojaswin > > -- > Pankaj > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 17:20 ` Ojaswin Mujoo @ 2026-02-18 17:42 ` Jan Kara 2026-02-18 20:22 ` Ojaswin Mujoo 0 siblings, 1 reply; 18+ messages in thread From: Jan Kara @ 2026-02-18 17:42 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue 17-02-26 22:50:17, Ojaswin Mujoo wrote: > On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote: > > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB > > pages based on `io_combine_limit` (typically 128kb). So immediately writing them > > might be ok as long as we don't remove those pages from the page cache like we do in > > RWF_UNCACHED. > > Yep, and Ive not looked at the code path much but I think if we really > care about the user not changing the data b/w write and writeback then > we will probably need to start the writeback while holding the folio > lock, which is currently not done in RWF_UNCACHED. That isn't enough. submit_bio() returning isn't enough to guaranteed DMA to the device has happened. And until it happens, modifying the pagecache page means modifying the data the disk will get. The best is probably to transition pages to writeback state and deal with it as with any other requirement for stable pages. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 17:42 ` [Lsf-pc] " Jan Kara @ 2026-02-18 20:22 ` Ojaswin Mujoo 0 siblings, 0 replies; 18+ messages in thread From: Ojaswin Mujoo @ 2026-02-18 20:22 UTC (permalink / raw) To: Jan Kara Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 06:42:05PM +0100, Jan Kara wrote: > On Tue 17-02-26 22:50:17, Ojaswin Mujoo wrote: > > On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote: > > > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB > > > pages based on `io_combine_limit` (typically 128kb). So immediately writing them > > > might be ok as long as we don't remove those pages from the page cache like we do in > > > RWF_UNCACHED. > > > > Yep, and Ive not looked at the code path much but I think if we really > > care about the user not changing the data b/w write and writeback then > > we will probably need to start the writeback while holding the folio > > lock, which is currently not done in RWF_UNCACHED. > > That isn't enough. submit_bio() returning isn't enough to guaranteed DMA > to the device has happened. And until it happens, modifying the pagecache > page means modifying the data the disk will get. The best is probably to > transition pages to writeback state and deal with it as with any other > requirement for stable pages. Yes true, looking at the code, it does seem like we would also need to depend on the stable page mechanism to ensure nobody changes the buffers till the IO has actually finished. I think the right way to go would be to first start with an implementation of RWF_WRITETHOUGH and then utilize that and stable pages to enable RWF_ATOMIC for buffered IO. Regards, ojaswin > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 13:32 ` Ojaswin Mujoo 2026-02-16 9:52 ` Pankaj Raghav @ 2026-02-16 11:38 ` Jan Kara 2026-02-16 13:18 ` Pankaj Raghav ` (2 more replies) 1 sibling, 3 replies; 18+ messages in thread From: Jan Kara @ 2026-02-16 11:38 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi! On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > Another thing that came up is to consider using write through semantics > for buffered atomic writes, where we are able to transition page to > writeback state immediately after the write and avoid any other users to > modify the data till writeback completes. This might affect performance > since we won't be able to batch similar atomic IOs but maybe > applications like postgres would not mind this too much. If we go with > this approach, we will be able to avoid worrying too much about other > users changing atomic data underneath us. > > An argument against this however is that it is user's responsibility to > not do non atomic IO over an atomic range and this shall be considered a > userspace usage error. This is similar to how there are ways users can > tear a dio if they perform overlapping writes. [1]. Yes, I was wondering whether the write-through semantics would make sense as well. Intuitively it should make things simpler because you could practially reuse the atomic DIO write path. Only that you'd first copy data into the page cache and issue dio write from those folios. No need for special tracking of which folios actually belong together in atomic write, no need for cluttering standard folio writeback path, in case atomic write cannot happen (e.g. because you cannot allocate appropriately aligned blocks) you get the error back rightaway, ... Of course this all depends on whether such semantics would be actually useful for users such as PostgreSQL. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 11:38 ` Jan Kara @ 2026-02-16 13:18 ` Pankaj Raghav 2026-02-17 18:36 ` Ojaswin Mujoo 2026-02-16 15:57 ` Andres Freund 2026-02-17 18:39 ` Ojaswin Mujoo 2 siblings, 1 reply; 18+ messages in thread From: Pankaj Raghav @ 2026-02-16 13:18 UTC (permalink / raw) To: Jan Kara, Ojaswin Mujoo Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On 2/16/2026 12:38 PM, Jan Kara wrote: > Hi! > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: >> Another thing that came up is to consider using write through semantics >> for buffered atomic writes, where we are able to transition page to >> writeback state immediately after the write and avoid any other users to >> modify the data till writeback completes. This might affect performance >> since we won't be able to batch similar atomic IOs but maybe >> applications like postgres would not mind this too much. If we go with >> this approach, we will be able to avoid worrying too much about other >> users changing atomic data underneath us. >> >> An argument against this however is that it is user's responsibility to >> not do non atomic IO over an atomic range and this shall be considered a >> userspace usage error. This is similar to how there are ways users can >> tear a dio if they perform overlapping writes. [1]. > > Yes, I was wondering whether the write-through semantics would make sense > as well. Intuitively it should make things simpler because you could > practially reuse the atomic DIO write path. Only that you'd first copy > data into the page cache and issue dio write from those folios. No need for > special tracking of which folios actually belong together in atomic write, > no need for cluttering standard folio writeback path, in case atomic write > cannot happen (e.g. because you cannot allocate appropriately aligned > blocks) you get the error back rightaway, ... > > Of course this all depends on whether such semantics would be actually > useful for users such as PostgreSQL. One issue might be the performance, especially if the atomic max unit is in the smaller end such as 16k or 32k (which is fairly common). But it will avoid the overlapping writes issue and can easily leverage the direct IO path. But one thing that postgres really cares about is the integrity of a database block. So if there is an IO that is a multiple of an atomic write unit (one atomic unit encapsulates the whole DB page), it is not a problem if tearing happens on the atomic boundaries. This fits very well with what NVMe calls Multiple Atomicity Mode (MAM) [1]. We don't have any semantics for MaM at the moment but that could increase the performance as we can do larger IOs but still get the atomic guarantees certain applications care about. [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-Revision-1.1-2024.08.05-Ratified.pdf ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 13:18 ` Pankaj Raghav @ 2026-02-17 18:36 ` Ojaswin Mujoo 0 siblings, 0 replies; 18+ messages in thread From: Ojaswin Mujoo @ 2026-02-17 18:36 UTC (permalink / raw) To: Pankaj Raghav Cc: Jan Kara, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Mon, Feb 16, 2026 at 02:18:10PM +0100, Pankaj Raghav wrote: > > > On 2/16/2026 12:38 PM, Jan Kara wrote: > > Hi! > > > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > > Another thing that came up is to consider using write through semantics > > > for buffered atomic writes, where we are able to transition page to > > > writeback state immediately after the write and avoid any other users to > > > modify the data till writeback completes. This might affect performance > > > since we won't be able to batch similar atomic IOs but maybe > > > applications like postgres would not mind this too much. If we go with > > > this approach, we will be able to avoid worrying too much about other > > > users changing atomic data underneath us. > > > > > > An argument against this however is that it is user's responsibility to > > > not do non atomic IO over an atomic range and this shall be considered a > > > userspace usage error. This is similar to how there are ways users can > > > tear a dio if they perform overlapping writes. [1]. > > > > Yes, I was wondering whether the write-through semantics would make sense > > as well. Intuitively it should make things simpler because you could > > practially reuse the atomic DIO write path. Only that you'd first copy > > data into the page cache and issue dio write from those folios. No need for > > special tracking of which folios actually belong together in atomic write, > > no need for cluttering standard folio writeback path, in case atomic write > > cannot happen (e.g. because you cannot allocate appropriately aligned > > blocks) you get the error back rightaway, ... > > > > Of course this all depends on whether such semantics would be actually > > useful for users such as PostgreSQL. > > One issue might be the performance, especially if the atomic max unit is in > the smaller end such as 16k or 32k (which is fairly common). But it will > avoid the overlapping writes issue and can easily leverage the direct IO > path. > > But one thing that postgres really cares about is the integrity of a > database block. So if there is an IO that is a multiple of an atomic write > unit (one atomic unit encapsulates the whole DB page), it is not a problem > if tearing happens on the atomic boundaries. This fits very well with what > NVMe calls Multiple Atomicity Mode (MAM) [1]. > > We don't have any semantics for MaM at the moment but that could increase > the performance as we can do larger IOs but still get the atomic guarantees > certain applications care about. Interesting, I think very very early dio implementations did use something of this sort where (awu_max = 4k) an atomic write of 16k would result in 4 x 4k atomic writes. I don't remember why it was shot down though :D Regards, ojaswin > > > [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-Revision-1.1-2024.08.05-Ratified.pdf > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 11:38 ` Jan Kara 2026-02-16 13:18 ` Pankaj Raghav @ 2026-02-16 15:57 ` Andres Freund 2026-02-17 18:39 ` Ojaswin Mujoo 2 siblings, 0 replies; 18+ messages in thread From: Andres Freund @ 2026-02-16 15:57 UTC (permalink / raw) To: Jan Kara Cc: Ojaswin Mujoo, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-16 12:38:59 +0100, Jan Kara wrote: > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > Another thing that came up is to consider using write through semantics > > for buffered atomic writes, where we are able to transition page to > > writeback state immediately after the write and avoid any other users to > > modify the data till writeback completes. This might affect performance > > since we won't be able to batch similar atomic IOs but maybe > > applications like postgres would not mind this too much. If we go with > > this approach, we will be able to avoid worrying too much about other > > users changing atomic data underneath us. > > > > An argument against this however is that it is user's responsibility to > > not do non atomic IO over an atomic range and this shall be considered a > > userspace usage error. This is similar to how there are ways users can > > tear a dio if they perform overlapping writes. [1]. > > Yes, I was wondering whether the write-through semantics would make sense > as well. As outlined in https://lore.kernel.org/all/zzvybbfy6bcxnkt4cfzruhdyy6jsvnuvtjkebdeqwkm6nfpgij@dlps7ucza22s/ that is something that would be useful for postgres even orthogonally to atomic writes. If this were the path to go with, I'd suggest adding an RWF_WRITETHROUGH and requiring it to be set when using RWF_ATOMIC on an buffered write. That way, if the kernel were to eventually support buffered atomic writes without immediate writeback, the semantics to userspace wouldn't suddenly change. > Intuitively it should make things simpler because you could > practially reuse the atomic DIO write path. Only that you'd first copy > data into the page cache and issue dio write from those folios. No need for > special tracking of which folios actually belong together in atomic write, > no need for cluttering standard folio writeback path, in case atomic write > cannot happen (e.g. because you cannot allocate appropriately aligned > blocks) you get the error back rightaway, ... > > Of course this all depends on whether such semantics would be actually > useful for users such as PostgreSQL. I think it would be useful for many workloads. As noted in the linked message, there are some workloads where I am not sure how the gains/costs would balance out (with a small PG buffer pool in a write heavy workload, we'd loose the ability to have the kernel avoid redundant writes). It's possible that we could develop some heuristics to fall back to doing our own torn-page avoidance in such cases, although it's not immediately obvious to me what that heuristic would be. It's also not that common a workload, it's *much* more common to have a read heavy workload that has to overflow in the kernel page cache, due to not being able to dedicate sufficient memory to postgres. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-16 11:38 ` Jan Kara 2026-02-16 13:18 ` Pankaj Raghav 2026-02-16 15:57 ` Andres Freund @ 2026-02-17 18:39 ` Ojaswin Mujoo 2026-02-18 0:26 ` Dave Chinner 2 siblings, 1 reply; 18+ messages in thread From: Ojaswin Mujoo @ 2026-02-17 18:39 UTC (permalink / raw) To: Jan Kara Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote: > Hi! > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > Another thing that came up is to consider using write through semantics > > for buffered atomic writes, where we are able to transition page to > > writeback state immediately after the write and avoid any other users to > > modify the data till writeback completes. This might affect performance > > since we won't be able to batch similar atomic IOs but maybe > > applications like postgres would not mind this too much. If we go with > > this approach, we will be able to avoid worrying too much about other > > users changing atomic data underneath us. > > > > An argument against this however is that it is user's responsibility to > > not do non atomic IO over an atomic range and this shall be considered a > > userspace usage error. This is similar to how there are ways users can > > tear a dio if they perform overlapping writes. [1]. > > Yes, I was wondering whether the write-through semantics would make sense > as well. Intuitively it should make things simpler because you could > practially reuse the atomic DIO write path. Only that you'd first copy > data into the page cache and issue dio write from those folios. No need for > special tracking of which folios actually belong together in atomic write, > no need for cluttering standard folio writeback path, in case atomic write > cannot happen (e.g. because you cannot allocate appropriately aligned > blocks) you get the error back rightaway, ... This is an interesting idea Jan and also saves a lot of tracking of atomic extents etc. I'm unsure how much of a performance impact it'd have though but I'll look into this Regards, ojaswin > > Of course this all depends on whether such semantics would be actually > useful for users such as PostgreSQL. > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 18:39 ` Ojaswin Mujoo @ 2026-02-18 0:26 ` Dave Chinner 2026-02-18 6:49 ` Christoph Hellwig 2026-02-18 12:54 ` Ojaswin Mujoo 0 siblings, 2 replies; 18+ messages in thread From: Dave Chinner @ 2026-02-18 0:26 UTC (permalink / raw) To: Ojaswin Mujoo Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 12:09:46AM +0530, Ojaswin Mujoo wrote: > On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote: > > Hi! > > > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > > Another thing that came up is to consider using write through semantics > > > for buffered atomic writes, where we are able to transition page to > > > writeback state immediately after the write and avoid any other users to > > > modify the data till writeback completes. This might affect performance > > > since we won't be able to batch similar atomic IOs but maybe > > > applications like postgres would not mind this too much. If we go with > > > this approach, we will be able to avoid worrying too much about other > > > users changing atomic data underneath us. > > > > > > An argument against this however is that it is user's responsibility to > > > not do non atomic IO over an atomic range and this shall be considered a > > > userspace usage error. This is similar to how there are ways users can > > > tear a dio if they perform overlapping writes. [1]. > > > > Yes, I was wondering whether the write-through semantics would make sense > > as well. Intuitively it should make things simpler because you could > > practially reuse the atomic DIO write path. Only that you'd first copy > > data into the page cache and issue dio write from those folios. No need for > > special tracking of which folios actually belong together in atomic write, > > no need for cluttering standard folio writeback path, in case atomic write > > cannot happen (e.g. because you cannot allocate appropriately aligned > > blocks) you get the error back rightaway, ... > > This is an interesting idea Jan and also saves a lot of tracking of > atomic extents etc. ISTR mentioning that we should be doing exactly this (grab page cache pages, fill them and submit them through the DIO path) for O_DSYNC buffered writethrough IO a long time again. The context was optimising buffered O_DSYNC to use the FUA optimisations in the iomap DIO write path. I suggested it again when discussing how RWF_DONTCACHE should be implemented, because the async DIO write completion path invalidates the page cache over the IO range. i.e. it would avoid the need to use folio flags to track pages that needed invalidation at IO completion... I have a vague recollection of mentioning this early in the buffered RWF_ATOMIC discussions, too, though that may have just been the voices in my head. Regardless, we are here again with proposals for RWF_ATOMIC and RWF_WRITETHROUGH and a suggestion that maybe we should vector buffered writethrough via the DIO path..... Perhaps it's time to do this? FWIW, the other thing that write-through via the DIO path enables is true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes block waiting on IO completion through generic_sync_write() -> vfs_fsync_range(), even when issued through AIO paths. Vectoring it through the DIO path avoids the blocking fsync path in IO submission as it runs in the async DIO completion path if it is needed.... -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 0:26 ` Dave Chinner @ 2026-02-18 6:49 ` Christoph Hellwig 2026-02-18 12:54 ` Ojaswin Mujoo 1 sibling, 0 replies; 18+ messages in thread From: Christoph Hellwig @ 2026-02-18 6:49 UTC (permalink / raw) To: Dave Chinner Cc: Ojaswin Mujoo, Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 11:26:06AM +1100, Dave Chinner wrote: > ISTR mentioning that we should be doing exactly this (grab page > cache pages, fill them and submit them through the DIO path) for > O_DSYNC buffered writethrough IO a long time again. Yes, multiple times. And I did a few more times since then. > Regardless, we are here again with proposals for RWF_ATOMIC and > RWF_WRITETHROUGH and a suggestion that maybe we should vector > buffered writethrough via the DIO path..... > > Perhaps it's time to do this? Yes. > FWIW, the other thing that write-through via the DIO path enables is > true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes > block waiting on IO completion through generic_sync_write() -> > vfs_fsync_range(), even when issued through AIO paths. Vectoring it > through the DIO path avoids the blocking fsync path in IO submission > as it runs in the async DIO completion path if it is needed.... It's only true if we can do the page cache updates non-blocking, but in many cases that should indeed be possible. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-18 0:26 ` Dave Chinner 2026-02-18 6:49 ` Christoph Hellwig @ 2026-02-18 12:54 ` Ojaswin Mujoo 1 sibling, 0 replies; 18+ messages in thread From: Ojaswin Mujoo @ 2026-02-18 12:54 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Wed, Feb 18, 2026 at 11:26:06AM +1100, Dave Chinner wrote: > On Wed, Feb 18, 2026 at 12:09:46AM +0530, Ojaswin Mujoo wrote: > > On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote: > > > Hi! > > > > > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote: > > > > Another thing that came up is to consider using write through semantics > > > > for buffered atomic writes, where we are able to transition page to > > > > writeback state immediately after the write and avoid any other users to > > > > modify the data till writeback completes. This might affect performance > > > > since we won't be able to batch similar atomic IOs but maybe > > > > applications like postgres would not mind this too much. If we go with > > > > this approach, we will be able to avoid worrying too much about other > > > > users changing atomic data underneath us. > > > > > > > > An argument against this however is that it is user's responsibility to > > > > not do non atomic IO over an atomic range and this shall be considered a > > > > userspace usage error. This is similar to how there are ways users can > > > > tear a dio if they perform overlapping writes. [1]. > > > > > > Yes, I was wondering whether the write-through semantics would make sense > > > as well. Intuitively it should make things simpler because you could > > > practially reuse the atomic DIO write path. Only that you'd first copy > > > data into the page cache and issue dio write from those folios. No need for > > > special tracking of which folios actually belong together in atomic write, > > > no need for cluttering standard folio writeback path, in case atomic write > > > cannot happen (e.g. because you cannot allocate appropriately aligned > > > blocks) you get the error back rightaway, ... > > > > This is an interesting idea Jan and also saves a lot of tracking of > > atomic extents etc. > > ISTR mentioning that we should be doing exactly this (grab page > cache pages, fill them and submit them through the DIO path) for > O_DSYNC buffered writethrough IO a long time again. The context was > optimising buffered O_DSYNC to use the FUA optimisations in the > iomap DIO write path. > > I suggested it again when discussing how RWF_DONTCACHE should be > implemented, because the async DIO write completion path invalidates > the page cache over the IO range. i.e. it would avoid the need to > use folio flags to track pages that needed invalidation at IO > completion... > > I have a vague recollection of mentioning this early in the buffered > RWF_ATOMIC discussions, too, though that may have just been the > voices in my head. Hi Dave, Yes we did discuss this [1] :) We also discussed the alternative of using the COW fork path for atomic writes [2]. Since at that point I was not completely sure if the writethrough would become too restrictive of an approach, I was working on a COW fork implementation. However, from the discussion here as well as Andres' comments, it seems like write through might not be too bad for postgres. > > Regardless, we are here again with proposals for RWF_ATOMIC and > RWF_WRITETHROUGH and a suggestion that maybe we should vector > buffered writethrough via the DIO path..... > > Perhaps it's time to do this? I agree that it makes more sense to do writethrough if we want to have the strict old-or-new semantics (as opposed to just untorn IO semantics). I'll work on a POC for this approach of doing atomic writes, I'll mostly try to base it off your suggestions in [1]. FWIW, I do have a somewhat working (although untested and possible broken in some places) POC for performing atomic writes via XFS COW fork based on suggestions from Dave [2]. Even though we want to explore the writethrough approach, I'd just share it here incase anyone is interested in how the design is looking like: https://github.com/OjaswinM/linux/commits/iomap-buffered-atomic-rfc2.3/ (If anyone prefers for me to send this as a patchset on mailing list, let me know) Regards, ojaswin [1] https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/ [2] https://lore.kernel.org/linux-fsdevel/aRuKz4F3xATf8IUp@dread.disaster.area/ > > FWIW, the other thing that write-through via the DIO path enables is > true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes > block waiting on IO completion through generic_sync_write() -> > vfs_fsync_range(), even when issued through AIO paths. Vectoring it > through the DIO path avoids the blocking fsync path in IO submission > as it runs in the async DIO completion path if it is needed.... > > -Dave. > -- > Dave Chinner > dgc@kernel.org ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-13 10:20 Pankaj Raghav 2026-02-13 13:32 ` Ojaswin Mujoo @ 2026-02-17 5:51 ` Christoph Hellwig 2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein 1 sibling, 1 reply; 18+ messages in thread From: Christoph Hellwig @ 2026-02-17 5:51 UTC (permalink / raw) To: Pankaj Raghav Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah I think a better session would be how we can help postgres to move off buffered I/O instead of adding more special cases for them. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 5:51 ` Christoph Hellwig @ 2026-02-17 9:23 ` Amir Goldstein 2026-02-17 15:47 ` Andres Freund 2026-02-18 6:51 ` Christoph Hellwig 0 siblings, 2 replies; 18+ messages in thread From: Amir Goldstein @ 2026-02-17 9:23 UTC (permalink / raw) To: Christoph Hellwig Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > I think a better session would be how we can help postgres to move > off buffered I/O instead of adding more special cases for them. Respectfully, I disagree that DIO is the only possible solution. Direct I/O is a legit solution for databases and so is buffered I/O each with their own caveats. Specifically, when two subsystems (kernel vfs and db) each require a huge amount of cache memory for best performance, setting them up to play nicely together to utilize system memory in an optimal way is a huge pain. Thanks, Amir. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein @ 2026-02-17 15:47 ` Andres Freund 2026-02-17 22:45 ` Dave Chinner 2026-02-18 6:53 ` Christoph Hellwig 2026-02-18 6:51 ` Christoph Hellwig 1 sibling, 2 replies; 18+ messages in thread From: Andres Freund @ 2026-02-17 15:47 UTC (permalink / raw) To: Amir Goldstein Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote: > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > I think a better session would be how we can help postgres to move > > off buffered I/O instead of adding more special cases for them. FWIW, we are adding support for DIO (it's been added, but performance isn't competitive for most workloads in the released versions yet, work to address those issues is in progress). But it's only really be viable for larger setups, not for e.g.: - smaller, unattended setups - uses of postgres as part of a larger application on one server with hard to predict memory usage of different components - intentionally overcommitted shared hosting type scenarios Even once a well configured postgres using DIO beats postgres not using DIO, I'll bet that well over 50% of users won't be able to use DIO. There are some kernel issues that make it harder than necessary to use DIO, btw: Most prominently: With DIO concurrently extending multiple files leads to quite terrible fragmentation, at least with XFS. Forcing us to over-aggressively use fallocate(), truncating later if it turns out we need less space. The fallocate in turn triggers slowness in the write paths, as writing to uninitialized extents is a metadata operation. It'd be great if the allocation behaviour with concurrent file extension could be improved and if we could have a fallocate mode that forces extents to be initialized. A secondary issue is that with the buffer pool sizes necessary for DIO use on bigger systems, creating the anonymous memory mapping becomes painfully slow if we use MAP_POPULATE - which we kinda need to do, as otherwise performance is very inconsistent initially (often iomap -> gup -> handle_mm_fault -> folio_zero_user uses the majority of the CPU). We've been experimenting with not using MAP_POPULATE and using multiple threads to populate the mapping in parallel, but that feels not like something that userspace ought to have to do. It's easier to work around for us that the uninitialized extent conversion issue, but it still is something we IMO shouldn't have to do. > Respectfully, I disagree that DIO is the only possible solution. > Direct I/O is a legit solution for databases and so is buffered I/O > each with their own caveats. > Specifically, when two subsystems (kernel vfs and db) each require a huge > amount of cache memory for best performance, setting them up to play nicely > together to utilize system memory in an optimal way is a huge pain. Yep. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 15:47 ` Andres Freund @ 2026-02-17 22:45 ` Dave Chinner 2026-02-18 4:10 ` Andres Freund 2026-02-18 6:53 ` Christoph Hellwig 1 sibling, 1 reply; 18+ messages in thread From: Dave Chinner @ 2026-02-17 22:45 UTC (permalink / raw) To: Andres Freund Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote: > Hi, > > On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote: > > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > > > I think a better session would be how we can help postgres to move > > > off buffered I/O instead of adding more special cases for them. > > FWIW, we are adding support for DIO (it's been added, but performance isn't > competitive for most workloads in the released versions yet, work to address > those issues is in progress). > > But it's only really be viable for larger setups, not for e.g.: > - smaller, unattended setups > - uses of postgres as part of a larger application on one server with hard to > predict memory usage of different components > - intentionally overcommitted shared hosting type scenarios > > Even once a well configured postgres using DIO beats postgres not using DIO, > I'll bet that well over 50% of users won't be able to use DIO. > > > There are some kernel issues that make it harder than necessary to use DIO, > btw: > > Most prominently: With DIO concurrently extending multiple files leads to > quite terrible fragmentation, at least with XFS. Forcing us to > over-aggressively use fallocate(), truncating later if it turns out we need > less space. <ahem> seriously, fallocate() is considered harmful for exactly these sorts of reasons. XFS has vastly better mechanisms built into it that mitigate worst case fragmentation without needing to change applications or increase runtime overhead. So, lets go way back - 32 years ago to 1994: commit 32766d4d387bc6779e0c432fb56a0cc4e6b96398 Author: Doug Doucette <doucette@engr.sgi.com> Date: Thu Mar 3 22:17:15 1994 +0000 Add fcntl implementation (F_FSGETXATTR, F_FSSETXATTR, and F_DIOINFO). Fix xfs_setattr new xfs fields' implementation to split out error checking to the front of the routine, like the other attributes. Don't set new fields in xfs_getattr unless one of the fields is requested. ..... + case F_FSSETXATTR: { + struct fsxattr fa; + vattr_t va; + + if (copyin(arg, &fa, sizeof(fa))) { + error = EFAULT; + break; + } + va.va_xflags = fa.fsx_xflags; + va.va_extsize = fa.fsx_extsize; ^^^^^^^^^^^^^^^ + error = xfs_setattr(vp, &va, AT_XFLAGS|AT_EXTSIZE, credp); + break; + } This was the commit that added user controlled extent size hints to XFS. These already existed in EFS, so applications using this functionality go back to the even earlier in the 1990s. So, let's set the extent size hint on a file to 1MB. Now whenever a data extent allocation on that file is attempted, the extent size that is allocated will be rounded up to the nearest 1MB. i.e. XFS will try to allocate unwritten extents in aligned multiples of the extent size hint regardless of the actual IO size being performed. Hence if you are doing concurrent extending 8kB writes, instead of allocating 8kB at a time, the extent size hint will force a 1MB unwritten extent to be allocated out beyond EOF. The subsequent extending 8kB writes to that file now hit that unwritten extent, and only need to convert it to written. The same will happen for all other concurrent extending writes - they will allocate in 1MB chunks, not 8KB. The result will be that the files will interleave 1MB sized extents across files instead of 8kB sized extents. i.e. we've just reduced the worst case fragmentation behaviour by a factor of 128. We've also reduced allocation overhead by a factor of 128, so the use of extent size hints results in the filesystem behaving in a far more efficient way and hence this results in higher performance. IOWs, the extent size hint effectively sets a minimum extent size that the filesystem will create for a given file, thereby mitigating the worst case fragmentation that can occur. However, the use of fallocate() in the application explicitly prevents the filesystem from doing this smart, transparent IO path thing to mitigate fragmentation. One of the most important properties of extent size hints is that they can be dynamically tuned *without changing the application.* The extent size hint is a property of the inode, and it can be set by the admin through various XFS tools (e.g. mkfs.xfs for a filesystem wide default, xfs_io to set it on a directory so all new files/dirs created in that directory inherit the value, set it on individual files, etc). It can be changed even whilst the file is in active use by the application. Hence the extent size hint it can be changed at any time, and you can apply it immediately to existing installations as an active mitigation. Doing this won't fix existing fragmentation (that's what xfs_fsr is for), but it will instantly mitigate/prevent new fragmentation from occurring. It's much more difficult to do this with applications that use fallocate()... Indeed, the case for using fallocate() instead of extent size hints gets worse the more you look at how extent size hints work. Extent size hints don't impact IO concurrency at all. Extent size hints are only applied during extent allocation, so the optimisation is applied naturally as part of the existing concurrent IO path. Hence using extent size hints won't block/stall/prevent concurrent async IO in any way. fallocate(), OTOH, causes a full IO pipeline stall (blocks submission of both reads and writes, then waits for all IO in flight to drain) on that file for the duration of the syscall. You can't do any sort of IO (async or otherwise) and run fallocate() at the same time, so fallocate() really sucks from the POV of a high performance IO app. fallocate() also marks the files as having persistent preallocation, which means that when you close the file the filesystem does not remove excessive extents allocated beyond EOF. Hence the reported problems with excessive space usage and needing to truncate files manually (which also cause a complete IO stall on that file) are brought on specifically because fallocate() is being used by the application to manage worst case fragmentation. This problem does not exist with extent size hints - unused blocks beyond EOF will be trimmed on last close or when the inode is cycled out of cache, just like we do for excess speculative prealloc beyond EOF for buffered writes (the buffered IO fragmentation mitigation mechanism for interleaving concurrent extending writes). The administrator can easily optimise extent size hints to match the optimal characteristics of the underlying storage (e.g. set them to be RAID stripe aligned), etc. Fallocate() requires the application to provide tunables to modify it's behaviour for optimal storage layout, and depending on how the application uses fallocate(), this level of flexibility may not even be possible. And let's not forget that an fallocate() based mitigation that helps one filesystem type can actively hurt another type (e.g. ext4) by introducing an application level extent allocation boundary vector where there was none before. Hence, IMO, micromanaging filesystem extent allocation with fallocate() is -almost always- the wrong thing for applications to be doing. There is no one "right way" to use fallocate() - what is optimal for one filesystem will be pessimal for another, and it is impossible to code optimal behaviour in the application for all filesystem types the app might run on. > The fallocate in turn triggers slowness in the write paths, as > writing to uninitialized extents is a metadata operation. That is not the problem you think it is. XFS is using unwritten extents for all buffered IO writes that use delayed allocation, too, and I don't see you complaining about that.... Yes, the overhead of unwritten extent conversion is more visible with direct IO, but that's only because DIO has much lower overhead and much, much higher performance ceiling than buffered IO. That doesn't mean unwritten extents are a performance limiting factor... > It'd be great if > the allocation behaviour with concurrent file extension could be improved and > if we could have a fallocate mode that forces extents to be initialized. <sigh> You mean like FALLOC_FL_WRITE_ZEROES? That won't fix your fragmentation problem, and it has all the same pipeline stall problems as allocating unwritten extents in fallocate(). Only much worse now, because the IO pipeline is stalled for the entire time it takes to write the zeroes to persistent storage. i.e. long tail file access latencies will increase massively if you do this regularly to extend files. -Dave. -- Dave Chinner dgc@kernel.org ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 22:45 ` Dave Chinner @ 2026-02-18 4:10 ` Andres Freund 0 siblings, 0 replies; 18+ messages in thread From: Andres Freund @ 2026-02-18 4:10 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah Hi, On 2026-02-18 09:45:46 +1100, Dave Chinner wrote: > On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote: > > There are some kernel issues that make it harder than necessary to use DIO, > > btw: > > > > Most prominently: With DIO concurrently extending multiple files leads to > > quite terrible fragmentation, at least with XFS. Forcing us to > > over-aggressively use fallocate(), truncating later if it turns out we need > > less space. > > <ahem> > > seriously, fallocate() is considered harmful for exactly these sorts > of reasons. XFS has vastly better mechanisms built into it that > mitigate worst case fragmentation without needing to change > applications or increase runtime overhead. There's probably a misunderstanding here: We don't do fallocate to avoid fragmentation. We want to guarantee that there's space for data that is in our buffer pool, as otherwise it's very easy to get into a pickle: If there is dirty data in the buffer pool that can't be written out due to ENOSPC, the subsequent checkpoint can't complete. So the system may be stuck because you're not be able to create more space for WAL / journaling, you can't free up old WAL due to the checkpoint not being able to complete, and if you react to that with a crash-recovery cycle you're likely to be unable to complete crash recovery because you'll just hit ENOSPC again. And yes, CoW filesystems make that less reliable, it turns out to still save people often enough that I doubt we can get rid of it. To ensure there's space for the write out of our buffer pool we have two choices: 1) write out zeroes 2) use fallocate Writing out zeroes that we will just overwrite later is obviously not a particularly good use of IO bandwidth, particularly on metered cloud "storage". But using fallocate() has fragmentation and unwritten-extent issues. Our compromise is that we use fallocate iff we enlarge the relation by a decent number of pages at once and write zeroes otherwise. Is that perfect? Hell no. But it's also not obvious what a better answer is with today's interfaces. If there were a "guarantee that N additional blocks are reserved, but not concretely allocated" interface, we'd gladly use it. > So, let's set the extent size hint on a file to 1MB. Now whenever a > data extent allocation on that file is attempted, the extent size > that is allocated will be rounded up to the nearest 1MB. i.e. XFS > will try to allocate unwritten extents in aligned multiples of the > extent size hint regardless of the actual IO size being performed. > > Hence if you are doing concurrent extending 8kB writes, instead of > allocating 8kB at a time, the extent size hint will force a 1MB > unwritten extent to be allocated out beyond EOF. The subsequent > extending 8kB writes to that file now hit that unwritten extent, and > only need to convert it to written. The same will happen for all > other concurrent extending writes - they will allocate in 1MB > chunks, not 8KB. We could probably benefit from that. > One of the most important properties of extent size hints is that > they can be dynamically tuned *without changing the application.* > The extent size hint is a property of the inode, and it can be set > by the admin through various XFS tools (e.g. mkfs.xfs for a > filesystem wide default, xfs_io to set it on a directory so all new > files/dirs created in that directory inherit the value, set it on > individual files, etc). It can be changed even whilst the file is in > active use by the application. IME our users run enough postgres instances, across a lot of differing workloads, that manual tuning like that will rarely if ever happen :(. I miss well educated DBAs :(. A large portion of users doesn't even have direct access to the server, only via the postgres protocol... If we were to use these hints, it'd have to happen automatically from within postgres. But that does seem viable, but certainly is also not exactly filesystem independent... > > The fallocate in turn triggers slowness in the write paths, as > > writing to uninitialized extents is a metadata operation. > > That is not the problem you think it is. XFS is using unwritten > extents for all buffered IO writes that use delayed allocation, too, > and I don't see you complaining about that.... It's a problem for buffered IO as well, just a bit harder to hit on many drives, because buffered O_DSYNC writes don't use FUA. If you need any durable writes into a file with unwritten extents, things get painful very fast. See a few paragraphs below for the most crucial case where we need to make sure writes are durable. testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=psync --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite |grep IOPS;done;done buffered: 0 overwrite: 0 write: IOPS=1427, BW=5709KiB/s (5846kB/s)(64.0MiB/11479msec); 0 zone resets buffered: 0 overwrite: 1 write: IOPS=4025, BW=15.7MiB/s (16.5MB/s)(64.0MiB/4070msec); 0 zone resets buffered: 1 overwrite: 0 write: IOPS=1638, BW=6554KiB/s (6712kB/s)(64.0MiB/9999msec); 0 zone resets buffered: 1 overwrite: 1 write: IOPS=3663, BW=14.3MiB/s (15.0MB/s)(64.0MiB/4472msec); 0 zone resets That's a > 2x throughput difference. And the results would be similar with --fdatasync=1. If you add AIO to the mix, the difference gets way bigger, particularly on drives with FUA support and DIO: testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=io_uring --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite --iodepth 32 |grep IOPS;done;done buffered: 0 overwrite: 0 write: IOPS=6143, BW=24.0MiB/s (25.2MB/s)(64.0MiB/2667msec); 0 zone resets buffered: 0 overwrite: 1 write: IOPS=76.6k, BW=299MiB/s (314MB/s)(64.0MiB/214msec); 0 zone resets buffered: 1 overwrite: 0 write: IOPS=1835, BW=7341KiB/s (7517kB/s)(64.0MiB/8928msec); 0 zone resets buffered: 1 overwrite: 1 write: IOPS=4096, BW=16.0MiB/s (16.8MB/s)(64.0MiB/4000msec); 0 zone resets It's less bad, but still quite a noticeable difference, on drives without volatile caches. And it's often worse on networked storage, whether it has a volatile cache or not. > > It'd be great if > > the allocation behaviour with concurrent file extension could be improved and > > if we could have a fallocate mode that forces extents to be initialized. > > <sigh> > > You mean like FALLOC_FL_WRITE_ZEROES? I hadn't seen that it was merged, that's great! It doesn't yet seem to be documented in the fallocate(2) man page, which I had checked... Hm, also doesn't seem to work on xfs yet :(, EOPNOTSUPP. > That won't fix your fragmentation problem, and it has all the same pipeline > stall problems as allocating unwritten extents in fallocate(). The primary case where FALLOC_FL_WRITE_ZEROES would be useful is for WAL file creation, which are always of the same fixed size (therefore no fragmentation risk). To avoid having metadata operation during our commit path, we today default to forcing them to be allocated by overwriting them with zeros and fsyncing them. To avoid having to do that all the time, we reuse them once they're not needed anymore. Not ensuring that the extents are already written, would have a very large perf penalty (as in ~2-3x for OLTP workloads, on XFS). That's true for both when using DIO and when not. To avoid having to do that over and over, we recycle WAL files. Unfortunately this means that when all those WAL files are not yet preallocated (or when we release them during low activity), the performance is rather noticeably worsened by the additional IO for pre-zeroing the WAL files. In theory FALLOC_FL_WRITE_ZEROES should be faster than issuing writes for the whole range. > Only much worse now, because the IO pipeline is stalled for the > entire time it takes to write the zeroes to persistent storage. i.e. > long tail file access latencies will increase massively if you do > this regularly to extend files. In the WAL path we fsync at the point we could use FALLOC_FL_WRITE_ZEROES, as otherwise the WAL segment might not exist after a crash, which would be ... bad. Greetings, Andres Freund ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 15:47 ` Andres Freund 2026-02-17 22:45 ` Dave Chinner @ 2026-02-18 6:53 ` Christoph Hellwig 1 sibling, 0 replies; 18+ messages in thread From: Christoph Hellwig @ 2026-02-18 6:53 UTC (permalink / raw) To: Andres Freund Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote: > Most prominently: With DIO concurrently extending multiple files leads to > quite terrible fragmentation, at least with XFS. Forcing us to > over-aggressively use fallocate(), truncating later if it turns out we need > less space. The fallocate in turn triggers slowness in the write paths, as > writing to uninitialized extents is a metadata operation. It'd be great if > the allocation behaviour with concurrent file extension could be improved and > if we could have a fallocate mode that forces extents to be initialized. As Dave already mentioned, if you do concurrent allocations (extension or hole filling), setting an extent size hint is probably a good idea. We could try to look into heuristics, but chances are that they would degrade other use caes. Details would be useful as a report on the XFS list. > > A secondary issue is that with the buffer pool sizes necessary for DIO use on > bigger systems, creating the anonymous memory mapping becomes painfully slow > if we use MAP_POPULATE - which we kinda need to do, as otherwise performance > is very inconsistent initially (often iomap -> gup -> handle_mm_fault -> > folio_zero_user uses the majority of the CPU). We've been experimenting with > not using MAP_POPULATE and using multiple threads to populate the mapping in > parallel, but that feels not like something that userspace ought to have to > do. It's easier to work around for us that the uninitialized extent > conversion issue, but it still is something we IMO shouldn't have to do. Please report this to linux-mm. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes 2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein 2026-02-17 15:47 ` Andres Freund @ 2026-02-18 6:51 ` Christoph Hellwig 1 sibling, 0 replies; 18+ messages in thread From: Christoph Hellwig @ 2026-02-18 6:51 UTC (permalink / raw) To: Amir Goldstein Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy, ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah On Tue, Feb 17, 2026 at 10:23:36AM +0100, Amir Goldstein wrote: > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote: > > > > I think a better session would be how we can help postgres to move > > off buffered I/O instead of adding more special cases for them. > > Respectfully, I disagree that DIO is the only possible solution. > Direct I/O is a legit solution for databases and so is buffered I/O > each with their own caveats. Maybe. Classic buffered I/O is not a legit solution for doing atomic I/Os, and if Postgres is desperate to use that, something like direct I/O (including the proposed write though semantics) are the only sensible choice. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-03-08 15:33 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-08 9:19 [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes Ritesh Harjani 2026-03-08 15:33 ` Andres Freund -- strict thread matches above, loose matches on Subject: below -- 2026-02-13 10:20 Pankaj Raghav 2026-02-13 13:32 ` Ojaswin Mujoo 2026-02-16 9:52 ` Pankaj Raghav 2026-02-17 17:20 ` Ojaswin Mujoo 2026-02-18 17:42 ` [Lsf-pc] " Jan Kara 2026-02-18 20:22 ` Ojaswin Mujoo 2026-02-16 11:38 ` Jan Kara 2026-02-16 13:18 ` Pankaj Raghav 2026-02-17 18:36 ` Ojaswin Mujoo 2026-02-16 15:57 ` Andres Freund 2026-02-17 18:39 ` Ojaswin Mujoo 2026-02-18 0:26 ` Dave Chinner 2026-02-18 6:49 ` Christoph Hellwig 2026-02-18 12:54 ` Ojaswin Mujoo 2026-02-17 5:51 ` Christoph Hellwig 2026-02-17 9:23 ` [Lsf-pc] " Amir Goldstein 2026-02-17 15:47 ` Andres Freund 2026-02-17 22:45 ` Dave Chinner 2026-02-18 4:10 ` Andres Freund 2026-02-18 6:53 ` Christoph Hellwig 2026-02-18 6:51 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox