Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-13 13:32 ` Ojaswin Mujoo
  2026-02-16  9:52   ` Pankaj Raghav
@ 2026-02-16 11:38   ` Jan Kara
  2026-02-16 13:18     ` Pankaj Raghav
                       ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Jan Kara @ 2026-02-16 11:38 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
	Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
	jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev,
	tytso, p.raghav, vi.shah

Hi!

On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> Another thing that came up is to consider using write through semantics 
> for buffered atomic writes, where we are able to transition page to
> writeback state immediately after the write and avoid any other users to
> modify the data till writeback completes. This might affect performance
> since we won't be able to batch similar atomic IOs but maybe
> applications like postgres would not mind this too much. If we go with
> this approach, we will be able to avoid worrying too much about other
> users changing atomic data underneath us. 
> 
> An argument against this however is that it is user's responsibility to
> not do non atomic IO over an atomic range and this shall be considered a
> userspace usage error. This is similar to how there are ways users can
> tear a dio if they perform overlapping writes. [1]. 

Yes, I was wondering whether the write-through semantics would make sense
as well. Intuitively it should make things simpler because you could
practially reuse the atomic DIO write path. Only that you'd first copy
data into the page cache and issue dio write from those folios. No need for
special tracking of which folios actually belong together in atomic write,
no need for cluttering standard folio writeback path, in case atomic write
cannot happen (e.g. because you cannot allocate appropriately aligned
blocks) you get the error back rightaway, ...

Of course this all depends on whether such semantics would be actually
useful for users such as PostgreSQL.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-16 11:38   ` Jan Kara
@ 2026-02-16 13:18     ` Pankaj Raghav
  2026-02-17 18:36       ` Ojaswin Mujoo
  2026-02-16 15:57     ` Andres Freund
  2026-02-17 18:39     ` Ojaswin Mujoo
  2 siblings, 1 reply; 18+ messages in thread
From: Pankaj Raghav @ 2026-02-16 13:18 UTC (permalink / raw)
  To: Jan Kara, Ojaswin Mujoo
  Cc: linux-xfs, linux-mm, linux-fsdevel, lsf-pc, Andres Freund, djwong,
	john.g.garry, willy, hch, ritesh.list, Luis Chamberlain, dchinner,
	Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah



On 2/16/2026 12:38 PM, Jan Kara wrote:
> Hi!
> 
> On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
>> Another thing that came up is to consider using write through semantics
>> for buffered atomic writes, where we are able to transition page to
>> writeback state immediately after the write and avoid any other users to
>> modify the data till writeback completes. This might affect performance
>> since we won't be able to batch similar atomic IOs but maybe
>> applications like postgres would not mind this too much. If we go with
>> this approach, we will be able to avoid worrying too much about other
>> users changing atomic data underneath us.
>>
>> An argument against this however is that it is user's responsibility to
>> not do non atomic IO over an atomic range and this shall be considered a
>> userspace usage error. This is similar to how there are ways users can
>> tear a dio if they perform overlapping writes. [1].
> 
> Yes, I was wondering whether the write-through semantics would make sense
> as well. Intuitively it should make things simpler because you could
> practially reuse the atomic DIO write path. Only that you'd first copy
> data into the page cache and issue dio write from those folios. No need for
> special tracking of which folios actually belong together in atomic write,
> no need for cluttering standard folio writeback path, in case atomic write
> cannot happen (e.g. because you cannot allocate appropriately aligned
> blocks) you get the error back rightaway, ...
> 
> Of course this all depends on whether such semantics would be actually
> useful for users such as PostgreSQL.

One issue might be the performance, especially if the atomic max unit is in the 
smaller end such as 16k or 32k (which is fairly common). But it will avoid the 
overlapping writes issue and can easily leverage the direct IO path.

But one thing that postgres really cares about is the integrity of a database 
block. So if there is an IO that is a multiple of an atomic write unit (one 
atomic unit encapsulates the whole DB page), it is not a problem if tearing 
happens on the atomic boundaries. This fits very well with what NVMe calls 
Multiple Atomicity Mode (MAM) [1].

We don't have any semantics for MaM at the moment but that could increase the 
performance as we can do larger IOs but still get the atomic guarantees certain 
applications care about.


[1] 
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-Revision-1.1-2024.08.05-Ratified.pdf 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-16 11:38   ` Jan Kara
  2026-02-16 13:18     ` Pankaj Raghav
@ 2026-02-16 15:57     ` Andres Freund
  2026-02-17 18:39     ` Ojaswin Mujoo
  2 siblings, 0 replies; 18+ messages in thread
From: Andres Freund @ 2026-02-16 15:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ojaswin Mujoo, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel,
	lsf-pc, djwong, john.g.garry, willy, hch, ritesh.list,
	Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
	p.raghav, vi.shah

Hi,

On 2026-02-16 12:38:59 +0100, Jan Kara wrote:
> On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > Another thing that came up is to consider using write through semantics
> > for buffered atomic writes, where we are able to transition page to
> > writeback state immediately after the write and avoid any other users to
> > modify the data till writeback completes. This might affect performance
> > since we won't be able to batch similar atomic IOs but maybe
> > applications like postgres would not mind this too much. If we go with
> > this approach, we will be able to avoid worrying too much about other
> > users changing atomic data underneath us.
> >
> > An argument against this however is that it is user's responsibility to
> > not do non atomic IO over an atomic range and this shall be considered a
> > userspace usage error. This is similar to how there are ways users can
> > tear a dio if they perform overlapping writes. [1].
>
> Yes, I was wondering whether the write-through semantics would make sense
> as well.

As outlined in
https://lore.kernel.org/all/zzvybbfy6bcxnkt4cfzruhdyy6jsvnuvtjkebdeqwkm6nfpgij@dlps7ucza22s/
that is something that would be useful for postgres even orthogonally to
atomic writes.

If this were the path to go with, I'd suggest adding an RWF_WRITETHROUGH and
requiring it to be set when using RWF_ATOMIC on an buffered write. That way,
if the kernel were to eventually support buffered atomic writes without
immediate writeback, the semantics to userspace wouldn't suddenly change.

> Intuitively it should make things simpler because you could
> practially reuse the atomic DIO write path. Only that you'd first copy
> data into the page cache and issue dio write from those folios. No need for
> special tracking of which folios actually belong together in atomic write,
> no need for cluttering standard folio writeback path, in case atomic write
> cannot happen (e.g. because you cannot allocate appropriately aligned
> blocks) you get the error back rightaway, ...
>
> Of course this all depends on whether such semantics would be actually
> useful for users such as PostgreSQL.

I think it would be useful for many workloads.

As noted in the linked message, there are some workloads where I am not sure
how the gains/costs would balance out (with a small PG buffer pool in a write
heavy workload, we'd loose the ability to have the kernel avoid redundant
writes). It's possible that we could develop some heuristics to fall back to
doing our own torn-page avoidance in such cases, although it's not immediately
obvious to me what that heuristic would be.  It's also not that common a
workload, it's *much* more common to have a read heavy workload that has to
overflow in the kernel page cache, due to not being able to dedicate
sufficient memory to postgres.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-17  5:51 ` Christoph Hellwig
@ 2026-02-17  9:23   ` Amir Goldstein
  2026-02-17 15:47     ` Andres Freund
  2026-02-18  6:51     ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Amir Goldstein @ 2026-02-17  9:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
	Andres Freund, djwong, john.g.garry, willy, ritesh.list, jack,
	ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev,
	tytso, p.raghav, vi.shah

On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
>
> I think a better session would be how we can help postgres to move
> off buffered I/O instead of adding more special cases for them.

Respectfully, I disagree that DIO is the only possible solution.
Direct I/O is a legit solution for databases and so is buffered I/O
each with their own caveats.

Specifically, when two subsystems (kernel vfs and db) each require a huge
amount of cache memory for best performance, setting them up to play nicely
together to utilize system memory in an optimal way is a huge pain.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
@ 2026-02-17 15:47     ` Andres Freund
  2026-02-17 22:45       ` Dave Chinner
  2026-02-18  6:53       ` Christoph Hellwig
  2026-02-18  6:51     ` Christoph Hellwig
  1 sibling, 2 replies; 18+ messages in thread
From: Andres Freund @ 2026-02-17 15:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm,
	linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, ritesh.list,
	jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez,
	gost.dev, tytso, p.raghav, vi.shah

Hi,

On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > I think a better session would be how we can help postgres to move
> > off buffered I/O instead of adding more special cases for them.

FWIW, we are adding support for DIO (it's been added, but performance isn't
competitive for most workloads in the released versions yet, work to address
those issues is in progress).

But it's only really be viable for larger setups, not for e.g.:
- smaller, unattended setups
- uses of postgres as part of a larger application on one server with hard to
  predict memory usage of different components
- intentionally overcommitted shared hosting type scenarios

Even once a well configured postgres using DIO beats postgres not using DIO,
I'll bet that well over 50% of users won't be able to use DIO.

There are some kernel issues that make it harder than necessary to use DIO,
btw:

Most prominently: With DIO concurrently extending multiple files leads to
quite terrible fragmentation, at least with XFS. Forcing us to
over-aggressively use fallocate(), truncating later if it turns out we need
less space. The fallocate in turn triggers slowness in the write paths, as
writing to uninitialized extents is a metadata operation.  It'd be great if
the allocation behaviour with concurrent file extension could be improved and
if we could have a fallocate mode that forces extents to be initialized.

A secondary issue is that with the buffer pool sizes necessary for DIO use on
bigger systems, creating the anonymous memory mapping becomes painfully slow
if we use MAP_POPULATE - which we kinda need to do, as otherwise performance
is very inconsistent initially (often iomap -> gup -> handle_mm_fault ->
folio_zero_user uses the majority of the CPU). We've been experimenting with
not using MAP_POPULATE and using multiple threads to populate the mapping in
parallel, but that feels not like something that userspace ought to have to
do.  It's easier to work around for us that the uninitialized extent
conversion issue, but it still is something we IMO shouldn't have to do.

> Respectfully, I disagree that DIO is the only possible solution.
> Direct I/O is a legit solution for databases and so is buffered I/O
> each with their own caveats.

> Specifically, when two subsystems (kernel vfs and db) each require a huge
> amount of cache memory for best performance, setting them up to play nicely
> together to utilize system memory in an optimal way is a huge pain.

Yep.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-16 13:18     ` Pankaj Raghav
@ 2026-02-17 18:36       ` Ojaswin Mujoo
  0 siblings, 0 replies; 18+ messages in thread
From: Ojaswin Mujoo @ 2026-02-17 18:36 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Jan Kara, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
	Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
	Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
	p.raghav, vi.shah

On Mon, Feb 16, 2026 at 02:18:10PM +0100, Pankaj Raghav wrote:
> 
> 
> On 2/16/2026 12:38 PM, Jan Kara wrote:
> > Hi!
> > 
> > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > > Another thing that came up is to consider using write through semantics
> > > for buffered atomic writes, where we are able to transition page to
> > > writeback state immediately after the write and avoid any other users to
> > > modify the data till writeback completes. This might affect performance
> > > since we won't be able to batch similar atomic IOs but maybe
> > > applications like postgres would not mind this too much. If we go with
> > > this approach, we will be able to avoid worrying too much about other
> > > users changing atomic data underneath us.
> > > 
> > > An argument against this however is that it is user's responsibility to
> > > not do non atomic IO over an atomic range and this shall be considered a
> > > userspace usage error. This is similar to how there are ways users can
> > > tear a dio if they perform overlapping writes. [1].
> > 
> > Yes, I was wondering whether the write-through semantics would make sense
> > as well. Intuitively it should make things simpler because you could
> > practially reuse the atomic DIO write path. Only that you'd first copy
> > data into the page cache and issue dio write from those folios. No need for
> > special tracking of which folios actually belong together in atomic write,
> > no need for cluttering standard folio writeback path, in case atomic write
> > cannot happen (e.g. because you cannot allocate appropriately aligned
> > blocks) you get the error back rightaway, ...
> > 
> > Of course this all depends on whether such semantics would be actually
> > useful for users such as PostgreSQL.
> 
> One issue might be the performance, especially if the atomic max unit is in
> the smaller end such as 16k or 32k (which is fairly common). But it will
> avoid the overlapping writes issue and can easily leverage the direct IO
> path.
> 
> But one thing that postgres really cares about is the integrity of a
> database block. So if there is an IO that is a multiple of an atomic write
> unit (one atomic unit encapsulates the whole DB page), it is not a problem
> if tearing happens on the atomic boundaries. This fits very well with what
> NVMe calls Multiple Atomicity Mode (MAM) [1].
> 
> We don't have any semantics for MaM at the moment but that could increase
> the performance as we can do larger IOs but still get the atomic guarantees
> certain applications care about.

Interesting, I think very very early dio implementations did use
something of this sort where (awu_max = 4k) an atomic write of 16k would
result in 4 x 4k atomic writes. 

I don't remember why it was shot down though :D

Regards,
ojaswin

> 
> 
> [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-Revision-1.1-2024.08.05-Ratified.pdf
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-16 11:38   ` Jan Kara
  2026-02-16 13:18     ` Pankaj Raghav
  2026-02-16 15:57     ` Andres Freund
@ 2026-02-17 18:39     ` Ojaswin Mujoo
  2026-02-18  0:26       ` Dave Chinner
  2 siblings, 1 reply; 18+ messages in thread
From: Ojaswin Mujoo @ 2026-02-17 18:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
	Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
	Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
	p.raghav, vi.shah

On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote:
> Hi!
> 
> On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > Another thing that came up is to consider using write through semantics 
> > for buffered atomic writes, where we are able to transition page to
> > writeback state immediately after the write and avoid any other users to
> > modify the data till writeback completes. This might affect performance
> > since we won't be able to batch similar atomic IOs but maybe
> > applications like postgres would not mind this too much. If we go with
> > this approach, we will be able to avoid worrying too much about other
> > users changing atomic data underneath us. 
> > 
> > An argument against this however is that it is user's responsibility to
> > not do non atomic IO over an atomic range and this shall be considered a
> > userspace usage error. This is similar to how there are ways users can
> > tear a dio if they perform overlapping writes. [1]. 
> 
> Yes, I was wondering whether the write-through semantics would make sense
> as well. Intuitively it should make things simpler because you could
> practially reuse the atomic DIO write path. Only that you'd first copy
> data into the page cache and issue dio write from those folios. No need for
> special tracking of which folios actually belong together in atomic write,
> no need for cluttering standard folio writeback path, in case atomic write
> cannot happen (e.g. because you cannot allocate appropriately aligned
> blocks) you get the error back rightaway, ...

This is an interesting idea Jan and also saves a lot of tracking of
atomic extents etc.

I'm unsure how much of a performance impact it'd have though but I'll
look into this

Regards,
ojaswin

> 
> Of course this all depends on whether such semantics would be actually
> useful for users such as PostgreSQL.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-17 15:47     ` Andres Freund
@ 2026-02-17 22:45       ` Dave Chinner
  2026-02-18  4:10         ` Andres Freund
  2026-02-18  6:53       ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2026-02-17 22:45 UTC (permalink / raw)
  To: Andres Freund
  Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs,
	linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
	ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner,
	Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah

On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote:
> Hi,
> 
> On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
> > On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > I think a better session would be how we can help postgres to move
> > > off buffered I/O instead of adding more special cases for them.
> 
> FWIW, we are adding support for DIO (it's been added, but performance isn't
> competitive for most workloads in the released versions yet, work to address
> those issues is in progress).
> 
> But it's only really be viable for larger setups, not for e.g.:
> - smaller, unattended setups
> - uses of postgres as part of a larger application on one server with hard to
>   predict memory usage of different components
> - intentionally overcommitted shared hosting type scenarios
> 
> Even once a well configured postgres using DIO beats postgres not using DIO,
> I'll bet that well over 50% of users won't be able to use DIO.
> 
> 
> There are some kernel issues that make it harder than necessary to use DIO,
> btw:
> 
> Most prominently: With DIO concurrently extending multiple files leads to
> quite terrible fragmentation, at least with XFS. Forcing us to
> over-aggressively use fallocate(), truncating later if it turns out we need
> less space.

<ahem>

seriously, fallocate() is considered harmful for exactly these sorts
of reasons. XFS has vastly better mechanisms built into it that
mitigate worst case fragmentation without needing to change
applications or increase runtime overhead.

So, lets go way back - 32 years ago to 1994:

commit 32766d4d387bc6779e0c432fb56a0cc4e6b96398
Author: Doug Doucette <doucette@engr.sgi.com>
Date:   Thu Mar 3 22:17:15 1994 +0000

    Add fcntl implementation (F_FSGETXATTR, F_FSSETXATTR, and F_DIOINFO).
    Fix xfs_setattr new xfs fields' implementation to split out error checking
    to the front of the routine, like the other attributes.  Don't set new
    fields in xfs_getattr unless one of the fields is requested.

.....

+       case F_FSSETXATTR: {
+               struct fsxattr fa;
+               vattr_t va;
+
+               if (copyin(arg, &fa, sizeof(fa))) {
+                       error = EFAULT;
+                       break;
+               }
+               va.va_xflags = fa.fsx_xflags;
+               va.va_extsize = fa.fsx_extsize;
                                ^^^^^^^^^^^^^^^
+               error = xfs_setattr(vp, &va, AT_XFLAGS|AT_EXTSIZE, credp);
+               break;
+           }

This was the commit that added user controlled extent size hints to
XFS. These already existed in EFS, so applications using this
functionality go back to the even earlier in the 1990s.

So, let's set the extent size hint on a file to 1MB. Now whenever a
data extent allocation on that file is attempted, the extent size
that is allocated will be rounded up to the nearest 1MB.  i.e. XFS
will try to allocate unwritten extents in aligned multiples of the
extent size hint regardless of the actual IO size being performed.

Hence if you are doing concurrent extending 8kB writes, instead of
allocating 8kB at a time, the extent size hint will force a 1MB
unwritten extent to be allocated out beyond EOF. The subsequent
extending 8kB writes to that file now hit that unwritten extent, and
only need to convert it to written. The same will happen for all
other concurrent extending writes - they will allocate in 1MB
chunks, not 8KB.

The result will be that the files will interleave 1MB sized extents
across files instead of 8kB sized extents. i.e. we've just reduced
the worst case fragmentation behaviour by a factor of 128. We've
also reduced allocation overhead by a factor of 128, so the use of
extent size hints results in the filesystem behaving in a far more
efficient way and hence this results in higher performance.

IOWs, the extent size hint effectively sets a minimum extent size
that the filesystem will create for a given file, thereby mitigating
the worst case fragmentation that can occur. However, the use of
fallocate() in the application explicitly prevents the filesystem
from doing this smart, transparent IO path thing to mitigate
fragmentation.

One of the most important properties of extent size hints is that
they can be dynamically tuned *without changing the application.*
The extent size hint is a property of the inode, and it can be set
by the admin through various XFS tools (e.g. mkfs.xfs for a
filesystem wide default, xfs_io to set it on a directory so all new
files/dirs created in that directory inherit the value, set it on
individual files, etc). It can be changed even whilst the file is in
active use by the application.

Hence the extent size hint it can be changed at any time, and you
can apply it immediately to existing installations as an active
mitigation. Doing this won't fix existing fragmentation (that's what
xfs_fsr is for), but it will instantly mitigate/prevent new
fragmentation from occurring. It's much more difficult to do this
with applications that use fallocate()...

Indeed, the case for using fallocate() instead of extent size hints
gets worse the more you look at how extent size hints work.

Extent size hints don't impact IO concurrency at all. Extent size
hints are only applied during extent allocation, so the optimisation
is applied naturally as part of the existing concurrent IO path.
Hence using extent size hints won't block/stall/prevent concurrent
async IO in any way.

fallocate(), OTOH, causes a full IO pipeline stall (blocks submission
of both reads and writes, then waits for all IO in flight to drain)
on that file for the duration of the syscall. You can't do any sort
of IO (async or otherwise) and run fallocate() at the same time, so
fallocate() really sucks from the POV of a high performance IO app.

fallocate() also marks the files as having persistent preallocation,
which means that when you close the file the filesystem does not
remove excessive extents allocated beyond EOF.  Hence the reported
problems with excessive space usage and needing to truncate files
manually (which also cause a complete IO stall on that file) are
brought on specifically because fallocate() is being used by the
application to manage worst case fragmentation.

This problem does not exist with extent size hints - unused blocks
beyond EOF will be trimmed on last close or when the inode is cycled
out of cache, just like we do for excess speculative prealloc beyond
EOF for buffered writes (the buffered IO fragmentation mitigation
mechanism for interleaving concurrent extending writes).

The administrator can easily optimise extent size hints to match the
optimal characteristics of the underlying storage (e.g. set them to
be RAID stripe aligned), etc. Fallocate() requires the application
to provide tunables to modify it's behaviour for optimal storage
layout, and depending on how the application uses fallocate(), this
level of flexibility may not even be possible.

And let's not forget that an fallocate() based mitigation that helps
one filesystem type can actively hurt another type (e.g. ext4) by
introducing an application level extent allocation boundary vector
where there was none before.

Hence, IMO, micromanaging filesystem extent allocation with
fallocate() is -almost always- the wrong thing for applications to
be doing. There is no one "right way" to use fallocate() - what is
optimal for one filesystem will be pessimal for another, and it is
impossible to code optimal behaviour in the application for all
filesystem types the app might run on.

> The fallocate in turn triggers slowness in the write paths, as
> writing to uninitialized extents is a metadata operation.

That is not the problem you think it is. XFS is using unwritten
extents for all buffered IO writes that use delayed allocation, too,
and I don't see you complaining about that....

Yes, the overhead of unwritten extent conversion is more visible
with direct IO, but that's only because DIO has much lower overhead
and much, much higher performance ceiling than buffered IO. That
doesn't mean unwritten extents are a performance limiting factor...

> It'd be great if
> the allocation behaviour with concurrent file extension could be improved and
> if we could have a fallocate mode that forces extents to be initialized.

<sigh>

You mean like FALLOC_FL_WRITE_ZEROES?

That won't fix your fragmentation problem, and it has all the same
pipeline stall problems as allocating unwritten extents in
fallocate().

Only much worse now, because the IO pipeline is stalled for the
entire time it takes to write the zeroes to persistent storage. i.e.
long tail file access latencies will increase massively if you do
this regularly to extend files.

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-17 18:39     ` Ojaswin Mujoo
@ 2026-02-18  0:26       ` Dave Chinner
  2026-02-18  6:49         ` Christoph Hellwig
  2026-02-18 12:54         ` Ojaswin Mujoo
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2026-02-18  0:26 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel,
	lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch,
	ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
	gost.dev, tytso, p.raghav, vi.shah

On Wed, Feb 18, 2026 at 12:09:46AM +0530, Ojaswin Mujoo wrote:
> On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote:
> > Hi!
> > 
> > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > > Another thing that came up is to consider using write through semantics 
> > > for buffered atomic writes, where we are able to transition page to
> > > writeback state immediately after the write and avoid any other users to
> > > modify the data till writeback completes. This might affect performance
> > > since we won't be able to batch similar atomic IOs but maybe
> > > applications like postgres would not mind this too much. If we go with
> > > this approach, we will be able to avoid worrying too much about other
> > > users changing atomic data underneath us. 
> > > 
> > > An argument against this however is that it is user's responsibility to
> > > not do non atomic IO over an atomic range and this shall be considered a
> > > userspace usage error. This is similar to how there are ways users can
> > > tear a dio if they perform overlapping writes. [1]. 
> > 
> > Yes, I was wondering whether the write-through semantics would make sense
> > as well. Intuitively it should make things simpler because you could
> > practially reuse the atomic DIO write path. Only that you'd first copy
> > data into the page cache and issue dio write from those folios. No need for
> > special tracking of which folios actually belong together in atomic write,
> > no need for cluttering standard folio writeback path, in case atomic write
> > cannot happen (e.g. because you cannot allocate appropriately aligned
> > blocks) you get the error back rightaway, ...
> 
> This is an interesting idea Jan and also saves a lot of tracking of
> atomic extents etc.

ISTR mentioning that we should be doing exactly this (grab page
cache pages, fill them and submit them through the DIO path) for
O_DSYNC buffered writethrough IO a long time again. The context was
optimising buffered O_DSYNC to use the FUA optimisations in the
iomap DIO write path.

I suggested it again when discussing how RWF_DONTCACHE should be
implemented, because the async DIO write completion path invalidates
the page cache over the IO range. i.e. it would avoid the need to
use folio flags to track pages that needed invalidation at IO
completion...

I have a vague recollection of mentioning this early in the buffered
RWF_ATOMIC discussions, too, though that may have just been the
voices in my head.

Regardless, we are here again with proposals for RWF_ATOMIC and
RWF_WRITETHROUGH and a suggestion that maybe we should vector
buffered writethrough via the DIO path.....

Perhaps it's time to do this?

FWIW, the other thing that write-through via the DIO path enables is
true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes
block waiting on IO completion through generic_sync_write() ->
vfs_fsync_range(), even when issued through AIO paths.  Vectoring it
through the DIO path avoids the blocking fsync path in IO submission
as it runs in the async DIO completion path if it is needed....

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-17 22:45       ` Dave Chinner
@ 2026-02-18  4:10         ` Andres Freund
  0 siblings, 0 replies; 18+ messages in thread
From: Andres Freund @ 2026-02-18  4:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs,
	linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
	ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner,
	Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah

Hi,

On 2026-02-18 09:45:46 +1100, Dave Chinner wrote:
> On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote:
> > There are some kernel issues that make it harder than necessary to use DIO,
> > btw:
> >
> > Most prominently: With DIO concurrently extending multiple files leads to
> > quite terrible fragmentation, at least with XFS. Forcing us to
> > over-aggressively use fallocate(), truncating later if it turns out we need
> > less space.
>
> <ahem>
>
> seriously, fallocate() is considered harmful for exactly these sorts
> of reasons. XFS has vastly better mechanisms built into it that
> mitigate worst case fragmentation without needing to change
> applications or increase runtime overhead.

There's probably a misunderstanding here: We don't do fallocate to avoid
fragmentation.

We want to guarantee that there's space for data that is in our buffer pool,
as otherwise it's very easy to get into a pickle:

If there is dirty data in the buffer pool that can't be written out due to
ENOSPC, the subsequent checkpoint can't complete. So the system may be stuck
because you're not be able to create more space for WAL / journaling, you
can't free up old WAL due to the checkpoint not being able to complete, and if
you react to that with a crash-recovery cycle you're likely to be unable to
complete crash recovery because you'll just hit ENOSPC again.

And yes, CoW filesystems make that less reliable, it turns out to still save
people often enough that I doubt we can get rid of it.

To ensure there's space for the write out of our buffer pool we have two
choices:

1) write out zeroes
2) use fallocate

Writing out zeroes that we will just overwrite later is obviously not a
particularly good use of IO bandwidth, particularly on metered cloud
"storage". But using fallocate() has fragmentation and unwritten-extent
issues.  Our compromise is that we use fallocate iff we enlarge the relation
by a decent number of pages at once and write zeroes otherwise.

Is that perfect? Hell no.  But it's also not obvious what a better answer is
with today's interfaces.

If there were a "guarantee that N additional blocks are reserved, but not
concretely allocated" interface, we'd gladly use it.

> So, let's set the extent size hint on a file to 1MB. Now whenever a
> data extent allocation on that file is attempted, the extent size
> that is allocated will be rounded up to the nearest 1MB.  i.e. XFS
> will try to allocate unwritten extents in aligned multiples of the
> extent size hint regardless of the actual IO size being performed.
>
> Hence if you are doing concurrent extending 8kB writes, instead of
> allocating 8kB at a time, the extent size hint will force a 1MB
> unwritten extent to be allocated out beyond EOF. The subsequent
> extending 8kB writes to that file now hit that unwritten extent, and
> only need to convert it to written. The same will happen for all
> other concurrent extending writes - they will allocate in 1MB
> chunks, not 8KB.

We could probably benefit from that.

> One of the most important properties of extent size hints is that
> they can be dynamically tuned *without changing the application.*
> The extent size hint is a property of the inode, and it can be set
> by the admin through various XFS tools (e.g. mkfs.xfs for a
> filesystem wide default, xfs_io to set it on a directory so all new
> files/dirs created in that directory inherit the value, set it on
> individual files, etc). It can be changed even whilst the file is in
> active use by the application.

IME our users run enough postgres instances, across a lot of differing
workloads, that manual tuning like that will rarely if ever happen :(. I miss
well educated DBAs :(.  A large portion of users doesn't even have direct
access to the server, only via the postgres protocol...

If we were to use these hints, it'd have to happen automatically from within
postgres.  But that does seem viable, but certainly is also not exactly
filesystem independent...

> > The fallocate in turn triggers slowness in the write paths, as
> > writing to uninitialized extents is a metadata operation.
>
> That is not the problem you think it is. XFS is using unwritten
> extents for all buffered IO writes that use delayed allocation, too,
> and I don't see you complaining about that....

It's a problem for buffered IO as well, just a bit harder to hit on many
drives, because buffered O_DSYNC writes don't use FUA.

If you need any durable writes into a file with unwritten extents, things get
painful very fast.

See a few paragraphs below for the most crucial case where we need to make
sure writes are durable.

testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=psync --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite |grep IOPS;done;done

buffered: 0 overwrite: 0
  write: IOPS=1427, BW=5709KiB/s (5846kB/s)(64.0MiB/11479msec); 0 zone resets
buffered: 0 overwrite: 1
  write: IOPS=4025, BW=15.7MiB/s (16.5MB/s)(64.0MiB/4070msec); 0 zone resets
buffered: 1 overwrite: 0
  write: IOPS=1638, BW=6554KiB/s (6712kB/s)(64.0MiB/9999msec); 0 zone resets
buffered: 1 overwrite: 1
  write: IOPS=3663, BW=14.3MiB/s (15.0MB/s)(64.0MiB/4472msec); 0 zone resets

That's a > 2x throughput difference. And the results would be similar with
--fdatasync=1.

If you add AIO to the mix, the difference gets way bigger, particularly on
drives with FUA support and DIO:

testdir=/srv/fio && for buffered in 0 1; do for overwrite in 0 1; do echo buffered: $buffered overwrite: $overwrite; rm -f $testdir/pg-extend* && fio --directory=$testdir --ioengine=io_uring --buffered=$buffered --bs=4kB --fallocate=none --overwrite=0 --rw=write --size=64MB --sync=dsync --name pg-extend --overwrite=$overwrite --iodepth 32 |grep IOPS;done;done

buffered: 0 overwrite: 0
  write: IOPS=6143, BW=24.0MiB/s (25.2MB/s)(64.0MiB/2667msec); 0 zone resets
buffered: 0 overwrite: 1
  write: IOPS=76.6k, BW=299MiB/s (314MB/s)(64.0MiB/214msec); 0 zone resets
buffered: 1 overwrite: 0
  write: IOPS=1835, BW=7341KiB/s (7517kB/s)(64.0MiB/8928msec); 0 zone resets
buffered: 1 overwrite: 1
  write: IOPS=4096, BW=16.0MiB/s (16.8MB/s)(64.0MiB/4000msec); 0 zone resets

It's less bad, but still quite a noticeable difference, on drives without
volatile caches.  And it's often worse on networked storage, whether it has a
volatile cache or not.

> > It'd be great if
> > the allocation behaviour with concurrent file extension could be improved and
> > if we could have a fallocate mode that forces extents to be initialized.
>
> <sigh>
>
> You mean like FALLOC_FL_WRITE_ZEROES?

I hadn't seen that it was merged, that's great!  It doesn't yet seem to be
documented in the fallocate(2) man page, which I had checked...

Hm, also doesn't seem to work on xfs yet :(, EOPNOTSUPP.

> That won't fix your fragmentation problem, and it has all the same pipeline
> stall problems as allocating unwritten extents in fallocate().

The primary case where FALLOC_FL_WRITE_ZEROES would be useful is for WAL file
creation, which are always of the same fixed size (therefore no fragmentation
risk).

To avoid having metadata operation during our commit path, we today default to
forcing them to be allocated by overwriting them with zeros and fsyncing
them. To avoid having to do that all the time, we reuse them once they're not
needed anymore.

Not ensuring that the extents are already written, would have a very large
perf penalty (as in ~2-3x for OLTP workloads, on XFS). That's true for both
when using DIO and when not.

To avoid having to do that over and over, we recycle WAL files.

Unfortunately this means that when all those WAL files are not yet
preallocated (or when we release them during low activity), the performance is
rather noticeably worsened by the additional IO for pre-zeroing the WAL files.

In theory FALLOC_FL_WRITE_ZEROES should be faster than issuing writes for the
whole range.

> Only much worse now, because the IO pipeline is stalled for the
> entire time it takes to write the zeroes to persistent storage. i.e.
> long tail file access latencies will increase massively if you do
> this regularly to extend files.

In the WAL path we fsync at the point we could use FALLOC_FL_WRITE_ZEROES, as
otherwise the WAL segment might not exist after a crash, which would be
... bad.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-18  0:26       ` Dave Chinner
@ 2026-02-18  6:49         ` Christoph Hellwig
  2026-02-18 12:54         ` Ojaswin Mujoo
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2026-02-18  6:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ojaswin Mujoo, Jan Kara, Pankaj Raghav, linux-xfs, linux-mm,
	linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy,
	hch, ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
	gost.dev, tytso, p.raghav, vi.shah

On Wed, Feb 18, 2026 at 11:26:06AM +1100, Dave Chinner wrote:
> ISTR mentioning that we should be doing exactly this (grab page
> cache pages, fill them and submit them through the DIO path) for
> O_DSYNC buffered writethrough IO a long time again.

Yes, multiple times.  And I did a few more times since then.

> Regardless, we are here again with proposals for RWF_ATOMIC and
> RWF_WRITETHROUGH and a suggestion that maybe we should vector
> buffered writethrough via the DIO path.....
> 
> Perhaps it's time to do this?

Yes.

> FWIW, the other thing that write-through via the DIO path enables is
> true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes
> block waiting on IO completion through generic_sync_write() ->
> vfs_fsync_range(), even when issued through AIO paths.  Vectoring it
> through the DIO path avoids the blocking fsync path in IO submission
> as it runs in the async DIO completion path if it is needed....

It's only true if we can do the page cache updates non-blocking, but
in many cases that should indeed be possible.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
  2026-02-17 15:47     ` Andres Freund
@ 2026-02-18  6:51     ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2026-02-18  6:51 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm,
	linux-fsdevel, lsf-pc, Andres Freund, djwong, john.g.garry, willy,
	ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner,
	Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah

On Tue, Feb 17, 2026 at 10:23:36AM +0100, Amir Goldstein wrote:
> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > I think a better session would be how we can help postgres to move
> > off buffered I/O instead of adding more special cases for them.
> 
> Respectfully, I disagree that DIO is the only possible solution.
> Direct I/O is a legit solution for databases and so is buffered I/O
> each with their own caveats.

Maybe.  Classic buffered I/O is not a legit solution for doing atomic
I/Os, and if Postgres is desperate to use that, something like direct
I/O (including the proposed write though semantics) are the only sensible
choice.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-17 15:47     ` Andres Freund
  2026-02-17 22:45       ` Dave Chinner
@ 2026-02-18  6:53       ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2026-02-18  6:53 UTC (permalink / raw)
  To: Andres Freund
  Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs,
	linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
	ritesh.list, jack, ojaswin, Luis Chamberlain, dchinner,
	Javier Gonzalez, gost.dev, tytso, p.raghav, vi.shah

On Tue, Feb 17, 2026 at 10:47:07AM -0500, Andres Freund wrote:
> Most prominently: With DIO concurrently extending multiple files leads to
> quite terrible fragmentation, at least with XFS. Forcing us to
> over-aggressively use fallocate(), truncating later if it turns out we need
> less space. The fallocate in turn triggers slowness in the write paths, as
> writing to uninitialized extents is a metadata operation.  It'd be great if
> the allocation behaviour with concurrent file extension could be improved and
> if we could have a fallocate mode that forces extents to be initialized.

As Dave already mentioned, if you do concurrent allocations (extension
or hole filling), setting an extent size hint is probably a good idea.
We could try to look into heuristics, but chances are that they would
degrade other use caes.  Details would be useful as a report on the
XFS list.

> 
> A secondary issue is that with the buffer pool sizes necessary for DIO use on
> bigger systems, creating the anonymous memory mapping becomes painfully slow
> if we use MAP_POPULATE - which we kinda need to do, as otherwise performance
> is very inconsistent initially (often iomap -> gup -> handle_mm_fault ->
> folio_zero_user uses the majority of the CPU). We've been experimenting with
> not using MAP_POPULATE and using multiple threads to populate the mapping in
> parallel, but that feels not like something that userspace ought to have to
> do.  It's easier to work around for us that the uninitialized extent
> conversion issue, but it still is something we IMO shouldn't have to do.

Please report this to linux-mm.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-18  0:26       ` Dave Chinner
  2026-02-18  6:49         ` Christoph Hellwig
@ 2026-02-18 12:54         ` Ojaswin Mujoo
  1 sibling, 0 replies; 18+ messages in thread
From: Ojaswin Mujoo @ 2026-02-18 12:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel,
	lsf-pc, Andres Freund, djwong, john.g.garry, willy, hch,
	ritesh.list, Luis Chamberlain, dchinner, Javier Gonzalez,
	gost.dev, tytso, p.raghav, vi.shah

On Wed, Feb 18, 2026 at 11:26:06AM +1100, Dave Chinner wrote:
> On Wed, Feb 18, 2026 at 12:09:46AM +0530, Ojaswin Mujoo wrote:
> > On Mon, Feb 16, 2026 at 12:38:59PM +0100, Jan Kara wrote:
> > > Hi!
> > > 
> > > On Fri 13-02-26 19:02:39, Ojaswin Mujoo wrote:
> > > > Another thing that came up is to consider using write through semantics 
> > > > for buffered atomic writes, where we are able to transition page to
> > > > writeback state immediately after the write and avoid any other users to
> > > > modify the data till writeback completes. This might affect performance
> > > > since we won't be able to batch similar atomic IOs but maybe
> > > > applications like postgres would not mind this too much. If we go with
> > > > this approach, we will be able to avoid worrying too much about other
> > > > users changing atomic data underneath us. 
> > > > 
> > > > An argument against this however is that it is user's responsibility to
> > > > not do non atomic IO over an atomic range and this shall be considered a
> > > > userspace usage error. This is similar to how there are ways users can
> > > > tear a dio if they perform overlapping writes. [1]. 
> > > 
> > > Yes, I was wondering whether the write-through semantics would make sense
> > > as well. Intuitively it should make things simpler because you could
> > > practially reuse the atomic DIO write path. Only that you'd first copy
> > > data into the page cache and issue dio write from those folios. No need for
> > > special tracking of which folios actually belong together in atomic write,
> > > no need for cluttering standard folio writeback path, in case atomic write
> > > cannot happen (e.g. because you cannot allocate appropriately aligned
> > > blocks) you get the error back rightaway, ...
> > 
> > This is an interesting idea Jan and also saves a lot of tracking of
> > atomic extents etc.
> 
> ISTR mentioning that we should be doing exactly this (grab page
> cache pages, fill them and submit them through the DIO path) for
> O_DSYNC buffered writethrough IO a long time again. The context was
> optimising buffered O_DSYNC to use the FUA optimisations in the
> iomap DIO write path.
> 
> I suggested it again when discussing how RWF_DONTCACHE should be
> implemented, because the async DIO write completion path invalidates
> the page cache over the IO range. i.e. it would avoid the need to
> use folio flags to track pages that needed invalidation at IO
> completion...
> 
> I have a vague recollection of mentioning this early in the buffered
> RWF_ATOMIC discussions, too, though that may have just been the
> voices in my head.

Hi Dave,

Yes we did discuss this [1] :)

We also discussed the alternative of using the COW fork path for atomic
writes [2]. Since at that point I was not completely sure if the
writethrough would become too restrictive of an approach, I was working
on a COW fork implementation.

However, from the discussion here as well as Andres' comments, it seems
like write through might not be too bad for postgres.

> 
> Regardless, we are here again with proposals for RWF_ATOMIC and
> RWF_WRITETHROUGH and a suggestion that maybe we should vector
> buffered writethrough via the DIO path.....
> 
> Perhaps it's time to do this?

I agree that it makes more sense to do writethrough if we want to have
the strict old-or-new semantics (as opposed to just untorn IO
semantics). I'll work on a POC for this approach of doing atomic writes,
I'll mostly try to base it off your suggestions in [1].

FWIW, I do have a somewhat working (although untested and possible
broken in some places) POC for performing atomic writes via XFS COW fork
based on suggestions from Dave [2]. Even though we want to explore the
writethrough approach, I'd just share it here incase anyone is
interested in how the design is looking like:

https://github.com/OjaswinM/linux/commits/iomap-buffered-atomic-rfc2.3/

(If anyone prefers for me to send this as a patchset on mailing list, let
me know)

Regards,
ojaswin

[1] https://lore.kernel.org/linux-fsdevel/aRmHRk7FGD4nCT0s@dread.disaster.area/
[2] https://lore.kernel.org/linux-fsdevel/aRuKz4F3xATf8IUp@dread.disaster.area/

> 
> FWIW, the other thing that write-through via the DIO path enables is
> true async O_DSYNC buffered IO. Right now O_DSYNC buffered writes
> block waiting on IO completion through generic_sync_write() ->
> vfs_fsync_range(), even when issued through AIO paths.  Vectoring it
> through the DIO path avoids the blocking fsync path in IO submission
> as it runs in the async DIO completion path if it is needed....
> 
> -Dave.
> -- 
> Dave Chinner
> dgc@kernel.org

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-17 17:20     ` Ojaswin Mujoo
@ 2026-02-18 17:42       ` Jan Kara
  2026-02-18 20:22         ` Ojaswin Mujoo
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2026-02-18 17:42 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
	Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
	jack, Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev,
	tytso, p.raghav, vi.shah

On Tue 17-02-26 22:50:17, Ojaswin Mujoo wrote:
> On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote:
> > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB
> > pages based on `io_combine_limit` (typically 128kb). So immediately writing them
> > might be ok as long as we don't remove those pages from the page cache like we do in
> > RWF_UNCACHED.
> 
> Yep, and Ive not looked at the code path much but I think if we really
> care about the user not changing the data b/w write and writeback then
> we will probably need to start the writeback while holding the folio
> lock, which is currently not done in RWF_UNCACHED.

That isn't enough. submit_bio() returning isn't enough to guaranteed DMA
to the device has happened. And until it happens, modifying the pagecache
page means modifying the data the disk will get. The best is probably to
transition pages to writeback state and deal with it as with any other
requirement for stable pages.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-02-18 17:42       ` [Lsf-pc] " Jan Kara
@ 2026-02-18 20:22         ` Ojaswin Mujoo
  0 siblings, 0 replies; 18+ messages in thread
From: Ojaswin Mujoo @ 2026-02-18 20:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Pankaj Raghav, linux-xfs, linux-mm, linux-fsdevel, lsf-pc,
	Andres Freund, djwong, john.g.garry, willy, hch, ritesh.list,
	Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
	p.raghav, vi.shah

On Wed, Feb 18, 2026 at 06:42:05PM +0100, Jan Kara wrote:
> On Tue 17-02-26 22:50:17, Ojaswin Mujoo wrote:
> > On Mon, Feb 16, 2026 at 10:52:35AM +0100, Pankaj Raghav wrote:
> > > Hmm, IIUC, postgres will write their dirty buffer cache by combining multiple DB
> > > pages based on `io_combine_limit` (typically 128kb). So immediately writing them
> > > might be ok as long as we don't remove those pages from the page cache like we do in
> > > RWF_UNCACHED.
> > 
> > Yep, and Ive not looked at the code path much but I think if we really
> > care about the user not changing the data b/w write and writeback then
> > we will probably need to start the writeback while holding the folio
> > lock, which is currently not done in RWF_UNCACHED.
> 
> That isn't enough. submit_bio() returning isn't enough to guaranteed DMA
> to the device has happened. And until it happens, modifying the pagecache
> page means modifying the data the disk will get. The best is probably to
> transition pages to writeback state and deal with it as with any other
> requirement for stable pages.

Yes true, looking at the code, it does seem like we would also need to
depend on the stable page mechanism to ensure nobody changes the buffers
till the IO has actually finished.

I think the right way to go would be to first start with an
implementation of RWF_WRITETHOUGH and then utilize that and stable pages
to enable RWF_ATOMIC for buffered IO.

Regards,
ojaswin

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
@ 2026-03-08  9:19 Ritesh Harjani
  2026-03-08 15:33 ` Andres Freund
  0 siblings, 1 reply; 18+ messages in thread
From: Ritesh Harjani @ 2026-03-08  9:19 UTC (permalink / raw)
  To: Andres Freund, Amir Goldstein
  Cc: Christoph Hellwig, Pankaj Raghav, linux-xfs, linux-mm,
	linux-fsdevel, lsf-pc, djwong, john.g.garry, willy, jack, ojaswin,
	Luis Chamberlain, dchinner, Javier Gonzalez, gost.dev, tytso,
	p.raghav, vi.shah

Andres Freund <andres@anarazel.de> writes:

Hi,

> Hi,
>
> On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
>> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
>> >
>> > I think a better session would be how we can help postgres to move
>> > off buffered I/O instead of adding more special cases for them.
>
> FWIW, we are adding support for DIO (it's been added, but performance isn't
> competitive for most workloads in the released versions yet, work to address
> those issues is in progress).
>

Is postgres also planning to evaluate the performance gains by using DIO
atomic writes available in upstream linux kernel? What would be
interesting to see is the relative %delta with DIO atomic-writes v/s
DIO non atomic writes.

That being said, I understand the discussion in this wider thread is
also around supporting write-through in linux and then adding support of
atomic writes on top of that. We have an early prototype of that
design ready and Ojaswin will be soon posting that out.

-ritesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes
  2026-03-08  9:19 [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes Ritesh Harjani
@ 2026-03-08 15:33 ` Andres Freund
  0 siblings, 0 replies; 18+ messages in thread
From: Andres Freund @ 2026-03-08 15:33 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Amir Goldstein, Christoph Hellwig, Pankaj Raghav, linux-xfs,
	linux-mm, linux-fsdevel, lsf-pc, djwong, john.g.garry, willy,
	jack, ojaswin, Luis Chamberlain, dchinner, Javier Gonzalez,
	gost.dev, tytso, p.raghav, vi.shah

Hi,

On 2026-03-08 14:49:21 +0530, Ritesh Harjani wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2026-02-17 10:23:36 +0100, Amir Goldstein wrote:
> >> On Tue, Feb 17, 2026 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> >> >
> >> > I think a better session would be how we can help postgres to move
> >> > off buffered I/O instead of adding more special cases for them.
> >
> > FWIW, we are adding support for DIO (it's been added, but performance isn't
> > competitive for most workloads in the released versions yet, work to address
> > those issues is in progress).
> >
>
> Is postgres also planning to evaluate the performance gains by using DIO
> atomic writes available in upstream linux kernel? What would be
> interesting to see is the relative %delta with DIO atomic-writes v/s
> DIO non atomic writes.

For some limited workloads that comparison is possible today with minimal work
(albeit with some safety compromises, due to postgres not yet verifying that
the atomic boundaries are correct, but it's good enough for experiments), as
you can just disable the torn-page avoidance with a configuration parameter.

The gains from not needing full page writes (postgres' mechanism to protect
against torn pages) can be rather significant, as full page writes have
substantial overhead due to the higher journalling volume. The worst part of
the cost is that the cost decreases between checkpoints (because we don't need
to repeatedly log a full page images for the same page), just to then increase
again when the next checkpoint starts.  It's not uncommon that in the phase
just after the start of a checkpoint, WAL is over 90% of full page writes
(when not having full page write compression enabled), while later the same
workload only has a very small percentage of the overhead.  The biggest gain
from atomic writes will be the more even performance (important for real world
users), rather than the absolute increase in throughput.

Normal gains during the full page intensive phase are probably on the order of
20-35% for workload with many small transactions, bigger for workloads with
larger transactions. But if the increase in WAL volume pushes you above the
disk write throughput, the gains can be almost arbitrarily larger. E.g. on a
cloud disk with 100MB/s of write bandwidth, the difference between WAL
throughput of 50MB/s without full page writes and the same workload with full
page images generating ~300MB/s of WAL will obviously mean that you'll get
about < 1/3 of the transaction throughput while also not having any spare IO
capacity for anything other than WAL writes.

The reason I say limited workloads above is that upstream postgres does not
yet do smart enough write combining with DIO for data writes, I'd expect that
to be addressed later this year (but it's community open source, as you
presumably know from experience, that's not always easy to predict /
control). If the workload has a large fraction of data writes, the overhead of
that makes the DIO numbers too unrealistic.

Unfortunately all this means that the gains from atomic writes, be it for
buffered or direct IO, will very very heavily depend on the chosen workload
and by tweaking the workload / hardware you can inflate the gains to an almost
arbitrarily large degree.

This is also about more than throughput / latency, as the volume of WAL also
impacts the cost of retaining the WAL - often that's done for a while to allow
point-in-time-recovery (i.e. recovering an older base backup up to a precise
point in time, to recover from application bugs or operator errors).

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-03-08 15:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-08  9:19 [Lsf-pc] [LSF/MM/BPF TOPIC] Buffered atomic writes Ritesh Harjani
2026-03-08 15:33 ` Andres Freund
  -- strict thread matches above, loose matches on Subject: below --
2026-02-13 10:20 Pankaj Raghav
2026-02-13 13:32 ` Ojaswin Mujoo
2026-02-16  9:52   ` Pankaj Raghav
2026-02-17 17:20     ` Ojaswin Mujoo
2026-02-18 17:42       ` [Lsf-pc] " Jan Kara
2026-02-18 20:22         ` Ojaswin Mujoo
2026-02-16 11:38   ` Jan Kara
2026-02-16 13:18     ` Pankaj Raghav
2026-02-17 18:36       ` Ojaswin Mujoo
2026-02-16 15:57     ` Andres Freund
2026-02-17 18:39     ` Ojaswin Mujoo
2026-02-18  0:26       ` Dave Chinner
2026-02-18  6:49         ` Christoph Hellwig
2026-02-18 12:54         ` Ojaswin Mujoo
2026-02-17  5:51 ` Christoph Hellwig
2026-02-17  9:23   ` [Lsf-pc] " Amir Goldstein
2026-02-17 15:47     ` Andres Freund
2026-02-17 22:45       ` Dave Chinner
2026-02-18  4:10         ` Andres Freund
2026-02-18  6:53       ` Christoph Hellwig
2026-02-18  6:51     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox