Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
       [not found] <20251003093213.52624-1-xemul@scylladb.com>
@ 2025-10-04  4:26 ` Christoph Hellwig
  2025-10-04 16:08   ` Andy Lutomirski
                     ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-04  4:26 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs

On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote:
> The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not
> updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS]
> add a FMODE flag to make XFS invisible I/O less hacky. Back then it
> was suggested that this flag is propagated to a O_NOCMTIME one.

skipping c/mtime is dangerous.  The XFS handle code allows it to
support HSM where data is migrated out to tape, and requires
CAP_SYS_ADMIN.  Allowing it for any file owner would expand the scope
for too much as now everyone could skip timestamp updates.

> It can be used by workloads that want to write a file but don't care
> much about the preciese timestamp on it and can update it later with
> utimens() call.

The workload might not care, the rest of the system does.  ctime can't
bet set to arbitrary values, so it is important for backups and as
an audit trail.

> There's another reason for having this patch. When performing AIO write,
> the file_modified_flags() function checks whether or not to update inode
> times. In case update is needed and iocb carries the RWF_NOWAIT flag,
> the check return EINTR error that quickly propagates into cb completion
> without doing any IO. This restriction effectively prevents doing AIO
> writes with nowait flag, as file modifications really imply time update.

Well, we'll need to look into that, including maybe non-blockin
timestamp updates.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-04  4:26 ` [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME Christoph Hellwig
@ 2025-10-04 16:08   ` Andy Lutomirski
  2025-10-07  5:08     ` Christoph Hellwig
  2025-10-05 22:06   ` Dave Chinner
  2025-10-05 23:38   ` Dave Chinner
  2 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2025-10-04 16:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
	linux-xfs

On Fri, Oct 3, 2025 at 9:26 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote:
> > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not
> > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS]
> > add a FMODE flag to make XFS invisible I/O less hacky. Back then it
> > was suggested that this flag is propagated to a O_NOCMTIME one.
>
> skipping c/mtime is dangerous.  The XFS handle code allows it to
> support HSM where data is migrated out to tape, and requires
> CAP_SYS_ADMIN.  Allowing it for any file owner would expand the scope
> for too much as now everyone could skip timestamp updates.
>
> > It can be used by workloads that want to write a file but don't care
> > much about the preciese timestamp on it and can update it later with
> > utimens() call.
>
> The workload might not care, the rest of the system does.  ctime can't
> bet set to arbitrary values, so it is important for backups and as
> an audit trail.
>
> > There's another reason for having this patch. When performing AIO write,
> > the file_modified_flags() function checks whether or not to update inode
> > times. In case update is needed and iocb carries the RWF_NOWAIT flag,
> > the check return EINTR error that quickly propagates into cb completion
> > without doing any IO. This restriction effectively prevents doing AIO
> > writes with nowait flag, as file modifications really imply time update.
>
> Well, we'll need to look into that, including maybe non-blockin
> timestamp updates.
>

It's been 12 years (!), but maybe it's time to reconsider this:

https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/

Nothing has fundamentally changed since then, but I bet enough little
things (folios!) have changed around this series that it won't apply
without considerably massaging.  I stopped working on it personally
because I moved the workload in question onto fast, fancy SSDs
resulting in my having bigger fish to fry.  I don't think I'll have
the bandwidth to pick it up any time soon, but maybe one of you folks
is interested :)  I never looked into the AIO path (I was interested
in the page_mkwrite path), but my series made it at least conceptually
possible to unconditionally mark the file as needing a cmtime update
when presently dirty data is written back, and I imagine that AIO
could use that too to avoid ever needing to bail out because an mtime
update would block.

To the extent that ctime is "important for backups", it's been *wrong*
for backups approximately forever -- one can read ctime, then read the
contents of a file, and get a new ctime and an old copy of the data
that preceeds the modification that logically triggered the ctime
value that was read.

--Andy
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-04 16:08   ` Andy Lutomirski
@ 2025-10-07  5:08     ` Christoph Hellwig
  2025-10-08 15:22       ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-07  5:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs

On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > Well, we'll need to look into that, including maybe non-blockin
> > timestamp updates.
> >
> 
> It's been 12 years (!), but maybe it's time to reconsider this:
> 
> https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/

I don't see how that is relevant here.  Also writes through shared
mmaps are problematic for so many reasons that I'm not sure we want
to encourage people to use that more.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-07  5:08     ` Christoph Hellwig
@ 2025-10-08 15:22       ` Andy Lutomirski
  2025-10-08 21:27         ` Dave Chinner
  2025-10-10  5:27         ` Christoph Hellwig
  0 siblings, 2 replies; 14+ messages in thread
From: Andy Lutomirski @ 2025-10-08 15:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
	linux-xfs

On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > > Well, we'll need to look into that, including maybe non-blockin
> > > timestamp updates.
> > >
> >
> > It's been 12 years (!), but maybe it's time to reconsider this:
> >
> > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/
>
> I don't see how that is relevant here.  Also writes through shared
> mmaps are problematic for so many reasons that I'm not sure we want
> to encourage people to use that more.
>

Because the same exact issue exists in the normal non-mmap write path,
and I can even quote you upthread :)

> Well, we'll need to look into that, including maybe non-blockin
timestamp updates.

I assume the code path that inspired this thread in the first place is:

ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
        struct file *file = iocb->ki_filp;
        struct address_space *mapping = file->f_mapping;
        struct inode *inode = mapping->host;
        ssize_t ret;

        ret = file_remove_privs(file);
        if (ret)
                return ret;

        ret = file_update_time(file);

and this has *exactly* the same problem as the shared-mmap write path:
it synchronously updates the time (well, synchronously enough that it
sometimes blocks), and it does so before updating the file contents
(although the window during which the timestamp is updated and the
contents are not is not as absurdly long as it is in the mmap case).

Now my series does not change any of this, but I'm thinking more of
the concept: instead of doing file/inode_update_time when a file is
logically written (in write_iter, page_mkwrite, etc), set a flag so
that the writeback code knows that the timestamp needs updating.
Thinking out loud, to handle both write_iter and mmap, there might
need to be two bits: one saying "the timestamp needs to be updated"
and another saying "the timestamp has been updated in the in-memory
inode, but the inode hasn't been dirtied yet".  And maybe the latter
is doable entirely within fs-specific code without any help from the
generic code, but it might still be nice to keep generic_update_time
usable for filesystems that want to do this.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-08 15:22       ` Andy Lutomirski
@ 2025-10-08 21:27         ` Dave Chinner
  2025-10-08 21:51           ` Andy Lutomirski
  2025-10-10  5:27         ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2025-10-08 21:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs

On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > > > Well, we'll need to look into that, including maybe non-blockin
> > > > timestamp updates.
> > > >
> > >
> > > It's been 12 years (!), but maybe it's time to reconsider this:
> > >
> > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/
> >
> > I don't see how that is relevant here.  Also writes through shared
> > mmaps are problematic for so many reasons that I'm not sure we want
> > to encourage people to use that more.
> >
> 
> Because the same exact issue exists in the normal non-mmap write path,
> and I can even quote you upthread :)
> 
> > Well, we'll need to look into that, including maybe non-blockin
> timestamp updates.
> 
> I assume the code path that inspired this thread in the first place is:
> 
> ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
>         struct file *file = iocb->ki_filp;
>         struct address_space *mapping = file->f_mapping;
>         struct inode *inode = mapping->host;
>         ssize_t ret;
> 
>         ret = file_remove_privs(file);
>         if (ret)
>                 return ret;
> 
>         ret = file_update_time(file);
> 
> and this has *exactly* the same problem as the shared-mmap write path:
> it synchronously updates the time (well, synchronously enough that it
> sometimes blocks),

You are conflating "synchronous update" with "blocking".

Avoiding the need for synchronous timestamp updates is exactly what
the lazytime mount option provides. i.e. lazytime degrades immediate
consistency requirements to eventual consistency similar to how the
default relatime behaviour defers atime updates for eventual
writeback.

IOWs, we've already largely addressed the synchronous c/mtime update
problem but what we haven't done is made timestamp updates
fully support non-blocking caller semantics. That's a separate
problem...

> and it does so before updating the file contents
> (although the window during which the timestamp is updated and the
> contents are not is not as absurdly long as it is in the mmap case).
> 
> Now my series does not change any of this, but I'm thinking more of
> the concept: instead of doing file/inode_update_time when a file is
> logically written (in write_iter, page_mkwrite, etc), set a flag so
> that the writeback code knows that the timestamp needs updating.

This is exactly what lazytime implements with the I_DIRTY_FLAG.

During writeback, if the filesystem has to modify other metadata in
the inode (e.g. block allocation), the filesystem will piggyback the
persistent update of the dirty timestamps on that modification and
clear the I_DIRTY_TIME flag.

However, if the writeback operation is a pure overwrite, then there
is no metadata modifiction occuring and so we leave the inode
I_DIRTY_TIME dirty for a future metadata persistence operation to
clean them.

IOWs, with lazytime, writeback already persists timestamp updates
when appropriate for best performance.

> Thinking out loud, to handle both write_iter and mmap, there might
> need to be two bits: one saying "the timestamp needs to be updated"
> and another saying "the timestamp has been updated in the in-memory
> inode, but the inode hasn't been dirtied yet".

The flag that implements the latter is called I_DIRTY_TIME. We have
not implemented the former as that's a userspace visible change of
behaviour.

> And maybe the latter
> is doable entirely within fs-specific code without any help from the
> generic code, but it might still be nice to keep generic_update_time
> usable for filesystems that want to do this.

generic_update_time() already supports I_DIRTY_TIME semantics.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-08 21:27         ` Dave Chinner
@ 2025-10-08 21:51           ` Andy Lutomirski
  2025-10-11  1:35             ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2025-10-08 21:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs

On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:

>
> You are conflating "synchronous update" with "blocking".
>
> Avoiding the need for synchronous timestamp updates is exactly what
> the lazytime mount option provides. i.e. lazytime degrades immediate
> consistency requirements to eventual consistency similar to how the
> default relatime behaviour defers atime updates for eventual
> writeback.
>
> IOWs, we've already largely addressed the synchronous c/mtime update
> problem but what we haven't done is made timestamp updates
> fully support non-blocking caller semantics. That's a separate
> problem...

I'm probably missing something, but is this really different?  Either
the mtime update can block or it can't block.  I haven't dug all the
way into exactly what happens in __mark_inode_dirty(), but there is a
lot going on in there even in the I_DIRTY_TIME path.  And Pavel is
saying that AIO and mtime updates don't play along well.

>
> > and it does so before updating the file contents
> > (although the window during which the timestamp is updated and the
> > contents are not is not as absurdly long as it is in the mmap case).
> >
> > Now my series does not change any of this, but I'm thinking more of
> > the concept: instead of doing file/inode_update_time when a file is
> > logically written (in write_iter, page_mkwrite, etc), set a flag so
> > that the writeback code knows that the timestamp needs updating.
>
> This is exactly what lazytime implements with the I_DIRTY_FLAG.
>
> During writeback, if the filesystem has to modify other metadata in
> the inode (e.g. block allocation), the filesystem will piggyback the
> persistent update of the dirty timestamps on that modification and
> clear the I_DIRTY_TIME flag.
>
> However, if the writeback operation is a pure overwrite, then there
> is no metadata modifiction occuring and so we leave the inode
> I_DIRTY_TIME dirty for a future metadata persistence operation to
> clean them.
>
> IOWs, with lazytime, writeback already persists timestamp updates
> when appropriate for best performance.

I'm probably doing a bad job explaining myself.

In my series, I move (for page_mkwrite only) the mtime update,
*including dirtying the inode* to the writeback path, which makes it
fully non-blocking / asynchronous / whatever you want to call it at
the time that page_mkwrite is called.  More concretely, my suggestion
is to be a bit lazier than current lazytime and not dirty the inode
*at all* in write_iter, or at least not dirty it for the purpose of
timestamp updates.  Instead set a flag somewhere that it cannot be
forgotten about -- in my series, it's this patch:

https://lore.kernel.org/all/f2ac22142b4634b55ff6858d159b45dac96f81b6.1377193658.git.luto@amacapital.net/

and it's a single atomic bit in struct address_space.  The idea is
that there is approximately no additional overhead at the time that
the page cache is dirtied for cmtime-related inode dirtying and that
all such overhead is deferred to the writeback path when it's as
asynchronous as possible from the perspective of whatever user code
dirtied the page cache.  My page_set_cmtime() is completely lockless.

My series is far from perfect, but I did test it with real workloads
12-ish years ago, on overworked HDDs, with latencytop, and it worked.
Performance was vastly improved (using mmap, not write(), obviously).

>
> > Thinking out loud, to handle both write_iter and mmap, there might
> > need to be two bits: one saying "the timestamp needs to be updated"
> > and another saying "the timestamp has been updated in the in-memory
> > inode, but the inode hasn't been dirtied yet".
>
> The flag that implements the latter is called I_DIRTY_TIME. We have
> not implemented the former as that's a userspace visible change of
> behaviour.

Maybe that change should be done?  Or not -- it wouldn't be terribly
hard to have a pair of atomic timestamps in struct inode indicating
what timestamps we want to write the next time we get around to it.
(Concretely, page_set_cmtime() would get some new parameters to
specify actual times, and atomic compare exchange would be used to
update the underlying data structure, so it would remain lock-free but
not be wait-free.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-08 21:51           ` Andy Lutomirski
@ 2025-10-11  1:35             ` Dave Chinner
  2025-10-11  4:04               ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2025-10-11  1:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs

On Wed, Oct 08, 2025 at 02:51:14PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> 
> >
> > You are conflating "synchronous update" with "blocking".
> >
> > Avoiding the need for synchronous timestamp updates is exactly what
> > the lazytime mount option provides. i.e. lazytime degrades immediate
> > consistency requirements to eventual consistency similar to how the
> > default relatime behaviour defers atime updates for eventual
> > writeback.
> >
> > IOWs, we've already largely addressed the synchronous c/mtime update
> > problem but what we haven't done is made timestamp updates
> > fully support non-blocking caller semantics. That's a separate
> > problem...
> 
> I'm probably missing something, but is this really different?

Yes, and yes.

> Either the mtime update can block or it can't block.

Sure, but that's not the issue we have to deal with.

In many filesystems and fs operations, we have to know if an
operation is going to block -before- we start the operation. e.g.
transactional changes cannot be rolled back once we've started the
modification if they need to block to make progress (e.g. read in
on-disk metadata).

This foresight, in many cases, is -unknowable-. Even though the
operation /likely/ won't block, we cannot *guarantee* ahead of time
that any given instance of the operation will /not/ block.  Hence
the reliable non-blocking operation that users are asking for is not
possible with unknowable implementation characteristics like this.

IOWs, a timestamp update implementation can be synchronous and
reliably non-blocking if it always knows when blocking will occur
and can return -EAGAIN instead of blocking to complete the
operation.

If it can't know when/if blocking will occur, then lazytime allows
us to defer the (potentially) blocking update operation to another
context that can block. Queuing for async processing can easily be
made non-blocking, and __mark_inode_dirty(I_DIRTY_TIME) does this
for us.

So, yeah, it should be pretty obvious at this point that non-blocking
implementation is completely independent of whether the operation is
performed synchronously or asynchronously. It's easier to make async
operations non-blocking, but that doesn't mean "non_blocking" and
"asynchronous execution" are interchangable terms or behaviours.

> I haven't dug all the
> way into exactly what happens in __mark_inode_dirty(), but there is a
> lot going on in there even in the I_DIRTY_TIME path.

It's pretty simple, really.  __mark_inode_dirty(I_DIRTY_TIME) is
non-blocking and queues the inode on the wb->i_dirty_time queue
for later processing.

> And Pavel is
> saying that AIO and mtime updates don't play along well.

Again: this is exactly why lazytime was added to XFS *ten years
ago*. From 2015 (issue #3):

https://lore.kernel.org/linux-xfs/CAD-J=zZh1dtJsfrW_Gwxjg+qvkZMu7ED-QOXrMMO6B-G0HY2-A@mail.gmail.com/

(Oh, look, a discussion that starts from a user suggestion of
exposing FMODE_NOCMTIME to userspace apps! Sound familiar?)

> > IOWs, with lazytime, writeback already persists timestamp updates
> > when appropriate for best performance.
> 
> I'm probably doing a bad job explaining myself.

No, I think both Christoph and I both understand exactly what you
are trying to describe.

It seems to me that haven't yet understood that lazytime already
does exactly what you are asking for. Hence you think we don't
understand the "lazytime" concept you are proposing and keep trying
to reinvent lazytime to convince us that we need "lazytime"
functionalitying in the kernel...

> > > Thinking out loud, to handle both write_iter and mmap, there might
> > > need to be two bits: one saying "the timestamp needs to be updated"
> > > and another saying "the timestamp has been updated in the in-memory
> > > inode, but the inode hasn't been dirtied yet".
> >
> > The flag that implements the latter is called I_DIRTY_TIME. We have
> > not implemented the former as that's a userspace visible change of
> > behaviour.
> 
> Maybe that change should be done?  Or not -- it wouldn't be terribly
> hard to have a pair of atomic timestamps in struct inode indicating
> what timestamps we want to write the next time we get around to it.

See, you just reinvented the lazytime mechanism. Again. :/

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-11  1:35             ` Dave Chinner
@ 2025-10-11  4:04               ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2025-10-11  4:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs

On Fri, Oct 10, 2025 at 6:35 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Oct 08, 2025 at 02:51:14PM -0700, Andy Lutomirski wrote:
> > On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> >
> > >
> > > You are conflating "synchronous update" with "blocking".
> > >
> > > Avoiding the need for synchronous timestamp updates is exactly what
> > > the lazytime mount option provides. i.e. lazytime degrades immediate
> > > consistency requirements to eventual consistency similar to how the
> > > default relatime behaviour defers atime updates for eventual
> > > writeback.
> > >
> > > IOWs, we've already largely addressed the synchronous c/mtime update
> > > problem but what we haven't done is made timestamp updates
> > > fully support non-blocking caller semantics. That's a separate
> > > problem...
> >
> > I'm probably missing something, but is this really different?
>
> Yes, and yes.
>
> > Either the mtime update can block or it can't block.
>
> Sure, but that's not the issue we have to deal with.
>
> In many filesystems and fs operations, we have to know if an
> operation is going to block -before- we start the operation. e.g.
> transactional changes cannot be rolled back once we've started the
> modification if they need to block to make progress (e.g. read in
> on-disk metadata).
>
> This foresight, in many cases, is -unknowable-. Even though the
> operation /likely/ won't block, we cannot *guarantee* ahead of time
> that any given instance of the operation will /not/ block.  Hence
> the reliable non-blocking operation that users are asking for is not
> possible with unknowable implementation characteristics like this.
>
> IOWs, a timestamp update implementation can be synchronous and
> reliably non-blocking if it always knows when blocking will occur
> and can return -EAGAIN instead of blocking to complete the
> operation.
>
> If it can't know when/if blocking will occur, then lazytime allows
> us to defer the (potentially) blocking update operation to another
> context that can block. Queuing for async processing can easily be
> made non-blocking, and __mark_inode_dirty(I_DIRTY_TIME) does this
> for us.
>
> So, yeah, it should be pretty obvious at this point that non-blocking
> implementation is completely independent of whether the operation is
> performed synchronously or asynchronously. It's easier to make async
> operations non-blocking, but that doesn't mean "non_blocking" and
> "asynchronous execution" are interchangable terms or behaviours.
>
> > I haven't dug all the
> > way into exactly what happens in __mark_inode_dirty(), but there is a
> > lot going on in there even in the I_DIRTY_TIME path.
>
> It's pretty simple, really.  __mark_inode_dirty(I_DIRTY_TIME) is
> non-blocking and queues the inode on the wb->i_dirty_time queue
> for later processing.
>

First, I apologize if I'm off base here.

Second, I don't think I'm entirely nuts, and I'm moderately confident
that, ten-ish years ago, I tested lazytime in the hopes that it would
solve my old problem, and IIRC it didn't help.  I was running a
production workload on ext4 on regrettably slow spinning rust backed
by a truly atrocious HPE controller.  And I was running latencytop to
generate little traces when my task got blocked, and there was no form
of AIO involved.  (And I don't really understand how AIO is wired up
internally...  And yes, in retrospect I should not have been using
shared-writable mmaps or even file-backed things at all for what I was
doing, but I had unrealistic expectations of how mmap worked when I
wrote that code more like 20 years ago, and I wasn't even using Linux
at the time I wrote it.)

I'm looking at the code now, and I see what you're talking about, and
__mark_inode_dirty(inode, I_DIRTY_TIME) looks fairly polite and like
it won't block.  But the relevant code seems to be:

int generic_update_time(struct inode *inode, int flags)
{
        int updated = inode_update_timestamps(inode, flags);
        int dirty_flags = 0;

        if (updated & (S_ATIME|S_MTIME|S_CTIME))
                dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ?
I_DIRTY_TIME : I_DIRTY_SYNC;
        if (updated & S_VERSION)
                dirty_flags |= I_DIRTY_SYNC;
        __mark_inode_dirty(inode, dirty_flags);
        ...

inode_update_timestamps does this, where updated != 0 if the timestamp
actually changed (which is subject to some complex coarse-graining
logic so it may only happen some of the time):

                if (IS_I_VERSION(inode) &&
inode_maybe_inc_iversion(inode, updated))
                        updated |= S_VERSION;

IS_I_VERSION seems to be unconditionally true on ext4.
inode_maybe_inc_iversion always returns true if updated is set, so
generic_update_time has a decent chance of doing
__mark_inode_dirty(inode, I_DIRTY_SYNC), which calls
s_op->dirty_inode, which calls ext4_journal_start, which, from my
recollection a decade ago, could easily block for a good second or so
on my delightful, now retired, HP/HPE system.

In my case, I think this is the path that was blocking for me in lots
of do_wp_page calls that would otherwise not have blocked.  I also
don't see any kiocb passed around or any mechanism by which this code
could know that it's supposed to be nonblocking, although I have
approximately no understanding of Linux AIO and I don't really know
what I should be looking for.

I could try to instrument the code a bit and test to see if I've
analyzed it right in a few days.

--Andy
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-08 15:22       ` Andy Lutomirski
  2025-10-08 21:27         ` Dave Chinner
@ 2025-10-10  5:27         ` Christoph Hellwig
  2025-10-10 17:35           ` Andy Lutomirski
  1 sibling, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-10  5:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs

On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > > > Well, we'll need to look into that, including maybe non-blockin
> > > > timestamp updates.
> > > >
> > >
> > > It's been 12 years (!), but maybe it's time to reconsider this:
> > >
> > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/
> >
> > I don't see how that is relevant here.  Also writes through shared
> > mmaps are problematic for so many reasons that I'm not sure we want
> > to encourage people to use that more.
> >
> 
> Because the same exact issue exists in the normal non-mmap write path,
> and I can even quote you upthread :)

The thread that started this is about io_uring nonblock writes, aka
O_DIRECT.  So there isn't any writeback to defer to. 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-10  5:27         ` Christoph Hellwig
@ 2025-10-10 17:35           ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2025-10-10 17:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
	linux-xfs

On Thu, Oct 9, 2025 at 10:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > > > > Well, we'll need to look into that, including maybe non-blockin
> > > > > timestamp updates.
> > > > >
> > > >
> > > > It's been 12 years (!), but maybe it's time to reconsider this:
> > > >
> > > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/
> > >
> > > I don't see how that is relevant here.  Also writes through shared
> > > mmaps are problematic for so many reasons that I'm not sure we want
> > > to encourage people to use that more.
> > >
> >
> > Because the same exact issue exists in the normal non-mmap write path,
> > and I can even quote you upthread :)
>
> The thread that started this is about io_uring nonblock writes, aka
> O_DIRECT.  So there isn't any writeback to defer to.

I haven't followed all the internal details, but RWF_DONTCACHE is
looking pretty good these days, and it does go through the writeback
path.  I wonder if it's getting good enough that most or all O_DIRECT
users could switch to using it.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-04  4:26 ` [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME Christoph Hellwig
  2025-10-04 16:08   ` Andy Lutomirski
@ 2025-10-05 22:06   ` Dave Chinner
  2025-10-07  5:10     ` Christoph Hellwig
  2025-10-05 23:38   ` Dave Chinner
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2025-10-05 22:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
	linux-xfs

On Fri, Oct 03, 2025 at 09:26:50PM -0700, Christoph Hellwig wrote:
> On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote:
> > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not
> > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS]
> > add a FMODE flag to make XFS invisible I/O less hacky. Back then it
> > was suggested that this flag is propagated to a O_NOCMTIME one.
> 
> skipping c/mtime is dangerous.  The XFS handle code allows it to
> support HSM where data is migrated out to tape, and requires
> CAP_SYS_ADMIN.  Allowing it for any file owner would expand the scope
> for too much as now everyone could skip timestamp updates.
> 
> > It can be used by workloads that want to write a file but don't care
> > much about the preciese timestamp on it and can update it later with
> > utimens() call.

If you don't care about accurate c/mtime, then mount the filesystem
with '-o lazytime' to degrade c/mtime updates to "eventual
consistency" behaviour for IO operations. If inode metadata is
otherwise modified (e.g. block allocation during IO) or the
application then calls utimens(), it will update the recorded
in-memory timestamps in a persistent manner immediately.

> The workload might not care, the rest of the system does.  ctime can't
> bet set to arbitrary values, so it is important for backups and as
> an audit trail.

But we can (and do) delay the persistence of IO-based timestamp
updates with the lazytime option.

> > There's another reason for having this patch. When performing AIO write,
> > the file_modified_flags() function checks whether or not to update inode
> > times. In case update is needed and iocb carries the RWF_NOWAIT flag,
> > the check return EINTR error that quickly propagates into cb completion
> > without doing any IO. This restriction effectively prevents doing AIO
> > writes with nowait flag, as file modifications really imply time update.
> 
> Well, we'll need to look into that, including maybe non-blockin
> timestamp updates.

Lazytime updates can generally be done in a non-blocking manner
right now (someone raised that in the context of io-uring on #xfs
about a month ago), but the NOWAIT behaviour for timestamp updates
is done at a higher level in the VFS and does not take into account
filesystem specific non-blocking lazytime updates at all.  If we
push the NOWAIT checking behaviour down to the filesystem, we can do
this.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-05 22:06   ` Dave Chinner
@ 2025-10-07  5:10     ` Christoph Hellwig
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2025-10-07  5:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs

On Mon, Oct 06, 2025 at 09:06:40AM +1100, Dave Chinner wrote:
> If you don't care about accurate c/mtime, then mount the filesystem
> with '-o lazytime' to degrade c/mtime updates to "eventual
> consistency" behaviour for IO operations.

Exactly.

> Lazytime updates can generally be done in a non-blocking manner
> right now (someone raised that in the context of io-uring on #xfs
> about a month ago), but the NOWAIT behaviour for timestamp updates
> is done at a higher level in the VFS and does not take into account
> filesystem specific non-blocking lazytime updates at all.  If we
> push the NOWAIT checking behaviour down to the filesystem, we can do
> this.

We might not even have to push it out, but just make the VFS/rw helper
check aware of lazytime.  Either way currently even a lazytime
timestampt update will cause a write to block, which renders the
nowait writes pretty useless on anything but block devices.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-04  4:26 ` [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME Christoph Hellwig
  2025-10-04 16:08   ` Andy Lutomirski
  2025-10-05 22:06   ` Dave Chinner
@ 2025-10-05 23:38   ` Dave Chinner
  2025-10-06  2:16     ` Theodore Ts'o
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2025-10-05 23:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api,
	linux-xfs

On Fri, Oct 03, 2025 at 09:26:50PM -0700, Christoph Hellwig wrote:
> On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote:
> > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not
> > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS]
> > add a FMODE flag to make XFS invisible I/O less hacky. Back then it
> > was suggested that this flag is propagated to a O_NOCMTIME one.
> 
> skipping c/mtime is dangerous.  The XFS handle code allows it to
> support HSM where data is migrated out to tape, and requires
> CAP_SYS_ADMIN.  Allowing it for any file owner would expand the scope
> for too much as now everyone could skip timestamp updates.

We have already provided a safe method for minimising the overhead
of c/mtime updates in the IO path - it's called lazytime.  The
lazytime mount option provides eventual consistency for c/mtime
updates for IO operations instead of immediate consistency.

Timestamps are still updated to have the correct values, but the
latency/performance of the timestamp updates is greatly improved by
holding them purely in memory until some other trigger forces them
to be persisted to disk.

> > It can be used by workloads that want to write a file but don't care
> > much about the preciese timestamp on it and can update it later with
> > utimens() call.
> 
> The workload might not care, the rest of the system does.  ctime can't
> bet set to arbitrary values, so it is important for backups and as
> an audit trail.

Lazytime works for this use case; a call to utimens() will cause a
persistent update of the timestamps. As will any other inode
modification that has persistence requirements (e.g.  block
allocation during IO or other syscalls that modify inode metadata).

> > There's another reason for having this patch. When performing AIO write,
> > the file_modified_flags() function checks whether or not to update inode
> > times. In case update is needed and iocb carries the RWF_NOWAIT flag,
> > the check return EINTR error that quickly propagates into cb completion
> > without doing any IO. This restriction effectively prevents doing AIO
> > writes with nowait flag, as file modifications really imply time update.
> 
> Well, we'll need to look into that, including maybe non-blockin
> timestamp updates.

This came up recently on #xfs w.r.t. lazytime behaviour - we need to
pass the NOWAIT decision semnatics down to the filesystem to allow
lazytime to be truly non-blocking.  At the moment the high level VFS
NOWAIT checks (via inode_needs_update_time()) have no visibility of
this filesystem specific functionality, so even if we can do the
lazy timestamp update without blocking we still give an -EAGAIN if
IOCB_NOWAIT is set.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
  2025-10-05 23:38   ` Dave Chinner
@ 2025-10-06  2:16     ` Theodore Ts'o
  0 siblings, 0 replies; 14+ messages in thread
From: Theodore Ts'o @ 2025-10-06  2:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs

On Mon, Oct 06, 2025 at 10:38:20AM +1100, Dave Chinner wrote:
> We have already provided a safe method for minimising the overhead
> of c/mtime updates in the IO path - it's called lazytime.  The
> lazytime mount option provides eventual consistency for c/mtime
> updates for IO operations instead of immediate consistency.
> 
> Timestamps are still updated to have the correct values, but the
> latency/performance of the timestamp updates is greatly improved by
> holding them purely in memory until some other trigger forces them
> to be persisted to disk.

Specifically, the timestamps are persisted to stable store when (a)
the file system is unmounted, (b) when the inode needs to be pushed
out to memory due to memory pressure, (c) when the inode is forcibly
persisted using fsync(), (d) when some other inode field is updated,
and the inode gets written out, or (e) after 24 hours.

As a result, the on-disk timestamps will be at most 24 hours stale.
But this is POSIX compliant, because if you read the timestamps using
stat(1), you will get the updated values, and what happens after a
crash in the absense of an fsync(2) is not defined.

The reason why we implemented this at $WORK is you are constantly
updating a database using fdatasync(2), and you care about 99.9
percentage I/O latency, the 4k writes to the inode table will
eventually triger a hard drive's Adjacent Track Interference (ATI)
mitigation, which involves rewriting set of disk tracks to avoid the
analog signal for adjacent tracks getting weakened by the hot-spot
writes, and this is measurable if you are looking at long-tail I/O
latencies.  (And yes, we had to talk to our HDD vendors to figure out
this is what was going on, since performance is out of scop[e of
SCSI/SATA specifications.  Hence, random long-tail ATI latencies to
preserve data integrity is allowed, and in fact, actually a good
thing.  :-)

					- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-11  4:04 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251003093213.52624-1-xemul@scylladb.com>
2025-10-04  4:26 ` [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME Christoph Hellwig
2025-10-04 16:08   ` Andy Lutomirski
2025-10-07  5:08     ` Christoph Hellwig
2025-10-08 15:22       ` Andy Lutomirski
2025-10-08 21:27         ` Dave Chinner
2025-10-08 21:51           ` Andy Lutomirski
2025-10-11  1:35             ` Dave Chinner
2025-10-11  4:04               ` Andy Lutomirski
2025-10-10  5:27         ` Christoph Hellwig
2025-10-10 17:35           ` Andy Lutomirski
2025-10-05 22:06   ` Dave Chinner
2025-10-07  5:10     ` Christoph Hellwig
2025-10-05 23:38   ` Dave Chinner
2025-10-06  2:16     ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).