* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME [not found] <20251003093213.52624-1-xemul@scylladb.com> @ 2025-10-04 4:26 ` Christoph Hellwig 2025-10-04 16:08 ` Andy Lutomirski ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Christoph Hellwig @ 2025-10-04 4:26 UTC (permalink / raw) To: Pavel Emelyanov; +Cc: linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote: > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS] > add a FMODE flag to make XFS invisible I/O less hacky. Back then it > was suggested that this flag is propagated to a O_NOCMTIME one. skipping c/mtime is dangerous. The XFS handle code allows it to support HSM where data is migrated out to tape, and requires CAP_SYS_ADMIN. Allowing it for any file owner would expand the scope for too much as now everyone could skip timestamp updates. > It can be used by workloads that want to write a file but don't care > much about the preciese timestamp on it and can update it later with > utimens() call. The workload might not care, the rest of the system does. ctime can't bet set to arbitrary values, so it is important for backups and as an audit trail. > There's another reason for having this patch. When performing AIO write, > the file_modified_flags() function checks whether or not to update inode > times. In case update is needed and iocb carries the RWF_NOWAIT flag, > the check return EINTR error that quickly propagates into cb completion > without doing any IO. This restriction effectively prevents doing AIO > writes with nowait flag, as file modifications really imply time update. Well, we'll need to look into that, including maybe non-blockin timestamp updates. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-04 4:26 ` [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME Christoph Hellwig @ 2025-10-04 16:08 ` Andy Lutomirski 2025-10-07 5:08 ` Christoph Hellwig 2025-10-05 22:06 ` Dave Chinner 2025-10-05 23:38 ` Dave Chinner 2 siblings, 1 reply; 14+ messages in thread From: Andy Lutomirski @ 2025-10-04 16:08 UTC (permalink / raw) To: Christoph Hellwig Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Fri, Oct 3, 2025 at 9:26 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote: > > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not > > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS] > > add a FMODE flag to make XFS invisible I/O less hacky. Back then it > > was suggested that this flag is propagated to a O_NOCMTIME one. > > skipping c/mtime is dangerous. The XFS handle code allows it to > support HSM where data is migrated out to tape, and requires > CAP_SYS_ADMIN. Allowing it for any file owner would expand the scope > for too much as now everyone could skip timestamp updates. > > > It can be used by workloads that want to write a file but don't care > > much about the preciese timestamp on it and can update it later with > > utimens() call. > > The workload might not care, the rest of the system does. ctime can't > bet set to arbitrary values, so it is important for backups and as > an audit trail. > > > There's another reason for having this patch. When performing AIO write, > > the file_modified_flags() function checks whether or not to update inode > > times. In case update is needed and iocb carries the RWF_NOWAIT flag, > > the check return EINTR error that quickly propagates into cb completion > > without doing any IO. This restriction effectively prevents doing AIO > > writes with nowait flag, as file modifications really imply time update. > > Well, we'll need to look into that, including maybe non-blockin > timestamp updates. > It's been 12 years (!), but maybe it's time to reconsider this: https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/ Nothing has fundamentally changed since then, but I bet enough little things (folios!) have changed around this series that it won't apply without considerably massaging. I stopped working on it personally because I moved the workload in question onto fast, fancy SSDs resulting in my having bigger fish to fry. I don't think I'll have the bandwidth to pick it up any time soon, but maybe one of you folks is interested :) I never looked into the AIO path (I was interested in the page_mkwrite path), but my series made it at least conceptually possible to unconditionally mark the file as needing a cmtime update when presently dirty data is written back, and I imagine that AIO could use that too to avoid ever needing to bail out because an mtime update would block. To the extent that ctime is "important for backups", it's been *wrong* for backups approximately forever -- one can read ctime, then read the contents of a file, and get a new ctime and an old copy of the data that preceeds the modification that logically triggered the ctime value that was read. --Andy Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-04 16:08 ` Andy Lutomirski @ 2025-10-07 5:08 ` Christoph Hellwig 2025-10-08 15:22 ` Andy Lutomirski 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2025-10-07 5:08 UTC (permalink / raw) To: Andy Lutomirski Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote: > > Well, we'll need to look into that, including maybe non-blockin > > timestamp updates. > > > > It's been 12 years (!), but maybe it's time to reconsider this: > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/ I don't see how that is relevant here. Also writes through shared mmaps are problematic for so many reasons that I'm not sure we want to encourage people to use that more. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-07 5:08 ` Christoph Hellwig @ 2025-10-08 15:22 ` Andy Lutomirski 2025-10-08 21:27 ` Dave Chinner 2025-10-10 5:27 ` Christoph Hellwig 0 siblings, 2 replies; 14+ messages in thread From: Andy Lutomirski @ 2025-10-08 15:22 UTC (permalink / raw) To: Christoph Hellwig Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote: > > > Well, we'll need to look into that, including maybe non-blockin > > > timestamp updates. > > > > > > > It's been 12 years (!), but maybe it's time to reconsider this: > > > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/ > > I don't see how that is relevant here. Also writes through shared > mmaps are problematic for so many reasons that I'm not sure we want > to encourage people to use that more. > Because the same exact issue exists in the normal non-mmap write path, and I can even quote you upthread :) > Well, we'll need to look into that, including maybe non-blockin timestamp updates. I assume the code path that inspired this thread in the first place is: ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; ssize_t ret; ret = file_remove_privs(file); if (ret) return ret; ret = file_update_time(file); and this has *exactly* the same problem as the shared-mmap write path: it synchronously updates the time (well, synchronously enough that it sometimes blocks), and it does so before updating the file contents (although the window during which the timestamp is updated and the contents are not is not as absurdly long as it is in the mmap case). Now my series does not change any of this, but I'm thinking more of the concept: instead of doing file/inode_update_time when a file is logically written (in write_iter, page_mkwrite, etc), set a flag so that the writeback code knows that the timestamp needs updating. Thinking out loud, to handle both write_iter and mmap, there might need to be two bits: one saying "the timestamp needs to be updated" and another saying "the timestamp has been updated in the in-memory inode, but the inode hasn't been dirtied yet". And maybe the latter is doable entirely within fs-specific code without any help from the generic code, but it might still be nice to keep generic_update_time usable for filesystems that want to do this. --Andy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-08 15:22 ` Andy Lutomirski @ 2025-10-08 21:27 ` Dave Chinner 2025-10-08 21:51 ` Andy Lutomirski 2025-10-10 5:27 ` Christoph Hellwig 1 sibling, 1 reply; 14+ messages in thread From: Dave Chinner @ 2025-10-08 21:27 UTC (permalink / raw) To: Andy Lutomirski Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote: > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote: > > > > Well, we'll need to look into that, including maybe non-blockin > > > > timestamp updates. > > > > > > > > > > It's been 12 years (!), but maybe it's time to reconsider this: > > > > > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/ > > > > I don't see how that is relevant here. Also writes through shared > > mmaps are problematic for so many reasons that I'm not sure we want > > to encourage people to use that more. > > > > Because the same exact issue exists in the normal non-mmap write path, > and I can even quote you upthread :) > > > Well, we'll need to look into that, including maybe non-blockin > timestamp updates. > > I assume the code path that inspired this thread in the first place is: > > ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from) > { > struct file *file = iocb->ki_filp; > struct address_space *mapping = file->f_mapping; > struct inode *inode = mapping->host; > ssize_t ret; > > ret = file_remove_privs(file); > if (ret) > return ret; > > ret = file_update_time(file); > > and this has *exactly* the same problem as the shared-mmap write path: > it synchronously updates the time (well, synchronously enough that it > sometimes blocks), You are conflating "synchronous update" with "blocking". Avoiding the need for synchronous timestamp updates is exactly what the lazytime mount option provides. i.e. lazytime degrades immediate consistency requirements to eventual consistency similar to how the default relatime behaviour defers atime updates for eventual writeback. IOWs, we've already largely addressed the synchronous c/mtime update problem but what we haven't done is made timestamp updates fully support non-blocking caller semantics. That's a separate problem... > and it does so before updating the file contents > (although the window during which the timestamp is updated and the > contents are not is not as absurdly long as it is in the mmap case). > > Now my series does not change any of this, but I'm thinking more of > the concept: instead of doing file/inode_update_time when a file is > logically written (in write_iter, page_mkwrite, etc), set a flag so > that the writeback code knows that the timestamp needs updating. This is exactly what lazytime implements with the I_DIRTY_FLAG. During writeback, if the filesystem has to modify other metadata in the inode (e.g. block allocation), the filesystem will piggyback the persistent update of the dirty timestamps on that modification and clear the I_DIRTY_TIME flag. However, if the writeback operation is a pure overwrite, then there is no metadata modifiction occuring and so we leave the inode I_DIRTY_TIME dirty for a future metadata persistence operation to clean them. IOWs, with lazytime, writeback already persists timestamp updates when appropriate for best performance. > Thinking out loud, to handle both write_iter and mmap, there might > need to be two bits: one saying "the timestamp needs to be updated" > and another saying "the timestamp has been updated in the in-memory > inode, but the inode hasn't been dirtied yet". The flag that implements the latter is called I_DIRTY_TIME. We have not implemented the former as that's a userspace visible change of behaviour. > And maybe the latter > is doable entirely within fs-specific code without any help from the > generic code, but it might still be nice to keep generic_update_time > usable for filesystems that want to do this. generic_update_time() already supports I_DIRTY_TIME semantics. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-08 21:27 ` Dave Chinner @ 2025-10-08 21:51 ` Andy Lutomirski 2025-10-11 1:35 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Andy Lutomirski @ 2025-10-08 21:51 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote: > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote: > > You are conflating "synchronous update" with "blocking". > > Avoiding the need for synchronous timestamp updates is exactly what > the lazytime mount option provides. i.e. lazytime degrades immediate > consistency requirements to eventual consistency similar to how the > default relatime behaviour defers atime updates for eventual > writeback. > > IOWs, we've already largely addressed the synchronous c/mtime update > problem but what we haven't done is made timestamp updates > fully support non-blocking caller semantics. That's a separate > problem... I'm probably missing something, but is this really different? Either the mtime update can block or it can't block. I haven't dug all the way into exactly what happens in __mark_inode_dirty(), but there is a lot going on in there even in the I_DIRTY_TIME path. And Pavel is saying that AIO and mtime updates don't play along well. > > > and it does so before updating the file contents > > (although the window during which the timestamp is updated and the > > contents are not is not as absurdly long as it is in the mmap case). > > > > Now my series does not change any of this, but I'm thinking more of > > the concept: instead of doing file/inode_update_time when a file is > > logically written (in write_iter, page_mkwrite, etc), set a flag so > > that the writeback code knows that the timestamp needs updating. > > This is exactly what lazytime implements with the I_DIRTY_FLAG. > > During writeback, if the filesystem has to modify other metadata in > the inode (e.g. block allocation), the filesystem will piggyback the > persistent update of the dirty timestamps on that modification and > clear the I_DIRTY_TIME flag. > > However, if the writeback operation is a pure overwrite, then there > is no metadata modifiction occuring and so we leave the inode > I_DIRTY_TIME dirty for a future metadata persistence operation to > clean them. > > IOWs, with lazytime, writeback already persists timestamp updates > when appropriate for best performance. I'm probably doing a bad job explaining myself. In my series, I move (for page_mkwrite only) the mtime update, *including dirtying the inode* to the writeback path, which makes it fully non-blocking / asynchronous / whatever you want to call it at the time that page_mkwrite is called. More concretely, my suggestion is to be a bit lazier than current lazytime and not dirty the inode *at all* in write_iter, or at least not dirty it for the purpose of timestamp updates. Instead set a flag somewhere that it cannot be forgotten about -- in my series, it's this patch: https://lore.kernel.org/all/f2ac22142b4634b55ff6858d159b45dac96f81b6.1377193658.git.luto@amacapital.net/ and it's a single atomic bit in struct address_space. The idea is that there is approximately no additional overhead at the time that the page cache is dirtied for cmtime-related inode dirtying and that all such overhead is deferred to the writeback path when it's as asynchronous as possible from the perspective of whatever user code dirtied the page cache. My page_set_cmtime() is completely lockless. My series is far from perfect, but I did test it with real workloads 12-ish years ago, on overworked HDDs, with latencytop, and it worked. Performance was vastly improved (using mmap, not write(), obviously). > > > Thinking out loud, to handle both write_iter and mmap, there might > > need to be two bits: one saying "the timestamp needs to be updated" > > and another saying "the timestamp has been updated in the in-memory > > inode, but the inode hasn't been dirtied yet". > > The flag that implements the latter is called I_DIRTY_TIME. We have > not implemented the former as that's a userspace visible change of > behaviour. Maybe that change should be done? Or not -- it wouldn't be terribly hard to have a pair of atomic timestamps in struct inode indicating what timestamps we want to write the next time we get around to it. (Concretely, page_set_cmtime() would get some new parameters to specify actual times, and atomic compare exchange would be used to update the underlying data structure, so it would remain lock-free but not be wait-free.) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-08 21:51 ` Andy Lutomirski @ 2025-10-11 1:35 ` Dave Chinner 2025-10-11 4:04 ` Andy Lutomirski 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2025-10-11 1:35 UTC (permalink / raw) To: Andy Lutomirski Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Wed, Oct 08, 2025 at 02:51:14PM -0700, Andy Lutomirski wrote: > On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote: > > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote: > > > > > You are conflating "synchronous update" with "blocking". > > > > Avoiding the need for synchronous timestamp updates is exactly what > > the lazytime mount option provides. i.e. lazytime degrades immediate > > consistency requirements to eventual consistency similar to how the > > default relatime behaviour defers atime updates for eventual > > writeback. > > > > IOWs, we've already largely addressed the synchronous c/mtime update > > problem but what we haven't done is made timestamp updates > > fully support non-blocking caller semantics. That's a separate > > problem... > > I'm probably missing something, but is this really different? Yes, and yes. > Either the mtime update can block or it can't block. Sure, but that's not the issue we have to deal with. In many filesystems and fs operations, we have to know if an operation is going to block -before- we start the operation. e.g. transactional changes cannot be rolled back once we've started the modification if they need to block to make progress (e.g. read in on-disk metadata). This foresight, in many cases, is -unknowable-. Even though the operation /likely/ won't block, we cannot *guarantee* ahead of time that any given instance of the operation will /not/ block. Hence the reliable non-blocking operation that users are asking for is not possible with unknowable implementation characteristics like this. IOWs, a timestamp update implementation can be synchronous and reliably non-blocking if it always knows when blocking will occur and can return -EAGAIN instead of blocking to complete the operation. If it can't know when/if blocking will occur, then lazytime allows us to defer the (potentially) blocking update operation to another context that can block. Queuing for async processing can easily be made non-blocking, and __mark_inode_dirty(I_DIRTY_TIME) does this for us. So, yeah, it should be pretty obvious at this point that non-blocking implementation is completely independent of whether the operation is performed synchronously or asynchronously. It's easier to make async operations non-blocking, but that doesn't mean "non_blocking" and "asynchronous execution" are interchangable terms or behaviours. > I haven't dug all the > way into exactly what happens in __mark_inode_dirty(), but there is a > lot going on in there even in the I_DIRTY_TIME path. It's pretty simple, really. __mark_inode_dirty(I_DIRTY_TIME) is non-blocking and queues the inode on the wb->i_dirty_time queue for later processing. > And Pavel is > saying that AIO and mtime updates don't play along well. Again: this is exactly why lazytime was added to XFS *ten years ago*. From 2015 (issue #3): https://lore.kernel.org/linux-xfs/CAD-J=zZh1dtJsfrW_Gwxjg+qvkZMu7ED-QOXrMMO6B-G0HY2-A@mail.gmail.com/ (Oh, look, a discussion that starts from a user suggestion of exposing FMODE_NOCMTIME to userspace apps! Sound familiar?) > > IOWs, with lazytime, writeback already persists timestamp updates > > when appropriate for best performance. > > I'm probably doing a bad job explaining myself. No, I think both Christoph and I both understand exactly what you are trying to describe. It seems to me that haven't yet understood that lazytime already does exactly what you are asking for. Hence you think we don't understand the "lazytime" concept you are proposing and keep trying to reinvent lazytime to convince us that we need "lazytime" functionalitying in the kernel... > > > Thinking out loud, to handle both write_iter and mmap, there might > > > need to be two bits: one saying "the timestamp needs to be updated" > > > and another saying "the timestamp has been updated in the in-memory > > > inode, but the inode hasn't been dirtied yet". > > > > The flag that implements the latter is called I_DIRTY_TIME. We have > > not implemented the former as that's a userspace visible change of > > behaviour. > > Maybe that change should be done? Or not -- it wouldn't be terribly > hard to have a pair of atomic timestamps in struct inode indicating > what timestamps we want to write the next time we get around to it. See, you just reinvented the lazytime mechanism. Again. :/ -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-11 1:35 ` Dave Chinner @ 2025-10-11 4:04 ` Andy Lutomirski 0 siblings, 0 replies; 14+ messages in thread From: Andy Lutomirski @ 2025-10-11 4:04 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Fri, Oct 10, 2025 at 6:35 PM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Oct 08, 2025 at 02:51:14PM -0700, Andy Lutomirski wrote: > > On Wed, Oct 8, 2025 at 2:27 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote: > > > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote: > > > > > > > > You are conflating "synchronous update" with "blocking". > > > > > > Avoiding the need for synchronous timestamp updates is exactly what > > > the lazytime mount option provides. i.e. lazytime degrades immediate > > > consistency requirements to eventual consistency similar to how the > > > default relatime behaviour defers atime updates for eventual > > > writeback. > > > > > > IOWs, we've already largely addressed the synchronous c/mtime update > > > problem but what we haven't done is made timestamp updates > > > fully support non-blocking caller semantics. That's a separate > > > problem... > > > > I'm probably missing something, but is this really different? > > Yes, and yes. > > > Either the mtime update can block or it can't block. > > Sure, but that's not the issue we have to deal with. > > In many filesystems and fs operations, we have to know if an > operation is going to block -before- we start the operation. e.g. > transactional changes cannot be rolled back once we've started the > modification if they need to block to make progress (e.g. read in > on-disk metadata). > > This foresight, in many cases, is -unknowable-. Even though the > operation /likely/ won't block, we cannot *guarantee* ahead of time > that any given instance of the operation will /not/ block. Hence > the reliable non-blocking operation that users are asking for is not > possible with unknowable implementation characteristics like this. > > IOWs, a timestamp update implementation can be synchronous and > reliably non-blocking if it always knows when blocking will occur > and can return -EAGAIN instead of blocking to complete the > operation. > > If it can't know when/if blocking will occur, then lazytime allows > us to defer the (potentially) blocking update operation to another > context that can block. Queuing for async processing can easily be > made non-blocking, and __mark_inode_dirty(I_DIRTY_TIME) does this > for us. > > So, yeah, it should be pretty obvious at this point that non-blocking > implementation is completely independent of whether the operation is > performed synchronously or asynchronously. It's easier to make async > operations non-blocking, but that doesn't mean "non_blocking" and > "asynchronous execution" are interchangable terms or behaviours. > > > I haven't dug all the > > way into exactly what happens in __mark_inode_dirty(), but there is a > > lot going on in there even in the I_DIRTY_TIME path. > > It's pretty simple, really. __mark_inode_dirty(I_DIRTY_TIME) is > non-blocking and queues the inode on the wb->i_dirty_time queue > for later processing. > First, I apologize if I'm off base here. Second, I don't think I'm entirely nuts, and I'm moderately confident that, ten-ish years ago, I tested lazytime in the hopes that it would solve my old problem, and IIRC it didn't help. I was running a production workload on ext4 on regrettably slow spinning rust backed by a truly atrocious HPE controller. And I was running latencytop to generate little traces when my task got blocked, and there was no form of AIO involved. (And I don't really understand how AIO is wired up internally... And yes, in retrospect I should not have been using shared-writable mmaps or even file-backed things at all for what I was doing, but I had unrealistic expectations of how mmap worked when I wrote that code more like 20 years ago, and I wasn't even using Linux at the time I wrote it.) I'm looking at the code now, and I see what you're talking about, and __mark_inode_dirty(inode, I_DIRTY_TIME) looks fairly polite and like it won't block. But the relevant code seems to be: int generic_update_time(struct inode *inode, int flags) { int updated = inode_update_timestamps(inode, flags); int dirty_flags = 0; if (updated & (S_ATIME|S_MTIME|S_CTIME)) dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ? I_DIRTY_TIME : I_DIRTY_SYNC; if (updated & S_VERSION) dirty_flags |= I_DIRTY_SYNC; __mark_inode_dirty(inode, dirty_flags); ... inode_update_timestamps does this, where updated != 0 if the timestamp actually changed (which is subject to some complex coarse-graining logic so it may only happen some of the time): if (IS_I_VERSION(inode) && inode_maybe_inc_iversion(inode, updated)) updated |= S_VERSION; IS_I_VERSION seems to be unconditionally true on ext4. inode_maybe_inc_iversion always returns true if updated is set, so generic_update_time has a decent chance of doing __mark_inode_dirty(inode, I_DIRTY_SYNC), which calls s_op->dirty_inode, which calls ext4_journal_start, which, from my recollection a decade ago, could easily block for a good second or so on my delightful, now retired, HP/HPE system. In my case, I think this is the path that was blocking for me in lots of do_wp_page calls that would otherwise not have blocked. I also don't see any kiocb passed around or any mechanism by which this code could know that it's supposed to be nonblocking, although I have approximately no understanding of Linux AIO and I don't really know what I should be looking for. I could try to instrument the code a bit and test to see if I've analyzed it right in a few days. --Andy Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-08 15:22 ` Andy Lutomirski 2025-10-08 21:27 ` Dave Chinner @ 2025-10-10 5:27 ` Christoph Hellwig 2025-10-10 17:35 ` Andy Lutomirski 1 sibling, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2025-10-10 5:27 UTC (permalink / raw) To: Andy Lutomirski Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote: > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote: > > > > Well, we'll need to look into that, including maybe non-blockin > > > > timestamp updates. > > > > > > > > > > It's been 12 years (!), but maybe it's time to reconsider this: > > > > > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/ > > > > I don't see how that is relevant here. Also writes through shared > > mmaps are problematic for so many reasons that I'm not sure we want > > to encourage people to use that more. > > > > Because the same exact issue exists in the normal non-mmap write path, > and I can even quote you upthread :) The thread that started this is about io_uring nonblock writes, aka O_DIRECT. So there isn't any writeback to defer to. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-10 5:27 ` Christoph Hellwig @ 2025-10-10 17:35 ` Andy Lutomirski 0 siblings, 0 replies; 14+ messages in thread From: Andy Lutomirski @ 2025-10-10 17:35 UTC (permalink / raw) To: Christoph Hellwig Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Thu, Oct 9, 2025 at 10:27 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote: > > On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote: > > > > > Well, we'll need to look into that, including maybe non-blockin > > > > > timestamp updates. > > > > > > > > > > > > > It's been 12 years (!), but maybe it's time to reconsider this: > > > > > > > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/ > > > > > > I don't see how that is relevant here. Also writes through shared > > > mmaps are problematic for so many reasons that I'm not sure we want > > > to encourage people to use that more. > > > > > > > Because the same exact issue exists in the normal non-mmap write path, > > and I can even quote you upthread :) > > The thread that started this is about io_uring nonblock writes, aka > O_DIRECT. So there isn't any writeback to defer to. I haven't followed all the internal details, but RWF_DONTCACHE is looking pretty good these days, and it does go through the writeback path. I wonder if it's getting good enough that most or all O_DIRECT users could switch to using it. --Andy ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-04 4:26 ` [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME Christoph Hellwig 2025-10-04 16:08 ` Andy Lutomirski @ 2025-10-05 22:06 ` Dave Chinner 2025-10-07 5:10 ` Christoph Hellwig 2025-10-05 23:38 ` Dave Chinner 2 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2025-10-05 22:06 UTC (permalink / raw) To: Christoph Hellwig Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Fri, Oct 03, 2025 at 09:26:50PM -0700, Christoph Hellwig wrote: > On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote: > > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not > > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS] > > add a FMODE flag to make XFS invisible I/O less hacky. Back then it > > was suggested that this flag is propagated to a O_NOCMTIME one. > > skipping c/mtime is dangerous. The XFS handle code allows it to > support HSM where data is migrated out to tape, and requires > CAP_SYS_ADMIN. Allowing it for any file owner would expand the scope > for too much as now everyone could skip timestamp updates. > > > It can be used by workloads that want to write a file but don't care > > much about the preciese timestamp on it and can update it later with > > utimens() call. If you don't care about accurate c/mtime, then mount the filesystem with '-o lazytime' to degrade c/mtime updates to "eventual consistency" behaviour for IO operations. If inode metadata is otherwise modified (e.g. block allocation during IO) or the application then calls utimens(), it will update the recorded in-memory timestamps in a persistent manner immediately. > The workload might not care, the rest of the system does. ctime can't > bet set to arbitrary values, so it is important for backups and as > an audit trail. But we can (and do) delay the persistence of IO-based timestamp updates with the lazytime option. > > There's another reason for having this patch. When performing AIO write, > > the file_modified_flags() function checks whether or not to update inode > > times. In case update is needed and iocb carries the RWF_NOWAIT flag, > > the check return EINTR error that quickly propagates into cb completion > > without doing any IO. This restriction effectively prevents doing AIO > > writes with nowait flag, as file modifications really imply time update. > > Well, we'll need to look into that, including maybe non-blockin > timestamp updates. Lazytime updates can generally be done in a non-blocking manner right now (someone raised that in the context of io-uring on #xfs about a month ago), but the NOWAIT behaviour for timestamp updates is done at a higher level in the VFS and does not take into account filesystem specific non-blocking lazytime updates at all. If we push the NOWAIT checking behaviour down to the filesystem, we can do this. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-05 22:06 ` Dave Chinner @ 2025-10-07 5:10 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2025-10-07 5:10 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Mon, Oct 06, 2025 at 09:06:40AM +1100, Dave Chinner wrote: > If you don't care about accurate c/mtime, then mount the filesystem > with '-o lazytime' to degrade c/mtime updates to "eventual > consistency" behaviour for IO operations. Exactly. > Lazytime updates can generally be done in a non-blocking manner > right now (someone raised that in the context of io-uring on #xfs > about a month ago), but the NOWAIT behaviour for timestamp updates > is done at a higher level in the VFS and does not take into account > filesystem specific non-blocking lazytime updates at all. If we > push the NOWAIT checking behaviour down to the filesystem, we can do > this. We might not even have to push it out, but just make the VFS/rw helper check aware of lazytime. Either way currently even a lazytime timestampt update will cause a write to block, which renders the nowait writes pretty useless on anything but block devices. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-04 4:26 ` [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME Christoph Hellwig 2025-10-04 16:08 ` Andy Lutomirski 2025-10-05 22:06 ` Dave Chinner @ 2025-10-05 23:38 ` Dave Chinner 2025-10-06 2:16 ` Theodore Ts'o 2 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2025-10-05 23:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Fri, Oct 03, 2025 at 09:26:50PM -0700, Christoph Hellwig wrote: > On Fri, Oct 03, 2025 at 12:32:13PM +0300, Pavel Emelyanov wrote: > > The FMODE_NOCMTIME flag tells that ctime and mtime stamps are not > > updated on IO. The flag was introduced long ago by 4d4be482a4 ([XFS] > > add a FMODE flag to make XFS invisible I/O less hacky. Back then it > > was suggested that this flag is propagated to a O_NOCMTIME one. > > skipping c/mtime is dangerous. The XFS handle code allows it to > support HSM where data is migrated out to tape, and requires > CAP_SYS_ADMIN. Allowing it for any file owner would expand the scope > for too much as now everyone could skip timestamp updates. We have already provided a safe method for minimising the overhead of c/mtime updates in the IO path - it's called lazytime. The lazytime mount option provides eventual consistency for c/mtime updates for IO operations instead of immediate consistency. Timestamps are still updated to have the correct values, but the latency/performance of the timestamp updates is greatly improved by holding them purely in memory until some other trigger forces them to be persisted to disk. > > It can be used by workloads that want to write a file but don't care > > much about the preciese timestamp on it and can update it later with > > utimens() call. > > The workload might not care, the rest of the system does. ctime can't > bet set to arbitrary values, so it is important for backups and as > an audit trail. Lazytime works for this use case; a call to utimens() will cause a persistent update of the timestamps. As will any other inode modification that has persistence requirements (e.g. block allocation during IO or other syscalls that modify inode metadata). > > There's another reason for having this patch. When performing AIO write, > > the file_modified_flags() function checks whether or not to update inode > > times. In case update is needed and iocb carries the RWF_NOWAIT flag, > > the check return EINTR error that quickly propagates into cb completion > > without doing any IO. This restriction effectively prevents doing AIO > > writes with nowait flag, as file modifications really imply time update. > > Well, we'll need to look into that, including maybe non-blockin > timestamp updates. This came up recently on #xfs w.r.t. lazytime behaviour - we need to pass the NOWAIT decision semnatics down to the filesystem to allow lazytime to be truly non-blocking. At the moment the high level VFS NOWAIT checks (via inode_needs_update_time()) have no visibility of this filesystem specific functionality, so even if we can do the lazy timestamp update without blocking we still give an -EAGAIN if IOCB_NOWAIT is set. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME 2025-10-05 23:38 ` Dave Chinner @ 2025-10-06 2:16 ` Theodore Ts'o 0 siblings, 0 replies; 14+ messages in thread From: Theodore Ts'o @ 2025-10-06 2:16 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel, Raphael S . Carvalho, linux-api, linux-xfs On Mon, Oct 06, 2025 at 10:38:20AM +1100, Dave Chinner wrote: > We have already provided a safe method for minimising the overhead > of c/mtime updates in the IO path - it's called lazytime. The > lazytime mount option provides eventual consistency for c/mtime > updates for IO operations instead of immediate consistency. > > Timestamps are still updated to have the correct values, but the > latency/performance of the timestamp updates is greatly improved by > holding them purely in memory until some other trigger forces them > to be persisted to disk. Specifically, the timestamps are persisted to stable store when (a) the file system is unmounted, (b) when the inode needs to be pushed out to memory due to memory pressure, (c) when the inode is forcibly persisted using fsync(), (d) when some other inode field is updated, and the inode gets written out, or (e) after 24 hours. As a result, the on-disk timestamps will be at most 24 hours stale. But this is POSIX compliant, because if you read the timestamps using stat(1), you will get the updated values, and what happens after a crash in the absense of an fsync(2) is not defined. The reason why we implemented this at $WORK is you are constantly updating a database using fdatasync(2), and you care about 99.9 percentage I/O latency, the 4k writes to the inode table will eventually triger a hard drive's Adjacent Track Interference (ATI) mitigation, which involves rewriting set of disk tracks to avoid the analog signal for adjacent tracks getting weakened by the hot-spot writes, and this is measurable if you are looking at long-tail I/O latencies. (And yes, we had to talk to our HDD vendors to figure out this is what was going on, since performance is out of scop[e of SCSI/SATA specifications. Hence, random long-tail ATI latencies to preserve data integrity is allowed, and in fact, actually a good thing. :-) - Ted ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-10-11  4:04 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251003093213.52624-1-xemul@scylladb.com>
2025-10-04  4:26 ` [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME Christoph Hellwig
2025-10-04 16:08   ` Andy Lutomirski
2025-10-07  5:08     ` Christoph Hellwig
2025-10-08 15:22       ` Andy Lutomirski
2025-10-08 21:27         ` Dave Chinner
2025-10-08 21:51           ` Andy Lutomirski
2025-10-11  1:35             ` Dave Chinner
2025-10-11  4:04               ` Andy Lutomirski
2025-10-10  5:27         ` Christoph Hellwig
2025-10-10 17:35           ` Andy Lutomirski
2025-10-05 22:06   ` Dave Chinner
2025-10-07  5:10     ` Christoph Hellwig
2025-10-05 23:38   ` Dave Chinner
2025-10-06  2:16     ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).