[PATCH 00/10] fs: multigrain timestamp redux

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/10] fs: multigrain timestamp redux
@ 2024-06-27  1:00 Jeff Layton
  2024-06-27  1:00 ` [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t Jeff Layton
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

At LSF/MM this year, we had a discussion about the inode change
attribute. At the time I mentioned that I thought I could salvage the
multigrain timestamp work that had to be reverted last year [1].  That
version had to be reverted because it was possible for a file to get a
coarse grained timestamp that appeared to be earlier than another file
that had recently gotten a fine-grained stamp.

This version corrects the problem by establishing a global ctime_floor
value that should prevent this from occurring. In the above situation
that was problematic before, the two files might end up with the same
timestamp value, but they won't appear to have been modified in the
wrong order.

That problem was discovered by the test-stat-time gnulib test. Note that
that test still fails on multigrain timestamps, but that's because its
method of determining the minimum delay that will show a timestamp
change will no longer work with multigrain timestamps. I have a patch to
change the testcase to use a different method that I will post soon.

The big question with this set is whether the performance will be
suitable. The testing I've done seems to show performance parity with
multigrain timestamps enabled, but it's hard to rule this out regressing
some workload.

This set is based on top of Christian's vfs.misc branch (which has the
earlier change to track inode timestamps as discrete integers). If there
are no major objections, I'd like to let this soak in linux-next for a
bit to see if any problems shake out.

[1]: https://lore.kernel.org/linux-fsdevel/20230807-mgctime-v7-0-d1dec143a704@kernel.org/

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Jeff Layton (10):
      fs: turn inode ctime fields into a single ktime_t
      fs: uninline inode_get_ctime and inode_set_ctime_to_ts
      fs: tracepoints for inode_needs_update_time and inode_set_ctime_to_ts
      fs: add infrastructure for multigrain timestamps
      fs: add percpu counters to count fine vs. coarse timestamps
      fs: have setattr_copy handle multigrain timestamps appropriately
      xfs: switch to multigrain timestamps
      ext4: switch to multigrain timestamps
      btrfs: convert to multigrain timestamps
      tmpfs: add support for multigrain timestamps

 fs/attr.c                        |  52 +++++++--
 fs/btrfs/file.c                  |  25 +----
 fs/btrfs/super.c                 |   3 +-
 fs/ext4/super.c                  |   2 +-
 fs/inode.c                       | 222 +++++++++++++++++++++++++++++++++++----
 fs/stat.c                        |  39 ++++++-
 fs/xfs/libxfs/xfs_trans_inode.c  |   6 +-
 fs/xfs/xfs_iops.c                |   6 +-
 fs/xfs/xfs_super.c               |   2 +-
 include/linux/fs.h               |  61 +++++++----
 include/trace/events/timestamp.h | 173 ++++++++++++++++++++++++++++++
 mm/shmem.c                       |   2 +-
 12 files changed, 514 insertions(+), 79 deletions(-)
---
base-commit: 33b321ac3a51e590225585f41c7412b86e987a0d
change-id: 20240626-mgtime-5cd80b18d810

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-07-01 22:49   ` Darrick J. Wong
  2024-06-27  1:00 ` [PATCH 02/10] fs: uninline inode_get_ctime and inode_set_ctime_to_ts Jeff Layton
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

The ctime is not settable to arbitrary values. It always comes from the
system clock, so we'll never stamp an inode with a value that can't be
represented there. If we disregard people setting their system clock
past the year 2262, there is no reason we can't replace the ctime fields
with a ktime_t.

Switch the ctime fields to a single ktime_t. Move the i_generation down
above i_fsnotify_mask and then move the i_version into the resulting 8
byte hole. This shrinks struct inode by 8 bytes total, and should
improve the cache footprint as the i_version and ctime are usually
updated together.

The one downside I can see to switching to a ktime_t is that if someone
has a filesystem with files on it that has ctimes outside the ktime_t
range (before ~1678 AD or after ~2262 AD), we won't be able to display
them properly in stat() without some special treatment in the
filesystem. The operating assumption here is that that is not a
practical problem.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/fs.h | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5ff362277834..5139dec085f2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -662,11 +662,10 @@ struct inode {
 	loff_t			i_size;
 	time64_t		i_atime_sec;
 	time64_t		i_mtime_sec;
-	time64_t		i_ctime_sec;
 	u32			i_atime_nsec;
 	u32			i_mtime_nsec;
-	u32			i_ctime_nsec;
-	u32			i_generation;
+	ktime_t			__i_ctime;
+	atomic64_t		i_version;
 	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
 	unsigned short          i_bytes;
 	u8			i_blkbits;
@@ -701,7 +700,6 @@ struct inode {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
 	};
-	atomic64_t		i_version;
 	atomic64_t		i_sequence; /* see futex */
 	atomic_t		i_count;
 	atomic_t		i_dio_count;
@@ -724,6 +722,8 @@ struct inode {
 	};
 
 
+	u32			i_generation;
+
 #ifdef CONFIG_FSNOTIFY
 	__u32			i_fsnotify_mask; /* all events this inode cares about */
 	/* 32-bit hole reserved for expanding i_fsnotify_mask */
@@ -1608,29 +1608,25 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
 	return inode_set_mtime_to_ts(inode, ts);
 }
 
-static inline time64_t inode_get_ctime_sec(const struct inode *inode)
+static inline struct timespec64 inode_get_ctime(const struct inode *inode)
 {
-	return inode->i_ctime_sec;
+	return ktime_to_timespec64(inode->__i_ctime);
 }
 
-static inline long inode_get_ctime_nsec(const struct inode *inode)
+static inline time64_t inode_get_ctime_sec(const struct inode *inode)
 {
-	return inode->i_ctime_nsec;
+	return inode_get_ctime(inode).tv_sec;
 }
 
-static inline struct timespec64 inode_get_ctime(const struct inode *inode)
+static inline long inode_get_ctime_nsec(const struct inode *inode)
 {
-	struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
-				 .tv_nsec = inode_get_ctime_nsec(inode) };
-
-	return ts;
+	return inode_get_ctime(inode).tv_nsec;
 }
 
 static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
 						      struct timespec64 ts)
 {
-	inode->i_ctime_sec = ts.tv_sec;
-	inode->i_ctime_nsec = ts.tv_nsec;
+	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
 	return ts;
 }
 

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/10] fs: uninline inode_get_ctime and inode_set_ctime_to_ts
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
  2024-06-27  1:00 ` [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-06-27  1:00 ` [PATCH 03/10] fs: tracepoints for inode_needs_update_time " Jeff Layton
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

Move both functions to fs/inode.c as they have grown a little large for
inlining.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/inode.c         | 25 +++++++++++++++++++++++++
 include/linux/fs.h | 13 ++-----------
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index e0815acc5abb..7b0a73ed499d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2501,6 +2501,31 @@ struct timespec64 current_time(struct inode *inode)
 }
 EXPORT_SYMBOL(current_time);
 
+/**
+ * inode_get_ctime - fetch the current ctime from the inode
+ * @inode: inode from which to fetch ctime
+ *
+ * Grab the current ctime tv_nsec field from the inode, mask off the
+ * I_CTIME_QUERIED flag and return it. This is mostly intended for use by
+ * internal consumers of the ctime that aren't concerned with ensuring a
+ * fine-grained update on the next change (e.g. when preparing to store
+ * the value in the backing store for later retrieval).
+ */
+struct timespec64 inode_get_ctime(const struct inode *inode)
+{
+	ktime_t ctime = inode->__i_ctime;
+
+	return ktime_to_timespec64(ctime);
+}
+EXPORT_SYMBOL(inode_get_ctime);
+
+struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
+{
+	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
+	return ts;
+}
+EXPORT_SYMBOL(inode_set_ctime_to_ts);
+
 /**
  * inode_set_ctime_current - set the ctime to current_time
  * @inode: inode
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5139dec085f2..4b10db12725d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1608,10 +1608,8 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
 	return inode_set_mtime_to_ts(inode, ts);
 }
 
-static inline struct timespec64 inode_get_ctime(const struct inode *inode)
-{
-	return ktime_to_timespec64(inode->__i_ctime);
-}
+struct timespec64 inode_get_ctime(const struct inode *inode);
+struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts);
 
 static inline time64_t inode_get_ctime_sec(const struct inode *inode)
 {
@@ -1623,13 +1621,6 @@ static inline long inode_get_ctime_nsec(const struct inode *inode)
 	return inode_get_ctime(inode).tv_nsec;
 }
 
-static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
-						      struct timespec64 ts)
-{
-	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
-	return ts;
-}
-
 /**
  * inode_set_ctime - set the ctime in the inode
  * @inode: inode in which to set the ctime

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/10] fs: tracepoints for inode_needs_update_time and inode_set_ctime_to_ts
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
  2024-06-27  1:00 ` [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t Jeff Layton
  2024-06-27  1:00 ` [PATCH 02/10] fs: uninline inode_get_ctime and inode_set_ctime_to_ts Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-06-27  1:00 ` [PATCH 04/10] fs: add infrastructure for multigrain timestamps Jeff Layton
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

Add a new tracepoint for when we're testing whether the timestamps need
updating, and around the update itself.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/inode.c                       |  4 +++
 include/trace/events/timestamp.h | 76 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 7b0a73ed499d..5d2b0dfe48c3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,8 @@
 #include <linux/iversion.h>
 #include <linux/rw_hint.h>
 #include <trace/events/writeback.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/timestamp.h>
 #include "internal.h"
 
 /*
@@ -2096,6 +2098,7 @@ static int inode_needs_update_time(struct inode *inode)
 	if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
 		sync_it |= S_VERSION;
 
+	trace_inode_needs_update_time(inode, &now, &ts, sync_it);
 	return sync_it;
 }
 
@@ -2522,6 +2525,7 @@ EXPORT_SYMBOL(inode_get_ctime);
 struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
 {
 	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
+	trace_inode_set_ctime_to_ts(inode, &ts);
 	return ts;
 }
 EXPORT_SYMBOL(inode_set_ctime_to_ts);
diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
new file mode 100644
index 000000000000..35ff875d3800
--- /dev/null
+++ b/include/trace/events/timestamp.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timestamp
+
+#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMESTAMP_H
+
+#include <linux/tracepoint.h>
+#include <linux/fs.h>
+
+TRACE_EVENT(inode_needs_update_time,
+	TP_PROTO(struct inode *inode,
+		 struct timespec64 *now,
+		 struct timespec64 *ctime,
+		 int sync_it),
+
+	TP_ARGS(inode, now, ctime, sync_it),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			dev)
+		__field(ino_t,			ino)
+		__field(time64_t,		now_sec)
+		__field(time64_t,		ctime_sec)
+		__field(long,			now_nsec)
+		__field(long,			ctime_nsec)
+		__field(int,			sync_it)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->sync_it	= sync_it;
+		__entry->now_sec	= now->tv_sec;
+		__entry->ctime_sec	= ctime->tv_sec;
+		__entry->now_nsec	= now->tv_nsec;
+		__entry->ctime_nsec	= ctime->tv_nsec;
+		__entry->sync_it	= sync_it;
+	),
+
+	TP_printk("ino=%d:%d:%ld sync_it=%d now=%llu.%ld ctime=%llu.%lu",
+		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+		__entry->sync_it,
+		__entry->now_sec, __entry->now_nsec,
+		__entry->ctime_sec, __entry->ctime_nsec
+	)
+);
+
+TRACE_EVENT(inode_set_ctime_to_ts,
+	TP_PROTO(struct inode *inode,
+		 struct timespec64 *ts),
+
+	TP_ARGS(inode, ts),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			dev)
+		__field(ino_t,			ino)
+		__field(time64_t,		ts_sec)
+		__field(long,			ts_nsec)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->ts_sec		= ts->tv_sec;
+		__entry->ts_nsec	= ts->tv_nsec;
+	),
+
+	TP_printk("ino=%d:%d:%ld ts=%llu.%lu",
+		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+		__entry->ts_sec, __entry->ts_nsec
+	)
+);
+#endif /* _TRACE_TIMESTAMP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/10] fs: add infrastructure for multigrain timestamps
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
                   ` (2 preceding siblings ...)
  2024-06-27  1:00 ` [PATCH 03/10] fs: tracepoints for inode_needs_update_time " Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-06-27 15:02   ` Chuck Lever
  2024-06-27  1:00 ` [PATCH 05/10] fs: add percpu counters to count fine vs. coarse timestamps Jeff Layton
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot metadata updates, down to around 1 per jiffy,
even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. Now that the ctime is stored as a ktime_t, we
can sacrifice the lowest bit in the word to act as a flag marking
whether the current timestamp has been queried via stat() or the like.

This solves the problem of being able to distinguish the timestamp
between updates, but introduces a new problem: it's now possible for a
file being changed to get a fine-grained timestamp and then a file that
was altered later to get a coarse-grained one that appears older than
the earlier fine-grained time. To remedy this, keep a global ktime_t
value that acts as a timestamp floor.

When we go to stamp a file, we first get the latter of the current floor
value and the current coarse-grained time (call this "now"). If the
current inode ctime hasn't been queried then we just attempt to stamp it
with that value using a cmpxchg() operation.

If it has been queried, then first see whether the current coarse time
appears later than what we have. If it does, then we accept that value.
If it doesn't, then we get a fine-grained time and try to swap that into
the global floor. Whether that succeeds or fails, we take the resulting
floor time and try to swap that into the ctime.

There is still one remaining problem:

All of this works as long as the realtime clock is monotonically
increasing. If the clock ever jumps backwards, then we could end up in a
situation where the floor value is "stuck" far in advance of the clock.

To remedy this, sanity check the floor value and if it's more than 6ms
(~2 jiffies) ahead of the current coarse-grained clock, disregard the
floor value, and just accept the current coarse-grained clock.

Filesystems opt into this by setting the FS_MGTIME fstype flag.  One
caveat: those that do will always present ctimes that have the lowest
bit unset, even when the on-disk ctime has it set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/inode.c                       | 168 +++++++++++++++++++++++++++++++++------
 fs/stat.c                        |  39 ++++++++-
 include/linux/fs.h               |  30 +++++++
 include/trace/events/timestamp.h |  97 ++++++++++++++++++++++
 4 files changed, 306 insertions(+), 28 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 5d2b0dfe48c3..12790a26102c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -62,6 +62,8 @@ static unsigned int i_hash_shift __ro_after_init;
 static struct hlist_head *inode_hashtable __ro_after_init;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+/* Don't send out a ctime lower than this (modulo backward clock jumps). */
+static __cacheline_aligned_in_smp ktime_t ctime_floor;
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2077,19 +2079,86 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/*
+ * The coarse-grained clock ticks once per jiffy (every 2ms or so). If the
+ * current floor is >6ms in the future, assume that the clock has jumped
+ * backward.
+ */
+#define CTIME_FLOOR_MAX_NS	6000000
+
+/**
+ * coarse_ctime - return the current coarse-grained time
+ * @floor: current ctime_floor value
+ *
+ * Get the coarse-grained time, and then determine whether to
+ * return it or the current floor value. Returns the later of the
+ * floor and coarse grained time, unless the floor value is too
+ * far into the future. If that happens, assume the clock has jumped
+ * backward, and that the floor should be ignored.
+ */
+static ktime_t coarse_ctime(ktime_t floor)
+{
+	ktime_t now = ktime_get_coarse_real() & ~I_CTIME_QUERIED;
+
+	/* If coarse time is already newer, return that */
+	if (ktime_before(floor, now))
+		return now;
+
+	/* Ensure floor is not _too_ far in the future */
+	if (ktime_after(floor, now + CTIME_FLOOR_MAX_NS))
+		return now;
+
+	return floor;
+}
+
+/**
+ * current_time - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+struct timespec64 current_time(struct inode *inode)
+{
+	ktime_t ctime, floor = smp_load_acquire(&ctime_floor);
+	ktime_t now = coarse_ctime(floor);
+	struct timespec64 now_ts = ktime_to_timespec64(now);
+
+	if (!is_mgtime(inode))
+		goto out;
+
+	/* If nothing has queried it, then coarse time is fine */
+	ctime = smp_load_acquire(&inode->__i_ctime);
+	if (ctime & I_CTIME_QUERIED) {
+		/*
+		 * If there is no apparent change, then
+		 * get a fine-grained timestamp.
+		 */
+		if ((now | I_CTIME_QUERIED) == ctime) {
+			ktime_get_real_ts64(&now_ts);
+			now_ts.tv_nsec &= ~I_CTIME_QUERIED;
+		}
+	}
+out:
+	return timestamp_truncate(now_ts, inode);
+}
+EXPORT_SYMBOL(current_time);
+
 static int inode_needs_update_time(struct inode *inode)
 {
+	struct timespec64 now, ts;
 	int sync_it = 0;
-	struct timespec64 now = current_time(inode);
-	struct timespec64 ts;
 
 	/* First try to exhaust all avenues to not sync */
 	if (IS_NOCMTIME(inode))
 		return 0;
 
+	now = current_time(inode);
+
 	ts = inode_get_mtime(inode);
 	if (!timespec64_equal(&ts, &now))
-		sync_it = S_MTIME;
+		sync_it |= S_MTIME;
 
 	ts = inode_get_ctime(inode);
 	if (!timespec64_equal(&ts, &now))
@@ -2485,25 +2554,6 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
 }
 EXPORT_SYMBOL(timestamp_truncate);
 
-/**
- * current_time - Return FS time
- * @inode: inode.
- *
- * Return the current time truncated to the time granularity supported by
- * the fs.
- *
- * Note that inode and inode->sb cannot be NULL.
- * Otherwise, the function warns and returns time without truncation.
- */
-struct timespec64 current_time(struct inode *inode)
-{
-	struct timespec64 now;
-
-	ktime_get_coarse_real_ts64(&now);
-	return timestamp_truncate(now, inode);
-}
-EXPORT_SYMBOL(current_time);
-
 /**
  * inode_get_ctime - fetch the current ctime from the inode
  * @inode: inode from which to fetch ctime
@@ -2518,12 +2568,18 @@ struct timespec64 inode_get_ctime(const struct inode *inode)
 {
 	ktime_t ctime = inode->__i_ctime;
 
+	if (is_mgtime(inode))
+		ctime &= ~I_CTIME_QUERIED;
 	return ktime_to_timespec64(ctime);
 }
 EXPORT_SYMBOL(inode_get_ctime);
 
 struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
 {
+	trace_inode_set_ctime_to_ts(inode, &ts);
+
+	if (is_mgtime(inode))
+		ts.tv_nsec &= ~I_CTIME_QUERIED;
 	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
 	trace_inode_set_ctime_to_ts(inode, &ts);
 	return ts;
@@ -2535,14 +2591,74 @@ EXPORT_SYMBOL(inode_set_ctime_to_ts);
  * @inode: inode
  *
  * Set the inode->i_ctime to the current value for the inode. Returns
- * the current value that was assigned to i_ctime.
+ * the current value that was assigned to i_ctime. If this is a not
+ * multigrain inode, then we just set it to whatever the coarse time is.
+ *
+ * If it is multigrain, then we first see if the coarse-grained
+ * timestamp is distinct from what we have. If so, then we'll just use
+ * that. If we have to get a fine-grained timestamp, then do so, and
+ * try to swap it into the floor. We accept the new floor value
+ * regardless of the outcome of the cmpxchg. After that, we try to
+ * swap the new value into __i_ctime. Again, we take the resulting
+ * ctime, regardless of the outcome of the swap.
  */
 struct timespec64 inode_set_ctime_current(struct inode *inode)
 {
-	struct timespec64 now = current_time(inode);
+	ktime_t ctime, now, cur, floor = smp_load_acquire(&ctime_floor);
+
+	now = coarse_ctime(floor);
 
-	inode_set_ctime_to_ts(inode, now);
-	return now;
+	/* Just return that if this is not a multigrain fs */
+	if (!is_mgtime(inode)) {
+		inode->__i_ctime = now;
+		goto out;
+	}
+
+	/*
+	 * We only need a fine-grained time if someone has queried it,
+	 * and the current coarse grained time isn't later than what's
+	 * already there.
+	 */
+	ctime = smp_load_acquire(&inode->__i_ctime);
+	if ((ctime & I_CTIME_QUERIED) && !ktime_after(now, ctime & ~I_CTIME_QUERIED)) {
+		ktime_t old;
+
+		/* Get a fine-grained time */
+		now = ktime_get_real() & ~I_CTIME_QUERIED;
+
+		/*
+		 * If the cmpxchg works, we take the new floor value. If
+		 * not, then that means that someone else changed it after we
+		 * fetched it but before we got here. That value is just
+		 * as good, so keep it.
+		 */
+		old = cmpxchg(&ctime_floor, floor, now);
+		trace_ctime_floor_update(inode, floor, now, old);
+		if (old != floor)
+			now = old;
+	}
+retry:
+	/* Try to swap the ctime into place. */
+	cur = cmpxchg(&inode->__i_ctime, ctime, now);
+	trace_ctime_inode_update(inode, ctime, now, cur);
+
+	/* If swap occurred, then we're done */
+	if (cur != ctime) {
+		/*
+		 * Was the change due to someone marking the old ctime QUERIED?
+		 * If so then retry the swap. This can only happen once since
+		 * the only way to clear I_CTIME_QUERIED is to stamp the inode
+		 * with a new ctime.
+		 */
+		if (!(ctime & I_CTIME_QUERIED) && (ctime | I_CTIME_QUERIED) == cur) {
+			ctime = cur;
+			goto retry;
+		}
+		/* Otherwise, take the new ctime */
+		now = cur & ~I_CTIME_QUERIED;
+	}
+out:
+	return timestamp_truncate(ktime_to_timespec64(now), inode);
 }
 EXPORT_SYMBOL(inode_set_ctime_current);
 
diff --git a/fs/stat.c b/fs/stat.c
index 6f65b3456cad..7e9bd16b553b 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -22,10 +22,39 @@
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
+#include <trace/events/timestamp.h>
 
 #include "internal.h"
 #include "mount.h"
 
+/**
+ * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
+ * @stat: where to store the resulting values
+ * @request_mask: STATX_* values requested
+ * @inode: inode from which to grab the c/mtime
+ *
+ * Given @inode, grab the ctime and mtime out if it and store the result
+ * in @stat. When fetching the value, flag it as queried so the next write
+ * will ensure a distinct timestamp.
+ */
+void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
+{
+	atomic_long_t *pc = (atomic_long_t *)&inode->__i_ctime;
+
+	/* If neither time was requested, then don't report them */
+	if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
+		stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
+		return;
+	}
+
+	stat->mtime.tv_sec = inode->i_mtime_sec;
+	stat->mtime.tv_nsec = inode->i_mtime_nsec;
+	stat->ctime = ktime_to_timespec64(atomic_long_fetch_or(I_CTIME_QUERIED, pc) &
+						~I_CTIME_QUERIED);
+	trace_fill_mg_cmtime(inode, atomic_long_read(pc));
+}
+EXPORT_SYMBOL(fill_mg_cmtime);
+
 /**
  * generic_fillattr - Fill in the basic attributes from the inode struct
  * @idmap:		idmap of the mount the inode was found from
@@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
 	stat->rdev = inode->i_rdev;
 	stat->size = i_size_read(inode);
 	stat->atime = inode_get_atime(inode);
-	stat->mtime = inode_get_mtime(inode);
-	stat->ctime = inode_get_ctime(inode);
+
+	if (is_mgtime(inode)) {
+		fill_mg_cmtime(stat, request_mask, inode);
+	} else {
+		stat->ctime = inode_get_ctime(inode);
+		stat->mtime = inode_get_mtime(inode);
+	}
+
 	stat->blksize = i_blocksize(inode);
 	stat->blocks = inode->i_blocks;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4b10db12725d..5694cb6c4dc2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1608,6 +1608,23 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
 	return inode_set_mtime_to_ts(inode, ts);
 }
 
+/*
+ * Multigrain timestamps
+ *
+ * Conditionally use fine-grained ctime and mtime timestamps when there
+ * are users actively observing them via getattr. The primary use-case
+ * for this is NFS clients that use the ctime to distinguish between
+ * different states of the file, and that are often fooled by multiple
+ * operations that occur in the same coarse-grained timer tick.
+ *
+ * We use the least significant bit of the ktime_t to track the QUERIED
+ * flag. This means that filesystems with multigrain timestamps effectively
+ * have 2ns resolution for the ctime, even if they advertise 1ns s_time_gran.
+ */
+#define I_CTIME_QUERIED		(1LL)
+
+static inline bool is_mgtime(const struct inode *inode);
+
 struct timespec64 inode_get_ctime(const struct inode *inode);
 struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts);
 
@@ -2477,6 +2494,7 @@ struct file_system_type {
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
 #define FS_ALLOW_IDMAP         32      /* FS has been updated to handle vfs idmappings. */
+#define FS_MGTIME		64	/* FS uses multigrain timestamps */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	int (*init_fs_context)(struct fs_context *);
 	const struct fs_parameter_spec *parameters;
@@ -2500,6 +2518,17 @@ struct file_system_type {
 
 #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
 
+/**
+ * is_mgtime: is this inode using multigrain timestamps
+ * @inode: inode to test for multigrain timestamps
+ *
+ * Return true if the inode uses multigrain timestamps, false otherwise.
+ */
+static inline bool is_mgtime(const struct inode *inode)
+{
+	return inode->i_sb->s_type->fs_flags & FS_MGTIME;
+}
+
 extern struct dentry *mount_bdev(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data,
 	int (*fill_super)(struct super_block *, void *, int));
@@ -3234,6 +3263,7 @@ extern void page_put_link(void *);
 extern int page_symlink(struct inode *inode, const char *symname, int len);
 extern const struct inode_operations page_symlink_inode_operations;
 extern void kfree_link(void *);
+void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode);
 void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
 void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
 extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
index 35ff875d3800..1f71738aa38c 100644
--- a/include/trace/events/timestamp.h
+++ b/include/trace/events/timestamp.h
@@ -8,6 +8,78 @@
 #include <linux/tracepoint.h>
 #include <linux/fs.h>
 
+TRACE_EVENT(ctime_floor_update,
+	TP_PROTO(struct inode *inode,
+		 ktime_t old,
+		 ktime_t new,
+		 ktime_t cur),
+
+	TP_ARGS(inode, old, new, cur),
+
+	TP_STRUCT__entry(
+		__field(dev_t,				dev)
+		__field(ino_t,				ino)
+		__field(ktime_t,			old)
+		__field(ktime_t,			new)
+		__field(ktime_t,			cur)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->old		= old;
+		__entry->new		= new;
+		__entry->cur		= cur;
+	),
+
+	TP_printk("ino=%d:%d:%lu old=%llu.%lu new=%llu.%lu cur=%llu.%lu swp=%c",
+		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+		ktime_to_timespec64(__entry->old).tv_sec,
+		ktime_to_timespec64(__entry->old).tv_nsec,
+		ktime_to_timespec64(__entry->new).tv_sec,
+		ktime_to_timespec64(__entry->new).tv_nsec,
+		ktime_to_timespec64(__entry->cur).tv_sec,
+		ktime_to_timespec64(__entry->cur).tv_nsec,
+		(__entry->old == __entry->cur) ? 'Y' : 'N'
+	)
+);
+
+TRACE_EVENT(ctime_inode_update,
+	TP_PROTO(struct inode *inode,
+		 ktime_t old,
+		 ktime_t new,
+		 ktime_t cur),
+
+	TP_ARGS(inode, old, new, cur),
+
+	TP_STRUCT__entry(
+		__field(dev_t,				dev)
+		__field(ino_t,				ino)
+		__field(ktime_t,			old)
+		__field(ktime_t,			new)
+		__field(ktime_t,			cur)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->old		= old;
+		__entry->new		= new;
+		__entry->cur		= cur;
+	),
+
+	TP_printk("ino=%d:%d:%ld old=%llu.%ld new=%llu.%ld cur=%llu.%ld swp=%c",
+		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+		ktime_to_timespec64(__entry->old).tv_sec,
+		ktime_to_timespec64(__entry->old).tv_nsec,
+		ktime_to_timespec64(__entry->new).tv_sec,
+		ktime_to_timespec64(__entry->new).tv_nsec,
+		ktime_to_timespec64(__entry->cur).tv_sec,
+		ktime_to_timespec64(__entry->cur).tv_nsec,
+		(__entry->old == __entry->cur ? 'Y' : 'N')
+	)
+);
+
 TRACE_EVENT(inode_needs_update_time,
 	TP_PROTO(struct inode *inode,
 		 struct timespec64 *now,
@@ -70,6 +142,31 @@ TRACE_EVENT(inode_set_ctime_to_ts,
 		__entry->ts_sec, __entry->ts_nsec
 	)
 );
+
+TRACE_EVENT(fill_mg_cmtime,
+	TP_PROTO(struct inode *inode,
+		 ktime_t ctime),
+
+	TP_ARGS(inode, ctime),
+
+	TP_STRUCT__entry(
+		__field(dev_t,			dev)
+		__field(ino_t,			ino)
+		__field(ktime_t,		ctime)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= inode->i_sb->s_dev;
+		__entry->ino		= inode->i_ino;
+		__entry->ctime		= ctime;
+	),
+
+	TP_printk("ino=%d:%d:%ld ctime=%llu.%lu",
+		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
+		ktime_to_timespec64(__entry->ctime).tv_sec,
+		ktime_to_timespec64(__entry->ctime).tv_nsec
+	)
+);
 #endif /* _TRACE_TIMESTAMP_H */
 
 /* This part must be outside protection */

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/10] fs: add percpu counters to count fine vs. coarse timestamps
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
                   ` (3 preceding siblings ...)
  2024-06-27  1:00 ` [PATCH 04/10] fs: add infrastructure for multigrain timestamps Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-06-27  1:00 ` [PATCH 06/10] fs: have setattr_copy handle multigrain timestamps appropriately Jeff Layton
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

Keep a pair of percpu counters so we can track what proportion of
timestamps is fine-grained.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/inode.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 12790a26102c..18a9d1398773 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,8 @@
 #include <linux/list_lru.h>
 #include <linux/iversion.h>
 #include <linux/rw_hint.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
 #include <trace/events/writeback.h>
 #define CREATE_TRACE_POINTS
 #include <trace/events/timestamp.h>
@@ -64,6 +66,11 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
 /* Don't send out a ctime lower than this (modulo backward clock jumps). */
 static __cacheline_aligned_in_smp ktime_t ctime_floor;
+
+/* Keep track of the number of fine vs. coarse timestamp fetches */
+static struct percpu_counter mg_fine_ts;
+static struct percpu_counter mg_coarse_ts;
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -2636,6 +2643,9 @@ struct timespec64 inode_set_ctime_current(struct inode *inode)
 		trace_ctime_floor_update(inode, floor, now, old);
 		if (old != floor)
 			now = old;
+		percpu_counter_inc(&mg_fine_ts);
+	} else {
+		percpu_counter_inc(&mg_coarse_ts);
 	}
 retry:
 	/* Try to swap the ctime into place. */
@@ -2711,3 +2721,32 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
 	return mode & ~S_ISGID;
 }
 EXPORT_SYMBOL(mode_strip_sgid);
+
+static int mgts_show(struct seq_file *s, void *p)
+{
+	u64 fine = percpu_counter_sum(&mg_fine_ts);
+	u64 coarse = percpu_counter_sum(&mg_coarse_ts);
+
+	seq_printf(s, "%llu %llu\n", fine, coarse);
+	return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(mgts);
+
+static int __init mg_debugfs_init(void)
+{
+	int ret = percpu_counter_init(&mg_fine_ts, 0, GFP_KERNEL);
+
+	if (ret)
+		return ret;
+
+	ret = percpu_counter_init(&mg_coarse_ts, 0, GFP_KERNEL);
+	if (ret) {
+		percpu_counter_destroy(&mg_fine_ts);
+		return ret;
+	}
+
+	debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, NULL, &mgts_fops);
+	return 0;
+}
+late_initcall(mg_debugfs_init);

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/10] fs: have setattr_copy handle multigrain timestamps appropriately
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
                   ` (4 preceding siblings ...)
  2024-06-27  1:00 ` [PATCH 05/10] fs: add percpu counters to count fine vs. coarse timestamps Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-06-27  1:00 ` [PATCH 07/10] xfs: switch to multigrain timestamps Jeff Layton
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

The setattr codepath is still using coarse-grained timestamps, even on
multigrain filesystems. To fix this, we need to fetch the timestamp for
ctime updates later, at the point where the assignment occurs in
setattr_copy.

On a multigrain inode, ignore the ia_ctime in the attrs, and always
update the ctime to the current clock value. Update the atime and mtime
with the same value (if needed) unless they are being set to other
specific values, a'la utimes().

Note that we don't want to do this universally however, as some
filesystems (e.g. most networked fs) want to do an explicit update
elsewhere before updating the local inode.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/attr.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 825007d5cda4..e03ea6951864 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -271,6 +271,42 @@ int inode_newsize_ok(const struct inode *inode, loff_t offset)
 }
 EXPORT_SYMBOL(inode_newsize_ok);
 
+/**
+ * setattr_copy_mgtime - update timestamps for mgtime inodes
+ * @inode: inode timestamps to be updated
+ * @attr: attrs for the update
+ *
+ * With multigrain timestamps, we need to take more care to prevent races
+ * when updating the ctime. Always update the ctime to the very latest
+ * using the standard mechanism, and use that to populate the atime and
+ * mtime appropriately (unless we're setting those to specific values).
+ */
+static void setattr_copy_mgtime(struct inode *inode, const struct iattr *attr)
+{
+	unsigned int ia_valid = attr->ia_valid;
+	struct timespec64 now;
+
+	/*
+	 * If the ctime isn't being updated then nothing else should be
+	 * either.
+	 */
+	if (!(ia_valid & ATTR_CTIME)) {
+		WARN_ON_ONCE(ia_valid & (ATTR_ATIME|ATTR_MTIME));
+		return;
+	}
+
+	now = inode_set_ctime_current(inode);
+	if (ia_valid & ATTR_ATIME_SET)
+		inode_set_atime_to_ts(inode, attr->ia_atime);
+	else if (ia_valid & ATTR_ATIME)
+		inode_set_atime_to_ts(inode, now);
+
+	if (ia_valid & ATTR_MTIME_SET)
+		inode_set_mtime_to_ts(inode, attr->ia_mtime);
+	else if (ia_valid & ATTR_MTIME)
+		inode_set_mtime_to_ts(inode, now);
+}
+
 /**
  * setattr_copy - copy simple metadata updates into the generic inode
  * @idmap:	idmap of the mount the inode was found from
@@ -303,12 +339,6 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode *inode,
 
 	i_uid_update(idmap, attr, inode);
 	i_gid_update(idmap, attr, inode);
-	if (ia_valid & ATTR_ATIME)
-		inode_set_atime_to_ts(inode, attr->ia_atime);
-	if (ia_valid & ATTR_MTIME)
-		inode_set_mtime_to_ts(inode, attr->ia_mtime);
-	if (ia_valid & ATTR_CTIME)
-		inode_set_ctime_to_ts(inode, attr->ia_ctime);
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
 		if (!in_group_or_capable(idmap, inode,
@@ -316,6 +346,16 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode *inode,
 			mode &= ~S_ISGID;
 		inode->i_mode = mode;
 	}
+
+	if (is_mgtime(inode))
+		return setattr_copy_mgtime(inode, attr);
+
+	if (ia_valid & ATTR_ATIME)
+		inode_set_atime_to_ts(inode, attr->ia_atime);
+	if (ia_valid & ATTR_MTIME)
+		inode_set_mtime_to_ts(inode, attr->ia_mtime);
+	if (ia_valid & ATTR_CTIME)
+		inode_set_ctime_to_ts(inode, attr->ia_ctime);
 }
 EXPORT_SYMBOL(setattr_copy);
 

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/10] xfs: switch to multigrain timestamps
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
                   ` (5 preceding siblings ...)
  2024-06-27  1:00 ` [PATCH 06/10] fs: have setattr_copy handle multigrain timestamps appropriately Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-06-27  1:00 ` [PATCH 08/10] ext4: " Jeff Layton
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
 fs/xfs/xfs_iops.c               | 6 ++++--
 fs/xfs/xfs_super.c              | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 69fc5b981352..1f3639bbf5f0 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
 	ASSERT(tp);
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 
-	tv = current_time(inode);
+	/* If the mtime changes, then ctime must also change */
+	ASSERT(flags & XFS_ICHGTIME_CHG);
 
+	tv = inode_set_ctime_current(inode);
 	if (flags & XFS_ICHGTIME_MOD)
 		inode_set_mtime_to_ts(inode, tv);
-	if (flags & XFS_ICHGTIME_CHG)
-		inode_set_ctime_to_ts(inode, tv);
 	if (flags & XFS_ICHGTIME_CREATE)
 		ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..ed6e6d9507df 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -590,10 +590,12 @@ xfs_vn_getattr(
 	stat->gid = vfsgid_into_kgid(vfsgid);
 	stat->ino = ip->i_ino;
 	stat->atime = inode_get_atime(inode);
-	stat->mtime = inode_get_mtime(inode);
-	stat->ctime = inode_get_ctime(inode);
+
+	fill_mg_cmtime(stat, request_mask, inode);
+
 	stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
+
 	if (xfs_has_v3inodes(mp)) {
 		if (request_mask & STATX_BTIME) {
 			stat->result_mask |= STATX_BTIME;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..210481b03fdb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
 	.init_fs_context	= xfs_init_fs_context,
 	.parameters		= xfs_fs_parameters,
 	.kill_sb		= xfs_kill_sb,
-	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/10] ext4: switch to multigrain timestamps
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
                   ` (6 preceding siblings ...)
  2024-06-27  1:00 ` [PATCH 07/10] xfs: switch to multigrain timestamps Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-06-27  1:00 ` [PATCH 09/10] btrfs: convert " Jeff Layton
  2024-06-27  1:00 ` [PATCH 10/10] tmpfs: add support for " Jeff Layton
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c682fb927b64..9ae48763f81f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7310,7 +7310,7 @@ static struct file_system_type ext4_fs_type = {
 	.init_fs_context	= ext4_init_fs_context,
 	.parameters		= ext4_param_specs,
 	.kill_sb		= ext4_kill_sb,
-	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+	.fs_flags		= FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/10] btrfs: convert to multigrain timestamps
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
                   ` (7 preceding siblings ...)
  2024-06-27  1:00 ` [PATCH 08/10] ext4: " Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  2024-06-27  1:00 ` [PATCH 10/10] tmpfs: add support for " Jeff Layton
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/btrfs/file.c  | 25 ++++---------------------
 fs/btrfs/super.c |  3 ++-
 2 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e764ac3f22e2..89b3c200c374 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
 	btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-	struct timespec64 now, ts;
-
-	if (IS_NOCMTIME(inode))
-		return;
-
-	now = current_time(inode);
-	ts = inode_get_mtime(inode);
-	if (!timespec64_equal(&ts, &now))
-		inode_set_mtime_to_ts(inode, now);
-
-	ts = inode_get_ctime(inode);
-	if (!timespec64_equal(&ts, &now))
-		inode_set_ctime_to_ts(inode, now);
-
-	if (IS_I_VERSION(inode))
-		inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 			     size_t count)
 {
@@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 	 * need to start yet another transaction to update the inode as we will
 	 * update the inode when we finish writing whatever data we write.
 	 */
-	update_time_for_write(inode);
+	if (!IS_NOCMTIME(inode)) {
+		inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
+		inode_inc_iversion(inode);
+	}
 
 	start_pos = round_down(pos, fs_info->sectorsize);
 	oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f05cce7c8b8d..1cd50293b98d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2173,7 +2173,8 @@ static struct file_system_type btrfs_fs_type = {
 	.init_fs_context	= btrfs_init_fs_context,
 	.parameters		= btrfs_fs_parameters,
 	.kill_sb		= btrfs_kill_super,
-	.fs_flags		= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_ALLOW_IDMAP,
+	.fs_flags		= FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+				  FS_ALLOW_IDMAP | FS_MGTIME,
  };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/10] tmpfs: add support for multigrain timestamps
  2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
                   ` (8 preceding siblings ...)
  2024-06-27  1:00 ` [PATCH 09/10] btrfs: convert " Jeff Layton
@ 2024-06-27  1:00 ` Jeff Layton
  9 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27  1:00 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton
  Cc: kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs,
	Jeff Layton

Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ff7c756a7d02..d650f48444e0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4653,7 +4653,7 @@ static struct file_system_type shmem_fs_type = {
 	.parameters	= shmem_fs_parameters,
 #endif
 	.kill_sb	= kill_litter_super,
-	.fs_flags	= FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+	.fs_flags	= FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 void __init shmem_init(void)

-- 
2.45.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 04/10] fs: add infrastructure for multigrain timestamps
  2024-06-27  1:00 ` [PATCH 04/10] fs: add infrastructure for multigrain timestamps Jeff Layton
@ 2024-06-27 15:02   ` Chuck Lever
  2024-06-27 15:35     ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2024-06-27 15:02 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Wed, Jun 26, 2024 at 09:00:24PM -0400, Jeff Layton wrote:
> The VFS always uses coarse-grained timestamps when updating the ctime
> and mtime after a change. This has the benefit of allowing filesystems
> to optimize away a lot metadata updates, down to around 1 per jiffy,
> even when a file is under heavy writes.
> 
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. A lot of changes
> can happen in a jiffy, so timestamps aren't sufficient to help the
> client decide to invalidate the cache. Even with NFSv4, a lot of
> exported filesystems don't properly support a change attribute and are
> subject to the same problems with timestamp granularity. Other
> applications have similar issues with timestamps (e.g backup
> applications).
> 
> If we were to always use fine-grained timestamps, that would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem would have to log a lot more metadata updates.
> 
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried. Now that the ctime is stored as a ktime_t, we
> can sacrifice the lowest bit in the word to act as a flag marking
> whether the current timestamp has been queried via stat() or the like.
> 
> This solves the problem of being able to distinguish the timestamp
> between updates, but introduces a new problem: it's now possible for a
> file being changed to get a fine-grained timestamp and then a file that
> was altered later to get a coarse-grained one that appears older than
> the earlier fine-grained time. To remedy this, keep a global ktime_t
> value that acts as a timestamp floor.
> 
> When we go to stamp a file, we first get the latter of the current floor
> value and the current coarse-grained time (call this "now"). If the
> current inode ctime hasn't been queried then we just attempt to stamp it
> with that value using a cmpxchg() operation.
> 
> If it has been queried, then first see whether the current coarse time
> appears later than what we have. If it does, then we accept that value.
> If it doesn't, then we get a fine-grained time and try to swap that into
> the global floor. Whether that succeeds or fails, we take the resulting
> floor time and try to swap that into the ctime.
> 
> There is still one remaining problem:
> 
> All of this works as long as the realtime clock is monotonically
> increasing. If the clock ever jumps backwards, then we could end up in a
> situation where the floor value is "stuck" far in advance of the clock.
> 
> To remedy this, sanity check the floor value and if it's more than 6ms
> (~2 jiffies) ahead of the current coarse-grained clock, disregard the
> floor value, and just accept the current coarse-grained clock.
> 
> Filesystems opt into this by setting the FS_MGTIME fstype flag.  One
> caveat: those that do will always present ctimes that have the lowest
> bit unset, even when the on-disk ctime has it set.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/inode.c                       | 168 +++++++++++++++++++++++++++++++++------
>  fs/stat.c                        |  39 ++++++++-
>  include/linux/fs.h               |  30 +++++++
>  include/trace/events/timestamp.h |  97 ++++++++++++++++++++++
>  4 files changed, 306 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 5d2b0dfe48c3..12790a26102c 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -62,6 +62,8 @@ static unsigned int i_hash_shift __ro_after_init;
>  static struct hlist_head *inode_hashtable __ro_after_init;
>  static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
>  
> +/* Don't send out a ctime lower than this (modulo backward clock jumps). */
> +static __cacheline_aligned_in_smp ktime_t ctime_floor;

This is piece of memory that will be hit pretty hard (and you
obviously recognize that because of the alignment attribute).

Would it be of any benefit to keep a distinct ctime_floor in each
super block instead?


>  /*
>   * Empty aops. Can be used for the cases where the user does not
>   * define any of the address_space operations.
> @@ -2077,19 +2079,86 @@ int file_remove_privs(struct file *file)
>  }
>  EXPORT_SYMBOL(file_remove_privs);
>  
> +/*
> + * The coarse-grained clock ticks once per jiffy (every 2ms or so). If the
> + * current floor is >6ms in the future, assume that the clock has jumped
> + * backward.
> + */
> +#define CTIME_FLOOR_MAX_NS	6000000
> +
> +/**
> + * coarse_ctime - return the current coarse-grained time
> + * @floor: current ctime_floor value
> + *
> + * Get the coarse-grained time, and then determine whether to
> + * return it or the current floor value. Returns the later of the
> + * floor and coarse grained time, unless the floor value is too
> + * far into the future. If that happens, assume the clock has jumped
> + * backward, and that the floor should be ignored.
> + */
> +static ktime_t coarse_ctime(ktime_t floor)
> +{
> +	ktime_t now = ktime_get_coarse_real() & ~I_CTIME_QUERIED;
> +
> +	/* If coarse time is already newer, return that */
> +	if (ktime_before(floor, now))
> +		return now;
> +
> +	/* Ensure floor is not _too_ far in the future */
> +	if (ktime_after(floor, now + CTIME_FLOOR_MAX_NS))
> +		return now;
> +
> +	return floor;
> +}
> +
> +/**
> + * current_time - Return FS time (possibly fine-grained)
> + * @inode: inode.
> + *
> + * Return the current time truncated to the time granularity supported by
> + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> + * as having been QUERIED, get a fine-grained timestamp.
> + */
> +struct timespec64 current_time(struct inode *inode)
> +{
> +	ktime_t ctime, floor = smp_load_acquire(&ctime_floor);
> +	ktime_t now = coarse_ctime(floor);
> +	struct timespec64 now_ts = ktime_to_timespec64(now);
> +
> +	if (!is_mgtime(inode))
> +		goto out;
> +
> +	/* If nothing has queried it, then coarse time is fine */
> +	ctime = smp_load_acquire(&inode->__i_ctime);
> +	if (ctime & I_CTIME_QUERIED) {
> +		/*
> +		 * If there is no apparent change, then
> +		 * get a fine-grained timestamp.
> +		 */
> +		if ((now | I_CTIME_QUERIED) == ctime) {
> +			ktime_get_real_ts64(&now_ts);
> +			now_ts.tv_nsec &= ~I_CTIME_QUERIED;
> +		}
> +	}
> +out:
> +	return timestamp_truncate(now_ts, inode);
> +}
> +EXPORT_SYMBOL(current_time);
> +
>  static int inode_needs_update_time(struct inode *inode)
>  {
> +	struct timespec64 now, ts;
>  	int sync_it = 0;
> -	struct timespec64 now = current_time(inode);
> -	struct timespec64 ts;
>  
>  	/* First try to exhaust all avenues to not sync */
>  	if (IS_NOCMTIME(inode))
>  		return 0;
>  
> +	now = current_time(inode);
> +
>  	ts = inode_get_mtime(inode);
>  	if (!timespec64_equal(&ts, &now))
> -		sync_it = S_MTIME;
> +		sync_it |= S_MTIME;
>  
>  	ts = inode_get_ctime(inode);
>  	if (!timespec64_equal(&ts, &now))
> @@ -2485,25 +2554,6 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
>  }
>  EXPORT_SYMBOL(timestamp_truncate);
>  
> -/**
> - * current_time - Return FS time
> - * @inode: inode.
> - *
> - * Return the current time truncated to the time granularity supported by
> - * the fs.
> - *
> - * Note that inode and inode->sb cannot be NULL.
> - * Otherwise, the function warns and returns time without truncation.
> - */
> -struct timespec64 current_time(struct inode *inode)
> -{
> -	struct timespec64 now;
> -
> -	ktime_get_coarse_real_ts64(&now);
> -	return timestamp_truncate(now, inode);
> -}
> -EXPORT_SYMBOL(current_time);
> -
>  /**
>   * inode_get_ctime - fetch the current ctime from the inode
>   * @inode: inode from which to fetch ctime
> @@ -2518,12 +2568,18 @@ struct timespec64 inode_get_ctime(const struct inode *inode)
>  {
>  	ktime_t ctime = inode->__i_ctime;
>  
> +	if (is_mgtime(inode))
> +		ctime &= ~I_CTIME_QUERIED;
>  	return ktime_to_timespec64(ctime);
>  }
>  EXPORT_SYMBOL(inode_get_ctime);
>  
>  struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
>  {
> +	trace_inode_set_ctime_to_ts(inode, &ts);
> +
> +	if (is_mgtime(inode))
> +		ts.tv_nsec &= ~I_CTIME_QUERIED;
>  	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
>  	trace_inode_set_ctime_to_ts(inode, &ts);
>  	return ts;
> @@ -2535,14 +2591,74 @@ EXPORT_SYMBOL(inode_set_ctime_to_ts);
>   * @inode: inode
>   *
>   * Set the inode->i_ctime to the current value for the inode. Returns
> - * the current value that was assigned to i_ctime.
> + * the current value that was assigned to i_ctime. If this is a not
> + * multigrain inode, then we just set it to whatever the coarse time is.
> + *
> + * If it is multigrain, then we first see if the coarse-grained
> + * timestamp is distinct from what we have. If so, then we'll just use
> + * that. If we have to get a fine-grained timestamp, then do so, and
> + * try to swap it into the floor. We accept the new floor value
> + * regardless of the outcome of the cmpxchg. After that, we try to
> + * swap the new value into __i_ctime. Again, we take the resulting
> + * ctime, regardless of the outcome of the swap.
>   */
>  struct timespec64 inode_set_ctime_current(struct inode *inode)
>  {
> -	struct timespec64 now = current_time(inode);
> +	ktime_t ctime, now, cur, floor = smp_load_acquire(&ctime_floor);
> +
> +	now = coarse_ctime(floor);
>  
> -	inode_set_ctime_to_ts(inode, now);
> -	return now;
> +	/* Just return that if this is not a multigrain fs */
> +	if (!is_mgtime(inode)) {
> +		inode->__i_ctime = now;
> +		goto out;
> +	}
> +
> +	/*
> +	 * We only need a fine-grained time if someone has queried it,
> +	 * and the current coarse grained time isn't later than what's
> +	 * already there.
> +	 */
> +	ctime = smp_load_acquire(&inode->__i_ctime);
> +	if ((ctime & I_CTIME_QUERIED) && !ktime_after(now, ctime & ~I_CTIME_QUERIED)) {
> +		ktime_t old;
> +
> +		/* Get a fine-grained time */
> +		now = ktime_get_real() & ~I_CTIME_QUERIED;
> +
> +		/*
> +		 * If the cmpxchg works, we take the new floor value. If
> +		 * not, then that means that someone else changed it after we
> +		 * fetched it but before we got here. That value is just
> +		 * as good, so keep it.
> +		 */
> +		old = cmpxchg(&ctime_floor, floor, now);
> +		trace_ctime_floor_update(inode, floor, now, old);
> +		if (old != floor)
> +			now = old;
> +	}
> +retry:
> +	/* Try to swap the ctime into place. */
> +	cur = cmpxchg(&inode->__i_ctime, ctime, now);
> +	trace_ctime_inode_update(inode, ctime, now, cur);
> +
> +	/* If swap occurred, then we're done */
> +	if (cur != ctime) {
> +		/*
> +		 * Was the change due to someone marking the old ctime QUERIED?
> +		 * If so then retry the swap. This can only happen once since
> +		 * the only way to clear I_CTIME_QUERIED is to stamp the inode
> +		 * with a new ctime.
> +		 */
> +		if (!(ctime & I_CTIME_QUERIED) && (ctime | I_CTIME_QUERIED) == cur) {
> +			ctime = cur;
> +			goto retry;
> +		}
> +		/* Otherwise, take the new ctime */
> +		now = cur & ~I_CTIME_QUERIED;
> +	}
> +out:
> +	return timestamp_truncate(ktime_to_timespec64(now), inode);
>  }
>  EXPORT_SYMBOL(inode_set_ctime_current);
>  
> diff --git a/fs/stat.c b/fs/stat.c
> index 6f65b3456cad..7e9bd16b553b 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -22,10 +22,39 @@
>  
>  #include <linux/uaccess.h>
>  #include <asm/unistd.h>
> +#include <trace/events/timestamp.h>
>  
>  #include "internal.h"
>  #include "mount.h"
>  
> +/**
> + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
> + * @stat: where to store the resulting values
> + * @request_mask: STATX_* values requested
> + * @inode: inode from which to grab the c/mtime
> + *
> + * Given @inode, grab the ctime and mtime out if it and store the result
> + * in @stat. When fetching the value, flag it as queried so the next write
> + * will ensure a distinct timestamp.
> + */
> +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
> +{
> +	atomic_long_t *pc = (atomic_long_t *)&inode->__i_ctime;
> +
> +	/* If neither time was requested, then don't report them */
> +	if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
> +		stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
> +		return;
> +	}
> +
> +	stat->mtime.tv_sec = inode->i_mtime_sec;
> +	stat->mtime.tv_nsec = inode->i_mtime_nsec;
> +	stat->ctime = ktime_to_timespec64(atomic_long_fetch_or(I_CTIME_QUERIED, pc) &
> +						~I_CTIME_QUERIED);
> +	trace_fill_mg_cmtime(inode, atomic_long_read(pc));
> +}
> +EXPORT_SYMBOL(fill_mg_cmtime);
> +
>  /**
>   * generic_fillattr - Fill in the basic attributes from the inode struct
>   * @idmap:		idmap of the mount the inode was found from
> @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
>  	stat->rdev = inode->i_rdev;
>  	stat->size = i_size_read(inode);
>  	stat->atime = inode_get_atime(inode);
> -	stat->mtime = inode_get_mtime(inode);
> -	stat->ctime = inode_get_ctime(inode);
> +
> +	if (is_mgtime(inode)) {
> +		fill_mg_cmtime(stat, request_mask, inode);
> +	} else {
> +		stat->ctime = inode_get_ctime(inode);
> +		stat->mtime = inode_get_mtime(inode);
> +	}
> +
>  	stat->blksize = i_blocksize(inode);
>  	stat->blocks = inode->i_blocks;
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4b10db12725d..5694cb6c4dc2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1608,6 +1608,23 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
>  	return inode_set_mtime_to_ts(inode, ts);
>  }
>  
> +/*
> + * Multigrain timestamps
> + *
> + * Conditionally use fine-grained ctime and mtime timestamps when there
> + * are users actively observing them via getattr. The primary use-case
> + * for this is NFS clients that use the ctime to distinguish between
> + * different states of the file, and that are often fooled by multiple
> + * operations that occur in the same coarse-grained timer tick.
> + *
> + * We use the least significant bit of the ktime_t to track the QUERIED
> + * flag. This means that filesystems with multigrain timestamps effectively
> + * have 2ns resolution for the ctime, even if they advertise 1ns s_time_gran.
> + */
> +#define I_CTIME_QUERIED		(1LL)
> +
> +static inline bool is_mgtime(const struct inode *inode);
> +
>  struct timespec64 inode_get_ctime(const struct inode *inode);
>  struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts);
>  
> @@ -2477,6 +2494,7 @@ struct file_system_type {
>  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
>  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
>  #define FS_ALLOW_IDMAP         32      /* FS has been updated to handle vfs idmappings. */
> +#define FS_MGTIME		64	/* FS uses multigrain timestamps */
>  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
>  	int (*init_fs_context)(struct fs_context *);
>  	const struct fs_parameter_spec *parameters;
> @@ -2500,6 +2518,17 @@ struct file_system_type {
>  
>  #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
>  
> +/**
> + * is_mgtime: is this inode using multigrain timestamps
> + * @inode: inode to test for multigrain timestamps
> + *
> + * Return true if the inode uses multigrain timestamps, false otherwise.
> + */
> +static inline bool is_mgtime(const struct inode *inode)
> +{
> +	return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> +}
> +
>  extern struct dentry *mount_bdev(struct file_system_type *fs_type,
>  	int flags, const char *dev_name, void *data,
>  	int (*fill_super)(struct super_block *, void *, int));
> @@ -3234,6 +3263,7 @@ extern void page_put_link(void *);
>  extern int page_symlink(struct inode *inode, const char *symname, int len);
>  extern const struct inode_operations page_symlink_inode_operations;
>  extern void kfree_link(void *);
> +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode);
>  void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
>  void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
>  extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
> diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
> index 35ff875d3800..1f71738aa38c 100644
> --- a/include/trace/events/timestamp.h
> +++ b/include/trace/events/timestamp.h
> @@ -8,6 +8,78 @@
>  #include <linux/tracepoint.h>
>  #include <linux/fs.h>
>  
> +TRACE_EVENT(ctime_floor_update,
> +	TP_PROTO(struct inode *inode,
> +		 ktime_t old,
> +		 ktime_t new,
> +		 ktime_t cur),
> +
> +	TP_ARGS(inode, old, new, cur),
> +
> +	TP_STRUCT__entry(
> +		__field(dev_t,				dev)
> +		__field(ino_t,				ino)
> +		__field(ktime_t,			old)
> +		__field(ktime_t,			new)
> +		__field(ktime_t,			cur)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= inode->i_sb->s_dev;
> +		__entry->ino		= inode->i_ino;
> +		__entry->old		= old;
> +		__entry->new		= new;
> +		__entry->cur		= cur;
> +	),
> +
> +	TP_printk("ino=%d:%d:%lu old=%llu.%lu new=%llu.%lu cur=%llu.%lu swp=%c",
> +		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> +		ktime_to_timespec64(__entry->old).tv_sec,
> +		ktime_to_timespec64(__entry->old).tv_nsec,
> +		ktime_to_timespec64(__entry->new).tv_sec,
> +		ktime_to_timespec64(__entry->new).tv_nsec,
> +		ktime_to_timespec64(__entry->cur).tv_sec,
> +		ktime_to_timespec64(__entry->cur).tv_nsec,
> +		(__entry->old == __entry->cur) ? 'Y' : 'N'
> +	)
> +);
> +
> +TRACE_EVENT(ctime_inode_update,
> +	TP_PROTO(struct inode *inode,
> +		 ktime_t old,
> +		 ktime_t new,
> +		 ktime_t cur),
> +
> +	TP_ARGS(inode, old, new, cur),
> +
> +	TP_STRUCT__entry(
> +		__field(dev_t,				dev)
> +		__field(ino_t,				ino)
> +		__field(ktime_t,			old)
> +		__field(ktime_t,			new)
> +		__field(ktime_t,			cur)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= inode->i_sb->s_dev;
> +		__entry->ino		= inode->i_ino;
> +		__entry->old		= old;
> +		__entry->new		= new;
> +		__entry->cur		= cur;
> +	),
> +
> +	TP_printk("ino=%d:%d:%ld old=%llu.%ld new=%llu.%ld cur=%llu.%ld swp=%c",
> +		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> +		ktime_to_timespec64(__entry->old).tv_sec,
> +		ktime_to_timespec64(__entry->old).tv_nsec,
> +		ktime_to_timespec64(__entry->new).tv_sec,
> +		ktime_to_timespec64(__entry->new).tv_nsec,
> +		ktime_to_timespec64(__entry->cur).tv_sec,
> +		ktime_to_timespec64(__entry->cur).tv_nsec,
> +		(__entry->old == __entry->cur ? 'Y' : 'N')
> +	)
> +);
> +
>  TRACE_EVENT(inode_needs_update_time,
>  	TP_PROTO(struct inode *inode,
>  		 struct timespec64 *now,
> @@ -70,6 +142,31 @@ TRACE_EVENT(inode_set_ctime_to_ts,
>  		__entry->ts_sec, __entry->ts_nsec
>  	)
>  );
> +
> +TRACE_EVENT(fill_mg_cmtime,
> +	TP_PROTO(struct inode *inode,
> +		 ktime_t ctime),
> +
> +	TP_ARGS(inode, ctime),
> +
> +	TP_STRUCT__entry(
> +		__field(dev_t,			dev)
> +		__field(ino_t,			ino)
> +		__field(ktime_t,		ctime)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev		= inode->i_sb->s_dev;
> +		__entry->ino		= inode->i_ino;
> +		__entry->ctime		= ctime;
> +	),
> +
> +	TP_printk("ino=%d:%d:%ld ctime=%llu.%lu",
> +		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> +		ktime_to_timespec64(__entry->ctime).tv_sec,
> +		ktime_to_timespec64(__entry->ctime).tv_nsec
> +	)
> +);
>  #endif /* _TRACE_TIMESTAMP_H */
>  
>  /* This part must be outside protection */
> 
> -- 
> 2.45.2
> 
> 

-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 04/10] fs: add infrastructure for multigrain timestamps
  2024-06-27 15:02   ` Chuck Lever
@ 2024-06-27 15:35     ` Jeff Layton
  0 siblings, 0 replies; 29+ messages in thread
From: Jeff Layton @ 2024-06-27 15:35 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Darrick J. Wong, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Thu, 2024-06-27 at 11:02 -0400, Chuck Lever wrote:
> On Wed, Jun 26, 2024 at 09:00:24PM -0400, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamps when updating the ctime
> > and mtime after a change. This has the benefit of allowing filesystems
> > to optimize away a lot metadata updates, down to around 1 per jiffy,
> > even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. A lot of changes
> > can happen in a jiffy, so timestamps aren't sufficient to help the
> > client decide to invalidate the cache. Even with NFSv4, a lot of
> > exported filesystems don't properly support a change attribute and are
> > subject to the same problems with timestamp granularity. Other
> > applications have similar issues with timestamps (e.g backup
> > applications).
> > 
> > If we were to always use fine-grained timestamps, that would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem would have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried. Now that the ctime is stored as a ktime_t, we
> > can sacrifice the lowest bit in the word to act as a flag marking
> > whether the current timestamp has been queried via stat() or the like.
> > 
> > This solves the problem of being able to distinguish the timestamp
> > between updates, but introduces a new problem: it's now possible for a
> > file being changed to get a fine-grained timestamp and then a file that
> > was altered later to get a coarse-grained one that appears older than
> > the earlier fine-grained time. To remedy this, keep a global ktime_t
> > value that acts as a timestamp floor.
> > 
> > When we go to stamp a file, we first get the latter of the current floor
> > value and the current coarse-grained time (call this "now"). If the
> > current inode ctime hasn't been queried then we just attempt to stamp it
> > with that value using a cmpxchg() operation.
> > 
> > If it has been queried, then first see whether the current coarse time
> > appears later than what we have. If it does, then we accept that value.
> > If it doesn't, then we get a fine-grained time and try to swap that into
> > the global floor. Whether that succeeds or fails, we take the resulting
> > floor time and try to swap that into the ctime.
> > 
> > There is still one remaining problem:
> > 
> > All of this works as long as the realtime clock is monotonically
> > increasing. If the clock ever jumps backwards, then we could end up in a
> > situation where the floor value is "stuck" far in advance of the clock.
> > 
> > To remedy this, sanity check the floor value and if it's more than 6ms
> > (~2 jiffies) ahead of the current coarse-grained clock, disregard the
> > floor value, and just accept the current coarse-grained clock.
> > 
> > Filesystems opt into this by setting the FS_MGTIME fstype flag.  One
> > caveat: those that do will always present ctimes that have the lowest
> > bit unset, even when the on-disk ctime has it set.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/inode.c                       | 168 +++++++++++++++++++++++++++++++++------
> >  fs/stat.c                        |  39 ++++++++-
> >  include/linux/fs.h               |  30 +++++++
> >  include/trace/events/timestamp.h |  97 ++++++++++++++++++++++
> >  4 files changed, 306 insertions(+), 28 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 5d2b0dfe48c3..12790a26102c 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -62,6 +62,8 @@ static unsigned int i_hash_shift __ro_after_init;
> >  static struct hlist_head *inode_hashtable __ro_after_init;
> >  static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
> >  
> > +/* Don't send out a ctime lower than this (modulo backward clock jumps). */
> > +static __cacheline_aligned_in_smp ktime_t ctime_floor;
> 
> This is piece of memory that will be hit pretty hard (and you
> obviously recognize that because of the alignment attribute).
> 
> Would it be of any benefit to keep a distinct ctime_floor in each
> super block instead?
> 

Good question. Dave Chinner suggested the same thing, but I think it's
a potential problem:

The first series had to be reverted because inodes that had been
modified in order could appear to be modified in reverse order with the
right combination of fine and coarse grained timestamps. With the new
floor value, that's no longer possible, but if we were to make it per-
sb then it becomes possible again with files in different filesystems.

This sort of timestamp comparison is done by tools like "make", and
it's rather common to keep built objects in one location and generate
or copy source files to another. My worry is that managing the floor as
anything but a global value could cause regressions in those sorts of
workloads.

> 
> >  /*
> >   * Empty aops. Can be used for the cases where the user does not
> >   * define any of the address_space operations.
> > @@ -2077,19 +2079,86 @@ int file_remove_privs(struct file *file)
> >  }
> >  EXPORT_SYMBOL(file_remove_privs);
> >  
> > +/*
> > + * The coarse-grained clock ticks once per jiffy (every 2ms or so). If the
> > + * current floor is >6ms in the future, assume that the clock has jumped
> > + * backward.
> > + */
> > +#define CTIME_FLOOR_MAX_NS	6000000
> > +
> > +/**
> > + * coarse_ctime - return the current coarse-grained time
> > + * @floor: current ctime_floor value
> > + *
> > + * Get the coarse-grained time, and then determine whether to
> > + * return it or the current floor value. Returns the later of the
> > + * floor and coarse grained time, unless the floor value is too
> > + * far into the future. If that happens, assume the clock has jumped
> > + * backward, and that the floor should be ignored.
> > + */
> > +static ktime_t coarse_ctime(ktime_t floor)
> > +{
> > +	ktime_t now = ktime_get_coarse_real() & ~I_CTIME_QUERIED;
> > +
> > +	/* If coarse time is already newer, return that */
> > +	if (ktime_before(floor, now))
> > +		return now;
> > +
> > +	/* Ensure floor is not _too_ far in the future */
> > +	if (ktime_after(floor, now + CTIME_FLOOR_MAX_NS))
> > +		return now;
> > +
> > +	return floor;
> > +}
> > +
> > +/**
> > + * current_time - Return FS time (possibly fine-grained)
> > + * @inode: inode.
> > + *
> > + * Return the current time truncated to the time granularity supported by
> > + * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
> > + * as having been QUERIED, get a fine-grained timestamp.
> > + */
> > +struct timespec64 current_time(struct inode *inode)
> > +{
> > +	ktime_t ctime, floor = smp_load_acquire(&ctime_floor);
> > +	ktime_t now = coarse_ctime(floor);
> > +	struct timespec64 now_ts = ktime_to_timespec64(now);
> > +
> > +	if (!is_mgtime(inode))
> > +		goto out;
> > +
> > +	/* If nothing has queried it, then coarse time is fine */
> > +	ctime = smp_load_acquire(&inode->__i_ctime);
> > +	if (ctime & I_CTIME_QUERIED) {
> > +		/*
> > +		 * If there is no apparent change, then
> > +		 * get a fine-grained timestamp.
> > +		 */
> > +		if ((now | I_CTIME_QUERIED) == ctime) {
> > +			ktime_get_real_ts64(&now_ts);
> > +			now_ts.tv_nsec &= ~I_CTIME_QUERIED;
> > +		}
> > +	}
> > +out:
> > +	return timestamp_truncate(now_ts, inode);
> > +}
> > +EXPORT_SYMBOL(current_time);
> > +
> >  static int inode_needs_update_time(struct inode *inode)
> >  {
> > +	struct timespec64 now, ts;
> >  	int sync_it = 0;
> > -	struct timespec64 now = current_time(inode);
> > -	struct timespec64 ts;
> >  
> >  	/* First try to exhaust all avenues to not sync */
> >  	if (IS_NOCMTIME(inode))
> >  		return 0;
> >  
> > +	now = current_time(inode);
> > +
> >  	ts = inode_get_mtime(inode);
> >  	if (!timespec64_equal(&ts, &now))
> > -		sync_it = S_MTIME;
> > +		sync_it |= S_MTIME;
> >  
> >  	ts = inode_get_ctime(inode);
> >  	if (!timespec64_equal(&ts, &now))
> > @@ -2485,25 +2554,6 @@ struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode)
> >  }
> >  EXPORT_SYMBOL(timestamp_truncate);
> >  
> > -/**
> > - * current_time - Return FS time
> > - * @inode: inode.
> > - *
> > - * Return the current time truncated to the time granularity supported by
> > - * the fs.
> > - *
> > - * Note that inode and inode->sb cannot be NULL.
> > - * Otherwise, the function warns and returns time without truncation.
> > - */
> > -struct timespec64 current_time(struct inode *inode)
> > -{
> > -	struct timespec64 now;
> > -
> > -	ktime_get_coarse_real_ts64(&now);
> > -	return timestamp_truncate(now, inode);
> > -}
> > -EXPORT_SYMBOL(current_time);
> > -
> >  /**
> >   * inode_get_ctime - fetch the current ctime from the inode
> >   * @inode: inode from which to fetch ctime
> > @@ -2518,12 +2568,18 @@ struct timespec64 inode_get_ctime(const struct inode *inode)
> >  {
> >  	ktime_t ctime = inode->__i_ctime;
> >  
> > +	if (is_mgtime(inode))
> > +		ctime &= ~I_CTIME_QUERIED;
> >  	return ktime_to_timespec64(ctime);
> >  }
> >  EXPORT_SYMBOL(inode_get_ctime);
> >  
> >  struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts)
> >  {
> > +	trace_inode_set_ctime_to_ts(inode, &ts);
> > +
> > +	if (is_mgtime(inode))
> > +		ts.tv_nsec &= ~I_CTIME_QUERIED;
> >  	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
> >  	trace_inode_set_ctime_to_ts(inode, &ts);
> >  	return ts;
> > @@ -2535,14 +2591,74 @@ EXPORT_SYMBOL(inode_set_ctime_to_ts);
> >   * @inode: inode
> >   *
> >   * Set the inode->i_ctime to the current value for the inode. Returns
> > - * the current value that was assigned to i_ctime.
> > + * the current value that was assigned to i_ctime. If this is a not
> > + * multigrain inode, then we just set it to whatever the coarse time is.
> > + *
> > + * If it is multigrain, then we first see if the coarse-grained
> > + * timestamp is distinct from what we have. If so, then we'll just use
> > + * that. If we have to get a fine-grained timestamp, then do so, and
> > + * try to swap it into the floor. We accept the new floor value
> > + * regardless of the outcome of the cmpxchg. After that, we try to
> > + * swap the new value into __i_ctime. Again, we take the resulting
> > + * ctime, regardless of the outcome of the swap.
> >   */
> >  struct timespec64 inode_set_ctime_current(struct inode *inode)
> >  {
> > -	struct timespec64 now = current_time(inode);
> > +	ktime_t ctime, now, cur, floor = smp_load_acquire(&ctime_floor);
> > +
> > +	now = coarse_ctime(floor);
> >  
> > -	inode_set_ctime_to_ts(inode, now);
> > -	return now;
> > +	/* Just return that if this is not a multigrain fs */
> > +	if (!is_mgtime(inode)) {
> > +		inode->__i_ctime = now;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * We only need a fine-grained time if someone has queried it,
> > +	 * and the current coarse grained time isn't later than what's
> > +	 * already there.
> > +	 */
> > +	ctime = smp_load_acquire(&inode->__i_ctime);
> > +	if ((ctime & I_CTIME_QUERIED) && !ktime_after(now, ctime & ~I_CTIME_QUERIED)) {
> > +		ktime_t old;
> > +
> > +		/* Get a fine-grained time */
> > +		now = ktime_get_real() & ~I_CTIME_QUERIED;
> > +
> > +		/*
> > +		 * If the cmpxchg works, we take the new floor value. If
> > +		 * not, then that means that someone else changed it after we
> > +		 * fetched it but before we got here. That value is just
> > +		 * as good, so keep it.
> > +		 */
> > +		old = cmpxchg(&ctime_floor, floor, now);
> > +		trace_ctime_floor_update(inode, floor, now, old);
> > +		if (old != floor)
> > +			now = old;
> > +	}
> > +retry:
> > +	/* Try to swap the ctime into place. */
> > +	cur = cmpxchg(&inode->__i_ctime, ctime, now);
> > +	trace_ctime_inode_update(inode, ctime, now, cur);
> > +
> > +	/* If swap occurred, then we're done */
> > +	if (cur != ctime) {
> > +		/*
> > +		 * Was the change due to someone marking the old ctime QUERIED?
> > +		 * If so then retry the swap. This can only happen once since
> > +		 * the only way to clear I_CTIME_QUERIED is to stamp the inode
> > +		 * with a new ctime.
> > +		 */
> > +		if (!(ctime & I_CTIME_QUERIED) && (ctime | I_CTIME_QUERIED) == cur) {
> > +			ctime = cur;
> > +			goto retry;
> > +		}
> > +		/* Otherwise, take the new ctime */
> > +		now = cur & ~I_CTIME_QUERIED;
> > +	}
> > +out:
> > +	return timestamp_truncate(ktime_to_timespec64(now), inode);
> >  }
> >  EXPORT_SYMBOL(inode_set_ctime_current);
> >  
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 6f65b3456cad..7e9bd16b553b 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -22,10 +22,39 @@
> >  
> >  #include <linux/uaccess.h>
> >  #include <asm/unistd.h>
> > +#include <trace/events/timestamp.h>
> >  
> >  #include "internal.h"
> >  #include "mount.h"
> >  
> > +/**
> > + * fill_mg_cmtime - Fill in the mtime and ctime and flag ctime as QUERIED
> > + * @stat: where to store the resulting values
> > + * @request_mask: STATX_* values requested
> > + * @inode: inode from which to grab the c/mtime
> > + *
> > + * Given @inode, grab the ctime and mtime out if it and store the result
> > + * in @stat. When fetching the value, flag it as queried so the next write
> > + * will ensure a distinct timestamp.
> > + */
> > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode)
> > +{
> > +	atomic_long_t *pc = (atomic_long_t *)&inode->__i_ctime;
> > +
> > +	/* If neither time was requested, then don't report them */
> > +	if (!(request_mask & (STATX_CTIME|STATX_MTIME))) {
> > +		stat->result_mask &= ~(STATX_CTIME|STATX_MTIME);
> > +		return;
> > +	}
> > +
> > +	stat->mtime.tv_sec = inode->i_mtime_sec;
> > +	stat->mtime.tv_nsec = inode->i_mtime_nsec;
> > +	stat->ctime = ktime_to_timespec64(atomic_long_fetch_or(I_CTIME_QUERIED, pc) &
> > +						~I_CTIME_QUERIED);
> > +	trace_fill_mg_cmtime(inode, atomic_long_read(pc));
> > +}
> > +EXPORT_SYMBOL(fill_mg_cmtime);
> > +
> >  /**
> >   * generic_fillattr - Fill in the basic attributes from the inode struct
> >   * @idmap:		idmap of the mount the inode was found from
> > @@ -58,8 +87,14 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 request_mask,
> >  	stat->rdev = inode->i_rdev;
> >  	stat->size = i_size_read(inode);
> >  	stat->atime = inode_get_atime(inode);
> > -	stat->mtime = inode_get_mtime(inode);
> > -	stat->ctime = inode_get_ctime(inode);
> > +
> > +	if (is_mgtime(inode)) {
> > +		fill_mg_cmtime(stat, request_mask, inode);
> > +	} else {
> > +		stat->ctime = inode_get_ctime(inode);
> > +		stat->mtime = inode_get_mtime(inode);
> > +	}
> > +
> >  	stat->blksize = i_blocksize(inode);
> >  	stat->blocks = inode->i_blocks;
> >  
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 4b10db12725d..5694cb6c4dc2 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1608,6 +1608,23 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
> >  	return inode_set_mtime_to_ts(inode, ts);
> >  }
> >  
> > +/*
> > + * Multigrain timestamps
> > + *
> > + * Conditionally use fine-grained ctime and mtime timestamps when there
> > + * are users actively observing them via getattr. The primary use-case
> > + * for this is NFS clients that use the ctime to distinguish between
> > + * different states of the file, and that are often fooled by multiple
> > + * operations that occur in the same coarse-grained timer tick.
> > + *
> > + * We use the least significant bit of the ktime_t to track the QUERIED
> > + * flag. This means that filesystems with multigrain timestamps effectively
> > + * have 2ns resolution for the ctime, even if they advertise 1ns s_time_gran.
> > + */
> > +#define I_CTIME_QUERIED		(1LL)
> > +
> > +static inline bool is_mgtime(const struct inode *inode);
> > +
> >  struct timespec64 inode_get_ctime(const struct inode *inode);
> >  struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct timespec64 ts);
> >  
> > @@ -2477,6 +2494,7 @@ struct file_system_type {
> >  #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
> >  #define FS_DISALLOW_NOTIFY_PERM	16	/* Disable fanotify permission events */
> >  #define FS_ALLOW_IDMAP         32      /* FS has been updated to handle vfs idmappings. */
> > +#define FS_MGTIME		64	/* FS uses multigrain timestamps */
> >  #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
> >  	int (*init_fs_context)(struct fs_context *);
> >  	const struct fs_parameter_spec *parameters;
> > @@ -2500,6 +2518,17 @@ struct file_system_type {
> >  
> >  #define MODULE_ALIAS_FS(NAME) MODULE_ALIAS("fs-" NAME)
> >  
> > +/**
> > + * is_mgtime: is this inode using multigrain timestamps
> > + * @inode: inode to test for multigrain timestamps
> > + *
> > + * Return true if the inode uses multigrain timestamps, false otherwise.
> > + */
> > +static inline bool is_mgtime(const struct inode *inode)
> > +{
> > +	return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> > +}
> > +
> >  extern struct dentry *mount_bdev(struct file_system_type *fs_type,
> >  	int flags, const char *dev_name, void *data,
> >  	int (*fill_super)(struct super_block *, void *, int));
> > @@ -3234,6 +3263,7 @@ extern void page_put_link(void *);
> >  extern int page_symlink(struct inode *inode, const char *symname, int len);
> >  extern const struct inode_operations page_symlink_inode_operations;
> >  extern void kfree_link(void *);
> > +void fill_mg_cmtime(struct kstat *stat, u32 request_mask, struct inode *inode);
> >  void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
> >  void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
> >  extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
> > diff --git a/include/trace/events/timestamp.h b/include/trace/events/timestamp.h
> > index 35ff875d3800..1f71738aa38c 100644
> > --- a/include/trace/events/timestamp.h
> > +++ b/include/trace/events/timestamp.h
> > @@ -8,6 +8,78 @@
> >  #include <linux/tracepoint.h>
> >  #include <linux/fs.h>
> >  
> > +TRACE_EVENT(ctime_floor_update,
> > +	TP_PROTO(struct inode *inode,
> > +		 ktime_t old,
> > +		 ktime_t new,
> > +		 ktime_t cur),
> > +
> > +	TP_ARGS(inode, old, new, cur),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(dev_t,				dev)
> > +		__field(ino_t,				ino)
> > +		__field(ktime_t,			old)
> > +		__field(ktime_t,			new)
> > +		__field(ktime_t,			cur)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->dev		= inode->i_sb->s_dev;
> > +		__entry->ino		= inode->i_ino;
> > +		__entry->old		= old;
> > +		__entry->new		= new;
> > +		__entry->cur		= cur;
> > +	),
> > +
> > +	TP_printk("ino=%d:%d:%lu old=%llu.%lu new=%llu.%lu cur=%llu.%lu swp=%c",
> > +		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> > +		ktime_to_timespec64(__entry->old).tv_sec,
> > +		ktime_to_timespec64(__entry->old).tv_nsec,
> > +		ktime_to_timespec64(__entry->new).tv_sec,
> > +		ktime_to_timespec64(__entry->new).tv_nsec,
> > +		ktime_to_timespec64(__entry->cur).tv_sec,
> > +		ktime_to_timespec64(__entry->cur).tv_nsec,
> > +		(__entry->old == __entry->cur) ? 'Y' : 'N'
> > +	)
> > +);
> > +
> > +TRACE_EVENT(ctime_inode_update,
> > +	TP_PROTO(struct inode *inode,
> > +		 ktime_t old,
> > +		 ktime_t new,
> > +		 ktime_t cur),
> > +
> > +	TP_ARGS(inode, old, new, cur),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(dev_t,				dev)
> > +		__field(ino_t,				ino)
> > +		__field(ktime_t,			old)
> > +		__field(ktime_t,			new)
> > +		__field(ktime_t,			cur)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->dev		= inode->i_sb->s_dev;
> > +		__entry->ino		= inode->i_ino;
> > +		__entry->old		= old;
> > +		__entry->new		= new;
> > +		__entry->cur		= cur;
> > +	),
> > +
> > +	TP_printk("ino=%d:%d:%ld old=%llu.%ld new=%llu.%ld cur=%llu.%ld swp=%c",
> > +		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> > +		ktime_to_timespec64(__entry->old).tv_sec,
> > +		ktime_to_timespec64(__entry->old).tv_nsec,
> > +		ktime_to_timespec64(__entry->new).tv_sec,
> > +		ktime_to_timespec64(__entry->new).tv_nsec,
> > +		ktime_to_timespec64(__entry->cur).tv_sec,
> > +		ktime_to_timespec64(__entry->cur).tv_nsec,
> > +		(__entry->old == __entry->cur ? 'Y' : 'N')
> > +	)
> > +);
> > +
> >  TRACE_EVENT(inode_needs_update_time,
> >  	TP_PROTO(struct inode *inode,
> >  		 struct timespec64 *now,
> > @@ -70,6 +142,31 @@ TRACE_EVENT(inode_set_ctime_to_ts,
> >  		__entry->ts_sec, __entry->ts_nsec
> >  	)
> >  );
> > +
> > +TRACE_EVENT(fill_mg_cmtime,
> > +	TP_PROTO(struct inode *inode,
> > +		 ktime_t ctime),
> > +
> > +	TP_ARGS(inode, ctime),
> > +
> > +	TP_STRUCT__entry(
> > +		__field(dev_t,			dev)
> > +		__field(ino_t,			ino)
> > +		__field(ktime_t,		ctime)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__entry->dev		= inode->i_sb->s_dev;
> > +		__entry->ino		= inode->i_ino;
> > +		__entry->ctime		= ctime;
> > +	),
> > +
> > +	TP_printk("ino=%d:%d:%ld ctime=%llu.%lu",
> > +		MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino,
> > +		ktime_to_timespec64(__entry->ctime).tv_sec,
> > +		ktime_to_timespec64(__entry->ctime).tv_nsec
> > +	)
> > +);
> >  #endif /* _TRACE_TIMESTAMP_H */
> >  
> >  /* This part must be outside protection */
> > 
> > -- 
> > 2.45.2
> > 
> > 
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-06-27  1:00 ` [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t Jeff Layton
@ 2024-07-01 22:49   ` Darrick J. Wong
  2024-07-02  0:22     ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2024-07-01 22:49 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Theodore Ts'o, Andreas Dilger, Chris Mason, Josef Bacik,
	David Sterba, Hugh Dickins, Andrew Morton, kernel-team,
	linux-fsdevel, linux-kernel, linux-trace-kernel, linux-xfs,
	linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Wed, Jun 26, 2024 at 09:00:21PM -0400, Jeff Layton wrote:
> The ctime is not settable to arbitrary values. It always comes from the
> system clock, so we'll never stamp an inode with a value that can't be
> represented there. If we disregard people setting their system clock
> past the year 2262, there is no reason we can't replace the ctime fields
> with a ktime_t.
> 
> Switch the ctime fields to a single ktime_t. Move the i_generation down
> above i_fsnotify_mask and then move the i_version into the resulting 8
> byte hole. This shrinks struct inode by 8 bytes total, and should
> improve the cache footprint as the i_version and ctime are usually
> updated together.
> 
> The one downside I can see to switching to a ktime_t is that if someone
> has a filesystem with files on it that has ctimes outside the ktime_t
> range (before ~1678 AD or after ~2262 AD), we won't be able to display
> them properly in stat() without some special treatment in the
> filesystem. The operating assumption here is that that is not a
> practical problem.

What happens if a filesystem with the ability to store ctimes beyond
whatever ktime_t supports (AFAICT 2^63-1 nanonseconds on either side of
the Unix epoch)?  I think the behavior with your patch is that ktime_set
clamps the ctime on iget because the kernel can't handle it?

It's a little surprising that the ctime will suddenly jump back in time
to 2262, but maybe you're right that nobody will notice or care? ;)

--D

> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  include/linux/fs.h | 26 +++++++++++---------------
>  1 file changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5ff362277834..5139dec085f2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -662,11 +662,10 @@ struct inode {
>  	loff_t			i_size;
>  	time64_t		i_atime_sec;
>  	time64_t		i_mtime_sec;
> -	time64_t		i_ctime_sec;
>  	u32			i_atime_nsec;
>  	u32			i_mtime_nsec;
> -	u32			i_ctime_nsec;
> -	u32			i_generation;
> +	ktime_t			__i_ctime;
> +	atomic64_t		i_version;
>  	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
>  	unsigned short          i_bytes;
>  	u8			i_blkbits;
> @@ -701,7 +700,6 @@ struct inode {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
>  	};
> -	atomic64_t		i_version;
>  	atomic64_t		i_sequence; /* see futex */
>  	atomic_t		i_count;
>  	atomic_t		i_dio_count;
> @@ -724,6 +722,8 @@ struct inode {
>  	};
>  
>  
> +	u32			i_generation;
> +
>  #ifdef CONFIG_FSNOTIFY
>  	__u32			i_fsnotify_mask; /* all events this inode cares about */
>  	/* 32-bit hole reserved for expanding i_fsnotify_mask */
> @@ -1608,29 +1608,25 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
>  	return inode_set_mtime_to_ts(inode, ts);
>  }
>  
> -static inline time64_t inode_get_ctime_sec(const struct inode *inode)
> +static inline struct timespec64 inode_get_ctime(const struct inode *inode)
>  {
> -	return inode->i_ctime_sec;
> +	return ktime_to_timespec64(inode->__i_ctime);
>  }
>  
> -static inline long inode_get_ctime_nsec(const struct inode *inode)
> +static inline time64_t inode_get_ctime_sec(const struct inode *inode)
>  {
> -	return inode->i_ctime_nsec;
> +	return inode_get_ctime(inode).tv_sec;
>  }
>  
> -static inline struct timespec64 inode_get_ctime(const struct inode *inode)
> +static inline long inode_get_ctime_nsec(const struct inode *inode)
>  {
> -	struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
> -				 .tv_nsec = inode_get_ctime_nsec(inode) };
> -
> -	return ts;
> +	return inode_get_ctime(inode).tv_nsec;
>  }
>  
>  static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
>  						      struct timespec64 ts)
>  {
> -	inode->i_ctime_sec = ts.tv_sec;
> -	inode->i_ctime_nsec = ts.tv_nsec;
> +	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
>  	return ts;
>  }
>  
> 
> -- 
> 2.45.2
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-01 22:49   ` Darrick J. Wong
@ 2024-07-02  0:22     ` Jeff Layton
  2024-07-02  7:37       ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2024-07-02  0:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Chandan Babu R,
	Theodore Ts'o, Andreas Dilger, Chris Mason, Josef Bacik,
	David Sterba, Hugh Dickins, Andrew Morton, kernel-team,
	linux-fsdevel, linux-kernel, linux-trace-kernel, linux-xfs,
	linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Mon, 2024-07-01 at 15:49 -0700, Darrick J. Wong wrote:
> On Wed, Jun 26, 2024 at 09:00:21PM -0400, Jeff Layton wrote:
> > The ctime is not settable to arbitrary values. It always comes from the
> > system clock, so we'll never stamp an inode with a value that can't be
> > represented there. If we disregard people setting their system clock
> > past the year 2262, there is no reason we can't replace the ctime fields
> > with a ktime_t.
> > 
> > Switch the ctime fields to a single ktime_t. Move the i_generation down
> > above i_fsnotify_mask and then move the i_version into the resulting 8
> > byte hole. This shrinks struct inode by 8 bytes total, and should
> > improve the cache footprint as the i_version and ctime are usually
> > updated together.
> > 
> > The one downside I can see to switching to a ktime_t is that if someone
> > has a filesystem with files on it that has ctimes outside the ktime_t
> > range (before ~1678 AD or after ~2262 AD), we won't be able to display
> > them properly in stat() without some special treatment in the
> > filesystem. The operating assumption here is that that is not a
> > practical problem.
> 
> What happens if a filesystem with the ability to store ctimes beyond
> whatever ktime_t supports (AFAICT 2^63-1 nanonseconds on either side of
> the Unix epoch)?  I think the behavior with your patch is that ktime_set
> clamps the ctime on iget because the kernel can't handle it?
> 
> It's a little surprising that the ctime will suddenly jump back in time
> to 2262, but maybe you're right that nobody will notice or care? ;)
> 
> 

Yeah, it'd be clamped at KTIME_MAX when we pull in the inode from disk,
a'la ktime_set.

I think it's important to note that the ctime is not settable from
userland, so if an on-disk ctime is outside of the ktime_t range, there
are only two possibilities:

1) the system clock was set to some time (far) in the future when the
file's metadata was last altered (bad clock? time traveling fs?).

...or...

2) the filesystem has been altered (fuzzing? deliberate doctoring?).

None of these seem like legitimate use cases so I'm arguing that we
shouldn't worry about them.

(...ok maybe the time travel one could be legit, but someone needs to
step up and make a case for it, if so.)

> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  include/linux/fs.h | 26 +++++++++++---------------
> >  1 file changed, 11 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 5ff362277834..5139dec085f2 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -662,11 +662,10 @@ struct inode {
> >  	loff_t			i_size;
> >  	time64_t		i_atime_sec;
> >  	time64_t		i_mtime_sec;
> > -	time64_t		i_ctime_sec;
> >  	u32			i_atime_nsec;
> >  	u32			i_mtime_nsec;
> > -	u32			i_ctime_nsec;
> > -	u32			i_generation;
> > +	ktime_t			__i_ctime;
> > +	atomic64_t		i_version;
> >  	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
> >  	unsigned short          i_bytes;
> >  	u8			i_blkbits;
> > @@ -701,7 +700,6 @@ struct inode {
> >  		struct hlist_head	i_dentry;
> >  		struct rcu_head		i_rcu;
> >  	};
> > -	atomic64_t		i_version;
> >  	atomic64_t		i_sequence; /* see futex */
> >  	atomic_t		i_count;
> >  	atomic_t		i_dio_count;
> > @@ -724,6 +722,8 @@ struct inode {
> >  	};
> >  
> >  
> > +	u32			i_generation;
> > +
> >  #ifdef CONFIG_FSNOTIFY
> >  	__u32			i_fsnotify_mask; /* all events this inode cares about */
> >  	/* 32-bit hole reserved for expanding i_fsnotify_mask */
> > @@ -1608,29 +1608,25 @@ static inline struct timespec64 inode_set_mtime(struct inode *inode,
> >  	return inode_set_mtime_to_ts(inode, ts);
> >  }
> >  
> > -static inline time64_t inode_get_ctime_sec(const struct inode *inode)
> > +static inline struct timespec64 inode_get_ctime(const struct inode *inode)
> >  {
> > -	return inode->i_ctime_sec;
> > +	return ktime_to_timespec64(inode->__i_ctime);
> >  }
> >  
> > -static inline long inode_get_ctime_nsec(const struct inode *inode)
> > +static inline time64_t inode_get_ctime_sec(const struct inode *inode)
> >  {
> > -	return inode->i_ctime_nsec;
> > +	return inode_get_ctime(inode).tv_sec;
> >  }
> >  
> > -static inline struct timespec64 inode_get_ctime(const struct inode *inode)
> > +static inline long inode_get_ctime_nsec(const struct inode *inode)
> >  {
> > -	struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
> > -				 .tv_nsec = inode_get_ctime_nsec(inode) };
> > -
> > -	return ts;
> > +	return inode_get_ctime(inode).tv_nsec;
> >  }
> >  
> >  static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
> >  						      struct timespec64 ts)
> >  {
> > -	inode->i_ctime_sec = ts.tv_sec;
> > -	inode->i_ctime_nsec = ts.tv_nsec;
> > +	inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
> >  	return ts;
> >  }
> >  
> > 
> > -- 
> > 2.45.2
> > 
> > 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02  0:22     ` Jeff Layton
@ 2024-07-02  7:37       ` Christoph Hellwig
  2024-07-02  9:56         ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-07-02  7:37 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Darrick J. Wong, Alexander Viro, Christian Brauner, Jan Kara,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Chandan Babu R, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Mon, Jul 01, 2024 at 08:22:07PM -0400, Jeff Layton wrote:
> 2) the filesystem has been altered (fuzzing? deliberate doctoring?).
> 
> None of these seem like legitimate use cases so I'm arguing that we
> shouldn't worry about them.

Not worry seems like the wrong answer here.  Either we decide they
are legitimate enough and we preserve them, or we decide they are
bogus and refuse reading the inode.  But we'll need to consciously
deal with the case.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02  7:37       ` Christoph Hellwig
@ 2024-07-02  9:56         ` Jeff Layton
  2024-07-02 10:19           ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2024-07-02  9:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Alexander Viro, Christian Brauner, Jan Kara,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Chandan Babu R, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Tue, 2024-07-02 at 00:37 -0700, Christoph Hellwig wrote:
> On Mon, Jul 01, 2024 at 08:22:07PM -0400, Jeff Layton wrote:
> > 2) the filesystem has been altered (fuzzing? deliberate doctoring?).
> > 
> > None of these seem like legitimate use cases so I'm arguing that we
> > shouldn't worry about them.
> 
> Not worry seems like the wrong answer here.  Either we decide they
> are legitimate enough and we preserve them, or we decide they are
> bogus and refuse reading the inode.  But we'll need to consciously
> deal with the case.
> 

Is there a problem with consciously dealing with it by clamping the
time at KTIME_MAX? If I had a fs with corrupt timestamps, the last
thing I'd want is the filesystem refusing to let me at my data because
of them.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02  9:56         ` Jeff Layton
@ 2024-07-02 10:19           ` Jan Kara
  2024-07-02 11:42             ` Christian Brauner
  2024-07-02 11:44             ` Jeff Layton
  0 siblings, 2 replies; 29+ messages in thread
From: Jan Kara @ 2024-07-02 10:19 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Darrick J. Wong, Alexander Viro,
	Christian Brauner, Jan Kara, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chandan Babu R, Theodore Ts'o,
	Andreas Dilger, Chris Mason, Josef Bacik, David Sterba,
	Hugh Dickins, Andrew Morton, kernel-team, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-xfs, linux-ext4,
	linux-btrfs, linux-mm, linux-nfs

On Tue 02-07-24 05:56:37, Jeff Layton wrote:
> On Tue, 2024-07-02 at 00:37 -0700, Christoph Hellwig wrote:
> > On Mon, Jul 01, 2024 at 08:22:07PM -0400, Jeff Layton wrote:
> > > 2) the filesystem has been altered (fuzzing? deliberate doctoring?).
> > > 
> > > None of these seem like legitimate use cases so I'm arguing that we
> > > shouldn't worry about them.
> > 
> > Not worry seems like the wrong answer here.  Either we decide they
> > are legitimate enough and we preserve them, or we decide they are
> > bogus and refuse reading the inode.  But we'll need to consciously
> > deal with the case.
> > 
> 
> Is there a problem with consciously dealing with it by clamping the
> time at KTIME_MAX? If I had a fs with corrupt timestamps, the last
> thing I'd want is the filesystem refusing to let me at my data because
> of them.

Well, you could also view it differently: If I have a fs that corrupts time
stamps, the last thing I'd like is that the kernel silently accepts it
without telling me about it :)

But more seriously, my filesystem development experience shows that if the
kernel silently tries to accept and fixup the breakage, it is nice in the
short term (no complaining users) but it tends to get ugly in the long term
(where tend people come up with nasty cases where it was wrong to fix it
up). So I think Christoph's idea of refusing to load inodes with ctimes out
of range makes sense. Or at least complain about it if nothing else (which
has some precedens in the year 2038 problem).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 10:19           ` Jan Kara
@ 2024-07-02 11:42             ` Christian Brauner
  2024-07-02 11:44             ` Jeff Layton
  1 sibling, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2024-07-02 11:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Christoph Hellwig, Darrick J. Wong, Alexander Viro,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Chandan Babu R, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

> if the
> kernel silently tries to accept and fixup the breakage, it is nice in the
> short term (no complaining users) but it tends to get ugly in the long term
> (where tend people come up with nasty cases where it was wrong to fix it
> up).

Yeah, very much agree. It works for simple APIs sometimes but it's
certainly not something we should just do for filesystems with actual
on-disk format.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 10:19           ` Jan Kara
  2024-07-02 11:42             ` Christian Brauner
@ 2024-07-02 11:44             ` Jeff Layton
  2024-07-02 12:04               ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2024-07-02 11:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Alexander Viro,
	Christian Brauner, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chandan Babu R, Theodore Ts'o,
	Andreas Dilger, Chris Mason, Josef Bacik, David Sterba,
	Hugh Dickins, Andrew Morton, kernel-team, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-xfs, linux-ext4,
	linux-btrfs, linux-mm, linux-nfs

On Tue, 2024-07-02 at 12:19 +0200, Jan Kara wrote:
> On Tue 02-07-24 05:56:37, Jeff Layton wrote:
> > On Tue, 2024-07-02 at 00:37 -0700, Christoph Hellwig wrote:
> > > On Mon, Jul 01, 2024 at 08:22:07PM -0400, Jeff Layton wrote:
> > > > 2) the filesystem has been altered (fuzzing? deliberate doctoring?).
> > > > 
> > > > None of these seem like legitimate use cases so I'm arguing that we
> > > > shouldn't worry about them.
> > > 
> > > Not worry seems like the wrong answer here.  Either we decide they
> > > are legitimate enough and we preserve them, or we decide they are
> > > bogus and refuse reading the inode.  But we'll need to consciously
> > > deal with the case.
> > > 
> > 
> > Is there a problem with consciously dealing with it by clamping the
> > time at KTIME_MAX? If I had a fs with corrupt timestamps, the last
> > thing I'd want is the filesystem refusing to let me at my data because
> > of them.
> 
> Well, you could also view it differently: If I have a fs that corrupts time
> stamps, the last thing I'd like is that the kernel silently accepts it
> without telling me about it :)
> 

Fair enough.

> But more seriously, my filesystem development experience shows that if the
> kernel silently tries to accept and fixup the breakage, it is nice in the
> short term (no complaining users) but it tends to get ugly in the long term
> (where tend people come up with nasty cases where it was wrong to fix it
> up). So I think Christoph's idea of refusing to load inodes with ctimes out
> of range makes sense. Or at least complain about it if nothing else (which
> has some precedens in the year 2038 problem).

Complaining about it is fairly simple. We could just throw a pr_warn in
inode_set_ctime_to_ts when the time comes back as KTIME_MAX. This might
also be what we need to do for filesystems like NFS, where a future
ctime on the server is not necessarily a problem for the client.

Refusing to load the inode on disk-based filesystems is harder, but is
probably possible. There are ~90 calls to inode_set_ctime_to_ts in the
kernel, so we'd need to vet the places that are loading times from disk
images or the like and fix them to return errors in this situation.

Is warning acceptable, or do we really need to reject inodes that have
corrupt timestamps like this?
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 11:44             ` Jeff Layton
@ 2024-07-02 12:04               ` Christoph Hellwig
  2024-07-02 12:09                 ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-07-02 12:04 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Jan Kara, Christoph Hellwig, Darrick J. Wong, Alexander Viro,
	Christian Brauner, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chandan Babu R, Theodore Ts'o,
	Andreas Dilger, Chris Mason, Josef Bacik, David Sterba,
	Hugh Dickins, Andrew Morton, kernel-team, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-xfs, linux-ext4,
	linux-btrfs, linux-mm, linux-nfs

On Tue, Jul 02, 2024 at 07:44:19AM -0400, Jeff Layton wrote:
> Complaining about it is fairly simple. We could just throw a pr_warn in
> inode_set_ctime_to_ts when the time comes back as KTIME_MAX. This might
> also be what we need to do for filesystems like NFS, where a future
> ctime on the server is not necessarily a problem for the client.
> 
> Refusing to load the inode on disk-based filesystems is harder, but is
> probably possible. There are ~90 calls to inode_set_ctime_to_ts in the
> kernel, so we'd need to vet the places that are loading times from disk
> images or the like and fix them to return errors in this situation.
> 
> Is warning acceptable, or do we really need to reject inodes that have
> corrupt timestamps like this?

inode_set_ctime_to_ts should return an error if things are out of range.
How do we currently catch this when it comes from userland?



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 12:04               ` Christoph Hellwig
@ 2024-07-02 12:09                 ` Jeff Layton
  2024-07-02 12:15                   ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2024-07-02 12:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Darrick J. Wong, Alexander Viro, Christian Brauner,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Chandan Babu R, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Tue, 2024-07-02 at 05:04 -0700, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 07:44:19AM -0400, Jeff Layton wrote:
> > Complaining about it is fairly simple. We could just throw a pr_warn in
> > inode_set_ctime_to_ts when the time comes back as KTIME_MAX. This might
> > also be what we need to do for filesystems like NFS, where a future
> > ctime on the server is not necessarily a problem for the client.
> > 
> > Refusing to load the inode on disk-based filesystems is harder, but is
> > probably possible. There are ~90 calls to inode_set_ctime_to_ts in the
> > kernel, so we'd need to vet the places that are loading times from disk
> > images or the like and fix them to return errors in this situation.
> > 
> > Is warning acceptable, or do we really need to reject inodes that have
> > corrupt timestamps like this?
> 
> inode_set_ctime_to_ts should return an error if things are out of range.

Currently it just returns the timespec64 we're setting it to (which
makes it easy to do several assignments), so we'd need to change its
prototype to handle this case, and fix up the callers to recognize the
error.

Alternately it may be easier to just add in a test for when
__i_ctime == KTIME_MAX in the appropriate callers and have them error
out. I'll have a look and see what makes sense.

> How do we currently catch this when it comes from userland?
> 

Not sure I understand this question. ctime values should never come
from userland. They should only ever come from the system clock.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 12:09                 ` Jeff Layton
@ 2024-07-02 12:15                   ` Christoph Hellwig
  2024-07-02 12:21                     ` Jeff Layton
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-07-02 12:15 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Jan Kara, Darrick J. Wong, Alexander Viro,
	Christian Brauner, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chandan Babu R, Theodore Ts'o,
	Andreas Dilger, Chris Mason, Josef Bacik, David Sterba,
	Hugh Dickins, Andrew Morton, kernel-team, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-xfs, linux-ext4,
	linux-btrfs, linux-mm, linux-nfs

On Tue, Jul 02, 2024 at 08:09:46AM -0400, Jeff Layton wrote:
> > > corrupt timestamps like this?
> > 
> > inode_set_ctime_to_ts should return an error if things are out of range.
> 
> Currently it just returns the timespec64 we're setting it to (which
> makes it easy to do several assignments), so we'd need to change its
> prototype to handle this case, and fix up the callers to recognize the
> error.
> 
> Alternately it may be easier to just add in a test for when
> __i_ctime == KTIME_MAX in the appropriate callers and have them error
> out. I'll have a look and see what makes sense.

The seems like a more awkward interface vs one that explicitly checks.

> 
> > How do we currently catch this when it comes from userland?
> > 
> 
> Not sure I understand this question. ctime values should never come
> from userland. They should only ever come from the system clock.

Ah, yes, utimes only changes mtime.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 12:15                   ` Christoph Hellwig
@ 2024-07-02 12:21                     ` Jeff Layton
  2024-07-02 15:12                       ` Christoph Hellwig
  2024-07-02 16:18                       ` Christian Brauner
  0 siblings, 2 replies; 29+ messages in thread
From: Jeff Layton @ 2024-07-02 12:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Darrick J. Wong, Alexander Viro, Christian Brauner,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Chandan Babu R, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Tue, 2024-07-02 at 05:15 -0700, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 08:09:46AM -0400, Jeff Layton wrote:
> > > > corrupt timestamps like this?
> > > 
> > > inode_set_ctime_to_ts should return an error if things are out of
> > > range.
> > 
> > Currently it just returns the timespec64 we're setting it to (which
> > makes it easy to do several assignments), so we'd need to change
> > its
> > prototype to handle this case, and fix up the callers to recognize
> > the
> > error.
> > 
> > Alternately it may be easier to just add in a test for when
> > __i_ctime == KTIME_MAX in the appropriate callers and have them
> > error
> > out. I'll have a look and see what makes sense.
> 
> The seems like a more awkward interface vs one that explicitly
> checks.
> 

Many of the existing callers of inode_ctime_to_ts are in void return
functions. They're just copying data from an internal representation to
struct inode and assume it always succeeds. For those we'll probably
have to catch bad ctime values earlier.

So, I think I'll probably have to roll bespoke error handling in all of
the relevant filesystems if we go this route. There are also
differences between filesystems -- does it make sense to refuse to load
an inode with a bogus ctime on NFS or AFS? Probably not.

Hell, it may be simpler to just ditch this patch and reimplement
mgtimes using the nanosecond fields like the earlier versions did.

I'll need to study this a bit and figure out what's best.

> > 
> > > How do we currently catch this when it comes from userland?
> > > 
> > 
> > Not sure I understand this question. ctime values should never come
> > from userland. They should only ever come from the system clock.
> 
> Ah, yes, utimes only changes mtime.

Yep. That's the main reason I see the ctime as very different from the
atime or mtime.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 12:21                     ` Jeff Layton
@ 2024-07-02 15:12                       ` Christoph Hellwig
  2024-07-02 15:58                         ` Jeff Layton
  2024-07-02 16:18                       ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-07-02 15:12 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Jan Kara, Darrick J. Wong, Alexander Viro,
	Christian Brauner, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chandan Babu R, Theodore Ts'o,
	Andreas Dilger, Chris Mason, Josef Bacik, David Sterba,
	Hugh Dickins, Andrew Morton, kernel-team, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-xfs, linux-ext4,
	linux-btrfs, linux-mm, linux-nfs

On Tue, Jul 02, 2024 at 08:21:42AM -0400, Jeff Layton wrote:
> Many of the existing callers of inode_ctime_to_ts are in void return
> functions. They're just copying data from an internal representation to
> struct inode and assume it always succeeds. For those we'll probably
> have to catch bad ctime values earlier.
> 
> So, I think I'll probably have to roll bespoke error handling in all of
> the relevant filesystems if we go this route. There are also
> differences between filesystems -- does it make sense to refuse to load
> an inode with a bogus ctime on NFS or AFS? Probably not.
> 
> Hell, it may be simpler to just ditch this patch and reimplement
> mgtimes using the nanosecond fields like the earlier versions did.

Thatdoes for sure sound simpler.  What is the big advantage of the
ktime_t?  Smaller size?



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 15:12                       ` Christoph Hellwig
@ 2024-07-02 15:58                         ` Jeff Layton
  2024-07-03  5:26                           ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Jeff Layton @ 2024-07-02 15:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Darrick J. Wong, Alexander Viro, Christian Brauner,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Chandan Babu R, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Tue, 2024-07-02 at 08:12 -0700, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 08:21:42AM -0400, Jeff Layton wrote:
> > Many of the existing callers of inode_ctime_to_ts are in void
> > return
> > functions. They're just copying data from an internal
> > representation to
> > struct inode and assume it always succeeds. For those we'll
> > probably
> > have to catch bad ctime values earlier.
> > 
> > So, I think I'll probably have to roll bespoke error handling in
> > all of
> > the relevant filesystems if we go this route. There are also
> > differences between filesystems -- does it make sense to refuse to
> > load
> > an inode with a bogus ctime on NFS or AFS? Probably not.
> > 
> > Hell, it may be simpler to just ditch this patch and reimplement
> > mgtimes using the nanosecond fields like the earlier versions did.
> 
> Thatdoes for sure sound simpler.  What is the big advantage of the
> ktime_t?  Smaller size?
> 

Yeah, mostly. We shrink struct inode by 8 bytes with that patch, and we
(probably) get a better cache footprint, since i_version ends up in the
same cacheline as the ctime. That's really a separate issue though, so
I'm not too worked up about dropping that patch.

As a bonus, leaving it split across separate fields means that we can
use unused bits in the nsec field for the flag, so we don't need to
sacrifice any timestamp granularity either.

I've got a draft rework that does this that I'm testing now. Assuming
it works OK, I'll resend in a few days.

Thanks for the feedback!
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 12:21                     ` Jeff Layton
  2024-07-02 15:12                       ` Christoph Hellwig
@ 2024-07-02 16:18                       ` Christian Brauner
  1 sibling, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2024-07-02 16:18 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Jan Kara, Darrick J. Wong, Alexander Viro,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Chandan Babu R, Theodore Ts'o, Andreas Dilger, Chris Mason,
	Josef Bacik, David Sterba, Hugh Dickins, Andrew Morton,
	kernel-team, linux-fsdevel, linux-kernel, linux-trace-kernel,
	linux-xfs, linux-ext4, linux-btrfs, linux-mm, linux-nfs

On Tue, Jul 02, 2024 at 08:21:42AM GMT, Jeff Layton wrote:
> On Tue, 2024-07-02 at 05:15 -0700, Christoph Hellwig wrote:
> > On Tue, Jul 02, 2024 at 08:09:46AM -0400, Jeff Layton wrote:
> > > > > corrupt timestamps like this?
> > > > 
> > > > inode_set_ctime_to_ts should return an error if things are out of
> > > > range.
> > > 
> > > Currently it just returns the timespec64 we're setting it to (which
> > > makes it easy to do several assignments), so we'd need to change
> > > its
> > > prototype to handle this case, and fix up the callers to recognize
> > > the
> > > error.
> > > 
> > > Alternately it may be easier to just add in a test for when
> > > __i_ctime == KTIME_MAX in the appropriate callers and have them
> > > error
> > > out. I'll have a look and see what makes sense.
> > 
> > The seems like a more awkward interface vs one that explicitly
> > checks.
> > 
> 
> Many of the existing callers of inode_ctime_to_ts are in void return
> functions. They're just copying data from an internal representation to
> struct inode and assume it always succeeds. For those we'll probably
> have to catch bad ctime values earlier.
> 
> So, I think I'll probably have to roll bespoke error handling in all of
> the relevant filesystems if we go this route. There are also

Shudder, let's try and avoid that.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-02 15:58                         ` Jeff Layton
@ 2024-07-03  5:26                           ` Christoph Hellwig
  2024-07-03  5:27                             ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-07-03  5:26 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Jan Kara, Darrick J. Wong, Alexander Viro,
	Christian Brauner, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chandan Babu R, Theodore Ts'o,
	Andreas Dilger, Chris Mason, Josef Bacik, David Sterba,
	Hugh Dickins, Andrew Morton, kernel-team, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-xfs, linux-ext4,
	linux-btrfs, linux-mm, linux-nfs

On Tue, Jul 02, 2024 at 11:58:02AM -0400, Jeff Layton wrote:
> Yeah, mostly. We shrink struct inode by 8 bytes with that patch, and we
> (probably) get a better cache footprint, since i_version ends up in the
> same cacheline as the ctime. That's really a separate issue though, so
> I'm not too worked up about dropping that patch.
> 
> As a bonus, leaving it split across separate fields means that we can
> use unused bits in the nsec field for the flag, so we don't need to
> sacrifice any timestamp granularity either.
> 
> I've got a draft rework that does this that I'm testing now. Assuming
> it works OK, I'll resend in a few days.

So while shrinking the inodes sounds nice, the tradeoff to have to
check all timestamps from disk / the server for validity doesn't
sound as positive.  So I'm glade we're avoiding this at least for.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t
  2024-07-03  5:26                           ` Christoph Hellwig
@ 2024-07-03  5:27                             ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-07-03  5:27 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christoph Hellwig, Jan Kara, Darrick J. Wong, Alexander Viro,
	Christian Brauner, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Chandan Babu R, Theodore Ts'o,
	Andreas Dilger, Chris Mason, Josef Bacik, David Sterba,
	Hugh Dickins, Andrew Morton, kernel-team, linux-fsdevel,
	linux-kernel, linux-trace-kernel, linux-xfs, linux-ext4,
	linux-btrfs, linux-mm, linux-nfs

On Tue, Jul 02, 2024 at 10:26:55PM -0700, Christoph Hellwig wrote:
> So while shrinking the inodes sounds nice, the tradeoff to have to
> check all timestamps from disk / the server for validity doesn't
> sound as positive.  So I'm glade we're avoiding this at least for.

... now.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-07-03  5:27 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-27  1:00 [PATCH 00/10] fs: multigrain timestamp redux Jeff Layton
2024-06-27  1:00 ` [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t Jeff Layton
2024-07-01 22:49   ` Darrick J. Wong
2024-07-02  0:22     ` Jeff Layton
2024-07-02  7:37       ` Christoph Hellwig
2024-07-02  9:56         ` Jeff Layton
2024-07-02 10:19           ` Jan Kara
2024-07-02 11:42             ` Christian Brauner
2024-07-02 11:44             ` Jeff Layton
2024-07-02 12:04               ` Christoph Hellwig
2024-07-02 12:09                 ` Jeff Layton
2024-07-02 12:15                   ` Christoph Hellwig
2024-07-02 12:21                     ` Jeff Layton
2024-07-02 15:12                       ` Christoph Hellwig
2024-07-02 15:58                         ` Jeff Layton
2024-07-03  5:26                           ` Christoph Hellwig
2024-07-03  5:27                             ` Christoph Hellwig
2024-07-02 16:18                       ` Christian Brauner
2024-06-27  1:00 ` [PATCH 02/10] fs: uninline inode_get_ctime and inode_set_ctime_to_ts Jeff Layton
2024-06-27  1:00 ` [PATCH 03/10] fs: tracepoints for inode_needs_update_time " Jeff Layton
2024-06-27  1:00 ` [PATCH 04/10] fs: add infrastructure for multigrain timestamps Jeff Layton
2024-06-27 15:02   ` Chuck Lever
2024-06-27 15:35     ` Jeff Layton
2024-06-27  1:00 ` [PATCH 05/10] fs: add percpu counters to count fine vs. coarse timestamps Jeff Layton
2024-06-27  1:00 ` [PATCH 06/10] fs: have setattr_copy handle multigrain timestamps appropriately Jeff Layton
2024-06-27  1:00 ` [PATCH 07/10] xfs: switch to multigrain timestamps Jeff Layton
2024-06-27  1:00 ` [PATCH 08/10] ext4: " Jeff Layton
2024-06-27  1:00 ` [PATCH 09/10] btrfs: convert " Jeff Layton
2024-06-27  1:00 ` [PATCH 10/10] tmpfs: add support for " Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).