From: Jeff Layton <jlayton@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Amir Goldstein <amir73il@gmail.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Kent Overstreet <kent.overstreet@linux.dev>,
Christian Brauner <brauner@kernel.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
John Stultz <jstultz@google.com>,
Thomas Gleixner <tglx@linutronix.de>,
Stephen Boyd <sboyd@kernel.org>,
Chandan Babu R <chandan.babu@oracle.com>,
"Darrick J. Wong" <djwong@kernel.org>,
Theodore Ts'o <tytso@mit.edu>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
David Sterba <dsterba@suse.com>, Hugh Dickins <hughd@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Jan Kara <jack@suse.de>, David Howells <dhowells@redhat.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org,
linux-btrfs@vger.kernel.org, linux-mm@kvack.org,
linux-nfs@vger.kernel.org
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing
Date: Tue, 31 Oct 2023 07:04:53 -0400 [thread overview]
Message-ID: <d5965ba7ed012433a9914ba38a6046f2ddb015ac.camel@kernel.org> (raw)
In-Reply-To: <ZUAwFkAizH1PrIZp@dread.disaster.area>
On Tue, 2023-10-31 at 09:37 +1100, Dave Chinner wrote:
> On Fri, Oct 27, 2023 at 06:35:58AM -0400, Jeff Layton wrote:
> > On Thu, 2023-10-26 at 13:20 +1100, Dave Chinner wrote:
> > > On Wed, Oct 25, 2023 at 08:25:35AM -0400, Jeff Layton wrote:
> > > > On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote:
> > > > > On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> > > > In earlier discussions you alluded to some repair and/or analysis tools
> > > > that depended on this counter.
> > >
> > > Yes, and one of those "tools" is *me*.
> > >
> > > I frequently look at the di_changecount when doing forensic and/or
> > > failure analysis on filesystem corpses. SOE analysis, relative
> > > modification activity, etc all give insight into what happened to
> > > the filesystem to get it into the state it is currently in, and
> > > di_changecount provides information no other metadata in the inode
> > > contains.
> > >
> > > > I took a quick look in xfsprogs, but I
> > > > didn't see anything there. Is there a library or something that these
> > > > tools use to get at this value?
> > >
> > > xfs_db is the tool I use for this, such as:
> > >
> > > $ sudo xfs_db -c "sb 0" -c "a rootino" -c "p v3.change_count" /dev/mapper/fast
> > > v3.change_count = 35
> > > $
> > >
> > > The root inode in this filesystem has a change count of 35. The root
> > > inode has 32 dirents in it, which means that no entries have ever
> > > been removed or renamed. This sort of insight into the past history
> > > of inode metadata is largely impossible to get any other way, and
> > > it's been the difference between understanding failure and having no
> > > clue more than once.
> > >
> > > Most block device parsing applications simply write their own
> > > decoder that walks the on-disk format. That's pretty trivial to do,
> > > developers can get all the information needed to do this from the
> > > on-disk format specification documentation we keep on kernel.org...
> > >
> >
> > Fair enough. I'm not here to tell you that you guys that you need to
> > change how di_changecount works. If it's too valuable to keep it
> > counting atime-only updates, then so be it.
> >
> > If that's the case however, and given that the multigrain timestamp work
> > is effectively dead, then I don't see an alternative to growing the on-
> > disk inode. Do you?
>
> Yes, I do see alternatives. That's what I've been trying
> (unsuccessfully) to describe and get consensus on. I feel like I'm
> being ignored and rail-roaded here, because nobody is even
> acknowledging that I'm proposing alternatives and keeps insisting
> that the only solution is a change of on-disk format.
>
> So, I'll summarise the situation *yet again* in the hope that this
> time I won't get people arguing about atime vs i-version and what
> constitutes an on-disk format change because that goes nowhere and
> does nothing to determine which solution might be acceptible.
>
> The basic situation is this:
>
> If XFS can ignore relatime or lazytime persistent updates for given
> situations, then *we don't need to make periodic on-disk updates of
> atime*. This makes the whole problem of "persistent atime update bumps
> i_version" go away because then we *aren't making persistent atime
> updates* except when some other persistent modification that bumps
> [cm]time occurs.
>
> But I don't want to do this unconditionally - for systems not
> running anything that samples i_version we want relatime/lazytime
> to behave as they are supposed to and do periodic persistent updates
> as per normal. Principle of least surprise and all that jazz.
>
> So we really need an indication for inodes that we should enable this
> mode for the inode. I have asked if we can have per-operation
> context flag to trigger this given the needs for io_uring to have
> context flags for timestamp updates to be added.
>
> I have asked if we can have an inode flag set by the VFS or
> application code for this. e.g. a flag set by nfsd whenever it accesses a
> given inode.
>
> I have asked if this inode flag can just be triggered if we ever see
> I_VERSION_QUERIED set or statx is used to retrieve a change cookie,
> and whether this is a reliable mechanism for setting such a flag.
>
Ok, so to make sure I understand what you're proposing:
This would be a new inode flag that would be set in conjunction with
I_VERSION_QUERIED (but presumably is never cleared)? When XFS sees this
flag set, it would skip sending the atime to disk.
Given that you want to avoid on-disk changes, I assume this flag will
not be stored on disk. What happens after the NFS server reboots?
Consider:
1/ NFS server queries for the i_version and we set the
I_NO_ATIME_UPDATES_ON_DISK flag (or whatever) in conjunction with
I_VERSION_QUERIED. Some atime updates occur and the i_version isn't
bumped (as you'd expect).
2/ The server then reboots.
3/ Server comes back up, and some local task issues a read against the
inode. I_NO_ATIME_UPDATES_ON_DISK never had a chance to be set after the
reboot, so that atime update ends up incrementing the i_version counter.
4/ client cache invalidation occurs even though there was no write to
the file
This might reduce some of the spurious i_version bumps, but I don't see
how it can eliminate them entirely.
> I have suggested mechanisms for using masked off bits of timestamps
> to encode sub-timestamp granularity change counts and keep them
> invisible to userspace and then not using i_version at all for XFS.
> This avoids all the problems that the multi-grain timestamp
> infrastructure exposed due to variable granularity of user visible
> timestamps and ordering across inodes with different granularity.
> This is potentially a general solution, too.
>
I don't really understand this at all, but trying to do anything with
fine-grained timestamps will just run into a lot of the same problems we
hit with the multigrain work. If you still see this as a path forward,
maybe you can describe it more detail?
> So, yeah, there are *lots* of ways we can solve this problem without
> needing to change on-disk formats.
>
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2023-10-31 11:04 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-18 17:41 [PATCH RFC 0/9] fs: multigrain timestamps (redux) Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 1/9] fs: switch timespec64 fields in inode to discrete integers Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing Jeff Layton
2023-10-18 19:18 ` Linus Torvalds
2023-10-18 20:47 ` Jeff Layton
2023-10-18 21:31 ` Linus Torvalds
2023-10-18 21:52 ` Jeff Layton
2023-10-19 9:29 ` Christian Brauner
2023-10-19 11:28 ` Jeff Layton
2023-10-19 22:02 ` Dave Chinner
2023-10-20 12:12 ` Jeff Layton
2023-10-20 20:06 ` Linus Torvalds
2023-10-20 20:20 ` Linus Torvalds
2023-10-20 21:05 ` Jeff Layton
2023-10-22 22:17 ` Dave Chinner
2023-10-23 14:45 ` Jeff Layton
2023-10-23 23:26 ` Dave Chinner
2023-10-24 0:18 ` Linus Torvalds
2023-10-24 3:40 ` Dave Chinner
2023-10-24 4:10 ` Linus Torvalds
2023-10-24 7:08 ` Amir Goldstein
2023-10-24 18:40 ` Jeff Layton
2023-10-25 8:05 ` Dave Chinner
2023-10-25 10:41 ` Amir Goldstein
2023-10-25 12:25 ` Jeff Layton
2023-10-26 2:20 ` Dave Chinner
2023-10-26 5:42 ` Amir Goldstein
2023-10-27 10:35 ` Jeff Layton
2023-10-30 22:37 ` Dave Chinner
2023-10-30 23:11 ` Linus Torvalds
2023-10-31 1:42 ` Dave Chinner
2023-10-31 7:03 ` Amir Goldstein
2023-10-31 10:30 ` Christian Brauner
2023-10-31 11:29 ` Jeff Layton
2023-10-31 21:57 ` Dave Chinner
2023-10-31 23:02 ` Darrick J. Wong
2023-10-31 23:47 ` Dave Chinner
2023-11-01 10:16 ` Jan Kara
2023-11-01 11:38 ` Amir Goldstein
2023-11-02 10:17 ` Jeff Layton
2023-11-01 20:10 ` Linus Torvalds
2023-11-01 21:34 ` Trond Myklebust
2023-11-01 22:23 ` Linus Torvalds
2023-11-01 22:45 ` Trond Myklebust
2023-11-01 23:29 ` Dave Chinner
2023-11-02 10:29 ` Jeff Layton
2023-11-02 10:15 ` Jeff Layton
2023-10-31 23:12 ` Darrick J. Wong
2023-11-01 8:08 ` Amir Goldstein
2023-10-31 11:26 ` Jeff Layton
2023-10-31 19:43 ` John Stoffel
2023-10-31 11:04 ` Jeff Layton [this message]
2023-10-31 12:22 ` Jan Kara
2023-10-31 12:55 ` Jeff Layton
2023-10-30 23:34 ` ronnie sahlberg
2023-10-24 14:24 ` Jeff Layton
2023-10-24 19:06 ` Jeff Layton
2023-10-24 19:40 ` Linus Torvalds
2023-10-24 20:19 ` Jeff Layton
2023-10-31 10:26 ` Christian Brauner
2023-10-31 13:55 ` Jeff Layton
2023-10-19 22:00 ` Thomas Gleixner
2023-10-19 22:41 ` Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 3/9] timekeeping: add new debugfs file to count multigrain timestamps Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 4/9] fs: add infrastructure for " Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 5/9] fs: have setattr_copy handle multigrain timestamps appropriately Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 6/9] xfs: switch to multigrain timestamps Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 7/9] ext4: " Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 8/9] btrfs: convert " Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 9/9] tmpfs: add support for " Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d5965ba7ed012433a9914ba38a6046f2ddb015ac.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=adilger.kernel@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=amir73il@gmail.com \
--cc=brauner@kernel.org \
--cc=chandan.babu@oracle.com \
--cc=clm@fb.com \
--cc=david@fromorbit.com \
--cc=dhowells@redhat.com \
--cc=djwong@kernel.org \
--cc=dsterba@suse.com \
--cc=hughd@google.com \
--cc=jack@suse.de \
--cc=josef@toxicpanda.com \
--cc=jstultz@google.com \
--cc=kent.overstreet@linux.dev \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=sboyd@kernel.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).