Linux NFS development
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
	"jack@suse.cz" <jack@suse.cz>
Cc: "clm@fb.com" <clm@fb.com>,
	"josef@toxicpanda.com" <josef@toxicpanda.com>,
	"jstultz@google.com" <jstultz@google.com>,
	"djwong@kernel.org" <djwong@kernel.org>,
	"brauner@kernel.org" <brauner@kernel.org>,
	"chandan.babu@oracle.com" <chandan.babu@oracle.com>,
	"hughd@google.com" <hughd@google.com>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"dsterba@suse.com" <dsterba@suse.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"jlayton@kernel.org" <jlayton@kernel.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"tytso@mit.edu" <tytso@mit.edu>,
	"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"amir73il@gmail.com" <amir73il@gmail.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"adilger.kernel@dilger.ca" <adilger.kernel@dilger.ca>,
	"kent.overstreet@linux.dev" <kent.overstreet@linux.dev>,
	"sboyd@kernel.org" <sboyd@kernel.org>,
	"dhowells@redhat.com" <dhowells@redhat.com>,
	"jack@suse.de" <jack@suse.de>
Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing
Date: Wed, 1 Nov 2023 21:34:57 +0000	[thread overview]
Message-ID: <3ae88800184f03b152aba6e4a95ebf26e854dd63.camel@hammerspace.com> (raw)
In-Reply-To: <CAHk-=wj6wy6tNUQm6EtgxfE_J229y1DthpCguqQfTej71yiJXw@mail.gmail.com>

On Wed, 2023-11-01 at 10:10 -1000, Linus Torvalds wrote:
> On Wed, 1 Nov 2023 at 00:16, Jan Kara <jack@suse.cz> wrote:
> > 
> > OK, but is this compatible with the current XFS behavior? AFAICS
> > currently
> > XFS sets sb->s_time_gran to 1 so timestamps currently stored on
> > disk will
> > have some mostly random garbage in low bits of the ctime.
> 
> I really *really* don't think we can use ctime as a "i_version"
> replacement. The whole fine-granularity patches were well-
> intentioned,
> but I do think they were broken.
> 
> Note that we can't use ctime as a "i_version" replacement for other
> reasons too - you have filesystems like FAT - which people do want to
> export - that have a single-second (or is it 2s?) granularity in
> reality, even though they report a 1ns value in s_time_gran.
> 
> But here's a suggestion that people may hate, but that might just
> work
> in practice:
> 
>  - get rid of i_version entirely
> 
>  - use the "known good" part of ctime as the upper bits of the change
> counter (and by "known good" I mean tv_sec - or possibly even "tv_sec
> / 2" if that dim FAT memory of mine is right)
> 
>  - make the rule be that ctime is *never* updated for atime updates
> (maybe that's already true, I didn't check - maybe it needs a new
> mount flag for nfsd)
> 
>  - have a per-inode in-memory and vfs-internal (entirely invisible to
> filesystems) "ctime modification counter" that is *NOT* a timestamp,
> and is *NOT* i_version
> 
>  - make the rule be that the "ctime modification counter" is always
> zero, *EXCEPT* if
>     (a) I_VERSION_QUERIED is set
>    AND
>     (b) the ctime modification doesn't modify the "known good" part
> of ctime
> 
> so how the "statx change cookie" ends up being "high bits tv_sec of
> ctime, low bits ctime modification cookie", and the end result of
> that
> is:
> 
>  - if all the reads happen after the last write (common case), then
> the low bits will be zero, because I_VERSION_QUERIED wasn't set when
> ctime was modified
> 
>  - if you do a write *after* a modification, the ctime cookie is
> guaranteed to change, because either the known good (sec/2sec) part
> of
> ctime is new, *or* the counter gets updated
> 
>  - if the nfs server reboots, the in-memory counter will be cleared
> again, and so the change cookie will cause client cache
> invalidations,
> but *only* for those "ctime changed in the same second _after_
> somebody did a read".
> 
>  - any long-time caches of files that don't get modified are all
> fine,
> because they will have those low bits zero and depend on just the
> stable part of ctime that works across filesystems. So there should
> be
> no nasty thundering herd issues on long-lived caches on lots of
> clients if the server reboots, or atime updates every 24 hours or
> anything like that.
> 
> and note that *NONE* of this requires any filesystem involvement
> (except for the rule of "no atime changes ever impact ctime", which
> may or may not already be true).
> 
> The filesystem does *not* know about that modification counter,
> there's no new on-disk stable information.
> 
> It's entirely possible that I'm missing something obvious, but the
> above sounds to me like the only time you'd have stale invalidations
> is really the (unusual) case of having writes after cached reads, and
> then a reboot.
> 
> We'd get rid of "inode_maybe_inc_iversion()" entirely, and instead
> replace it with logic in inode_set_ctime_current() that basically
> does
> 
>  - if the stable part of ctime changes, clear the new 32-bit counter
> 
>  - if I_VERSION_QUERIED isn't set, clear the new 32-bit counter
> 
>  - otherwise, increment the new 32-bit counter
> 
> and then the STATX_CHANGE_COOKIE code basically just returns
> 
>    (stable part of ctime << 32) + new 32-bit counter
> 
> (and again, the "stable part of ctime" is either just tv_sec, or it's
> "tv_sec >> 1" or whatever).
> 
> The above does not expose *any* changes to timestamps to users, and
> should work across a wide variety of filesystems, without requiring
> any special code from the filesystem itself.
> 
> And now please all jump on me and say "No, Linus, that won't work,
> because XYZ".
> 
> Because it is *entirely* possible that I missed something truly
> fundamental, and the above is completely broken for some obvious
> reason that I just didn't think of.
> 

My client writes to the file and immediately reads the ctime. A 3rd
party client then writes immediately after my ctime read.
A reboot occurs (maybe minutes later), then I re-read the ctime, and
get the same value as before the 3rd party write.

Yes, most of the time that is better than the naked ctime, but not
across a reboot.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



  reply	other threads:[~2023-11-01 21:35 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-18 17:41 [PATCH RFC 0/9] fs: multigrain timestamps (redux) Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 1/9] fs: switch timespec64 fields in inode to discrete integers Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing Jeff Layton
2023-10-18 19:18   ` Linus Torvalds
2023-10-18 20:47     ` Jeff Layton
2023-10-18 21:31       ` Linus Torvalds
2023-10-18 21:52         ` Jeff Layton
2023-10-19  9:29           ` Christian Brauner
2023-10-19 11:28             ` Jeff Layton
2023-10-19 22:02               ` Dave Chinner
2023-10-20 12:12                 ` Jeff Layton
2023-10-20 20:06                   ` Linus Torvalds
2023-10-20 20:20                     ` Linus Torvalds
2023-10-20 21:05                     ` Jeff Layton
2023-10-22 22:17                   ` Dave Chinner
2023-10-23 14:45                     ` Jeff Layton
2023-10-23 23:26                       ` Dave Chinner
2023-10-24  0:18                         ` Linus Torvalds
2023-10-24  3:40                           ` Dave Chinner
2023-10-24  4:10                             ` Linus Torvalds
2023-10-24  7:08                             ` Amir Goldstein
2023-10-24 18:40                               ` Jeff Layton
2023-10-25  8:05                                 ` Dave Chinner
2023-10-25 10:41                                   ` Amir Goldstein
2023-10-25 12:25                                   ` Jeff Layton
2023-10-26  2:20                                     ` Dave Chinner
2023-10-26  5:42                                       ` Amir Goldstein
2023-10-27 10:35                                       ` Jeff Layton
2023-10-30 22:37                                         ` Dave Chinner
2023-10-30 23:11                                           ` Linus Torvalds
2023-10-31  1:42                                             ` Dave Chinner
2023-10-31  7:03                                               ` Amir Goldstein
2023-10-31 10:30                                                 ` Christian Brauner
2023-10-31 11:29                                                 ` Jeff Layton
2023-10-31 21:57                                                   ` Dave Chinner
2023-10-31 23:02                                                     ` Darrick J. Wong
2023-10-31 23:47                                                       ` Dave Chinner
2023-11-01 10:16                                                     ` Jan Kara
2023-11-01 11:38                                                       ` Amir Goldstein
2023-11-02 10:17                                                         ` Jeff Layton
2023-11-01 20:10                                                       ` Linus Torvalds
2023-11-01 21:34                                                         ` Trond Myklebust [this message]
2023-11-01 22:23                                                           ` Linus Torvalds
2023-11-01 22:45                                                             ` Trond Myklebust
2023-11-01 23:29                                                           ` Dave Chinner
2023-11-02 10:29                                                             ` Jeff Layton
2023-11-02 10:15                                                         ` Jeff Layton
2023-10-31 23:12                                                 ` Darrick J. Wong
2023-11-01  8:08                                                   ` Amir Goldstein
2023-10-31 11:26                                               ` Jeff Layton
2023-10-31 19:43                                                 ` John Stoffel
2023-10-31 11:04                                           ` Jeff Layton
2023-10-31 12:22                                             ` Jan Kara
2023-10-31 12:55                                               ` Jeff Layton
2023-10-30 23:34                                         ` ronnie sahlberg
2023-10-24 14:24                             ` Jeff Layton
2023-10-24 19:06                           ` Jeff Layton
2023-10-24 19:40                             ` Linus Torvalds
2023-10-24 20:19                               ` Jeff Layton
2023-10-31 10:26               ` Christian Brauner
2023-10-31 13:55                 ` Jeff Layton
2023-10-19 22:00   ` Thomas Gleixner
2023-10-19 22:41     ` Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 3/9] timekeeping: add new debugfs file to count multigrain timestamps Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 4/9] fs: add infrastructure for " Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 5/9] fs: have setattr_copy handle multigrain timestamps appropriately Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 6/9] xfs: switch to multigrain timestamps Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 7/9] ext4: " Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 8/9] btrfs: convert " Jeff Layton
2023-10-18 17:41 ` [PATCH RFC 9/9] tmpfs: add support for " Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ae88800184f03b152aba6e4a95ebf26e854dd63.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=chandan.babu@oracle.com \
    --cc=clm@fb.com \
    --cc=david@fromorbit.com \
    --cc=dhowells@redhat.com \
    --cc=djwong@kernel.org \
    --cc=dsterba@suse.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jack@suse.de \
    --cc=jlayton@kernel.org \
    --cc=josef@toxicpanda.com \
    --cc=jstultz@google.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sboyd@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox