From: Jeff Layton <jlayton@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>,
tytso@mit.edu, adilger.kernel@dilger.ca, djwong@kernel.org,
trondmy@hammerspace.com, neilb@suse.de, viro@zeniv.linux.org.uk,
zohar@linux.ibm.com, xiubli@redhat.com, chuck.lever@oracle.com,
lczerner@redhat.com, bfields@fieldses.org, brauner@kernel.org,
fweimer@redhat.com, linux-btrfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
ceph-devel@vger.kernel.org, linux-ext4@vger.kernel.org,
linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland
Date: Tue, 18 Oct 2022 13:04:34 -0400 [thread overview]
Message-ID: <fcd35c14353ae859d778a23f80249047819bc4b0.camel@kernel.org> (raw)
In-Reply-To: <20221018151721.cl6dbupqjkkivxyf@quack3>
On Tue, 2022-10-18 at 17:17 +0200, Jan Kara wrote:
> On Tue 18-10-22 10:21:08, Jeff Layton wrote:
> > On Tue, 2022-10-18 at 15:49 +0200, Jan Kara wrote:
> > > On Tue 18-10-22 06:35:14, Jeff Layton wrote:
> > > > On Tue, 2022-10-18 at 09:14 +1100, Dave Chinner wrote:
> > > > > On Mon, Oct 17, 2022 at 06:57:09AM -0400, Jeff Layton wrote:
> > > > > > Trond is of the opinion that monotonicity is a hard requirement, and
> > > > > > that we should not allow filesystems that can't provide that quality to
> > > > > > report STATX_VERSION at all. His rationale is that one of the main uses
> > > > > > for this is for backup applications, and for those a counter that could
> > > > > > go backward is worse than useless.
> > > > >
> > > > > From the perspective of a backup program doing incremental backups,
> > > > > an inode with a change counter that has a different value to the
> > > > > current backup inventory means the file contains different
> > > > > information than what the current backup inventory holds. Again,
> > > > > snapshots, rollbacks, etc.
> > > > >
> > > > > Therefore, regardless of whether the change counter has gone
> > > > > forwards or backwards, the backup program needs to back up this
> > > > > current version of the file in this backup because it is different
> > > > > to the inventory copy. Hence if the backup program fails to back it
> > > > > up, it will not be creating an exact backup of the user's data at
> > > > > the point in time the backup is run...
> > > > >
> > > > > Hence I don't see that MONOTONIC is a requirement for backup
> > > > > programs - they really do have to be able to handle filesystems that
> > > > > have modifications that move backwards in time as well as forwards...
> > > >
> > > > Rolling backward is not a problem in and of itself. The big issue is
> > > > that after a crash, we can end up with a change attr seen before the
> > > > crash that is now associated with a completely different inode state.
> > > >
> > > > The scenario is something like:
> > > >
> > > > - Change attr for an empty file starts at 1
> > > >
> > > > - Write "A" to file, change attr goes to 2
> > > >
> > > > - Read and statx happens (client sees "A" with change attr 2)
> > > >
> > > > - Crash (before last change is logged to disk)
> > > >
> > > > - Machine reboots, inode is empty, change attr back to 1
> > > >
> > > > - Write "B" to file, change attr goes to 2
> > > >
> > > > - Client stat's file, sees change attr 2 and assumes its cache is
> > > > correct when it isn't (should be "B" not "A" now).
> > > >
> > > > The real danger comes not from the thing going backward, but the fact
> > > > that it can march forward again after going backward, and then the
> > > > client can see two different inode states associated with the same
> > > > change attr value. Jumping all the change attributes forward by a
> > > > significant amount after a crash should avoid this issue.
> > >
> > > As Dave pointed out, the problem with change attr having the same value for
> > > a different inode state (after going backwards) holds not only for the
> > > crashes but also for restore from backups, fs snapshots, device snapshots
> > > etc. So relying on change attr only looks a bit fragile. It works for the
> > > common case but the edge cases are awkward and there's no easy way to
> > > detect you are in the edge case.
> > >
> >
> > This is true. In fact in the snapshot case you can't even rely on doing
> > anything at reboot since you won't necessarily need to reboot to make it
> > roll backward.
> >
> > Whether that obviates the use of this value altogether, I'm not sure.
> >
> > > So I think any implementation caring about data integrity would have to
> > > include something like ctime into the picture anyway. Or we could just
> > > completely give up any idea of monotonicity and on each mount select random
> > > prime P < 2^64 and instead of doing inc when advancing the change
> > > attribute, we'd advance it by P. That makes collisions after restore /
> > > crash fairly unlikely.
> >
> > Part of the goal (at least for NFS) is to avoid unnecessary cache
> > invalidations.
> >
> > If we just increment it by a particular offset on every reboot, then
> > every time the server reboots, the clients will invalidate all of their
> > cached inodes, and proceed to hammer the server with READ calls just as
> > it's having to populate its own caches from disk.
>
> Note that I didn't propose to increment by offset on every reboot or mount.
> I have proposed that inode_maybe_inc_iversion() would not increment
> i_version by 1 (in fact 1 << I_VERSION_QUERIED_SHIFT) but rather by P (or P
> << I_VERSION_QUERIED_SHIFT) where P is a suitable number randomly selected
> on filesystem mount.
>
> This will not cause cache invalidation after a clean unmount + remount. It
> will cause cache invalidation after a crash, snapshot rollback etc., only for
> inodes where i_version changed. If P is suitably selected (e.g. as being a
> prime), then the chances of collisions (even after a snapshot rollback) are
> very low (on the order of 2^(-50) if my piece of envelope calculations are
> right).
>
> So this should nicely deal with all the problems we've spotted so far. But
> I may be missing something...
Got it! That makes a lot more sense. Thinking about this some more...
What sort of range for P would be suitable?
Every increment would need to be by (shifted) P, so we can't choose too
large a number. Queries are pretty rare vs. writes though, so that
mitigates the issue somewhat.
There are 31 primes between 1 and 127. Worst case, we'd still have 2^48
increments before the counter wraps.
Let me think about this some more, but maybe that's good enough to
ensure uniqueness.
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2022-10-18 17:04 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-17 10:57 [PATCH v7 0/9] fs: clean up handling of i_version counter Jeff Layton
2022-10-17 10:57 ` [PATCH v7 1/9] fs: uninline inode_query_iversion Jeff Layton
2022-10-17 10:57 ` [PATCH v7 2/9] fs: clarify when the i_version counter must be updated Jeff Layton
2022-10-17 10:57 ` [PATCH v7 3/9] vfs: plumb i_version handling into struct kstat Jeff Layton
2022-10-17 10:57 ` [PATCH v7 4/9] nfs: report the inode version in getattr if requested Jeff Layton
2022-10-17 10:57 ` [PATCH v7 5/9] ceph: " Jeff Layton
2022-10-17 10:57 ` [PATCH v7 6/9] nfsd: move nfsd4_change_attribute to nfsfh.c Jeff Layton
2022-11-03 15:42 ` Chuck Lever III
2022-10-17 10:57 ` [PATCH v7 7/9] nfsd: use the getattr operation to fetch i_version Jeff Layton
2022-11-03 15:42 ` Chuck Lever III
2022-10-17 10:57 ` [PATCH v7 8/9] nfsd: remove fetch_iversion export operation Jeff Layton
2022-11-03 15:43 ` Chuck Lever III
2022-10-17 10:57 ` [RFC PATCH v7 9/9] vfs: expose STATX_VERSION to userland Jeff Layton
2022-10-17 22:14 ` Dave Chinner
2022-10-18 10:35 ` Jeff Layton
2022-10-18 13:49 ` Jan Kara
2022-10-18 14:21 ` Jeff Layton
2022-10-18 15:17 ` Jan Kara
2022-10-18 17:04 ` Jeff Layton [this message]
2022-10-19 17:23 ` Jan Kara
2022-10-19 18:47 ` Jeff Layton
2022-10-20 10:39 ` Jan Kara
2022-10-21 10:08 ` Jeff Layton
2022-10-18 14:56 ` Jeff Layton
2022-10-19 11:13 ` [PATCH v7 0/9] fs: clean up handling of i_version counter Christian Brauner
2022-10-19 12:18 ` Jeff Layton
2022-10-19 15:45 ` Darrick J. Wong
2022-10-19 20:36 ` Jeff Layton
2022-10-20 6:58 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fcd35c14353ae859d778a23f80249047819bc4b0.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=adilger.kernel@dilger.ca \
--cc=bfields@fieldses.org \
--cc=brauner@kernel.org \
--cc=ceph-devel@vger.kernel.org \
--cc=chuck.lever@oracle.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=fweimer@redhat.com \
--cc=jack@suse.cz \
--cc=lczerner@redhat.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=neilb@suse.de \
--cc=trondmy@hammerspace.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=xiubli@redhat.com \
--cc=zohar@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).