From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31AD2C4332F for ; Tue, 31 Oct 2023 11:05:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B80EB6B02D6; Tue, 31 Oct 2023 07:05:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B30486B02D8; Tue, 31 Oct 2023 07:05:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F8326B02DA; Tue, 31 Oct 2023 07:05:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 919DD6B02D6 for ; Tue, 31 Oct 2023 07:05:06 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6379640AF8 for ; Tue, 31 Oct 2023 11:05:06 +0000 (UTC) X-FDA: 81405474612.10.5B15D69 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf01.hostedemail.com (Postfix) with ESMTP id 01DBD4001F for ; Tue, 31 Oct 2023 11:05:02 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="B0/jmFqp"; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf01.hostedemail.com: domain of jlayton@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=jlayton@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698750303; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ma5veRGxqeOhDpKoJM9fB8UtM1ATPVT5ed/fPFzGvlY=; b=A8w7xLsvvG9qUs6gpkTIXh7TzljIvC/EEs2Da+hJ3zWqjp2SAlNyn0Mqfl0WPqFq4NP8ZX ky9+hkDt2kEARX9UJOdXNSfvHGpRHXEXUEIkxQWUyaQxo0mfu/88plqM04U08Ys02UzEvx xQObfx/ckojYRux4TBLr6ICqwTgorNk= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="B0/jmFqp"; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf01.hostedemail.com: domain of jlayton@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=jlayton@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698750303; a=rsa-sha256; cv=none; b=cPmWqSw0jI3sW3Cki0d7/qYuWTLUbZPcJUfH8PZSkAPKDg2w4OJmvsn+IOAHNirbNCBSPE hcp+fxLYeIaXtc3d0f0K6Zg1VBeU0FBrCiDOZyThEkmJCwPyWD6GCRc/O4AcY7ZPrJSRRK 5oQkdvwiRn/AlaclngGihgkVkQfV2U4= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 925B4CE0A06; Tue, 31 Oct 2023 11:04:58 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7655DC433C8; Tue, 31 Oct 2023 11:04:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1698750297; bh=seof9F/T3buCREfnRaM8n/SGCO7U8Q7G81Pbh+MqLSc=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=B0/jmFqpkWT9SLBYRDZGYIX/pJjNs4MsJ8bvvsaA4U4aW11phg5v1ZDR7id8JsNUd zw7wKFBCoOozkUMgXREp3NScaoG75PhGVvyT0DRWC4CndqXfatxmQYN7afSYzvYzxl Ew7aofGgt6KnA+bRoHMQRacfMT4SYz6jJClpQc4An/W3pqbp0PvDWQ0UFMhO3l6/uK KlW1kzDr8H8qNnylIjSfelxj+IdLb9WebVPPRYxrpln//fJ9fD57xi8zLr4p9dCDnG V9CUqL2LTaeMvAOIBUYTnePoIIgDRfOTBSgaPm/0Vsr0resDDxv6fsqVKLu0+aoWQe uU5kVJIfA5VRg== Message-ID: Subject: Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing From: Jeff Layton To: Dave Chinner Cc: Amir Goldstein , Linus Torvalds , Kent Overstreet , Christian Brauner , Alexander Viro , John Stultz , Thomas Gleixner , Stephen Boyd , Chandan Babu R , "Darrick J. Wong" , Theodore Ts'o , Andreas Dilger , Chris Mason , Josef Bacik , David Sterba , Hugh Dickins , Andrew Morton , Jan Kara , David Howells , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org Date: Tue, 31 Oct 2023 07:04:53 -0400 In-Reply-To: References: <61b32a4093948ae1ae8603688793f07de764430f.camel@kernel.org> <2ef9ac6180e47bc9cc8edef20648a000367c4ed2.camel@kernel.org> <6df5ea54463526a3d898ed2bd8a005166caa9381.camel@kernel.org> Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.48.4 (3.48.4-1.fc38) MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 01DBD4001F X-Stat-Signature: dmqjfcu5k7fnb9jbn7jcapjeeykcomdu X-Rspam-User: X-HE-Tag: 1698750302-731117 X-HE-Meta: U2FsdGVkX19aYFDfdoR+NtJ3TR2cPfDP3bFJ57ETD8nKnsJkFQcXLlQ0anmi0R345LSI0WMKtlc3EbD2ZMGx7jEUuj7PAF/u/r0NeI4B+Zi3xAfuyM060BUSUVurF3okLYnmLhWyy0z6LEegUxNDxIdMNcJbLFi7VQwcsWEP56771443LZKzefj481UOyiyyJvqKFElsk25V2diZS8plAjALE30w1o3BmY/NC/wGu9/MwMN8qryc64ZPaHalvR0OGRmHD7K9m/juNPyTJwBpiUwSvfQOllKe3WtNp//B47UlpO5CoCsbuXZrAGHmnS20pBoSeLJnemoIBACemzpsxqp2gyWaYHIJyexGc9tG3I2xakWxF2i+NjjLQURt/n1fbRRPpCePeKgrOrQnphYYEzTgSEa8cSVceRmPN6toqBpcIYVOeEt4qT63inR/czAIDqu4D6ed+ntLKZbun8MHFQL1HH7hPut9qnxkzI4wkPBwu8lBakucGzpI5yHmQaAOAAR4grPB50l8W8u8R/jsnntrZtOyB+VNh07+wiB2QVeDz7KjwKhKROPRQ8VBtI5HGOohvwcOgtMJat77RcTML9VQeK6Ph+p64D6C00W82gCTSuyJepMpUoneqOC9JgxGuZS/wHmuob6Cl2uGVFiKHK7BPw+yIeWHLbIgF7acLaYckP4EebCf//z2Ki0jiYGrTM+9zRNE3Ncz+18T2LVwfFmu9QJ1BsmizeLBjkVbBsD/MkOUj/Z6UDj8Kra6bIcStZ/eH7TYukQ5kB3Pv2ttIgvQZEkttWgCXuxYP/ubcjGxD/hoTNxM5FyxDDkBo/E2KcozUXhlcLHQgIChJZGYin6lwFa8FVXalwMFGTVs1RJVrtRU0l031TyYHPr5KJqT0CBoAUTCq1d/rxagj+oU7lCWINtIzLevQ9+/YiggQ3Yjx1V0vRHjc7StRzTTYjjr8FBx6jvVIXoylC1ITQa uoojR2QL gJkw5VdWplXDb4Oc+QWBvt9KmncAwnusofqs/tSkbnTM21G+9abut3f/CSR01HkTwh8cigVxt2Tb+1oDUKWxfz/FuvCLb+Mce+ANYWTleRk7ToavTD73ZwWu0Vx7VjDWTcDM/72cl88u9UzqDoerYvkyyW6x4LN7Q1+Uqmck9W4fE6w+nyXhcQFejOGuZzvCHjdNAWtNLiCHT+BaPTN/AioOYCIjrQwcVXqYzzVXbTg4QWylPbXDOP7u1vA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 2023-10-31 at 09:37 +1100, Dave Chinner wrote: > On Fri, Oct 27, 2023 at 06:35:58AM -0400, Jeff Layton wrote: > > On Thu, 2023-10-26 at 13:20 +1100, Dave Chinner wrote: > > > On Wed, Oct 25, 2023 at 08:25:35AM -0400, Jeff Layton wrote: > > > > On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote: > > > > > On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote: > > > > In earlier discussions you alluded to some repair and/or analysis t= ools > > > > that depended on this counter. > > >=20 > > > Yes, and one of those "tools" is *me*. > > >=20 > > > I frequently look at the di_changecount when doing forensic and/or > > > failure analysis on filesystem corpses. SOE analysis, relative > > > modification activity, etc all give insight into what happened to > > > the filesystem to get it into the state it is currently in, and > > > di_changecount provides information no other metadata in the inode > > > contains. > > >=20 > > > > I took a quick look in xfsprogs, but I > > > > didn't see anything there. Is there a library or something that the= se > > > > tools use to get at this value? > > >=20 > > > xfs_db is the tool I use for this, such as: > > >=20 > > > $ sudo xfs_db -c "sb 0" -c "a rootino" -c "p v3.change_count" /dev/ma= pper/fast > > > v3.change_count =3D 35 > > > $ > > >=20 > > > The root inode in this filesystem has a change count of 35. The root > > > inode has 32 dirents in it, which means that no entries have ever > > > been removed or renamed. This sort of insight into the past history > > > of inode metadata is largely impossible to get any other way, and > > > it's been the difference between understanding failure and having no > > > clue more than once. > > >=20 > > > Most block device parsing applications simply write their own > > > decoder that walks the on-disk format. That's pretty trivial to do, > > > developers can get all the information needed to do this from the > > > on-disk format specification documentation we keep on kernel.org... > > >=20 > >=20 > > Fair enough. I'm not here to tell you that you guys that you need to > > change how di_changecount works. If it's too valuable to keep it > > counting atime-only updates, then so be it. > >=20 > > If that's the case however, and given that the multigrain timestamp wor= k > > is effectively dead, then I don't see an alternative to growing the on- > > disk inode. Do you? >=20 > Yes, I do see alternatives. That's what I've been trying > (unsuccessfully) to describe and get consensus on. I feel like I'm > being ignored and rail-roaded here, because nobody is even > acknowledging that I'm proposing alternatives and keeps insisting > that the only solution is a change of on-disk format. >=20 > So, I'll summarise the situation *yet again* in the hope that this > time I won't get people arguing about atime vs i-version and what > constitutes an on-disk format change because that goes nowhere and > does nothing to determine which solution might be acceptible. >=20 > The basic situation is this: >=20 > If XFS can ignore relatime or lazytime persistent updates for given > situations, then *we don't need to make periodic on-disk updates of > atime*. This makes the whole problem of "persistent atime update bumps > i_version" go away because then we *aren't making persistent atime > updates* except when some other persistent modification that bumps > [cm]time occurs. >=20 > But I don't want to do this unconditionally - for systems not > running anything that samples i_version we want relatime/lazytime > to behave as they are supposed to and do periodic persistent updates > as per normal. Principle of least surprise and all that jazz. >=20 > So we really need an indication for inodes that we should enable this > mode for the inode. I have asked if we can have per-operation > context flag to trigger this given the needs for io_uring to have > context flags for timestamp updates to be added.=20 >=20 > I have asked if we can have an inode flag set by the VFS or > application code for this. e.g. a flag set by nfsd whenever it accesses a > given inode. >=20 > I have asked if this inode flag can just be triggered if we ever see > I_VERSION_QUERIED set or statx is used to retrieve a change cookie, > and whether this is a reliable mechanism for setting such a flag. >=20 Ok, so to make sure I understand what you're proposing: This would be a new inode flag that would be set in conjunction with I_VERSION_QUERIED (but presumably is never cleared)? When XFS sees this flag set, it would skip sending the atime to disk. Given that you want to avoid on-disk changes, I assume this flag will not be stored on disk. What happens after the NFS server reboots? Consider: 1/ NFS server queries for the i_version and we set the I_NO_ATIME_UPDATES_ON_DISK flag (or whatever) in conjunction with I_VERSION_QUERIED. Some atime updates occur and the i_version isn't bumped (as you'd expect). 2/ The server then reboots. 3/ Server comes back up, and some local task issues a read against the inode. I_NO_ATIME_UPDATES_ON_DISK never had a chance to be set after the reboot, so that atime update ends up incrementing the i_version counter. 4/ client cache invalidation occurs even though there was no write to the file This might reduce some of the spurious i_version bumps, but I don't see how it can eliminate them entirely. > I have suggested mechanisms for using masked off bits of timestamps > to encode sub-timestamp granularity change counts and keep them > invisible to userspace and then not using i_version at all for XFS. > This avoids all the problems that the multi-grain timestamp > infrastructure exposed due to variable granularity of user visible > timestamps and ordering across inodes with different granularity. > This is potentially a general solution, too. >=20 I don't really understand this at all, but trying to do anything with fine-grained timestamps will just run into a lot of the same problems we hit with the multigrain work. If you still see this as a path forward, maybe you can describe it more detail? > So, yeah, there are *lots* of ways we can solve this problem without > needing to change on-disk formats. >=20 --=20 Jeff Layton