From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A4077812 for ; Mon, 30 Oct 2023 02:29:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="E/Y3MToH" Received: from mail-pl1-x632.google.com (mail-pl1-x632.google.com [IPv6:2607:f8b0:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8128BBE for ; Sun, 29 Oct 2023 19:29:51 -0700 (PDT) Received: by mail-pl1-x632.google.com with SMTP id d9443c01a7336-1c77449a6daso34075025ad.0 for ; Sun, 29 Oct 2023 19:29:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1698632991; x=1699237791; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=4bCdvSlpz1hDv7p4GSr2XVNxtMcsUc8tGdKVI0ksdEg=; b=E/Y3MToH1WN7+WIkBdtvDPc9tveLAn0fZqYCltdshwHJFeqdaNDf59UgiMXlw/CfQF XIg0bGhXuk+oczexzJolId4yRD5XGb+SOMPokdmbwSTkFXl8xTMMUvIETAKQKCiRE5sZ 32kVTozhRTUTo7BnK7EufDg1GRhy+cusS48ANVUa5FzcbLcvtc9cXzfPCStYz27Dm7JY fZB5ju1k5Lxb2e5Dyl/D5GYDKLeVXtsN7EyNp0I648wjfqn1xjWMsZXa3sCsg3IrrRQs HVe7zx3YTz6hMZRlbU41/CdQHJ560XNi/nDW7/CgSplJQJsEEFV/sHlEQRSK2/th24w6 sJfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698632991; x=1699237791; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=4bCdvSlpz1hDv7p4GSr2XVNxtMcsUc8tGdKVI0ksdEg=; b=lxQLObq8jc5Kv8sp8ZOOmNDFpU29AC4XB49fOp2o7819XPLpHfIsOvNyXks78o6ElN jfYWDp97NaM4hm7oH1stZDCMx/xq3GgCtGOC4s8+JrxtOSUW2g4Dxte0XCwj6DUh40Yf 2G7B3VvY6O+FmD/L+5wS9hF4RMS+PvGnkOZNrje5a/lKfgCKEU9E0cetotVD/sH0jTCF N6p910gDL1xoROqqmhjYEX0S2QzK00qlTY89M+GcMHNiii5QC8XuHcxvU/+ZlZ5pxbva xfxf0cCAVjr5IGeH7Bkk/KUxjbFstecxU3ssvn0vWK3u+P82XnvQj8Jqn6aHeM3eGycf n+Iw== X-Gm-Message-State: AOJu0YyvX/XG5JubfVSpGfkloI67lyqymxJu3mttg/PECDqu8geVJ5o7 m/xgMj2lslEqi7YLTvv9P6KBiR5bX1I6FEFw42A= X-Google-Smtp-Source: AGHT+IGdFyNqQZGTK3X91Bqwn5ScSLeGF8Jk6HTe0zARBVHvUCJ4yYQ04EWIJEqJ/Lok7o/AK7oHKA== X-Received: by 2002:a17:903:32c3:b0:1cc:45d0:470b with SMTP id i3-20020a17090332c300b001cc45d0470bmr2938121plr.7.1698632990924; Sun, 29 Oct 2023 19:29:50 -0700 (PDT) Received: from dread.disaster.area (pa49-180-20-59.pa.nsw.optusnet.com.au. [49.180.20.59]) by smtp.gmail.com with ESMTPSA id f12-20020a170902684c00b001b9f032bb3dsm5282609pln.3.2023.10.29.19.29.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 29 Oct 2023 19:29:50 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1qxI2F-005jod-05; Mon, 30 Oct 2023 13:29:47 +1100 Date: Mon, 30 Oct 2023 13:29:47 +1100 From: Dave Chinner To: Wedson Almeida Filho Cc: Boqun Feng , Matthew Wilcox , Benno Lossin , Alexander Viro , Christian Brauner , Kent Overstreet , Greg Kroah-Hartman , linux-fsdevel@vger.kernel.org, rust-for-linux@vger.kernel.org, Wedson Almeida Filho , Marco Elver Subject: Re: [RFC PATCH 06/19] rust: fs: introduce `FileSystem::init_root` Message-ID: References: <20231018122518.128049-1-wedsonaf@gmail.com> <20231018122518.128049-7-wedsonaf@gmail.com> Precedence: bulk X-Mailing-List: rust-for-linux@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Mon, Oct 23, 2023 at 09:55:08AM -0300, Wedson Almeida Filho wrote: > On Mon, 23 Oct 2023 at 02:29, Dave Chinner wrote: > > > > On Sat, Oct 21, 2023 at 12:33:57PM -0700, Boqun Feng wrote: > > > On Sat, Oct 21, 2023 at 06:01:02PM +0100, Matthew Wilcox wrote: > > > > I'm only an expert on the page cache, not the rest of the VFS. So > > > > what are the rules around modifying i_state for the VFS? > > > > > > Agreed, same question here. > > > > inode->i_state should only be modified under inode->i_lock. > > > > And in most situations, you have to hold the inode->i_lock to read > > state flags as well so that reads are serialised against > > modifications which are typically non-atomic RMW operations. > > > > There is, I think, one main exception to read side locking and this > > is find_inode_rcu() which does an unlocked check for I_WILL_FREE | > > I_FREEING. In this case, the inode->i_state updates in iput_final() > > use WRITE_ONCE under the inode->i_lock to provide the necessary > > semantics for the unlocked READ_ONCE() done under rcu_read_lock(). > > > > IOWs, if you follow the general rule that any inode->i_state access > > (read or write) needs to hold inode->i_lock, you probably won't > > screw up. > > I don't see filesystems doing this though. In particular, see > iget_locked() -- if a new inode is returned, then it is locked, but if > a cached one is found, it's not locked. I did say "if you follow the general rule". And where there is a "general rule" there is the implication that there are special cases where the "general rule" doesn't get applied, yes? :) I_NEW is the exception to the general rule, and very few people writing filesystems actually know about it let alone care about it... > So we're in this situation where a returned inode may or may not be > locked. And the way to determine if it's locked or not is to read > i_state. > > Here are examples of kernfs, ext2, ext4 and squashfs doing it: > https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/kernfs/inode.c#L252 > https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/ext2/inode.c#L1392 > https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/ext4/inode.c#L4707 > https://elixir.bootlin.com/linux/v6.6-rc7/source/fs/squashfs/inode.c#L82 > > They all call iget_locked(), and if I_NEW is set, they initialise the > inode and unlock it with unlock_new_inode(); otherwise they just > return the unlocked inode. All of them are perfectly fine. I_NEW is the bit we use to synchronise inode initialisation - we have to ensure there is only a single initialisation running while there are concurrent lookups that can find the inode whilst it is being initialised. We cannot hold a spin lock over inode initialisation (it may have to do IO!), so we set the I_NEW flag under the i_lock and the inode_hash_lock during hash insertion so that they are set atomically from the hash lookup POV. If the inode is then found in cache, wait_on_inode() does the serialisation against the running initialisation indicated by the __I_NEW bit in the i_state word. Hence if the caller of iget_locked() ever sees I_NEW, it is guaranteed to have exclusive access to the inode and -must- first initialise the inode and then call unlock_new_inode() when it has completed. It doesn't need to hold inode->i_lock in this case because there's nothing it needs to serialise against as iget_locked() has already done all that work. If the inode is found in cache by iget_locked, then the wait_on_inode() call is guaranteed to ensure that I_NEW is not set when it returns. The atomic bit operations on __I_NEW and the memory barriers in unlock_new_inode() plays an important part in this dance, and they guarantee that I_NEW has been cleared before iget_locked() returns. No need for inode->i_lock to be held in this case, either, because iget_locked() did all the serialisation for us. This special dance is an optimisation that avoids the need to take inode->i_lock in the inode lookup fast path just to check I_NEW. It is an exception to the general rule but internal it uses inode->i_lock in the places it is needed to ensure anything using the general rule about accessing i_state still behaves correctly. Cheers, Dave. -- Dave Chinner david@fromorbit.com