From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from nautica.notk.org (nautica.notk.org [91.121.71.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B328F10A01 for ; Thu, 28 Dec 2023 21:55:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=codewreck.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=codewreck.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=codewreck.org header.i=@codewreck.org header.b="Zb1cykjX"; dkim=pass (2048-bit key) header.d=codewreck.org header.i=@codewreck.org header.b="jQeKFBCE" Received: by nautica.notk.org (Postfix, from userid 108) id ACD92C01D; Thu, 28 Dec 2023 22:48:43 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codewreck.org; s=2; t=1703800123; bh=3DNDBmGOorDQwTyB+MEJkPAA4YtScDFMxbluvMqCfhw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Zb1cykjXPjgRibxyHSqUt8l3/tOltpVyp+kjbWFZkDWb4ImngRJK0g4qc4Ftt9qrR MXzq5R6BT3Durq7vnJwmRIbauPjSTM0CUsDVjN1QmDXhngjrF7vOxPG6KBUbpeSm5w 05h+gTUPcLijG51wc8thsz/bRnajvqMwBdcRWYJPFSnMmMXutRe+bvCj/tXRv72xlK mlFW3PRCoWP3pxIABEcfTy2j/70mq9TfRsrtKkbhVO3GDfRko/8fN44/FWYV9lWIPu uKlxGjcTwNYlQmkN7Q8KCEpn/PScrKFEYeB2JbPWYU1/80K4ALSaDxKvkHN25U3c9N ku8COWg5lnkxQ== X-Spam-Level: Received: from gaia (localhost [127.0.0.1]) by nautica.notk.org (Postfix) with ESMTPS id 51875C009; Thu, 28 Dec 2023 22:48:41 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=codewreck.org; s=2; t=1703800122; bh=3DNDBmGOorDQwTyB+MEJkPAA4YtScDFMxbluvMqCfhw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=jQeKFBCEJg2yrXs5iL0e8OPL7v8VJ0bMPHgeNbvGa+0fgLEFmFBHEwP18DYhS6JCP wwvd6kibPiU4tsNev11nOumuF050bDpalG232AxihgyD1/NOZvjQICaaT6caP2y4Df q+ZthmilPal+hZljOvMmWl3jhz/xQhTqgIjkwpQ6gzAzfIoCIOvGof2zs4jtg828IJ 9ChWRzdaUj6uFlRI+t9ZfJTTEHRpuWLzfkCR/14orXkCIyh9Xj+o/TX8TUOLkbXrZn ugxgpLdRm9piptj0V74k1EiDyn1LwyzQqcfbNW+75HtZSvVHcfId6lnxkfQ32Ltycb ndeRs1i93vIEg== Received: from localhost (gaia [local]) by gaia (OpenSMTPD) with ESMTPA id 3654a994; Thu, 28 Dec 2023 21:48:38 +0000 (UTC) Date: Fri, 29 Dec 2023 06:48:23 +0900 From: asmadeus@codewreck.org To: Eric Van Hensbergen Cc: Christian Schoenebeck , v9fs@lists.linux.dev, ericvh@gmail.com Subject: Re: cache fixes (redux) Message-ID: References: <3936280.6WX1Z9CqS8@silver> Precedence: bulk X-Mailing-List: v9fs@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Eric Van Hensbergen wrote on Thu, Dec 28, 2023 at 09:07:46AM -0600: > > > qemu mixes in some bits from st_dev to avoid this: > > > https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881 > > > > Inode numbers are always just unique within the same file system. And that > > makes sense. If a unique file ID is required system wide, then inode number > > must be combined with the file system's device ID. > > Ah, okay, good to know. Christian do you remember anything about this > parallel unlink case? Is there any special handling to guard against > reused inode numbers? From the discussion this seems likely in the > tmpfs case. Just to clarify: I've quoted tmpfs as a way of manually generating duplicates, but it still needs multiple mount points (as single tmpfs will still ensure uniqueness) e.g. --- # mkdir dup # cd dup # mkdir 1 2 # mount -t tmpfs tmpfs 1 # mount -t tmpfs tmpfs 2 # echo one > 1/one # echo second > 2/second # ls -li 1 2 # ls -li 1 2 1: total 4 2 -rw-r--r--. 1 root root 4 Dec 29 06:10 one 2: total 4 2 -rw-r--r--. 1 root root 7 Dec 29 06:10 second --- Then export that 'dup' directory without remapping to see how it behaves; you'll find some weird behaviours with cached modes. I don't think we can do anything if the server sends us identical inodes, that's the server's job. (iiuc from a networked filesystem's point of view (e.g. nfs), we ought to check i_ino + i_generation as inode number can be reused after the previous owner of the number was deleted, but we're guaranteed the couple change... At any given point though i_ino is unique so we've been relying on just that for two reasons: no room to send i_generation, and the field isn't easily obtainable from userspace (need to get in through bytes in export_to_handle_at which is file system dependant), so we only rely on inode number -- if the field gets bigger it should be variable length and we should write the whole mnt_id + export_to_handle_at content for a stable, truly unique handle. That's what userspace NFS servers do.) > > Yeah, there's the files and then there's the inode. So if its just a rename, > its likely it kept the inode, right? I guess I should just gaze at some > VFS traces to understand the behavior and whether or not we need to do > anything from an inode/qid.path perspective. Yes, renames won't change the inode at all (ls -i will keep the same number, and all props will be identical) In linux VFS speak, there's: - dentries, what you called files?, pointer to an inode from directories; that's what get updated on renames; there can be multiple dentries for a single inode (hardlinks for regular files) We keep some fid in there for directories to walk from there as needed (one per uid) - inode numbers, that identify a given ""file"" (common speech file, actual data); it must not change for the lifetime of the file and there should be no collision - the inode struct e.g. "file" informations (size, various times, i_version..) that get updated when file is modfied (rename shouldn't touch these either, but writes will); I guess with what you said there can be multiple inode structs alive for a given "file"? But that sounds like a bug to me, we should always reuse as it's also what governs the vfs cache data: at best you'll be fetching stats twice, at worse you'll end up with two mmap in parallel that don't get updated the same... - the file struct, that'll back a fd somewhere (can be multiple fd per file with dup and friends); we keep the open 'fid' information around in the file's private data. (I'm not 100% sure how process cwd is handled -- probably just a ref on the dentry? When opening a file we need to figure it out from the dentry, so get the parent's dentry and fids from there and use that parent's dentry cached fid, or build it up -- I guess if there's no cache file here we could do multi-level walk, but my understanding was that there should always be something cached at the point... Also, for something like ls you'll stat all files in a directory, so if there wasn't a cached fid for the parent's dentry you'd do a lot of multi-level walks, where caching the parent's dentry sound more efficent?) So to recap there should only be one inode, then as many dentries as there are paths to it, and as many files as there are "open handles" to it In cacheless modes we should drop dentries as soon as no file/process use it, ad drop inodes as soon as there's no dentries left. In cached mode we should have an extra ref to dentries (droppable on memory pressure somehow -- we're not doing that correctly right now afaik), that'll keep the inode alive, so we should try very hard to always reuse the same inode/dentres to avoid needless lookups. > I have publically stated I hate the temporary file hack. It just seems > awful, but I've punted this to be a server responsibility so I don't have > to think about it ;) I get where it may cause trouble on a reconnect, > but that could be handled different ways as well, just with a lot more > invasiveness on the server-side (like holding on to dangling fids on > unexpected disconnects) -- maybe something else to queue up for > future protocol revisions. Well there's reconnect (connection troubles), and there's reconnect (server restart, or migration) -- in nfsv4 the server is expected to have some persistent states so e.g. file leases will persist, so I guess it would be possible to also store information for tmpfiles there too, but at the very least the server needs to also be able to re-open the file so it needs to be kept around... (speaking of which, I guess qemu live migration doesn't work with 9p mounts? or maybe just not tmpfiles... Can't say I tried...) Anyway, we don't need to work on reconnect here, that's distinct enough from the problems you've seen; let's not worry about it at this point, and we can bring it back up later if/when someone has energy for it :) -- Dominique Martinet | Asmadeus