From: Christian Schoenebeck <linux_oss@crudebyte.com>
To: Eric Van Hensbergen <ericvh@kernel.org>, asmadeus@codewreck.org
Cc: v9fs@lists.linux.dev, ericvh@gmail.com
Subject: Re: cache fixes (redux)
Date: Fri, 29 Dec 2023 13:22:45 +0100 [thread overview]
Message-ID: <2850709.f7ZBceQk0b@silver> (raw)
In-Reply-To: <ZY3tJ0eD73Gmy_PQ@codewreck.org>
On Thursday, December 28, 2023 10:48:23 PM CET asmadeus@codewreck.org wrote:
> Eric Van Hensbergen wrote on Thu, Dec 28, 2023 at 09:07:46AM -0600:
> > > > qemu mixes in some bits from st_dev to avoid this:
> > > > https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881
> > >
> > > Inode numbers are always just unique within the same file system. And that
> > > makes sense. If a unique file ID is required system wide, then inode number
> > > must be combined with the file system's device ID.
> >
> > Ah, okay, good to know. Christian do you remember anything about this
> > parallel unlink case? Is there any special handling to guard against
> > reused inode numbers? From the discussion this seems likely in the
> > tmpfs case.
>
> Just to clarify: I've quoted tmpfs as a way of manually generating
> duplicates, but it still needs multiple mount points (as single tmpfs
> will still ensure uniqueness)
> e.g.
>
> ---
> # mkdir dup
> # cd dup
> # mkdir 1 2
> # mount -t tmpfs tmpfs 1
> # mount -t tmpfs tmpfs 2
> # echo one > 1/one
> # echo second > 2/second
> # ls -li 1 2
> # ls -li 1 2
> 1:
> total 4
> 2 -rw-r--r--. 1 root root 4 Dec 29 06:10 one
>
> 2:
> total 4
> 2 -rw-r--r--. 1 root root 7 Dec 29 06:10 second
> ---
Which is the expected, correct behaviour, as you can still distinguish the two
by their device IDs:
# stat 1/one
File: 1/one
Size: 4 Blocks: 8 IO Block: 4096 regular file
Device: 2dh/45d Inode: 2 Links: 1
...
# stat 2/second
File: 2/second
Size: 7 Blocks: 8 IO Block: 4096 regular file
Device: 2eh/46d Inode: 2 Links: 1
...
> Then export that 'dup' directory without remapping to see how it
> behaves; you'll find some weird behaviours with cached modes.
>
>
> I don't think we can do anything if the server sends us identical
> inodes, that's the server's job.
Right.
BTW even if qid.path length is increased as part of a 9p protocol change, it
would still be tricky for client implementation side, as that number is too
big as simply being exposed as inode number by client. IIRC virtiofsd is
handling this by automatically creating separate devices when needed. I
haven't checked how exactly, whether there is e.g. some easy way to create
"subdevices" with Linux.
> (iiuc from a networked filesystem's point of view (e.g. nfs), we ought
> to check i_ino + i_generation as inode number can be reused after the
> previous owner of the number was deleted, but we're guaranteed the
> couple change... At any given point though i_ino is unique so we've been
> relying on just that for two reasons: no room to send i_generation, and
> the field isn't easily obtainable from userspace (need to get in through
> bytes in export_to_handle_at which is file system dependant), so we only
> rely on inode number -- if the field gets bigger it should be variable
> length and we should write the whole mnt_id + export_to_handle_at
> content for a stable, truly unique handle. That's what userspace NFS
> servers do.)
Why is that? The inode number is a serial number which is consecutively
incremented:
# ls -li one
2 -rw-r--r-- 1 me me 4 Dec 29 12:42 one
# rm one
# echo one > one
# ls -li one
3 -rw-r--r-- 1 me me 4 Dec 29 12:55 one
So I would not expect an inode number to be reused before the consecutive
counter wraps at 2^64. And then, this discussion is about files still being
open. So host's file system can't recycle the inode number unless 9p client
closes the corresponding old FID.
> >
> > Yeah, there's the files and then there's the inode. So if its just a rename,
> > its likely it kept the inode, right? I guess I should just gaze at some
> > VFS traces to understand the behavior and whether or not we need to do
> > anything from an inode/qid.path perspective.
>
> Yes, renames won't change the inode at all (ls -i will keep the same
> number, and all props will be identical)
>
>
> In linux VFS speak, there's:
> - dentries, what you called files?, pointer to an inode from
> directories; that's what get updated on renames; there can be multiple
> dentries for a single inode (hardlinks for regular files)
> We keep some fid in there for directories to walk from there as needed
> (one per uid)
>
> - inode numbers, that identify a given ""file"" (common speech file,
> actual data); it must not change for the lifetime of the file and there
> should be no collision
>
> - the inode struct e.g. "file" informations (size, various times,
> i_version..) that get updated when file is modfied (rename shouldn't
> touch these either, but writes will);
> I guess with what you said there can be multiple inode structs alive for
> a given "file"? But that sounds like a bug to me, we should always reuse
> as it's also what governs the vfs cache data: at best you'll be fetching
> stats twice, at worse you'll end up with two mmap in parallel that don't
> get updated the same...
>
> - the file struct, that'll back a fd somewhere (can be multiple fd per
> file with dup and friends); we keep the open 'fid' information around in
> the file's private data.
>
> (I'm not 100% sure how process cwd is handled -- probably just a ref on
> the dentry? When opening a file we need to figure it out from the
> dentry, so get the parent's dentry and fids from there and use
> that parent's dentry cached fid, or build it up -- I guess if there's no
> cache file here we could do multi-level walk, but my understanding was
> that there should always be something cached at the point... Also, for
> something like ls you'll stat all files in a directory, so if there
> wasn't a cached fid for the parent's dentry you'd do a lot of
> multi-level walks, where caching the parent's dentry sound more
> efficent?)
>
>
> So to recap there should only be one inode, then as many dentries as
> there are paths to it, and as many files as there are "open handles" to
> it
>
> In cacheless modes we should drop dentries as soon as no file/process
> use it, ad drop inodes as soon as there's no dentries left.
> In cached mode we should have an extra ref to dentries (droppable on
> memory pressure somehow -- we're not doing that correctly right now
> afaik), that'll keep the inode alive, so we should try very hard to
> always reuse the same inode/dentres to avoid needless lookups.
Yes, the cache currently just grows indefinitely. The question is what kind of
metric shall be used to decide when to start dropping old things from the
cache.
> > I have publically stated I hate the temporary file hack. It just seems
> > awful, but I've punted this to be a server responsibility so I don't have
> > to think about it ;) I get where it may cause trouble on a reconnect,
> > but that could be handled different ways as well, just with a lot more
> > invasiveness on the server-side (like holding on to dangling fids on
> > unexpected disconnects) -- maybe something else to queue up for
> > future protocol revisions.
>
> Well there's reconnect (connection troubles), and there's reconnect
> (server restart, or migration) -- in nfsv4 the server is expected to
> have some persistent states so e.g. file leases will persist, so I guess
> it would be possible to also store information for tmpfiles there too,
> but at the very least the server needs to also be able to re-open the
> file so it needs to be kept around...
> (speaking of which, I guess qemu live migration doesn't work with 9p
> mounts? or maybe just not tmpfiles... Can't say I tried...)
Right, QEMU live-migration doesn't work with 9p ATM, and probably won't in
near future.
> Anyway, we don't need to work on reconnect here, that's distinct enough
> from the problems you've seen;
> let's not worry about it at this point, and we can bring it back up
> later if/when someone has energy for it :)
next prev parent reply other threads:[~2023-12-29 12:22 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-26 16:13 cache fixes (redux) Eric Van Hensbergen
2023-12-27 8:10 ` Dominique Martinet
2023-12-27 18:07 ` Eric Van Hensbergen
[not found] ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com>
2023-12-28 0:32 ` Dominique Martinet
2023-12-28 10:44 ` Christian Schoenebeck
2023-12-28 15:07 ` Eric Van Hensbergen
2023-12-28 21:48 ` asmadeus
2023-12-29 12:22 ` Christian Schoenebeck [this message]
2023-12-29 12:50 ` asmadeus
2023-12-29 16:16 ` Eric Van Hensbergen
2023-12-29 16:06 ` Eric Van Hensbergen
2023-12-29 15:50 ` Eric Van Hensbergen
2024-01-03 18:09 ` cache fixes (redux) (a tale of many inode numbers...) Eric Van Hensbergen
2024-01-03 20:24 ` cache fixes (redux) asmadeus
[not found] ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com>
2024-01-03 22:42 ` Fwd: " Eric Van Hensbergen
2024-01-03 22:48 ` Eric Van Hensbergen
2024-01-04 12:37 ` Christian Schoenebeck
2024-01-04 15:59 ` Eric Van Hensbergen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2850709.f7ZBceQk0b@silver \
--to=linux_oss@crudebyte.com \
--cc=asmadeus@codewreck.org \
--cc=ericvh@gmail.com \
--cc=ericvh@kernel.org \
--cc=v9fs@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox