From: asmadeus@codewreck.org
To: Eric Van Hensbergen <ericvh@kernel.org>
Cc: Christian Schoenebeck <linux_oss@crudebyte.com>,
v9fs@lists.linux.dev, ericvh@gmail.com
Subject: Re: cache fixes (redux)
Date: Fri, 29 Dec 2023 06:48:23 +0900 [thread overview]
Message-ID: <ZY3tJ0eD73Gmy_PQ@codewreck.org> (raw)
In-Reply-To: <ZY2PQkJ6jq4dx9p_@FV7GG9FTHL>
Eric Van Hensbergen wrote on Thu, Dec 28, 2023 at 09:07:46AM -0600:
> > > qemu mixes in some bits from st_dev to avoid this:
> > > https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881
> >
> > Inode numbers are always just unique within the same file system. And that
> > makes sense. If a unique file ID is required system wide, then inode number
> > must be combined with the file system's device ID.
>
> Ah, okay, good to know. Christian do you remember anything about this
> parallel unlink case? Is there any special handling to guard against
> reused inode numbers? From the discussion this seems likely in the
> tmpfs case.
Just to clarify: I've quoted tmpfs as a way of manually generating
duplicates, but it still needs multiple mount points (as single tmpfs
will still ensure uniqueness)
e.g.
---
# mkdir dup
# cd dup
# mkdir 1 2
# mount -t tmpfs tmpfs 1
# mount -t tmpfs tmpfs 2
# echo one > 1/one
# echo second > 2/second
# ls -li 1 2
# ls -li 1 2
1:
total 4
2 -rw-r--r--. 1 root root 4 Dec 29 06:10 one
2:
total 4
2 -rw-r--r--. 1 root root 7 Dec 29 06:10 second
---
Then export that 'dup' directory without remapping to see how it
behaves; you'll find some weird behaviours with cached modes.
I don't think we can do anything if the server sends us identical
inodes, that's the server's job.
(iiuc from a networked filesystem's point of view (e.g. nfs), we ought
to check i_ino + i_generation as inode number can be reused after the
previous owner of the number was deleted, but we're guaranteed the
couple change... At any given point though i_ino is unique so we've been
relying on just that for two reasons: no room to send i_generation, and
the field isn't easily obtainable from userspace (need to get in through
bytes in export_to_handle_at which is file system dependant), so we only
rely on inode number -- if the field gets bigger it should be variable
length and we should write the whole mnt_id + export_to_handle_at
content for a stable, truly unique handle. That's what userspace NFS
servers do.)
>
> Yeah, there's the files and then there's the inode. So if its just a rename,
> its likely it kept the inode, right? I guess I should just gaze at some
> VFS traces to understand the behavior and whether or not we need to do
> anything from an inode/qid.path perspective.
Yes, renames won't change the inode at all (ls -i will keep the same
number, and all props will be identical)
In linux VFS speak, there's:
- dentries, what you called files?, pointer to an inode from
directories; that's what get updated on renames; there can be multiple
dentries for a single inode (hardlinks for regular files)
We keep some fid in there for directories to walk from there as needed
(one per uid)
- inode numbers, that identify a given ""file"" (common speech file,
actual data); it must not change for the lifetime of the file and there
should be no collision
- the inode struct e.g. "file" informations (size, various times,
i_version..) that get updated when file is modfied (rename shouldn't
touch these either, but writes will);
I guess with what you said there can be multiple inode structs alive for
a given "file"? But that sounds like a bug to me, we should always reuse
as it's also what governs the vfs cache data: at best you'll be fetching
stats twice, at worse you'll end up with two mmap in parallel that don't
get updated the same...
- the file struct, that'll back a fd somewhere (can be multiple fd per
file with dup and friends); we keep the open 'fid' information around in
the file's private data.
(I'm not 100% sure how process cwd is handled -- probably just a ref on
the dentry? When opening a file we need to figure it out from the
dentry, so get the parent's dentry and fids from there and use
that parent's dentry cached fid, or build it up -- I guess if there's no
cache file here we could do multi-level walk, but my understanding was
that there should always be something cached at the point... Also, for
something like ls you'll stat all files in a directory, so if there
wasn't a cached fid for the parent's dentry you'd do a lot of
multi-level walks, where caching the parent's dentry sound more
efficent?)
So to recap there should only be one inode, then as many dentries as
there are paths to it, and as many files as there are "open handles" to
it
In cacheless modes we should drop dentries as soon as no file/process
use it, ad drop inodes as soon as there's no dentries left.
In cached mode we should have an extra ref to dentries (droppable on
memory pressure somehow -- we're not doing that correctly right now
afaik), that'll keep the inode alive, so we should try very hard to
always reuse the same inode/dentres to avoid needless lookups.
> I have publically stated I hate the temporary file hack. It just seems
> awful, but I've punted this to be a server responsibility so I don't have
> to think about it ;) I get where it may cause trouble on a reconnect,
> but that could be handled different ways as well, just with a lot more
> invasiveness on the server-side (like holding on to dangling fids on
> unexpected disconnects) -- maybe something else to queue up for
> future protocol revisions.
Well there's reconnect (connection troubles), and there's reconnect
(server restart, or migration) -- in nfsv4 the server is expected to
have some persistent states so e.g. file leases will persist, so I guess
it would be possible to also store information for tmpfiles there too,
but at the very least the server needs to also be able to re-open the
file so it needs to be kept around...
(speaking of which, I guess qemu live migration doesn't work with 9p
mounts? or maybe just not tmpfiles... Can't say I tried...)
Anyway, we don't need to work on reconnect here, that's distinct enough
from the problems you've seen;
let's not worry about it at this point, and we can bring it back up
later if/when someone has energy for it :)
--
Dominique Martinet | Asmadeus
next prev parent reply other threads:[~2023-12-28 21:55 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-26 16:13 cache fixes (redux) Eric Van Hensbergen
2023-12-27 8:10 ` Dominique Martinet
2023-12-27 18:07 ` Eric Van Hensbergen
[not found] ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com>
2023-12-28 0:32 ` Dominique Martinet
2023-12-28 10:44 ` Christian Schoenebeck
2023-12-28 15:07 ` Eric Van Hensbergen
2023-12-28 21:48 ` asmadeus [this message]
2023-12-29 12:22 ` Christian Schoenebeck
2023-12-29 12:50 ` asmadeus
2023-12-29 16:16 ` Eric Van Hensbergen
2023-12-29 16:06 ` Eric Van Hensbergen
2023-12-29 15:50 ` Eric Van Hensbergen
2024-01-03 18:09 ` cache fixes (redux) (a tale of many inode numbers...) Eric Van Hensbergen
2024-01-03 20:24 ` cache fixes (redux) asmadeus
[not found] ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com>
2024-01-03 22:42 ` Fwd: " Eric Van Hensbergen
2024-01-03 22:48 ` Eric Van Hensbergen
2024-01-04 12:37 ` Christian Schoenebeck
2024-01-04 15:59 ` Eric Van Hensbergen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZY3tJ0eD73Gmy_PQ@codewreck.org \
--to=asmadeus@codewreck.org \
--cc=ericvh@gmail.com \
--cc=ericvh@kernel.org \
--cc=linux_oss@crudebyte.com \
--cc=v9fs@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox