* cache fixes (redux)
@ 2023-12-26 16:13 Eric Van Hensbergen
2023-12-27 8:10 ` Dominique Martinet
0 siblings, 1 reply; 18+ messages in thread
From: Eric Van Hensbergen @ 2023-12-26 16:13 UTC (permalink / raw)
To: v9fs
It's the holidays so I have some time to spend brain cycles on 9p --
I've been looking over the meta -data caching problems and found
several low-hanging fixes that might help without having to do the big
re-write I tried last year.
The two things I'm going after first
- The fact that we never seem to use multi-walk (this isn't a cache
specific problem, it actually is somewhat worse in non-cache mode).
This seems to be because fid_lookup goes recursive before entering
multiwalk. This should be a fairly simple fix - I'm just going to add
an argument to limit recursion, but I'm noodling on whether we need
that path a bit first.
- figure out why even in loose mode we seem to be sending a lot of
stat/getattr -- this seems to be related to some code that came in
with the cache code. It seems we almost always stat/getattr because
we are constantly creating new inodes. There's some comments in
vfs_inode.c:755 that talk about needing to do this for parallel lookup
-- but this all seems like a hack. Anyone have better insight into
why we need this? We always to the stat/getattr to retreive the stat
to compare against because we don't trust qid.path -- which I think is
wrong. If there is an issue with unlinked files, I think we can do
something smarter to differentiate them, maybe using a qid.type
- QUESTION: it seems like P9_QTLINK and P9_QTSYMLINK are no
longer used, can these be recovered? I was thinking of reusing one of
these to mark unlinked files which we could do a more comprehensive
compare for if its actually necessary.
FWIW - I do plan to go back and try and do the broader rewrite on the
meta-data cache because I think we'll need it for temporal caches to
work properly, but I figure if I knock these two out it'll make life
better for many people's common use cases.
-eric
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: cache fixes (redux) 2023-12-26 16:13 cache fixes (redux) Eric Van Hensbergen @ 2023-12-27 8:10 ` Dominique Martinet 2023-12-27 18:07 ` Eric Van Hensbergen [not found] ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com> 0 siblings, 2 replies; 18+ messages in thread From: Dominique Martinet @ 2023-12-27 8:10 UTC (permalink / raw) To: Eric Van Hensbergen; +Cc: v9fs, Christian Schoenebeck Hi Eric! Eric Van Hensbergen wrote on Tue, Dec 26, 2023 at 10:13:20AM -0600: > - The fact that we never seem to use multi-walk (this isn't a cache > specific problem, it actually is somewhat worse in non-cache mode). > This seems to be because fid_lookup goes recursive before entering > multiwalk. This should be a fairly simple fix - I'm just going to add > an argument to limit recursion, but I'm noodling on whether we need > that path a bit first. I'm not sure multi-walk is possible in practice: even opening a file directly through e.g. cat /mnt/long/path/to/file will have the VFS check permissions along the path, so the VFS will need to stat all elements along the way. If you can get it to work though it's certainly worth implementing the possibility, let's give it a try. > - figure out why even in loose mode we seem to be sending a lot of > stat/getattr -- this seems to be related to some code that came in > with the cache code. It seems we almost always stat/getattr because > we are constantly creating new inodes. There's some comments in > vfs_inode.c:755 that talk about needing to do this for parallel lookup > -- but this all seems like a hack. Anyone have better insight into > why we need this? We always to the stat/getattr to retreive the stat > to compare against because we don't trust qid.path -- which I think is > wrong. If there is an issue with unlinked files, I think we can do > something smarter to differentiate them, maybe using a qid.type Hmm in loose mode we definitely should be reusing the inodes... No idea why we had this in the first place, I'd say rip it out and compare? > - QUESTION: it seems like P9_QTLINK and P9_QTSYMLINK are no > longer used, can these be recovered? I was thinking of reusing one of > these to mark unlinked files which we could do a more comprehensive > compare for if its actually necessary. Since we don't support reconnects, unlinked files have been working just fine with servers that keep the underlying files open (e.g. qemu) What's your plan exactly? We don't have any way of driving protocol changes forward (and there's no compatibility negotiation on client connect...), but I recall Christian also was looking forward to some other changes so we might want to try to move this forward again... -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-27 8:10 ` Dominique Martinet @ 2023-12-27 18:07 ` Eric Van Hensbergen [not found] ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com> 1 sibling, 0 replies; 18+ messages in thread From: Eric Van Hensbergen @ 2023-12-27 18:07 UTC (permalink / raw) To: Dominique Martinet; +Cc: v9fs, Christian Schoenebeck On Wed, Dec 27, 2023 at 2:11 AM Dominique Martinet <asmadeus@codewreck.org> wrote: > > Eric Van Hensbergen wrote on Tue, Dec 26, 2023 at 10:13:20AM -0600: > > - The fact that we never seem to use multi-walk (this isn't a cache > > specific problem, it actually is somewhat worse in non-cache mode). > > This seems to be because fid_lookup goes recursive before entering > > multiwalk. This should be a fairly simple fix - I'm just going to add > > an argument to limit recursion, but I'm noodling on whether we need > > that path a bit first. > > I'm not sure multi-walk is possible in practice: even opening a file > directly through e.g. cat /mnt/long/path/to/file will have the VFS check > permissions along the path, so the VFS will need to stat all elements > along the way. > Well, the server will do perm checks (ideally on walks), so do we really have to? Feels like overkill to check in both places, in principal we expect/assume server perm checks. > > > - figure out why even in loose mode we seem to be sending a lot of > > stat/getattr -- this seems to be related to some code that came in > > with the cache code. It seems we almost always stat/getattr because > > we are constantly creating new inodes. There's some comments in > > vfs_inode.c:755 that talk about needing to do this for parallel lookup > > -- but this all seems like a hack. Anyone have better insight into > > why we need this? We always to the stat/getattr to retreive the stat > > to compare against because we don't trust qid.path -- which I think is > > wrong. If there is an issue with unlinked files, I think we can do > > something smarter to differentiate them, maybe using a qid.type > > Hmm in loose mode we definitely should be reusing the inodes... > No idea why we had this in the first place, I'd say rip it out and > compare? > Yeah, I was thinking of using one of the unused qid bits to mark negative inodes so we dont collide on lookups after unlink (although I dont know how often this ever happened and some of this will be more straightforward with all fids tracked in inodes versus dentries). it did strike me that inodes are only unique per filesystem and I doubt the servers guarantee uniqueness across underlying mounts - but in principal we shouldnt have to rely on looking up inodes by qid.path anyways….however nothing I’m planning on should make this worse, but something to keep an eye out for. > > - QUESTION: it seems like P9_QTLINK and P9_QTSYMLINK are no > > longer used, can these be recovered? I was thinking of reusing one of > > these to mark unlinked files which we could do a more comprehensive > > compare for if its actually necessary. > > Since we don't support reconnects, unlinked files have been working just > fine with servers that keep the underlying files open (e.g. qemu) > > What's your plan exactly? > > We don't have any way of driving protocol changes forward (and there's > no compatibility negotiation on client connect...), but I recall > Christian also was looking forward to some other changes so we might > want to try to move this forward again... > Well, those bits unused (I think) so shouldnt be protocol change. However we can bump to new protocol when necessary (9p2024? its been 23 years, why not?). big on my list would be qid.path -> 128 bit and qid.version -> 128 bit and maybe include filesystem uid in qid to differentiate underlying mounts and maybe i_generation. This would help with cache maintenance and uniqueness concerns. I’m also thinking of how we might handle security differently (use kernel TLS, public key auth integration, etc. although those might not require protocol changes). i’m also in favor of dirread senantics for cache modes so we retrieve attributes with the file list. What other things should we be considering? maybe time to start putting together an RFC. Its probably worth driving things like qid changes together with the Plan 9 community versus making 9p2024 a linux-only version since there are still folks using and larger qid space is likely a reality on newer systems). -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com>]
* Re: cache fixes (redux) [not found] ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com> @ 2023-12-28 0:32 ` Dominique Martinet 2023-12-28 10:44 ` Christian Schoenebeck 0 siblings, 1 reply; 18+ messages in thread From: Dominique Martinet @ 2023-12-28 0:32 UTC (permalink / raw) To: Eric Van Hensbergen; +Cc: Christian Schoenebeck, v9fs Eric Van Hensbergen wrote on Wed, Dec 27, 2023 at 11:39:19AM -0600: > On Wed, Dec 27, 2023 at 2:11 AM Dominique Martinet <asmadeus@codewreck.org> > wrote: > > I'm not sure multi-walk is possible in practice: even opening a file > > directly through e.g. cat /mnt/long/path/to/file will have the VFS check > > permissions along the path, so the VFS will need to stat all elements > > along the way. > > Well, the server will do perm checks (ideally on walks), so do we really > have to? Feels like overkill to check in both places, in principal we > expect/assume server perm checks. The server should definitely check as well, yes. I'm not sure it's possible to tell the linux VFS that the server already checked though; ultimately the plan back when I was working on 9p for my previous job was to plug in MPI-IO into a user-space client library (e.g. https://github.com/martinetd/space9 -- now unmaintained) so we wouldn't have to deal with the vfs... But once again, if you can manage to tell it that, I'm all for it. Some servers might not check properly but that can/should be fixed when it comes up. > it did strike me that inodes are only unique per filesystem and I doubt the > servers guarantee uniqueness across underlying mounts - but in principal we > shouldnt have to rely on looking up inodes by qid.path anyways….however > nothing I’m planning on should make this worse, but something to keep an > eye out for. We've had collisions a few times -- most "real" file systems I'm aware of pseudo-randomize inode generation but in particular tmpfs inode numbering is linear per mount so it's very easy to trigger collisions. qemu mixes in some bits from st_dev to avoid this: https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881 > > > - QUESTION: it seems like P9_QTLINK and P9_QTSYMLINK are no > > > longer used, can these be recovered? I was thinking of reusing one of > > > these to mark unlinked files which we could do a more comprehensive > > > compare for if its actually necessary. > > > > Since we don't support reconnects, unlinked files have been working just > > fine with servers that keep the underlying files open (e.g. qemu) > > > > What's your plan exactly? > > > See above, mark fid/qids that are unlinked with a different type I think. > I’m gonna look at how other filesystems handle this case before doing > anything. local filesystems don't do anything - the inode is just kept around, all the way down to disk afaik (disk-mapped tmpfiles flush data to disk so there must be some way to address them...) (The last iput with nlink==0 calls evict_inode which does required cleanup) NFS doesn't delete the backing file but renames it - silly renames - you've probably seen some .nfs1234 files around if you used a mount point, the client won't let you delete them while it's open: rm: cannot remove '.nfs000000000000f42300000001': Device or resource busy And it's unlinked when file is closed (DCACHE_NFSFS_RENAMED check in nfs_dentry_iput). That's necessary because the connexion to the server can drop at any time, and reconnecting needs to be able to open the file again. Servers will also scrub these files after they're been unused for a while. For 9p since we don't handle reconnect I think we're fine just not doing anything -- servers that don't close backing files will de facto keep the file existing through their fd. There's a process with the file open so we have an open fid somewhere and the inode won't get killed either. When the last fd is closed the inode is evicted and we'll have all fids closed so server will release its fd as well and backing filesystem will cleanup appropriately. If we want to handle reconnect at some point (CEA/Bull did that work but never tried to publish it...), then we ought to do silly renames as well, but servers really need background cleanup as well or it'll end up full of .nfsxxx files... > > We don't have any way of driving protocol changes forward (and there's > > no compatibility negotiation on client connect...), but I recall > > Christian also was looking forward to some other changes so we might > > want to try to move this forward again... > > > > Well, those bits unused (I think) so shouldnt be protocol change. Right, I didn't realize it was just local use -- we can do whatever we want in memory yes. I'd avoid reusing the bits and define some new ones at the end of the permissible range to make the "can come from server" and "client mess" separation but leaving this up to you -- ultimately as long as there's no conflict with what's on wire we should be fine. > However we can bump to > new protocol when necessary (9p2023? its been 23 years, why not?). big on > my list would be qid.path -> 128 bit and qid.version -> 128 bit and maybe > include filesystem uid in qid to differentiate underlying mounts and maybe > i_generation. This would help with cache maintenance and uniqueness > concerns. I’m also thinking of how we might handle security differently > (use kernel TLS, public key auth integration, etc. although those might not > require protocol changes). i’m also in favor of dirread senantics for > cache modes so we retrieve attributes with the file list. What other > things should we be considering? maybe time to start putting together an > RFC. Right; there's plenty to improve if we start... My free time is basically rock bottom as it has been for a few years so I probably won't help much, but I wouldn't push back. -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-28 0:32 ` Dominique Martinet @ 2023-12-28 10:44 ` Christian Schoenebeck 2023-12-28 15:07 ` Eric Van Hensbergen 0 siblings, 1 reply; 18+ messages in thread From: Christian Schoenebeck @ 2023-12-28 10:44 UTC (permalink / raw) To: Eric Van Hensbergen, Dominique Martinet; +Cc: v9fs, Greg Kurz On Thursday, December 28, 2023 1:32:17 AM CET Dominique Martinet wrote: > Eric Van Hensbergen wrote on Wed, Dec 27, 2023 at 11:39:19AM -0600: > > On Wed, Dec 27, 2023 at 2:11 AM Dominique Martinet <asmadeus@codewreck.org> > > wrote: > > > I'm not sure multi-walk is possible in practice: even opening a file > > > directly through e.g. cat /mnt/long/path/to/file will have the VFS check > > > permissions along the path, so the VFS will need to stat all elements > > > along the way. > > > > Well, the server will do perm checks (ideally on walks), so do we really > > have to? Feels like overkill to check in both places, in principal we > > expect/assume server perm checks. > > The server should definitely check as well, yes. > I'm not sure it's possible to tell the linux VFS that the server already > checked though; ultimately the plan back when I was working on 9p for my > previous job was to plug in MPI-IO into a user-space client library > (e.g. https://github.com/martinetd/space9 -- now unmaintained) so we > wouldn't have to deal with the vfs... > > But once again, if you can manage to tell it that, I'm all for it. Some > servers might not check properly but that can/should be fixed when it > comes up. I would rather see this as a brave kernel option for people who really know what they are doing, not as a potential default behaviour. With QEMU there are two distinct behaviours: 'passthrough' permissions vs. 'mapped(-*)' permissions: https://wiki.qemu.org/Documentation/9psetup#Starting_the_Guest_directly Where 'passthrough' means guest sees the exact same permissions for files as host (9p server) does. In this mode you could probably leave it to server doing the permission checks. However 'passthrough' mode is discouraged for security reasons, and one of the 'mapped' permissions modes is recommended instead, because we had several security issues in the past which showed that it's not a good idea using 'passthrough' (whereas 'mapped' was unaffected). With 'mapped-(*)' however guest should do permissions checks. Because in this mode file permissions on host are different and separately handled between host and guest. So host sees something else than guest does. > > it did strike me that inodes are only unique per filesystem and I doubt the > > servers guarantee uniqueness across underlying mounts - but in principal we > > shouldnt have to rely on looking up inodes by qid.path anyways….however > > nothing I’m planning on should make this worse, but something to keep an > > eye out for. > > We've had collisions a few times -- most "real" file systems I'm aware > of pseudo-randomize inode generation but in particular tmpfs inode > numbering is linear per mount so it's very easy to trigger collisions. > > qemu mixes in some bits from st_dev to avoid this: > https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881 Inode numbers are always just unique within the same file system. And that makes sense. If a unique file ID is required system wide, then inode number must be combined with the file system's device ID. And yes, that's what QEMU does (combining inode # with device ID) when option `multidevs=remap` was supplied when generating qid.path to avoid file ID collisions on guest. Unfortunately qid.path field is currently too small to just use the device ID directly, so a little trick is used: the so called 'Exponential Golomb' algorithm is used to generate a much shorter device ID that can be put into qid.path instead. Note though, this behaviour is actually off by default ATM, controllable by: multidevs=remap|forbid|warn https://wiki.qemu.org/Documentation/9psetup#Starting_the_Guest_directly By default (multidevs=warn) QEMU would currently only print/log a warning: warn_report_once( "9p: Multiple devices detected in same VirtFS export, " "which might lead to file ID collisions and severe " "misbehaviours on guest! You should either use a " "separate export for each device shared from host or " "use virtfs option 'multidevs=remap'!" ); > > > > - QUESTION: it seems like P9_QTLINK and P9_QTSYMLINK are no > > > > longer used, can these be recovered? I was thinking of reusing one of > > > > these to mark unlinked files which we could do a more comprehensive > > > > compare for if its actually necessary. > > > > > > Since we don't support reconnects, unlinked files have been working just > > > fine with servers that keep the underlying files open (e.g. qemu) > > > > > > What's your plan exactly? > > > > > > See above, mark fid/qids that are unlinked with a different type I think. > > I’m gonna look at how other filesystems handle this case before doing > > anything. > > local filesystems don't do anything - the inode is just kept around, all > the way down to disk afaik (disk-mapped tmpfiles flush data to disk so > there must be some way to address them...) > (The last iput with nlink==0 calls evict_inode which does required > cleanup) > > > NFS doesn't delete the backing file but renames it - silly renames - > you've probably seen some .nfs1234 files around if you used a mount > point, the client won't let you delete them while it's open: > rm: cannot remove '.nfs000000000000f42300000001': Device or resource busy > > And it's unlinked when file is closed (DCACHE_NFSFS_RENAMED check in > nfs_dentry_iput). That's necessary because the connexion to the server > can drop at any time, and reconnecting needs to be able to open the > file again. > Servers will also scrub these files after they're been unused for a > while. > > > For 9p since we don't handle reconnect I think we're fine just not doing > anything -- servers that don't close backing files will de facto keep > the file existing through their fd. There's a process with the file open > so we have an open fid somewhere and the inode won't get killed either. > When the last fd is closed the inode is evicted and we'll have all fids > closed so server will release its fd as well and backing filesystem will > cleanup appropriately. > > If we want to handle reconnect at some point (CEA/Bull did that work but > never tried to publish it...), then we ought to do silly renames as > well, but servers really need background cleanup as well or it'll end > up full of .nfsxxx files... > > > > > We don't have any way of driving protocol changes forward (and there's > > > no compatibility negotiation on client connect...), but I recall > > > Christian also was looking forward to some other changes so we might > > > want to try to move this forward again... > > > > > > > Well, those bits unused (I think) so shouldnt be protocol change. > > Right, I didn't realize it was just local use -- we can do whatever we > want in memory yes. > I'd avoid reusing the bits and define some new ones at the end of the > permissible range to make the "can come from server" and "client mess" > separation but leaving this up to you -- ultimately as long as there's > no conflict with what's on wire we should be fine. > > > However we can bump to > > new protocol when necessary (9p2023? its been 23 years, why not?). big on > > my list would be qid.path -> 128 bit and qid.version -> 128 bit and maybe > > include filesystem uid in qid to differentiate underlying mounts and maybe > > i_generation. This would help with cache maintenance and uniqueness > > concerns. I’m also thinking of how we might handle security differently > > (use kernel TLS, public key auth integration, etc. although those might not > > require protocol changes). i’m also in favor of dirread senantics for > > cache modes so we retrieve attributes with the file list. What other > > things should we be considering? maybe time to start putting together an > > RFC. > > Right; there's plenty to improve if we start... > My free time is basically rock bottom as it has been for a few years so > I probably won't help much, but I wouldn't push back. I use to gather ideas for 9p protocol changes here: https://wiki.qemu.org/Documentation/9p#Protocol_Plans One thing that comes to my mind that's not yet on that list: server side push notifications, e.g. to inform guest that something changed on a file/dir guest currently keeps open. That might improve consistency when both on host and guest changes are made and caching being used by guest. /Christian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-28 10:44 ` Christian Schoenebeck @ 2023-12-28 15:07 ` Eric Van Hensbergen 2023-12-28 21:48 ` asmadeus 0 siblings, 1 reply; 18+ messages in thread From: Eric Van Hensbergen @ 2023-12-28 15:07 UTC (permalink / raw) To: Christian Schoenebeck; +Cc: v9fs, asmadeus, ericvh On Thu, Dec 28, 2023 at 11:44:32AM +0100, Christian Schoenebeck wrote: > On Thursday, December 28, 2023 1:32:17 AM CET Dominique Martinet wrote: > > Eric Van Hensbergen wrote on Wed, Dec 27, 2023 at 11:39:19AM -0600: > > > > > > Well, the server will do perm checks (ideally on walks), so do we really > > > have to? Feels like overkill to check in both places, in principal we > > > expect/assume server perm checks. > > > > The server should definitely check as well, yes. > > I would rather see this as a brave kernel option for people who really know > what they are doing, not as a potential default behaviour. > Likely exposed first as one of the many cache options, but yes - I was planning on potentially guarding this initially -- although as Dominique says the VFS may be doing checking regardless of what we do in v9fs and based on the protocol traces I was looking at before it seems we were likely doing duplicate checks (or at the very least duplicate getattrs) and getting rid of those is a bug fix not a feature change. So I'm going to proceed carefully and will post diffs on protocol traces. > With QEMU there are two distinct behaviours: 'passthrough' permissions vs. > 'mapped(-*)' permissions: > > https://wiki.qemu.org/Documentation/9psetup#Starting_the_Guest_directly > > Where 'passthrough' means guest sees the exact same permissions for files as > host (9p server) does. In this mode you could probably leave it to server > doing the permission checks. However 'passthrough' mode is discouraged for > security reasons, and one of the 'mapped' permissions modes is recommended > instead, because we had several security issues in the past which showed that > it's not a good idea using 'passthrough' (whereas 'mapped' was unaffected). > > With 'mapped-(*)' however guest should do permissions checks. Because in this > mode file permissions on host are different and separately handled between > host and guest. So host sees something else than guest does. > I think we've had this discussion before, but relying on the client to do permission checks for the server sounds like a very dangerous behavior from a security perspective. If the server is configured for some form of mapped mode then the server should be responsible for checking those semantics to protect from a compromised client. This is probably even more the case in a hosted guest environment. In any case, the main point kicking off this thread was vfs_lookup cases where we are doing multi-walk -- I expect VFS is still going to do double checks on any sort of transactions on the files themselves so while I suppose it opens the window up to a traversal bypassing a directory which doesn't have x permissions, that seems like a corner case. Besides which, as per the previous paragraph, I think VFS is going to do the checks anyways and we were just doing duplicate checks so shouldn't be a problem. > > > > qemu mixes in some bits from st_dev to avoid this: > > https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881 > > Inode numbers are always just unique within the same file system. And that > makes sense. If a unique file ID is required system wide, then inode number > must be combined with the file system's device ID. > Ah, okay, good to know. Christian do you remember anything about this parallel unlink case? Is there any special handling to guard against reused inode numbers? From the discussion this seems likely in the tmpfs case. > > > > > - QUESTION: it seems like P9_QTLINK and P9_QTSYMLINK are no > > > > > longer used, can these be recovered? I was thinking of reusing one of > > > > > these to mark unlinked files which we could do a more comprehensive > > > > > compare for if its actually necessary. > > > > > > > See above, mark fid/qids that are unlinked with a different type I think. > > > I´m gonna look at how other filesystems handle this case before doing > > > anything. > > > > local filesystems don't do anything - the inode is just kept around, all > > the way down to disk afaik (disk-mapped tmpfiles flush data to disk so > > there must be some way to address them...) > > (The last iput with nlink==0 calls evict_inode which does required > > cleanup) > > > > > > NFS doesn't delete the backing file but renames it - silly renames - > > you've probably seen some .nfs1234 files around if you used a mount > > point, the client won't let you delete them while it's open: > > rm: cannot remove '.nfs000000000000f42300000001': Device or resource busy > > > > And it's unlinked when file is closed (DCACHE_NFSFS_RENAMED check in > > nfs_dentry_iput). That's necessary because the connexion to the server > > can drop at any time, and reconnecting needs to be able to open the > > file again. > > Servers will also scrub these files after they're been unused for a > > while. > > Yeah, there's the files and then there's the inode. So if its just a rename, its likely it kept the inode, right? I guess I should just gaze at some VFS traces to understand the behavior and whether or not we need to do anything from an inode/qid.path perspective. I have publically stated I hate the temporary file hack. It just seems awful, but I've punted this to be a server responsibility so I don't have to think about it ;) I get where it may cause trouble on a reconnect, but that could be handled different ways as well, just with a lot more invasiveness on the server-side (like holding on to dangling fids on unexpected disconnects) -- maybe something else to queue up for future protocol revisions. > > I use to gather ideas for 9p protocol changes here: > https://wiki.qemu.org/Documentation/9p#Protocol_Plans > > One thing that comes to my mind that's not yet on that list: server side push > notifications, e.g. to inform guest that something changed on a file/dir guest > currently keeps open. That might improve consistency when both on host and > guest changes are made and caching being used by guest. > Yes, that list largely seems in line with mine except for adding the error attribute on Rread/Rwrite -- that one maybe needs some discussion. I do agree that looking into server-push or some other notification method would be good to add to the protocol. -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-28 15:07 ` Eric Van Hensbergen @ 2023-12-28 21:48 ` asmadeus 2023-12-29 12:22 ` Christian Schoenebeck 2023-12-29 15:50 ` Eric Van Hensbergen 0 siblings, 2 replies; 18+ messages in thread From: asmadeus @ 2023-12-28 21:48 UTC (permalink / raw) To: Eric Van Hensbergen; +Cc: Christian Schoenebeck, v9fs, ericvh Eric Van Hensbergen wrote on Thu, Dec 28, 2023 at 09:07:46AM -0600: > > > qemu mixes in some bits from st_dev to avoid this: > > > https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881 > > > > Inode numbers are always just unique within the same file system. And that > > makes sense. If a unique file ID is required system wide, then inode number > > must be combined with the file system's device ID. > > Ah, okay, good to know. Christian do you remember anything about this > parallel unlink case? Is there any special handling to guard against > reused inode numbers? From the discussion this seems likely in the > tmpfs case. Just to clarify: I've quoted tmpfs as a way of manually generating duplicates, but it still needs multiple mount points (as single tmpfs will still ensure uniqueness) e.g. --- # mkdir dup # cd dup # mkdir 1 2 # mount -t tmpfs tmpfs 1 # mount -t tmpfs tmpfs 2 # echo one > 1/one # echo second > 2/second # ls -li 1 2 # ls -li 1 2 1: total 4 2 -rw-r--r--. 1 root root 4 Dec 29 06:10 one 2: total 4 2 -rw-r--r--. 1 root root 7 Dec 29 06:10 second --- Then export that 'dup' directory without remapping to see how it behaves; you'll find some weird behaviours with cached modes. I don't think we can do anything if the server sends us identical inodes, that's the server's job. (iiuc from a networked filesystem's point of view (e.g. nfs), we ought to check i_ino + i_generation as inode number can be reused after the previous owner of the number was deleted, but we're guaranteed the couple change... At any given point though i_ino is unique so we've been relying on just that for two reasons: no room to send i_generation, and the field isn't easily obtainable from userspace (need to get in through bytes in export_to_handle_at which is file system dependant), so we only rely on inode number -- if the field gets bigger it should be variable length and we should write the whole mnt_id + export_to_handle_at content for a stable, truly unique handle. That's what userspace NFS servers do.) > > Yeah, there's the files and then there's the inode. So if its just a rename, > its likely it kept the inode, right? I guess I should just gaze at some > VFS traces to understand the behavior and whether or not we need to do > anything from an inode/qid.path perspective. Yes, renames won't change the inode at all (ls -i will keep the same number, and all props will be identical) In linux VFS speak, there's: - dentries, what you called files?, pointer to an inode from directories; that's what get updated on renames; there can be multiple dentries for a single inode (hardlinks for regular files) We keep some fid in there for directories to walk from there as needed (one per uid) - inode numbers, that identify a given ""file"" (common speech file, actual data); it must not change for the lifetime of the file and there should be no collision - the inode struct e.g. "file" informations (size, various times, i_version..) that get updated when file is modfied (rename shouldn't touch these either, but writes will); I guess with what you said there can be multiple inode structs alive for a given "file"? But that sounds like a bug to me, we should always reuse as it's also what governs the vfs cache data: at best you'll be fetching stats twice, at worse you'll end up with two mmap in parallel that don't get updated the same... - the file struct, that'll back a fd somewhere (can be multiple fd per file with dup and friends); we keep the open 'fid' information around in the file's private data. (I'm not 100% sure how process cwd is handled -- probably just a ref on the dentry? When opening a file we need to figure it out from the dentry, so get the parent's dentry and fids from there and use that parent's dentry cached fid, or build it up -- I guess if there's no cache file here we could do multi-level walk, but my understanding was that there should always be something cached at the point... Also, for something like ls you'll stat all files in a directory, so if there wasn't a cached fid for the parent's dentry you'd do a lot of multi-level walks, where caching the parent's dentry sound more efficent?) So to recap there should only be one inode, then as many dentries as there are paths to it, and as many files as there are "open handles" to it In cacheless modes we should drop dentries as soon as no file/process use it, ad drop inodes as soon as there's no dentries left. In cached mode we should have an extra ref to dentries (droppable on memory pressure somehow -- we're not doing that correctly right now afaik), that'll keep the inode alive, so we should try very hard to always reuse the same inode/dentres to avoid needless lookups. > I have publically stated I hate the temporary file hack. It just seems > awful, but I've punted this to be a server responsibility so I don't have > to think about it ;) I get where it may cause trouble on a reconnect, > but that could be handled different ways as well, just with a lot more > invasiveness on the server-side (like holding on to dangling fids on > unexpected disconnects) -- maybe something else to queue up for > future protocol revisions. Well there's reconnect (connection troubles), and there's reconnect (server restart, or migration) -- in nfsv4 the server is expected to have some persistent states so e.g. file leases will persist, so I guess it would be possible to also store information for tmpfiles there too, but at the very least the server needs to also be able to re-open the file so it needs to be kept around... (speaking of which, I guess qemu live migration doesn't work with 9p mounts? or maybe just not tmpfiles... Can't say I tried...) Anyway, we don't need to work on reconnect here, that's distinct enough from the problems you've seen; let's not worry about it at this point, and we can bring it back up later if/when someone has energy for it :) -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-28 21:48 ` asmadeus @ 2023-12-29 12:22 ` Christian Schoenebeck 2023-12-29 12:50 ` asmadeus 2023-12-29 16:06 ` Eric Van Hensbergen 2023-12-29 15:50 ` Eric Van Hensbergen 1 sibling, 2 replies; 18+ messages in thread From: Christian Schoenebeck @ 2023-12-29 12:22 UTC (permalink / raw) To: Eric Van Hensbergen, asmadeus; +Cc: v9fs, ericvh On Thursday, December 28, 2023 10:48:23 PM CET asmadeus@codewreck.org wrote: > Eric Van Hensbergen wrote on Thu, Dec 28, 2023 at 09:07:46AM -0600: > > > > qemu mixes in some bits from st_dev to avoid this: > > > > https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881 > > > > > > Inode numbers are always just unique within the same file system. And that > > > makes sense. If a unique file ID is required system wide, then inode number > > > must be combined with the file system's device ID. > > > > Ah, okay, good to know. Christian do you remember anything about this > > parallel unlink case? Is there any special handling to guard against > > reused inode numbers? From the discussion this seems likely in the > > tmpfs case. > > Just to clarify: I've quoted tmpfs as a way of manually generating > duplicates, but it still needs multiple mount points (as single tmpfs > will still ensure uniqueness) > e.g. > > --- > # mkdir dup > # cd dup > # mkdir 1 2 > # mount -t tmpfs tmpfs 1 > # mount -t tmpfs tmpfs 2 > # echo one > 1/one > # echo second > 2/second > # ls -li 1 2 > # ls -li 1 2 > 1: > total 4 > 2 -rw-r--r--. 1 root root 4 Dec 29 06:10 one > > 2: > total 4 > 2 -rw-r--r--. 1 root root 7 Dec 29 06:10 second > --- Which is the expected, correct behaviour, as you can still distinguish the two by their device IDs: # stat 1/one File: 1/one Size: 4 Blocks: 8 IO Block: 4096 regular file Device: 2dh/45d Inode: 2 Links: 1 ... # stat 2/second File: 2/second Size: 7 Blocks: 8 IO Block: 4096 regular file Device: 2eh/46d Inode: 2 Links: 1 ... > Then export that 'dup' directory without remapping to see how it > behaves; you'll find some weird behaviours with cached modes. > > > I don't think we can do anything if the server sends us identical > inodes, that's the server's job. Right. BTW even if qid.path length is increased as part of a 9p protocol change, it would still be tricky for client implementation side, as that number is too big as simply being exposed as inode number by client. IIRC virtiofsd is handling this by automatically creating separate devices when needed. I haven't checked how exactly, whether there is e.g. some easy way to create "subdevices" with Linux. > (iiuc from a networked filesystem's point of view (e.g. nfs), we ought > to check i_ino + i_generation as inode number can be reused after the > previous owner of the number was deleted, but we're guaranteed the > couple change... At any given point though i_ino is unique so we've been > relying on just that for two reasons: no room to send i_generation, and > the field isn't easily obtainable from userspace (need to get in through > bytes in export_to_handle_at which is file system dependant), so we only > rely on inode number -- if the field gets bigger it should be variable > length and we should write the whole mnt_id + export_to_handle_at > content for a stable, truly unique handle. That's what userspace NFS > servers do.) Why is that? The inode number is a serial number which is consecutively incremented: # ls -li one 2 -rw-r--r-- 1 me me 4 Dec 29 12:42 one # rm one # echo one > one # ls -li one 3 -rw-r--r-- 1 me me 4 Dec 29 12:55 one So I would not expect an inode number to be reused before the consecutive counter wraps at 2^64. And then, this discussion is about files still being open. So host's file system can't recycle the inode number unless 9p client closes the corresponding old FID. > > > > Yeah, there's the files and then there's the inode. So if its just a rename, > > its likely it kept the inode, right? I guess I should just gaze at some > > VFS traces to understand the behavior and whether or not we need to do > > anything from an inode/qid.path perspective. > > Yes, renames won't change the inode at all (ls -i will keep the same > number, and all props will be identical) > > > In linux VFS speak, there's: > - dentries, what you called files?, pointer to an inode from > directories; that's what get updated on renames; there can be multiple > dentries for a single inode (hardlinks for regular files) > We keep some fid in there for directories to walk from there as needed > (one per uid) > > - inode numbers, that identify a given ""file"" (common speech file, > actual data); it must not change for the lifetime of the file and there > should be no collision > > - the inode struct e.g. "file" informations (size, various times, > i_version..) that get updated when file is modfied (rename shouldn't > touch these either, but writes will); > I guess with what you said there can be multiple inode structs alive for > a given "file"? But that sounds like a bug to me, we should always reuse > as it's also what governs the vfs cache data: at best you'll be fetching > stats twice, at worse you'll end up with two mmap in parallel that don't > get updated the same... > > - the file struct, that'll back a fd somewhere (can be multiple fd per > file with dup and friends); we keep the open 'fid' information around in > the file's private data. > > (I'm not 100% sure how process cwd is handled -- probably just a ref on > the dentry? When opening a file we need to figure it out from the > dentry, so get the parent's dentry and fids from there and use > that parent's dentry cached fid, or build it up -- I guess if there's no > cache file here we could do multi-level walk, but my understanding was > that there should always be something cached at the point... Also, for > something like ls you'll stat all files in a directory, so if there > wasn't a cached fid for the parent's dentry you'd do a lot of > multi-level walks, where caching the parent's dentry sound more > efficent?) > > > So to recap there should only be one inode, then as many dentries as > there are paths to it, and as many files as there are "open handles" to > it > > In cacheless modes we should drop dentries as soon as no file/process > use it, ad drop inodes as soon as there's no dentries left. > In cached mode we should have an extra ref to dentries (droppable on > memory pressure somehow -- we're not doing that correctly right now > afaik), that'll keep the inode alive, so we should try very hard to > always reuse the same inode/dentres to avoid needless lookups. Yes, the cache currently just grows indefinitely. The question is what kind of metric shall be used to decide when to start dropping old things from the cache. > > I have publically stated I hate the temporary file hack. It just seems > > awful, but I've punted this to be a server responsibility so I don't have > > to think about it ;) I get where it may cause trouble on a reconnect, > > but that could be handled different ways as well, just with a lot more > > invasiveness on the server-side (like holding on to dangling fids on > > unexpected disconnects) -- maybe something else to queue up for > > future protocol revisions. > > Well there's reconnect (connection troubles), and there's reconnect > (server restart, or migration) -- in nfsv4 the server is expected to > have some persistent states so e.g. file leases will persist, so I guess > it would be possible to also store information for tmpfiles there too, > but at the very least the server needs to also be able to re-open the > file so it needs to be kept around... > (speaking of which, I guess qemu live migration doesn't work with 9p > mounts? or maybe just not tmpfiles... Can't say I tried...) Right, QEMU live-migration doesn't work with 9p ATM, and probably won't in near future. > Anyway, we don't need to work on reconnect here, that's distinct enough > from the problems you've seen; > let's not worry about it at this point, and we can bring it back up > later if/when someone has energy for it :) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-29 12:22 ` Christian Schoenebeck @ 2023-12-29 12:50 ` asmadeus 2023-12-29 16:16 ` Eric Van Hensbergen 2023-12-29 16:06 ` Eric Van Hensbergen 1 sibling, 1 reply; 18+ messages in thread From: asmadeus @ 2023-12-29 12:50 UTC (permalink / raw) To: Christian Schoenebeck; +Cc: Eric Van Hensbergen, v9fs, ericvh Christian Schoenebeck wrote on Fri, Dec 29, 2023 at 01:22:45PM +0100: > Which is the expected, correct behaviour, as you can still distinguish the two > by their device IDs: (Yes, just meant to say one can produce collisions easily if they want) > BTW even if qid.path length is increased as part of a 9p protocol change, it > would still be tricky for client implementation side, as that number is too > big as simply being exposed as inode number by client. IIRC virtiofsd is > handling this by automatically creating separate devices when needed. I > haven't checked how exactly, whether there is e.g. some easy way to create > "subdevices" with Linux. NFS also automatically creates a new mountpoint at junctions, so it's definitely possible (fs/nfs/namespace.c): /* * nfs_d_automount - Handle crossing a mountpoint on the server * @path - The mountpoint * * When we encounter a mountpoint on the server, we want to set up * a mountpoint on the client too, to prevent inode numbers from * colliding, and to allow "df" to work properly. * On NFSv4, we also want to allow for the fact that different * filesystems may be migrated to different servers in a failover * situation, and that different filesystems may want to use * different security flavours. */ It's not strictly needed though; in practice the vfs can handle duplicate i_ino as long as appropriate functions are used for iget (iget5_locked in fs/9p/vfs_inode.c); although now I'm thinking about it that won't work for programs like tar that'll notice the inode number is identical and consider the second file to be a hard link of the first... I guess we could swipe it under the rug by making the inode number a hash of everything in qid.path, and pretend collisions never happen™? > > (iiuc from a networked filesystem's point of view (e.g. nfs), we ought > > to check i_ino + i_generation as inode number can be reused after the > > previous owner of the number was deleted, but we're guaranteed the > > couple change... At any given point though i_ino is unique so we've been > > relying on just that for two reasons: no room to send i_generation, and > > the field isn't easily obtainable from userspace (need to get in through > > bytes in export_to_handle_at which is file system dependant), so we only > > rely on inode number -- if the field gets bigger it should be variable > > length and we should write the whole mnt_id + export_to_handle_at > > content for a stable, truly unique handle. That's what userspace NFS > > servers do.) > > Why is that? The inode number is a serial number which is consecutively > incremented: It's serial for tmpfs, but that's an implementation detail. ext4 and xfs will both allocate through some larger stride. ... Actually didn't think it'd be this easy to reproduce, but ext4 reliably recyles the same inodes after unlink: h # touch a b c # ls -li a b c 357 -rw-r--r--. 1 root root 0 Dec 29 21:39 a 365 -rw-r--r--. 1 root root 0 Dec 29 21:39 b 366 -rw-r--r--. 1 root root 0 Dec 29 21:39 c # rm -f a b c # touch new # ls -li new 357 -rw-r--r--. 1 root root 0 Dec 29 21:40 new # rm -f new # touch again # ls -li again 357 -rw-r--r--. 1 root root 0 Dec 29 21:40 again (works even if I actually write something into the files and sync them) > So I would not expect an inode number to be reused before the consecutive > counter wraps at 2^64. And then, this discussion is about files still being > open. So host's file system can't recycle the inode number unless 9p client > closes the corresponding old FID. If the file's still open, it's not removed from the fs -- we're guaranted the inode won't come back yet. For a networked file system though I was thinking of cache more than open files, e.g. in the previous example a 9p client might think 'new' and 'again' are the same file. I guess it doesn't really matter since we have no cache invalidation / refresh logic... > > In cacheless modes we should drop dentries as soon as no file/process > > use it, ad drop inodes as soon as there's no dentries left. > > In cached mode we should have an extra ref to dentries (droppable on > > memory pressure somehow -- we're not doing that correctly right now > > afaik), that'll keep the inode alive, so we should try very hard to > > always reuse the same inode/dentres to avoid needless lookups. > > Yes, the cache currently just grows indefinitely. The question is what kind of > metric shall be used to decide when to start dropping old things from the > cache. My understanding is that the vfs already can keep track of this for us. Look at e.g. prune_icache_sb in fs/inode.c It's called when there's memory pressure, and it'll call inode op's evict_inode when actually freeing thing so we could close any still remembered fid there, and I think it'd work out in general... Except that we keep an explicit ref so it doesn't have anything to free iirc. Well, it's more stuff that'd need experimenting with... -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-29 12:50 ` asmadeus @ 2023-12-29 16:16 ` Eric Van Hensbergen 0 siblings, 0 replies; 18+ messages in thread From: Eric Van Hensbergen @ 2023-12-29 16:16 UTC (permalink / raw) To: asmadeus; +Cc: Christian Schoenebeck, v9fs, ericvh On Fri, Dec 29, 2023 at 09:50:39PM +0900, asmadeus@codewreck.org wrote: > > It's not strictly needed though; in practice the vfs can handle > duplicate i_ino as long as appropriate functions are used for iget > (iget5_locked in fs/9p/vfs_inode.c); although now I'm thinking about it > that won't work for programs like tar that'll notice the inode number is > identical and consider the second file to be a hard link of the first... > This is the problem though, iget functions use lookup functions which right now compare a bunch of state requiring a getattr to validate. This is one of the things I was trying to eliminate. The other option would be putting some of these additional bits (device, i_generation, etc.) in the qid. > > I guess we could swipe it under the rug by making the inode number a > hash of everything in qid.path, and pretend collisions never happenTM? > feels wrong, hashes always have collisions and it makes me uncomfortable. I know some of the other on disk file systems have moved to 128-bit inodes, does that help or just push out the problem (or make it less likely I suppose). The combo of qid.path and qid.version is probably a pretty good mix, if qid.version is some hash on time then the likelyhood of collision goes way down (but notably isn't zero..) and servers don't always populate qid.version, but that's something to fix server side. > > ... Actually didn't think it'd be this easy to reproduce, but ext4 > reliably recyles the same inodes after unlink: > h > # touch a b c > # ls -li a b c > 357 -rw-r--r--. 1 root root 0 Dec 29 21:39 a > 365 -rw-r--r--. 1 root root 0 Dec 29 21:39 b > 366 -rw-r--r--. 1 root root 0 Dec 29 21:39 c > # rm -f a b c > # touch new > # ls -li new > 357 -rw-r--r--. 1 root root 0 Dec 29 21:40 new > # rm -f new > # touch again > # ls -li again > 357 -rw-r--r--. 1 root root 0 Dec 29 21:40 again > > (works even if I actually write something into the files and sync them) > ... > > If the file's still open, it's not removed from the fs -- we're > guaranted the inode won't come back yet. > For a networked file system though I was thinking of cache more than > open files, e.g. in the previous example a 9p client might think 'new' > and 'again' are the same file. So this is just a cache race problem potentially. The real concern being that the cache is out of date with the server and then we get an inode collision. But that probably likely for other reasons with a loose cache -- I guess the question is how to handle for a tight cache. qid.path+qid.vers maybe enough - but races still possible. > > I guess it doesn't really matter since we have no cache invalidation / > refresh logic... > Well, we have logic - its just always valid or always invalid. The real question is how do we want it to work in tight mode. Is qid valid or do we always need to do a stat/getattr. -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-29 12:22 ` Christian Schoenebeck 2023-12-29 12:50 ` asmadeus @ 2023-12-29 16:06 ` Eric Van Hensbergen 1 sibling, 0 replies; 18+ messages in thread From: Eric Van Hensbergen @ 2023-12-29 16:06 UTC (permalink / raw) To: Christian Schoenebeck; +Cc: asmadeus, v9fs, ericvh On Fri, Dec 29, 2023 at 01:22:45PM +0100, Christian Schoenebeck wrote: > On Thursday, December 28, 2023 10:48:23 PM CET asmadeus@codewreck.org wrote: > > Eric Van Hensbergen wrote on Thu, Dec 28, 2023 at 09:07:46AM -0600: > > BTW even if qid.path length is increased as part of a 9p protocol change, it > would still be tricky for client implementation side, as that number is too > big as simply being exposed as inode number by client. IIRC virtiofsd is > handling this by automatically creating separate devices when needed. I > haven't checked how exactly, whether there is e.g. some easy way to create > "subdevices" with Linux. > I'll have a look at this and see if there is something we can do that is similar with different mount points. > > (iiuc from a networked filesystem's point of view (e.g. nfs), we ought > > to check i_ino + i_generation as inode number can be reused after the > > previous owner of the number was deleted, but we're guaranteed the > > couple change... Current iget validation logic checks i_generation from getattr. This was one of the things I wasn't sure if we needed and would definitely mess with qid lookup on our side. This was why I was thinking if we marked unlinked inodes in some way on the client side we'd be less likely to hit a collision, but if the inode wasn't really released on server side (because it was open), maybe this doesn't happen and we only have to worry about stale inodes in cache on client that haven't been properly released. But is that ever a valid case? (it may be happening because of cache code being broken, but that seems like a different case -- maybe if we see an inode lose its last link and its not currently open we just garbage collect it? Don't we do that already? I can see where this might screw up if we have multiple instances of inode structs around, which the current code seems to gesture towards, but I think its a bug. I may write up some auditing checks to look for dup inode structs and other things as part of debug.) > > > > In cacheless modes we should drop dentries as soon as no file/process > > use it, ad drop inodes as soon as there's no dentries left. > > In cached mode we should have an extra ref to dentries (droppable on > > memory pressure somehow -- we're not doing that correctly right now > > afaik), that'll keep the inode alive, so we should try very hard to > > always reuse the same inode/dentres to avoid needless lookups. > > Yes, the cache currently just grows indefinitely. The question is what kind of > metric shall be used to decide when to start dropping old things from the > cache. > Some of this is to do with the fid/dentry handling though. Fids stay in loose, which mean dentries stay, which means inodes stay. I think we can be a bit more aggresive with transient fid recovery and decouple them from dentry by attaching to inodes so we can aggresively clean up dcache if we want to without impacting things -- this will mean less interstitial fids although transient fids for cwd like references will exist, but we'll have one instaead of one per path element. I'll probably keep aggressive dentry recovery to a secondary change set since it'll break a bunch of existing assumptions and I think that was what was giving me trouble in my rewrite branch. > > > > Well there's reconnect (connection troubles), and there's reconnect > > (server restart, or migration) -- in nfsv4 the server is expected to ... > > Right, QEMU live-migration doesn't work with 9p ATM, and probably won't in > near future. > Yeah, I think there's a whole kettle of fish here. I'm interested in tilting at this windmill to support simulation checkpoint/restart cases, but I agree its a hard problem so will hold off until this other stuff is sorted. -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-28 21:48 ` asmadeus 2023-12-29 12:22 ` Christian Schoenebeck @ 2023-12-29 15:50 ` Eric Van Hensbergen 2024-01-03 18:09 ` cache fixes (redux) (a tale of many inode numbers...) Eric Van Hensbergen 2024-01-03 20:24 ` cache fixes (redux) asmadeus 1 sibling, 2 replies; 18+ messages in thread From: Eric Van Hensbergen @ 2023-12-29 15:50 UTC (permalink / raw) To: asmadeus; +Cc: Christian Schoenebeck, v9fs, ericvh On Fri, Dec 29, 2023 at 06:48:23AM +0900, asmadeus@codewreck.org wrote: > Eric Van Hensbergen wrote on Thu, Dec 28, 2023 at 09:07:46AM -0600: > > - the file struct, that'll back a fd somewhere (can be multiple fd per > file with dup and friends); we keep the open 'fid' information around in > the file's private data. > > (I'm not 100% sure how process cwd is handled -- probably just a ref on > the dentry? When opening a file we need to figure it out from the > dentry, so get the parent's dentry and fids from there This was why I originally hooked transient fids up to dentries, because it seemed like that was what was holding the open reference. Of course, every dentry points to an inode so either work for these purposes, but I wanted to clunk transient fids on dentry release versus wait for inode release (which can take a while, particularly when caching). In the cache re-write I was looking at moving all fid management to inode, but clunking transient fids inodes lose all dentry references. I need to go back over the patches but I think I still had a clever way of linking transient fids to particular dentries so that the tracking worked as expected. > > that parent's dentry cached fid, or build it up -- I guess if there's no > cache file here we could do multi-level walk, but my understanding was > that there should always be something cached at the point... > The concern I have in the back of my head is a detached hierarchy where we have a reference to the directory at the top of that hierarchy but not all the way back to the mount point. In that situation, we can't multiwalk all the way from root, which was the logic behind the step by step d_parent walk until we find a fid. But I think that's the exception case that has become the only case because of the structure of the code, I think if we can't find a usable fid off of the d_parent, we try multi-walk from root and if that fails we do the step-by-step. To be clear, I'm not even sure how often this would every happen (when we don't have a valid d_parent with a fid, I don't think that happens at all in practice in cache mode but we are hanging on to way too many fids in cache mode, so maybe it should occur more than it does....) > > something like ls you'll stat all files in a directory, so if there > wasn't a cached fid for the parent's dentry you'd do a lot of > multi-level walks, where caching the parent's dentry sound more > efficent?) > yes, and I think this is what Ron was seeing recently -- although this was without any cache mode. > > So to recap there should only be one inode, then as many dentries as > there are paths to it, and as many files as there are "open handles" to > it > > In cacheless modes we should drop dentries as soon as no file/process > use it, ad drop inodes as soon as there's no dentries left. > In cached mode we should have an extra ref to dentries (droppable on > memory pressure somehow -- we're not doing that correctly right now > afaik), that'll keep the inode alive, so we should try very hard to > always reuse the same inode/dentres to avoid needless lookups. > I'm not convinced of the second part, I think we don't want to hold the extra ref to dentries because that causes fid explosion. Inodes can be kept alive without a dentry (but we do need to be able to lookup an inode by i_no / qid.path -- which we are able to do). There's another thing which has bugged me looking through the code - there's lots of code (lookup, create, open, link, etc.) which passes inode *parent which we only use for session information, but really that should be fine as a start-point for finding the parent fid versus doing a d_parent traversal. Then for refresh logic we have inode_revalidate which can do different things depending on cache policy: - cacheless - release immediately, clunk all fids (this assumes the mountpoint inode will have a reference as long as we are mounted, but I think that is correct - data/dir cache - same semantics as now, inode holds data cache, all good - policy differences: (which will trigger on inode_revalidate/refresh) - none: always call walk/getattr to revalidate inode from server (the combo of walk/getattr may be a good new compount proto element) - loose: same as current semantics, keep around forever if we can, always assume client side cache is correct. - tight: use qid.version from walk/re-walk to validate, and getattr if stale. - temporal: cache times out after a mount configured timeo and we call getattr - push(future): use a subscription method to get updates on cache invalidations from server. -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) (a tale of many inode numbers...) 2023-12-29 15:50 ` Eric Van Hensbergen @ 2024-01-03 18:09 ` Eric Van Hensbergen 2024-01-03 20:24 ` cache fixes (redux) asmadeus 1 sibling, 0 replies; 18+ messages in thread From: Eric Van Hensbergen @ 2024-01-03 18:09 UTC (permalink / raw) To: asmadeus; +Cc: Christian Schoenebeck, v9fs, ericvh Okay, been delving through this code. Most (all?) was added by Aneesh when he did the cache stuff and I clearly wasn't paying close attention a decade ago (mostly because I didn't care about the cache then). In any case, I think there is a ton of code I am going to be able to delete (patches coming soon), but wanted to sanity check one thing with you folks. Looking at the inode get code (which also doubles as inode creation code). The inodes are hashed with qid2ino(qid) which I get to take the blame for because its been there for 18th years. If the size of the inode_tt is 64-bit, it just assigns the qid->path+2 (+2?) to the i_ino. If not (presumably for 32 bit) it sets the inode to (qid->path+2) ^ (qid->path+2 >> 32). If ino_t goes to 64 bit this means we reduce precision for no reason, so that has to be fixed -- but I'm trying to remember the relevance of +2 which dates back all the way to the original submission. I don't remember why we did this, but I think I'm gonna kill that in one patch and in a second patch fix the length check. Moving up the stack a bit, we use the modified qid.path to hash the inode, but then we also assign it (overriding the allocated inode number on the client). I'm gonna say this is wrong (definitely wrong in the case where we are allocated inodes willy-nilly) and we should probably continue to use qid.path to hash, but then keep the client side inode number which is guarenteed unique on allocation. Or do you folks think we should align client-side inode with server-side inode to help with debug? We should still have consistent qid->paths regardless. Thoughts? -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2023-12-29 15:50 ` Eric Van Hensbergen 2024-01-03 18:09 ` cache fixes (redux) (a tale of many inode numbers...) Eric Van Hensbergen @ 2024-01-03 20:24 ` asmadeus [not found] ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com> 2024-01-03 22:48 ` Eric Van Hensbergen 1 sibling, 2 replies; 18+ messages in thread From: asmadeus @ 2024-01-03 20:24 UTC (permalink / raw) To: Eric Van Hensbergen; +Cc: Christian Schoenebeck, v9fs, ericvh Eric Van Hensbergen wrote on Fri, Dec 29, 2023 at 09:50:07AM -0600: > > So to recap there should only be one inode, then as many dentries as > > there are paths to it, and as many files as there are "open handles" to > > it > > > > In cacheless modes we should drop dentries as soon as no file/process > > use it, ad drop inodes as soon as there's no dentries left. > > In cached mode we should have an extra ref to dentries (droppable on > > memory pressure somehow -- we're not doing that correctly right now > > afaik), that'll keep the inode alive, so we should try very hard to > > always reuse the same inode/dentres to avoid needless lookups. > > > > I'm not convinced of the second part, I think we don't want to hold the > extra ref to dentries because that causes fid explosion. (Note: no longer relevant if we get rid of fid lookup by dentry) Might have the wrong word -- we don't want to prevent the vfs from cleaning up if there's cache pressure, but we need to differentiate cacheless (drop the inode as soon as it's no longer used) and cache (keep it around while it's here) It definitely causes total fids open to rise (which can be a probem depending on the server), I guess we could count how many there are and have a watermark to start reclaiming if too many are around as an option, but I've sort of given up on that at this point... > There's another thing which has bugged me looking through the code - > there's lots of code (lookup, create, open, link, etc.) which passes > inode *parent which we only use for session information, but really > that should be fine as a start-point for finding the parent fid versus > doing a d_parent traversal. That makes sense, I sort of assumed we didn't have the parent's inode but looking back we seem to always have an inode *dir.. Would be great to get rid of this dentry fid cache. (Note the dentry will still be kept around by the vfs anyway, but we don't care about that anymore if it doesn't hold fids) > Then for refresh logic we have inode_revalidate which can do different > things depending on cache policy: > - cacheless - release immediately, clunk all fids (this assumes the > mountpoint inode will have a reference as long as we are mounted, > but I think that is correct > - data/dir cache - same semantics as now, inode holds data cache, all good > - policy differences: (which will trigger on inode_revalidate/refresh) > - none: always call walk/getattr to revalidate inode from server > (the combo of walk/getattr may be a good new compount proto element) > - loose: same as current semantics, keep around forever if we can, > always assume client side cache is correct. > - tight: use qid.version from walk/re-walk to validate, and getattr > if stale. > - temporal: cache times out after a mount configured timeo and we call > getattr > - push(future): use a subscription method to get updates on cache > invalidations from server. Sounds good. Also I need to correct one point I said in this thread: I had assumed creat+unlink worked, but something as simple as fstat on the unlinked file doesn't work in cacheless mode: --- #include <fcntl.h> #include <stdio.h> #include <sys/stat.h> #include <unistd.h> int main(int argc, char *argv[]) { int fd; struct stat statbuf; fd = open("test_unlink_stat", O_CREAT|O_RDWR, 0600); if (fd < 0) { perror("open"); return 1; } if (unlink("test_unlink_stat") < 0) { perror("unlink"); return 2; } write(fd, "data\n", 5); if (fstat(fd, &statbuf) < 0) { perror("fstat"); return 3; } printf("stat ok, size %zd\n", statbuf.st_size); return 0; } --- Perhaps that'll just start working with your fixes but either way it's already not working cacheless, so I don't think you need to worry about that at this point... (I noticed because I finally took the time to boot a debian on 9p rootfs, and apt breaks without this -- works fine with caches) Eric Van Hensbergen wrote on Wed, Jan 03, 2024 at 12:09:45PM -0600: > Okay, been delving through this code. Most (all?) was added by > Aneesh when he did the cache stuff and I clearly wasn't paying > close attention a decade ago (mostly because I didn't care about > the cache then). In any case, I think there is a ton of code > I am going to be able to delete (patches coming soon), but wanted > to sanity check one thing with you folks. Good work! > Looking at the inode get code (which also doubles as inode creation > code). The inodes are hashed with qid2ino(qid) which I get to take > the blame for because its been there for 18th years. > If the size of the inode_tt is 64-bit, it just > assigns the qid->path+2 (+2?) hm, inode 1 is definitely usable (a tmpfs' root directory uses it), but perhaps 0 would caus problems? In which case +1 makes sense if qid->path can be 0, or +2 if it could be -1 (I wouldn't put it past some servers); but I can only guess. > to the i_ino. If not (presumably for 32 bit) I had assumed the conversion to inode64 was done in practice but the struct inode's i_ino is still an unsigned long... Although kstat's ino field is u64 so if userspace is willing to handle it we could just bump the size. I remember having troubles with 64 bit inodes on 32 bit binaries quite a few years ago so I guess we still need to do that dance -- although not necessarily internally; we could emulate xfs here: - they use xfs_ino_t internally which is unsigned long long (u64); there's even a confusingly named i_ino in xfs_inode so the inode's struct i_ino is actually never used as far as I can see - there are inode32/inode64 mount option to force allocating inodes in the 32 bit range if requested (but doesn't try to remap existing inodes, this is just for new files) In our case we could add a 64bit inode field in v9fs_inode so we don't need to worry about it, and remap for kstat (v9fs_vfs_getattr*) if a new inode32 mount option is requested? I wouldn't do it by default on 32 bit systems as that'll blow up, and hopefully by now any userspace that'll upgrade their kernel will have long converted to _FILE_OFFSET_BITS=64 (which also controls 64bit inode for stat's st_ino) > it sets the inode to (qid->path+2) ^ (qid->path+2 >> 32). (Seems a bit simplistic but probably good enough in practice...) > If ino_t goes to 64 bit this means we reduce precision for no reason, > so that has to be fixed -- but I'm trying to remember the relevance > of +2 which dates back all the way to the original submission. I > don't remember why we did this, but I think I'm gonna kill that in > one patch and in a second patch fix the length check. Unfortunately cannot help as my involvement started much later.. > Moving up the stack a bit, we use the modified qid.path to hash the > inode, but then we also assign it (overriding the allocated inode number > on the client). I'm gonna say this is wrong (definitely wrong in the > case where we are allocated inodes willy-nilly) and we should probably > continue to use qid.path to hash, but then keep the client side inode > number which is guarenteed unique on allocation. Or do you folks think > we should align client-side inode with server-side inode to help with > debug? We should still have consistent qid->paths regardless. Thoughts? I ended up suggesting something similar above without reading this properly, if I read this correctly I think we agree -- use as much bits as we can internally and keep it coherent, then just downsize for getattr if required. -- Dominique Martinet | Asmadeus ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com>]
* Fwd: cache fixes (redux) [not found] ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com> @ 2024-01-03 22:42 ` Eric Van Hensbergen 0 siblings, 0 replies; 18+ messages in thread From: Eric Van Hensbergen @ 2024-01-03 22:42 UTC (permalink / raw) To: v9fs; +Cc: Christian Schoenebeck ooops, didn't hit reply all. ---------- Forwarded message --------- From: Eric Van Hensbergen <ericvh@kernel.org> Date: Wed, Jan 3, 2024 at 2:52 PM Subject: Re: cache fixes (redux) To: Dominique Martinet <asmadeus@codewreck.org> On Wed, Jan 3, 2024 at 2:25 PM <asmadeus@codewreck.org> wrote: > > Moving up the stack a bit, we use the modified qid.path to hash the > > inode, but then we also assign it (overriding the allocated inode number > > on the client). I'm gonna say this is wrong (definitely wrong in the > > case where we are allocated inodes willy-nilly) and we should probably > > continue to use qid.path to hash, but then keep the client side inode > > number which is guarenteed unique on allocation. Or do you folks think > > we should align client-side inode with server-side inode to help with > > debug? We should still have consistent qid->paths regardless. Thoughts? > > I ended up suggesting something similar above without reading this > properly, if I read this correctly I think we agree -- use as much bits > as we can internally and keep it coherent, then just downsize for > getattr if required. > One gotcha here and I'm inclined to maybe ignore it. readdir uses qid2ino to populate the inode number for directory listings. So, I'd need to convert that code to lookup the inode just to extract the client-side inode number. But this is all client side, so maybe not a big deal and we aren't even using the readdir code right now in .L so its not really on the critical path. The other problem of course is that the inode hash is also an unsigned long in the kernel. So really same problem applies... But I'm inclined to switch to using qid.path to hash anyways as I think it makes the code more understandable and we can get rid of weirdness like +2 and get guarenteed unique inode numbers. The other thing that needs to change though is v9fs_mount needs to use iunique to grab a new inode number versus using qid.path from the mount which means the logic there needs to change to using iget so we can insert the mount point into the hash table. I'm assuming that just works, but gonna reorder my changes to try doing that first and see if anything blows up. -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2024-01-03 20:24 ` cache fixes (redux) asmadeus [not found] ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com> @ 2024-01-03 22:48 ` Eric Van Hensbergen 2024-01-04 12:37 ` Christian Schoenebeck 1 sibling, 1 reply; 18+ messages in thread From: Eric Van Hensbergen @ 2024-01-03 22:48 UTC (permalink / raw) To: asmadeus; +Cc: Christian Schoenebeck, v9fs One other thing I noticed when looking through the code (related to xattrs and acls) - it seems on create we are pre-populating client side ACLs -- this requires some extra protocol and steps it feels like we don't have to take. So I guess two thoughts: - shouldn't the server side be populating ACLs appropriately on create if the underlying file system supports them - if this isn't the case it feels like we might be able to get away with a slightly reduced approach of walking to the new fid and sending the appropriate setacl proto messages and then rely on the inode refresh to fill the cache (if the cache is enabled A bunch of this has come up unwinding complexity in where we are "creating inodes" versus using iget with optional create. I think everything needs to use the latter so things show up in the hash appropriately. -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2024-01-03 22:48 ` Eric Van Hensbergen @ 2024-01-04 12:37 ` Christian Schoenebeck 2024-01-04 15:59 ` Eric Van Hensbergen 0 siblings, 1 reply; 18+ messages in thread From: Christian Schoenebeck @ 2024-01-04 12:37 UTC (permalink / raw) To: asmadeus, Eric Van Hensbergen; +Cc: v9fs, Greg Kurz On Wednesday, January 3, 2024 11:48:13 PM CET Eric Van Hensbergen wrote: > One other thing I noticed when looking through the code (related to > xattrs and acls) - it seems on create we are pre-populating client > side ACLs -- this requires some extra protocol and steps it feels like > we don't have to take. So I guess two thoughts: > - shouldn't the server side be populating ACLs appropriately on create > if the underlying file system supports them To resume a discussion we recently had on this topic, i.e. relying on server handling all permissions: ACLs are actually the line where it becomes difficult to be handled solely by server on its own. Because AFAICS there is nothing defined in the protocol to handle ACLs in the first place, instead client sends xattr queries to 9p server to map ACLs as xattrs on server side: $ getfattr foo # file: foo user.virtfs.gid user.virtfs.mode user.virtfs.system.posix_acl_access # <-- client encoded & managed user.virtfs.system.posix_acl_default # <-- client encoded & managed user.virtfs.uid ... Now even if the protocol got extended to handle ACLs, it would still be difficult for server to handle this task reliably; I am thinking of setups where e.g. guest runs in an Active Directory domain, LDAP, PAM modules, and what not (all on guest side that is). A huge bunch of information server would simply not have and a user would need to duplicate as a setup on host side (not a good idea for security reasons). So in short: simple Unix permissions, sure, but anything beyond that, I don't think so. > - if this isn't the case it feels like we might be able to get away > with a slightly reduced approach of walking to the new fid and sending > the appropriate setacl proto messages and then rely on the inode > refresh to fill the cache (if the cache is enabled > > A bunch of this has come up unwinding complexity in where we are > "creating inodes" versus using iget with optional create. I think > everything needs to use the latter so things show up in the hash > appropriately. > > -eric > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: cache fixes (redux) 2024-01-04 12:37 ` Christian Schoenebeck @ 2024-01-04 15:59 ` Eric Van Hensbergen 0 siblings, 0 replies; 18+ messages in thread From: Eric Van Hensbergen @ 2024-01-04 15:59 UTC (permalink / raw) To: Christian Schoenebeck; +Cc: asmadeus, v9fs, Greg Kurz Makes sense I guess - I'll leave ACLs alone for now. -eric ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-01-04 15:59 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-26 16:13 cache fixes (redux) Eric Van Hensbergen
2023-12-27 8:10 ` Dominique Martinet
2023-12-27 18:07 ` Eric Van Hensbergen
[not found] ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com>
2023-12-28 0:32 ` Dominique Martinet
2023-12-28 10:44 ` Christian Schoenebeck
2023-12-28 15:07 ` Eric Van Hensbergen
2023-12-28 21:48 ` asmadeus
2023-12-29 12:22 ` Christian Schoenebeck
2023-12-29 12:50 ` asmadeus
2023-12-29 16:16 ` Eric Van Hensbergen
2023-12-29 16:06 ` Eric Van Hensbergen
2023-12-29 15:50 ` Eric Van Hensbergen
2024-01-03 18:09 ` cache fixes (redux) (a tale of many inode numbers...) Eric Van Hensbergen
2024-01-03 20:24 ` cache fixes (redux) asmadeus
[not found] ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com>
2024-01-03 22:42 ` Fwd: " Eric Van Hensbergen
2024-01-03 22:48 ` Eric Van Hensbergen
2024-01-04 12:37 ` Christian Schoenebeck
2024-01-04 15:59 ` Eric Van Hensbergen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox