Re: cache fixes (redux) - Christian Schoenebeck

Linux 9p file system development
 help / color / mirror / Atom feed

From: Christian Schoenebeck <linux_oss@crudebyte.com>
To: Eric Van Hensbergen <ericvh@gmail.com>,
	Dominique Martinet <asmadeus@codewreck.org>
Cc: v9fs@lists.linux.dev, Greg Kurz <groug@kaod.org>
Subject: Re: cache fixes (redux)
Date: Thu, 28 Dec 2023 11:44:32 +0100	[thread overview]
Message-ID: <3936280.6WX1Z9CqS8@silver> (raw)
In-Reply-To: <ZYzCERk_dg1-CEjb@codewreck.org>

On Thursday, December 28, 2023 1:32:17 AM CET Dominique Martinet wrote:
> Eric Van Hensbergen wrote on Wed, Dec 27, 2023 at 11:39:19AM -0600:
> > On Wed, Dec 27, 2023 at 2:11 AM Dominique Martinet <asmadeus@codewreck.org>
> > wrote:
> > > I'm not sure multi-walk is possible in practice: even opening a file
> > > directly through e.g. cat /mnt/long/path/to/file will have the VFS check
> > > permissions along the path, so the VFS will need to stat all elements
> > > along the way.
> > 
> > Well, the server will do perm checks (ideally on walks), so do we really
> > have to?  Feels like overkill to check in both places, in principal we
> > expect/assume server perm checks.
> 
> The server should definitely check as well, yes.
> I'm not sure it's possible to tell the linux VFS that the server already
> checked though; ultimately the plan back when I was working on 9p for my
> previous job was to plug in MPI-IO into a user-space client library
> (e.g. https://github.com/martinetd/space9 -- now unmaintained) so we
> wouldn't have to deal with the vfs...
> 
> But once again, if you can manage to tell it that, I'm all for it. Some
> servers might not check properly but that can/should be fixed when it
> comes up.

I would rather see this as a brave kernel option for people who really know
what they are doing, not as a potential default behaviour.

With QEMU there are two distinct behaviours: 'passthrough' permissions vs.
'mapped(-*)' permissions:

  https://wiki.qemu.org/Documentation/9psetup#Starting_the_Guest_directly

Where 'passthrough' means guest sees the exact same permissions for files as
host (9p server) does. In this mode you could probably leave it to server
doing the permission checks. However 'passthrough' mode is discouraged for
security reasons, and one of the 'mapped' permissions modes is recommended
instead, because we had several security issues in the past which showed that
it's not a good idea using 'passthrough' (whereas 'mapped' was unaffected).

With 'mapped-(*)' however guest should do permissions checks. Because in this
mode file permissions on host are different and separately handled between
host and guest. So host sees something else than guest does.

> > it did strike me that inodes are only unique per filesystem and I doubt the
> > servers guarantee uniqueness across underlying mounts - but in principal we
> > shouldnt have to rely on looking up inodes by qid.path anyways….however
> > nothing I’m planning on should make this worse, but something to keep an
> > eye out for.
> 
> We've had collisions a few times -- most "real" file systems I'm aware
> of pseudo-randomize inode generation but in particular tmpfs inode
> numbering is linear per mount so it's very easy to trigger collisions.
> 
> qemu mixes in some bits from st_dev to avoid this:
> https://gitlab.com/qemu-project/qemu/-/blob/master/hw/9pfs/9p.c?ref_type=heads#L881

Inode numbers are always just unique within the same file system. And that
makes sense. If a unique file ID is required system wide, then inode number
must be combined with the file system's device ID.

And yes, that's what QEMU does (combining inode # with device ID) when option
`multidevs=remap` was supplied when generating qid.path to avoid file ID
collisions on guest. Unfortunately qid.path field is currently too small to
just use the device ID directly, so a little trick is used: the so called
'Exponential Golomb' algorithm is used to generate a much shorter device ID
that can be put into qid.path instead.

Note though, this behaviour is actually off by default ATM, controllable by:

 multidevs=remap|forbid|warn

 https://wiki.qemu.org/Documentation/9psetup#Starting_the_Guest_directly

By default (multidevs=warn) QEMU would currently only print/log a warning:

  warn_report_once(
    "9p: Multiple devices detected in same VirtFS export, "
    "which might lead to file ID collisions and severe "
    "misbehaviours on guest! You should either use a "
    "separate export for each device shared from host or "
    "use virtfs option 'multidevs=remap'!"
  );

> > > >      - QUESTION: it seems like P9_QTLINK and P9_QTSYMLINK are no
> > > > longer used, can these be recovered?  I was thinking of reusing one of
> > > > these to mark unlinked files which we could do a more comprehensive
> > > > compare for if its actually necessary.
> > >
> > > Since we don't support reconnects, unlinked files have been working just
> > > fine with servers that keep the underlying files open (e.g. qemu)
> > >
> > > What's your plan exactly?
> > 
> > 
> > See above, mark fid/qids that are unlinked with a different type I think.
> > I’m gonna look at how other filesystems handle this case before doing
> > anything.
> 
> local filesystems don't do anything - the inode is just kept around, all
> the way down to disk afaik (disk-mapped tmpfiles flush data to disk so
> there must be some way to address them...)
> (The last iput with nlink==0 calls evict_inode which does required
> cleanup)
> 
> 
> NFS doesn't delete the backing file but renames it - silly renames -
> you've probably seen some .nfs1234 files around if you used a mount
> point, the client won't let you delete them while it's open:
> rm: cannot remove '.nfs000000000000f42300000001': Device or resource busy
> 
> And it's unlinked when file is closed (DCACHE_NFSFS_RENAMED check in
> nfs_dentry_iput). That's necessary because the connexion to the server
> can drop at any time, and reconnecting needs to be able to open the
> file again.
> Servers will also scrub these files after they're been unused for a
> while.
> 
> 
> For 9p since we don't handle reconnect I think we're fine just not doing
> anything -- servers that don't close backing files will de facto keep
> the file existing through their fd. There's a process with the file open
> so we have an open fid somewhere and the inode won't get killed either.
> When the last fd is closed the inode is evicted and we'll have all fids
> closed so server will release its fd as well and backing filesystem will
> cleanup appropriately.
> 
> If we want to handle reconnect at some point (CEA/Bull did that work but
> never tried to publish it...), then we ought to do silly renames as
> well, but servers really need background cleanup as well or it'll end
> up full of .nfsxxx files...
> 
> 
> > > We don't have any way of driving protocol changes forward (and there's
> > > no compatibility negotiation on client connect...), but I recall
> > > Christian also was looking forward to some other changes so we might
> > > want to try to move this forward again...
> > >
> > 
> > Well, those bits unused (I think) so shouldnt be protocol change.
> 
> Right, I didn't realize it was just local use -- we can do whatever we
> want in memory yes.
> I'd avoid reusing the bits and define some new ones at the end of the
> permissible range to make the "can come from server" and "client mess"
> separation but leaving this up to you -- ultimately as long as there's
> no conflict with what's on wire we should be fine.
> 
> > However we can bump to
> > new protocol when necessary (9p2023?  its been 23 years, why not?). big on
> > my list would be qid.path -> 128 bit and qid.version -> 128 bit and maybe
> > include filesystem uid in qid to differentiate underlying mounts and maybe
> > i_generation.  This would help with cache maintenance and uniqueness
> > concerns.  I’m also thinking of how we might handle security differently
> > (use kernel TLS, public key auth integration, etc. although those might not
> > require protocol changes).  i’m also in favor of dirread senantics for
> > cache modes so we retrieve attributes with the file list.  What other
> > things should we be considering?  maybe time to start putting together an
> > RFC.
> 
> Right; there's plenty to improve if we start...
> My free time is basically rock bottom as it has been for a few years so
> I probably won't help much, but I wouldn't push back.

I use to gather ideas for 9p protocol changes here:
https://wiki.qemu.org/Documentation/9p#Protocol_Plans

One thing that comes to my mind that's not yet on that list: server side push
notifications, e.g. to inform guest that something changed on a file/dir guest
currently keeps open. That might improve consistency when both on host and
guest changes are made and caching being used by guest.

/Christian

next prev parent reply	other threads:[~2023-12-28 11:29 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-26 16:13 cache fixes (redux) Eric Van Hensbergen
2023-12-27  8:10 ` Dominique Martinet
2023-12-27 18:07   ` Eric Van Hensbergen
     [not found]   ` <CAFkjPT=8CZHATaraBrqAZGDKjQLOp=U1gdgteJ5jpXRGJyBojQ@mail.gmail.com>
2023-12-28  0:32     ` Dominique Martinet
2023-12-28 10:44       ` Christian Schoenebeck [this message]
2023-12-28 15:07         ` Eric Van Hensbergen
2023-12-28 21:48           ` asmadeus
2023-12-29 12:22             ` Christian Schoenebeck
2023-12-29 12:50               ` asmadeus
2023-12-29 16:16                 ` Eric Van Hensbergen
2023-12-29 16:06               ` Eric Van Hensbergen
2023-12-29 15:50             ` Eric Van Hensbergen
2024-01-03 18:09               ` cache fixes (redux) (a tale of many inode numbers...) Eric Van Hensbergen
2024-01-03 20:24               ` cache fixes (redux) asmadeus
     [not found]                 ` <CAFkjPT=61yygntbhSJMMWeK4JBrm85EZpz387aPHD_xmHVBbog@mail.gmail.com>
2024-01-03 22:42                   ` Fwd: " Eric Van Hensbergen
2024-01-03 22:48                 ` Eric Van Hensbergen
2024-01-04 12:37                   ` Christian Schoenebeck
2024-01-04 15:59                     ` Eric Van Hensbergen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3936280.6WX1Z9CqS8@silver \
    --to=linux_oss@crudebyte.com \
    --cc=asmadeus@codewreck.org \
    --cc=ericvh@gmail.com \
    --cc=groug@kaod.org \
    --cc=v9fs@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox