All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: virtio-fs@redhat.com
Subject: Re: [Virtio-fs] Ways to uniquely and persistently identify nodes
Date: Wed, 15 Jan 2020 13:58:06 +0100	[thread overview]
Message-ID: <2f099bc9-d2ef-0f1c-8fe0-dd2e64f061ca@redhat.com> (raw)
In-Reply-To: <20200115101222.GB3811@work-vm>


[-- Attachment #1.1: Type: text/plain, Size: 7645 bytes --]

On 15.01.20 11:12, Dr. David Alan Gilbert wrote:
> * Max Reitz (mreitz@redhat.com) wrote:
>> Hi,
>>
>> As discussed in today’s meeting, there is a problem with uniquely and
>> persistently identifying nodes in the guest.
>>
>> Actually, there are multiple problems:
>>
>> (1) stat’s st_ino in the guest is not necessarily unique.  Currently, it
>> just the st_ino from the host file, so if you have mounted multiple
>> different filesystems in the exported directory tree, you may get
>> collisions.
>>
>> (2) The FUSE 64-bit fuse_ino_t (which identifies an open file,
>> basically) is not persistent.  It is just an index into a vector that
>> contains all open inodes, and whenever virtiofsd is restarted, the
>> vector is renewed.  That means that whenever this happens, all
>> fuse_ino_t values the guest holds will become invalid.  (And when
>> virtiofsd starts handing out new fuse_ino_t values, those will probably
>> not point to the same inodes as before.)
>>
>> (3) name_to_handle_at()/open_by_handle_at() are implemented by FUSE just
>> by handing out the fuse_ino_t value as the handle.  This is not only a
>> problem as long as fuse_ino_t is not persistent (see (2)), but also in
>> general, because the fuse_ino_t value is only valid (per FUSE protocol)
>> as long as the inode is referenced by the guest.
>>
>>
>> The first question that I think needs to be asked is whether we care
>> about each of this points at all.
>>
>> (1) Maybe it just doesn’t matter whether the st_ino values are unique.
>>
>> (2) Maybe we don’t care about virtiofsd being restarted while the guest
>> is running or only paused.  (“Restarting” includes migration of the
>> daemon to a different host.)
> 
> I prefer to split that answer because I care about migration but not
> about restarting the daemon in place.
> 
>> (3) I suppose we do care about this.
>>
>>
>> Assuming we do care about the points, here are some ways I have
>> considered of addressing them:
>>
>> (1)
>>
>> (a)
>>
>> If we could make the 64-bit fuse_ino_t unique and persistent (see (2)),
>> we could use that for st_ino (also a 64-bit field).
>>
>> (This is the case if we keep the current schema for fuse_ino_t, be it
>> because we don’t care about (2) or because we want (2a).)
>>
>> (b)
>>
>> Otherwise, we probably want to continue passing through st_ino and then
>> ensure that stat’s st_dev is unique for each submount in the exported
>> tree.  We can achieve that by extending the FUSE protocol for virtiofsd
>> to announce submounts and then the FUSE kernel driver to automount them.
> 
> This feels nice to me, since the st_dev map should be small.
> 
>>  (This means that these submounts in the guest are then technically
>> unrelated filesystems.  It also means that the FUSE driver would need to
>> automount them with the “virtiofs” fs type, which is kind of weird, and
>> restricts this solution to virtiofs.)
> 
> Well it has to automount them witht he same type as it's mounted with;
> so it's not virtiofs specific.

Right.  But we need some way to tell the filesystem the context for the
submount.  I’m not sure whether that’s possible in an fs-agnostic way,
because the only information we can reasonably pass goes to
vfs_submount()’s @name parameter.  I think?  Or maybe we can make @data
point to a FUSE-common structure that every FUSE fs with submounts would
then need to be able to parse?  But I’m not sure how that would even
translate to userspace.

>> (2)
>>
>> (a)
>>
>> We can keep the current way if we just store the in-memory mapping while
>> virtiofsd is suspended (and migrate it it if we want to migrate the
>> virtiofsd instance).  The table may grow to be very large, though, and
>> it contains for example file descriptors that we would need to migrate,
>> too (perhaps as file handles?).
> 
> The 'when suspended' worries me - if it's important data to persist then
> we probably need to be more careful with it; i.e. keep it sync'd to
> disk.  If it's not important long term then do we need to keep it that
> long? 
> Be wary of migating a large, rapidly changing table.
> 
>> (b)
>>
>> We could extend the fuse_ino_t type to an arbitrary size basically, to
>> be negotiated between FUSE server and client.  This would require
>> extensive modification of the FUSE protocol and kernel driver (and would
>> ask for respective modification of libfuse, too), though.  Such a larger
>> value could then capture both a submount ID and a unique identifier for
>> inodes on the respective host filesystems, such as st_ino.  This would
>> ensure that every virtiofsd instance would generate the same fuse_ino_t
>> values for the same nodes on the same exported tree.
>>
>> However, note that this doesn’t auto-populate the fuse_ino_t mappings:
>> When after restarting virtiofsd the server wants to access an existing
>> inode, it can’t, because there is no good way to translate even larger
>> fuse_ino_t values to a file descriptor.  (We could do that if the
>> fuse_ino_t value encapsulated a handle.  (As in open_by_handle_at().)
>> The problem is that we can’t trust the guest to keep a handle, so we
>> must ensure that the handle returned points to a file the guest is
>> allowed to access.  Doing that cryptographically (e.g. with a MAC) is
>> probably out of the question, because that would make fuse_ino_t really
>> big.  Another idea would be to set a flag on the host FS for files that
>> the guest has a handle to.  But this flag would need to be
>> guest-specific...  So we’d probably again end up with a large database
>> just as in (2a).  (It doesn’t need to be a flag on the FS, it could also
>> be a database, I suppose.))
> 
> I'm no crypto person, but I don't know how to show that's safe.
> Some inode numbers are well-known (e.g. on xfs / always seems to be 128
> for me).  I'm just worrying that makes it easier for the guest to figure
> out the crypto.

Well, with MACs the adversary sees the data and the MAC together anyway,
so they’re designed in such a way that you can’t infer the key from that
information.

And of course they’re designed such that without a key, you can’t
generate a valid MAC for given data.

IOW, that’s the precise reason we’d use a MAC: So the guest can’t just
return a well-known handle to get access to files it shouldn’t be able
to access.

>> We could also re-enumerate the exported tree after reopening (perhaps
>> lazily for each exported filesystem) and thus recreate the mapping.  But
>> this would take as much time as a “find” over the whole exported tree.
>>
>> (c)
>>
>> We could complement the fuse_ino_t value by a context ID, that in our
>> case would probably be derived from the submount ID (e.g. the relative
>> mount point).  This would only require minor modification of the FUSE
>> protocol: Detecting mount points in lookups; a new command to retrieve a
>> mount point’s context ID; and some way to set this context ID.
>>
>> We could set the context ID either explicitly with a new command; or as
>> part of every FUSE request (a field in the request header); or just as
>> part of virtio-fs (be it with one virtqueue per context (which may or
>> may not be feasible), or just by prefixing every FUSE request on the
>> line with a context ID).
> 
> One-virtqueue per context doesn't seem feasible to me; we don't know
> how many submounts there will be and there could be lots of them.

OK, good, so I don’t have to worry about it. :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2020-01-15 12:58 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-13 17:46 [Virtio-fs] Ways to uniquely and persistently identify nodes Max Reitz
2020-01-14 16:13 ` Miklos Szeredi
2020-01-14 16:49   ` Max Reitz
2020-01-14 19:24     ` Miklos Szeredi
2020-01-15  9:37       ` Max Reitz
2020-01-15  9:39         ` Max Reitz
2020-01-15 10:10         ` Miklos Szeredi
2020-01-15 11:51           ` Miklos Szeredi
2020-01-15 12:51             ` Max Reitz
2020-01-15 14:58               ` Miklos Szeredi
2020-01-15 15:05                 ` Miklos Szeredi
2020-01-15 15:19                   ` Max Reitz
2020-01-15 16:01                     ` Miklos Szeredi
2020-01-15 16:12                       ` Max Reitz
2020-01-15 15:12                 ` Max Reitz
2020-01-15 15:46                   ` Miklos Szeredi
2020-01-15 15:52                     ` Max Reitz
2020-01-15 12:01       ` Stefan Hajnoczi
2020-01-15 13:00         ` Max Reitz
2020-01-16 20:32           ` Vivek Goyal
2020-01-17 18:13             ` Max Reitz
2020-01-15 10:12 ` Dr. David Alan Gilbert
2020-01-15 12:58   ` Max Reitz [this message]
2020-01-15 12:50 ` Stefan Hajnoczi
2020-01-15 14:21   ` Max Reitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2f099bc9-d2ef-0f1c-8fe0-dd2e64f061ca@redhat.com \
    --to=mreitz@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=virtio-fs@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.