From: Eduard Shishkin <eduard.shishkin@huawei.com>
To: Antonios Motakis <antonios.motakis@huawei.com>,
Greg Kurz <groug@kaod.org>, "Emilio G. Cota" <cota@braap.org>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
Veaceslav Falico <veaceslav.falico@huawei.com>,
Jani Kokkonen <Jani.Kokkonen@huawei.com>,
"vfalico@gmail.com" <vfalico@gmail.com>,
"Wangguoli (Andy)" <andy.wangguoli@huawei.com>,
Jiangyiwen <jiangyiwen@huawei.com>,
"zhangwei (CR)" <zhangwei555@huawei.com>
Subject: Re: [Qemu-devel] [RFC] qid path collision issues in 9pfs
Date: Wed, 24 Jan 2018 19:05:07 +0100 [thread overview]
Message-ID: <07c8b4ee-15f0-409e-274b-01bba80c76d2@huawei.com> (raw)
In-Reply-To: <0286586c-7061-6f3f-20ae-b5241d12685b@huawei.com>
On 1/24/2018 5:40 PM, Antonios Motakis wrote:
>
>
> On 01/24/2018 02:30 PM, Greg Kurz wrote:
>> Thanks Emilio for providing these valuable suggestions ! :)
>>
>> On Sat, 20 Jan 2018 17:03:49 -0500
>> "Emilio G. Cota" <cota@braap.org> wrote:
>>
>>> On Fri, Jan 19, 2018 at 19:05:06 -0500, Emilio G. Cota wrote:
>>>>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>>>>> Antonios Motakis <antonios.motakis@huawei.com> wrote:
>>>>>> Since inodes are not completely random, and we usually have a
>>>>>> handful of device IDs,
>>>>>> we get a much smaller number of entries to track in the hash table.
>>>>>>
>>>>>> So what this would give:
>>>>>> (1) Would be faster and take less memory than mapping the full
>>>>>> inode_nr,devi_id
>>>>>> tuple to unique QID paths
>>>>>> (2) Guaranteed not to run out of bits when inode numbers stay
>>>>>> below the lowest
>>>>>> 54 bits and we have less than 1024 devices.
>>>>>> (3) When we get beyond this this limit, there is a chance we
>>>>>> run out of bits to
>>>>>> allocate new QID paths, but we can detect this and refuse to serve
>>>>>> the offending
>>>>>> files instead of allowing a collision.
>>>>>>
>>>>>> We could tweak the prefix size to match the scenarios that we
>>>>>> consider more likely,
>>>>>> but I think close to 10-16 bits sounds reasonable enough. What do
>>>>>> you think?
>>>> Assuming assumption (2) is very likely to be true, I'd suggest
>>>> dropping the intermediate hash table altogether, and simply refuse
>>>> to work with any files that do not meet (2).
>>>>
>>>> That said, the naive solution of having a large hash table with all
>>>> entries
>>>> in it might be worth a shot.
>>> hmm but that would still take a lot of memory.
>>>
>>> Given assumption (2), a good compromise would be the following,
>>> taking into account that the number of total gids is unlikely to
>>> reach even close to 2**64:
>>> - bit 63: 0/1 determines "fast" or "slow" encoding
>>> - 62-0:
>>> - fast (trivial) encoding: when assumption (2) is met
>>> - 62-53: device id (it fits because of (2))
>>> - 52-0: inode (it fits because of (2))
>> And as pointed by Eduard, we may have to take the mount id into account
>> as well if we want to support the case where we have bind mounts in the
>> exported directory... My understanding is that mount ids are incremental
>> and reused when the associated fs gets unmounted: if we assume that the
>> host doesn't have more than 1024 mounts, we would need 10 bits to encode
>> it.
>>
>> The fast encoding could be something like:
>>
>> 62-53: mount id
>> 52-43: device id
>> 42-0: inode
>
> I don't agree that we should take the mount id into account though.
> The TL; DR: I think the issue about bind mounts is distinct from the QID
> path issue, and just happens to be worked around when we (falsely)
> advertise to the guest that 2 files are not the same (even though they
> are). Making unique 2 files that shouldn't be, will cause other issues.
>
> The kernel's 9p client documentation states that with fscache enabled,
> there is no support for coherency when multiple users (i.e. guest and
> host) are reading and writing to the share. If this limitation is not
> taken into account, there are multiple issues with stale caches in the
> guest.
>
> Disambiguating files using mount id might work around fscache
> limitations in this case, but will introduce a host of other bugs. For
> example:
> (1) The user starts two containers sharing a directory (via host bind
> mounts) with data
> (2) Container 1 writes something to a file in the data dir
> (3) Container 2 reads from the file
> (4) The guest kernel doesn't know the the file is one and the same, so
> it is twice in the cache. Container 2 might get stale data
It is only problem of the guest that he deceives himself.
>
> The user, wrote the code running in containers 1 and 2, assuming they
> can share a file when running on the same system. For example, one
> container generating the configuration file for another. It doesn't
> matter if the user wrote the applications correctly, syncing data when
> needed. It only breaks because we lied to the guest 9p client, telling
> it that they are distinct files.
Nope, we didn't lie. We passed objective information (st_ino, st_dev,
st_mountid, etc).
Thanks,
Eduard.
9p is supposed to support this.
>
> This is why I think including the mount id in the QID path would be
> another bug, this time in the opposite direction.
>
> In contrast the QID path issues:
> (1) do not require touching files on the host, after the guest has
> already mounted the share, to trigger it.
> (2) can be explained by the guest assuming that two or more distinct
> files are actually the same.
>
> The bind mount issue:
> (1) bind mounts have to be changed on the host after the guest has
> mounted the share. Already a no-no for fscache, and can be explained by
> stale caches in the guest.
> (2) The guest is correctly identifying that they refer to the same file.
> There is no collision here.
>
>>
>>> - slow path: assumption (2) isn't met. Then, assign incremental
>>> IDs in the [0,2**63-1] range and track them in a hash table.
>>>
>>> Choosing 10 or whatever else bits for the device id is of course TBD,
>>> as Antonios you pointed out.
>>>
>> This is a best effort to have a fallback in QEMU. The right way to
>> address the issue would really be to extend the protocol to have
>> bigger qids (eg, 64 for inode, 32 for device and 32 for mount).
>
> Does this mean we don't need the slow path for the fallback case? I have
> tested a glib hash table implementation of the "fast path", I will look
> into porting it to the QEMU hash table and will send it to this list.
>
> Keep in mind, we still need a hash table for the device id, since it is
> 32 bits, but we will try to reserve only 10-16 bits for it.
>
> Cheers,
> Tony
prev parent reply other threads:[~2018-01-24 18:06 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-12 11:32 [Qemu-devel] [RFC] qid path collision issues in 9pfs Antonios Motakis
2018-01-12 14:27 ` Daniel P. Berrange
2018-01-12 15:05 ` Veaceslav Falico
2018-01-12 17:00 ` Greg Kurz
2018-01-12 16:25 ` Greg Kurz
2018-01-12 16:14 ` Greg Kurz
2018-01-15 3:49 ` Antonios Motakis
2018-01-19 10:27 ` Greg Kurz
2018-01-19 15:52 ` Eduard Shishkin
2018-01-19 16:36 ` Greg Kurz
2018-01-19 16:37 ` Veaceslav Falico
2018-01-19 18:05 ` Greg Kurz
2018-01-19 18:51 ` Eduard Shishkin
2018-01-25 14:46 ` Veaceslav Falico
2018-01-25 16:08 ` Veaceslav Falico
2018-01-29 17:05 ` Greg Kurz
2018-01-22 12:40 ` Eduard Shishkin
2018-01-24 15:09 ` Greg Kurz
2018-01-20 0:05 ` Emilio G. Cota
2018-01-20 22:03 ` Emilio G. Cota
2018-01-24 13:30 ` Greg Kurz
2018-01-24 16:40 ` Antonios Motakis
2018-01-24 18:05 ` Eduard Shishkin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=07c8b4ee-15f0-409e-274b-01bba80c76d2@huawei.com \
--to=eduard.shishkin@huawei.com \
--cc=Jani.Kokkonen@huawei.com \
--cc=andy.wangguoli@huawei.com \
--cc=antonios.motakis@huawei.com \
--cc=cota@braap.org \
--cc=groug@kaod.org \
--cc=jiangyiwen@huawei.com \
--cc=qemu-devel@nongnu.org \
--cc=veaceslav.falico@huawei.com \
--cc=vfalico@gmail.com \
--cc=zhangwei555@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).