From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35550) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eek5P-0001LF-OH for qemu-devel@nongnu.org; Thu, 25 Jan 2018 11:09:17 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eek5M-0002eZ-Ey for qemu-devel@nongnu.org; Thu, 25 Jan 2018 11:09:11 -0500 Received: from szxga04-in.huawei.com ([45.249.212.190]:2158 helo=huawei.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eek5L-0002d3-Ik for qemu-devel@nongnu.org; Thu, 25 Jan 2018 11:09:08 -0500 From: Veaceslav Falico References: <081955e1-84ec-4877-72d4-f4e8b46be350@huawei.com> <20180112171416.6048ae9e@bahia.lan> <20180119112733.4a9dd43f@bahia.lan> <05660a29-b22d-7bd6-91e2-37c54b5f2194@huawei.com> <899644b1-e401-8ed5-8a49-1f0e2dbf9b2b@huawei.com> <20180119190555.35127373@bahia.lan> Message-ID: Date: Thu, 25 Jan 2018 17:08:40 +0100 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-GB Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC] qid path collision issues in 9pfs List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Greg Kurz Cc: Eduard Shishkin , Antonios Motakis , "qemu-devel@nongnu.org" , Jani Kokkonen , "vfalico@gmail.com" , "Wangguoli (Andy)" , Jiangyiwen , "zhangwei (CR)" , "Emilio G. Cota" On 1/25/2018 3:46 PM, Veaceslav Falico wrote: > Hi, >=20 > sorry for the late reply, we're acutally working on it internally... >=20 > On 1/19/2018 7:05 PM, Greg Kurz wrote: >> On Fri, 19 Jan 2018 17:37:58 +0100 >> Veaceslav Falico wrote: >> >>> On 1/19/2018 4:52 PM, Eduard Shishkin wrote: >>>> >>>> >>>> On 1/19/2018 11:27 AM, Greg Kurz wrote: =20 >>>>> On Mon, 15 Jan 2018 11:49:31 +0800 >>>>> Antonios Motakis wrote: >>>>> =20 >>>>>> On 13-Jan-18 00:14, Greg Kurz wrote: =20 >>>>>>> On Fri, 12 Jan 2018 19:32:10 +0800 >>>>>>> Antonios Motakis wrote: >>>>>>> =20 >>>>>>>> Hello all, >>>>>>>> =20 >>>>>>> >>>>>>> Hi Antonios, >>>>>>> >>>>>>> I see you have attached a patch to this email... this really isn'= t the preferred >>>>>>> way to do things since it prevents to comment the patch (at least= with my mail >>>>>>> client). The appropriate way would have been to send the patch wi= th a cover >>>>>>> letter, using git-send-email for example. =20 >>>>>> >>>>>> I apologize for attaching the patch, I should have known better! >>>>>> =20 >>>>> >>>>> np :) >>>>> =20 >>>>>>> =20 >>>>>>>> We have found an issue in the 9p implementation of QEMU, with ho= w qid paths are generated, which can cause qid path collisions and severa= l issues caused by them. In our use case (running containers under VMs) t= hese have proven to be critical. >>>>>>>> =20 >>>>>>> >>>>>>> Ouch... >>>>>>> =20 >>>>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path = using the inode number of the file as input. According to the 9p spec the= path should be able to uniquely identify a file, distinct files should n= ot share a path value. >>>>>>>> >>>>>>>> The current implementation that defines qid.path =3D inode nr wo= rks fine as long as there are not files from multiple partitions visible = under the 9p share. In that case, distinct files from different devices a= re allowed to have the same inode number. So with multiple partitions, we= have a very high probability of qid path collisions. >>>>>>>> >>>>>>>> How to demonstrate the issue: >>>>>>>> 1) Prepare a problematic share: >>>>>>>> =C2=A0- mount one partition under share/p1/ with some files insi= de >>>>>>>> =C2=A0- mount another one *with identical contents* under share/= p2/ >>>>>>>> =C2=A0- confirm that both partitions have files with same inode = nr, size, etc >>>>>>>> 2) Demonstrate breakage: >>>>>>>> =C2=A0- start a VM with a virtio-9p pointing to the share >>>>>>>> =C2=A0- mount 9p share with FSCACHE on >>>>>>>> =C2=A0- keep open share/p1/file >>>>>>>> =C2=A0- open and write to share/p2/file >>>>>>>> >>>>>>>> What should happen is, the guest will consider share/p1/file and= share/p2/file to be the same file, and since we are using the cache it w= ill not reopen it. We intended to write to partition 2, but we just wrote= to partition 1. This is just one example on how the guest may rely on qi= d paths being unique. >>>>>>>> >>>>>>>> In the use case of containers where we commonly have a few conta= iners per VM, all based on similar images, these kind of qid path collisi= ons are very common and they seem to cause all kinds of funny behavior (s= ometimes very subtle). >>>>>>>> >>>>>>>> To avoid this situation, the device id of a file needs to be als= o taken as input for generating a qid path. Unfortunately, the size of bo= th inode nr + device id together would be 96 bits, while we have only 64 = bits for the qid path, so we can't just append them and call it a day :( >>>>>>>> >>>>>>>> We have thought of a few approaches, but we would definitely lik= e to hear what the upstream maintainers and community think: >>>>>>>> >>>>>>>> * Full fix: Change the 9p protocol >>>>>>>> >>>>>>>> We would need to support a longer qid path, based on a virtio fe= ature flag. This would take reworking of host and guest parts of virtio-9= p, so both QEMU and Linux for most users. >>>>>>>> =20 >>>>>>> >>>>>>> I agree for a longer qid path, but we shouldn't tie it to a virti= o flag since >>>>>>> 9p is transport agnostic. And it happens to be used with a variet= y of transports. >>>>>>> QEMU has both virtio-9p and a Xen backend for example. >>>>>>> =20 >>>>>>>> * Fallback and/or interim solutions >>>>>>>> >>>>>>>> A virtio feature flag may be refused by the guest, so we think w= e still need to make collisions less likely even with 64 bit paths. E.g. = =20 >>>>>>> >>>>>>> In all cases, we would need a fallback solution to support curren= t >>>>>>> guest setups. Also there are several 9p server implementations ou= t >>>>>>> there (ganesha, diod, kvmtool) that are currently used with linux >>>>>>> clients... it will take some time to get everyone in sync :-\ >>>>>>> =20 >>>>>>>> 1. XOR the device id with inode nr to produce the qid path (we a= ttach a proof of concept patch) =20 >>>>>>> >>>>>>> Hmm... this would still allow collisions. Not good for fallback. >>>>>>> =20 >>>>>>>> 2. 64 bit hash of device id and inode nr =20 >>>>>>> >>>>>>> Same here. >>>>>>> =20 >>>>>>>> 3. other ideas, such as allocating new qid paths and keep a look= up table of some sorts (possibly too expensive) >>>>>>>> =20 >>>>>>> >>>>>>> This would be acceptable for fallback. =20 >>>>>> >>>>>> Maybe we can use the QEMU hash table (https://github.com/qemu/qemu= /blob/master/util/qht.c) but I wonder if it scales to millions of entries= . Do you think it is appropriate in this case? >>>>>> =20 >>>>> >>>>> I don't know QHT, hence Cc'ing Emilio for insights. >>>>> =20 >>>>>> I was thinking on how to implement something like this, without ha= ving to maintain millions of entries... One option we could consider is t= o split the bits into a directly-mapped part, and a lookup part. For exam= ple: >>>>>> >>>>>> Inode =3D >>>>>> [ 10 first bits ] + [ 54 lowest bits ] >>>>>> >>>>>> A hash table maps [ inode 10 first bits ] + [ device id ] =3D> [ 1= 0 bit prefix ] >>>>>> The prefix is uniquely allocated for each input. >>>>>> >>>>>> Qid path =3D >>>>>> [ 10 bit prefix ] + [ inode 54 lowest bits ] >>>>>> >>>>>> Since inodes are not completely random, and we usually have a hand= ful of device IDs, we get a much smaller number of entries to track in th= e hash table. >>>>>> >>>>>> So what this would give: >>>>>> (1)=C2=A0=C2=A0=C2=A0 Would be faster and take less memory than ma= pping the full inode_nr,devi_id tuple to unique QID paths >>>>>> (2)=C2=A0=C2=A0=C2=A0 Guaranteed not to run out of bits when inode= numbers stay below the lowest 54 bits and we have less than 1024 devices= . >>>>>> (3)=C2=A0=C2=A0=C2=A0 When we get beyond this this limit, there is= a chance we run out of bits to allocate new QID paths, but we can detect= this and refuse to serve the offending files instead of allowing a colli= sion. >>>>>> >>>>>> We could tweak the prefix size to match the scenarios that we cons= ider more likely, but I think close to 10-16 bits sounds reasonable enoug= h. What do you think? >>>>>> =20 >>>>> >>>>> I think that should work. Please send a patchset :) =20 >>>> >>>> Hi Greg, >>>> >>>> This is based on the assumption that high bits of inode numbers are >>>> always zero, which is unacceptable from my standpoint. Inode numbers >>>> allocator is a per-volume file system feature, so there may appear >>>> allocators, which don't use simple increment and assign inode number >>>> with non-zero high bits even for the first file of a volume. >>>> >>>> As I understand, the problem is that the guest doesn't have enough >>>> information for proper files identification (in our case share/p1/fi= le >>>> and share/p2/file are wrongly identified as the same file). It seems >>>> that we need to transmit device ID to the guest, and not in the high= bits of the inode number. >>>> >>>> AFAIK Slava has patches for such transmission, I hope that he will s= end >>>> it eventually. =20 >>> >>> They're pretty trivial, however it'll require great effort and coordi= nation >>> for all parties involved, as they break backwards compatibility (QID = becomes >>> bigger and, thus, breaks "Q" transmission). >>> >>> I'd like to raise another issue - it seems that st_dev and i_no aren'= t >>> enough: >>> >>> mkdir -p /tmp/mounted >>> touch /tmp/mounted/file >>> mount -o bind /tmp/mounted t1 >>> mount -o bind /tmp/mounted t2 >>> stat t{1,2}/file | grep Inode >>> Device: 805h/2053d Inode: 42729487 Links: 3 >>> Device: 805h/2053d Inode: 42729487 Links: 3 >>> >>> In that case, even with the patch applied, we'll still have the issue= of >>> colliding QIDs guest-side - so, for example, after umount t1, t2/file >>> becomes unaccessible, as the inode points to t1... >>> >> >> t1/file and t2/file really point to the same file on the server, so >> it is expected they have the same QIDs. >> >>> I'm not sure how to fix this one. Ideas? >> >> fscache bug ? >=20 > I've reproduced it today without fscache: >=20 > host: > mount -o bind /tmp/mounted t1 > mount -o bind /tmp/mounted t2 >=20 > guest: > / # tail -f t1/file & > / # 321 >=20 > / # cat t2/file > 321 > / # >=20 > host: > mv t1/file t1/file_moved >=20 > guest: > / # cat t2/file > cat: can't open 't2/file': No such file or directory > / # mount | grep fscache > / # Sorry, disregard this test case, it's operating normally - when we move the t1/file, we also move the t2/file, as they're the same directory... Sorry, it's a brainfart after a long discussion about the issue :). So, it's still not reproducible without (fs)cache guest-side. Anyway, the question below still stands - is the guest allowed to re-use the FIDs for the files with same QIDs? >=20 > Also, per internal discussions, we're not sure if the guest side > is allowed to reuse the FIDs opened previously for same QID.paths. >=20 > QEMU holds FID.path for each FID, which contains the actual FS path, > i.e. "/path/to/file".=20 >=20 > So, if we (guest) have two "identical" (per QID.path and RFC) files > (f1 and f2), though in different directories (host and guest), and=20 > we've accessed f1 once (FID 10, for example) - are we allowed to > re-use FID 10 for f2, if f1's QID.path =3D=3D f2's QID.path ? >=20 >> >>> >>>> >>>> Thanks, >>>> Eduard. >>>> =20 >>>>> =20 >>>>>>> =20 >>>>>>>> With our proof of concept patch, the issues caused by qid path c= ollisions go away, so it can be seen as an interim solution of sorts. How= ever, the chance of collisions is not eliminated, we are just replacing t= he current strategy, which is almost guaranteed to cause collisions in ce= rtain use cases, with one that makes them more rare. We think that a virt= io feature flag for longer qid paths is the only way to eliminate these i= ssues completely. >>>>>>>> >>>>>>>> This is the extent that we were able to analyze the issue from o= ur side, we would like to hear if there are some better ideas, or which a= pproach would be more appropriate for upstream. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> =20 >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> --=20 >>>>>>> Greg >>>>>>> =20 >>>>>> =20 >>>>> =20 >>>> >>>> . >>>> =20 >>> >>> >> >> . >> >=20