From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=49456 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1P5c3G-0000Wt-1T for qemu-devel@nongnu.org; Tue, 12 Oct 2010 06:25:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1P5c3E-0002cT-UU for qemu-devel@nongnu.org; Tue, 12 Oct 2010 06:25:45 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46620) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1P5c3E-0002cJ-Nw for qemu-devel@nongnu.org; Tue, 12 Oct 2010 06:25:44 -0400 Message-ID: <4CB43799.2050106@redhat.com> Date: Tue, 12 Oct 2010 12:25:29 +0200 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification References: <1286552914-27014-1-git-send-email-stefanha@linux.vnet.ibm.com> <1286552914-27014-4-git-send-email-stefanha@linux.vnet.ibm.com> <4CB18549.3020206@redhat.com> <20101011100954.GA4078@stefan-thinkpad.transitives.com> <4CB30B43.2040706@redhat.com> <4CB32530.2070504@codemonkey.ws> <4CB32615.6030008@redhat.com> <4CB3321F.2060803@codemonkey.ws> <4CB33519.40302@redhat.com> <4CB33711.8030808@codemonkey.ws> In-Reply-To: <4CB33711.8030808@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Kevin Wolf , Christoph Hellwig , Stefan Hajnoczi , qemu-devel@nongnu.org On 10/11/2010 06:10 PM, Anthony Liguori wrote: > On 10/11/2010 11:02 AM, Avi Kivity wrote: >> On 10/11/2010 05:49 PM, Anthony Liguori wrote: >>> On 10/11/2010 09:58 AM, Avi Kivity wrote: >>>>> A leak is unacceptable. It means an image can grow to an >>>>> unbounded size. If you are a server provider offering >>>>> multitenancy, then a malicious guest can potentially grow the >>>>> image beyond it's allotted size causing a Denial of Service attack >>>>> against another tenant. >>>> >>>> >>>> This particular leak cannot grow, and is not controlled by the guest. >>> >>> As the image gets moved from hypervisor to hypervisor, it can keep >>> growing if given a chance to fill up the disk, then trim it all way. >>> >>> In a mixed hypervisor environment, it just becomes a numbers game. >> >> I don't see how it can grow. Both the freelist and the clusters it >> points to consume space, which becomes a leak once you move it to a >> hypervisor that doesn't understand the freelist. The older >> hypervisor then allocates new blocks. As soon as it performs a >> metadata scan (if ever), the freelist is reclaimed. > > Assume you don't ever do a metadata scan (which is really our design > point). What about crashes? > > If you move to a hypervisor that doesn't support it, then move to a > hypervisor that does, you create a brand new freelist and start > leaking more space. This isn't a contrived scenario if you have a > cloud environment with a mix of hosts. It's only a leak if you don't do a metadata scan. > > You might not be able to get a ping-pong every time you provision, but > with enough effort, you could create serious problems. > > It's really an issue of correctness. Making correctness trade-offs > for the purpose of compatibility is a policy decision and not > something we should bake into an image format. If a tool feels > strongly that it's a reasonable trade off to make, it can always fudge > the feature bits itself. I think the effort here is reasonable, clearing a bit on startup is not that complicated. >>> >>> A potential solution here is to treat TRIM a little differently than >>> we've been discussing. >>> >>> When TRIM happens, don't immediately write an unallocated cluster >>> entry for the L2. Leave the L2 entry in-tact. Don't actually write >>> a UCE to the L2 until you actually allocate the block. >>> >>> This implies a cost because you'll need to do metadata syncs to make >>> this work. However, that eliminates leakage. >> >> The information is lost on shutdown; and you can have a large number >> of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a >> user expecting a visit from RIAA). >> >> A slight twist on your proposal is to have an allocated-but-may-drop >> bit in a L2 entry. TRIM or zero detection sets the bit (leaving the >> cluster number intact). A following write to the cluster needs to >> clear the bit; if we reallocate the cluster we need to replace it >> with a ZCE. > > Yeah, this is sort of what I was thinking. You would still want a > free list but it becomes totally optional because if it's lost, no > data is leaked (assuming that the older version understands the bit). > > I was suggesting that we store that bit in the free list though > because that let's us support having older QEMUs with absolutely no > knowledge still work. It doesn't - on rewrite an old qemu won't clear the bit, so a newer qemu would think it's still free. The autoclear bit solves it nicely - the old qemu automatically drops the allocated-but-may-drop bits, undoing any TRIMs (which is unfortunate) but preserving consistency. -- error compiling committee.c: too many arguments to function