From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=35951 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1P5Kxz-0001ex-Mu for qemu-devel@nongnu.org; Mon, 11 Oct 2010 12:11:17 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1P5Kxo-00035U-M4 for qemu-devel@nongnu.org; Mon, 11 Oct 2010 12:11:11 -0400 Received: from mail-yx0-f173.google.com ([209.85.213.173]:38123) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1P5Kxo-00035O-Hh for qemu-devel@nongnu.org; Mon, 11 Oct 2010 12:11:00 -0400 Received: by yxn35 with SMTP id 35so768758yxn.4 for ; Mon, 11 Oct 2010 09:11:00 -0700 (PDT) Message-ID: <4CB33711.8030808@codemonkey.ws> Date: Mon, 11 Oct 2010 11:10:57 -0500 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification References: <1286552914-27014-1-git-send-email-stefanha@linux.vnet.ibm.com> <1286552914-27014-4-git-send-email-stefanha@linux.vnet.ibm.com> <4CB18549.3020206@redhat.com> <20101011100954.GA4078@stefan-thinkpad.transitives.com> <4CB30B43.2040706@redhat.com> <4CB32530.2070504@codemonkey.ws> <4CB32615.6030008@redhat.com> <4CB3321F.2060803@codemonkey.ws> <4CB33519.40302@redhat.com> In-Reply-To: <4CB33519.40302@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Avi Kivity Cc: Kevin Wolf , Christoph Hellwig , Stefan Hajnoczi , qemu-devel@nongnu.org On 10/11/2010 11:02 AM, Avi Kivity wrote: > On 10/11/2010 05:49 PM, Anthony Liguori wrote: >> On 10/11/2010 09:58 AM, Avi Kivity wrote: >>>> A leak is unacceptable. It means an image can grow to an unbounded >>>> size. If you are a server provider offering multitenancy, then a >>>> malicious guest can potentially grow the image beyond it's allotted >>>> size causing a Denial of Service attack against another tenant. >>> >>> >>> This particular leak cannot grow, and is not controlled by the guest. >> >> As the image gets moved from hypervisor to hypervisor, it can keep >> growing if given a chance to fill up the disk, then trim it all way. >> >> In a mixed hypervisor environment, it just becomes a numbers game. > > I don't see how it can grow. Both the freelist and the clusters it > points to consume space, which becomes a leak once you move it to a > hypervisor that doesn't understand the freelist. The older hypervisor > then allocates new blocks. As soon as it performs a metadata scan (if > ever), the freelist is reclaimed. Assume you don't ever do a metadata scan (which is really our design point). If you move to a hypervisor that doesn't support it, then move to a hypervisor that does, you create a brand new freelist and start leaking more space. This isn't a contrived scenario if you have a cloud environment with a mix of hosts. You might not be able to get a ping-pong every time you provision, but with enough effort, you could create serious problems. It's really an issue of correctness. Making correctness trade-offs for the purpose of compatibility is a policy decision and not something we should bake into an image format. If a tool feels strongly that it's a reasonable trade off to make, it can always fudge the feature bits itself. >> >>>> A freelist has to be a non-optional feature. When the freelist bit >>>> is set, an older QEMU cannot read the image. If the freelist is >>>> completed used, the freelist bit can be cleared and the image is >>>> then usable by older QEMUs. >>> >>> Once we support TRIM (or detect zeros) we'll never have a clean >>> freelist. >> >> Zero detection doesn't add to the free list. > > Why not? If a cluster is zero filled, you may drop it (assuming no > backing image). Sorry, I was thinking about the case of copy-on-read. When you transition from UCE -> ZCE, nothing gets added to the free list. But if you go from allocated -> ZCE, then you would add to the free list. >> >> A potential solution here is to treat TRIM a little differently than >> we've been discussing. >> >> When TRIM happens, don't immediately write an unallocated cluster >> entry for the L2. Leave the L2 entry in-tact. Don't actually write >> a UCE to the L2 until you actually allocate the block. >> >> This implies a cost because you'll need to do metadata syncs to make >> this work. However, that eliminates leakage. > > The information is lost on shutdown; and you can have a large number > of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a > user expecting a visit from RIAA). > > A slight twist on your proposal is to have an allocated-but-may-drop > bit in a L2 entry. TRIM or zero detection sets the bit (leaving the > cluster number intact). A following write to the cluster needs to > clear the bit; if we reallocate the cluster we need to replace it with > a ZCE. Yeah, this is sort of what I was thinking. You would still want a free list but it becomes totally optional because if it's lost, no data is leaked (assuming that the older version understands the bit). I was suggesting that we store that bit in the free list though because that let's us support having older QEMUs with absolutely no knowledge still work. > This makes the freelist all L2 entries with the bit set; it may be > less efficient than a custom data structure though. We still want the freelist to avoid recreating it. We also want to store the allocated-but-may-drop bit in the free list. Regards, Anthony Liguori