From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=58439 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1P5Kq3-0005h8-Pk for qemu-devel@nongnu.org; Mon, 11 Oct 2010 12:03:07 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1P5Kpt-0001cY-Vo for qemu-devel@nongnu.org; Mon, 11 Oct 2010 12:02:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:56906) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1P5Kpt-0001cO-OG for qemu-devel@nongnu.org; Mon, 11 Oct 2010 12:02:49 -0400 Message-ID: <4CB33519.40302@redhat.com> Date: Mon, 11 Oct 2010 18:02:33 +0200 From: Avi Kivity MIME-Version: 1.0 Subject: Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification References: <1286552914-27014-1-git-send-email-stefanha@linux.vnet.ibm.com> <1286552914-27014-4-git-send-email-stefanha@linux.vnet.ibm.com> <4CB18549.3020206@redhat.com> <20101011100954.GA4078@stefan-thinkpad.transitives.com> <4CB30B43.2040706@redhat.com> <4CB32530.2070504@codemonkey.ws> <4CB32615.6030008@redhat.com> <4CB3321F.2060803@codemonkey.ws> In-Reply-To: <4CB3321F.2060803@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Kevin Wolf , Christoph Hellwig , Stefan Hajnoczi , qemu-devel@nongnu.org On 10/11/2010 05:49 PM, Anthony Liguori wrote: > On 10/11/2010 09:58 AM, Avi Kivity wrote: >>> A leak is unacceptable. It means an image can grow to an unbounded >>> size. If you are a server provider offering multitenancy, then a >>> malicious guest can potentially grow the image beyond it's allotted >>> size causing a Denial of Service attack against another tenant. >> >> >> This particular leak cannot grow, and is not controlled by the guest. > > As the image gets moved from hypervisor to hypervisor, it can keep > growing if given a chance to fill up the disk, then trim it all way. > > In a mixed hypervisor environment, it just becomes a numbers game. I don't see how it can grow. Both the freelist and the clusters it points to consume space, which becomes a leak once you move it to a hypervisor that doesn't understand the freelist. The older hypervisor then allocates new blocks. As soon as it performs a metadata scan (if ever), the freelist is reclaimed. You could only get a growing leak if you moved it to a hypervisor that doesn't perform metadata scans, but then that is independent of the freelist. > >>> A freelist has to be a non-optional feature. When the freelist bit >>> is set, an older QEMU cannot read the image. If the freelist is >>> completed used, the freelist bit can be cleared and the image is >>> then usable by older QEMUs. >> >> Once we support TRIM (or detect zeros) we'll never have a clean >> freelist. > > Zero detection doesn't add to the free list. Why not? If a cluster is zero filled, you may drop it (assuming no backing image). > > A potential solution here is to treat TRIM a little differently than > we've been discussing. > > When TRIM happens, don't immediately write an unallocated cluster > entry for the L2. Leave the L2 entry in-tact. Don't actually write a > UCE to the L2 until you actually allocate the block. > > This implies a cost because you'll need to do metadata syncs to make > this work. However, that eliminates leakage. The information is lost on shutdown; and you can have a large number of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a user expecting a visit from RIAA). A slight twist on your proposal is to have an allocated-but-may-drop bit in a L2 entry. TRIM or zero detection sets the bit (leaving the cluster number intact). A following write to the cluster needs to clear the bit; if we reallocate the cluster we need to replace it with a ZCE. This makes the freelist all L2 entries with the bit set; it may be less efficient than a custom data structure though. -- error compiling committee.c: too many arguments to function