From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:58794)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanb@linux.vnet.ibm.com>) id 1R4B9y-0001AY-R4
	for qemu-devel@nongnu.org; Thu, 15 Sep 2011 08:35:24 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanb@linux.vnet.ibm.com>) id 1R4B9t-0001gq-SV
	for qemu-devel@nongnu.org; Thu, 15 Sep 2011 08:35:18 -0400
Received: from e39.co.us.ibm.com ([32.97.110.160]:55925)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanb@linux.vnet.ibm.com>) id 1R4B9t-0001fw-Lq
	for qemu-devel@nongnu.org; Thu, 15 Sep 2011 08:35:13 -0400
Received: from d03relay05.boulder.ibm.com (d03relay05.boulder.ibm.com
	[9.17.195.107])
	by e39.co.us.ibm.com (8.14.4/8.13.1) with ESMTP id p8FCJVc0025750
	for <qemu-devel@nongnu.org>; Thu, 15 Sep 2011 06:19:31 -0600
Received: from d03av05.boulder.ibm.com (d03av05.boulder.ibm.com [9.17.195.85])
	by d03relay05.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP
	id p8FCYvVm131996
	for <qemu-devel@nongnu.org>; Thu, 15 Sep 2011 06:35:01 -0600
Received: from d03av05.boulder.ibm.com (loopback [127.0.0.1])
	by d03av05.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP
	id p8FCYust007783
	for <qemu-devel@nongnu.org>; Thu, 15 Sep 2011 06:34:56 -0600
Message-ID: <4E71F0EF.6070803@linux.vnet.ibm.com>
Date: Thu, 15 Sep 2011 08:34:55 -0400
From: Stefan Berger <stefanb@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <4E70DEE8.8090908@linux.vnet.ibm.com>
	<CAJSP0QUvDJYFKs-wOuJmY=M=syCJQ4F6LOb9c7-0NgTWuoeMYQ@mail.gmail.com>
In-Reply-To: <CAJSP0QUvDJYFKs-wOuJmY=M=syCJQ4F6LOb9c7-0NgTWuoeMYQ@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] Design of the blobstore  [API of the NVRAM]
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>, Markus Armbruster <armbru@redhat.com>, Anthony Liguori <aliguori@us.ibm.com>, QEMU Developers <qemu-devel@nongnu.org>, "Michael S. Tsirkin" <mst@redhat.com>

On 09/15/2011 07:17 AM, Stefan Hajnoczi wrote:
> On Wed, Sep 14, 2011 at 6:05 PM, Stefan Berger
> <stefanb@linux.vnet.ibm.com>  wrote:
>>   One property of the blobstore is that it has a certain required size for
>> accommodating all blobs of device that want to store their blobs onto. The
>> assumption is that the size of these blobs is know a-priori to the writer of
>> the device code and all devices can register their space requirements with
>> the blobstore during device initialization. Then gathering all the
>> registered blobs' sizes plus knowing the overhead of the layout of the data
>> on the disk lets QEMU calculate the total required (minimum) size that the
>> image has to have to accommodate all blobs in a particular blobstore.
> Libraries like tdb or gdbm come to mind.  We should be careful not to
> reinvent cpio/tar or FAT :).
Sure. As long as these dbs allow to over-ride open(), close(), read(), 
write() and seek() with bdrv ops we could recycle any of these. Maybe we 
can build something smaller than those...
> What about live migration?  If each VM has a LUN assigned on a SAN
> then these qcow2 files add a new requirement for a shared file system.
>
Well, one can still block-migrate these. The user has to know of course 
whether shared storage is setup or not and pass the appropriate flags to 
libvirt for migration. I know it works (modulo some problems when using 
encrypted QCoW2) since I've been testing with it.

> Perhaps it makes sense to include the blobstore in the VM state data
> instead?  If you take that approach then the blobstore will get
> snapshotted *into* the existing qcow2 images.  Then you don't need a
> shared file system for migration to work.
>
It could be an option. However, if the user has a raw image for the VM 
we still need the NVRAM emulation for the TPM for example. So we need to 
store the persistent data somewhere but raw is not prepared for that. 
Even if snapshotting doesn't work at all we need to be able to persist 
devices' data.


> Can you share your design for the actual QEMU API that the TPM code
> will use to manipulate the blobstore?  Is it designed to work in the
> event loop while QEMU is running, or is it for rare I/O on
> startup/shutdown?
>
Everything is kind of changing now. But here's what I have right now:

     tb->s.tpm_ltpms->nvram = nvram_setup(tpm_ltpms->drive_id, &errcode);
     if (!tb->s.tpm_ltpms->nvram) {
         fprintf(stderr, "Could not find nvram.\n");
         return errcode;
     }

     nvram_register_blob(tb->s.tpm_ltpms->nvram,
                         NVRAM_ENTRY_PERMSTATE,
                         tpmlib_get_prop(TPMPROP_TPM_MAX_NV_SPACE));
     nvram_register_blob(tb->s.tpm_ltpms->nvram,
                         NVRAM_ENTRY_SAVESTATE,
                         tpmlib_get_prop(TPMPROP_TPM_MAX_SAVESTATE_SPACE));
     nvram_register_blob(tb->s.tpm_ltpms->nvram,
                         NVRAM_ENTRY_VOLASTATE,
                         
tpmlib_get_prop(TPMPROP_TPM_MAX_VOLATILESTATE_SPACE));

     rc = nvram_start(tpm_ltpms->nvram, fail_on_encrypted_drive);

Above first sets up the NVRAM using the drive's id. That is the -tpmdev 
...,nvram=my-bs, parameter. This establishes the NVRAM. Subsequently the 
blobs to be written into the NVRAM are registered. The nvram_start then 
reconciles the registered NVRAM blobs with those found on disk and if 
everything fits together the result is 'rc = 0' and the NVRAM is ready 
to go. Other devices can than do the same also with the same NVRAM or 
another NVRAM. (NVRAM now after renaming from blobstore).

Reading from NVRAM in case of the TPM is a rare event. It happens in the 
context of QEMU's main thread:

     if (nvram_read_data(tpm_ltpms->nvram,
                         NVRAM_ENTRY_PERMSTATE,
&tpm_ltpms->permanent_state.buffer,
&tpm_ltpms->permanent_state.size,
                         0, NULL, NULL) ||
         nvram_read_data(tpm_ltpms->nvram,
                         NVRAM_ENTRY_SAVESTATE,
&tpm_ltpms->save_state.buffer,
&tpm_ltpms->save_state.size,
                         0, NULL, NULL))
     {
         tpm_ltpms->had_fatal_error = true;
         return;
     }

Above reads the data of 2 blobs synchronously. This happens during startup.


Writes are depending on what the user does with the TPM. He can trigger 
lots of updates to persistent state if he performs certain operations, 
i.e., persisting keys inside the TPM.

     rc = nvram_write_data(tpm_ltpms->nvram,
                           what, tsb->buffer, tsb->size,
                           VNVRAM_ASYNC_F | VNVRAM_WAIT_COMPLETION_F,
                           NULL, NULL);

Above writes a TPM blob into the NVRAM. This is triggered by the TPM 
thread and notifies the QEMU main thread to write the blob into NVRAM. I 
do this synchronously at the moment not using the last two parameters 
for callback after completion but the two flags. The first is to notify 
the main thread the 2nd flag is to wait for the completion of the 
request (using a condition internally).

Here are the protos:

VNVRAM *nvram_setup(const char *drive_id, int *errcode);

int nvram_start(VNVRAM *, bool fail_on_encrypted_drive);

int nvram_register_blob(VNVRAM *bs, enum NVRAMEntryType type,
                         unsigned int maxsize);

unsigned int nvram_get_totalsize(VNVRAM *bs);
unsigned int nvram_get_totalsize_kb(VNVRAM *bs);

typedef void NVRAMRWFinishCB(void *opaque, int errcode, bool is_write,
                              unsigned char **data, unsigned int len);

int nvram_write_data(VNVRAM *bs, enum NVRAMEntryType type,
                      const unsigned char *data, unsigned int len,
                      int flags, NVRAMRWFinishCB cb, void *opaque);


As said, things are changing right now, so this is to give an impression...

   Stefan

> Stefan
>