From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:33311)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanb@linux.vnet.ibm.com>) id 1R5gc1-0007lh-7C
	for qemu-devel@nongnu.org; Mon, 19 Sep 2011 12:22:30 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanb@linux.vnet.ibm.com>) id 1R5gbw-0007dn-6A
	for qemu-devel@nongnu.org; Mon, 19 Sep 2011 12:22:29 -0400
Received: from e8.ny.us.ibm.com ([32.97.182.138]:50605)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanb@linux.vnet.ibm.com>) id 1R5gbw-0007dZ-2b
	for qemu-devel@nongnu.org; Mon, 19 Sep 2011 12:22:24 -0400
Received: from d01relay06.pok.ibm.com (d01relay06.pok.ibm.com [9.56.227.116])
	by e8.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id p8JG7ixB029130
	for <qemu-devel@nongnu.org>; Mon, 19 Sep 2011 12:07:44 -0400
Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64])
	by d01relay06.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	p8JGMDUi1097824
	for <qemu-devel@nongnu.org>; Mon, 19 Sep 2011 12:22:15 -0400
Received: from d01av04.pok.ibm.com (loopback [127.0.0.1])
	by d01av04.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id
	p8JGMC2u027417
	for <qemu-devel@nongnu.org>; Mon, 19 Sep 2011 12:22:12 -0400
Message-ID: <4E776C2A.5020006@linux.vnet.ibm.com>
Date: Mon, 19 Sep 2011 12:22:02 -0400
From: Stefan Berger <stefanb@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <20110915122842.GA6302@redhat.com>
	<4E720CA9.9050208@linux.vnet.ibm.com>	<20110916144443.GB20933@redhat.com>
	<4E737D70.8030801@linux.vnet.ibm.com>
	<20110917192857.GB6127@redhat.com>
In-Reply-To: <20110917192857.GB6127@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] blobstore disk format (was Re: Design of the
	blobstore)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>, qemu-devel@nongnu.org, Markus Armbruster <armbru@redhat.com>

On 09/17/2011 03:28 PM, Michael S. Tsirkin wrote:
> On Fri, Sep 16, 2011 at 12:46:40PM -0400, Stefan Berger wrote:
>> On 09/16/2011 10:44 AM, Michael S. Tsirkin wrote:
>>> On Thu, Sep 15, 2011 at 10:33:13AM -0400, Stefan Berger wrote:
>>>> On 09/15/2011 08:28 AM, Michael S. Tsirkin wrote:
>>>>> So the below is a proposal for a directory scheme
>>>>> for storing (optionally multiple) nvram images,
>>>>> along with any metadata.
>>>>> Data is encoded using BER:
>>>>> http://en.wikipedia.org/wiki/Basic_Encoding_Rules
>>>>> Specifically, we mostly use the subsets.
>>>>>
>>>> Would it change anything if we were to think of the NVRAM image as
>>>> another piece of metadata?
>>> Yes, we can do that, sure. I had the feeling that it will help to lay
>>> out the image at the end, to make directory listing
>>> more efficient - the rest of metadata is usually small,
>>> image might be somewhat large.
>>>
>> Why not let a convenience library handle the metadata on the device
>> level, having it create the blob that the NVRAM layer ends up
>> writing and parsing before the device uses it? Otherwise I should
>> maybe rename the nvram to meatdata_store :-/
> Maybe we are talking about different things. All I agrue for
> is using a common standard format for storing metadata,
> instead of having each device roll its own.
That's fine. The TPM code inside libtpms serializes all internal data 
structures for later resumption. It doesn't use ASN.1 but in effect 
endianess-normalizes them and stores them in a packed format that can 
later be resumed along with a version tag prepended where need . Are you 
suggesting to change that ? I hope not...


>>>> I am also wondering whether each device shouldn't just handle the
>>>> metadata itself,
>>> It could be that just means we will have custom code with
>>> different bugs in each device.
>>> Note that from experience with formats, the problem with
>>> time becomes less trivial than it seems as we
>>> need to provide forward and backward compatibility
>>> guarantees.
>>>
>> Is that guaranteed just by using ASN.1 ?
> At least for BER, yes. We can always skip an optional field
> that we don't recognize without knowing anything about
> its internal format.

>> Do we need to add a
>> revision to the metadata?
> IMO, no. Instead we add optional attributes as long as we can
> preserve backwards compatibility, and madatory attributes
> if we can't.
>
Are devices doing this right now or are these future changes to devices' 
code?

>>>> encryption, integrity value (crc32 or sha1) and so on. What
>>>> metadata should there be that really need to be handled on the NVRAM
>>>> API and below level rather than on the device-specific code level?
>>> So checksum  (checksum value and type) 'and so on' are what I call
>>> metadata :) Doing it at device level seems wrong.
>>>
>> You mean doing it at the NVRAM level seems wrong. Of course, again
>> something a device could write into a header prepended to the actual
>> blob. Maybe every device that needs it should do that so that if we
>> were to support encryption of blobs and the key for decryption was
>> wrong one could detect it early without feeding badly decrypted /
>> corrupted state into the device and see what happens.
> Do what? Checksum the data? Well, error detection is nice,
> but it could be that people actually care about not losing
> all of the data on nvram if qemu is killed.  I also wonder whether
> invalidating all data because of a single bit corruption is a bug or a
> feature.
>
The checksuming I think makes sense if encryption is being added so 
decryption and testing for proper key material remains an NVRAM 
operation rather than a device operation.
>>>>> We use a directory as a SET in a CER format.
>>>>> This allows generating directory online without scanning
>>>>> the entries beforehand.
>>>>>
>>>> I guess it is the 'unknown' for me... but what is the advantage of
>>>> using ASN1 for this rather than just writing out packed and
>>>> endianess-normalized data structures (with revision value),
>>> If you want an example of where this 'custom formats are easy
>>> so let us write one' leads to in the end,
>>> look no further than live migration code.
>>> It's a mess of hacks that does not even work across
>>> upstream qemu versions, leave alone across
>>> downstreams (different linux distros).
>>>
>> So is ASN1 the answer or does one still need to add a revision tag
>> to each blob putting in custom code for parsing the different
>> revisions of data structures (I guess) that may be extended/changed
>> over time?
>>
>>     Stefan
> We don't need revisions. We can always parse a new structure
> skipping optional attributes we don't recognize. In case we want to
> break old qemu versions intentially, we can add
> a mandatory attribute.
So you said you had some code for the handling of ASN.1. Can sketch how 
the interaction of devices would work with mandatory and optional 
attributes along with an API? I'd prefer to NOT have the attributes and 
values be a part of the NVRAM API itself but let a (mandatory) library 
handle the serialization and deserialization of these metadata when a 
device wants to write or read state respectively. But maybe I just want 
to keep the NVRAM API 'too simple'.

    Stefan

>>>> having
>>>> them crc32-protected to have some sanity checking in place?
>>>>
>>>>      Stefan
>>> I'm not sure why we want crc specifically in TPM.
>>> If it is 'just because we can' then it probably
>>> applies to other non-volatile storage?
>>> Storage generally?
>>>
>>>>> The rest of the encoding uses a DER format.
>>>>> This makes for fast parsing as entries are easy to skip.
>>>>>
>>>>> Each entry is encoded in DER format.
>>>>> Each entry is a SEQUENCE with two objects:
>>>>> 1. nvram
>>>>> 2. optional name - a UTF8String
>>>>>
>>>>> Binary data is stored as OCTET-STRING values on disk.
>>>>> Any RW metadata is stored as OCTET-STRING value as well.
>>>>> Any RO metadata is stored in appropriate universal encoding,
>>>>> by type.
>>>>>
>>>>> On the context below, an attribute is either a IA5String or a SEQUENCE.
>>>>> If IA5String, this is the attribute name, and it has no value.
>>>>> If SEQUENCE, the first entry in the sequence is an
>>>>> IA5String, it is the attribute name. The rest of the entries
>>>>> represent the attribute value.
>>>>>
>>>>> Mandatory/optional attributes: depends on type.
>>>>> tpm will have realsize as RW mandatory attribute.
>>>>>
>>>>> Each nvram is built as a SEQUENCE including 4 objects
>>>>> 1. type - an IA5String. downstreams can use other types such as
>>>>>                       UUIDs instead to ensure no conflicts with upstream
>>>>> 2. SET of mandatory attributes
>>>>> 3. SET of optional attributes
>>>>> 4. data - a RW OCTET-STRING
>>>>>
>>>>> It is envisioned that attributes won't be too large,
>>>>> so they can easily be kept in memory.
>>>>>
>>>>>