qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Ekaterina Tumanova <tumanova@linux.vnet.ibm.com>
To: Markus Armbruster <armbru@redhat.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Stefan Hajnoczi <stefanha@gmail.com>,
	qemu-devel@nongnu.org,
	Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com>,
	dahi@linux.vnet.ibm.com, cornelia.huck@de.ibm.com,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] Geometry and blocksize support for backing devices
Date: Fri, 07 Nov 2014 16:39:34 +0300	[thread overview]
Message-ID: <545CCB96.5020400@linux.vnet.ibm.com> (raw)
In-Reply-To: <8761ere5dx.fsf@blackfin.pond.sub.org>

On 11/07/2014 12:17 PM, Markus Armbruster wrote:
> Christian Borntraeger <borntraeger@de.ibm.com> writes:
>
>> Markus, Kevin, Stefan,
>>
>> here is a (somewhat late) followup of some KVM forum discussions regarding
>> block size and geometry of pass-through block devices. Let's just do a quick
>> wrap-up (as of my understanding) and a proposal at the end of the mail
>>
>>
>>
>>
>>
>> - DASD/ECKD disk devices have several special properties, e.g. the geometry
>>    still has an influence on the partitions and each track can have a different
>>    format and the format of the disk actually follows what z/OS has as basic
>>    data structures. Linux does a low-level format of the disk to look like a
>>    block device (most common case is formatted with 4K blocks). There are
>>    still some smalls warts for z/OS compatibility (block0 has 28 bytes of data,
>>    block1 has 148 bytes and block2 has 84 bytes of data. everything else has
>>    4k. the dasd device driver will fake 4k blocks for all blocks by ignoring
>>    writes beyond the data for blocks 0-2 and filling the gaps with 0xe5).
>>
>> - Since Linux uses DASDs as normal block device, we actually want to use
>>    virtio-blk to pass those to KVM guests. Due to the warts mentioned above,
>>    we have to have the proper block size and geometry in the guest. Otherwise
>>    things will fail in certain cases (linux partition detection for example)
>>
>> - we have libvirt support to provide this information in the XML. This is far
>>    from user friendly, though, as the admin has to manually look up the
>>    properties of the host disks and then modify the guest definition accordingly.
>>
>> - Kate came up with patches (based on initial patches from Einar Lueck) for
>>    auto-detection of geometry and block size for host block devices
>>
>> - Stefan and Paolo had some concerns:
>> 1. if the geometry etc is important, then make it part of the guest definition
>> 2. what about migration and the target disk differs
>> 3. is that issue system z specific or generic?
>>
>> Regarding 1: this does work as of today, but it is pretty complicated for an
>> admin to do so
>>
>> Regarding 2: System z system do not have built-in disks, they are always
>> accessed via fibre channel (either FICON protocol for DASDs or FCP protocol
>> for scsi). So its quite common to have shared access to the same disk from
>> different System z boxes. system z admins should be able to setup this
>> properly. Question is, is it ok to assume that and fail if not?
>>
>> Regarding 3: No idea.
>>
>> At KVM forum I talked to several people regarding a solution:
>>
>> a) Stefan suggested to make the auto detection explicit, e.g.: provide a
>> "autodetect" tag for the secs, cylc, heads and logical_sector_size properties
>> This would require changes in qemu, but also in libvirt and its domain
>> configuration
>> b) Markus suggested that there are already some cases in QEMU, where we rely
>> on the admin to provide a proper setup on the target, e.g. an .iso
>> file as image.
>> If the target has a different content in the .iso file things will break.
>>
>> Markus said, there are two classes of this.
>> a) problems that can be detected by QEMU. Here QEMU will abort migration
>
> Example: device missing on target, target rejects migration when it
> receives the device's state.
>
>> b) problems that cannot be detected by QEMU (e.g. different iso content). this
>> will trigger failures later on
>>
>> a is preferred
>
> Note that the geometry is currently in class (b).  Configuration
> generally is.
>
> A perpetual long-term goal of migration is embedding configuration in
> the migration stream, to move it from (b) to (a).  Just frontend
> configuration, because backend configuration is generally host-specific.
>
> Geometry is a property of the device, thus frontend configuration.
>
> For historical reasons, device geometry properties default to backend
> values set with -drive cyls=... or -hdachs.  This is deprecated.
>
> If the user doesn't specify geometry, QEMU makes one up, usually based
> on device size.  In certain circumstances, it bases on a DOS partition
> table instead.  Misfeature, in my opinion.  Partitioning a disk can give
> you a different geometry on the next restart, including migration,
> unless you specify the geometry explicitly.  Fortunately, most guests
> don't care for geometry at all.  This is entirely undocumented, as far
> as I can tell.
>
>> Now here comes my proposal:
>> Markus statement brought up an idea of special casing DASDs support. We can
>> call an ioctl BIODASDINFO on the block device that will only succeed if the host
>> disk is really a dasd. We could enable the auto detection for that case.
>
> If BIODASDINFO succeeds, QEMU uses that instead of making up device
> geometry as described above.  Correct?
>
> Let's spell out when exactly BIODASDINFO succeeds, to avoid
> misunderstandings.  It does when the backend is a DASD (/dev/dasd*).
> What about a partition on a DASD?  A file in a filesystem on a DASD?
>
> Auto-detecting geometry on DASDs adds an irregularity to the user
> interface.  I guess DASDs are special enough to tolerate that.
>
>> In addition, QEMU will check if geometry and block size match during
>> migration, if
>> not, migration will fail. That would work with the following cases
>>
>> (manual override == secs, cyls, headers, blocksize given by admin)
>>
>> HOST A                          HOST B
>> dasd (auto)            ----->   dasd (auto)
>> dasd (auto)            ----->   image file (manual override)
>> image file (manual)    ----->   dasd (auto)
>> image file (manual)    ----->   image file (manual)
>> dasd (auto)            ----->   other host block device with manual override
>>
>> if there are different dasds or different value migration will fail (and that is
>> what we want)
>
> I guess I helped plant this idea.  Thinking about it again, I fear doing
> it just for geometry could be shortsighted.  When we do it for all
> device configuration later on, a geometry special case could get in the
> way.
>
> While you're certainly welcome to take on one of migration's long-term
> projects, I don't want to make it a prerequisite for getting DASDs more
> usable ;)
>
> Without this sanity check, we gain another way to mess up geometry by
> changing the default geometry (see "DOS partition table" above for the
> existing way).  We reuse the same answer: don't do that then.
>
> Good enough for Paolo, if I understand him correctly.  I guess it's good
> enough for me too, but please document it.  Covering the whole default
> geometry machinery would be nice.
>
>> In addition we could also implement Stefans proposal to add a
>> "autodetect" statement
>> for secs, cyls, head.... but I am not sure about libvirt support, though
>>
>> So 3 parts:
>> 1. autodetect on real DASDs
>> 2. geometry and sector size checking in generic code
>> 3. maybe an autodetect flag
>>
>> makes sense? Any guidance how to proceed?
>
> Try posting a patch just for 1., and see if anyone screams :)
>

Thanks a lot for the advise!
A small follow-up question...

Since the whole thing becomes arch-specific again (or even 
DASD-specific), I suppose I no longer need to change the driver code.

In my most recent patchset geometry was detected by HDIO_GETGEO inside
hw/block/hd-geometry.c (via arch-specific hook)
But... for the blocksize the control was passed to the driver functions 
(probe_logical_blocksize, probe_physical_blocksize) which called 
appropriate ioctls (BLKPBSZGET, BLKSSZGET) for "raw" and "host devices"

I suppose, that now there's no need for that anymore I can move the 
block ioctl calls back to hw/block/block.c and remove the introduced 
driver functions.

What do you think?

Kate.

      parent reply	other threads:[~2014-11-07 13:40 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-29 12:27 [Qemu-devel] [PATCH 0/4] Geometry and blocksize support for backing devices Ekaterina Tumanova
2014-07-29 12:27 ` [Qemu-devel] [PATCH 1/4] hd-geometry.c: Integrate HDIO_GETGEO in guessing for target-s390x Ekaterina Tumanova
2014-08-20  8:19   ` Paolo Bonzini
2014-08-20  9:35     ` Christian Borntraeger
2014-08-20 13:10       ` Paolo Bonzini
2014-08-25  8:21         ` Christian Borntraeger
2014-07-29 12:27 ` [Qemu-devel] [PATCH 2/4] blocksize: support auto-sensing of blocksizes Ekaterina Tumanova
2014-09-03 15:37   ` Stefan Hajnoczi
2014-07-29 12:27 ` [Qemu-devel] [PATCH 3/4] blocksize: Add driver method to get the blocksizes Ekaterina Tumanova
2014-07-29 12:27 ` [Qemu-devel] [PATCH 4/4] blocksize: add blkconf_blocksize call to all block devices Ekaterina Tumanova
2014-09-03 15:46   ` Stefan Hajnoczi
2014-09-04 10:28     ` Ekaterina Tumanova
2014-09-17 14:00       ` Stefan Hajnoczi
2014-09-04 14:15     ` Christian Borntraeger
2014-09-17 13:56       ` Stefan Hajnoczi
2014-07-29 12:37 ` [Qemu-devel] [PATCH 0/4] Geometry and blocksize support for backing devices Christian Borntraeger
2014-11-06 15:58 ` [Qemu-devel] " Christian Borntraeger
2014-11-06 17:24   ` Paolo Bonzini
2014-11-06 19:05     ` Christian Borntraeger
2014-11-07  9:17   ` Markus Armbruster
2014-11-07  9:48     ` Christian Borntraeger
2014-11-07 15:58       ` Markus Armbruster
2014-11-07 13:39     ` Ekaterina Tumanova [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=545CCB96.5020400@linux.vnet.ibm.com \
    --to=tumanova@linux.vnet.ibm.com \
    --cc=armbru@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=cornelia.huck@de.ibm.com \
    --cc=dahi@linux.vnet.ibm.com \
    --cc=kwolf@redhat.com \
    --cc=mihajlov@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).