Re: [Qemu-devel] Converting qcow2 image to raw thin lv

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Re: [Qemu-devel] Converting qcow2 image to raw thin lv
       [not found] ` <CAMr-obv38E7jmVN4r-UQLoBgk5bBj9ZWWrYts6=12Pkiz62KvA@mail.gmail.com>
@ 2017-02-13 10:04   ` Kevin Wolf
  2017-02-13 12:11     ` [Qemu-devel] [Qemu-discuss] " Wolfgang Bumiller
  0 siblings, 1 reply; 2+ messages in thread
From: Kevin Wolf @ 2017-02-13 10:04 UTC (permalink / raw)
  To: Nir Soffer; +Cc: qemu-discuss, Max Reitz, qemu-block, qemu-devel

Am 12.02.2017 um 01:58 hat Nir Soffer geschrieben:
> On Sat, Feb 11, 2017 at 12:23 AM, Nir Soffer <nirsof@gmail.com> wrote:
> > Hi all,
> >
> > I'm trying to convert images (mostly qcow2) to raw format on thin lv,
> > hoping to write only the allocated blocks on the thin lv, but
> > it seems that qemu-img cannot write sparse image on a block
> > device.
> >
> > Here is an example:
> >
> > Create a new thin lv:
> >
> > # lvcreate --name raw-test --virtualsize 20g --thinpool pool0 ovirt-local
> >   Using default stripesize 64.00 KiB.
> >   Logical volume "raw-test" created.
> >
> > [root@voodoo6 ~]# lvs ovirt-local
> >   LV                                   VG          Attr       LSize
> > Pool  Origin Data%  Meta%  Move Log Cpy%Sync Convert
> >   029060ab-41ef-4dfd-9a3e-4c716c01db06 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.74
> >   4f207ee8-bb47-465a-9b68-cb778e070861 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        0.00
> >   7aed605e-c74c-40d8-b449-8a1bf7228b8b ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.98
> >   ce6d08d3-350f-4afa-a0e7-7b492a1a7744 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.87
> >   pool0                                ovirt-local twi-aotz-- 40.00g
> >            10.30  5.49
> >   raw-test                             ovirt-local Vwi-a-tz-- 20.00g
> > pool0        0.00
> >
> > I want to convert this image (fresh fedora 25 installation):
> >
> > # qemu-img info fedora.qcow2
> > image: fedora.qcow2
> > file format: qcow2
> > virtual size: 20G (21474836480 bytes)
> > disk size: 1.3G
> > cluster_size: 65536
> > Format specific information:
> >     compat: 1.1
> >     lazy refcounts: false
> >     refcount bits: 16
> >     corrupt: false
> >
> > Convert the image to raw, into the new thin lv:
> >
> > # qemu-img convert -p -f qcow2 -O raw -t none -T none fedora.qcow2
> > /dev/ovirt-local/raw-test
> >     (100.00/100%)
> >
> > The image size was 1.3G, but now the thin lv is fully allocated:
> >
> > # lvs ovirt-local
> >   LV                                   VG          Attr       LSize
> > Pool  Origin Data%  Meta%  Move Log Cpy%Sync Convert
> >   029060ab-41ef-4dfd-9a3e-4c716c01db06 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.74
> >   4f207ee8-bb47-465a-9b68-cb778e070861 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        0.00
> >   7aed605e-c74c-40d8-b449-8a1bf7228b8b ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.98
> >   ce6d08d3-350f-4afa-a0e7-7b492a1a7744 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.87
> >   pool0                                ovirt-local twi-aotz-- 40.00g
> >            60.30  29.72
> >   raw-test                             ovirt-local Vwi-a-tz-- 20.00g
> > pool0        100.00
> >
> > Recreate the lv:
> >
> > # lvremove -f ovirt-local/raw-test
> >   Logical volume "raw-test" successfully removed
> >
> > # lvcreate --name raw-test --virtualsize 20g --thinpool pool0 ovirt-local
> >   Using default stripesize 64.00 KiB.
> >   Logical volume "raw-test" created.
> >
> > Covert the qcow image to raw sparse file:
> >
> > # qemu-img convert -p -f qcow2 -O raw -t none -T none fedora.qcow2 fedora.raw
> >     (100.00/100%)
> >
> > # qemu-img info fedora.raw
> > image: fedora.raw
> > file format: raw
> > virtual size: 20G (21474836480 bytes)
> > disk size: 1.3G
> >
> > Write the sparse file to the thin lv:
> >
> > # dd if=fedora.raw of=/dev/ovirt-local/raw-test bs=8M conv=sparse
> > 2560+0 records in
> > 2560+0 records out
> > 21474836480 bytes (21 GB) copied, 39.0065 s, 551 MB/s
> >
> > Now we are using only 7.19% of the lv:
> >
> > # lvs ovirt-local
> >   LV                                   VG          Attr       LSize
> > Pool  Origin Data%  Meta%  Move Log Cpy%Sync Convert
> >   029060ab-41ef-4dfd-9a3e-4c716c01db06 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.74
> >   4f207ee8-bb47-465a-9b68-cb778e070861 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        0.00
> >   7aed605e-c74c-40d8-b449-8a1bf7228b8b ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.98
> >   ce6d08d3-350f-4afa-a0e7-7b492a1a7744 ovirt-local Vwi-a-tz-- 20.00g
> > pool0        6.87
> >   pool0                                ovirt-local twi-aotz-- 40.00g
> >            13.89  7.17
> >   raw-test                             ovirt-local Vwi-a-tz-- 20.00g
> > pool0        7.19
> >
> > This works, but it would be nicer to have a way to convert
> > to raw sparse to a block device in one pass.
> 
> So it seems that qemu-img is trying to write a sparse image.
> 
> I tested again with empty file:
> 
>     truncate -s 20m empty
> 
> Using strace, qemu-img checks the device discard_zeroes_data:
> 
>     ioctl(11, BLKDISCARDZEROES, 0)          = 0
> 
> Then it find that the source is empty:
> 
>     lseek(10, 0, SEEK_DATA)                 = -1 ENXIO (No such device
> or address)
> 
> Then it issues one call
> 
>     [pid 10041] ioctl(11, BLKZEROOUT, 0x7f6049c82ba0) = 0
> 
> And fsync and close the destination.
> 
> # grep -s "" /sys/block/dm-57/queue/discard_*
> /sys/block/dm-57/queue/discard_granularity:65536
> /sys/block/dm-57/queue/discard_max_bytes:17179869184
> /sys/block/dm-57/queue/discard_zeroes_data:0
> 
> I wonder why discard_zeroes_data is 0, while discarding
> blocks seems to zero them.
> 
> Seems that this this bug:
> https://bugzilla.redhat.com/835622
> 
> thin lv does promise (by default) to zero new allocated blocks,
> and it does returns zeros when reading unallocated data, like
> a sparse file.
> 
> Since qemu does not know that the thin lv is not allocated, it cannot
> skip empty blocks safely.
> 
> It would be useful if it had a flag to force sparsness when the
> user knows that this operation is safe, or maybe we need a thin lvm
> driver?

Yes, I think your analysis is correct, I seem to remember that I've seen
this happen before.

The Right Thing (TM) to do, however, seems to be fixing the kernel so
that BLKDISCARDZEROES correctly returns that discard does in fact zero
out blocks on this device. As soon as this ioctl works correctly,
qemu-img should just automatically do what you want.

Now if it turns out it is important to support older kernels without the
fix, we can think about a driver-specific option for the 'file' driver
that overrides the kernel's value. But I really want to make sure that
we use such workarounds only in addition, not instead of doing the
proper root cause fix in the kernel.

So can you please bring it up with the LVM people?

Kevin

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Qemu-devel] [Qemu-discuss] Converting qcow2 image to raw thin lv
  2017-02-13 10:04   ` [Qemu-devel] Converting qcow2 image to raw thin lv Kevin Wolf
@ 2017-02-13 12:11     ` Wolfgang Bumiller
  0 siblings, 0 replies; 2+ messages in thread
From: Wolfgang Bumiller @ 2017-02-13 12:11 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Nir Soffer, Max Reitz, qemu-devel, qemu-block, qemu-discuss

On Mon, Feb 13, 2017 at 11:04:30AM +0100, Kevin Wolf wrote:
> Am 12.02.2017 um 01:58 hat Nir Soffer geschrieben:
> > On Sat, Feb 11, 2017 at 12:23 AM, Nir Soffer <nirsof@gmail.com> wrote:
> > > Hi all,
> > >
> > > I'm trying to convert images (mostly qcow2) to raw format on thin lv,
> > > hoping to write only the allocated blocks on the thin lv, but
> > > it seems that qemu-img cannot write sparse image on a block
> > > device.
> > >
> > > (...)
> > 
> > So it seems that qemu-img is trying to write a sparse image.
> > 
> > I tested again with empty file:
> > 
> >     truncate -s 20m empty
> > 
> > Using strace, qemu-img checks the device discard_zeroes_data:
> > 
> >     ioctl(11, BLKDISCARDZEROES, 0)          = 0
> > 
> > Then it find that the source is empty:
> > 
> >     lseek(10, 0, SEEK_DATA)                 = -1 ENXIO (No such device
> > or address)
> > 
> > Then it issues one call
> > 
> >     [pid 10041] ioctl(11, BLKZEROOUT, 0x7f6049c82ba0) = 0
> > 
> > And fsync and close the destination.
> > 
> > # grep -s "" /sys/block/dm-57/queue/discard_*
> > /sys/block/dm-57/queue/discard_granularity:65536
> > /sys/block/dm-57/queue/discard_max_bytes:17179869184
> > /sys/block/dm-57/queue/discard_zeroes_data:0
> > 
> > I wonder why discard_zeroes_data is 0, while discarding
> > blocks seems to zero them.
> > 
> > Seems that this this bug:
> > https://bugzilla.redhat.com/835622
> > 
> > thin lv does promise (by default) to zero new allocated blocks,
> > and it does returns zeros when reading unallocated data, like
> > a sparse file.
> > 
> > Since qemu does not know that the thin lv is not allocated, it cannot
> > skip empty blocks safely.
> > 
> > It would be useful if it had a flag to force sparsness when the
> > user knows that this operation is safe, or maybe we need a thin lvm
> > driver?
> 
> Yes, I think your analysis is correct, I seem to remember that I've seen
> this happen before.
> 
> The Right Thing (TM) to do, however, seems to be fixing the kernel so
> that BLKDISCARDZEROES correctly returns that discard does in fact zero
> out blocks on this device. As soon as this ioctl works correctly,
> qemu-img should just automatically do what you want.
> 
> Now if it turns out it is important to support older kernels without the
> fix, we can think about a driver-specific option for the 'file' driver
> that overrides the kernel's value. But I really want to make sure that
> we use such workarounds only in addition, not instead of doing the
> proper root cause fix in the kernel.
> 
> So can you please bring it up with the LVM people?

I'm not sure it's that easy. The discard granularity of LVM thin is not
equal to their reported block/sector sizes, but to the size of the
chunks they allocate.

  # blockdev --getss /dev/dm-9
  512
  # blockdev --getbsz /dev/dm-9
  4096
  # blockdev --getpbsz /dev/dm-9
  4096
  # cat /sys/block/dm-9/queue/discard_granularity
  131072
  #

I currently don't see qemu using the discard_granularity property for
this purpose. IIRC the code for write_zeroes() eg. simply checks the
discard_zeroes flag but not what size it is trying to zero-out/discard.

We have an experimental semi-complete "can-do-footshooting" 'zeroinit'
filter for this purpose to basically explicitly set the "has_zero_init"
flag and drop "write_zeroes()" calls to blocks at an address greater
than the highest written one up to that point.
It should use a dirty bitmap instead and is sort of dangerous this way
which is why it's not on the qemu-devel list. But if this approach is at
all acceptable (despite being a hack) I could improve it and send it to
the list?
https://github.com/Blub/qemu/commit/6f6f38d2ef8f22a12f72e4d60f8a1fa978ac569a
(you'd just prefix the destination with `zeroinit:` in the qemu-img
command)

Additionally I'm currently still playing with the details and quirks of
various storages (lvm/dm thin, rbd, zvols) in an attempt to create a
tool to convert between various storages. (I did some successful tests
converting disk images between these storages & qcow2 together with
their snapshots in a COW-aware way...) I'm planning on releasing some
experimental code soon-ish (there's still some polishing to do though to
the documentation, the library's API and the format - and the qcow2
support is a patch for qemu-img to use the library.)

My adventures into dm-thin metadata allows me to answer this one though:

> > or maybe we need a thin lvm driver?

Probably not. It does not support SEEK_DATA/SEEK_HOLE and to my
knowledge also has no other sane metadata querying methods. You'd have
to read the metadata device instead. To do this properly you have to
reserve a metadata snapshot and there can only ever be one of those per
pool, which means you could only have 1 such disk in total running on a
system and no other dm-thin metadata aware tool could be used during
that time (otherwise the reserver operations will fail with an error and
qemu would have to wait&retry a lot...).

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-02-13 12:12 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAMr-obvQ82w5+fuF1Q1pnbWvvp0mFguG8R1uaBV+RDBkKdEwGg@mail.gmail.com>
     [not found] ` <CAMr-obv38E7jmVN4r-UQLoBgk5bBj9ZWWrYts6=12Pkiz62KvA@mail.gmail.com>
2017-02-13 10:04   ` [Qemu-devel] Converting qcow2 image to raw thin lv Kevin Wolf
2017-02-13 12:11     ` [Qemu-devel] [Qemu-discuss] " Wolfgang Bumiller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).