* [Qemu-devel] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM @ 2017-09-13 11:48 Darren Kenny [not found] ` <59B91DCA.5080405@oracle.com> 0 siblings, 1 reply; 5+ messages in thread From: Darren Kenny @ 2017-09-13 11:48 UTC (permalink / raw) To: qemu-devel Hi, It was observed during some testing of Qemu 2.9 that it appeared that if you resized a qcow2 block device while the VM is running, that an qemu-img check would report that there were leaked clusters. The steps to reproduce are: - First create the test image: # /usr/bin/qemu-img create -f qcow2 test.qcow2 10G Formatting 'test.qcow2', fmt=qcow2 size=10737418240 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 # qemu-img check test.qcow2 No errors were found on the image. - Now run a VM based here on Oracle Linux 7, but the disto really isn't important here, since the test disk is not even mounted in the VM at this point in time: # /usr/bin/qemu-kvm \ -name 'test-vm' \ -monitor stdio \ -drive id=drive_image1,if=none,snapshot=on,format=qcow2,file=./ol73-64.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,addr=0x4 \ -drive id=drive_test,if=none,format=qcow2,file=./stg.qcow2 \ -device virtio-blk-pci,id=stg,drive=drive_test,bootindex=1,serial=TARGET_DISK0,bus=pci.0,addr=0x5 \ -net bridge,br=br1 -net nic,model=virtio,macaddr=52:54:00:90:91:92 \ -m 4096 \ -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \ -vnc :0 - Resize the img size to 15g from qemu monitor (on stdio after above command) QEMU 2.5.0 monitor - type 'help' for more information (qemu) block_resize drive_test 15360 - Now, in a separate terminal, while leaving the VM running, check the img again from host side: # qemu-img check ./test.qcow2 Leaked cluster 3 refcount=1 reference=0 1 leaked clusters were found on the image. This means waste of disk space, but no harm to data. Image end offset: 327680 As it suggests above, this is not really corruption, but it is a bit misleading, and could make people think there is an issue here (hence the reason I've been asked to find a fix). What I observed, then was that if I powered down the VM, or even just quit the VM, that the subsequent check of the disk would say that everything was just fine, and there no longer were leaked clusters. In looking at the code in qcow2_truncate() it would appear that in the case where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the metadata to disk - which seems to be the case here. If I ignore the if test, and always execute the block in block/qcow2.c, lines 3250 to 3258: if (prealloc != PREALLOC_MODE_OFF) { /* Flush metadata before actually changing the image size */ ret = bdrv_flush(bs); if (ret < 0) { error_setg_errno(errp, -ret, "Failed to flush the preallocated area to disk"); return ret; } } causing the flush to always be done, then the check will succeed when the VM is still running. While I know that this resolves the issue, I can only imagine that there was some reason that this check for !PREALLOC_MODE_OFF was being done in the first place. So, I'm hoping that someone here might be able to explain to me why that check is needed, but also why it might be wrong to do the flush regardless of the value of prealloc here. If it is wrong to do that flush here, then would anyone have suggestions as to an alternative solution to this issue? Thanks, Darren. ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <59B91DCA.5080405@oracle.com>]
* Re: [Qemu-devel] [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM [not found] ` <59B91DCA.5080405@oracle.com> @ 2017-09-13 12:20 ` Kevin Wolf 2017-09-13 13:32 ` Darren Kenny 0 siblings, 1 reply; 5+ messages in thread From: Kevin Wolf @ 2017-09-13 12:20 UTC (permalink / raw) To: Darren Kenny; +Cc: qemu-block, qemu-devel Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben: > [Cross-posted from qemu-devel, meant to send here first] Just keep both lists in the CC for the same email. > Hi, > > It was observed during some testing of Qemu 2.9 that it appeared that if you > resized a qcow2 block device while the VM is running, that an qemu-img check > would report that there were leaked clusters. > > The steps to reproduce are: > > - First create the test image: > [...] > > - Now run a VM based here on Oracle Linux 7, but the disto really isn't > important here, since the test disk is not even mounted in the VM at this > point in time: > [...] > > - Resize the img size to 15g from qemu monitor (on stdio after above > command) > > QEMU 2.5.0 monitor - type 'help' for more information > (qemu) block_resize drive_test 15360 > > - Now, in a separate terminal, while leaving the VM running, check the img > again from host side: > > # qemu-img check ./test.qcow2 > Leaked cluster 3 refcount=1 reference=0 > > 1 leaked clusters were found on the image. > This means waste of disk space, but no harm to data. > Image end offset: 327680 > > As it suggests above, this is not really corruption, but it is a bit > misleading, and could make people think there is an issue here > (hence the reason I've been asked to find a fix). There is an issue here, which is that you are accessing the image at the same time from two separate processes. qemu is using writeback caches in order to improve performance, so only after the guest has issued a flush command to its disk or after you shut down or at least pause qemu, the changes are fully written to the image file. In qemu 2.10, you would probably see this instead: $ qemu-img check ./test.qcow2 qemu-img: Could not open './test.qcow2': Failed to get shared "write" lock Is another process using the image? This lock can be overridden, but at least it shows clearly that you are doing something that you probably shouldn't be doing. > What I observed, then was that if I powered down the VM, or even just > quit the VM, that the subsequent check of the disk would say that > everything was just fine, and there no longer were leaked clusters. Yes, quitting the VM writes out all of the cached data, so the on-disk state and the in-memory state become fully consistent again. > In looking at the code in qcow2_truncate() it would appear that in the case > where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the > metadata to disk - which seems to be the case here. > > If I ignore the if test, and always execute the block in block/qcow2.c, > lines 3250 to 3258: > > if (prealloc != PREALLOC_MODE_OFF) { > /* Flush metadata before actually changing the image size */ > ret = bdrv_flush(bs); > if (ret < 0) { > error_setg_errno(errp, -ret, > "Failed to flush the preallocated area to disk"); > return ret; > } > } > > causing the flush to always be done, then the check will succeed when > the VM is still running. > > While I know that this resolves the issue, I can only imagine that > there was some reason that this check for !PREALLOC_MODE_OFF was being > done in the first place. > > So, I'm hoping that someone here might be able to explain to me why > that check is needed, but also why it might be wrong to do the flush > regardless of the value of prealloc here. Doing a flush here wouldn't be wrong, but it's also unnecessary and would slow down the operation a bit. For the preallocation case, the goal that is achieved with the flush is that the header update with the new image size is only written if the preallocation succeeded, so that in error cases you don't end up with an image that seems to be resized, but isn't actually preallocated as requested. > If it is wrong to do that flush here, then would anyone have > suggestions as to an alternative solution to this issue? Don't call 'qemu-img check' (or basically any other operation on the image) while a VM is using the image. Kevin ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Qemu-devel] [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM 2017-09-13 12:20 ` [Qemu-devel] [Qemu-block] " Kevin Wolf @ 2017-09-13 13:32 ` Darren Kenny 2017-09-13 14:07 ` Kevin Wolf 0 siblings, 1 reply; 5+ messages in thread From: Darren Kenny @ 2017-09-13 13:32 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, qemu-block Hi Kevin, Thanks for getting back to me so quickly. Kevin Wolf wrote: > Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben: >> [Cross-posted from qemu-devel, meant to send here first] > > Just keep both lists in the CC for the same email. Will do. > There is an issue here, which is that you are accessing the image at > the same time from two separate processes. qemu is using writeback > caches in order to improve performance, so only after the guest has > issued a flush command to its disk or after you shut down or at least > pause qemu, the changes are fully written to the image file. In qemu > 2.10, you would probably see this instead: $ qemu-img check > ./test.qcow2 qemu-img: Could not open './test.qcow2': Failed to get > shared "write" lock Is another process using the image? This lock can > be overridden, but at least it shows clearly that you are doing > something that you probably shouldn't be doing. Hmm, I've just updated to the HEAD of the Git repo, and I didn't see this locking behaviour, it still did the same thing as before. Does the disk need to be formatted/mounted before it's seen as locked? Or even a configure option? The version that have is: $ qemu-img --version qemu-img version 2.10.50 (v2.10.0-476-g04ef330) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers $ qemu-system-x86_64 --version QEMU emulator version 2.10.50 (v2.10.0-476-g04ef330) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers The last commit I have is (as in the version string): 04ef330 tcg/tci: do not use ldst label (never implemented) >> What I observed, then was that if I powered down the VM, or even just >> quit the VM, that the subsequent check of the disk would say that >> everything was just fine, and there no longer were leaked clusters. > > Yes, quitting the VM writes out all of the cached data, so the on-disk > state and the in-memory state become fully consistent again. OK >> In looking at the code in qcow2_truncate() it would appear that in the case >> where prealloc has the value PREALLOC_MODE_OFF, that we don't flush the >> metadata to disk - which seems to be the case here. >> >> If I ignore the if test, and always execute the block in block/qcow2.c, >> lines 3250 to 3258: >> >> if (prealloc != PREALLOC_MODE_OFF) { >> /* Flush metadata before actually changing the image size */ >> ret = bdrv_flush(bs); >> if (ret< 0) { >> error_setg_errno(errp, -ret, >> "Failed to flush the preallocated area to disk"); >> return ret; >> } >> } >> >> causing the flush to always be done, then the check will succeed when >> the VM is still running. >> >> While I know that this resolves the issue, I can only imagine that >> there was some reason that this check for !PREALLOC_MODE_OFF was being >> done in the first place. >> >> So, I'm hoping that someone here might be able to explain to me why >> that check is needed, but also why it might be wrong to do the flush >> regardless of the value of prealloc here. > > Doing a flush here wouldn't be wrong, but it's also unnecessary and > would slow down the operation a bit. Sure, but how often does a resize/truncate get done? Would seem like a small impact to do it - but I agree w.r.t. the single-process access as a better solution. > For the preallocation case, the goal that is achieved with the flush is > that the header update with the new image size is only written if the > preallocation succeeded, so that in error cases you don't end up with an > image that seems to be resized, but isn't actually preallocated as > requested. Understood, thanks. >> If it is wrong to do that flush here, then would anyone have >> suggestions as to an alternative solution to this issue? > > Don't call 'qemu-img check' (or basically any other operation on the > image) while a VM is using the image. Heh, sure - if the locking was working as you describe then I think I would be fine with that :) Thanks, Darren. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Qemu-devel] [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM 2017-09-13 13:32 ` Darren Kenny @ 2017-09-13 14:07 ` Kevin Wolf 2017-09-13 15:33 ` Darren Kenny 0 siblings, 1 reply; 5+ messages in thread From: Kevin Wolf @ 2017-09-13 14:07 UTC (permalink / raw) To: Darren Kenny; +Cc: qemu-devel, qemu-block Am 13.09.2017 um 15:32 hat Darren Kenny geschrieben: > Hi Kevin, > > Thanks for getting back to me so quickly. > > Kevin Wolf wrote: > > Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben: > > > [Cross-posted from qemu-devel, meant to send here first] > > > > Just keep both lists in the CC for the same email. > > Will do. > > There is an issue here, which is that you are accessing the image at the > > same time from two separate processes. qemu is using writeback caches in > > order to improve performance, so only after the guest has issued a flush > > command to its disk or after you shut down or at least pause qemu, the > > changes are fully written to the image file. In qemu 2.10, you would > > probably see this instead: $ qemu-img check ./test.qcow2 qemu-img: Could > > not open './test.qcow2': Failed to get shared "write" lock Is another > > process using the image? This lock can be overridden, but at least it > > shows clearly that you are doing something that you probably shouldn't > > be doing. > > Hmm, I've just updated to the HEAD of the Git repo, and I didn't see this > locking behaviour, it still did the same thing as before. > > Does the disk need to be formatted/mounted before it's seen as locked? > Or even a configure option? > > The version that have is: > > $ qemu-img --version > qemu-img version 2.10.50 (v2.10.0-476-g04ef330) > Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > > $ qemu-system-x86_64 --version > QEMU emulator version 2.10.50 (v2.10.0-476-g04ef330) > Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers > > The last commit I have is (as in the version string): > > 04ef330 tcg/tci: do not use ldst label (never implemented) This should have the locking code. It only works with relatively new Linux kernels, though (it needs F_OFD_SETLK support). If you don't have that, no locking is used even in qemu 2.10. You could try enforcing some locking by adding file.locking=on to your -drive option. If you're running an old kernel, this should print a warning message (and use some less safe locking variant). > > Doing a flush here wouldn't be wrong, but it's also unnecessary and > > would slow down the operation a bit. > Sure, but how often does a resize/truncate get done? Would seem like a > small impact to do it - but I agree w.r.t. the single-process access > as a better solution. The thing is, truncate isn't the only operation that will lead to qemu-img check reporting failure. Any cluster allocation in the image can cause the same symptom, and there it is actually very important for performance that we use the cache and do a batched write only later. So changing truncate so that this specific operation looks as if accessing the image from a second process were okay wouldn't actually make a big difference for the overall state. Maybe it's in fact better to have such attempts fail consistently. Kevin ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Qemu-devel] [Qemu-block] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM 2017-09-13 14:07 ` Kevin Wolf @ 2017-09-13 15:33 ` Darren Kenny 0 siblings, 0 replies; 5+ messages in thread From: Darren Kenny @ 2017-09-13 15:33 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, qemu-block Kevin Wolf wrote: > Am 13.09.2017 um 15:32 hat Darren Kenny geschrieben: >> Hi Kevin, >> >> Thanks for getting back to me so quickly. >> >> Kevin Wolf wrote: >>> Am 13.09.2017 um 14:00 hat Darren Kenny geschrieben: >>>> [Cross-posted from qemu-devel, meant to send here first] >>> Just keep both lists in the CC for the same email. >> Will do. >>> There is an issue here, which is that you are accessing the image at the >>> same time from two separate processes. qemu is using writeback caches in >>> order to improve performance, so only after the guest has issued a flush >>> command to its disk or after you shut down or at least pause qemu, the >>> changes are fully written to the image file. In qemu 2.10, you would >>> probably see this instead: $ qemu-img check ./test.qcow2 qemu-img: Could >>> not open './test.qcow2': Failed to get shared "write" lock Is another >>> process using the image? This lock can be overridden, but at least it >>> shows clearly that you are doing something that you probably shouldn't >>> be doing. >> Hmm, I've just updated to the HEAD of the Git repo, and I didn't see this >> locking behaviour, it still did the same thing as before. >> >> Does the disk need to be formatted/mounted before it's seen as locked? >> Or even a configure option? >> >> The version that have is: >> >> $ qemu-img --version >> qemu-img version 2.10.50 (v2.10.0-476-g04ef330) >> Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers >> >> $ qemu-system-x86_64 --version >> QEMU emulator version 2.10.50 (v2.10.0-476-g04ef330) >> Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers >> >> The last commit I have is (as in the version string): >> >> 04ef330 tcg/tci: do not use ldst label (never implemented) > > This should have the locking code. It only works with relatively new > Linux kernels, though (it needs F_OFD_SETLK support). If you don't have > that, no locking is used even in qemu 2.10. > > You could try enforcing some locking by adding file.locking=on to your > -drive option. If you're running an old kernel, this should print a > warning message (and use some less safe locking variant). Ah, OK - I will need to look into that. >>> Doing a flush here wouldn't be wrong, but it's also unnecessary and >>> would slow down the operation a bit. >> Sure, but how often does a resize/truncate get done? Would seem like a >> small impact to do it - but I agree w.r.t. the single-process access >> as a better solution. > > The thing is, truncate isn't the only operation that will lead to > qemu-img check reporting failure. Any cluster allocation in the image > can cause the same symptom, and there it is actually very important for > performance that we use the cache and do a batched write only later. > > So changing truncate so that this specific operation looks as if > accessing the image from a second process were okay wouldn't actually > make a big difference for the overall state. Maybe it's in fact better > to have such attempts fail consistently. > Thanks for the explanation, and I agree that consistency is usually best. Thanks, Darren. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-09-13 15:33 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-09-13 11:48 [Qemu-devel] Q: Report of leaked clusters with qcow2 when disk is resized with a live VM Darren Kenny [not found] ` <59B91DCA.5080405@oracle.com> 2017-09-13 12:20 ` [Qemu-devel] [Qemu-block] " Kevin Wolf 2017-09-13 13:32 ` Darren Kenny 2017-09-13 14:07 ` Kevin Wolf 2017-09-13 15:33 ` Darren Kenny
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).