From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:56553) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0vMY-0002iZ-On for qemu-devel@nongnu.org; Mon, 23 Nov 2015 12:57:15 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a0vMV-00051k-U5 for qemu-devel@nongnu.org; Mon, 23 Nov 2015 12:57:14 -0500 Received: from mx1.redhat.com ([209.132.183.28]:56847) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a0vMV-00051f-ML for qemu-devel@nongnu.org; Mon, 23 Nov 2015 12:57:11 -0500 References: <869ffbdbbf856eecebe5c330e166bd93@oberoncompany.com> From: Max Reitz Message-ID: <56535373.9070305@redhat.com> Date: Mon, 23 Nov 2015 18:57:07 +0100 MIME-Version: 1.0 In-Reply-To: <869ffbdbbf856eecebe5c330e166bd93@oberoncompany.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="6toSgCVJ5avwnrug79MFWgG7jmsjiS7RG" Subject: Re: [Qemu-devel] qcow2 corruption repair can not proceed due to bad snapshot List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Brian Taber , qemu-devel@nongnu.org This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --6toSgCVJ5avwnrug79MFWgG7jmsjiS7RG Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 20.11.2015 20:33, Brian Taber wrote: > I recently ran across an issue (completely my own fault) that others > have encountered with varying details/success in fixing. I had a VM > stuck in shutdown (windoze asking/waiting to kill a program) that I > thought was already down when I created a snapshot on the 3 disks > attached to the VM. After running the snapshot command I went back to > the machine and instead of just turning off (which would have been > better), I let the shutdown complete. >=20 > Needless to say all 3 images had corruption to varying degrees. The > first disk, system disk, was the worse. The other 2 has databases and > were repairable via the "qemu-img check -r all image.img" command (with= > a bunch of messages/warnings). I suspect the limited activity on > shutdown helped save them. The system disk would not perform a check, > it encountered: >=20 > qemu-img: Could not open 'image.img': Could not read snapshots: File to= o > large >=20 > Searching online for this returns different repair methods, but the > latest version of qemu I compiled for a newer qemu-img (I did not want > to use an older version as suggested in posts), I pulled latest source,= > compiled, but I got the same error trying to check or convert the image= =2E > I dug into the qcow2 code, silenced that particular error, and was abl= e > to get the check to actually run (I was able to work around the problem= > and let the repair run with modifications to block/qcow2.c about line > 1136 and ignoring the return result if 27 (EFBIG) and setting res to 0;= > probably really bad to do, just did this to get get to checks). The > repair run repaired the image to the point the checks came back OK. > Unfortunately the image was still broke, trying to list snapshots or > use image returned the file to long error again. >=20 > Ultimately I was able to repair the system disk by converting the image= > to raw as suggested in other posts now that it was repaired and was abl= e > to start the machine again right where it left off (or at least it > appears so). Disk checks within the machine return OK. One thing I am= > unsure of is how safe the qemu images are in regards to snapshots, and = I > dare not try to do anything with them as they are, and will convert to > raw then all of them back into qemu images. They are safe, but you may only have one program writing to an image file at a time. Therefore, if you want to do snapshots of a live VM, you have to do that through the respective qemu instance (e.g. using the QMP command blockdev-snapshot-internal-sync). > Even though this is entirely due to creating a snapshot while the disk > is in use, some thoughts: >=20 > - if a user is trying to run a repair it should not error about > snapshots and proceed with checks/repairs and allow convert if possible= =2E I don't think this should be done silently. If qcow2 encounters errors during the repair process, those are errors which generally mean =E2=80=9C= Trying to repair this image may or will damage it further=E2=80=9D. Therefore, a= t least there should be a flag the user has to set to tell the qcow2 driver to ignore errors as far as possible. Another way to do it would be a runtime option for qcow2 to ignore the snapshot table (because apparently most of the people who ran a qemu-img snapshot operation on an in-use image noticed that something went wrong because loading the snapshot table fails). qemu-img convert could set that option automatically so you can convert a qcow2 file with an invalid snapshot table to raw (ignoring the snapshots). > - if possible, before actually doing a snapshot, check if the file is i= n > use to avoid this situation all together Yes, this has been proposed a couple of times and is something we will have to do sooner or later, since so many people make the mistake of using qemu-img on a qcow2 file that is in use by a VM (knowingly or by mistake). I don't know the current status of this. Some people proposed a qcow2-specific flag in the file, but the obvious problem is that this flag will be a nuisance if some process accessing the qcow2 file crashes. Would be solvable by either abusing qemu-img amend for removing that flag, or by adding a new option to qemu-img check which allows you to override that flag if you are sure that no process is accessing the file anymore. Other people suggested using flock(), but that would be a Unix-specific solution. I'm personally leaning towards the qcow2 flag. Having a way to reset it using some qemu-img subcommand should suffice for the rare and not-to-be-expected (;-)) case of qemu crashing. > I would submit a patch, but I do not know enough about the possible > repercussions of ignoring an error and repairing/converting. Nobody knows, it depends on what in the image is broken exactly. Note that repairing a qcow2 image basically only means repairing the refcount information. The snapshot table will still be broken, even if you get qemu-img check -r all to run. I think it would be better to focus on allowing even terminally-broken qcow2 images to be converted to raw (or a fresh qcow2 image), saving as much data as possible. Max --6toSgCVJ5avwnrug79MFWgG7jmsjiS7RG Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBCAAGBQJWU1NzAAoJEDuxQgLoOKytqlsH/joRW7aRsYIxwfqSPf5kt19p 27f4y18em/3jVErhYiFdZJJ2V04+3o8D73t4Q1wfL0c1noGruv7NERXvvxMUZcTX YkGxJfNIVf5fk4DVP3LnMNMtSUmKHHduSEgUfcxhyfLE602nY+LH+lodyxNobuLW ymtUQFo2WfsWWly7zYVODixxAcSJnIDtzu60z6bmweosHnEXNIYo1NVaxIkc/uNm 0MmhhOepmV6bhXA+/LurrzLM+p6dI7wu6paJ7SU6FzdfI3lKVmSummCnApPfnrfL DkYlViKmZEzRXlDq9HdYzGXkDsvHvIBxAbma5ylT/83ZD/IYHM6QQerEFmcC5pI= =uq38 -----END PGP SIGNATURE----- --6toSgCVJ5avwnrug79MFWgG7jmsjiS7RG--