qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Signal handling and qcow2 image corruption
@ 2008-03-05 21:18 David Barrett
  2008-03-05 21:55 ` Anthony Liguori
  0 siblings, 1 reply; 26+ messages in thread
From: David Barrett @ 2008-03-05 21:18 UTC (permalink / raw)
  To: qemu-devel

I'm tracking down a image corruption issue and I'm curious if you can 
answer the following:

1) Is there any difference between sending a "TERM" signal to the QEMU 
process and typing "quit" at the monitor?

2) Will sending TERM corrupt the 'gcow2' image (in ways other than 
normal guest OS dirty shutdown)?

3) Assuming I always start QEMU using "-loadvm", is there any risk in 
using 'kill' to send SIGTERM to the QMEU process when done?


I notice the entire implementation of "do_quit()" is simply:

	static void do_quit(void)
	{
		exit(0);
	}

So I don't see any special shutdown sequence being invoked, and I can't 
find any atexit() handler that's used in the general case.  Thus it 
would seem to me that just killing the process should be the same as 
calling "quit" via the monitor.

(I also can't find a signal handler for SIGTERM, but I might have missed 
it.)

Furthermore, if I understand "-loadvm" correctly, then any change made 
by the guest OS (including any corruption of the image caused by a dirty 
shutdown) should be blown away on the next restart.

Thus it seems that I should be able to safely start QEMU with -loadvm, 
do my thing inside the guest OS, and then just "kill" the host process, 
again and again and again, without any risk of accumulated corruption.

Is this correct, or am I misunderstanding this?

I ask because I seem to be getting accumulated corruption in my guest 
OS, but it's not totally reproducible.  Just trying to make sure QEMU 
works as it does to narrow down the debugging options.

Thanks!

-david

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Signal handling and qcow2 image corruption
  2008-03-05 21:18 [Qemu-devel] Signal handling and qcow2 image corruption David Barrett
@ 2008-03-05 21:55 ` Anthony Liguori
  2008-03-05 23:48   ` David Barrett
                     ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-03-05 21:55 UTC (permalink / raw)
  To: qemu-devel

David Barrett wrote:
> I'm tracking down a image corruption issue and I'm curious if you can 
> answer the following:
>
> 1) Is there any difference between sending a "TERM" signal to the QEMU 
> process and typing "quit" at the monitor?

Yes.  Since QEMU is single threaded, when you issue a quit, you know you 
aren't in the middle of writing qcow2 meta data to disk.

> 2) Will sending TERM corrupt the 'gcow2' image (in ways other than 
> normal guest OS dirty shutdown)?

Possibly, yes.

> 3) Assuming I always start QEMU using "-loadvm", is there any risk in 
> using 'kill' to send SIGTERM to the QMEU process when done?

Yes.  If you want to SIGTERM QEMU, the safest thing to do is use -snapshot.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Signal handling and qcow2 image corruption
  2008-03-05 21:55 ` Anthony Liguori
@ 2008-03-05 23:48   ` David Barrett
  2008-03-06  6:57   ` Avi Kivity
  2008-07-21 18:10   ` [Qemu-devel] qcow2 - safe on kill? safe on power fail? Jamie Lokier
  2 siblings, 0 replies; 26+ messages in thread
From: David Barrett @ 2008-03-05 23:48 UTC (permalink / raw)
  To: qemu-devel

Ah thanks, that makes a lot of sense.  Unfortunately, -snapshot doesn't 
appear to work with -loadvm: any VM snapshots created outside of 
snapshot mode are suppressed, and any VM snapshots created inside 
snapshot mode disappear on close (even if you try to commit):

===== I have a snapshot VM named 'boot' that -snapshot can't find ====
dbarrett@LappyReborn:~/rs/qa$ qemu -smb qemu -kernel-kqemu -localtime -m 
512 -monitor stdio -loadvm boot -snapshot winxp.qcow2
QEMU 0.9.0 monitor - type 'help' for more information
(qemu) Could not find snapshot 'boot' on device 'hda'
(qemu) info snapshots
Snapshot devices: hda
Snapshot list (from hda):
ID        TAG                 VM SIZE                DATE       VM CLOCK

============ When I try to save one it appears to work.... ===========
(qemu) savevm boot
(qemu) commit all
(qemu) info snapshots
Snapshot devices: hda
Snapshot list (from hda):
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         boot                    27M 2008-03-05 15:37:35   00:00:23.114
(qemu) quit

==== ...but when I start up again with -snapshot, it can't find it ====
dbarrett@LappyReborn:~/rs/qa$ qemu -smb qemu -kernel-kqemu -localtime -m 
512 -monitor stdio -loadvm boot -snapshot winxp.qcow2
QEMU 0.9.0 monitor - type 'help' for more information
(qemu) Could not find snapshot 'boot' on device 'hda'
(qemu) info snapshots
Snapshot devices: hda
Snapshot list (from hda):
ID        TAG                 VM SIZE                DATE       VM CLOCK
(qemu)


=== But when -snapshot is disabled, it finds my snapshot VM again ====
dbarrett@LappyReborn:~/rs/qa$ qemu -smb qemu -kernel-kqemu -localtime -m 
512 -monitor stdio -loadvm boot winxp.qcow2
QEMU 0.9.0 monitor - type 'help' for more information
(qemu) info snapshots
Snapshot devices: hda
Snapshot list (from hda):
ID        TAG                 VM SIZE                DATE       VM CLOCK
1         boot                    53M 2008-03-03 17:30:58   01:40:10.163
(qemu)


I think the solution is to skip -snapshot, use -loadvm, and just use a 
named pipe to send a "quit" command to the monitor in order to shut it 
down rather than SIGTERM.

Thanks for your help!

-david

Anthony Liguori wrote:
> David Barrett wrote:
>> I'm tracking down a image corruption issue and I'm curious if you can 
>> answer the following:
>>
>> 1) Is there any difference between sending a "TERM" signal to the QEMU 
>> process and typing "quit" at the monitor?
> 
> Yes.  Since QEMU is single threaded, when you issue a quit, you know you 
> aren't in the middle of writing qcow2 meta data to disk.
> 
>> 2) Will sending TERM corrupt the 'gcow2' image (in ways other than 
>> normal guest OS dirty shutdown)?
> 
> Possibly, yes.
> 
>> 3) Assuming I always start QEMU using "-loadvm", is there any risk in 
>> using 'kill' to send SIGTERM to the QMEU process when done?
> 
> Yes.  If you want to SIGTERM QEMU, the safest thing to do is use -snapshot.
> 
> Regards,
> 
> Anthony Liguori
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] Signal handling and qcow2 image corruption
  2008-03-05 21:55 ` Anthony Liguori
  2008-03-05 23:48   ` David Barrett
@ 2008-03-06  6:57   ` Avi Kivity
  2008-07-21 18:10   ` [Qemu-devel] qcow2 - safe on kill? safe on power fail? Jamie Lokier
  2 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2008-03-06  6:57 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> David Barrett wrote:
>> I'm tracking down a image corruption issue and I'm curious if you can 
>> answer the following:
>>
>> 1) Is there any difference between sending a "TERM" signal to the 
>> QEMU process and typing "quit" at the monitor?
>
> Yes.  Since QEMU is single threaded, when you issue a quit, you know 
> you aren't in the middle of writing qcow2 meta data to disk.
>

That's not enough.  If you write a metadata pointer before allocating 
and writing the block, and you terminate between these two operations, 
the next write allocation will leave two pointers pointing to the same 
block.

I don't know if qemu is susceptible to this.



-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-03-05 21:55 ` Anthony Liguori
  2008-03-05 23:48   ` David Barrett
  2008-03-06  6:57   ` Avi Kivity
@ 2008-07-21 18:10   ` Jamie Lokier
  2008-07-21 19:43     ` Anthony Liguori
  2 siblings, 1 reply; 26+ messages in thread
From: Jamie Lokier @ 2008-07-21 18:10 UTC (permalink / raw)
  To: qemu-devel

Quite a while ago, Anthony Liguori wrote:
> David Barrett wrote:
> >I'm tracking down a image corruption issue and I'm curious if you can 
> >answer the following:
> >
> >1) Is there any difference between sending a "TERM" signal to the QEMU 
> >process and typing "quit" at the monitor?
> 
> Yes.  Since QEMU is single threaded, when you issue a quit, you know you 
> aren't in the middle of writing qcow2 meta data to disk.
> 
> >2) Will sending TERM corrupt the 'gcow2' image (in ways other than 
> >normal guest OS dirty shutdown)?
> 
> Possibly, yes.
> 
> >3) Assuming I always start QEMU using "-loadvm", is there any risk in 
> >using 'kill' to send SIGTERM to the QMEU process when done?
> 
> Yes.  If you want to SIGTERM QEMU, the safest thing to do is use -snapshot.

Just today, I had a running KVM instance for an important server (our
busy mail server) lock up.  It was accepting VNC
connections, but sending keystrokes, mouse movements and so on didn't
do anything.  It had been running for several weeks without any problem.
I don't have a report on whether there was a picture from VNC.

Our system manager decided there was nothing else to do, and killed
that process (SIGTERM), then restarted it.

(Unfortunately, he didn't know about the monitor and "quit".)

So far, it's looking ok, but I'm concerned about the possibility of
qcow2 corruption which the above mail says is possible.

Even if we could have used the monitor *this* time, QEMU is quite a
complex piece of software which we can't assume to be bug free.  what
happens if KVM/QEMU locks up or crashes, in the following ways:

    - Some emulated driver crashes.  I *have* seen this happen.
      (Try '-net user -net user' on the command line.  Ok, now we know not
      to do it...).  The process dies.

    - Some emulated driver gets stuck in a loop.  You know, a bug.
      No choice but to kill the process.

    - The host machine loses power.  Host's journalled filesystem is
      fine, but what about the qcow2 images of guests?

I'm imagining that qcow2 is like a very simple filesystem format.
Real filesystems have "fsck" and/or use journalling or similar to be
robust.  Is there a "fsck" equivalent for qcow2?  (Maybe running
qemu-img convert is that?)  Does it use journalling or other
techniques internally to make sure it is difficult to corrupt, even if
the host dies unexpectedly?

If qcow2 is not resistant to sudden failures, would it be difficult to
make it more robust?

(One method which comes to mind is to use a daemon process just to
handle the disk image, communicating with QEMU.  QEMU is complex and
may occasionally have problems, but the daemon would do just one
thing, so quite likely to survive.  It won't be robust against power
failure, though, and it sounds like performance might suck.)

Or should we avoid using qcow2, for important guest servers that would
be expensive or impossible to reconstruct?

If not qcow2, are any of the other supported incremental formats
robust in these ways, e.g. the VMware one?

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 18:10   ` [Qemu-devel] qcow2 - safe on kill? safe on power fail? Jamie Lokier
@ 2008-07-21 19:43     ` Anthony Liguori
  2008-07-21 21:26       ` Jamie Lokier
  2008-07-21 22:00       ` Andreas Schwab
  0 siblings, 2 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-07-21 19:43 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier wrote:
> Quite a while ago, Anthony Liguori wrote:
>   
>> David Barrett wrote:
>>     
>>> I'm tracking down a image corruption issue and I'm curious if you can 
>>> answer the following:
>>>
>>> 1) Is there any difference between sending a "TERM" signal to the QEMU 
>>> process and typing "quit" at the monitor?
>>>       
>> Yes.  Since QEMU is single threaded, when you issue a quit, you know you 
>> aren't in the middle of writing qcow2 meta data to disk.
>>
>>     
>>> 2) Will sending TERM corrupt the 'gcow2' image (in ways other than 
>>> normal guest OS dirty shutdown)?
>>>       
>> Possibly, yes.
>>
>>     
>>> 3) Assuming I always start QEMU using "-loadvm", is there any risk in 
>>> using 'kill' to send SIGTERM to the QMEU process when done?
>>>       
>> Yes.  If you want to SIGTERM QEMU, the safest thing to do is use -snapshot.
>>     
>
> Just today, I had a running KVM instance for an important server (our
> busy mail server) lock up.  It was accepting VNC
> connections, but sending keystrokes, mouse movements and so on didn't
> do anything.  It had been running for several weeks without any problem.
> I don't have a report on whether there was a picture from VNC.
>
> Our system manager decided there was nothing else to do, and killed
> that process (SIGTERM), then restarted it.
>   

SIGTERM is about the worse thing you could do, but you're probably okay.

QCOW2 files have no journal, so they are not safe against unexpected 
power outages or hard crashes.  If you need a great deal of reliability, 
you should use a raw image.

With that said, let me explain exactly what circumstances corruption can 
occur in as it turns out that, in practice, the corruption window isn't 
that big.

Obviously there are no issues on the read path, so we'll stick strictly 
to the write path.

QEMU is single-threaded and QCOW2 supports asynchronous write 
operations.  There are two parts in this operation.  The first discovers 
what offset within the QCOW2 file to write to.  If the sector has been 
previously allocated, this will consist only of read operations.  It 
will then issue an asynchronous write operation to the allocated sector.

Since your guest probably is using a journalled file system, you will be 
okay if something happens before that data gets written to disk[1].

If the sector hasn't been previously allocated, then a new sector in the 
file needs to be allocated.  This is going to change metadata within the 
QCOW2 file and this is where it is possible to corrupt a disk image.  
The operation of allocating a new disk sector is completely synchronous 
so no other code runs until this completes.  Once the disk sector is 
allocated, you're safe again[1].

Since no other code runs during this period, bugs in the device 
emulation, a user closing the SDL window, and issuing quit in the 
monitor, will not corrupt the disk image.  Your guest may require an 
fsck but the QCOW2 image will be fine.

The only ways that you can cause corruption is if the QCOW2 sector 
allocation code is faulty (and you would be screwed no matter what here) 
or if you issue a SIGTERM/SIGKILL that interrupts the code while it's 
allocating a new sector.  If your guest is hung, chances are it's not 
actively writing to disk but this is why SIGTERM/SIGKILL is really a 
terrible thing to do.  It's really the only practical way to corrupt a 
disk image (short of a hard power outage).

If someone was sufficiently concerned, it's probably relatively straight 
forward to implement an fsck or journal for QCOW2.  This would allow the 
image to be recovered if the meta data somehow got corrupted.

With all this said, I've definitely seen corruption in QCOW2 images that 
were caused by crashing my host kernel.  I beat up on QEMU pretty badly 
though.  I think under normal circumstances, it's unlikely a user would 
see this in practice.

[1] It's not quite that simple.  Your host doesn't necessarily guarantee 
integrity unless 1) you've got battery backed cache on your disks 
(commodity disks aren't battery backed typically) or you've disabled 
write-back 2) you have a file system that supports barriers and barriers 
are enabled by default (they aren't enabled by default with ext2/3) 3) 
you are running QEMU with cache=off to disable host write caching.  
Basically, chances are your data is not as safe as you assume it is and 
QEMU adds very little additional uncertainty to that unless you do 
something nasty like SIGKILL/SIGTERM while doing heavy disk IO.

Regards,

Anthony Liguori

> (Unfortunately, he didn't know about the monitor and "quit".)
>
> So far, it's looking ok, but I'm concerned about the possibility of
> qcow2 corruption which the above mail says is possible.
>
> Even if we could have used the monitor *this* time, QEMU is quite a
> complex piece of software which we can't assume to be bug free.  what
> happens if KVM/QEMU locks up or crashes, in the following ways:
>
>     - Some emulated driver crashes.  I *have* seen this happen.
>       (Try '-net user -net user' on the command line.  Ok, now we know not
>       to do it...).  The process dies.
>
>     - Some emulated driver gets stuck in a loop.  You know, a bug.
>       No choice but to kill the process.
>
>     - The host machine loses power.  Host's journalled filesystem is
>       fine, but what about the qcow2 images of guests?
>
> I'm imagining that qcow2 is like a very simple filesystem format.
> Real filesystems have "fsck" and/or use journalling or similar to be
> robust.  Is there a "fsck" equivalent for qcow2?  (Maybe running
> qemu-img convert is that?)  Does it use journalling or other
> techniques internally to make sure it is difficult to corrupt, even if
> the host dies unexpectedly?
>
> If qcow2 is not resistant to sudden failures, would it be difficult to
> make it more robust?
>
> (One method which comes to mind is to use a daemon process just to
> handle the disk image, communicating with QEMU.  QEMU is complex and
> may occasionally have problems, but the daemon would do just one
> thing, so quite likely to survive.  It won't be robust against power
> failure, though, and it sounds like performance might suck.)
>
> Or should we avoid using qcow2, for important guest servers that would
> be expensive or impossible to reconstruct?
>
> If not qcow2, are any of the other supported incremental formats
> robust in these ways, e.g. the VMware one?
>
> Thanks,
> -- Jamie
>
>
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 19:43     ` Anthony Liguori
@ 2008-07-21 21:26       ` Jamie Lokier
  2008-07-21 22:14         ` Anthony Liguori
  2008-07-21 22:00       ` Andreas Schwab
  1 sibling, 1 reply; 26+ messages in thread
From: Jamie Lokier @ 2008-07-21 21:26 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> Since your guest probably is using a journalled file system, you will be 
> okay if something happens before that data gets written to disk[1].

Thanks Anthony, that's helpful.

> If the sector hasn't been previously allocated, then a new sector in the 
> file needs to be allocated.  This is going to change metadata within the 
> QCOW2 file and this is where it is possible to corrupt a disk image.  
> The operation of allocating a new disk sector is completely synchronous 
> so no other code runs until this completes.  Once the disk sector is 
> allocated, you're safe again[1].

My main concern is corruption of the QCOW2 sector allocation map, and
subsequently QEMU/KVM breaking or going wildly haywire with that file.

With a normal filesystem, sure, there are lots of ways to get
corruption when certain events happen.  But you don't lose the _whole_
filesystem.

My concern is that if the QCOW2 sector allocation map is corrupted by
these events, you may lose the _whole_ virtual machine, which can be a
pretty big loss.

Is the format robust enough to prevent that from being a problem?

(Backups help (but not good enough for things like a mail or database
server).  But how do you safely backup the image of a VM that is
running 24x7?  LVM snapshots are the only way I've thought of, and
they have a barrier problem, see below.)

> you have a file system that supports barriers and barriers 
> are enabled by default (they aren't enabled by default with ext2/3)

There was recent talk of enabling them by default for ext3.

> you are running QEMU with cache=off to disable host write caching.  

Doesn't that use O_DIRECT?  O_DIRECT writes don't use barriers, and
fsync() does not deterministically issue a disk barrier if there's no
metadata change, so O_DIRECT writes are _less_ safe with disks which
have write-cache enabled than using normal writes.

What about using a partition, such as an LVM volume (so it can be
snapshotted without having to take down the VM)?  I'm under the
impression there is no way to issue disk barrier flushes to a
partition, so that's screwed too.  (Besides, LVM doesn't propagate
barrier requests from filesystems either...)

The last two paragraphs apply when using _any_ file format and break
the integrity of guest journalling filesystems, not just qcow2.

> Since no other code runs during this period, bugs in the device 
> emulation, a user closing the SDL window, and issuing quit in the 
> monitor, will not corrupt the disk image.  Your guest may require an 
> fsck but the QCOW2 image will be fine.

Does this apply to KVM as well?  I thought KVM had a separate threads
for I/O, so problems in another subsystem might crash an I/O thread in
mid action.  Is that work in progress?

Thanks again,
-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 19:43     ` Anthony Liguori
  2008-07-21 21:26       ` Jamie Lokier
@ 2008-07-21 22:00       ` Andreas Schwab
  2008-07-21 22:15         ` Anthony Liguori
  1 sibling, 1 reply; 26+ messages in thread
From: Andreas Schwab @ 2008-07-21 22:00 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori <anthony@codemonkey.ws> writes:

> The only ways that you can cause corruption is if the QCOW2 sector
> allocation code is faulty (and you would be screwed no matter what here)
> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's
> allocating a new sector.

Blocking SIGTERM until the allocation is finished could close that hole.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 21:26       ` Jamie Lokier
@ 2008-07-21 22:14         ` Anthony Liguori
  2008-07-21 23:47           ` Jamie Lokier
  2008-07-22  6:06           ` Avi Kivity
  0 siblings, 2 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-07-21 22:14 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier wrote:
>> If the sector hasn't been previously allocated, then a new sector in the 
>> file needs to be allocated.  This is going to change metadata within the 
>> QCOW2 file and this is where it is possible to corrupt a disk image.  
>> The operation of allocating a new disk sector is completely synchronous 
>> so no other code runs until this completes.  Once the disk sector is 
>> allocated, you're safe again[1].
>>     
>
> My main concern is corruption of the QCOW2 sector allocation map, and
> subsequently QEMU/KVM breaking or going wildly haywire with that file.
>
> With a normal filesystem, sure, there are lots of ways to get
> corruption when certain events happen.  But you don't lose the _whole_
> filesystem.
>   

Sure you can.  If you don't have a battery backed disk cache and are 
using write-back (which is usually the default), you can definitely get 
corruption of the journal.  Likewise, under the right scenarios, you 
will get journal corruption with the default mount options of ext3 
because it doesn't use barriers.

This is very hard to see happen in practice though because these windows 
are very small--just like with QEMU.

> My concern is that if the QCOW2 sector allocation map is corrupted by
> these events, you may lose the _whole_ virtual machine, which can be a
> pretty big loss.
>
> Is the format robust enough to prevent that from being a problem?
>   

It could be extended to contain a journal.  But that doesn't guarantee 
that you won't lose data because of your file system failing, that's the 
point I'm making.

> (Backups help (but not good enough for things like a mail or database
> server).  But how do you safely backup the image of a VM that is
> running 24x7?  LVM snapshots are the only way I've thought of, and
> they have a barrier problem, see below.)
>
>   
>> you have a file system that supports barriers and barriers 
>> are enabled by default (they aren't enabled by default with ext2/3)
>>     
>
> There was recent talk of enabling them by default for ext3.
>   

It's not going to happen.

>> you are running QEMU with cache=off to disable host write caching.  
>>     
>
> Doesn't that use O_DIRECT?  O_DIRECT writes don't use barriers, and
> fsync() does not deterministically issue a disk barrier if there's no
> metadata change, so O_DIRECT writes are _less_ safe with disks which
> have write-cache enabled than using normal writes.
>   

It depends on the filesystem.  ext3 never issues any barriers by default 
:-)

I would think a good filesystem would issue a barrier after an O_DIRECT 
write.

> What about using a partition, such as an LVM volume (so it can be
> snapshotted without having to take down the VM)?  I'm under the
> impression there is no way to issue disk barrier flushes to a
> partition, so that's screwed too.  (Besides, LVM doesn't propagate
> barrier requests from filesystems either...)
>   

Unfortunately there is no userspace API to inject barriers in a disk.  
fdatasync() maybe but that's not the same behavior as a barrier?   I 
don't think IDE supports barriers at all FWIW.  It only has a write-back 
and write-through mode so if you care about data, you would have to 
enable write-through in your guest.

> The last two paragraphs apply when using _any_ file format and break
> the integrity of guest journalling filesystems, not just qcow2.
>
>   
>> Since no other code runs during this period, bugs in the device 
>> emulation, a user closing the SDL window, and issuing quit in the 
>> monitor, will not corrupt the disk image.  Your guest may require an 
>> fsck but the QCOW2 image will be fine.
>>     
>
> Does this apply to KVM as well?  I thought KVM had a separate threads
> for I/O, so problems in another subsystem might crash an I/O thread in
> mid action.  Is that work in progress?
>   

Not really.  There is a big lock that prevents two threads from every 
running at the same time within QEMU.

Regards,

Anthony Liguori

> Thanks again,
> -- Jamie
>
>
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 22:00       ` Andreas Schwab
@ 2008-07-21 22:15         ` Anthony Liguori
  2008-07-21 22:22           ` David Barrett
  2008-07-22  6:07           ` Avi Kivity
  0 siblings, 2 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-07-21 22:15 UTC (permalink / raw)
  To: qemu-devel

Andreas Schwab wrote:
> Anthony Liguori <anthony@codemonkey.ws> writes:
>
>   
>> The only ways that you can cause corruption is if the QCOW2 sector
>> allocation code is faulty (and you would be screwed no matter what here)
>> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's
>> allocating a new sector.
>>     
>
> Blocking SIGTERM until the allocation is finished could close that hole.
>   

Seems like a band-aid to me as SIGKILL is still an issue.  Plus it would 
involve modifying all disk formats, not just QCOW2.  I'd rather see 
proper journal support added to QCOW2 myself.

Regards,

Anthony Liguori

> Andreas.
>
>   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 22:15         ` Anthony Liguori
@ 2008-07-21 22:22           ` David Barrett
  2008-07-21 22:50             ` Anthony Liguori
  2008-07-22  6:07           ` Avi Kivity
  1 sibling, 1 reply; 26+ messages in thread
From: David Barrett @ 2008-07-21 22:22 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> Andreas Schwab wrote:
>> Anthony Liguori <anthony@codemonkey.ws> writes:
>>  
>>> The only ways that you can cause corruption is if the QCOW2 sector
>>> allocation code is faulty (and you would be screwed no matter what here)
>>> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's
>>> allocating a new sector.
>>
>> Blocking SIGTERM until the allocation is finished could close that hole.
> 
> Seems like a band-aid to me as SIGKILL is still an issue.  Plus it would 
> involve modifying all disk formats, not just QCOW2.  I'd rather see 
> proper journal support added to QCOW2 myself.

Well, SIGKILL is a bit more of an extreme case.  SIGTERM seems like a 
reasonable way to trigger a graceful shutdown (at least, I know I 
assumed it did for a long time, whereas I'd never assume SIGKILL was 
graceful).

-david

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 22:22           ` David Barrett
@ 2008-07-21 22:50             ` Anthony Liguori
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-07-21 22:50 UTC (permalink / raw)
  To: qemu-devel

David Barrett wrote:
> Anthony Liguori wrote:
>> Andreas Schwab wrote:
>>> Anthony Liguori <anthony@codemonkey.ws> writes:
>>>  
>>>> The only ways that you can cause corruption is if the QCOW2 sector
>>>> allocation code is faulty (and you would be screwed no matter what 
>>>> here)
>>>> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's
>>>> allocating a new sector.
>>>
>>> Blocking SIGTERM until the allocation is finished could close that 
>>> hole.
>>
>> Seems like a band-aid to me as SIGKILL is still an issue.  Plus it 
>> would involve modifying all disk formats, not just QCOW2.  I'd rather 
>> see proper journal support added to QCOW2 myself.
>
> Well, SIGKILL is a bit more of an extreme case.  SIGTERM seems like a 
> reasonable way to trigger a graceful shutdown (at least, I know I 
> assumed it did for a long time, whereas I'd never assume SIGKILL was 
> graceful).

It would probably be reasonable to trap SIGTERM and to have it trigger 
the equivalent of the "quit" command in the monitor.  Right now, SIGTERM 
will not result in a graceful shutdown of QEMU.

Regards,

Anthony Liguori

> -david
>
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 22:14         ` Anthony Liguori
@ 2008-07-21 23:47           ` Jamie Lokier
  2008-07-22  6:06           ` Avi Kivity
  1 sibling, 0 replies; 26+ messages in thread
From: Jamie Lokier @ 2008-07-21 23:47 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> >My main concern is corruption of the QCOW2 sector allocation map, and
> >subsequently QEMU/KVM breaking or going wildly haywire with that file.
> >
> >With a normal filesystem, sure, there are lots of ways to get
> >corruption when certain events happen.  But you don't lose the _whole_
> >filesystem.
> 
> Sure you can.  If you don't have a battery backed disk cache and are 
> using write-back (which is usually the default), you can definitely get 
> corruption of the journal.  Likewise, under the right scenarios, you 
> will get journal corruption with the default mount options of ext3 
> because it doesn't use barriers.

Well, no, when you get filesystem corruption, you don't lose the whole
filesystem.  If you have 10,000,000 files in 100GB, you might lose a
fraction of it.

Also, you're unlikely to lose anything which you're not writing to at
all at the time.  E.g. the OS installation.

My worry is if I have the same amount of data in a QCOW2, I might lose
all of it including the OS, which seems much harsher.  And there isn't
a way to recover it.

But I don't know if QCOW2 is that sensitive, that's why I'm asking.

> This is very hard to see happen in practice though because these windows 
> are very small--just like with QEMU.

The software-caused windows are non-existent on a modern filesystem
with good practice: barriers, decent disks.

> >My concern is that if the QCOW2 sector allocation map is corrupted by
> >these events, you may lose the _whole_ virtual machine, which can be a
> >pretty big loss.
> >
> >Is the format robust enough to prevent that from being a problem?
> 
> It could be extended to contain a journal.  But that doesn't guarantee 
> that you won't lose data because of your file system failing, that's the 
> point I'm making.

Erm.  I think you're answering a different question to the one I'm asking :-)

If my host filesystem is using ext3 with barriers enabled, and the
block device underlying it supports barriers, then I *never* expect to
see host filesystem corruption even on power failure, unless there is
a blatant hardware fault.

The "operation windows" for corruption are non-existent, they are
eliminated in principle, there is *no* sequence of software events
with a time windows for a sudden power failure or crash which results
in corruption.

I regard that as very robust.

However, I do expect to see corruption in QCOW2 images from killing
the process, or shutting down the host without remembering to shut
down all guests first, or losing power on the host.

It only happens on sector allocation.  But isn't that quite often -
i.e. whenever the image grows?  I find my images grow very often, in
normal usage, until they are approaching the size of the flat format.

And, apart from worrying about software corruption windows in QCOW2,
I'm worried that I'll lost the whole installed operating system,
applications, etc., if QEMU can't then read the QCOW2 image properly.
Rather than just a few files.

> >>you have a file system that supports barriers and barriers 
> >>are enabled by default (they aren't enabled by default with ext2/3)
> >
> >There was recent talk of enabling them by default for ext3.
> 
> It's not going to happen.

You may be right.  This is one more reason why I'm asking myself if
ext3 on my VM hosts is such a smart idea...

ext4, however, has barriers enabled by default since 2.6.26 :-)

> >>you are running QEMU with cache=off to disable host write caching.  
> >
> >Doesn't that use O_DIRECT?  O_DIRECT writes don't use barriers, and
> >fsync() does not deterministically issue a disk barrier if there's no
> >metadata change, so O_DIRECT writes are _less_ safe with disks which
> >have write-cache enabled than using normal writes.
>
> It depends on the filesystem.  ext3 never issues any barriers by default 
> :-)

But even with barrier=1, it doesn't issue them with O_DIRECT writes.

> I would think a good filesystem would issue a barrier after an O_DIRECT 
> write.

For O_SYNC, maybe.  But O_DIRECT: that would be more barriers than
most applications want.  Unnecessary barriers are not cheap,
especially on IDE (see below).

It would be better if fdatasync() issued the barrier if there have
been any O_DIRECT writes since the last barrier, even if there's no
cached data to write.  That gives the app a chance to decide where and
when to have barriers.

Otherwise you can't use O_DIRECT to simulate "filesystem in a file"
with similar performance characteristics as a real filesystem.

This is getting off-topic for qemu-devel though.

> >What about using a partition, such as an LVM volume (so it can be
> >snapshotted without having to take down the VM)?  I'm under the
> >impression there is no way to issue disk barrier flushes to a
> >partition, so that's screwed too.  (Besides, LVM doesn't propagate
> >barrier requests from filesystems either...)
> 
> Unfortunately there is no userspace API to inject barriers in a disk.  
> fdatasync() maybe but that's not the same behavior as a barrier?

It's not the same behaviour in theory, but in Linux they are muddled
together to be the same thing.  I tried clarifying the different on
linux-fsdevel, and just got filesystem developers telling me that
barriers on Linux block devices always imply a flush, not just
ordering, so there's no reason to have different requests.

At the application level, the best you can do with normal files is
wait until your AIO writes return, then issue and wait for fdatasync,
then start more writes.  It seems the same AIO methods could work
equally on a block device, with fdatasync sending a barrier+flush
request to the disk if there have been any preceding writes.  It would
be rather convenient too.

> I don't think IDE supports barriers at all FWIW.  It only has a
> write-back and write-through mode so if you care about data, you
> would have to enable write-through in your guest.

Not quite true.  IDE supports barriers on the host by the host kernel
waiting for writes to complete, issuing an IDE FLUSH WRITE CACHE
command, then allowing later writes to start.  It also uses the FUA
("Force Unit Access") bit to do uncached single-sector writes.  So
it does support barriers in a roundabout way, on nearly all IDE disks.

It makes a difference.  I have some devices with ext3 IDE disks that
get corrupt from time to time in normal usage (which includes pulling
the plug regularly :-), unless I enable barriers or turn off
write-cache.  But turning off write-cache slows them down a lot, and
barriers slows them down only a little, so IDE barriers are good.

> >Does this apply to KVM as well?  I thought KVM had a separate threads
> >for I/O, so problems in another subsystem might crash an I/O thread in
> >mid action.  Is that work in progress?
> 
> Not really.  There is a big lock that prevents two threads from every 
> running at the same time within QEMU.

Oh.  What a curious form of threading :-)

-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 22:14         ` Anthony Liguori
  2008-07-21 23:47           ` Jamie Lokier
@ 2008-07-22  6:06           ` Avi Kivity
  2008-07-22 14:08             ` Anthony Liguori
  2008-07-22 14:32             ` Jamie Lokier
  1 sibling, 2 replies; 26+ messages in thread
From: Avi Kivity @ 2008-07-22  6:06 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> Jamie Lokier wrote:
>>> If the sector hasn't been previously allocated, then a new sector in 
>>> the file needs to be allocated.  This is going to change metadata 
>>> within the QCOW2 file and this is where it is possible to corrupt a 
>>> disk image.  The operation of allocating a new disk sector is 
>>> completely synchronous so no other code runs until this completes.  
>>> Once the disk sector is allocated, you're safe again[1].
>>>     
>>
>> My main concern is corruption of the QCOW2 sector allocation map, and
>> subsequently QEMU/KVM breaking or going wildly haywire with that file.
>>
>> With a normal filesystem, sure, there are lots of ways to get
>> corruption when certain events happen.  But you don't lose the _whole_
>> filesystem.
>>   
>
> Sure you can.  If you don't have a battery backed disk cache and are 
> using write-back (which is usually the default), you can definitely 
> get corruption of the journal.  Likewise, under the right scenarios, 
> you will get journal corruption with the default mount options of ext3 
> because it doesn't use barriers.
>

What about SCSI or SATA NCQ?  On these, barriers don't impact 
performance greatly.

> This is very hard to see happen in practice though because these 
> windows are very small--just like with QEMU.
>

The exposure window with qemu is not small.  It's as large as the page 
cache of the host.

>
>
>>> you are running QEMU with cache=off to disable host write caching.      
>>
>> Doesn't that use O_DIRECT?  O_DIRECT writes don't use barriers, and
>> fsync() does not deterministically issue a disk barrier if there's no
>> metadata change, so O_DIRECT writes are _less_ safe with disks which
>> have write-cache enabled than using normal writes.
>>   
>
> It depends on the filesystem.  ext3 never issues any barriers by 
> default :-)
>
> I would think a good filesystem would issue a barrier after an 
> O_DIRECT write.
>

Using a disk controller that supports queueing means that you can (in 
theory at least) leave writeback turned on and yet have the disk not lie 
to you about completions.



-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-21 22:15         ` Anthony Liguori
  2008-07-21 22:22           ` David Barrett
@ 2008-07-22  6:07           ` Avi Kivity
  2008-07-22 14:11             ` Anthony Liguori
  2008-07-22 14:22             ` Jamie Lokier
  1 sibling, 2 replies; 26+ messages in thread
From: Avi Kivity @ 2008-07-22  6:07 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> Andreas Schwab wrote:
>> Anthony Liguori <anthony@codemonkey.ws> writes:
>>
>>  
>>> The only ways that you can cause corruption is if the QCOW2 sector
>>> allocation code is faulty (and you would be screwed no matter what 
>>> here)
>>> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's
>>> allocating a new sector.
>>>     
>>
>> Blocking SIGTERM until the allocation is finished could close that hole.
>>   
>
> Seems like a band-aid to me as SIGKILL is still an issue.  Plus it 
> would involve modifying all disk formats, not just QCOW2.  I'd rather 
> see proper journal support added to QCOW2 myself.

Journalling is so out of fashion.  It's better to sequence the 
operations so that failure results in a leak instead of corruption.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22  6:06           ` Avi Kivity
@ 2008-07-22 14:08             ` Anthony Liguori
  2008-07-22 14:46               ` Jamie Lokier
  2008-07-22 19:11               ` Avi Kivity
  2008-07-22 14:32             ` Jamie Lokier
  1 sibling, 2 replies; 26+ messages in thread
From: Anthony Liguori @ 2008-07-22 14:08 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Anthony Liguori wrote:
>>
>> Sure you can.  If you don't have a battery backed disk cache and are 
>> using write-back (which is usually the default), you can definitely 
>> get corruption of the journal.  Likewise, under the right scenarios, 
>> you will get journal corruption with the default mount options of 
>> ext3 because it doesn't use barriers.
>>
>
> What about SCSI or SATA NCQ?  On these, barriers don't impact 
> performance greatly.

Good question, I don't know the answer.  But ext3 doesn't autodetect 
SCSI/NCQ or anything.  It disabled barriers by default.  Some distros 
have changed this behavior historically (SLES I believe).

>> This is very hard to see happen in practice though because these 
>> windows are very small--just like with QEMU.
>>
>
> The exposure window with qemu is not small.  It's as large as the page 
> cache of the host.

Note I was careful to qualify my statements that cache=off was required.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22  6:07           ` Avi Kivity
@ 2008-07-22 14:11             ` Anthony Liguori
  2008-07-22 14:36               ` Avi Kivity
  2008-07-22 14:22             ` Jamie Lokier
  1 sibling, 1 reply; 26+ messages in thread
From: Anthony Liguori @ 2008-07-22 14:11 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> Journalling is so out of fashion.  It's better to sequence the 
> operations so that failure results in a leak instead of corruption.

Since the metadata is being updated synchronously, you could probably 
get away with a pretty simple journal.  Maybe even a single field that 
contains the offset you are allocating which then gets reset once the 
allocation was completed.

When QEMU starts up again, it can look at that field, and if it's not 0, 
check for anomalies in the allocation, prune that portion of the tree, 
and then start the guest.  That's a few more writes but it's already a 
slow path so it should be okay.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22  6:07           ` Avi Kivity
  2008-07-22 14:11             ` Anthony Liguori
@ 2008-07-22 14:22             ` Jamie Lokier
  1 sibling, 0 replies; 26+ messages in thread
From: Jamie Lokier @ 2008-07-22 14:22 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> >Seems like a band-aid to me as SIGKILL is still an issue.  Plus it 
> >would involve modifying all disk formats, not just QCOW2.  I'd rather 
> >see proper journal support added to QCOW2 myself.
> 
> Journalling is so out of fashion.  It's better to sequence the 
> operations so that failure results in a leak instead of corruption.

That would be find.  If there's too much leakage after a time, it
would be easy enough to "qemu convert" to recreate the image without
leakage.

Or trees, trees are the new journals... :-)  Still there's always
the possibility of errors to recover from, no matter how careful.

-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22  6:06           ` Avi Kivity
  2008-07-22 14:08             ` Anthony Liguori
@ 2008-07-22 14:32             ` Jamie Lokier
  1 sibling, 0 replies; 26+ messages in thread
From: Jamie Lokier @ 2008-07-22 14:32 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> The exposure window with qemu is not small.  It's as large as the page 
> cache of the host.

Ouch, that's a good point.  I hadn't thought of that.

With cache=off, the exposure window is as large as the I/O scheduling
and disk seek time between the multiple blocks written during sector
allocation.  Given how often sector allocation occurs, it's not small.

I think I'm going to just stop using QCOW2, bite the bullet, and use
large disks and flat images for all production VMs.

The small but looking-plausible possibility of losing a whole valuable
machine due to niggling things like a rare QEMU crash or host crash is
very uncool.

-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22 14:11             ` Anthony Liguori
@ 2008-07-22 14:36               ` Avi Kivity
  2008-07-22 16:16                 ` Jamie Lokier
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2008-07-22 14:36 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> Avi Kivity wrote:
>> Journalling is so out of fashion.  It's better to sequence the 
>> operations so that failure results in a leak instead of corruption.
>
> Since the metadata is being updated synchronously, you could probably 
> get away with a pretty simple journal.  Maybe even a single field that 
> contains the offset you are allocating which then gets reset once the 
> allocation was completed.
>
>

Why would you want to get away with a simple journal when you can get 
away without one?

It's a simple matter of allocating, making sure the allocation is on 
disk, and recording that allocation in the tables.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22 14:08             ` Anthony Liguori
@ 2008-07-22 14:46               ` Jamie Lokier
  2008-07-22 19:11               ` Avi Kivity
  1 sibling, 0 replies; 26+ messages in thread
From: Jamie Lokier @ 2008-07-22 14:46 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> >What about SCSI or SATA NCQ?  On these, barriers don't impact 
> >performance greatly.
> 
> Good question, I don't know the answer.  But ext3 doesn't autodetect 
> SCSI/NCQ or anything.  It disabled barriers by default.  Some distros 
> have changed this behavior historically (SLES I believe).

Also don't forget XFS, Reiserfs.  I think they both use barriers by
default and have a correct fsync too.

SCSI/NCQ are detected by the block layer, as long as the filesystem
uses barriers.  Oh, and as long as not using LVM which doesn't pass on
barriers :/

> >>This is very hard to see happen in practice though because these 
> >>windows are very small--just like with QEMU.
>
> >The exposure window with qemu is not small.  It's as large as the page 
> >cache of the host.
> 
> Note I was careful to qualify my statements that cache=off was required.

Fair point.  Unfortunately cache=off introduces other exposure windows.

With cache=on, the multiple block writes to allocate a qcow2 sector
are in fast succession, so a QEMU crash (or signal) has to happen
during this short interval.

With cache=off, those writes will take as long as the disk seeks
between them, so there's a longer time window for a QEMU crash to
corrupt the file.

Also with cache=off, there's no disk barriers on any filesystem and
any filesystem options, so there's the additional time window of disk
cache inconsistency with the platters.

Databases face the same problem on Linux, but it's often ignored.
Does anyone know what Oracle on Linux does to keep it's structures
robust?

-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22 14:36               ` Avi Kivity
@ 2008-07-22 16:16                 ` Jamie Lokier
  2008-07-22 19:13                   ` Avi Kivity
  0 siblings, 1 reply; 26+ messages in thread
From: Jamie Lokier @ 2008-07-22 16:16 UTC (permalink / raw)
  To: qemu-devel

> It's a simple matter of allocating, making sure the allocation is on 
> disk, and recording that allocation in the tables.

The simple implementations are only safe if sector writes are atomic.

Opinions from Google seem divided about when you can assume that,
especially when the underlying file or device is not directly mapped
to disk sectors.

-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22 14:08             ` Anthony Liguori
  2008-07-22 14:46               ` Jamie Lokier
@ 2008-07-22 19:11               ` Avi Kivity
  1 sibling, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2008-07-22 19:11 UTC (permalink / raw)
  To: qemu-devel

Anthony Liguori wrote:
> Avi Kivity wrote:
>> Anthony Liguori wrote:
>>>
>>> Sure you can.  If you don't have a battery backed disk cache and are 
>>> using write-back (which is usually the default), you can definitely 
>>> get corruption of the journal.  Likewise, under the right scenarios, 
>>> you will get journal corruption with the default mount options of 
>>> ext3 because it doesn't use barriers.
>>>
>>
>> What about SCSI or SATA NCQ?  On these, barriers don't impact 
>> performance greatly.
>
> Good question, I don't know the answer.  But ext3 doesn't autodetect 
> SCSI/NCQ or anything.  It disabled barriers by default.  Some distros 
> have changed this behavior historically (SLES I believe).
>

This ought to be on the driver level.  SCSI and NCQ disks should report 
barrier support; old IDE should report no barriers unless the user sets 
dont_care_about_performance_and_have_unlimited_warranty=1.  ext* should 
use barriers if available.

Of course this is linux-kernel material, not really on topic for this list.

>>> This is very hard to see happen in practice though because these 
>>> windows are very small--just like with QEMU.
>>>
>>
>> The exposure window with qemu is not small.  It's as large as the 
>> page cache of the host.
>
> Note I was careful to qualify my statements that cache=off was required.

Ah, okay then.

Qemu should be written assuming the underlying layers are sane; trying 
to work around Linux bugs is madness.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22 16:16                 ` Jamie Lokier
@ 2008-07-22 19:13                   ` Avi Kivity
  2008-07-22 20:04                     ` Jamie Lokier
  0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2008-07-22 19:13 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier wrote:
>> It's a simple matter of allocating, making sure the allocation is on 
>> disk, and recording that allocation in the tables.
>>     
>
> The simple implementations are only safe if sector writes are atomic.
>
> Opinions from Google seem divided about when you can assume that,
> especially when the underlying file or device is not directly mapped
> to disk sectors.
>   

That's worrying.  I guess always-allocate-on-write solves that (with 
versioned roots in well-known places), but that's not qcow2 any more -- 
it's btrfs.  And given that btrfs ought to allow file-level snapshots, 
perhaps the direction should be raw files on top of btrfs (which could 
be extended to do block sharing, yay!)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22 19:13                   ` Avi Kivity
@ 2008-07-22 20:04                     ` Jamie Lokier
  2008-07-22 21:25                       ` Avi Kivity
  0 siblings, 1 reply; 26+ messages in thread
From: Jamie Lokier @ 2008-07-22 20:04 UTC (permalink / raw)
  To: qemu-devel

Avi Kivity wrote:
> >>It's a simple matter of allocating, making sure the allocation is on 
> >>disk, and recording that allocation in the tables.
> >
> >The simple implementations are only safe if sector writes are atomic.
> >
> >Opinions from Google seem divided about when you can assume that,
> >especially when the underlying file or device is not directly mapped
> >to disk sectors.
> 
> That's worrying.  I guess always-allocate-on-write solves that (with 
> versioned roots in well-known places), but that's not qcow2 any more -- 
> it's btrfs.

Fair. Simple journalling with checksumed log records also solves the
problem without being half as clever - and probably easy to retrofit
to qcow2, without breaking backward compatibility.  (Old qemus would
ignore the journal.)

> And given that btrfs ought to allow file-level snapshots, perhaps
> the direction should be raw files on top of btrfs (which could be
> extended to do block sharing, yay!)

Block/extent sharing would be a nice bonus :-)

Does btrfs work on other platforms than Linux?

Also, is btrfs as good as the hype, in respect of things like fsync,
barriers, cache=off consistency etc. which we've talked about?  Maybe,
but I wouldn't assume it.

Userspace btrfs-in-a-file library would be ideal, for cross-platform
support, but I don't see it happening.

You can do raw, sparse files on ext3 or any other unix filesystem.
They are about as compact as qcow2, if you ignore compression.

The real big problem I found with sparse files is that copying them
locally, or copying them to another machine (e.g. with rsync) is
*incredibly* slow because it's so slow to scan the sparse regions, and
this gets really slow if you have, say, a 100GB virtual disk (5GB
used, rest to grow into).  "rsync --sparse" even bizarrely transmits a
lot of zero data over the network, or spends an age compressing it.

btrfs flat files will have the same problem.

The FIEMAP interface may solve it, generically on all Linux
filesystem, if copying tools are updated to use it.

-- Jamie

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] qcow2 - safe on kill?  safe on power fail?
  2008-07-22 20:04                     ` Jamie Lokier
@ 2008-07-22 21:25                       ` Avi Kivity
  0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2008-07-22 21:25 UTC (permalink / raw)
  To: qemu-devel

Jamie Lokier wrote:
>   
>> And given that btrfs ought to allow file-level snapshots, perhaps
>> the direction should be raw files on top of btrfs (which could be
>> extended to do block sharing, yay!)
>>     
>
> Block/extent sharing would be a nice bonus :-)
>
> Does btrfs work on other platforms than Linux?
>
>   

There's a Solaris port called zfs, and a bsd port called WAFL.

> Also, is btrfs as good as the hype, in respect of things like fsync,
> barriers, cache=off consistency etc. which we've talked about?  Maybe,
> but I wouldn't assume it.
>   

It better be, as it's coming from oracle.  O_DIRECT and barriers are 
their bread and butter (fs).

>
> You can do raw, sparse files on ext3 or any other unix filesystem.
> They are about as compact as qcow2, if you ignore compression.
>
>   

Except that you lose snapshot support, etc.

> The real big problem I found with sparse files is that copying them
> locally, or copying them to another machine (e.g. with rsync) is
> *incredibly* slow because it's so slow to scan the sparse regions, and
> this gets really slow if you have, say, a 100GB virtual disk (5GB
> used, rest to grow into).  "rsync --sparse" even bizarrely transmits a
> lot of zero data over the network, or spends an age compressing it.
>
> btrfs flat files will have the same problem.
>   

There was some talk about an API to discover unallocated regions.

> The FIEMAP interface may solve it, generically on all Linux
> filesystem, if copying tools are updated to use it.
>   

Like that, but better.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-07-22 21:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-05 21:18 [Qemu-devel] Signal handling and qcow2 image corruption David Barrett
2008-03-05 21:55 ` Anthony Liguori
2008-03-05 23:48   ` David Barrett
2008-03-06  6:57   ` Avi Kivity
2008-07-21 18:10   ` [Qemu-devel] qcow2 - safe on kill? safe on power fail? Jamie Lokier
2008-07-21 19:43     ` Anthony Liguori
2008-07-21 21:26       ` Jamie Lokier
2008-07-21 22:14         ` Anthony Liguori
2008-07-21 23:47           ` Jamie Lokier
2008-07-22  6:06           ` Avi Kivity
2008-07-22 14:08             ` Anthony Liguori
2008-07-22 14:46               ` Jamie Lokier
2008-07-22 19:11               ` Avi Kivity
2008-07-22 14:32             ` Jamie Lokier
2008-07-21 22:00       ` Andreas Schwab
2008-07-21 22:15         ` Anthony Liguori
2008-07-21 22:22           ` David Barrett
2008-07-21 22:50             ` Anthony Liguori
2008-07-22  6:07           ` Avi Kivity
2008-07-22 14:11             ` Anthony Liguori
2008-07-22 14:36               ` Avi Kivity
2008-07-22 16:16                 ` Jamie Lokier
2008-07-22 19:13                   ` Avi Kivity
2008-07-22 20:04                     ` Jamie Lokier
2008-07-22 21:25                       ` Avi Kivity
2008-07-22 14:22             ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).