* [Qemu-devel] Signal handling and qcow2 image corruption @ 2008-03-05 21:18 David Barrett 2008-03-05 21:55 ` Anthony Liguori 0 siblings, 1 reply; 26+ messages in thread From: David Barrett @ 2008-03-05 21:18 UTC (permalink / raw) To: qemu-devel I'm tracking down a image corruption issue and I'm curious if you can answer the following: 1) Is there any difference between sending a "TERM" signal to the QEMU process and typing "quit" at the monitor? 2) Will sending TERM corrupt the 'gcow2' image (in ways other than normal guest OS dirty shutdown)? 3) Assuming I always start QEMU using "-loadvm", is there any risk in using 'kill' to send SIGTERM to the QMEU process when done? I notice the entire implementation of "do_quit()" is simply: static void do_quit(void) { exit(0); } So I don't see any special shutdown sequence being invoked, and I can't find any atexit() handler that's used in the general case. Thus it would seem to me that just killing the process should be the same as calling "quit" via the monitor. (I also can't find a signal handler for SIGTERM, but I might have missed it.) Furthermore, if I understand "-loadvm" correctly, then any change made by the guest OS (including any corruption of the image caused by a dirty shutdown) should be blown away on the next restart. Thus it seems that I should be able to safely start QEMU with -loadvm, do my thing inside the guest OS, and then just "kill" the host process, again and again and again, without any risk of accumulated corruption. Is this correct, or am I misunderstanding this? I ask because I seem to be getting accumulated corruption in my guest OS, but it's not totally reproducible. Just trying to make sure QEMU works as it does to narrow down the debugging options. Thanks! -david ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] Signal handling and qcow2 image corruption 2008-03-05 21:18 [Qemu-devel] Signal handling and qcow2 image corruption David Barrett @ 2008-03-05 21:55 ` Anthony Liguori 2008-03-05 23:48 ` David Barrett ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Anthony Liguori @ 2008-03-05 21:55 UTC (permalink / raw) To: qemu-devel David Barrett wrote: > I'm tracking down a image corruption issue and I'm curious if you can > answer the following: > > 1) Is there any difference between sending a "TERM" signal to the QEMU > process and typing "quit" at the monitor? Yes. Since QEMU is single threaded, when you issue a quit, you know you aren't in the middle of writing qcow2 meta data to disk. > 2) Will sending TERM corrupt the 'gcow2' image (in ways other than > normal guest OS dirty shutdown)? Possibly, yes. > 3) Assuming I always start QEMU using "-loadvm", is there any risk in > using 'kill' to send SIGTERM to the QMEU process when done? Yes. If you want to SIGTERM QEMU, the safest thing to do is use -snapshot. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] Signal handling and qcow2 image corruption 2008-03-05 21:55 ` Anthony Liguori @ 2008-03-05 23:48 ` David Barrett 2008-03-06 6:57 ` Avi Kivity 2008-07-21 18:10 ` [Qemu-devel] qcow2 - safe on kill? safe on power fail? Jamie Lokier 2 siblings, 0 replies; 26+ messages in thread From: David Barrett @ 2008-03-05 23:48 UTC (permalink / raw) To: qemu-devel Ah thanks, that makes a lot of sense. Unfortunately, -snapshot doesn't appear to work with -loadvm: any VM snapshots created outside of snapshot mode are suppressed, and any VM snapshots created inside snapshot mode disappear on close (even if you try to commit): ===== I have a snapshot VM named 'boot' that -snapshot can't find ==== dbarrett@LappyReborn:~/rs/qa$ qemu -smb qemu -kernel-kqemu -localtime -m 512 -monitor stdio -loadvm boot -snapshot winxp.qcow2 QEMU 0.9.0 monitor - type 'help' for more information (qemu) Could not find snapshot 'boot' on device 'hda' (qemu) info snapshots Snapshot devices: hda Snapshot list (from hda): ID TAG VM SIZE DATE VM CLOCK ============ When I try to save one it appears to work.... =========== (qemu) savevm boot (qemu) commit all (qemu) info snapshots Snapshot devices: hda Snapshot list (from hda): ID TAG VM SIZE DATE VM CLOCK 1 boot 27M 2008-03-05 15:37:35 00:00:23.114 (qemu) quit ==== ...but when I start up again with -snapshot, it can't find it ==== dbarrett@LappyReborn:~/rs/qa$ qemu -smb qemu -kernel-kqemu -localtime -m 512 -monitor stdio -loadvm boot -snapshot winxp.qcow2 QEMU 0.9.0 monitor - type 'help' for more information (qemu) Could not find snapshot 'boot' on device 'hda' (qemu) info snapshots Snapshot devices: hda Snapshot list (from hda): ID TAG VM SIZE DATE VM CLOCK (qemu) === But when -snapshot is disabled, it finds my snapshot VM again ==== dbarrett@LappyReborn:~/rs/qa$ qemu -smb qemu -kernel-kqemu -localtime -m 512 -monitor stdio -loadvm boot winxp.qcow2 QEMU 0.9.0 monitor - type 'help' for more information (qemu) info snapshots Snapshot devices: hda Snapshot list (from hda): ID TAG VM SIZE DATE VM CLOCK 1 boot 53M 2008-03-03 17:30:58 01:40:10.163 (qemu) I think the solution is to skip -snapshot, use -loadvm, and just use a named pipe to send a "quit" command to the monitor in order to shut it down rather than SIGTERM. Thanks for your help! -david Anthony Liguori wrote: > David Barrett wrote: >> I'm tracking down a image corruption issue and I'm curious if you can >> answer the following: >> >> 1) Is there any difference between sending a "TERM" signal to the QEMU >> process and typing "quit" at the monitor? > > Yes. Since QEMU is single threaded, when you issue a quit, you know you > aren't in the middle of writing qcow2 meta data to disk. > >> 2) Will sending TERM corrupt the 'gcow2' image (in ways other than >> normal guest OS dirty shutdown)? > > Possibly, yes. > >> 3) Assuming I always start QEMU using "-loadvm", is there any risk in >> using 'kill' to send SIGTERM to the QMEU process when done? > > Yes. If you want to SIGTERM QEMU, the safest thing to do is use -snapshot. > > Regards, > > Anthony Liguori > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] Signal handling and qcow2 image corruption 2008-03-05 21:55 ` Anthony Liguori 2008-03-05 23:48 ` David Barrett @ 2008-03-06 6:57 ` Avi Kivity 2008-07-21 18:10 ` [Qemu-devel] qcow2 - safe on kill? safe on power fail? Jamie Lokier 2 siblings, 0 replies; 26+ messages in thread From: Avi Kivity @ 2008-03-06 6:57 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > David Barrett wrote: >> I'm tracking down a image corruption issue and I'm curious if you can >> answer the following: >> >> 1) Is there any difference between sending a "TERM" signal to the >> QEMU process and typing "quit" at the monitor? > > Yes. Since QEMU is single threaded, when you issue a quit, you know > you aren't in the middle of writing qcow2 meta data to disk. > That's not enough. If you write a metadata pointer before allocating and writing the block, and you terminate between these two operations, the next write allocation will leave two pointers pointing to the same block. I don't know if qemu is susceptible to this. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-03-05 21:55 ` Anthony Liguori 2008-03-05 23:48 ` David Barrett 2008-03-06 6:57 ` Avi Kivity @ 2008-07-21 18:10 ` Jamie Lokier 2008-07-21 19:43 ` Anthony Liguori 2 siblings, 1 reply; 26+ messages in thread From: Jamie Lokier @ 2008-07-21 18:10 UTC (permalink / raw) To: qemu-devel Quite a while ago, Anthony Liguori wrote: > David Barrett wrote: > >I'm tracking down a image corruption issue and I'm curious if you can > >answer the following: > > > >1) Is there any difference between sending a "TERM" signal to the QEMU > >process and typing "quit" at the monitor? > > Yes. Since QEMU is single threaded, when you issue a quit, you know you > aren't in the middle of writing qcow2 meta data to disk. > > >2) Will sending TERM corrupt the 'gcow2' image (in ways other than > >normal guest OS dirty shutdown)? > > Possibly, yes. > > >3) Assuming I always start QEMU using "-loadvm", is there any risk in > >using 'kill' to send SIGTERM to the QMEU process when done? > > Yes. If you want to SIGTERM QEMU, the safest thing to do is use -snapshot. Just today, I had a running KVM instance for an important server (our busy mail server) lock up. It was accepting VNC connections, but sending keystrokes, mouse movements and so on didn't do anything. It had been running for several weeks without any problem. I don't have a report on whether there was a picture from VNC. Our system manager decided there was nothing else to do, and killed that process (SIGTERM), then restarted it. (Unfortunately, he didn't know about the monitor and "quit".) So far, it's looking ok, but I'm concerned about the possibility of qcow2 corruption which the above mail says is possible. Even if we could have used the monitor *this* time, QEMU is quite a complex piece of software which we can't assume to be bug free. what happens if KVM/QEMU locks up or crashes, in the following ways: - Some emulated driver crashes. I *have* seen this happen. (Try '-net user -net user' on the command line. Ok, now we know not to do it...). The process dies. - Some emulated driver gets stuck in a loop. You know, a bug. No choice but to kill the process. - The host machine loses power. Host's journalled filesystem is fine, but what about the qcow2 images of guests? I'm imagining that qcow2 is like a very simple filesystem format. Real filesystems have "fsck" and/or use journalling or similar to be robust. Is there a "fsck" equivalent for qcow2? (Maybe running qemu-img convert is that?) Does it use journalling or other techniques internally to make sure it is difficult to corrupt, even if the host dies unexpectedly? If qcow2 is not resistant to sudden failures, would it be difficult to make it more robust? (One method which comes to mind is to use a daemon process just to handle the disk image, communicating with QEMU. QEMU is complex and may occasionally have problems, but the daemon would do just one thing, so quite likely to survive. It won't be robust against power failure, though, and it sounds like performance might suck.) Or should we avoid using qcow2, for important guest servers that would be expensive or impossible to reconstruct? If not qcow2, are any of the other supported incremental formats robust in these ways, e.g. the VMware one? Thanks, -- Jamie ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 18:10 ` [Qemu-devel] qcow2 - safe on kill? safe on power fail? Jamie Lokier @ 2008-07-21 19:43 ` Anthony Liguori 2008-07-21 21:26 ` Jamie Lokier 2008-07-21 22:00 ` Andreas Schwab 0 siblings, 2 replies; 26+ messages in thread From: Anthony Liguori @ 2008-07-21 19:43 UTC (permalink / raw) To: qemu-devel Jamie Lokier wrote: > Quite a while ago, Anthony Liguori wrote: > >> David Barrett wrote: >> >>> I'm tracking down a image corruption issue and I'm curious if you can >>> answer the following: >>> >>> 1) Is there any difference between sending a "TERM" signal to the QEMU >>> process and typing "quit" at the monitor? >>> >> Yes. Since QEMU is single threaded, when you issue a quit, you know you >> aren't in the middle of writing qcow2 meta data to disk. >> >> >>> 2) Will sending TERM corrupt the 'gcow2' image (in ways other than >>> normal guest OS dirty shutdown)? >>> >> Possibly, yes. >> >> >>> 3) Assuming I always start QEMU using "-loadvm", is there any risk in >>> using 'kill' to send SIGTERM to the QMEU process when done? >>> >> Yes. If you want to SIGTERM QEMU, the safest thing to do is use -snapshot. >> > > Just today, I had a running KVM instance for an important server (our > busy mail server) lock up. It was accepting VNC > connections, but sending keystrokes, mouse movements and so on didn't > do anything. It had been running for several weeks without any problem. > I don't have a report on whether there was a picture from VNC. > > Our system manager decided there was nothing else to do, and killed > that process (SIGTERM), then restarted it. > SIGTERM is about the worse thing you could do, but you're probably okay. QCOW2 files have no journal, so they are not safe against unexpected power outages or hard crashes. If you need a great deal of reliability, you should use a raw image. With that said, let me explain exactly what circumstances corruption can occur in as it turns out that, in practice, the corruption window isn't that big. Obviously there are no issues on the read path, so we'll stick strictly to the write path. QEMU is single-threaded and QCOW2 supports asynchronous write operations. There are two parts in this operation. The first discovers what offset within the QCOW2 file to write to. If the sector has been previously allocated, this will consist only of read operations. It will then issue an asynchronous write operation to the allocated sector. Since your guest probably is using a journalled file system, you will be okay if something happens before that data gets written to disk[1]. If the sector hasn't been previously allocated, then a new sector in the file needs to be allocated. This is going to change metadata within the QCOW2 file and this is where it is possible to corrupt a disk image. The operation of allocating a new disk sector is completely synchronous so no other code runs until this completes. Once the disk sector is allocated, you're safe again[1]. Since no other code runs during this period, bugs in the device emulation, a user closing the SDL window, and issuing quit in the monitor, will not corrupt the disk image. Your guest may require an fsck but the QCOW2 image will be fine. The only ways that you can cause corruption is if the QCOW2 sector allocation code is faulty (and you would be screwed no matter what here) or if you issue a SIGTERM/SIGKILL that interrupts the code while it's allocating a new sector. If your guest is hung, chances are it's not actively writing to disk but this is why SIGTERM/SIGKILL is really a terrible thing to do. It's really the only practical way to corrupt a disk image (short of a hard power outage). If someone was sufficiently concerned, it's probably relatively straight forward to implement an fsck or journal for QCOW2. This would allow the image to be recovered if the meta data somehow got corrupted. With all this said, I've definitely seen corruption in QCOW2 images that were caused by crashing my host kernel. I beat up on QEMU pretty badly though. I think under normal circumstances, it's unlikely a user would see this in practice. [1] It's not quite that simple. Your host doesn't necessarily guarantee integrity unless 1) you've got battery backed cache on your disks (commodity disks aren't battery backed typically) or you've disabled write-back 2) you have a file system that supports barriers and barriers are enabled by default (they aren't enabled by default with ext2/3) 3) you are running QEMU with cache=off to disable host write caching. Basically, chances are your data is not as safe as you assume it is and QEMU adds very little additional uncertainty to that unless you do something nasty like SIGKILL/SIGTERM while doing heavy disk IO. Regards, Anthony Liguori > (Unfortunately, he didn't know about the monitor and "quit".) > > So far, it's looking ok, but I'm concerned about the possibility of > qcow2 corruption which the above mail says is possible. > > Even if we could have used the monitor *this* time, QEMU is quite a > complex piece of software which we can't assume to be bug free. what > happens if KVM/QEMU locks up or crashes, in the following ways: > > - Some emulated driver crashes. I *have* seen this happen. > (Try '-net user -net user' on the command line. Ok, now we know not > to do it...). The process dies. > > - Some emulated driver gets stuck in a loop. You know, a bug. > No choice but to kill the process. > > - The host machine loses power. Host's journalled filesystem is > fine, but what about the qcow2 images of guests? > > I'm imagining that qcow2 is like a very simple filesystem format. > Real filesystems have "fsck" and/or use journalling or similar to be > robust. Is there a "fsck" equivalent for qcow2? (Maybe running > qemu-img convert is that?) Does it use journalling or other > techniques internally to make sure it is difficult to corrupt, even if > the host dies unexpectedly? > > If qcow2 is not resistant to sudden failures, would it be difficult to > make it more robust? > > (One method which comes to mind is to use a daemon process just to > handle the disk image, communicating with QEMU. QEMU is complex and > may occasionally have problems, but the daemon would do just one > thing, so quite likely to survive. It won't be robust against power > failure, though, and it sounds like performance might suck.) > > Or should we avoid using qcow2, for important guest servers that would > be expensive or impossible to reconstruct? > > If not qcow2, are any of the other supported incremental formats > robust in these ways, e.g. the VMware one? > > Thanks, > -- Jamie > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 19:43 ` Anthony Liguori @ 2008-07-21 21:26 ` Jamie Lokier 2008-07-21 22:14 ` Anthony Liguori 2008-07-21 22:00 ` Andreas Schwab 1 sibling, 1 reply; 26+ messages in thread From: Jamie Lokier @ 2008-07-21 21:26 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > Since your guest probably is using a journalled file system, you will be > okay if something happens before that data gets written to disk[1]. Thanks Anthony, that's helpful. > If the sector hasn't been previously allocated, then a new sector in the > file needs to be allocated. This is going to change metadata within the > QCOW2 file and this is where it is possible to corrupt a disk image. > The operation of allocating a new disk sector is completely synchronous > so no other code runs until this completes. Once the disk sector is > allocated, you're safe again[1]. My main concern is corruption of the QCOW2 sector allocation map, and subsequently QEMU/KVM breaking or going wildly haywire with that file. With a normal filesystem, sure, there are lots of ways to get corruption when certain events happen. But you don't lose the _whole_ filesystem. My concern is that if the QCOW2 sector allocation map is corrupted by these events, you may lose the _whole_ virtual machine, which can be a pretty big loss. Is the format robust enough to prevent that from being a problem? (Backups help (but not good enough for things like a mail or database server). But how do you safely backup the image of a VM that is running 24x7? LVM snapshots are the only way I've thought of, and they have a barrier problem, see below.) > you have a file system that supports barriers and barriers > are enabled by default (they aren't enabled by default with ext2/3) There was recent talk of enabling them by default for ext3. > you are running QEMU with cache=off to disable host write caching. Doesn't that use O_DIRECT? O_DIRECT writes don't use barriers, and fsync() does not deterministically issue a disk barrier if there's no metadata change, so O_DIRECT writes are _less_ safe with disks which have write-cache enabled than using normal writes. What about using a partition, such as an LVM volume (so it can be snapshotted without having to take down the VM)? I'm under the impression there is no way to issue disk barrier flushes to a partition, so that's screwed too. (Besides, LVM doesn't propagate barrier requests from filesystems either...) The last two paragraphs apply when using _any_ file format and break the integrity of guest journalling filesystems, not just qcow2. > Since no other code runs during this period, bugs in the device > emulation, a user closing the SDL window, and issuing quit in the > monitor, will not corrupt the disk image. Your guest may require an > fsck but the QCOW2 image will be fine. Does this apply to KVM as well? I thought KVM had a separate threads for I/O, so problems in another subsystem might crash an I/O thread in mid action. Is that work in progress? Thanks again, -- Jamie ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 21:26 ` Jamie Lokier @ 2008-07-21 22:14 ` Anthony Liguori 2008-07-21 23:47 ` Jamie Lokier 2008-07-22 6:06 ` Avi Kivity 0 siblings, 2 replies; 26+ messages in thread From: Anthony Liguori @ 2008-07-21 22:14 UTC (permalink / raw) To: qemu-devel Jamie Lokier wrote: >> If the sector hasn't been previously allocated, then a new sector in the >> file needs to be allocated. This is going to change metadata within the >> QCOW2 file and this is where it is possible to corrupt a disk image. >> The operation of allocating a new disk sector is completely synchronous >> so no other code runs until this completes. Once the disk sector is >> allocated, you're safe again[1]. >> > > My main concern is corruption of the QCOW2 sector allocation map, and > subsequently QEMU/KVM breaking or going wildly haywire with that file. > > With a normal filesystem, sure, there are lots of ways to get > corruption when certain events happen. But you don't lose the _whole_ > filesystem. > Sure you can. If you don't have a battery backed disk cache and are using write-back (which is usually the default), you can definitely get corruption of the journal. Likewise, under the right scenarios, you will get journal corruption with the default mount options of ext3 because it doesn't use barriers. This is very hard to see happen in practice though because these windows are very small--just like with QEMU. > My concern is that if the QCOW2 sector allocation map is corrupted by > these events, you may lose the _whole_ virtual machine, which can be a > pretty big loss. > > Is the format robust enough to prevent that from being a problem? > It could be extended to contain a journal. But that doesn't guarantee that you won't lose data because of your file system failing, that's the point I'm making. > (Backups help (but not good enough for things like a mail or database > server). But how do you safely backup the image of a VM that is > running 24x7? LVM snapshots are the only way I've thought of, and > they have a barrier problem, see below.) > > >> you have a file system that supports barriers and barriers >> are enabled by default (they aren't enabled by default with ext2/3) >> > > There was recent talk of enabling them by default for ext3. > It's not going to happen. >> you are running QEMU with cache=off to disable host write caching. >> > > Doesn't that use O_DIRECT? O_DIRECT writes don't use barriers, and > fsync() does not deterministically issue a disk barrier if there's no > metadata change, so O_DIRECT writes are _less_ safe with disks which > have write-cache enabled than using normal writes. > It depends on the filesystem. ext3 never issues any barriers by default :-) I would think a good filesystem would issue a barrier after an O_DIRECT write. > What about using a partition, such as an LVM volume (so it can be > snapshotted without having to take down the VM)? I'm under the > impression there is no way to issue disk barrier flushes to a > partition, so that's screwed too. (Besides, LVM doesn't propagate > barrier requests from filesystems either...) > Unfortunately there is no userspace API to inject barriers in a disk. fdatasync() maybe but that's not the same behavior as a barrier? I don't think IDE supports barriers at all FWIW. It only has a write-back and write-through mode so if you care about data, you would have to enable write-through in your guest. > The last two paragraphs apply when using _any_ file format and break > the integrity of guest journalling filesystems, not just qcow2. > > >> Since no other code runs during this period, bugs in the device >> emulation, a user closing the SDL window, and issuing quit in the >> monitor, will not corrupt the disk image. Your guest may require an >> fsck but the QCOW2 image will be fine. >> > > Does this apply to KVM as well? I thought KVM had a separate threads > for I/O, so problems in another subsystem might crash an I/O thread in > mid action. Is that work in progress? > Not really. There is a big lock that prevents two threads from every running at the same time within QEMU. Regards, Anthony Liguori > Thanks again, > -- Jamie > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 22:14 ` Anthony Liguori @ 2008-07-21 23:47 ` Jamie Lokier 2008-07-22 6:06 ` Avi Kivity 1 sibling, 0 replies; 26+ messages in thread From: Jamie Lokier @ 2008-07-21 23:47 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > >My main concern is corruption of the QCOW2 sector allocation map, and > >subsequently QEMU/KVM breaking or going wildly haywire with that file. > > > >With a normal filesystem, sure, there are lots of ways to get > >corruption when certain events happen. But you don't lose the _whole_ > >filesystem. > > Sure you can. If you don't have a battery backed disk cache and are > using write-back (which is usually the default), you can definitely get > corruption of the journal. Likewise, under the right scenarios, you > will get journal corruption with the default mount options of ext3 > because it doesn't use barriers. Well, no, when you get filesystem corruption, you don't lose the whole filesystem. If you have 10,000,000 files in 100GB, you might lose a fraction of it. Also, you're unlikely to lose anything which you're not writing to at all at the time. E.g. the OS installation. My worry is if I have the same amount of data in a QCOW2, I might lose all of it including the OS, which seems much harsher. And there isn't a way to recover it. But I don't know if QCOW2 is that sensitive, that's why I'm asking. > This is very hard to see happen in practice though because these windows > are very small--just like with QEMU. The software-caused windows are non-existent on a modern filesystem with good practice: barriers, decent disks. > >My concern is that if the QCOW2 sector allocation map is corrupted by > >these events, you may lose the _whole_ virtual machine, which can be a > >pretty big loss. > > > >Is the format robust enough to prevent that from being a problem? > > It could be extended to contain a journal. But that doesn't guarantee > that you won't lose data because of your file system failing, that's the > point I'm making. Erm. I think you're answering a different question to the one I'm asking :-) If my host filesystem is using ext3 with barriers enabled, and the block device underlying it supports barriers, then I *never* expect to see host filesystem corruption even on power failure, unless there is a blatant hardware fault. The "operation windows" for corruption are non-existent, they are eliminated in principle, there is *no* sequence of software events with a time windows for a sudden power failure or crash which results in corruption. I regard that as very robust. However, I do expect to see corruption in QCOW2 images from killing the process, or shutting down the host without remembering to shut down all guests first, or losing power on the host. It only happens on sector allocation. But isn't that quite often - i.e. whenever the image grows? I find my images grow very often, in normal usage, until they are approaching the size of the flat format. And, apart from worrying about software corruption windows in QCOW2, I'm worried that I'll lost the whole installed operating system, applications, etc., if QEMU can't then read the QCOW2 image properly. Rather than just a few files. > >>you have a file system that supports barriers and barriers > >>are enabled by default (they aren't enabled by default with ext2/3) > > > >There was recent talk of enabling them by default for ext3. > > It's not going to happen. You may be right. This is one more reason why I'm asking myself if ext3 on my VM hosts is such a smart idea... ext4, however, has barriers enabled by default since 2.6.26 :-) > >>you are running QEMU with cache=off to disable host write caching. > > > >Doesn't that use O_DIRECT? O_DIRECT writes don't use barriers, and > >fsync() does not deterministically issue a disk barrier if there's no > >metadata change, so O_DIRECT writes are _less_ safe with disks which > >have write-cache enabled than using normal writes. > > It depends on the filesystem. ext3 never issues any barriers by default > :-) But even with barrier=1, it doesn't issue them with O_DIRECT writes. > I would think a good filesystem would issue a barrier after an O_DIRECT > write. For O_SYNC, maybe. But O_DIRECT: that would be more barriers than most applications want. Unnecessary barriers are not cheap, especially on IDE (see below). It would be better if fdatasync() issued the barrier if there have been any O_DIRECT writes since the last barrier, even if there's no cached data to write. That gives the app a chance to decide where and when to have barriers. Otherwise you can't use O_DIRECT to simulate "filesystem in a file" with similar performance characteristics as a real filesystem. This is getting off-topic for qemu-devel though. > >What about using a partition, such as an LVM volume (so it can be > >snapshotted without having to take down the VM)? I'm under the > >impression there is no way to issue disk barrier flushes to a > >partition, so that's screwed too. (Besides, LVM doesn't propagate > >barrier requests from filesystems either...) > > Unfortunately there is no userspace API to inject barriers in a disk. > fdatasync() maybe but that's not the same behavior as a barrier? It's not the same behaviour in theory, but in Linux they are muddled together to be the same thing. I tried clarifying the different on linux-fsdevel, and just got filesystem developers telling me that barriers on Linux block devices always imply a flush, not just ordering, so there's no reason to have different requests. At the application level, the best you can do with normal files is wait until your AIO writes return, then issue and wait for fdatasync, then start more writes. It seems the same AIO methods could work equally on a block device, with fdatasync sending a barrier+flush request to the disk if there have been any preceding writes. It would be rather convenient too. > I don't think IDE supports barriers at all FWIW. It only has a > write-back and write-through mode so if you care about data, you > would have to enable write-through in your guest. Not quite true. IDE supports barriers on the host by the host kernel waiting for writes to complete, issuing an IDE FLUSH WRITE CACHE command, then allowing later writes to start. It also uses the FUA ("Force Unit Access") bit to do uncached single-sector writes. So it does support barriers in a roundabout way, on nearly all IDE disks. It makes a difference. I have some devices with ext3 IDE disks that get corrupt from time to time in normal usage (which includes pulling the plug regularly :-), unless I enable barriers or turn off write-cache. But turning off write-cache slows them down a lot, and barriers slows them down only a little, so IDE barriers are good. > >Does this apply to KVM as well? I thought KVM had a separate threads > >for I/O, so problems in another subsystem might crash an I/O thread in > >mid action. Is that work in progress? > > Not really. There is a big lock that prevents two threads from every > running at the same time within QEMU. Oh. What a curious form of threading :-) -- Jamie ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 22:14 ` Anthony Liguori 2008-07-21 23:47 ` Jamie Lokier @ 2008-07-22 6:06 ` Avi Kivity 2008-07-22 14:08 ` Anthony Liguori 2008-07-22 14:32 ` Jamie Lokier 1 sibling, 2 replies; 26+ messages in thread From: Avi Kivity @ 2008-07-22 6:06 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > Jamie Lokier wrote: >>> If the sector hasn't been previously allocated, then a new sector in >>> the file needs to be allocated. This is going to change metadata >>> within the QCOW2 file and this is where it is possible to corrupt a >>> disk image. The operation of allocating a new disk sector is >>> completely synchronous so no other code runs until this completes. >>> Once the disk sector is allocated, you're safe again[1]. >>> >> >> My main concern is corruption of the QCOW2 sector allocation map, and >> subsequently QEMU/KVM breaking or going wildly haywire with that file. >> >> With a normal filesystem, sure, there are lots of ways to get >> corruption when certain events happen. But you don't lose the _whole_ >> filesystem. >> > > Sure you can. If you don't have a battery backed disk cache and are > using write-back (which is usually the default), you can definitely > get corruption of the journal. Likewise, under the right scenarios, > you will get journal corruption with the default mount options of ext3 > because it doesn't use barriers. > What about SCSI or SATA NCQ? On these, barriers don't impact performance greatly. > This is very hard to see happen in practice though because these > windows are very small--just like with QEMU. > The exposure window with qemu is not small. It's as large as the page cache of the host. > > >>> you are running QEMU with cache=off to disable host write caching. >> >> Doesn't that use O_DIRECT? O_DIRECT writes don't use barriers, and >> fsync() does not deterministically issue a disk barrier if there's no >> metadata change, so O_DIRECT writes are _less_ safe with disks which >> have write-cache enabled than using normal writes. >> > > It depends on the filesystem. ext3 never issues any barriers by > default :-) > > I would think a good filesystem would issue a barrier after an > O_DIRECT write. > Using a disk controller that supports queueing means that you can (in theory at least) leave writeback turned on and yet have the disk not lie to you about completions. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 6:06 ` Avi Kivity @ 2008-07-22 14:08 ` Anthony Liguori 2008-07-22 14:46 ` Jamie Lokier 2008-07-22 19:11 ` Avi Kivity 2008-07-22 14:32 ` Jamie Lokier 1 sibling, 2 replies; 26+ messages in thread From: Anthony Liguori @ 2008-07-22 14:08 UTC (permalink / raw) To: qemu-devel Avi Kivity wrote: > Anthony Liguori wrote: >> >> Sure you can. If you don't have a battery backed disk cache and are >> using write-back (which is usually the default), you can definitely >> get corruption of the journal. Likewise, under the right scenarios, >> you will get journal corruption with the default mount options of >> ext3 because it doesn't use barriers. >> > > What about SCSI or SATA NCQ? On these, barriers don't impact > performance greatly. Good question, I don't know the answer. But ext3 doesn't autodetect SCSI/NCQ or anything. It disabled barriers by default. Some distros have changed this behavior historically (SLES I believe). >> This is very hard to see happen in practice though because these >> windows are very small--just like with QEMU. >> > > The exposure window with qemu is not small. It's as large as the page > cache of the host. Note I was careful to qualify my statements that cache=off was required. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 14:08 ` Anthony Liguori @ 2008-07-22 14:46 ` Jamie Lokier 2008-07-22 19:11 ` Avi Kivity 1 sibling, 0 replies; 26+ messages in thread From: Jamie Lokier @ 2008-07-22 14:46 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > >What about SCSI or SATA NCQ? On these, barriers don't impact > >performance greatly. > > Good question, I don't know the answer. But ext3 doesn't autodetect > SCSI/NCQ or anything. It disabled barriers by default. Some distros > have changed this behavior historically (SLES I believe). Also don't forget XFS, Reiserfs. I think they both use barriers by default and have a correct fsync too. SCSI/NCQ are detected by the block layer, as long as the filesystem uses barriers. Oh, and as long as not using LVM which doesn't pass on barriers :/ > >>This is very hard to see happen in practice though because these > >>windows are very small--just like with QEMU. > > >The exposure window with qemu is not small. It's as large as the page > >cache of the host. > > Note I was careful to qualify my statements that cache=off was required. Fair point. Unfortunately cache=off introduces other exposure windows. With cache=on, the multiple block writes to allocate a qcow2 sector are in fast succession, so a QEMU crash (or signal) has to happen during this short interval. With cache=off, those writes will take as long as the disk seeks between them, so there's a longer time window for a QEMU crash to corrupt the file. Also with cache=off, there's no disk barriers on any filesystem and any filesystem options, so there's the additional time window of disk cache inconsistency with the platters. Databases face the same problem on Linux, but it's often ignored. Does anyone know what Oracle on Linux does to keep it's structures robust? -- Jamie ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 14:08 ` Anthony Liguori 2008-07-22 14:46 ` Jamie Lokier @ 2008-07-22 19:11 ` Avi Kivity 1 sibling, 0 replies; 26+ messages in thread From: Avi Kivity @ 2008-07-22 19:11 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > Avi Kivity wrote: >> Anthony Liguori wrote: >>> >>> Sure you can. If you don't have a battery backed disk cache and are >>> using write-back (which is usually the default), you can definitely >>> get corruption of the journal. Likewise, under the right scenarios, >>> you will get journal corruption with the default mount options of >>> ext3 because it doesn't use barriers. >>> >> >> What about SCSI or SATA NCQ? On these, barriers don't impact >> performance greatly. > > Good question, I don't know the answer. But ext3 doesn't autodetect > SCSI/NCQ or anything. It disabled barriers by default. Some distros > have changed this behavior historically (SLES I believe). > This ought to be on the driver level. SCSI and NCQ disks should report barrier support; old IDE should report no barriers unless the user sets dont_care_about_performance_and_have_unlimited_warranty=1. ext* should use barriers if available. Of course this is linux-kernel material, not really on topic for this list. >>> This is very hard to see happen in practice though because these >>> windows are very small--just like with QEMU. >>> >> >> The exposure window with qemu is not small. It's as large as the >> page cache of the host. > > Note I was careful to qualify my statements that cache=off was required. Ah, okay then. Qemu should be written assuming the underlying layers are sane; trying to work around Linux bugs is madness. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 6:06 ` Avi Kivity 2008-07-22 14:08 ` Anthony Liguori @ 2008-07-22 14:32 ` Jamie Lokier 1 sibling, 0 replies; 26+ messages in thread From: Jamie Lokier @ 2008-07-22 14:32 UTC (permalink / raw) To: qemu-devel Avi Kivity wrote: > The exposure window with qemu is not small. It's as large as the page > cache of the host. Ouch, that's a good point. I hadn't thought of that. With cache=off, the exposure window is as large as the I/O scheduling and disk seek time between the multiple blocks written during sector allocation. Given how often sector allocation occurs, it's not small. I think I'm going to just stop using QCOW2, bite the bullet, and use large disks and flat images for all production VMs. The small but looking-plausible possibility of losing a whole valuable machine due to niggling things like a rare QEMU crash or host crash is very uncool. -- Jamie ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 19:43 ` Anthony Liguori 2008-07-21 21:26 ` Jamie Lokier @ 2008-07-21 22:00 ` Andreas Schwab 2008-07-21 22:15 ` Anthony Liguori 1 sibling, 1 reply; 26+ messages in thread From: Andreas Schwab @ 2008-07-21 22:00 UTC (permalink / raw) To: qemu-devel Anthony Liguori <anthony@codemonkey.ws> writes: > The only ways that you can cause corruption is if the QCOW2 sector > allocation code is faulty (and you would be screwed no matter what here) > or if you issue a SIGTERM/SIGKILL that interrupts the code while it's > allocating a new sector. Blocking SIGTERM until the allocation is finished could close that hole. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 22:00 ` Andreas Schwab @ 2008-07-21 22:15 ` Anthony Liguori 2008-07-21 22:22 ` David Barrett 2008-07-22 6:07 ` Avi Kivity 0 siblings, 2 replies; 26+ messages in thread From: Anthony Liguori @ 2008-07-21 22:15 UTC (permalink / raw) To: qemu-devel Andreas Schwab wrote: > Anthony Liguori <anthony@codemonkey.ws> writes: > > >> The only ways that you can cause corruption is if the QCOW2 sector >> allocation code is faulty (and you would be screwed no matter what here) >> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's >> allocating a new sector. >> > > Blocking SIGTERM until the allocation is finished could close that hole. > Seems like a band-aid to me as SIGKILL is still an issue. Plus it would involve modifying all disk formats, not just QCOW2. I'd rather see proper journal support added to QCOW2 myself. Regards, Anthony Liguori > Andreas. > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 22:15 ` Anthony Liguori @ 2008-07-21 22:22 ` David Barrett 2008-07-21 22:50 ` Anthony Liguori 2008-07-22 6:07 ` Avi Kivity 1 sibling, 1 reply; 26+ messages in thread From: David Barrett @ 2008-07-21 22:22 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > Andreas Schwab wrote: >> Anthony Liguori <anthony@codemonkey.ws> writes: >> >>> The only ways that you can cause corruption is if the QCOW2 sector >>> allocation code is faulty (and you would be screwed no matter what here) >>> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's >>> allocating a new sector. >> >> Blocking SIGTERM until the allocation is finished could close that hole. > > Seems like a band-aid to me as SIGKILL is still an issue. Plus it would > involve modifying all disk formats, not just QCOW2. I'd rather see > proper journal support added to QCOW2 myself. Well, SIGKILL is a bit more of an extreme case. SIGTERM seems like a reasonable way to trigger a graceful shutdown (at least, I know I assumed it did for a long time, whereas I'd never assume SIGKILL was graceful). -david ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 22:22 ` David Barrett @ 2008-07-21 22:50 ` Anthony Liguori 0 siblings, 0 replies; 26+ messages in thread From: Anthony Liguori @ 2008-07-21 22:50 UTC (permalink / raw) To: qemu-devel David Barrett wrote: > Anthony Liguori wrote: >> Andreas Schwab wrote: >>> Anthony Liguori <anthony@codemonkey.ws> writes: >>> >>>> The only ways that you can cause corruption is if the QCOW2 sector >>>> allocation code is faulty (and you would be screwed no matter what >>>> here) >>>> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's >>>> allocating a new sector. >>> >>> Blocking SIGTERM until the allocation is finished could close that >>> hole. >> >> Seems like a band-aid to me as SIGKILL is still an issue. Plus it >> would involve modifying all disk formats, not just QCOW2. I'd rather >> see proper journal support added to QCOW2 myself. > > Well, SIGKILL is a bit more of an extreme case. SIGTERM seems like a > reasonable way to trigger a graceful shutdown (at least, I know I > assumed it did for a long time, whereas I'd never assume SIGKILL was > graceful). It would probably be reasonable to trap SIGTERM and to have it trigger the equivalent of the "quit" command in the monitor. Right now, SIGTERM will not result in a graceful shutdown of QEMU. Regards, Anthony Liguori > -david > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-21 22:15 ` Anthony Liguori 2008-07-21 22:22 ` David Barrett @ 2008-07-22 6:07 ` Avi Kivity 2008-07-22 14:11 ` Anthony Liguori 2008-07-22 14:22 ` Jamie Lokier 1 sibling, 2 replies; 26+ messages in thread From: Avi Kivity @ 2008-07-22 6:07 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > Andreas Schwab wrote: >> Anthony Liguori <anthony@codemonkey.ws> writes: >> >> >>> The only ways that you can cause corruption is if the QCOW2 sector >>> allocation code is faulty (and you would be screwed no matter what >>> here) >>> or if you issue a SIGTERM/SIGKILL that interrupts the code while it's >>> allocating a new sector. >>> >> >> Blocking SIGTERM until the allocation is finished could close that hole. >> > > Seems like a band-aid to me as SIGKILL is still an issue. Plus it > would involve modifying all disk formats, not just QCOW2. I'd rather > see proper journal support added to QCOW2 myself. Journalling is so out of fashion. It's better to sequence the operations so that failure results in a leak instead of corruption. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 6:07 ` Avi Kivity @ 2008-07-22 14:11 ` Anthony Liguori 2008-07-22 14:36 ` Avi Kivity 2008-07-22 14:22 ` Jamie Lokier 1 sibling, 1 reply; 26+ messages in thread From: Anthony Liguori @ 2008-07-22 14:11 UTC (permalink / raw) To: qemu-devel Avi Kivity wrote: > Journalling is so out of fashion. It's better to sequence the > operations so that failure results in a leak instead of corruption. Since the metadata is being updated synchronously, you could probably get away with a pretty simple journal. Maybe even a single field that contains the offset you are allocating which then gets reset once the allocation was completed. When QEMU starts up again, it can look at that field, and if it's not 0, check for anomalies in the allocation, prune that portion of the tree, and then start the guest. That's a few more writes but it's already a slow path so it should be okay. Regards, Anthony Liguori ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 14:11 ` Anthony Liguori @ 2008-07-22 14:36 ` Avi Kivity 2008-07-22 16:16 ` Jamie Lokier 0 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2008-07-22 14:36 UTC (permalink / raw) To: qemu-devel Anthony Liguori wrote: > Avi Kivity wrote: >> Journalling is so out of fashion. It's better to sequence the >> operations so that failure results in a leak instead of corruption. > > Since the metadata is being updated synchronously, you could probably > get away with a pretty simple journal. Maybe even a single field that > contains the offset you are allocating which then gets reset once the > allocation was completed. > > Why would you want to get away with a simple journal when you can get away without one? It's a simple matter of allocating, making sure the allocation is on disk, and recording that allocation in the tables. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 14:36 ` Avi Kivity @ 2008-07-22 16:16 ` Jamie Lokier 2008-07-22 19:13 ` Avi Kivity 0 siblings, 1 reply; 26+ messages in thread From: Jamie Lokier @ 2008-07-22 16:16 UTC (permalink / raw) To: qemu-devel > It's a simple matter of allocating, making sure the allocation is on > disk, and recording that allocation in the tables. The simple implementations are only safe if sector writes are atomic. Opinions from Google seem divided about when you can assume that, especially when the underlying file or device is not directly mapped to disk sectors. -- Jamie ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 16:16 ` Jamie Lokier @ 2008-07-22 19:13 ` Avi Kivity 2008-07-22 20:04 ` Jamie Lokier 0 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2008-07-22 19:13 UTC (permalink / raw) To: qemu-devel Jamie Lokier wrote: >> It's a simple matter of allocating, making sure the allocation is on >> disk, and recording that allocation in the tables. >> > > The simple implementations are only safe if sector writes are atomic. > > Opinions from Google seem divided about when you can assume that, > especially when the underlying file or device is not directly mapped > to disk sectors. > That's worrying. I guess always-allocate-on-write solves that (with versioned roots in well-known places), but that's not qcow2 any more -- it's btrfs. And given that btrfs ought to allow file-level snapshots, perhaps the direction should be raw files on top of btrfs (which could be extended to do block sharing, yay!) -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 19:13 ` Avi Kivity @ 2008-07-22 20:04 ` Jamie Lokier 2008-07-22 21:25 ` Avi Kivity 0 siblings, 1 reply; 26+ messages in thread From: Jamie Lokier @ 2008-07-22 20:04 UTC (permalink / raw) To: qemu-devel Avi Kivity wrote: > >>It's a simple matter of allocating, making sure the allocation is on > >>disk, and recording that allocation in the tables. > > > >The simple implementations are only safe if sector writes are atomic. > > > >Opinions from Google seem divided about when you can assume that, > >especially when the underlying file or device is not directly mapped > >to disk sectors. > > That's worrying. I guess always-allocate-on-write solves that (with > versioned roots in well-known places), but that's not qcow2 any more -- > it's btrfs. Fair. Simple journalling with checksumed log records also solves the problem without being half as clever - and probably easy to retrofit to qcow2, without breaking backward compatibility. (Old qemus would ignore the journal.) > And given that btrfs ought to allow file-level snapshots, perhaps > the direction should be raw files on top of btrfs (which could be > extended to do block sharing, yay!) Block/extent sharing would be a nice bonus :-) Does btrfs work on other platforms than Linux? Also, is btrfs as good as the hype, in respect of things like fsync, barriers, cache=off consistency etc. which we've talked about? Maybe, but I wouldn't assume it. Userspace btrfs-in-a-file library would be ideal, for cross-platform support, but I don't see it happening. You can do raw, sparse files on ext3 or any other unix filesystem. They are about as compact as qcow2, if you ignore compression. The real big problem I found with sparse files is that copying them locally, or copying them to another machine (e.g. with rsync) is *incredibly* slow because it's so slow to scan the sparse regions, and this gets really slow if you have, say, a 100GB virtual disk (5GB used, rest to grow into). "rsync --sparse" even bizarrely transmits a lot of zero data over the network, or spends an age compressing it. btrfs flat files will have the same problem. The FIEMAP interface may solve it, generically on all Linux filesystem, if copying tools are updated to use it. -- Jamie ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 20:04 ` Jamie Lokier @ 2008-07-22 21:25 ` Avi Kivity 0 siblings, 0 replies; 26+ messages in thread From: Avi Kivity @ 2008-07-22 21:25 UTC (permalink / raw) To: qemu-devel Jamie Lokier wrote: > >> And given that btrfs ought to allow file-level snapshots, perhaps >> the direction should be raw files on top of btrfs (which could be >> extended to do block sharing, yay!) >> > > Block/extent sharing would be a nice bonus :-) > > Does btrfs work on other platforms than Linux? > > There's a Solaris port called zfs, and a bsd port called WAFL. > Also, is btrfs as good as the hype, in respect of things like fsync, > barriers, cache=off consistency etc. which we've talked about? Maybe, > but I wouldn't assume it. > It better be, as it's coming from oracle. O_DIRECT and barriers are their bread and butter (fs). > > You can do raw, sparse files on ext3 or any other unix filesystem. > They are about as compact as qcow2, if you ignore compression. > > Except that you lose snapshot support, etc. > The real big problem I found with sparse files is that copying them > locally, or copying them to another machine (e.g. with rsync) is > *incredibly* slow because it's so slow to scan the sparse regions, and > this gets really slow if you have, say, a 100GB virtual disk (5GB > used, rest to grow into). "rsync --sparse" even bizarrely transmits a > lot of zero data over the network, or spends an age compressing it. > > btrfs flat files will have the same problem. > There was some talk about an API to discover unallocated regions. > The FIEMAP interface may solve it, generically on all Linux > filesystem, if copying tools are updated to use it. > Like that, but better. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Qemu-devel] qcow2 - safe on kill? safe on power fail? 2008-07-22 6:07 ` Avi Kivity 2008-07-22 14:11 ` Anthony Liguori @ 2008-07-22 14:22 ` Jamie Lokier 1 sibling, 0 replies; 26+ messages in thread From: Jamie Lokier @ 2008-07-22 14:22 UTC (permalink / raw) To: qemu-devel Avi Kivity wrote: > >Seems like a band-aid to me as SIGKILL is still an issue. Plus it > >would involve modifying all disk formats, not just QCOW2. I'd rather > >see proper journal support added to QCOW2 myself. > > Journalling is so out of fashion. It's better to sequence the > operations so that failure results in a leak instead of corruption. That would be find. If there's too much leakage after a time, it would be easy enough to "qemu convert" to recreate the image without leakage. Or trees, trees are the new journals... :-) Still there's always the possibility of errors to recover from, no matter how careful. -- Jamie ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2008-07-22 21:25 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-03-05 21:18 [Qemu-devel] Signal handling and qcow2 image corruption David Barrett 2008-03-05 21:55 ` Anthony Liguori 2008-03-05 23:48 ` David Barrett 2008-03-06 6:57 ` Avi Kivity 2008-07-21 18:10 ` [Qemu-devel] qcow2 - safe on kill? safe on power fail? Jamie Lokier 2008-07-21 19:43 ` Anthony Liguori 2008-07-21 21:26 ` Jamie Lokier 2008-07-21 22:14 ` Anthony Liguori 2008-07-21 23:47 ` Jamie Lokier 2008-07-22 6:06 ` Avi Kivity 2008-07-22 14:08 ` Anthony Liguori 2008-07-22 14:46 ` Jamie Lokier 2008-07-22 19:11 ` Avi Kivity 2008-07-22 14:32 ` Jamie Lokier 2008-07-21 22:00 ` Andreas Schwab 2008-07-21 22:15 ` Anthony Liguori 2008-07-21 22:22 ` David Barrett 2008-07-21 22:50 ` Anthony Liguori 2008-07-22 6:07 ` Avi Kivity 2008-07-22 14:11 ` Anthony Liguori 2008-07-22 14:36 ` Avi Kivity 2008-07-22 16:16 ` Jamie Lokier 2008-07-22 19:13 ` Avi Kivity 2008-07-22 20:04 ` Jamie Lokier 2008-07-22 21:25 ` Avi Kivity 2008-07-22 14:22 ` Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).