* Making discard/fstrim reliable @ 2014-03-26 20:47 Richard W.M. Jones 2014-04-02 18:18 ` Jeff Moyer 2014-04-03 17:08 ` Lukáš Czerner 0 siblings, 2 replies; 10+ messages in thread From: Richard W.M. Jones @ 2014-03-26 20:47 UTC (permalink / raw) To: linux-fsdevel; +Cc: pbonzini virt-sparsify is a tool for trimming free space in virtual disk images. The new implementation uses vfs/kernel/qemu discard support. Essentially it does: for each filesystem: mount -o discard $fs /mnt sync fstrim /mnt umount /mnt sync # qemu is killed after sync returns Although typing these commands by hand works fine, when you run them from a program the fstrim doesn't happen all the way down the stack reliably. Mostly it works, but sometimes it only trims some space from the host file. It appears that when the host is slow / under load, the problem happens more frequently. Also it may happen more frequently on i686 than on x86-64 (possibly also due to speed of host). The question is: What can I do to make sure the trim happens reliably, all the way down the stack, before qemu is killed? I am testing this using the latest upstream kernel & qemu. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-03-26 20:47 Making discard/fstrim reliable Richard W.M. Jones @ 2014-04-02 18:18 ` Jeff Moyer 2014-04-02 18:59 ` Richard W.M. Jones 2014-04-03 17:08 ` Lukáš Czerner 1 sibling, 1 reply; 10+ messages in thread From: Jeff Moyer @ 2014-04-02 18:18 UTC (permalink / raw) To: Richard W.M. Jones; +Cc: linux-fsdevel, pbonzini "Richard W.M. Jones" <rjones@redhat.com> writes: Hi, Richard, > virt-sparsify is a tool for trimming free space in virtual disk > images. The new implementation uses vfs/kernel/qemu discard support. > Essentially it does: > Presumably there's a "start guest" step here that's missing? > for each filesystem: > mount -o discard $fs /mnt What is $fs? Do you pass in a list of devices? Also, you don't need to mount with -o discard in order to use fstrim. In fact, I'd recommend against doing that. > sync Interesting. Have you seen mount dirty inodes or something? > fstrim /mnt > umount /mnt > sync > # qemu is killed after sync returns > > Although typing these commands by hand works fine, when you run them > from a program the fstrim doesn't happen all the way down the stack > reliably. Mostly it works, but sometimes it only trims some space > from the host file. What is in the stack? Are you using qcow2 images, plain files, device mapper, anything else? Which file systems are you testing, and are they used in the host, the guest or both? How are you checking for success? Do you have a golden image you start with so that your test case is repeatable? > It appears that when the host is slow / under load, the problem > happens more frequently. Also it may happen more frequently on i686 > than on x86-64 (possibly also due to speed of host). I don't know of any reason that any of the variables you listed would affect the reliability at all. As far as I can tell, fstrim is a synchronous ioctl. I believe the only reason space wouldn't be freed is if the fs is fragmented in such a way as to not meet the minimum trim granularity of the underlying device. Of course, there may be file system specific reasons too, I guess. Hopefully others can comment on that. Cheers, Jeff > The question is: What can I do to make sure the trim happens reliably, > all the way down the stack, before qemu is killed? > > I am testing this using the latest upstream kernel & qemu. > > Rich. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-04-02 18:18 ` Jeff Moyer @ 2014-04-02 18:59 ` Richard W.M. Jones 2014-04-02 20:02 ` Jeff Moyer 0 siblings, 1 reply; 10+ messages in thread From: Richard W.M. Jones @ 2014-04-02 18:59 UTC (permalink / raw) To: Jeff Moyer; +Cc: linux-fsdevel, pbonzini On Wed, Apr 02, 2014 at 02:18:40PM -0400, Jeff Moyer wrote: > "Richard W.M. Jones" <rjones@redhat.com> writes: > > Hi, Richard, > > > virt-sparsify is a tool for trimming free space in virtual disk > > images. The new implementation uses vfs/kernel/qemu discard support. > > Essentially it does: > > > > Presumably there's a "start guest" step here that's missing? Yup, it starts up a small appliance to do these operations. > > for each filesystem: > > mount -o discard $fs /mnt > > What is $fs? Do you pass in a list of devices? Yes and no. We examine the partitions, logical volumes and so on in order to get a list of mountable filesystems, and then the list is iterated over in this loop. The precise code for finding the filesystems is here: https://github.com/libguestfs/libguestfs/blob/master/src/listfs.c#L45 ^ That code is running on the host side. It issues various calls to the appliance side which are executed by code in multiple files here: https://github.com/libguestfs/libguestfs/tree/master/daemon > Also, you don't need to mount with -o discard in order to use fstrim. > In fact, I'd recommend against doing that. > > > sync > > Interesting. Have you seen mount dirty inodes or something? The sync is actually not material here. However I included it for completeness because it is an effective workaround for another unreliability case where you delete some files before doing the fstrim, and ext4 is slow enough that the files you remove don't return space to the host. The relevant code is: https://github.com/libguestfs/libguestfs/blob/master/daemon/fstrim.c#L53 > > fstrim /mnt > > umount /mnt > > sync > > # qemu is killed after sync returns > > > > Although typing these commands by hand works fine, when you run them > > from a program the fstrim doesn't happen all the way down the stack > > reliably. Mostly it works, but sometimes it only trims some space > > from the host file. > > What is in the stack? Are you using qcow2 images, plain files, device > mapper, anything else? In the test case it is recent kernel -> virtio-scsi -> qemu -> raw format local file stored on host filesystem (ext4 on the test machine). > Which file systems are you testing, and are they > used in the host, the guest or both? ext4 guest and host in this case. > How are you checking for success? We measure the file size (stat.st_blocks) on the host during the test. There are various thresholds which count as success (see test script linked below). In the case where it is failing it's hardly discarding any blocks, although it does discard some. > Do you have a golden image you start with so that your test case is > repeatable? We create images on the fly, but yes I'm confident that the test is repeatable (although that doesn't mean it is failing on every run -- it's a race condition of some sort). The test code is here: https://github.com/libguestfs/libguestfs/blob/master/tests/discard/test-fstrim.pl > > It appears that when the host is slow / under load, the problem > > happens more frequently. Also it may happen more frequently on i686 > > than on x86-64 (possibly also due to speed of host). > > I don't know of any reason that any of the variables you listed would > affect the reliability at all. As far as I can tell, fstrim is a > synchronous ioctl. I believe the only reason space wouldn't be freed is > if the fs is fragmented in such a way as to not meet the minimum trim > granularity of the underlying device. It's a freshly created filesystem so I guess it's not likely to be fragmented. I suspect it's something to do with how we kill qemu. Requests are in flight somewhere. Just not sure how we sync "enough" to make sure everything is on the host. FWIW here is the elaborate sync dance we currently do to work around bugs present and past: https://github.com/libguestfs/libguestfs/blob/master/daemon/sync.c#L54 Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming blog: http://rwmj.wordpress.com Fedora now supports 80 OCaml packages (the OPEN alternative to F#) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-04-02 18:59 ` Richard W.M. Jones @ 2014-04-02 20:02 ` Jeff Moyer 2014-04-02 20:26 ` Richard W.M. Jones 2014-04-10 15:05 ` Richard W.M. Jones 0 siblings, 2 replies; 10+ messages in thread From: Jeff Moyer @ 2014-04-02 20:02 UTC (permalink / raw) To: Richard W.M. Jones; +Cc: linux-fsdevel, pbonzini "Richard W.M. Jones" <rjones@redhat.com> writes: >> > for each filesystem: >> > mount -o discard $fs /mnt >> >> What is $fs? Do you pass in a list of devices? > > Yes and no. We examine the partitions, logical volumes and so on in > order to get a list of mountable filesystems, and then the list is > iterated over in this loop. The precise code for finding the > filesystems is here: > > https://github.com/libguestfs/libguestfs/blob/master/src/listfs.c#L45 > > ^ That code is running on the host side. It issues various calls to > the appliance side which are executed by code in multiple files here: > > https://github.com/libguestfs/libguestfs/tree/master/daemon Sorry, that's a lot to take in. Can you distill this down to exactly the parts involved in the problem you're seeing? Pretend I don't know anything about libguestfs (I don't). >> Do you have a golden image you start with so that your test case is >> repeatable? > > We create images on the fly, but yes I'm confident that the test is > repeatable (although that doesn't mean it is failing on every run -- > it's a race condition of some sort). The test code is here: > > https://github.com/libguestfs/libguestfs/blob/master/tests/discard/test-fstrim.pl I suggest you create a golden image that you copy for each test that already has the data committed to it. The test would then just issue the rm and discard. $g->fill (33, 10000000, "/data"); What's that do? >> > It appears that when the host is slow / under load, the problem >> > happens more frequently. Also it may happen more frequently on i686 >> > than on x86-64 (possibly also due to speed of host). >> >> I don't know of any reason that any of the variables you listed would >> affect the reliability at all. As far as I can tell, fstrim is a >> synchronous ioctl. I believe the only reason space wouldn't be freed is >> if the fs is fragmented in such a way as to not meet the minimum trim >> granularity of the underlying device. > > It's a freshly created filesystem so I guess it's not likely to be > fragmented. It's only 64MB in size, right? That could certainly affect things! Again, we'd need some input from file system folks, though. I'm hoping we're getting enough information for them to make progress. Cheers, Jeff ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-04-02 20:02 ` Jeff Moyer @ 2014-04-02 20:26 ` Richard W.M. Jones 2014-04-10 15:05 ` Richard W.M. Jones 1 sibling, 0 replies; 10+ messages in thread From: Richard W.M. Jones @ 2014-04-02 20:26 UTC (permalink / raw) To: Jeff Moyer; +Cc: linux-fsdevel, pbonzini On Wed, Apr 02, 2014 at 04:02:57PM -0400, Jeff Moyer wrote: > "Richard W.M. Jones" <rjones@redhat.com> writes: > > Yes and no. We examine the partitions, logical volumes and so on in > > order to get a list of mountable filesystems, and then the list is > > iterated over in this loop. The precise code for finding the > > filesystems is here: > > > > https://github.com/libguestfs/libguestfs/blob/master/src/listfs.c#L45 > > > > ^ That code is running on the host side. It issues various calls to > > the appliance side which are executed by code in multiple files here: > > > > https://github.com/libguestfs/libguestfs/tree/master/daemon > > Sorry, that's a lot to take in. Can you distill this down to exactly > the parts involved in the problem you're seeing? Pretend I don't know > anything about libguestfs (I don't). Libguestfs runs up a small appliance, running the daemon above. The library side sends the daemon commands which the daemon runs. The commands that are run map nearly 1-1 to the libguestfs API. The test program is this: # Add the disk to qemu. $g->add_drive ($disk, format => $format, readonly => 0, discard => "enable"); # Run qemu. $g->launch (); # Run mkfs -t ext4 /dev/sda in the appliance. $g->mkfs ("ext4", "/dev/sda"); # Run mount -o nodiscard /dev/sda / # [This is different from my earlier description, but it is done # deliberately so rm doesn't trim the space, because the test is # trying to test fstrim, not rm. Discard of rm is tested in another test.] $g->mount_options ("nodiscard", "/dev/sda", "/"); # Create a 10000000 byte file /data containing '!' characters. $g->fill (33, 10000000, "/data"); # Sync the (guest) disk. $g->sync (); # The host file is examined here. my $full_size = (stat ($disk))[12]; # Remove the large data file. Because of -o nodiscard this won't trim. $g->rm ("/data"); # Run fstrim in the appliance. $g->fstrim ("/"); # This issues a sync for the guest disk, and kills qemu. $g->shutdown (); # The host file is examined again. my $trimmed_size = (stat ($disk))[12]; if ($full_size - $trimmed_size < 1000) { # then print an error and die } > I suggest you create a golden image that you copy for each test that > already has the data committed to it. The test would then just issue > the rm and discard. Sure I'll try this. Will take a little time to implement it and run the tests in various environments. > It's only 64MB in size, right? That could certainly affect things! So I'm guessing this is too small? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-04-02 20:02 ` Jeff Moyer 2014-04-02 20:26 ` Richard W.M. Jones @ 2014-04-10 15:05 ` Richard W.M. Jones 1 sibling, 0 replies; 10+ messages in thread From: Richard W.M. Jones @ 2014-04-10 15:05 UTC (permalink / raw) To: Jeff Moyer; +Cc: linux-fsdevel, pbonzini, Lukáš Czerner Just to finish off this subject. It was (kind of) a bug in the test suite: The test suite assumed that fstrim would trim the journal as well as the file content. I adjusted the test so it only measures the trimming of the file content, not any other metadata on the disk, and now the test works reliably. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-03-26 20:47 Making discard/fstrim reliable Richard W.M. Jones 2014-04-02 18:18 ` Jeff Moyer @ 2014-04-03 17:08 ` Lukáš Czerner 2014-04-03 17:23 ` Richard W.M. Jones 2014-04-03 17:57 ` Paolo Bonzini 1 sibling, 2 replies; 10+ messages in thread From: Lukáš Czerner @ 2014-04-03 17:08 UTC (permalink / raw) To: Richard W.M. Jones; +Cc: linux-fsdevel, pbonzini [-- Attachment #1: Type: TEXT/PLAIN, Size: 3563 bytes --] On Wed, 26 Mar 2014, Richard W.M. Jones wrote: > Date: Wed, 26 Mar 2014 20:47:08 +0000 > From: Richard W.M. Jones <rjones@redhat.com> > To: linux-fsdevel@vger.kernel.org > Cc: pbonzini@redhat.com > Subject: Making discard/fstrim reliable > > > virt-sparsify is a tool for trimming free space in virtual disk > images. The new implementation uses vfs/kernel/qemu discard support. > Essentially it does: > > for each filesystem: > mount -o discard $fs /mnt > sync > fstrim /mnt > umount /mnt > sync > # qemu is killed after sync returns > > Although typing these commands by hand works fine, when you run them > from a program the fstrim doesn't happen all the way down the stack > reliably. Mostly it works, but sometimes it only trims some space > from the host file. > > It appears that when the host is slow / under load, the problem > happens more frequently. Also it may happen more frequently on i686 > than on x86-64 (possibly also due to speed of host). > > The question is: What can I do to make sure the trim happens reliably, > all the way down the stack, before qemu is killed? > > I am testing this using the latest upstream kernel & qemu. > > Rich. There is really no reliability to be had with discard. It's and advisory interface, not every file system implements it and when it does the implementation and hence the results varies wildly. I'd suggest not to do things this way. However let's take a look at your case. In order to determine why you think it's unreliable I'd need some data to back it up. How the file system looks like (an image would be great), when and how it was created, what is its size, what's the image size and what size difference do you expect. Also what file system type this is. However if we're talking about raw file system images in files in the host, then much better solution would be to use fsck. Ext4 already has option -E discard which will send a discard down for ever free range (similarly as fstrim would do on mounted file system). I suspect that other fs utilities might have similar functionality. Of course in order for it to work you need a layer to translate discard requests to punch holes to the underlying file system (such as loop device for example). But I think that if there is enough interest we might do this directly from e2fsck when we notice that we're running on the file rather than block device. Also please note that mke2fs will issue the initial discard by default, so if you create the file system and then run fstrim on it with expectation that the size of a backing file will go down, you would be wrong. It was already trimmed down on file system creation time. All that said, while discard is a interesting functionality and can be abused in many _many_ ways. It looks like what you really need is something that is currently available in fallocate(1) from util-linux package. The option to look for is --dig-holes: -d, --dig-holes Detect and dig holes. Makes the file sparse in-place, without using extra disk space. The minimal size of the hole depends on filesystem I/O block size (usually 4096 bytes). Also, when using this option, --keep-size is implied. You can think of this as doing a "cp --sparse" and renaming the dest file as the origi‐ nal, without the need for extra disk space. I am not sure whether util-linux version with this functionality has been released yet. But you can always checkout git repository: https://github.com/karelzak/util-linux.git I hope it helps. Thanks! -Lukas ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-04-03 17:08 ` Lukáš Czerner @ 2014-04-03 17:23 ` Richard W.M. Jones 2014-04-03 17:57 ` Paolo Bonzini 1 sibling, 0 replies; 10+ messages in thread From: Richard W.M. Jones @ 2014-04-03 17:23 UTC (permalink / raw) To: Lukáš Czerner; +Cc: linux-fsdevel, pbonzini On Thu, Apr 03, 2014 at 07:08:34PM +0200, Lukáš Czerner wrote: > All that said, while discard is a interesting functionality and can > be abused in many _many_ ways. It looks like what you really need is > something that is currently available in fallocate(1) from > util-linux package. The option to look for is --dig-holes: > > -d, --dig-holes > Detect and dig holes. Makes the file sparse in-place, without using extra disk space. > The minimal size of the hole depends on filesystem I/O block size (usually 4096 bytes). > Also, when using this option, --keep-size is implied. > > You can think of this as doing a "cp --sparse" and renaming the dest file as the origi‐ > nal, without the need for extra disk space. This applies to single host files right? virt-sparsify takes a disk image and sparsifies it, by examining the internals of the disk image to find unused space (not *zero* space, but space that's unused -- it could contain deleted files or other garbage). Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-04-03 17:08 ` Lukáš Czerner 2014-04-03 17:23 ` Richard W.M. Jones @ 2014-04-03 17:57 ` Paolo Bonzini 2014-04-03 18:08 ` Richard W.M. Jones 1 sibling, 1 reply; 10+ messages in thread From: Paolo Bonzini @ 2014-04-03 17:57 UTC (permalink / raw) To: Lukáš Czerner, Richard W.M. Jones; +Cc: linux-fsdevel Il 03/04/2014 19:08, Lukáš Czerner ha scritto: > However if we're talking about raw file system images in files in > the host, then much better solution would be to use fsck. Ext4 > already has option -E discard which will send a discard down for > ever free range (similarly as fstrim would do on mounted file > system). I suspect that other fs utilities might have similar > functionality. > > Of course in order for it to work you need a layer to translate > discard requests to punch holes to the underlying file system (such > as loop device for example). But I think that if there is enough > interest we might do this directly from e2fsck when we notice that > we're running on the file rather than block device. The e2fsck can also be done from within libguestfs easily. libguestfs runs within a VM so QEMU would handle the translation to hole-punching. But fstrim is much faster than e2fsck. From Richard's description, what seems to happen is that ext4 FITRIM scans the filesystem and prepares the discard requests; but then it sends them down to the filesystem after the ioctl has finished. Does that make any sense? And would that be considered a bug? Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Making discard/fstrim reliable 2014-04-03 17:57 ` Paolo Bonzini @ 2014-04-03 18:08 ` Richard W.M. Jones 0 siblings, 0 replies; 10+ messages in thread From: Richard W.M. Jones @ 2014-04-03 18:08 UTC (permalink / raw) To: Paolo Bonzini; +Cc: Lukáš Czerner, linux-fsdevel On Thu, Apr 03, 2014 at 07:57:59PM +0200, Paolo Bonzini wrote: > From Richard's description, what seems to happen is that ext4 FITRIM > scans the filesystem and prepares the discard requests; but then it > sends them down to the filesystem after the ioctl has finished. > Does that make any sense? And would that be considered a bug? That would account for it if it happened. I'm still working on a good reproducer. It's remarkably hard to reproduce on fast / 64 bit hardware ... Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/ ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-04-10 15:05 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-26 20:47 Making discard/fstrim reliable Richard W.M. Jones 2014-04-02 18:18 ` Jeff Moyer 2014-04-02 18:59 ` Richard W.M. Jones 2014-04-02 20:02 ` Jeff Moyer 2014-04-02 20:26 ` Richard W.M. Jones 2014-04-10 15:05 ` Richard W.M. Jones 2014-04-03 17:08 ` Lukáš Czerner 2014-04-03 17:23 ` Richard W.M. Jones 2014-04-03 17:57 ` Paolo Bonzini 2014-04-03 18:08 ` Richard W.M. Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).