Making discard/fstrim reliable

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Making discard/fstrim reliable
@ 2014-03-26 20:47 Richard W.M. Jones
  2014-04-02 18:18 ` Jeff Moyer
  2014-04-03 17:08 ` Lukáš Czerner
  0 siblings, 2 replies; 10+ messages in thread
From: Richard W.M. Jones @ 2014-03-26 20:47 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: pbonzini

virt-sparsify is a tool for trimming free space in virtual disk
images.  The new implementation uses vfs/kernel/qemu discard support.
Essentially it does:

  for each filesystem:
    mount -o discard $fs /mnt
    sync
    fstrim /mnt
    umount /mnt
  sync
  # qemu is killed after sync returns

Although typing these commands by hand works fine, when you run them
from a program the fstrim doesn't happen all the way down the stack
reliably.  Mostly it works, but sometimes it only trims some space
from the host file.

It appears that when the host is slow / under load, the problem
happens more frequently.  Also it may happen more frequently on i686
than on x86-64 (possibly also due to speed of host).

The question is: What can I do to make sure the trim happens reliably,
all the way down the stack, before qemu is killed?

I am testing this using the latest upstream kernel & qemu.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-03-26 20:47 Making discard/fstrim reliable Richard W.M. Jones
@ 2014-04-02 18:18 ` Jeff Moyer
  2014-04-02 18:59   ` Richard W.M. Jones
  2014-04-03 17:08 ` Lukáš Czerner
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff Moyer @ 2014-04-02 18:18 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: linux-fsdevel, pbonzini

"Richard W.M. Jones" <rjones@redhat.com> writes:

Hi, Richard,

> virt-sparsify is a tool for trimming free space in virtual disk
> images.  The new implementation uses vfs/kernel/qemu discard support.
> Essentially it does:
>

Presumably there's a "start guest" step here that's missing?

>   for each filesystem:
>     mount -o discard $fs /mnt

What is $fs?  Do you pass in a list of devices?

Also, you don't need to mount with -o discard in order to use fstrim.
In fact, I'd recommend against doing that.

>     sync

Interesting.  Have you seen mount dirty inodes or something?

>     fstrim /mnt
>     umount /mnt
>   sync
>   # qemu is killed after sync returns
>
> Although typing these commands by hand works fine, when you run them
> from a program the fstrim doesn't happen all the way down the stack
> reliably.  Mostly it works, but sometimes it only trims some space
> from the host file.

What is in the stack?  Are you using qcow2 images, plain files, device
mapper, anything else?  Which file systems are you testing, and are they
used in the host, the guest or both?  How are you checking for success?
Do you have a golden image you start with so that your test case is
repeatable?

> It appears that when the host is slow / under load, the problem
> happens more frequently.  Also it may happen more frequently on i686
> than on x86-64 (possibly also due to speed of host).

I don't know of any reason that any of the variables you listed would
affect the reliability at all.  As far as I can tell, fstrim is a
synchronous ioctl.  I believe the only reason space wouldn't be freed is
if the fs is fragmented in such a way as to not meet the minimum trim
granularity of the underlying device.  Of course, there may be file
system specific reasons too, I guess.  Hopefully others can comment on
that.

Cheers,
Jeff

> The question is: What can I do to make sure the trim happens reliably,
> all the way down the stack, before qemu is killed?
>
> I am testing this using the latest upstream kernel & qemu.
>
> Rich.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-04-02 18:18 ` Jeff Moyer
@ 2014-04-02 18:59   ` Richard W.M. Jones
  2014-04-02 20:02     ` Jeff Moyer
  0 siblings, 1 reply; 10+ messages in thread
From: Richard W.M. Jones @ 2014-04-02 18:59 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel, pbonzini

On Wed, Apr 02, 2014 at 02:18:40PM -0400, Jeff Moyer wrote:
> "Richard W.M. Jones" <rjones@redhat.com> writes:
> 
> Hi, Richard,
> 
> > virt-sparsify is a tool for trimming free space in virtual disk
> > images.  The new implementation uses vfs/kernel/qemu discard support.
> > Essentially it does:
> >
> 
> Presumably there's a "start guest" step here that's missing?

Yup, it starts up a small appliance to do these operations.

> >   for each filesystem:
> >     mount -o discard $fs /mnt
> 
> What is $fs?  Do you pass in a list of devices?

Yes and no.  We examine the partitions, logical volumes and so on in
order to get a list of mountable filesystems, and then the list is
iterated over in this loop.  The precise code for finding the
filesystems is here:

https://github.com/libguestfs/libguestfs/blob/master/src/listfs.c#L45

^ That code is running on the host side.  It issues various calls to
the appliance side which are executed by code in multiple files here:

https://github.com/libguestfs/libguestfs/tree/master/daemon

> Also, you don't need to mount with -o discard in order to use fstrim.
> In fact, I'd recommend against doing that.
> 
> >     sync
> 
> Interesting.  Have you seen mount dirty inodes or something?

The sync is actually not material here.  However I included it for
completeness because it is an effective workaround for another
unreliability case where you delete some files before doing the
fstrim, and ext4 is slow enough that the files you remove don't return
space to the host.  The relevant code is:

https://github.com/libguestfs/libguestfs/blob/master/daemon/fstrim.c#L53

> >     fstrim /mnt
> >     umount /mnt
> >   sync
> >   # qemu is killed after sync returns
> >
> > Although typing these commands by hand works fine, when you run them
> > from a program the fstrim doesn't happen all the way down the stack
> > reliably.  Mostly it works, but sometimes it only trims some space
> > from the host file.
> 
> What is in the stack?  Are you using qcow2 images, plain files, device
> mapper, anything else?

In the test case it is recent kernel -> virtio-scsi -> qemu -> raw
format local file stored on host filesystem (ext4 on the test machine).

> Which file systems are you testing, and are they
> used in the host, the guest or both?

ext4 guest and host in this case.

> How are you checking for success?

We measure the file size (stat.st_blocks) on the host during the test.

There are various thresholds which count as success (see test script
linked below).  In the case where it is failing it's hardly discarding
any blocks, although it does discard some.

> Do you have a golden image you start with so that your test case is
> repeatable?

We create images on the fly, but yes I'm confident that the test is
repeatable (although that doesn't mean it is failing on every run --
it's a race condition of some sort).  The test code is here:

https://github.com/libguestfs/libguestfs/blob/master/tests/discard/test-fstrim.pl

> > It appears that when the host is slow / under load, the problem
> > happens more frequently.  Also it may happen more frequently on i686
> > than on x86-64 (possibly also due to speed of host).
> 
> I don't know of any reason that any of the variables you listed would
> affect the reliability at all.  As far as I can tell, fstrim is a
> synchronous ioctl.  I believe the only reason space wouldn't be freed is
> if the fs is fragmented in such a way as to not meet the minimum trim
> granularity of the underlying device.

It's a freshly created filesystem so I guess it's not likely to be
fragmented.

I suspect it's something to do with how we kill qemu.  Requests are in
flight somewhere.  Just not sure how we sync "enough" to make sure
everything is on the host.  FWIW here is the elaborate sync dance we
currently do to work around bugs present and past:

https://github.com/libguestfs/libguestfs/blob/master/daemon/sync.c#L54

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming blog: http://rwmj.wordpress.com
Fedora now supports 80 OCaml packages (the OPEN alternative to F#)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-04-02 18:59   ` Richard W.M. Jones
@ 2014-04-02 20:02     ` Jeff Moyer
  2014-04-02 20:26       ` Richard W.M. Jones
  2014-04-10 15:05       ` Richard W.M. Jones
  0 siblings, 2 replies; 10+ messages in thread
From: Jeff Moyer @ 2014-04-02 20:02 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: linux-fsdevel, pbonzini

"Richard W.M. Jones" <rjones@redhat.com> writes:

>> >   for each filesystem:
>> >     mount -o discard $fs /mnt
>> 
>> What is $fs?  Do you pass in a list of devices?
>
> Yes and no.  We examine the partitions, logical volumes and so on in
> order to get a list of mountable filesystems, and then the list is
> iterated over in this loop.  The precise code for finding the
> filesystems is here:
>
> https://github.com/libguestfs/libguestfs/blob/master/src/listfs.c#L45
>
> ^ That code is running on the host side.  It issues various calls to
> the appliance side which are executed by code in multiple files here:
>
> https://github.com/libguestfs/libguestfs/tree/master/daemon

Sorry, that's a lot to take in.  Can you distill this down to exactly
the parts involved in the problem you're seeing?  Pretend I don't know
anything about libguestfs (I don't).

>> Do you have a golden image you start with so that your test case is
>> repeatable?
>
> We create images on the fly, but yes I'm confident that the test is
> repeatable (although that doesn't mean it is failing on every run --
> it's a race condition of some sort).  The test code is here:
>
> https://github.com/libguestfs/libguestfs/blob/master/tests/discard/test-fstrim.pl

I suggest you create a golden image that you copy for each test that
already has the data committed to it.  The test would then just issue
the rm and discard.

$g->fill (33, 10000000, "/data");

What's that do?

>> > It appears that when the host is slow / under load, the problem
>> > happens more frequently.  Also it may happen more frequently on i686
>> > than on x86-64 (possibly also due to speed of host).
>> 
>> I don't know of any reason that any of the variables you listed would
>> affect the reliability at all.  As far as I can tell, fstrim is a
>> synchronous ioctl.  I believe the only reason space wouldn't be freed is
>> if the fs is fragmented in such a way as to not meet the minimum trim
>> granularity of the underlying device.
>
> It's a freshly created filesystem so I guess it's not likely to be
> fragmented.

It's only 64MB in size, right?  That could certainly affect things!
Again, we'd need some input from file system folks, though.  I'm hoping
we're getting enough information for them to make progress.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-04-02 20:02     ` Jeff Moyer
@ 2014-04-02 20:26       ` Richard W.M. Jones
  2014-04-10 15:05       ` Richard W.M. Jones
  1 sibling, 0 replies; 10+ messages in thread
From: Richard W.M. Jones @ 2014-04-02 20:26 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel, pbonzini

On Wed, Apr 02, 2014 at 04:02:57PM -0400, Jeff Moyer wrote:
> "Richard W.M. Jones" <rjones@redhat.com> writes:
> > Yes and no.  We examine the partitions, logical volumes and so on in
> > order to get a list of mountable filesystems, and then the list is
> > iterated over in this loop.  The precise code for finding the
> > filesystems is here:
> >
> > https://github.com/libguestfs/libguestfs/blob/master/src/listfs.c#L45
> >
> > ^ That code is running on the host side.  It issues various calls to
> > the appliance side which are executed by code in multiple files here:
> >
> > https://github.com/libguestfs/libguestfs/tree/master/daemon
> 
> Sorry, that's a lot to take in.  Can you distill this down to exactly
> the parts involved in the problem you're seeing?  Pretend I don't know
> anything about libguestfs (I don't).

Libguestfs runs up a small appliance, running the daemon above.  The
library side sends the daemon commands which the daemon runs.  The
commands that are run map nearly 1-1 to the libguestfs API.

The test program is this:

  # Add the disk to qemu.
  $g->add_drive ($disk, format => $format, readonly => 0, discard => "enable");
  # Run qemu.
  $g->launch ();

  # Run mkfs -t ext4 /dev/sda in the appliance.
  $g->mkfs ("ext4", "/dev/sda");
  # Run mount -o nodiscard /dev/sda /
  # [This is different from my earlier description, but it is done
  # deliberately so rm doesn't trim the space, because the test is
  # trying to test fstrim, not rm.  Discard of rm is tested in another test.]
  $g->mount_options ("nodiscard", "/dev/sda", "/");

  # Create a 10000000 byte file /data containing '!' characters.
  $g->fill (33, 10000000, "/data");
  # Sync the (guest) disk.
  $g->sync ();

  # The host file is examined here.
  my $full_size = (stat ($disk))[12];

  # Remove the large data file.  Because of -o nodiscard this won't trim.
  $g->rm ("/data");
  # Run fstrim in the appliance.
  $g->fstrim ("/");

  # This issues a sync for the guest disk, and kills qemu.
  $g->shutdown ();

  # The host file is examined again.
  my $trimmed_size = (stat ($disk))[12];
  if ($full_size - $trimmed_size < 1000) {
    # then print an error and die
  }

> I suggest you create a golden image that you copy for each test that
> already has the data committed to it.  The test would then just issue
> the rm and discard.

Sure I'll try this.  Will take a little time to implement it and run
the tests in various environments.

> It's only 64MB in size, right?  That could certainly affect things!

So I'm guessing this is too small?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-04-02 20:02     ` Jeff Moyer
  2014-04-02 20:26       ` Richard W.M. Jones
@ 2014-04-10 15:05       ` Richard W.M. Jones
  1 sibling, 0 replies; 10+ messages in thread
From: Richard W.M. Jones @ 2014-04-10 15:05 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel, pbonzini, Lukáš Czerner

Just to finish off this subject.

It was (kind of) a bug in the test suite:

The test suite assumed that fstrim would trim the journal as well as
the file content.  I adjusted the test so it only measures the
trimming of the file content, not any other metadata on the disk, and
now the test works reliably.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-03-26 20:47 Making discard/fstrim reliable Richard W.M. Jones
  2014-04-02 18:18 ` Jeff Moyer
@ 2014-04-03 17:08 ` Lukáš Czerner
  2014-04-03 17:23   ` Richard W.M. Jones
  2014-04-03 17:57   ` Paolo Bonzini
  1 sibling, 2 replies; 10+ messages in thread
From: Lukáš Czerner @ 2014-04-03 17:08 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: linux-fsdevel, pbonzini

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3563 bytes --]

On Wed, 26 Mar 2014, Richard W.M. Jones wrote:

> Date: Wed, 26 Mar 2014 20:47:08 +0000
> From: Richard W.M. Jones <rjones@redhat.com>
> To: linux-fsdevel@vger.kernel.org
> Cc: pbonzini@redhat.com
> Subject: Making discard/fstrim reliable
> 
> 
> virt-sparsify is a tool for trimming free space in virtual disk
> images.  The new implementation uses vfs/kernel/qemu discard support.
> Essentially it does:
> 
>   for each filesystem:
>     mount -o discard $fs /mnt
>     sync
>     fstrim /mnt
>     umount /mnt
>   sync
>   # qemu is killed after sync returns
> 
> Although typing these commands by hand works fine, when you run them
> from a program the fstrim doesn't happen all the way down the stack
> reliably.  Mostly it works, but sometimes it only trims some space
> from the host file.
> 
> It appears that when the host is slow / under load, the problem
> happens more frequently.  Also it may happen more frequently on i686
> than on x86-64 (possibly also due to speed of host).
> 
> The question is: What can I do to make sure the trim happens reliably,
> all the way down the stack, before qemu is killed?
> 
> I am testing this using the latest upstream kernel & qemu.
> 
> Rich.

There is really no reliability to be had with discard. It's and
advisory interface, not every file system implements it and when it
does the implementation and hence the results varies wildly.

I'd suggest not to do things this way.

However let's take a look at your case. In order to determine why you
think it's unreliable I'd need some data to back it up. How the file
system looks like (an image would be great), when and how it was
created, what is its size, what's the image size and what size
difference do you expect. Also what file system type this is.

However if we're talking about raw file system images in files in
the host, then much better solution would be to use fsck. Ext4
already has option -E discard which will send a discard down for
ever free range (similarly as fstrim would do on mounted file
system). I suspect that other fs utilities might have similar
functionality.

Of course in order for it to work you need a layer to translate
discard requests to punch holes to the underlying file system (such
as loop device for example). But I think that if there is enough
interest we might do this directly from e2fsck when we notice that
we're running on the file rather than block device.

Also please note that mke2fs will issue the initial discard by
default, so if you create the file system and then run fstrim on it
with expectation that the size of a backing file will go down, you
would be wrong. It was already trimmed down on file system creation
time.

All that said, while discard is a interesting functionality and can
be abused in many _many_ ways. It looks like what you really need is
something that is currently available in fallocate(1) from
util-linux package. The option to look for is --dig-holes:

-d, --dig-holes
      Detect  and  dig holes. Makes the file sparse in-place, without using extra disk space.
      The minimal size of the hole depends on filesystem I/O block size (usually 4096 bytes).
      Also, when using this option, --keep-size is implied.

      You can think of this as doing a "cp --sparse" and renaming the dest file as the origi‐
      nal, without the need for extra disk space.

I am not sure whether util-linux version with this functionality has
been released yet. But you can always checkout git repository:

https://github.com/karelzak/util-linux.git

I hope it helps.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-04-03 17:08 ` Lukáš Czerner
@ 2014-04-03 17:23   ` Richard W.M. Jones
  2014-04-03 17:57   ` Paolo Bonzini
  1 sibling, 0 replies; 10+ messages in thread
From: Richard W.M. Jones @ 2014-04-03 17:23 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-fsdevel, pbonzini

On Thu, Apr 03, 2014 at 07:08:34PM +0200, Lukáš Czerner wrote:
> All that said, while discard is a interesting functionality and can
> be abused in many _many_ ways. It looks like what you really need is
> something that is currently available in fallocate(1) from
> util-linux package. The option to look for is --dig-holes:
> 
> -d, --dig-holes
>       Detect  and  dig holes. Makes the file sparse in-place, without using extra disk space.
>       The minimal size of the hole depends on filesystem I/O block size (usually 4096 bytes).
>       Also, when using this option, --keep-size is implied.
> 
>       You can think of this as doing a "cp --sparse" and renaming the dest file as the origi‐
>       nal, without the need for extra disk space.

This applies to single host files right?

virt-sparsify takes a disk image and sparsifies it, by examining the
internals of the disk image to find unused space (not *zero* space,
but space that's unused -- it could contain deleted files or other
garbage).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-04-03 17:08 ` Lukáš Czerner
  2014-04-03 17:23   ` Richard W.M. Jones
@ 2014-04-03 17:57   ` Paolo Bonzini
  2014-04-03 18:08     ` Richard W.M. Jones
  1 sibling, 1 reply; 10+ messages in thread
From: Paolo Bonzini @ 2014-04-03 17:57 UTC (permalink / raw)
  To: Lukáš Czerner, Richard W.M. Jones; +Cc: linux-fsdevel

Il 03/04/2014 19:08, Lukáš Czerner ha scritto:
> However if we're talking about raw file system images in files in
> the host, then much better solution would be to use fsck. Ext4
> already has option -E discard which will send a discard down for
> ever free range (similarly as fstrim would do on mounted file
> system). I suspect that other fs utilities might have similar
> functionality.
>
> Of course in order for it to work you need a layer to translate
> discard requests to punch holes to the underlying file system (such
> as loop device for example). But I think that if there is enough
> interest we might do this directly from e2fsck when we notice that
> we're running on the file rather than block device.

The e2fsck can also be done from within libguestfs easily.  libguestfs 
runs within a VM so QEMU would handle the translation to hole-punching.

But fstrim is much faster than e2fsck.

 From Richard's description, what seems to happen is that ext4 FITRIM 
scans the filesystem and prepares the discard requests; but then it 
sends them down to the filesystem after the ioctl has finished.  Does 
that make any sense?  And would that be considered a bug?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Making discard/fstrim reliable
  2014-04-03 17:57   ` Paolo Bonzini
@ 2014-04-03 18:08     ` Richard W.M. Jones
  0 siblings, 0 replies; 10+ messages in thread
From: Richard W.M. Jones @ 2014-04-03 18:08 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Lukáš Czerner, linux-fsdevel

On Thu, Apr 03, 2014 at 07:57:59PM +0200, Paolo Bonzini wrote:
> From Richard's description, what seems to happen is that ext4 FITRIM
> scans the filesystem and prepares the discard requests; but then it
> sends them down to the filesystem after the ioctl has finished.
> Does that make any sense?  And would that be considered a bug?

That would account for it if it happened.

I'm still working on a good reproducer.  It's remarkably hard to
reproduce on fast / 64 bit hardware ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-04-10 15:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-26 20:47 Making discard/fstrim reliable Richard W.M. Jones
2014-04-02 18:18 ` Jeff Moyer
2014-04-02 18:59   ` Richard W.M. Jones
2014-04-02 20:02     ` Jeff Moyer
2014-04-02 20:26       ` Richard W.M. Jones
2014-04-10 15:05       ` Richard W.M. Jones
2014-04-03 17:08 ` Lukáš Czerner
2014-04-03 17:23   ` Richard W.M. Jones
2014-04-03 17:57   ` Paolo Bonzini
2014-04-03 18:08     ` Richard W.M. Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).