From: Richard Laager <rlaager@wiktel.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming
Date: Fri, 09 Mar 2012 14:36:50 -0600 [thread overview]
Message-ID: <1331325410.3715.77.camel@watermelon.coderich.net> (raw)
In-Reply-To: <1331226917-6658-15-git-send-email-pbonzini@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 4320 bytes --]
I was just working on this as well, though your implementation is *far*
more complete than mine. (I was only looking at making changes to the
discard implementation in block/raw-posix.c.)
I've got several comments, which I've separated by logical topic...
-------- BLKDISCARD --------
fallocate() only supports files. In my patch, I was fstat()ing the fd
just after opening it and caching an is_device boolean. Then, when doing
discards, if !is_device, call fallocate(), else (i.e. for devices) do
the following (note: untested code):
__uint64_t range[2];
range[0] = sector_num << 9;
range[1] = (__uint64_t)nb_sectors << 9;
if (ioctl(s->fd, BLKDISCARD, &range) && errno != EOPNOTSUPP) {
return -errno;
}
return 0;
Note that for BLKDISCARD, you probably do NOT want to clear has_discard
if you get EOPNOTSUPP. For example, if you have LVM across multiple
disks where some support discard and some do not, you'll get EOPNOTSUPP
for some discard requests, but not all. ext4 had a behavior where it
would stop issuing discards after one failed, but the LVM guys heavily
criticized that and asked them to get rid of that behavior. (I haven't
checked to make sure it actually happened.)
I'd imagine it's still fine to stop trying fallocate() hole punches
after the first failure; I'm not aware of any filesystem where discards
would be supported in only some ranges of a single file, nor would that
make much sense.
-------- sync requirements --------
I'm not sure if fallocate() and/or BLKDISCARD always guarantee that the
discard has made it to stable storage. If they don't, does O_DIRECT or
O_DSYNC on open() cause them to make any such guarantee? If not, should
you be calling fdatasync() or fsync() when the user has specified
cache=direct or cache=writethrough?
Note that the Illumos implementation (see below) has a flag to ask for
either behavior.
-------- non-Linux Implementations --------
FreeBSD and Illumos have ioctl()s similar to BLKDISCARD and I have
similar untested code for those as well:
/* FreeBSD */
#ifdef DIOCGDELETE
if (s->is_device) {
off_t range[2];
range[0] = sector_num << 9;
range[1] = (off_t)nb_sectors << 9;
if (ioctl(s->fd, DIOCGDELETE, &range) != 0 && errno != ENOIOCTL)
{
return -errno;
}
return 0;
}
#endif
/* Illumos */
#ifdef DKIOCFREE
if (s->is_device) {
dkioc_free_t df;
/* TODO: Is this correct? Are the other discard ioctl()s sync or async? */
df.df_flags = (s->open_flags & (O_DIRECT | O_DSYNC)) ?
DF_WAIT_SYNC : 0;
df.df_start = sector_num << 9;
df.df_length = (diskaddr_t)nb_sectors << 9;
if (ioctl(s->fd, DKIOCFREE, &range)) != 0 && errno != ENOTTY)
{
return -errno;
}
return 0;
}
#endif
-------- Thin vs. Thick Provisioning --------
On Fri, 2012-03-09 at 00:35 -0800, Chris Wedgwood wrote:
> Simplest still compare the blocks allocated by the file
> to it's length (ie. stat.st_blocks != stat.st_size>>9).
I thought of this as well. It covers "99%" of the cases, but there's one
case where it breaks down. Imagine I have a sparse file backing my
virtual disk. In the guest, I fill the virtual disk 100%. Then, I
restart QEMU. Now it thinks that sparse file is non-sparse and stops
issuing hole punches.
To be completely correct, I suggest the following behavior:
1. Add a discard boolean option to the disk layer.
2. If discard is not specified:
* For files, detect a true/false value by comparing
stat.st_blocks != stat.st_size>>9.
* For devices, assume a fixed value (true?).
3. If discard is true, issue discards.
4. If discard is false, do not issue discards.
-------- CONFIG_XFS --------
XFS has implemented the FALLOC_FL_PUNCH_HOLE interface since it was
created. So unless you're concerned about people compiling QEMU on newer
systems with FALLOC_FL_PUNCH_HOLE and then running that binary on older
kernels, there's no case where one needs both the CONFIG_XFS code and
FALLOC_FL_PUNCH_HOLE code.
--
Richard
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
next prev parent reply other threads:[~2012-03-09 20:37 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-03-08 17:15 [Qemu-devel] [RFC PATCH 00/17] Improvements around discard and write zeroes Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 01/17] qemu-iotests: add a simple test for write_zeroes Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 02/17] qed: make write-zeroes bounce buffer smaller than a single cluster Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 03/17] block: add discard properties to BlockDriverInfo Paolo Bonzini
2012-03-09 16:47 ` Kevin Wolf
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 04/17] qed: implement bdrv_aio_discard Paolo Bonzini
2012-03-09 16:31 ` Kevin Wolf
2012-03-09 17:53 ` Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 05/17] block: pass around qiov for write_zeroes operation Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations Paolo Bonzini
2012-03-09 16:37 ` Kevin Wolf
2012-03-09 18:06 ` Paolo Bonzini
2012-03-10 18:02 ` Richard Laager
2012-03-12 12:27 ` Paolo Bonzini
2012-03-12 13:04 ` Kevin Wolf
2012-03-13 19:13 ` Richard Laager
2012-03-14 7:41 ` Paolo Bonzini
2012-03-14 12:01 ` Kevin Wolf
2012-03-14 12:14 ` Paolo Bonzini
2012-03-14 12:37 ` Kevin Wolf
2012-03-14 12:49 ` Paolo Bonzini
2012-03-14 13:04 ` Kevin Wolf
2012-03-24 15:33 ` Christoph Hellwig
2012-03-24 15:30 ` Christoph Hellwig
2012-03-26 19:40 ` Richard Laager
2012-03-27 10:20 ` Kevin Wolf
2012-03-24 15:29 ` Christoph Hellwig
2012-03-26 9:44 ` Daniel P. Berrange
2012-03-26 9:56 ` Christoph Hellwig
2012-03-15 0:42 ` Richard Laager
2012-03-15 9:36 ` Paolo Bonzini
2012-03-16 0:47 ` Richard Laager
2012-03-16 9:34 ` Paolo Bonzini
2012-03-24 15:27 ` Christoph Hellwig
2012-03-26 19:40 ` Richard Laager
2012-03-27 9:08 ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 07/17] block: make high level discard operation always zero Paolo Bonzini
2012-03-08 17:55 ` Avi Kivity
2012-03-09 16:42 ` Kevin Wolf
2012-03-12 10:42 ` Avi Kivity
2012-03-12 11:04 ` Kevin Wolf
2012-03-12 12:03 ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 08/17] block: kill the write zeroes operation Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 09/17] ide: issue discard asynchronously but serialize the pieces Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 10/17] ide/scsi: add discard_zeroes_data property Paolo Bonzini
2012-03-08 18:13 ` Avi Kivity
2012-03-08 18:14 ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 11/17] ide/scsi: prepare for flipping the discard defaults Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 12/17] ide/scsi: turn on discard Paolo Bonzini
2012-03-08 18:17 ` Avi Kivity
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 13/17] block: fallback from discard to writes Paolo Bonzini
2012-03-24 15:35 ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 14/17] block: support FALLOC_FL_PUNCH_HOLE trimming Paolo Bonzini
2012-03-09 8:20 ` Chris Wedgwood
2012-03-09 8:31 ` Paolo Bonzini
2012-03-09 8:35 ` Chris Wedgwood
2012-03-09 8:40 ` Paolo Bonzini
2012-03-09 10:31 ` Stefan Hajnoczi
2012-03-09 10:43 ` Paolo Bonzini
2012-03-09 10:53 ` Stefan Hajnoczi
2012-03-09 10:57 ` Paolo Bonzini
2012-03-09 20:36 ` Richard Laager [this message]
2012-03-12 9:34 ` Paolo Bonzini
2012-03-24 15:40 ` Christoph Hellwig
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 15/17] raw: add get_info Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 16/17] qemu-io: fix the alloc command Paolo Bonzini
2012-03-08 17:15 ` [Qemu-devel] [RFC PATCH 17/17] raw: implement is_allocated Paolo Bonzini
2012-03-24 15:42 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1331325410.3715.77.camel@watermelon.coderich.net \
--to=rlaager@wiktel.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).