qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
@ 2015-01-27 13:51 ` Denis V. Lunev
  2015-01-27 17:57   ` Max Reitz
  0 siblings, 1 reply; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-27 13:51 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev

There is a possibility that we are extending our image and thus writing
zeroes beyond the end of the file. In this case we do not need to care
about the hole to make sure that there is no data in the file under
this offset (pre-condition to fallocate(0) to work). We could simply call
fallocate(0).

This improves the performance of writing zeroes even on really old
platforms which do not have even FALLOC_FL_PUNCH_HOLE.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index c039bef..fa05239 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
 #define FS_NOCOW_FL                     0x00800000 /* Do not cow file */
 #endif
 #endif
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 #include <linux/falloc.h>
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -902,7 +902,7 @@ static int translate_err(int err)
     return err;
 }
 
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
@@ -981,6 +981,12 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
     }
 #endif
 
+#ifdef CONFIG_FALLOCATE
+    if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) {
+        return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+#endif
+
     s->has_write_zeroes = false;
     return ret;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2015-01-27 13:51 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
@ 2015-01-27 17:57   ` Max Reitz
  2015-01-27 18:19     ` Denis V. Lunev
  0 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2015-01-27 17:57 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi

On 2015-01-27 at 08:51, Denis V. Lunev wrote:
> There is a possibility that we are extending our image and thus writing
> zeroes beyond the end of the file. In this case we do not need to care
> about the hole to make sure that there is no data in the file under
> this offset (pre-condition to fallocate(0) to work). We could simply call
> fallocate(0).
>
> This improves the performance of writing zeroes even on really old
> platforms which do not have even FALLOC_FL_PUNCH_HOLE.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> CC: Fam Zheng <famz@redhat.com>
> ---
>   block/raw-posix.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index c039bef..fa05239 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -60,7 +60,7 @@
>   #define FS_NOCOW_FL                     0x00800000 /* Do not cow file */
>   #endif
>   #endif
> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
> +#ifdef CONFIG_FALLOCATE

This change doesn't seem right; CONFIG_FALLOCATE is set if 
posix_fallocate() is available, not for the Linux-specific fallocate() 
from linux/falloc.h.

>   #include <linux/falloc.h>
>   #endif
>   #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
> @@ -902,7 +902,7 @@ static int translate_err(int err)
>       return err;
>   }
>   
> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
> +#ifdef CONFIG_FALLOCATE

Same here.

>   static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>   {
>       do {
> @@ -981,6 +981,12 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>       }
>   #endif
>   
> +#ifdef CONFIG_FALLOCATE
> +    if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) {
> +        return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
> +    }
> +#endif
> +

This seems fine though, but as I've asked in patch 5: Do we want to have 
a "has_fallocate"?

Other than that, this is the first usage of bs->total_sectors in this 
file; raw_co_get_block_status() does a similar check, but it uses 
bdrv_getlength() instead. If bs->total_sectors is correct, 
bdrv_getlength() will actually do nothing but return bs->total_sectors * 
BDRV_SECTOR_SIZE; it will only do more (that is, update 
bs->total_sectors) if it is not correct to use bs->total_sectors (and I 
feel like it may not be correct because BlockDriver.has_variable_length 
is true).

Max

>       s->has_write_zeroes = false;
>       return ret;
>   }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2015-01-27 17:57   ` Max Reitz
@ 2015-01-27 18:19     ` Denis V. Lunev
  2015-01-27 18:24       ` Max Reitz
  0 siblings, 1 reply; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-27 18:19 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi

On 27/01/15 20:57, Max Reitz wrote:
> On 2015-01-27 at 08:51, Denis V. Lunev wrote:
>> There is a possibility that we are extending our image and thus writing
>> zeroes beyond the end of the file. In this case we do not need to care
>> about the hole to make sure that there is no data in the file under
>> this offset (pre-condition to fallocate(0) to work). We could simply 
>> call
>> fallocate(0).
>>
>> This improves the performance of writing zeroes even on really old
>> platforms which do not have even FALLOC_FL_PUNCH_HOLE.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> CC: Fam Zheng <famz@redhat.com>
>> ---
>>   block/raw-posix.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index c039bef..fa05239 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -60,7 +60,7 @@
>>   #define FS_NOCOW_FL                     0x00800000 /* Do not cow 
>> file */
>>   #endif
>>   #endif
>> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
>> defined(CONFIG_FALLOCATE_ZERO_RANGE)
>> +#ifdef CONFIG_FALLOCATE
>
> This change doesn't seem right; CONFIG_FALLOCATE is set if 
> posix_fallocate() is available, not for the Linux-specific fallocate() 
> from linux/falloc.h.
>

here is a check for fallocate and posix_fallocate in configure script

# check for fallocate
fallocate=no
cat > $TMPC << EOF
#include <fcntl.h>

int main(void)
{
     fallocate(0, 0, 0, 0);
     return 0;
}
EOF
if compile_prog "" "" ; then
   fallocate=yes
fi
...
# check for posix_fallocate
posix_fallocate=no
cat > $TMPC << EOF
#include <fcntl.h>

int main(void)
{
     posix_fallocate(0, 0, 0);
     return 0;
}
EOF
if compile_prog "" "" ; then
     posix_fallocate=yes
fi
...
if test "$fallocate" = "yes" ; then
   echo "CONFIG_FALLOCATE=y" >> $config_host_mak
fi
...
if test "$posix_fallocate" = "yes" ; then
   echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak
fi

Thus my check looks correct to me.

>>   #include <linux/falloc.h>
>>   #endif
>>   #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
>> @@ -902,7 +902,7 @@ static int translate_err(int err)
>>       return err;
>>   }
>>   -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
>> defined(CONFIG_FALLOCATE_ZERO_RANGE)
>> +#ifdef CONFIG_FALLOCATE
>
> Same here.
>
>>   static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>>   {
>>       do {
>> @@ -981,6 +981,12 @@ static ssize_t 
>> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>       }
>>   #endif
>>   +#ifdef CONFIG_FALLOCATE
>> +    if (aiocb->aio_offset >= aiocb->bs->total_sectors << 
>> BDRV_SECTOR_BITS) {
>> +        return do_fallocate(s->fd, 0, aiocb->aio_offset, 
>> aiocb->aio_nbytes);
>> +    }
>> +#endif
>> +
>
> This seems fine though, but as I've asked in patch 5: Do we want to 
> have a "has_fallocate"?
>
> Other than that, this is the first usage of bs->total_sectors in this 
> file; raw_co_get_block_status() does a similar check, but it uses 
> bdrv_getlength() instead. If bs->total_sectors is correct, 
> bdrv_getlength() will actually do nothing but return bs->total_sectors 
> * BDRV_SECTOR_SIZE; it will only do more (that is, update 
> bs->total_sectors) if it is not correct to use bs->total_sectors (and 
> I feel like it may not be correct because 
> BlockDriver.has_variable_length is true).
>
> Max
>
ok, will do

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2015-01-27 18:19     ` Denis V. Lunev
@ 2015-01-27 18:24       ` Max Reitz
  2015-01-27 18:33         ` Denis V. Lunev
  0 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2015-01-27 18:24 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi

On 2015-01-27 at 13:19, Denis V. Lunev wrote:
> On 27/01/15 20:57, Max Reitz wrote:
>> On 2015-01-27 at 08:51, Denis V. Lunev wrote:
>>> There is a possibility that we are extending our image and thus writing
>>> zeroes beyond the end of the file. In this case we do not need to care
>>> about the hole to make sure that there is no data in the file under
>>> this offset (pre-condition to fallocate(0) to work). We could simply 
>>> call
>>> fallocate(0).
>>>
>>> This improves the performance of writing zeroes even on really old
>>> platforms which do not have even FALLOC_FL_PUNCH_HOLE.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> CC: Peter Lieven <pl@kamp.de>
>>> CC: Fam Zheng <famz@redhat.com>
>>> ---
>>>   block/raw-posix.c | 10 ++++++++--
>>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>>> index c039bef..fa05239 100644
>>> --- a/block/raw-posix.c
>>> +++ b/block/raw-posix.c
>>> @@ -60,7 +60,7 @@
>>>   #define FS_NOCOW_FL                     0x00800000 /* Do not cow 
>>> file */
>>>   #endif
>>>   #endif
>>> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
>>> defined(CONFIG_FALLOCATE_ZERO_RANGE)
>>> +#ifdef CONFIG_FALLOCATE
>>
>> This change doesn't seem right; CONFIG_FALLOCATE is set if 
>> posix_fallocate() is available, not for the Linux-specific 
>> fallocate() from linux/falloc.h.
>>
>
> here is a check for fallocate and posix_fallocate in configure script
>
> # check for fallocate
> fallocate=no
> cat > $TMPC << EOF
> #include <fcntl.h>
>
> int main(void)
> {
>     fallocate(0, 0, 0, 0);
>     return 0;
> }
> EOF
> if compile_prog "" "" ; then
>   fallocate=yes
> fi
> ...
> # check for posix_fallocate
> posix_fallocate=no
> cat > $TMPC << EOF
> #include <fcntl.h>
>
> int main(void)
> {
>     posix_fallocate(0, 0, 0);
>     return 0;
> }
> EOF
> if compile_prog "" "" ; then
>     posix_fallocate=yes
> fi
> ...
> if test "$fallocate" = "yes" ; then
>   echo "CONFIG_FALLOCATE=y" >> $config_host_mak
> fi
> ...
> if test "$posix_fallocate" = "yes" ; then
>   echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak
> fi
>
> Thus my check looks correct to me.

Oh, sorry, I somehow mixed those checks. You're right.

Very well then; maybe you want to mention this change in the commit 
message, though?

Max

>
>>>   #include <linux/falloc.h>
>>>   #endif
>>>   #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
>>> @@ -902,7 +902,7 @@ static int translate_err(int err)
>>>       return err;
>>>   }
>>>   -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
>>> defined(CONFIG_FALLOCATE_ZERO_RANGE)
>>> +#ifdef CONFIG_FALLOCATE
>>
>> Same here.
>>
>>>   static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>>>   {
>>>       do {
>>> @@ -981,6 +981,12 @@ static ssize_t 
>>> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>>       }
>>>   #endif
>>>   +#ifdef CONFIG_FALLOCATE
>>> +    if (aiocb->aio_offset >= aiocb->bs->total_sectors << 
>>> BDRV_SECTOR_BITS) {
>>> +        return do_fallocate(s->fd, 0, aiocb->aio_offset, 
>>> aiocb->aio_nbytes);
>>> +    }
>>> +#endif
>>> +
>>
>> This seems fine though, but as I've asked in patch 5: Do we want to 
>> have a "has_fallocate"?
>>
>> Other than that, this is the first usage of bs->total_sectors in this 
>> file; raw_co_get_block_status() does a similar check, but it uses 
>> bdrv_getlength() instead. If bs->total_sectors is correct, 
>> bdrv_getlength() will actually do nothing but return 
>> bs->total_sectors * BDRV_SECTOR_SIZE; it will only do more (that is, 
>> update bs->total_sectors) if it is not correct to use 
>> bs->total_sectors (and I feel like it may not be correct because 
>> BlockDriver.has_variable_length is true).
>>
>> Max
>>
> ok, will do

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2015-01-27 18:24       ` Max Reitz
@ 2015-01-27 18:33         ` Denis V. Lunev
  0 siblings, 0 replies; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-27 18:33 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Peter Lieven, Fam Zheng, qemu-devel, Stefan Hajnoczi

On 27/01/15 21:24, Max Reitz wrote:
> On 2015-01-27 at 13:19, Denis V. Lunev wrote:
>> On 27/01/15 20:57, Max Reitz wrote:
>>> On 2015-01-27 at 08:51, Denis V. Lunev wrote:
>>>> There is a possibility that we are extending our image and thus 
>>>> writing
>>>> zeroes beyond the end of the file. In this case we do not need to care
>>>> about the hole to make sure that there is no data in the file under
>>>> this offset (pre-condition to fallocate(0) to work). We could 
>>>> simply call
>>>> fallocate(0).
>>>>
>>>> This improves the performance of writing zeroes even on really old
>>>> platforms which do not have even FALLOC_FL_PUNCH_HOLE.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>>> CC: Peter Lieven <pl@kamp.de>
>>>> CC: Fam Zheng <famz@redhat.com>
>>>> ---
>>>>   block/raw-posix.c | 10 ++++++++--
>>>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>>>> index c039bef..fa05239 100644
>>>> --- a/block/raw-posix.c
>>>> +++ b/block/raw-posix.c
>>>> @@ -60,7 +60,7 @@
>>>>   #define FS_NOCOW_FL                     0x00800000 /* Do not cow 
>>>> file */
>>>>   #endif
>>>>   #endif
>>>> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
>>>> defined(CONFIG_FALLOCATE_ZERO_RANGE)
>>>> +#ifdef CONFIG_FALLOCATE
>>>
>>> This change doesn't seem right; CONFIG_FALLOCATE is set if 
>>> posix_fallocate() is available, not for the Linux-specific 
>>> fallocate() from linux/falloc.h.
>>>
>>
>> here is a check for fallocate and posix_fallocate in configure script
>>
>> # check for fallocate
>> fallocate=no
>> cat > $TMPC << EOF
>> #include <fcntl.h>
>>
>> int main(void)
>> {
>>     fallocate(0, 0, 0, 0);
>>     return 0;
>> }
>> EOF
>> if compile_prog "" "" ; then
>>   fallocate=yes
>> fi
>> ...
>> # check for posix_fallocate
>> posix_fallocate=no
>> cat > $TMPC << EOF
>> #include <fcntl.h>
>>
>> int main(void)
>> {
>>     posix_fallocate(0, 0, 0);
>>     return 0;
>> }
>> EOF
>> if compile_prog "" "" ; then
>>     posix_fallocate=yes
>> fi
>> ...
>> if test "$fallocate" = "yes" ; then
>>   echo "CONFIG_FALLOCATE=y" >> $config_host_mak
>> fi
>> ...
>> if test "$posix_fallocate" = "yes" ; then
>>   echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak
>> fi
>>
>> Thus my check looks correct to me.
>
> Oh, sorry, I somehow mixed those checks. You're right.
>
> Very well then; maybe you want to mention this change in the commit 
> message, though?
>
> Max
>
no prob

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
@ 2015-01-28 18:38 Denis V. Lunev
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz,
	Stefan Hajnoczi, Denis V. Lunev

These patches eliminate data writes completely on Linux if fallocate
FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are  supported on
underlying filesystem.

I have performed several tests with non-aligned fallocate calls and
in all cases (with non-aligned fallocates) Linux performs fine, i.e.
areas are zeroed correctly. Checks were made on
Linux 3.16.0-28-generic #38-Ubuntu SMP

This should seriously increase performance of bdrv_write_zeroes

Changes from v4:
- comments to patches are improved by Max Reitz suggestions
- replaced 'return ret' with 'return -ENOTSUP' handle_aiocb_write_zeroes
  The idea is that we can reach that point only if original ret was
  equal to -ENOTSUP
- added has_fallocate flag to indicate that fallocate is working for
  given BDS
- checked file length with bdrv_getlength

Changes from v3:
- dropped original patch 1, equivalent stuff was merged already
- reordered patches as suggested by Fam
- fixes spelling errors as suggested by Fam
- fixed not initialized value in handle_aiocb_write_zeroes
- fixed wrong error processing from do_fallocate(FALLOC_FL_ZERO_RANGE)

Changes from v2:
- added Peter Lieven to CC
- added CONFIG_FALLOCATE check to call do_fallocate in patch 7
- dropped patch 1 as NACK-ed
- added processing of very large data areas in bdrv_co_write_zeroes (new
  patch 1)
- set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files
  (new patch 8)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values
  2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
@ 2015-01-28 18:38 ` Denis V. Lunev
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper Denis V. Lunev
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev

actually the code
    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
        ret == -ENOTTY) {
        ret = -ENOTSUP;
    }
is present twice and will be added a couple more times. Create helper
for this.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index e51293a..24300d0 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -893,6 +893,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 }
 #endif
 
+static int translate_err(int err)
+{
+    if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
+        err == -ENOTTY) {
+        err = -ENOTSUP;
+    }
+    return err;
+}
+
 static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
@@ -921,10 +930,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 #endif
     }
 
-    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
-        ret == -ENOTTY) {
+    ret = translate_err(ret);
+    if (ret == -ENOTSUP) {
         s->has_write_zeroes = false;
-        ret = -ENOTSUP;
     }
     return ret;
 }
@@ -968,10 +976,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 #endif
     }
 
-    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
-        ret == -ENOTTY) {
+    ret = translate_err(ret);
+    if (ret == -ENOTSUP) {
         s->has_discard = false;
-        ret = -ENOTSUP;
     }
     return ret;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper
  2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
@ 2015-01-28 18:38 ` Denis V. Lunev
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev

The pattern
    do {
        if (fallocate(s->fd, mode, offset, len) == 0) {
            return 0;
        }
    } while (errno == EINTR);
    ret = translate_err(-errno);
will be commonly useful in next patches. Create helper for it.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 24300d0..2aa268a 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -902,6 +902,18 @@ static int translate_err(int err)
     return err;
 }
 
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE)
+static int do_fallocate(int fd, int mode, off_t offset, off_t len)
+{
+    do {
+        if (fallocate(fd, mode, offset, len) == 0) {
+            return 0;
+        }
+    } while (errno == EINTR);
+    return translate_err(-errno);
+}
+#endif
+
 static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
@@ -965,14 +977,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 #endif
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
-        do {
-            if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-                          aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
-                return 0;
-            }
-        } while (errno == EINTR);
-
-        ret = -errno;
+        ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                           aiocb->aio_offset, aiocb->aio_nbytes);
 #endif
     }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
  2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper Denis V. Lunev
@ 2015-01-28 18:38 ` Denis V. Lunev
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi,
	Denis V. Lunev

move code dealing with a block device to a separate function. This will
allow to implement additional processing for ordinary files.

Please note, that xfs_code has been moved before checking for
s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
This makes code a bit more consistent.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 2aa268a..d3910fd 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -914,41 +914,49 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 }
 #endif
 
-static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
 {
-    int ret = -EOPNOTSUPP;
+    int ret = -ENOTSUP;
     BDRVRawState *s = aiocb->bs->opaque;
 
-    if (s->has_write_zeroes == 0) {
+    if (!s->has_write_zeroes) {
         return -ENOTSUP;
     }
 
-    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
 #ifdef BLKZEROOUT
-        do {
-            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
-            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
-                return 0;
-            }
-        } while (errno == EINTR);
-
-        ret = -errno;
-#endif
-    } else {
-#ifdef CONFIG_XFS
-        if (s->is_xfs) {
-            return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+    do {
+        uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
+        if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
+            return 0;
         }
+    } while (errno == EINTR);
+
+    ret = translate_err(-errno);
 #endif
-    }
 
-    ret = translate_err(ret);
     if (ret == -ENOTSUP) {
         s->has_write_zeroes = false;
     }
     return ret;
 }
 
+static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+{
+    BDRVRawState *s = aiocb->bs->opaque;
+
+    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
+        return handle_aiocb_write_zeroes_block(aiocb);
+    }
+
+#ifdef CONFIG_XFS
+    if (s->is_xfs) {
+        return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+#endif
+
+    return -ENOTSUP;
+}
+
 static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes
  2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (2 preceding siblings ...)
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
@ 2015-01-28 18:38 ` Denis V. Lunev
  2015-01-29 22:40   ` Max Reitz
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz,
	Stefan Hajnoczi, Denis V. Lunev

This efficiently writes zeroes on Linux if the kernel is capable enough.
FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not
including file expansion.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 15 +++++++++++++--
 configure         | 19 +++++++++++++++++++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index d3910fd..5a777e7 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
 #define FS_NOCOW_FL                     0x00800000 /* Do not cow file */
 #endif
 #endif
-#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
 #include <linux/falloc.h>
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -902,7 +902,7 @@ static int translate_err(int err)
     return err;
 }
 
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE)
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
@@ -954,6 +954,17 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
     }
 #endif
 
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+    if (s->has_write_zeroes) {
+        int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+                               aiocb->aio_offset, aiocb->aio_nbytes);
+        if (ret == 0 || ret != -ENOTSUP) {
+            return ret;
+        }
+        s->has_write_zeroes = false;
+    }
+#endif
+
     return -ENOTSUP;
 }
 
diff --git a/configure b/configure
index f185dd0..e00e03a 100755
--- a/configure
+++ b/configure
@@ -3335,6 +3335,22 @@ if compile_prog "" "" ; then
   fallocate_punch_hole=yes
 fi
 
+# check that fallocate supports range zeroing inside the file
+fallocate_zero_range=no
+cat > $TMPC << EOF
+#include <fcntl.h>
+#include <linux/falloc.h>
+
+int main(void)
+{
+    fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0);
+    return 0;
+}
+EOF
+if compile_prog "" "" ; then
+  fallocate_zero_range=yes
+fi
+
 # check for posix_fallocate
 posix_fallocate=no
 cat > $TMPC << EOF
@@ -4567,6 +4583,9 @@ fi
 if test "$fallocate_punch_hole" = "yes" ; then
   echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak
 fi
+if test "$fallocate_zero_range" = "yes" ; then
+  echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak
+fi
 if test "$posix_fallocate" = "yes" ; then
   echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak
 fi
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
  2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (3 preceding siblings ...)
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
@ 2015-01-28 18:38 ` Denis V. Lunev
  2015-01-29 22:40   ` Max Reitz
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
  6 siblings, 1 reply; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz,
	Stefan Hajnoczi, Denis V. Lunev

This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems
and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more
mature.

The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due
to the following reasons:
- FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes
  sparse. In order to retain original functionality we must allocate
  disk space afterwards. This is done using fallocate(0) call
- fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing
  if called above already allocated areas of the file, i.e. the content
  will not be zeroed

This should increase the performance a bit for not-so-modern kernels.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 5a777e7..2e24829 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -965,6 +965,21 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
     }
 #endif
 
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+    if (s->has_discard) {
+        int ret = do_fallocate(s->fd,
+                               FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                               aiocb->aio_offset, aiocb->aio_nbytes);
+        if (ret == 0) {
+            ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+        }
+        if (ret != -ENOTSUP) {
+            return ret;
+        }
+        s->has_discard = false;
+    }
+#endif
+
     return -ENOTSUP;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (4 preceding siblings ...)
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
@ 2015-01-28 18:38 ` Denis V. Lunev
  2015-01-29 22:50   ` Max Reitz
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
  6 siblings, 1 reply; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz,
	Stefan Hajnoczi, Denis V. Lunev

There is a possibility that we are extending our image and thus writing
zeroes beyond the end of the file. In this case we do not need to care
about the hole to make sure that there is no data in the file under
this offset (pre-condition to fallocate(0) to work). We could simply call
fallocate(0).

This improves the performance of writing zeroes even on really old
platforms which do not have even FALLOC_FL_PUNCH_HOLE.

Before the patch do_fallocate was used when either
CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE are defined.
Now the story is different. CONFIG_FALLOCATE is defined when Linux
fallocate is defined, posix_fallocate is completely different story
(CONFIG_POSIX_FALLOCATE). CONFIG_FALLOCATE is mandatory prerequite
for both CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE
thus we are on the safe side.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 2e24829..3db911a 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -147,6 +147,7 @@ typedef struct BDRVRawState {
     bool has_discard:1;
     bool has_write_zeroes:1;
     bool discard_zeroes:1;
+    bool has_fallocate;
     bool needs_alignment;
 } BDRVRawState;
 
@@ -452,6 +453,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
     if (S_ISREG(st.st_mode)) {
         s->discard_zeroes = true;
+        s->has_fallocate = true;
     }
     if (S_ISBLK(st.st_mode)) {
 #ifdef BLKDISCARDZEROES
@@ -902,7 +904,7 @@ static int translate_err(int err)
     return err;
 }
 
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
@@ -980,6 +982,16 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
     }
 #endif
 
+#ifdef CONFIG_FALLOCATE
+    if (s->has_fallocate && aiocb->aio_offset >= bdrv_getlength(aiocb->bs)) {
+        int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+        if (ret == 0 || ret != -ENOTSUP) {
+            return ret;
+        }
+        s->has_fallocate = false;
+    }
+#endif
+
     return -ENOTSUP;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
  2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
                   ` (5 preceding siblings ...)
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
@ 2015-01-28 18:38 ` Denis V. Lunev
  2015-01-29 22:51   ` Max Reitz
  6 siblings, 1 reply; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-28 18:38 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Max Reitz,
	Stefan Hajnoczi, Denis V. Lunev

fallocate() works fine and could handle properly with arbitrary size
requests. There is no sense to reduce the amount of space to fallocate.
The bigger is the size, the better is the performance as the amount of
journal updates is reduced.

The patch changes behavior for both generic filesystem and XFS codepaths,
which are different in handle_aiocb_write_zeroes. The implementation
of falocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same
thus the change is fine for both ways.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Max Reitz <mreitz@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 3db911a..ec38fee 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
     }
 }
 
+static void raw_probe_max_write_zeroes(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
+    struct stat st;
+
+    if (fstat(s->fd, &st) < 0) {
+        return; /* no problem, keep default value */
+    }
+    if (!S_ISREG(st.st_mode) || !s->discard_zeroes) {
+        return;
+    }
+    bs->bl.max_write_zeroes = INT_MAX;
+}
+
 static void raw_parse_flags(int bdrv_flags, int *open_flags)
 {
     assert(open_flags != NULL);
@@ -600,6 +614,7 @@ static int raw_reopen_prepare(BDRVReopenState *state,
     /* Fail already reopen_prepare() if we can't get a working O_DIRECT
      * alignment with the new fd. */
     if (raw_s->fd != -1) {
+        raw_probe_max_write_zeroes(state->bs);
         raw_probe_alignment(state->bs, raw_s->fd, &local_err);
         if (local_err) {
             qemu_close(raw_s->fd);
@@ -653,6 +668,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
 
     raw_probe_alignment(bs, s->fd, errp);
     bs->bl.opt_mem_alignment = s->buf_align;
+
+    raw_probe_max_write_zeroes(bs);
 }
 
 static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
@ 2015-01-29 22:40   ` Max Reitz
  0 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2015-01-29 22:40 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 2015-01-28 at 13:38, Denis V. Lunev wrote:
> This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
> Unfortunately, FALLOC_FL_ZERO_RANGE is supported on really modern systems
> and only for a couple of filesystems. FALLOC_FL_PUNCH_HOLE is much more
> mature.
>
> The sequence of 2 operations FALLOC_FL_PUNCH_HOLE and 0 is necessary due
> to the following reasons:
> - FALLOC_FL_PUNCH_HOLE creates a hole in the file, the file becomes
>    sparse. In order to retain original functionality we must allocate
>    disk space afterwards. This is done using fallocate(0) call
> - fallocate(0) without preceeding FALLOC_FL_PUNCH_HOLE will do nothing
>    if called above already allocated areas of the file, i.e. the content
>    will not be zeroed
>
> This should increase the performance a bit for not-so-modern kernels.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> CC: Fam Zheng <famz@redhat.com>
> ---
>   block/raw-posix.c | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 5a777e7..2e24829 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -965,6 +965,21 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>       }
>   #endif
>   
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +    if (s->has_discard) {
> +        int ret = do_fallocate(s->fd,
> +                               FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +                               aiocb->aio_offset, aiocb->aio_nbytes);
> +        if (ret == 0) {
> +            ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
> +        }
> +        if (ret != -ENOTSUP) {
> +            return ret;
> +        }
> +        s->has_discard = false;
> +    }

The problem with putting do_fallocate() there is that if do_fallocate() 
with FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE works but do_fallocate() 
without any flags does not, has_discard will be set to false which will 
render handle_aiocb_discard() useless.

I don't think that that's possible, though (the first do_fallocate() 
working, but the second returning -ENOTSUP), so here's one rather reluctant:

Reviewed-by: Max Reitz <mreitz@redhat.com>

> +#endif
> +
>       return -ENOTSUP;
>   }
>   

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
@ 2015-01-29 22:40   ` Max Reitz
  0 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2015-01-29 22:40 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 2015-01-28 at 13:38, Denis V. Lunev wrote:
> This efficiently writes zeroes on Linux if the kernel is capable enough.
> FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not
> including file expansion.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> CC: Fam Zheng <famz@redhat.com>
> ---
>   block/raw-posix.c | 15 +++++++++++++--
>   configure         | 19 +++++++++++++++++++
>   2 files changed, 32 insertions(+), 2 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
@ 2015-01-29 22:50   ` Max Reitz
  2015-01-30  5:38     ` Denis V. Lunev
  0 siblings, 1 reply; 18+ messages in thread
From: Max Reitz @ 2015-01-29 22:50 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 2015-01-28 at 13:38, Denis V. Lunev wrote:
> There is a possibility that we are extending our image and thus writing
> zeroes beyond the end of the file. In this case we do not need to care
> about the hole to make sure that there is no data in the file under
> this offset (pre-condition to fallocate(0) to work). We could simply call
> fallocate(0).
>
> This improves the performance of writing zeroes even on really old
> platforms which do not have even FALLOC_FL_PUNCH_HOLE.
>
> Before the patch do_fallocate was used when either
> CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE are defined.
> Now the story is different. CONFIG_FALLOCATE is defined when Linux
> fallocate is defined, posix_fallocate is completely different story
> (CONFIG_POSIX_FALLOCATE). CONFIG_FALLOCATE is mandatory prerequite
> for both CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE
> thus we are on the safe side.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> CC: Fam Zheng <famz@redhat.com>
> ---
>   block/raw-posix.c | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 2e24829..3db911a 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -147,6 +147,7 @@ typedef struct BDRVRawState {
>       bool has_discard:1;
>       bool has_write_zeroes:1;
>       bool discard_zeroes:1;
> +    bool has_fallocate;
>       bool needs_alignment;
>   } BDRVRawState;
>   
> @@ -452,6 +453,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>       }
>       if (S_ISREG(st.st_mode)) {
>           s->discard_zeroes = true;
> +        s->has_fallocate = true;
>       }
>       if (S_ISBLK(st.st_mode)) {
>   #ifdef BLKDISCARDZEROES
> @@ -902,7 +904,7 @@ static int translate_err(int err)
>       return err;
>   }
>   
> -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
> +#ifdef CONFIG_FALLOCATE
>   static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>   {
>       do {
> @@ -980,6 +982,16 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>       }
>   #endif
>   
> +#ifdef CONFIG_FALLOCATE
> +    if (s->has_fallocate && aiocb->aio_offset >= bdrv_getlength(aiocb->bs)) {
> +        int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
> +        if (ret == 0 || ret != -ENOTSUP) {
> +            return ret;
> +        }
> +        s->has_fallocate = false;
> +    }
> +#endif
> +
>       return -ENOTSUP;
>   }

Now that you do have has_fallocate, I think you should be using it in 
patch 5 as well. So I think you should either you make this patch add it 
in the area touched by patch 5, or you introduce has_fallocate in patch 
5 already and use it there.

Max

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
  2015-01-28 18:38 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
@ 2015-01-29 22:51   ` Max Reitz
  0 siblings, 0 replies; 18+ messages in thread
From: Max Reitz @ 2015-01-29 22:51 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 2015-01-28 at 13:38, Denis V. Lunev wrote:
> fallocate() works fine and could handle properly with arbitrary size
> requests. There is no sense to reduce the amount of space to fallocate.
> The bigger is the size, the better is the performance as the amount of
> journal updates is reduced.
>
> The patch changes behavior for both generic filesystem and XFS codepaths,
> which are different in handle_aiocb_write_zeroes. The implementation
> of falocate and xfsctl(XFS_IOC_ZERO_RANGE) for XFS are exactly the same

*fallocate

With that fixed:

Reviewed-by: Max Reitz <mreitz@redhat.com>

> thus the change is fine for both ways.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> CC: Fam Zheng <famz@redhat.com>
> ---
>   block/raw-posix.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 3db911a..ec38fee 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -293,6 +293,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
>       }
>   }
>   
> +static void raw_probe_max_write_zeroes(BlockDriverState *bs)
> +{
> +    BDRVRawState *s = bs->opaque;
> +    struct stat st;
> +
> +    if (fstat(s->fd, &st) < 0) {
> +        return; /* no problem, keep default value */
> +    }
> +    if (!S_ISREG(st.st_mode) || !s->discard_zeroes) {
> +        return;
> +    }
> +    bs->bl.max_write_zeroes = INT_MAX;
> +}
> +
>   static void raw_parse_flags(int bdrv_flags, int *open_flags)
>   {
>       assert(open_flags != NULL);
> @@ -600,6 +614,7 @@ static int raw_reopen_prepare(BDRVReopenState *state,
>       /* Fail already reopen_prepare() if we can't get a working O_DIRECT
>        * alignment with the new fd. */
>       if (raw_s->fd != -1) {
> +        raw_probe_max_write_zeroes(state->bs);
>           raw_probe_alignment(state->bs, raw_s->fd, &local_err);
>           if (local_err) {
>               qemu_close(raw_s->fd);
> @@ -653,6 +668,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>   
>       raw_probe_alignment(bs, s->fd, errp);
>       bs->bl.opt_mem_alignment = s->buf_align;
> +
> +    raw_probe_max_write_zeroes(bs);
>   }
>   
>   static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
  2015-01-29 22:50   ` Max Reitz
@ 2015-01-30  5:38     ` Denis V. Lunev
  0 siblings, 0 replies; 18+ messages in thread
From: Denis V. Lunev @ 2015-01-30  5:38 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Fam Zheng, Peter Lieven, qemu-devel, Stefan Hajnoczi

On 30/01/15 01:50, Max Reitz wrote:
> On 2015-01-28 at 13:38, Denis V. Lunev wrote:
>> There is a possibility that we are extending our image and thus writing
>> zeroes beyond the end of the file. In this case we do not need to care
>> about the hole to make sure that there is no data in the file under
>> this offset (pre-condition to fallocate(0) to work). We could simply 
>> call
>> fallocate(0).
>>
>> This improves the performance of writing zeroes even on really old
>> platforms which do not have even FALLOC_FL_PUNCH_HOLE.
>>
>> Before the patch do_fallocate was used when either
>> CONFIG_FALLOCATE_PUNCH_HOLE or CONFIG_FALLOCATE_ZERO_RANGE are defined.
>> Now the story is different. CONFIG_FALLOCATE is defined when Linux
>> fallocate is defined, posix_fallocate is completely different story
>> (CONFIG_POSIX_FALLOCATE). CONFIG_FALLOCATE is mandatory prerequite
>> for both CONFIG_FALLOCATE_PUNCH_HOLE and CONFIG_FALLOCATE_ZERO_RANGE
>> thus we are on the safe side.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> CC: Fam Zheng <famz@redhat.com>
>> ---
>>   block/raw-posix.c | 14 +++++++++++++-
>>   1 file changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 2e24829..3db911a 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -147,6 +147,7 @@ typedef struct BDRVRawState {
>>       bool has_discard:1;
>>       bool has_write_zeroes:1;
>>       bool discard_zeroes:1;
>> +    bool has_fallocate;
>>       bool needs_alignment;
>>   } BDRVRawState;
>>   @@ -452,6 +453,7 @@ static int raw_open_common(BlockDriverState 
>> *bs, QDict *options,
>>       }
>>       if (S_ISREG(st.st_mode)) {
>>           s->discard_zeroes = true;
>> +        s->has_fallocate = true;
>>       }
>>       if (S_ISBLK(st.st_mode)) {
>>   #ifdef BLKDISCARDZEROES
>> @@ -902,7 +904,7 @@ static int translate_err(int err)
>>       return err;
>>   }
>>   -#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || 
>> defined(CONFIG_FALLOCATE_ZERO_RANGE)
>> +#ifdef CONFIG_FALLOCATE
>>   static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>>   {
>>       do {
>> @@ -980,6 +982,16 @@ static ssize_t 
>> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>       }
>>   #endif
>>   +#ifdef CONFIG_FALLOCATE
>> +    if (s->has_fallocate && aiocb->aio_offset >= 
>> bdrv_getlength(aiocb->bs)) {
>> +        int ret = do_fallocate(s->fd, 0, aiocb->aio_offset, 
>> aiocb->aio_nbytes);
>> +        if (ret == 0 || ret != -ENOTSUP) {
>> +            return ret;
>> +        }
>> +        s->has_fallocate = false;
>> +    }
>> +#endif
>> +
>>       return -ENOTSUP;
>>   }
>
> Now that you do have has_fallocate, I think you should be using it in 
> patch 5 as well. So I think you should either you make this patch add 
> it in the area touched by patch 5, or you introduce has_fallocate in 
> patch 5 already and use it there.
>
> Max
OK. No problem. I do not think that it is ever possible, but
why not?

I will reorder these patches in the patchset to minimize changes.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-01-30  5:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-28 18:38 [Qemu-devel] [PATCH v5 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
2015-01-28 18:38 ` [Qemu-devel] [PATCH 1/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
2015-01-28 18:38 ` [Qemu-devel] [PATCH 2/7] block/raw-posix: create do_fallocate helper Denis V. Lunev
2015-01-28 18:38 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
2015-01-28 18:38 ` [Qemu-devel] [PATCH 4/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
2015-01-29 22:40   ` Max Reitz
2015-01-28 18:38 ` [Qemu-devel] [PATCH 5/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
2015-01-29 22:40   ` Max Reitz
2015-01-28 18:38 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
2015-01-29 22:50   ` Max Reitz
2015-01-30  5:38     ` Denis V. Lunev
2015-01-28 18:38 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
2015-01-29 22:51   ` Max Reitz
  -- strict thread matches above, loose matches on Subject: below --
2015-01-27 13:51 [Qemu-devel] [PATCH v4 0/7] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
2015-01-27 13:51 ` [Qemu-devel] [PATCH 6/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
2015-01-27 17:57   ` Max Reitz
2015-01-27 18:19     ` Denis V. Lunev
2015-01-27 18:24       ` Max Reitz
2015-01-27 18:33         ` Denis V. Lunev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).