* [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
@ 2014-12-30 9:20 ` Denis V. Lunev
2015-01-05 7:34 ` Peter Lieven
2014-12-30 9:20 ` [Qemu-devel] [PATCH 2/8] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
` (7 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 9:20 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
16 MiB as a chunk size. This is implemented in this way to tolerate
buggy block backends which do not accept too big requests.
Though if the bdrv_co_write_zeroes callback is not good enough, we
fallback to write data explicitely using bdrv_co_writev and we
create buffer to accomodate zeroes inside. The size of this buffer
is the size of the chunk. Thus if the underlying layer will have
bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
Actually, there is no need to allocate such a big amount of memory.
We could simply allocate 1 MiB buffer and create iovec, which will
point to the same memory.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
block.c | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)
diff --git a/block.c b/block.c
index 4165d42..d69c121 100644
--- a/block.c
+++ b/block.c
@@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
* of 32768 512-byte sectors (16 MiB) per request.
*/
#define MAX_WRITE_ZEROES_DEFAULT 32768
+/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
+#define MAX_ZEROES_CHUNK (1024 * 1024)
static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
{
BlockDriver *drv = bs->drv;
QEMUIOVector qiov;
- struct iovec iov = {0};
int ret = 0;
+ void *chunk = NULL;
+
+ qemu_iovec_init(&qiov, 0);
int max_write_zeroes = bs->bl.max_write_zeroes ?
bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
@@ -3217,27 +3221,35 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
}
if (ret == -ENOTSUP) {
+ int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
+ int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
+
/* Fall back to bounce buffer if write zeroes is unsupported */
- iov.iov_len = num * BDRV_SECTOR_SIZE;
- if (iov.iov_base == NULL) {
- iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
- if (iov.iov_base == NULL) {
+ if (chunk == NULL) {
+ chunk = qemu_try_blockalign(bs, chunk_size);
+ if (chunk == NULL) {
ret = -ENOMEM;
goto fail;
}
- memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
+ memset(chunk, 0, chunk_size);
+ }
+
+ while (num_bytes > 0) {
+ int to_add = MIN(chunk_size, num_bytes);
+ qemu_iovec_add(&qiov, chunk, to_add);
+ num_bytes -= to_add;
}
- qemu_iovec_init_external(&qiov, &iov, 1);
ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
/* Keep bounce buffer around if it is big enough for all
* all future requests.
*/
- if (num < max_write_zeroes) {
- qemu_vfree(iov.iov_base);
- iov.iov_base = NULL;
+ if (chunk_size != MAX_ZEROES_CHUNK) {
+ qemu_vfree(chunk);
+ chunk = NULL;
}
+ qemu_iovec_reset(&qiov);
}
sector_num += num;
@@ -3245,7 +3257,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
}
fail:
- qemu_vfree(iov.iov_base);
+ qemu_iovec_destroy(&qiov);
+ qemu_vfree(chunk);
return ret;
}
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
2014-12-30 9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
@ 2015-01-05 7:34 ` Peter Lieven
2015-01-05 11:06 ` Denis V. Lunev
0 siblings, 1 reply; 24+ messages in thread
From: Peter Lieven @ 2015-01-05 7:34 UTC (permalink / raw)
To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
On 30.12.2014 10:20, Denis V. Lunev wrote:
> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
> 16 MiB as a chunk size. This is implemented in this way to tolerate
> buggy block backends which do not accept too big requests.
>
> Though if the bdrv_co_write_zeroes callback is not good enough, we
> fallback to write data explicitely using bdrv_co_writev and we
> create buffer to accomodate zeroes inside. The size of this buffer
> is the size of the chunk. Thus if the underlying layer will have
> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>
> Actually, there is no need to allocate such a big amount of memory.
> We could simply allocate 1 MiB buffer and create iovec, which will
> point to the same memory.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> ---
> block.c | 35 ++++++++++++++++++++++++-----------
> 1 file changed, 24 insertions(+), 11 deletions(-)
>
> diff --git a/block.c b/block.c
> index 4165d42..d69c121 100644
> --- a/block.c
> +++ b/block.c
> @@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
> * of 32768 512-byte sectors (16 MiB) per request.
> */
> #define MAX_WRITE_ZEROES_DEFAULT 32768
> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>
> static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
> {
> BlockDriver *drv = bs->drv;
> QEMUIOVector qiov;
> - struct iovec iov = {0};
> int ret = 0;
> + void *chunk = NULL;
> +
> + qemu_iovec_init(&qiov, 0);
>
> int max_write_zeroes = bs->bl.max_write_zeroes ?
> bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
> @@ -3217,27 +3221,35 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
> }
>
> if (ret == -ENOTSUP) {
> + int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
> + int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
> +
> /* Fall back to bounce buffer if write zeroes is unsupported */
> - iov.iov_len = num * BDRV_SECTOR_SIZE;
> - if (iov.iov_base == NULL) {
> - iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
> - if (iov.iov_base == NULL) {
> + if (chunk == NULL) {
> + chunk = qemu_try_blockalign(bs, chunk_size);
> + if (chunk == NULL) {
> ret = -ENOMEM;
> goto fail;
> }
> - memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
> + memset(chunk, 0, chunk_size);
> + }
> +
> + while (num_bytes > 0) {
> + int to_add = MIN(chunk_size, num_bytes);
> + qemu_iovec_add(&qiov, chunk, to_add);
This can and likely will fail for big num_bytes if you exceed IOV_MAX vectors.
I would stick to the old method and limit the num to a reasonable value e.g. MAX_WRITE_ZEROES_DEFAULT.
This becomes necessary as you set INT_MAX for max_write_zeroes. That hasn't been considered before in
the original patch.
Peter
> + num_bytes -= to_add;
> }
> - qemu_iovec_init_external(&qiov, &iov, 1);
>
> ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
>
> /* Keep bounce buffer around if it is big enough for all
> * all future requests.
> */
> - if (num < max_write_zeroes) {
> - qemu_vfree(iov.iov_base);
> - iov.iov_base = NULL;
> + if (chunk_size != MAX_ZEROES_CHUNK) {
> + qemu_vfree(chunk);
> + chunk = NULL;
> }
> + qemu_iovec_reset(&qiov);
> }
>
> sector_num += num;
> @@ -3245,7 +3257,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
> }
>
> fail:
> - qemu_vfree(iov.iov_base);
> + qemu_iovec_destroy(&qiov);
> + qemu_vfree(chunk);
> return ret;
> }
>
--
Mit freundlichen Grüßen
Peter Lieven
...........................................................
KAMP Netzwerkdienste GmbH
Vestische Str. 89-91 | 46117 Oberhausen
Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
pl@kamp.de | http://www.kamp.de
Geschäftsführer: Heiner Lante | Michael Lante
Amtsgericht Duisburg | HRB Nr. 12154
USt-Id-Nr.: DE 120607556
...........................................................
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
2015-01-05 7:34 ` Peter Lieven
@ 2015-01-05 11:06 ` Denis V. Lunev
2015-01-05 11:23 ` Peter Lieven
0 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:06 UTC (permalink / raw)
To: Peter Lieven; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
On 05/01/15 10:34, Peter Lieven wrote:
> On 30.12.2014 10:20, Denis V. Lunev wrote:
>> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
>> 16 MiB as a chunk size. This is implemented in this way to tolerate
>> buggy block backends which do not accept too big requests.
>>
>> Though if the bdrv_co_write_zeroes callback is not good enough, we
>> fallback to write data explicitely using bdrv_co_writev and we
>> create buffer to accomodate zeroes inside. The size of this buffer
>> is the size of the chunk. Thus if the underlying layer will have
>> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>>
>> Actually, there is no need to allocate such a big amount of memory.
>> We could simply allocate 1 MiB buffer and create iovec, which will
>> point to the same memory.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> ---
>> block.c | 35 ++++++++++++++++++++++++-----------
>> 1 file changed, 24 insertions(+), 11 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 4165d42..d69c121 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3173,14 +3173,18 @@ int coroutine_fn
>> bdrv_co_copy_on_readv(BlockDriverState *bs,
>> * of 32768 512-byte sectors (16 MiB) per request.
>> */
>> #define MAX_WRITE_ZEROES_DEFAULT 32768
>> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big
>> allocations */
>> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>> static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState
>> *bs,
>> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>> {
>> BlockDriver *drv = bs->drv;
>> QEMUIOVector qiov;
>> - struct iovec iov = {0};
>> int ret = 0;
>> + void *chunk = NULL;
>> +
>> + qemu_iovec_init(&qiov, 0);
>> int max_write_zeroes = bs->bl.max_write_zeroes ?
>> bs->bl.max_write_zeroes :
>> MAX_WRITE_ZEROES_DEFAULT;
>> @@ -3217,27 +3221,35 @@ static int coroutine_fn
>> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>> }
>> if (ret == -ENOTSUP) {
>> + int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
>> + int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
>> +
>> /* Fall back to bounce buffer if write zeroes is
>> unsupported */
>> - iov.iov_len = num * BDRV_SECTOR_SIZE;
>> - if (iov.iov_base == NULL) {
>> - iov.iov_base = qemu_try_blockalign(bs, num *
>> BDRV_SECTOR_SIZE);
>> - if (iov.iov_base == NULL) {
>> + if (chunk == NULL) {
>> + chunk = qemu_try_blockalign(bs, chunk_size);
>> + if (chunk == NULL) {
>> ret = -ENOMEM;
>> goto fail;
>> }
>> - memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
>> + memset(chunk, 0, chunk_size);
>> + }
>> +
>> + while (num_bytes > 0) {
>> + int to_add = MIN(chunk_size, num_bytes);
>> + qemu_iovec_add(&qiov, chunk, to_add);
>
> This can and likely will fail for big num_bytes if you exceed IOV_MAX
> vectors.
>
> I would stick to the old method and limit the num to a reasonable
> value e.g. MAX_WRITE_ZEROES_DEFAULT.
> This becomes necessary as you set INT_MAX for max_write_zeroes. That
> hasn't been considered before in
> the original patch.
>
> Peter
>
hmm. You are right, but I think that it would be better to limit iovec size
to 32 and this will solve the problem. Allocation of 32 Mb could be a
real problem
on loaded system could be a problem.
What do you think on this? May be we could consider 16 as a limit...
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
2015-01-05 11:06 ` Denis V. Lunev
@ 2015-01-05 11:23 ` Peter Lieven
2015-01-05 11:48 ` Denis V. Lunev
2015-01-05 12:26 ` [Qemu-devel] [PATCH v2 1/1] " Denis V. Lunev
0 siblings, 2 replies; 24+ messages in thread
From: Peter Lieven @ 2015-01-05 11:23 UTC (permalink / raw)
To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
On 05.01.2015 12:06, Denis V. Lunev wrote:
> On 05/01/15 10:34, Peter Lieven wrote:
>> On 30.12.2014 10:20, Denis V. Lunev wrote:
>>> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
>>> 16 MiB as a chunk size. This is implemented in this way to tolerate
>>> buggy block backends which do not accept too big requests.
>>>
>>> Though if the bdrv_co_write_zeroes callback is not good enough, we
>>> fallback to write data explicitely using bdrv_co_writev and we
>>> create buffer to accomodate zeroes inside. The size of this buffer
>>> is the size of the chunk. Thus if the underlying layer will have
>>> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>>>
>>> Actually, there is no need to allocate such a big amount of memory.
>>> We could simply allocate 1 MiB buffer and create iovec, which will
>>> point to the same memory.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> CC: Peter Lieven <pl@kamp.de>
>>> ---
>>> block.c | 35 ++++++++++++++++++++++++-----------
>>> 1 file changed, 24 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 4165d42..d69c121 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
>>> * of 32768 512-byte sectors (16 MiB) per request.
>>> */
>>> #define MAX_WRITE_ZEROES_DEFAULT 32768
>>> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
>>> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>>> static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>>> {
>>> BlockDriver *drv = bs->drv;
>>> QEMUIOVector qiov;
>>> - struct iovec iov = {0};
>>> int ret = 0;
>>> + void *chunk = NULL;
>>> +
>>> + qemu_iovec_init(&qiov, 0);
>>> int max_write_zeroes = bs->bl.max_write_zeroes ?
>>> bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
>>> @@ -3217,27 +3221,35 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>> }
>>> if (ret == -ENOTSUP) {
>>> + int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
>>> + int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
>>> +
>>> /* Fall back to bounce buffer if write zeroes is unsupported */
>>> - iov.iov_len = num * BDRV_SECTOR_SIZE;
>>> - if (iov.iov_base == NULL) {
>>> - iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
>>> - if (iov.iov_base == NULL) {
>>> + if (chunk == NULL) {
>>> + chunk = qemu_try_blockalign(bs, chunk_size);
>>> + if (chunk == NULL) {
>>> ret = -ENOMEM;
>>> goto fail;
>>> }
>>> - memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
>>> + memset(chunk, 0, chunk_size);
>>> + }
>>> +
>>> + while (num_bytes > 0) {
>>> + int to_add = MIN(chunk_size, num_bytes);
>>> + qemu_iovec_add(&qiov, chunk, to_add);
>>
>> This can and likely will fail for big num_bytes if you exceed IOV_MAX vectors.
>>
>> I would stick to the old method and limit the num to a reasonable value e.g. MAX_WRITE_ZEROES_DEFAULT.
>> This becomes necessary as you set INT_MAX for max_write_zeroes. That hasn't been considered before in
>> the original patch.
>>
>> Peter
>>
>
> hmm. You are right, but I think that it would be better to limit iovec size
> to 32 and this will solve the problem. Allocation of 32 Mb could be a real problem
> on loaded system could be a problem.
>
> What do you think on this? May be we could consider 16 as a limit...
I would do the following:
---8<---
From 8c2a08baddbcd9e89bbb11fa83a42350bd7cc095 Mon Sep 17 00:00:00 2001
From: Peter Lieven <pl@kamp.de>
Date: Mon, 5 Jan 2015 12:14:52 +0100
Subject: [PATCH] block: limited request size in write zeroes unsupported path
If bs->bl.max_write_zeroes is large and we end up in the unsupported
path we might allocate a lot of memory for the iovector and/or even
generate an oversized requests.
Fix this by limiting the request by the minimum of the reported
maximum transfer size or 16MB (32768 sectors).
Reported-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Peter Lieven <pl@kamp.de>
---
block.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/block.c b/block.c
index a612594..8009478 100644
--- a/block.c
+++ b/block.c
@@ -3203,6 +3203,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
if (ret == -ENOTSUP) {
/* Fall back to bounce buffer if write zeroes is unsupported */
+ int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length,
+ MAX_WRITE_ZEROES_DEFAULT);
+ num = MIN(num, max_xfer_len);
iov.iov_len = num * BDRV_SECTOR_SIZE;
if (iov.iov_base == NULL) {
iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
@@ -3219,7 +3222,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
/* Keep bounce buffer around if it is big enough for all
* all future requests.
*/
- if (num < max_write_zeroes) {
+ if (num < max_xfer_len) {
qemu_vfree(iov.iov_base);
iov.iov_base = NULL;
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
2015-01-05 11:23 ` Peter Lieven
@ 2015-01-05 11:48 ` Denis V. Lunev
2015-01-05 12:26 ` [Qemu-devel] [PATCH v2 1/1] " Denis V. Lunev
1 sibling, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:48 UTC (permalink / raw)
To: Peter Lieven; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
On 05/01/15 14:23, Peter Lieven wrote:
> On 05.01.2015 12:06, Denis V. Lunev wrote:
>> On 05/01/15 10:34, Peter Lieven wrote:
>>> On 30.12.2014 10:20, Denis V. Lunev wrote:
>>>> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
>>>> 16 MiB as a chunk size. This is implemented in this way to tolerate
>>>> buggy block backends which do not accept too big requests.
>>>>
>>>> Though if the bdrv_co_write_zeroes callback is not good enough, we
>>>> fallback to write data explicitely using bdrv_co_writev and we
>>>> create buffer to accomodate zeroes inside. The size of this buffer
>>>> is the size of the chunk. Thus if the underlying layer will have
>>>> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>>>>
>>>> Actually, there is no need to allocate such a big amount of memory.
>>>> We could simply allocate 1 MiB buffer and create iovec, which will
>>>> point to the same memory.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>>> CC: Peter Lieven <pl@kamp.de>
>>>> ---
>>>> block.c | 35 ++++++++++++++++++++++++-----------
>>>> 1 file changed, 24 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index 4165d42..d69c121 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -3173,14 +3173,18 @@ int coroutine_fn
>>>> bdrv_co_copy_on_readv(BlockDriverState *bs,
>>>> * of 32768 512-byte sectors (16 MiB) per request.
>>>> */
>>>> #define MAX_WRITE_ZEROES_DEFAULT 32768
>>>> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big
>>>> allocations */
>>>> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>>>> static int coroutine_fn
>>>> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>>> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>>>> {
>>>> BlockDriver *drv = bs->drv;
>>>> QEMUIOVector qiov;
>>>> - struct iovec iov = {0};
>>>> int ret = 0;
>>>> + void *chunk = NULL;
>>>> +
>>>> + qemu_iovec_init(&qiov, 0);
>>>> int max_write_zeroes = bs->bl.max_write_zeroes ?
>>>> bs->bl.max_write_zeroes :
>>>> MAX_WRITE_ZEROES_DEFAULT;
>>>> @@ -3217,27 +3221,35 @@ static int coroutine_fn
>>>> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>>> }
>>>> if (ret == -ENOTSUP) {
>>>> + int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
>>>> + int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
>>>> +
>>>> /* Fall back to bounce buffer if write zeroes is
>>>> unsupported */
>>>> - iov.iov_len = num * BDRV_SECTOR_SIZE;
>>>> - if (iov.iov_base == NULL) {
>>>> - iov.iov_base = qemu_try_blockalign(bs, num *
>>>> BDRV_SECTOR_SIZE);
>>>> - if (iov.iov_base == NULL) {
>>>> + if (chunk == NULL) {
>>>> + chunk = qemu_try_blockalign(bs, chunk_size);
>>>> + if (chunk == NULL) {
>>>> ret = -ENOMEM;
>>>> goto fail;
>>>> }
>>>> - memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
>>>> + memset(chunk, 0, chunk_size);
>>>> + }
>>>> +
>>>> + while (num_bytes > 0) {
>>>> + int to_add = MIN(chunk_size, num_bytes);
>>>> + qemu_iovec_add(&qiov, chunk, to_add);
>>>
>>> This can and likely will fail for big num_bytes if you exceed
>>> IOV_MAX vectors.
>>>
>>> I would stick to the old method and limit the num to a reasonable
>>> value e.g. MAX_WRITE_ZEROES_DEFAULT.
>>> This becomes necessary as you set INT_MAX for max_write_zeroes. That
>>> hasn't been considered before in
>>> the original patch.
>>>
>>> Peter
>>>
>>
>> hmm. You are right, but I think that it would be better to limit
>> iovec size
>> to 32 and this will solve the problem. Allocation of 32 Mb could be a
>> real problem
>> on loaded system could be a problem.
>>
>> What do you think on this? May be we could consider 16 as a limit...
>
> I would do the following:
>
> ---8<---
>
> From 8c2a08baddbcd9e89bbb11fa83a42350bd7cc095 Mon Sep 17 00:00:00 2001
> From: Peter Lieven <pl@kamp.de>
> Date: Mon, 5 Jan 2015 12:14:52 +0100
> Subject: [PATCH] block: limited request size in write zeroes
> unsupported path
>
> If bs->bl.max_write_zeroes is large and we end up in the unsupported
> path we might allocate a lot of memory for the iovector and/or even
> generate an oversized requests.
>
> Fix this by limiting the request by the minimum of the reported
> maximum transfer size or 16MB (32768 sectors).
>
> Reported-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
> block.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/block.c b/block.c
> index a612594..8009478 100644
> --- a/block.c
> +++ b/block.c
> @@ -3203,6 +3203,9 @@ static int coroutine_fn
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
>
> if (ret == -ENOTSUP) {
> /* Fall back to bounce buffer if write zeroes is
> unsupported */
> + int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length,
> + MAX_WRITE_ZEROES_DEFAULT);
> + num = MIN(num, max_xfer_len);
this is not going to work IMHO. num is the number in sectors.
max_xfer_len is in bytes.
I will send my updated version using your approach in a
couple of minutes. Would like to test it a bit.
> iov.iov_len = num * BDRV_SECTOR_SIZE;
> if (iov.iov_base == NULL) {
> iov.iov_base = qemu_try_blockalign(bs, num *
> BDRV_SECTOR_SIZE);
> @@ -3219,7 +3222,7 @@ static int coroutine_fn
> bdrv_co_do_write_zeroes(BlockDriverState *bs,
> /* Keep bounce buffer around if it is big enough for all
> * all future requests.
> */
> - if (num < max_write_zeroes) {
> + if (num < max_xfer_len) {
> qemu_vfree(iov.iov_base);
> iov.iov_base = NULL;
> }
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH v2 1/1] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
2015-01-05 11:23 ` Peter Lieven
2015-01-05 11:48 ` Denis V. Lunev
@ 2015-01-05 12:26 ` Denis V. Lunev
2015-01-05 12:32 ` [Qemu-devel] [PATCH v3 " Denis V. Lunev
1 sibling, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 12:26 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
16 MiB as a chunk size. This is implemented in this way to tolerate
buggy block backends which do not accept too big requests.
Though if the bdrv_co_write_zeroes callback is not good enough, we
fallback to write data explicitely using bdrv_co_writev and we
create buffer to accomodate zeroes inside. The size of this buffer
is the size of the chunk. Thus if the underlying layer will have
bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
Actually, there is no need to allocate such a big amount of memory.
We could simply allocate 1 MiB buffer and create iovec, which will
point to the same memory.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
Changes from v1:
- using MAX_WRITE_ZEROES_DEFAULT as a limit for real write
block.c | 40 +++++++++++++++++++++++++++++-----------
1 file changed, 29 insertions(+), 11 deletions(-)
diff --git a/block.c b/block.c
index 4165d42..49fd4e3 100644
--- a/block.c
+++ b/block.c
@@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
* of 32768 512-byte sectors (16 MiB) per request.
*/
#define MAX_WRITE_ZEROES_DEFAULT 32768
+/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
+#define MAX_ZEROES_CHUNK (1024 * 1024)
static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
{
BlockDriver *drv = bs->drv;
QEMUIOVector qiov;
- struct iovec iov = {0};
int ret = 0;
+ void *chunk = NULL;
+
+ qemu_iovec_init(&qiov, 0);
int max_write_zeroes = bs->bl.max_write_zeroes ?
bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
@@ -3217,27 +3221,40 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
}
if (ret == -ENOTSUP) {
+ int64_t num_bytes;
+ int chunk_size;
+
/* Fall back to bounce buffer if write zeroes is unsupported */
- iov.iov_len = num * BDRV_SECTOR_SIZE;
- if (iov.iov_base == NULL) {
- iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
- if (iov.iov_base == NULL) {
+ num = MIN(num, MAX_WRITE_ZEROES_DEFAULT >> BDRV_SECTOR_BITS);
+
+ num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
+ chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
+
+ if (chunk == NULL) {
+ chunk = qemu_try_blockalign(bs, chunk_size);
+ if (chunk == NULL) {
ret = -ENOMEM;
goto fail;
}
- memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
+ memset(chunk, 0, chunk_size);
+ }
+
+ while (num_bytes > 0) {
+ int to_add = MIN(chunk_size, num_bytes);
+ qemu_iovec_add(&qiov, chunk, to_add);
+ num_bytes -= to_add;
}
- qemu_iovec_init_external(&qiov, &iov, 1);
ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
/* Keep bounce buffer around if it is big enough for all
* all future requests.
*/
- if (num < max_write_zeroes) {
- qemu_vfree(iov.iov_base);
- iov.iov_base = NULL;
+ if (chunk_size != MAX_ZEROES_CHUNK) {
+ qemu_vfree(chunk);
+ chunk = NULL;
}
+ qemu_iovec_reset(&qiov);
}
sector_num += num;
@@ -3245,7 +3262,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
}
fail:
- qemu_vfree(iov.iov_base);
+ qemu_iovec_destroy(&qiov);
+ qemu_vfree(chunk);
return ret;
}
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH v3 1/1] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
2015-01-05 12:26 ` [Qemu-devel] [PATCH v2 1/1] " Denis V. Lunev
@ 2015-01-05 12:32 ` Denis V. Lunev
0 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 12:32 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
16 MiB as a chunk size. This is implemented in this way to tolerate
buggy block backends which do not accept too big requests.
Though if the bdrv_co_write_zeroes callback is not good enough, we
fallback to write data explicitely using bdrv_co_writev and we
create buffer to accomodate zeroes inside. The size of this buffer
is the size of the chunk. Thus if the underlying layer will have
bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
Actually, there is no need to allocate such a big amount of memory.
We could simply allocate 1 MiB buffer and create iovec, which will
point to the same memory.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
Changes from v2:
- fixed num assignment as MAX_WRITE_ZEROES_DEFAULT is already in sectors
Changes from v1:
- using MAX_WRITE_ZEROES_DEFAULT as a limit for real write
block.c | 40 +++++++++++++++++++++++++++++-----------
1 file changed, 29 insertions(+), 11 deletions(-)
diff --git a/block.c b/block.c
index 4165d42..e09220f 100644
--- a/block.c
+++ b/block.c
@@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
* of 32768 512-byte sectors (16 MiB) per request.
*/
#define MAX_WRITE_ZEROES_DEFAULT 32768
+/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
+#define MAX_ZEROES_CHUNK (1024 * 1024)
static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
{
BlockDriver *drv = bs->drv;
QEMUIOVector qiov;
- struct iovec iov = {0};
int ret = 0;
+ void *chunk = NULL;
+
+ qemu_iovec_init(&qiov, 0);
int max_write_zeroes = bs->bl.max_write_zeroes ?
bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
@@ -3217,27 +3221,40 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
}
if (ret == -ENOTSUP) {
+ int64_t num_bytes;
+ int chunk_size;
+
/* Fall back to bounce buffer if write zeroes is unsupported */
- iov.iov_len = num * BDRV_SECTOR_SIZE;
- if (iov.iov_base == NULL) {
- iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
- if (iov.iov_base == NULL) {
+ num = MIN(num, MAX_WRITE_ZEROES_DEFAULT);
+
+ num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
+ chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
+
+ if (chunk == NULL) {
+ chunk = qemu_try_blockalign(bs, chunk_size);
+ if (chunk == NULL) {
ret = -ENOMEM;
goto fail;
}
- memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
+ memset(chunk, 0, chunk_size);
+ }
+
+ while (num_bytes > 0) {
+ int to_add = MIN(chunk_size, num_bytes);
+ qemu_iovec_add(&qiov, chunk, to_add);
+ num_bytes -= to_add;
}
- qemu_iovec_init_external(&qiov, &iov, 1);
ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
/* Keep bounce buffer around if it is big enough for all
* all future requests.
*/
- if (num < max_write_zeroes) {
- qemu_vfree(iov.iov_base);
- iov.iov_base = NULL;
+ if (chunk_size != MAX_ZEROES_CHUNK) {
+ qemu_vfree(chunk);
+ chunk = NULL;
}
+ qemu_iovec_reset(&qiov);
}
sector_num += num;
@@ -3245,7 +3262,8 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
}
fail:
- qemu_vfree(iov.iov_base);
+ qemu_iovec_destroy(&qiov);
+ qemu_vfree(chunk);
return ret;
}
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH 2/8] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
@ 2014-12-30 9:20 ` Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 3/8] block/raw-posix: create do_fallocate helper Denis V. Lunev
` (6 subsequent siblings)
8 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 9:20 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
This efficiently writes zeroes on Linux if the kernel is capable enough.
FALLOC_FL_ZERO_RANGE correctly handles all cases, including and not
including file expansion.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
block/raw-posix.c | 13 ++++++++++++-
configure | 19 +++++++++++++++++++
2 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index e51293a..66ebaab 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
#define FS_NOCOW_FL 0x00800000 /* Do not cow file */
#endif
#endif
-#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
#include <linux/falloc.h>
#endif
#if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -919,6 +919,17 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
}
#endif
+
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+ do {
+ if (fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+ aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
+ return 0;
+ }
+ } while (errno == EINTR);
+
+ ret = -errno;
+#endif
}
if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
diff --git a/configure b/configure
index cae588c..dfcf7b3 100755
--- a/configure
+++ b/configure
@@ -3309,6 +3309,22 @@ if compile_prog "" "" ; then
fallocate_punch_hole=yes
fi
+# check that fallocate supports range zeroing inside the file
+fallocate_zero_range=no
+cat > $TMPC << EOF
+#include <fcntl.h>
+#include <linux/falloc.h>
+
+int main(void)
+{
+ fallocate(0, FALLOC_FL_ZERO_RANGE, 0, 0);
+ return 0;
+}
+EOF
+if compile_prog "" "" ; then
+ fallocate_zero_range=yes
+fi
+
# check for posix_fallocate
posix_fallocate=no
cat > $TMPC << EOF
@@ -4538,6 +4554,9 @@ fi
if test "$fallocate_punch_hole" = "yes" ; then
echo "CONFIG_FALLOCATE_PUNCH_HOLE=y" >> $config_host_mak
fi
+if test "$fallocate_zero_range" = "yes" ; then
+ echo "CONFIG_FALLOCATE_ZERO_RANGE=y" >> $config_host_mak
+fi
if test "$posix_fallocate" = "yes" ; then
echo "CONFIG_POSIX_FALLOCATE=y" >> $config_host_mak
fi
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH 3/8] block/raw-posix: create do_fallocate helper
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 2/8] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
@ 2014-12-30 9:20 ` Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
` (5 subsequent siblings)
8 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 9:20 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
The pattern
do {
if (fallocate(s->fd, mode, offset, len) == 0) {
return 0;
}
} while (errno == EINTR);
is used twice at the moment and I am going to add more usages. Move it
to the helper function.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
block/raw-posix.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 66ebaab..a7c8816 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -893,6 +893,19 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
}
#endif
+
+#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+static int do_fallocate(int fd, int mode, off_t offset, off_t len)
+{
+ do {
+ if (fallocate(fd, mode, offset, len) == 0) {
+ return 0;
+ }
+ } while (errno == EINTR);
+ return -errno;
+}
+#endif
+
static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
{
int ret = -EOPNOTSUPP;
@@ -921,14 +934,8 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
#endif
#ifdef CONFIG_FALLOCATE_ZERO_RANGE
- do {
- if (fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
- aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
- return 0;
- }
- } while (errno == EINTR);
-
- ret = -errno;
+ ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+ aiocb->aio_offset, aiocb->aio_nbytes);
#endif
}
@@ -968,14 +975,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
#endif
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
- do {
- if (fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
- aiocb->aio_offset, aiocb->aio_nbytes) == 0) {
- return 0;
- }
- } while (errno == EINTR);
-
- ret = -errno;
+ ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ aiocb->aio_offset, aiocb->aio_nbytes);
#endif
}
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
` (2 preceding siblings ...)
2014-12-30 9:20 ` [Qemu-devel] [PATCH 3/8] block/raw-posix: create do_fallocate helper Denis V. Lunev
@ 2014-12-30 9:20 ` Denis V. Lunev
2015-01-05 6:46 ` Fam Zheng
2014-12-30 9:20 ` [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
` (4 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 9:20 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
actually the code
if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
ret == -ENOTTY) {
ret = -ENOTSUP;
}
is present twice and will be added a couple more times. Create helper
for this. Place it into do_fallocate() for further convinience.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
block/raw-posix.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index a7c8816..25a6947 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -894,6 +894,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
#endif
+static int translate_err(int err)
+{
+ if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
+ err == -ENOTTY) {
+ err = -ENOTSUP;
+ }
+ return err;
+}
+
#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
static int do_fallocate(int fd, int mode, off_t offset, off_t len)
{
@@ -902,7 +911,7 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
return 0;
}
} while (errno == EINTR);
- return -errno;
+ return translate_err(-errno);
}
#endif
@@ -939,10 +948,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
#endif
}
- if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
- ret == -ENOTTY) {
+ ret = translate_err(ret);
+ if (ret == -ENOTSUP) {
s->has_write_zeroes = false;
- ret = -ENOTSUP;
}
return ret;
}
@@ -980,10 +988,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
#endif
}
- if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
- ret == -ENOTTY) {
+ ret = translate_err(ret);
+ if (ret == -ENOTSUP) {
s->has_discard = false;
- ret = -ENOTSUP;
}
return ret;
}
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values
2014-12-30 9:20 ` [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
@ 2015-01-05 6:46 ` Fam Zheng
2015-01-05 11:17 ` Denis V. Lunev
0 siblings, 1 reply; 24+ messages in thread
From: Fam Zheng @ 2015-01-05 6:46 UTC (permalink / raw)
To: Denis V. Lunev; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi
On Tue, 12/30 12:20, Denis V. Lunev wrote:
> actually the code
> if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
> ret == -ENOTTY) {
> ret = -ENOTSUP;
> }
> is present twice and will be added a couple more times. Create helper
> for this. Place it into do_fallocate() for further convinience.
s/convinience/convenience/
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> ---
> block/raw-posix.c | 21 ++++++++++++++-------
> 1 file changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index a7c8816..25a6947 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -894,6 +894,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
> #endif
>
>
> +static int translate_err(int err)
> +{
> + if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
> + err == -ENOTTY) {
> + err = -ENOTSUP;
> + }
> + return err;
> +}
> +
> #if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
> static int do_fallocate(int fd, int mode, off_t offset, off_t len)
> {
> @@ -902,7 +911,7 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
> return 0;
> }
> } while (errno == EINTR);
> - return -errno;
> + return translate_err(-errno);
This could be a separate patch. Maybe reorder the previous patches as:
1) Introduce translate_err.
2) Refactor out do_fallocate.
3) Introduce FALLOC_FL_ZERO_RANGE calling code.
Fam
> }
> #endif
>
> @@ -939,10 +948,9 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> #endif
> }
>
> - if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
> - ret == -ENOTTY) {
> + ret = translate_err(ret);
> + if (ret == -ENOTSUP) {
> s->has_write_zeroes = false;
> - ret = -ENOTSUP;
> }
> return ret;
> }
> @@ -980,10 +988,9 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
> #endif
> }
>
> - if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
> - ret == -ENOTTY) {
> + ret = translate_err(ret);
> + if (ret == -ENOTSUP) {
> s->has_discard = false;
> - ret = -ENOTSUP;
> }
> return ret;
> }
> --
> 1.9.1
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values
2015-01-05 6:46 ` Fam Zheng
@ 2015-01-05 11:17 ` Denis V. Lunev
0 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:17 UTC (permalink / raw)
To: Fam Zheng; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi
On 05/01/15 09:46, Fam Zheng wrote:
> On Tue, 12/30 12:20, Denis V. Lunev wrote:
>> actually the code
>> if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
>> ret == -ENOTTY) {
>> ret = -ENOTSUP;
>> }
>> is present twice and will be added a couple more times. Create helper
>> for this. Place it into do_fallocate() for further convinience.
> s/convinience/convenience/
>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> ---
>> block/raw-posix.c | 21 ++++++++++++++-------
>> 1 file changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index a7c8816..25a6947 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -894,6 +894,15 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
>> #endif
>>
>>
>> +static int translate_err(int err)
>> +{
>> + if (err == -ENODEV || err == -ENOSYS || err == -EOPNOTSUPP ||
>> + err == -ENOTTY) {
>> + err = -ENOTSUP;
>> + }
>> + return err;
>> +}
>> +
>> #if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
>> static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>> {
>> @@ -902,7 +911,7 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>> return 0;
>> }
>> } while (errno == EINTR);
>> - return -errno;
>> + return translate_err(-errno);
> This could be a separate patch. Maybe reorder the previous patches as:
>
> 1) Introduce translate_err.
> 2) Refactor out do_fallocate.
> 3) Introduce FALLOC_FL_ZERO_RANGE calling code.
>
> Fam
ok, will try
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
` (3 preceding siblings ...)
2014-12-30 9:20 ` [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
@ 2014-12-30 9:20 ` Denis V. Lunev
2015-01-05 6:57 ` Fam Zheng
2014-12-30 9:20 ` [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
` (3 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 9:20 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
move code dealing with a block device to a separate function. This will
allow to implement additional processing for an ordinary files.
Pls note, that xfs_code has been moved before checking for
s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
This makes code a bit more consistent.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
block/raw-posix.c | 60 +++++++++++++++++++++++++++++++++++--------------------
1 file changed, 38 insertions(+), 22 deletions(-)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 25a6947..7866d31 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -915,46 +915,62 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
}
#endif
-static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
{
int ret = -EOPNOTSUPP;
BDRVRawState *s = aiocb->bs->opaque;
- if (s->has_write_zeroes == 0) {
+ if (!s->has_write_zeroes) {
return -ENOTSUP;
}
- if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
#ifdef BLKZEROOUT
- do {
- uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
- if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
- return 0;
- }
- } while (errno == EINTR);
-
- ret = -errno;
-#endif
- } else {
-#ifdef CONFIG_XFS
- if (s->is_xfs) {
- return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+ do {
+ uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
+ if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
+ return 0;
}
-#endif
+ } while (errno == EINTR);
-#ifdef CONFIG_FALLOCATE_ZERO_RANGE
- ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
- aiocb->aio_offset, aiocb->aio_nbytes);
+ ret = translate_err(-errno);
#endif
- }
- ret = translate_err(ret);
if (ret == -ENOTSUP) {
s->has_write_zeroes = false;
}
return ret;
}
+static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+{
+ BDRVRawState *s;
+
+ if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
+ return handle_aiocb_write_zeroes_block(aiocb);
+ }
+
+#ifdef CONFIG_XFS
+ if (s->is_xfs) {
+ return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+ }
+#endif
+
+ s = aiocb->bs->opaque;
+
+#ifdef CONFIG_FALLOCATE_ZERO_RANGE
+ if (s->has_write_zeroes) {
+ int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
+ aiocb->aio_offset, aiocb->aio_nbytes);
+ if (ret == 0 && ret != -ENOTSUP) {
+ return ret;
+ }
+ }
+#endif
+
+ s->has_write_zeroes = false;
+ return -ENOTSUP;
+}
+
static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
{
int ret = -EOPNOTSUPP;
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
2014-12-30 9:20 ` [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
@ 2015-01-05 6:57 ` Fam Zheng
2015-01-05 11:07 ` Denis V. Lunev
0 siblings, 1 reply; 24+ messages in thread
From: Fam Zheng @ 2015-01-05 6:57 UTC (permalink / raw)
To: Denis V. Lunev; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi
On Tue, 12/30 12:20, Denis V. Lunev wrote:
> move code dealing with a block device to a separate function. This will
> allow to implement additional processing for an ordinary files.
s/an//
>
> Pls note, that xfs_code has been moved before checking for
> s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
> This makes code a bit more consistent.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> ---
> block/raw-posix.c | 60 +++++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 38 insertions(+), 22 deletions(-)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 25a6947..7866d31 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -915,46 +915,62 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
> }
> #endif
>
> -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
> {
> int ret = -EOPNOTSUPP;
> BDRVRawState *s = aiocb->bs->opaque;
>
> - if (s->has_write_zeroes == 0) {
> + if (!s->has_write_zeroes) {
> return -ENOTSUP;
> }
>
> - if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> #ifdef BLKZEROOUT
> - do {
> - uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
> - if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
> - return 0;
> - }
> - } while (errno == EINTR);
> -
> - ret = -errno;
> -#endif
> - } else {
> -#ifdef CONFIG_XFS
> - if (s->is_xfs) {
> - return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
> + do {
> + uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
> + if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
> + return 0;
> }
> -#endif
> + } while (errno == EINTR);
>
> -#ifdef CONFIG_FALLOCATE_ZERO_RANGE
> - ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
> - aiocb->aio_offset, aiocb->aio_nbytes);
> + ret = translate_err(-errno);
> #endif
> - }
>
> - ret = translate_err(ret);
> if (ret == -ENOTSUP) {
> s->has_write_zeroes = false;
> }
> return ret;
> }
>
> +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> +{
> + BDRVRawState *s;
> +
> + if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
> + return handle_aiocb_write_zeroes_block(aiocb);
> + }
> +
> +#ifdef CONFIG_XFS
> + if (s->is_xfs) {
s is not initialized here.
Also, could you please do these refactoring before adding new code in the
series? That way the new code needn't to be moved, which makes the patches
easier to review.
Fam
> + return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
> + }
> +#endif
> +
> + s = aiocb->bs->opaque;
> +
> +#ifdef CONFIG_FALLOCATE_ZERO_RANGE
> + if (s->has_write_zeroes) {
> + int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
> + aiocb->aio_offset, aiocb->aio_nbytes);
> + if (ret == 0 && ret != -ENOTSUP) {
> + return ret;
> + }
> + }
> +#endif
> +
> + s->has_write_zeroes = false;
> + return -ENOTSUP;
> +}
> +
> static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
> {
> int ret = -EOPNOTSUPP;
> --
> 1.9.1
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit
2015-01-05 6:57 ` Fam Zheng
@ 2015-01-05 11:07 ` Denis V. Lunev
0 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:07 UTC (permalink / raw)
To: Fam Zheng; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi
On 05/01/15 09:57, Fam Zheng wrote:
> On Tue, 12/30 12:20, Denis V. Lunev wrote:
>> move code dealing with a block device to a separate function. This will
>> allow to implement additional processing for an ordinary files.
> s/an//
>
>> Pls note, that xfs_code has been moved before checking for
>> s->has_write_zeroes as xfs_write_zeroes does not touch this flag inside.
>> This makes code a bit more consistent.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> ---
>> block/raw-posix.c | 60 +++++++++++++++++++++++++++++++++++--------------------
>> 1 file changed, 38 insertions(+), 22 deletions(-)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 25a6947..7866d31 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -915,46 +915,62 @@ static int do_fallocate(int fd, int mode, off_t offset, off_t len)
>> }
>> #endif
>>
>> -static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>> +static ssize_t handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
>> {
>> int ret = -EOPNOTSUPP;
>> BDRVRawState *s = aiocb->bs->opaque;
>>
>> - if (s->has_write_zeroes == 0) {
>> + if (!s->has_write_zeroes) {
>> return -ENOTSUP;
>> }
>>
>> - if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>> #ifdef BLKZEROOUT
>> - do {
>> - uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
>> - if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
>> - return 0;
>> - }
>> - } while (errno == EINTR);
>> -
>> - ret = -errno;
>> -#endif
>> - } else {
>> -#ifdef CONFIG_XFS
>> - if (s->is_xfs) {
>> - return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
>> + do {
>> + uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
>> + if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
>> + return 0;
>> }
>> -#endif
>> + } while (errno == EINTR);
>>
>> -#ifdef CONFIG_FALLOCATE_ZERO_RANGE
>> - ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
>> - aiocb->aio_offset, aiocb->aio_nbytes);
>> + ret = translate_err(-errno);
>> #endif
>> - }
>>
>> - ret = translate_err(ret);
>> if (ret == -ENOTSUP) {
>> s->has_write_zeroes = false;
>> }
>> return ret;
>> }
>>
>> +static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>> +{
>> + BDRVRawState *s;
>> +
>> + if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
>> + return handle_aiocb_write_zeroes_block(aiocb);
>> + }
>> +
>> +#ifdef CONFIG_XFS
>> + if (s->is_xfs) {
> s is not initialized here.
>
> Also, could you please do these refactoring before adding new code in the
> series? That way the new code needn't to be moved, which makes the patches
> easier to review.
>
> Fam
ok
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
` (4 preceding siblings ...)
2014-12-30 9:20 ` [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
@ 2014-12-30 9:20 ` Denis V. Lunev
2015-01-05 7:02 ` Fam Zheng
2014-12-30 9:20 ` [Qemu-devel] [PATCH 7/8] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
` (2 subsequent siblings)
8 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 9:20 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
Simple fallocate(0) will extend file with zeroes when appropriate in the
middle of the file if there is a hole there and at the end of the file.
Unfortunately fallocate(0) does not drop the content of the file if
there is a data on this offset. Therefore to make the situation consistent
we should drop the data beforehand. This is done using FALLOC_FL_PUNCH_HOLE
This should increase the performance a bit for not-so-modern kernels or for
filesystems which do not support FALLOC_FL_ZERO_RANGE.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
block/raw-posix.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 7866d31..96a8678 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -968,6 +968,23 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
#endif
s->has_write_zeroes = false;
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+ if (s->has_discard) {
+ int ret;
+ ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ aiocb->aio_offset, aiocb->aio_nbytes);
+ if (ret < 0) {
+ if (ret == -ENOTSUP) {
+ s->has_discard = false;
+ }
+ return ret;
+ }
+ return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+ }
+#endif
+
+ s->has_discard = false;
return -ENOTSUP;
}
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
2014-12-30 9:20 ` [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
@ 2015-01-05 7:02 ` Fam Zheng
2015-01-05 11:14 ` Denis V. Lunev
0 siblings, 1 reply; 24+ messages in thread
From: Fam Zheng @ 2015-01-05 7:02 UTC (permalink / raw)
To: Denis V. Lunev; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi
On Tue, 12/30 12:20, Denis V. Lunev wrote:
> This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
>
> Simple fallocate(0) will extend file with zeroes when appropriate in the
> middle of the file if there is a hole there and at the end of the file.
> Unfortunately fallocate(0) does not drop the content of the file if
> there is a data on this offset. Therefore to make the situation consistent
> we should drop the data beforehand. This is done using FALLOC_FL_PUNCH_HOLE
>
> This should increase the performance a bit for not-so-modern kernels or for
> filesystems which do not support FALLOC_FL_ZERO_RANGE.
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
> ---
> block/raw-posix.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 7866d31..96a8678 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -968,6 +968,23 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> #endif
>
> s->has_write_zeroes = false;
> +
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> + if (s->has_discard) {
> + int ret;
> + ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> + aiocb->aio_offset, aiocb->aio_nbytes);
> + if (ret < 0) {
> + if (ret == -ENOTSUP) {
> + s->has_discard = false;
> + }
> + return ret;
> + }
> + return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
Why is fallocate(0) necessary here? The manpage says:
Deallocating file space
Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38)
in mode deallocates space (i.e., creates a hole) in the byte range
starting at offset and continuing for len bytes. Within the specified
range, partial file system blocks are zeroed, and whole file system
blocks are removed from the file. After a successful call, subsequent
reads from this range will return zeroes.
So the data are already zeroes after FALLOC_FL_PUNCH_HOLE.
Fam
> + }
> +#endif
> +
> + s->has_discard = false;
> return -ENOTSUP;
> }
>
> --
> 1.9.1
>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes
2015-01-05 7:02 ` Fam Zheng
@ 2015-01-05 11:14 ` Denis V. Lunev
0 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2015-01-05 11:14 UTC (permalink / raw)
To: Fam Zheng; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi
On 05/01/15 10:02, Fam Zheng wrote:
> On Tue, 12/30 12:20, Denis V. Lunev wrote:
>> This sequence works efficiently if FALLOC_FL_ZERO_RANGE is not supported.
>>
>> Simple fallocate(0) will extend file with zeroes when appropriate in the
>> middle of the file if there is a hole there and at the end of the file.
>> Unfortunately fallocate(0) does not drop the content of the file if
>> there is a data on this offset. Therefore to make the situation consistent
>> we should drop the data beforehand. This is done using FALLOC_FL_PUNCH_HOLE
>>
>> This should increase the performance a bit for not-so-modern kernels or for
>> filesystems which do not support FALLOC_FL_ZERO_RANGE.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Peter Lieven <pl@kamp.de>
>> ---
>> block/raw-posix.c | 17 +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/block/raw-posix.c b/block/raw-posix.c
>> index 7866d31..96a8678 100644
>> --- a/block/raw-posix.c
>> +++ b/block/raw-posix.c
>> @@ -968,6 +968,23 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>> #endif
>>
>> s->has_write_zeroes = false;
>> +
>> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>> + if (s->has_discard) {
>> + int ret;
>> + ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>> + aiocb->aio_offset, aiocb->aio_nbytes);
>> + if (ret < 0) {
>> + if (ret == -ENOTSUP) {
>> + s->has_discard = false;
>> + }
>> + return ret;
>> + }
>> + return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
> Why is fallocate(0) necessary here? The manpage says:
>
> Deallocating file space
> Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux 2.6.38)
> in mode deallocates space (i.e., creates a hole) in the byte range
> starting at offset and continuing for len bytes. Within the specified
> range, partial file system blocks are zeroed, and whole file system
> blocks are removed from the file. After a successful call, subsequent
> reads from this range will return zeroes.
>
> So the data are already zeroes after FALLOC_FL_PUNCH_HOLE.
>
> Fam
These zeroes will have different properties. FALLOC_FL_PUNCH_HOLE
deallocates disk space on that range. Thus this call work work in a
different way in respect to the method of zero writing. This does not
look good for me.
The function should keep the file in the same state using all
possible internal implementations. If the caller wants to use
FALLOC_FL_PUNCH_HOLE
alone, it should call handle_aiocb_discard method.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH 7/8] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
` (5 preceding siblings ...)
2014-12-30 9:20 ` [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
@ 2014-12-30 9:20 ` Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 8/8] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
2014-12-30 10:55 ` [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Peter Lieven
8 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 9:20 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
There is a possibility that we are extending our image and thus writing
zeroes beyond end of the file. In this case we do not need to care
about the hole to make sure that there is no data in the file under
this offset (pre-condition to fallocate(0) to work). We could simply call
fallocate(0).
This improves the performance of writing zeroes even on really old
platforms which do not have even FALLOC_FL_PUNCH_HOLE.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
block/raw-posix.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 96a8678..57b94ad 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -60,7 +60,7 @@
#define FS_NOCOW_FL 0x00800000 /* Do not cow file */
#endif
#endif
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
#include <linux/falloc.h>
#endif
#if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
@@ -903,7 +903,7 @@ static int translate_err(int err)
return err;
}
-#if defined(CONFIG_FALLOCATE_PUNCH_HOLE) || defined(CONFIG_FALLOCATE_ZERO_RANGE)
+#ifdef CONFIG_FALLOCATE
static int do_fallocate(int fd, int mode, off_t offset, off_t len)
{
do {
@@ -957,6 +957,12 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
s = aiocb->bs->opaque;
+#ifdef CONFIG_FALLOCATE
+ if (aiocb->aio_offset >= aiocb->bs->total_sectors << BDRV_SECTOR_BITS) {
+ return do_fallocate(s->fd, 0, aiocb->aio_offset, aiocb->aio_nbytes);
+ }
+#endif
+
#ifdef CONFIG_FALLOCATE_ZERO_RANGE
if (s->has_write_zeroes) {
int ret = do_fallocate(s->fd, FALLOC_FL_ZERO_RANGE,
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* [Qemu-devel] [PATCH 8/8] block/raw-posix: set max_write_zeroes to INT_MAX for regular files
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
` (6 preceding siblings ...)
2014-12-30 9:20 ` [Qemu-devel] [PATCH 7/8] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
@ 2014-12-30 9:20 ` Denis V. Lunev
2014-12-30 10:55 ` [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Peter Lieven
8 siblings, 0 replies; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 9:20 UTC (permalink / raw)
Cc: Kevin Wolf, Denis V. Lunev, Peter Lieven, qemu-devel,
Stefan Hajnoczi
fallocate() works fine and could handle properly with arbitrary size
requests. There is no sense to reduce the amount of space to fallocate.
The bigger the size, the better is performance as the amount of journal
updates is reduced.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Peter Lieven <pl@kamp.de>
---
block/raw-posix.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 57b94ad..3e156c4 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -292,6 +292,20 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
}
}
+static void raw_probe_max_write_zeroes(BlockDriverState *bs)
+{
+ BDRVRawState *s = bs->opaque;
+ struct stat st;
+
+ if (fstat(s->fd, &st) < 0) {
+ return; /* no problem, keep default value */
+ }
+ if (!S_ISREG(st.st_mode) || !s->discard_zeroes) {
+ return;
+ }
+ bs->bl.max_write_zeroes = INT_MAX;
+}
+
static void raw_parse_flags(int bdrv_flags, int *open_flags)
{
assert(open_flags != NULL);
@@ -598,6 +612,7 @@ static int raw_reopen_prepare(BDRVReopenState *state,
/* Fail already reopen_prepare() if we can't get a working O_DIRECT
* alignment with the new fd. */
if (raw_s->fd != -1) {
+ raw_probe_max_write_zeroes(state->bs);
raw_probe_alignment(state->bs, raw_s->fd, &local_err);
if (local_err) {
qemu_close(raw_s->fd);
@@ -651,6 +666,8 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
raw_probe_alignment(bs, s->fd, errp);
bs->bl.opt_mem_alignment = s->buf_align;
+
+ raw_probe_max_write_zeroes(bs);
}
static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
--
1.9.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
` (7 preceding siblings ...)
2014-12-30 9:20 ` [Qemu-devel] [PATCH 8/8] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
@ 2014-12-30 10:55 ` Peter Lieven
2014-12-30 11:07 ` Denis V. Lunev
8 siblings, 1 reply; 24+ messages in thread
From: Peter Lieven @ 2014-12-30 10:55 UTC (permalink / raw)
To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
Am 30.12.2014 um 10:20 schrieb Denis V. Lunev:
> These patches eliminate data writes completely on Linux if fallocate
> FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are supported on
> underlying filesystem.
>
> I have performed several tests with non-aligned fallocate calls and
> in all cases (with non-aligned fallocates) Linux performs fine, i.e.
> areas are zeroed correctly. Checks were made on
> Linux 3.16.0-28-generic #38-Ubuntu SMP
>
> This should seriously increase performance in some special cases.
Could you give a hint what that special cases are? It would help
to evaluate and test the performance difference.
Thanks,
Peter
>
> Changes from v2:
> - added Peter Lieven to CC
> - added CONFIG_FALLOCATE check to call do_fallocate in patch 7
> - dropped patch 1 as NACK-ed
> - added processing of very large data areas in bdrv_co_write_zeroes (new
> patch 1)
> - set bl.max_write_zeroes to INT_MAX in raw-posix.c for regular files
> (new patch 8)
>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Peter Lieven <pl@kamp.de>
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
2014-12-30 10:55 ` [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Peter Lieven
@ 2014-12-30 11:07 ` Denis V. Lunev
2015-01-05 6:55 ` Peter Lieven
0 siblings, 1 reply; 24+ messages in thread
From: Denis V. Lunev @ 2014-12-30 11:07 UTC (permalink / raw)
To: Peter Lieven; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
On 30/12/14 13:55, Peter Lieven wrote:
> Am 30.12.2014 um 10:20 schrieb Denis V. Lunev:
>> These patches eliminate data writes completely on Linux if fallocate
>> FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are supported on
>> underlying filesystem.
>>
>> I have performed several tests with non-aligned fallocate calls and
>> in all cases (with non-aligned fallocates) Linux performs fine, i.e.
>> areas are zeroed correctly. Checks were made on
>> Linux 3.16.0-28-generic #38-Ubuntu SMP
>>
>> This should seriously increase performance in some special cases.
> Could you give a hint what that special cases are? It would help
> to evaluate and test the performance difference.
>
> Thanks,
> Peter
- 15% in Parallels Image expansion, see my side patchset
- writing zeroes to raw image with BLOCKDEV_DETECT_ZEROES_OPTIONS_ON set
(actually I have kludged raw-posix.c to have this flag always set
to perform independent testing)
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c
2014-12-30 11:07 ` Denis V. Lunev
@ 2015-01-05 6:55 ` Peter Lieven
0 siblings, 0 replies; 24+ messages in thread
From: Peter Lieven @ 2015-01-05 6:55 UTC (permalink / raw)
To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
On 30.12.2014 12:07, Denis V. Lunev wrote:
> On 30/12/14 13:55, Peter Lieven wrote:
>> Am 30.12.2014 um 10:20 schrieb Denis V. Lunev:
>>> These patches eliminate data writes completely on Linux if fallocate
>>> FALLOC_FL_ZERO_RANGE or FALLOC_FL_PUNCH_HOLE are supported on
>>> underlying filesystem.
>>>
>>> I have performed several tests with non-aligned fallocate calls and
>>> in all cases (with non-aligned fallocates) Linux performs fine, i.e.
>>> areas are zeroed correctly. Checks were made on
>>> Linux 3.16.0-28-generic #38-Ubuntu SMP
>>>
>>> This should seriously increase performance in some special cases.
>> Could you give a hint what that special cases are? It would help
>> to evaluate and test the performance difference.
>>
>> Thanks,
>> Peter
>
> - 15% in Parallels Image expansion, see my side patchset
I will have a look.
> - writing zeroes to raw image with BLOCKDEV_DETECT_ZEROES_OPTIONS_ON set
> (actually I have kludged raw-posix.c to have this flag always set
> to perform independent testing)
Is there a valid reason to write chunks of >16MB blocksize from a guest? Or is this
just users performing pseudo benchmarks with dd?
Peter
^ permalink raw reply [flat|nested] 24+ messages in thread