From: Peter Lieven <pl@kamp.de>
To: "Denis V. Lunev" <den@openvz.org>
Cc: Kevin Wolf <kwolf@redhat.com>,
qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes
Date: Mon, 05 Jan 2015 12:23:05 +0100 [thread overview]
Message-ID: <54AA7419.8050304@kamp.de> (raw)
In-Reply-To: <54AA7022.2030304@openvz.org>
On 05.01.2015 12:06, Denis V. Lunev wrote:
> On 05/01/15 10:34, Peter Lieven wrote:
>> On 30.12.2014 10:20, Denis V. Lunev wrote:
>>> bdrv_co_do_write_zeroes split writes using bl.max_write_zeroes or
>>> 16 MiB as a chunk size. This is implemented in this way to tolerate
>>> buggy block backends which do not accept too big requests.
>>>
>>> Though if the bdrv_co_write_zeroes callback is not good enough, we
>>> fallback to write data explicitely using bdrv_co_writev and we
>>> create buffer to accomodate zeroes inside. The size of this buffer
>>> is the size of the chunk. Thus if the underlying layer will have
>>> bl.max_write_zeroes high enough, f.e. 4 GiB, the allocation can fail.
>>>
>>> Actually, there is no need to allocate such a big amount of memory.
>>> We could simply allocate 1 MiB buffer and create iovec, which will
>>> point to the same memory.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> CC: Peter Lieven <pl@kamp.de>
>>> ---
>>> block.c | 35 ++++++++++++++++++++++++-----------
>>> 1 file changed, 24 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 4165d42..d69c121 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -3173,14 +3173,18 @@ int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
>>> * of 32768 512-byte sectors (16 MiB) per request.
>>> */
>>> #define MAX_WRITE_ZEROES_DEFAULT 32768
>>> +/* allocate iovec with zeroes using 1 MiB chunks to avoid to big allocations */
>>> +#define MAX_ZEROES_CHUNK (1024 * 1024)
>>> static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>> int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
>>> {
>>> BlockDriver *drv = bs->drv;
>>> QEMUIOVector qiov;
>>> - struct iovec iov = {0};
>>> int ret = 0;
>>> + void *chunk = NULL;
>>> +
>>> + qemu_iovec_init(&qiov, 0);
>>> int max_write_zeroes = bs->bl.max_write_zeroes ?
>>> bs->bl.max_write_zeroes : MAX_WRITE_ZEROES_DEFAULT;
>>> @@ -3217,27 +3221,35 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>>> }
>>> if (ret == -ENOTSUP) {
>>> + int64_t num_bytes = (int64_t)num << BDRV_SECTOR_BITS;
>>> + int chunk_size = MIN(MAX_ZEROES_CHUNK, num_bytes);
>>> +
>>> /* Fall back to bounce buffer if write zeroes is unsupported */
>>> - iov.iov_len = num * BDRV_SECTOR_SIZE;
>>> - if (iov.iov_base == NULL) {
>>> - iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
>>> - if (iov.iov_base == NULL) {
>>> + if (chunk == NULL) {
>>> + chunk = qemu_try_blockalign(bs, chunk_size);
>>> + if (chunk == NULL) {
>>> ret = -ENOMEM;
>>> goto fail;
>>> }
>>> - memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
>>> + memset(chunk, 0, chunk_size);
>>> + }
>>> +
>>> + while (num_bytes > 0) {
>>> + int to_add = MIN(chunk_size, num_bytes);
>>> + qemu_iovec_add(&qiov, chunk, to_add);
>>
>> This can and likely will fail for big num_bytes if you exceed IOV_MAX vectors.
>>
>> I would stick to the old method and limit the num to a reasonable value e.g. MAX_WRITE_ZEROES_DEFAULT.
>> This becomes necessary as you set INT_MAX for max_write_zeroes. That hasn't been considered before in
>> the original patch.
>>
>> Peter
>>
>
> hmm. You are right, but I think that it would be better to limit iovec size
> to 32 and this will solve the problem. Allocation of 32 Mb could be a real problem
> on loaded system could be a problem.
>
> What do you think on this? May be we could consider 16 as a limit...
I would do the following:
---8<---
From 8c2a08baddbcd9e89bbb11fa83a42350bd7cc095 Mon Sep 17 00:00:00 2001
From: Peter Lieven <pl@kamp.de>
Date: Mon, 5 Jan 2015 12:14:52 +0100
Subject: [PATCH] block: limited request size in write zeroes unsupported path
If bs->bl.max_write_zeroes is large and we end up in the unsupported
path we might allocate a lot of memory for the iovector and/or even
generate an oversized requests.
Fix this by limiting the request by the minimum of the reported
maximum transfer size or 16MB (32768 sectors).
Reported-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Peter Lieven <pl@kamp.de>
---
block.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/block.c b/block.c
index a612594..8009478 100644
--- a/block.c
+++ b/block.c
@@ -3203,6 +3203,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
if (ret == -ENOTSUP) {
/* Fall back to bounce buffer if write zeroes is unsupported */
+ int max_xfer_len = MIN_NON_ZERO(bs->bl.max_transfer_length,
+ MAX_WRITE_ZEROES_DEFAULT);
+ num = MIN(num, max_xfer_len);
iov.iov_len = num * BDRV_SECTOR_SIZE;
if (iov.iov_base == NULL) {
iov.iov_base = qemu_try_blockalign(bs, num * BDRV_SECTOR_SIZE);
@@ -3219,7 +3222,7 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
/* Keep bounce buffer around if it is big enough for all
* all future requests.
*/
- if (num < max_write_zeroes) {
+ if (num < max_xfer_len) {
qemu_vfree(iov.iov_base);
iov.iov_base = NULL;
}
--
1.7.9.5
next prev parent reply other threads:[~2015-01-05 11:23 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-30 9:20 [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 1/8] block: prepare bdrv_co_do_write_zeroes to deal with large bl.max_write_zeroes Denis V. Lunev
2015-01-05 7:34 ` Peter Lieven
2015-01-05 11:06 ` Denis V. Lunev
2015-01-05 11:23 ` Peter Lieven [this message]
2015-01-05 11:48 ` Denis V. Lunev
2015-01-05 12:26 ` [Qemu-devel] [PATCH v2 1/1] " Denis V. Lunev
2015-01-05 12:32 ` [Qemu-devel] [PATCH v3 " Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 2/8] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 3/8] block/raw-posix: create do_fallocate helper Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 4/8] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
2015-01-05 6:46 ` Fam Zheng
2015-01-05 11:17 ` Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 5/8] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
2015-01-05 6:57 ` Fam Zheng
2015-01-05 11:07 ` Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 6/8] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
2015-01-05 7:02 ` Fam Zheng
2015-01-05 11:14 ` Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 7/8] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev
2014-12-30 9:20 ` [Qemu-devel] [PATCH 8/8] block/raw-posix: set max_write_zeroes to INT_MAX for regular files Denis V. Lunev
2014-12-30 10:55 ` [Qemu-devel] [PATCH v3 0/8] eliminate data write in bdrv_write_zeroes on Linux in raw-posix.c Peter Lieven
2014-12-30 11:07 ` Denis V. Lunev
2015-01-05 6:55 ` Peter Lieven
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54AA7419.8050304@kamp.de \
--to=pl@kamp.de \
--cc=den@openvz.org \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.