Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Denis V. Lunev" <den-lists@parallels.com>
To: Peter Lieven <pl@kamp.de>, "Denis V. Lunev" <den@openvz.org>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs
Date: Sat, 27 Dec 2014 20:42:52 +0300	[thread overview]
Message-ID: <549EEF9C.8060002@parallels.com> (raw)
In-Reply-To: <549EC7A6.8020007@kamp.de>

On 27/12/14 17:52, Peter Lieven wrote:
> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev:
>> On 26/12/14 16:32, Denis V. Lunev wrote:
>>> On 26/12/14 16:13, Peter Lieven wrote:
>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>>>>> The check for maximum length was added by
>>>>>     commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>>>>     Author: Peter Lieven <pl@kamp.de>
>>>>>     Date:   Thu Oct 24 12:06:58 2013 +0200
>>>>>       block: honour BlockLimits in bdrv_co_do_write_zeroes
>>>>>
>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is
>>>>> no need to limit the size to 32 MB. Callback should provide effective
>>>>> implementation which normally should not write any zeroes in comparable
>>>>> amount.
>>>>
>>>> NACK.
>>>>
>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
>>>> This heaviliy depends on several circumstances that the block layer is not aware of.
>>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance
>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to
>>>> return -ENOTSUP if the request size or alignment doesn't fit.
>>>
>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
>>> specified, the cost is almost the same for any amount of zeroes
>>> written. This is true for fallocate from my point of view. The amount
>>> of actually written data will be in several orders less than specified
>>> except slow path, which honors 32 MB limit.
>>>
>>> If the operation is complex in realization, then it will be rate-limited
>>> below, in actual implementation.
>>>
>>>> There are known backends e.g. anything that deals with SCSI that have a known
>>>> limitation of the maximum number of zeroes they can write fast in a single request.
>>>> This number MUST NOT be exceeded. The below patch would break all those backends.
>>>
>>> could you pls point me this backends. Actually, from my point of
>>> view, they should properly setup max_write_zeroes for themselves.
>>> This is done at least in block/iscsi.c and it would be consistent
>>> way of doing so.
>>>
>>>>
>>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix
>>>> it at another point in the code.
>>>>
>>>
>>> I am trying to minimize amount of metadata updates for a file.
>>> This provides some speedup even on ext4 and this will provide
>>> even more speedup with a distributed filesystem like CEPH
>>> where size updates of the files and block allocation are
>>> costly.
>>>
>>> Regards,
>>>      Den
>> First of all, the patch is really wrong :) It was written using
>> wrong assumptions.
>>
>> OK. I have spent some time reading your original patchset and
>> and did not found any useful justification for default limit
>> for both discard and write zero.
>
> 32768 is the largest power of two fitting into a uint16.
> And uint16 is quite common for nb_sectors in backends.
>

ok. This could be reasonable.


>>
>> Yes, there are drivers which requires block level to call
>> .bdrv_co_do_write_zeroes with alignment and with upper limit.
>> But in this case driver setups max_write_zeroes. All buggy
>> drivers should do that not to affect not buggy ones from
>> my opinion.
>>
>> This is the only purpose of the original patches for limits.
>> I have wrongly interpret BlockLimits as something connected
>> with time of the operation. Sorry for that.
>>
>> Therefore there is no good reason for limiting the amount of
>> data sent to drv->bdrv_co_writev with any data size. The only
>> thing is that it would be good not to allocate too many memory
>> at once. We could do something like
>>
>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
>> added = 0;
>> for (added = 0; added < num; add += MIN(2048, num)) {
>>      qemu_iovec_add(qiov, base, MIN(2048, num));
>> }
>>
>> to avoid really big allocations here even if .max_write_zeroes is
>> very high. Do you think that this might be useful?
>>
>> As for .bdrv_co_do_write_zeroes itself, can we still drop
>> default 32 Mb limit? If there are some buggy drivers, they
>> should have .max_write_zeroes specified.
>>
>> The same applies to .max_discard
>
> Its always risky to change default behaviour. In the original discussion we
> agreed that there should be a limit for each request. I think the 2^15 was
> Paolos suggestion.
>
> You where talking of metadata updates for a file. So the operation that is too slow
> for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem?
> What is the exact operation that you try to optimize?
>
> I am wondering because as far as I can see write zeroes is only supported for
> XFS and block devices which support BLKZEROOUT. The latter only works for
> cache=none. So its not that easy to end up in an optimized (fast) path anyway.
>
> Peter
>
>
you have missed 6 patches below ;) f.e. patch 2/7

OK. I'll redo changes and fix on raw-posix level.

next prev parent reply	other threads:[~2014-12-27 17:43 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-26 12:35 [Qemu-devel] [PATCH v2 0/7] eliminate data write in bdrv_write_zeroes on Linux Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs Denis V. Lunev
2014-12-26 13:13   ` Peter Lieven
2014-12-26 13:32     ` Denis V. Lunev
2014-12-26 19:15       ` Denis V. Lunev
2014-12-27 14:52         ` Peter Lieven
2014-12-27 17:42           ` Denis V. Lunev [this message]
2014-12-27 20:01             ` Peter Lieven
2014-12-27 20:14               ` Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 2/7] block: use fallocate(FALLOC_FL_ZERO_RANGE) in handle_aiocb_write_zeroes Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 3/7] block/raw-posix: create do_fallocate helper Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 4/7] block/raw-posix: create translate_err helper to merge errno values Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 5/7] block/raw-posix: refactor handle_aiocb_write_zeroes a bit Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 6/7] block: use fallocate(FALLOC_FL_PUNCH_HOLE) & fallocate(0) to write zeroes Denis V. Lunev
2014-12-26 12:35 ` [Qemu-devel] [PATCH 7/7] block/raw-posix: call plain fallocate in handle_aiocb_write_zeroes Denis V. Lunev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=549EEF9C.8060002@parallels.com \
    --to=den-lists@parallels.com \
    --cc=den@openvz.org \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=pl@kamp.de \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).