From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:45967)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den-lists@parallels.com>) id 1Y4xlA-0006ij-0d
	for qemu-devel@nongnu.org; Sat, 27 Dec 2014 15:14:49 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den-lists@parallels.com>) id 1Y4xl6-0004Pb-PS
	for qemu-devel@nongnu.org; Sat, 27 Dec 2014 15:14:47 -0500
Received: from relay.parallels.com ([195.214.232.42]:50844)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den-lists@parallels.com>) id 1Y4xl6-0004PQ-Di
	for qemu-devel@nongnu.org; Sat, 27 Dec 2014 15:14:44 -0500
Message-ID: <549F132B.2080808@parallels.com>
Date: Sat, 27 Dec 2014 23:14:35 +0300
From: "Denis V. Lunev" <den-lists@parallels.com>
MIME-Version: 1.0
References: <1419597313-20514-1-git-send-email-den@openvz.org>	<1419597313-20514-2-git-send-email-den@openvz.org>	<549D5F16.2070007@kamp.de>
	<549D6351.8050904@parallels.com>	<549DB3B6.1070800@openvz.org>
	<549EC7A6.8020007@kamp.de> <549EEF9C.8060002@parallels.com>
	<549F1016.1010705@kamp.de>
In-Reply-To: <549F1016.1010705@kamp.de>
Content-Type: text/plain; charset="ISO-8859-15"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to
 bdrv_co_do_write_zeroes callback in bs
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Lieven <pl@kamp.de>, "Denis V. Lunev" <den@openvz.org>
Cc: Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>

On 27/12/14 23:01, Peter Lieven wrote:
> Am 27.12.2014 um 18:42 schrieb Denis V. Lunev:
>> On 27/12/14 17:52, Peter Lieven wrote:
>>> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev:
>>>> On 26/12/14 16:32, Denis V. Lunev wrote:
>>>>> On 26/12/14 16:13, Peter Lieven wrote:
>>>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev:
>>>>>>> The check for maximum length was added by
>>>>>>>      commit c31cb70728d2c0c8900b35a66784baa446fd5147
>>>>>>>      Author: Peter Lieven <pl@kamp.de>
>>>>>>>      Date:   Thu Oct 24 12:06:58 2013 +0200
>>>>>>>        block: honour BlockLimits in bdrv_co_do_write_zeroes
>>>>>>>
>>>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is
>>>>>>> no need to limit the size to 32 MB. Callback should provide effective
>>>>>>> implementation which normally should not write any zeroes in comparable
>>>>>>> amount.
>>>>>>
>>>>>> NACK.
>>>>>>
>>>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation.
>>>>>> This heaviliy depends on several circumstances that the block layer is not aware of.
>>>>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance
>>>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to
>>>>>> return -ENOTSUP if the request size or alignment doesn't fit.
>>>>>
>>>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is
>>>>> specified, the cost is almost the same for any amount of zeroes
>>>>> written. This is true for fallocate from my point of view. The amount
>>>>> of actually written data will be in several orders less than specified
>>>>> except slow path, which honors 32 MB limit.
>>>>>
>>>>> If the operation is complex in realization, then it will be rate-limited
>>>>> below, in actual implementation.
>>>>>
>>>>>> There are known backends e.g. anything that deals with SCSI that have a known
>>>>>> limitation of the maximum number of zeroes they can write fast in a single request.
>>>>>> This number MUST NOT be exceeded. The below patch would break all those backends.
>>>>>
>>>>> could you pls point me this backends. Actually, from my point of
>>>>> view, they should properly setup max_write_zeroes for themselves.
>>>>> This is done at least in block/iscsi.c and it would be consistent
>>>>> way of doing so.
>>>>>
>>>>>>
>>>>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix
>>>>>> it at another point in the code.
>>>>>>
>>>>>
>>>>> I am trying to minimize amount of metadata updates for a file.
>>>>> This provides some speedup even on ext4 and this will provide
>>>>> even more speedup with a distributed filesystem like CEPH
>>>>> where size updates of the files and block allocation are
>>>>> costly.
>>>>>
>>>>> Regards,
>>>>>       Den
>>>> First of all, the patch is really wrong :) It was written using
>>>> wrong assumptions.
>>>>
>>>> OK. I have spent some time reading your original patchset and
>>>> and did not found any useful justification for default limit
>>>> for both discard and write zero.
>>>
>>> 32768 is the largest power of two fitting into a uint16.
>>> And uint16 is quite common for nb_sectors in backends.
>>>
>>
>> ok. This could be reasonable.
>>
>>
>>>>
>>>> Yes, there are drivers which requires block level to call
>>>> .bdrv_co_do_write_zeroes with alignment and with upper limit.
>>>> But in this case driver setups max_write_zeroes. All buggy
>>>> drivers should do that not to affect not buggy ones from
>>>> my opinion.
>>>>
>>>> This is the only purpose of the original patches for limits.
>>>> I have wrongly interpret BlockLimits as something connected
>>>> with time of the operation. Sorry for that.
>>>>
>>>> Therefore there is no good reason for limiting the amount of
>>>> data sent to drv->bdrv_co_writev with any data size. The only
>>>> thing is that it would be good not to allocate too many memory
>>>> at once. We could do something like
>>>>
>>>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE);
>>>> added = 0;
>>>> for (added = 0; added < num; add += MIN(2048, num)) {
>>>>       qemu_iovec_add(qiov, base, MIN(2048, num));
>>>> }
>>>>
>>>> to avoid really big allocations here even if .max_write_zeroes is
>>>> very high. Do you think that this might be useful?
>>>>
>>>> As for .bdrv_co_do_write_zeroes itself, can we still drop
>>>> default 32 Mb limit? If there are some buggy drivers, they
>>>> should have .max_write_zeroes specified.
>>>>
>>>> The same applies to .max_discard
>>>
>>> Its always risky to change default behaviour. In the original discussion we
>>> agreed that there should be a limit for each request. I think the 2^15 was
>>> Paolos suggestion.
>>>
>>> You where talking of metadata updates for a file. So the operation that is too slow
>>> for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem?
>>> What is the exact operation that you try to optimize?
>>>
>>> I am wondering because as far as I can see write zeroes is only supported for
>>> XFS and block devices which support BLKZEROOUT. The latter only works for
>>> cache=none. So its not that easy to end up in an optimized (fast) path anyway.
>>>
>>> Peter
>>>
>>>
>> you have missed 6 patches below ;) f.e. patch 2/7
>>
>> OK. I'll redo changes and fix on raw-posix level.
>
> I was not in CC on the series. Can you please include me in CC for all patches when you respin.
>

no prob :)