From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45967) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y4xlA-0006ij-0d for qemu-devel@nongnu.org; Sat, 27 Dec 2014 15:14:49 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Y4xl6-0004Pb-PS for qemu-devel@nongnu.org; Sat, 27 Dec 2014 15:14:47 -0500 Received: from relay.parallels.com ([195.214.232.42]:50844) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y4xl6-0004PQ-Di for qemu-devel@nongnu.org; Sat, 27 Dec 2014 15:14:44 -0500 Message-ID: <549F132B.2080808@parallels.com> Date: Sat, 27 Dec 2014 23:14:35 +0300 From: "Denis V. Lunev" MIME-Version: 1.0 References: <1419597313-20514-1-git-send-email-den@openvz.org> <1419597313-20514-2-git-send-email-den@openvz.org> <549D5F16.2070007@kamp.de> <549D6351.8050904@parallels.com> <549DB3B6.1070800@openvz.org> <549EC7A6.8020007@kamp.de> <549EEF9C.8060002@parallels.com> <549F1016.1010705@kamp.de> In-Reply-To: <549F1016.1010705@kamp.de> Content-Type: text/plain; charset="ISO-8859-15"; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 1/7] block: fix maximum length sent to bdrv_co_do_write_zeroes callback in bs List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Lieven , "Denis V. Lunev" Cc: Kevin Wolf , Paolo Bonzini , qemu-devel@nongnu.org, Stefan Hajnoczi On 27/12/14 23:01, Peter Lieven wrote: > Am 27.12.2014 um 18:42 schrieb Denis V. Lunev: >> On 27/12/14 17:52, Peter Lieven wrote: >>> Am 26.12.2014 um 20:15 schrieb Denis V. Lunev: >>>> On 26/12/14 16:32, Denis V. Lunev wrote: >>>>> On 26/12/14 16:13, Peter Lieven wrote: >>>>>> Am 26.12.2014 um 13:35 schrieb Denis V. Lunev: >>>>>>> The check for maximum length was added by >>>>>>> commit c31cb70728d2c0c8900b35a66784baa446fd5147 >>>>>>> Author: Peter Lieven >>>>>>> Date: Thu Oct 24 12:06:58 2013 +0200 >>>>>>> block: honour BlockLimits in bdrv_co_do_write_zeroes >>>>>>> >>>>>>> but actually if driver provides .bdrv_co_write_zeroes callback, there is >>>>>>> no need to limit the size to 32 MB. Callback should provide effective >>>>>>> implementation which normally should not write any zeroes in comparable >>>>>>> amount. >>>>>> >>>>>> NACK. >>>>>> >>>>>> First there is no guarantee that bdrv_co_do_write_zeroes is a fast operation. >>>>>> This heaviliy depends on several circumstances that the block layer is not aware of. >>>>>> If a specific protocol knows it is very fast in writing zeroes under any circumstance >>>>>> it should provide INT_MAX in bs->bl.max_write_zeroes. It is then still allowed to >>>>>> return -ENOTSUP if the request size or alignment doesn't fit. >>>>> >>>>> the idea is that (from my point of view) if .bdrv_co_do_write_zeroes is >>>>> specified, the cost is almost the same for any amount of zeroes >>>>> written. This is true for fallocate from my point of view. The amount >>>>> of actually written data will be in several orders less than specified >>>>> except slow path, which honors 32 MB limit. >>>>> >>>>> If the operation is complex in realization, then it will be rate-limited >>>>> below, in actual implementation. >>>>> >>>>>> There are known backends e.g. anything that deals with SCSI that have a known >>>>>> limitation of the maximum number of zeroes they can write fast in a single request. >>>>>> This number MUST NOT be exceeded. The below patch would break all those backends. >>>>> >>>>> could you pls point me this backends. Actually, from my point of >>>>> view, they should properly setup max_write_zeroes for themselves. >>>>> This is done at least in block/iscsi.c and it would be consistent >>>>> way of doing so. >>>>> >>>>>> >>>>>> What issue are you trying to fix with this patch? Maybe there is a better way to fix >>>>>> it at another point in the code. >>>>>> >>>>> >>>>> I am trying to minimize amount of metadata updates for a file. >>>>> This provides some speedup even on ext4 and this will provide >>>>> even more speedup with a distributed filesystem like CEPH >>>>> where size updates of the files and block allocation are >>>>> costly. >>>>> >>>>> Regards, >>>>> Den >>>> First of all, the patch is really wrong :) It was written using >>>> wrong assumptions. >>>> >>>> OK. I have spent some time reading your original patchset and >>>> and did not found any useful justification for default limit >>>> for both discard and write zero. >>> >>> 32768 is the largest power of two fitting into a uint16. >>> And uint16 is quite common for nb_sectors in backends. >>> >> >> ok. This could be reasonable. >> >> >>>> >>>> Yes, there are drivers which requires block level to call >>>> .bdrv_co_do_write_zeroes with alignment and with upper limit. >>>> But in this case driver setups max_write_zeroes. All buggy >>>> drivers should do that not to affect not buggy ones from >>>> my opinion. >>>> >>>> This is the only purpose of the original patches for limits. >>>> I have wrongly interpret BlockLimits as something connected >>>> with time of the operation. Sorry for that. >>>> >>>> Therefore there is no good reason for limiting the amount of >>>> data sent to drv->bdrv_co_writev with any data size. The only >>>> thing is that it would be good not to allocate too many memory >>>> at once. We could do something like >>>> >>>> base = qemu_try_blockalign(bs, MIN(2048, num) * BDRV_SECTOR_SIZE); >>>> added = 0; >>>> for (added = 0; added < num; add += MIN(2048, num)) { >>>> qemu_iovec_add(qiov, base, MIN(2048, num)); >>>> } >>>> >>>> to avoid really big allocations here even if .max_write_zeroes is >>>> very high. Do you think that this might be useful? >>>> >>>> As for .bdrv_co_do_write_zeroes itself, can we still drop >>>> default 32 Mb limit? If there are some buggy drivers, they >>>> should have .max_write_zeroes specified. >>>> >>>> The same applies to .max_discard >>> >>> Its always risky to change default behaviour. In the original discussion we >>> agreed that there should be a limit for each request. I think the 2^15 was >>> Paolos suggestion. >>> >>> You where talking of metadata updates for a file. So the operation that is too slow >>> for you is bdrv_write_zeroes inside a container file? What is the underlaying filesystem? >>> What is the exact operation that you try to optimize? >>> >>> I am wondering because as far as I can see write zeroes is only supported for >>> XFS and block devices which support BLKZEROOUT. The latter only works for >>> cache=none. So its not that easy to end up in an optimized (fast) path anyway. >>> >>> Peter >>> >>> >> you have missed 6 patches below ;) f.e. patch 2/7 >> >> OK. I'll redo changes and fix on raw-posix level. > > I was not in CC on the series. Can you please include me in CC for all patches when you respin. > no prob :)