From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38524) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WsSkO-0003aP-2o for qemu-devel@nongnu.org; Thu, 05 Jun 2014 04:10:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WsSkF-000366-1h for qemu-devel@nongnu.org; Thu, 05 Jun 2014 04:10:04 -0400 Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:39049 helo=mx01.kamp.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WsSkE-00033O-O7 for qemu-devel@nongnu.org; Thu, 05 Jun 2014 04:09:54 -0400 Message-ID: <539025CD.40900@kamp.de> Date: Thu, 05 Jun 2014 10:09:49 +0200 From: Peter Lieven MIME-Version: 1.0 References: <1401486037-25609-1-git-send-email-pl@kamp.de> <20140604151228.GL11073@stefanha-thinkpad.redhat.com> <538F3BE4.9020602@kamp.de> <20140605075355.GA27366@stefanha-thinkpad.redhat.com> In-Reply-To: <20140605075355.GA27366@stefanha-thinkpad.redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCHv3 RESEND] block: introduce BDRV_O_SEQUENTIAL List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: kwolf@redhat.com, famz@redhat.com, Stefan Hajnoczi , qemu-devel@nongnu.org, shadowsor@gmail.com, pbonzini@redhat.com On 05.06.2014 09:53, Stefan Hajnoczi wrote: > On Wed, Jun 04, 2014 at 05:31:48PM +0200, Peter Lieven wrote: >> Am 04.06.2014 17:12, schrieb Stefan Hajnoczi: >>> On Fri, May 30, 2014 at 11:40:37PM +0200, Peter Lieven wrote: >>>> this patch introduces a new flag to indicate that we are going to sequentially >>>> read from a file and do not plan to reread/reuse the data after it has been read. >>>> >>>> The current use of this flag is to open the source(s) of a qemu-img convert >>>> process. If a protocol from block/raw-posix.c is used posix_fadvise is utilized >>>> to advise to the kernel that we are going to read sequentially from the >>>> file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate >>>> that there is no advantage keeping the blocks in the buffers. >>>> >>>> Consider the following test case that was created to confirm the behaviour of >>>> the new flag: >>>> >>>> A 10G logical volume was created and filled with random data. >>>> Then the logical volume was exported via qemu-img convert to an iscsi target. >>>> Before the export was started all caches of the linux kernel where dropped. >>>> >>>> Old behavior: >>>> - The convert process took 3m45s and the buffer cache grew up to 9.67 GB close >>>> to the end of the conversion. After qemu-img terminated all the buffers were >>>> freed by the kernel. >>>> >>>> New behavior with the -N switch: >>>> - The convert process took 3m43s and the buffer cache grew up to 15.48 MB close >>>> to the end with some small peaks up to 30 MB during the conversion. >>> FADVISE_SEQUENTIAL can be good since it doubles read-ahead on Linux. >>> >>> I'm skeptical of the effort to avoid buffer cache usage using >>> FADVISE_DONTNEED. The performance results tell me that less buffer >>> cache was used but that number doesn't have a direct effect on >>> application performance. >>> >>> Let's check GNU coreutils: >>> >>> $ cd coreutils >>> $ git grep FADVISE_DONTNEED >>> gl/lib/fadvise.h: FADVISE_DONTNEED = POSIX_FADV_DONTNEED, >>> gl/lib/fadvise.h: FADVISE_DONTNEED, >>> $ >>> >>> GNU cp(1) does not care about minimizing impact on buffer cache using >>> FADVISE_DONTNEED. It just sets FADVISE_SEQUENTIAL on the source file >>> and calls read() (plus uses FIEMAP to check extents for sparseness). >>> >>> I want to avoid adding code just for the heck of it. We need a deeper >>> understanding: >>> >>> Please drop FADVISE_DONTNEED and compare again to see if it changes the >>> benchmark. >>> >>> By the way, did you perform several runs to check the variance of the >>> running time? I don't know if the 2 seconds difference were noise or >>> because FADVISE_SEQUENTIAL or because FADVISE_DONTNEED or because both. >> There was no effect on the runtime as far as I remember. I ran >> some tests, but not a number large enough to filter out the noise. >> >> I created this one because we saw it helps under memory pressure. >> Maybe its too specific to add it into mainline qemu, but I wanted to >> avoid to have too much individual changes we need to maintain. > I'm open to merging it if the improvement can be quantified. Right now > this might be a workaround for Linux memory management heuristics or it > might not have any effect, I don't know. I understand that you are critical about it. I can just say it solved the problem with the specific setup, kernel version etc. I found that FADVISE_DONTNEED solves problems also in other applications. Offtopic: i have an raspberry pi running as tvheadend and observed desync of the DVBS2 signal at some times. Since I FADV_DONTNEED all written frames away it runs smothly. I this case the feeing of the page cache was CPU intensive for the small device and caused the desync. > >>>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>>> index 6586a0c..9768cc4 100644 >>>> --- a/block/raw-posix.c >>>> +++ b/block/raw-posix.c >>>> @@ -447,6 +447,13 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, >>>> } >>>> #endif >>>> >>>> +#ifdef POSIX_FADV_SEQUENTIAL >>>> + if (bs->open_flags & BDRV_O_SEQUENTIAL && >>>> + !(bs->open_flags & BDRV_O_NOCACHE)) { >>>> + posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL); >>>> + } >>>> +#endif >>> This is only true if the image format is raw. If the image format on >>> top of this raw-posix BDS is non-raw then the read pattern may not be >>> sequential. >> You are right, but will the other formats set BDRV_O_SEQUENTIAL? > If the user specifies qemu-img convert -N then it will be set for any > image format. Of course, but when e.g. qcow2 opens its underlying file, then BDRV_O_SEQUENTIAL is not passed on, or is it? > > Maybe qemu-img convert can always set BDRV_O_SEQUENTIAL and the have the > raw_bsd.c format propagate it to bs->file while other formats do not. > Then the user doesn't have to specify a command-line option and we don't > set it for non-raw image formats. This would be an option. Peter