From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C3CFC04AA7 for ; Mon, 13 May 2019 12:09:58 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id A75E520879 for ; Mon, 13 May 2019 12:09:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A75E520879 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([127.0.0.1]:56286 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hQ9mG-0003sZ-Lz for qemu-devel@archiver.kernel.org; Mon, 13 May 2019 08:09:56 -0400 Received: from eggs.gnu.org ([209.51.188.92]:41888) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hQ9hd-0001B3-MV for qemu-devel@nongnu.org; Mon, 13 May 2019 08:05:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hQ9hb-0005fN-N8 for qemu-devel@nongnu.org; Mon, 13 May 2019 08:05:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47448) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1hQ9cf-0002jt-Pk; Mon, 13 May 2019 08:00:02 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C534B3082E24; Mon, 13 May 2019 12:00:00 +0000 (UTC) Received: from blackfin.pond.sub.org (ovpn-116-28.ams2.redhat.com [10.36.116.28]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2ACD05D71E; Mon, 13 May 2019 11:59:58 +0000 (UTC) Received: by blackfin.pond.sub.org (Postfix, from userid 1000) id 9E95D11385E4; Mon, 13 May 2019 13:59:56 +0200 (CEST) From: Markus Armbruster To: Xiang Zheng References: <20190505070059.4664-1-zhengxiang9@huawei.com> <87a7fyb0v7.fsf@dusky.pond.sub.org> <72adbed8-f650-42df-98d5-e98154baec08@redhat.com> <87h8a513sl.fsf@dusky.pond.sub.org> <87ef574z4e.fsf@dusky.pond.sub.org> <87ef56mjbi.fsf@dusky.pond.sub.org> Date: Mon, 13 May 2019 13:59:56 +0200 In-Reply-To: (Xiang Zheng's message of "Sat, 11 May 2019 16:36:53 +0800") Message-ID: <87bm06a7k3.fsf@dusky.pond.sub.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.46]); Mon, 13 May 2019 12:00:00 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 209.132.183.28 Subject: Re: [Qemu-devel] [PATCH] pflash: Only read non-zero parts of backend image X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kwolf@redhat.com, peter.maydell@linaro.org, qemu-block@nongnu.org, ard.biesheuvel@linaro.org, qemu-devel@nongnu.org, mreitz@redhat.com, stefanha@redhat.com, guoheyi@huawei.com, wanghaibin.wang@huawei.com, Laszlo Ersek Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Xiang Zheng writes: > On 2019/5/10 23:16, Markus Armbruster wrote: >> Xiang Zheng writes: >> >>> On 2019/5/9 19:59, Markus Armbruster wrote: >>>> Xiang Zheng writes: >>>> >>>>> On 2019/5/8 21:20, Markus Armbruster wrote: >>>>>> Laszlo Ersek writes: >>>>>> >>>>>>> Hi Markus, >>>>>>> >>>>>>> On 05/07/19 20:01, Markus Armbruster wrote: >>>>>>>> The subject is slightly misleading. Holes read as zero. So do >>>>>>>> non-holes full of zeroes. The patch avoids reading the former, but >>>>>>>> still reads the latter. >>>>>>>> >>>>>>>> Xiang Zheng writes: >>>>>>>> >>>>>>>>> Currently we fill the memory space with two 64MB NOR images when >>>>>>>>> using persistent UEFI variables on virt board. Actually we only use >>>>>>>>> a very small(non-zero) part of the memory while the rest significant >>>>>>>>> large(zero) part of memory is wasted. >>>>>>>> >>>>>>>> Neglects to mention that the "virt board" is ARM. >>>>>>>> >>>>>>>>> So this patch checks the block status and only writes the non-zero part >>>>>>>>> into memory. This requires pflash devices to use sparse files for >>>>>>>>> backends. >>>>>>>> >>>>>>>> I started to draft an improved commit message, but then I realized this >>>>>>>> patch can't work. >>>>>>>> >>>>>>>> The pflash_cfi01 device allocates its device memory like this: >>>>>>>> >>>>>>>> memory_region_init_rom_device( >>>>>>>> &pfl->mem, OBJECT(dev), >>>>>>>> &pflash_cfi01_ops, >>>>>>>> pfl, >>>>>>>> pfl->name, total_len, &local_err); >>>>>>>> >>>>>>>> pflash_cfi02 is similar. >>>>>>>> >>>>>>>> memory_region_init_rom_device() calls >>>>>>>> memory_region_init_rom_device_nomigrate() calls qemu_ram_alloc() calls >>>>>>>> qemu_ram_alloc_internal() calls g_malloc0(). Thus, all the device >>>>>>>> memory gets written to even with this patch. >>>>>>> >>>>>>> As far as I can see, qemu_ram_alloc_internal() calls g_malloc0() only to >>>>>>> allocate the the new RAMBlock object called "new_block". The actual >>>>>>> guest RAM allocation occurs inside ram_block_add(), which is also called >>>>>>> by qemu_ram_alloc_internal(). >>>>>> >>>>>> You're right. I should've read more attentively. >>>>>> >>>>>>> One frame outwards the stack, qemu_ram_alloc() passes NULL to >>>>>>> qemu_ram_alloc_internal(), for the 4th ("host") parameter. Therefore, in >>>>>>> qemu_ram_alloc_internal(), we set "new_block->host" to NULL as well. >>>>>>> >>>>>>> Then in ram_block_add(), we take the (!new_block->host) branch, and call >>>>>>> phys_mem_alloc(). >>>>>>> >>>>>>> Unfortunately, "phys_mem_alloc" is a function pointer, set with >>>>>>> phys_mem_set_alloc(). The phys_mem_set_alloc() function is called from >>>>>>> "target/s390x/kvm.c" (setting the function pointer to >>>>>>> legacy_s390_alloc()), so it doesn't apply in this case. Therefore we end >>>>>>> up calling the default qemu_anon_ram_alloc() function, through the >>>>>>> funcptr. (I think anyway.) >>>>>>> >>>>>>> And qemu_anon_ram_alloc() boils down to mmap() + MAP_ANONYMOUS, in >>>>>>> qemu_ram_mmap(). (Even on PPC64 hosts, because qemu_anon_ram_alloc() >>>>>>> passes (-1) for "fd".) >>>>>>> >>>>>>> I may have missed something, of course -- I obviously didn't test it, >>>>>>> just speculated from the source. >>>>>> >>>>>> Thanks for your sleuthing! >>>>>> >>>>>>>> I'm afraid you neglected to test. >>>>>> >>>>>> Accusation actually unsupported. I apologize, and replace it by a >>>>>> question: have you observed the improvement you're trying to achieve, >>>>>> and if yes, how? >>>>>> >>>>> >>>>> Yes, we need to create sparse files as the backing images for pflash device. >>>>> To create sparse files like: >>>>> >>>>> dd of="QEMU_EFI-pflash.raw" if="/dev/zero" bs=1M seek=64 count=0 >>>>> dd of="QEMU_EFI-pflash.raw" if="QEMU_EFI.fd" conv=notrunc >>>> >>>> This creates a copy of firmware binary QEMU_EFI.fd padded with a hole to >>>> 64MiB. >>>> >>>>> dd of="empty_VARS.fd" if="/dev/zero" bs=1M seek=64 count=0 >>>> >>>> This creates the varstore as a 64MiB hole. As far as I know (very >>>> little), you should use the varstore template that comes with the >>>> firmware binary. >>>> >>>> I use >>>> >>>> cp --sparse=always bld/pc-bios/edk2-arm-vars.fd . >>>> cp --sparse=always bld/pc-bios/edk2-aarch64-code.fd . >>>> >>>> These guys are already zero-padded, and I use cp to sparsify. >>>> >>>>> Start a VM with below commandline: >>>>> >>>>> -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,unit=0,readonly=on\ >>>>> -drive file=/usr/share/edk2/aarch64/empty_VARS.fd,if=pflash,format=raw,unit=1 \ >>>>> >>>>> Then observe the memory usage of the qemu process (THP is on). >>>>> >>>>> 1) Without this patch: >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >>>>> AnonHugePages: 706560 kB >>>>> AnonHugePages: 2048 kB >>>>> AnonHugePages: 65536 kB // pflash memory device >>>>> AnonHugePages: 65536 kB // pflash memory device >>>>> AnonHugePages: 2048 kB >>>>> >>>>> # ps aux | grep qemu-system-aarch64 >>>>> RSS: 879684 >>>>> >>>>> 2) After applying this patch: >>>>> # cat /proc/`pidof qemu-system-aarch64`/smaps | grep AnonHugePages: | grep -v ' 0 kB' >>>>> AnonHugePages: 700416 kB >>>>> AnonHugePages: 2048 kB >>>>> AnonHugePages: 2048 kB // pflash memory device >>>>> AnonHugePages: 2048 kB // pflash memory device >>>>> AnonHugePages: 2048 kB >>>>> >>>>> # ps aux | grep qemu-system-aarch64 >>>>> RSS: 744380 >>>> >>>> Okay, this demonstrates the patch succeeds at mapping parts of the >>>> pflash memory as holes. >>>> >>>> Do the guests in these QEMU processes run? >>> >>> Yes. >> >> Good to know, thanks. >> >>>>> Obviously, there are at least 100MiB memory saved for each guest. >>>> >>>> For a definition of "memory". >>>> >>>> Next question: what impact on system performance do you observe? >>>> >>>> Let me explain. >>>> >>>> Virtual memory holes get filled in by demand paging on access. In other >>>> words, they remain holes only as long as nothing accesses the memory. >>>> >>>> Without your patch, we allocate pages at image read time and fill them >>>> with zeroes. If we don't access them again, the kernel will eventually >>>> page them out (assuming you're running with swap). So the steady state >>>> is "we waste some swap space", not "we waste some physical RAM". >>>> >>> >>> Not everybody wants to run with swap because it may cause low performance. >> >> Someone running without swap because he heard someone say someone said >> swap may be slow is probably throwing away performance. >> >> But I assume you mean people running without swap because they measured >> their workload and found it more performant without swap. Legitimate. > > Yes, and I had ever suffered from the high IO waits with swap.:) > >> >>>> Your patch lets us map pflash memory pages containing only zeros as >>>> holes. >>>> >>>> For pages that never get accessed, your patch avoids page allocation, >>>> filling with zeroes, writing to swap (all one-time costs), and saves >>>> some swap space (not commonly an issue). >>>> >>>> For pflash memory that gets accessed, your patch merely delays page >>>> allocation from image read time to first access. >>>> >>>> I wonder how these savings and delays affect actual system performance. >>>> Without an observable change in system performance, all we'd accomplish >>>> is changing a bunch of numers in /proc/$pid/. >>>> >>>> What improvement(s) can you observe? >>> >>> We only use pflash device for UEFI, and we hardly care about the performance. >>> I think the bottleneck of the performance is the MMIO emulation, even this >>> patch would delay page allocation at the first access. >> >> I wasn't inquiring about the performance of the pflash device. I was >> inquiring about *system* performance. But let me rephrase my question. >> >> Doing work to save resources is only worthwhile if something valuable >> gets better in a measurable way. I'm asking you >> >> (1) to explain what exactly you value, and >> >> (2) to provide measurements that show improvement. >> > > What we exactly value is the cost of memory resources and it is the only > thing that this patch aims to resolve. Then measure this cost! > I am confused that why you think it will impact the system performance? Did I > neglect something? If the patch does not impact how the system as a whole performs, then it's useless. Since you find it useful, it must have some valuable[*] observable effect for you. Tell us about it! I keep asking not to torment you, but to guide you towards building a compelling justification for your patch. However, I can only show you the path; the walking you'll have to do yourself. >>>> I guess the best case for your patch is many guests with relatively >>>> small RAM sizes. >> >> . >> [*] Changing a bunch of numbers in /proc is not valuable.