From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 24644C05027 for ; Mon, 23 Jan 2023 14:39:28 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pJxB4-0004HQ-F7; Mon, 23 Jan 2023 08:48:02 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pJxB2-0004HH-Ly for qemu-devel@nongnu.org; Mon, 23 Jan 2023 08:48:00 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pJxB0-0005K1-7R for qemu-devel@nongnu.org; Mon, 23 Jan 2023 08:48:00 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674481676; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references; bh=Lw+8aQbf189/UF09nUhQRAD1bJ4r6sguUVehJt2mdEE=; b=MDpCmopdqfUHQri04ONAyvhr31epYNCRxGf2iqO5PdOd0voxRquLftV2ZO+KE9xWQrrvgF Ypiu/h/UsvYrBgcKFOiCIufM1bIOEVrdFjFVTKRzHHwSoKI6DmajpuQRy7MrINUrusG4SN 5SFFj7ENoza5O6rl/rPX/4rUQD2vYAQ= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-662--ApYGPHBPu-th7wyxX64kg-1; Mon, 23 Jan 2023 08:47:53 -0500 X-MC-Unique: -ApYGPHBPu-th7wyxX64kg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8DA3A3C0DDB2; Mon, 23 Jan 2023 13:47:52 +0000 (UTC) Received: from redhat.com (unknown [10.33.36.197]) by smtp.corp.redhat.com (Postfix) with ESMTPS id EB03040C1141; Mon, 23 Jan 2023 13:47:50 +0000 (UTC) Date: Mon, 23 Jan 2023 13:47:48 +0000 From: Daniel =?utf-8?B?UC4gQmVycmFuZ8Op?= To: Daniil Tatianin Cc: David Hildenbrand , Paolo Bonzini , qemu-devel@nongnu.org, Stefan Weil , Igor Mammedov , yc-core@yandex-team.ru Subject: Re: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout Message-ID: References: <20230120134749.550639-1-d-tatianin@yandex-team.ru> <338cbc9a-4eea-a76c-8042-98372fb70854@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.2.9 (2022-11-12) X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 Received-SPF: pass client-ip=170.10.133.124; envelope-from=berrange@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Daniel =?utf-8?B?UC4gQmVycmFuZ8Op?= Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote: > On 1/23/23 11:57 AM, David Hildenbrand wrote: > > On 20.01.23 14:47, Daniil Tatianin wrote: > > > This series introduces new qemu_prealloc_mem_with_timeout() api, > > > which allows limiting the maximum amount of time to be spent on memory > > > preallocation. It also adds prealloc statistics collection that is > > > exposed via an optional timeout handler. > > > > > > This new api is then utilized by hostmem for guest RAM preallocation > > > controlled via new object properties called 'prealloc-timeout' and > > > 'prealloc-timeout-fatal'. > > > > > > This is useful for limiting VM startup time on systems with > > > unpredictable page allocation delays due to memory fragmentation or the > > > backing storage. The timeout can be configured to either simply emit a > > > warning and continue VM startup without having preallocated the entire > > > guest RAM or just abort startup entirely if that is not acceptable for > > > a specific use case. > > > > The major use case for preallocation is memory resources that cannot be > > overcommitted (hugetlb, file blocks, ...), to avoid running out of such > > resources later, while the guest is already running, and crashing it. > > Wouldn't you say that preallocating memory for the sake of speeding up guest > kernel startup & runtime is a valid use case of prealloc? This way we can > avoid expensive (for a multitude of reasons) page faults that will otherwise > slow down the guest significantly at runtime and affect the user experience. > > > Allocating only a fraction "because it takes too long" looks quite > > useless in that (main use-case) context. We shouldn't encourage QEMU > > users to play with fire in such a way. IOW, there should be no way > > around "prealloc-timeout-fatal". Either preallocation succeeded and the > > guest can run, or it failed, and the guest can't run. > > Here we basically accept the fact that e.g with fragmented memory the kernel > might take a while in a page fault handler especially for hugetlb because of > page compaction that has to run for every fault. > > This way we can prefault at least some number of pages and let the guest > fault the rest on demand later on during runtime even if it's slow and would > cause a noticeable lag. Rather than treat this as a problem that needs a timeout, can we restate it as situations need synchronous vs asynchronous preallocation ? For the case where we need synchronous prealloc, current QEMU deals with that. If it doesn't work quickly enough, mgmt can just kill QEMU already today. For the case where you would like some prealloc, but don't mind if it runs without full prealloc, then why not just treat it as an entirely asynchronous task ? Instead of calling qemu_prealloc_mem and waiting for it to complete, just spawn a thread to run qemu_prealloc_mem, so it doesn't block QEMU startup. This will have minimal maint burden on the existing code, and will avoid need for mgmt apps to think about what timeout value to give, which is good because timeouts are hard to get right. Most of the time that async background prealloc will still finish before the guest even gets out of the firmware phase, but if it takes longer it is no big deal. You don't need to quit the prealloc job early, you just need it to not delay the guest OS boot IIUC. This impl could be done with the 'prealloc' property turning from a boolean on/off, to a enum on/async/off, where 'on' == sync prealloc. Or add a separate 'prealloc-async' bool property With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|