From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 219D1C6377D for ; Thu, 22 Jul 2021 13:49:34 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 80DFC6109D for ; Thu, 22 Jul 2021 13:49:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 80DFC6109D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:60792 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m6Z4u-0002Im-CV for qemu-devel@archiver.kernel.org; Thu, 22 Jul 2021 09:49:32 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:40224) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m6Z39-0000yL-Hh for qemu-devel@nongnu.org; Thu, 22 Jul 2021 09:47:44 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:32827) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m6Z37-0005V2-LB for qemu-devel@nongnu.org; Thu, 22 Jul 2021 09:47:43 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1626961660; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=q15Mz0gRQ45Vm8MfUlAWVgMVAdhoIsIrt4AVMIh9dU8=; b=H5oHq2wifP/R2QEYOKyQZ0ty7n/54Jzjs+pVfcegGtUqoRjU2D6Uvugrzl3nwxdOf8jr+9 j1hRwqjH7hEu16gKJDnfHO3uSS/JNGk+qBBR63gLQFEOKfH1qIJHo5FyzkbZKO45ueorxS 7viGjvNSEMztQO/ipxwjfKPROvScHPw= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-467-1dL8PVmIMnSmaXpBZ6fWkA-1; Thu, 22 Jul 2021 09:47:36 -0400 X-MC-Unique: 1dL8PVmIMnSmaXpBZ6fWkA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E77C9100B722; Thu, 22 Jul 2021 13:47:24 +0000 (UTC) Received: from redhat.com (ovpn-114-245.ams2.redhat.com [10.36.114.245]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 10CC919711; Thu, 22 Jul 2021 13:47:21 +0000 (UTC) Date: Thu, 22 Jul 2021 14:47:19 +0100 From: Daniel =?utf-8?B?UC4gQmVycmFuZ8Op?= To: David Hildenbrand Subject: Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() Message-ID: References: <20210722123635.60608-1-david@redhat.com> <20210722123635.60608-2-david@redhat.com> <8ec55578-37b8-079a-5e04-d8160ab19109@redhat.com> MIME-Version: 1.0 In-Reply-To: <8ec55578-37b8-079a-5e04-d8160ab19109@redhat.com> User-Agent: Mutt/2.0.7 (2021-05-04) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=berrange@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=216.205.24.124; envelope-from=berrange@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -42 X-Spam_score: -4.3 X-Spam_bar: ---- X-Spam_report: (-4.3 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.472, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Daniel =?utf-8?B?UC4gQmVycmFuZ8Op?= Cc: Pankaj Gupta , Eduardo Habkost , "Michael S. Tsirkin" , "Dr . David Alan Gilbert" , qemu-devel@nongnu.org, Pankaj Gupta , Igor Mammedov , Paolo Bonzini , Marek Kedzierski Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Thu, Jul 22, 2021 at 03:39:50PM +0200, David Hildenbrand wrote: > On 22.07.21 15:31, Daniel P. Berrangé wrote: > > On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote: > > > Let's sense support and use it for preallocation. MADV_POPULATE_WRITE > > > does not require a SIGBUS handler, doesn't actually touch page content, > > > and avoids context switches; it is, therefore, faster and easier to handle > > > than our current approach. > > > > > > While MADV_POPULATE_WRITE is, in general, faster than manual > > > prefaulting, and especially faster with 4k pages, there is still value in > > > prefaulting using multiple threads to speed up preallocation. > > > > > > More details on MADV_POPULATE_WRITE can be found in the Linux commit > > > 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault > > > page tables") and in the man page proposal [1]. > > > > > > [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com > > > > > > This resolves the TODO in do_touch_pages(). > > > > > > In the future, we might want to look into using fallocate(), eventually > > > combined with MADV_POPULATE_READ, when dealing with shared file > > > mappings. > > > > > > Reviewed-by: Pankaj Gupta > > > Signed-off-by: David Hildenbrand > > > --- > > > include/qemu/osdep.h | 7 ++++ > > > util/oslib-posix.c | 88 +++++++++++++++++++++++++++++++++----------- > > > 2 files changed, 74 insertions(+), 21 deletions(-) > > > > > > > @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg) > > > return NULL; > > > } > > > +static void *do_madv_populate_write_pages(void *arg) > > > +{ > > > + MemsetThread *memset_args = (MemsetThread *)arg; > > > + const size_t size = memset_args->numpages * memset_args->hpagesize; > > > + char * const addr = memset_args->addr; > > > + int ret; > > > + > > > + if (!size) { > > > + return NULL; > > > + } > > > + > > > + /* See do_touch_pages(). */ > > > + qemu_mutex_lock(&page_mutex); > > > + while (!threads_created_flag) { > > > + qemu_cond_wait(&page_cond, &page_mutex); > > > + } > > > + qemu_mutex_unlock(&page_mutex); > > > + > > > + ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE); > > > + if (ret) { > > > + memset_thread_failed = true; > > > > I'm wondering if this use of memset_thread_failed is sufficient. > > > > This is pre-existing from the current impl, and ends up being > > used to set the bool result of 'touch_all_pages'. The caller > > of that then does > > > > if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) { > > error_setg(errp, "os_mem_prealloc: Insufficient free host memory " > > "pages available to allocate guest RAM"); > > } > > > > this was reasonable with the old impl, because the only reason > > we ever see 'memset_thread_failed==true' is if we got SIGBUS > > due to ENOMEM. > > > > My concern is that madvise() has a bunch of possible errno > > codes returned on failure, and we're not distinguishing > > them. In the past this kind of thing has burnt us making > > failures hard to debug. > > > > Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno' > > > > Then, we can make 'touch_all_pages' have an 'Error **errp' > > parameter, and it can directly call > > > > error_setg_errno(errp, memset_thead_errno, ....some message...) > > > > when memset_thread_errno is non-zero, and thus we can remove > > the generic message from the caller of touch_all_pages. > > > > If you agree, it'd be best to refactor the existing code to > > use this pattern in an initial patch. > > We could also simply trace the return value, which should be comparatively > easy to add. We should be getting either -ENOMEM or -EHWPOISON. And the > latter is highly unlikely to happen when actually preallocating. > > We made sure that we don't end up with -EINVAL as we're sensing of > MADV_POPULATE_WRITE works on the mapping. Those are in the "normal" usage scenarios. I'm wondering about the abnormal scenarios where QEMU code is mistakenly screwed up or libvirt / mgmt app makes some config mistake. eg we can get things like EPERM if selinux or seccomp block the madvise syscall by mistake (common if EQMU is inside docker for example), or can we get EINVAL if the 'addr' is not page aligned, and so on. > So when it comes to debugging, I'd actually prefer tracing -errno, as the > real error will be of little help to end users. I don't care about the end users interpreting it, rather us as maintainers who get a bug report containing insufficient info to diagnose the root cause. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|