From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE912C001DF for ; Wed, 16 Aug 2023 09:49:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 40F5994003D; Wed, 16 Aug 2023 05:49:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3BF6F8D0001; Wed, 16 Aug 2023 05:49:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2885294003D; Wed, 16 Aug 2023 05:49:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 18C1E8D0001 for ; Wed, 16 Aug 2023 05:49:14 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 9AD46C03C5 for ; Wed, 16 Aug 2023 09:49:13 +0000 (UTC) X-FDA: 81129494586.16.943DEEA Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf05.hostedemail.com (Postfix) with ESMTP id 59C2410000D for ; Wed, 16 Aug 2023 09:49:11 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LrSCmm9c; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692179351; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0OFZQ+OXvyBmwhhAQyMup8CjhNg5bEgDX0bYoSJ7Q74=; b=0GHfxR/tQmorADP8UPe5lzydG4U0u5AiOl8QRMEjuZaSetGJRXOcMJaMpoKArE7JPBccye HVIohefwzYDXNDLfK2O9ixkNQcPEOxspS+sT+OElQ63kn3xVQjjrcmfm7k8sHNVQPOa4rp RRL22bUPROqMTCiG1XZW+Ht6ZwDmUUo= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LrSCmm9c; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692179351; a=rsa-sha256; cv=none; b=DLngewtGHkrygHei6P/wh1mmzkK2Re8LBLQpSJhF47AxBf7QtH6xplXaGX/BYblvWw5HQ7 3oATnFUf0bFMisFbkTMNjlH9VvmOZplDPjMhJnBfXaWWHBG/HRwMHT4pDQbMey+RGkLyx/ tEhV3UIZJdGWd3RUdsu3CXc9kSWQJN0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1692179350; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0OFZQ+OXvyBmwhhAQyMup8CjhNg5bEgDX0bYoSJ7Q74=; b=LrSCmm9cpXlGSh3JvwlifeE2NKG2RNTq/fWnXMygPT4T9HnRieNuygD9d1E98QDOUuesoC Qrj+zbkURwCj0qX1F+1OphVJp5jQ0+xM3agphp/bC58Si/WHzmntyuFDa6v8v3HmSLJA/U tJckqknFoLN90/Yp7xCptSdrBkzSqdE= Received: from mail-lj1-f199.google.com (mail-lj1-f199.google.com [209.85.208.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-558-apLeL4tnM4Sj_kpnpxecFg-1; Wed, 16 Aug 2023 05:49:07 -0400 X-MC-Unique: apLeL4tnM4Sj_kpnpxecFg-1 Received: by mail-lj1-f199.google.com with SMTP id 38308e7fff4ca-2b9ba3d6191so63051391fa.2 for ; Wed, 16 Aug 2023 02:49:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692179346; x=1692784146; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=0OFZQ+OXvyBmwhhAQyMup8CjhNg5bEgDX0bYoSJ7Q74=; b=KnGlElP145+xAqW3wJuSZwCYVpgJgspeR+hcqJaPmqg8gITqaFFho6Ju4nmNLSPYf/ LLe6sCWTVfky82ffqV+dvlTjfdE/+ts9fIvhNOKbUOHW+N0zxvUwYDE4JoGnjNNLaUYz l5gWtWHJ7sXpVU0FZzg3wCB+hc/bT3vaPO1z5gltadJuB+KnsL2vyiYGfFlJhmgdd6Yp AdSTw9ENZ28m3gbTwHNkCOXszqVI3pHe7Z7Kb0HPGiYkOgo/TpiEm02nn6WbUSStMHuj VMvnUxzcP+IYqnr0Zh8DR7vth71J82I7Cr2cTs/3Nn+IpU7C7yzcPahXhOKis7vyKj+o Hdgw== X-Gm-Message-State: AOJu0Yypi5qEDEURWDlL4e0hIR0BB0zzc7oX3oNaxYwthOfsUDw/QztK h/UN+WDrGV+5YDzf37IzdpG2JwwYnqdyr5W4Y3jVVf3A9g8Vf2yt3fPJArPfDvnX2gSfjNsPGPA ahZ9Yb3mFtL8= X-Received: by 2002:a2e:b285:0:b0:2b4:5cad:f246 with SMTP id 5-20020a2eb285000000b002b45cadf246mr1107449ljx.7.1692179345765; Wed, 16 Aug 2023 02:49:05 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFAcVWmuJNfrQc1yoLpfalLlUny5Yyp2MgsmtGtaEdKirncc0n7pb9TlVylBlmWwHsFeOVlgw== X-Received: by 2002:a2e:b285:0:b0:2b4:5cad:f246 with SMTP id 5-20020a2eb285000000b002b45cadf246mr1107428ljx.7.1692179345278; Wed, 16 Aug 2023 02:49:05 -0700 (PDT) Received: from ?IPV6:2003:cb:c74b:8b00:5520:fa3c:c527:592f? (p200300cbc74b8b005520fa3cc527592f.dip0.t-ipconnect.de. [2003:cb:c74b:8b00:5520:fa3c:c527:592f]) by smtp.gmail.com with ESMTPSA id o13-20020a05600c378d00b003fe2de3f94fsm20525323wmr.12.2023.08.16.02.49.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 16 Aug 2023 02:49:04 -0700 (PDT) Message-ID: Date: Wed, 16 Aug 2023 11:49:03 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 To: Yan Zhao Cc: John Hubbard , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, mike.kravetz@oracle.com, apopple@nvidia.com, jgg@nvidia.com, rppt@kernel.org, akpm@linux-foundation.org, kevin.tian@intel.com, Mel Gorman References: <20230810085636.25914-1-yan.y.zhao@intel.com> <41a893e1-f2e7-23f4-cad2-d5c353a336a3@redhat.com> <6b48a161-257b-a02b-c483-87c04b655635@redhat.com> <1ad2c33d-95e1-49ec-acd2-ac02b506974e@nvidia.com> <846e9117-1f79-a5e0-1b14-3dba91ab8033@redhat.com> <4271b91c-90b7-4b48-b761-b4535b2ae9b7@nvidia.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 59C2410000D X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: i7j6f9t9hq9pksxay1gujj7wqau486jr X-HE-Tag: 1692179351-100491 X-HE-Meta: U2FsdGVkX1+wpmNgTZ+bobhoL90uc93PqZYu4rKgBsy4XOi32zwRslBFP0vawbNXW13JqU8PBoFTeh/Kw5psN8KJjU8Fb7dfhCTjzhtaSmSxAilEawQQidjO7ErlPrbgb6b4iKiouDT/Rj+aGaE+psCGSXV9uesHgEzSRWrq0nid9QIw1wdQBWjL5Czqy7bYUIRbbhAj1XFWyPJ8Xerv6J7FmCOBD1Rd9ZdWIrBcI9qKsqwFGl++79RpqkYp3wHstRg9htl7L6YHfB4D/KvR2e3hvJ1SasW++PKsAcoSv2Ej8lPknII+4aBoVlmCTm8DZSWdmZ9e8TckALqHkT0o+7tzidFsmOKOyNauQ1Hmc0HGigEmhUv+3joOAu6j7I4mc2FIhtWsXp7yuQXkQWkutjw7gRedpmw1XZtK4mgGkweXWNvQiNT2EGdNPGXF5sDXXuCBW9qGBo5HlCp5sctqg5xM5fwMb5yVwRiR5qovxDoabwfNoyAO5yCGK56IfrhWn9cUY7TygjQZl01H6flJQ0Vf+zn9LBcKknq06nv7GhqsRbw4u7w/dQEtqFhXad0Ye6QT0EYvnoAxfUqCAdOHwIghdK1ZNArC41s4JEO/xUZUmzLuZ3IcBGZFGTFGNdntxL9Eimb8gwOKtkphSSgiKFqaJenzjll5axNGX+zx/KWsZ7CuqARSQNTiFhEQZpgonIJUAGUl5rPEso6yaClJmfaf1btcL0AcfGYJyWakZpaUapJyYtJsrM1thnmDFYQicm698CqxxkMPefDONm/luyNW5JK8hpSJaroDjDxNtNTQSrmVj/lvilS7DEMZPKQgtG6YibGumMmcadq4lpTW0xEN46ASIQg08hQSmETA9rZd6YZ3HGgEsdZ0lLJa02JBi/j4BI8bFjH9Sna9/rKgvQ/IVfahVS7j2N2oqmQiPrlAUTIGJR16JW8HyYISaqjxbAuy9g6FOhUFrwrDTY5 lkLml0B0 D7fHoYY0rdZb9yMyEU6Gct8k44ph4vMKtXAjr6ZvfOKei/v8QOv7XyZqJ/zlqnbZPmQ0KbcWuW8goYhGjXS/T71qz5OyvVIADAq+xeOoxeTn0augYDZjOQ0qMa/RiW87Gwmk9zKux3IM4hWrTp5BP8oiI2kfuiKXHa3xrcsLU+U/D3XsUSNCt7vW/LhVeEQ3lckhe4AAsCZRwF5tJv57d0jVAnlZT0THYxEKgIzhxx40rL+ItiuZ5Z+fJiiLWzUGgJL2bAgCrxzVtblqMxZqwG/450T/EFW9FlQEeTFZ9e2FhUzNxjSlplOTEIaDOhDvToZTaPR0oY5E840C6HUlIGXgWiB6ITuPYWkoYFIA8MB12a7QxhV0jrf0aPHCIvbIlTeJC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 16.08.23 11:06, Yan Zhao wrote: > On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote: >> On 15.08.23 04:34, John Hubbard wrote: >>> On 8/14/23 02:09, Yan Zhao wrote: >>> ... >>>>> hmm_range_fault()-based memory management in particular might benefit >>>>> from having NUMA balancing disabled entirely for the memremap_pages() >>>>> region, come to think of it. That seems relatively easy and clean at >>>>> first glance anyway. >>>>> >>>>> For other regions (allocated by the device driver), a per-VMA flag >>>>> seems about right: VM_NO_NUMA_BALANCING ? >>>>> >>>> Thanks a lot for those good suggestions! >>>> For VMs, when could a per-VMA flag be set? >>>> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until >>>> after it's mapped into VFIO. >>>> Then, should VFIO set this flag on after it maps a range? >>>> Could this flag be unset after device hot-unplug? >>>> >>> >>> I'm hoping someone who thinks about VMs and VFIO often can chime in. >> >> At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao, >> using madvise). >> >> BUT, I do wonder what value there would be for autonuma to still be active > Currently MADV_* is up to 25 > #define MADV_COLLAPSE 25, > while madvise behavior is of type "int". So it's ok. > > But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform. I remember there were discussions to increase it also for 32bit. If that's required, we might want to go down that path. But do 32bit architectures even care about NUMA hinting? If not, just ignore them ... > >> for the remainder of the hypervisor. If there is none, a prctl() would be >> better. > Add a new field in "struct vma_numab_state" in vma, and use prctl() to > update this field? Rather a global toggle per MM, no need to update individual VMAs -- if we go down that prctl() path. No need to consume more memory for VMAs. [...] >> We already do have a mechanism in QEMU to get notified when longterm-pinning >> in the kernel might happen (and, therefore, MADV_DONTNEED must not be used): >> * ram_block_discard_disable() >> * ram_block_uncoordinated_discard_disable() > Looks this ram_block_discard allow/disallow state is global rather than per-VMA > in QEMU. Yes. Once you transition into "discard of any kind disabled", you can go over all guest memory VMAs (RAMBlock) and issue an madvise() for them. (or alternatively, do the prctl() once ) We'll also have to handle new guest memory being created afterwards, but that is easy. Once we transition to "no discarding disabled", you can go over all guest memory VMAs (RAMBlock) and issue an madvise() for them again (or alternatively, do the prctl() once). > So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and > it's up to the user space to choose between per-VMA and complex way or > global and simpler way? QEMU could do either way. The question would be if a per-vma settings makes sense for NUMA hinting. -- Cheers, David / dhildenb