From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 55211C2BD09 for ; Mon, 24 Jun 2024 05:40:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=OtJDCfF6HNodKIw335y5xnQZy2lqKbBth5LtqMRY0UQ=; b=rHTizNF7h63ZRWB3PWv7wpavFm o+x25p6AdU56y2lRWuoneLBQ358F8PvwFZ2jN9xg26ALqiVzcsd0QEXHgK7ZLpZ6+nBQQcHdO00eM JJWCbulJs2kraHPN02qZvLJEJ1qMsJwHakfmYMCkhOGyiguMYJM31pFh/FbvOlesqT2qsCmbmk8BV kXAzk30ge/OaU6MKQi/pdht4hBkiMZS+ZrWUflrOqOoBSUoxQjKwwCa4mg9AU5BlBXJi/IxenXsiZ cz6CAzOQDo3TRz0Nw63wxm/mm9loZ8TsOsduGTrRMKuuzYODSC5YkbqWl61ITd3lYLKB8vzil4iVQ uQu+qocg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sLcQv-0000000FdA5-1k3E; Mon, 24 Jun 2024 05:40:05 +0000 Received: from mail-il1-x12f.google.com ([2607:f8b0:4864:20::12f]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sLcQq-0000000Fd7u-1oP5 for linux-arm-kernel@lists.infradead.org; Mon, 24 Jun 2024 05:40:01 +0000 Received: by mail-il1-x12f.google.com with SMTP id e9e14a558f8ab-3762775e9ebso437485ab.0 for ; Sun, 23 Jun 2024 22:39:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719207598; x=1719812398; darn=lists.infradead.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=OtJDCfF6HNodKIw335y5xnQZy2lqKbBth5LtqMRY0UQ=; b=OLM4qxCfPEFqgEty7bRAWpM2MQzocxsGECspgGbkjIvabWz13WBLXXpsbfOXJUqnPI xaYip0l32YGLJdar+chdKgOvVSbIQBK+mB43cRU96yHra5vsis4nqRqHRmoAGN6ztzel brykkcVS3QJtCQNqz2Eaq6B0dRCYZTmwHGYFhmXHxsGW4HJc2yqNHYNeFxFjeQ1CKD/I txuJ8X5m7+LjKdh6zEXRqx2J4YrFztJVp1/Yc66eIpKTawu2RExNZNm4KfHr42mJbi1f 7ePTfa/uyw8dqpAq+UTOjKnDf+ul3bltWH2XyKdIkEgWcobAn1fFPfcbYJAUwmzhs26Z bJCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719207598; x=1719812398; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=OtJDCfF6HNodKIw335y5xnQZy2lqKbBth5LtqMRY0UQ=; b=FpeGbY5520bYRCG4uHtFHlGGCozHW+NwYn1sHCU5SNxi6s0sLOo0HYsItJWlSzp/5y 5phRoNpoE2PG4SBLDcPBzcV4YpFbi3YlpK+633hn7CZL+ITqMaeiTnuR4zIIDk7hvstM bVuiNdXkhSgo1zONarOp6cCKsrIbqQNfl7ZXlU2rHmJD8tLUtLu7Pr2bKNzWvCq7Nc6q x5PMCIQAn2ee5btViwZQVgy5eFkZL1yFAgVABeVtQgm+VEa+cTE60oxPZk1O1t8Q+jn1 DX//c66MahwgdymLDOXmfWcgijMRA3/gECAnGF86gIX+q8Pcy5LwLxzSaxKSQA+Ghvpo G98A== X-Forwarded-Encrypted: i=1; AJvYcCW2U4bYC+Nqgs31N+5YAOwRdvPZtgzCRx06QXmCJ20cT7wM5HWRUlXmqu5T9kGTK5/kU+rFGd9IvclGGDfGSHkTbNrjBbbcuwfjKMLZb9ddcjB0Mtc= X-Gm-Message-State: AOJu0Yzuek8trp+acabTAw6gGBdSnNSNefnk95lznFcDsxw3Oa1eqpxA eckauOOvhh1sxzmYrAyypYp/ahutTMVVeVmUmWEHxS/BKMx6Aak1UIKk/Yw7jA== X-Google-Smtp-Source: AGHT+IGeU5humV3Riu+YjtYEKG3bs3w4wg/OpCVJtBJ7LoNo0udp5ed8reNWhFCXf/mFKa+ZRzmNhg== X-Received: by 2002:a05:6e02:20cd:b0:376:37e7:c9b with SMTP id e9e14a558f8ab-3763819ede9mr3795415ab.29.1719207597913; Sun, 23 Jun 2024 22:39:57 -0700 (PDT) Received: from google.com ([2a00:79e0:2e28:6:75a9:86a4:602e:ea0a]) by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-376311b07efsm14685035ab.2.2024.06.23.22.39.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 23 Jun 2024 22:39:57 -0700 (PDT) Date: Sun, 23 Jun 2024 23:39:50 -0600 From: Yu Zhao To: Nanyong Sun Cc: David Rientjes , Will Deacon , Catalin Marinas , Matthew Wilcox , muchun.song@linux.dev, Andrew Morton , anshuman.khandual@arm.com, wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Yosry Ahmed , Sourav Panda Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Message-ID: References: <20240113094436.2506396-1-sunnanyong@huawei.com> <20240207111252.GA22167@willie-the-truck> <44075bc2-ac5f-ffcd-0d2f-4093351a6151@huawei.com> <20240208131734.GA23428@willie-the-truck> <22c14513-af78-0f1d-5647-384ff9cb5993@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <22c14513-af78-0f1d-5647-384ff9cb5993@huawei.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240623_224000_506187_E91708BD X-CRM114-Status: GOOD ( 41.69 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote: > On 2024/3/14 7:32, David Rientjes wrote: > > > On Thu, 8 Feb 2024, Will Deacon wrote: > > > > > > How about take a new lock with irq disabled during BBM, like: > > > > > > > > +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) > > > > +{ > > > > +    (NEW_LOCK); > > > > +    pte_clear(&init_mm, addr, ptep); > > > > +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > > > > +    set_pte_at(&init_mm, addr, ptep, pte); > > > > +    spin_unlock_irq(NEW_LOCK); > > > > +} > > > I really think the only maintainable way to achieve this is to avoid the > > > possibility of a fault altogether. > > > > > > Will > > > > > > > > Nanyong, are you still actively working on making HVO possible on arm64? > > > > This would yield a substantial memory savings on hosts that are largely > > configured with hugetlbfs. In our case, the size of this hugetlbfs pool > > is actually never changed after boot, but it sounds from the thread that > > there was an idea to make HVO conditional on FEAT_BBM. Is this being > > pursued? > > > > If so, any testing help needed? > I'm afraid that FEAT_BBM may not solve the problem here I think so too -- I came cross this while working on TAO [1]. [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ > because from Arm > ARM, > I see that FEAT_BBM is only used for changing block size. Therefore, in this > HVO feature, > it can work in the split PMD stage, that is, BBM can be avoided in > vmemmap_split_pmd, > but in the subsequent vmemmap_remap_pte, the Output address of PTE still > needs to be > changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my > understanding > of ARM FEAT_BBM is wrong, and I hope someone can correct me. > Actually, the solution I first considered was to use the stop_machine > method, but we have > products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically > use hugepages, > so I have to consider performance issues. If your product does not change > the amount of huge > pages after booting, using stop_machine() may be a feasible way. > So far, I still haven't come up with a good solution. I do have a patch that's similar to stop_machine() -- it uses NMI IPIs to pause/resume remote CPUs while the local one is doing BBM. Note that the problem of updating vmemmap for struct page[], as I see it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory hot removal in general [2]. On arm64, we would need to support BBM on vmemmap so that we can fix the problem with offlining memory (or to be precise, unmapping offlined struct page[]), by mapping offlined struct page[] to a read-only page of dummy struct page[], similar to ZERO_PAGE(). (Or we would have to make extremely invasive changes to the reader side, i.e., all speculative PFN walkers.) In case you are interested in testing my approach, you can swap your patch 2 with the following: [2] https://lore.kernel.org/20240621213717.1099079-1-yuzhao@google.com/ diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h index 8ff5f2a2579e..1af1aa34a351 100644 --- a/arch/arm64/include/asm/pgalloc.h +++ b/arch/arm64/include/asm/pgalloc.h @@ -12,6 +12,7 @@ #include #include #include +#include #define __HAVE_ARCH_PGD_FREE #define __HAVE_ARCH_PUD_FREE @@ -137,4 +138,58 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmdp, pgtable_t ptep) __pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN); } +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP + +#define vmemmap_update_lock vmemmap_update_lock +static inline void vmemmap_update_lock(void) +{ + cpus_read_lock(); +} + +#define vmemmap_update_unlock vmemmap_update_unlock +static inline void vmemmap_update_unlock(void) +{ + cpus_read_unlock(); +} + +#define vmemmap_update_pte vmemmap_update_pte +static inline void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte) +{ + preempt_disable(); + pause_remote_cpus(); + + pte_clear(&init_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + set_pte_at(&init_mm, addr, ptep, pte); + + resume_remote_cpus(); + preempt_enable(); +} + +#define vmemmap_update_pmd vmemmap_update_pmd +static inline void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep) +{ + preempt_disable(); + pause_remote_cpus(); + + pmd_clear(pmdp); + flush_tlb_kernel_range(addr, addr + PMD_SIZE); + pmd_populate_kernel(&init_mm, pmdp, ptep); + + resume_remote_cpus(); + preempt_enable(); +} + +#define vmemmap_flush_tlb_all vmemmap_flush_tlb_all +static inline void vmemmap_flush_tlb_all(void) +{ +} + +#define vmemmap_flush_tlb_range vmemmap_flush_tlb_range +static inline void vmemmap_flush_tlb_range(unsigned long start, unsigned long end) +{ +} + +#endif /* CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP */ + #endif diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h index efb13112b408..544b15948b64 100644 --- a/arch/arm64/include/asm/smp.h +++ b/arch/arm64/include/asm/smp.h @@ -144,6 +144,9 @@ bool cpus_are_stuck_in_kernel(void); extern void crash_smp_send_stop(void); extern bool smp_crash_stop_failed(void); +void pause_remote_cpus(void); +void resume_remote_cpus(void); + #endif /* ifndef __ASSEMBLY__ */ #endif /* ifndef __ASM_SMP_H */ diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 31c8b3094dd7..ae0a178db066 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -71,16 +71,25 @@ enum ipi_msg_type { IPI_RESCHEDULE, IPI_CALL_FUNC, IPI_CPU_STOP, + IPI_CPU_PAUSE, +#ifdef CONFIG_KEXEC_CORE IPI_CPU_CRASH_STOP, +#endif +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST IPI_TIMER, +#endif +#ifdef CONFIG_IRQ_WORK IPI_IRQ_WORK, +#endif NR_IPI, /* * Any enum >= NR_IPI and < MAX_IPI is special and not tracable * with trace_ipi_* */ IPI_CPU_BACKTRACE = NR_IPI, +#ifdef CONFIG_KGDB IPI_KGDB_ROUNDUP, +#endif MAX_IPI }; @@ -771,9 +780,16 @@ static const char *ipi_types[NR_IPI] __tracepoint_string = { [IPI_RESCHEDULE] = "Rescheduling interrupts", [IPI_CALL_FUNC] = "Function call interrupts", [IPI_CPU_STOP] = "CPU stop interrupts", + [IPI_CPU_PAUSE] = "CPU pause interrupts", +#ifdef CONFIG_KEXEC_CORE [IPI_CPU_CRASH_STOP] = "CPU stop (for crash dump) interrupts", +#endif +#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST [IPI_TIMER] = "Timer broadcast interrupts", +#endif +#ifdef CONFIG_IRQ_WORK [IPI_IRQ_WORK] = "IRQ work interrupts", +#endif }; static void smp_cross_call(const struct cpumask *target, unsigned int ipinr); @@ -832,6 +848,85 @@ void __noreturn panic_smp_self_stop(void) local_cpu_stop(); } +static DEFINE_SPINLOCK(cpu_pause_lock); +static cpumask_t paused_cpus; +static cpumask_t resumed_cpus; + +static void pause_local_cpu(void) +{ + int cpu = smp_processor_id(); + + cpumask_clear_cpu(cpu, &resumed_cpus); + /* + * Paired with pause_remote_cpus() to confirm that this CPU not only + * will be paused but also can be reliably resumed. + */ + smp_wmb(); + cpumask_set_cpu(cpu, &paused_cpus); + /* A typical example for sleep and wake-up functions. */ + smp_mb(); + while (!cpumask_test_cpu(cpu, &resumed_cpus)) { + wfe(); + barrier(); + } + barrier(); + cpumask_clear_cpu(cpu, &paused_cpus); +} + +void pause_remote_cpus(void) +{ + cpumask_t cpus_to_pause; + + lockdep_assert_cpus_held(); + lockdep_assert_preemption_disabled(); + + cpumask_copy(&cpus_to_pause, cpu_online_mask); + cpumask_clear_cpu(smp_processor_id(), &cpus_to_pause); + + spin_lock(&cpu_pause_lock); + + WARN_ON_ONCE(!cpumask_empty(&paused_cpus)); + + smp_cross_call(&cpus_to_pause, IPI_CPU_PAUSE); + + while (!cpumask_equal(&cpus_to_pause, &paused_cpus)) { + cpu_relax(); + barrier(); + } + /* + * Paired pause_local_cpu() to confirm that all CPUs not only will be + * paused but also can be reliably resumed. + */ + smp_rmb(); + WARN_ON_ONCE(cpumask_intersects(&cpus_to_pause, &resumed_cpus)); + + spin_unlock(&cpu_pause_lock); +} + +void resume_remote_cpus(void) +{ + cpumask_t cpus_to_resume; + + lockdep_assert_cpus_held(); + lockdep_assert_preemption_disabled(); + + cpumask_copy(&cpus_to_resume, cpu_online_mask); + cpumask_clear_cpu(smp_processor_id(), &cpus_to_resume); + + spin_lock(&cpu_pause_lock); + + cpumask_setall(&resumed_cpus); + /* A typical example for sleep and wake-up functions. */ + smp_mb(); + while (cpumask_intersects(&cpus_to_resume, &paused_cpus)) { + sev(); + cpu_relax(); + barrier(); + } + + spin_unlock(&cpu_pause_lock); +} + #ifdef CONFIG_KEXEC_CORE static atomic_t waiting_for_crash_ipi = ATOMIC_INIT(0); #endif @@ -911,6 +1006,11 @@ static void do_handle_IPI(int ipinr) local_cpu_stop(); break; + case IPI_CPU_PAUSE: + pause_local_cpu(); + break; + +#ifdef CONFIG_KEXEC_CORE case IPI_CPU_CRASH_STOP: if (IS_ENABLED(CONFIG_KEXEC_CORE)) { ipi_cpu_crash_stop(cpu, get_irq_regs()); @@ -918,6 +1018,7 @@ static void do_handle_IPI(int ipinr) unreachable(); } break; +#endif #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST case IPI_TIMER: @@ -939,9 +1040,11 @@ static void do_handle_IPI(int ipinr) nmi_cpu_backtrace(get_irq_regs()); break; +#ifdef CONFIG_KGDB case IPI_KGDB_ROUNDUP: kgdb_nmicallback(cpu, get_irq_regs()); break; +#endif default: pr_crit("CPU%u: Unknown IPI message 0x%x\n", cpu, ipinr); @@ -971,9 +1074,14 @@ static bool ipi_should_be_nmi(enum ipi_msg_type ipi) switch (ipi) { case IPI_CPU_STOP: + case IPI_CPU_PAUSE: +#ifdef CONFIG_KEXEC_CORE case IPI_CPU_CRASH_STOP: +#endif case IPI_CPU_BACKTRACE: +#ifdef CONFIG_KGDB case IPI_KGDB_ROUNDUP: +#endif return true; default: return false; diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 5113753f3ac9..da6f2a7d665e 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -46,6 +46,18 @@ struct vmemmap_remap_walk { unsigned long flags; }; +#ifndef vmemmap_update_lock +static void vmemmap_update_lock(void) +{ +} +#endif + +#ifndef vmemmap_update_unlock +static void vmemmap_update_unlock(void) +{ +} +#endif + #ifndef vmemmap_update_pmd static inline void vmemmap_update_pmd(unsigned long addr, pmd_t *pmdp, pte_t *ptep) @@ -194,10 +206,12 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, VM_BUG_ON(!PAGE_ALIGNED(start | end)); + vmemmap_update_lock(); mmap_read_lock(&init_mm); ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops, NULL, walk); mmap_read_unlock(&init_mm); + vmemmap_update_unlock(); if (ret) return ret;