From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 59E68C5517A for ; Wed, 11 Nov 2020 14:54:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BC70720791 for ; Wed, 11 Nov 2020 14:54:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="FjlD6IMO" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BC70720791 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 430106B0096; Wed, 11 Nov 2020 09:54:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3DEE26B0098; Wed, 11 Nov 2020 09:54:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2CF2C6B0099; Wed, 11 Nov 2020 09:54:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0253.hostedemail.com [216.40.44.253]) by kanga.kvack.org (Postfix) with ESMTP id 00A286B0096 for ; Wed, 11 Nov 2020 09:54:04 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 9D6C9181AC9CB for ; Wed, 11 Nov 2020 14:54:04 +0000 (UTC) X-FDA: 77472432408.22.drug59_4e0012e272fe Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id 7F5D918038E68 for ; Wed, 11 Nov 2020 14:54:04 +0000 (UTC) X-HE-Tag: drug59_4e0012e272fe X-Filterd-Recvd-Size: 14220 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf21.hostedemail.com (Postfix) with ESMTP for ; Wed, 11 Nov 2020 14:54:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605106443; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MWnBRq9PHvur9WoMxYhfb4ZElv4UhjwblZCapc3dW1k=; b=FjlD6IMOOtBAy1UuqorakBB0OB6l9kT59qUCZhBpWalZjZXS8o4p7PT6IgYZrIBfdh533E 8p15G7dudve1ljRYdX87f5KlrEpDQbFFUx1Xsql7GlUIAiiZqHBlrULpp8ZbEUZh64wQ3g ib5ATTelRLJG0mqX1zhMxVeVHVI0JYA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-27-pkCbRx5rMSKc3AjbWQCS4A-1; Wed, 11 Nov 2020 09:53:59 -0500 X-MC-Unique: pkCbRx5rMSKc3AjbWQCS4A-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 4A5F51099F76; Wed, 11 Nov 2020 14:53:57 +0000 (UTC) Received: from t480s.redhat.com (ovpn-114-151.ams2.redhat.com [10.36.114.151]) by smtp.corp.redhat.com (Postfix) with ESMTP id 711F927BB5; Wed, 11 Nov 2020 14:53:54 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org, David Hildenbrand , Michal Hocko , Michael Ellerman , Benjamin Herrenschmidt , Paul Mackerras , Rashmica Gupta , Andrew Morton , Mike Rapoport , Michal Hocko , Oscar Salvador , Wei Yang Subject: [PATCH v2 8/8] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations Date: Wed, 11 Nov 2020 15:53:22 +0100 Message-Id: <20201111145322.15793-9-david@redhat.com> In-Reply-To: <20201111145322.15793-1-david@redhat.com> References: <20201111145322.15793-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Let's use alloc_contig_pages() for allocating memory and remove the linear mapping manually via arch_remove_linear_mapping(). Mark all pages PG_offline, such that they will definitely not get touched - e.g., when hibernating. When freeing memory, try to revert what we did. The original idea was discussed in: https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.co= m This is similar to CONFIG_DEBUG_PAGEALLOC handling on other architectures, whereby only single pages are unmapped from the linear mapping. Let's mimic what memory hot(un)plug would do with the linear mapping. We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies. Add a TODO that we want to use __GFP_ZERO for clearing once alloc_contig_pages() understands that. Tested with in QEMU/TCG with 10 GiB of main memory: [root@localhost ~]# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrac= e/enable [ 105.903043][ T1080] memtrace: Allocated trace memory on node 0 at 0x= 0000000080000000 [root@localhost ~]# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrac= e/enable [ 145.042493][ T1080] radix-mmu: Mapped 0x0000000080000000-0x00000000c= 0000000 with 64.0 KiB pages [ 145.049019][ T1080] memtrace: Freed trace memory back on node 0 [ 145.333960][ T1080] memtrace: Allocated trace memory on node 0 at 0x= 0000000080000000 [root@localhost ~]# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrac= e/enable [ 213.606916][ T1080] radix-mmu: Mapped 0x0000000080000000-0x00000000c= 0000000 with 64.0 KiB pages [ 213.613855][ T1080] memtrace: Freed trace memory back on node 0 [ 214.185094][ T1080] memtrace: Allocated trace memory on node 0 at 0x= 0000000080000000 [root@localhost ~]# echo 0x100000000 > /sys/kernel/debug/powerpc/memtra= ce/enable [ 234.874872][ T1080] radix-mmu: Mapped 0x0000000080000000-0x000000010= 0000000 with 64.0 KiB pages [ 234.886974][ T1080] memtrace: Freed trace memory back on node 0 [ 234.890153][ T1080] memtrace: Failed to allocate trace memory on nod= e 0 [root@localhost ~]# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrac= e/enable [ 259.490196][ T1080] memtrace: Allocated trace memory on node 0 at 0x= 0000000080000000 I also made sure allocated memory is properly zeroed. Note 1: We currently won't be allocating from ZONE_MOVABLE - because our pages are not movable. However, as we don't run with any memory hot(un)plug mechanism around, we could make an exception to increase the chance of allocations succeeding. Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used along PG_reserved in hibernation code will always return "true" on powerpc, resulting in the pages getting touched. It's too generic - e.g., indicates boot allocations. Note 3: For now, we keep using memory_block_size_bytes() as minimum granularity. Suggested-by: Michal Hocko Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Rashmica Gupta Cc: Andrew Morton Cc: Mike Rapoport Cc: Michal Hocko Cc: Oscar Salvador Cc: Wei Yang Signed-off-by: David Hildenbrand --- arch/powerpc/platforms/powernv/Kconfig | 8 +- arch/powerpc/platforms/powernv/memtrace.c | 163 ++++++++-------------- 2 files changed, 62 insertions(+), 109 deletions(-) diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platfo= rms/powernv/Kconfig index 938803eab0ad..619b093a0657 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -27,11 +27,11 @@ config OPAL_PRD recovery diagnostics on OpenPower machines =20 config PPC_MEMTRACE - bool "Enable removal of RAM from kernel mappings for tracing" - depends on PPC_POWERNV && MEMORY_HOTREMOVE + bool "Enable runtime allocation of RAM for tracing" + depends on PPC_POWERNV && MEMORY_HOTPLUG && CONTIG_ALLOC help - Enabling this option allows for the removal of memory (RAM) - from the kernel mappings to be used for hardware tracing. + Enabling this option allows for runtime allocation of memory (RAM) + for hardware tracing. =20 config PPC_VAS bool "IBM Virtual Accelerator Switchboard (VAS)" diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/pla= tforms/powernv/memtrace.c index 0e42fe2d7b6a..5fc9408bb0b3 100644 --- a/arch/powerpc/platforms/powernv/memtrace.c +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -51,33 +51,12 @@ static const struct file_operations memtrace_fops =3D= { .open =3D simple_open, }; =20 -static int check_memblock_online(struct memory_block *mem, void *arg) -{ - if (mem->state !=3D MEM_ONLINE) - return -1; - - return 0; -} - -static int change_memblock_state(struct memory_block *mem, void *arg) -{ - unsigned long state =3D (unsigned long)arg; - - mem->state =3D state; - - return 0; -} - static void memtrace_clear_range(unsigned long start_pfn, unsigned long nr_pages) { unsigned long pfn; =20 - /* - * As pages are offline, we cannot trust the memmap anymore. As HIGHMEM - * does not apply, avoid passing around "struct page" and use - * clear_page() instead directly. - */ + /* As HIGHMEM does not apply, use clear_page() directly. */ for (pfn =3D start_pfn; pfn < start_pfn + nr_pages; pfn++) { if (IS_ALIGNED(pfn, PAGES_PER_SECTION)) cond_resched(); @@ -85,72 +64,39 @@ static void memtrace_clear_range(unsigned long start_= pfn, } } =20 -/* called with device_hotplug_lock held */ -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages) -{ - const unsigned long start =3D PFN_PHYS(start_pfn); - const unsigned long size =3D PFN_PHYS(nr_pages); - - if (walk_memory_blocks(start, size, NULL, check_memblock_online)) - return false; - - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE, - change_memblock_state); - - if (offline_pages(start_pfn, nr_pages)) { - walk_memory_blocks(start, size, (void *)MEM_ONLINE, - change_memblock_state); - return false; - } - - walk_memory_blocks(start, size, (void *)MEM_OFFLINE, - change_memblock_state); - - - return true; -} - static u64 memtrace_alloc_node(u32 nid, u64 size) { - u64 start_pfn, end_pfn, nr_pages, pfn; - u64 base_pfn; - u64 bytes =3D memory_block_size_bytes(); + const unsigned long nr_pages =3D PHYS_PFN(size); + unsigned long pfn, start_pfn; + struct page *page; =20 - if (!node_spanned_pages(nid)) + /* + * Trace memory needs to be aligned to the size, which is guaranteed + * by alloc_contig_pages(). + */ + page =3D alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE | + __GFP_NOWARN, nid, NULL); + if (!page) return 0; + start_pfn =3D page_to_pfn(page); =20 - start_pfn =3D node_start_pfn(nid); - end_pfn =3D node_end_pfn(nid); - nr_pages =3D size >> PAGE_SHIFT; - - /* Trace memory needs to be aligned to the size */ - end_pfn =3D round_down(end_pfn - nr_pages, nr_pages); - - lock_device_hotplug(); - for (base_pfn =3D end_pfn; base_pfn > start_pfn; base_pfn -=3D nr_pages= ) { - if (memtrace_offline_pages(nid, base_pfn, nr_pages) =3D=3D true) { - /* - * Clear the range while we still have a linear - * mapping. - */ - memtrace_clear_range(base_pfn, nr_pages); - /* - * Remove memory in memory block size chunks so that - * iomem resources are always split to the same size and - * we never try to remove memory that spans two iomem - * resources. - */ - end_pfn =3D base_pfn + nr_pages; - for (pfn =3D base_pfn; pfn < end_pfn; pfn +=3D bytes>> PAGE_SHIFT) { - __remove_memory(nid, pfn << PAGE_SHIFT, bytes); - } - unlock_device_hotplug(); - return base_pfn << PAGE_SHIFT; - } - } - unlock_device_hotplug(); + /* + * Clear the range while we still have a linear mapping. + * + * TODO: use __GFP_ZERO with alloc_contig_pages() once supported. + */ + memtrace_clear_range(start_pfn, nr_pages); =20 - return 0; + /* + * Set pages PageOffline(), to indicate that nobody (e.g., hibernation, + * dumping, ...) should be touching these pages. + */ + for (pfn =3D start_pfn; pfn < start_pfn + nr_pages; pfn++) + __SetPageOffline(pfn_to_page(pfn)); + + arch_remove_linear_mapping(PFN_PHYS(start_pfn), size); + + return PFN_PHYS(start_pfn); } =20 static int memtrace_init_regions_runtime(u64 size) @@ -220,16 +166,30 @@ static int memtrace_init_debugfs(void) return ret; } =20 -static int online_mem_block(struct memory_block *mem, void *arg) +static int memtrace_free(int nid, u64 start, u64 size) { - return device_online(&mem->dev); + struct mhp_params params =3D { .pgprot =3D PAGE_KERNEL }; + const unsigned long nr_pages =3D PHYS_PFN(size); + const unsigned long start_pfn =3D PHYS_PFN(start); + unsigned long pfn; + int ret; + + ret =3D arch_create_linear_mapping(nid, start, size, ¶ms); + if (ret) + return ret; + + for (pfn =3D start_pfn; pfn < start_pfn + nr_pages; pfn++) + __ClearPageOffline(pfn_to_page(pfn)); + + free_contig_range(start_pfn, nr_pages); + return 0; } =20 /* - * Iterate through the chunks of memory we have removed from the kernel - * and attempt to add them back to the kernel. + * Iterate through the chunks of memory we allocated and attempt to expo= se + * them back to the kernel. */ -static int memtrace_online(void) +static int memtrace_free_regions(void) { int i, ret =3D 0; struct memtrace_entry *ent; @@ -237,7 +197,7 @@ static int memtrace_online(void) for (i =3D memtrace_array_nr - 1; i >=3D 0; i--) { ent =3D &memtrace_array[i]; =20 - /* We have onlined this chunk previously */ + /* We have freed this chunk previously */ if (ent->nid =3D=3D NUMA_NO_NODE) continue; =20 @@ -247,30 +207,25 @@ static int memtrace_online(void) ent->mem =3D 0; } =20 - if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) { - pr_err("Failed to add trace memory to node %d\n", + if (memtrace_free(ent->nid, ent->start, ent->size)) { + pr_err("Failed to free trace memory on node %d\n", ent->nid); ret +=3D 1; continue; } =20 - lock_device_hotplug(); - walk_memory_blocks(ent->start, ent->size, NULL, - online_mem_block); - unlock_device_hotplug(); - /* - * Memory was added successfully so clean up references to it - * so on reentry we can tell that this chunk was added. + * Memory was freed successfully so clean up references to it + * so on reentry we can tell that this chunk was freed. */ debugfs_remove_recursive(ent->dir); - pr_info("Added trace memory back to node %d\n", ent->nid); + pr_info("Freed trace memory back on node %d\n", ent->nid); ent->size =3D ent->start =3D ent->nid =3D NUMA_NO_NODE; } if (ret) return ret; =20 - /* If all chunks of memory were added successfully, reset globals */ + /* If all chunks of memory were freed successfully, reset globals */ kfree(memtrace_array); memtrace_array =3D NULL; memtrace_size =3D 0; @@ -295,18 +250,16 @@ static int memtrace_enable_set(void *data, u64 val) =20 mutex_lock(&memtrace_mutex); =20 - /* Re-add/online previously removed/offlined memory */ - if (memtrace_size) { - if (memtrace_online()) - goto out_unlock; - } + /* Free all previously allocated memory. */ + if (memtrace_size && memtrace_free_regions()) + goto out_unlock; =20 if (!val) { rc =3D 0; goto out_unlock; } =20 - /* Offline and remove memory */ + /* Allocate memory. */ if (memtrace_init_regions_runtime(val)) goto out_unlock; =20 --=20 2.26.2