* [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
@ 2020-10-29 16:27 David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 1/4] powerpc/mm: factor out creating/removing linear mapping David Hildenbrand
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, linuxppc-dev, David Hildenbrand, Andrew Morton,
Benjamin Herrenschmidt, Michael Ellerman, Michal Hocko,
Michal Hocko, Mike Rapoport, Oscar Salvador, Paul Mackerras,
Rashmica Gupta, Wei Yang
powernv/memtrace is the only in-kernel user that rips out random memory
it never added (doesn't own) in order to allocate memory without a
linear mapping. Let's stop abusing memory hot(un)plug infrastructure for
that - use alloc_contig_pages() for allocating memory and remove the
linear mapping manually.
The original idea was discussed in:
https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
I only tested allocations briefly via QEMU TCG - see patch #4 for more
details.
David Hildenbrand (4):
powerpc/mm: factor out creating/removing linear mapping
powerpc/mm: print warning in arch_remove_linear_mapping()
powerpc/mm: remove linear mapping if __add_pages() fails in
arch_add_memory()
powernv/memtrace: don't abuse memory hot(un)plug infrastructure for
memory allocations
arch/powerpc/mm/mem.c | 48 +++++---
arch/powerpc/platforms/powernv/Kconfig | 8 +-
arch/powerpc/platforms/powernv/memtrace.c | 134 ++++++++--------------
include/linux/memory_hotplug.h | 3 +
4 files changed, 86 insertions(+), 107 deletions(-)
--
2.26.2
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v1 1/4] powerpc/mm: factor out creating/removing linear mapping
2020-10-29 16:27 [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
@ 2020-10-29 16:27 ` David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping() David Hildenbrand
` (3 subsequent siblings)
4 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, linuxppc-dev, David Hildenbrand, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Michal Hocko, Oscar Salvador,
Wei Yang
We want to stop abusing memory hotplug infrastructure in memtrace code
to perform allocations and remove the linear mapping. Instead we will use
alloc_contig_pages() and remove the identity mapping manually.
Let's factor out creating/removing the linear mapping into
arch_create_linear_mapping() / arch_remove_linear_mapping() - so in the
future, we might be able to have whole arch_add_memory() /
arch_remove_memory() be implemented in common code.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/powerpc/mm/mem.c | 41 +++++++++++++++++++++++-----------
include/linux/memory_hotplug.h | 3 +++
2 files changed, 31 insertions(+), 13 deletions(-)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 01ec2a252f09..8a86d81f8df0 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -120,34 +120,26 @@ static void flush_dcache_range_chunked(unsigned long start, unsigned long stop,
}
}
-int __ref arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_params *params)
+int __ref arch_create_linear_mapping(int nid, u64 start, u64 size,
+ struct mhp_params *params)
{
- unsigned long start_pfn = start >> PAGE_SHIFT;
- unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;
start = (unsigned long)__va(start);
rc = create_section_mapping(start, start + size, nid,
params->pgprot);
if (rc) {
- pr_warn("Unable to create mapping for hot added memory 0x%llx..0x%llx: %d\n",
+ pr_warn("Unable to create linear mapping for 0x%llx..0x%llx: %d\n",
start, start + size, rc);
return -EFAULT;
}
-
- return __add_pages(nid, start_pfn, nr_pages, params);
+ return 0;
}
-void __ref arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void __ref arch_remove_linear_mapping(u64 start, u64 size)
{
- unsigned long start_pfn = start >> PAGE_SHIFT;
- unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
- __remove_pages(start_pfn, nr_pages, altmap);
-
/* Remove htab bolted mappings for this section of memory */
start = (unsigned long)__va(start);
flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
@@ -160,6 +152,29 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
*/
vm_unmap_aliases();
}
+
+int __ref arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_params *params)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ int rc;
+
+ rc = arch_create_linear_mapping(nid, start, size, params);
+ if (rc)
+ return rc;
+ return __add_pages(nid, start_pfn, nr_pages, params);
+}
+
+void __ref arch_remove_memory(int nid, u64 start, u64 size,
+ struct vmem_altmap *altmap)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+
+ __remove_pages(start_pfn, nr_pages, altmap);
+ arch_remove_linear_mapping(start, size);
+}
#endif
#ifndef CONFIG_NEED_MULTIPLE_NODES
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d65c6fdc5cfc..00b9e9bd3850 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -375,6 +375,9 @@ extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);
extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
unsigned long nr_pages);
+extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
+ struct mhp_params *params);
+void arch_remove_linear_mapping(u64 start, u64 size);
#endif /* CONFIG_MEMORY_HOTPLUG */
#endif /* __LINUX_MEMORY_HOTPLUG_H */
--
2.26.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping()
2020-10-29 16:27 [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 1/4] powerpc/mm: factor out creating/removing linear mapping David Hildenbrand
@ 2020-10-29 16:27 ` David Hildenbrand
2020-11-04 9:42 ` osalvador
2020-10-29 16:27 ` [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory() David Hildenbrand
` (2 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, linuxppc-dev, David Hildenbrand, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Michal Hocko, Oscar Salvador,
Wei Yang
Let's print a warning similar to in arch_add_linear_mapping() instead of
WARN_ON_ONCE() and eventually crashing the kernel.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/powerpc/mm/mem.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 8a86d81f8df0..685028451dd2 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -145,7 +145,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
ret = remove_section_mapping(start, start + size);
- WARN_ON_ONCE(ret);
+ if (ret)
+ pr_warn("Unable to remove linear mapping for 0x%llx..0x%llx: %d\n",
+ start, start + size, ret);
/* Ensure all vmalloc mappings are flushed in case they also
* hit that section of memory
--
2.26.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
2020-10-29 16:27 [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 1/4] powerpc/mm: factor out creating/removing linear mapping David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping() David Hildenbrand
@ 2020-10-29 16:27 ` David Hildenbrand
2020-11-04 9:50 ` osalvador
2020-11-04 12:11 ` Oscar Salvador
2020-10-29 16:27 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-11-25 11:57 ` [PATCH v1 0/4] " Michael Ellerman
4 siblings, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, linuxppc-dev, David Hildenbrand, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Michal Hocko, Oscar Salvador,
Wei Yang
Let's revert what we did in case seomthing goes wrong and we return an
error.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/powerpc/mm/mem.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 685028451dd2..69b3e8072261 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -165,7 +165,10 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
rc = arch_create_linear_mapping(nid, start, size, params);
if (rc)
return rc;
- return __add_pages(nid, start_pfn, nr_pages, params);
+ rc = __add_pages(nid, start_pfn, nr_pages, params);
+ if (rc)
+ arch_remove_linear_mapping(start, size);
+ return rc;
}
void __ref arch_remove_memory(int nid, u64 start, u64 size,
--
2.26.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-10-29 16:27 [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
` (2 preceding siblings ...)
2020-10-29 16:27 ` [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory() David Hildenbrand
@ 2020-10-29 16:27 ` David Hildenbrand
2020-11-03 9:23 ` Michal Hocko
2020-11-25 11:57 ` [PATCH v1 0/4] " Michael Ellerman
4 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-10-29 16:27 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, linuxppc-dev, David Hildenbrand, Michal Hocko,
Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
Oscar Salvador, Wei Yang
Let's use alloc_contig_pages() for allocating memory and remove the
linear mapping manually via arch_remove_linear_mapping(). Mark all pages
PG_offline, such that they will definitely not get touched - e.g.,
when hibernating. When freeing memory, try to revert what we did.
The original idea was discussed in:
https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
architectures, whereby only single pages are unmapped from the linear
mapping. Let's mimic what memory hot(un)plug would do with the linear
mapping.
We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
Simple test under QEMU TCG (10GB RAM, single NUMA node):
sh-5.0# mount -t debugfs none /sys/kernel/debug/
sh-5.0# cat /sys/devices/system/memory/block_size_bytes
40000000
sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
[ 75.430549][ T356] memtrace: Freed trace memory back on node 0
[ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
[ 80.430493][ T356] memtrace: Freed trace memory back on node 0
[ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
pages are not movable. However, as we don't run with any memory
hot(un)plug mechanism around, we could make an exception to
increase the chance of allocations succeeding.
Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
along PG_reserved in hibernation code will always return "true"
on powerpc, resulting in the pages getting touched. It's too
generic - e.g., indicates boot allocations.
Note 3: For now, we keep using memory_block_size_bytes() as minimum
granularity. I'm not able to come up with a better guess (most
probably, doing it on a section basis could be possible).
Suggested-by: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rashmica Gupta <rashmica.g@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/powerpc/platforms/powernv/Kconfig | 8 +-
arch/powerpc/platforms/powernv/memtrace.c | 134 ++++++++--------------
2 files changed, 49 insertions(+), 93 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index 938803eab0ad..619b093a0657 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -27,11 +27,11 @@ config OPAL_PRD
recovery diagnostics on OpenPower machines
config PPC_MEMTRACE
- bool "Enable removal of RAM from kernel mappings for tracing"
- depends on PPC_POWERNV && MEMORY_HOTREMOVE
+ bool "Enable runtime allocation of RAM for tracing"
+ depends on PPC_POWERNV && MEMORY_HOTPLUG && CONTIG_ALLOC
help
- Enabling this option allows for the removal of memory (RAM)
- from the kernel mappings to be used for hardware tracing.
+ Enabling this option allows for runtime allocation of memory (RAM)
+ for hardware tracing.
config PPC_VAS
bool "IBM Virtual Accelerator Switchboard (VAS)"
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 6828108486f8..8f47797a78c2 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -50,83 +50,50 @@ static const struct file_operations memtrace_fops = {
.open = simple_open,
};
-static int check_memblock_online(struct memory_block *mem, void *arg)
-{
- if (mem->state != MEM_ONLINE)
- return -1;
-
- return 0;
-}
-
-static int change_memblock_state(struct memory_block *mem, void *arg)
-{
- unsigned long state = (unsigned long)arg;
-
- mem->state = state;
-
- return 0;
-}
-
-/* called with device_hotplug_lock held */
-static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
+static u64 memtrace_alloc_node(u32 nid, u64 size)
{
- const unsigned long start = PFN_PHYS(start_pfn);
- const unsigned long size = PFN_PHYS(nr_pages);
+ const unsigned long nr_pages = PHYS_PFN(size);
+ unsigned long pfn, start_pfn;
+ struct page *page;
- if (walk_memory_blocks(start, size, NULL, check_memblock_online))
- return false;
-
- walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
- change_memblock_state);
-
- if (offline_pages(start_pfn, nr_pages)) {
- walk_memory_blocks(start, size, (void *)MEM_ONLINE,
- change_memblock_state);
- return false;
- }
+ /*
+ * Trace memory needs to be aligned to the size, which is guaranteed
+ * by alloc_contig_pages().
+ */
+ page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
+ nid, NULL);
+ if (!page)
+ return 0;
- walk_memory_blocks(start, size, (void *)MEM_OFFLINE,
- change_memblock_state);
+ /*
+ * Set pages PageOffline(), to indicate that nobody (e.g., hibernation,
+ * dumping, ...) should be touching these pages.
+ */
+ start_pfn = page_to_pfn(page);
+ for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++)
+ __SetPageOffline(pfn_to_page(pfn));
+ arch_remove_linear_mapping(PFN_PHYS(start_pfn), size);
- return true;
+ return PFN_PHYS(start_pfn);
}
-static u64 memtrace_alloc_node(u32 nid, u64 size)
+static int memtrace_free(int nid, u64 start, u64 size)
{
- u64 start_pfn, end_pfn, nr_pages, pfn;
- u64 base_pfn;
- u64 bytes = memory_block_size_bytes();
+ struct mhp_params params = { .pgprot = PAGE_KERNEL };
+ const unsigned long nr_pages = PHYS_PFN(size);
+ const unsigned long start_pfn = PHYS_PFN(start);
+ unsigned long pfn;
+ int ret;
- if (!node_spanned_pages(nid))
- return 0;
+ ret = arch_create_linear_mapping(nid, start, size, ¶ms);
+ if (ret)
+ return ret;
- start_pfn = node_start_pfn(nid);
- end_pfn = node_end_pfn(nid);
- nr_pages = size >> PAGE_SHIFT;
-
- /* Trace memory needs to be aligned to the size */
- end_pfn = round_down(end_pfn - nr_pages, nr_pages);
-
- lock_device_hotplug();
- for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) {
- if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) {
- /*
- * Remove memory in memory block size chunks so that
- * iomem resources are always split to the same size and
- * we never try to remove memory that spans two iomem
- * resources.
- */
- end_pfn = base_pfn + nr_pages;
- for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> PAGE_SHIFT) {
- __remove_memory(nid, pfn << PAGE_SHIFT, bytes);
- }
- unlock_device_hotplug();
- return base_pfn << PAGE_SHIFT;
- }
- }
- unlock_device_hotplug();
+ for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++)
+ __ClearPageOffline(pfn_to_page(pfn));
+ free_contig_range(start_pfn, nr_pages);
return 0;
}
@@ -197,16 +164,11 @@ static int memtrace_init_debugfs(void)
return ret;
}
-static int online_mem_block(struct memory_block *mem, void *arg)
-{
- return device_online(&mem->dev);
-}
-
/*
- * Iterate through the chunks of memory we have removed from the kernel
- * and attempt to add them back to the kernel.
+ * Iterate through the chunks of memory we allocated and attempt to expose
+ * them back to the kernel.
*/
-static int memtrace_online(void)
+static int memtrace_free_regions(void)
{
int i, ret = 0;
struct memtrace_entry *ent;
@@ -214,7 +176,7 @@ static int memtrace_online(void)
for (i = memtrace_array_nr - 1; i >= 0; i--) {
ent = &memtrace_array[i];
- /* We have onlined this chunk previously */
+ /* We have freed this chunk previously */
if (ent->nid == NUMA_NO_NODE)
continue;
@@ -224,30 +186,25 @@ static int memtrace_online(void)
ent->mem = 0;
}
- if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) {
- pr_err("Failed to add trace memory to node %d\n",
+ if (memtrace_free(ent->nid, ent->start, ent->size)) {
+ pr_err("Failed to free trace memory on node %d\n",
ent->nid);
ret += 1;
continue;
}
- lock_device_hotplug();
- walk_memory_blocks(ent->start, ent->size, NULL,
- online_mem_block);
- unlock_device_hotplug();
-
/*
- * Memory was added successfully so clean up references to it
- * so on reentry we can tell that this chunk was added.
+ * Memory was freed successfully so clean up references to it
+ * so on reentry we can tell that this chunk was freed.
*/
debugfs_remove_recursive(ent->dir);
- pr_info("Added trace memory back to node %d\n", ent->nid);
+ pr_info("Freed trace memory back on node %d\n", ent->nid);
ent->size = ent->start = ent->nid = NUMA_NO_NODE;
}
if (ret)
return ret;
- /* If all chunks of memory were added successfully, reset globals */
+ /* If all chunks of memory were freed successfully, reset globals */
kfree(memtrace_array);
memtrace_array = NULL;
memtrace_size = 0;
@@ -269,16 +226,15 @@ static int memtrace_enable_set(void *data, u64 val)
return -EINVAL;
}
- /* Re-add/online previously removed/offlined memory */
+ /* Free all previously allocated memory. */
if (memtrace_size) {
- if (memtrace_online())
+ if (memtrace_free_regions())
return -EAGAIN;
}
if (!val)
return 0;
- /* Offline and remove memory */
if (memtrace_init_regions_runtime(val))
return -EINVAL;
--
2.26.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-10-29 16:27 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
@ 2020-11-03 9:23 ` Michal Hocko
2020-11-03 9:29 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2020-11-03 9:23 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Oscar Salvador, Wei Yang
On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
> Let's use alloc_contig_pages() for allocating memory and remove the
> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
> PG_offline, such that they will definitely not get touched - e.g.,
> when hibernating. When freeing memory, try to revert what we did.
>
> The original idea was discussed in:
> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>
> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
> architectures, whereby only single pages are unmapped from the linear
> mapping. Let's mimic what memory hot(un)plug would do with the linear
> mapping.
>
> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>
> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>
> sh-5.0# mount -t debugfs none /sys/kernel/debug/
> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
> 40000000
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>
> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
> pages are not movable. However, as we don't run with any memory
> hot(un)plug mechanism around, we could make an exception to
> increase the chance of allocations succeeding.
>
> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
> along PG_reserved in hibernation code will always return "true"
> on powerpc, resulting in the pages getting touched. It's too
> generic - e.g., indicates boot allocations.
>
> Note 3: For now, we keep using memory_block_size_bytes() as minimum
> granularity. I'm not able to come up with a better guess (most
> probably, doing it on a section basis could be possible).
>
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Rashmica Gupta <rashmica.g@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Thanks! This looks like a move into the right direction. I cannot really
judge implementation details because I am not familiar with the code.
I have only one tiny concern:
[...]
> -/* called with device_hotplug_lock held */
> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
> +static u64 memtrace_alloc_node(u32 nid, u64 size)
> {
> - const unsigned long start = PFN_PHYS(start_pfn);
> - const unsigned long size = PFN_PHYS(nr_pages);
> + const unsigned long nr_pages = PHYS_PFN(size);
> + unsigned long pfn, start_pfn;
> + struct page *page;
>
> - if (walk_memory_blocks(start, size, NULL, check_memblock_online))
> - return false;
> -
> - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
> - change_memblock_state);
> -
> - if (offline_pages(start_pfn, nr_pages)) {
> - walk_memory_blocks(start, size, (void *)MEM_ONLINE,
> - change_memblock_state);
> - return false;
> - }
> + /*
> + * Trace memory needs to be aligned to the size, which is guaranteed
> + * by alloc_contig_pages().
> + */
> + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
> + nid, NULL);
__GFP_THISNODE without other modifiers looks suspicious. I suspect you
want to enfore node locality and exclude movable zones by this. While
this works it is an antipattern. I would rather use GFP_KERNEL |
__GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.
If for no other reasons we want to be able to work inside a normal
compaction context (comparing to effectively GFP_NOIO which the above
implies). Also this looks like a sleepable context.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-11-03 9:23 ` Michal Hocko
@ 2020-11-03 9:29 ` David Hildenbrand
0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2020-11-03 9:29 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Oscar Salvador, Wei Yang
On 03.11.20 10:23, Michal Hocko wrote:
> On Thu 29-10-20 17:27:18, David Hildenbrand wrote:
>> Let's use alloc_contig_pages() for allocating memory and remove the
>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>> PG_offline, such that they will definitely not get touched - e.g.,
>> when hibernating. When freeing memory, try to revert what we did.
>>
>> The original idea was discussed in:
>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>>
>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>> architectures, whereby only single pages are unmapped from the linear
>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>> mapping.
>>
>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>>
>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>>
>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>> 40000000
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
>> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
>> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>
>> Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
>> pages are not movable. However, as we don't run with any memory
>> hot(un)plug mechanism around, we could make an exception to
>> increase the chance of allocations succeeding.
>>
>> Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
>> along PG_reserved in hibernation code will always return "true"
>> on powerpc, resulting in the pages getting touched. It's too
>> generic - e.g., indicates boot allocations.
>>
>> Note 3: For now, we keep using memory_block_size_bytes() as minimum
>> granularity. I'm not able to come up with a better guess (most
>> probably, doing it on a section basis could be possible).
>>
>> Suggested-by: Michal Hocko <mhocko@kernel.org>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Paul Mackerras <paulus@samba.org>
>> Cc: Rashmica Gupta <rashmica.g@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Thanks! This looks like a move into the right direction. I cannot really
> judge implementation details because I am not familiar with the code.
> I have only one tiny concern:
> [...]
>> -/* called with device_hotplug_lock held */
>> -static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
>> +static u64 memtrace_alloc_node(u32 nid, u64 size)
>> {
>> - const unsigned long start = PFN_PHYS(start_pfn);
>> - const unsigned long size = PFN_PHYS(nr_pages);
>> + const unsigned long nr_pages = PHYS_PFN(size);
>> + unsigned long pfn, start_pfn;
>> + struct page *page;
>>
>> - if (walk_memory_blocks(start, size, NULL, check_memblock_online))
>> - return false;
>> -
>> - walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
>> - change_memblock_state);
>> -
>> - if (offline_pages(start_pfn, nr_pages)) {
>> - walk_memory_blocks(start, size, (void *)MEM_ONLINE,
>> - change_memblock_state);
>> - return false;
>> - }
>> + /*
>> + * Trace memory needs to be aligned to the size, which is guaranteed
>> + * by alloc_contig_pages().
>> + */
>> + page = alloc_contig_pages(nr_pages, __GFP_THISNODE | __GFP_NOWARN,
>> + nid, NULL);
>
> __GFP_THISNODE without other modifiers looks suspicious. I suspect you
> want to enfore node locality and exclude movable zones by this. While
> this works it is an antipattern. I would rather use GFP_KERNEL |
> __GFP_THISNODE | __GFP_NOWARN to be more in line with other gfp usage.
Agreed GFP_KERNEL should be the right thing to do here.
>
> If for no other reasons we want to be able to work inside a normal
> compaction context (comparing to effectively GFP_NOIO which the above
> implies). Also this looks like a sleepable context.
>
Yes it is. Thanks!
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping()
2020-10-29 16:27 ` [PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping() David Hildenbrand
@ 2020-11-04 9:42 ` osalvador
2020-11-11 12:10 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: osalvador @ 2020-11-04 9:42 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Michal Hocko, Wei Yang
On Thu, Oct 29, 2020 at 05:27:16PM +0100, David Hildenbrand wrote:
> Let's print a warning similar to in arch_add_linear_mapping() instead of
> WARN_ON_ONCE() and eventually crashing the kernel.
>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Rashmica Gupta <rashmica.g@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> arch/powerpc/mm/mem.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 8a86d81f8df0..685028451dd2 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -145,7 +145,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
> flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
>
> ret = remove_section_mapping(start, start + size);
> - WARN_ON_ONCE(ret);
> + if (ret)
> + pr_warn("Unable to remove linear mapping for 0x%llx..0x%llx: %d\n",
> + start, start + size, ret);
I guess the fear is to panic on systems that do have panic_on_warn (not
sure how many productions systems have this out there).
But anyway, being coherent with that, I think you should remove the WARN_ON
in hash__remove_section_mapping as well.
Besides that:
Reviewed-by: Oscar Salvador <osalvador@suse.
Not sure if the functions below that also have any sort of WARN_ON.
native_hpte_removebolted has a VM_WARN_ON, but that is on
CONFIG_DEBUG_VM so does not really matter.
--
Oscar Salvador
SUSE L3
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
2020-10-29 16:27 ` [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory() David Hildenbrand
@ 2020-11-04 9:50 ` osalvador
2020-11-04 12:06 ` Mike Rapoport
2020-11-04 12:11 ` Oscar Salvador
1 sibling, 1 reply; 17+ messages in thread
From: osalvador @ 2020-11-04 9:50 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Michal Hocko, Wei Yang
On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote:
> Let's revert what we did in case seomthing goes wrong and we return an
> error.
Dumb question, but should not we do this for other arches as well?
--
Oscar Salvador
SUSE L3
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
2020-11-04 9:50 ` osalvador
@ 2020-11-04 12:06 ` Mike Rapoport
2020-11-04 12:11 ` Oscar Salvador
0 siblings, 1 reply; 17+ messages in thread
From: Mike Rapoport @ 2020-11-04 12:06 UTC (permalink / raw)
To: osalvador
Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev,
Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Rashmica Gupta, Andrew Morton, Michal Hocko, Wei Yang
On Wed, Nov 04, 2020 at 10:50:07AM +0100, osalvador wrote:
> On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote:
> > Let's revert what we did in case seomthing goes wrong and we return an
> > error.
>
> Dumb question, but should not we do this for other arches as well?
It seems arm64 and s390 already do that.
x86 could have its arch_add_memory() improved though :)
> --
> Oscar Salvador
> SUSE L3
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
2020-11-04 12:06 ` Mike Rapoport
@ 2020-11-04 12:11 ` Oscar Salvador
2020-11-11 12:07 ` David Hildenbrand
0 siblings, 1 reply; 17+ messages in thread
From: Oscar Salvador @ 2020-11-04 12:11 UTC (permalink / raw)
To: Mike Rapoport
Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev,
Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Rashmica Gupta, Andrew Morton, Michal Hocko, Wei Yang
On Wed, Nov 04, 2020 at 02:06:51PM +0200, Mike Rapoport wrote:
> On Wed, Nov 04, 2020 at 10:50:07AM +0100, osalvador wrote:
> > On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote:
> > > Let's revert what we did in case seomthing goes wrong and we return an
> > > error.
> >
> > Dumb question, but should not we do this for other arches as well?
>
> It seems arm64 and s390 already do that.
> x86 could have its arch_add_memory() improved though :)
Right, I only stared at x86 and see it did not have it.
I guess we want to have all arches aligned with this.
Thanks
--
Oscar Salvador
SUSE L3
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
2020-10-29 16:27 ` [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory() David Hildenbrand
2020-11-04 9:50 ` osalvador
@ 2020-11-04 12:11 ` Oscar Salvador
1 sibling, 0 replies; 17+ messages in thread
From: Oscar Salvador @ 2020-11-04 12:11 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Michal Hocko, Wei Yang
On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote:
> Let's revert what we did in case seomthing goes wrong and we return an
> error.
>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Rashmica Gupta <rashmica.g@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
> ---
> arch/powerpc/mm/mem.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 685028451dd2..69b3e8072261 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -165,7 +165,10 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
> rc = arch_create_linear_mapping(nid, start, size, params);
> if (rc)
> return rc;
> - return __add_pages(nid, start_pfn, nr_pages, params);
> + rc = __add_pages(nid, start_pfn, nr_pages, params);
> + if (rc)
> + arch_remove_linear_mapping(start, size);
> + return rc;
> }
>
> void __ref arch_remove_memory(int nid, u64 start, u64 size,
> --
> 2.26.2
>
--
Oscar Salvador
SUSE L3
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
[not found] <87o8kcttjp.fsf@mpe.ellerman.id.au>
@ 2020-11-05 8:29 ` David Hildenbrand
2020-11-05 10:47 ` Michael Ellerman
0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2020-11-05 8:29 UTC (permalink / raw)
To: Michael Ellerman
Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev,
Michal Hocko, Benjamin Herrenschmidt, Paul Mackerras,
Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
Oscar Salvador, Wei Yang
> Am 05.11.2020 um 03:53 schrieb Michael Ellerman <mpe@ellerman.id.au>:
>
> David Hildenbrand <david@redhat.com> writes:
>> Let's use alloc_contig_pages() for allocating memory and remove the
>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>> PG_offline, such that they will definitely not get touched - e.g.,
>> when hibernating. When freeing memory, try to revert what we did.
>> The original idea was discussed in:
>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>> architectures, whereby only single pages are unmapped from the linear
>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>> mapping.
>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>> 40000000
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
>> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
>> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>
> I gave this a quick spin on a real machine, seems to work OK.
>
> I don't have the actual memtrace tools setup to do an actual trace, will
> try and get someone to test that also.
>
> One observation is that previously the memory was zeroed when enabling
> the memtrace, whereas now it's not.
>
> eg, before:
>
> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 10000000
>
> whereas after:
>
> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 00000080 e0 fd 43 00 00 00 00 00 e0 fd 43 00 00 00 00 00 |..C.......C.....|
> 00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 00000830 98 bf 39 00 00 00 00 00 98 bf 39 00 00 00 00 00 |..9.......9.....|
> 00000840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> *
> 000008a0 b0 c8 47 00 00 00 00 00 b0 c8 47 00 00 00 00 00 |..G.......G.....|
> 000008b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
> ...
> 0fffff70 78 53 49 7d 00 00 29 2e 88 00 92 41 01 00 49 39 |xSI}..)....A..I9|
> 0fffff80 b4 07 4a 7d 28 f8 00 7d 00 48 08 7c 0c 00 c2 40 |..J}(..}.H.|...@|
> 0fffff90 2d f9 40 7d f0 ff c2 40 b4 07 0a 7d 00 48 8a 7f |-.@}...@...}.H..|
> 0fffffa0 70 fe 9e 41 cc ff ff 4b 00 00 00 60 00 00 00 60 |p..A...K...`...`|
> 0fffffb0 01 00 00 48 00 00 00 60 00 00 a3 2f 0c fd 9e 40 |...H...`.../...@|
> 0fffffc0 00 00 a2 3c 00 00 a5 e8 00 00 62 3c 00 00 63 e8 |...<......b<..c.|
> 0fffffd0 01 00 20 39 83 02 80 38 00 00 3c 99 01 00 00 48 |.. 9...8..<....H|
> 0fffffe0 00 00 00 60 e4 fc ff 4b 00 00 80 38 78 fb e3 7f |...`...K...8x...|
> 0ffffff0 01 00 00 48 00 00 00 60 2c fe ff 4b 00 00 00 60 |...H...`,..K...`|
> 10000000
>
>
> That's a nice way for root to read kernel memory, so we should probably
> add a __GFP_ZERO or memset in there somewhere.
Thanks for catching that! Will have a look on Monday if alloc_contig_pages() already properly handled __GFP_ZERO so we can use it, otherwise I‘ll fix that.
I don‘t recall that memory hotunplug does any zeroing - that‘s why I didn‘t add any explicit zeroing. Could be you were just lucky in your experiment - I assume we‘ll leak kernel memory already.
Thank!
> cheers
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-11-05 8:29 ` [PATCH v1 4/4] " David Hildenbrand
@ 2020-11-05 10:47 ` Michael Ellerman
0 siblings, 0 replies; 17+ messages in thread
From: Michael Ellerman @ 2020-11-05 10:47 UTC (permalink / raw)
To: David Hildenbrand
Cc: David Hildenbrand, linux-kernel, linux-mm, linuxppc-dev,
Michal Hocko, Benjamin Herrenschmidt, Paul Mackerras,
Rashmica Gupta, Andrew Morton, Mike Rapoport, Michal Hocko,
Oscar Salvador, Wei Yang
David Hildenbrand <david@redhat.com> writes:
>> Am 05.11.2020 um 03:53 schrieb Michael Ellerman <mpe@ellerman.id.au>:
>>
>> David Hildenbrand <david@redhat.com> writes:
>>> Let's use alloc_contig_pages() for allocating memory and remove the
>>> linear mapping manually via arch_remove_linear_mapping(). Mark all pages
>>> PG_offline, such that they will definitely not get touched - e.g.,
>>> when hibernating. When freeing memory, try to revert what we did.
>>> The original idea was discussed in:
>>> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>>> This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
>>> architectures, whereby only single pages are unmapped from the linear
>>> mapping. Let's mimic what memory hot(un)plug would do with the linear
>>> mapping.
>>> We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies.
>>> Simple test under QEMU TCG (10GB RAM, single NUMA node):
>>> sh-5.0# mount -t debugfs none /sys/kernel/debug/
>>> sh-5.0# cat /sys/devices/system/memory/block_size_bytes
>>> 40000000
>>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [ 71.052836][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>> sh-5.0# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [ 75.424302][ T356] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
>>> [ 75.430549][ T356] memtrace: Freed trace memory back on node 0
>>> [ 75.604520][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>> sh-5.0# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [ 80.418835][ T356] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
>>> [ 80.430493][ T356] memtrace: Freed trace memory back on node 0
>>> [ 80.433882][ T356] memtrace: Failed to allocate trace memory on node 0
>>> sh-5.0# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
>>> [ 91.920158][ T356] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
>>
>> I gave this a quick spin on a real machine, seems to work OK.
>>
>> I don't have the actual memtrace tools setup to do an actual trace, will
>> try and get someone to test that also.
>>
>> One observation is that previously the memory was zeroed when enabling
>> the memtrace, whereas now it's not.
>>
>> eg, before:
>>
>> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
>> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 10000000
>>
>> whereas after:
>>
>> # hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace
>> 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 00000080 e0 fd 43 00 00 00 00 00 e0 fd 43 00 00 00 00 00 |..C.......C.....|
>> 00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 00000830 98 bf 39 00 00 00 00 00 98 bf 39 00 00 00 00 00 |..9.......9.....|
>> 00000840 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> *
>> 000008a0 b0 c8 47 00 00 00 00 00 b0 c8 47 00 00 00 00 00 |..G.......G.....|
>> 000008b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
>> ...
>> 0fffff70 78 53 49 7d 00 00 29 2e 88 00 92 41 01 00 49 39 |xSI}..)....A..I9|
>> 0fffff80 b4 07 4a 7d 28 f8 00 7d 00 48 08 7c 0c 00 c2 40 |..J}(..}.H.|...@|
>> 0fffff90 2d f9 40 7d f0 ff c2 40 b4 07 0a 7d 00 48 8a 7f |-.@}...@...}.H..|
>> 0fffffa0 70 fe 9e 41 cc ff ff 4b 00 00 00 60 00 00 00 60 |p..A...K...`...`|
>> 0fffffb0 01 00 00 48 00 00 00 60 00 00 a3 2f 0c fd 9e 40 |...H...`.../...@|
>> 0fffffc0 00 00 a2 3c 00 00 a5 e8 00 00 62 3c 00 00 63 e8 |...<......b<..c.|
>> 0fffffd0 01 00 20 39 83 02 80 38 00 00 3c 99 01 00 00 48 |.. 9...8..<....H|
>> 0fffffe0 00 00 00 60 e4 fc ff 4b 00 00 80 38 78 fb e3 7f |...`...K...8x...|
>> 0ffffff0 01 00 00 48 00 00 00 60 2c fe ff 4b 00 00 00 60 |...H...`,..K...`|
>> 10000000
>>
>>
>> That's a nice way for root to read kernel memory, so we should probably
>> add a __GFP_ZERO or memset in there somewhere.
>
> Thanks for catching that! Will have a look on Monday if
> alloc_contig_pages() already properly handled __GFP_ZERO so we can use
> it, otherwise I‘ll fix that.
I had a quick look and didn't see it, but maybe it is in there somewhere.
> I don‘t recall that memory hotunplug does any zeroing - that‘s why I
> didn‘t add any explicit zeroing. Could be you were just lucky in your
> experiment - I assume we‘ll leak kernel memory already.
Hmm yeah good point. I did try it multiple times, and I never get
anything non-zero with the existing code.
I guess it's just that the new method is more likely to give us memory
that's already been used for something.
So I guess that's not actually a problem with your patch, it's just
exposing an existing issue.
cheers
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
2020-11-04 12:11 ` Oscar Salvador
@ 2020-11-11 12:07 ` David Hildenbrand
0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2020-11-11 12:07 UTC (permalink / raw)
To: Oscar Salvador, Mike Rapoport
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Michal Hocko, Wei Yang
On 04.11.20 13:11, Oscar Salvador wrote:
> On Wed, Nov 04, 2020 at 02:06:51PM +0200, Mike Rapoport wrote:
>> On Wed, Nov 04, 2020 at 10:50:07AM +0100, osalvador wrote:
>>> On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote:
>>>> Let's revert what we did in case seomthing goes wrong and we return an
>>>> error.
>>>
>>> Dumb question, but should not we do this for other arches as well?
>>
>> It seems arm64 and s390 already do that.
>> x86 could have its arch_add_memory() improved though :)
>
> Right, I only stared at x86 and see it did not have it.
> I guess we want to have all arches aligned with this.
The ultimate goal would be to get rid of arch-specific arch_add_memory()
implementations completely, providing arch_create_linear_mapping() /
arch_remove_linear_mapping() instead (as indicated in patch #1).
The x86 variant certainly needs love, but I'll keep this patch set
powerpc specific, so it can go via the powerpc tree in one piece. I'll
add unifying these implementations onto my todo list.
Thanks!
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping()
2020-11-04 9:42 ` osalvador
@ 2020-11-11 12:10 ` David Hildenbrand
0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2020-11-11 12:10 UTC (permalink / raw)
To: osalvador
Cc: linux-kernel, linux-mm, linuxppc-dev, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Rashmica Gupta,
Andrew Morton, Mike Rapoport, Michal Hocko, Wei Yang
On 04.11.20 10:42, osalvador wrote:
> On Thu, Oct 29, 2020 at 05:27:16PM +0100, David Hildenbrand wrote:
>> Let's print a warning similar to in arch_add_linear_mapping() instead of
>> WARN_ON_ONCE() and eventually crashing the kernel.
>>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Paul Mackerras <paulus@samba.org>
>> Cc: Rashmica Gupta <rashmica.g@gmail.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Mike Rapoport <rppt@kernel.org>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> arch/powerpc/mm/mem.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>> index 8a86d81f8df0..685028451dd2 100644
>> --- a/arch/powerpc/mm/mem.c
>> +++ b/arch/powerpc/mm/mem.c
>> @@ -145,7 +145,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
>> flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
>>
>> ret = remove_section_mapping(start, start + size);
>> - WARN_ON_ONCE(ret);
>> + if (ret)
>> + pr_warn("Unable to remove linear mapping for 0x%llx..0x%llx: %d\n",
>> + start, start + size, ret);
>
> I guess the fear is to panic on systems that do have panic_on_warn (not
> sure how many productions systems have this out there).
Exactly.
> But anyway, being coherent with that, I think you should remove the WARN_ON
> in hash__remove_section_mapping as well.
Thanks, I'll add a patch doing that.
>
> Besides that:
>
> Reviewed-by: Oscar Salvador <osalvador@suse.
>
> Not sure if the functions below that also have any sort of WARN_ON.
> native_hpte_removebolted has a VM_WARN_ON, but that is on
> CONFIG_DEBUG_VM so does not really matter.
Right. Thanks!
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
2020-10-29 16:27 [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
` (3 preceding siblings ...)
2020-10-29 16:27 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
@ 2020-11-25 11:57 ` Michael Ellerman
4 siblings, 0 replies; 17+ messages in thread
From: Michael Ellerman @ 2020-11-25 11:57 UTC (permalink / raw)
To: linux-kernel, David Hildenbrand
Cc: Rashmica Gupta, Michael Ellerman, Andrew Morton, Paul Mackerras,
Michal Hocko, Mike Rapoport, Michal Hocko, Wei Yang,
Oscar Salvador, linux-mm, Benjamin Herrenschmidt, linuxppc-dev
On Thu, 29 Oct 2020 17:27:14 +0100, David Hildenbrand wrote:
> powernv/memtrace is the only in-kernel user that rips out random memory
> it never added (doesn't own) in order to allocate memory without a
> linear mapping. Let's stop abusing memory hot(un)plug infrastructure for
> that - use alloc_contig_pages() for allocating memory and remove the
> linear mapping manually.
>
> The original idea was discussed in:
> https://lkml.kernel.org/r/48340e96-7e6b-736f-9e23-d3111b915b6e@redhat.com
>
> [...]
Applied to powerpc/next.
[1/4] powerpc/mm: factor out creating/removing linear mapping
https://git.kernel.org/powerpc/c/4abb1e5b63ac3281275315fc6b0cde0b9c2e2e42
[2/4] powerpc/mm: print warning in arch_remove_linear_mapping()
https://git.kernel.org/powerpc/c/1f73ad3e8d755dbec52fcec98618a7ce4de12af2
[3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
https://git.kernel.org/powerpc/c/ca2c36cae9d48b180ea51259e35ab3d95d327df2
[4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
https://git.kernel.org/powerpc/c/0bd4b96d99108b7ea9bac0573957483be7781d70
cheers
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2020-11-25 11:57 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-10-29 16:27 [PATCH v1 0/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 1/4] powerpc/mm: factor out creating/removing linear mapping David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 2/4] powerpc/mm: print warning in arch_remove_linear_mapping() David Hildenbrand
2020-11-04 9:42 ` osalvador
2020-11-11 12:10 ` David Hildenbrand
2020-10-29 16:27 ` [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory() David Hildenbrand
2020-11-04 9:50 ` osalvador
2020-11-04 12:06 ` Mike Rapoport
2020-11-04 12:11 ` Oscar Salvador
2020-11-11 12:07 ` David Hildenbrand
2020-11-04 12:11 ` Oscar Salvador
2020-10-29 16:27 ` [PATCH v1 4/4] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations David Hildenbrand
2020-11-03 9:23 ` Michal Hocko
2020-11-03 9:29 ` David Hildenbrand
2020-11-25 11:57 ` [PATCH v1 0/4] " Michael Ellerman
[not found] <87o8kcttjp.fsf@mpe.ellerman.id.au>
2020-11-05 8:29 ` [PATCH v1 4/4] " David Hildenbrand
2020-11-05 10:47 ` Michael Ellerman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).