[PATCH v5 0/5] kdump: crashkernel reservation from CMA

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/5] kdump: crashkernel reservation from CMA
@ 2025-06-12 10:11 Jiri Bohac
  2025-06-12 10:13 ` [PATCH v5 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Jiri Bohac @ 2025-06-12 10:11 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Hi,

this series implements a way to reserve additional crash kernel
memory using CMA.

Link to the v1 discussion:
https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
See below for the changes since v1 and how concerns from the 
discussion have been addressed.

Currently, all the memory for the crash kernel is not usable by
the 1st (production) kernel. It is also unmapped so that it can't
be corrupted by the fault that will eventually trigger the crash.
This makes sense for the memory actually used by the kexec-loaded
crash kernel image and initrd and the data prepared during the
load (vmcoreinfo, ...). However, the reserved space needs to be
much larger than that to provide enough run-time memory for the
crash kernel and the kdump userspace. Estimating the amount of
memory to reserve is difficult. Being too careful makes kdump
likely to end in OOM, being too generous takes even more memory
from the production system. Also, the reservation only allows
reserving a single contiguous block (or two with the "low"
suffix). I've seen systems where this fails because the physical
memory is fragmented.

By reserving additional crashkernel memory from CMA, the main
crashkernel reservation can be just large enough to fit the
kernel and initrd image, minimizing the memory taken away from
the production system. Most of the run-time memory for the crash
kernel will be memory previously available to userspace in the
production system. As this memory is no longer wasted, the
reservation can be done with a generous margin, making kdump more
reliable. Kernel memory that we need to preserve for dumping is
normally not allocated from CMA, unless it is explicitly
allocated as movable. Currently this is only the case for memory
ballooning and zswap. Such movable memory will be missing from
the vmcore. User data is typically not dumped by makedumpfile.
When dumping of user data is intended this new CMA reservation
cannot be used.

There are five patches in this series:

The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more
argument to store the cma reservation size.

The second patch implements reserve_crashkernel_cma() which
performs the reservation. If the requested size is not available
in a single range, multiple smaller ranges will be reserved.

The third patch updates Documentation/, explicitly mentioning the
potential DMA corruption of the CMA-reserved memory.

The fourth patch adds a short delay before booting the kdump
kernel, allowing pending DMA transfers to finish.

The fifth patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
  through /proc/vmcore by excluding them from the vmcoreinfo
  PT_LOAD ranges.

Adding other architectures is easy and I can do that as soon as
this series is merged.

With this series applied, specifying
	crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.

An additional 1G will be reserved from CMA, still usable by the
production system. The crash kernel will have 1.1G memory
available. The 100M can be reliably predicted based on the size
of the kernel and initrd.

The new cma suffix is completely optional. When no
crashkernel=size,cma is specified, everything works as before.

---
Changes since v4:

- v5 is identical to v4 for all patches except patch 4/5,
  where v5 incorporates feedback from David Hildenbrand

---
Changes since v3:

- updated for 6.15
- reworked the delay patch:
  - delay changed to 10 s based on David Hildenbrand's comments
  - constant changed to variable so that the delay can be easily made
    configurable in the future
- made reserve_crashkernel_cma() return early when cma_size == 0
  to avoid printing out the 0-sized cma allocation

---
Changes since v2:

based on feedback from Baoquan He and David Hildenbrand:
- kept original formatting of suffix_tbl[]
- updated documentation to mention movable pages missing from vmcore
- fixed whitespace in documentation
- moved the call crash_cma_clear_pending_dma() after
  machine_crash_shutdown() so that non-crash CPUs and timers are
  shut down before the delay

---
Changes since v1:

The key concern raised in the v1 discussion was that pages in the
CMA region may be pinned and used for a DMA transfer, potentially
corrupting the new kernel's memory. When the cma suffix is used, kdump
may be less reliable and the corruption hard to debug

This v2 series addresses this concern in two ways:

1) Clearly stating the potential problem in the updated
Documentation and setting the expectation (patch 3/5)

Documentation now explicitly states that:
- the risk of kdump failure is increased
- the CMA reservation is intended for users who can not or don't
  want to sacrifice enough memory for a standard crashkernel reservation
  and who prefer less reliable kdump to no kdump at all

This is consistent with the documentation of the
crash_kexec_post_notifiers option, which can also increase the
risk of kdump failure, yet may be the only way to use kdump on
some systems. And just like the crash_kexec_post_notifiers
option, the cma crashkernel suffix is completely optional:
the series has zero effect when the suffix is not used.

2) Giving DMA time to finish before booting the kdump kernel
   (patch 4/5)

Pages can be pinned for long term use using the FOLL_LONGTERM
flag. Then they are migrated outside the CMA region. Pinning
without this flag shows that the intent of their user is to only
use them for short-lived DMA transfers. 

Delay the boot of the kdump kernel when the CMA reservation is
used, giving potential pending DMA transfers time to finish.

Other minor changes since v1:
- updated for 6.14-rc2
- moved #ifdefs and #defines to header files
- added __always_unused in parse_crashkernel() to silence a false
  unused variable warning

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option
  2025-06-12 10:11 [PATCH v5 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
@ 2025-06-12 10:13 ` Jiri Bohac
  2025-06-12 10:16 ` [PATCH v5 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Jiri Bohac @ 2025-06-12 10:13 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Add a new cma_size parameter to parse_crashkernel().
When not NULL, call __parse_crashkernel to parse the CMA
reservation size from "crashkernel=size,cma" and store it
in cma_size.

Set cma_size to NULL in all calls to parse_crashkernel().

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
 arch/arm/kernel/setup.c              |  2 +-
 arch/arm64/mm/init.c                 |  2 +-
 arch/loongarch/kernel/setup.c        |  2 +-
 arch/mips/kernel/setup.c             |  2 +-
 arch/powerpc/kernel/fadump.c         |  2 +-
 arch/powerpc/kexec/core.c            |  2 +-
 arch/powerpc/mm/nohash/kaslr_booke.c |  2 +-
 arch/riscv/mm/init.c                 |  2 +-
 arch/s390/kernel/setup.c             |  2 +-
 arch/sh/kernel/machine_kexec.c       |  2 +-
 arch/x86/kernel/setup.c              |  2 +-
 include/linux/crash_reserve.h        |  3 ++-
 kernel/crash_reserve.c               | 16 ++++++++++++++--
 13 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index a41c93988d2c..0bfd66c7ada0 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1004,7 +1004,7 @@ static void __init reserve_crashkernel(void)
 	total_mem = get_total_mem();
 	ret = parse_crashkernel(boot_command_line, total_mem,
 				&crash_size, &crash_base,
-				NULL, NULL);
+				NULL, NULL, NULL);
 	/* invalid value specified or crashkernel=0 */
 	if (ret || !crash_size)
 		return;
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 0c8c35dd645e..ea84a61ed508 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -106,7 +106,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
 				&crash_size, &crash_base,
-				&low_size, &high);
+				&low_size, NULL, &high);
 	if (ret)
 		return;
 
diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c
index b99fbb388fe0..22b27cd447a1 100644
--- a/arch/loongarch/kernel/setup.c
+++ b/arch/loongarch/kernel/setup.c
@@ -265,7 +265,7 @@ static void __init arch_reserve_crashkernel(void)
 		return;
 
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
-				&crash_size, &crash_base, &low_size, &high);
+				&crash_size, &crash_base, &low_size, NULL, &high);
 	if (ret)
 		return;
 
diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
index fbfe0771317e..11b9b6b63e19 100644
--- a/arch/mips/kernel/setup.c
+++ b/arch/mips/kernel/setup.c
@@ -458,7 +458,7 @@ static void __init mips_parse_crashkernel(void)
 	total_mem = memblock_phys_mem_size();
 	ret = parse_crashkernel(boot_command_line, total_mem,
 				&crash_size, &crash_base,
-				NULL, NULL);
+				NULL, NULL, NULL);
 	if (ret != 0 || crash_size <= 0)
 		return;
 
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 8ca49e40c473..28cab25d5b33 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -333,7 +333,7 @@ static __init u64 fadump_calculate_reserve_size(void)
 	 * memory at a predefined offset.
 	 */
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
-				&size, &base, NULL, NULL);
+				&size, &base, NULL, NULL, NULL);
 	if (ret == 0 && size > 0) {
 		unsigned long max_size;
 
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 00e9c267b912..d1a2d755381c 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -110,7 +110,7 @@ void __init arch_reserve_crashkernel(void)
 
 	/* use common parsing */
 	ret = parse_crashkernel(boot_command_line, total_mem_sz, &crash_size,
-				&crash_base, NULL, NULL);
+				&crash_base, NULL, NULL, NULL);
 
 	if (ret)
 		return;
diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c b/arch/powerpc/mm/nohash/kaslr_booke.c
index 5c8d1bb98b3e..5e4897daaaea 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -178,7 +178,7 @@ static void __init get_crash_kernel(void *fdt, unsigned long size)
 	int ret;
 
 	ret = parse_crashkernel(boot_command_line, size, &crash_size,
-				&crash_base, NULL, NULL);
+				&crash_base, NULL, NULL, NULL);
 	if (ret != 0 || crash_size == 0)
 		return;
 	if (crash_base == 0)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index ab475ec6ca42..3f272aff2cf1 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1402,7 +1402,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
 				&crash_size, &crash_base,
-				&low_size, &high);
+				&low_size, NULL, &high);
 	if (ret)
 		return;
 
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index f244c5560e7f..b99aeb0db2ee 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -605,7 +605,7 @@ static void __init reserve_crashkernel(void)
 	int rc;
 
 	rc = parse_crashkernel(boot_command_line, ident_map_size,
-			       &crash_size, &crash_base, NULL, NULL);
+			       &crash_size, &crash_base, NULL, NULL, NULL);
 
 	crash_base = ALIGN(crash_base, KEXEC_CRASH_MEM_ALIGN);
 	crash_size = ALIGN(crash_size, KEXEC_CRASH_MEM_ALIGN);
diff --git a/arch/sh/kernel/machine_kexec.c b/arch/sh/kernel/machine_kexec.c
index 8321b31d2e19..37073ca1e0ad 100644
--- a/arch/sh/kernel/machine_kexec.c
+++ b/arch/sh/kernel/machine_kexec.c
@@ -146,7 +146,7 @@ void __init reserve_crashkernel(void)
 		return;
 
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
-			&crash_size, &crash_base, NULL, NULL);
+			&crash_size, &crash_base, NULL, NULL, NULL);
 	if (ret == 0 && crash_size > 0) {
 		crashk_res.start = crash_base;
 		crashk_res.end = crash_base + crash_size - 1;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 7d9ed79a93c0..870b06571b2e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -582,7 +582,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
 				&crash_size, &crash_base,
-				&low_size, &high);
+				&low_size, NULL, &high);
 	if (ret)
 		return;
 
diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h
index 1fe7e7d1b214..e784aaff2f5a 100644
--- a/include/linux/crash_reserve.h
+++ b/include/linux/crash_reserve.h
@@ -16,7 +16,8 @@ extern struct resource crashk_low_res;
 
 int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
 		unsigned long long *crash_size, unsigned long long *crash_base,
-		unsigned long long *low_size, bool *high);
+		unsigned long long *low_size, unsigned long long *cma_size,
+		bool *high);
 
 #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
 #ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
index aff7c0fdbefa..a8861f3f64fe 100644
--- a/kernel/crash_reserve.c
+++ b/kernel/crash_reserve.c
@@ -172,17 +172,19 @@ static int __init parse_crashkernel_simple(char *cmdline,
 
 #define SUFFIX_HIGH 0
 #define SUFFIX_LOW  1
-#define SUFFIX_NULL 2
+#define SUFFIX_CMA  2
+#define SUFFIX_NULL 3
 static __initdata char *suffix_tbl[] = {
 	[SUFFIX_HIGH] = ",high",
 	[SUFFIX_LOW]  = ",low",
+	[SUFFIX_CMA]  = ",cma",
 	[SUFFIX_NULL] = NULL,
 };
 
 /*
  * That function parses "suffix"  crashkernel command lines like
  *
- *	crashkernel=size,[high|low]
+ *	crashkernel=size,[high|low|cma]
  *
  * It returns 0 on success and -EINVAL on failure.
  */
@@ -298,9 +300,11 @@ int __init parse_crashkernel(char *cmdline,
 			     unsigned long long *crash_size,
 			     unsigned long long *crash_base,
 			     unsigned long long *low_size,
+			     unsigned long long *cma_size,
 			     bool *high)
 {
 	int ret;
+	unsigned long long __always_unused cma_base;
 
 	/* crashkernel=X[@offset] */
 	ret = __parse_crashkernel(cmdline, system_ram, crash_size,
@@ -331,6 +335,14 @@ int __init parse_crashkernel(char *cmdline,
 
 		*high = true;
 	}
+
+	/*
+	 * optional CMA reservation
+	 * cma_base is ignored
+	 */
+	if (cma_size)
+		__parse_crashkernel(cmdline, 0, cma_size,
+			&cma_base, suffix_tbl[SUFFIX_CMA]);
 #endif
 	if (!*crash_size)
 		ret = -EINVAL;

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 2/5] kdump: implement reserve_crashkernel_cma
  2025-06-12 10:11 [PATCH v5 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
  2025-06-12 10:13 ` [PATCH v5 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
@ 2025-06-12 10:16 ` Jiri Bohac
  2025-06-12 10:17 ` [PATCH v5 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Jiri Bohac @ 2025-06-12 10:16 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

reserve_crashkernel_cma() reserves CMA ranges for the
crash kernel. If allocating the requested size fails,
try to reserve in smaller blocks.

Store the reserved ranges in the crashk_cma_ranges array
and the number of ranges in crashk_cma_cnt.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

---
Changes since v3:
- make reserve_crashkernel_cma() return early when cma_size == 0
  to avoid printing out the 0 cma-allocated size

---
 include/linux/crash_reserve.h | 12 ++++++++
 kernel/crash_reserve.c        | 52 +++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h
index e784aaff2f5a..7b44b41d0a20 100644
--- a/include/linux/crash_reserve.h
+++ b/include/linux/crash_reserve.h
@@ -13,12 +13,24 @@
  */
 extern struct resource crashk_res;
 extern struct resource crashk_low_res;
+extern struct range crashk_cma_ranges[];
+#if defined(CONFIG_CMA) && defined(CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION)
+#define CRASHKERNEL_CMA
+#define CRASHKERNEL_CMA_RANGES_MAX 4
+extern int crashk_cma_cnt;
+#else
+#define crashk_cma_cnt 0
+#define CRASHKERNEL_CMA_RANGES_MAX 0
+#endif
+
 
 int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
 		unsigned long long *crash_size, unsigned long long *crash_base,
 		unsigned long long *low_size, unsigned long long *cma_size,
 		bool *high);
 
+void __init reserve_crashkernel_cma(unsigned long long cma_size);
+
 #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
 #ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
 #define DEFAULT_CRASH_KERNEL_LOW_SIZE	(128UL << 20)
diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
index a8861f3f64fe..ae32ea707678 100644
--- a/kernel/crash_reserve.c
+++ b/kernel/crash_reserve.c
@@ -14,6 +14,8 @@
 #include <linux/cpuhotplug.h>
 #include <linux/memblock.h>
 #include <linux/kmemleak.h>
+#include <linux/cma.h>
+#include <linux/crash_reserve.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -469,6 +471,56 @@ void __init reserve_crashkernel_generic(unsigned long long crash_size,
 #endif
 }
 
+struct range crashk_cma_ranges[CRASHKERNEL_CMA_RANGES_MAX];
+#ifdef CRASHKERNEL_CMA
+int crashk_cma_cnt;
+void __init reserve_crashkernel_cma(unsigned long long cma_size)
+{
+	unsigned long long request_size = roundup(cma_size, PAGE_SIZE);
+	unsigned long long reserved_size = 0;
+
+	if (!cma_size)
+		return;
+
+	while (cma_size > reserved_size &&
+	       crashk_cma_cnt < CRASHKERNEL_CMA_RANGES_MAX) {
+
+		struct cma *res;
+
+		if (cma_declare_contiguous(0, request_size, 0, 0, 0, false,
+				       "crashkernel", &res)) {
+			/* reservation failed, try half-sized blocks */
+			if (request_size <= PAGE_SIZE)
+				break;
+
+			request_size = roundup(request_size / 2, PAGE_SIZE);
+			continue;
+		}
+
+		crashk_cma_ranges[crashk_cma_cnt].start = cma_get_base(res);
+		crashk_cma_ranges[crashk_cma_cnt].end =
+			crashk_cma_ranges[crashk_cma_cnt].start +
+			cma_get_size(res) - 1;
+		++crashk_cma_cnt;
+		reserved_size += request_size;
+	}
+
+	if (cma_size > reserved_size)
+		pr_warn("crashkernel CMA reservation failed: %lld MB requested, %lld MB reserved in %d ranges\n",
+			cma_size >> 20, reserved_size >> 20, crashk_cma_cnt);
+	else
+		pr_info("crashkernel CMA reserved: %lld MB in %d ranges\n",
+			reserved_size >> 20, crashk_cma_cnt);
+}
+
+#else /* CRASHKERNEL_CMA */
+void __init reserve_crashkernel_cma(unsigned long long cma_size)
+{
+	if (cma_size)
+		pr_warn("crashkernel CMA reservation not supported\n");
+}
+#endif
+
 #ifndef HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY
 static __init int insert_crashkernel_resources(void)
 {

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 3/5] kdump, documentation: describe craskernel CMA reservation
  2025-06-12 10:11 [PATCH v5 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
  2025-06-12 10:13 ` [PATCH v5 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
  2025-06-12 10:16 ` [PATCH v5 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
@ 2025-06-12 10:17 ` Jiri Bohac
  2025-06-12 10:18 ` [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Jiri Bohac @ 2025-06-12 10:17 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Describe the new crashkernel ",cma" suffix in Documentation/

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
 Documentation/admin-guide/kdump/kdump.rst     | 21 ++++++++++++++++++
 .../admin-guide/kernel-parameters.txt         | 22 +++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 1f7f14c6e184..089665731509 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -311,6 +311,27 @@ crashkernel syntax
 
             crashkernel=0,low
 
+4) crashkernel=size,cma
+
+	Reserve additional crash kernel memory from CMA. This reservation is
+	usable by the first system's userspace memory and kernel movable
+	allocations (memory balloon, zswap). Pages allocated from this memory
+	range will not be included in the vmcore so this should not be used if
+	dumping of userspace memory is intended and it has to be expected that
+	some movable kernel pages may be missing from the dump.
+
+	A standard crashkernel reservation, as described above, is still needed
+	to hold the crash kernel and initrd.
+
+	This option increases the risk of a kdump failure: DMA transfers
+	configured by the first kernel may end up corrupting the second
+	kernel's memory.
+
+	This reservation method is intended for systems that can't afford to
+	sacrifice enough memory for standard crashkernel reservation and where
+	less reliable and possibly incomplete kdump is preferable to no kdump at
+	all.
+
 Boot into System Kernel
 -----------------------
 1) Update the boot loader (such as grub, yaboot, or lilo) configuration
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ea81784be981..ee6be52dd8a5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -983,6 +983,28 @@
 			0: to disable low allocation.
 			It will be ignored when crashkernel=X,high is not used
 			or memory reserved is below 4G.
+	crashkernel=size[KMG],cma
+			[KNL, X86] Reserve additional crash kernel memory from
+			CMA. This reservation is usable by the first system's
+			userspace memory and kernel movable allocations (memory
+			balloon, zswap). Pages allocated from this memory range
+			will not be included in the vmcore so this should not
+			be used if dumping of userspace memory is intended and
+			it has to be expected that some movable kernel pages
+			may be missing from the dump.
+
+			A standard crashkernel reservation, as described above,
+			is still needed to hold the crash kernel and initrd.
+
+			This option increases the risk of a kdump failure: DMA
+			transfers configured by the first kernel may end up
+			corrupting the second kernel's memory.
+
+			This reservation method is intended for systems that
+			can't afford to sacrifice enough memory for standard
+			crashkernel reservation and where less reliable and
+			possibly incomplete kdump is preferable to no kdump at
+			all.
 
 	cryptomgr.notests
 			[KNL] Disable crypto self-tests

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA
  2025-06-12 10:11 [PATCH v5 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
                   ` (2 preceding siblings ...)
  2025-06-12 10:17 ` [PATCH v5 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
@ 2025-06-12 10:18 ` Jiri Bohac
  2025-06-12 23:47   ` Andrew Morton
  2025-06-12 10:20 ` [PATCH v5 5/5] x86: implement crashkernel cma reservation Jiri Bohac
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Jiri Bohac @ 2025-06-12 10:18 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

When re-using the CMA area for kdump there is a risk of pending DMA
into pinned user pages in the CMA area.

Pages residing in CMA areas can usually not get long-term pinned and
are instead migrated away from the CMA area, so long-term pinning is
typically not a concern. (BUGs in the kernel might still lead to
long-term pinning of such pages if everything goes wrong.)

Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly
be the source or destination of a pending DMA transfer.

Although there is no clear specification how long a page may be pinned
without FOLL_LONGTERM, pinning without the flag shows an intent of the
caller to only use the memory for short-lived DMA transfers, not a transfer
initiated by a device asynchronously at a random time in the future.

Add a delay of CMA_DMA_TIMEOUT_SEC seconds before starting the kdump
kernel, giving such short-lived DMA transfers time to finish before
the CMA memory is re-used by the kdump kernel.

Set CMA_DMA_TIMEOUT_SEC to 10 seconds - chosen arbitrarily as both
a huge margin for a DMA transfer, yet not increasing the kdump time
too significantly.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>

---
Changes since v4:
- reworded the paragraph about long-term pinning
- simplified crash_cma_clear_pending_dma()
- dropped cma_dma_timeout_sec variable

---
Changes since v3:
- renamed CMA_DMA_TIMEOUT_SEC to CMA_DMA_TIMEOUT_MSEC, change delay to 10 seconds
- introduce a cma_dma_timeout_sec initialized to CMA_DMA_TIMEOUT_SEC
  to make the timeout trivially tunable if needed in the future

---
 kernel/crash_core.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 335b8425dd4b..a4ef79591eb2 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -21,6 +21,7 @@
 #include <linux/reboot.h>
 #include <linux/btf.h>
 #include <linux/objtool.h>
+#include <linux/delay.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -33,6 +34,11 @@
 /* Per cpu memory for storing cpu states in case of system crash. */
 note_buf_t __percpu *crash_notes;
 
+/* time to wait for possible DMA to finish before starting the kdump kernel
+ * when a CMA reservation is used
+ */
+#define CMA_DMA_TIMEOUT_SEC 10
+
 #ifdef CONFIG_CRASH_DUMP
 
 int kimage_crash_copy_vmcoreinfo(struct kimage *image)
@@ -97,6 +103,14 @@ int kexec_crash_loaded(void)
 }
 EXPORT_SYMBOL_GPL(kexec_crash_loaded);
 
+static void crash_cma_clear_pending_dma(void)
+{
+	if (!crashk_cma_cnt)
+		return;
+
+	mdelay(CMA_DMA_TIMEOUT_SEC * 1000);
+}
+
 /*
  * No panic_cpu check version of crash_kexec().  This function is called
  * only when panic_cpu holds the current CPU number; this is the only CPU
@@ -119,6 +133,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
 			crash_setup_regs(&fixed_regs, regs);
 			crash_save_vmcoreinfo();
 			machine_crash_shutdown(&fixed_regs);
+			crash_cma_clear_pending_dma();
 			machine_kexec(kexec_crash_image);
 		}
 		kexec_unlock();

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 5/5] x86: implement crashkernel cma reservation
  2025-06-12 10:11 [PATCH v5 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
                   ` (3 preceding siblings ...)
  2025-06-12 10:18 ` [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
@ 2025-06-12 10:20 ` Jiri Bohac
  2025-08-20 15:46 ` [PATCH v5 0/5] kdump: crashkernel reservation from CMA Breno Leitao
  2025-10-03 15:51 ` Breno Leitao
  6 siblings, 0 replies; 25+ messages in thread
From: Jiri Bohac @ 2025-06-12 10:20 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Implement the crashkernel CMA reservation for x86:
- enable parsing of the cma suffix by parse_crashkernel()
- reserve memory with reserve_crashkernel_cma()
- add the CMA-reserved ranges to the e820 map for the crash kernel
- exclude the CMA-reserved ranges from vmcore

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
 arch/x86/kernel/crash.c | 26 ++++++++++++++++++++++----
 arch/x86/kernel/setup.c |  5 +++--
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 0be61c45400c..670aa9b8b0f8 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -163,10 +163,10 @@ static struct crash_mem *fill_up_crash_elf_data(void)
 		return NULL;
 
 	/*
-	 * Exclusion of crash region and/or crashk_low_res may cause
-	 * another range split. So add extra two slots here.
+	 * Exclusion of crash region, crashk_low_res and/or crashk_cma_ranges
+	 * may cause range splits. So add extra slots here.
 	 */
-	nr_ranges += 2;
+	nr_ranges += 2 + crashk_cma_cnt;
 	cmem = vzalloc(struct_size(cmem, ranges, nr_ranges));
 	if (!cmem)
 		return NULL;
@@ -184,6 +184,7 @@ static struct crash_mem *fill_up_crash_elf_data(void)
 static int elf_header_exclude_ranges(struct crash_mem *cmem)
 {
 	int ret = 0;
+	int i;
 
 	/* Exclude the low 1M because it is always reserved */
 	ret = crash_exclude_mem_range(cmem, 0, SZ_1M - 1);
@@ -198,8 +199,17 @@ static int elf_header_exclude_ranges(struct crash_mem *cmem)
 	if (crashk_low_res.end)
 		ret = crash_exclude_mem_range(cmem, crashk_low_res.start,
 					      crashk_low_res.end);
+	if (ret)
+		return ret;
 
-	return ret;
+	for (i = 0; i < crashk_cma_cnt; ++i) {
+		ret = crash_exclude_mem_range(cmem, crashk_cma_ranges[i].start,
+					      crashk_cma_ranges[i].end);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
 }
 
 static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
@@ -352,6 +362,14 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
 		add_e820_entry(params, &ei);
 	}
 
+	for (i = 0; i < crashk_cma_cnt; ++i) {
+		ei.addr = crashk_cma_ranges[i].start;
+		ei.size = crashk_cma_ranges[i].end -
+			  crashk_cma_ranges[i].start + 1;
+		ei.type = E820_TYPE_RAM;
+		add_e820_entry(params, &ei);
+	}
+
 out:
 	vfree(cmem);
 	return ret;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 870b06571b2e..dcbeba344825 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -573,7 +573,7 @@ static void __init memblock_x86_reserve_range_setup_data(void)
 
 static void __init arch_reserve_crashkernel(void)
 {
-	unsigned long long crash_base, crash_size, low_size = 0;
+	unsigned long long crash_base, crash_size, low_size = 0, cma_size = 0;
 	bool high = false;
 	int ret;
 
@@ -582,7 +582,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
 				&crash_size, &crash_base,
-				&low_size, NULL, &high);
+				&low_size, &cma_size, &high);
 	if (ret)
 		return;
 
@@ -592,6 +592,7 @@ static void __init arch_reserve_crashkernel(void)
 	}
 
 	reserve_crashkernel_generic(crash_size, crash_base, low_size, high);
+	reserve_crashkernel_cma(cma_size);
 }
 
 static struct resource standard_io_resources[] = {


-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA
  2025-06-12 10:18 ` [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
@ 2025-06-12 23:47   ` Andrew Morton
  2025-06-13  9:19     ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2025-06-12 23:47 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Baoquan He, Vivek Goyal, Dave Young, kexec, Philipp Rudo,
	Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

On Thu, 12 Jun 2025 12:18:40 +0200 Jiri Bohac <jbohac@suse.cz> wrote:

> When re-using the CMA area for kdump there is a risk of pending DMA
> into pinned user pages in the CMA area.
> 
> Pages residing in CMA areas can usually not get long-term pinned and
> are instead migrated away from the CMA area, so long-term pinning is
> typically not a concern. (BUGs in the kernel might still lead to
> long-term pinning of such pages if everything goes wrong.)
> 
> Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly
> be the source or destination of a pending DMA transfer.
> 
> Although there is no clear specification how long a page may be pinned
> without FOLL_LONGTERM, pinning without the flag shows an intent of the
> caller to only use the memory for short-lived DMA transfers, not a transfer
> initiated by a device asynchronously at a random time in the future.
> 
> Add a delay of CMA_DMA_TIMEOUT_SEC seconds before starting the kdump
> kernel, giving such short-lived DMA transfers time to finish before
> the CMA memory is re-used by the kdump kernel.
> 
> Set CMA_DMA_TIMEOUT_SEC to 10 seconds - chosen arbitrarily as both
> a huge margin for a DMA transfer, yet not increasing the kdump time
> too significantly.

Oh.  10s sounds a lot.  How long does this process typically take?

It's sad to add a 10s delay for something which some systems will never
do.  I wonder if there's some simple hack we can add.  Like having a
global flag which gets set the first time someone pins a CMA page for
DMA and, if that flag is later found to be unset, skip the delay?  Or
something else along these lines?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA
  2025-06-12 23:47   ` Andrew Morton
@ 2025-06-13  9:19     ` David Hildenbrand
  2025-06-14  2:41       ` Baoquan He
  2025-06-19 12:46       ` Jiri Bohac
  0 siblings, 2 replies; 25+ messages in thread
From: David Hildenbrand @ 2025-06-13  9:19 UTC (permalink / raw)
  To: Andrew Morton, Jiri Bohac
  Cc: Baoquan He, Vivek Goyal, Dave Young, kexec, Philipp Rudo,
	Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

On 13.06.25 01:47, Andrew Morton wrote:
> On Thu, 12 Jun 2025 12:18:40 +0200 Jiri Bohac <jbohac@suse.cz> wrote:
> 
>> When re-using the CMA area for kdump there is a risk of pending DMA
>> into pinned user pages in the CMA area.
>>
>> Pages residing in CMA areas can usually not get long-term pinned and
>> are instead migrated away from the CMA area, so long-term pinning is
>> typically not a concern. (BUGs in the kernel might still lead to
>> long-term pinning of such pages if everything goes wrong.)
>>
>> Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly
>> be the source or destination of a pending DMA transfer.
>>
>> Although there is no clear specification how long a page may be pinned
>> without FOLL_LONGTERM, pinning without the flag shows an intent of the
>> caller to only use the memory for short-lived DMA transfers, not a transfer
>> initiated by a device asynchronously at a random time in the future.
>>
>> Add a delay of CMA_DMA_TIMEOUT_SEC seconds before starting the kdump
>> kernel, giving such short-lived DMA transfers time to finish before
>> the CMA memory is re-used by the kdump kernel.
>>
>> Set CMA_DMA_TIMEOUT_SEC to 10 seconds - chosen arbitrarily as both
>> a huge margin for a DMA transfer, yet not increasing the kdump time
>> too significantly.
> 
> Oh.  10s sounds a lot.  How long does this process typically take?
> 
> It's sad to add a 10s delay for something which some systems will never
> do.  I wonder if there's some simple hack we can add.  Like having a
> global flag which gets set the first time someone pins a CMA page

We would likely have to do that for any GUP on such a page (FOLL_GET | 
FOLL_PIN), both from gup-fast and gup-slow.

Should work, but IMHO can be optimized later, on top of this series.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA
  2025-06-13  9:19     ` David Hildenbrand
@ 2025-06-14  2:41       ` Baoquan He
  2025-06-19 12:46       ` Jiri Bohac
  1 sibling, 0 replies; 25+ messages in thread
From: Baoquan He @ 2025-06-14  2:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Jiri Bohac, Vivek Goyal, Dave Young, kexec,
	Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

On 06/13/25 at 11:19am, David Hildenbrand wrote:
> On 13.06.25 01:47, Andrew Morton wrote:
> > On Thu, 12 Jun 2025 12:18:40 +0200 Jiri Bohac <jbohac@suse.cz> wrote:
> > 
> > > When re-using the CMA area for kdump there is a risk of pending DMA
> > > into pinned user pages in the CMA area.
> > > 
> > > Pages residing in CMA areas can usually not get long-term pinned and
> > > are instead migrated away from the CMA area, so long-term pinning is
> > > typically not a concern. (BUGs in the kernel might still lead to
> > > long-term pinning of such pages if everything goes wrong.)
> > > 
> > > Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly
> > > be the source or destination of a pending DMA transfer.
> > > 
> > > Although there is no clear specification how long a page may be pinned
> > > without FOLL_LONGTERM, pinning without the flag shows an intent of the
> > > caller to only use the memory for short-lived DMA transfers, not a transfer
> > > initiated by a device asynchronously at a random time in the future.
> > > 
> > > Add a delay of CMA_DMA_TIMEOUT_SEC seconds before starting the kdump
> > > kernel, giving such short-lived DMA transfers time to finish before
> > > the CMA memory is re-used by the kdump kernel.
> > > 
> > > Set CMA_DMA_TIMEOUT_SEC to 10 seconds - chosen arbitrarily as both
> > > a huge margin for a DMA transfer, yet not increasing the kdump time
> > > too significantly.
> > 
> > Oh.  10s sounds a lot.  How long does this process typically take?
> > 
> > It's sad to add a 10s delay for something which some systems will never
> > do.  I wonder if there's some simple hack we can add.  Like having a
> > global flag which gets set the first time someone pins a CMA page

I have the same worry as Andrew. One system run off rails, we don't try
to slam the brake, but wait 10 seconds instead to do that. Lucky we have
noticed people the risk.

> 
> We would likely have to do that for any GUP on such a page (FOLL_GET |
> FOLL_PIN), both from gup-fast and gup-slow.

There could be such GUP page, not always? This feature is an opt-in for
users, they can decide or tune the waiting time too?

My personal opinion. I will not suggest people to use it in RHEL, while
other people feel free to try it as the risk has been warned.

> 
> Should work, but IMHO can be optimized later, on top of this series.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA
  2025-06-13  9:19     ` David Hildenbrand
  2025-06-14  2:41       ` Baoquan He
@ 2025-06-19 12:46       ` Jiri Bohac
  1 sibling, 0 replies; 25+ messages in thread
From: Jiri Bohac @ 2025-06-19 12:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Baoquan He, Vivek Goyal, Dave Young, kexec,
	Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

On Fri, Jun 13, 2025 at 11:19:11AM +0200, David Hildenbrand wrote:
> > It's sad to add a 10s delay for something which some systems will never
> > do.  I wonder if there's some simple hack we can add.  Like having a
> > global flag which gets set the first time someone pins a CMA page
> 
> We would likely have to do that for any GUP on such a page (FOLL_GET |
> FOLL_PIN), both from gup-fast and gup-slow.
> 
> Should work, but IMHO can be optimized later, on top of this series.

The 10 s was David's suggestion during the discussion of v2 of
this patchset. We already had a discussion about both the length
of the delay and whether to make it configurable [1] 

We agreed it was best to start with a longer fixed delay to be on the
safe side.

If the CMA reservation becomes popular and anybody complains
about the delay, then we can trivially make this configurable or
think of other improvements.

[1] https://lore.kernel.org/lkml/a1a5af90-bc8a-448a-81fa-485624d592f3@redhat.com/

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-06-12 10:11 [PATCH v5 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
                   ` (4 preceding siblings ...)
  2025-06-12 10:20 ` [PATCH v5 5/5] x86: implement crashkernel cma reservation Jiri Bohac
@ 2025-08-20 15:46 ` Breno Leitao
  2025-08-20 16:20   ` Jiri Bohac
  2025-10-03 15:51 ` Breno Leitao
  6 siblings, 1 reply; 25+ messages in thread
From: Breno Leitao @ 2025-08-20 15:46 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm, Philipp Rudo,
	Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Hello Jiri,

On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote:
> The fifth patch enables the functionality for x86 as a proof of
> concept. There are just three things every arch needs to do:
> - call reserve_crashkernel_cma()
> - include the CMA-reserved ranges in the physical memory map
> - exclude the CMA-reserved ranges from the memory available
>   through /proc/vmcore by excluding them from the vmcoreinfo
>   PT_LOAD ranges.

First, thank you for making this change; it’s very helpful.
I haven’t come across anything regarding arm64 support. Is this on
anyone’s to-do list?

--breno

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-08-20 15:46 ` [PATCH v5 0/5] kdump: crashkernel reservation from CMA Breno Leitao
@ 2025-08-20 16:20   ` Jiri Bohac
  2025-08-21  8:35     ` Breno Leitao
  0 siblings, 1 reply; 25+ messages in thread
From: Jiri Bohac @ 2025-08-20 16:20 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm, Philipp Rudo,
	Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Hi Breno,

On Wed, Aug 20, 2025 at 08:46:54AM -0700, Breno Leitao wrote:
> Hello Jiri,
> 
> First, thank you for making this change; it’s very helpful.
> I haven’t come across anything regarding arm64 support. Is this on
> anyone’s to-do list?

Yes, I plan to implement this at least for ppc64, arm64 and s390x,
hopefully in time for 6.18.

Regards,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-08-20 16:20   ` Jiri Bohac
@ 2025-08-21  8:35     ` Breno Leitao
  2025-08-22 19:45       ` Jiri Bohac
  0 siblings, 1 reply; 25+ messages in thread
From: Breno Leitao @ 2025-08-21  8:35 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm, Philipp Rudo,
	Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Hello Jiri,

On Wed, Aug 20, 2025 at 06:20:13PM +0200, Jiri Bohac wrote:
> On Wed, Aug 20, 2025 at 08:46:54AM -0700, Breno Leitao wrote:
> > First, thank you for making this change; it’s very helpful.
> > I haven’t come across anything regarding arm64 support. Is this on
> > anyone’s to-do list?
> 
> Yes, I plan to implement this at least for ppc64, arm64 and s390x,
> hopefully in time for 6.18.

Thanks!

I have another question. I assume it’s not possible to allocate only the
CMA crashkernel area for the kdump kernel, since we need to keep the
loaded kernel in the crashkernel area while the system is running.
Therefore, specifying crashkernel=X (without ',cma') is necessary.

At the same time, since the crashdump environment will use CMA, the
crashkernel area itself doesn’t need to be very large, as the CMA space
will be allocated later.

With that in mind, how do I find what is the recommended size for the
crashkernel area, assuming the CMA area will be more than sufficient at
runtime?

Does it need ot be much higher than the size of kdump kernel and initrd?

Thanks
--breno

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-08-21  8:35     ` Breno Leitao
@ 2025-08-22 19:45       ` Jiri Bohac
  0 siblings, 0 replies; 25+ messages in thread
From: Jiri Bohac @ 2025-08-22 19:45 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Baoquan He, Vivek Goyal, Dave Young, kexec, akpm, Philipp Rudo,
	Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

On Thu, Aug 21, 2025 at 01:35:35AM -0700, Breno Leitao wrote:
> I have another question. I assume it’s not possible to allocate only the
> CMA crashkernel area for the kdump kernel, since we need to keep the
> loaded kernel in the crashkernel area while the system is running.
> Therefore, specifying crashkernel=X (without ',cma') is necessary.

exactly

> At the same time, since the crashdump environment will use CMA, the
> crashkernel area itself doesn’t need to be very large, as the CMA space
> will be allocated later.
> 
> With that in mind, how do I find what is the recommended size for the
> crashkernel area, assuming the CMA area will be more than sufficient at
> runtime?
>
> Does it need ot be much higher than the size of kdump kernel and initrd?

I don't have a good answer now - I have this on my to-do list
and I need to investigate this more to come up with a formula to
calculate the required size to include this in the kdump tool I
maintain for SUSE. 

When testing the kernel part I used 
crashkernel=100M crashkernel=XXX,cma on a machine that would
normally require something like crashkernel=430M.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-06-12 10:11 [PATCH v5 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
                   ` (5 preceding siblings ...)
  2025-08-20 15:46 ` [PATCH v5 0/5] kdump: crashkernel reservation from CMA Breno Leitao
@ 2025-10-03 15:51 ` Breno Leitao
  2025-10-06  8:16   ` David Hildenbrand
  6 siblings, 1 reply; 25+ messages in thread
From: Breno Leitao @ 2025-10-03 15:51 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: riel, vbabka, nphamcs, Baoquan He, Vivek Goyal, Dave Young, kexec,
	akpm, Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu,
	linux-kernel, David Hildenbrand, Michal Hocko

Hello Jiri,

On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote:

> Currently this is only the case for memory ballooning and zswap. Such movable
> memory will be missing from the vmcore. User data is typically not dumped by
> makedumpfile.

For zswap and zsmalloc pages, I'm wondering whether these pages will be missing
from the vmcore, or if there's a possibility they might be present but
corrupted—especially since they could reside in the CMA region, which may be
overwritten by the kdump environment.

My main question is: Do we need to explicitly teach makedumpfile to ignore the
CMA area in the vmcore, since it is already being overwritten and thus
unreliable? Or does makedumpfile already have mechanisms in place to
automatically ignore these special zswap/zsmalloc pages that may have been
overwritten if they were located in the CMA region?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-03 15:51 ` Breno Leitao
@ 2025-10-06  8:16   ` David Hildenbrand
  2025-10-06 16:25     ` Breno Leitao
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-10-06  8:16 UTC (permalink / raw)
  To: Breno Leitao, Jiri Bohac
  Cc: riel, vbabka, nphamcs, Baoquan He, Vivek Goyal, Dave Young, kexec,
	akpm, Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu,
	linux-kernel, Michal Hocko

On 03.10.25 17:51, Breno Leitao wrote:
> Hello Jiri,
> 
> On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote:
> 
>> Currently this is only the case for memory ballooning and zswap. Such movable
>> memory will be missing from the vmcore. User data is typically not dumped by
>> makedumpfile.
> 
> For zswap and zsmalloc pages, I'm wondering whether these pages will be missing
> from the vmcore, or if there's a possibility they might be present but
> corrupted—especially since they could reside in the CMA region, which may be
> overwritten by the kdump environment.

That's not different to ordinary user pages residing on these areas, right?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-06  8:16   ` David Hildenbrand
@ 2025-10-06 16:25     ` Breno Leitao
  2025-10-06 16:45       ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Breno Leitao @ 2025-10-06 16:25 UTC (permalink / raw)
  To: David Hildenbrand, kas
  Cc: Jiri Bohac, riel, vbabka, nphamcs, Baoquan He, Vivek Goyal,
	Dave Young, kexec, akpm, Philipp Rudo, Donald Dutile, Pingfan Liu,
	Tao Liu, linux-kernel, Michal Hocko

On Mon, Oct 06, 2025 at 10:16:26AM +0200, David Hildenbrand wrote:
> On 03.10.25 17:51, Breno Leitao wrote:
> > Hello Jiri,
> > 
> > On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote:
> > 
> > > Currently this is only the case for memory ballooning and zswap. Such movable
> > > memory will be missing from the vmcore. User data is typically not dumped by
> > > makedumpfile.
> > 
> > For zswap and zsmalloc pages, I'm wondering whether these pages will be missing
> > from the vmcore, or if there's a possibility they might be present but
> > corrupted—especially since they could reside in the CMA region, which may be
> > overwritten by the kdump environment.
> 
> That's not different to ordinary user pages residing on these areas, right?

Will zsmalloc on CMA pages be marked as "userpages"?

makedump file iterates over the pfns and check for a few flags before
"copying" them to disk.

In makedumpfile, userpages are basically discarded if they are anonymous
pages:
	#define isAnon(mapping, flags, _mapcount) \
		(((unsigned long)mapping & PAGE_MAPPING_ANON) != 0 && !isSlab(flags,
		_mapcount))

	https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.h#L164

	called from:
	https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.c#L6671

For zsmalloc pages in the CMA, The page struct (pfn)) is marked with old
page struct (from the first kernel), but, the content has changed
(replaced by kdump environment - 2nd kernel).

So, whatever decision makedumpfile does based on the PFN, it will dump
incorrect data, given that the page content does not match the data
anymore.

If my understanding is valid, we don't want to dump any page that points
to the PFN, because they will probably have garbage.

That said, I see two options:

 1) Ignore the CMA area completely in makedump.
    - I don't think there is any way to find that area today. The kernel
      might need to print the CMA region somewhere (/proc/iomem?)

 2) Given that most of the memory in CMA will be anonymous memory, and
    already discard by other rules, just add an additional entry for
    zsmalloc pages.

    Talking to Kirill offline, it seems we can piggy back on MovableOps
    page flag.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-06 16:25     ` Breno Leitao
@ 2025-10-06 16:45       ` David Hildenbrand
  2025-10-06 23:34         ` Tao Liu
  2025-10-07  3:55         ` Baoquan He
  0 siblings, 2 replies; 25+ messages in thread
From: David Hildenbrand @ 2025-10-06 16:45 UTC (permalink / raw)
  To: Breno Leitao, kas
  Cc: Jiri Bohac, riel, vbabka, nphamcs, Baoquan He, Vivek Goyal,
	Dave Young, kexec, akpm, Philipp Rudo, Donald Dutile, Pingfan Liu,
	Tao Liu, linux-kernel, Michal Hocko

On 06.10.25 18:25, Breno Leitao wrote:
> On Mon, Oct 06, 2025 at 10:16:26AM +0200, David Hildenbrand wrote:
>> On 03.10.25 17:51, Breno Leitao wrote:
>>> Hello Jiri,
>>>
>>> On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote:
>>>
>>>> Currently this is only the case for memory ballooning and zswap. Such movable
>>>> memory will be missing from the vmcore. User data is typically not dumped by
>>>> makedumpfile.
>>>
>>> For zswap and zsmalloc pages, I'm wondering whether these pages will be missing
>>> from the vmcore, or if there's a possibility they might be present but
>>> corrupted—especially since they could reside in the CMA region, which may be
>>> overwritten by the kdump environment.
>>
>> That's not different to ordinary user pages residing on these areas, right?
> 
> Will zsmalloc on CMA pages be marked as "userpages"?

No, but they should have the zsmalloc page type set.

> 
> makedump file iterates over the pfns and check for a few flags before
> "copying" them to disk.
> 
> In makedumpfile, userpages are basically discarded if they are anonymous
> pages:
> 	#define isAnon(mapping, flags, _mapcount) \
> 		(((unsigned long)mapping & PAGE_MAPPING_ANON) != 0 && !isSlab(flags,
> 		_mapcount))
> 
> 	https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.h#L164
> 
> 	called from:
> 	https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.c#L6671
> 
> For zsmalloc pages in the CMA, The page struct (pfn)) is marked with old
> page struct (from the first kernel), but, the content has changed
> (replaced by kdump environment - 2nd kernel).
> 
> So, whatever decision makedumpfile does based on the PFN, it will dump
> incorrect data, given that the page content does not match the data
> anymore.

Right.

> 
> If my understanding is valid, we don't want to dump any page that points
> to the PFN, because they will probably have garbage.

My theory is that barely anybody will go ahead and check compressed page 
content, but I agree. We should filter them out.

> 
> That said, I see two options:
> 
>   1) Ignore the CMA area completely in makedump.
>      - I don't think there is any way to find that area today. The kernel
>        might need to print the CMA region somewhere (/proc/iomem?)

/proc/iomem in the newkernel should indicate the memory region as System 
RAM (for the new kernel). That can just be filtered out in any case: 
dumping memory of the new kernel does not make sense in any case.

> 
>   2) Given that most of the memory in CMA will be anonymous memory, and
>      already discard by other rules, just add an additional entry for
>      zsmalloc pages.
> 
>      Talking to Kirill offline, it seems we can piggy back on MovableOps
>      page flag.

We should likely check the page type instead if we go down that path.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-06 16:45       ` David Hildenbrand
@ 2025-10-06 23:34         ` Tao Liu
  2025-10-07  3:55         ` Baoquan He
  1 sibling, 0 replies; 25+ messages in thread
From: Tao Liu @ 2025-10-06 23:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Breno Leitao, kas, Jiri Bohac, riel, vbabka, nphamcs, Baoquan He,
	Vivek Goyal, Dave Young, kexec, akpm, Philipp Rudo, Donald Dutile,
	Pingfan Liu, linux-kernel, Michal Hocko

Hi David,

On Tue, Oct 7, 2025 at 5:45 AM David Hildenbrand <dhildenb@redhat.com> wrote:
>
> On 06.10.25 18:25, Breno Leitao wrote:
> > On Mon, Oct 06, 2025 at 10:16:26AM +0200, David Hildenbrand wrote:
> >> On 03.10.25 17:51, Breno Leitao wrote:
> >>> Hello Jiri,
> >>>
> >>> On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote:
> >>>
> >>>> Currently this is only the case for memory ballooning and zswap. Such movable
> >>>> memory will be missing from the vmcore. User data is typically not dumped by
> >>>> makedumpfile.
> >>>
> >>> For zswap and zsmalloc pages, I'm wondering whether these pages will be missing
> >>> from the vmcore, or if there's a possibility they might be present but
> >>> corrupted—especially since they could reside in the CMA region, which may be
> >>> overwritten by the kdump environment.
> >>
> >> That's not different to ordinary user pages residing on these areas, right?
> >
> > Will zsmalloc on CMA pages be marked as "userpages"?
>
> No, but they should have the zsmalloc page type set.
>
> >
> > makedump file iterates over the pfns and check for a few flags before
> > "copying" them to disk.
> >
> > In makedumpfile, userpages are basically discarded if they are anonymous
> > pages:
> >       #define isAnon(mapping, flags, _mapcount) \
> >               (((unsigned long)mapping & PAGE_MAPPING_ANON) != 0 && !isSlab(flags,
> >               _mapcount))
> >
> >       https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.h#L164
> >
> >       called from:
> >       https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.c#L6671
> >
> > For zsmalloc pages in the CMA, The page struct (pfn)) is marked with old
> > page struct (from the first kernel), but, the content has changed
> > (replaced by kdump environment - 2nd kernel).
> >
> > So, whatever decision makedumpfile does based on the PFN, it will dump
> > incorrect data, given that the page content does not match the data
> > anymore.
>
> Right.
>
> >
> > If my understanding is valid, we don't want to dump any page that points
> > to the PFN, because they will probably have garbage.
>
> My theory is that barely anybody will go ahead and check compressed page
> content, but I agree. We should filter them out.
>
> >
> > That said, I see two options:
> >
> >   1) Ignore the CMA area completely in makedump.
> >      - I don't think there is any way to find that area today. The kernel
> >        might need to print the CMA region somewhere (/proc/iomem?)
>
> /proc/iomem in the newkernel should indicate the memory region as System
> RAM (for the new kernel). That can just be filtered out in any case:
> dumping memory of the new kernel does not make sense in any case.
>
> >
> >   2) Given that most of the memory in CMA will be anonymous memory, and
> >      already discard by other rules, just add an additional entry for
> >      zsmalloc pages.
> >
> >      Talking to Kirill offline, it seems we can piggy back on MovableOps
> >      page flag.
>
> We should likely check the page type instead if we go down that path.

If choosing a proper page type/flag is hard, maybe an ongoing new
feature for makedumpfile can help with that. In short, if we can get a
workable page flag for CMA to get filtered, then proceed as usual, If
cannot, then we can use eppic/btf/kallsyms[1] in makedumpfile to
programmably determine page type and filter it out. See the program
for determining amdgpu's mm pages[2].

[1]: https://lore.kernel.org/kexec/20250610095743.18073-1-ltao@redhat.com/T/#m901bf1413b844648c86e8a84d75b66d0531b9f92
[2]: https://lore.kernel.org/kexec/20250610095743.18073-1-ltao@redhat.com/T/#m38362d258e3b0bdc14a64e54a6acd5b85810ca26

Cheers,
Tao Liu

>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-06 16:45       ` David Hildenbrand
  2025-10-06 23:34         ` Tao Liu
@ 2025-10-07  3:55         ` Baoquan He
  2025-10-07  9:11           ` Jiri Bohac
  2025-10-08 10:42           ` Breno Leitao
  1 sibling, 2 replies; 25+ messages in thread
From: Baoquan He @ 2025-10-07  3:55 UTC (permalink / raw)
  To: David Hildenbrand, Breno Leitao
  Cc: kas, Jiri Bohac, riel, vbabka, nphamcs, Vivek Goyal, Dave Young,
	kexec, akpm, Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu,
	linux-kernel, Michal Hocko

On 10/06/25 at 06:45pm, David Hildenbrand wrote:
> On 06.10.25 18:25, Breno Leitao wrote:
> > On Mon, Oct 06, 2025 at 10:16:26AM +0200, David Hildenbrand wrote:
> > > On 03.10.25 17:51, Breno Leitao wrote:
> > > > Hello Jiri,
> > > > 
> > > > On Thu, Jun 12, 2025 at 12:11:19PM +0200, Jiri Bohac wrote:
> > > > 
> > > > > Currently this is only the case for memory ballooning and zswap. Such movable
> > > > > memory will be missing from the vmcore. User data is typically not dumped by
> > > > > makedumpfile.
> > > > 
> > > > For zswap and zsmalloc pages, I'm wondering whether these pages will be missing
> > > > from the vmcore, or if there's a possibility they might be present but
> > > > corrupted—especially since they could reside in the CMA region, which may be
> > > > overwritten by the kdump environment.
> > > 
> > > That's not different to ordinary user pages residing on these areas, right?
> > 
> > Will zsmalloc on CMA pages be marked as "userpages"?
> 
> No, but they should have the zsmalloc page type set.
> 
> > 
> > makedump file iterates over the pfns and check for a few flags before
> > "copying" them to disk.
> > 
> > In makedumpfile, userpages are basically discarded if they are anonymous
> > pages:
> > 	#define isAnon(mapping, flags, _mapcount) \
> > 		(((unsigned long)mapping & PAGE_MAPPING_ANON) != 0 && !isSlab(flags,
> > 		_mapcount))
> > 
> > 	https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.h#L164
> > 
> > 	called from:
> > 	https://github.com/makedumpfile/makedumpfile/blob/master/makedumpfile.c#L6671
> > 
> > For zsmalloc pages in the CMA, The page struct (pfn)) is marked with old
> > page struct (from the first kernel), but, the content has changed
> > (replaced by kdump environment - 2nd kernel).
> > 
> > So, whatever decision makedumpfile does based on the PFN, it will dump
> > incorrect data, given that the page content does not match the data
> > anymore.
> 
> Right.
> 
> > 
> > If my understanding is valid, we don't want to dump any page that points
> > to the PFN, because they will probably have garbage.
> 
> My theory is that barely anybody will go ahead and check compressed page
> content, but I agree. We should filter them out.
> 
> > 
> > That said, I see two options:
> > 
> >   1) Ignore the CMA area completely in makedump.
> >      - I don't think there is any way to find that area today. The kernel
> >        might need to print the CMA region somewhere (/proc/iomem?)
> 
> /proc/iomem in the newkernel should indicate the memory region as System RAM
> (for the new kernel). That can just be filtered out in any case: dumping
> memory of the new kernel does not make sense in any case.

Agree.

And I saw Jiri has excluded the crashk_cma_ranges[] from the dumped
content via elf_header_exclude_ranges(). Have you encountered a real
problem about the dumping, or you are just worried about it?

> 
> > 
> >   2) Given that most of the memory in CMA will be anonymous memory, and
> >      already discard by other rules, just add an additional entry for
> >      zsmalloc pages.
> > 
> >      Talking to Kirill offline, it seems we can piggy back on MovableOps
> >      page flag.
> 
> We should likely check the page type instead if we go down that path.

Talking about the pages in CMA except of crashk_cma_ranges[],
zsmalloc/zswap is true as anon mem and can be discarded. I am wondering
if there's any driver or kernel pages residing in CMA and being worth to
dump out.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-07  3:55         ` Baoquan He
@ 2025-10-07  9:11           ` Jiri Bohac
  2025-10-08 10:42           ` Breno Leitao
  1 sibling, 0 replies; 25+ messages in thread
From: Jiri Bohac @ 2025-10-07  9:11 UTC (permalink / raw)
  To: Baoquan He
  Cc: David Hildenbrand, Breno Leitao, kas, riel, vbabka, nphamcs,
	Vivek Goyal, Dave Young, kexec, akpm, Philipp Rudo, Donald Dutile,
	Pingfan Liu, Tao Liu, linux-kernel, Michal Hocko

On Tue, Oct 07, 2025 at 11:55:36AM +0800, Baoquan He wrote:
> And I saw Jiri has excluded the crashk_cma_ranges[] from the dumped
> content via elf_header_exclude_ranges(). 

Exactly, thanks for pointing this out, while I was away from my e-mail.

The crashkernel CMA reservation ranges will not be seen at all by
makedumpfile.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-07  3:55         ` Baoquan He
  2025-10-07  9:11           ` Jiri Bohac
@ 2025-10-08 10:42           ` Breno Leitao
  2025-10-13  4:03             ` [External] " Zhongkun He
  1 sibling, 1 reply; 25+ messages in thread
From: Breno Leitao @ 2025-10-08 10:42 UTC (permalink / raw)
  To: Baoquan He
  Cc: David Hildenbrand, kas, Jiri Bohac, riel, vbabka, nphamcs,
	Vivek Goyal, Dave Young, kexec, akpm, Philipp Rudo, Donald Dutile,
	Pingfan Liu, Tao Liu, linux-kernel, Michal Hocko

On Tue, Oct 07, 2025 at 11:55:36AM +0800, Baoquan He wrote:
> On 10/06/25 at 06:45pm, David Hildenbrand wrote:
>
> Have you encountered a real problem about the dumping, or you are just
> worried about it?

I haven't encountered any issues so far, and I already have a set of
machines running with this configuration.

I'm planning to roll out this feature to a larger group of servers, so
I'm currently performing due diligence.

Thanks!
--breno

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [External] Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-08 10:42           ` Breno Leitao
@ 2025-10-13  4:03             ` Zhongkun He
  2025-10-13  8:00               ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: Zhongkun He @ 2025-10-13  4:03 UTC (permalink / raw)
  To: jbohac, Baoquan He, David Hildenbrand
  Cc: kas, riel, vbabka, nphamcs, Vivek Goyal, Dave Young, kexec, akpm,
	Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	Michal Hocko, Muchun Song

Hi folks,

We’re encountering the same issue that this patch series aims to address,
and we also planned to leverage CMA to solve it. However, some implementation
details on our side may differ, so we’d like to discuss our proposed approach we
have tried in this thread.

1. Register a dedicated CMA area for kexec kernel use
Introduce a dedicated CMA region (e.g., kexec_cma) and allocate the control
code page and crash segments from this area via cma_alloc. Pages for a
normal kexec kernel can also be allocated from this region [1].

2. Keep crashkernel=xx unchanged (register CMA)
We introduced a flag in the kexec syscall to dynamically enable
or disable memory reuse without system reboot. For example, with
crashkernel=500M (a 500M cma region), cma_alloc may use 100M for the
kernel,initrd and others data. This region could then be reused for the current
kernel if the reuse flag is set in the syscall, or left unused for dumping user
pages in case of a crash.

3. Keep this CMA region inactive by default
The CMA region should remain inactive until kexec is enabled with the reuse flag
(or fully reused when the kdump service is not enabled). It can then
be activated for
use by the current kernel.

4. Introduce a new migratetype KEXEC_CMA
Similar to the existing CMA type, this would be used to:
1)Prevent page allocation from this zone for get_user_pages (GUP).
2)Handle page migration correctly when pages are pinned after allocation.

We would like to discuss the feasibility and potential implications of
this approach with the community.

[1]:https://lore.kernel.org/all/20250610085327.51817-1-graf@amazon.com/

Best regards,
Zhongkun He

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [External] Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-13  4:03             ` [External] " Zhongkun He
@ 2025-10-13  8:00               ` David Hildenbrand
  2025-10-14  7:36                 ` Zhongkun He
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2025-10-13  8:00 UTC (permalink / raw)
  To: Zhongkun He, jbohac, Baoquan He
  Cc: kas, riel, vbabka, nphamcs, Vivek Goyal, Dave Young, kexec, akpm,
	Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	Michal Hocko, Muchun Song

On 13.10.25 06:03, Zhongkun He wrote:
> Hi folks,
> 
> We’re encountering the same issue that this patch series aims to address,
> and we also planned to leverage CMA to solve it. However, some implementation
> details on our side may differ, so we’d like to discuss our proposed approach we
> have tried in this thread.
> 
> 1. Register a dedicated CMA area for kexec kernel use
> Introduce a dedicated CMA region (e.g., kexec_cma) and allocate the control
> code page and crash segments from this area via cma_alloc. Pages for a
> normal kexec kernel can also be allocated from this region [1].
> 
> 2. Keep crashkernel=xx unchanged (register CMA)
> We introduced a flag in the kexec syscall to dynamically enable
> or disable memory reuse without system reboot. For example, with
> crashkernel=500M (a 500M cma region), cma_alloc may use 100M for the
> kernel,initrd and others data. This region could then be reused for the current
> kernel if the reuse flag is set in the syscall, or left unused for dumping user
> pages in case of a crash.
> 
> 3. Keep this CMA region inactive by default
> The CMA region should remain inactive until kexec is enabled with the reuse flag
> (or fully reused when the kdump service is not enabled). It can then
> be activated for
> use by the current kernel.
> 
> 4. Introduce a new migratetype KEXEC_CMA
> Similar to the existing CMA type, this would be used to:
> 1)Prevent page allocation from this zone for get_user_pages (GUP).
> 2)Handle page migration correctly when pages are pinned after allocation.

It will be hard to get something like that in for the purpose of kdump. 
Further, I'm afraid it might open up a can of worms of "migration 
temporarily failed" -> GUP failed issues for some workloads.

So I assume this might currently not be the best way to move forward.

One alternative would be using GCMA [1] in the current design. The CMA 
memory would not be exposed to the buddy, but can still be used as a 
cache for clean file pages. Pinning etc. is not a problem in that context.

Of course, the more we limit the usage of that region, the less 
versatile it is.

[1] https://lkml.kernel.org/r/20251010011951.2136980-1-surenb@google.com

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [External] Re: [PATCH v5 0/5] kdump: crashkernel reservation from CMA
  2025-10-13  8:00               ` David Hildenbrand
@ 2025-10-14  7:36                 ` Zhongkun He
  0 siblings, 0 replies; 25+ messages in thread
From: Zhongkun He @ 2025-10-14  7:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: jbohac, Baoquan He, kas, riel, vbabka, nphamcs, Vivek Goyal,
	Dave Young, kexec, akpm, Philipp Rudo, Donald Dutile, Pingfan Liu,
	Tao Liu, linux-kernel, Michal Hocko, Muchun Song

On Mon, Oct 13, 2025 at 4:01 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.10.25 06:03, Zhongkun He wrote:
> > Hi folks,
> >
> > We’re encountering the same issue that this patch series aims to address,
> > and we also planned to leverage CMA to solve it. However, some implementation
> > details on our side may differ, so we’d like to discuss our proposed approach we
> > have tried in this thread.
> >
> > 1. Register a dedicated CMA area for kexec kernel use
> > Introduce a dedicated CMA region (e.g., kexec_cma) and allocate the control
> > code page and crash segments from this area via cma_alloc. Pages for a
> > normal kexec kernel can also be allocated from this region [1].
> >
> > 2. Keep crashkernel=xx unchanged (register CMA)
> > We introduced a flag in the kexec syscall to dynamically enable
> > or disable memory reuse without system reboot. For example, with
> > crashkernel=500M (a 500M cma region), cma_alloc may use 100M for the
> > kernel,initrd and others data. This region could then be reused for the current
> > kernel if the reuse flag is set in the syscall, or left unused for dumping user
> > pages in case of a crash.
> >
> > 3. Keep this CMA region inactive by default
> > The CMA region should remain inactive until kexec is enabled with the reuse flag
> > (or fully reused when the kdump service is not enabled). It can then
> > be activated for
> > use by the current kernel.
> >
> > 4. Introduce a new migratetype KEXEC_CMA
> > Similar to the existing CMA type, this would be used to:
> > 1)Prevent page allocation from this zone for get_user_pages (GUP).
> > 2)Handle page migration correctly when pages are pinned after allocation.
>

Hi David,

> It will be hard to get something like that in for the purpose of kdump.
> Further, I'm afraid it might open up a can of worms of "migration
> temporarily failed" -> GUP failed issues for some workloads.
>
> So I assume this might currently not be the best way to move forward.

Got it, thanks.

>
> One alternative would be using GCMA [1] in the current design. The CMA
> memory would not be exposed to the buddy, but can still be used as a
> cache for clean file pages. Pinning etc. is not a problem in that context.
>
> Of course, the more we limit the usage of that region, the less
> versatile it is.

Yes, I agree.  I’ve already noticed GCMA, and I hope it can be merged upstream
in a more lightweight manner.

>
> [1] https://lkml.kernel.org/r/20251010011951.2136980-1-surenb@google.com
>
> --
> Cheers
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-10-14  7:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12 10:11 [PATCH v5 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
2025-06-12 10:13 ` [PATCH v5 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
2025-06-12 10:16 ` [PATCH v5 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
2025-06-12 10:17 ` [PATCH v5 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
2025-06-12 10:18 ` [PATCH v5 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
2025-06-12 23:47   ` Andrew Morton
2025-06-13  9:19     ` David Hildenbrand
2025-06-14  2:41       ` Baoquan He
2025-06-19 12:46       ` Jiri Bohac
2025-06-12 10:20 ` [PATCH v5 5/5] x86: implement crashkernel cma reservation Jiri Bohac
2025-08-20 15:46 ` [PATCH v5 0/5] kdump: crashkernel reservation from CMA Breno Leitao
2025-08-20 16:20   ` Jiri Bohac
2025-08-21  8:35     ` Breno Leitao
2025-08-22 19:45       ` Jiri Bohac
2025-10-03 15:51 ` Breno Leitao
2025-10-06  8:16   ` David Hildenbrand
2025-10-06 16:25     ` Breno Leitao
2025-10-06 16:45       ` David Hildenbrand
2025-10-06 23:34         ` Tao Liu
2025-10-07  3:55         ` Baoquan He
2025-10-07  9:11           ` Jiri Bohac
2025-10-08 10:42           ` Breno Leitao
2025-10-13  4:03             ` [External] " Zhongkun He
2025-10-13  8:00               ` David Hildenbrand
2025-10-14  7:36                 ` Zhongkun He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox