[PATCH v2 0/5] kdump: crashkernel reservation from CMA

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] kdump: crashkernel reservation from CMA
@ 2025-02-20 16:48 Jiri Bohac
  2025-02-20 16:51 ` [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Jiri Bohac @ 2025-02-20 16:48 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Hi,

this series implements a way to reserve additional crash kernel
memory using CMA.

Link to the v1 discussion:
https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
See below for the changes since v1 and how concerns from the 
discussion have been addressed.

Currently, all the memory for the crash kernel is not usable by
the 1st (production) kernel. It is also unmapped so that it can't
be corrupted by the fault that will eventually trigger the crash.
This makes sense for the memory actually used by the kexec-loaded
crash kernel image and initrd and the data prepared during the
load (vmcoreinfo, ...). However, the reserved space needs to be
much larger than that to provide enough run-time memory for the
crash kernel and the kdump userspace. Estimating the amount of
memory to reserve is difficult. Being too careful makes kdump
likely to end in OOM, being too generous takes even more memory
from the production system. Also, the reservation only allows
reserving a single contiguous block (or two with the "low"
suffix). I've seen systems where this fails because the physical
memory is fragmented.

By reserving additional crashkernel memory from CMA, the main
crashkernel reservation can be just large enough to fit the 
kernel and initrd image, minimizing the memory taken away from
the production system. Most of the run-time memory for the crash
kernel will be memory previously available to userspace in the
production system. As this memory is no longer wasted, the
reservation can be done with a generous margin, making kdump more
reliable. Kernel memory that we need to preserve for dumping is 
never allocated from CMA. User data is typically not dumped by 
makedumpfile. When dumping of user data is intended this new CMA 
reservation cannot be used.

There are five patches in this series:

The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more
argument to store the cma reservation size.

The second patch implements reserve_crashkernel_cma() which
performs the reservation. If the requested size is not available
in a single range, multiple smaller ranges will be reserved.

The third patch updates Documentation/, explicitly mentioning the
potential DMA corruption of the CMA-reserved memory.

The fourth patch adds a short delay before booting the kdump
kernel, allowing pending DMA transfers to finish.

The fifth patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
  through /proc/vmcore by excluding them from the vmcoreinfo
  PT_LOAD ranges.

Adding other architectures is easy and I can do that as soon as
this series is merged.

With this series applied, specifying
	crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.

An additional 1G will be reserved from CMA, still usable by the
production system. The crash kernel will have 1.1G memory
available. The 100M can be reliably predicted based on the size
of the kernel and initrd.

The new cma suffix is completely optional. When no
crashkernel=size,cma is specified, everything works as before.

---
Changes since v1:

The key concern raised in the v1 discussion was that pages in the
CMA region may be pinned and used for a DMA transfer, potentially
corrupting the new kernel's memory. When the cma suffix is used, kdump
may be less reliable and the corruption hard to debug

This v2 series addresses this concern in two ways:

1) Clearly stating the potential problem in the updated
Documentation and setting the expectation (patch 3/5)

Documentation now explicitly states that:
- the risk of kdump failure is increased
- the CMA reservation is intended for users who can not or don't
  want to sacrifice enough memory for a standard crashkernel reservation
  and who prefer less reliable kdump to no kdump at all

This is consistent with the documentation of the
crash_kexec_post_notifiers option, which can also increase the
risk of kdump failure, yet may be the only way to use kdump on
some systems. And just like the crash_kexec_post_notifiers
option, the cma crashkernel suffix is completely optional:
the series has zero effect when the suffix is not used.

2) Giving DMA time to finish before booting the kdump kernel
   (patch 4/5)

Pages can be pinned for long term use using the FOLL_LONGTERM
flag. Then they are migrated outside the CMA region. Pinning
without this flag shows that the intent of their user is to only
use them for short-lived DMA transfers. 

Delay the boot of the kdump kernel when the CMA reservation is
used, giving potential pending DMA transfers time to finish.

Other minor changes since v1:
- updated for 6.14-rc2
- moved #ifdefs and #defines to header files
- added __always_unused in parse_crashkernel() to silence a false
  unused variable warning

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option
  2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
@ 2025-02-20 16:51 ` Jiri Bohac
  2025-03-03  1:51   ` Baoquan He
  2025-02-20 16:52 ` [PATCH v2 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Jiri Bohac @ 2025-02-20 16:51 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Add a new cma_size parameter to parse_crashkernel().
When not NULL, call __parse_crashkernel to parse the CMA
reservation size from "crashkernel=size,cma" and store it
in cma_size.

Set cma_size to NULL in all calls to parse_crashkernel().

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
 arch/arm/kernel/setup.c              |  2 +-
 arch/arm64/mm/init.c                 |  2 +-
 arch/loongarch/kernel/setup.c        |  2 +-
 arch/mips/kernel/setup.c             |  2 +-
 arch/powerpc/kernel/fadump.c         |  2 +-
 arch/powerpc/kexec/core.c            |  2 +-
 arch/powerpc/mm/nohash/kaslr_booke.c |  2 +-
 arch/riscv/mm/init.c                 |  2 +-
 arch/s390/kernel/setup.c             |  2 +-
 arch/sh/kernel/machine_kexec.c       |  2 +-
 arch/x86/kernel/setup.c              |  2 +-
 include/linux/crash_reserve.h        |  3 ++-
 kernel/crash_reserve.c               | 22 +++++++++++++++++-----
 13 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index a41c93988d2c..0bfd66c7ada0 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1004,7 +1004,7 @@ static void __init reserve_crashkernel(void)
 	total_mem = get_total_mem();
 	ret = parse_crashkernel(boot_command_line, total_mem,
 				&crash_size, &crash_base,
-				NULL, NULL);
+				NULL, NULL, NULL);
 	/* invalid value specified or crashkernel=0 */
 	if (ret || !crash_size)
 		return;
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 9c0b8d9558fc..06bf216a4b0d 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -107,7 +107,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
 				&crash_size, &crash_base,
-				&low_size, &high);
+				&low_size, NULL, &high);
 	if (ret)
 		return;
 
diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c
index edcfdfcad7d2..ffdfb5407043 100644
--- a/arch/loongarch/kernel/setup.c
+++ b/arch/loongarch/kernel/setup.c
@@ -266,7 +266,7 @@ static void __init arch_reserve_crashkernel(void)
 		return;
 
 	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
-				&crash_size, &crash_base, &low_size, &high);
+				&crash_size, &crash_base, &low_size, NULL, &high);
 	if (ret)
 		return;
 
diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
index fbfe0771317e..11b9b6b63e19 100644
--- a/arch/mips/kernel/setup.c
+++ b/arch/mips/kernel/setup.c
@@ -458,7 +458,7 @@ static void __init mips_parse_crashkernel(void)
 	total_mem = memblock_phys_mem_size();
 	ret = parse_crashkernel(boot_command_line, total_mem,
 				&crash_size, &crash_base,
-				NULL, NULL);
+				NULL, NULL, NULL);
 	if (ret != 0 || crash_size <= 0)
 		return;
 
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 4b371c738213..f90aaa2263aa 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -334,7 +334,7 @@ static __init u64 fadump_calculate_reserve_size(void)
 	 * memory at a predefined offset.
 	 */
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
-				&size, &base, NULL, NULL);
+				&size, &base, NULL, NULL, NULL);
 	if (ret == 0 && size > 0) {
 		unsigned long max_size;
 
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 58a930a47422..35f92427d282 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -66,7 +66,7 @@ void __init reserve_crashkernel(void)
 	total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
 	/* use common parsing */
 	ret = parse_crashkernel(boot_command_line, total_mem_sz,
-			&crash_size, &crash_base, NULL, NULL);
+			&crash_size, &crash_base, NULL, NULL, NULL);
 	if (ret == 0 && crash_size > 0) {
 		crashk_res.start = crash_base;
 		crashk_res.end = crash_base + crash_size - 1;
diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c b/arch/powerpc/mm/nohash/kaslr_booke.c
index 5c8d1bb98b3e..5e4897daaaea 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -178,7 +178,7 @@ static void __init get_crash_kernel(void *fdt, unsigned long size)
 	int ret;
 
 	ret = parse_crashkernel(boot_command_line, size, &crash_size,
-				&crash_base, NULL, NULL);
+				&crash_base, NULL, NULL, NULL);
 	if (ret != 0 || crash_size == 0)
 		return;
 	if (crash_base == 0)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 15b2eda4c364..9634a800629b 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1405,7 +1405,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
 				&crash_size, &crash_base,
-				&low_size, &high);
+				&low_size, NULL, &high);
 	if (ret)
 		return;
 
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index d78bcfe707b5..4d9b5b5d0cb2 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -607,7 +607,7 @@ static void __init reserve_crashkernel(void)
 	int rc;
 
 	rc = parse_crashkernel(boot_command_line, ident_map_size,
-			       &crash_size, &crash_base, NULL, NULL);
+			       &crash_size, &crash_base, NULL, NULL, NULL);
 
 	crash_base = ALIGN(crash_base, KEXEC_CRASH_MEM_ALIGN);
 	crash_size = ALIGN(crash_size, KEXEC_CRASH_MEM_ALIGN);
diff --git a/arch/sh/kernel/machine_kexec.c b/arch/sh/kernel/machine_kexec.c
index 8321b31d2e19..37073ca1e0ad 100644
--- a/arch/sh/kernel/machine_kexec.c
+++ b/arch/sh/kernel/machine_kexec.c
@@ -146,7 +146,7 @@ void __init reserve_crashkernel(void)
 		return;
 
 	ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
-			&crash_size, &crash_base, NULL, NULL);
+			&crash_size, &crash_base, NULL, NULL, NULL);
 	if (ret == 0 && crash_size > 0) {
 		crashk_res.start = crash_base;
 		crashk_res.end = crash_base + crash_size - 1;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index cebee310e200..853afde761a7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -481,7 +481,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
 				&crash_size, &crash_base,
-				&low_size, &high);
+				&low_size, NULL, &high);
 	if (ret)
 		return;
 
diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h
index 5a9df944fb80..a681f265a361 100644
--- a/include/linux/crash_reserve.h
+++ b/include/linux/crash_reserve.h
@@ -16,7 +16,8 @@ extern struct resource crashk_low_res;
 
 int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
 		unsigned long long *crash_size, unsigned long long *crash_base,
-		unsigned long long *low_size, bool *high);
+		unsigned long long *low_size, unsigned long long *cma_size,
+		bool *high);
 
 #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
 #ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
index a620fb4b2116..e72a9c897694 100644
--- a/kernel/crash_reserve.c
+++ b/kernel/crash_reserve.c
@@ -172,17 +172,19 @@ static int __init parse_crashkernel_simple(char *cmdline,
 
 #define SUFFIX_HIGH 0
 #define SUFFIX_LOW  1
-#define SUFFIX_NULL 2
+#define SUFFIX_CMA  2
+#define SUFFIX_NULL 3
 static __initdata char *suffix_tbl[] = {
-	[SUFFIX_HIGH] = ",high",
-	[SUFFIX_LOW]  = ",low",
-	[SUFFIX_NULL] = NULL,
+	[SUFFIX_HIGH]	= ",high",
+	[SUFFIX_LOW]	= ",low",
+	[SUFFIX_CMA]	= ",cma",
+	[SUFFIX_NULL]	= NULL,
 };
 
 /*
  * That function parses "suffix"  crashkernel command lines like
  *
- *	crashkernel=size,[high|low]
+ *	crashkernel=size,[high|low|cma]
  *
  * It returns 0 on success and -EINVAL on failure.
  */
@@ -298,9 +300,11 @@ int __init parse_crashkernel(char *cmdline,
 			     unsigned long long *crash_size,
 			     unsigned long long *crash_base,
 			     unsigned long long *low_size,
+			     unsigned long long *cma_size,
 			     bool *high)
 {
 	int ret;
+	unsigned long long __always_unused cma_base;
 
 	/* crashkernel=X[@offset] */
 	ret = __parse_crashkernel(cmdline, system_ram, crash_size,
@@ -331,6 +335,14 @@ int __init parse_crashkernel(char *cmdline,
 
 		*high = true;
 	}
+
+	/*
+	* optional CMA reservation
+	* cma_base is ignored
+	*/
+	if (cma_size)
+		__parse_crashkernel(cmdline, 0, cma_size,
+			&cma_base, suffix_tbl[SUFFIX_CMA]);
 #endif
 	if (!*crash_size)
 		ret = -EINVAL;

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/5] kdump: implement reserve_crashkernel_cma
  2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
  2025-02-20 16:51 ` [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
@ 2025-02-20 16:52 ` Jiri Bohac
  2025-02-20 16:54 ` [PATCH v2 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Jiri Bohac @ 2025-02-20 16:52 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

reserve_crashkernel_cma() reserves CMA ranges for the
crash kernel. If allocating the requested size fails,
try to reserve in smaller blocks.

Store the reserved ranges in the crashk_cma_ranges array
and the number of ranges in crashk_cma_cnt.
---
 include/linux/crash_reserve.h | 12 +++++++++
 kernel/crash_reserve.c        | 49 +++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h
index a681f265a361..97964f2a583d 100644
--- a/include/linux/crash_reserve.h
+++ b/include/linux/crash_reserve.h
@@ -13,12 +13,24 @@
  */
 extern struct resource crashk_res;
 extern struct resource crashk_low_res;
+extern struct range crashk_cma_ranges[];
+#if defined(CONFIG_CMA) && defined(CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION)
+#define CRASHKERNEL_CMA
+#define CRASHKERNEL_CMA_RANGES_MAX 4
+extern int crashk_cma_cnt;
+#else
+#define crashk_cma_cnt 0
+#define CRASHKERNEL_CMA_RANGES_MAX 0
+#endif
+
 
 int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
 		unsigned long long *crash_size, unsigned long long *crash_base,
 		unsigned long long *low_size, unsigned long long *cma_size,
 		bool *high);
 
+void __init reserve_crashkernel_cma(unsigned long long cma_size);
+
 #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
 #ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
 #define DEFAULT_CRASH_KERNEL_LOW_SIZE	(128UL << 20)
diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
index e72a9c897694..d71aff19a28d 100644
--- a/kernel/crash_reserve.c
+++ b/kernel/crash_reserve.c
@@ -14,6 +14,8 @@
 #include <linux/cpuhotplug.h>
 #include <linux/memblock.h>
 #include <linux/kmemleak.h>
+#include <linux/cma.h>
+#include <linux/crash_reserve.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -470,6 +472,53 @@ void __init reserve_crashkernel_generic(char *cmdline,
 #endif
 }
 
+struct range crashk_cma_ranges[CRASHKERNEL_CMA_RANGES_MAX];
+#ifdef CRASHKERNEL_CMA
+int crashk_cma_cnt = 0;
+void __init reserve_crashkernel_cma(unsigned long long cma_size)
+{
+	unsigned long long request_size = roundup(cma_size, PAGE_SIZE);
+	unsigned long long reserved_size = 0;
+
+	while (cma_size > reserved_size &&
+	       crashk_cma_cnt < CRASHKERNEL_CMA_RANGES_MAX) {
+
+		struct cma *res;
+
+		if (cma_declare_contiguous(0, request_size, 0, 0, 0, false,
+				       "crashkernel", &res)) {
+			/* reservation failed, try half-sized blocks */
+			if (request_size <= PAGE_SIZE)
+				break;
+
+			request_size = roundup(request_size / 2, PAGE_SIZE);
+			continue;
+		}
+
+		crashk_cma_ranges[crashk_cma_cnt].start = cma_get_base(res);
+		crashk_cma_ranges[crashk_cma_cnt].end =
+			crashk_cma_ranges[crashk_cma_cnt].start +
+			cma_get_size(res) - 1;
+		++crashk_cma_cnt;
+		reserved_size += request_size;
+	}
+
+	if (cma_size > reserved_size)
+		pr_warn("crashkernel CMA reservation failed: %lld MB requested, %lld MB reserved in %d ranges\n",
+			cma_size >> 20, reserved_size >> 20, crashk_cma_cnt);
+	else
+		pr_info("crashkernel CMA reserved: %lld MB in %d ranges\n",
+			reserved_size >> 20, crashk_cma_cnt);
+}
+
+#else /* CRASHKERNEL_CMA */
+void __init reserve_crashkernel_cma(unsigned long long cma_size)
+{
+	if (cma_size)
+		pr_warn("crashkernel CMA reservation not supported\n");
+}
+#endif
+
 #ifndef HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY
 static __init int insert_crashkernel_resources(void)
 {

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/5] kdump, documentation: describe craskernel CMA reservation
  2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
  2025-02-20 16:51 ` [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
  2025-02-20 16:52 ` [PATCH v2 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
@ 2025-02-20 16:54 ` Jiri Bohac
  2025-03-03  1:54   ` Baoquan He
  2025-02-20 16:55 ` [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Jiri Bohac @ 2025-02-20 16:54 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Describe the new crashkernel ",cma" suffix in Documentation/
---
 Documentation/admin-guide/kdump/kdump.rst     | 19 +++++++++++++++++--
 .../admin-guide/kernel-parameters.txt         | 14 ++++++++++++++
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 5376890adbeb..3f7ff301c604 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -213,8 +213,7 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
    kernel.
 
    Otherwise it should be the start of memory region reserved for
-   second kernel using boot parameter "crashkernel=Y@X". Here X is
-   start of memory region reserved for dump-capture kernel.
+   second kernel using boot parameter "crashkernel=Y@X". Here X is start of memory region reserved for dump-capture kernel.
    Generally X is 16MB (0x1000000). So you can set
    CONFIG_PHYSICAL_START=0x1000000
 
@@ -315,6 +314,22 @@ crashkernel syntax
 
             crashkernel=0,low
 
+4) crashkernel=size,cma
+
+   Reserves additional memory from CMA. A standard crashkernel reservation, as
+   described above, is still needed, but can be just big enough to hold the
+   kernel and initrd. All the memory the crash kernel needs for its runtime and
+   for running the kdump userspace processes can be provided by this CMA
+   reservation, re-using memory available to the production system's userspace.
+   Because of this, the CMA reservation should not be used if kdump is configured
+   to dump userspace memory - the re-used memory ranges won't be in the vmcore.
+
+   This option increases the risk of a kdump failure: DMA transfers configured
+   by the first kernel may end up corrupting the second kernel's memory. This
+   reservation method is intended for systems that can't afford to sacrifice
+   enough memory for standard crashkernel reservation and where less reliable
+   kdump is preferrable to no kdump at all.
+
 Boot into System Kernel
 -----------------------
 1) Update the boot loader (such as grub, yaboot, or lilo) configuration
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..ea4d53435515 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -987,6 +987,20 @@
 			0: to disable low allocation.
 			It will be ignored when crashkernel=X,high is not used
 			or memory reserved is below 4G.
+	crashkernel=size[KMG],cma
+			[KNL, X86] Reserve additional crash kernel memory from CMA.
+			This reservation is usable by the first system's userspace,
+			so this should not be used if dumping of userspace
+			memory is intended. A standard crashkernel reservation,
+			as described above, is still needed to hold the crash
+			kernel and initrd.
+			This option increases the risk of a kdump failure: DMA
+			transfers configured by the first kernel may end up
+			corrupting the second kernel's memory. This reservation
+			method is intended for systems that can't afford to
+			sacrifice enough memory for standard crashkernel
+			reservation and where less reliable kdump is preferrable 
+			to no kdump at all.
 
 	cryptomgr.notests
 			[KNL] Disable crypto self-tests

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA
  2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
                   ` (2 preceding siblings ...)
  2025-02-20 16:54 ` [PATCH v2 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
@ 2025-02-20 16:55 ` Jiri Bohac
  2025-03-03  2:02   ` Baoquan He
  2025-02-20 16:57 ` [PATCH v2 5/5] x86: implement crashkernel cma reservation Jiri Bohac
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Jiri Bohac @ 2025-02-20 16:55 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

When re-using the CMA area for kdump there is a risk of pending DMA into
pinned user pages in the CMA area.

Pages that are pinned long-term are migrated away from CMA, so these are not a
concern. Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly
be the source or destination of a pending DMA transfer.

Although there is no clear specification how long a page may be pinned without
FOLL_LONGTERM, pinning without the flag shows an intent of the caller to
only use the memory for short-lived DMA transfers, not a transfer initiated
by a device asynchronously at a random time in the future.

Add a delay of CMA_DMA_TIMEOUT_MSEC milliseconds before starting the kdump
kernel, giving such short-lived DMA transfers time to finish before the CMA
memory is re-used by the kdump kernel.

Set CMA_DMA_TIMEOUT_MSEC to 1000 (one second) - chosen arbitrarily as both a
huge margin for a DMA transfer, yet not increasing the kdump time
significantly.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
 include/linux/crash_core.h |  5 +++++
 kernel/crash_core.c        | 10 ++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 44305336314e..543e4a71f13c 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -56,6 +56,11 @@ static inline unsigned int crash_get_elfcorehdr_size(void) { return 0; }
 /* Alignment required for elf header segment */
 #define ELF_CORE_HEADER_ALIGN   4096
 
+/* Time to wait for possible DMA to finish before starting the kdump kernel
+ * when a CMA reservation is used
+ */
+#define CMA_DMA_TIMEOUT_MSEC 1000
+
 extern int crash_exclude_mem_range(struct crash_mem *mem,
 				   unsigned long long mstart,
 				   unsigned long long mend);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 078fe5bc5a74..543e509b7926 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -21,6 +21,7 @@
 #include <linux/reboot.h>
 #include <linux/btf.h>
 #include <linux/objtool.h>
+#include <linux/delay.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -97,6 +98,14 @@ int kexec_crash_loaded(void)
 }
 EXPORT_SYMBOL_GPL(kexec_crash_loaded);
 
+static void crash_cma_clear_pending_dma(void)
+{
+	if (!crashk_cma_cnt)
+		return;
+
+	mdelay(CMA_DMA_TIMEOUT_MSEC);
+}
+
 /*
  * No panic_cpu check version of crash_kexec().  This function is called
  * only when panic_cpu holds the current CPU number; this is the only CPU
@@ -116,6 +125,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
 		if (kexec_crash_image) {
 			struct pt_regs fixed_regs;
 
+			crash_cma_clear_pending_dma();
 			crash_setup_regs(&fixed_regs, regs);
 			crash_save_vmcoreinfo();
 			machine_crash_shutdown(&fixed_regs);

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 5/5] x86: implement crashkernel cma reservation
  2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
                   ` (3 preceding siblings ...)
  2025-02-20 16:55 ` [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
@ 2025-02-20 16:57 ` Jiri Bohac
  2025-03-03  2:08 ` [PATCH v2 0/5] kdump: crashkernel reservation from CMA Baoquan He
  2025-03-03  8:25 ` David Hildenbrand
  6 siblings, 0 replies; 30+ messages in thread
From: Jiri Bohac @ 2025-02-20 16:57 UTC (permalink / raw)
  To: Baoquan He, Vivek Goyal, Dave Young, kexec
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

Implement the crashkernel CMA reservation for x86:
- enable parsing of the cma suffix by parse_crashkernel()
- reserve memory with reserve_crashkernel_cma()
- add the CMA-reserved ranges to the e820 map for the crash kernel
- exclude the CMA-reserved ranges from vmcore

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
 arch/x86/kernel/crash.c | 26 ++++++++++++++++++++++----
 arch/x86/kernel/setup.c |  5 +++--
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 340af8155658..70823fa9abea 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -163,10 +163,10 @@ static struct crash_mem *fill_up_crash_elf_data(void)
 		return NULL;
 
 	/*
-	 * Exclusion of crash region and/or crashk_low_res may cause
-	 * another range split. So add extra two slots here.
+	 * Exclusion of crash region, crashk_low_res and/or crashk_cma_ranges
+	 * may cause range splits. So add extra slots here.
 	 */
-	nr_ranges += 2;
+	nr_ranges += 2 + crashk_cma_cnt;
 	cmem = vzalloc(struct_size(cmem, ranges, nr_ranges));
 	if (!cmem)
 		return NULL;
@@ -184,6 +184,7 @@ static struct crash_mem *fill_up_crash_elf_data(void)
 static int elf_header_exclude_ranges(struct crash_mem *cmem)
 {
 	int ret = 0;
+	int i;
 
 	/* Exclude the low 1M because it is always reserved */
 	ret = crash_exclude_mem_range(cmem, 0, SZ_1M - 1);
@@ -198,8 +199,17 @@ static int elf_header_exclude_ranges(struct crash_mem *cmem)
 	if (crashk_low_res.end)
 		ret = crash_exclude_mem_range(cmem, crashk_low_res.start,
 					      crashk_low_res.end);
+	if (ret)
+		return ret;
 
-	return ret;
+	for (i = 0; i < crashk_cma_cnt; ++i) {
+		ret = crash_exclude_mem_range(cmem, crashk_cma_ranges[i].start,
+					      crashk_cma_ranges[i].end);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
 }
 
 static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
@@ -352,6 +362,14 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
 		add_e820_entry(params, &ei);
 	}
 
+	for (i = 0; i < crashk_cma_cnt; ++i) {
+		ei.addr = crashk_cma_ranges[i].start;
+		ei.size = crashk_cma_ranges[i].end -
+			  crashk_cma_ranges[i].start + 1;
+		ei.type = E820_TYPE_RAM;
+		add_e820_entry(params, &ei);
+	}
+
 out:
 	vfree(cmem);
 	return ret;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 853afde761a7..90e10a18f0c9 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -471,7 +471,7 @@ static void __init memblock_x86_reserve_range_setup_data(void)
 
 static void __init arch_reserve_crashkernel(void)
 {
-	unsigned long long crash_base, crash_size, low_size = 0;
+	unsigned long long crash_base, crash_size, low_size = 0, cma_size = 0;
 	char *cmdline = boot_command_line;
 	bool high = false;
 	int ret;
@@ -481,7 +481,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
 				&crash_size, &crash_base,
-				&low_size, NULL, &high);
+				&low_size, &cma_size, &high);
 	if (ret)
 		return;
 
@@ -492,6 +492,7 @@ static void __init arch_reserve_crashkernel(void)
 
 	reserve_crashkernel_generic(cmdline, crash_size, crash_base,
 				    low_size, high);
+	reserve_crashkernel_cma(cma_size);
 }
 
 static struct resource standard_io_resources[] = {

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option
  2025-02-20 16:51 ` [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
@ 2025-03-03  1:51   ` Baoquan He
  0 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-03-03  1:51 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Vivek Goyal, Dave Young, kexec, Philipp Rudo, Donald Dutile,
	Pingfan Liu, Tao Liu, linux-kernel, David Hildenbrand,
	Michal Hocko

On 02/20/25 at 05:51pm, Jiri Bohac wrote:
......snip...  
> diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
> index a620fb4b2116..e72a9c897694 100644
> --- a/kernel/crash_reserve.c
> +++ b/kernel/crash_reserve.c
> @@ -172,17 +172,19 @@ static int __init parse_crashkernel_simple(char *cmdline,
>  
>  #define SUFFIX_HIGH 0
>  #define SUFFIX_LOW  1
> -#define SUFFIX_NULL 2
> +#define SUFFIX_CMA  2
> +#define SUFFIX_NULL 3
>  static __initdata char *suffix_tbl[] = {
> -	[SUFFIX_HIGH] = ",high",
> -	[SUFFIX_LOW]  = ",low",
> -	[SUFFIX_NULL] = NULL,
> +	[SUFFIX_HIGH]	= ",high",
> +	[SUFFIX_LOW]	= ",low",
> +	[SUFFIX_CMA]	= ",cma",
> +	[SUFFIX_NULL]	= NULL,
>  };

Seems the old style looks a little better.
......


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/5] kdump, documentation: describe craskernel CMA reservation
  2025-02-20 16:54 ` [PATCH v2 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
@ 2025-03-03  1:54   ` Baoquan He
  0 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-03-03  1:54 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Vivek Goyal, Dave Young, kexec, Philipp Rudo, Donald Dutile,
	Pingfan Liu, Tao Liu, linux-kernel, David Hildenbrand,
	Michal Hocko

On 02/20/25 at 05:54pm, Jiri Bohac wrote:
......
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb8752b42ec8..ea4d53435515 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -987,6 +987,20 @@
>  			0: to disable low allocation.
>  			It will be ignored when crashkernel=X,high is not used
>  			or memory reserved is below 4G.
> +	crashkernel=size[KMG],cma
> +			[KNL, X86] Reserve additional crash kernel memory from CMA.
> +			This reservation is usable by the first system's userspace,
> +			so this should not be used if dumping of userspace
> +			memory is intended. A standard crashkernel reservation,
> +			as described above, is still needed to hold the crash
> +			kernel and initrd.
> +			This option increases the risk of a kdump failure: DMA
> +			transfers configured by the first kernel may end up
> +			corrupting the second kernel's memory. This reservation
> +			method is intended for systems that can't afford to
> +			sacrifice enough memory for standard crashkernel
> +			reservation and where less reliable kdump is preferrable 
There's trailing whitespace at the end of above line, checkpatch should
be run to check it.

> +			to no kdump at all.
>  
>  	cryptomgr.notests
>  			[KNL] Disable crypto self-tests
> 
> -- 
> Jiri Bohac <jbohac@suse.cz>
> SUSE Labs, Prague, Czechia
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA
  2025-02-20 16:55 ` [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
@ 2025-03-03  2:02   ` Baoquan He
  2025-03-11 12:00     ` Jiri Bohac
  0 siblings, 1 reply; 30+ messages in thread
From: Baoquan He @ 2025-03-03  2:02 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Vivek Goyal, Dave Young, kexec, Philipp Rudo, Donald Dutile,
	Pingfan Liu, Tao Liu, linux-kernel, David Hildenbrand,
	Michal Hocko

On 02/20/25 at 05:55pm, Jiri Bohac wrote:
> When re-using the CMA area for kdump there is a risk of pending DMA into
> pinned user pages in the CMA area.
> 
> Pages that are pinned long-term are migrated away from CMA, so these are not a
> concern. Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly
> be the source or destination of a pending DMA transfer.
> 
> Although there is no clear specification how long a page may be pinned without
> FOLL_LONGTERM, pinning without the flag shows an intent of the caller to
> only use the memory for short-lived DMA transfers, not a transfer initiated
> by a device asynchronously at a random time in the future.
> 
> Add a delay of CMA_DMA_TIMEOUT_MSEC milliseconds before starting the kdump
> kernel, giving such short-lived DMA transfers time to finish before the CMA
> memory is re-used by the kdump kernel.
> 
> Set CMA_DMA_TIMEOUT_MSEC to 1000 (one second) - chosen arbitrarily as both a
> huge margin for a DMA transfer, yet not increasing the kdump time
> significantly.
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>
> ---
>  include/linux/crash_core.h |  5 +++++
>  kernel/crash_core.c        | 10 ++++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index 44305336314e..543e4a71f13c 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -56,6 +56,11 @@ static inline unsigned int crash_get_elfcorehdr_size(void) { return 0; }
>  /* Alignment required for elf header segment */
>  #define ELF_CORE_HEADER_ALIGN   4096
>  
> +/* Time to wait for possible DMA to finish before starting the kdump kernel
> + * when a CMA reservation is used
> + */
> +#define CMA_DMA_TIMEOUT_MSEC 1000
> +
>  extern int crash_exclude_mem_range(struct crash_mem *mem,
>  				   unsigned long long mstart,
>  				   unsigned long long mend);
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 078fe5bc5a74..543e509b7926 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -21,6 +21,7 @@
>  #include <linux/reboot.h>
>  #include <linux/btf.h>
>  #include <linux/objtool.h>
> +#include <linux/delay.h>
>  
>  #include <asm/page.h>
>  #include <asm/sections.h>
> @@ -97,6 +98,14 @@ int kexec_crash_loaded(void)
>  }
>  EXPORT_SYMBOL_GPL(kexec_crash_loaded);
>  
> +static void crash_cma_clear_pending_dma(void)
> +{
> +	if (!crashk_cma_cnt)
> +		return;
> +
> +	mdelay(CMA_DMA_TIMEOUT_MSEC);
> +}
> +
>  /*
>   * No panic_cpu check version of crash_kexec().  This function is called
>   * only when panic_cpu holds the current CPU number; this is the only CPU
> @@ -116,6 +125,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
>  		if (kexec_crash_image) {
>  			struct pt_regs fixed_regs;
>  
> +			crash_cma_clear_pending_dma();

This could be too ideal, I am not sure if it's a good way. When crash
triggered, we need do the urgent and necessary thing as soon as
possible, then shutdown all CPU to avoid further damage. This one second
of waiting could give the strayed system too much time. My personal
opinion.

>  			crash_setup_regs(&fixed_regs, regs);
>  			crash_save_vmcoreinfo();
>  			machine_crash_shutdown(&fixed_regs);
> 
> -- 
> Jiri Bohac <jbohac@suse.cz>
> SUSE Labs, Prague, Czechia
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
                   ` (4 preceding siblings ...)
  2025-02-20 16:57 ` [PATCH v2 5/5] x86: implement crashkernel cma reservation Jiri Bohac
@ 2025-03-03  2:08 ` Baoquan He
  2025-03-03  8:25 ` David Hildenbrand
  6 siblings, 0 replies; 30+ messages in thread
From: Baoquan He @ 2025-03-03  2:08 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Vivek Goyal, Dave Young, kexec, Philipp Rudo, Donald Dutile,
	Pingfan Liu, Tao Liu, linux-kernel, David Hildenbrand,
	Michal Hocko

On 02/20/25 at 05:48pm, Jiri Bohac wrote:
......snip... 
> ---
> Changes since v1:
> 
> The key concern raised in the v1 discussion was that pages in the
> CMA region may be pinned and used for a DMA transfer, potentially
> corrupting the new kernel's memory. When the cma suffix is used, kdump
> may be less reliable and the corruption hard to debug
> 
> This v2 series addresses this concern in two ways:
> 
> 1) Clearly stating the potential problem in the updated
> Documentation and setting the expectation (patch 3/5)
> 
> Documentation now explicitly states that:
> - the risk of kdump failure is increased
> - the CMA reservation is intended for users who can not or don't
>   want to sacrifice enough memory for a standard crashkernel reservation
>   and who prefer less reliable kdump to no kdump at all
> 
> This is consistent with the documentation of the
> crash_kexec_post_notifiers option, which can also increase the
> risk of kdump failure, yet may be the only way to use kdump on
> some systems. And just like the crash_kexec_post_notifiers
> option, the cma crashkernel suffix is completely optional:
> the series has zero effect when the suffix is not used.

Thanks for the effort to investigate and add clear note about the
potential risk in document. Except of the 1 second waiting for short
term pined page for DMA, the whole series looks good to me. Hope other
people can also give comment to evaluate the risk of waiting, I will
wait another week to add my personal ACK.

Thanks
Baoquan

> 
> 2) Giving DMA time to finish before booting the kdump kernel
>    (patch 4/5)
> 
> Pages can be pinned for long term use using the FOLL_LONGTERM
> flag. Then they are migrated outside the CMA region. Pinning
> without this flag shows that the intent of their user is to only
> use them for short-lived DMA transfers. 
> 
> Delay the boot of the kdump kernel when the CMA reservation is
> used, giving potential pending DMA transfers time to finish.
> 
> Other minor changes since v1:
> - updated for 6.14-rc2
> - moved #ifdefs and #defines to header files
> - added __always_unused in parse_crashkernel() to silence a false
>   unused variable warning
>  
> 
> -- 
> Jiri Bohac <jbohac@suse.cz>
> SUSE Labs, Prague, Czechia
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
                   ` (5 preceding siblings ...)
  2025-03-03  2:08 ` [PATCH v2 0/5] kdump: crashkernel reservation from CMA Baoquan He
@ 2025-03-03  8:25 ` David Hildenbrand
  2025-03-03 14:17   ` Donald Dutile
  2025-03-12 15:36   ` Jiri Bohac
  6 siblings, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-03-03  8:25 UTC (permalink / raw)
  To: Jiri Bohac, Baoquan He, Vivek Goyal, Dave Young, kexec
  Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

On 20.02.25 17:48, Jiri Bohac wrote:
> Hi,
> 
> this series implements a way to reserve additional crash kernel
> memory using CMA.
> 
> Link to the v1 discussion:
> https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
> See below for the changes since v1 and how concerns from the
> discussion have been addressed.
> 
> Currently, all the memory for the crash kernel is not usable by
> the 1st (production) kernel. It is also unmapped so that it can't
> be corrupted by the fault that will eventually trigger the crash.
> This makes sense for the memory actually used by the kexec-loaded
> crash kernel image and initrd and the data prepared during the
> load (vmcoreinfo, ...). However, the reserved space needs to be
> much larger than that to provide enough run-time memory for the
> crash kernel and the kdump userspace. Estimating the amount of
> memory to reserve is difficult. Being too careful makes kdump
> likely to end in OOM, being too generous takes even more memory
> from the production system. Also, the reservation only allows
> reserving a single contiguous block (or two with the "low"
> suffix). I've seen systems where this fails because the physical
> memory is fragmented.
> 
> By reserving additional crashkernel memory from CMA, the main
> crashkernel reservation can be just large enough to fit the
> kernel and initrd image, minimizing the memory taken away from
> the production system. Most of the run-time memory for the crash
> kernel will be memory previously available to userspace in the
> production system. As this memory is no longer wasted, the
> reservation can be done with a generous margin, making kdump more
> reliable. Kernel memory that we need to preserve for dumping is
> never allocated from CMA. User data is typically not dumped by
> makedumpfile. When dumping of user data is intended this new CMA
> reservation cannot be used.

Hi,

I'll note that your comment about "user space" is currently the case, 
but will likely not hold in the long run. The assumption you are making 
is that only user-space memory will be allocated from MIGRATE_CMA, which 
is not necessarily the case. Any movable allocation will end up in there.

Besides LRU folios (user space memory and the pagecache), we already 
support migration of some kernel allocations using the non-lru migration 
framework. Such allocations (which use __GFP_MOVABLE, see 
__SetPageMovable()) currently only include
* memory balloon: pages we never want to dump either way
* zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
* z3fold (->zpool): only used by zswap (-> compressed LRU pages)

Just imagine if we support migration of other kernel allocations, such 
as user page tables. The dump would be missing important information.

Once that happens, it will become a lot harder to judge whether CMA can 
be used or not. At least, the kernel could bail out/warn for these 
kernel configs.

> 
> There are five patches in this series:
> 
> The first adds a new ",cma" suffix to the recenly introduced generic
> crashkernel parsing code. parse_crashkernel() takes one more
> argument to store the cma reservation size.
> 
> The second patch implements reserve_crashkernel_cma() which
> performs the reservation. If the requested size is not available
> in a single range, multiple smaller ranges will be reserved.
> 
> The third patch updates Documentation/, explicitly mentioning the
> potential DMA corruption of the CMA-reserved memory.
> 
> The fourth patch adds a short delay before booting the kdump
> kernel, allowing pending DMA transfers to finish.

What does "short" mean? At least in theory, long-term pinning is 
forbidden for MIGRATE_CMA, so we should not have such pages mapped into 
an iommu where DMA can happily keep going on for quite a while.

But that assumes that our old kernel is not buggy, and doesn't end up 
mapping these pages into an IOMMU where DMA will just continue. I recall 
that DRM might currently be a problem, described here [1].

If kdump starts not working as expected in case our old kernel is buggy, 
doesn't that partially destroy the purpose of kdump (-> debug bugs in 
the old kernel)?

[1] https://lore.kernel.org/all/Z6MV_Y9WRdlBYeRs@phenom.ffwll.local/T/#u

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-03-03  8:25 ` David Hildenbrand
@ 2025-03-03 14:17   ` Donald Dutile
  2025-03-04  4:20     ` Baoquan He
  2025-03-12 15:36   ` Jiri Bohac
  1 sibling, 1 reply; 30+ messages in thread
From: Donald Dutile @ 2025-03-03 14:17 UTC (permalink / raw)
  To: David Hildenbrand, Jiri Bohac, Baoquan He, Vivek Goyal,
	Dave Young, kexec
  Cc: Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko



On 3/3/25 3:25 AM, David Hildenbrand wrote:
> On 20.02.25 17:48, Jiri Bohac wrote:
>> Hi,
>>
>> this series implements a way to reserve additional crash kernel
>> memory using CMA.
>>
>> Link to the v1 discussion:
>> https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
>> See below for the changes since v1 and how concerns from the
>> discussion have been addressed.
>>
>> Currently, all the memory for the crash kernel is not usable by
>> the 1st (production) kernel. It is also unmapped so that it can't
>> be corrupted by the fault that will eventually trigger the crash.
>> This makes sense for the memory actually used by the kexec-loaded
>> crash kernel image and initrd and the data prepared during the
>> load (vmcoreinfo, ...). However, the reserved space needs to be
>> much larger than that to provide enough run-time memory for the
>> crash kernel and the kdump userspace. Estimating the amount of
>> memory to reserve is difficult. Being too careful makes kdump
>> likely to end in OOM, being too generous takes even more memory
>> from the production system. Also, the reservation only allows
>> reserving a single contiguous block (or two with the "low"
>> suffix). I've seen systems where this fails because the physical
>> memory is fragmented.
>>
>> By reserving additional crashkernel memory from CMA, the main
>> crashkernel reservation can be just large enough to fit the
>> kernel and initrd image, minimizing the memory taken away from
>> the production system. Most of the run-time memory for the crash
>> kernel will be memory previously available to userspace in the
>> production system. As this memory is no longer wasted, the
>> reservation can be done with a generous margin, making kdump more
>> reliable. Kernel memory that we need to preserve for dumping is
>> never allocated from CMA. User data is typically not dumped by
>> makedumpfile. When dumping of user data is intended this new CMA
>> reservation cannot be used.
> 
> 
> Hi,
> 
> I'll note that your comment about "user space" is currently the case, but will likely not hold in the long run. The assumption you are making is that only user-space memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable allocation will end up in there.
> 
> Besides LRU folios (user space memory and the pagecache), we already support migration of some kernel allocations using the non-lru migration framework. Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently only include
> * memory balloon: pages we never want to dump either way
> * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
> * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
> 
> Just imagine if we support migration of other kernel allocations, such as user page tables. The dump would be missing important information.
> 
IOMMUFD is a near-term candidate for user page tables with multi-stage iommu support with going through upstream review atm.
Just saying, that David's case will be a norm in high-end VMs with performance-enhanced guest-driven iommu support (for GPUs).

> Once that happens, it will become a lot harder to judge whether CMA can be used or not. At least, the kernel could bail out/warn for these kernel configs.
> 
I don't think the aforementioned focus is to use CMA, but given its performance benefits, it won't take long to be the next perf improvement step taken.

>>
>> There are five patches in this series:
>>
>> The first adds a new ",cma" suffix to the recenly introduced generic
>> crashkernel parsing code. parse_crashkernel() takes one more
>> argument to store the cma reservation size.
>>
>> The second patch implements reserve_crashkernel_cma() which
>> performs the reservation. If the requested size is not available
>> in a single range, multiple smaller ranges will be reserved.
>>
>> The third patch updates Documentation/, explicitly mentioning the
>> potential DMA corruption of the CMA-reserved memory.
>>
>> The fourth patch adds a short delay before booting the kdump
>> kernel, allowing pending DMA transfers to finish.
> 
> 
> What does "short" mean? At least in theory, long-term pinning is forbidden for MIGRATE_CMA, so we should not have such pages mapped into an iommu where DMA can happily keep going on for quite a while.
> 
Hmmm, in the case I mentioned above, should there be a kexec hook in multi-stage IOMMU support for the hypervisor/VMM to invalidate/shut-off stage 2 mappings asap (a multi-microsecond process) so
DMA termination from VMs is stunted ?  is that already done today (due to 'simple', single-stage, device assignment in a VM)?

> But that assumes that our old kernel is not buggy, and doesn't end up mapping these pages into an IOMMU where DMA will just continue. I recall that DRM might currently be a problem, described here [1].
> 
> If kdump starts not working as expected in case our old kernel is buggy, doesn't that partially destroy the purpose of kdump (-> debug bugs in the old kernel)?
> 
> 
> [1] https://lore.kernel.org/all/Z6MV_Y9WRdlBYeRs@phenom.ffwll.local/T/#u
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-03-03 14:17   ` Donald Dutile
@ 2025-03-04  4:20     ` Baoquan He
  2025-05-28 21:01       ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Baoquan He @ 2025-03-04  4:20 UTC (permalink / raw)
  To: Donald Dutile, David Hildenbrand, Jiri Bohac
  Cc: Vivek Goyal, Dave Young, kexec, Philipp Rudo, Pingfan Liu,
	Tao Liu, linux-kernel, David Hildenbrand, Michal Hocko

On 03/03/25 at 09:17am, Donald Dutile wrote:
> 
> 
> On 3/3/25 3:25 AM, David Hildenbrand wrote:
> > On 20.02.25 17:48, Jiri Bohac wrote:
> > > Hi,
> > > 
> > > this series implements a way to reserve additional crash kernel
> > > memory using CMA.
> > > 
> > > Link to the v1 discussion:
> > > https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
> > > See below for the changes since v1 and how concerns from the
> > > discussion have been addressed.
> > > 
> > > Currently, all the memory for the crash kernel is not usable by
> > > the 1st (production) kernel. It is also unmapped so that it can't
> > > be corrupted by the fault that will eventually trigger the crash.
> > > This makes sense for the memory actually used by the kexec-loaded
> > > crash kernel image and initrd and the data prepared during the
> > > load (vmcoreinfo, ...). However, the reserved space needs to be
> > > much larger than that to provide enough run-time memory for the
> > > crash kernel and the kdump userspace. Estimating the amount of
> > > memory to reserve is difficult. Being too careful makes kdump
> > > likely to end in OOM, being too generous takes even more memory
> > > from the production system. Also, the reservation only allows
> > > reserving a single contiguous block (or two with the "low"
> > > suffix). I've seen systems where this fails because the physical
> > > memory is fragmented.
> > > 
> > > By reserving additional crashkernel memory from CMA, the main
> > > crashkernel reservation can be just large enough to fit the
> > > kernel and initrd image, minimizing the memory taken away from
> > > the production system. Most of the run-time memory for the crash
> > > kernel will be memory previously available to userspace in the
> > > production system. As this memory is no longer wasted, the
> > > reservation can be done with a generous margin, making kdump more
> > > reliable. Kernel memory that we need to preserve for dumping is
> > > never allocated from CMA. User data is typically not dumped by
> > > makedumpfile. When dumping of user data is intended this new CMA
> > > reservation cannot be used.
> > 
> > 
> > Hi,
> > 
> > I'll note that your comment about "user space" is currently the case, but will likely not hold in the long run. The assumption you are making is that only user-space memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable allocation will end up in there.
> > 
> > Besides LRU folios (user space memory and the pagecache), we already support migration of some kernel allocations using the non-lru migration framework. Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently only include
> > * memory balloon: pages we never want to dump either way
> > * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
> > * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
> > 
> > Just imagine if we support migration of other kernel allocations, such as user page tables. The dump would be missing important information.
> > 
> IOMMUFD is a near-term candidate for user page tables with multi-stage iommu support with going through upstream review atm.
> Just saying, that David's case will be a norm in high-end VMs with performance-enhanced guest-driven iommu support (for GPUs).

Thank both for valuable inputs, David and Don. I agree that we may argue
not every system have ballon or enabling swap for now, while future
extending of migration on other kernel allocation could become obstacle
we can't detour.

If we have known for sure this feature could be a bad code, we may need
to stop it in advance.

Thoughts, Jiri?

> 
> > Once that happens, it will become a lot harder to judge whether CMA can be used or not. At least, the kernel could bail out/warn for these kernel configs.
> > 
> I don't think the aforementioned focus is to use CMA, but given its performance benefits, it won't take long to be the next perf improvement step taken.
> 
> > > 
> > > There are five patches in this series:
> > > 
> > > The first adds a new ",cma" suffix to the recenly introduced generic
> > > crashkernel parsing code. parse_crashkernel() takes one more
> > > argument to store the cma reservation size.
> > > 
> > > The second patch implements reserve_crashkernel_cma() which
> > > performs the reservation. If the requested size is not available
> > > in a single range, multiple smaller ranges will be reserved.
> > > 
> > > The third patch updates Documentation/, explicitly mentioning the
> > > potential DMA corruption of the CMA-reserved memory.
> > > 
> > > The fourth patch adds a short delay before booting the kdump
> > > kernel, allowing pending DMA transfers to finish.
> > 
> > 
> > What does "short" mean? At least in theory, long-term pinning is forbidden for MIGRATE_CMA, so we should not have such pages mapped into an iommu where DMA can happily keep going on for quite a while.
> > 
> Hmmm, in the case I mentioned above, should there be a kexec hook in multi-stage IOMMU support for the hypervisor/VMM to invalidate/shut-off stage 2 mappings asap (a multi-microsecond process) so
> DMA termination from VMs is stunted ?  is that already done today (due to 'simple', single-stage, device assignment in a VM)?
> 
> > But that assumes that our old kernel is not buggy, and doesn't end up mapping these pages into an IOMMU where DMA will just continue. I recall that DRM might currently be a problem, described here [1].
> > 
> > If kdump starts not working as expected in case our old kernel is buggy, doesn't that partially destroy the purpose of kdump (-> debug bugs in the old kernel)?
> > 
> > 
> > [1] https://lore.kernel.org/all/Z6MV_Y9WRdlBYeRs@phenom.ffwll.local/T/#u
> > 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA
  2025-03-03  2:02   ` Baoquan He
@ 2025-03-11 12:00     ` Jiri Bohac
  0 siblings, 0 replies; 30+ messages in thread
From: Jiri Bohac @ 2025-03-11 12:00 UTC (permalink / raw)
  To: Baoquan He
  Cc: Vivek Goyal, Dave Young, kexec, Philipp Rudo, Donald Dutile,
	Pingfan Liu, Tao Liu, linux-kernel, David Hildenbrand,
	Michal Hocko

On Mon, Mar 03, 2025 at 10:02:38AM +0800, Baoquan He wrote:
> On 02/20/25 at 05:55pm, Jiri Bohac wrote:
> > +static void crash_cma_clear_pending_dma(void)
> > +{
> > +	if (!crashk_cma_cnt)
> > +		return;
> > +
> > +	mdelay(CMA_DMA_TIMEOUT_MSEC);
> > +}
> > +
> >  /*
> >   * No panic_cpu check version of crash_kexec().  This function is called
> >   * only when panic_cpu holds the current CPU number; this is the only CPU
> > @@ -116,6 +125,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
> >  		if (kexec_crash_image) {
> >  			struct pt_regs fixed_regs;
> >  
> > +			crash_cma_clear_pending_dma();
> 
> This could be too ideal, I am not sure if it's a good way. When crash
> triggered, we need do the urgent and necessary thing as soon as
> possible, then shutdown all CPU to avoid further damage. This one second
> of waiting could give the strayed system too much time. My personal
> opinion.

Good point! I think it makes sense to move the call to crash_cma_clear_pending_dma()
past the call of machine_crash_shutdown where all the shutdown
happens, like this:

> >  			crash_setup_regs(&fixed_regs, regs);
> >  			crash_save_vmcoreinfo();
> >  			machine_crash_shutdown(&fixed_regs);

+			crash_cma_clear_pending_dma();

I'll post a v3 with this change included.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-03-03  8:25 ` David Hildenbrand
  2025-03-03 14:17   ` Donald Dutile
@ 2025-03-12 15:36   ` Jiri Bohac
  1 sibling, 0 replies; 30+ messages in thread
From: Jiri Bohac @ 2025-03-12 15:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baoquan He, Vivek Goyal, Dave Young, kexec, Philipp Rudo,
	Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

On Mon, Mar 03, 2025 at 09:25:30AM +0100, David Hildenbrand wrote:
> On 20.02.25 17:48, Jiri Bohac wrote:
> > 
> > By reserving additional crashkernel memory from CMA, the main
> > crashkernel reservation can be just large enough to fit the
> > kernel and initrd image, minimizing the memory taken away from
> > the production system. Most of the run-time memory for the crash
> > kernel will be memory previously available to userspace in the
> > production system. As this memory is no longer wasted, the
> > reservation can be done with a generous margin, making kdump more
> > reliable. Kernel memory that we need to preserve for dumping is
> > never allocated from CMA. User data is typically not dumped by
> > makedumpfile. When dumping of user data is intended this new CMA
> > reservation cannot be used.
> 
> I'll note that your comment about "user space" is currently the case, but
> will likely not hold in the long run. The assumption you are making is that
> only user-space memory will be allocated from MIGRATE_CMA, which is not
> necessarily the case. Any movable allocation will end up in there.
> 
> Besides LRU folios (user space memory and the pagecache), we already support
> migration of some kernel allocations using the non-lru migration framework.
> Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently
> only include
> * memory balloon: pages we never want to dump either way
> * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
> * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
> 
> Just imagine if we support migration of other kernel allocations, such as
> user page tables. The dump would be missing important information.
> 
> Once that happens, it will become a lot harder to judge whether CMA can be
> used or not. At least, the kernel could bail out/warn for these kernel
> configs.

Thanks for ponting this out. I still don't see this as a
roadblock for my primary usecase of the CMA reservation: 
get at least some (less reliable and potentially
less useful) kdump where the user is not prepared to sacrifice
the memory needed for the standard reservation and where the only
other option is no kdump at all.

Still a lot can be analyzed with a vmcore that is missing those
__GFP_MOVABLE pages. Even if/when some user page tables are
missing.

I'll send a v3 with the documenatation part updated to better
describe this.

> > The fourth patch adds a short delay before booting the kdump
> > kernel, allowing pending DMA transfers to finish.
> 
> 
> What does "short" mean? At least in theory, long-term pinning is forbidden
> for MIGRATE_CMA, so we should not have such pages mapped into an iommu where
> DMA can happily keep going on for quite a while.

See patch 4/5 in the series:
I propose 1 second, which is a negligible time from the kdump POV
but I assume it should be plenty enough for non-long-term pins in
MIGRATE_CMA. 

> But that assumes that our old kernel is not buggy, and doesn't end up
> mapping these pages into an IOMMU where DMA will just continue. I recall
> that DRM might currently be a problem, described here [1].
>
> If kdump starts not working as expected in case our old kernel is buggy,
> doesn't that partially destroy the purpose of kdump (-> debug bugs in the
> old kernel)?

Again, this is meant as a kind of "lightweight best effort
kdump". If there is a bug that causes the crash _and_ a bug in a
driver that hogs MIGRATE_CMA and maps it into IOMMU then this
lightweight kdump may break. Then it's time to sacrifice more
memory and use a normal crashkernel reservation.

It's not like any bug in the old kernel will break it. It's a
very specific kind of bug that can potentially break it.

I see this whole thing as particularly useful for VMs. Unlike big
physical machines, where taking away a couple hundred MBs of
memory for kdump does not really hurt, a VM can ideally be given just
enough memory for its particular task. This can often be less
than 1 GB. Proper kdump reservation needs a couple hundred MBs,
so a very large proportion of the VM memory. In case of a
virtualization host running hundreds or thousands such VMs this
means a huge waste of memory. And VMs often don't use too many
drivers for real hardware, decreasing the risk of hitting a buggy
driver like this.

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-03-04  4:20     ` Baoquan He
@ 2025-05-28 21:01       ` David Hildenbrand
  2025-05-29  7:46         ` Michal Hocko
  2025-05-29 16:22         ` Jiri Bohac
  0 siblings, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-05-28 21:01 UTC (permalink / raw)
  To: Baoquan He, Donald Dutile, Jiri Bohac
  Cc: Vivek Goyal, Dave Young, kexec, Philipp Rudo, Pingfan Liu,
	Tao Liu, linux-kernel, David Hildenbrand, Michal Hocko

On 04.03.25 05:20, Baoquan He wrote:
> On 03/03/25 at 09:17am, Donald Dutile wrote:
>>
>>
>> On 3/3/25 3:25 AM, David Hildenbrand wrote:
>>> On 20.02.25 17:48, Jiri Bohac wrote:
>>>> Hi,
>>>>
>>>> this series implements a way to reserve additional crash kernel
>>>> memory using CMA.
>>>>
>>>> Link to the v1 discussion:
>>>> https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
>>>> See below for the changes since v1 and how concerns from the
>>>> discussion have been addressed.
>>>>
>>>> Currently, all the memory for the crash kernel is not usable by
>>>> the 1st (production) kernel. It is also unmapped so that it can't
>>>> be corrupted by the fault that will eventually trigger the crash.
>>>> This makes sense for the memory actually used by the kexec-loaded
>>>> crash kernel image and initrd and the data prepared during the
>>>> load (vmcoreinfo, ...). However, the reserved space needs to be
>>>> much larger than that to provide enough run-time memory for the
>>>> crash kernel and the kdump userspace. Estimating the amount of
>>>> memory to reserve is difficult. Being too careful makes kdump
>>>> likely to end in OOM, being too generous takes even more memory
>>>> from the production system. Also, the reservation only allows
>>>> reserving a single contiguous block (or two with the "low"
>>>> suffix). I've seen systems where this fails because the physical
>>>> memory is fragmented.
>>>>
>>>> By reserving additional crashkernel memory from CMA, the main
>>>> crashkernel reservation can be just large enough to fit the
>>>> kernel and initrd image, minimizing the memory taken away from
>>>> the production system. Most of the run-time memory for the crash
>>>> kernel will be memory previously available to userspace in the
>>>> production system. As this memory is no longer wasted, the
>>>> reservation can be done with a generous margin, making kdump more
>>>> reliable. Kernel memory that we need to preserve for dumping is
>>>> never allocated from CMA. User data is typically not dumped by
>>>> makedumpfile. When dumping of user data is intended this new CMA
>>>> reservation cannot be used.
>>>
>>>
>>> Hi,
>>>
>>> I'll note that your comment about "user space" is currently the case, but will likely not hold in the long run. The assumption you are making is that only user-space memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable allocation will end up in there.
>>>
>>> Besides LRU folios (user space memory and the pagecache), we already support migration of some kernel allocations using the non-lru migration framework. Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently only include
>>> * memory balloon: pages we never want to dump either way
>>> * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
>>> * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
>>>
>>> Just imagine if we support migration of other kernel allocations, such as user page tables. The dump would be missing important information.
>>>
>> IOMMUFD is a near-term candidate for user page tables with multi-stage iommu support with going through upstream review atm.
>> Just saying, that David's case will be a norm in high-end VMs with performance-enhanced guest-driven iommu support (for GPUs).
> 
> Thank both for valuable inputs, David and Don. I agree that we may argue
> not every system have ballon or enabling swap for now, while future
> extending of migration on other kernel allocation could become obstacle
> we can't detour.
> 
> If we have known for sure this feature could be a bad code, we may need
> to stop it in advance.

Sorry for the late reply.

I think we just have to be careful to document it properly -- especially 
the shortcomings and that this feature might become a problem in the 
future. Movable user-space page tables getting placed on CMA memory 
would probably not be a problem if we don't care about ... user-space 
data either way.

The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could 
be configurable how long to wait? 10s is certainly "safer".

But maybe, in the target use case: VMs, direct I/O will not be that common.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-28 21:01       ` David Hildenbrand
@ 2025-05-29  7:46         ` Michal Hocko
  2025-05-29  9:19           ` Michal Hocko
  2025-05-30  8:06           ` David Hildenbrand
  2025-05-29 16:22         ` Jiri Bohac
  1 sibling, 2 replies; 30+ messages in thread
From: Michal Hocko @ 2025-05-29  7:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On Wed 28-05-25 23:01:04, David Hildenbrand wrote:
[...]
> I think we just have to be careful to document it properly -- especially the
> shortcomings and that this feature might become a problem in the future.
> Movable user-space page tables getting placed on CMA memory would probably
> not be a problem if we don't care about ... user-space data either way.

I think makedumpfile could refuse to capture a dump if userspace memory
is requested to enforce this.

> The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could be
> configurable how long to wait? 10s is certainly "safer".

Quite honestly we will never know and rather than making this
configurable I would go with reasonably large. Couple of seconds
certainly do not matter for the kdump situations but I would go as far
as minutes.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-29  7:46         ` Michal Hocko
@ 2025-05-29  9:19           ` Michal Hocko
  2025-05-30  8:06           ` David Hildenbrand
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2025-05-29  9:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On Thu 29-05-25 09:46:28, Michal Hocko wrote:
> On Wed 28-05-25 23:01:04, David Hildenbrand wrote:
> [...]
> > I think we just have to be careful to document it properly -- especially the
> > shortcomings and that this feature might become a problem in the future.
> > Movable user-space page tables getting placed on CMA memory would probably
> > not be a problem if we don't care about ... user-space data either way.
> 
> I think makedumpfile could refuse to capture a dump if userspace memory
> is requested to enforce this.
> 
> > The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could be
> > configurable how long to wait? 10s is certainly "safer".
> 
> Quite honestly we will never know and rather than making this
> configurable I would go with reasonably large. Couple of seconds
> certainly do not matter for the kdump situations but I would go as far

typo
s@I would go@I would not go@

> as minutes.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-28 21:01       ` David Hildenbrand
  2025-05-29  7:46         ` Michal Hocko
@ 2025-05-29 16:22         ` Jiri Bohac
  1 sibling, 0 replies; 30+ messages in thread
From: Jiri Bohac @ 2025-05-29 16:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baoquan He, Donald Dutile, Vivek Goyal, Dave Young, kexec,
	Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand, Michal Hocko

On Wed, May 28, 2025 at 11:01:04PM +0200, David Hildenbrand wrote:
> I think we just have to be careful to document it properly -- especially the
> shortcomings and that this feature might become a problem in the future.
> Movable user-space page tables getting placed on CMA memory would probably
> not be a problem if we don't care about ... user-space data either way.

Agreed; in the v3 series [1] I amended the documentation part [2] to
explicitly mention that kernel movable allocations could be
missing from the vmcore.

The risks associated with pending DMA are also mentioned.

Is there anything you're still missing from the v3 documentation?

> The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could be
> configurable how long to wait? 10s is certainly "safer".

I have nothing against making this configurable, or just setting
the fixed/default delay to 10s. Which would you prefer?
Would you prefer a command-line option, config option or a sysfs
file?

Thanks!

[1] https://lore.kernel.org/lkml/Z9H10pYIFLBHNKpr@dwarf.suse.cz/
[2] https://lore.kernel.org/lkml/Z9H4E82EslkGR7pV@dwarf.suse.cz/

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-29  7:46         ` Michal Hocko
  2025-05-29  9:19           ` Michal Hocko
@ 2025-05-30  8:06           ` David Hildenbrand
  2025-05-30  8:28             ` Michal Hocko
  1 sibling, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-05-30  8:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On 29.05.25 09:46, Michal Hocko wrote:
> On Wed 28-05-25 23:01:04, David Hildenbrand wrote:
> [...]
>> I think we just have to be careful to document it properly -- especially the
>> shortcomings and that this feature might become a problem in the future.
>> Movable user-space page tables getting placed on CMA memory would probably
>> not be a problem if we don't care about ... user-space data either way.
> 
> I think makedumpfile could refuse to capture a dump if userspace memory
> is requested to enforce this.

Yeah, it will be tricky once we support placing other memory on CMA 
regions. E.g., there was the discussion of making some slab allocations 
movable.

But probably, in such a configuration, we would later simply refuse to 
active CMA kdump.

> 
>> The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could be
>> configurable how long to wait? 10s is certainly "safer".
> 
> Quite honestly we will never know and rather than making this
> configurable I would go with reasonably large. Couple of seconds
> certainly do not matter for the kdump situations but I would go as far
> as minutes.

I recall that somebody raised that kdump downtime might be problematic 
(might affect service downtime?).

So I would just add a kconfig option with a default of 10s.

But even better if we can avoid the kconfig and just make it 10s for all 
setups.

I would not suggest having a different (runtime/boottime) way of 
configuring this.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  8:06           ` David Hildenbrand
@ 2025-05-30  8:28             ` Michal Hocko
  2025-05-30  8:39               ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2025-05-30  8:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On Fri 30-05-25 10:06:52, David Hildenbrand wrote:
> On 29.05.25 09:46, Michal Hocko wrote:
> > On Wed 28-05-25 23:01:04, David Hildenbrand wrote:
> > [...]
> > > I think we just have to be careful to document it properly -- especially the
> > > shortcomings and that this feature might become a problem in the future.
> > > Movable user-space page tables getting placed on CMA memory would probably
> > > not be a problem if we don't care about ... user-space data either way.
> > 
> > I think makedumpfile could refuse to capture a dump if userspace memory
> > is requested to enforce this.
> 
> Yeah, it will be tricky once we support placing other memory on CMA regions.
> E.g., there was the discussion of making some slab allocations movable.
> 
> But probably, in such a configuration, we would later simply refuse to
> active CMA kdump.

Or we can make the kdump CMA region more special and only allow
GFP_HIGHUSER_MOVABLE allocations from that. Anyaway I think we should
deal with this once we get there.

> > > The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could be
> > > configurable how long to wait? 10s is certainly "safer".
> > 
> > Quite honestly we will never know and rather than making this
> > configurable I would go with reasonably large. Couple of seconds
> > certainly do not matter for the kdump situations but I would go as far
> > as minutes.
> 
> I recall that somebody raised that kdump downtime might be problematic
> (might affect service downtime?).
> 
> So I would just add a kconfig option with a default of 10s.

kconfig option usually doesn't really work for distro kernels. I am
personally not really keen on having a tuning knob because there is a
risk of cargo cult based tuning we have seen in other areas. That might
make it hard to remove the knob later on. Fundamentally we should have 2
situations though. Either we know that the HW is sane and then we
shouldn't really need any sleep or the HW might misbehave and then we
need to wait _some_ time. If our initial guess is incorrect then we can
always increase it and we would learn about that through bug reports.

All that being said I would go with an additional parameter to the
kdump cma setup - e.g. cma_sane_dma that would skip waiting and use 10s
otherwise. That would make the optimized behavior opt in, we do not need
to support all sorts of timeouts and also learn if this is not
sufficient.

Makes sense?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  8:28             ` Michal Hocko
@ 2025-05-30  8:39               ` David Hildenbrand
  2025-05-30  9:07                 ` Michal Hocko
  0 siblings, 1 reply; 30+ messages in thread
From: David Hildenbrand @ 2025-05-30  8:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On 30.05.25 10:28, Michal Hocko wrote:
> On Fri 30-05-25 10:06:52, David Hildenbrand wrote:
>> On 29.05.25 09:46, Michal Hocko wrote:
>>> On Wed 28-05-25 23:01:04, David Hildenbrand wrote:
>>> [...]
>>>> I think we just have to be careful to document it properly -- especially the
>>>> shortcomings and that this feature might become a problem in the future.
>>>> Movable user-space page tables getting placed on CMA memory would probably
>>>> not be a problem if we don't care about ... user-space data either way.
>>>
>>> I think makedumpfile could refuse to capture a dump if userspace memory
>>> is requested to enforce this.
>>
>> Yeah, it will be tricky once we support placing other memory on CMA regions.
>> E.g., there was the discussion of making some slab allocations movable.
>>
>> But probably, in such a configuration, we would later simply refuse to
>> active CMA kdump.
> 
> Or we can make the kdump CMA region more special and only allow
> GFP_HIGHUSER_MOVABLE allocations from that. Anyaway I think we should
> deal with this once we get there.

Might be doable. When migrating (e.g., compacting) pages we'd have to 
make sure to also not migrate these pages into the CMA regions. Might be 
a bit more tricky, but likely solvable.

>   
>>>> The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could be
>>>> configurable how long to wait? 10s is certainly "safer".
>>>
>>> Quite honestly we will never know and rather than making this
>>> configurable I would go with reasonably large. Couple of seconds
>>> certainly do not matter for the kdump situations but I would go as far
>>> as minutes.
>>
>> I recall that somebody raised that kdump downtime might be problematic
>> (might affect service downtime?).
>>
>> So I would just add a kconfig option with a default of 10s.
> 
> kconfig option usually doesn't really work for distro kernels. I am
> personally not really keen on having a tuning knob because there is a
> risk of cargo cult based tuning we have seen in other areas. That might
> make it hard to remove the knob later on. Fundamentally we should have 2
> situations though. Either we know that the HW is sane and then we
> shouldn't really need any sleep or the HW might misbehave and then we
> need to wait _some_ time. If our initial guess is incorrect then we can
> always increase it and we would learn about that through bug reports.

kconfigs are usually much easier to alter/remove than other tunables in 
my experience.

But yeah, it would have to go for the setting that works for all 
supported hw (iow, conservative timeout).

> 
> All that being said I would go with an additional parameter to the
> kdump cma setup - e.g. cma_sane_dma that would skip waiting and use 10s
> otherwise. That would make the optimized behavior opt in, we do not need
> to support all sorts of timeouts and also learn if this is not
> sufficient.
> 
> Makes sense?

Just so I understand correctly, you mean extending the "crashkernel=" 
option with a boolean parameter? If set, e.g., wait 1s, otherwise magic 
number 10?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  8:39               ` David Hildenbrand
@ 2025-05-30  9:07                 ` Michal Hocko
  2025-05-30  9:11                   ` David Hildenbrand
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2025-05-30  9:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On Fri 30-05-25 10:39:39, David Hildenbrand wrote:
> On 30.05.25 10:28, Michal Hocko wrote:
[...]
> > All that being said I would go with an additional parameter to the
> > kdump cma setup - e.g. cma_sane_dma that would skip waiting and use 10s
> > otherwise. That would make the optimized behavior opt in, we do not need
> > to support all sorts of timeouts and also learn if this is not
> > sufficient.
> > 
> > Makes sense?
> 
> Just so I understand correctly, you mean extending the "crashkernel=" option
> with a boolean parameter? If set, e.g., wait 1s, otherwise magic number 10?

crashkernel=1G,cma,cma_sane_dma # no wait on transition
crashkernel=1G,cma # wait on transition with e.g. 10s timeout
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  9:07                 ` Michal Hocko
@ 2025-05-30  9:11                   ` David Hildenbrand
  2025-05-30  9:26                     ` Michal Hocko
  2025-05-30  9:34                     ` Jiri Bohac
  0 siblings, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-05-30  9:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On 30.05.25 11:07, Michal Hocko wrote:
> On Fri 30-05-25 10:39:39, David Hildenbrand wrote:
>> On 30.05.25 10:28, Michal Hocko wrote:
> [...]
>>> All that being said I would go with an additional parameter to the
>>> kdump cma setup - e.g. cma_sane_dma that would skip waiting and use 10s
>>> otherwise. That would make the optimized behavior opt in, we do not need
>>> to support all sorts of timeouts and also learn if this is not
>>> sufficient.
>>>
>>> Makes sense?
>>
>> Just so I understand correctly, you mean extending the "crashkernel=" option
>> with a boolean parameter? If set, e.g., wait 1s, otherwise magic number 10?
> 
> crashkernel=1G,cma,cma_sane_dma # no wait on transition

But is no wait ok? I mean, any O_DIRECT with any device would at least 
take a bit, no?

Of course, there is a short time between the crash and actually 
triggerying kdump.

> crashkernel=1G,cma # wait on transition with e.g. 10s timeout

In general, would work for me.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  9:11                   ` David Hildenbrand
@ 2025-05-30  9:26                     ` Michal Hocko
  2025-05-30  9:28                       ` David Hildenbrand
  2025-05-30  9:34                     ` Jiri Bohac
  1 sibling, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2025-05-30  9:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On Fri 30-05-25 11:11:40, David Hildenbrand wrote:
> On 30.05.25 11:07, Michal Hocko wrote:
> > On Fri 30-05-25 10:39:39, David Hildenbrand wrote:
> > > On 30.05.25 10:28, Michal Hocko wrote:
> > [...]
> > > > All that being said I would go with an additional parameter to the
> > > > kdump cma setup - e.g. cma_sane_dma that would skip waiting and use 10s
> > > > otherwise. That would make the optimized behavior opt in, we do not need
> > > > to support all sorts of timeouts and also learn if this is not
> > > > sufficient.
> > > > 
> > > > Makes sense?
> > > 
> > > Just so I understand correctly, you mean extending the "crashkernel=" option
> > > with a boolean parameter? If set, e.g., wait 1s, otherwise magic number 10?
> > 
> > crashkernel=1G,cma,cma_sane_dma # no wait on transition
> 
> But is no wait ok? I mean, any O_DIRECT with any device would at least take
> a bit, no?
> 
> Of course, there is a short time between the crash and actually triggerying
> kdump.

This is something we can test for and if we need a short timeout in this
case as well then it is just trivial to add it. I am much more
concerned about those potentially unpredictable DMA transfers that could
take too long and it is impossible to test for those and therefore we
need to overshoot.
 
> > crashkernel=1G,cma # wait on transition with e.g. 10s timeout
> 
> In general, would work for me.
> 
> -- 
> Cheers,
> 
> David / dhildenb

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  9:26                     ` Michal Hocko
@ 2025-05-30  9:28                       ` David Hildenbrand
  0 siblings, 0 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-05-30  9:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Baoquan He, Donald Dutile, Jiri Bohac, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On 30.05.25 11:26, Michal Hocko wrote:
> On Fri 30-05-25 11:11:40, David Hildenbrand wrote:
>> On 30.05.25 11:07, Michal Hocko wrote:
>>> On Fri 30-05-25 10:39:39, David Hildenbrand wrote:
>>>> On 30.05.25 10:28, Michal Hocko wrote:
>>> [...]
>>>>> All that being said I would go with an additional parameter to the
>>>>> kdump cma setup - e.g. cma_sane_dma that would skip waiting and use 10s
>>>>> otherwise. That would make the optimized behavior opt in, we do not need
>>>>> to support all sorts of timeouts and also learn if this is not
>>>>> sufficient.
>>>>>
>>>>> Makes sense?
>>>>
>>>> Just so I understand correctly, you mean extending the "crashkernel=" option
>>>> with a boolean parameter? If set, e.g., wait 1s, otherwise magic number 10?
>>>
>>> crashkernel=1G,cma,cma_sane_dma # no wait on transition
>>
>> But is no wait ok? I mean, any O_DIRECT with any device would at least take
>> a bit, no?
>>
>> Of course, there is a short time between the crash and actually triggerying
>> kdump.
> 
> This is something we can test for and if we need a short timeout in this
> case as well then it is just trivial to add it. I am much more
> concerned about those potentially unpredictable DMA transfers that could
> take too long and it is impossible to test for those and therefore we
> need to overshoot.

Agreed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  9:11                   ` David Hildenbrand
  2025-05-30  9:26                     ` Michal Hocko
@ 2025-05-30  9:34                     ` Jiri Bohac
  2025-05-30  9:47                       ` David Hildenbrand
  1 sibling, 1 reply; 30+ messages in thread
From: Jiri Bohac @ 2025-05-30  9:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Baoquan He, Donald Dutile, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On Fri, May 30, 2025 at 11:11:40AM +0200, David Hildenbrand wrote:
> On 30.05.25 11:07, Michal Hocko wrote:
> > On Fri 30-05-25 10:39:39, David Hildenbrand wrote:
> > > On 30.05.25 10:28, Michal Hocko wrote:
> > [...]
> > > > All that being said I would go with an additional parameter to the
> > > > kdump cma setup - e.g. cma_sane_dma that would skip waiting and use 10s
> > > > otherwise. That would make the optimized behavior opt in, we do not need
> > > > to support all sorts of timeouts and also learn if this is not
> > > > sufficient.
> > > > 
> > > > Makes sense?
> > > 
> > > Just so I understand correctly, you mean extending the "crashkernel=" option
> > > with a boolean parameter? If set, e.g., wait 1s, otherwise magic number 10?
> > 
> > crashkernel=1G,cma,cma_sane_dma # no wait on transition
> 
> But is no wait ok? I mean, any O_DIRECT with any device would at least take
> a bit, no?
> 
> Of course, there is a short time between the crash and actually triggerying
> kdump.
> 
> > crashkernel=1G,cma # wait on transition with e.g. 10s timeout
> 
> In general, would work for me.

I don't like extending the crashkernel= syntax like this.
It would make hooking into the generic parsing code in
parse_crashkernel() really ugly. The syntax is already
convoluted as is and hard enough to explain in the documentation.

Also I don't see how adding a boolean knob is better than adding
one that allows setting any arbitrary timeout. It has less
flexibility and all the drawbacks of having an extra knob.

I am inclined to just setting the fixed delay to 10s for now and
adding a sysfs knob later if someone asks for it.

Would that work for you?

If you don't have other objections to the v3 series,
I'll just update it for v6.15 and post again a v4
with the 10s timeout...

Thanks for your input!

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  9:34                     ` Jiri Bohac
@ 2025-05-30  9:47                       ` David Hildenbrand
  2025-05-30  9:54                         ` Michal Hocko
  2025-05-30 10:06                         ` Jiri Bohac
  0 siblings, 2 replies; 30+ messages in thread
From: David Hildenbrand @ 2025-05-30  9:47 UTC (permalink / raw)
  To: Jiri Bohac
  Cc: Michal Hocko, Baoquan He, Donald Dutile, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On 30.05.25 11:34, Jiri Bohac wrote:
> On Fri, May 30, 2025 at 11:11:40AM +0200, David Hildenbrand wrote:
>> On 30.05.25 11:07, Michal Hocko wrote:
>>> On Fri 30-05-25 10:39:39, David Hildenbrand wrote:
>>>> On 30.05.25 10:28, Michal Hocko wrote:
>>> [...]
>>>>> All that being said I would go with an additional parameter to the
>>>>> kdump cma setup - e.g. cma_sane_dma that would skip waiting and use 10s
>>>>> otherwise. That would make the optimized behavior opt in, we do not need
>>>>> to support all sorts of timeouts and also learn if this is not
>>>>> sufficient.
>>>>>
>>>>> Makes sense?
>>>>
>>>> Just so I understand correctly, you mean extending the "crashkernel=" option
>>>> with a boolean parameter? If set, e.g., wait 1s, otherwise magic number 10?
>>>
>>> crashkernel=1G,cma,cma_sane_dma # no wait on transition
>>
>> But is no wait ok? I mean, any O_DIRECT with any device would at least take
>> a bit, no?
>>
>> Of course, there is a short time between the crash and actually triggerying
>> kdump.
>>
>>> crashkernel=1G,cma # wait on transition with e.g. 10s timeout
>>
>> In general, would work for me.
> 
> I don't like extending the crashkernel= syntax like this.
> It would make hooking into the generic parsing code in
> parse_crashkernel() really ugly. The syntax is already
> convoluted as is and hard enough to explain in the documentation.

Would another boolean flag (on top of the other one you are adding) 
really make this significantly more ugly?

> 
> Also I don't see how adding a boolean knob is better than adding
> one that allows setting any arbitrary timeout. It has less
> flexibility and all the drawbacks of having an extra knob.

I guess Michals point is that specifying the higher-level problem and 
giving less flexibility mioght actually be less confusing for users.

> 
> I am inclined to just setting the fixed delay to 10s for now and
> adding a sysfs knob later if someone asks for it.
> 
> Would that work for you?

Sure. We could always add such a flag later if it's really a problem for 
someone.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  9:47                       ` David Hildenbrand
@ 2025-05-30  9:54                         ` Michal Hocko
  2025-05-30 10:06                         ` Jiri Bohac
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2025-05-30  9:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jiri Bohac, Baoquan He, Donald Dutile, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On Fri 30-05-25 11:47:46, David Hildenbrand wrote:
> On 30.05.25 11:34, Jiri Bohac wrote:
[...]
> > I am inclined to just setting the fixed delay to 10s for now and
> > adding a sysfs knob later if someone asks for it.
> > 
> > Would that work for you?
> 
> Sure. We could always add such a flag later if it's really a problem for
> someone.

Yes, no objection with the most conservative approach first.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA
  2025-05-30  9:47                       ` David Hildenbrand
  2025-05-30  9:54                         ` Michal Hocko
@ 2025-05-30 10:06                         ` Jiri Bohac
  1 sibling, 0 replies; 30+ messages in thread
From: Jiri Bohac @ 2025-05-30 10:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Baoquan He, Donald Dutile, Vivek Goyal, Dave Young,
	kexec, Philipp Rudo, Pingfan Liu, Tao Liu, linux-kernel,
	David Hildenbrand

On Fri, May 30, 2025 at 11:47:46AM +0200, David Hildenbrand wrote:
> > > > crashkernel=1G,cma,cma_sane_dma # no wait on transition
> > > 
> > > But is no wait ok? I mean, any O_DIRECT with any device would at least take
> > > a bit, no?
> > > 
> > > Of course, there is a short time between the crash and actually triggerying
> > > kdump.
> > > 
> > > > crashkernel=1G,cma # wait on transition with e.g. 10s timeout
> > > 
> > > In general, would work for me.
> > 
> > I don't like extending the crashkernel= syntax like this.
> > It would make hooking into the generic parsing code in
> > parse_crashkernel() really ugly. The syntax is already
> > convoluted as is and hard enough to explain in the documentation.
> 
> Would another boolean flag (on top of the other one you are adding) really
> make this significantly more ugly?

the current code does not split the parameter by commas and treat
the part as boolean flags.

Both ",cma" and ",cma,cma_sane_dma" (and possibly
",cma_sane_dma,cma") would need to be added to suffix_tbl[]
(carefully thinking about the order because one is a prefix of the
other); then handled almost the same except setting the flag.

Also I think using the command line is way less flexible than
sysfs. E.g. the userspace tool loading the crash kernel (kdump)
may want to decide if the hardware is sane using its own
whitelist/blacklist...

> > I am inclined to just setting the fixed delay to 10s for now and
> > adding a sysfs knob later if someone asks for it.
> > 
> > Would that work for you?
> 
> Sure. We could always add such a flag later if it's really a problem for
> someone.

OK, thanks! Will post the v4 shortly.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-05-30 10:06 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-20 16:48 [PATCH v2 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
2025-02-20 16:51 ` [PATCH v2 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
2025-03-03  1:51   ` Baoquan He
2025-02-20 16:52 ` [PATCH v2 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
2025-02-20 16:54 ` [PATCH v2 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
2025-03-03  1:54   ` Baoquan He
2025-02-20 16:55 ` [PATCH v2 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
2025-03-03  2:02   ` Baoquan He
2025-03-11 12:00     ` Jiri Bohac
2025-02-20 16:57 ` [PATCH v2 5/5] x86: implement crashkernel cma reservation Jiri Bohac
2025-03-03  2:08 ` [PATCH v2 0/5] kdump: crashkernel reservation from CMA Baoquan He
2025-03-03  8:25 ` David Hildenbrand
2025-03-03 14:17   ` Donald Dutile
2025-03-04  4:20     ` Baoquan He
2025-05-28 21:01       ` David Hildenbrand
2025-05-29  7:46         ` Michal Hocko
2025-05-29  9:19           ` Michal Hocko
2025-05-30  8:06           ` David Hildenbrand
2025-05-30  8:28             ` Michal Hocko
2025-05-30  8:39               ` David Hildenbrand
2025-05-30  9:07                 ` Michal Hocko
2025-05-30  9:11                   ` David Hildenbrand
2025-05-30  9:26                     ` Michal Hocko
2025-05-30  9:28                       ` David Hildenbrand
2025-05-30  9:34                     ` Jiri Bohac
2025-05-30  9:47                       ` David Hildenbrand
2025-05-30  9:54                         ` Michal Hocko
2025-05-30 10:06                         ` Jiri Bohac
2025-05-29 16:22         ` Jiri Bohac
2025-03-12 15:36   ` Jiri Bohac

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox