* [PATCH v3 0/5] kdump: crashkernel reservation from CMA
@ 2025-03-12 21:00 Jiri Bohac
2025-03-12 21:03 ` [PATCH v3 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
` (4 more replies)
0 siblings, 5 replies; 10+ messages in thread
From: Jiri Bohac @ 2025-03-12 21:00 UTC (permalink / raw)
To: Baoquan He, Vivek Goyal, Dave Young, kexec
Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
David Hildenbrand, Michal Hocko
Hi,
this series implements a way to reserve additional crash kernel
memory using CMA.
Link to the v1 discussion:
https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
See below for the changes since v1 and how concerns from the
discussion have been addressed.
Currently, all the memory for the crash kernel is not usable by
the 1st (production) kernel. It is also unmapped so that it can't
be corrupted by the fault that will eventually trigger the crash.
This makes sense for the memory actually used by the kexec-loaded
crash kernel image and initrd and the data prepared during the
load (vmcoreinfo, ...). However, the reserved space needs to be
much larger than that to provide enough run-time memory for the
crash kernel and the kdump userspace. Estimating the amount of
memory to reserve is difficult. Being too careful makes kdump
likely to end in OOM, being too generous takes even more memory
from the production system. Also, the reservation only allows
reserving a single contiguous block (or two with the "low"
suffix). I've seen systems where this fails because the physical
memory is fragmented.
By reserving additional crashkernel memory from CMA, the main
crashkernel reservation can be just large enough to fit the
kernel and initrd image, minimizing the memory taken away from
the production system. Most of the run-time memory for the crash
kernel will be memory previously available to userspace in the
production system. As this memory is no longer wasted, the
reservation can be done with a generous margin, making kdump more
reliable. Kernel memory that we need to preserve for dumping is
normally not allocated from CMA, unless it is explicitly
allocated as movable. Currently this is only the case for memory
ballooning and zswap. Such movable memory will be missing from
the vmcore. User data is typically not dumped by makedumpfile.
When dumping of user data is intended this new CMA reservation
cannot be used.
There are five patches in this series:
The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more
argument to store the cma reservation size.
The second patch implements reserve_crashkernel_cma() which
performs the reservation. If the requested size is not available
in a single range, multiple smaller ranges will be reserved.
The third patch updates Documentation/, explicitly mentioning the
potential DMA corruption of the CMA-reserved memory.
The fourth patch adds a short delay before booting the kdump
kernel, allowing pending DMA transfers to finish.
The fifth patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
through /proc/vmcore by excluding them from the vmcoreinfo
PT_LOAD ranges.
Adding other architectures is easy and I can do that as soon as
this series is merged.
With this series applied, specifying
crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.
An additional 1G will be reserved from CMA, still usable by the
production system. The crash kernel will have 1.1G memory
available. The 100M can be reliably predicted based on the size
of the kernel and initrd.
The new cma suffix is completely optional. When no
crashkernel=size,cma is specified, everything works as before.
---
Changes since v2:
based on feedback from Baoquan He and David Hildenbrand:
- kept original formatting of suffix_tbl[]
- updated documentation to mention movable pages missing from vmcore
- fixed whitespace in documentation
- moved the call crash_cma_clear_pending_dma() after
machine_crash_shutdown() so that non-crash CPUs and timers are
shut down before the delay
---
Changes since v1:
The key concern raised in the v1 discussion was that pages in the
CMA region may be pinned and used for a DMA transfer, potentially
corrupting the new kernel's memory. When the cma suffix is used, kdump
may be less reliable and the corruption hard to debug
This v2 series addresses this concern in two ways:
1) Clearly stating the potential problem in the updated
Documentation and setting the expectation (patch 3/5)
Documentation now explicitly states that:
- the risk of kdump failure is increased
- the CMA reservation is intended for users who can not or don't
want to sacrifice enough memory for a standard crashkernel reservation
and who prefer less reliable kdump to no kdump at all
This is consistent with the documentation of the
crash_kexec_post_notifiers option, which can also increase the
risk of kdump failure, yet may be the only way to use kdump on
some systems. And just like the crash_kexec_post_notifiers
option, the cma crashkernel suffix is completely optional:
the series has zero effect when the suffix is not used.
2) Giving DMA time to finish before booting the kdump kernel
(patch 4/5)
Pages can be pinned for long term use using the FOLL_LONGTERM
flag. Then they are migrated outside the CMA region. Pinning
without this flag shows that the intent of their user is to only
use them for short-lived DMA transfers.
Delay the boot of the kdump kernel when the CMA reservation is
used, giving potential pending DMA transfers time to finish.
Other minor changes since v1:
- updated for 6.14-rc2
- moved #ifdefs and #defines to header files
- added __always_unused in parse_crashkernel() to silence a false
unused variable warning
--
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option
2025-03-12 21:00 [PATCH v3 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
@ 2025-03-12 21:03 ` Jiri Bohac
2025-03-12 21:05 ` [PATCH v3 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
` (3 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: Jiri Bohac @ 2025-03-12 21:03 UTC (permalink / raw)
To: Baoquan He, Vivek Goyal, Dave Young, kexec
Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
David Hildenbrand, Michal Hocko
Add a new cma_size parameter to parse_crashkernel().
When not NULL, call __parse_crashkernel to parse the CMA
reservation size from "crashkernel=size,cma" and store it
in cma_size.
Set cma_size to NULL in all calls to parse_crashkernel().
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
arch/arm/kernel/setup.c | 2 +-
arch/arm64/mm/init.c | 2 +-
arch/loongarch/kernel/setup.c | 2 +-
arch/mips/kernel/setup.c | 2 +-
arch/powerpc/kernel/fadump.c | 2 +-
arch/powerpc/kexec/core.c | 2 +-
arch/powerpc/mm/nohash/kaslr_booke.c | 2 +-
arch/riscv/mm/init.c | 2 +-
arch/s390/kernel/setup.c | 2 +-
arch/sh/kernel/machine_kexec.c | 2 +-
arch/x86/kernel/setup.c | 2 +-
include/linux/crash_reserve.h | 3 ++-
kernel/crash_reserve.c | 16 ++++++++++++++--
13 files changed, 27 insertions(+), 14 deletions(-)
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index a41c93988d2c..0bfd66c7ada0 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1004,7 +1004,7 @@ static void __init reserve_crashkernel(void)
total_mem = get_total_mem();
ret = parse_crashkernel(boot_command_line, total_mem,
&crash_size, &crash_base,
- NULL, NULL);
+ NULL, NULL, NULL);
/* invalid value specified or crashkernel=0 */
if (ret || !crash_size)
return;
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 9c0b8d9558fc..06bf216a4b0d 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -107,7 +107,7 @@ static void __init arch_reserve_crashkernel(void)
ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
&crash_size, &crash_base,
- &low_size, &high);
+ &low_size, NULL, &high);
if (ret)
return;
diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c
index edcfdfcad7d2..ffdfb5407043 100644
--- a/arch/loongarch/kernel/setup.c
+++ b/arch/loongarch/kernel/setup.c
@@ -266,7 +266,7 @@ static void __init arch_reserve_crashkernel(void)
return;
ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
- &crash_size, &crash_base, &low_size, &high);
+ &crash_size, &crash_base, &low_size, NULL, &high);
if (ret)
return;
diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
index fbfe0771317e..11b9b6b63e19 100644
--- a/arch/mips/kernel/setup.c
+++ b/arch/mips/kernel/setup.c
@@ -458,7 +458,7 @@ static void __init mips_parse_crashkernel(void)
total_mem = memblock_phys_mem_size();
ret = parse_crashkernel(boot_command_line, total_mem,
&crash_size, &crash_base,
- NULL, NULL);
+ NULL, NULL, NULL);
if (ret != 0 || crash_size <= 0)
return;
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 4b371c738213..f90aaa2263aa 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -334,7 +334,7 @@ static __init u64 fadump_calculate_reserve_size(void)
* memory at a predefined offset.
*/
ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
- &size, &base, NULL, NULL);
+ &size, &base, NULL, NULL, NULL);
if (ret == 0 && size > 0) {
unsigned long max_size;
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 58a930a47422..35f92427d282 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -66,7 +66,7 @@ void __init reserve_crashkernel(void)
total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
/* use common parsing */
ret = parse_crashkernel(boot_command_line, total_mem_sz,
- &crash_size, &crash_base, NULL, NULL);
+ &crash_size, &crash_base, NULL, NULL, NULL);
if (ret == 0 && crash_size > 0) {
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c b/arch/powerpc/mm/nohash/kaslr_booke.c
index 5c8d1bb98b3e..5e4897daaaea 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -178,7 +178,7 @@ static void __init get_crash_kernel(void *fdt, unsigned long size)
int ret;
ret = parse_crashkernel(boot_command_line, size, &crash_size,
- &crash_base, NULL, NULL);
+ &crash_base, NULL, NULL, NULL);
if (ret != 0 || crash_size == 0)
return;
if (crash_base == 0)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 15b2eda4c364..9634a800629b 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1405,7 +1405,7 @@ static void __init arch_reserve_crashkernel(void)
ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
&crash_size, &crash_base,
- &low_size, &high);
+ &low_size, NULL, &high);
if (ret)
return;
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index d78bcfe707b5..4d9b5b5d0cb2 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -607,7 +607,7 @@ static void __init reserve_crashkernel(void)
int rc;
rc = parse_crashkernel(boot_command_line, ident_map_size,
- &crash_size, &crash_base, NULL, NULL);
+ &crash_size, &crash_base, NULL, NULL, NULL);
crash_base = ALIGN(crash_base, KEXEC_CRASH_MEM_ALIGN);
crash_size = ALIGN(crash_size, KEXEC_CRASH_MEM_ALIGN);
diff --git a/arch/sh/kernel/machine_kexec.c b/arch/sh/kernel/machine_kexec.c
index 8321b31d2e19..37073ca1e0ad 100644
--- a/arch/sh/kernel/machine_kexec.c
+++ b/arch/sh/kernel/machine_kexec.c
@@ -146,7 +146,7 @@ void __init reserve_crashkernel(void)
return;
ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
- &crash_size, &crash_base, NULL, NULL);
+ &crash_size, &crash_base, NULL, NULL, NULL);
if (ret == 0 && crash_size > 0) {
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index cebee310e200..853afde761a7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -481,7 +481,7 @@ static void __init arch_reserve_crashkernel(void)
ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
&crash_size, &crash_base,
- &low_size, &high);
+ &low_size, NULL, &high);
if (ret)
return;
diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h
index 5a9df944fb80..a681f265a361 100644
--- a/include/linux/crash_reserve.h
+++ b/include/linux/crash_reserve.h
@@ -16,7 +16,8 @@ extern struct resource crashk_low_res;
int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base,
- unsigned long long *low_size, bool *high);
+ unsigned long long *low_size, unsigned long long *cma_size,
+ bool *high);
#ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
#ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
index a620fb4b2116..4969d60c00d6 100644
--- a/kernel/crash_reserve.c
+++ b/kernel/crash_reserve.c
@@ -172,17 +172,19 @@ static int __init parse_crashkernel_simple(char *cmdline,
#define SUFFIX_HIGH 0
#define SUFFIX_LOW 1
-#define SUFFIX_NULL 2
+#define SUFFIX_CMA 2
+#define SUFFIX_NULL 3
static __initdata char *suffix_tbl[] = {
[SUFFIX_HIGH] = ",high",
[SUFFIX_LOW] = ",low",
+ [SUFFIX_CMA] = ",cma",
[SUFFIX_NULL] = NULL,
};
/*
* That function parses "suffix" crashkernel command lines like
*
- * crashkernel=size,[high|low]
+ * crashkernel=size,[high|low|cma]
*
* It returns 0 on success and -EINVAL on failure.
*/
@@ -298,9 +300,11 @@ int __init parse_crashkernel(char *cmdline,
unsigned long long *crash_size,
unsigned long long *crash_base,
unsigned long long *low_size,
+ unsigned long long *cma_size,
bool *high)
{
int ret;
+ unsigned long long __always_unused cma_base;
/* crashkernel=X[@offset] */
ret = __parse_crashkernel(cmdline, system_ram, crash_size,
@@ -331,6 +335,14 @@ int __init parse_crashkernel(char *cmdline,
*high = true;
}
+
+ /*
+ * optional CMA reservation
+ * cma_base is ignored
+ */
+ if (cma_size)
+ __parse_crashkernel(cmdline, 0, cma_size,
+ &cma_base, suffix_tbl[SUFFIX_CMA]);
#endif
if (!*crash_size)
ret = -EINVAL;
--
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 2/5] kdump: implement reserve_crashkernel_cma
2025-03-12 21:00 [PATCH v3 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
2025-03-12 21:03 ` [PATCH v3 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
@ 2025-03-12 21:05 ` Jiri Bohac
2025-03-12 21:09 ` [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
` (2 subsequent siblings)
4 siblings, 0 replies; 10+ messages in thread
From: Jiri Bohac @ 2025-03-12 21:05 UTC (permalink / raw)
To: Baoquan He, Vivek Goyal, Dave Young, kexec
Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
David Hildenbrand, Michal Hocko
reserve_crashkernel_cma() reserves CMA ranges for the
crash kernel. If allocating the requested size fails,
try to reserve in smaller blocks.
Store the reserved ranges in the crashk_cma_ranges array
and the number of ranges in crashk_cma_cnt.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
include/linux/crash_reserve.h | 12 +++++++++
kernel/crash_reserve.c | 49 +++++++++++++++++++++++++++++++++++
2 files changed, 61 insertions(+)
diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h
index a681f265a361..97964f2a583d 100644
--- a/include/linux/crash_reserve.h
+++ b/include/linux/crash_reserve.h
@@ -13,12 +13,24 @@
*/
extern struct resource crashk_res;
extern struct resource crashk_low_res;
+extern struct range crashk_cma_ranges[];
+#if defined(CONFIG_CMA) && defined(CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION)
+#define CRASHKERNEL_CMA
+#define CRASHKERNEL_CMA_RANGES_MAX 4
+extern int crashk_cma_cnt;
+#else
+#define crashk_cma_cnt 0
+#define CRASHKERNEL_CMA_RANGES_MAX 0
+#endif
+
int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base,
unsigned long long *low_size, unsigned long long *cma_size,
bool *high);
+void __init reserve_crashkernel_cma(unsigned long long cma_size);
+
#ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
#ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
#define DEFAULT_CRASH_KERNEL_LOW_SIZE (128UL << 20)
diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
index 4969d60c00d6..3d35d90dde38 100644
--- a/kernel/crash_reserve.c
+++ b/kernel/crash_reserve.c
@@ -14,6 +14,8 @@
#include <linux/cpuhotplug.h>
#include <linux/memblock.h>
#include <linux/kmemleak.h>
+#include <linux/cma.h>
+#include <linux/crash_reserve.h>
#include <asm/page.h>
#include <asm/sections.h>
@@ -470,6 +472,53 @@ void __init reserve_crashkernel_generic(char *cmdline,
#endif
}
+struct range crashk_cma_ranges[CRASHKERNEL_CMA_RANGES_MAX];
+#ifdef CRASHKERNEL_CMA
+int crashk_cma_cnt;
+void __init reserve_crashkernel_cma(unsigned long long cma_size)
+{
+ unsigned long long request_size = roundup(cma_size, PAGE_SIZE);
+ unsigned long long reserved_size = 0;
+
+ while (cma_size > reserved_size &&
+ crashk_cma_cnt < CRASHKERNEL_CMA_RANGES_MAX) {
+
+ struct cma *res;
+
+ if (cma_declare_contiguous(0, request_size, 0, 0, 0, false,
+ "crashkernel", &res)) {
+ /* reservation failed, try half-sized blocks */
+ if (request_size <= PAGE_SIZE)
+ break;
+
+ request_size = roundup(request_size / 2, PAGE_SIZE);
+ continue;
+ }
+
+ crashk_cma_ranges[crashk_cma_cnt].start = cma_get_base(res);
+ crashk_cma_ranges[crashk_cma_cnt].end =
+ crashk_cma_ranges[crashk_cma_cnt].start +
+ cma_get_size(res) - 1;
+ ++crashk_cma_cnt;
+ reserved_size += request_size;
+ }
+
+ if (cma_size > reserved_size)
+ pr_warn("crashkernel CMA reservation failed: %lld MB requested, %lld MB reserved in %d ranges\n",
+ cma_size >> 20, reserved_size >> 20, crashk_cma_cnt);
+ else
+ pr_info("crashkernel CMA reserved: %lld MB in %d ranges\n",
+ reserved_size >> 20, crashk_cma_cnt);
+}
+
+#else /* CRASHKERNEL_CMA */
+void __init reserve_crashkernel_cma(unsigned long long cma_size)
+{
+ if (cma_size)
+ pr_warn("crashkernel CMA reservation not supported\n");
+}
+#endif
+
#ifndef HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY
static __init int insert_crashkernel_resources(void)
{
--
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation
2025-03-12 21:00 [PATCH v3 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
2025-03-12 21:03 ` [PATCH v3 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
2025-03-12 21:05 ` [PATCH v3 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
@ 2025-03-12 21:09 ` Jiri Bohac
2025-03-14 3:18 ` Baoquan He
2025-03-12 21:10 ` [PATCH v3 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
2025-03-12 21:11 ` [PATCH v3 5/5] x86: implement crashkernel cma reservation Jiri Bohac
4 siblings, 1 reply; 10+ messages in thread
From: Jiri Bohac @ 2025-03-12 21:09 UTC (permalink / raw)
To: Baoquan He, Vivek Goyal, Dave Young, kexec
Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
David Hildenbrand, Michal Hocko
Describe the new crashkernel ",cma" suffix in Documentation/
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
Documentation/admin-guide/kdump/kdump.rst | 24 +++++++++++++++++--
.../admin-guide/kernel-parameters.txt | 22 +++++++++++++++++
2 files changed, 44 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 5376890adbeb..0a7ec98c74fa 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -315,6 +314,27 @@ crashkernel syntax
crashkernel=0,low
+4) crashkernel=size,cma
+
+ Reserve additional crash kernel memory from CMA. This reservation is
+ usable by the first system's userspace memory and kernel movable
+ allocations (memory balloon, zswap). Pages allocated from this memory
+ range will not be included in the vmcore so this should not be used if
+ dumping of userspace memory is intended and it has to be expected that
+ some movable kernel pages may be missing from the dump.
+
+ A standard crashkernel reservation, as described above, is still needed
+ to hold the crash kernel and initrd.
+
+ This option increases the risk of a kdump failure: DMA transfers
+ configured by the first kernel may end up corrupting the second
+ kernel's memory.
+
+ This reservation method is intended for systems that can't afford to
+ sacrifice enough memory for standard crashkernel reservation and where
+ less reliable and possibly incomplete kdump is preferable to no kdump at
+ all.
+
Boot into System Kernel
-----------------------
1) Update the boot loader (such as grub, yaboot, or lilo) configuration
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..895b974dc3bb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -987,6 +987,28 @@
0: to disable low allocation.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
+ crashkernel=size[KMG],cma
+ [KNL, X86] Reserve additional crash kernel memory from
+ CMA. This reservation is usable by the first system's
+ userspace memory and kernel movable allocations (memory
+ balloon, zswap). Pages allocated from this memory range
+ will not be included in the vmcore so this should not
+ be used if dumping of userspace memory is intended and
+ it has to be expected that some movable kernel pages
+ may be missing from the dump.
+
+ A standard crashkernel reservation, as described above,
+ is still needed to hold the crash kernel and initrd.
+
+ This option increases the risk of a kdump failure: DMA
+ transfers configured by the first kernel may end up
+ corrupting the second kernel's memory.
+
+ This reservation method is intended for systems that
+ can't afford to sacrifice enough memory for standard
+ crashkernel reservation and where less reliable and
+ possibly incomplete kdump is preferable to no kdump at
+ all.
cryptomgr.notests
[KNL] Disable crypto self-tests
--
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 4/5] kdump: wait for DMA to finish when using CMA
2025-03-12 21:00 [PATCH v3 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
` (2 preceding siblings ...)
2025-03-12 21:09 ` [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
@ 2025-03-12 21:10 ` Jiri Bohac
2025-03-12 21:11 ` [PATCH v3 5/5] x86: implement crashkernel cma reservation Jiri Bohac
4 siblings, 0 replies; 10+ messages in thread
From: Jiri Bohac @ 2025-03-12 21:10 UTC (permalink / raw)
To: Baoquan He, Vivek Goyal, Dave Young, kexec
Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
David Hildenbrand, Michal Hocko
When re-using the CMA area for kdump there is a risk of pending DMA into
pinned user pages in the CMA area.
Pages that are pinned long-term are migrated away from CMA, so these are
not a concern. Pages pinned without FOLL_LONGTERM remain in the CMA and may
possibly be the source or destination of a pending DMA transfer.
Although there is no clear specification how long a page may be pinned
without FOLL_LONGTERM, pinning without the flag shows an intent of the
caller to only use the memory for short-lived DMA transfers, not a transfer
initiated by a device asynchronously at a random time in the future.
Add a delay of CMA_DMA_TIMEOUT_MSEC milliseconds before starting the kdump
kernel, giving such short-lived DMA transfers time to finish before the CMA
memory is re-used by the kdump kernel.
Set CMA_DMA_TIMEOUT_MSEC to 1000 (one second) - chosen arbitrarily as both
a huge margin for a DMA transfer, yet not increasing the kdump time
significantly.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
include/linux/crash_core.h | 5 +++++
kernel/crash_core.c | 10 ++++++++++
2 files changed, 15 insertions(+)
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 44305336314e..543e4a71f13c 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -56,6 +56,11 @@ static inline unsigned int crash_get_elfcorehdr_size(void) { return 0; }
/* Alignment required for elf header segment */
#define ELF_CORE_HEADER_ALIGN 4096
+/* Time to wait for possible DMA to finish before starting the kdump kernel
+ * when a CMA reservation is used
+ */
+#define CMA_DMA_TIMEOUT_MSEC 1000
+
extern int crash_exclude_mem_range(struct crash_mem *mem,
unsigned long long mstart,
unsigned long long mend);
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 078fe5bc5a74..ed152af08d1e 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -21,6 +21,7 @@
#include <linux/reboot.h>
#include <linux/btf.h>
#include <linux/objtool.h>
+#include <linux/delay.h>
#include <asm/page.h>
#include <asm/sections.h>
@@ -97,6 +98,14 @@ int kexec_crash_loaded(void)
}
EXPORT_SYMBOL_GPL(kexec_crash_loaded);
+static void crash_cma_clear_pending_dma(void)
+{
+ if (!crashk_cma_cnt)
+ return;
+
+ mdelay(CMA_DMA_TIMEOUT_MSEC);
+}
+
/*
* No panic_cpu check version of crash_kexec(). This function is called
* only when panic_cpu holds the current CPU number; this is the only CPU
@@ -119,6 +128,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
crash_setup_regs(&fixed_regs, regs);
crash_save_vmcoreinfo();
machine_crash_shutdown(&fixed_regs);
+ crash_cma_clear_pending_dma();
machine_kexec(kexec_crash_image);
}
kexec_unlock();
--
2.46.0
--
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v3 5/5] x86: implement crashkernel cma reservation
2025-03-12 21:00 [PATCH v3 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
` (3 preceding siblings ...)
2025-03-12 21:10 ` [PATCH v3 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
@ 2025-03-12 21:11 ` Jiri Bohac
4 siblings, 0 replies; 10+ messages in thread
From: Jiri Bohac @ 2025-03-12 21:11 UTC (permalink / raw)
To: Baoquan He, Vivek Goyal, Dave Young, kexec
Cc: Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu, linux-kernel,
David Hildenbrand, Michal Hocko
Implement the crashkernel CMA reservation for x86:
- enable parsing of the cma suffix by parse_crashkernel()
- reserve memory with reserve_crashkernel_cma()
- add the CMA-reserved ranges to the e820 map for the crash kernel
- exclude the CMA-reserved ranges from vmcore
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
---
arch/x86/kernel/crash.c | 26 ++++++++++++++++++++++----
arch/x86/kernel/setup.c | 5 +++--
2 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 340af8155658..70823fa9abea 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -163,10 +163,10 @@ static struct crash_mem *fill_up_crash_elf_data(void)
return NULL;
/*
- * Exclusion of crash region and/or crashk_low_res may cause
- * another range split. So add extra two slots here.
+ * Exclusion of crash region, crashk_low_res and/or crashk_cma_ranges
+ * may cause range splits. So add extra slots here.
*/
- nr_ranges += 2;
+ nr_ranges += 2 + crashk_cma_cnt;
cmem = vzalloc(struct_size(cmem, ranges, nr_ranges));
if (!cmem)
return NULL;
@@ -184,6 +184,7 @@ static struct crash_mem *fill_up_crash_elf_data(void)
static int elf_header_exclude_ranges(struct crash_mem *cmem)
{
int ret = 0;
+ int i;
/* Exclude the low 1M because it is always reserved */
ret = crash_exclude_mem_range(cmem, 0, SZ_1M - 1);
@@ -198,8 +199,17 @@ static int elf_header_exclude_ranges(struct crash_mem *cmem)
if (crashk_low_res.end)
ret = crash_exclude_mem_range(cmem, crashk_low_res.start,
crashk_low_res.end);
+ if (ret)
+ return ret;
- return ret;
+ for (i = 0; i < crashk_cma_cnt; ++i) {
+ ret = crash_exclude_mem_range(cmem, crashk_cma_ranges[i].start,
+ crashk_cma_ranges[i].end);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
}
static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
@@ -352,6 +362,14 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
add_e820_entry(params, &ei);
}
+ for (i = 0; i < crashk_cma_cnt; ++i) {
+ ei.addr = crashk_cma_ranges[i].start;
+ ei.size = crashk_cma_ranges[i].end -
+ crashk_cma_ranges[i].start + 1;
+ ei.type = E820_TYPE_RAM;
+ add_e820_entry(params, &ei);
+ }
+
out:
vfree(cmem);
return ret;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 853afde761a7..90e10a18f0c9 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -471,7 +471,7 @@ static void __init memblock_x86_reserve_range_setup_data(void)
static void __init arch_reserve_crashkernel(void)
{
- unsigned long long crash_base, crash_size, low_size = 0;
+ unsigned long long crash_base, crash_size, low_size = 0, cma_size = 0;
char *cmdline = boot_command_line;
bool high = false;
int ret;
@@ -481,7 +481,7 @@ static void __init arch_reserve_crashkernel(void)
ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
&crash_size, &crash_base,
- &low_size, NULL, &high);
+ &low_size, &cma_size, &high);
if (ret)
return;
@@ -492,6 +492,7 @@ static void __init arch_reserve_crashkernel(void)
reserve_crashkernel_generic(cmdline, crash_size, crash_base,
low_size, high);
+ reserve_crashkernel_cma(cma_size);
}
static struct resource standard_io_resources[] = {
--
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, Prague, Czechia
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation
2025-03-12 21:09 ` [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
@ 2025-03-14 3:18 ` Baoquan He
2025-06-27 12:16 ` David Hildenbrand
0 siblings, 1 reply; 10+ messages in thread
From: Baoquan He @ 2025-03-14 3:18 UTC (permalink / raw)
To: Jiri Bohac, David Hildenbrand
Cc: akpm, Vivek Goyal, Dave Young, kexec, Philipp Rudo, Donald Dutile,
Pingfan Liu, Tao Liu, linux-kernel, Michal Hocko
Hi Jiri,
On 03/12/25 at 10:09pm, Jiri Bohac wrote:
......
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb8752b42ec8..895b974dc3bb 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -987,6 +987,28 @@
> 0: to disable low allocation.
> It will be ignored when crashkernel=X,high is not used
> or memory reserved is below 4G.
> + crashkernel=size[KMG],cma
> + [KNL, X86] Reserve additional crash kernel memory from
> + CMA. This reservation is usable by the first system's
> + userspace memory and kernel movable allocations (memory
> + balloon, zswap). Pages allocated from this memory range
> + will not be included in the vmcore so this should not
> + be used if dumping of userspace memory is intended and
> + it has to be expected that some movable kernel pages
> + may be missing from the dump.
Since David and Don expressed concern about the missing kernel pages
allocated from CMA area in v2, and you argued this is still useful for
VM system, I would like to invite David to help evaluate the whole
series if it's worth from the VM and MM point of view.
Thanks
Baoquan
> +
> + A standard crashkernel reservation, as described above,
> + is still needed to hold the crash kernel and initrd.
> +
> + This option increases the risk of a kdump failure: DMA
> + transfers configured by the first kernel may end up
> + corrupting the second kernel's memory.
> +
> + This reservation method is intended for systems that
> + can't afford to sacrifice enough memory for standard
> + crashkernel reservation and where less reliable and
> + possibly incomplete kdump is preferable to no kdump at
> + all.
>
> cryptomgr.notests
> [KNL] Disable crypto self-tests
>
> --
> Jiri Bohac <jbohac@suse.cz>
> SUSE Labs, Prague, Czechia
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation
2025-03-14 3:18 ` Baoquan He
@ 2025-06-27 12:16 ` David Hildenbrand
2025-06-27 12:18 ` David Hildenbrand
0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-06-27 12:16 UTC (permalink / raw)
To: Baoquan He, Jiri Bohac, David Hildenbrand
Cc: akpm, Vivek Goyal, Dave Young, kexec, Philipp Rudo, Donald Dutile,
Pingfan Liu, Tao Liu, linux-kernel, Michal Hocko
On 14.03.25 04:18, Baoquan He wrote:
> Hi Jiri,
>
> On 03/12/25 at 10:09pm, Jiri Bohac wrote:
> ......
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index fb8752b42ec8..895b974dc3bb 100644
>> --- a/Documentation/admin-guide/kernel-parameters.txt
>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>> @@ -987,6 +987,28 @@
>> 0: to disable low allocation.
>> It will be ignored when crashkernel=X,high is not used
>> or memory reserved is below 4G.
>> + crashkernel=size[KMG],cma
>> + [KNL, X86] Reserve additional crash kernel memory from
>> + CMA. This reservation is usable by the first system's
>> + userspace memory and kernel movable allocations (memory
>> + balloon, zswap). Pages allocated from this memory range
>> + will not be included in the vmcore so this should not
>> + be used if dumping of userspace memory is intended and
>> + it has to be expected that some movable kernel pages
>> + may be missing from the dump.
>
> Since David and Don expressed concern about the missing kernel pages
> allocated from CMA area in v2, and you argued this is still useful for
> VM system, I would like to invite David to help evaluate the whole
> series if it's worth from the VM and MM point of view.
Balloon pages will not be dumped either way (PageOffline), so that is
not a convern.
Zsmalloc pages ... are probably fine right now. They should likely only
be storing compressed user data. (not sure if they also store some other
datastructures, I think no, but might be wrong)
My comment was rather forward-looking: that CMA memory only contains
user space memory is already not the case (but the existing cases might
be okay). In the future, as we support other movable allocations (as
raised, leaf page tables at some point, and there were discussions about
movable slab pages, although that might be challenging) this can change
(unless we find ways of not placing these allocations on CMA memory).
So as is, this should be fine, but it's certainly something to be aware
of in the future.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation
2025-06-27 12:16 ` David Hildenbrand
@ 2025-06-27 12:18 ` David Hildenbrand
2025-07-07 4:54 ` Baoquan He
0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-06-27 12:18 UTC (permalink / raw)
To: Baoquan He, Jiri Bohac, David Hildenbrand
Cc: akpm, Vivek Goyal, Dave Young, kexec, Philipp Rudo, Donald Dutile,
Pingfan Liu, Tao Liu, linux-kernel, Michal Hocko
On 27.06.25 14:16, David Hildenbrand wrote:
> On 14.03.25 04:18, Baoquan He wrote:
>> Hi Jiri,
>>
>> On 03/12/25 at 10:09pm, Jiri Bohac wrote:
>> ......
>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>>> index fb8752b42ec8..895b974dc3bb 100644
>>> --- a/Documentation/admin-guide/kernel-parameters.txt
>>> +++ b/Documentation/admin-guide/kernel-parameters.txt
>>> @@ -987,6 +987,28 @@
>>> 0: to disable low allocation.
>>> It will be ignored when crashkernel=X,high is not used
>>> or memory reserved is below 4G.
>>> + crashkernel=size[KMG],cma
>>> + [KNL, X86] Reserve additional crash kernel memory from
>>> + CMA. This reservation is usable by the first system's
>>> + userspace memory and kernel movable allocations (memory
>>> + balloon, zswap). Pages allocated from this memory range
>>> + will not be included in the vmcore so this should not
>>> + be used if dumping of userspace memory is intended and
>>> + it has to be expected that some movable kernel pages
>>> + may be missing from the dump.
>>
>> Since David and Don expressed concern about the missing kernel pages
>> allocated from CMA area in v2, and you argued this is still useful for
>> VM system, I would like to invite David to help evaluate the whole
>> series if it's worth from the VM and MM point of view.
>
> Balloon pages will not be dumped either way (PageOffline), so that is
> not a convern.
>
> Zsmalloc pages ... are probably fine right now. They should likely only
> be storing compressed user data. (not sure if they also store some other
> datastructures, I think no, but might be wrong)
>
> My comment was rather forward-looking: that CMA memory only contains
> user space memory is already not the case (but the existing cases might
> be okay). In the future, as we support other movable allocations (as
> raised, leaf page tables at some point, and there were discussions about
> movable slab pages, although that might be challenging) this can change
> (unless we find ways of not placing these allocations on CMA memory).
>
> So as is, this should be fine, but it's certainly something to be aware
> of in the future.
>
BTW, I realize this was a late reply, and that the series already
proceeded. Just stumbled over that un-replied mail an thought I'd
clarify my point here.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation
2025-06-27 12:18 ` David Hildenbrand
@ 2025-07-07 4:54 ` Baoquan He
0 siblings, 0 replies; 10+ messages in thread
From: Baoquan He @ 2025-07-07 4:54 UTC (permalink / raw)
To: David Hildenbrand
Cc: Jiri Bohac, David Hildenbrand, akpm, Vivek Goyal, Dave Young,
kexec, Philipp Rudo, Donald Dutile, Pingfan Liu, Tao Liu,
linux-kernel, Michal Hocko
On 06/27/25 at 02:18pm, David Hildenbrand wrote:
> On 27.06.25 14:16, David Hildenbrand wrote:
> > On 14.03.25 04:18, Baoquan He wrote:
> > > Hi Jiri,
> > >
> > > On 03/12/25 at 10:09pm, Jiri Bohac wrote:
> > > ......
> > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > > index fb8752b42ec8..895b974dc3bb 100644
> > > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > > @@ -987,6 +987,28 @@
> > > > 0: to disable low allocation.
> > > > It will be ignored when crashkernel=X,high is not used
> > > > or memory reserved is below 4G.
> > > > + crashkernel=size[KMG],cma
> > > > + [KNL, X86] Reserve additional crash kernel memory from
> > > > + CMA. This reservation is usable by the first system's
> > > > + userspace memory and kernel movable allocations (memory
> > > > + balloon, zswap). Pages allocated from this memory range
> > > > + will not be included in the vmcore so this should not
> > > > + be used if dumping of userspace memory is intended and
> > > > + it has to be expected that some movable kernel pages
> > > > + may be missing from the dump.
> > >
> > > Since David and Don expressed concern about the missing kernel pages
> > > allocated from CMA area in v2, and you argued this is still useful for
> > > VM system, I would like to invite David to help evaluate the whole
> > > series if it's worth from the VM and MM point of view.
> >
> > Balloon pages will not be dumped either way (PageOffline), so that is
> > not a convern.
> >
> > Zsmalloc pages ... are probably fine right now. They should likely only
> > be storing compressed user data. (not sure if they also store some other
> > datastructures, I think no, but might be wrong)
> >
> > My comment was rather forward-looking: that CMA memory only contains
> > user space memory is already not the case (but the existing cases might
> > be okay). In the future, as we support other movable allocations (as
> > raised, leaf page tables at some point, and there were discussions about
> > movable slab pages, although that might be challenging) this can change
> > (unless we find ways of not placing these allocations on CMA memory).
> >
> > So as is, this should be fine, but it's certainly something to be aware
> > of in the future.
> >
>
> BTW, I realize this was a late reply, and that the series already proceeded.
> Just stumbled over that un-replied mail an thought I'd clarify my point
> here.
Thanks a lot for deliberating on this and providing these helpful
details. As you said, this feature is fine for the time being, we can
remember this and consider how to adapt in the future once those movable
allocations could happen in CMA. And the risk has been told clearly in
doc.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-07-07 4:54 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-12 21:00 [PATCH v3 0/5] kdump: crashkernel reservation from CMA Jiri Bohac
2025-03-12 21:03 ` [PATCH v3 1/5] Add a new optional ",cma" suffix to the crashkernel= command line option Jiri Bohac
2025-03-12 21:05 ` [PATCH v3 2/5] kdump: implement reserve_crashkernel_cma Jiri Bohac
2025-03-12 21:09 ` [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation Jiri Bohac
2025-03-14 3:18 ` Baoquan He
2025-06-27 12:16 ` David Hildenbrand
2025-06-27 12:18 ` David Hildenbrand
2025-07-07 4:54 ` Baoquan He
2025-03-12 21:10 ` [PATCH v3 4/5] kdump: wait for DMA to finish when using CMA Jiri Bohac
2025-03-12 21:11 ` [PATCH v3 5/5] x86: implement crashkernel cma reservation Jiri Bohac
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).