* [RFC 1/3] crash_dump: Parse the CMA's mem_map in kdump
2023-12-18 5:23 [RFC 0/3] kdump: Check mem_map of CMA area in kdump Pingfan Liu
@ 2023-12-18 5:23 ` Pingfan Liu
2023-12-18 5:23 ` [RFC 2/3] of: kexec: Set up properties for reusing CMA " Pingfan Liu
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Pingfan Liu @ 2023-12-18 5:23 UTC (permalink / raw)
To: kexec
Cc: Pingfan Liu, Jiri Bohac, Michal Hocko, Philipp Rudo, Baoquan He,
Dave Young
From: Pingfan Liu <piliu@redhat.com>
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
To: kexec@lists.infradead.org
---
include/linux/kexec.h | 1 +
init/main.c | 4 +++
kernel/crash_dump.c | 80 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 8227455192b7..95e19c814c93 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -410,6 +410,7 @@ extern bool kexec_in_progress;
int crash_shrink_memory(unsigned long new_size);
ssize_t crash_get_memory_size(void);
+void kdump_kernel_reuse_mem(void);
#ifndef arch_kexec_protect_crashkres
/*
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..ed90097caaa8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@
#include <linux/init_syscalls.h>
#include <linux/stackdepot.h>
#include <linux/randomize_kstack.h>
+#include <linux/kexec.h>
#include <net/net_namespace.h>
#include <asm/io.h>
@@ -1068,6 +1069,9 @@ void start_kernel(void)
arch_post_acpi_subsys_init();
kcsan_init();
+ /* */
+ kdump_kernel_reuse_mem();
+
/* Do the rest non-__init'ed, we're now alive */
arch_call_rest_init();
diff --git a/kernel/crash_dump.c b/kernel/crash_dump.c
index 92da32275af5..4605005f9534 100644
--- a/kernel/crash_dump.c
+++ b/kernel/crash_dump.c
@@ -4,6 +4,7 @@
#include <linux/init.h>
#include <linux/errno.h>
#include <linux/export.h>
+#include <linux/cc_platform.h>
/*
* stores the physical address of elf header of crash image
@@ -39,3 +40,82 @@ static int __init setup_elfcorehdr(char *arg)
return end > arg ? 0 : -EINVAL;
}
early_param("elfcorehdr", setup_elfcorehdr);
+
+u64 kdump_cma_pfn;
+u64 kdump_cma_pg_cnt;
+u64 kdump_cma_pg_paddr;
+
+/* In old kernel, wreck should correspond to pfn */
+static void check_poison_page(struct page *wreck, int cnt, unsigned long pfn)
+{
+ int i = 0;
+ int order;
+ /* Info copied from old kernel */
+ struct page *check = wreck;
+ struct page *p;
+
+ /*
+ * Strictly check, but there is still a rare case, where both _refcnt and
+ * _mapcount are forged by some wrong code
+ */
+ for (; i < cnt; i += 1 << order) {
+ order = folio_order((struct folio *)check);
+ if (PageBuddy(check)) {
+ if (page_count(check) != 0 || total_mapcount(check) != 0)
+ goto fail;
+ if (check->index != 0)
+ goto fail;
+ if (check->lru.next != LIST_POISON1 ||
+ check->lru.prev != LIST_POISON2)
+ goto fail;
+
+ } else if (folio_test_anon((struct folio *)check)) {
+ /* check PAGE_MAPPING_ANON bit in mapping */
+ if (!((unsigned long)check->mapping & PAGE_MAPPING_ANON))
+ goto fail;
+ if (page_count(check) == 0)
+ goto fail;
+ if (total_mapcount(check) != page_count(check))
+ goto fail;
+ } else {
+ goto fail;
+ }
+
+ p = pfn_to_page(pfn);
+ for (int j = 0; j < 1 << order; j++)
+ ClearPageReserved(p + j);
+ __free_pages(p, order);
+
+fail:
+ check += 1 << order;
+ pfn += 1 << order;
+ }
+}
+
+void kdump_kernel_reuse_mem(void)
+{
+ u64 size;
+ struct page *wreck;
+ struct kvec kvec;
+ struct iov_iter iter;
+ char *buf;
+
+ if (!is_kdump_kernel())
+ return;
+
+ if (!kdump_cma_pfn || !kdump_cma_pg_cnt || !kdump_cma_pg_paddr)
+ return;
+
+ size = kdump_cma_pg_cnt * sizeof(struct page);
+ /* copy the wreck's page[] */
+ buf = kmalloc(size, GFP_KERNEL);
+ if (!buf)
+ return;
+ wreck = (struct page *)buf;
+ kvec.iov_base = buf;
+ kvec.iov_len = size;
+ iov_iter_kvec(&iter, ITER_DEST, &kvec, 1, size);
+ read_from_oldmem(&iter, size, &kdump_cma_pg_paddr,
+ cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT));
+ check_poison_page(wreck, kdump_cma_pg_cnt, kdump_cma_pfn);
+}
--
2.31.1
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply related [flat|nested] 6+ messages in thread* [RFC 2/3] of: kexec: Set up properties for reusing CMA in kdump
2023-12-18 5:23 [RFC 0/3] kdump: Check mem_map of CMA area in kdump Pingfan Liu
2023-12-18 5:23 ` [RFC 1/3] crash_dump: Parse the CMA's mem_map " Pingfan Liu
@ 2023-12-18 5:23 ` Pingfan Liu
2023-12-18 5:23 ` [RFC 3/3] of: fdt: Parse properties of " Pingfan Liu
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Pingfan Liu @ 2023-12-18 5:23 UTC (permalink / raw)
To: kexec
Cc: Pingfan Liu, Jiri Bohac, Michal Hocko, Philipp Rudo, Baoquan He,
Dave Young
From: Pingfan Liu <piliu@redhat.com>
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
To: kexec@lists.infradead.org
---
drivers/of/kexec.c | 14 ++++++++++++++
include/linux/kexec.h | 4 ++++
2 files changed, 18 insertions(+)
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 68278340cecf..471b6cdf34ff 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -410,6 +410,20 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image,
if (ret)
goto out;
}
+
+ ret = fdt_appendprop_u64(fdt, chosen_node,
+ "kdump,cma_pfn", kdump_cma_pfn);
+ if (ret)
+ goto out;
+ ret = fdt_appendprop_u64(fdt, chosen_node,
+ "kdump,cma_pg_cnt", kdump_cma_pg_cnt);
+ if (ret)
+ goto out;
+ ret = fdt_appendprop_u64(fdt, chosen_node,
+ "kdump,cma_pg_paddr", kdump_cma_pg_paddr);
+ if (ret)
+ goto out;
+
}
/* add bootargs */
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 95e19c814c93..bf2d8b9d8c07 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -408,6 +408,10 @@ bool kexec_load_permitted(int kexec_image_type);
/* flag to track if kexec reboot is in progress */
extern bool kexec_in_progress;
+extern u64 kdump_cma_pfn;
+extern u64 kdump_cma_pg_cnt;
+extern u64 kdump_cma_pg_paddr;
+
int crash_shrink_memory(unsigned long new_size);
ssize_t crash_get_memory_size(void);
void kdump_kernel_reuse_mem(void);
--
2.31.1
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply related [flat|nested] 6+ messages in thread* [RFC 3/3] of: fdt: Parse properties of reusing CMA in kdump
2023-12-18 5:23 [RFC 0/3] kdump: Check mem_map of CMA area in kdump Pingfan Liu
2023-12-18 5:23 ` [RFC 1/3] crash_dump: Parse the CMA's mem_map " Pingfan Liu
2023-12-18 5:23 ` [RFC 2/3] of: kexec: Set up properties for reusing CMA " Pingfan Liu
@ 2023-12-18 5:23 ` Pingfan Liu
2023-12-18 15:19 ` [RFC 0/3] kdump: Check mem_map of CMA area " Michal Hocko
2023-12-19 15:20 ` Philipp Rudo
4 siblings, 0 replies; 6+ messages in thread
From: Pingfan Liu @ 2023-12-18 5:23 UTC (permalink / raw)
To: kexec
Cc: Pingfan Liu, Jiri Bohac, Michal Hocko, Philipp Rudo, Baoquan He,
Dave Young
From: Pingfan Liu <piliu@redhat.com>
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
To: kexec@lists.infradead.org
---
drivers/of/fdt.c | 43 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index bf502ba8da95..09673440f876 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -620,6 +620,19 @@ static void __init fdt_reserve_elfcorehdr(void)
elfcorehdr_size >> 10, elfcorehdr_addr);
}
+static void __init fdt_reserve_kdump_cma(void)
+{
+ if (!IS_ENABLED(CONFIG_CRASH_DUMP))
+ return;
+
+ if (!is_kdump_kernel())
+ return;
+ if (!kdump_cma_pfn || !kdump_cma_pg_cnt || !kdump_cma_pg_paddr)
+ return;
+ memblock_reserve(kdump_cma_pfn << PAGE_SHIFT, kdump_cma_pg_cnt << PAGE_SHIFT);
+ pr_info("Reserving %llu KiB of memory at 0x%llx for kdump cma\n",
+ kdump_cma_pg_cnt >> 10, kdump_cma_pfn << PAGE_SHIFT);
+}
/**
* early_init_fdt_scan_reserved_mem() - create reserved memory regions
*
@@ -637,6 +650,7 @@ void __init early_init_fdt_scan_reserved_mem(void)
fdt_scan_reserved_mem();
fdt_reserve_elfcorehdr();
+ fdt_reserve_kdump_cma();
/* Process header /memreserve/ fields */
for (n = 0; ; n++) {
@@ -1150,6 +1164,34 @@ int __init early_init_dt_scan_memory(void)
return found_memory;
}
+static void __init early_init_dt_check_kdump_cma(unsigned long node)
+{
+ const __be32 *prop;
+ int len;
+
+ if (!IS_ENABLED(CONFIG_CRASH_DUMP))
+ return;
+ if (!is_kdump_kernel())
+ return;
+ pr_debug("Looking for kdump cma property... ");
+
+ prop = of_get_flat_dt_prop(node, "kdump,cma_pfn", &len);
+ if (!prop)
+ return;
+ kdump_cma_pfn = *prop;
+
+ prop = of_get_flat_dt_prop(node, "kdump,cma_pg_cnt", &len);
+ if (!prop)
+ return;
+ kdump_cma_pg_cnt = *prop;
+
+ prop = of_get_flat_dt_prop(node, "kdump,cma_pg_paddr", &len);
+ if (!prop)
+ return;
+ kdump_cma_pg_paddr = *prop;
+
+}
+
int __init early_init_dt_scan_chosen(char *cmdline)
{
int l, node;
@@ -1168,6 +1210,7 @@ int __init early_init_dt_scan_chosen(char *cmdline)
early_init_dt_check_for_initrd(node);
early_init_dt_check_for_elfcorehdr(node);
+ early_init_dt_check_kdump_cma(node);
rng_seed = of_get_flat_dt_prop(node, "rng-seed", &l);
if (rng_seed && l > 0) {
--
2.31.1
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [RFC 0/3] kdump: Check mem_map of CMA area in kdump
2023-12-18 5:23 [RFC 0/3] kdump: Check mem_map of CMA area in kdump Pingfan Liu
` (2 preceding siblings ...)
2023-12-18 5:23 ` [RFC 3/3] of: fdt: Parse properties of " Pingfan Liu
@ 2023-12-18 15:19 ` Michal Hocko
2023-12-19 15:20 ` Philipp Rudo
4 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2023-12-18 15:19 UTC (permalink / raw)
To: Pingfan Liu
Cc: kexec, Pingfan Liu, Jiri Bohac, Philipp Rudo, Baoquan He,
Dave Young
On Mon 18-12-23 13:23:22, Pingfan Liu wrote:
> From: Pingfan Liu <piliu@redhat.com>
>
>
> First of all, this series is only for proof of concept. It only passes compilation.
>
> For years, CMA is proposed to be used as crashkernel reserved memory.
> But DIO prevent us to follow it since DMA may be in-flight and ruin the
> kdump kernel.
>
> This series exports the crash kernel's CMA area information through
> device-tree, and kdump kernel skips any page, which refcnt!=mapcount and
> has a potential DMA activity.
I didn't have time to look deeper into implementation (and I will get
back to it only early Jan) but mapcount based checks are really tricky
and unreliable. folio_maybe_dma_pinned sounds like a better test. You
definitely want to have that checked by more MM people and CC linux-mm.
> The exported information include:
> u64 kdump_cma_pfn;
> u64 kdump_cma_pg_cnt;
> u64 kdump_cma_pg_paddr;
>
> And they should be filled with Jiri's series "[PATCH 0/4] kdump:
> crashkernel reservation from CMA"
>
> After the conjunction of two series, the CMA used for kdump has only the
> following risk, where the following conditions:
> -1.a wrong code forges _refcnt and mapcount to the same value
> -2.the page is also used by DIO
>
>
> Is it acceptable, or any rescue e.g. CRC on page?
We alredy do have vm_debug=P which enables init time poisoning
on all struct pages. The value is then checked when the page is
allocated.
> Please share your thoughts.
Having a sanity check on exported cma pages makes some sense to me. The
exact check might be more involved with false positives but they
shouldn't be a major problem unless there are too many of them.
--
Michal Hocko
SUSE Labs
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [RFC 0/3] kdump: Check mem_map of CMA area in kdump
2023-12-18 5:23 [RFC 0/3] kdump: Check mem_map of CMA area in kdump Pingfan Liu
` (3 preceding siblings ...)
2023-12-18 15:19 ` [RFC 0/3] kdump: Check mem_map of CMA area " Michal Hocko
@ 2023-12-19 15:20 ` Philipp Rudo
4 siblings, 0 replies; 6+ messages in thread
From: Philipp Rudo @ 2023-12-19 15:20 UTC (permalink / raw)
To: Pingfan Liu
Cc: kexec, Pingfan Liu, Jiri Bohac, Michal Hocko, Baoquan He,
Dave Young
Hi Pingfan,
On Mon, 18 Dec 2023 13:23:22 +0800
Pingfan Liu <kernelfans@gmail.com> wrote:
> From: Pingfan Liu <piliu@redhat.com>
>
>
> First of all, this series is only for proof of concept. It only passes compilation.
>
> For years, CMA is proposed to be used as crashkernel reserved memory.
> But DIO prevent us to follow it since DMA may be in-flight and ruin the
> kdump kernel.
>
> This series exports the crash kernel's CMA area information through
> device-tree, and kdump kernel skips any page, which refcnt!=mapcount and
> has a potential DMA activity.
>
> The exported information include:
> u64 kdump_cma_pfn;
> u64 kdump_cma_pg_cnt;
> u64 kdump_cma_pg_paddr;
>
> And they should be filled with Jiri's series "[PATCH 0/4] kdump:
> crashkernel reservation from CMA"
>
> After the conjunction of two series, the CMA used for kdump has only the
> following risk, where the following conditions:
> -1.a wrong code forges _refcnt and mapcount to the same value
> -2.the page is also used by DIO
>
>
> Is it acceptable, or any rescue e.g. CRC on page?
>
> Please share your thoughts.
I don't think your approach will work as intended. The problem is that
we are dealing with two separate kernels and there is no guarantee that
both kernels are identical. So you cannot rely on the definition of
struct page in the crash kernel to be identical to the one in the
panicked kernel. Meaning check_poison_page from the crash kernel cannot
simply operate on the struct pages from the panicked kernel.
To get this approach to work I see three possible "fixes"
1) enforce in kexec that only the currently running kernel can be
loaded as crash kernel.
2) pass all required "debuginfo" to the crash kernel so it can parse
the required data reliably from the dump. This also requires to have
all the mm helper functions to be reimplemented to work in
check_poison_page.
3) the required information is passed via a new data structure which
is designed in a way that it can easily be passed in between different
kernels. But this would require the mm subsystem to maintain the page
states in the CMA in two separate data structures.
Personally I don't think that any of the three "fixes" is desirable.
Thanks
Philipp
> Thanks,
>
> Pingfan
>
>
> Cc: Jiri Bohac <jbohac@suse.cz>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Philipp Rudo <prudo@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Dave Young <dyoung@redhat.com>
> To: kexec@lists.infradead.org
> ---
> Pingfan Liu (3):
> crash_dump: Parse the CMA's mem_map in kdump
> of: kexec: Set up properties for reusing CMA in kdump
> of: fdt: Parse properties of reusing CMA in kdump
>
> drivers/of/fdt.c | 43 +++++++++++++++++++++++
> drivers/of/kexec.c | 14 ++++++++
> include/linux/kexec.h | 5 +++
> init/main.c | 4 +++
> kernel/crash_dump.c | 80 +++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 146 insertions(+)
>
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply [flat|nested] 6+ messages in thread