* [PATCH 0/8] arm64 kexec kernel patches
@ 2014-05-09 0:48 Geoff Levand
2014-05-09 0:48 ` [PATCH 8/8] arm64: Enable kexec in defconfig Geoff Levand
` (8 more replies)
0 siblings, 9 replies; 23+ messages in thread
From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw)
To: linux-arm-kernel
Hi Maintainers,
This patchset adds support for kexec re-boots on arm64. I have tested with the
VE fast model using various kernel config options with both spin and psci enable
methods. I'll continue to test in the coming weeks.
I tried to re-use the existing hot plug cpu_ops support for the secondary CPU
shutdown as much as possible, but needed to do some things specific to kexec
that I couldn't do with what was already there. A significant change is in
[PATCH 4/8] (arm64: Add smp_spin_table_set_die) where I add the ability to setup
a custom cpu_die handler.
To get the the spin-table secondary CPUs into the proper state described in
Documentation/arm64/booting.txt I use a three step spin loop. First in the
kernel's virtual address space, then to the identity mapped address, then jump
to the final spin code in the 2nd stage kernel's /memreserve/ area. To support
this three step spin I needed [PATCH 5/8] (arm64: Split soft_restart into two
stages). Please see the patch comments for more info. If we added the 2nd
stage kernel's /memreserve/ area to the identity map we could eliminate the
middle step and go from the VA space to the /memreserve/ area directly.
Please consider all patches for inclusion. Any comments or suggestions on how
to improve would be very welcome.
To load a kexec kernel and execute a kexec re-boot on arm64 my patches to
kexec-tools, which have not yet been merged upstream, are needed:
https://git.linaro.org/people/geoff.levand/kexec-tools.git
-Geoff
The following changes since commit 89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd:
Linux 3.15-rc4 (2014-05-04 18:14:42 -0700)
are available in the git repository at:
git://git.linaro.org/people/geoff.levand/linux-kexec.git for-arm64-kexec
for you to fetch changes up to 32399380e2249697ca549848ef83e5706eb4d83c:
arm64: Enable kexec in defconfig (2014-05-08 17:09:27 -0700)
----------------------------------------------------------------
Geoff Levand (8):
arm64: Use cpu_ops for smp_stop
arm64: Make cpu_read_ops generic
arm64: Add spin-table cpu_die
arm64: Add smp_spin_table_set_die
arm64: Split soft_restart into two stages
arm64/kexec: kexec needs cpu_die
arm64/kexec: Add core kexec support
arm64: Enable kexec in defconfig
MAINTAINERS | 9 +
arch/arm64/Kconfig | 8 +
arch/arm64/configs/defconfig | 1 +
arch/arm64/include/asm/cpu_ops.h | 5 +-
arch/arm64/include/asm/kexec.h | 44 +++
arch/arm64/include/asm/system_misc.h | 1 +
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/cpu_ops.c | 11 +-
arch/arm64/kernel/machine_kexec.c | 623 +++++++++++++++++++++++++++++++++++
arch/arm64/kernel/process.c | 4 +-
arch/arm64/kernel/psci.c | 4 +-
arch/arm64/kernel/relocate_kernel.S | 239 ++++++++++++++
arch/arm64/kernel/smp.c | 10 +-
arch/arm64/kernel/smp_spin_table.c | 21 +-
include/uapi/linux/kexec.h | 1 +
15 files changed, 969 insertions(+), 13 deletions(-)
create mode 100644 arch/arm64/include/asm/kexec.h
create mode 100644 arch/arm64/kernel/machine_kexec.c
create mode 100644 arch/arm64/kernel/relocate_kernel.S
--
1.9.1
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH 8/8] arm64: Enable kexec in defconfig 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand @ 2014-05-09 0:48 ` Geoff Levand 2014-05-09 0:48 ` [PATCH 2/8] arm64: Make cpu_read_ops generic Geoff Levand ` (7 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw) To: linux-arm-kernel Signed-off-by: Geoff Levand <geoff@infradead.org> --- arch/arm64/configs/defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig index 7959dd0..6ab5f63 100644 --- a/arch/arm64/configs/defconfig +++ b/arch/arm64/configs/defconfig @@ -28,6 +28,7 @@ CONFIG_ARCH_XGENE=y CONFIG_SMP=y CONFIG_PREEMPT=y CONFIG_CMA=y +CONFIG_KEXEC=y CONFIG_CMDLINE="console=ttyAMA0" # CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set CONFIG_COMPAT=y -- 1.9.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/8] arm64: Make cpu_read_ops generic 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand 2014-05-09 0:48 ` [PATCH 8/8] arm64: Enable kexec in defconfig Geoff Levand @ 2014-05-09 0:48 ` Geoff Levand 2014-05-09 0:48 ` [PATCH 6/8] arm64/kexec: kexec needs cpu_die Geoff Levand ` (6 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw) To: linux-arm-kernel Remove the __init attribute from the cpu_read_ops() routine and the __initconst attribute from the supported_cpu_ops[] variable to allow cpu_read_ops() to be used after kernel initialization. Change cpu_read_ops() from acting on the static local variable cpu_ops to acting on a pointer to an array of struct cpu_operations passed as a cpu_read_ops() parameter. Also change any calls to cpu_read_ops() to pass the local cpu_ops variable. This change has no functional effect. The kexec_load syscall handling can re-use the cpu_read_ops() routine in its parsing of the device tree for the 2nd stage kernel. kexec_load syscall can be called after kernel init, so cannot use any routines with the __init attribute. Signed-off-by: Geoff Levand <geoff@infradead.org> --- arch/arm64/include/asm/cpu_ops.h | 3 ++- arch/arm64/kernel/cpu_ops.c | 11 ++++++----- arch/arm64/kernel/smp.c | 2 +- 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/arm64/include/asm/cpu_ops.h b/arch/arm64/include/asm/cpu_ops.h index 1524130..872f61a 100644 --- a/arch/arm64/include/asm/cpu_ops.h +++ b/arch/arm64/include/asm/cpu_ops.h @@ -59,7 +59,8 @@ struct cpu_operations { }; extern const struct cpu_operations *cpu_ops[NR_CPUS]; -extern int __init cpu_read_ops(struct device_node *dn, int cpu); +extern int cpu_read_ops(struct device_node *dn, int cpu, + const struct cpu_operations **cpu_ops); extern void __init cpu_read_bootcpu_ops(void); #endif /* ifndef __ASM_CPU_OPS_H */ diff --git a/arch/arm64/kernel/cpu_ops.c b/arch/arm64/kernel/cpu_ops.c index d62d12f..6ccba89 100644 --- a/arch/arm64/kernel/cpu_ops.c +++ b/arch/arm64/kernel/cpu_ops.c @@ -27,7 +27,7 @@ extern const struct cpu_operations cpu_psci_ops; const struct cpu_operations *cpu_ops[NR_CPUS]; -static const struct cpu_operations *supported_cpu_ops[] __initconst = { +static const struct cpu_operations *supported_cpu_ops[] = { #ifdef CONFIG_SMP &smp_spin_table_ops, &cpu_psci_ops, @@ -52,7 +52,8 @@ static const struct cpu_operations * __init cpu_get_ops(const char *name) /* * Read a cpu's enable method from the device tree and record it in cpu_ops. */ -int __init cpu_read_ops(struct device_node *dn, int cpu) +int cpu_read_ops(struct device_node *dn, int cpu, + const struct cpu_operations **cpu_ops) { const char *enable_method = of_get_property(dn, "enable-method", NULL); if (!enable_method) { @@ -66,8 +67,8 @@ int __init cpu_read_ops(struct device_node *dn, int cpu) return -ENOENT; } - cpu_ops[cpu] = cpu_get_ops(enable_method); - if (!cpu_ops[cpu]) { + *cpu_ops = cpu_get_ops(enable_method); + if (!*cpu_ops) { pr_warn("%s: unsupported enable-method property: %s\n", dn->full_name, enable_method); return -EOPNOTSUPP; @@ -83,5 +84,5 @@ void __init cpu_read_bootcpu_ops(void) pr_err("Failed to find device node for boot cpu\n"); return; } - cpu_read_ops(dn, 0); + cpu_read_ops(dn, 0, &cpu_ops[0]); } diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 020bbd5..f9241c1 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -362,7 +362,7 @@ void __init smp_init_cpus(void) if (cpu >= NR_CPUS) goto next; - if (cpu_read_ops(dn, cpu) != 0) + if (cpu_read_ops(dn, cpu, &cpu_ops[cpu])) goto next; if (cpu_ops[cpu]->cpu_init(dn, cpu)) -- 1.9.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 6/8] arm64/kexec: kexec needs cpu_die 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand 2014-05-09 0:48 ` [PATCH 8/8] arm64: Enable kexec in defconfig Geoff Levand 2014-05-09 0:48 ` [PATCH 2/8] arm64: Make cpu_read_ops generic Geoff Levand @ 2014-05-09 0:48 ` Geoff Levand 2014-05-09 8:24 ` Mark Rutland 2014-05-09 0:48 ` [PATCH 7/8] arm64/kexec: Add core kexec support Geoff Levand ` (5 subsequent siblings) 8 siblings, 1 reply; 23+ messages in thread From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw) To: linux-arm-kernel Kexec uses the cpu_die method of struct cpu_operations, so add defined(CONFIG_KEXEC) to the preprocessor conditional that enables cpu_die. Signed-off-by: Geoff Levand <geoff@infradead.org> --- arch/arm64/kernel/psci.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm64/kernel/psci.c b/arch/arm64/kernel/psci.c index ea4828a..0e5fa69 100644 --- a/arch/arm64/kernel/psci.c +++ b/arch/arm64/kernel/psci.c @@ -253,7 +253,7 @@ static int cpu_psci_cpu_boot(unsigned int cpu) return err; } -#ifdef CONFIG_HOTPLUG_CPU +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_KEXEC) static int cpu_psci_cpu_disable(unsigned int cpu) { /* Fail early if we don't have CPU_OFF support */ @@ -284,7 +284,7 @@ const struct cpu_operations cpu_psci_ops = { .cpu_init = cpu_psci_cpu_init, .cpu_prepare = cpu_psci_cpu_prepare, .cpu_boot = cpu_psci_cpu_boot, -#ifdef CONFIG_HOTPLUG_CPU +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_KEXEC) .cpu_disable = cpu_psci_cpu_disable, .cpu_die = cpu_psci_cpu_die, #endif -- 1.9.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 6/8] arm64/kexec: kexec needs cpu_die 2014-05-09 0:48 ` [PATCH 6/8] arm64/kexec: kexec needs cpu_die Geoff Levand @ 2014-05-09 8:24 ` Mark Rutland 2014-05-13 22:27 ` Geoff Levand 0 siblings, 1 reply; 23+ messages in thread From: Mark Rutland @ 2014-05-09 8:24 UTC (permalink / raw) To: linux-arm-kernel Hi Geoff, On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > Kexec uses the cpu_die method of struct cpu_operations, so add > defined(CONFIG_KEXEC) to the preprocessor conditional that enables cpu_die. Why not make kexec depend on !CONFIG_SMP || CONFIG_HOTPLUG_CPU instead? >From the POV of the PSCI code in the kernel, it's hotplugging a CPU. Why it's performing the hotplug operation shouldn't matter. > > Signed-off-by: Geoff Levand <geoff@infradead.org> > --- > arch/arm64/kernel/psci.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/arm64/kernel/psci.c b/arch/arm64/kernel/psci.c > index ea4828a..0e5fa69 100644 > --- a/arch/arm64/kernel/psci.c > +++ b/arch/arm64/kernel/psci.c > @@ -253,7 +253,7 @@ static int cpu_psci_cpu_boot(unsigned int cpu) > return err; > } > > -#ifdef CONFIG_HOTPLUG_CPU > +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_KEXEC) > static int cpu_psci_cpu_disable(unsigned int cpu) > { > /* Fail early if we don't have CPU_OFF support */ > @@ -284,7 +284,7 @@ const struct cpu_operations cpu_psci_ops = { > .cpu_init = cpu_psci_cpu_init, > .cpu_prepare = cpu_psci_cpu_prepare, > .cpu_boot = cpu_psci_cpu_boot, > -#ifdef CONFIG_HOTPLUG_CPU > +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_KEXEC) > .cpu_disable = cpu_psci_cpu_disable, > .cpu_die = cpu_psci_cpu_die, > #endif Doesn't his cause the build to fail when KEXEC && !HOTPLUG_CPU? I didn't see cpu_ops.h updated similarly. Thanks, Mark. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 6/8] arm64/kexec: kexec needs cpu_die 2014-05-09 8:24 ` Mark Rutland @ 2014-05-13 22:27 ` Geoff Levand 0 siblings, 0 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-13 22:27 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2014-05-09 at 09:24 +0100, Mark Rutland wrote: > Hi Geoff, > > On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > > Kexec uses the cpu_die method of struct cpu_operations, so add > > defined(CONFIG_KEXEC) to the preprocessor conditional that enables cpu_die. > > Why not make kexec depend on !CONFIG_SMP || CONFIG_HOTPLUG_CPU instead? > > From the POV of the PSCI code in the kernel, it's hotplugging a CPU. Why > it's performing the hotplug operation shouldn't matter. Sure. > > @@ -284,7 +284,7 @@ const struct cpu_operations cpu_psci_ops = { > > .cpu_init = cpu_psci_cpu_init, > > .cpu_prepare = cpu_psci_cpu_prepare, > > .cpu_boot = cpu_psci_cpu_boot, > > -#ifdef CONFIG_HOTPLUG_CPU > > +#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_KEXEC) > > .cpu_disable = cpu_psci_cpu_disable, > > .cpu_die = cpu_psci_cpu_die, > > #endif > > Doesn't his cause the build to fail when KEXEC && !HOTPLUG_CPU? I didn't > see cpu_ops.h updated similarly. Sorry, that part got lost when rebasing patches. I added back in on my for-arm-kexec-2 and master branches. -Geoff ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 7/8] arm64/kexec: Add core kexec support 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand ` (2 preceding siblings ...) 2014-05-09 0:48 ` [PATCH 6/8] arm64/kexec: kexec needs cpu_die Geoff Levand @ 2014-05-09 0:48 ` Geoff Levand 2014-05-09 15:36 ` Mark Rutland ` (2 more replies) 2014-05-09 0:48 ` [PATCH 3/8] arm64: Add spin-table cpu_die Geoff Levand ` (4 subsequent siblings) 8 siblings, 3 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw) To: linux-arm-kernel Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S, to the arm64 architecture that add support for the kexec re-boot mechanism on arm64 (CONFIG_KEXEC). This implementation supports re-boots of kernels with either PSCI or spin-table enable methods, but with some limitations on the match of 1st and 2nd stage kernels. The known limitations are checked in the kexec_compat_check() routine, which is called during a kexec_load syscall. If any limitations are reached an error is returned by the kexec_load syscall. Many of the limitations can be removed with some enhancment to the CPU shutdown management code in machine_kexec.c. Signed-off-by: Geoff Levand <geoff@infradead.org> --- MAINTAINERS | 9 + arch/arm64/Kconfig | 8 + arch/arm64/include/asm/kexec.h | 44 +++ arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/machine_kexec.c | 623 ++++++++++++++++++++++++++++++++++++ arch/arm64/kernel/relocate_kernel.S | 239 ++++++++++++++ include/uapi/linux/kexec.h | 1 + 7 files changed, 925 insertions(+) create mode 100644 arch/arm64/include/asm/kexec.h create mode 100644 arch/arm64/kernel/machine_kexec.c create mode 100644 arch/arm64/kernel/relocate_kernel.S diff --git a/MAINTAINERS b/MAINTAINERS index 1066264..bb666bb 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5144,6 +5144,15 @@ F: include/linux/kexec.h F: include/uapi/linux/kexec.h F: kernel/kexec.c +KEXEC FOR ARM64 +M: Geoff Levand <geoff@infradead.org> +W: http://kernel.org/pub/linux/utils/kernel/kexec/ +L: kexec at lists.infradead.org +L: linux-arm-kernel at lists.infradead.org (moderated for non-subscribers) +S: Maintained +F: arch/arm64/machine_kexec.c +F: arch/arm64/relocate_kernel.S + KEYS/KEYRINGS: M: David Howells <dhowells@redhat.com> L: keyrings at linux-nfs.org diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e759af5..dcd5ebc 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -244,6 +244,14 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE source "mm/Kconfig" +config KEXEC + bool "kexec system call" + ---help--- + kexec is a system call that implements the ability to shutdown your + current kernel, and to start another kernel. It is like a reboot + but it is independent of the system firmware. And like a reboot + you can start any kernel with it, not just Linux. + config XEN_DOM0 def_bool y depends on XEN diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h new file mode 100644 index 0000000..41a6244 --- /dev/null +++ b/arch/arm64/include/asm/kexec.h @@ -0,0 +1,44 @@ +#ifndef _ARM64_KEXEC_H +#define _ARM64_KEXEC_H + +#if defined(CONFIG_KEXEC) + +/* Maximum physical address we can use pages from */ + +#define KEXEC_SOURCE_MEMORY_LIMIT (-1UL) + +/* Maximum address we can reach in physical address mode */ + +#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL) + +/* Maximum address we can use for the control code buffer */ + +#define KEXEC_CONTROL_MEMORY_LIMIT (-1UL) + +#define KEXEC_CONTROL_PAGE_SIZE 4096 + +#define KEXEC_ARCH KEXEC_ARCH_ARM64 + +#if !defined(__ASSEMBLY__) + +/** + * crash_setup_regs() - save registers for the panic kernel + * + * @newregs: registers are saved here + * @oldregs: registers to be saved (may be %NULL) + */ + +static inline void crash_setup_regs(struct pt_regs *newregs, + struct pt_regs *oldregs) +{ +} + +/* Function pointer to optional machine-specific reinitialization */ + +extern void (*kexec_reinit)(void); + +#endif /* __ASSEMBLY__ */ + +#endif /* CONFIG_KEXEC */ + +#endif /* _ARM64_KEXEC_H */ diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 7d811d9..7272510 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -22,6 +22,7 @@ arm64-obj-$(CONFIG_EARLY_PRINTK) += early_printk.o arm64-obj-$(CONFIG_ARM64_CPU_SUSPEND) += sleep.o suspend.o arm64-obj-$(CONFIG_JUMP_LABEL) += jump_label.o arm64-obj-$(CONFIG_KGDB) += kgdb.o +arm64-obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o obj-y += $(arm64-obj-y) vdso/ obj-m += $(arm64-obj-m) diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c new file mode 100644 index 0000000..62779e5 --- /dev/null +++ b/arch/arm64/kernel/machine_kexec.c @@ -0,0 +1,623 @@ +/* + * kexec for arm64 + */ + +#include <linux/delay.h> +#include <linux/irq.h> +#include <linux/kexec.h> +#include <linux/mm.h> +#include <linux/of.h> +#include <linux/of_fdt.h> +#include <linux/slab.h> +#include <linux/uaccess.h> + +#include <asm/cacheflush.h> +#include <asm/cpu_ops.h> +#include <asm/system_misc.h> + +/* Global variables for the relocate_kernel routine. */ + +extern const unsigned char relocate_new_kernel[]; +extern const unsigned long relocate_new_kernel_size; +extern unsigned long kexec_signal_addr; +extern unsigned long kexec_kimage_head; +extern unsigned long kexec_dtb_addr; +extern unsigned long kexec_kimage_start; +extern unsigned long kexec_spinner_count; + +/* Global variables for the kexec_cpu_spin routine. */ + +extern const unsigned char kexec_cpu_spin[]; +extern const unsigned long kexec_cpu_spin_size; + +void (*kexec_reinit)(void); + +/** + * struct kexec_cpu_info_spin - Info needed for the "spin table" enable method. + */ + +struct kexec_cpu_info_spin { + phys_addr_t phy_release_addr; /* cpu order */ +}; + +/** + * struct kexec_cpu_info - Info for a specific cpu in the device tree. + */ + +struct kexec_cpu_info { + unsigned int cpu; + const struct cpu_operations *cpu_ops; + bool spinner; + struct kexec_cpu_info_spin spin; +}; + +/** + * struct kexec_dt_info - Device tree info needed by the local kexec routines. + */ + +struct kexec_dt_info { + unsigned int cpu_count; + struct kexec_cpu_info *cpu_info; + unsigned int spinner_count; + phys_addr_t phy_memreserve_addr; + unsigned int memreserve_size; +}; + +/** + * struct kexec_ctx - Kexec runtime context. + * + * @dt_1st: Device tree info for the 1st stage kernel. + * @dt_2nd: Device tree info for the 2nd stage kernel. + */ + +struct kexec_ctx { + struct kexec_dt_info dt_1st; + struct kexec_dt_info dt_2nd; +}; + +static struct kexec_ctx *ctx; + +/** + * kexec_spin_code_offset - Offset into memreserve area of the spin code. + */ + +static const unsigned int kexec_spin_code_offset = PAGE_SIZE; + +/** + * kexec_is_dtb - Helper routine to check the device tree header signature. + */ + +static int kexec_is_dtb(__be32 magic) +{ + int result = be32_to_cpu(magic) == OF_DT_HEADER; + + return result; +} + +/** + * kexec_is_dtb_user - Helper routine to check the device tree header signature. + */ + +static int kexec_is_dtb_user(const void *dtb) +{ + __be32 magic; + + get_user(magic, (__be32 *)dtb); + + return kexec_is_dtb(magic); +} + +/** + * kexec_find_dtb_seg - Helper routine to find the dtb segment. + */ + +static const struct kexec_segment *kexec_find_dtb_seg( + const struct kimage *image) +{ + int i; + + for (i = 0; i < image->nr_segments; i++) { + if (kexec_is_dtb_user(image->segment[i].buf)) + return &image->segment[i]; + } + + return NULL; +} + +/** + * kexec_copy_dtb - Helper routine to copy dtb from user space. + */ + +static void *kexec_copy_dtb(const struct kexec_segment *seg) +{ + int result; + void *dtb; + + BUG_ON(!seg && !seg->bufsz); + + dtb = kmalloc(seg->bufsz, GFP_KERNEL); + + if (!dtb) { + pr_debug("%s: out of memory", __func__); + return NULL; + } + + result = copy_from_user(dtb, seg->buf, seg->bufsz); + + if (result) { + kfree(dtb); + return NULL; + } + + return dtb; +} + + +/** + * kexec_read_memreserve - Initialize memreserve info from a dtb. + */ + +static int kexec_read_memreserve(const void *dtb, struct kexec_dt_info *info) +{ + const struct boot_param_header *h = dtb; + struct pair { + __be64 phy_addr; + __be64 size; + } const *pair; + + pair = dtb + be32_to_cpu(h->off_mem_rsvmap); + + if ((pair + 1)->size) + pr_warn("kexec: Multiple memreserve regions found."); + + info->phy_memreserve_addr = be64_to_cpu(pair->phy_addr); + info->memreserve_size = be64_to_cpu(pair->size); + + pr_debug("%s:%d: memreserve_addr: %pa (%p)\n", __func__, __LINE__, + &info->phy_memreserve_addr, + phys_to_virt(info->phy_memreserve_addr)); + pr_debug("%s:%d: memreserve_size: %u (%xh)\n", __func__, __LINE__, + info->memreserve_size, info->memreserve_size); + + return 0; +} + +/** + * kexec_setup_cpu_spin - Initialize cpu spin info from a device tree cpu node. + */ + +static int kexec_setup_cpu_spin(const struct device_node *dn, + struct kexec_cpu_info_spin *info) +{ + int result; + u64 t1; + + memset(info, 0, sizeof(*info)); + + result = of_property_read_u64(dn, "cpu-release-addr", &t1); + + if (result) { + pr_warn("kexec: Read cpu-release-addr failed.\n"); + return result; + } + + info->phy_release_addr = le64_to_cpu(t1); + + return 0; +} + +/** + * kexec_cpu_info_init - Initialize an array of kexec_cpu_info structures. + * + * Allocates a cpu info array and fills it with info for all cpus found in + * the device tree passed. The cpu info array is zero terminated. + */ + +int kexec_cpu_info_init(const struct device_node *dn, + struct kexec_dt_info *info) +{ + int result; + unsigned int cpu; + const struct device_node *i; + + info->cpu_info = kmalloc( + (1 + info->cpu_count) * sizeof(struct kexec_cpu_info), + GFP_KERNEL); + + if (!info->cpu_info) { + pr_debug("%s: out of memory", __func__); + return -ENOMEM; + } + + info->spinner_count = 0; + + for (cpu = 0, i = dn; cpu < info->cpu_count; cpu++) { + struct kexec_cpu_info *cpu_info = &info->cpu_info[cpu]; + + i = of_find_node_by_type((struct device_node *)i, "cpu"); + + BUG_ON(!i); + + cpu_info->cpu = cpu; + + result = cpu_read_ops((struct device_node *)i, cpu, + &cpu_info->cpu_ops); + + if (result) + goto on_error; + + cpu_info->spinner = !strcmp(cpu_info->cpu_ops->name, + "spin-table"); + + if (cpu_info->spinner) { + info->spinner_count++; + + result = kexec_setup_cpu_spin(i, &cpu_info->spin); + + if (result) + goto on_error; + } + + if (cpu_info->spinner) + pr_devel("%s:%d: cpu-%u: '%s' release_addr: %pa\n", + __func__, __LINE__, cpu, + cpu_info->cpu_ops->name, + &cpu_info->spin.phy_release_addr); + else + pr_devel("%s:%d: cpu-%u: '%s'\n", __func__, __LINE__, + cpu, cpu_info->cpu_ops->name); + } + + return 0; + +on_error: + kfree(info->cpu_info); + info->cpu_info = NULL; + + return result; +} + +/** + * kexec_dt_info_init - Initialize a kexec_dt_info structure from a dtb. + */ + +int kexec_dt_info_init(void *dtb, struct kexec_dt_info *info) +{ + int result; + struct device_node *i; + struct device_node *dn; + + if (!dtb) { + /* 1st stage. */ + dn = NULL; + } else { + /* 2nd stage. */ + + of_fdt_unflatten_tree(dtb, &dn); + + result = kexec_read_memreserve(dtb, info); + + if (result) + return result; + } + + /* + * We may need to work with offline cpus to get them into the correct + * state for a given enable method to work, and so need an info_array + * that has info about all the platform cpus. + */ + + for (info->cpu_count = 0, i = dn; (i = of_find_node_by_type(i, "cpu")); + info->cpu_count++) + (void)0; + + pr_devel("%s:%d: cpu_count: %u\n", __func__, __LINE__, info->cpu_count); + + if (!info->cpu_count) { + pr_err("kexec: Error: No cpu nodes found in device tree.\n"); + return -EINVAL; + } + + result = kexec_cpu_info_init(dn, info); + + return result; +} + +/** +* kexec_spin_2 - The identity map spin loop. +*/ + +void kexec_spin_2(unsigned int cpu, phys_addr_t signal_1, + phys_addr_t phy_release_addr, phys_addr_t signal_2) +{ + typedef void (*fn_t)(phys_addr_t, phys_addr_t); + + fn_t spin_3; + + atomic_dec((atomic_t *)signal_1); + + /* Wait for next signal. */ + + while (!atomic_read((atomic_t *)signal_2)) + (void)0; + + /* Enter the memreserve spin code. */ + + spin_3 = (fn_t)(ctx->dt_2nd.phy_memreserve_addr + + kexec_spin_code_offset); + + spin_3(phy_release_addr, signal_2); + + BUG(); +} + +static atomic_t spin_1_signal = ATOMIC_INIT(0); +static atomic_t spin_2_signal = ATOMIC_INIT(0); + +/** +* kexec_spin_1 - The virtual address spin loop. +*/ + +static void kexec_spin_1(unsigned int cpu) +{ + typedef void (*fn_t)(unsigned int, phys_addr_t, phys_addr_t, + phys_addr_t); + fn_t fn; + + pr_devel("%s:%d: id: %u\n", __func__, __LINE__, smp_processor_id()); + + /* Wait for the signal. */ + + while (!atomic_read(&spin_1_signal)) + (void)0; + + /* Enter the identity mapped spin code. */ + + setup_mm_for_reboot(); + + fn = (fn_t)virt_to_phys(kexec_spin_2); + + fn(cpu, virt_to_phys(&spin_1_signal), + ctx->dt_2nd.cpu_info[cpu].spin.phy_release_addr, + virt_to_phys(&spin_2_signal)); + + BUG(); +} + +/** +* kexec_restart - Called after the identity mapping is enabled. +*/ + +static void kexec_restart(void) +{ + unsigned long timeout = 1000; + + atomic_set(&spin_1_signal, ctx->dt_1st.spinner_count - 1); + + __flush_dcache_area(&spin_1_signal, sizeof(spin_1_signal)); + + while (timeout-- && atomic64_read(&spin_1_signal)) + udelay(10); +} + +/** +* kexec_compat_check - Helper to check compatability of 2nd stage kernel. +*/ + +static int kexec_compat_check(const struct kexec_dt_info *dt1, + const struct kexec_dt_info *dt2) +{ + int result = 0; + + /* No checks needed for psci to psci. */ + + if (!dt1->spinner_count && !dt2->spinner_count) + goto done; + + /* Check for a cpu count mismatch. */ + + if (dt1->cpu_count != dt2->cpu_count) { + pr_err("kexec: Error: CPU count mismatch %u -> %u.\n", + dt1->cpu_count, dt2->cpu_count); + result++; + } + + /* Check for an enable method mismatch. */ + + if (dt1->spinner_count != dt2->spinner_count) { + pr_err("kexec: Error: Enable method mismatch %s -> %s.\n", + dt1->cpu_info[0].cpu_ops->name, + dt2->cpu_info[0].cpu_ops->name); + result++; + } + + /* Check for mixed enable methods. */ + + if (dt1->spinner_count && (dt1->cpu_count != dt1->spinner_count)) { + pr_err("kexec: Error: Mixed enable methods in 1st stage.\n"); + result++; + } + + if (dt2->spinner_count && (dt2->cpu_count != dt2->spinner_count)) { + pr_err("kexec: Error: Mixed enable methods in 2nd stage.\n"); + result++; + } + + /* Check for cpus still spinning in secondary_holding_pen. */ + + if (NR_CPUS < dt1->spinner_count) { + pr_err("kexec: Error: NR_CPUS too small for spin enable %u < %u.\n", + NR_CPUS, dt1->spinner_count + 1); + result++; + } + +done: + return result ? -EINVAL : 0; +} + +void machine_kexec_cleanup(struct kimage *image) +{ + if (ctx) { + kfree(ctx->dt_1st.cpu_info); + ctx->dt_1st.cpu_info = NULL; + + kfree(ctx->dt_2nd.cpu_info); + ctx->dt_2nd.cpu_info = NULL; + } + + kfree(ctx); + ctx = NULL; +} + +void machine_crash_shutdown(struct pt_regs *regs) +{ +} + +/** + * machine_kexec_prepare - Prepare for a kexec reboot. + * + * Called from the core kexec code when a kernel image is loaded. + */ + +int machine_kexec_prepare(struct kimage *image) +{ + int result; + const struct kexec_segment *seg; + void *dtb; + + machine_kexec_cleanup(NULL); + + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); + + if (!ctx) { + pr_debug("%s: out of memory", __func__); + return -ENOMEM; + } + + seg = kexec_find_dtb_seg(image); + BUG_ON(!seg); + + dtb = kexec_copy_dtb(seg); + BUG_ON(!dtb); + BUG_ON(!kexec_is_dtb(*(const __be32 *)dtb)); + + result = kexec_dt_info_init(NULL, &ctx->dt_1st); + + if (result) + goto on_error; + + result = kexec_dt_info_init(dtb, &ctx->dt_2nd); + + if (result) + goto on_error; + + if (ctx->dt_2nd.spinner_count) { + BUG_ON(!ctx->dt_2nd.phy_memreserve_addr); + BUG_ON(kexec_cpu_spin_size >= ctx->dt_2nd.memreserve_size + - kexec_spin_code_offset); + } + + result = kexec_compat_check(&ctx->dt_1st, &ctx->dt_2nd); + + if (result) + goto on_error; + + kexec_dtb_addr = seg->mem; + kexec_kimage_start = image->start; + kexec_spinner_count = ctx->dt_1st.spinner_count - 1; + + smp_spin_table_set_die(kexec_spin_1); + + goto on_exit; + +on_error: + machine_kexec_cleanup(NULL); +on_exit: + kfree(dtb); + + return result; +} + +/** + * machine_kexec - Do the kexec reboot. + * + * Called from the core kexec code for a sys_reboot with LINUX_REBOOT_CMD_KEXEC. + */ + +void machine_kexec(struct kimage *image) +{ + phys_addr_t reboot_code_buffer_phys; + void *reboot_code_buffer; + unsigned int cpu; + + BUG_ON(relocate_new_kernel_size > KEXEC_CONTROL_PAGE_SIZE); + BUG_ON(num_online_cpus() > 1); + + pr_devel("%s:%d: id: %u\n", __func__, __LINE__, smp_processor_id()); + + kexec_kimage_head = image->head; + kexec_signal_addr = virt_to_phys(&spin_2_signal); + + reboot_code_buffer_phys = page_to_phys(image->control_code_page); + reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys); + + if (ctx->dt_2nd.spinner_count) { + void *va; + + /* + * Copy the spin code to the 2nd stage memreserve area as + * dictated by the arm64 boot specification. + */ + + va = phys_to_virt(ctx->dt_2nd.phy_memreserve_addr + + kexec_spin_code_offset); + + memcpy(va, kexec_cpu_spin, kexec_cpu_spin_size); + + flush_icache_range((unsigned long)va, + (unsigned long)va + kexec_cpu_spin_size); + + /* + * Zero the release address for all the 2nd stage cpus. + */ + + for (cpu = 0; cpu < ctx->dt_2nd.cpu_count; cpu++) { + u64 *release_addr; + + if (!ctx->dt_2nd.cpu_info[cpu].spinner) + continue; + + release_addr = phys_to_virt( + ctx->dt_2nd.cpu_info[cpu].spin.phy_release_addr); + + *release_addr = 0; + + __flush_dcache_area(release_addr, sizeof(u64)); + } + } + + /* + * Copy relocate_new_kernel to the reboot_code_buffer for use + * after the kernel is shut down. + */ + + memcpy(reboot_code_buffer, relocate_new_kernel, + relocate_new_kernel_size); + + flush_icache_range((unsigned long)reboot_code_buffer, + (unsigned long)reboot_code_buffer + KEXEC_CONTROL_PAGE_SIZE); + + /* TODO: Adjust any mismatch in cpu enable methods. */ + + pr_info("Bye!\n"); + + if (kexec_reinit) + kexec_reinit(); + + local_irq_disable(); + local_fiq_disable(); + + setup_restart(); + kexec_restart(); + soft_restart(reboot_code_buffer_phys); +} diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S new file mode 100644 index 0000000..15a49d6 --- /dev/null +++ b/arch/arm64/kernel/relocate_kernel.S @@ -0,0 +1,239 @@ +/* + * kexec for arm64 + */ + +#include <asm/memory.h> +#include <asm/page.h> + +/* + * kexec_cpu_spin - Spin the CPU as described in the arm64/booting.txt document. + * + * Prototype: void kexec_cpu_spin(phys_addr_t release_addr, phys_addr_t signal); + * + * The caller must initialize release_addr to zero or a valid address + * prior to calling kexec_cpu_spin. Note that if the MMU will be turned on + * or off while the CPU is spinning here this code must be in an identity + * mapped page. The value written to release_addr must be in little endian + * order. + */ + +.align 3 + +.globl kexec_cpu_spin +kexec_cpu_spin: + + /* Signal that this cpu has entered. */ +1: + ldxr x2, [x1] + sub x2, x2, 1 + stxr w3, x2, [x1] + cbnz w3, 1b + + + /* Spin while release_addr is zero. */ +1: + wfe + ldr x4, [x0] + cbz x4, 1b + + /* Convert LE to CPU. */ + +#if defined(__AARCH64EB__) + rev x4, x4 +#endif + + /* Jump to new kernel. */ + + mov x0, xzr + mov x1, xzr + mov x2, xzr + mov x3, xzr + + br x4 + +.align 3 + +.kexec_cpu_spin_end: + +/* + * kexec_cpu_spin_size - Byte count for copy operations. + */ + +.globl kexec_cpu_spin_size +kexec_cpu_spin_size: + .quad .kexec_cpu_spin_end - kexec_cpu_spin + + +/* + * relocate_new_kernel - Put the 2nd stage kernel image in place and boot it. + * + * The memory that the old kernel occupies may be overwritten when coping the + * new kernel to its final location. To assure that the relocate_new_kernel + * routine which does that copy is not overwritten, all code and data needed + * by relocate_new_kernel must be between the symbols relocate_new_kernel and + * relocate_new_kernel_end. The machine_kexec() routine will copy + * relocate_new_kernel to the kexec control_code_page, a special page which + * has been set up to be preserved during the kernel copy operation. + */ + +/* These definitions correspond to the kimage_entry flags in linux/kexec.h */ + +#define IND_DESTINATION_BIT 0 +#define IND_INDIRECTION_BIT 1 +#define IND_DONE_BIT 2 +#define IND_SOURCE_BIT 3 + +.align 3 + +.globl relocate_new_kernel +relocate_new_kernel: + + /* Signal secondary cpus to enter the memreserve spin code. */ + + ldr x1, kexec_signal_addr + ldr x2, kexec_spinner_count + str x2, [x1] + + /* Wait for all secondary cpus to enter. */ +1: + ldr x2, [x1] + cbnz x2, 1b + + /* Copy the new kernel image. */ + + ldr x10, kexec_kimage_head /* x10 = entry */ + + /* Check if the new kernel needs relocation. */ + + cbz x10, .done + tbnz x10, IND_DONE_BIT, .done + + /* Setup loop variables. */ + + mov x12, xzr /* x12 = ptr */ + mov x13, xzr /* x13 = dest */ + +.loop: + /* addr = entry & PAGE_MASK */ + + and x14, x10, PAGE_MASK /* x14 = addr */ + + /* switch (entry & IND_FLAGS) */ + +.case_source: + tbz x10, IND_SOURCE_BIT, .case_indirection + + /* copy_page(x20 = dest, x21 = addr) */ + + mov x20, x13 + mov x21, x14 + + prfm pldl1strm, [x21, #64] +1: ldp x22, x23, [x21] + ldp x24, x25, [x21, #16] + ldp x26, x27, [x21, #32] + ldp x28, x29, [x21, #48] + add x21, x21, #64 + prfm pldl1strm, [x21, #64] + stnp x22, x23, [x20] + stnp x24, x25, [x20, #16] + stnp x26, x27, [x20, #32] + stnp x28, x29, [x20, #48] + add x20, x20, #64 + tst x21, #(PAGE_SIZE - 1) + b.ne 1b + + /* dest += PAGE_SIZE */ + + add x13, x13, PAGE_SIZE + b .next_entry + +.case_indirection: + tbz x10, IND_INDIRECTION_BIT, .case_destination + + /* ptr = addr */ + + mov x12, x14 + b .next_entry + +.case_destination: + tbz x10, IND_DESTINATION_BIT, .next_entry + + /* dest = addr */ + + mov x13, x14 + +.next_entry: + /* entry = *ptr++ */ + + ldr x10, [x12] + add x12, x12, 8 + + /* while (!(entry & IND_DONE)) */ + + tbz x10, IND_DONE_BIT, .loop + +.done: + /* Jump to new kernel. */ + + ldr x0, kexec_dtb_addr + mov x1, xzr + mov x2, xzr + mov x3, xzr + + ldr x4, kexec_kimage_start + br x4 + +.align 3 + +/* The machine_kexec routines set these variables. */ + +/* + * kexec_signal_addr - Physical address of the spin signal variable. + */ + +.globl kexec_signal_addr +kexec_signal_addr: + .quad 0x0 + +/* + * kexec_spinner_count - Count of spinning cpus. + */ + +.globl kexec_spinner_count +kexec_spinner_count: + .quad 0x0 + +/* + * kexec_dtb_addr - The address of the new kernel's device tree. + */ + +.globl kexec_dtb_addr +kexec_dtb_addr: + .quad 0x0 + +/* + * kexec_kimage_head - Copy of image->head, the list of kimage entries. + */ + +.globl kexec_kimage_head +kexec_kimage_head: + .quad 0x0 + +/* + * kexec_kimage_start - Copy of image->start, the entry point of the new kernel. + */ + +.globl kexec_kimage_start +kexec_kimage_start: + .quad 0x0 + +.relocate_new_kernel_end: + +/* + * relocate_new_kernel_size - Byte count to copy to kimage control_code_page. + */ + +.globl relocate_new_kernel_size +relocate_new_kernel_size: + .quad .relocate_new_kernel_end - relocate_new_kernel diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h index d6629d4..b0bc56d 100644 --- a/include/uapi/linux/kexec.h +++ b/include/uapi/linux/kexec.h @@ -28,6 +28,7 @@ #define KEXEC_ARCH_SH (42 << 16) #define KEXEC_ARCH_MIPS_LE (10 << 16) #define KEXEC_ARCH_MIPS ( 8 << 16) +#define KEXEC_ARCH_ARM64 (183 << 16) /* The artificial cap on the number of segments passed to kexec_load. */ #define KEXEC_SEGMENT_MAX 16 -- 1.9.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 7/8] arm64/kexec: Add core kexec support 2014-05-09 0:48 ` [PATCH 7/8] arm64/kexec: Add core kexec support Geoff Levand @ 2014-05-09 15:36 ` Mark Rutland 2014-05-13 22:27 ` Geoff Levand 2014-05-14 10:54 ` Catalin Marinas 2014-07-07 7:33 ` Dave Young 2 siblings, 1 reply; 23+ messages in thread From: Mark Rutland @ 2014-05-09 15:36 UTC (permalink / raw) To: linux-arm-kernel Hi Geoff, On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S, to the > arm64 architecture that add support for the kexec re-boot mechanism on arm64 > (CONFIG_KEXEC). > > This implementation supports re-boots of kernels with either PSCI or spin-table > enable methods, but with some limitations on the match of 1st and 2nd stage > kernels. The known limitations are checked in the kexec_compat_check() routine, > which is called during a kexec_load syscall. If any limitations are reached an > error is returned by the kexec_load syscall. Many of the limitations can be > removed with some enhancment to the CPU shutdown management code in > machine_kexec.c. I'm very much not happy with special-casing spin-table and working around the issues in the boot protocol. There's a lot of code, it has subtle bugs, and it makes it more difficult to support other boot mechanisms that might be required in future (it certainly sets a bad precedent w.r.t. separation of concerns). I think if we cannot offline a CPU through the generic hotplug infrastructure then we should fail to kexec. If we need a way of doing hotplug without PSCI then we should sort that out [1] rather than working around brokenness with more brokenness I also don't think that kexec should care about the precise hotplug mechanism so long as it works. If the DTB passed to the new kernel describes a different mechanism than is currently in use, that's the caller's choice and I don't see any point second-guessing that -- we're already allowing them to boot an arbitrary kernel and so long as we leave the system in a sane state I don't see a good reason to deny the user from immediately destroying that. [1] For hotplug without PSCI I think we can get away with adding an optional property to spin-table describing a (physical) address to branch to which returns CPUs to their original spin code. So long as the kernel zeroes the release address first we should be able to boot a CPU exactly as we managed to the first time. For spin-table we'll also need to jump back up to EL2 when EL2 is present. It should be possible to do that within the spin-table code if we update the initial EL2 vector and get KVM to tear itself down before cpu_die happens. [...] > +/** > + * struct kexec_cpu_info_spin - Info needed for the "spin table" enable method. > + */ > + > +struct kexec_cpu_info_spin { > + phys_addr_t phy_release_addr; /* cpu order */ I assume you mean this is in the endianness of the current kernel? I was initially confused by the comment, and I think it might be better to drop it -- unless told that a variable is in a specific endianness I would assume that it's the current native endianness anyway. As a general point, could you please use "phys" rather than "phy" in variable names? It'll make this more consistent with the rest of the arm64 code, easier to search for, and reads better IMO. [...] > +struct kexec_ctx { > + struct kexec_dt_info dt_1st; > + struct kexec_dt_info dt_2nd; > +}; > + > +static struct kexec_ctx *ctx; Is there any reason this should be dynamically allocated? Do we even need the current DTB if we rely on hotplug? > +static int kexec_is_dtb(__be32 magic) > +{ > + int result = be32_to_cpu(magic) == OF_DT_HEADER; > + > + return result; You can drop the temporary int here. > +/** > + * kexec_is_dtb_user - Helper routine to check the device tree header signature. > + */ > + > +static int kexec_is_dtb_user(const void *dtb) > +{ > + __be32 magic; > + > + get_user(magic, (__be32 *)dtb); Return value check? Unless we know this can't fail? If it can fail, surely we should return an appropriate error and forward it to userspace. EFAULT? > + > + return kexec_is_dtb(magic); And EINVAL if we can read this but it's not a DTB? [...] > +/** > + * kexec_read_memreserve - Initialize memreserve info from a dtb. > + */ > + > +static int kexec_read_memreserve(const void *dtb, struct kexec_dt_info *info) > +{ > + const struct boot_param_header *h = dtb; > + struct pair { > + __be64 phy_addr; > + __be64 size; > + } const *pair; > + > + pair = dtb + be32_to_cpu(h->off_mem_rsvmap); > + > + if ((pair + 1)->size) > + pr_warn("kexec: Multiple memreserve regions found."); Huh? Multiple arbitrary memreserves are entirely valid. Why should we warn in that case? > + > + info->phy_memreserve_addr = be64_to_cpu(pair->phy_addr); > + info->memreserve_size = be64_to_cpu(pair->size); So we're assuming that the memory described in an arbitrary memreserve entry (which is intended to describe memory which shouldn't be touched unless we know what we're doing) is for our arbitrary use!? NAK. We shouldn't need to special-case reserved memory handling if we rely on cpu hotplug. If we don't then the only functional option is to add a memreserve, but that will end up leaking a small amount of memory upon every kexec. I believe that the former is the only sane option. [...] > +static int kexec_setup_cpu_spin(const struct device_node *dn, > + struct kexec_cpu_info_spin *info) > +{ > + int result; > + u64 t1; > + > + memset(info, 0, sizeof(*info)); > + > + result = of_property_read_u64(dn, "cpu-release-addr", &t1); > + > + if (result) { > + pr_warn("kexec: Read cpu-release-addr failed.\n"); > + return result; > + } > + > + info->phy_release_addr = le64_to_cpu(t1); Why are we calling le64_to_cpu here? of_property_read_u64 reads a be64 value from dt into cpu endianness, so at the very least the annotation is the wrong way around. Have you tested this with a BE kernel? We should ensure that LE->LE, LE->BE, BE->BE, BE->LE all work. [...] > +int kexec_cpu_info_init(const struct device_node *dn, > + struct kexec_dt_info *info) > +{ > + int result; > + unsigned int cpu; > + const struct device_node *i; > + > + info->cpu_info = kmalloc( > + (1 + info->cpu_count) * sizeof(struct kexec_cpu_info), > + GFP_KERNEL); Why one more than the cpu count? I thought cpu_count covered all the CPUs in the dtb? [...] > +int kexec_dt_info_init(void *dtb, struct kexec_dt_info *info) > +{ > + int result; > + struct device_node *i; > + struct device_node *dn; > + > + if (!dtb) { > + /* 1st stage. */ > + dn = NULL; > + } else { > + /* 2nd stage. */ > + > + of_fdt_unflatten_tree(dtb, &dn); This may fail. We should check that dn is not NULL before we try to use it later -- many of_* functions will traverse the current kernel's boot DT if provided with a NULL root. > + > + result = kexec_read_memreserve(dtb, info); > + > + if (result) > + return result; > + } > + > + /* > + * We may need to work with offline cpus to get them into the correct > + * state for a given enable method to work, and so need an info_array > + * that has info about all the platform cpus. > + */ What exactly do we need to do to offline CPUs? > + > + for (info->cpu_count = 0, i = dn; (i = of_find_node_by_type(i, "cpu")); > + info->cpu_count++) > + (void)0; If dn is NULL here we'll read of_allnodes, which I don't think you intended. > +void kexec_spin_2(unsigned int cpu, phys_addr_t signal_1, > + phys_addr_t phy_release_addr, phys_addr_t signal_2) > +{ > + typedef void (*fn_t)(phys_addr_t, phys_addr_t); > + > + fn_t spin_3; > + > + atomic_dec((atomic_t *)signal_1); > + > + /* Wait for next signal. */ > + > + while (!atomic_read((atomic_t *)signal_2)) > + (void)0; Why not cpu_relax()? [...] > + /* Check for cpus still spinning in secondary_holding_pen. */ > + > + if (NR_CPUS < dt1->spinner_count) { > + pr_err("kexec: Error: NR_CPUS too small for spin enable %u < %u.\n", > + NR_CPUS, dt1->spinner_count + 1); > + result++; > + } In some cases people might describe fewer CPUs in the DTB than are actually present, which we should give up for also. I think we can alter secondary_holding_pen to be a bit more intelligent for that case and get any unexpected secondaries to write a flag to indicate their presence. We can then decide to reject kexec if that flag is set. > +void machine_crash_shutdown(struct pt_regs *regs) > +{ > +} Missing implementation? If it's fine for this to be empty it would be nice to have a comment to that effect. [...] > +int machine_kexec_prepare(struct kimage *image) > +{ > + int result; > + const struct kexec_segment *seg; > + void *dtb; > + > + machine_kexec_cleanup(NULL); > + > + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); > + > + if (!ctx) { > + pr_debug("%s: out of memory", __func__); > + return -ENOMEM; > + } > + > + seg = kexec_find_dtb_seg(image); > + BUG_ON(!seg); > + > + dtb = kexec_copy_dtb(seg); > + BUG_ON(!dtb); > + BUG_ON(!kexec_is_dtb(*(const __be32 *)dtb)); Why BUG_ON rather than report the failure and return an error? > + > + result = kexec_dt_info_init(NULL, &ctx->dt_1st); > + > + if (result) > + goto on_error; > + > + result = kexec_dt_info_init(dtb, &ctx->dt_2nd); > + > + if (result) > + goto on_error; > + > + if (ctx->dt_2nd.spinner_count) { > + BUG_ON(!ctx->dt_2nd.phy_memreserve_addr); > + BUG_ON(kexec_cpu_spin_size >= ctx->dt_2nd.memreserve_size > + - kexec_spin_code_offset); > + } > + > + result = kexec_compat_check(&ctx->dt_1st, &ctx->dt_2nd); > + > + if (result) > + goto on_error; > + > + kexec_dtb_addr = seg->mem; > + kexec_kimage_start = image->start; > + kexec_spinner_count = ctx->dt_1st.spinner_count - 1; > + > + smp_spin_table_set_die(kexec_spin_1); I very much dislike hooking into the spin-table code like this. [...] > + if (ctx->dt_2nd.spinner_count) { > + void *va; > + > + /* > + * Copy the spin code to the 2nd stage memreserve area as > + * dictated by the arm64 boot specification. > + */ The boot documentation says any spin code must be protected with a memreserve. This does not mean that the first memreserve is a special location that the spin table must be placed at. This comment seems to have the implication backwards. > + va = phys_to_virt(ctx->dt_2nd.phy_memreserve_addr > + + kexec_spin_code_offset); > + > + memcpy(va, kexec_cpu_spin, kexec_cpu_spin_size); > + > + flush_icache_range((unsigned long)va, > + (unsigned long)va + kexec_cpu_spin_size); > + > + /* > + * Zero the release address for all the 2nd stage cpus. > + */ > + > + for (cpu = 0; cpu < ctx->dt_2nd.cpu_count; cpu++) { > + u64 *release_addr; > + > + if (!ctx->dt_2nd.cpu_info[cpu].spinner) > + continue; > + > + release_addr = phys_to_virt( > + ctx->dt_2nd.cpu_info[cpu].spin.phy_release_addr); > + > + *release_addr = 0; > + > + __flush_dcache_area(release_addr, sizeof(u64)); > + } > + } > + > + /* > + * Copy relocate_new_kernel to the reboot_code_buffer for use > + * after the kernel is shut down. > + */ > + > + memcpy(reboot_code_buffer, relocate_new_kernel, > + relocate_new_kernel_size); > + > + flush_icache_range((unsigned long)reboot_code_buffer, > + (unsigned long)reboot_code_buffer + KEXEC_CONTROL_PAGE_SIZE); > + > + /* TODO: Adjust any mismatch in cpu enable methods. */ ??? > +/* > + * kexec_cpu_spin - Spin the CPU as described in the arm64/booting.txt document. > + * > + * Prototype: void kexec_cpu_spin(phys_addr_t release_addr, phys_addr_t signal); > + * > + * The caller must initialize release_addr to zero or a valid address > + * prior to calling kexec_cpu_spin. Note that if the MMU will be turned on > + * or off while the CPU is spinning here this code must be in an identity > + * mapped page. The value written to release_addr must be in little endian > + * order. The MMU _must_ be off upon entry to the kernel, as is explicitly stated in Documentation/arm64/booting.txt, and I don't see why the spinning code should have the MMU on. It seems like an endless source of subtle bugs. Perhaps I've missed it, but I can't see that the idmap page tables for this are protected with a memreserve. If they aren't, then the new kernel may clobber them and the secondaries might all start taking exceptions unexpectedly. If they are then I don't see how the new kernel identifies them as such so that we don't end up rendering a chunk of memory unusable on each kexec. > +/* > + * relocate_new_kernel - Put the 2nd stage kernel image in place and boot it. > + * > + * The memory that the old kernel occupies may be overwritten when coping the > + * new kernel to its final location. To assure that the relocate_new_kernel > + * routine which does that copy is not overwritten, all code and data needed > + * by relocate_new_kernel must be between the symbols relocate_new_kernel and > + * relocate_new_kernel_end. The machine_kexec() routine will copy > + * relocate_new_kernel to the kexec control_code_page, a special page which > + * has been set up to be preserved during the kernel copy operation. > + */ > + > +/* These definitions correspond to the kimage_entry flags in linux/kexec.h */ > + > +#define IND_DESTINATION_BIT 0 > +#define IND_INDIRECTION_BIT 1 > +#define IND_DONE_BIT 2 > +#define IND_SOURCE_BIT 3 These should live in linux/kexec.h -- the existing macros can be generated from these and we should be able to protect everything with #ifndef __ASSEMBLY__ Cheers, Mark. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 7/8] arm64/kexec: Add core kexec support 2014-05-09 15:36 ` Mark Rutland @ 2014-05-13 22:27 ` Geoff Levand 2014-05-16 10:26 ` Mark Rutland 0 siblings, 1 reply; 23+ messages in thread From: Geoff Levand @ 2014-05-13 22:27 UTC (permalink / raw) To: linux-arm-kernel Hi Mark, On Fri, 2014-05-09 at 16:36 +0100, Mark Rutland wrote: > On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > > Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S, to the > > arm64 architecture that add support for the kexec re-boot mechanism on arm64 > > (CONFIG_KEXEC). > > > > This implementation supports re-boots of kernels with either PSCI or spin-table > > enable methods, but with some limitations on the match of 1st and 2nd stage > > kernels. The known limitations are checked in the kexec_compat_check() routine, > > which is called during a kexec_load syscall. If any limitations are reached an > > error is returned by the kexec_load syscall. Many of the limitations can be > > removed with some enhancment to the CPU shutdown management code in > > machine_kexec.c. > > I think if we cannot offline a CPU through the generic hotplug > infrastructure then we should fail to kexec. If we need a way of doing > hotplug without PSCI then we should sort that out [1] rather than > working around brokenness with more brokenness OK, as I mentioned in the cover message I agree with this. > I also don't think that kexec should care about the precise hotplug > mechanism so long as it works. If the DTB passed to the new kernel > describes a different mechanism than is currently in use, that's the > caller's choice and I don't see any point second-guessing that -- we're > already allowing them to boot an arbitrary kernel and so long as we > leave the system in a sane state I don't see a good reason to deny the > user from immediately destroying that. One use case for kexec is to use linux as a bootloader. The 1st stage bootloader kernel should be able to boot any other kernel. For this to work the 1st stage kernel should do whatever it can to get the secondary cpus into a state compatible with the 2nd stage kernel. If any of the secondary cpus have a 1st stage enable method different from the 2nd stage enable method, then the 1st stage kernel should move the cpus to their 2nd stage enable method at shutdown. Also, any cpus stuck in the kernel secondary_holding_pen should to be moved to their 2nd stage enable method. I have not tried to implement this, but it seems to me that it can be done. Do you see any reason why this would not work? > [1] For hotplug without PSCI I think we can get away with adding an > optional property to spin-table describing a (physical) address to > branch to which returns CPUs to their original spin code. So long as the > kernel zeroes the release address first we should be able to boot a CPU > exactly as we managed to the first time. I think just this would be just the address of the cpu's spin code, with the restriction that the code is entered at this address also. If there is no code there, in the case of a PSCI to spin-table re-boot, then the 1st stage kernel needs to install some spin code. Also, shouldn't this be a required property to avoid the spin code memory leakage problem? > For spin-table we'll also need to jump back up to EL2 when EL2 is > present. It should be possible to do that within the spin-table code if > we update the initial EL2 vector and get KVM to tear itself down before > cpu_die happens. OK, I have not yet considered EL2. > > +/** > > + * struct kexec_cpu_info_spin - Info needed for the "spin table" enable method. > > + */ > > + > > +struct kexec_cpu_info_spin { > > + phys_addr_t phy_release_addr; /* cpu order */ > > I assume you mean this is in the endianness of the current kernel? I was > initially confused by the comment, and I think it might be better to > drop it -- unless told that a variable is in a specific endianness I > would assume that it's the current native endianness anyway. The value is read as LE from the device tree, and this comment is to clarify that the conversion from LE to cpu has been done. Maybe 'cpu byte order' is less confusing? Normally, I think the term 'machine order' would be used, but I chose 'cpu' from the name of the le64_to_cpu routine. > As a general point, could you please use "phys" rather than "phy" in > variable names? It'll make this more consistent with the rest of the > arm64 code, easier to search for, and reads better IMO. Sure. > > +struct kexec_ctx { > > + struct kexec_dt_info dt_1st; > > + struct kexec_dt_info dt_2nd; > > +}; > > + > > +static struct kexec_ctx *ctx; > > Is there any reason this should be dynamically allocated? > > Do we even need the current DTB if we rely on hotplug? I think so, for this implementation we need to at least check if the enable methods and cpu counts of the two DTs match and fail the kexec-load syscall if they do not. > > +/** > > + * kexec_is_dtb_user - Helper routine to check the device tree header signature. > > + */ > > + > > +static int kexec_is_dtb_user(const void *dtb) > > +{ > > + __be32 magic; > > + > > + get_user(magic, (__be32 *)dtb); > > Return value check? Unless we know this can't fail? > > If it can fail, surely we should return an appropriate error and forward > it to userspace. EFAULT? > > > + > > + return kexec_is_dtb(magic); > > And EINVAL if we can read this but it's not a DTB? These kexec_is_dtb are just used to search for the DTB segment, so are expected to return false for non-DTB segments. I'll change the return type to bool to make the usage more clear. > > +/** > > + * kexec_read_memreserve - Initialize memreserve info from a dtb. > > + */ > > + > > +static int kexec_read_memreserve(const void *dtb, struct kexec_dt_info *info) > > +{ > > + const struct boot_param_header *h = dtb; > > + struct pair { > > + __be64 phy_addr; > > + __be64 size; > > + } const *pair; > > + > > + pair = dtb + be32_to_cpu(h->off_mem_rsvmap); > > + > > + if ((pair + 1)->size) > > + pr_warn("kexec: Multiple memreserve regions found."); > > Huh? Multiple arbitrary memreserves are entirely valid. Why should we > warn in that case? If a user reports a problem I thought this comment may be useful in debugging since the current implementation does not consider them. > > + > > + info->phy_memreserve_addr = be64_to_cpu(pair->phy_addr); > > + info->memreserve_size = be64_to_cpu(pair->size); > > So we're assuming that the memory described in an arbitrary memreserve > entry (which is intended to describe memory which shouldn't be touched > unless we know what we're doing) is for our arbitrary use!? > > NAK. > > We shouldn't need to special-case reserved memory handling if we rely on > cpu hotplug. If we don't then the only functional option is to add a > memreserve, but that will end up leaking a small amount of memory upon > every kexec. I believe that the former is the only sane option. I think the solution is to have the cpu spin code address property. > > +static int kexec_setup_cpu_spin(const struct device_node *dn, > > + struct kexec_cpu_info_spin *info) > > +{ > > + int result; > > + u64 t1; > > + > > + memset(info, 0, sizeof(*info)); > > + > > + result = of_property_read_u64(dn, "cpu-release-addr", &t1); > > + > > + if (result) { > > + pr_warn("kexec: Read cpu-release-addr failed.\n"); > > + return result; > > + } > > + > > + info->phy_release_addr = le64_to_cpu(t1); > > Why are we calling le64_to_cpu here? > > of_property_read_u64 reads a be64 value from dt into cpu endianness, so > at the very least the annotation is the wrong way around. I'll check it again. I read this and thought the conversion was needed: The value will be written as a single 64-bit little-endian value, so CPUs must convert the read value to their native endianness before jumping to it. > Have you tested this with a BE kernel? We should ensure that LE->LE, > LE->BE, BE->BE, BE->LE all work. Not yet. I'm in the process of setting up a BE test environment. > > +int kexec_cpu_info_init(const struct device_node *dn, > > + struct kexec_dt_info *info) > > +{ > > + int result; > > + unsigned int cpu; > > + const struct device_node *i; > > + > > + info->cpu_info = kmalloc( > > + (1 + info->cpu_count) * sizeof(struct kexec_cpu_info), > > + GFP_KERNEL); > > Why one more than the cpu count? I thought cpu_count covered all the > CPUs in the dtb? Yes, a left over from when the array was zero terminated. Thanks for such a detailed check! > [...] > > > +int kexec_dt_info_init(void *dtb, struct kexec_dt_info *info) > > +{ > > + int result; > > + struct device_node *i; > > + struct device_node *dn; > > + > > + if (!dtb) { > > + /* 1st stage. */ > > + dn = NULL; > > + } else { > > + /* 2nd stage. */ > > + > > + of_fdt_unflatten_tree(dtb, &dn); > > This may fail. We should check that dn is not NULL before we try to use > it later -- many of_* functions will traverse the current kernel's boot > DT if provided with a NULL root. OK, I'll fix it. > > + > > + result = kexec_read_memreserve(dtb, info); > > + > > + if (result) > > + return result; > > + } > > + > > + /* > > + * We may need to work with offline cpus to get them into the correct > > + * state for a given enable method to work, and so need an info_array > > + * that has info about all the platform cpus. > > + */ > > What exactly do we need to do to offline CPUs? As mentioned above, to get them into a state expected by the 2nd stage kernel if we choose to do so, but to do the compatibility checks for this implementation. Maybe I'll change the wording of this comment. > > +void kexec_spin_2(unsigned int cpu, phys_addr_t signal_1, > > + phys_addr_t phy_release_addr, phys_addr_t signal_2) > > +{ > > + typedef void (*fn_t)(phys_addr_t, phys_addr_t); > > + > > + fn_t spin_3; > > + > > + atomic_dec((atomic_t *)signal_1); > > + > > + /* Wait for next signal. */ > > + > > + while (!atomic_read((atomic_t *)signal_2)) > > + (void)0; > > Why not cpu_relax()? Sure. > > [...] > > > + /* Check for cpus still spinning in secondary_holding_pen. */ > > + > > + if (NR_CPUS < dt1->spinner_count) { > > + pr_err("kexec: Error: NR_CPUS too small for spin enable %u < %u.\n", > > + NR_CPUS, dt1->spinner_count + 1); > > + result++; > > + } > > In some cases people might describe fewer CPUs in the DTB than are > actually present, which we should give up for also. I think we can alter > secondary_holding_pen to be a bit more intelligent for that case and get > any unexpected secondaries to write a flag to indicate their presence. OK, I'll look into that change. > > +void machine_crash_shutdown(struct pt_regs *regs) > > +{ > > +} > > Missing implementation? If it's fine for this to be empty it would be > nice to have a comment to that effect. Sure, the core kexec code calls this, but it is a todo for kdump. > > +int machine_kexec_prepare(struct kimage *image) > > +{ > > + int result; > > + const struct kexec_segment *seg; > > + void *dtb; > > + > > + machine_kexec_cleanup(NULL); > > + > > + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); > > + > > + if (!ctx) { > > + pr_debug("%s: out of memory", __func__); > > + return -ENOMEM; > > + } > > + > > + seg = kexec_find_dtb_seg(image); > > + BUG_ON(!seg); > > + > > + dtb = kexec_copy_dtb(seg); > > + BUG_ON(!dtb); > > + BUG_ON(!kexec_is_dtb(*(const __be32 *)dtb)); > > Why BUG_ON rather than report the failure and return an error? Sure, the user space kexec helper should have set these up correctly, so these were intended as sanity checks, but to report an error would be better. > > + > > + result = kexec_dt_info_init(NULL, &ctx->dt_1st); > > + > > + if (result) > > + goto on_error; > > + > > + result = kexec_dt_info_init(dtb, &ctx->dt_2nd); > > + > > + if (result) > > + goto on_error; > > + > > + if (ctx->dt_2nd.spinner_count) { > > + BUG_ON(!ctx->dt_2nd.phy_memreserve_addr); > > + BUG_ON(kexec_cpu_spin_size >= ctx->dt_2nd.memreserve_size > > + - kexec_spin_code_offset); > > + } > > + > > + result = kexec_compat_check(&ctx->dt_1st, &ctx->dt_2nd); > > + > > + if (result) > > + goto on_error; > > + > > + kexec_dtb_addr = seg->mem; > > + kexec_kimage_start = image->start; > > + kexec_spinner_count = ctx->dt_1st.spinner_count - 1; > > + > > + smp_spin_table_set_die(kexec_spin_1); > > I very much dislike hooking into the spin-table code like this. > > [...] > > > + if (ctx->dt_2nd.spinner_count) { > > + void *va; > > + > > + /* > > + * Copy the spin code to the 2nd stage memreserve area as > > + * dictated by the arm64 boot specification. > > + */ > > The boot documentation says any spin code must be protected with a > memreserve. This does not mean that the first memreserve is a special > location that the spin table must be placed at. This comment seems to > have the implication backwards. I'll think about how we can reword the boot documentation to make it more clear. > > + * kexec_cpu_spin - Spin the CPU as described in the arm64/booting.txt document. > > + * > > + * Prototype: void kexec_cpu_spin(phys_addr_t release_addr, phys_addr_t signal); > > + * > > + * The caller must initialize release_addr to zero or a valid address > > + * prior to calling kexec_cpu_spin. Note that if the MMU will be turned on > > + * or off while the CPU is spinning here this code must be in an identity > > + * mapped page. The value written to release_addr must be in little endian > > + * order. > > The MMU _must_ be off upon entry to the kernel, as is explicitly stated > in Documentation/arm64/booting.txt, and I don't see why the spinning > code should have the MMU on. It seems like an endless source of subtle > bugs. Sorry, this is an old comment. The part about the MMU is no longer valid. > > +/* > > + * relocate_new_kernel - Put the 2nd stage kernel image in place and boot it. > > + * > > + * The memory that the old kernel occupies may be overwritten when coping the > > + * new kernel to its final location. To assure that the relocate_new_kernel > > + * routine which does that copy is not overwritten, all code and data needed > > + * by relocate_new_kernel must be between the symbols relocate_new_kernel and > > + * relocate_new_kernel_end. The machine_kexec() routine will copy > > + * relocate_new_kernel to the kexec control_code_page, a special page which > > + * has been set up to be preserved during the kernel copy operation. > > + */ > > + > > +/* These definitions correspond to the kimage_entry flags in linux/kexec.h */ > > + > > +#define IND_DESTINATION_BIT 0 > > +#define IND_INDIRECTION_BIT 1 > > +#define IND_DONE_BIT 2 > > +#define IND_SOURCE_BIT 3 > > These should live in linux/kexec.h -- the existing macros can be > generated from these and we should be able to protect everything with > #ifndef __ASSEMBLY__ I'll make that change. -Geoff ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 7/8] arm64/kexec: Add core kexec support 2014-05-13 22:27 ` Geoff Levand @ 2014-05-16 10:26 ` Mark Rutland 0 siblings, 0 replies; 23+ messages in thread From: Mark Rutland @ 2014-05-16 10:26 UTC (permalink / raw) To: linux-arm-kernel On Tue, May 13, 2014 at 11:27:30PM +0100, Geoff Levand wrote: > Hi Mark, > > On Fri, 2014-05-09 at 16:36 +0100, Mark Rutland wrote: > > On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > > > Add three new files, kexec.h, machine_kexec.c and relocate_kernel.S, to the > > > arm64 architecture that add support for the kexec re-boot mechanism on arm64 > > > (CONFIG_KEXEC). > > > > > > This implementation supports re-boots of kernels with either PSCI or spin-table > > > enable methods, but with some limitations on the match of 1st and 2nd stage > > > kernels. The known limitations are checked in the kexec_compat_check() routine, > > > which is called during a kexec_load syscall. If any limitations are reached an > > > error is returned by the kexec_load syscall. Many of the limitations can be > > > removed with some enhancment to the CPU shutdown management code in > > > machine_kexec.c. > > > > I think if we cannot offline a CPU through the generic hotplug > > infrastructure then we should fail to kexec. If we need a way of doing > > hotplug without PSCI then we should sort that out [1] rather than > > working around brokenness with more brokenness > > OK, as I mentioned in the cover message I agree with this. Ok. > > I also don't think that kexec should care about the precise hotplug > > mechanism so long as it works. If the DTB passed to the new kernel > > describes a different mechanism than is currently in use, that's the > > caller's choice and I don't see any point second-guessing that -- we're > > already allowing them to boot an arbitrary kernel and so long as we > > leave the system in a sane state I don't see a good reason to deny the > > user from immediately destroying that. > > One use case for kexec is to use linux as a bootloader. The 1st stage > bootloader kernel should be able to boot any other kernel. For this to work > the 1st stage kernel should do whatever it can to get the secondary cpus into a > state compatible with the 2nd stage kernel. If any of the secondary cpus > have a 1st stage enable method different from the 2nd stage enable method, > then the 1st stage kernel should move the cpus to their 2nd stage enable > method at shutdown. Also, any cpus stuck in the kernel secondary_holding_pen > should to be moved to their 2nd stage enable method. I have not tried to > implement this, but it seems to me that it can be done. While using Linux as a chained bootloader could make sense, I don't think it makes sense for that bootloader kernel to bring the secondaries up at all if we're not going to be able to hotplug them off. I'm not sure I follow the reasoning w.r.t. making the first kernel behave as a shim to use a different enable-method for the second kernel. Why would you not teach the secondary kernel to handle the real enable method? > Do you see any reason why this would not work? This may work, but I think there is a much better solution (i.e. ensuring we have real hotplug mechanisms where we need them). I am concerned that if and when we get more enable methods, the complexity of performing the shim work is going to expand dramatically. It certainly isn't going to be possible to shim all combinations (e.g. spint-table -> PSCI). > > [1] For hotplug without PSCI I think we can get away with adding an > > optional property to spin-table describing a (physical) address to > > branch to which returns CPUs to their original spin code. So long as the > > kernel zeroes the release address first we should be able to boot a CPU > > exactly as we managed to the first time. > > I think just this would be just the address of the cpu's spin code, with the > restriction that the code is entered at this address also. If there is no code > there, in the case of a PSCI to spin-table re-boot, then the 1st stage kernel > needs to install some spin code. Also, shouldn't this be a required property > to avoid the spin code memory leakage problem? The initial spin code must have been protected with a memeserve, so it should still be there. Why would we need to install some new code there? As the original code is still present, retaining the original memreserve will be sufficient. We won't leak memory through additional memreserves, and we won't clobber the spin-table code as it was reserved. If it's not present we simply don't have hotplug, cannot hotplug the CPU, and cannot kexec in SMP. > > For spin-table we'll also need to jump back up to EL2 when EL2 is > > present. It should be possible to do that within the spin-table code if > > we update the initial EL2 vector and get KVM to tear itself down before > > cpu_die happens. > > OK, I have not yet considered EL2. > > > > +/** > > > + * struct kexec_cpu_info_spin - Info needed for the "spin table" enable method. > > > + */ > > > + > > > +struct kexec_cpu_info_spin { > > > + phys_addr_t phy_release_addr; /* cpu order */ > > > > I assume you mean this is in the endianness of the current kernel? I was > > initially confused by the comment, and I think it might be better to > > drop it -- unless told that a variable is in a specific endianness I > > would assume that it's the current native endianness anyway. > > The value is read as LE from the device tree, and this comment is to clarify > that the conversion from LE to cpu has been done. Maybe 'cpu byte order' is > less confusing? Normally, I think the term 'machine order' would be used, but > I chose 'cpu' from the name of the le64_to_cpu routine. To me, the code is more confusing due to the presence of the comment. My default assumption would be that all variables are {native,cpu}-endian unless commented otherwise. I'm not sure what you mean w.r.t. the LE conversion. The address in the DT will be big-endian, and the accessors will covert them to native endianness as required. The value _at_ the address will be LE, but there is no conversion necessary on the address itself. > > > +struct kexec_ctx { > > > + struct kexec_dt_info dt_1st; > > > + struct kexec_dt_info dt_2nd; > > > +}; > > > + > > > +static struct kexec_ctx *ctx; > > > > Is there any reason this should be dynamically allocated? > > > > Do we even need the current DTB if we rely on hotplug? > > I think so, for this implementation we need to at least check if the enable > methods and cpu counts of the two DTs match and fail the kexec-load syscall > if they do not. With the approach I described, we would not need to perform this check. We should simply trust that the user knows what they are doing (we're letting them boot an arbitrary kernel anyway...). > > > > +/** > > > + * kexec_is_dtb_user - Helper routine to check the device tree header signature. > > > + */ > > > + > > > +static int kexec_is_dtb_user(const void *dtb) > > > +{ > > > + __be32 magic; > > > + > > > + get_user(magic, (__be32 *)dtb); > > > > Return value check? Unless we know this can't fail? > > > > If it can fail, surely we should return an appropriate error and forward > > it to userspace. EFAULT? > > > > > + > > > + return kexec_is_dtb(magic); > > > > And EINVAL if we can read this but it's not a DTB? > > These kexec_is_dtb are just used to search for the DTB segment, so are expected > to return false for non-DTB segments. I'll change the return type to bool to > make the usage more clear. While that's fine for the latter, I think the former should fail all the way to userspace which has handed us a pointer to memory which it does not own. > > > > +/** > > > + * kexec_read_memreserve - Initialize memreserve info from a dtb. > > > + */ > > > + > > > +static int kexec_read_memreserve(const void *dtb, struct kexec_dt_info *info) > > > +{ > > > + const struct boot_param_header *h = dtb; > > > + struct pair { > > > + __be64 phy_addr; > > > + __be64 size; > > > + } const *pair; > > > + > > > + pair = dtb + be32_to_cpu(h->off_mem_rsvmap); > > > + > > > + if ((pair + 1)->size) > > > + pr_warn("kexec: Multiple memreserve regions found."); > > > > Huh? Multiple arbitrary memreserves are entirely valid. Why should we > > warn in that case? > > If a user reports a problem I thought this comment may be useful in debugging > since the current implementation does not consider them. Given that multiple memreserves are entirely valid this is going to result in false positives. I think we can get rid of this if we rely on userspace passing the right information along. > > > > + > > > + info->phy_memreserve_addr = be64_to_cpu(pair->phy_addr); > > > + info->memreserve_size = be64_to_cpu(pair->size); > > > > So we're assuming that the memory described in an arbitrary memreserve > > entry (which is intended to describe memory which shouldn't be touched > > unless we know what we're doing) is for our arbitrary use!? > > > > NAK. > > > > We shouldn't need to special-case reserved memory handling if we rely on > > cpu hotplug. If we don't then the only functional option is to add a > > memreserve, but that will end up leaking a small amount of memory upon > > every kexec. I believe that the former is the only sane option. > > I think the solution is to have the cpu spin code address property. Ok. > > > > +static int kexec_setup_cpu_spin(const struct device_node *dn, > > > + struct kexec_cpu_info_spin *info) > > > +{ > > > + int result; > > > + u64 t1; > > > + > > > + memset(info, 0, sizeof(*info)); > > > + > > > + result = of_property_read_u64(dn, "cpu-release-addr", &t1); > > > + > > > + if (result) { > > > + pr_warn("kexec: Read cpu-release-addr failed.\n"); > > > + return result; > > > + } > > > + > > > + info->phy_release_addr = le64_to_cpu(t1); > > > > Why are we calling le64_to_cpu here? > > > > of_property_read_u64 reads a be64 value from dt into cpu endianness, so > > at the very least the annotation is the wrong way around. > > I'll check it again. I read this and thought the conversion was needed: > > The value will be written as a single 64-bit little-endian > value, so CPUs must convert the read value to their native endianness > before jumping to it. Endianness conversion may be necessary on the value written/read to/from the address but you're converting the endianness of the address of the mailbox, not the value written to the mailbox... > > > Have you tested this with a BE kernel? We should ensure that LE->LE, > > LE->BE, BE->BE, BE->LE all work. > > Not yet. I'm in the process of setting up a BE test environment. Ok. I'd very much like to know that this will work across BE<->LE boundaries. > > > > +int kexec_cpu_info_init(const struct device_node *dn, > > > + struct kexec_dt_info *info) > > > +{ > > > + int result; > > > + unsigned int cpu; > > > + const struct device_node *i; > > > + > > > + info->cpu_info = kmalloc( > > > + (1 + info->cpu_count) * sizeof(struct kexec_cpu_info), > > > + GFP_KERNEL); > > > > Why one more than the cpu count? I thought cpu_count covered all the > > CPUs in the dtb? > > Yes, a left over from when the array was zero terminated. Thanks for such > a detailed check! > > > [...] > > > > > +int kexec_dt_info_init(void *dtb, struct kexec_dt_info *info) > > > +{ > > > + int result; > > > + struct device_node *i; > > > + struct device_node *dn; > > > + > > > + if (!dtb) { > > > + /* 1st stage. */ > > > + dn = NULL; > > > + } else { > > > + /* 2nd stage. */ > > > + > > > + of_fdt_unflatten_tree(dtb, &dn); > > > > This may fail. We should check that dn is not NULL before we try to use > > it later -- many of_* functions will traverse the current kernel's boot > > DT if provided with a NULL root. > > OK, I'll fix it. > > > > + > > > + result = kexec_read_memreserve(dtb, info); > > > + > > > + if (result) > > > + return result; > > > + } > > > + > > > + /* > > > + * We may need to work with offline cpus to get them into the correct > > > + * state for a given enable method to work, and so need an info_array > > > + * that has info about all the platform cpus. > > > + */ > > > > What exactly do we need to do to offline CPUs? > > As mentioned above, to get them into a state expected by the 2nd stage > kernel if we choose to do so, but to do the compatibility checks for > this implementation. Maybe I'll change the wording of this comment. I still don't understand why we would need to bring them online to later fake putting them offline, rather than getting the next kernel to handle the real enable method. Cheers, Mark. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 7/8] arm64/kexec: Add core kexec support 2014-05-09 0:48 ` [PATCH 7/8] arm64/kexec: Add core kexec support Geoff Levand 2014-05-09 15:36 ` Mark Rutland @ 2014-05-14 10:54 ` Catalin Marinas 2014-05-14 23:20 ` Geoff Levand 2014-07-07 7:33 ` Dave Young 2 siblings, 1 reply; 23+ messages in thread From: Catalin Marinas @ 2014-05-14 10:54 UTC (permalink / raw) To: linux-arm-kernel On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > +KEXEC FOR ARM64 > +M: Geoff Levand <geoff@infradead.org> > +W: http://kernel.org/pub/linux/utils/kernel/kexec/ > +L: kexec at lists.infradead.org > +L: linux-arm-kernel at lists.infradead.org (moderated for non-subscribers) > +S: Maintained > +F: arch/arm64/machine_kexec.c > +F: arch/arm64/relocate_kernel.S These entries missed the full directory name. Anyway, this code already comes under the core arm64 MAINTAINERS entry and it doesn't make sense to have a special kexec case. Please add a proper header to the new files you introduce, including copyright and author information. -- Catalin ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 7/8] arm64/kexec: Add core kexec support 2014-05-14 10:54 ` Catalin Marinas @ 2014-05-14 23:20 ` Geoff Levand 0 siblings, 0 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-14 23:20 UTC (permalink / raw) To: linux-arm-kernel Hi Catalin, On Wed, 2014-05-14 at 11:54 +0100, Catalin Marinas wrote: > On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > > +KEXEC FOR ARM64 > > +M: Geoff Levand <geoff@infradead.org> > > +W: http://kernel.org/pub/linux/utils/kernel/kexec/ > > +L: kexec at lists.infradead.org > > +L: linux-arm-kernel at lists.infradead.org (moderated for non-subscribers) > > +S: Maintained > > +F: arch/arm64/machine_kexec.c > > +F: arch/arm64/relocate_kernel.S > > These entries missed the full directory name. > > Anyway, this code already comes under the core arm64 MAINTAINERS entry > and it doesn't make sense to have a special kexec case. Please add a > proper header to the new files you introduce, including copyright and > author information. Thanks for the comments. I'll fix for v2 of the series. -Geoff ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 7/8] arm64/kexec: Add core kexec support 2014-05-09 0:48 ` [PATCH 7/8] arm64/kexec: Add core kexec support Geoff Levand 2014-05-09 15:36 ` Mark Rutland 2014-05-14 10:54 ` Catalin Marinas @ 2014-07-07 7:33 ` Dave Young 2014-07-11 9:47 ` Dave Young 2 siblings, 1 reply; 23+ messages in thread From: Dave Young @ 2014-07-07 7:33 UTC (permalink / raw) To: linux-arm-kernel [snip] > + > +/** > + * kexec_cpu_info_init - Initialize an array of kexec_cpu_info structures. > + * > + * Allocates a cpu info array and fills it with info for all cpus found in > + * the device tree passed. The cpu info array is zero terminated. > + */ > + > +int kexec_cpu_info_init(const struct device_node *dn, > + struct kexec_dt_info *info) > +{ > + int result; > + unsigned int cpu; > + const struct device_node *i; > + > + info->cpu_info = kmalloc( > + (1 + info->cpu_count) * sizeof(struct kexec_cpu_info), > + GFP_KERNEL); > + > + if (!info->cpu_info) { > + pr_debug("%s: out of memory", __func__); > + return -ENOMEM; > + } > + > + info->spinner_count = 0; > + > + for (cpu = 0, i = dn; cpu < info->cpu_count; cpu++) { > + struct kexec_cpu_info *cpu_info = &info->cpu_info[cpu]; > + > + i = of_find_node_by_type((struct device_node *)i, "cpu"); > + > + BUG_ON(!i); > + > + cpu_info->cpu = cpu; > + > + result = cpu_read_ops((struct device_node *)i, cpu, > + &cpu_info->cpu_ops); cpu_ops memory is not allocated? BTW cpu_read_ops will call cpu_get_ops which is marked as __init Thanks Dave ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 7/8] arm64/kexec: Add core kexec support 2014-07-07 7:33 ` Dave Young @ 2014-07-11 9:47 ` Dave Young 0 siblings, 0 replies; 23+ messages in thread From: Dave Young @ 2014-07-11 9:47 UTC (permalink / raw) To: linux-arm-kernel On 07/07/14 at 03:33pm, Dave Young wrote: > [snip] > > > + > > +/** > > + * kexec_cpu_info_init - Initialize an array of kexec_cpu_info structures. > > + * > > + * Allocates a cpu info array and fills it with info for all cpus found in > > + * the device tree passed. The cpu info array is zero terminated. > > + */ > > + > > +int kexec_cpu_info_init(const struct device_node *dn, > > + struct kexec_dt_info *info) > > +{ > > + int result; > > + unsigned int cpu; > > + const struct device_node *i; > > + > > + info->cpu_info = kmalloc( > > + (1 + info->cpu_count) * sizeof(struct kexec_cpu_info), > > + GFP_KERNEL); > > + > > + if (!info->cpu_info) { > > + pr_debug("%s: out of memory", __func__); > > + return -ENOMEM; > > + } > > + > > + info->spinner_count = 0; > > + > > + for (cpu = 0, i = dn; cpu < info->cpu_count; cpu++) { > > + struct kexec_cpu_info *cpu_info = &info->cpu_info[cpu]; > > + > > + i = of_find_node_by_type((struct device_node *)i, "cpu"); > > + > > + BUG_ON(!i); > > + > > + cpu_info->cpu = cpu; > > + > > + result = cpu_read_ops((struct device_node *)i, cpu, > > + &cpu_info->cpu_ops); > > cpu_ops memory is not allocated? Oops, I misread the code, it should be not a problem. Just ignore above comment But I surely have some problem, probably caused by some random issues. > > BTW cpu_read_ops will call cpu_get_ops which is marked as __init > > Thanks > Dave > > _______________________________________________ > kexec mailing list > kexec at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/8] arm64: Add spin-table cpu_die 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand ` (3 preceding siblings ...) 2014-05-09 0:48 ` [PATCH 7/8] arm64/kexec: Add core kexec support Geoff Levand @ 2014-05-09 0:48 ` Geoff Levand 2014-05-09 8:54 ` Mark Rutland 2014-05-09 0:48 ` [PATCH 4/8] arm64: Add smp_spin_table_set_die Geoff Levand ` (3 subsequent siblings) 8 siblings, 1 reply; 23+ messages in thread From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw) To: linux-arm-kernel Add two new minimal routines smp_spin_table_cpu_disable() and smp_spin_table_cpu_die() and hook them up to the smp_spin_table_ops instance. Kexec support will use smp_spin_table_cpu_die() for re-boot of spin table CPUs, but also needs a compatible smp_spin_table_cpu_disable() to allow execution to reach smp_spin_table_cpu_die(). Signed-off-by: Geoff Levand <geoff@infradead.org> --- arch/arm64/kernel/smp_spin_table.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/arch/arm64/kernel/smp_spin_table.c b/arch/arm64/kernel/smp_spin_table.c index 7a530d2..26c780b 100644 --- a/arch/arm64/kernel/smp_spin_table.c +++ b/arch/arm64/kernel/smp_spin_table.c @@ -142,10 +142,23 @@ static void smp_spin_table_cpu_postboot(void) raw_spin_unlock(&boot_lock); } +static int smp_spin_table_cpu_disable(unsigned int cpu) +{ + return 0; +} + +static void smp_spin_table_cpu_die(unsigned int cpu) +{ + while (1) + cpu_relax(); +} + const struct cpu_operations smp_spin_table_ops = { .name = "spin-table", .cpu_init = smp_spin_table_cpu_init, .cpu_prepare = smp_spin_table_cpu_prepare, .cpu_boot = smp_spin_table_cpu_boot, .cpu_postboot = smp_spin_table_cpu_postboot, + .cpu_disable = smp_spin_table_cpu_disable, + .cpu_die = smp_spin_table_cpu_die, }; -- 1.9.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/8] arm64: Add spin-table cpu_die 2014-05-09 0:48 ` [PATCH 3/8] arm64: Add spin-table cpu_die Geoff Levand @ 2014-05-09 8:54 ` Mark Rutland 0 siblings, 0 replies; 23+ messages in thread From: Mark Rutland @ 2014-05-09 8:54 UTC (permalink / raw) To: linux-arm-kernel On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > Add two new minimal routines smp_spin_table_cpu_disable() and > smp_spin_table_cpu_die() and hook them up to the smp_spin_table_ops instance. > > Kexec support will use smp_spin_table_cpu_die() for re-boot of spin table CPUs, > but also needs a compatible smp_spin_table_cpu_disable() to allow > execution to reach smp_spin_table_cpu_die(). > > Signed-off-by: Geoff Levand <geoff@infradead.org> > --- > arch/arm64/kernel/smp_spin_table.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/arch/arm64/kernel/smp_spin_table.c b/arch/arm64/kernel/smp_spin_table.c > index 7a530d2..26c780b 100644 > --- a/arch/arm64/kernel/smp_spin_table.c > +++ b/arch/arm64/kernel/smp_spin_table.c > @@ -142,10 +142,23 @@ static void smp_spin_table_cpu_postboot(void) > raw_spin_unlock(&boot_lock); > } > > +static int smp_spin_table_cpu_disable(unsigned int cpu) > +{ > + return 0; If we cannot kill the CPU then we need to fail here. cpu_disable is called early enough that the cpu hotplug infrastructure can recover if we cannot disable the CPU. > +} > + > +static void smp_spin_table_cpu_die(unsigned int cpu) > +{ > + while (1) > + cpu_relax(); This does not kill the CPU -- it's stuck spinning on some instructions that might get clobbered upon a kexec. So this is insufficient for kexec. This also doesn't allow the CPU to be brought back, so it's completely useless as-is. NAK for pseudo-hotplug. If we cannot place the CPU somewhere out of the way of the kernel then we _must_ fail. Thanks, Mark. > +} > + > const struct cpu_operations smp_spin_table_ops = { > .name = "spin-table", > .cpu_init = smp_spin_table_cpu_init, > .cpu_prepare = smp_spin_table_cpu_prepare, > .cpu_boot = smp_spin_table_cpu_boot, > .cpu_postboot = smp_spin_table_cpu_postboot, > + .cpu_disable = smp_spin_table_cpu_disable, > + .cpu_die = smp_spin_table_cpu_die, > }; > -- > 1.9.1 > > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 4/8] arm64: Add smp_spin_table_set_die 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand ` (4 preceding siblings ...) 2014-05-09 0:48 ` [PATCH 3/8] arm64: Add spin-table cpu_die Geoff Levand @ 2014-05-09 0:48 ` Geoff Levand 2014-05-09 0:48 ` [PATCH 1/8] arm64: Use cpu_ops for smp_stop Geoff Levand ` (2 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw) To: linux-arm-kernel Remove the const attribute from the smp_spin_table_ops instance in smp_spin_table.c and add the new routine smp_spin_table_set_die() which allows a custom cpu die method to be set in the smp_spin_table_ops instance. The ability to set a custom cpu die routine is needed by kexec to allow it to manage any secondary CPUs that need to be shut down as described in booting.txt. Signed-off-by: Geoff Levand <geoff@infradead.org> --- arch/arm64/include/asm/cpu_ops.h | 2 ++ arch/arm64/kernel/smp_spin_table.c | 8 +++++++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/cpu_ops.h b/arch/arm64/include/asm/cpu_ops.h index 872f61a..42bcd24 100644 --- a/arch/arm64/include/asm/cpu_ops.h +++ b/arch/arm64/include/asm/cpu_ops.h @@ -63,4 +63,6 @@ extern int cpu_read_ops(struct device_node *dn, int cpu, const struct cpu_operations **cpu_ops); extern void __init cpu_read_bootcpu_ops(void); +void smp_spin_table_set_die(void (*fn)(unsigned int)); + #endif /* ifndef __ASM_CPU_OPS_H */ diff --git a/arch/arm64/kernel/smp_spin_table.c b/arch/arm64/kernel/smp_spin_table.c index 26c780b..ce7d0a3 100644 --- a/arch/arm64/kernel/smp_spin_table.c +++ b/arch/arm64/kernel/smp_spin_table.c @@ -153,7 +153,7 @@ static void smp_spin_table_cpu_die(unsigned int cpu) cpu_relax(); } -const struct cpu_operations smp_spin_table_ops = { +struct cpu_operations smp_spin_table_ops = { .name = "spin-table", .cpu_init = smp_spin_table_cpu_init, .cpu_prepare = smp_spin_table_cpu_prepare, @@ -162,3 +162,9 @@ const struct cpu_operations smp_spin_table_ops = { .cpu_disable = smp_spin_table_cpu_disable, .cpu_die = smp_spin_table_cpu_die, }; + +void smp_spin_table_set_die(void (*fn)(unsigned int)) +{ + BUG_ON(!fn); + smp_spin_table_ops.cpu_die = fn; +} -- 1.9.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 1/8] arm64: Use cpu_ops for smp_stop 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand ` (5 preceding siblings ...) 2014-05-09 0:48 ` [PATCH 4/8] arm64: Add smp_spin_table_set_die Geoff Levand @ 2014-05-09 0:48 ` Geoff Levand 2014-05-09 8:44 ` Mark Rutland 2014-05-09 0:48 ` [PATCH 5/8] arm64: Split soft_restart into two stages Geoff Levand 2014-05-09 16:22 ` [PATCH 0/8] arm64 kexec kernel patches Mark Rutland 8 siblings, 1 reply; 23+ messages in thread From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw) To: linux-arm-kernel The current implementation of ipi_cpu_stop() is just a tight infinite loop around cpu_relax(). Add a check for a valid cpu_die method of the appropriate cpu_operations structure, and if a valid method is found, transfer control to that method. The core kexec code calls the arch specific machine_shutdown() routine to shutdown any SMP secondary CPUs. The current implementation of the arm64 machine_shutdown() uses smp_send_stop(), which ultimately runs ipi_cpu_stop() on the secondary CPUs. The infinite loop implementation of the current ipi_cpu_stop() does not have any mechanism to get the CPU into a state compatable with a kexec re-boot. Signed-off-by: Geoff Levand <geoff@infradead.org> --- arch/arm64/kernel/smp.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index f0a141d..020bbd5 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -508,6 +508,14 @@ static void ipi_cpu_stop(unsigned int cpu) local_irq_disable(); + /* If we have the cup_ops use them. */ + + if (cpu_ops[cpu]->cpu_disable && cpu_ops[cpu]->cpu_die + && !cpu_ops[cpu]->cpu_disable(cpu)) + cpu_ops[cpu]->cpu_die(cpu); + + /* Spin here if the cup_ops fail. */ + while (1) cpu_relax(); } -- 1.9.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 1/8] arm64: Use cpu_ops for smp_stop 2014-05-09 0:48 ` [PATCH 1/8] arm64: Use cpu_ops for smp_stop Geoff Levand @ 2014-05-09 8:44 ` Mark Rutland 2014-05-13 22:27 ` Geoff Levand 0 siblings, 1 reply; 23+ messages in thread From: Mark Rutland @ 2014-05-09 8:44 UTC (permalink / raw) To: linux-arm-kernel On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > The current implementation of ipi_cpu_stop() is just a tight infinite loop > around cpu_relax(). Add a check for a valid cpu_die method of the appropriate > cpu_operations structure, and if a valid method is found, transfer control to > that method. > > The core kexec code calls the arch specific machine_shutdown() routine to > shutdown any SMP secondary CPUs. The current implementation of the arm64 > machine_shutdown() uses smp_send_stop(), which ultimately runs ipi_cpu_stop() > on the secondary CPUs. The infinite loop implementation of the current > ipi_cpu_stop() does not have any mechanism to get the CPU into a state > compatable with a kexec re-boot. > > Signed-off-by: Geoff Levand <geoff@infradead.org> > --- > arch/arm64/kernel/smp.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c > index f0a141d..020bbd5 100644 > --- a/arch/arm64/kernel/smp.c > +++ b/arch/arm64/kernel/smp.c > @@ -508,6 +508,14 @@ static void ipi_cpu_stop(unsigned int cpu) > > local_irq_disable(); > > + /* If we have the cup_ops use them. */ > + > + if (cpu_ops[cpu]->cpu_disable && cpu_ops[cpu]->cpu_die > + && !cpu_ops[cpu]->cpu_disable(cpu)) > + cpu_ops[cpu]->cpu_die(cpu); For PSCI 0.2 support, we're going to need a cpu_kill callback which we can't call from the dying CPU. Specifically, we'll need to poll CPU_AFFINITY_INFO to ensure that secondaries have _actually_ left the kernel and aren't going to be adversely affected by the kernel text getting clobbered. As we're going to wire that up to the cpu hotplug infrastructure it would be nice to perform the hotplug for kexec by reusing the generic hotplug infrastructure rather than calling portions of the arm64 implementation directly. > + > + /* Spin here if the cup_ops fail. */ > + > while (1) > cpu_relax(); This seems very dodgy to me. If a CPU doesn't actually die it's going to be spinning in some memory that we may later clobber. At that point the CPU will do arbitrarily bad things when it begins executing whatever its currently executing instructions (or vectors) were replaced by, and you will waste hours trying to figure out what went wrong (See 8121cf312a19 "ARM: 7766/1: versatile: don't mark pen as __INIT" for a similar mess). If we fail to hotplug a CPU we at minimum need some acknowledgement that we failed. I would rather we failed to kexec entirely in that case. Cheers, Mark. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/8] arm64: Use cpu_ops for smp_stop 2014-05-09 8:44 ` Mark Rutland @ 2014-05-13 22:27 ` Geoff Levand 0 siblings, 0 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-13 22:27 UTC (permalink / raw) To: linux-arm-kernel Hi Mark, On Fri, 2014-05-09 at 09:44 +0100, Mark Rutland wrote: > On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > > + /* If we have the cup_ops use them. */ > > + > > + if (cpu_ops[cpu]->cpu_disable && cpu_ops[cpu]->cpu_die > > + && !cpu_ops[cpu]->cpu_disable(cpu)) > > + cpu_ops[cpu]->cpu_die(cpu); > > For PSCI 0.2 support, we're going to need a cpu_kill callback which we > can't call from the dying CPU. Specifically, we'll need to poll > CPU_AFFINITY_INFO to ensure that secondaries have _actually_ left the > kernel and aren't going to be adversely affected by the kernel text > getting clobbered. > > As we're going to wire that up to the cpu hotplug infrastructure it > would be nice to perform the hotplug for kexec by reusing the generic > hotplug infrastructure rather than calling portions of the arm64 > implementation directly. OK, is there somewhere I can see that new code, and when do you expect it to be merged? > > + > > + /* Spin here if the cup_ops fail. */ > > + > > while (1) > > cpu_relax(); > > This seems very dodgy to me. If a CPU doesn't actually die it's going to > be spinning in some memory that we may later clobber. At that point the > CPU will do arbitrarily bad things when it begins executing whatever its > currently executing instructions (or vectors) were replaced by, and you > will waste hours trying to figure out what went wrong (See 8121cf312a19 > "ARM: 7766/1: versatile: don't mark pen as __INIT" for a similar mess). > > If we fail to hotplug a CPU we at minimum need some acknowledgement that > we failed. I would rather we failed to kexec entirely in that case. This loop is for the non-hotplug power-off shutdown. This whole smp_stop support needs to be reconsidered for a hotplug spin-table re-work. -Geoff ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 5/8] arm64: Split soft_restart into two stages 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand ` (6 preceding siblings ...) 2014-05-09 0:48 ` [PATCH 1/8] arm64: Use cpu_ops for smp_stop Geoff Levand @ 2014-05-09 0:48 ` Geoff Levand 2014-05-09 16:22 ` [PATCH 0/8] arm64 kexec kernel patches Mark Rutland 8 siblings, 0 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-09 0:48 UTC (permalink / raw) To: linux-arm-kernel Remove the call to setup_restart() from within soft_restart(), change setup_restart() from local to global scope by removing its static attribute, and add a function prototype for setup_restart() in system_misc.h. Splitting setup_restart() and soft_restart() into two different operations allows for the calling routine to call setup_restart(), then perform some operation with the identity map in enabled, and then call soft_restart() to disable the MMU and perform the soft reset. It is expected that the calling routine will call setup_restart(), then shortly thereafter call soft_restart(). To properly manage the shutdown of spin-table secondary CPUs it is necessary to transition those CPUs into a state compatible with booting.txt. The existing implementation of soft_restart() did not allow for the transition to an identity mapped spin loop. Signed-off-by: Geoff Levand <geoff@infradead.org> --- arch/arm64/include/asm/system_misc.h | 1 + arch/arm64/kernel/process.c | 4 +--- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/system_misc.h b/arch/arm64/include/asm/system_misc.h index 7a18fab..15bc824 100644 --- a/arch/arm64/include/asm/system_misc.h +++ b/arch/arm64/include/asm/system_misc.h @@ -41,6 +41,7 @@ struct mm_struct; extern void show_pte(struct mm_struct *mm, unsigned long addr); extern void __show_regs(struct pt_regs *); +void setup_restart(void); void soft_restart(unsigned long); extern void (*arm_pm_restart)(enum reboot_mode reboot_mode, const char *cmd); diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c index 6391485..d52c660 100644 --- a/arch/arm64/kernel/process.c +++ b/arch/arm64/kernel/process.c @@ -50,7 +50,7 @@ #include <asm/processor.h> #include <asm/stacktrace.h> -static void setup_restart(void) +void setup_restart(void) { /* * Tell the mm system that we are going to reboot - @@ -74,8 +74,6 @@ void soft_restart(unsigned long addr) typedef void (*phys_reset_t)(unsigned long); phys_reset_t phys_reset; - setup_restart(); - /* Switch to the identity mapping */ phys_reset = (phys_reset_t)virt_to_phys(cpu_reset); phys_reset(addr); -- 1.9.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 0/8] arm64 kexec kernel patches 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand ` (7 preceding siblings ...) 2014-05-09 0:48 ` [PATCH 5/8] arm64: Split soft_restart into two stages Geoff Levand @ 2014-05-09 16:22 ` Mark Rutland 2014-05-13 22:26 ` Geoff Levand 8 siblings, 1 reply; 23+ messages in thread From: Mark Rutland @ 2014-05-09 16:22 UTC (permalink / raw) To: linux-arm-kernel On Fri, May 09, 2014 at 01:48:17AM +0100, Geoff Levand wrote: > Hi Maintainers, > > This patchset adds support for kexec re-boots on arm64. I have tested with the > VE fast model using various kernel config options with both spin and psci enable > methods. I'll continue to test in the coming weeks. > > I tried to re-use the existing hot plug cpu_ops support for the secondary CPU > shutdown as much as possible, but needed to do some things specific to kexec > that I couldn't do with what was already there. A significant change is in > [PATCH 4/8] (arm64: Add smp_spin_table_set_die) where I add the ability to setup > a custom cpu_die handler. > > To get the the spin-table secondary CPUs into the proper state described in > Documentation/arm64/booting.txt I use a three step spin loop. First in the > kernel's virtual address space, then to the identity mapped address, then jump > to the final spin code in the 2nd stage kernel's /memreserve/ area. To support > this three step spin I needed [PATCH 5/8] (arm64: Split soft_restart into two > stages). Please see the patch comments for more info. If we added the 2nd > stage kernel's /memreserve/ area to the identity map we could eliminate the > middle step and go from the VA space to the /memreserve/ area directly. As I've covered in my reply to patch 7 [1] I don't think this is a good approach. I think a vastly better approach is to make kexec depend on cpu hotplug support in SMP, and enable a simple hotplug-capable boot protocol (e.g. extend spin-table with a cpu-return-addr). That way the in-kernel portions of kexec can use the existing infrastructure without tonnes of point hacks, and we enable a generic hotplug capable mechanism for those systems which cannot implement PSCI. > > Please consider all patches for inclusion. Any comments or suggestions on how > to improve would be very welcome. > > To load a kexec kernel and execute a kexec re-boot on arm64 my patches to > kexec-tools, which have not yet been merged upstream, are needed: > > https://git.linaro.org/people/geoff.levand/kexec-tools.git Is the master branch up-to-date? The commit dates on all branches I can see imply they haven't been updated in a while, and the code looks like it needs some cleanup (there are some unused functions, hard-coded values, etc). Cheers, Mark. [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-May/254819.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 0/8] arm64 kexec kernel patches 2014-05-09 16:22 ` [PATCH 0/8] arm64 kexec kernel patches Mark Rutland @ 2014-05-13 22:26 ` Geoff Levand 0 siblings, 0 replies; 23+ messages in thread From: Geoff Levand @ 2014-05-13 22:26 UTC (permalink / raw) To: linux-arm-kernel Hi Mark, Thanks for taking the time to review the patches in such detail. On Fri, 2014-05-09 at 17:22 +0100, Mark Rutland wrote: > As I've covered in my reply to patch 7 [1] I don't think this is a good > approach. I think a vastly better approach is to make kexec depend on > cpu hotplug support in SMP, and enable a simple hotplug-capable boot > protocol (e.g. extend spin-table with a cpu-return-addr). > > That way the in-kernel portions of kexec can use the existing > infrastructure without tonnes of point hacks, and we enable a generic > hotplug capable mechanism for those systems which cannot implement PSCI. I think this is a sound approach. As I was working on the kexec code I felt the same, that the core SMP CPU management should be doing more and kexec should just use that existing support. I'll look into splitting off what spin-table handling I have in kexec into a patch to update the hotplug support. > > https://git.linaro.org/people/geoff.levand/kexec-tools.git > > Is the master branch up-to-date? The commit dates on all branches I can > see imply they haven't been updated in a while, and the code looks like > it needs some cleanup (there are some unused functions, hard-coded > values, etc). I'm working on the cleanup of kexec-tools now. I pushed out a version that should boot vanilla 2nd stage kernels and the branches in my repo. -Geoff ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2014-07-11 9:47 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-05-09 0:48 [PATCH 0/8] arm64 kexec kernel patches Geoff Levand 2014-05-09 0:48 ` [PATCH 8/8] arm64: Enable kexec in defconfig Geoff Levand 2014-05-09 0:48 ` [PATCH 2/8] arm64: Make cpu_read_ops generic Geoff Levand 2014-05-09 0:48 ` [PATCH 6/8] arm64/kexec: kexec needs cpu_die Geoff Levand 2014-05-09 8:24 ` Mark Rutland 2014-05-13 22:27 ` Geoff Levand 2014-05-09 0:48 ` [PATCH 7/8] arm64/kexec: Add core kexec support Geoff Levand 2014-05-09 15:36 ` Mark Rutland 2014-05-13 22:27 ` Geoff Levand 2014-05-16 10:26 ` Mark Rutland 2014-05-14 10:54 ` Catalin Marinas 2014-05-14 23:20 ` Geoff Levand 2014-07-07 7:33 ` Dave Young 2014-07-11 9:47 ` Dave Young 2014-05-09 0:48 ` [PATCH 3/8] arm64: Add spin-table cpu_die Geoff Levand 2014-05-09 8:54 ` Mark Rutland 2014-05-09 0:48 ` [PATCH 4/8] arm64: Add smp_spin_table_set_die Geoff Levand 2014-05-09 0:48 ` [PATCH 1/8] arm64: Use cpu_ops for smp_stop Geoff Levand 2014-05-09 8:44 ` Mark Rutland 2014-05-13 22:27 ` Geoff Levand 2014-05-09 0:48 ` [PATCH 5/8] arm64: Split soft_restart into two stages Geoff Levand 2014-05-09 16:22 ` [PATCH 0/8] arm64 kexec kernel patches Mark Rutland 2014-05-13 22:26 ` Geoff Levand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).