* [RFC 0/5] parker: PARtitioned KERnel
@ 2025-09-23 15:31 Fam Zheng
2025-09-23 15:31 ` [RFC 1/5] x86/boot/e820: Fix memmap to parse with 1 argument Fam Zheng
` (6 more replies)
0 siblings, 7 replies; 18+ messages in thread
From: Fam Zheng @ 2025-09-23 15:31 UTC (permalink / raw)
To: linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, fam.zheng, Zhang Rui, fam, H. Peter Anvin, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
From: Thom Hughes <thom.hughes@bytedance.com>
Hi all,
Parker is a proposed feature in linux for multiple linux kernels to run
simultaneously on single machine, without traditional kvm virtualisation. This
is achieved by partitioning the CPU cores, memory and devices for
partitioning-aware Linux kernel.
=== Side note begin ===
Coincidentally it has some similarities with [1] but the design and
implementations are totally separate.
While there are still many open questions and pending work in this direction, we
hope to share the idea and collect feedbacks from you!
=== Side note end ===
Each kernel instance can have the same image, but the initial kernel, or Boot
Kernel, controls the hardware allocation and partition. All other kernels are
secondary kernel, or Application Kernel, touch their own assigned CPU/Memory/IO
devices.
The primary use case in mind for parker is on the machines with high core
counts, where scalability concerns may arise. Once started, there is no
communication between kernel instances. In other words, they share nothing thus
improve scalability. Each kernel needs its own (PCIe) devices for IO, such as
NVMe or NICs.
Another possible use case is for different kernel instances to have different
performance tunings, CONFIG_ options, FDO/PGO according to the workload.
On the implementation side, parker exposes a kernfs directory interface, and
uses kexec to hot-load secondary kernel images to reserved memory regions.
Before creating partitions, the Boot Kernel will offline cpus, reserve physical
memory (using CMA), unbind PCI devices, etc. allocating those to the Application
Kernel so that it can safely use it.
In terms of fault isolation or security, all kernel instances share the same
domain, as there is no supervising mechanism. A kernel bug in any partition can
cause problems for the whole physical machine. This is a tradeoff for
low-overhead / low-complexity, but hope in the future we can take advantage of
some hardware mechanism to introduce some isolation.
Signed-off-by: Thom Hughes <thom.hughes@bytedance.com>
Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
[1] https://lore.kernel.org/lkml/20250918222607.186488-1-xiyou.wangcong@gmail.com/
Thom Hughes (5):
x86/boot/e820: Fix memmap to parse with 1 argument
x86/smpboot: Export wakeup_secondary_cpu_via_init
x86/parker: Introduce parker kerfs interface
x86/parker: Add parker initialisation code
x86/apic: Make Parker instance use physical APIC
arch/x86/Kbuild | 3 +
arch/x86/Kconfig | 2 +
arch/x86/include/asm/smp.h | 1 +
arch/x86/kernel/apic/apic_flat_64.c | 3 +-
arch/x86/kernel/e820.c | 2 +-
arch/x86/kernel/setup.c | 4 +
arch/x86/kernel/smpboot.c | 2 +-
arch/x86/parker/Kconfig | 4 +
arch/x86/parker/Makefile | 3 +
arch/x86/parker/Makefile-full | 3 +
arch/x86/parker/internal.h | 54 ++
arch/x86/parker/kernfs.c | 1266 +++++++++++++++++++++++++++
arch/x86/parker/setup.c | 423 +++++++++
arch/x86/parker/trampoline.S | 55 ++
arch/x86/parker/trampoline.h | 10 +
drivers/thermal/intel/therm_throt.c | 3 +
include/linux/parker-bkup.h | 22 +
include/linux/parker.h | 22 +
include/uapi/linux/magic.h | 1 +
19 files changed, 1880 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/parker/Kconfig
create mode 100644 arch/x86/parker/Makefile
create mode 100644 arch/x86/parker/Makefile-full
create mode 100644 arch/x86/parker/internal.h
create mode 100644 arch/x86/parker/kernfs.c
create mode 100644 arch/x86/parker/setup.c
create mode 100644 arch/x86/parker/trampoline.S
create mode 100644 arch/x86/parker/trampoline.h
create mode 100644 include/linux/parker-bkup.h
create mode 100644 include/linux/parker.h
--
2.39.5
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC 1/5] x86/boot/e820: Fix memmap to parse with 1 argument
2025-09-23 15:31 [RFC 0/5] parker: PARtitioned KERnel Fam Zheng
@ 2025-09-23 15:31 ` Fam Zheng
2025-09-23 15:31 ` [RFC 2/5] x86/smpboot: Export wakeup_secondary_cpu_via_init Fam Zheng
` (5 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Fam Zheng @ 2025-09-23 15:31 UTC (permalink / raw)
To: linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, fam.zheng, Zhang Rui, fam, H. Peter Anvin, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
From: Thom Hughes <thom.hughes@bytedance.com>
This is needed because in the simplest case, parker Application Kernel
only gets one user e820 entry from memmap.
Signed-off-by: Thom Hughes <thom.hughes@bytedance.com>
Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
---
arch/x86/kernel/e820.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 84264205dae5..05dfb192d4b9 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -330,7 +330,7 @@ int __init e820__update_table(struct e820_table *table)
/* If there's only one memory region, don't bother: */
if (table->nr_entries < 2)
- return -1;
+ return 0;
BUG_ON(table->nr_entries > max_nr_entries);
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC 2/5] x86/smpboot: Export wakeup_secondary_cpu_via_init
2025-09-23 15:31 [RFC 0/5] parker: PARtitioned KERnel Fam Zheng
2025-09-23 15:31 ` [RFC 1/5] x86/boot/e820: Fix memmap to parse with 1 argument Fam Zheng
@ 2025-09-23 15:31 ` Fam Zheng
2025-09-23 15:31 ` [RFC 3/5] x86/parker: Introduce parker kerfs interface Fam Zheng
` (4 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Fam Zheng @ 2025-09-23 15:31 UTC (permalink / raw)
To: linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, fam.zheng, Zhang Rui, fam, H. Peter Anvin, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
From: Thom Hughes <thom.hughes@bytedance.com>
Will be used by parker setup code.
Signed-off-by: Thom Hughes <thom.hughes@bytedance.com>
Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
---
arch/x86/include/asm/smp.h | 1 +
arch/x86/kernel/smpboot.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index ca073f40698f..cfc212bbb4a6 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -104,6 +104,7 @@ void native_smp_prepare_boot_cpu(void);
void smp_prepare_cpus_common(void);
void native_smp_prepare_cpus(unsigned int max_cpus);
void native_smp_cpus_done(unsigned int max_cpus);
+int wakeup_secondary_cpu_via_init(u32 phys_apicid, unsigned long start_eip);
int common_cpu_up(unsigned int cpunum, struct task_struct *tidle);
int native_kick_ap(unsigned int cpu, struct task_struct *tidle);
int native_cpu_disable(void);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index c10850ae6f09..c9a941178488 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -715,7 +715,7 @@ static void send_init_sequence(u32 phys_apicid)
/*
* Wake up AP by INIT, INIT, STARTUP sequence.
*/
-static int wakeup_secondary_cpu_via_init(u32 phys_apicid, unsigned long start_eip)
+int wakeup_secondary_cpu_via_init(u32 phys_apicid, unsigned long start_eip)
{
unsigned long send_status = 0, accept_status = 0;
int num_starts, j, maxlvt;
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC 3/5] x86/parker: Introduce parker kerfs interface
2025-09-23 15:31 [RFC 0/5] parker: PARtitioned KERnel Fam Zheng
2025-09-23 15:31 ` [RFC 1/5] x86/boot/e820: Fix memmap to parse with 1 argument Fam Zheng
2025-09-23 15:31 ` [RFC 2/5] x86/smpboot: Export wakeup_secondary_cpu_via_init Fam Zheng
@ 2025-09-23 15:31 ` Fam Zheng
2025-09-23 15:31 ` [RFC 4/5] x86/parker: Add parker initialisation code Fam Zheng
` (3 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Fam Zheng @ 2025-09-23 15:31 UTC (permalink / raw)
To: linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, fam.zheng, Zhang Rui, fam, H. Peter Anvin, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
From: Thom Hughes <thom.hughes@bytedance.com>
This is the control knobs exposed to the boot kernel in order to start
the secondary kernels.
Signed-off-by: Thom Hughes <thom.hughes@bytedance.com>
Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
---
arch/x86/Kbuild | 3 +
arch/x86/Kconfig | 2 +
arch/x86/parker/Kconfig | 4 +
arch/x86/parker/Makefile | 2 +
arch/x86/parker/internal.h | 54 ++
arch/x86/parker/kernfs.c | 1266 ++++++++++++++++++++++++++++++++++++
include/linux/parker.h | 7 +
include/uapi/linux/magic.h | 1 +
8 files changed, 1339 insertions(+)
create mode 100644 arch/x86/parker/Kconfig
create mode 100644 arch/x86/parker/Makefile
create mode 100644 arch/x86/parker/internal.h
create mode 100644 arch/x86/parker/kernfs.c
create mode 100644 include/linux/parker.h
diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild
index f7fb3d88c57b..e50fec2e8e5a 100644
--- a/arch/x86/Kbuild
+++ b/arch/x86/Kbuild
@@ -16,6 +16,9 @@ obj-$(CONFIG_XEN) += xen/
obj-$(CONFIG_PVH) += platform/pvh/
+# Multi-kernel support
+obj-$(CONFIG_PARKER) += parker/
+
# Hyper-V paravirtualization support
obj-$(subst m,y,$(CONFIG_HYPERV)) += hyperv/
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f86e7072a5ba..490ea18cf783 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -3218,3 +3218,5 @@ config HAVE_ATOMIC_IOMAP
source "arch/x86/kvm/Kconfig"
source "arch/x86/Kconfig.assembler"
+
+source "arch/x86/parker/Kconfig"
diff --git a/arch/x86/parker/Kconfig b/arch/x86/parker/Kconfig
new file mode 100644
index 000000000000..716a2537f12c
--- /dev/null
+++ b/arch/x86/parker/Kconfig
@@ -0,0 +1,4 @@
+config PARKER
+ bool "Enable multi-kernel host support"
+ depends on X86_64 && SMP
+ select CMA
diff --git a/arch/x86/parker/Makefile b/arch/x86/parker/Makefile
new file mode 100644
index 000000000000..41c40fc64267
--- /dev/null
+++ b/arch/x86/parker/Makefile
@@ -0,0 +1,2 @@
+obj-y += kernfs.o
+$(obj)/kernfs.o: $(obj)/internal.h
diff --git a/arch/x86/parker/internal.h b/arch/x86/parker/internal.h
new file mode 100644
index 000000000000..a6150f1beb77
--- /dev/null
+++ b/arch/x86/parker/internal.h
@@ -0,0 +1,54 @@
+#ifndef _PARKER_INTERNAL_H
+#define _PARKER_INTERNAL_H
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/list.h>
+#include <linux/kernfs.h>
+
+/* Currently limit support for devices */
+#define PARKER_MAX_PCI_DEVICES 256
+#define PARKER_MAX_CPUS 512
+
+/* For now, limited to one page,
+ * but could have chained pages for PCI devs + APIC ids */
+struct parker_control_structure {
+ phys_addr_t start_address;
+ bool online;
+ unsigned int parker_id;
+ u32 pci_dev_ids[PARKER_MAX_PCI_DEVICES];
+ unsigned int num_pci_devs;
+ u32 apic_ids[PARKER_MAX_CPUS];
+ unsigned int num_cpus;
+};
+
+struct parker_kernel_device_entry {
+ struct list_head list_entry;
+ struct kernfs_node *kn;
+ struct device *dev;
+};
+
+struct parker_kernel_entry {
+ struct kernfs_node *kn;
+ struct mutex mutex;
+ unsigned int id;
+ bool online;
+ struct cpumask cpu_mask;
+ /* Contiguous pages from CMA for parker physical memory */
+ struct page *physical_memory_pages;
+ unsigned long physical_memory_page_count;
+ /* Control structure PAGE for now */
+ struct page *control_structure_pages;
+ /* Currently always 1 but future proofing */
+ unsigned long control_structure_page_count;
+ struct kernfs_node *kn_devices;
+ /* List of each kernfs node, get kobj from kernfs_node */
+ struct list_head list_devices;
+};
+
+/* Ensure we don't exceed 1 page, if we do. We need to rethink control structure
+ * and chain pages together. */
+static_assert(sizeof(struct parker_control_structure) < PAGE_SIZE,
+ "struct (parker_control_structure) too large!");
+
+#endif
diff --git a/arch/x86/parker/kernfs.c b/arch/x86/parker/kernfs.c
new file mode 100644
index 000000000000..68f4b7f779b5
--- /dev/null
+++ b/arch/x86/parker/kernfs.c
@@ -0,0 +1,1266 @@
+#define pr_fmt(fmt) "parker: " fmt
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/syscalls.h>
+#include <linux/kernel.h>
+#include <linux/cpumask.h>
+#include <linux/kexec.h>
+#include <linux/mm.h>
+#include <linux/cma.h>
+#include <linux/parker.h>
+#include <linux/magic.h>
+#include <linux/math.h>
+#include <linux/interrupt.h>
+#include <linux/irqreturn.h>
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+#include <linux/delay.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/reboot.h>
+#include <linux/debugfs.h>
+#include <linux/fs.h>
+#include <linux/fs_parser.h>
+#include <linux/sysfs.h>
+#include <linux/kernfs.h>
+#include <linux/seq_buf.h>
+#include <linux/seq_file.h>
+#include <linux/nospec.h>
+
+#include <asm/mtrr.h>
+#include <asm/realmode.h>
+#include <asm/microcode.h>
+#include <asm/apic.h>
+#include <asm/espfix.h>
+#include <asm/irqdomain.h>
+#include <asm/init.h>
+#include <asm/hw_irq.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/set_memory.h>
+
+#include "internal.h"
+
+static struct cma *parker_cma[MAX_NUMNODES];
+static unsigned long long parker_cma_size;
+static unsigned long long parker_cma_size_in_node[MAX_NUMNODES];
+static struct page *parker_active_control_structure_page;
+
+static struct transition_pagetable_data {
+ struct x86_mapping_info info;
+ pgd_t *pgd;
+ void *stack;
+} transition_pagetable;
+
+static int __init parker_early_cma(char *p)
+{
+ int nid, count = 0;
+ unsigned long long tmp;
+ char *s = p;
+
+ while (*s) {
+ if (sscanf(s, "%llu%n", &tmp, &count) != 1)
+ break;
+
+ if (s[count] == ':') {
+ if (tmp >= MAX_NUMNODES)
+ break;
+ nid = array_index_nospec(tmp, MAX_NUMNODES);
+
+ s += count + 1;
+ tmp = memparse(s, &s);
+ parker_cma_size_in_node[nid] = tmp;
+ parker_cma_size += tmp;
+
+ /*
+ * Skip the separator if have one, otherwise
+ * break the parsing.
+ */
+ if (*s == ',')
+ s++;
+ else
+ break;
+ } else {
+ parker_cma_size = memparse(p, &p);
+ break;
+ }
+ }
+
+ return 0;
+}
+early_param("parker_cma", parker_early_cma);
+
+#define ORDER_1G 30
+void __init parker_cma_reserve(void)
+{
+ bool node_specific_cma_alloc = false;
+ unsigned long long size, reserved, per_node;
+ int nid;
+
+ if (!parker_cma_size)
+ return;
+
+ for (nid = 0; nid < MAX_NUMNODES; nid++) {
+ if (parker_cma_size_in_node[nid] == 0)
+ continue;
+
+ if (!node_online(nid)) {
+ pr_warn("invalid node %d specified for CMA allocation\n", nid);
+ parker_cma_size -= parker_cma_size_in_node[nid];
+ parker_cma_size_in_node[nid] = 0;
+ continue;
+ }
+
+ if (parker_cma_size_in_node[nid] < SZ_1G) {
+ pr_warn("cma area of node %d should be at least 1GiB\n", nid);
+ parker_cma_size -= parker_cma_size_in_node[nid];
+ parker_cma_size_in_node[nid] = 0;
+ } else {
+ node_specific_cma_alloc = true;
+ }
+ }
+ /* Validate the CMA size again in case some invalid nodes specified. */
+ if (!parker_cma_size)
+ return;
+
+ if (parker_cma_size < SZ_1G) {
+ pr_warn("cma area should be at least 1 GiB\n");
+ parker_cma_size = 0;
+ return;
+ }
+
+ if (!node_specific_cma_alloc) {
+ /*
+ * If 3 GB area is requested on a machine with 4 numa nodes,
+ * let's allocate 1 GB on first three nodes and ignore the last one.
+ */
+ per_node = DIV_ROUND_UP(parker_cma_size, nr_online_nodes);
+ pr_info("reserve CMA %llu MiB, up to %llu MiB per node\n",
+ parker_cma_size / SZ_1M, per_node / SZ_1M);
+ }
+
+ reserved = 0;
+ for_each_online_node(nid) {
+ int res;
+ char name[CMA_MAX_NAME];
+
+ if (node_specific_cma_alloc) {
+ if (parker_cma_size_in_node[nid] == 0)
+ continue;
+
+ size = parker_cma_size_in_node[nid];
+ } else {
+ size = min(per_node, parker_cma_size - reserved);
+ }
+
+ size = round_up(size, SZ_1G);
+
+ snprintf(name, sizeof(name), "parker%d", nid);
+ /*
+ * Note that 'order per bit' is based on smallest size that
+ * may be returned to CMA allocator in the case of
+ * huge page demotion.
+ */
+ res = cma_declare_contiguous_nid(0, size, 0,
+ SZ_1G,
+ ORDER_1G - PAGE_SHIFT, false, name,
+ &parker_cma[nid], nid);
+ if (res) {
+ pr_warn("reservation failed - err %d, node %d",
+ res, nid);
+ continue;
+ }
+
+ reserved += size;
+ pr_info("reserved %llu MiB on node %d\n",
+ size / SZ_1M, nid);
+
+ if (reserved >= parker_cma_size)
+ break;
+ }
+
+ if (!reserved)
+ /*
+ * parker_cma_size is used to determine if allocations from
+ * cma are possible. Set to zero if no cma regions are set up.
+ */
+ parker_cma_size = 0;
+}
+
+/* Make sure we don't overwrite initial_code too early */
+struct semaphore cpu_kick_semaphore;
+
+__attribute__((noreturn)) static void parker_bsp_start(void)
+{
+ /* Let parker_start_kernel know we're here */
+ up(&cpu_kick_semaphore);
+
+ if (kexec_image) {
+ machine_kexec(kexec_image);
+ }
+ // never get here but?
+ for (;;) {
+ continue;
+ }
+}
+
+__attribute__((noreturn)) static void parker_ap_wait(void)
+{
+ /* Let parker_start_kernel know we're here */
+ up(&cpu_kick_semaphore);
+
+ unsigned int cpu = smp_processor_id();
+ unsigned int apic_id = apic->cpu_present_to_apicid(cpu);
+
+ volatile struct parker_control_structure *pcs;
+ /* For now, use global active control page.
+ * Eventually we can add lookup from CPU -> control page */
+ pcs = page_address(parker_active_control_structure_page);
+ int idx = 0;
+ while (!READ_ONCE(pcs->start_address)) {
+ idx++;
+ continue;
+ }
+ pr_debug("parker trampoline physical address %llx\n", pcs->start_address);
+ smp_mb();
+ u64 call_addr = 0;
+ /* There's no race condition on stack as we don't read the stack pointer again */
+ asm volatile (
+ "mov (%1), %0\n\t"
+ "mov %3, %%rsp\n\t"
+ "mov %4, %%esi\n\t"
+ "mov %2, %%cr3\n\t"
+ ANNOTATE_RETPOLINE_SAFE
+ "call *%0\n\t"
+ : "+r" (call_addr)
+ : "r" (&pcs->start_address),
+ "r" (__sme_pa(transition_pagetable.pgd)),
+ "r" (__sme_pa(transition_pagetable.stack + PAGE_SIZE)),
+ "r" (apic_id)
+ : "esi", "rsp"
+ );
+
+ for (;;) {
+ continue;
+ }
+}
+
+static void parker_host_ipicb(void)
+{
+ pr_info("OKK\n");
+}
+
+static void __init *alloc_pgt_page(void *dummy)
+{
+ return (void*)get_zeroed_page(GFP_ATOMIC);
+}
+
+static int __init init_transition_pgtable(pgd_t *pgd)
+{
+ pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
+ unsigned long vaddr, paddr;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ vaddr = (unsigned long)parker_ap_wait;
+ pgd += pgd_index(vaddr);
+ if (!pgd_present(*pgd)) {
+ p4d = (p4d_t *)alloc_pgt_page(NULL);
+ if (!p4d)
+ return -ENOMEM;
+ set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+ }
+ p4d = p4d_offset(pgd, vaddr);
+ if (!p4d_present(*p4d)) {
+ pud = (pud_t *)alloc_pgt_page(NULL);
+ if (!pud)
+ return -ENOMEM;
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+ }
+ pud = pud_offset(p4d, vaddr);
+ if (!pud_present(*pud)) {
+ pmd = (pmd_t *)alloc_pgt_page(NULL);
+ if (!pmd)
+ return -ENOMEM;
+ set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+ }
+ pmd = pmd_offset(pud, vaddr);
+ if (!pmd_present(*pmd)) {
+ pte = (pte_t *)alloc_pgt_page(NULL);
+ if (!pte)
+ return -ENOMEM;
+ set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
+ }
+ pte = pte_offset_kernel(pmd, vaddr);
+
+ paddr = __pa(vaddr);
+ set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot));
+
+ return 0;
+}
+
+/* Allocate intermediate trampoline pagetable, that has all physical memory
+ * mapped allowing us to reuse this for all parker kernel instantiations. */
+static int __init parker_host_transition_pagetable_init(void)
+{
+ struct x86_mapping_info info = {
+ .alloc_pgt_page = alloc_pgt_page,
+// .free_pgt_page = free_pgt_page,
+ .page_flag = __PAGE_KERNEL_LARGE_EXEC,
+ .kernpg_flag = _KERNPG_TABLE_NOENC,
+ };
+
+ pgd_t *pgd;
+ pgd = alloc_pgt_page(NULL);
+ void *stack = alloc_pgt_page(NULL);
+ if (!pgd)
+ return -ENOMEM;
+
+ for (int i = 0; i < nr_pfn_mapped; i++) {
+ unsigned long mstart, mend;
+
+ mstart = pfn_mapped[i].start << PAGE_SHIFT;
+ mend = pfn_mapped[i].end << PAGE_SHIFT;
+ if (kernel_ident_mapping_init(&info, pgd, mstart, mend)) {
+ //kernel_ident_mapping_free(&info, pgd);
+ return -ENOMEM;
+ }
+ }
+
+ transition_pagetable.info = info;
+ transition_pagetable.pgd = pgd;
+ transition_pagetable.stack = stack;
+
+ return init_transition_pgtable(pgd);
+}
+static int __init parker_kernfs_init(void);
+
+/* Multi-kernel module code for Primary <-> secondary communication */
+static int __init parker_module_init(void)
+{
+ if (is_parker_instance())
+ return -ENODEV;
+ parker_kernfs_init();
+ sema_init(&cpu_kick_semaphore, 0);
+ // TODO: Device registration for sysfs interface
+ // copying resctrl interface style with folder creation
+ // and deletion to create kernels.
+ pr_info("Multikernel module loading...\n");
+ // TODO: Custom
+ if (x86_platform_ipi_callback) {
+ pr_err("Platform callback exists\n");
+ return -ENODEV;
+ }
+ x86_platform_ipi_callback = parker_host_ipicb;
+ if(parker_host_transition_pagetable_init()) {
+ pr_info("TTABLE FAILED!\n");
+ return -ENODEV;
+ }
+
+ return 0;
+}
+
+static void __exit parker_module_exit(void)
+{
+ pr_info("Multikernel exiting.\n");
+ //__free_pages(parker_control_page, 0);
+}
+
+/* Ensure global parker lock is held */
+static int parker_start_kernel(struct parker_kernel_entry *pke)
+{
+ struct parker_control_structure *pcs;
+ struct list_head *dev_elem;
+ int ret;
+
+ WRITE_ONCE(parker_active_control_structure_page, pke->control_structure_pages);
+ pcs = page_address(parker_active_control_structure_page);
+
+ if (!pcs)
+ return -EINVAL;
+
+ /* Add PCI device IDs to control structure */
+ list_for_each(dev_elem, &pke->list_devices) {
+ struct parker_kernel_device_entry *pkde;
+ struct pci_dev *pdev;
+ int pci_dev_index = pcs->num_pci_devs++;
+ pkde = container_of(dev_elem,
+ struct parker_kernel_device_entry,
+ list_entry);
+ pdev = to_pci_dev(pkde->dev);
+ pcs->pci_dev_ids[pci_dev_index] = pci_dev_id(pdev);
+ }
+
+ int bsp_cpu, cpu, i = 0;
+ /* Partitioned kernel's AP will wait on BSP to jump to kernel's startup code */
+ for_each_cpu(cpu, &pke->cpu_mask) {
+ u32 apicid = apic->cpu_present_to_apicid(cpu);
+ pcs->apic_ids[i] = apicid;
+ ++pcs->num_cpus;
+ int old = i++;
+ if (old == 0) {
+ bsp_cpu = cpu;
+ continue;
+ }
+
+ smpboot_control = cpu;
+ initial_code = (unsigned long)parker_ap_wait;
+ init_espfix_ap(cpu);
+ smp_mb();
+
+ pr_debug("parker AP %d %d\n", apicid, ret);
+ unsigned long start_ip = real_mode_header->trampoline_start;
+ ret = wakeup_secondary_cpu_via_init(apicid, start_ip);
+ /* Continue on with errors for now */
+ if (ret) {
+ pr_err("Failed to start cpu %d\n", cpu);
+ --i;
+ --pcs->num_cpus;
+ continue;
+ }
+ /* Wait for CPU to wakeup and start executing AP wait function */
+ down(&cpu_kick_semaphore);
+ }
+
+ /* Start the partitioned kernel's BSP */
+ //mtrr_save_state();
+ u32 apicid = apic->cpu_present_to_apicid(bsp_cpu);
+ smpboot_control = bsp_cpu;
+ initial_code = (unsigned long)parker_bsp_start;
+ init_espfix_ap(bsp_cpu);
+ smp_mb();
+ unsigned long start_ip = real_mode_header->trampoline_start;
+ ret = wakeup_secondary_cpu_via_init(apicid, start_ip);
+ if (ret)
+ return ret;
+ down(&cpu_kick_semaphore);
+
+ /* Wait for partitioned kernel to start */
+ while (!READ_ONCE(pcs->online))
+ cpu_relax();
+
+ return 0;
+}
+
+static bool parker_kernel_is_online(struct parker_kernel_entry *pke)
+{
+ struct parker_control_structure *pcs;
+ pcs = page_address(pke->control_structure_pages);
+ return READ_ONCE(pcs->online);
+}
+
+/*
+ *
+ * Proper implementation:
+ * /sys/fs/parker new kernelfs
+ *
+ */
+/* The filesystem can only be mounted once. */
+/* TODO: Deal with recovery of structures if unmounted */
+// Forward declarations
+static int parker_get_tree(struct fs_context *fc);
+static int parker_init_fs_context(struct fs_context *fc);
+static void parker_fs_context_free(struct fs_context *fc);
+static void parker_kill_sb(struct super_block *sb);
+static int parker_kn_set_ugid(struct kernfs_node *kn);
+
+/* Mutex to protect parker access. */
+DEFINE_MUTEX(parker_mutex);
+atomic_t parker_kernels = ATOMIC_INIT(0);
+static bool parker_mounted;
+/* All CPUs belonging to second kernel*/
+static struct cpumask parker_cpus;
+
+struct parker {
+ struct kernfs_node *kn;
+ /* TODO: control structures etc... */
+};
+
+struct parker_file_type {
+ char *name;
+ umode_t mode;
+ const struct kernfs_ops *kf_ops;
+
+ int (*seq_show)(struct kernfs_open_file *of, struct seq_file *sf, void *v);
+
+ ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off);
+};
+
+static int parker_add_files(struct kernfs_node *kn, struct parker_file_type *pfts, int len);
+
+static int parker_seqfile_show(struct seq_file *m, void *arg)
+{
+ struct kernfs_open_file *of = m->private;
+ struct parker_file_type *pft = of->kn->priv;
+
+ if (pft->seq_show)
+ return pft->seq_show(of, m, arg);
+
+ return 0;
+}
+
+static ssize_t parker_file_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct parker_file_type *pft = of->kn->priv;
+
+ if (pft->write)
+ return pft->write(of, buf, nbytes, off);
+
+ return -EINVAL;
+}
+
+static const struct kernfs_ops parker_kf_ops = {
+ .atomic_write_len = PAGE_SIZE,
+ .write = parker_file_write,
+ .seq_show = parker_seqfile_show,
+};
+
+/* List of attributes in root - currently none */
+static struct parker_file_type root_attributes[] = {};
+
+static int parker_kernel_index_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ mutex_lock(&pke->mutex);
+ seq_printf(seq, "%u\n", pke->id);
+ mutex_unlock(&pke->mutex);
+ return 0;
+}
+
+static int parker_kernel_control_structure_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ mutex_lock(&parker_mutex);
+ seq_printf(seq, "0x%llx\n", page_to_phys(pke->control_structure_pages));
+ mutex_unlock(&parker_mutex);
+ return 0;
+}
+
+
+static int parker_kernel_online_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ bool online;
+
+ mutex_lock(&pke->mutex);
+ online = parker_kernel_is_online(pke);
+ seq_printf(seq, "%u\n", online);
+ mutex_unlock(&pke->mutex);
+ return 0;
+}
+
+static ssize_t parker_kernel_online_write(struct kernfs_open_file *of,
+ char *buf,
+ size_t nbytes, loff_t off)
+{
+ int ret;
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+
+ mutex_lock(&parker_mutex);
+ mutex_lock(&pke->mutex);
+
+ ret = parker_start_kernel(pke);
+ /* Only set online if the second kernel successfully started */
+ if (!ret)
+ pke->online = true;
+
+
+ mutex_unlock(&pke->mutex);
+ mutex_unlock(&parker_mutex);
+
+ return ret ?: nbytes;
+}
+
+static int parker_kernel_cpus_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ mutex_lock(&pke->mutex);
+ seq_printf(seq, "%*pbl\n", cpumask_pr_args(&pke->cpu_mask));
+ mutex_unlock(&pke->mutex);
+ return 0;
+}
+
+static ssize_t parker_kernel_cpus_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ cpumask_var_t tmpmask, newmask;
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ int cpu, ret;
+
+ if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
+ return -ENOMEM;
+
+ if (!zalloc_cpumask_var(&newmask, GFP_KERNEL)) {
+ free_cpumask_var(tmpmask);
+ return -ENOMEM;
+ }
+
+ mutex_lock(&parker_mutex);
+ mutex_lock(&pke->mutex);
+ ret = cpulist_parse(buf, newmask);
+
+ /* Check if any CPUs belong to another parker kernel */
+ cpumask_and(tmpmask, newmask, &parker_cpus);
+ if (!cpumask_empty(tmpmask)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* If CPUs are currently online, offline them */
+ cpumask_and(tmpmask, newmask, cpu_online_mask);
+ if (!cpumask_empty(tmpmask)) {
+ for_each_cpu(cpu, tmpmask)
+ remove_cpu(cpu);
+ }
+
+ cpumask_or(&parker_cpus, &parker_cpus, newmask);
+ cpumask_copy(&pke->cpu_mask, newmask);
+out:
+ free_cpumask_var(tmpmask);
+ free_cpumask_var(newmask);
+ mutex_unlock(&pke->mutex);
+ mutex_unlock(&parker_mutex);
+ return ret ?: nbytes;
+}
+
+static int parker_kernel_memory_show(struct kernfs_open_file *of,
+ struct seq_file *seq, void *v)
+{
+ int ret;
+
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ mutex_lock(&pke->mutex);
+ if (!pke->physical_memory_pages) {
+ ret = -EINVAL;
+ goto out;
+ }
+ phys_addr_t base = page_to_phys(pke->physical_memory_pages);
+ unsigned long long size = pke->physical_memory_page_count * PAGE_SIZE;
+ seq_printf(seq, "%llu@0x%llx\n", size, base);
+ ret = 0;
+out:
+ mutex_unlock(&pke->mutex);
+ return ret;
+}
+
+static ssize_t parker_kernel_memory_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ struct page *result;
+ int ret, memory_nid = NUMA_NO_NODE;
+ char *end;
+
+ unsigned long long size;
+ unsigned long page_count;
+
+ mutex_lock(&pke->mutex);
+ if (!(size = memparse(buf, &end))) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Ensure write is fully parsed */
+ if (*end != '\0' && *end != '\n') {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* We need a CPU to determine which NUMA node to allocate memory on */
+ if (cpumask_empty(&pke->cpu_mask)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* Get NUMA node for first cpu (BSP) */
+ memory_nid = cpu_to_node(cpumask_first(&pke->cpu_mask));
+
+ if (pke->physical_memory_pages) {
+ if (!cma_release(parker_cma[memory_nid],
+ pke->physical_memory_pages,
+ pke->physical_memory_page_count)) {
+ ret = -EBUSY;
+ goto out;
+ }
+ }
+
+ /* Assume that size is page aligned, if not second kernel loses page */
+ page_count = size >> PAGE_SHIFT;
+ result = cma_alloc(parker_cma[memory_nid], page_count, 0, false);
+
+ if (!result) {
+ ret = -ENOMEM;
+ pke->physical_memory_pages = NULL;
+ pke->physical_memory_page_count = 0;
+ goto out;
+ }
+
+ if (!cma_pages_valid(parker_cma[memory_nid], result, page_count)) {
+ ret = -EINVAL;
+ if (!cma_release(parker_cma[memory_nid], result, page_count))
+ pr_err("Failed to release invalid allocation.");
+ goto out;
+
+ }
+
+ pke->physical_memory_pages = result;
+ pke->physical_memory_page_count = page_count;
+ ret = 0;
+out:
+ mutex_unlock(&pke->mutex);
+ return ret ?: nbytes;
+}
+
+/* TODO: Consider implementation where we bind to pci-stub instead - avoid rescanning problem? */
+static ssize_t parker_kernel_bind_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ struct kernfs_node *dev_kn;
+ struct device *dev;
+ int ret = -ENODEV;
+
+ mutex_lock(&pke->mutex);
+ dev = bus_find_device_by_name(&pci_bus_type, NULL, buf);
+ /* Remove from bus to prevent anyone from using it */
+ if (dev) {
+ struct parker_kernel_device_entry *pkde;
+ /* If device already disabled, maybe owned by another kernel. Only claim enabled devices */
+ if (!pci_is_enabled(to_pci_dev(dev))) {
+ put_device(dev);
+ ret = -EBUSY;
+ goto out;
+ }
+
+ pkde = kzalloc(sizeof(*pkde), GFP_KERNEL);
+ pkde->dev = dev;
+ dev_kn = kernfs_create_dir(pke->kn_devices, dev_name(dev),
+ pke->kn_devices->mode, pkde);
+ /* We use after kernfs_remove in unbind & rmdir case*/
+ kernfs_get(dev_kn);
+ pkde->kn = dev_kn;
+ list_add_tail(&pkde->list_entry, &pke->list_devices);
+ kernfs_activate(dev_kn);
+ pci_bus_type.remove(dev);
+ ret = 0;
+ }
+
+out:
+ mutex_unlock(&pke->mutex);
+ return ret ?: nbytes;
+}
+
+static ssize_t parker_kernel_unbind_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct parker_kernel_entry *pke = of->kn->parent->priv;
+ struct parker_kernel_device_entry *pkde;
+ struct kernfs_node *dev_kn;
+ struct device *dev;
+ int ret = -ENODEV;
+
+ mutex_lock(&pke->mutex);
+ dev = bus_find_device_by_name(&pci_bus_type, NULL, buf);
+ /* Remove from bus to prevent anyone from using it */
+ if (dev) {
+ /* Check if device is claimed by kernel */
+ dev_kn = kernfs_find_and_get(pke->kn_devices, dev_name(dev));
+ if (!dev_kn) {
+ put_device(dev);
+ goto out;
+ }
+
+ /* Ensure PCI device isn't enabled */
+ if (pci_is_enabled(to_pci_dev(dev))) {
+ put_device(dev);
+ kernfs_put(dev_kn);
+ goto out;
+ }
+ pkde = dev_kn->priv;
+
+ ret = pci_bus_type.probe(dev);
+ put_device(dev);
+ kernfs_remove(dev_kn);
+ /* One reference from getting above, one from device subdir creation */
+ kernfs_put(dev_kn);
+ kernfs_put(dev_kn);
+ list_del(&pkde->list_entry);
+ kfree(pkde);
+ }
+
+out:
+ mutex_unlock(&pke->mutex);
+ return ret ?: nbytes;
+}
+
+/* Secondary kernel attributes */
+static struct parker_file_type per_kernel_attributes[] = {
+ /* Passed to secondary kernel to identify */
+ {
+ .name = "id",
+ .mode = 0644,
+ .kf_ops = &parker_kf_ops,
+ .seq_show = parker_kernel_index_show,
+ },
+ {
+ .name = "control_structure",
+ .mode = 0644,
+ .kf_ops = &parker_kf_ops,
+ .seq_show = parker_kernel_control_structure_show,
+ },
+ {
+ .name = "cpus",
+ .mode = 0644,
+ .kf_ops = &parker_kf_ops,
+ .seq_show = parker_kernel_cpus_show,
+ .write = parker_kernel_cpus_write,
+ },
+ /* Add per numa node memory? */
+ {
+ .name = "memory",
+ .mode = 0644,
+ .kf_ops = &parker_kf_ops,
+ .seq_show = parker_kernel_memory_show,
+ .write = parker_kernel_memory_write,
+ },
+ {
+ .name = "bind",
+ .mode = 0644,
+ .kf_ops = &parker_kf_ops,
+ .write = parker_kernel_bind_write,
+ },
+ {
+ .name = "unbind",
+ .mode = 0644,
+ .kf_ops = &parker_kf_ops,
+ .write = parker_kernel_unbind_write,
+ },
+ /* TODO: is status better? */
+ {
+ .name = "online",
+ .mode = 0644,
+ .kf_ops = &parker_kf_ops,
+ .seq_show = parker_kernel_online_show,
+ .write = parker_kernel_online_write, // todo
+ },
+};
+
+struct parker_fs_context {
+ struct kernfs_fs_context kfc;
+};
+
+static int parker_setup_root(struct parker_fs_context *ctx);
+static void parker_destroy_root(void);
+
+static struct kernfs_root *parker_root;
+struct parker parker_default;
+
+static const struct fs_context_operations parker_fs_context_ops = {
+ .free = parker_fs_context_free,
+ .get_tree = parker_get_tree,
+};
+
+static struct file_system_type parker_fs_type = {
+ .name = "parker",
+ .init_fs_context = parker_init_fs_context,
+ .kill_sb = parker_kill_sb,
+};
+
+
+static int parker_kernel_entry_destroy(struct parker_kernel_entry *pke)
+{
+ int cpu, memory_nid = NUMA_NO_NODE, ret = 0;
+ struct list_head *dev_elem, *n;
+
+
+ /* Bring back parker CPUs */
+ for_each_cpu(cpu, &pke->cpu_mask) {
+ add_cpu(cpu);
+ if (memory_nid == NUMA_NO_NODE)
+ memory_nid = cpu_to_node(cpu);
+ }
+ cpumask_andnot(&parker_cpus, &parker_cpus, &pke->cpu_mask);
+
+ /* Free memory allocated */
+ if (pke->physical_memory_page_count > 0 &&
+ !cma_release(parker_cma[memory_nid],
+ pke->physical_memory_pages,
+ pke->physical_memory_page_count)) {
+ ret = -EBUSY;
+ }
+
+ for (int i = 0; i < pke->control_structure_page_count; ++i) {
+ __free_pages(pke->control_structure_pages + i, 0);
+ }
+
+
+ /* Unclaim PCI devices */
+ list_for_each_safe(dev_elem, n, &pke->list_devices) {
+ struct parker_kernel_device_entry *pkde;
+ pkde = container_of(dev_elem,
+ struct parker_kernel_device_entry,
+ list_entry);
+ ret = pci_bus_type.probe(pkde->dev);
+ if (ret)
+ continue;
+ put_device(pkde->dev);
+ kernfs_remove(pkde->kn);
+ kernfs_put(pkde->kn);
+ kfree(pkde);
+ }
+
+ atomic_dec(&parker_kernels);
+ mutex_destroy(&pke->mutex);
+ if (pke->kn)
+ kernfs_put(pke->kn);
+ kfree(pke);
+
+ return ret;
+}
+
+static int parker_kernel_control_structure_alloc(struct parker_kernel_entry *pke)
+{
+ pke->control_structure_pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, 0);
+ if (!pke->control_structure_pages)
+ return -ENOMEM;
+
+ pke->control_structure_page_count = 1;
+ return 0;
+}
+
+static int parker_kernel_entry_init(struct parker_kernel_entry *pke)
+{
+ struct kernfs_node *kn;
+ int ret;
+ // Also allocate any secondary structures?
+
+ ret = parker_kernel_control_structure_alloc(pke);
+ if (ret)
+ return ret;
+
+ atomic_inc(&parker_kernels);
+ pke->id = atomic_read(&parker_kernels);
+ pke->online = false;
+ mutex_init(&pke->mutex);
+ INIT_LIST_HEAD(&pke->list_devices);
+
+ kn = kernfs_create_dir(pke->kn, "devices", pke->kn->mode, pke);
+ if (IS_ERR(kn)) {
+ /* As no devices, can't fail */
+ parker_kernel_entry_destroy(pke);
+ return PTR_ERR(kn);
+ }
+ pke->kn_devices = kn;
+
+ return 0;
+}
+
+static int parker_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
+{
+ int ret = 0;
+ struct parker_kernel_entry *pke;
+ struct kernfs_node *kn;
+
+ /* Only allow creation from within root directory */
+ if (parent_kn != parker_default.kn)
+ return -EINVAL;
+
+ if (strchr(name, '\n'))
+ return -EINVAL;
+
+ mutex_lock(&parker_mutex);
+ pke = kzalloc(sizeof(*pke), GFP_KERNEL);
+ if (!pke) {
+ ret = -ENOMEM;
+ goto out_unlock;
+ }
+
+ kn = kernfs_create_dir(parent_kn, name, mode, pke);
+ if (IS_ERR(kn)) {
+ ret = PTR_ERR(kn);
+ goto out_free_pke;
+ }
+ pke->kn = kn;
+
+ ret = parker_kernel_entry_init(pke);
+ if (ret)
+ goto out_unlock;
+
+ /* As we will use pke after kernfs_remove */
+ kernfs_get(pke->kn);
+
+ ret = parker_kn_set_ugid(kn);
+ if (ret) {
+ goto out_destroy;
+ }
+
+ ret = parker_add_files(kn, per_kernel_attributes, ARRAY_SIZE(per_kernel_attributes));
+ if (ret) {
+ goto out_destroy;
+ }
+
+ kernfs_activate(kn);
+ goto out_unlock;
+
+out_destroy:
+ kernfs_remove(pke->kn);
+ kernfs_put(pke->kn);
+out_free_pke:
+ kfree(pke);
+out_unlock:
+ mutex_unlock(&parker_mutex);
+ return ret;
+}
+
+static int parker_rmdir(struct kernfs_node *kn)
+{
+ struct parker_kernel_entry *pke = kn->priv;
+ int ret = 0;
+
+ /* Only handle rmdir of kernel */
+ if (pke->kn != kn) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (parker_kernel_is_online(pke)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ /* First remove, ensuring no new operations */
+ mutex_lock(&pke->mutex);
+ kernfs_remove_self(kn);
+ mutex_unlock(&pke->mutex);
+
+ ret = parker_kernel_entry_destroy(pke);
+out:
+ return ret;
+}
+
+static struct kernfs_syscall_ops parker_kf_syscall_ops = {
+ .mkdir = parker_mkdir,
+ .rmdir = parker_rmdir,
+};
+
+static inline struct parker_fs_context *parker_fc2context(struct fs_context *fc)
+{
+ struct kernfs_fs_context *kfc = fc->fs_private;
+
+ return container_of(kfc, struct parker_fs_context, kfc);
+}
+
+static int parker_kn_set_ugid(struct kernfs_node *kn)
+{
+ struct iattr iattr = { .ia_valid = ATTR_UID | ATTR_GID,
+ .ia_uid = current_fsuid(),
+ .ia_gid = current_fsgid(), };
+
+ if (uid_eq(iattr.ia_uid, GLOBAL_ROOT_UID) &&
+ gid_eq(iattr.ia_gid, GLOBAL_ROOT_GID))
+ return 0;
+
+ return kernfs_setattr(kn, &iattr);
+}
+
+static int parker_add_file(struct kernfs_node *parent_kn,
+ struct parker_file_type *pft)
+{
+ struct kernfs_node *kn;
+ int ret;
+
+ kn = __kernfs_create_file(parent_kn, pft->name, pft->mode,
+ GLOBAL_ROOT_UID, GLOBAL_ROOT_GID,
+ 0, pft->kf_ops, pft, NULL, NULL);
+ if (IS_ERR(kn))
+ return PTR_ERR(kn);
+
+ ret = parker_kn_set_ugid(kn);
+ if (ret) {
+ kernfs_remove(kn);
+ return ret;
+ }
+
+ return 0;
+}
+
+static int parker_add_files(struct kernfs_node *kn, struct parker_file_type *pfts, int len)
+{
+ struct parker_file_type *pft;
+ int ret;
+
+ lockdep_assert_held(&parker_mutex);
+
+ for (pft = pfts; pft < pfts + len; pft++) {
+ ret = parker_add_file(kn, pft);
+ if (ret)
+ goto error;
+ }
+
+ return 0;
+error:
+ pr_warn("Failed to add %s, err=%d\n", pft->name, ret);
+ while (--pft >= pfts) {
+ kernfs_remove_by_name(kn, pft->name);
+ }
+ return ret;
+}
+
+
+static int parker_init_fs_context(struct fs_context *fc)
+{
+ struct parker_fs_context *ctx;
+ ctx = kzalloc(sizeof(struct parker_fs_context), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->kfc.magic = PARKER_SUPER_MAGIC; // TODO: Add to include/uapi/linux/magic.h
+ fc->fs_private = &ctx->kfc;
+ fc->ops = &parker_fs_context_ops;
+ put_user_ns(fc->user_ns);
+ fc->user_ns = get_user_ns(&init_user_ns);
+ fc->global = true;
+ return 0;
+}
+
+static int parker_get_tree(struct fs_context *fc)
+{
+ struct parker_fs_context *ctx = parker_fc2context(fc);
+ int ret = 0;
+
+ mutex_lock(&parker_mutex);
+ if (parker_mounted) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ /* filesystem was unmounted but kernels weren't cleared up, reactivate last root */
+ if (parker_default.kn) {
+ ctx->kfc.root = parker_root;
+ goto activate_root;
+ }
+
+ ret = parker_setup_root(ctx);
+ if (ret)
+ goto destroy_root;
+
+ ret = parker_add_files(parker_default.kn, root_attributes, ARRAY_SIZE(root_attributes));
+ if (ret < 0)
+ goto destroy_root;
+
+activate_root:
+ kernfs_activate(parker_default.kn);
+ ret = kernfs_get_tree(fc);
+ if (ret < 0)
+ goto destroy_root;
+ parker_mounted = true;
+out:
+ mutex_unlock(&parker_mutex);
+ return ret;
+
+destroy_root:
+ parker_destroy_root();
+ return ret;
+}
+
+static void parker_fs_context_free(struct fs_context *fc)
+{
+ struct parker_fs_context *ctx = parker_fc2context(fc);
+
+ kernfs_free_fs_context(fc);
+ kfree(ctx);
+}
+
+static void parker_kill_sb(struct super_block *sb)
+{
+ mutex_lock(&parker_mutex);
+ parker_mounted = false;
+
+ /* Only destroy root if no kernels are still declared */
+ if (atomic_read(&parker_kernels) == 0) {
+ parker_destroy_root();
+ }
+
+ kernfs_kill_sb(sb);
+ mutex_unlock(&parker_mutex);
+}
+
+static void parker_destroy_root(void)
+{
+ kernfs_destroy_root(parker_root);
+ parker_default.kn = NULL;
+}
+
+static int parker_setup_root(struct parker_fs_context *ctx)
+{
+ parker_root = kernfs_create_root(
+ &parker_kf_syscall_ops,
+ KERNFS_ROOT_CREATE_DEACTIVATED | KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK,
+ &parker_default);
+
+ if (IS_ERR(parker_root))
+ return PTR_ERR(parker_root);
+
+ ctx->kfc.root = parker_root;
+ parker_default.kn = kernfs_root_to_node(parker_root);
+
+ return 0;
+}
+
+/* Prevent us from onlining CPUs provisioned to parker instance */
+static int parker_cpu_offline_startup(unsigned int cpu)
+{
+ int ret;
+
+ mutex_lock(&parker_mutex);
+ ret = cpumask_test_cpu(cpu, &parker_cpus) ? -EINVAL : 0;
+ mutex_unlock(&parker_mutex);
+
+ return 0;
+}
+
+
+static int __init parker_kernfs_init(void)
+{
+ int ret = 0;
+
+ if (!parker_cma_size) {
+ pr_err("No parker CMA regions allocated, disabling parker.");
+ return -ENOENT;
+ }
+
+ ret = sysfs_create_mount_point(fs_kobj, "parker");
+ if (ret)
+ return ret;
+
+ ret = register_filesystem(&parker_fs_type);
+ if (ret)
+ goto cleanup_mountpoint;
+
+ ret = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "parker", parker_cpu_offline_startup, NULL);
+ if (ret < 0)
+ goto cleanup_filesystem;
+
+ return ret;
+cleanup_filesystem:
+ unregister_filesystem(&parker_fs_type);
+cleanup_mountpoint:
+ sysfs_remove_mount_point(fs_kobj, "parker");
+ return ret;
+}
+
+module_init(parker_module_init);
+module_exit(parker_module_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Thom Hughes");
+MODULE_DESCRIPTION("Parker linux host module.");
+
diff --git a/include/linux/parker.h b/include/linux/parker.h
new file mode 100644
index 000000000000..4984aefcee0f
--- /dev/null
+++ b/include/linux/parker.h
@@ -0,0 +1,7 @@
+#ifndef _LINUX_PARKER_H
+#define _LINUX_PARKER_H
+#ifdef CONFIG_PARKER
+
+#endif /* CONFIG_PARKER */
+#endif /* _LINUX_PARKER_H */
+
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..25658054e3a7 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -38,6 +38,7 @@
#define OVERLAYFS_SUPER_MAGIC 0x794c7630
#define FUSE_SUPER_MAGIC 0x65735546
#define BCACHEFS_SUPER_MAGIC 0xca451a4e
+#define PARKER_SUPER_MAGIC 0x5041524b /* "PARK" */
#define MINIX_SUPER_MAGIC 0x137F /* minix v1 fs, 14 char names */
#define MINIX_SUPER_MAGIC2 0x138F /* minix v1 fs, 30 char names */
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC 4/5] x86/parker: Add parker initialisation code
2025-09-23 15:31 [RFC 0/5] parker: PARtitioned KERnel Fam Zheng
` (2 preceding siblings ...)
2025-09-23 15:31 ` [RFC 3/5] x86/parker: Introduce parker kerfs interface Fam Zheng
@ 2025-09-23 15:31 ` Fam Zheng
2025-09-23 15:31 ` [RFC 5/5] x86/apic: Make Parker instance use physical APIC Fam Zheng
` (2 subsequent siblings)
6 siblings, 0 replies; 18+ messages in thread
From: Fam Zheng @ 2025-09-23 15:31 UTC (permalink / raw)
To: linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, fam.zheng, Zhang Rui, fam, H. Peter Anvin, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
From: Thom Hughes <thom.hughes@bytedance.com>
Signed-off-by: Thom Hughes <thom.hughes@bytedance.com>
Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
---
arch/x86/kernel/setup.c | 4 +
arch/x86/parker/Makefile | 3 +-
arch/x86/parker/Makefile-full | 3 +
arch/x86/parker/setup.c | 423 ++++++++++++++++++++++++++++
arch/x86/parker/trampoline.S | 55 ++++
arch/x86/parker/trampoline.h | 10 +
drivers/thermal/intel/therm_throt.c | 3 +
include/linux/parker-bkup.h | 22 ++
include/linux/parker.h | 15 +
9 files changed, 537 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/parker/Makefile-full
create mode 100644 arch/x86/parker/setup.c
create mode 100644 arch/x86/parker/trampoline.S
create mode 100644 arch/x86/parker/trampoline.h
create mode 100644 include/linux/parker-bkup.h
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index cebee310e200..a3c7909efaf5 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -20,6 +20,7 @@
#include <linux/pci.h>
#include <linux/root_dev.h>
#include <linux/hugetlb.h>
+#include <linux/parker.h>
#include <linux/tboot.h>
#include <linux/usb/xhci-dbgp.h>
#include <linux/static_call.h>
@@ -917,6 +918,7 @@ void __init setup_arch(char **cmdline_p)
* called before cache_bp_init() for setting up MTRR state.
*/
init_hypervisor_platform();
+ parker_init();
tsc_early_init();
x86_init.resources.probe_roms();
@@ -1110,6 +1112,8 @@ void __init setup_arch(char **cmdline_p)
if (boot_cpu_has(X86_FEATURE_GBPAGES))
hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+ /* Allocate memory for PARKER kernels */
+ parker_cma_reserve();
/*
* Reserve memory for crash kernel after SRAT is parsed so that it
diff --git a/arch/x86/parker/Makefile b/arch/x86/parker/Makefile
index 41c40fc64267..506ad8cbff00 100644
--- a/arch/x86/parker/Makefile
+++ b/arch/x86/parker/Makefile
@@ -1,2 +1,3 @@
-obj-y += kernfs.o
+obj-y += kernfs.o setup.o trampoline.o
$(obj)/kernfs.o: $(obj)/internal.h
+$(obj)/setup.o: $(obj)/internal.h
diff --git a/arch/x86/parker/Makefile-full b/arch/x86/parker/Makefile-full
new file mode 100644
index 000000000000..506ad8cbff00
--- /dev/null
+++ b/arch/x86/parker/Makefile-full
@@ -0,0 +1,3 @@
+obj-y += kernfs.o setup.o trampoline.o
+$(obj)/kernfs.o: $(obj)/internal.h
+$(obj)/setup.o: $(obj)/internal.h
diff --git a/arch/x86/parker/setup.c b/arch/x86/parker/setup.c
new file mode 100644
index 000000000000..2d36dac05289
--- /dev/null
+++ b/arch/x86/parker/setup.c
@@ -0,0 +1,423 @@
+#define pr_fmt(fmt) "parker: " fmt
+
+#include <linux/module.h>
+#include <linux/memblock.h>
+#include <linux/parker.h>
+#include <linux/smp.h>
+#include <linux/interrupt.h>
+#include <linux/irqreturn.h>
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/pgtable.h>
+#include <linux/reboot.h>
+
+#include <asm/apic.h>
+#include <asm/realmode.h>
+#include <asm/reboot.h>
+#include <asm/acpi.h>
+#include <asm/i8259.h>
+#include <asm/mpspec.h>
+#include <asm/x86_init.h>
+#include <asm/pci_x86.h>
+#include <asm/realmode.h>
+
+#include "internal.h"
+#include "trampoline.h"
+
+bool is_parker = false;
+
+static phys_addr_t parker_control_structure_address;
+static volatile struct parker_control_structure *parker_control_structure;
+
+static void (*old_shutdown)(void);
+static void (*old_restart)(char*);
+
+/* Take in parker control page as kernel parameter
+ * this also indicates we are booting as a parker kernel
+ * currently assumes that control structure is 1 page */
+static __init int parker_parse_early_param(char *opt)
+{
+ if (!opt)
+ return -EINVAL;
+ char *oldopt = opt;
+ parker_control_structure_address = memparse(opt, &opt);
+ if (oldopt == opt)
+ return -EINVAL;
+ is_parker = true;
+ return 0;
+}
+early_param("parker", parker_parse_early_param);
+
+inline bool is_parker_instance(void)
+{
+ return is_parker;
+}
+
+static struct resource parker_control_structure_resource = {
+ .name = "Parker Control Structure",
+ .start = 0,
+ .end = 0,
+ .flags = IORESOURCE_SYSTEM_RAM,
+ .desc = IORES_DESC_RESERVED
+};
+
+static struct real_mode_header parker_dummy_real_mode_header;
+
+static void parker_reserve_control_structure(unsigned long long addr)
+{
+ parker_control_structure_resource.start = addr;
+ parker_control_structure_resource.end = addr + PAGE_SIZE - 1;
+ insert_resource(&iomem_resource, &parker_control_structure_resource);
+}
+
+static void __init parker_x2apic_init(void)
+{
+#ifdef CONFIG_X86_X2APIC
+ if (!x2apic_enabled())
+ return;
+ x2apic_phys = 1;
+ /*
+ * This will trigger the switch to apic_x2apic_phys. Empty OEM IDs
+ * ensure that only this APIC driver picks up the call.
+ */
+ default_acpi_madt_oem_check("", "");
+#endif
+}
+
+/* Setup trampoline pagetable, stack and initial code pointer */
+static int __init parker_init_trampoline(void)
+{
+ /* Setup trampoline lock or else head_64.S:secondary_startup_64 will crash */
+ trampoline_lock = (u32 *)&parker_trampoline_lock;
+ WRITE_ONCE(*trampoline_lock, 0);
+
+ /* Map kernel page table so we can access kernel memory */
+ for (int i = pgd_index(__PAGE_OFFSET); i < PTRS_PER_PGD; i++)
+ WRITE_ONCE(parker_trampoline_pgt[i], init_top_pgt[i].pgd);
+
+ /* 1:1 map parker_ap_trampoline physical memory so we can jump from host*/
+ u64 paddr_map = virt_to_phys(parker_ap_trampoline);
+ pgd_t *pgd = (pgd_t*)parker_trampoline_pgt;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ pgd += pgd_index(paddr_map);
+ pgprot_t prot = PAGE_KERNEL_EXEC_NOENC;
+ if (!pgd_present(*pgd)) {
+ p4d = (p4d_t *)memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+ if (!p4d)
+ return -ENOMEM;
+ set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE));
+ }
+ p4d = p4d_offset(pgd, paddr_map);
+ if (!p4d_present(*p4d)) {
+ pud = (pud_t *)memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+ if (!pud) {
+ memblock_free(p4d, PAGE_SIZE);
+ return -ENOMEM;
+ }
+ set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE));
+ }
+ pud = pud_offset(p4d, paddr_map);
+ if (!pud_present(*pud)) {
+ pmd = (pmd_t *)memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+ if (!pmd) {
+ memblock_free(p4d, PAGE_SIZE);
+ memblock_free(pud, PAGE_SIZE);
+ return -ENOMEM;
+ }
+ set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+ }
+ pmd = pmd_offset(pud, paddr_map);
+ if (!pmd_present(*pmd)) {
+ pte = (pte_t *)memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+ if (!pte) {
+ memblock_free(p4d, PAGE_SIZE);
+ memblock_free(pud, PAGE_SIZE);
+ memblock_free(pmd, PAGE_SIZE);
+ return -ENOMEM;
+ }
+ set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE));
+ }
+ pte = pte_offset_kernel(pmd, paddr_map);
+
+ if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+ prot = PAGE_KERNEL_EXEC;
+
+ set_pte(pte, pfn_pte(paddr_map >> PAGE_SHIFT, prot));
+
+ /* Write initial code (within parker) for secondary CPU initialisation */
+ WRITE_ONCE(parker_trampoline_start, secondary_startup_64);
+
+ /* Store virtual address of top of stack at bottom of stack */
+ WRITE_ONCE(parker_trampoline_stack, &parker_trampoline_stack_end);
+
+ /* Synchronise all updates */
+ smp_mb();
+
+ return 0;
+}
+
+static void parker_set_spoof_bsp(bool enabled)
+{
+ u64 msr;
+ rdmsrl(MSR_IA32_APICBASE, msr);
+ msr = enabled ? (msr | MSR_IA32_APICBASE_BSP) :
+ (msr & ~MSR_IA32_APICBASE_BSP);
+ wrmsrl(MSR_IA32_APICBASE, msr);
+}
+
+static void __init parker_parse_smp_cfg(void)
+{
+ int ret;
+ pr_info("smpconfig\n");
+ /* Disable legacy PIC */
+ pic_mode = 0;
+
+ /* Initialize x2apic as ACPI disabled */
+ parker_x2apic_init();
+
+ /* Spoof parker BSP as BSP or kernel thinks it's crash kernel */
+ parker_set_spoof_bsp(true);
+
+ /* Assume that lapic address is unchangd */
+ register_lapic_address(APIC_DEFAULT_PHYS_BASE);
+ ret = parker_init_trampoline();
+ if (ret < 0) {
+ pr_info("Failed to initialise trampoline.\n");
+ smp_found_config = 0;
+ return;
+ }
+
+ /* Register all APIC ID's for parker APs */
+ for (int i = 0; i < parker_control_structure->num_cpus; ++i) {
+ topology_register_apic(READ_ONCE(parker_control_structure->apic_ids[i]),
+ CPU_ACPIID_INVALID,
+ true);
+ }
+
+ smp_found_config = 1;
+}
+
+static bool __init parker_x2apic_available(void)
+{
+ return x2apic_enabled();
+}
+
+static void parker_init_host_control(void)
+{
+ parker_reserve_control_structure((unsigned long long)parker_control_structure);
+ phys_addr_t address = parker_control_structure_address;
+ parker_control_structure = early_memremap(address, PAGE_SIZE);
+ /* Announce trampoline physical address to host kernel */
+ phys_addr_t trampoline_phys_addr = virt_to_phys(parker_ap_trampoline);
+ WRITE_ONCE(parker_control_structure->start_address, trampoline_phys_addr);
+ smp_mb();
+}
+
+/* Some APIC callback overrides */
+static int parker_wakeup_secondary_cpu_64(u32 apicid, unsigned long _dummy_start_eip)
+{
+ WRITE_ONCE(parker_trampoline_apicid, apicid);
+ smp_mb();
+
+ /* Wait for APIC id to be reset before continuing,
+ * ensuring no CPU misses trampoline kick. */
+ while (READ_ONCE(parker_trampoline_apicid) != 0)
+ cpu_relax();
+
+ return 0;
+}
+
+static void parker_send_IPI_allbutself(int vector)
+{
+ if (num_online_cpus() < 2)
+ return;
+
+ __apic_send_IPI_mask_allbutself(cpu_online_mask, vector);
+}
+
+static void parker_send_IPI_all(int vector)
+{
+ __apic_send_IPI_mask(cpu_online_mask, vector);
+}
+
+
+/* Setup real mode header so SMP doesn't dereference null pointer */
+static void __init parker_realmode_init(void)
+{
+ real_mode_header = &parker_dummy_real_mode_header;
+}
+
+static void parker_emergency_restart(void)
+{
+ pr_notice("Restart not supported, spinning\n");
+ for (;;) {
+ continue;
+ }
+}
+
+static void parker_offline(void)
+{
+ /* Remove BSP flag from APIC MSR
+ * or we crash on second use of BSP in parker kernel */
+ parker_set_spoof_bsp(false);
+ parker_control_structure = memremap(parker_control_structure_address, PAGE_SIZE, MEMREMAP_WB);
+ if (!parker_control_structure) {
+ pr_err("Unable to map control structure, unable to tell host we are offline.\n");
+ return;
+ }
+ WRITE_ONCE(parker_control_structure->online, false);
+ memunmap((void*)parker_control_structure);
+}
+
+static void parker_shutdown(void)
+{
+ pr_info("shutting down.\n");
+ parker_offline();
+ old_shutdown();
+}
+
+/* No restart occurs, will just effectively shutdown */
+static void parker_restart(char *msg)
+{
+ pr_info("rebooting\n");
+ parker_offline();
+ old_restart(msg);
+}
+
+static struct pci_bus __init *parker_pci_init_root_bus(int busno)
+{
+ struct pci_bus *bus;
+ struct pci_sysdata *sd;
+ LIST_HEAD(resources);
+
+ /* If bus exists, continue */
+ /* TODO: Is domain always 0? (probably not) */
+ bus = pci_find_bus(0, busno);
+ if (bus)
+ return bus;
+
+ sd = kzalloc(sizeof(*sd), GFP_KERNEL);
+ if (!sd) {
+ printk(KERN_ERR "PCI: OOM, skipping PCI bus %02x\n", busno);
+ return NULL;
+ }
+
+ sd->node = x86_pci_root_bus_node(busno);
+ x86_pci_root_bus_resources(busno, &resources);
+ bus = pci_create_root_bus(NULL, busno, &pci_root_ops, sd, &resources);
+ if (!bus) {
+ pci_free_resource_list(&resources);
+ kfree(sd);
+ return NULL;
+ }
+
+ return bus;
+}
+
+static int __init parker_pci_init(void)
+{
+ /* Set to 0 as we are manually setting up and probing busses ourselves */
+ pcibios_last_bus = 0;
+
+ /* Scan only passed through PCI devices, passing through PCIe port unsuppored */
+ for (int i = 0; i < parker_control_structure->num_pci_devs; ++i) {
+ u32 dev_id = parker_control_structure->pci_dev_ids[i];
+ u32 busno = PCI_BUS_NUM(dev_id);
+ u32 devfn = dev_id & 0xff;
+ struct pci_bus *bus = parker_pci_init_root_bus(busno);
+ if (!bus) {
+ pr_err("Failed to get bus: %d\n", busno);
+ continue;
+ }
+ struct pci_dev *dev = pci_scan_single_device(bus, devfn);
+ if (!dev) {
+ pr_err("Failed to get dev: %d\n", devfn);
+ continue;
+ }
+ pci_bus_add_device(dev);
+ }
+
+ /* We can announce online now to host kernel */
+ WRITE_ONCE(parker_control_structure->online, 1);
+ smp_mb();
+
+ /* PCI initialisation is the last time the early mapped structure is used */
+ early_memunmap((void*)parker_control_structure, PAGE_SIZE);
+
+ /* TODO: Disable rescan! */
+ return 0;
+}
+
+static int parker_pci_enable_irq(struct pci_dev *dev)
+{
+ /* Let's lie to everyone ;) */
+ /* TODO: Find drivers that can use MSI only that fail to load without INT-A
+ * then we can return -EINVAL; */
+ return 0;
+}
+
+static void parker_pci_disable_irq(struct pci_dev *dev)
+{
+ return;
+}
+
+void __init parker_init()
+{
+ if (!is_parker_instance())
+ return;
+
+ pr_info("parker: Initialising parker..\n");
+ /* TODO: Re-enable! */
+ legacy_pic = &null_legacy_pic;
+ /* Reserve dummy header so existing smpboot.c:do_boot_cpu code
+ * doesn't dereference NULL pointer */
+ x86_platform.realmode_reserve = x86_init_noop;
+ x86_platform.realmode_init = parker_realmode_init;
+
+ /* Disable legacy code */
+ x86_platform.legacy.rtc = 0;
+ x86_platform.legacy.warm_reset = 0;
+ x86_platform.legacy.i8042 = X86_LEGACY_I8042_PLATFORM_ABSENT;
+
+ /* Disable emergency restart */
+ machine_ops.emergency_restart = parker_emergency_restart;
+
+ /* Save old machine ops */
+ old_shutdown = machine_ops.shutdown;
+ old_restart = machine_ops.restart;
+
+ /* Ensure shutdown / restart makes host kernel aware parker is offline */
+ machine_ops.shutdown = parker_shutdown;
+ machine_ops.restart = parker_restart;
+
+ /* Use control structure for SMP CPU APIC ID enumeration */
+ x86_init.mpparse.find_mptable = x86_init_noop;
+ x86_init.mpparse.early_parse_smp_cfg = x86_init_noop;
+ x86_init.mpparse.parse_smp_cfg = parker_parse_smp_cfg;
+
+ /* TODO: Investigate x2apic alternative, but requires baremetal */
+ x86_init.hyper.x2apic_available = parker_x2apic_available;
+
+ /* Disable PCI IRQ handling as we don't support INT-A mode */
+ x86_init.pci.init = parker_pci_init;
+ x86_init.pci.init_irq = x86_init_noop;
+ x86_init.pci.fixup_irqs = x86_init_noop;
+ pcibios_enable_irq = parker_pci_enable_irq;
+ pcibios_disable_irq = parker_pci_disable_irq;
+
+ /* No ACPI, so no hotplugging (be nice) */
+ disable_acpi();
+
+ /* Setup host kernel control page */
+ parker_init_host_control();
+
+ /* Let smpboot.c:do_boot_cpu use our wakeup routine */
+ apic_update_callback(wakeup_secondary_cpu_64, parker_wakeup_secondary_cpu_64);
+
+ /* Prevent shorthand IPIs */
+ apic_update_callback(send_IPI_all, parker_send_IPI_all);
+ apic_update_callback(send_IPI_allbutself, parker_send_IPI_allbutself);
+}
diff --git a/arch/x86/parker/trampoline.S b/arch/x86/parker/trampoline.S
new file mode 100644
index 000000000000..2107201eb1de
--- /dev/null
+++ b/arch/x86/parker/trampoline.S
@@ -0,0 +1,55 @@
+#include <linux/linkage.h>
+
+#include <asm/page_types.h>
+#include <asm/nospec-branch.h>
+#include <asm/unwind_hints.h>
+
+/* NOTE: No SME, host kernel and secondary kernel must match N-level pgt */
+/* NOTE: Changing this file will require a full recompilation as makefile isn't setup properly */
+.text
+.code64
+SYM_CODE_START(parker_ap_trampoline)
+ UNWIND_HINT_END_OF_STACK
+ ANNOTATE_NOENDBR
+/* Spin for now */
+.Lno_trampoline_start:
+ mov parker_trampoline_start(%rip), %rcx
+ test %rcx, %rcx
+ jz .Lno_trampoline_start
+ leaq parker_trampoline_pgt(%rip), %rax
+.Lno_stack:
+ /* Store vaddr of stack at top of stack */
+ movq parker_trampoline_stack(%rip), %rsp
+ test %rsp, %rsp
+ jz .Lno_stack
+ mov %rax, %cr3
+.Lwrong_apicid:
+ cmp parker_trampoline_apicid, %esi
+ jne .Lwrong_apicid
+.Ltrampoline_locked:
+ lock btsl $0, parker_trampoline_lock
+ jnc .Ltrampoline_unlocked
+ pause
+ jmp .Ltrampoline_locked
+/* Assume APIC ID 0 is never secondary processor */
+.Ltrampoline_unlocked:
+ movl $0, parker_trampoline_apicid
+ ANNOTATE_RETPOLINE_SAFE
+ call *%rcx
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+SYM_CODE_END(parker_ap_trampoline)
+
+.data
+.balign PAGE_SIZE
+SYM_DATA(parker_trampoline_pgt, .skip 4096)
+SYM_DATA(parker_trampoline_start, .quad 0)
+SYM_DATA(parker_trampoline_apicid, .long 0)
+SYM_DATA(parker_trampoline_lock, .long 0)
+.balign 4096
+/* TODO: Just allocate a stack why waste 4KB */
+SYM_DATA_START(parker_trampoline_stack)
+ .skip 4096
+SYM_DATA_END_LABEL(parker_trampoline_stack, SYM_L_GLOBAL, parker_trampoline_stack_end)
+SYM_DATA(parker_trampoline_end, .quad 0)
diff --git a/arch/x86/parker/trampoline.h b/arch/x86/parker/trampoline.h
new file mode 100644
index 000000000000..b93ca612db99
--- /dev/null
+++ b/arch/x86/parker/trampoline.h
@@ -0,0 +1,10 @@
+#ifndef _TRAMPOLINE_H
+#define _TRAMPOLINE_H
+void parker_ap_trampoline(void);
+extern u64 parker_trampoline_pgt[];
+extern void *parker_trampoline_start;
+extern u32 parker_trampoline_lock;
+extern void *parker_trampoline_stack;
+extern u32 parker_trampoline_apicid;
+extern u64 parker_trampoline_stack_end;
+#endif
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
index e69868e868eb..dabc7e35ff72 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -20,6 +20,7 @@
#include <linux/kernel.h>
#include <linux/percpu.h>
#include <linux/export.h>
+#include <linux/parker.h>
#include <linux/types.h>
#include <linux/init.h>
#include <linux/smp.h>
@@ -690,6 +691,8 @@ void intel_thermal_interrupt(void)
/* Thermal monitoring depends on APIC, ACPI and clock modulation */
static int intel_thermal_supported(struct cpuinfo_x86 *c)
{
+ if (is_parker_instance())
+ return 0;
if (!boot_cpu_has(X86_FEATURE_APIC))
return 0;
if (!cpu_has(c, X86_FEATURE_ACPI) || !cpu_has(c, X86_FEATURE_ACC))
diff --git a/include/linux/parker-bkup.h b/include/linux/parker-bkup.h
new file mode 100644
index 000000000000..b00833b5a24b
--- /dev/null
+++ b/include/linux/parker-bkup.h
@@ -0,0 +1,22 @@
+#ifndef _LINUX_PARKER_H
+#define _LINUX_PARKER_H
+#ifdef CONFIG_PARKER
+extern void __init parker_cma_reserve(void);
+extern void __init parker_init(void);
+extern bool is_parker_instance(void);
+#else
+static inline __init void parker_cma_reserve(void)
+{
+}
+
+static inline __init void parker_init(void)
+{
+}
+
+static inline bool is_parker_instance(void)
+{
+ return false;
+}
+#endif /* CONFIG_PARKER */
+#endif /* _LINUX_PARKER_H */
+
diff --git a/include/linux/parker.h b/include/linux/parker.h
index 4984aefcee0f..b00833b5a24b 100644
--- a/include/linux/parker.h
+++ b/include/linux/parker.h
@@ -1,7 +1,22 @@
#ifndef _LINUX_PARKER_H
#define _LINUX_PARKER_H
#ifdef CONFIG_PARKER
+extern void __init parker_cma_reserve(void);
+extern void __init parker_init(void);
+extern bool is_parker_instance(void);
+#else
+static inline __init void parker_cma_reserve(void)
+{
+}
+static inline __init void parker_init(void)
+{
+}
+
+static inline bool is_parker_instance(void)
+{
+ return false;
+}
#endif /* CONFIG_PARKER */
#endif /* _LINUX_PARKER_H */
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC 5/5] x86/apic: Make Parker instance use physical APIC
2025-09-23 15:31 [RFC 0/5] parker: PARtitioned KERnel Fam Zheng
` (3 preceding siblings ...)
2025-09-23 15:31 ` [RFC 4/5] x86/parker: Add parker initialisation code Fam Zheng
@ 2025-09-23 15:31 ` Fam Zheng
2025-09-23 19:46 ` [RFC 0/5] parker: PARtitioned KERnel H. Peter Anvin
2025-09-24 15:22 ` Dave Hansen
6 siblings, 0 replies; 18+ messages in thread
From: Fam Zheng @ 2025-09-23 15:31 UTC (permalink / raw)
To: linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, fam.zheng, Zhang Rui, fam, H. Peter Anvin, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
From: Thom Hughes <thom.hughes@bytedance.com>
Signed-off-by: Thom Hughes <thom.hughes@bytedance.com>
Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
---
arch/x86/kernel/apic/apic_flat_64.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/apic/apic_flat_64.c b/arch/x86/kernel/apic/apic_flat_64.c
index e0308d8c4e6c..e753125a1de8 100644
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -9,6 +9,7 @@
* James Cleverdon.
*/
#include <linux/export.h>
+#include <linux/parker.h>
#include <asm/apic.h>
@@ -21,7 +22,7 @@ static u32 physflat_get_apic_id(u32 x)
static int physflat_probe(void)
{
- return 1;
+ return is_parker_instance();
}
static int physflat_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
--
2.39.5
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-23 15:31 [RFC 0/5] parker: PARtitioned KERnel Fam Zheng
` (4 preceding siblings ...)
2025-09-23 15:31 ` [RFC 5/5] x86/apic: Make Parker instance use physical APIC Fam Zheng
@ 2025-09-23 19:46 ` H. Peter Anvin
2025-09-24 15:22 ` Dave Hansen
6 siblings, 0 replies; 18+ messages in thread
From: H. Peter Anvin @ 2025-09-23 19:46 UTC (permalink / raw)
To: Fam Zheng, linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, Zhang Rui, fam, x86, liangma, Dave Hansen,
Rafael J. Wysocki, guojinhui.liam, linux-pm, Thom Hughes
On 2025-09-23 08:31, Fam Zheng wrote:
>
> Parker is a proposed feature in linux for multiple linux kernels to run
> simultaneously on single machine, without traditional kvm virtualisation. This
> is achieved by partitioning the CPU cores, memory and devices for
> partitioning-aware Linux kernel.
>
This seems to be much better handled by a lightweight hypervisor. There is a
reason why ALL IBM mainframes have a low-level hard-partitioning hypervisor.
Typically that hypervisor will expose a static, very low level view of the
machine (e.g. no scheduling - VCPUs are mapped 1:1 to physical CPUs; no I/O
sharing or emulation, except possibly as needed to boot, and so on.)
Because the functionality of the hypervisor is so limited, the overhead is
minimal, but it CAN (but doesn't HAVE TO) provide memory and I/O isolation
between partitions.
-hpa
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-23 15:31 [RFC 0/5] parker: PARtitioned KERnel Fam Zheng
` (5 preceding siblings ...)
2025-09-23 19:46 ` [RFC 0/5] parker: PARtitioned KERnel H. Peter Anvin
@ 2025-09-24 15:22 ` Dave Hansen
2025-09-24 16:21 ` [External] " Fam Zheng
2025-09-24 19:01 ` H. Peter Anvin
6 siblings, 2 replies; 18+ messages in thread
From: Dave Hansen @ 2025-09-24 15:22 UTC (permalink / raw)
To: Fam Zheng, linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, Zhang Rui, fam, H. Peter Anvin, x86, liangma,
Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
On 9/23/25 08:31, Fam Zheng wrote:
> In terms of fault isolation or security, all kernel instances share
> the same domain, as there is no supervising mechanism. A kernel bug
> in any partition can cause problems for the whole physical machine.
> This is a tradeoff for low-overhead / low-complexity, but hope in
> the future we can take advantage of some hardware mechanism to
> introduce some isolation.
I just don't think this is approach is viable. The buck needs to stop
_somewhere_. You can't just have a bunch of different kernels, with
nothing in charge of the system as a whole.
Just think of bus locks. They affect the whole system. What if one
kernel turns off split lock detection? Or has a different rate limit
than the others? What if one kernel is a big fan of WBINVD? How about
when they use resctrl to partition an L3 cache? How about microcode updates?
I'd just guess that there are a few hundred problems like that. Maybe more.
I'm not saying this won't be useful for a handful of folks in a tightly
controlled environment. But I just don't think it has a place in
mainline where it needs to work for everyone.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [External] Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-24 15:22 ` Dave Hansen
@ 2025-09-24 16:21 ` Fam Zheng
2025-09-24 18:32 ` Dave Hansen
2025-09-24 19:01 ` H. Peter Anvin
1 sibling, 1 reply; 18+ messages in thread
From: Fam Zheng @ 2025-09-24 16:21 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, Zhang Rui, fam, H. Peter Anvin, x86, liangma,
Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
On Wed, Sep 24, 2025 at 4:23 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 9/23/25 08:31, Fam Zheng wrote:
> > In terms of fault isolation or security, all kernel instances share
> > the same domain, as there is no supervising mechanism. A kernel bug
> > in any partition can cause problems for the whole physical machine.
> > This is a tradeoff for low-overhead / low-complexity, but hope in
> > the future we can take advantage of some hardware mechanism to
> > introduce some isolation.
> I just don't think this is approach is viable. The buck needs to stop
> _somewhere_. You can't just have a bunch of different kernels, with
> nothing in charge of the system as a whole.
>
> Just think of bus locks. They affect the whole system. What if one
> kernel turns off split lock detection? Or has a different rate limit
> than the others? What if one kernel is a big fan of WBINVD? How about
> when they use resctrl to partition an L3 cache? How about microcode updates?
The model and motivation here is not to split the domain and give
different shares to different sysadmins, it's intended for one kernel
to partition itself. I agree we shouldn't have different kernels here:
one old, one new, one Linux, one Windows... All partitions should run
a verified parker-aware kernel. Actually, it may be a good idea to
force the same buildid in kexec between the boot kernel and secondary
ones.
Fam
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [External] Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-24 16:21 ` [External] " Fam Zheng
@ 2025-09-24 18:32 ` Dave Hansen
[not found] ` <CABgc4wRgpYNARf+7MhsadfXjDJ0Vd01OoqnVdrW3m6dXMzQSaQ@mail.gmail.com>
0 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2025-09-24 18:32 UTC (permalink / raw)
To: Fam Zheng
Cc: linux-kernel, Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, Zhang Rui, fam, H. Peter Anvin, x86, liangma,
Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
On 9/24/25 09:21, Fam Zheng wrote:
...
> The model and motivation here is not to split the domain and give
> different shares to different sysadmins, it's intended for one kernel
> to partition itself. I agree we shouldn't have different kernels here:
> one old, one new, one Linux, one Windows... All partitions should run
> a verified parker-aware kernel. Actually, it may be a good idea to
> force the same buildid in kexec between the boot kernel and secondary
> ones.
Uhhh.... From the cover letter:
> Another possible use case is for different kernel instances to have
> different performance tunings, CONFIG_ options, FDO/PGO according to
> the workload.
Wouldn't the buildid change with CONIFG_ options and FDO/PGO?
Thank you for posting this series. It's interesting and a thought
provoking. But, that's where it stops for me. I don't think this
approach has any future upstream. I probably won't look at it again,
even if it hits my inbox. (I hope it _isn't_ sent again unless there is
some *MAJOR* *MAJOR* change to the approach).
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-24 15:22 ` Dave Hansen
2025-09-24 16:21 ` [External] " Fam Zheng
@ 2025-09-24 19:01 ` H. Peter Anvin
[not found] ` <CABgc4wTjc9nxmB16LkxiOL5gYO9K8kr46OqM=asyUkX7cT50Sg@mail.gmail.com>
2025-10-22 12:11 ` Pavel Machek
1 sibling, 2 replies; 18+ messages in thread
From: H. Peter Anvin @ 2025-09-24 19:01 UTC (permalink / raw)
To: Dave Hansen, Fam Zheng, linux-kernel
Cc: Lukasz Luba, linyongting, songmuchun, satish.kumar,
Borislav Petkov, Thomas Gleixner, yuanzhu, Ingo Molnar,
Daniel Lezcano, Zhang Rui, fam, x86, liangma, Dave Hansen,
Rafael J. Wysocki, guojinhui.liam, linux-pm, Thom Hughes
On September 24, 2025 8:22:54 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>On 9/23/25 08:31, Fam Zheng wrote:
>> In terms of fault isolation or security, all kernel instances share
>> the same domain, as there is no supervising mechanism. A kernel bug
>> in any partition can cause problems for the whole physical machine.
>> This is a tradeoff for low-overhead / low-complexity, but hope in
>> the future we can take advantage of some hardware mechanism to
>> introduce some isolation.
>I just don't think this is approach is viable. The buck needs to stop
>_somewhere_. You can't just have a bunch of different kernels, with
>nothing in charge of the system as a whole.
>
>Just think of bus locks. They affect the whole system. What if one
>kernel turns off split lock detection? Or has a different rate limit
>than the others? What if one kernel is a big fan of WBINVD? How about
>when they use resctrl to partition an L3 cache? How about microcode updates?
>
>I'd just guess that there are a few hundred problems like that. Maybe more.
>
>I'm not saying this won't be useful for a handful of folks in a tightly
>controlled environment. But I just don't think it has a place in
>mainline where it needs to work for everyone.
Again, this comes down to why a partitioning top level hypervisor is The Right Thing[TM].
IBM mainframes are, again, the archetype here, having done it standard since VM/370 in 1972. This was running on machines with a *maximum* of 4 MB memory.
This approach works.
Nearly every OS on these machines tend to run under a *second* level hypervisor, although that isn't required.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [External] Re: [RFC 0/5] parker: PARtitioned KERnel
[not found] ` <CABgc4wRgpYNARf+7MhsadfXjDJ0Vd01OoqnVdrW3m6dXMzQSaQ@mail.gmail.com>
@ 2025-09-24 20:13 ` Fam Zheng
0 siblings, 0 replies; 18+ messages in thread
From: Fam Zheng @ 2025-09-24 20:13 UTC (permalink / raw)
To: Dave Hansen
Cc: Fam Zheng, linux-kernel, Lukasz Luba, linyongting, songmuchun,
satish.kumar, Borislav Petkov, Thomas Gleixner, yuanzhu,
Ingo Molnar, Daniel Lezcano, Zhang Rui, H. Peter Anvin, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm
On Wed, Sep 24, 2025 at 9:05 PM Fam Zheng <fam@euphon.net> wrote:
>
>
>
> On Wed, Sep 24, 2025 at 7:32 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 9/24/25 09:21, Fam Zheng wrote:
>> ...
>> > The model and motivation here is not to split the domain and give
>> > different shares to different sysadmins, it's intended for one kernel
>> > to partition itself. I agree we shouldn't have different kernels here:
>> > one old, one new, one Linux, one Windows... All partitions should run
>> > a verified parker-aware kernel. Actually, it may be a good idea to
>> > force the same buildid in kexec between the boot kernel and secondary
>> > ones.
>> Uhhh.... From the cover letter:
>>
>> > Another possible use case is for different kernel instances to have
>> > different performance tunings, CONFIG_ options, FDO/PGO according to
>> > the workload.
>>
>> Wouldn't the buildid change with CONIFG_ options and FDO/PGO?
>>
>
>
Discussing goals and non-goals is what we were looking for for this
RFC, and these were just stretchy ideas that we can decide not to go
for.
Thanks for looking at this!
Forgot to turn off email html mode in my previous message..
(outside working hours now so replying from personal email)
Thanks,
Fam
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/5] parker: PARtitioned KERnel
[not found] ` <CABgc4wTjc9nxmB16LkxiOL5gYO9K8kr46OqM=asyUkX7cT50Sg@mail.gmail.com>
@ 2025-09-24 20:14 ` Fam Zheng
2025-09-24 22:13 ` H. Peter Anvin
0 siblings, 1 reply; 18+ messages in thread
From: Fam Zheng @ 2025-09-24 20:14 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Dave Hansen, Fam Zheng, linux-kernel, Lukasz Luba, linyongting,
songmuchun, satish.kumar, Borislav Petkov, Thomas Gleixner,
yuanzhu, Ingo Molnar, Daniel Lezcano, Zhang Rui, x86, liangma,
Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
On Wed, Sep 24, 2025 at 9:10 PM Fam Zheng <fam@euphon.net> wrote:
>
>
>
> On Wed, Sep 24, 2025 at 8:02 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> On September 24, 2025 8:22:54 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>> >On 9/23/25 08:31, Fam Zheng wrote:
>> >> In terms of fault isolation or security, all kernel instances share
>> >> the same domain, as there is no supervising mechanism. A kernel bug
>> >> in any partition can cause problems for the whole physical machine.
>> >> This is a tradeoff for low-overhead / low-complexity, but hope in
>> >> the future we can take advantage of some hardware mechanism to
>> >> introduce some isolation.
>> >I just don't think this is approach is viable. The buck needs to stop
>> >_somewhere_. You can't just have a bunch of different kernels, with
>> >nothing in charge of the system as a whole.
>> >
>> >Just think of bus locks. They affect the whole system. What if one
>> >kernel turns off split lock detection? Or has a different rate limit
>> >than the others? What if one kernel is a big fan of WBINVD? How about
>> >when they use resctrl to partition an L3 cache? How about microcode updates?
>> >
>> >I'd just guess that there are a few hundred problems like that. Maybe more.
>> >
>> >I'm not saying this won't be useful for a handful of folks in a tightly
>> >controlled environment. But I just don't think it has a place in
>> >mainline where it needs to work for everyone.
>>
>> Again, this comes down to why a partitioning top level hypervisor is The Right Thing[TM].
>>
>> IBM mainframes are, again, the archetype here, having done it standard since VM/370 in 1972. This was running on machines with a *maximum* of 4 MB memory.
>>
>> This approach works.
>>
>> Nearly every OS on these machines tend to run under a *second* level hypervisor, although that isn't required.
>
>
I'm trying to think about the hypervisor approach you mentioned, but
if it doesn't provide memory and I/O isolation, what is the advantage
over this RFC? (if it doesn I think then we're talking about a
specially configured KVM which does 1:1 vcpu pinning etc).
Sorry, forgot to turn off email html mode in my previous message..
Fam
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-24 20:14 ` Fam Zheng
@ 2025-09-24 22:13 ` H. Peter Anvin
2025-09-25 7:26 ` [External] " Fam Zheng
0 siblings, 1 reply; 18+ messages in thread
From: H. Peter Anvin @ 2025-09-24 22:13 UTC (permalink / raw)
To: fam, Fam Zheng
Cc: Dave Hansen, Fam Zheng, linux-kernel, Lukasz Luba, linyongting,
songmuchun, satish.kumar, Borislav Petkov, Thomas Gleixner,
yuanzhu, Ingo Molnar, Daniel Lezcano, Zhang Rui, x86, liangma,
Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
On September 24, 2025 1:14:26 PM PDT, Fam Zheng <fam@euphon.net> wrote:
>On Wed, Sep 24, 2025 at 9:10 PM Fam Zheng <fam@euphon.net> wrote:
>>
>>
>>
>> On Wed, Sep 24, 2025 at 8:02 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>>
>>> On September 24, 2025 8:22:54 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>>> >On 9/23/25 08:31, Fam Zheng wrote:
>>> >> In terms of fault isolation or security, all kernel instances share
>>> >> the same domain, as there is no supervising mechanism. A kernel bug
>>> >> in any partition can cause problems for the whole physical machine.
>>> >> This is a tradeoff for low-overhead / low-complexity, but hope in
>>> >> the future we can take advantage of some hardware mechanism to
>>> >> introduce some isolation.
>>> >I just don't think this is approach is viable. The buck needs to stop
>>> >_somewhere_. You can't just have a bunch of different kernels, with
>>> >nothing in charge of the system as a whole.
>>> >
>>> >Just think of bus locks. They affect the whole system. What if one
>>> >kernel turns off split lock detection? Or has a different rate limit
>>> >than the others? What if one kernel is a big fan of WBINVD? How about
>>> >when they use resctrl to partition an L3 cache? How about microcode updates?
>>> >
>>> >I'd just guess that there are a few hundred problems like that. Maybe more.
>>> >
>>> >I'm not saying this won't be useful for a handful of folks in a tightly
>>> >controlled environment. But I just don't think it has a place in
>>> >mainline where it needs to work for everyone.
>>>
>>> Again, this comes down to why a partitioning top level hypervisor is The Right Thing[TM].
>>>
>>> IBM mainframes are, again, the archetype here, having done it standard since VM/370 in 1972. This was running on machines with a *maximum* of 4 MB memory.
>>>
>>> This approach works.
>>>
>>> Nearly every OS on these machines tend to run under a *second* level hypervisor, although that isn't required.
>>
>>
>I'm trying to think about the hypervisor approach you mentioned, but
>if it doesn't provide memory and I/O isolation, what is the advantage
>over this RFC? (if it doesn I think then we're talking about a
>specially configured KVM which does 1:1 vcpu pinning etc).
>
>
>Sorry, forgot to turn off email html mode in my previous message..
>
>
>Fam
>
The difference is that this is highly invasive to the OS, which affects developers and users not wanting this feature.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [External] Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-24 22:13 ` H. Peter Anvin
@ 2025-09-25 7:26 ` Fam Zheng
2025-09-25 15:32 ` H. Peter Anvin
0 siblings, 1 reply; 18+ messages in thread
From: Fam Zheng @ 2025-09-25 7:26 UTC (permalink / raw)
To: H. Peter Anvin
Cc: fam, Dave Hansen, linux-kernel, Lukasz Luba, linyongting,
songmuchun, satish.kumar, Borislav Petkov, Thomas Gleixner,
yuanzhu, Ingo Molnar, Daniel Lezcano, Zhang Rui, x86, liangma,
Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
> From: "H. Peter Anvin"<hpa@zytor.com>
> The difference is that this is highly invasive to the OS, which affects developers and users not wanting this feature.
Yeah that makes sense, thanks for clarifying. By having a hypervisor
at least in early boot of secondary kernels, we don't need to patch
device enumeration etc. In the kernel code.
Once the kernel is up, it can be then promoted to run directly on bare
metal, so zero performance overhead.
Fam
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [External] Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-25 7:26 ` [External] " Fam Zheng
@ 2025-09-25 15:32 ` H. Peter Anvin
0 siblings, 0 replies; 18+ messages in thread
From: H. Peter Anvin @ 2025-09-25 15:32 UTC (permalink / raw)
To: Fam Zheng
Cc: fam, Dave Hansen, linux-kernel, Lukasz Luba, linyongting,
songmuchun, satish.kumar, Borislav Petkov, Thomas Gleixner,
yuanzhu, Ingo Molnar, Daniel Lezcano, Zhang Rui, x86, liangma,
Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
On 2025-09-25 00:26, Fam Zheng wrote:
>> From: "H. Peter Anvin"<hpa@zytor.com>
>> The difference is that this is highly invasive to the OS, which affects developers and users not wanting this feature.
>
> Yeah that makes sense, thanks for clarifying. By having a hypervisor
> at least in early boot of secondary kernels, we don't need to patch
> device enumeration etc. In the kernel code.
>
> Once the kernel is up, it can be then promoted to run directly on bare
> metal, so zero performance overhead.
Realistically you would remain in the hypervisor, but nothing or almost
nothing will trap into the hypervisor, so again, zero or negligible
performance overhead. You also *can* put some isolation or protection
features in the low-level hypervisor.
The important thing here is that the maintenance burden *and* the policy
choices fall on the users of the feature, and as the upstream maintainers
cannot and thus will not test this use case, it is likely to break on a
regular basis.
This is basically "paravirt_ops all over again." There are very good reasons
we are trying to get rid of them.
-hpa
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/5] parker: PARtitioned KERnel
2025-09-24 19:01 ` H. Peter Anvin
[not found] ` <CABgc4wTjc9nxmB16LkxiOL5gYO9K8kr46OqM=asyUkX7cT50Sg@mail.gmail.com>
@ 2025-10-22 12:11 ` Pavel Machek
2025-10-23 1:26 ` H. Peter Anvin
1 sibling, 1 reply; 18+ messages in thread
From: Pavel Machek @ 2025-10-22 12:11 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Dave Hansen, Fam Zheng, linux-kernel, Lukasz Luba, linyongting,
songmuchun, satish.kumar, Borislav Petkov, Thomas Gleixner,
yuanzhu, Ingo Molnar, Daniel Lezcano, Zhang Rui, fam, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
[-- Attachment #1: Type: text/plain, Size: 1882 bytes --]
On Wed 2025-09-24 12:01:52, H. Peter Anvin wrote:
> On September 24, 2025 8:22:54 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
> >On 9/23/25 08:31, Fam Zheng wrote:
> >> In terms of fault isolation or security, all kernel instances share
> >> the same domain, as there is no supervising mechanism. A kernel bug
> >> in any partition can cause problems for the whole physical machine.
> >> This is a tradeoff for low-overhead / low-complexity, but hope in
> >> the future we can take advantage of some hardware mechanism to
> >> introduce some isolation.
> >I just don't think this is approach is viable. The buck needs to stop
> >_somewhere_. You can't just have a bunch of different kernels, with
> >nothing in charge of the system as a whole.
> >
> >Just think of bus locks. They affect the whole system. What if one
> >kernel turns off split lock detection? Or has a different rate limit
> >than the others? What if one kernel is a big fan of WBINVD? How about
> >when they use resctrl to partition an L3 cache? How about microcode updates?
> >
> >I'd just guess that there are a few hundred problems like that. Maybe more.
> >
> >I'm not saying this won't be useful for a handful of folks in a tightly
> >controlled environment. But I just don't think it has a place in
> >mainline where it needs to work for everyone.
>
> Again, this comes down to why a partitioning top level hypervisor is The Right Thing[TM].
>
> IBM mainframes are, again, the archetype here, having done it
> standard since VM/370 in 1972. This was running on machines with a
> *maximum* of 4 MB memory.
Is there a good resource on IBM mainframes, prefferably written in
language that can be understood by mostly x86 kernel hacker?
BR,
Pavel
--
I don't work for Nazis and criminals, and neither should you.
Boycott Putin, Trump, Netanyahu and Musk!
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 0/5] parker: PARtitioned KERnel
2025-10-22 12:11 ` Pavel Machek
@ 2025-10-23 1:26 ` H. Peter Anvin
0 siblings, 0 replies; 18+ messages in thread
From: H. Peter Anvin @ 2025-10-23 1:26 UTC (permalink / raw)
To: Pavel Machek
Cc: Dave Hansen, Fam Zheng, linux-kernel, Lukasz Luba, linyongting,
songmuchun, satish.kumar, Borislav Petkov, Thomas Gleixner,
yuanzhu, Ingo Molnar, Daniel Lezcano, Zhang Rui, fam, x86,
liangma, Dave Hansen, Rafael J. Wysocki, guojinhui.liam, linux-pm,
Thom Hughes
On October 22, 2025 5:11:19 AM PDT, Pavel Machek <pavel@ucw.cz> wrote:
>On Wed 2025-09-24 12:01:52, H. Peter Anvin wrote:
>> On September 24, 2025 8:22:54 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote:
>> >On 9/23/25 08:31, Fam Zheng wrote:
>> >> In terms of fault isolation or security, all kernel instances share
>> >> the same domain, as there is no supervising mechanism. A kernel bug
>> >> in any partition can cause problems for the whole physical machine.
>> >> This is a tradeoff for low-overhead / low-complexity, but hope in
>> >> the future we can take advantage of some hardware mechanism to
>> >> introduce some isolation.
>> >I just don't think this is approach is viable. The buck needs to stop
>> >_somewhere_. You can't just have a bunch of different kernels, with
>> >nothing in charge of the system as a whole.
>> >
>> >Just think of bus locks. They affect the whole system. What if one
>> >kernel turns off split lock detection? Or has a different rate limit
>> >than the others? What if one kernel is a big fan of WBINVD? How about
>> >when they use resctrl to partition an L3 cache? How about microcode updates?
>> >
>> >I'd just guess that there are a few hundred problems like that. Maybe more.
>> >
>> >I'm not saying this won't be useful for a handful of folks in a tightly
>> >controlled environment. But I just don't think it has a place in
>> >mainline where it needs to work for everyone.
>>
>> Again, this comes down to why a partitioning top level hypervisor is The Right Thing[TM].
>>
>> IBM mainframes are, again, the archetype here, having done it
>> standard since VM/370 in 1972. This was running on machines with a
>> *maximum* of 4 MB memory.
>
>Is there a good resource on IBM mainframes, prefferably written in
>language that can be understood by mostly x86 kernel hacker?
>
>BR,
> Pavel
I don't know... perhaps ask the s390 guys?
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-10-23 1:27 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-23 15:31 [RFC 0/5] parker: PARtitioned KERnel Fam Zheng
2025-09-23 15:31 ` [RFC 1/5] x86/boot/e820: Fix memmap to parse with 1 argument Fam Zheng
2025-09-23 15:31 ` [RFC 2/5] x86/smpboot: Export wakeup_secondary_cpu_via_init Fam Zheng
2025-09-23 15:31 ` [RFC 3/5] x86/parker: Introduce parker kerfs interface Fam Zheng
2025-09-23 15:31 ` [RFC 4/5] x86/parker: Add parker initialisation code Fam Zheng
2025-09-23 15:31 ` [RFC 5/5] x86/apic: Make Parker instance use physical APIC Fam Zheng
2025-09-23 19:46 ` [RFC 0/5] parker: PARtitioned KERnel H. Peter Anvin
2025-09-24 15:22 ` Dave Hansen
2025-09-24 16:21 ` [External] " Fam Zheng
2025-09-24 18:32 ` Dave Hansen
[not found] ` <CABgc4wRgpYNARf+7MhsadfXjDJ0Vd01OoqnVdrW3m6dXMzQSaQ@mail.gmail.com>
2025-09-24 20:13 ` Fam Zheng
2025-09-24 19:01 ` H. Peter Anvin
[not found] ` <CABgc4wTjc9nxmB16LkxiOL5gYO9K8kr46OqM=asyUkX7cT50Sg@mail.gmail.com>
2025-09-24 20:14 ` Fam Zheng
2025-09-24 22:13 ` H. Peter Anvin
2025-09-25 7:26 ` [External] " Fam Zheng
2025-09-25 15:32 ` H. Peter Anvin
2025-10-22 12:11 ` Pavel Machek
2025-10-23 1:26 ` H. Peter Anvin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox