Linux Kernel Selftest development
 help / color / mirror / Atom feed

* (no subject)
@ 2026-05-29 14:32 George Guo
  2026-05-29 14:32 ` [PATCH v2 1/4] LoongArch: kexec: add KHO support for FDT-based systems George Guo
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: George Guo @ 2026-05-29 14:32 UTC (permalink / raw)
  To: Huacai Chen, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Shuah Khan
  Cc: WANG Xuerui, Alexander Graf, loongarch, linux-kernel, kexec,
	linux-mm, linux-kselftest, George Guo

From: George Guo <guodongtai@kylinos.cn>

WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#26: 
containing only /chosen with the two KHO properties.  Since DEVICE_TREE_GUID

ERROR: Invalid commit separator - some tools may have problems applying this
#34: 
-------------------------------

total: 2 errors, 1 warnings, 0 lines checked
>
Date: Fri, 29 May 2026 21:54:01 +0800
Subject: [PATCH v2 0/4] LoongArch: add KHO support and selftests

This series adds Kexec Handover (KHO) support for LoongArch and extends
the KHO selftest infrastructure to run on LoongArch under QEMU.

KHO passes metadata (the KHO state FDT and scratch area addresses) to the
second kernel via the FDT /chosen node, using the linux,kho-fdt and
linux,kho-scratch properties that drivers/of/kexec.c:kho_add_chosen()
writes and drivers/of/fdt.c:early_init_dt_check_kho() reads.

KHO support (patches 1-2):

Patch 1 adds KHO support for FDT-based systems (initial_boot_params !=
NULL, e.g. QEMU virt without OVMF).  kho_load_fdt() copies the running
kernel's FDT, appends linux,kho-fdt and linux,kho-scratch to /chosen,
and loads the result as a kexec segment.  machine_kexec() updates the
DEVICE_TREE_GUID entry in the EFI config table to point to this segment
so the second kernel's fdt_setup() can find and parse it.

Patch 2 adds KHO support for ACPI-only systems (initial_boot_params ==
NULL, e.g. LoongArch servers with UEFI or QEMU with OVMF).  Because no
system FDT is available, kho_load_fdt() builds a minimal FDT from
scratch containing only /chosen with the two KHO properties.  Since
DEVICE_TREE_GUID is absent from the EFI config table on ACPI-only
systems, a new extended config table is built with the entry appended
and loaded as a kexec segment; machine_kexec() switches st->tables to
point to it before jumping.  The second kernel's fdt_setup() calls
efi_fdt_pointer() to detect the KHO FDT and passes it to
early_init_dt_check_kho().

Selftest support (patches 3-4):

Patch 3 adds loongarch.conf and extends vmtest.sh to recognise loongarch64
as a build target.  The LoongArch virt machine is FDT-only (no ACPI), so
'earlycon' must appear on the kernel cmdline or the console UART is never
discovered.  PS/2 input devices are also disabled since QEMU's LoongArch
virt machine has no i8042 controller; the fallback port probe hits a page
fault and panics before reaching userspace.

Patch 4 handles QEMU not exiting after kexec on LoongArch.  QEMU provides
no EFI runtime services, so machine_restart() falls through to an infinite
idle loop.  QEMU_NEEDS_KILL=1 in loongarch.conf signals vmtest.sh to run
QEMU in the background, poll the serial output for the test verdict, and
kill QEMU once it appears, so the test completes unattended.

George Guo (4):
  LoongArch: kexec: add KHO support for FDT-based systems
  LoongArch: kexec: add KHO support for ACPI-only systems
  selftests/kho: add LoongArch vmtest support
  selftests/kho: handle QEMU not exiting after kexec on LoongArch

 arch/loongarch/Kconfig                     |   3 +
 arch/loongarch/include/asm/kexec.h         |   7 +
 arch/loongarch/kernel/machine_kexec.c      |  38 +++
 arch/loongarch/kernel/machine_kexec_file.c | 256 +++++++++++++++++++++
 arch/loongarch/kernel/setup.c              |  21 +-
 tools/testing/selftests/kho/loongarch.conf |  13 ++
 tools/testing/selftests/kho/vmtest.sh      |  35 ++-
 7 files changed, 365 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/kho/loongarch.conf

-- 
2.25.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/4] LoongArch: kexec: add KHO support for FDT-based systems
  2026-05-29 14:32 George Guo
@ 2026-05-29 14:32 ` George Guo
  2026-05-29 14:32 ` [PATCH v2 2/4] LoongArch: kexec: add KHO support for ACPI-only systems George Guo
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: George Guo @ 2026-05-29 14:32 UTC (permalink / raw)
  To: Huacai Chen, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Shuah Khan
  Cc: WANG Xuerui, Alexander Graf, loongarch, linux-kernel, kexec,
	linux-mm, linux-kselftest, George Guo, Kexin Liu

From: George Guo <guodongtai@kylinos.cn>

Enable Kexec Handover (KHO) on LoongArch64 for FDT-based systems.

- Kconfig: select ARCH_SUPPORTS_KEXEC_HANDOVER for CONFIG_64BIT
- kexec.h: add fdt/fdt_mem fields to kimage_arch to hold the KHO FDT
  kexec segment virtual and physical addresses
- machine_kexec_file.c: add kho_load_fdt() which copies the running
  kernel's FDT (initial_boot_params), appends linux,kho-fdt and
  linux,kho-scratch properties to /chosen, and loads the result as a
  kexec segment; called from load_other_segments().  Returns -EINVAL
  when initial_boot_params is NULL (ACPI-only boot) since that path
  requires separate handling.
- machine_kexec.c: before jumping to the new kernel, update the
  DEVICE_TREE_GUID entry in the EFI config table to point to the KHO
  FDT segment so the second kernel finds it via efi_fdt_pointer() and
  early_init_dt_check_kho() calls kho_populate()

Co-developed-by: Kexin Liu <liukexin@kylinos.cn>
Signed-off-by: Kexin Liu <liukexin@kylinos.cn>
Signed-off-by: George Guo <guodongtai@kylinos.cn>
---
 arch/loongarch/Kconfig                     |   3 +
 arch/loongarch/include/asm/kexec.h         |   4 +
 arch/loongarch/kernel/machine_kexec.c      |  27 +++++
 arch/loongarch/kernel/machine_kexec_file.c | 118 +++++++++++++++++++++
 4 files changed, 152 insertions(+)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 606597da46b8..d494418545f5 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -684,6 +684,9 @@ config ARCH_SUPPORTS_KEXEC
 config ARCH_SUPPORTS_KEXEC_FILE
 	def_bool 64BIT
 
+config ARCH_SUPPORTS_KEXEC_HANDOVER
+	def_bool 64BIT
+
 config ARCH_SELECTS_KEXEC_FILE
 	def_bool 64BIT
 	depends on KEXEC_FILE
diff --git a/arch/loongarch/include/asm/kexec.h b/arch/loongarch/include/asm/kexec.h
index 209fa43222e1..adf54bfcdd49 100644
--- a/arch/loongarch/include/asm/kexec.h
+++ b/arch/loongarch/include/asm/kexec.h
@@ -39,6 +39,10 @@ struct kimage_arch {
 	unsigned long efi_boot;
 	unsigned long cmdline_ptr;
 	unsigned long systable_ptr;
+#ifdef CONFIG_KEXEC_HANDOVER
+	void *fdt;		/* virtual address of KHO FDT segment buffer */
+	unsigned long fdt_mem;	/* physical address of KHO FDT segment */
+#endif
 };
 
 #ifdef CONFIG_KEXEC_FILE
diff --git a/arch/loongarch/kernel/machine_kexec.c b/arch/loongarch/kernel/machine_kexec.c
index d7fafda1d541..130f1adfa515 100644
--- a/arch/loongarch/kernel/machine_kexec.c
+++ b/arch/loongarch/kernel/machine_kexec.c
@@ -6,6 +6,7 @@
  */
 #include <linux/compiler.h>
 #include <linux/cpu.h>
+#include <linux/efi.h>
 #include <linux/kexec.h>
 #include <linux/crash_dump.h>
 #include <linux/delay.h>
@@ -284,6 +285,32 @@ void machine_kexec(struct kimage *image)
 	pr_notice("We will call new kernel at 0x%lx\n", start_addr);
 	pr_notice("Bye ...\n");
 
+#ifdef CONFIG_KEXEC_HANDOVER
+	/*
+	 * Point the EFI FDTPTR config table entry at the modified FDT so the
+	 * second kernel picks up the linux,kho-fdt and linux,kho-scratch
+	 * properties via early_init_dt_check_kho().
+	 */
+	if (internal->fdt_mem) {
+		/*
+		 * FDT-based system: DEVICE_TREE_GUID already exists in the EFI
+		 * config table; just update its pointer to our KHO FDT.
+		 */
+		efi_system_table_t *st =
+			(efi_system_table_t *)TO_CACHE(systable_ptr);
+		efi_config_table_t *ct =
+			(efi_config_table_t *)TO_CACHE((unsigned long)st->tables);
+		unsigned long i;
+
+		for (i = 0; i < st->nr_tables; i++) {
+			if (!efi_guidcmp(ct[i].guid, DEVICE_TREE_GUID)) {
+				ct[i].table = (void *)internal->fdt_mem;
+				break;
+			}
+		}
+	}
+#endif
+
 	/* Make reboot code buffer available to the boot CPU. */
 	flush_cache_all();
 
diff --git a/arch/loongarch/kernel/machine_kexec_file.c b/arch/loongarch/kernel/machine_kexec_file.c
index 5584b798ba46..bf1e8c1c7e70 100644
--- a/arch/loongarch/kernel/machine_kexec_file.c
+++ b/arch/loongarch/kernel/machine_kexec_file.c
@@ -13,7 +13,9 @@
 #include <linux/ioport.h>
 #include <linux/kernel.h>
 #include <linux/kexec.h>
+#include <linux/libfdt.h>
 #include <linux/memblock.h>
+#include <linux/of_fdt.h>
 #include <linux/slab.h>
 #include <linux/string.h>
 #include <linux/types.h>
@@ -32,6 +34,11 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 	image->elf_headers = NULL;
 	image->elf_headers_sz = 0;
 
+#ifdef CONFIG_KEXEC_HANDOVER
+	kvfree(image->arch.fdt);
+	image->arch.fdt = NULL;
+#endif
+
 	return kexec_image_post_load_cleanup_default(image);
 }
 
@@ -55,6 +62,111 @@ static void cmdline_add_initrd(struct kimage *image, unsigned long *cmdline_tmpl
 	*cmdline_tmplen += initrd_strlen;
 }
 
+#ifdef CONFIG_KEXEC_HANDOVER
+/*
+ * Add KHO metadata to an FDT /chosen node and load the FDT as a kexec
+ * segment.  The second kernel reads linux,kho-fdt and linux,kho-scratch
+ * from /chosen via early_init_dt_check_kho() and calls kho_populate().
+ *
+ */
+static int kho_load_fdt(struct kimage *image)
+{
+	void *fdt;
+	int ret, chosen_node;
+	size_t fdt_size;
+	struct kexec_buf kbuf = {
+		.image		= image,
+		.buf_min	= 0,
+		.buf_max	= ULONG_MAX,
+		.top_down	= true,
+	};
+
+	if (!image->kho.fdt || !image->kho.scratch)
+		return 0;
+
+	if (initial_boot_params) {
+		/*
+		 * FDT boot: copy the running kernel's FDT and append KHO
+		 * properties to /chosen.
+		 */
+
+		/*
+		 * Only two KHO properties are added to /chosen (linux,kho-fdt
+		 * and linux,kho-scratch), so SZ_1K of extra space is
+		 * sufficient.
+		 */
+		fdt_size = fdt_totalsize(initial_boot_params) + SZ_1K;
+		fdt = kvmalloc(fdt_size, GFP_KERNEL);
+		if (!fdt)
+			return -ENOMEM;
+
+		ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
+		if (ret < 0) {
+			pr_err("Failed to open FDT: %d\n", ret);
+			goto out_free;
+		}
+
+		chosen_node = fdt_path_offset(fdt, "/chosen");
+		if (chosen_node == -FDT_ERR_NOTFOUND) {
+			pr_debug("No /chosen node in FDT, creating one\n");
+			chosen_node = fdt_add_subnode(fdt,
+						      fdt_path_offset(fdt, "/"),
+						      "chosen");
+		}
+		if (chosen_node < 0) {
+			ret = chosen_node;
+			goto out_free;
+		}
+
+		/* Remove stale KHO properties left by a previous kexec load */
+		fdt_delprop(fdt, chosen_node, "linux,kho-fdt");
+		fdt_delprop(fdt, chosen_node, "linux,kho-scratch");
+
+		ret = fdt_appendprop_addrrange(fdt, 0, chosen_node,
+					       "linux,kho-fdt",
+					       image->kho.fdt, PAGE_SIZE);
+		if (ret)
+			goto out_free;
+
+		ret = fdt_appendprop_addrrange(fdt, 0, chosen_node,
+					       "linux,kho-scratch",
+					       image->kho.scratch->mem,
+					       image->kho.scratch->bufsz);
+		if (ret)
+			goto out_free;
+
+		/*
+		 * Shrink totalsize to the actual data size so the kexec segment
+		 * allocated by kexec_add_buffer() covers only the packed FDT data.
+		 * The slack added above for property insertion is part of the
+		 * kvmalloc'd buffer, which is freed by kimage_file_post_load_cleanup()
+		 * once the kexec image has been loaded.
+		 */
+		fdt_pack(fdt);
+
+		kbuf.buffer	= fdt;
+		kbuf.bufsz	= fdt_totalsize(fdt);
+		kbuf.memsz	= kbuf.bufsz;
+		kbuf.buf_align	= PAGE_SIZE;
+		kbuf.mem	= KEXEC_BUF_MEM_UNKNOWN;
+
+		ret = kexec_add_buffer(&kbuf);
+		if (ret)
+			goto out_free;
+
+		image->arch.fdt     = fdt;
+		image->arch.fdt_mem = kbuf.mem;
+		return 0;
+	} else {
+		return -EINVAL;
+	}
+
+out_free:
+	kvfree(fdt);
+	return ret;
+}
+#endif
+
 #ifdef CONFIG_CRASH_DUMP
 
 static int prepare_elf_headers(void **addr, unsigned long *sz)
@@ -230,6 +342,12 @@ int load_other_segments(struct kimage *image,
 	cmdline = modified_cmdline;
 	image->arch.cmdline_ptr = (unsigned long)cmdline;
 
+#ifdef CONFIG_KEXEC_HANDOVER
+	ret = kho_load_fdt(image);
+	if (ret)
+		goto out_err;
+#endif
+
 	return 0;
 
 out_err:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/4] LoongArch: kexec: add KHO support for ACPI-only systems
  2026-05-29 14:32 George Guo
  2026-05-29 14:32 ` [PATCH v2 1/4] LoongArch: kexec: add KHO support for FDT-based systems George Guo
@ 2026-05-29 14:32 ` George Guo
  2026-05-29 14:32 ` [PATCH v2 3/4] selftests/kho: add LoongArch vmtest support George Guo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: George Guo @ 2026-05-29 14:32 UTC (permalink / raw)
  To: Huacai Chen, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Shuah Khan
  Cc: WANG Xuerui, Alexander Graf, loongarch, linux-kernel, kexec,
	linux-mm, linux-kselftest, George Guo

From: George Guo <guodongtai@kylinos.cn>

On ACPI-only systems fdt_setup() returns early when it detects ACPI,
so initial_boot_params remains NULL and the FDT-based kho_load_fdt()
path cannot be used.

machine_kexec_file.c:
- kho_load_fdt(): add an else branch that builds a minimal FDT from
  scratch (SZ_1K) containing only a /chosen node with linux,kho-fdt
  and linux,kho-scratch properties, using the libfdt creation API.
- Since DEVICE_TREE_GUID is absent from the EFI config table on
  ACPI-only systems, build a new config table with DEVICE_TREE_GUID
  appended and load it as a kexec segment.  Store the result in
  kimage_arch so machine_kexec() can switch st->tables before jumping.
- arch_kimage_file_post_load_cleanup(): free the efi_tables kvmalloc
  buffer once the kexec image has been loaded.

machine_kexec.c:
- Before jumping, update the EFI system table pointer: for FDT-based
  systems update the existing DEVICE_TREE_GUID entry; for ACPI-only
  systems switch st->tables / st->nr_tables to the new extended table.

setup.c:
- fdt_setup(): when ACPI is detected, use efi_fdt_pointer() to detect
  whether this is a KHO kexec boot.  The first kernel switches the EFI
  config table to a new one that includes a DEVICE_TREE_GUID entry
  pointing to the minimal KHO FDT.  If found, call early_init_dt_scan()
  so early_init_dt_check_kho() can consume linux,kho-fdt and
  linux,kho-scratch, then reset initial_boot_params to NULL so the rest
  of the ACPI boot path is unaffected.

kexec.h:
- Add efi_tables, efi_tables_mem, efi_tables_cnt to kimage_arch.

Signed-off-by: George Guo <guodongtai@kylinos.cn>
---
 arch/loongarch/include/asm/kexec.h         |   3 +
 arch/loongarch/kernel/machine_kexec.c      |  11 ++
 arch/loongarch/kernel/machine_kexec_file.c | 162 +++++++++++++++++++--
 arch/loongarch/kernel/setup.c              |  21 ++-
 4 files changed, 184 insertions(+), 13 deletions(-)

diff --git a/arch/loongarch/include/asm/kexec.h b/arch/loongarch/include/asm/kexec.h
index adf54bfcdd49..e1abaf40b06a 100644
--- a/arch/loongarch/include/asm/kexec.h
+++ b/arch/loongarch/include/asm/kexec.h
@@ -42,6 +42,9 @@ struct kimage_arch {
 #ifdef CONFIG_KEXEC_HANDOVER
 	void *fdt;		/* virtual address of KHO FDT segment buffer */
 	unsigned long fdt_mem;	/* physical address of KHO FDT segment */
+	void *efi_tables;		/* new EFI config table buffer (virtual) */
+	unsigned long efi_tables_mem;	/* physical address of new EFI config table */
+	unsigned long efi_tables_cnt;	/* number of entries in new EFI config table */
 #endif
 };
 
diff --git a/arch/loongarch/kernel/machine_kexec.c b/arch/loongarch/kernel/machine_kexec.c
index 130f1adfa515..4ee9a433d2de 100644
--- a/arch/loongarch/kernel/machine_kexec.c
+++ b/arch/loongarch/kernel/machine_kexec.c
@@ -308,6 +308,17 @@ void machine_kexec(struct kimage *image)
 				break;
 			}
 		}
+	} else if (internal->efi_tables_mem) {
+		/*
+		 * ACPI-only system: DEVICE_TREE_GUID was not in the original
+		 * EFI config table.  Switch to the new table that was built in
+		 * kho_load_fdt() with DEVICE_TREE_GUID appended.
+		 */
+		efi_system_table_t *st =
+			(efi_system_table_t *)TO_CACHE(systable_ptr);
+
+		st->tables    = internal->efi_tables_mem;
+		st->nr_tables = internal->efi_tables_cnt;
 	}
 #endif
 
diff --git a/arch/loongarch/kernel/machine_kexec_file.c b/arch/loongarch/kernel/machine_kexec_file.c
index bf1e8c1c7e70..c1955d991061 100644
--- a/arch/loongarch/kernel/machine_kexec_file.c
+++ b/arch/loongarch/kernel/machine_kexec_file.c
@@ -10,6 +10,7 @@
 
 #define pr_fmt(fmt) "kexec_file: " fmt
 
+#include <linux/efi.h>
 #include <linux/ioport.h>
 #include <linux/kernel.h>
 #include <linux/kexec.h>
@@ -20,6 +21,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/vmalloc.h>
+#include <asm/addrspace.h>
 #include <asm/bootinfo.h>
 
 const struct kexec_file_ops * const kexec_file_loaders[] = {
@@ -37,6 +39,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
 #ifdef CONFIG_KEXEC_HANDOVER
 	kvfree(image->arch.fdt);
 	image->arch.fdt = NULL;
+	kvfree(image->arch.efi_tables);
+	image->arch.efi_tables = NULL;
 #endif
 
 	return kexec_image_post_load_cleanup_default(image);
@@ -68,6 +72,13 @@ static void cmdline_add_initrd(struct kimage *image, unsigned long *cmdline_tmpl
  * segment.  The second kernel reads linux,kho-fdt and linux,kho-scratch
  * from /chosen via early_init_dt_check_kho() and calls kho_populate().
  *
+ * On FDT-based systems (initial_boot_params != NULL), the current FDT is
+ * copied and the KHO properties are appended to /chosen.
+ *
+ * On ACPI-only systems (initial_boot_params == NULL), a minimal FDT
+ * containing only /chosen is built from scratch.  machine_kexec() updates
+ * the EFI config table DEVICE_TREE_GUID entry to point to this segment so
+ * that the second kernel's fdt_setup() can find and parse it.
  */
 static int kho_load_fdt(struct kimage *image)
 {
@@ -143,24 +154,151 @@ static int kho_load_fdt(struct kimage *image)
 		 * once the kexec image has been loaded.
 		 */
 		fdt_pack(fdt);
+	} else {
+		/*
+		 * ACPI boot: build a minimal FDT containing only /chosen with
+		 * the two KHO properties.  No system FDT is available to copy.
+		 */
 
-		kbuf.buffer	= fdt;
-		kbuf.bufsz	= fdt_totalsize(fdt);
-		kbuf.memsz	= kbuf.bufsz;
-		kbuf.buf_align	= PAGE_SIZE;
-		kbuf.mem	= KEXEC_BUF_MEM_UNKNOWN;
+		__be64 prop[2];
 
-		ret = kexec_add_buffer(&kbuf);
-		if (ret)
+		fdt_size = SZ_1K;
+		fdt = kvmalloc(fdt_size, GFP_KERNEL);
+		if (!fdt)
+			return -ENOMEM;
+
+		ret = fdt_create(fdt, fdt_size);
+		if (ret < 0)
+			goto out_free;
+		ret = fdt_finish_reservemap(fdt);
+		if (ret < 0)
+			goto out_free;
+		ret = fdt_begin_node(fdt, "");	/* root */
+		if (ret < 0)
+			goto out_free;
+		ret = fdt_property_u32(fdt, "#address-cells", 2);
+		if (ret < 0)
+			goto out_free;
+		ret = fdt_property_u32(fdt, "#size-cells", 2);
+		if (ret < 0)
+			goto out_free;
+		ret = fdt_begin_node(fdt, "chosen");
+		if (ret < 0)
 			goto out_free;
 
-		image->arch.fdt     = fdt;
-		image->arch.fdt_mem = kbuf.mem;
-		return 0;
-	} else {
-		return -EINVAL;
+		prop[0] = cpu_to_be64(image->kho.fdt);
+		prop[1] = cpu_to_be64(PAGE_SIZE);
+		ret = fdt_property(fdt, "linux,kho-fdt", prop, sizeof(prop));
+		if (ret < 0)
+			goto out_free;
+
+		prop[0] = cpu_to_be64(image->kho.scratch->mem);
+		prop[1] = cpu_to_be64(image->kho.scratch->bufsz);
+		ret = fdt_property(fdt, "linux,kho-scratch", prop, sizeof(prop));
+		if (ret < 0)
+			goto out_free;
+
+		ret = fdt_end_node(fdt);	/* chosen */
+		if (ret < 0)
+			goto out_free;
+		ret = fdt_end_node(fdt);	/* root */
+		if (ret < 0)
+			goto out_free;
+		ret = fdt_finish(fdt);
+		if (ret < 0)
+			goto out_free;
 	}
 
+	kbuf.buffer	= fdt;
+	kbuf.bufsz	= fdt_totalsize(fdt);
+	kbuf.memsz	= kbuf.bufsz;
+	kbuf.buf_align	= PAGE_SIZE;
+	kbuf.mem	= KEXEC_BUF_MEM_UNKNOWN;
+
+	ret = kexec_add_buffer(&kbuf);
+	if (ret)
+		goto out_free;
+
+	image->arch.fdt     = fdt;
+	image->arch.fdt_mem = kbuf.mem;
+
+	/*
+	 * On ACPI-only systems DEVICE_TREE_GUID is not in the EFI config
+	 * table, so the second kernel's efi_fdt_pointer() cannot locate the
+	 * KHO FDT.  Build a new EFI config table with DEVICE_TREE_GUID added
+	 * and load it as a kexec segment; machine_kexec() will update
+	 * st->tables / st->nr_tables to point to it before jumping.
+	 */
+
+	/*
+	 * fw_arg2 is the EFI system table physical address passed by the
+	 * firmware/bootloader.  Use it directly here because
+	 * image->arch.systable_ptr is set later in machine_kexec_prepare(),
+	 * which runs after load_other_segments() / kho_load_fdt().
+	 */
+	if (!initial_boot_params && fw_arg2) {
+		efi_system_table_t *st =
+			(efi_system_table_t *)TO_CACHE(fw_arg2);
+		efi_config_table_t *ct =
+			(efi_config_table_t *)TO_CACHE((unsigned long)st->tables);
+		unsigned long i;
+		bool found = false;
+
+		/*
+		 * Scan the original config table;
+		 * DEVICE_TREE_GUID is absent on ACPI-only systems.
+		 */
+		for (i = 0; i < st->nr_tables; i++) {
+			if (!efi_guidcmp(ct[i].guid, DEVICE_TREE_GUID)) {
+				found = true;
+				break;
+			}
+		}
+
+		if (!found) {
+			size_t old_sz = st->nr_tables * sizeof(efi_config_table_t);
+			size_t new_sz = old_sz + sizeof(efi_config_table_t);
+			efi_config_table_t *new_ct;
+			struct kexec_buf tbuf = {
+				.image		= image,
+				.buf_min	= 0,
+				.buf_max	= ULONG_MAX,
+				.top_down	= true,
+			};
+
+			/*
+			 * Allocate a new table with n+1 entries and append
+			 * the DEVICE_TREE_GUID entry.
+			 */
+			new_ct = kvmalloc(new_sz, GFP_KERNEL);
+			if (!new_ct)
+				return -ENOMEM;
+
+			memcpy(new_ct, ct, old_sz);
+			new_ct[st->nr_tables].guid  = DEVICE_TREE_GUID;
+			new_ct[st->nr_tables].table = (void *)image->arch.fdt_mem;
+
+			/* Register the new config table as a kexec segment. */
+			tbuf.buffer   = new_ct;
+			tbuf.bufsz    = new_sz;
+			tbuf.memsz    = new_sz;
+			tbuf.buf_align = sizeof(void *);
+			tbuf.mem      = KEXEC_BUF_MEM_UNKNOWN;
+
+			ret = kexec_add_buffer(&tbuf);
+			if (ret) {
+				kvfree(new_ct);
+				return ret;
+			}
+
+			image->arch.efi_tables     = new_ct;
+			image->arch.efi_tables_mem = tbuf.mem;
+			image->arch.efi_tables_cnt = st->nr_tables + 1;
+		}
+	}
+
+	return 0;
+
 out_free:
 	kvfree(fdt);
 	return ret;
diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c
index 839b23edee87..c82067d1dc75 100644
--- a/arch/loongarch/kernel/setup.c
+++ b/arch/loongarch/kernel/setup.c
@@ -286,8 +286,27 @@ static void __init fdt_setup(void)
 	void *fdt_pointer;
 
 	/* ACPI-based systems do not require parsing fdt */
-	if (acpi_os_get_root_pointer())
+	if (acpi_os_get_root_pointer()) {
+#ifdef CONFIG_KEXEC_HANDOVER
+		/*
+		 * On a KHO kexec boot the first kernel builds a minimal FDT
+		 * containing only /chosen with linux,kho-fdt and
+		 * linux,kho-scratch, and switches the EFI config table to a
+		 * new one that includes a DEVICE_TREE_GUID entry pointing to
+		 * it.  Use efi_fdt_pointer() to detect this case.
+		 *
+		 * Call early_init_dt_scan() to let early_init_dt_check_kho()
+		 * consume the KHO data, then reset initial_boot_params so the
+		 * rest of the ACPI boot path is not confused by this FDT.
+		 */
+		fdt_pointer = efi_fdt_pointer();
+		if (fdt_pointer && !fdt_check_header(fdt_pointer)) {
+			early_init_dt_scan(fdt_pointer, __pa(fdt_pointer));
+			initial_boot_params = NULL;
+		}
+#endif
 		return;
+	}
 
 	/* Prefer to use built-in dtb, checking its legality first. */
 	if (IS_ENABLED(CONFIG_BUILTIN_DTB) && !fdt_check_header(__dtb_start))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 3/4] selftests/kho: add LoongArch vmtest support
  2026-05-29 14:32 George Guo
  2026-05-29 14:32 ` [PATCH v2 1/4] LoongArch: kexec: add KHO support for FDT-based systems George Guo
  2026-05-29 14:32 ` [PATCH v2 2/4] LoongArch: kexec: add KHO support for ACPI-only systems George Guo
@ 2026-05-29 14:32 ` George Guo
  2026-05-29 14:32 ` [PATCH v2 4/4] selftests/kho: handle QEMU not exiting after kexec on LoongArch George Guo
  2026-05-31  9:28 ` Mike Rapoport
  4 siblings, 0 replies; 7+ messages in thread
From: George Guo @ 2026-05-29 14:32 UTC (permalink / raw)
  To: Huacai Chen, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Shuah Khan
  Cc: WANG Xuerui, Alexander Graf, loongarch, linux-kernel, kexec,
	linux-mm, linux-kselftest, George Guo, Kexin Liu

From: George Guo <guodongtai@kylinos.cn>

Add loongarch.conf to configure QEMU's LoongArch virt machine with a
la464 CPU, enable the 8250 serial console, and set the kernel image
to vmlinux.efi. Extend vmtest.sh to recognize loongarch64 as a
supported target and map it to the 'loongarch' kernel arch name.

QEMU's LoongArch virt machine provides no ACPI tables and relies on
FDT to describe hardware. Without 'earlycon' on the kernel command
line, the FDT is not scanned for a console UART, no output reaches
the console, and vmtest.sh's console log stays empty causing the test
to always fail. Add 'earlycon' to KERNEL_CMDLINE in loongarch.conf
to fix this.

QEMU's LoongArch virt machine has no i8042 PS/2 controller.  When PNP
detection finds nothing, i8042_init() falls back to probing the ports
directly.  On LoongArch the I/O ports are memory-mapped, and the i8042
port addresses are not backed by any device on the virt machine, so
i8042_flush() takes a page fault and the kernel panics:

  i8042: PNP: No PS/2 controller found.
  i8042: Probing ports directly.
  CPU 0 Unable to handle kernel paging request at virtual address ffff800000008064
  ERA: i8042_flush+0x50/0x198
   RA: i8042_init+0x2a8/0x35c
  Kernel panic - not syncing: Attempted to kill init!

Disable SERIO_I8042 and its dependents (KEYBOARD_ATKBD, MOUSE_PS2) in
the QEMU_KCONFIG fragment to prevent the driver from being built.

Co-developed-by: Kexin Liu <liukexin@kylinos.cn>
Signed-off-by: Kexin Liu <liukexin@kylinos.cn>
Signed-off-by: George Guo <guodongtai@kylinos.cn>
---
 tools/testing/selftests/kho/loongarch.conf | 10 ++++++++++
 tools/testing/selftests/kho/vmtest.sh      |  3 ++-
 2 files changed, 12 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/kho/loongarch.conf

diff --git a/tools/testing/selftests/kho/loongarch.conf b/tools/testing/selftests/kho/loongarch.conf
new file mode 100644
index 000000000000..0145cb49e5b2
--- /dev/null
+++ b/tools/testing/selftests/kho/loongarch.conf
@@ -0,0 +1,10 @@
+QEMU_CMD="qemu-system-loongarch64 -M virt -cpu la464"
+QEMU_KCONFIG="
+CONFIG_SERIAL_8250=y
+CONFIG_SERIAL_8250_CONSOLE=y
+# CONFIG_KEYBOARD_ATKBD is not set
+# CONFIG_MOUSE_PS2 is not set
+# CONFIG_SERIO_I8042 is not set
+"
+KERNEL_IMAGE="vmlinux.efi"
+KERNEL_CMDLINE="console=ttyS0 earlycon"
diff --git a/tools/testing/selftests/kho/vmtest.sh b/tools/testing/selftests/kho/vmtest.sh
index 49fdac8e8b15..a6ae9ac09595 100755
--- a/tools/testing/selftests/kho/vmtest.sh
+++ b/tools/testing/selftests/kho/vmtest.sh
@@ -21,7 +21,7 @@ Options:
 	-d)	path to the kernel build directory
 	-j)	number of jobs for compilation, similar to -j in make
 	-t)	run test for target_arch, requires CROSS_COMPILE set
-		supported targets: aarch64, x86_64
+		supported targets: aarch64, x86_64, loongarch64
 	-h)	display this help
 EOF
 }
@@ -123,6 +123,7 @@ function target_to_arch() {
 	case $target in
 	     aarch64) echo "arm64" ;;
 	     x86_64) echo "x86" ;;
+	     loongarch64) echo "loongarch" ;;
 	     *) skip "architecture $target is not supported"
 	esac
 }
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 4/4] selftests/kho: handle QEMU not exiting after kexec on LoongArch
  2026-05-29 14:32 George Guo
                   ` (2 preceding siblings ...)
  2026-05-29 14:32 ` [PATCH v2 3/4] selftests/kho: add LoongArch vmtest support George Guo
@ 2026-05-29 14:32 ` George Guo
  2026-05-31  9:28 ` Mike Rapoport
  4 siblings, 0 replies; 7+ messages in thread
From: George Guo @ 2026-05-29 14:32 UTC (permalink / raw)
  To: Huacai Chen, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Shuah Khan
  Cc: WANG Xuerui, Alexander Graf, loongarch, linux-kernel, kexec,
	linux-mm, linux-kselftest, George Guo

From: George Guo <guodongtai@kylinos.cn>

On LoongArch, QEMU provides only a minimal EFI stub with no runtime
services and no ACPI tables, so machine_restart() falls through to
its infinite idle loop and QEMU never exits after kexec.  The test
result is already printed to the serial console and vmtest.sh reports
success, but the user must press Ctrl+C to get the prompt back.

Add QEMU_NEEDS_KILL=1 to loongarch.conf so the test completes
unattended.

Signed-off-by: George Guo <guodongtai@kylinos.cn>
---
 tools/testing/selftests/kho/loongarch.conf |  3 ++
 tools/testing/selftests/kho/vmtest.sh      | 32 ++++++++++++++++++----
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/kho/loongarch.conf b/tools/testing/selftests/kho/loongarch.conf
index 0145cb49e5b2..02a6add633f1 100644
--- a/tools/testing/selftests/kho/loongarch.conf
+++ b/tools/testing/selftests/kho/loongarch.conf
@@ -8,3 +8,6 @@ CONFIG_SERIAL_8250_CONSOLE=y
 "
 KERNEL_IMAGE="vmlinux.efi"
 KERNEL_CMDLINE="console=ttyS0 earlycon"
+# QEMU never exits after kexec on LoongArch (no EFI runtime services),
+# so vmtest.sh must kill it once the test verdict appears.
+QEMU_NEEDS_KILL=1
diff --git a/tools/testing/selftests/kho/vmtest.sh b/tools/testing/selftests/kho/vmtest.sh
index a6ae9ac09595..821e13fa69a5 100755
--- a/tools/testing/selftests/kho/vmtest.sh
+++ b/tools/testing/selftests/kho/vmtest.sh
@@ -107,12 +107,32 @@ function run_qemu() {
 
 	cmdline="$cmdline kho=on panic=-1"
 
-	$qemu_cmd -m 1G -smp 2 -no-reboot -nographic -nodefaults \
-		  -accel kvm -accel hvf -accel tcg  \
-		  -serial file:"$serial" \
-		  -append "$cmdline" \
-		  -kernel "$kernel" \
-		  -initrd "$initrd"
+	local qemu_args=(
+		-m 1G -smp 2 -no-reboot -nographic -nodefaults
+		-accel kvm -accel hvf -accel tcg
+		-serial file:"$serial"
+		-append "$cmdline"
+		-kernel "$kernel"
+		-initrd "$initrd"
+	)
+
+	# If the target does not exit QEMU after kexec (e.g. no EFI runtime
+	# services), the conf file sets QEMU_NEEDS_KILL=1.  Run QEMU in the
+	# background, poll for the test verdict, then kill it.
+	if [[ "${QEMU_NEEDS_KILL:-0}" == "1" ]]; then
+		$qemu_cmd "${qemu_args[@]}" &
+		local qemu_pid=$!
+		local remaining=100
+		while ((remaining-- > 0)); do
+			grep -q "KHO restore succeeded\|KHO restore failed" \
+				"$serial" 2>/dev/null && break
+			sleep 1
+		done
+		kill "$qemu_pid" 2>/dev/null
+		wait "$qemu_pid" 2>/dev/null || true
+	else
+		$qemu_cmd "${qemu_args[@]}"
+	fi
 
 	grep "KHO restore succeeded" "$serial" &> /dev/null || fail "KHO failed"
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re:
  2026-05-29 14:32 George Guo
                   ` (3 preceding siblings ...)
  2026-05-29 14:32 ` [PATCH v2 4/4] selftests/kho: handle QEMU not exiting after kexec on LoongArch George Guo
@ 2026-05-31  9:28 ` Mike Rapoport
  4 siblings, 0 replies; 7+ messages in thread
From: Mike Rapoport @ 2026-05-31  9:28 UTC (permalink / raw)
  To: George Guo
  Cc: Huacai Chen, Pasha Tatashin, Pratyush Yadav, Shuah Khan,
	WANG Xuerui, Alexander Graf, loongarch, linux-kernel, kexec,
	linux-mm, linux-kselftest, George Guo

Hi,

On Fri, May 29, 2026 at 10:32:34PM +0800, George Guo wrote:
> From: George Guo <guodongtai@kylinos.cn>
> 
> WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
> #26: 
> containing only /chosen with the two KHO properties.  Since DEVICE_TREE_GUID
> 
> ERROR: Invalid commit separator - some tools may have problems applying this
> #34: 
> -------------------------------
> 
> total: 2 errors, 1 warnings, 0 lines checked

I believe something went wrong with sending the emails :)

Please include changes from revision to revision in the cover letter next
time.

You also may want to consider using b4 prep/send for sending the patches.

> >
> Date: Fri, 29 May 2026 21:54:01 +0800
> Subject: [PATCH v2 0/4] LoongArch: add KHO support and selftests
> 
> This series adds Kexec Handover (KHO) support for LoongArch and extends
> the KHO selftest infrastructure to run on LoongArch under QEMU.
> 
> KHO passes metadata (the KHO state FDT and scratch area addresses) to the
> second kernel via the FDT /chosen node, using the linux,kho-fdt and
> linux,kho-scratch properties that drivers/of/kexec.c:kho_add_chosen()
> writes and drivers/of/fdt.c:early_init_dt_check_kho() reads.
> 
> Selftest support (patches 3-4):
> 
> Patch 3 adds loongarch.conf and extends vmtest.sh to recognise loongarch64
> as a build target.  The LoongArch virt machine is FDT-only (no ACPI), so
> 'earlycon' must appear on the kernel cmdline or the console UART is never
> discovered.  PS/2 input devices are also disabled since QEMU's LoongArch
> virt machine has no i8042 controller; the fallback port probe hits a page
> fault and panics before reaching userspace.

Patch 3 should not introduce regressions
 
> Patch 4 handles QEMU not exiting after kexec on LoongArch.  QEMU provides
> no EFI runtime services, so machine_restart() falls through to an infinite
> idle loop.  QEMU_NEEDS_KILL=1 in loongarch.conf signals vmtest.sh to run
> QEMU in the background, poll the serial output for the test verdict, and
> kill QEMU once it appears, so the test completes unattended.

And patch 4 should be folded into patch 3.
As I said during v1 review, the whole wait and kill loop can be replaced
with 'timeout' command which does not need to be specific for LoongArch.
The actual timeout value might, though.

Will wait for v3 for more detailed review.
 
> George Guo (4):
>   LoongArch: kexec: add KHO support for FDT-based systems
>   LoongArch: kexec: add KHO support for ACPI-only systems
>   selftests/kho: add LoongArch vmtest support
>   selftests/kho: handle QEMU not exiting after kexec on LoongArch
> 
>  arch/loongarch/Kconfig                     |   3 +
>  arch/loongarch/include/asm/kexec.h         |   7 +
>  arch/loongarch/kernel/machine_kexec.c      |  38 +++
>  arch/loongarch/kernel/machine_kexec_file.c | 256 +++++++++++++++++++++
>  arch/loongarch/kernel/setup.c              |  21 +-
>  tools/testing/selftests/kho/loongarch.conf |  13 ++
>  tools/testing/selftests/kho/vmtest.sh      |  35 ++-
>  7 files changed, 365 insertions(+), 8 deletions(-)
>  create mode 100644 tools/testing/selftests/kho/loongarch.conf
> 
> -- 
> 2.25.1
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* (no subject)
@ 2021-06-06 19:19 Davidlohr Bueso
  2021-06-07 16:02 ` André Almeida
  0 siblings, 1 reply; 7+ messages in thread
From: Davidlohr Bueso @ 2021-06-06 19:19 UTC (permalink / raw)
  To: Andrï¿½ Almeida
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	krisman, pgriffais, z.figura12, joel, malteskarupke, linux-api,
	fweimer, libc-alpha, linux-kselftest, shuah, acme, corbet,
	Peter Oskolkov, Andrey Semashev, mtk.manpages

Bcc:
Subject: Re: [PATCH v4 07/15] docs: locking: futex2: Add documentation
Reply-To:
In-Reply-To: <20210603195924.361327-8-andrealmeid@collabora.com>

On Thu, 03 Jun 2021, Andrï¿½ Almeida wrote:

>Add a new documentation file specifying both userspace API and internal
>implementation details of futex2 syscalls.

I think equally important would be to provide a manpage for each new
syscall you are introducing, and keep mkt in the loop as in the past he
extensively documented and improved futex manpages, and overall has a
lot of experience with dealing with kernel interfaces.

Thanks,
Davidlohr

>
>Signed-off-by: André Almeida <andrealmeid@collabora.com>
>---
> Documentation/locking/futex2.rst | 198 +++++++++++++++++++++++++++++++
> Documentation/locking/index.rst  |   1 +
> 2 files changed, 199 insertions(+)
> create mode 100644 Documentation/locking/futex2.rst
>
>diff --git a/Documentation/locking/futex2.rst b/Documentation/locking/futex2.rst
>new file mode 100644
>index 000000000000..2f74d7c97a55
>--- /dev/null
>+++ b/Documentation/locking/futex2.rst
>@@ -0,0 +1,198 @@
>+.. SPDX-License-Identifier: GPL-2.0
>+
>+======
>+futex2
>+======
>+
>+:Author: André Almeida <andrealmeid@collabora.com>
>+
>+futex, or fast user mutex, is a set of syscalls to allow userspace to create
>+performant synchronization mechanisms, such as mutexes, semaphores and
>+conditional variables in userspace. C standard libraries, like glibc, uses it
>+as a means to implement more high level interfaces like pthreads.
>+
>+The interface
>+=============
>+
>+uAPI functions
>+--------------
>+
>+.. kernel-doc:: kernel/futex2.c
>+   :identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_requeue
>+
>+uAPI structures
>+---------------
>+
>+.. kernel-doc:: include/uapi/linux/futex.h
>+
>+The ``flag`` argument
>+---------------------
>+
>+The flag is used to specify the size of the futex word
>+(FUTEX_[8, 16, 32, 64]). It's mandatory to define one, since there's no
>+default size.
>+
>+By default, the timeout uses a monotonic clock, but can be used as a realtime
>+one by using the FUTEX_REALTIME_CLOCK flag.
>+
>+By default, futexes are of the private type, that means that this user address
>+will be accessed by threads that share the same memory region. This allows for
>+some internal optimizations, so they are faster. However, if the address needs
>+to be shared with different processes (like using ``mmap()`` or ``shm()``), they
>+need to be defined as shared and the flag FUTEX_SHARED_FLAG is used to set that.
>+
>+By default, the operation has no NUMA-awareness, meaning that the user can't
>+choose the memory node where the kernel side futex data will be stored. The
>+user can choose the node where it wants to operate by setting the
>+FUTEX_NUMA_FLAG and using the following structure (where X can be 8, 16, 32 or
>+64)::
>+
>+ struct futexX_numa {
>+         __uX value;
>+         __sX hint;
>+ };
>+
>+This structure should be passed at the ``void *uaddr`` of futex functions. The
>+address of the structure will be used to be waited on/waken on, and the
>+``value`` will be compared to ``val`` as usual. The ``hint`` member is used to
>+define which node the futex will use. When waiting, the futex will be
>+registered on a kernel-side table stored on that node; when waking, the futex
>+will be searched for on that given table. That means that there's no redundancy
>+between tables, and the wrong ``hint`` value will lead to undesired behavior.
>+Userspace is responsible for dealing with node migrations issues that may
>+occur. ``hint`` can range from [0, MAX_NUMA_NODES), for specifying a node, or
>+-1, to use the same node the current process is using.
>+
>+When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be stored on a
>+global table on allocated on the first node.
>+
>+The ``timo`` argument
>+---------------------
>+
>+As per the Y2038 work done in the kernel, new interfaces shouldn't add timeout
>+options known to be buggy. Given that, ``timo`` should be a 64-bit timeout at
>+all platforms, using an absolute timeout value.
>+
>+Implementation
>+==============
>+
>+The internal implementation follows a similar design to the original futex.
>+Given that we want to replicate the same external behavior of current futex,
>+this should be somewhat expected.
>+
>+Waiting
>+-------
>+
>+For the wait operations, they are all treated as if you want to wait on N
>+futexes, so the path for futex_wait and futex_waitv is the basically the same.
>+For both syscalls, the first step is to prepare an internal list for the list
>+of futexes to wait for (using struct futexv_head). For futex_wait() calls, this
>+list will have a single object.
>+
>+We have a hash table, where waiters register themselves before sleeping. Then
>+the wake function checks this table looking for waiters at uaddr.  The hash
>+bucket to be used is determined by a struct futex_key, that stores information
>+to uniquely identify an address from a given process. Given the huge address
>+space, there'll be hash collisions, so we store information to be later used on
>+collision treatment.
>+
>+First, for every futex we want to wait on, we check if (``*uaddr == val``).
>+This check is done holding the bucket lock, so we are correctly serialized with
>+any futex_wake() calls. If any waiter fails the check above, we dequeue all
>+futexes. The check (``*uaddr == val``) can fail for two reasons:
>+
>+- The values are different, and we return -EAGAIN. However, if while
>+  dequeueing we found that some futexes were awakened, we prioritize this
>+  and return success.
>+
>+- When trying to access the user address, we do so with page faults
>+  disabled because we are holding a bucket's spin lock (and can't sleep
>+  while holding a spin lock). If there's an error, it might be a page
>+  fault, or an invalid address. We release the lock, dequeue everyone
>+  (because it's illegal to sleep while there are futexes enqueued, we
>+  could lose wakeups) and try again with page fault enabled. If we
>+  succeed, this means that the address is valid, but we need to do
>+  all the work again. For serialization reasons, we need to have the
>+  spin lock when getting the user value. Additionally, for shared
>+  futexes, we also need to recalculate the hash, since the underlying
>+  mapping mechanisms could have changed when dealing with page fault.
>+  If, even with page fault enabled, we can't access the address, it
>+  means it's an invalid user address, and we return -EFAULT. For this
>+  case, we prioritize the error, even if some futexes were awaken.
>+
>+If the check is OK, they are enqueued on a linked list in our bucket, and
>+proceed to the next one. If all waiters succeed, we put the thread to sleep
>+until a futex_wake() call, timeout expires or we get a signal. After waking up,
>+we dequeue everyone, and check if some futex was awakened. This dequeue is done
>+by iteratively walking at each element of struct futex_head list.
>+
>+All enqueuing/dequeuing operations requires to hold the bucket lock, to avoid
>+racing while modifying the list.
>+
>+Waking
>+------
>+
>+We get the bucket that's storing the waiters at uaddr, and wake the required
>+number of waiters, checking for hash collision.
>+
>+There's an optimization that makes futex_wake() not take the bucket lock if
>+there's no one to be woken on that bucket. It checks an atomic counter that each
>+bucket has, if it says 0, then the syscall exits. In order for this to work, the
>+waiter thread increases it before taking the lock, so the wake thread will
>+correctly see that there's someone waiting and will continue the path to take
>+the bucket lock. To get the correct serialization, the waiter issues a memory
>+barrier after increasing the bucket counter and the waker issues a memory
>+barrier before checking it.
>+
>+Requeuing
>+---------
>+
>+The requeue path first checks for each struct futex_requeue and their flags.
>+Then, it will compare the expected value with the one at uaddr1::uaddr.
>+Following the same serialization explained at Waking_, we increase the atomic
>+counter for the bucket of uaddr2 before taking the lock. We need to have both
>+buckets locks at same time so we don't race with other futex operation. To
>+ensure the locks are taken in the same order for all threads (and thus avoiding
>+deadlocks), every requeue operation takes the "smaller" bucket first, when
>+comparing both addresses.
>+
>+If the compare with user value succeeds, we proceed by waking ``nr_wake``
>+futexes, and then requeuing ``nr_requeue`` from bucket of uaddr1 to the uaddr2.
>+This consists in a simple list deletion/addition and replacing the old futex key
>+with the new one.
>+
>+Futex keys
>+----------
>+
>+There are two types of futexes: private and shared ones. The private are futexes
>+meant to be used by threads that share the same memory space, are easier to be
>+uniquely identified and thus can have some performance optimization. The
>+elements for identifying one are: the start address of the page where the
>+address is, the address offset within the page and the current->mm pointer.
>+
>+Now, for uniquely identifying a shared futex:
>+
>+- If the page containing the user address is an anonymous page, we can
>+  just use the same data used for private futexes (the start address of
>+  the page, the address offset within the page and the current->mm
>+  pointer); that will be enough for uniquely identifying such futex. We
>+  also set one bit at the key to differentiate if a private futex is
>+  used on the same address (mixing shared and private calls does not
>+  work).
>+
>+- If the page is file-backed, current->mm maybe isn't the same one for
>+  every user of this futex, so we need to use other data: the
>+  page->index, a UUID for the struct inode and the offset within the
>+  page.
>+
>+Note that members of futex_key don't have any particular meaning after they
>+are part of the struct - they are just bytes to identify a futex.  Given that,
>+we don't need to use a particular name or type that matches the original data,
>+we only need to care about the bitsize of each component and make both private
>+and shared fit in the same memory space.
>+
>+Source code documentation
>+=========================
>+
>+.. kernel-doc:: kernel/futex2.c
>+   :no-identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv sys_futex_requeue
>diff --git a/Documentation/locking/index.rst b/Documentation/locking/index.rst
>index 7003bd5aeff4..9bf03c7fa1ec 100644
>--- a/Documentation/locking/index.rst
>+++ b/Documentation/locking/index.rst
>@@ -24,6 +24,7 @@ locking
>     percpu-rw-semaphore
>     robust-futexes
>     robust-futex-ABI
>+    futex2
>
> .. only::  subproject and html
>
>--
>2.31.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re:
  2021-06-06 19:19 Davidlohr Bueso
@ 2021-06-07 16:02 ` André Almeida
  0 siblings, 0 replies; 7+ messages in thread
From: André Almeida @ 2021-06-07 16:02 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
	linux-kernel, Steven Rostedt, Sebastian Andrzej Siewior, kernel,
	krisman, pgriffais, z.figura12, joel, malteskarupke, linux-api,
	fweimer, libc-alpha, linux-kselftest, shuah, acme, corbet,
	Peter Oskolkov, Andrey Semashev, mtk.manpages

Às 16:19 de 06/06/21, Davidlohr Bueso escreveu:
> Bcc:
> Subject: Re: [PATCH v4 07/15] docs: locking: futex2: Add documentation
> Reply-To:
> In-Reply-To: <20210603195924.361327-8-andrealmeid@collabora.com>
> 
> On Thu, 03 Jun 2021, Andrï¿½ Almeida wrote:
> 
>> Add a new documentation file specifying both userspace API and internal
>> implementation details of futex2 syscalls.
> 
> I think equally important would be to provide a manpage for each new
> syscall you are introducing, and keep mkt in the loop as in the past he
> extensively documented and improved futex manpages, and overall has a
> lot of experience with dealing with kernel interfaces.

Right, I'll add the man pages in a future version and make sure to have
mkt in the loop, thanks for the tip.

> 
> Thanks,
> Davidlohr
> 
>>
>> Signed-off-by: André Almeida <andrealmeid@collabora.com>
>> ---
>> Documentation/locking/futex2.rst | 198 +++++++++++++++++++++++++++++++
>> Documentation/locking/index.rst  |   1 +
>> 2 files changed, 199 insertions(+)
>> create mode 100644 Documentation/locking/futex2.rst
>>
>> diff --git a/Documentation/locking/futex2.rst
>> b/Documentation/locking/futex2.rst
>> new file mode 100644
>> index 000000000000..2f74d7c97a55
>> --- /dev/null
>> +++ b/Documentation/locking/futex2.rst
>> @@ -0,0 +1,198 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +======
>> +futex2
>> +======
>> +
>> +:Author: André Almeida <andrealmeid@collabora.com>
>> +
>> +futex, or fast user mutex, is a set of syscalls to allow userspace to
>> create
>> +performant synchronization mechanisms, such as mutexes, semaphores and
>> +conditional variables in userspace. C standard libraries, like glibc,
>> uses it
>> +as a means to implement more high level interfaces like pthreads.
>> +
>> +The interface
>> +=============
>> +
>> +uAPI functions
>> +--------------
>> +
>> +.. kernel-doc:: kernel/futex2.c
>> +   :identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv
>> sys_futex_requeue
>> +
>> +uAPI structures
>> +---------------
>> +
>> +.. kernel-doc:: include/uapi/linux/futex.h
>> +
>> +The ``flag`` argument
>> +---------------------
>> +
>> +The flag is used to specify the size of the futex word
>> +(FUTEX_[8, 16, 32, 64]). It's mandatory to define one, since there's no
>> +default size.
>> +
>> +By default, the timeout uses a monotonic clock, but can be used as a
>> realtime
>> +one by using the FUTEX_REALTIME_CLOCK flag.
>> +
>> +By default, futexes are of the private type, that means that this
>> user address
>> +will be accessed by threads that share the same memory region. This
>> allows for
>> +some internal optimizations, so they are faster. However, if the
>> address needs
>> +to be shared with different processes (like using ``mmap()`` or
>> ``shm()``), they
>> +need to be defined as shared and the flag FUTEX_SHARED_FLAG is used
>> to set that.
>> +
>> +By default, the operation has no NUMA-awareness, meaning that the
>> user can't
>> +choose the memory node where the kernel side futex data will be
>> stored. The
>> +user can choose the node where it wants to operate by setting the
>> +FUTEX_NUMA_FLAG and using the following structure (where X can be 8,
>> 16, 32 or
>> +64)::
>> +
>> + struct futexX_numa {
>> +         __uX value;
>> +         __sX hint;
>> + };
>> +
>> +This structure should be passed at the ``void *uaddr`` of futex
>> functions. The
>> +address of the structure will be used to be waited on/waken on, and the
>> +``value`` will be compared to ``val`` as usual. The ``hint`` member
>> is used to
>> +define which node the futex will use. When waiting, the futex will be
>> +registered on a kernel-side table stored on that node; when waking,
>> the futex
>> +will be searched for on that given table. That means that there's no
>> redundancy
>> +between tables, and the wrong ``hint`` value will lead to undesired
>> behavior.
>> +Userspace is responsible for dealing with node migrations issues that
>> may
>> +occur. ``hint`` can range from [0, MAX_NUMA_NODES), for specifying a
>> node, or
>> +-1, to use the same node the current process is using.
>> +
>> +When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be
>> stored on a
>> +global table on allocated on the first node.
>> +
>> +The ``timo`` argument
>> +---------------------
>> +
>> +As per the Y2038 work done in the kernel, new interfaces shouldn't
>> add timeout
>> +options known to be buggy. Given that, ``timo`` should be a 64-bit
>> timeout at
>> +all platforms, using an absolute timeout value.
>> +
>> +Implementation
>> +==============
>> +
>> +The internal implementation follows a similar design to the original
>> futex.
>> +Given that we want to replicate the same external behavior of current
>> futex,
>> +this should be somewhat expected.
>> +
>> +Waiting
>> +-------
>> +
>> +For the wait operations, they are all treated as if you want to wait
>> on N
>> +futexes, so the path for futex_wait and futex_waitv is the basically
>> the same.
>> +For both syscalls, the first step is to prepare an internal list for
>> the list
>> +of futexes to wait for (using struct futexv_head). For futex_wait()
>> calls, this
>> +list will have a single object.
>> +
>> +We have a hash table, where waiters register themselves before
>> sleeping. Then
>> +the wake function checks this table looking for waiters at uaddr. 
>> The hash
>> +bucket to be used is determined by a struct futex_key, that stores
>> information
>> +to uniquely identify an address from a given process. Given the huge
>> address
>> +space, there'll be hash collisions, so we store information to be
>> later used on
>> +collision treatment.
>> +
>> +First, for every futex we want to wait on, we check if (``*uaddr ==
>> val``).
>> +This check is done holding the bucket lock, so we are correctly
>> serialized with
>> +any futex_wake() calls. If any waiter fails the check above, we
>> dequeue all
>> +futexes. The check (``*uaddr == val``) can fail for two reasons:
>> +
>> +- The values are different, and we return -EAGAIN. However, if while
>> +  dequeueing we found that some futexes were awakened, we prioritize
>> this
>> +  and return success.
>> +
>> +- When trying to access the user address, we do so with page faults
>> +  disabled because we are holding a bucket's spin lock (and can't sleep
>> +  while holding a spin lock). If there's an error, it might be a page
>> +  fault, or an invalid address. We release the lock, dequeue everyone
>> +  (because it's illegal to sleep while there are futexes enqueued, we
>> +  could lose wakeups) and try again with page fault enabled. If we
>> +  succeed, this means that the address is valid, but we need to do
>> +  all the work again. For serialization reasons, we need to have the
>> +  spin lock when getting the user value. Additionally, for shared
>> +  futexes, we also need to recalculate the hash, since the underlying
>> +  mapping mechanisms could have changed when dealing with page fault.
>> +  If, even with page fault enabled, we can't access the address, it
>> +  means it's an invalid user address, and we return -EFAULT. For this
>> +  case, we prioritize the error, even if some futexes were awaken.
>> +
>> +If the check is OK, they are enqueued on a linked list in our bucket,
>> and
>> +proceed to the next one. If all waiters succeed, we put the thread to
>> sleep
>> +until a futex_wake() call, timeout expires or we get a signal. After
>> waking up,
>> +we dequeue everyone, and check if some futex was awakened. This
>> dequeue is done
>> +by iteratively walking at each element of struct futex_head list.
>> +
>> +All enqueuing/dequeuing operations requires to hold the bucket lock,
>> to avoid
>> +racing while modifying the list.
>> +
>> +Waking
>> +------
>> +
>> +We get the bucket that's storing the waiters at uaddr, and wake the
>> required
>> +number of waiters, checking for hash collision.
>> +
>> +There's an optimization that makes futex_wake() not take the bucket
>> lock if
>> +there's no one to be woken on that bucket. It checks an atomic
>> counter that each
>> +bucket has, if it says 0, then the syscall exits. In order for this
>> to work, the
>> +waiter thread increases it before taking the lock, so the wake thread
>> will
>> +correctly see that there's someone waiting and will continue the path
>> to take
>> +the bucket lock. To get the correct serialization, the waiter issues
>> a memory
>> +barrier after increasing the bucket counter and the waker issues a
>> memory
>> +barrier before checking it.
>> +
>> +Requeuing
>> +---------
>> +
>> +The requeue path first checks for each struct futex_requeue and their
>> flags.
>> +Then, it will compare the expected value with the one at uaddr1::uaddr.
>> +Following the same serialization explained at Waking_, we increase
>> the atomic
>> +counter for the bucket of uaddr2 before taking the lock. We need to
>> have both
>> +buckets locks at same time so we don't race with other futex
>> operation. To
>> +ensure the locks are taken in the same order for all threads (and
>> thus avoiding
>> +deadlocks), every requeue operation takes the "smaller" bucket first,
>> when
>> +comparing both addresses.
>> +
>> +If the compare with user value succeeds, we proceed by waking
>> ``nr_wake``
>> +futexes, and then requeuing ``nr_requeue`` from bucket of uaddr1 to
>> the uaddr2.
>> +This consists in a simple list deletion/addition and replacing the
>> old futex key
>> +with the new one.
>> +
>> +Futex keys
>> +----------
>> +
>> +There are two types of futexes: private and shared ones. The private
>> are futexes
>> +meant to be used by threads that share the same memory space, are
>> easier to be
>> +uniquely identified and thus can have some performance optimization. The
>> +elements for identifying one are: the start address of the page where
>> the
>> +address is, the address offset within the page and the current->mm
>> pointer.
>> +
>> +Now, for uniquely identifying a shared futex:
>> +
>> +- If the page containing the user address is an anonymous page, we can
>> +  just use the same data used for private futexes (the start address of
>> +  the page, the address offset within the page and the current->mm
>> +  pointer); that will be enough for uniquely identifying such futex. We
>> +  also set one bit at the key to differentiate if a private futex is
>> +  used on the same address (mixing shared and private calls does not
>> +  work).
>> +
>> +- If the page is file-backed, current->mm maybe isn't the same one for
>> +  every user of this futex, so we need to use other data: the
>> +  page->index, a UUID for the struct inode and the offset within the
>> +  page.
>> +
>> +Note that members of futex_key don't have any particular meaning
>> after they
>> +are part of the struct - they are just bytes to identify a futex. 
>> Given that,
>> +we don't need to use a particular name or type that matches the
>> original data,
>> +we only need to care about the bitsize of each component and make
>> both private
>> +and shared fit in the same memory space.
>> +
>> +Source code documentation
>> +=========================
>> +
>> +.. kernel-doc:: kernel/futex2.c
>> +   :no-identifiers: sys_futex_wait sys_futex_wake sys_futex_waitv
>> sys_futex_requeue
>> diff --git a/Documentation/locking/index.rst
>> b/Documentation/locking/index.rst
>> index 7003bd5aeff4..9bf03c7fa1ec 100644
>> --- a/Documentation/locking/index.rst
>> +++ b/Documentation/locking/index.rst
>> @@ -24,6 +24,7 @@ locking
>>     percpu-rw-semaphore
>>     robust-futexes
>>     robust-futex-ABI
>> +    futex2
>>
>> .. only::  subproject and html
>>
>> -- 
>> 2.31.1
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-31  9:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29 14:32 George Guo
2026-05-29 14:32 ` [PATCH v2 1/4] LoongArch: kexec: add KHO support for FDT-based systems George Guo
2026-05-29 14:32 ` [PATCH v2 2/4] LoongArch: kexec: add KHO support for ACPI-only systems George Guo
2026-05-29 14:32 ` [PATCH v2 3/4] selftests/kho: add LoongArch vmtest support George Guo
2026-05-29 14:32 ` [PATCH v2 4/4] selftests/kho: handle QEMU not exiting after kexec on LoongArch George Guo
2026-05-31  9:28 ` Mike Rapoport
  -- strict thread matches above, loose matches on Subject: below --
2021-06-06 19:19 Davidlohr Bueso
2021-06-07 16:02 ` André Almeida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox